$ ai-evals
/companies

Companies

Every company in the AI evals space we've reviewed. Independent — we don't accept vendor sponsorships, and reviews are updated as products change.

Category
Pricing
other
25 companies

Arize AI

ML observability platform extended into LLMs, with the open-source Phoenix framework as a popular standalone trace viewer.

7.2
observabilityML monitoringLLM evalsfreemium

Braintrust

Eval-driven dev platform combining traces, datasets, scorers, and a playground in one product.

9.1
LLM evalsobservabilityprompt managementfreemium

Comet (Opik)

Open-source LLM evaluation and observability from a mature MLOps team — credible Langfuse alternative.

7.4
observabilityLLM evalsMLOpsopen-source

Datadog

APM giant with bolted-on LLM observability for OpenAI and Anthropic calls.

6.4
observabilityAPMpaid

DeepEval (Confident AI)

pytest-style LLM evaluation framework with synthetic dataset generation and CI/CD-native testing.

7.6
LLM evalsfreemium

Evidently AI

Open-source ML and LLM evaluation framework with strong methodology docs — building blocks, not a finished platform.

7.0
LLM evalsML monitoringopen-source

Fiddler

Enterprise ML governance platform extended to LLMs and generative AI, with audit-ready traces and in-environment evaluations.

7.2
AI governanceagent observabilityenterprise

Galileo

Agent reliability platform with cheap, fast evaluators that can run on every request in production.

7.5
agent observabilityLLM evalsfreemium

HUD

Open-source platform for building RL environments and evals for computer-use agents — used by frontier labs, ships its own benchmarks.

7.8
agent observabilityRL environmentsbenchmarksopen-source

Label Studio

Open-source data annotation platform with rubric enforcement, escalation workflows, and audit trails — extended to LLM review.

7.0
annotationLLM evalsopen-source

Langfuse

Open-source LLM observability with evals, prompt management, and best-in-class tracing.

8.4
observabilityLLM evalsprompt managementopen-source

LangSmith

Observability and evaluation built by the LangChain team — best-in-class if your stack is LangChain or LangGraph.

7.5
observabilityLLM evalsprompt managementfreemium

LiteLLM

Open-source Python SDK and proxy that translates requests across 100+ LLM providers into the OpenAI format.

8.0
LLM gatewaymulti-provider routingopen-source

Maxim AI

AI quality evaluation platform with prebuilt and custom scorers, designed to plug into existing observability stacks.

6.8
LLM evalsfreemium

MLflow

Open-source MLOps standard with LLM tracing, evaluation, and prompt management bolted on top.

6.6
MLOpsobservabilityLLM evalsopen-source

OpenRouter

Single OpenAI-compatible endpoint to 500+ models across 60+ providers, billed pay-as-you-go.

8.2
LLM gatewaymulti-provider routingpaid

Portkey

Full-stack AI gateway with the broadest model catalog, built-in guardrails, and enterprise-grade governance.

7.8
LLM gatewaymulti-provider routingAI governancefreemium

Promptfoo

Open-source CLI for evaluating LLM prompts and red-teaming applications, with YAML/JSON configs that live next to your code.

7.4
LLM evalsred-teamingopen-source

PromptHub

Git-style version control for prompts — branch, commit, merge, and CI-gate prompt changes.

6.8
prompt managementfreemium

PromptLayer

Visual prompt editor and version control built for non-technical teams.

7.0
prompt managementfreemium

RAGAS

Open-source evaluation framework purpose-built for RAG pipelines, with reference-free metrics that became the industry standard.

7.5
LLM evalsRAG evaluationopen-source

SuperAnnotate

Annotation platform with strong tooling for measuring and resolving disagreements between human reviewers and automated scorers.

6.8
annotationLLM evalspaid

Vellum

Visual workflow builder with built-in observability for low-code agent development.

7.0
prompt managementagent observabilityfreemium

Weights & Biases Weave

LLM tracing, evaluation, and prompt management embedded inside the Weights & Biases ML platform.

6.8
observabilityLLM evalsprompt managementMLOpsfreemium

ZenML

Open-source MLOps and LLMOps framework for building reproducible, infrastructure-agnostic AI pipelines.

6.8
MLOpsLLM evalsfreemium