Reviews of AI eval tools — written for developers.
We test, compare, and review the tools shaping how engineering teams measure LLMs and agents in production.
- companies reviewed
- 25
- last updated
- May 22, 2026
Featured companies
all companies →Braintrust
Eval-driven dev platform combining traces, datasets, scorers, and a playground in one product.
Fiddler
Enterprise ML governance platform extended to LLMs and generative AI, with audit-ready traces and in-environment evaluations.
Galileo
Agent reliability platform with cheap, fast evaluators that can run on every request in production.
HUD
Open-source platform for building RL environments and evals for computer-use agents — used by frontier labs, ships its own benchmarks.
Langfuse
Open-source LLM observability with evals, prompt management, and best-in-class tracing.
LiteLLM
Open-source Python SDK and proxy that translates requests across 100+ LLM providers into the OpenAI format.
Recent editorial
all editorial →LLM evals and observability company acquisitions
Eight acquisitions in fourteen months — Langfuse, Humanloop, Helicone, Promptfoo, Velvet, Weights & Biases, Statsig, Galileo. Who bought what, the three buyer patterns behind the deals, and what it means if you're picking a tool right now.
How to reduce LLM costs in production
A practical guide to finding where your LLM bill is actually going, fixing the expensive parts, and keeping the savings in place — with notes on the tools we'd reach for at each step.
How to actually lower your LLM bill (without shipping worse output)
Why aggregate dashboards stop being enough once your AI app is real, and the workflow engineering teams use to find expensive workflow steps, replace them, and ship the change without breaking quality.