Reviews of AI eval tools — written for developers.
We test, compare, and review the tools shaping how engineering teams measure LLMs and agents in production.
- companies reviewed
- 16
- last updated
- Apr 19, 2026
Featured companies
all companies →Braintrust
Eval-driven dev platform combining traces, datasets, scorers, and a playground in one product.
Fiddler
Enterprise ML governance platform extended to LLMs and generative AI, with audit-ready traces and in-environment evaluations.
Galileo
Agent reliability platform with cheap, fast evaluators that can run on every request in production.
Helicone
Proxy-based LLM observability — drop in by changing the base URL, no SDK changes needed.
Langfuse
Open-source LLM observability with evals, prompt management, and best-in-class tracing.
Vellum
Visual workflow builder with built-in observability for low-code agent development.
Recent editorial
all editorial →The best prompt management tools (2026)
Seven prompt management tools, ranked by what they actually solve — from no-code editors to Git-style versioning to eval-first platforms.
The best AI agent observability tools (2026)
Five tools we'd actually pick for monitoring multi-step agents in production — what they cover, where they break, and who each one is for.
Arize AI alternatives (2026)
Five platforms to consider if Arize's ML-first architecture isn't the right fit for an LLM-only workflow — and one honest case for sticking with Arize.