ai-evals.tools

What we review

Tools that help engineering teams measure the quality of LLMs and agents in production. That includes offline evals, online evals, observability, prompt management, dataset tooling, and red-teaming.

Criteria

Developer experience. SDK ergonomics, docs, time-to-first-eval.
Coverage. Built-in metrics, custom evals, support for agents and tool use.
Production readiness. Latency, sampling, PII handling, on-prem options.
Pricing transparency. Public pricing, usage caps, free tier limits.
Open source posture. What's open, what's not, how the OSS roadmap relates to the commercial product.

Scoring

Scores are 0–10 across the criteria above, weighted toward developer experience and production readiness. Scores are recomputed when a product changes meaningfully, and the reviewed date on each company page reflects the last time we re-tested.