ai-evals.tools

Verdict

The annotation specialist. If your bottleneck is "we need rigorous, auditable human review with rubric enforcement across many reviewers," Label Studio is the most mature answer in the category. The cost of that specialization: it's an annotation platform, not an eval platform — bring your own tracing, scoring, and CI/CD, and plan to do real glue work to connect them.

What it is

Label Studio is the open-source annotation platform from HumanSignal (originally Heartex). It's been the de facto OSS choice for ML data labeling for years, and has been extended to support LLM review workflows — scoring agent outputs, annotating responses against rubrics, escalating disputed cases.

OSS under Apache 2.0. Enterprise tier (HumanSignal) for SSO, larger-scale deployments, and managed hosting.

Where it shines

Review operations. Assignment queues, rubric enforcement, multi-reviewer agreement, audit trails — all the boring parts of running real review at scale, all built out properly.
Maturity. The product has been in production at major ML teams for years. That kind of reliability is rare in the LLM-native side of this category.
Open source. Real OSS, not "open core with the good parts gated."

Where it falls short

Not an eval platform. Label Studio handles annotation. It doesn't handle tracing, automated scoring, dataset management, or CI integration. Connecting to those is your problem.
Not LLM-native. The data model is "items to label," not "agent traces with spans and tool calls." Reviewing a multi-step agent run is awkward without another tool surfacing the trace first.

Bottom line

The right pick if your team already has tracing and automated scoring solved, and your bottleneck is rigorous, auditable human review at scale. If you're building a complete eval workflow from scratch, the integrated platforms (Braintrust, Langfuse) will get you further faster.

Label Studio

Verdict

What it is

Where it shines

Where it falls short

Bottom line

Related

Arize AI

Braintrust

Comet (Opik)