Why evals are finally the bottleneck

Two years ago, the conversation was about which model to use. The answer flipped every six weeks, and the answer mattered: a 5-point jump on MMLU translated, roughly, into a usable feature versus an unusable one.

That era is over.

The frontier models are close enough that, for most production use cases, the choice between them is a rounding error. The bottleneck moved. It moved to the layer that decides whether your specific feature, with your specific prompts, on your specific user inputs, actually works.

The shape of the problem

You ship an LLM feature. It works on the demos. A week later, a customer support ticket comes in: the model said something wrong. You look at the trace. You can see what it said. You can't see why. You don't know if it's a one-off, or 1% of traffic, or 10%.

That's the eval problem. Not "which model scored higher on a public benchmark" but "is my feature getting better or worse, this week, according to my definition of better."

What's missing

Most teams I talk to have one of these:

A spreadsheet of test cases that someone updates by hand.
A CI job that runs a few prompts against a few models and prints the diff.
Nothing.

What they're missing is the loop. Datasets that grow as production surfaces new cases. Online scoring that flags regressions before a human notices. Prompt experiments that compare like with like and say, with statistical confidence, that the new prompt is better.

That loop is what the companies we review here are building.

What we're going to do

We're going to test these tools. Not from press releases. By using them. Same dataset, same task, same model — and we'll write down what was annoying, what we couldn't do, and what made us close the tab. Independent, no sponsorships, scores that move when the products move.

If you're picking eval tooling this quarter, that's what we're for.