$ ai-evals
← all editorial
PostJune 23, 2026· Ethan

Evals aren't the new PRD. The rubric is.

"Evals are the new PRD" is half right. Evals replaced acceptance criteria — but the spec didn't disappear, it moved into a scoring prompt nobody reviews.

There's a tidy claim making the rounds: in the AI era, evals are the new PRD. The product requirements doc is dead because AI output is non-deterministic, so the spec becomes a suite of structured tests. The PM stops writing "make it better" and starts saying "make this number go up."

It's a good piece, and the core observation is right: a static document that nobody runs is a weak way to define quality for a system whose output changes every time you call it. But the slogan overshoots. Evals didn't replace the PRD. They replaced one half of it — and the other half didn't disappear. It got compressed into a scoring rubric that nobody reviews like a spec.

A PRD does two jobs, not one

Strip a good PRD down and you find two different documents wearing one cover.

The first is acceptance criteria: the testable conditions for "done." The recipe must list every ingredient. The summary must stay under 200 words. The agent must not call the refund tool without confirmation. This half should be executable, and for AI features a static checklist genuinely is worse than a running eval. Here the slogan holds. Evals are strictly better acceptance criteria — they run continuously, they catch regressions, they put a number on fuzzy quality.

The second job is intent: why this feature exists, who it's for, what we're deliberately not building, how it trades off against the next three things on the roadmap. None of that is testable. None of it is a number. And an eval suite is completely silent on it.

The "evals are the new PRD" claim quietly assumes the second job already happened — that someone decided what "good" means before anyone wrote a scorer. That decision is the hard part, and the eval can't make it for you.

You can pass every eval and ship the wrong thing

An eval defends a definition of good. It is bad at discovering one.

For a mature, narrow feature, that's fine — you know what good looks like, and the eval keeps you honest about staying there. But the moment a feature is genuinely new, you don't have the definition yet. You're going to learn it by putting something in front of users and watching what they do. Write the eval first and you've just encoded your best guess from before you knew anything, then spent a quarter optimizing toward it.

A green eval suite tells you the output matches the rubric. It does not tell you the rubric matches the user. Those are different claims, and conflating them is how teams ship a feature that scores 0.94 and still gets churned.

The number-going-up trap is in the slogan itself

The original piece nods at Goodhart's Law — optimize a metric and it stops being a good metric — and then, a few paragraphs later, hands the PM a new mantra: stop saying "make it better," start saying "make this number go up."

That is Goodharting, stated as a virtue. The whole risk of evals-as-spec is that the metric is a lossy compression of what you actually wanted, and the tighter you optimize it, the more the loss shows. Prose ambiguity in a PRD is annoying, but it's also load-bearing: it preserves intent the metric can't capture. "The tone should feel trustworthy" is unmeasurable and exactly the kind of thing that quietly dies when the spec becomes three AI-judge scores.

The rubric is the spec now — and it's the worst-documented spec you have

Here's the part that should bother people. When you replace a PRD with an eval, the product thinking doesn't vanish. It moves into the judge's rubric — the prompt that tells an LLM what counts as a good answer.

That rubric is a specification. It encodes every product decision: what matters, what's a dealbreaker, how to weigh clarity against completeness. It's just a specification that lives in a code repo, written by one engineer at 6pm, versioned (if you're lucky) in a commit message, and reviewed by no one. Design can't read it. Support can't argue with it. Legal never sees it. The whole audience a PRD exists to align is now locked out of the document that replaced it.

So the real shift isn't "evals replace PRDs." It's that the spec migrated into the rubric — and we stopped treating it like a spec. That's a downgrade dressed as progress.

What actually changed

The honest version of the claim:

  • Evals are the new acceptance test. They replace the testable half of the PRD, and they're better at it. Believe this part.
  • The rubric is the new requirements doc. That's where intent now lives, and it deserves the rigor a PRD used to get: versioned, reviewed by non-engineers, debated, owned by the PM. Almost nobody does this.
  • Strategy is still strategy. Why, for whom, and what not to build are not numbers and never will be. If your eval suite is your only spec, those decisions are still being made — just implicitly, by whoever wrote the scorer.

The weekly cadence — Monday review traces, Tuesday curate failures, Wednesday compare models — is a genuinely good operating loop for a feature that already knows what it's for. It's just not a substitute for deciding what it's for.

So what do you do with this

Keep the eval suite. It's the best thing to happen to AI product quality in two years, and we spend most of our time here testing the tools that run it. But stop pretending it's the whole spec.

Two concrete moves:

  1. Treat the rubric as a reviewed document. Pull it out of the prompt string. Put it somewhere design, support, and your PM can read and redline. The day someone outside engineering disagrees with a line in your scorer is the day it's doing its job.
  2. Write down what the eval can't measure. One short page of intent and non-goals next to the eval suite. Call it whatever you want. It's a PRD, and it's the part the evals were never going to cover.

Evals are the new acceptance criteria. The rubric is the new PRD. And the strategy was never going to fit in a number — it just got easier to pretend it did.

#opinion#evals#workflow