ai-evals.tools

What it is

Vellum is a visual builder for LLM workflows and agents, with observability and evaluation built into the canvas. You compose an agent as a graph of nodes — prompts, tools, conditionals — and the same view shows traces, scores, and A/B test results once the workflow is live. Free tier with 30 credits/month; paid plans start at $25/month.

Developer experience

The "developer experience" framing fits Vellum a bit awkwardly: a meaningful chunk of its appeal is making agent development less code-centric. Engineers can drop into custom code nodes when they need to, but the product is happiest when most of your workflow lives in the visual graph.

Where it shines

Cross-functional collaboration. PMs can read and modify the same workflow engineers built. That's hard to overstate as a productivity unlock.
Coherent debug-and-iterate loop. The graph used to design the agent is the same one you debug it in.
Built-in evaluation. Online evals run against the same workflow; you don't need a separate eval product.

Where it falls short

Code-first teams hit walls. If most of your agent is custom Python with state and side effects, the visual model fights you.
Lock-in. Workflows live in Vellum. Migrating off is non-trivial.
Niche. The teams it fits, it fits well. Outside that audience it's an awkward choice.

Bottom line

If your AI org includes meaningful PM or domain-expert participation in agent design, Vellum deserves a serious look. For pure-engineering teams shipping code-first agents, the SDK-based platforms (Braintrust, Langfuse) are a better fit.

Vellum

Verdict

What it is

Developer experience

Where it shines

Where it falls short

Bottom line

Related

Braintrust

Fiddler

Galileo