How to reduce LLM costs in production

TL;DR

Most LLM cost problems hide because the dashboard that tells you spend is up doesn't tell you which prompt, tool call, or model decision drove the increase. Aggregate spend numbers are a starting signal, not a debugging tool. The fix is trace-level cost observability — every model call, tool invocation, and retrieval step instrumented as its own span with token counts and estimated cost attached, rolled up through the trace tree. Once you can see where the money is going, three optimizations cover most of the wins: tighter prompts, the right model per step, and fixing workflow patterns (retrieval bloat, silent retries, runaway agent loops). And every cost-cutting change goes through evals before merge, so the savings don't quietly come back as quality regressions. Braintrust and Langfuse are the two tools we'd reach for to run this workflow end to end.

Where LLM costs hide and how to surface them

Cost problem	What it looks like in production	What surfaces it
Context bloat	System prompts that grew from 800 → 4,000 tokens, oversized retrieval chunks	Token-scaled timeline view
Expensive tool calls	Retrieval that pulls 12k tokens when 800 would do, agents that loop on failed tool calls	Per-tool-call cost attribution inside traces
Retry storms	Schema mismatches and malformed JSON triggering 2–3 retries of the full prompt	Trace tree with cost propagated to the parent span
Overpowered model	Frontier-tier model used on extraction, classification, or routing steps	Model experiments with cost + quality side by side
Verbose prompts	Few-shot examples and instructions accumulated over months that no longer earn their tokens	Prompt experiments in a side-by-side playground
Silent regressions	Cost-cutting changes that quietly lower output quality on edge cases	CI-gated evals on every PR

Why LLM costs balloon in production

LLM cost problems almost never show up at launch. The first version of a feature is small, well-scoped, and inexpensive per request. Real cost arrives later, when production traffic, accumulated prompt drift, and workflow complexity have had time to compound.

Three patterns drive almost every "where did this bill come from" investigation we run.

Workflow complexity. Modern AI applications don't send one prompt and return one answer. RAG calls retrievers, rerankers, and generators. Agents call tools, evaluate intermediate outputs, and loop. A single user request can fan out into ten or twenty model calls before a final answer reaches the user. An aggregate dashboard sees one feature; the invoice sees twenty calls.

Context growth. Prompts accumulate. System instructions get patched to handle new edge cases. Few-shot examples stick around after stronger models stop needing them. Retrieved chunks get pulled in with weak filtering. A prompt that used 800 input tokens at launch is using 4,000 six months later, often producing a similar answer at five times the cost. Nobody decided to make it more expensive — it drifted.

Retries and failures. Tool calls fail. JSON outputs break. Schemas change. Each retry usually resends the full prompt, so a workflow with a 10% retry rate costs 10% more than the headline numbers suggest. That 10% is rarely visible at the model-call level.

Once these stack, a request that looks like a $0.02 model call routinely becomes a $0.30 workflow — and you don't know that until you can see the trace.

Why aggregate dashboards aren't enough

A daily spend chart tells you costs are rising. It doesn't tell you which prompt, retrieval step, or retry pattern is responsible. That gap between "we have a problem" and "we know how to fix it" is the gap that trace-level observability closes.

	Dashboard-only	Trace-level observability
Per-call cost	Aggregated by model or by day	Itemized on every span
Tool-call cost	Hidden inside totals	Visible per tool invocation
Context bloat	Not surfaced	Token-scaled timelines
Retry visibility	Counted as separate calls	Grouped under the parent trace
Root cause	Manual log review	Starts from the expensive span

Dashboards are necessary — they're how you notice spend is rising in the first place. But they're a starting signal, not a debugging tool. Lowering cost in a measurable way requires per-span evidence.

Use tracing as the foundation

Trace-level cost observability treats every LLM call, tool call, and retrieval step as an instrumented span, with token counts and estimated cost attached at that level. Costs roll up through the trace tree, so total workflow cost and individual-step cost are visible in the same view.

Three capabilities make this useful in practice:

Per-span cost attribution. Every model call, tool invocation, and retrieval step has its own token usage and estimated cost. No aggregation, no averaging.
Cost rollups. Child-span cost propagates to the parent, so the full cost of a multi-step workflow is visible at the top of the trace tree without rerunning anything.
Cost-scaled timelines. Spans are visually sized by tokens or estimated cost, so a 12,000-token retrieval step is obviously bigger than a 400-token classifier call. Eyeballing replaces grepping.

Most modern LLM observability platforms cover the first two. Cost-scaled timelines are less universal — they're the difference between "I can compute which step was expensive" and "the expensive step is screaming at me from the trace view." Braintrust ships all three with cost attached to every span automatically. Langfuse ships the strongest tracing in the open-source category — drilling into a multi-step agent trace there feels closer to a real APM than what most eval-first tools deliver. The full breakdown of which tool delivers which capability is in our LLM monitoring rankings.

Use trace trees to find expensive spans fast

A typical investigation starts by sorting production traces by total cost, opening the top of the list, and looking for child spans that are dramatically larger than their siblings. A span that costs ten times as much as the nearby steps is almost always the right place to start.

The shape of a good trace view:

The parent span shows the rolled-up total — full workflow cost in one number.
Child spans show their individual cost, with the largest contributors obvious from sorting or visual sizing.
You can drill down without leaving the trace — child spans expand inline, and the model call that's consuming the most tokens is one or two clicks away from the top of the trace.

This turns cost reduction from "guess and check" into a targeted investigation. The expensive span is visible before you change any prompts, models, or code, which means every fix is informed by evidence instead of intuition.

Inspect individual tool calls

Agent and RAG systems hide most of their cost inside tool calls. The usual suspects:

A retrieval tool that pulls 12,000 tokens of context when 800 would answer the question.
An agent that loops three times on a failed function call before giving up.
A summarization step that runs on a full document when only one section is relevant.
A reranker called twice in a single request because two upstream nodes ask for the same thing.

Patterns like these can double or triple an LLM bill without ever showing up on a top-level dashboard, because the cost is paid by the next model call that consumes the bloated context, not the tool call itself. Per-tool-call cost attribution makes this visible: every tool call is its own inspectable span with inputs, outputs, token count, and estimated cost. Once you can see a single tool call costing fifteen times its sibling that does the same job, the fix is usually mechanical — a narrower query, a tighter retrieval window, or a guardrail on the loop.

This is the area where Langfuse's tracing earns its reputation: agent runs render cleanly even with deeply nested spans, and tool calls are first-class in the UI rather than buried under a generic "model call" span.

Use timeline views to spot context bloat

The trace tree tells you which span is expensive. A token-scaled timeline tells you why without opening logs. Each span is sized by tokens or estimated cost, so an oversized system prompt, a runaway retrieval, or a retry loop becomes visible at a glance.

A few patterns timeline views surface immediately:

A system prompt that grew to 4,000 tokens over months of patching shows up as an oversized block in every trace.
A retrieval step pulling 20 chunks when 5 would do appears as a heavy span surrounded by smaller siblings.
A retry loop — three consecutive identical-shaped heavy spans — is a visual signature you stop missing after seeing it once.

For most teams, this is the first view worth opening when an LLM bill spikes. It collapses what would otherwise be hours of log spelunking into seconds of pattern recognition.

Experiment with prompts to cut input tokens

Once you know which span is expensive, the next step is to test whether a shorter or cleaner prompt delivers the same quality at lower cost. The standard workflow:

Pull a real production trace from the expensive span.
Duplicate the prompt into a playground.
Edit a variant — remove a few-shot example, restructure the instruction block, move reference material into retrieval, trim redundant phrasing.
Run both versions against the same inputs and compare quality scores, token counts, and estimated cost side by side.

Prompt experiments are usually the highest-ROI optimization because they carry the least risk. You're not changing models, you're not changing system behavior — you're shipping a tighter version of the prompt you already trust. In our experience this delivers 20–40% input-token savings on mature applications without any model change.

The right tool here is whichever one ties the playground to the eval system you already use. Braintrust's Playground is the cleanest of these — production traces feed directly into prompt variants, every variant runs against the same scorer, and PMs can iterate without filing an engineering ticket. Langfuse's Playground is functionally similar with the open-source guarantee that comes with the rest of the product.

Experiment with models per step

When prompt edits stop producing meaningful savings, model selection is usually the next lever. Many production workflows use the same frontier model on every step, even when half the steps are extraction, classification, or formatting that don't need frontier reasoning.

A model experiment looks like:

Build a dataset from real production inputs for the step under test.
Define a scorer that reflects what "good output" means for that step.
Run the same dataset across multiple candidate models.
Compare cost per request alongside the quality score for each model.

What you usually find:

Model class	Avg cost per request	Quality score	Verdict
Frontier model	$0.024	0.91	Keep for complex reasoning steps
Mid-tier model	$0.006	0.88	Good enough for most user queries
Small open / fine-tuned model	$0.001	0.72	Use for routing and classification only

The right answer is almost never "use the cheap model everywhere." It's "use the cheap model where the cheap model is good enough." That's a per-step decision, and trace-level visibility is what makes it tractable. Braintrust and Langfuse both support this kind of multi-model experiment natively, with cost and quality reported side by side. For high-volume online evaluation — scoring live traffic instead of a sample — Galileo is worth a look because their Luna-2 evaluators are cheap enough to score every request rather than just statistical samples.

Gate every cost-cutting change with evals

This is the part most teams skip and regret. Every "shorter prompt" is also a "prompt with one less instruction" — possibly the one that handled your hardest edge case. Every "cheaper model" performs well on the examples you remember and possibly badly on the cases you didn't think to test. Cost reduction without evals is unmeasured risk.

The pattern that works:

Observability identifies an expensive trace.
Convert the trace into a reusable eval case.
Test the cheaper variant (prompt, model, workflow change) against the eval suite.
Ship only if quality holds.
Keep the eval case in the suite so future regressions are caught automatically.

CI-gated evals turn this into one workflow: a PR runs the eval suite, posts results, and blocks merge if quality scores drop below threshold. Cost changes go through the same engineering discipline as any other production change. The two tools we'd reach for here are Braintrust (the native GitHub Action posts eval results as PR comments and fails CI on regression — this is the part of the product that has the most leverage in production) and Langfuse (the open-source eval runner integrates cleanly with whatever CI you already run). Promptfoo is also worth mentioning if you want config-as-code evals living in the repo next to the prompts.

The discipline matters more than the tool. Without the gate, you're not optimizing — you're reducing scope unmeasured.

Close the loop: production findings become permanent eval cases

LLM cost optimization isn't a one-shot project. Traffic shifts, prompts grow, new features add new spans, and a workflow that's efficient today can be expensive in three months. The teams that keep their LLM bill in check long-term treat cost optimization as a continuous loop, not a quarterly cleanup.

The pattern:

Every costly or broken trace that gets investigated becomes a permanent eval case.
Over time, the eval suite accumulates as a record of every cost and quality issue already addressed.
Future prompt and model changes run against that record before they ship.

Both Braintrust and Langfuse support one-click conversion of production traces into eval cases. This is where the workflow compounds — every fix you ship comes with a regression test attached, so savings don't quietly evaporate three deploys later.

The supporting infrastructure matters at scale. Querying large volumes of trace data, generating eval cases programmatically, and running evals from the terminal are all things you'll do more as the workflow matures. Braintrust ships Brainstore (a database optimized for trace queries), Loop (an AI assistant that generates eval datasets and scorers from logs in plain English), and the bt CLI for terminal workflows. Langfuse's open-source stack lets you wire equivalent capability into your own pipelines — slower to get to feature parity, but the source is yours.

The mental model

The teams that lower their LLM bill without breaking quality treat cost as a debuggable property of the system, not a line on an invoice.

The bill is a signal, not a target. The target is "the workflow does what it should at the lowest cost that maintains quality."
The trace tree is the unit of analysis. Aggregate stats are useful for noticing problems; traces are useful for fixing them.
Every cost-reduction change is gated by evals. Without that gate, you're not optimizing — you're reducing scope unmeasured.

The teams that struggle to lower their bill almost always have a tooling problem on the surface, but the deeper issue is that they don't yet treat cost as engineering. Once you do, the workflow takes care of itself — and the tools (Braintrust, Langfuse, or whatever fits your stack) become accelerants on a process you already trust.

FAQs

What's the fastest way to find where my LLM cost is coming from?

Open the most expensive production traces in whatever observability tool you use and look for child spans that are much larger than their siblings. A span that costs ten times the nearby steps is almost always the right place to start. If you don't have trace-level cost attribution yet, that's the first capability to add — aggregate dashboards will tell you spend is up but not which prompt, tool call, or model decision caused it. Braintrust and Langfuse both ship this out of the box.

Can I just switch to a cheaper model and call it done?

Usually not. Prompt length, context bloat, retry loops, and inefficient tool calls often contribute as much cost as model choice — sometimes more. And a blanket model swap is exactly the change most likely to cause silent quality regressions, because the cheaper model performs well on the easy cases and badly on the ones you didn't remember to test. Test per-step model swaps against a real eval suite before you ship.

Is open source enough, or do I need a paid tool?

Depends on the team. Langfuse is open source and self-hostable, and the tracing is genuinely best-in-class. For compliance-sensitive or cost-sensitive teams it's the right answer. Braintrust is closed source with a generous free tier, and the eval-driven workflow (especially the GitHub Action and the Loop assistant) is more polished than anything OSS currently delivers. Either tool will get you the cost-reduction workflow described here; the choice depends on whether self-hosting and source access matter to you.

How do I keep cost savings from quietly disappearing over time?

Convert every cost-related investigation into a permanent eval case, and gate future PRs on that eval suite. Without the gate, future prompt edits and feature additions will erode the savings — usually within a few deploys. The mechanism is the same in Braintrust, Langfuse, and Promptfoo: turn a production trace into a test case, add it to the suite, and have CI fail the merge if quality drops on that case.

How does this work for agents specifically?

Agent observability is where trace-level cost analysis pays off the most, because a single agent run can fan out into dozens of model calls and tool invocations — and any one of them can be the runaway expensive step. The structural fixes (per-tool-call inspection, retry detection, loop turn limits) all live in the trace view. The best agent observability tools list covers which tools render agent traces best.

Is this realistic for a small team without dedicated ML infra people?

Yes — that's the whole point of the platform tools. The free tiers on Braintrust and Langfuse are enough to get a complete cost-reduction loop running without committing to a paid plan. Most of the leverage comes from the workflow (trace → identify expensive span → test variant → gate with evals), not the tool's bells and whistles. A small team running that loop will outperform a larger team running ad-hoc cost cleanups twice a year.