A single LLM call rarely looks expensive. A few thousand input tokens, a few hundred output tokens, fractions of a cent — the per-request math always feels manageable. The monthly invoice tells a different story, because production AI applications stopped looking like single LLM calls a long time ago. Tool calls compound. Retrieval steps compound. Retries compound. Agent loops compound. Context windows grow.
Most teams find out about rising LLM cost the way they find out about most production problems: a billing alert, after the cost has already accumulated for weeks. The dashboards say "spend is up." They don't say which prompt, model call, or retry pattern is responsible. That gap between "we have a problem" and "we know how to fix it" is the gap trace-level observability is supposed to close.
This post is about how to close it.
Why LLM cost compounds at scale
Three patterns drive almost every "where did this bill come from" investigation we've seen.
Workflow complexity. Modern AI applications rarely send a single prompt and return a single answer. RAG calls retrievers, rerankers, and generators. Agents call tools, evaluate intermediate outputs, and loop. A single user request can produce twenty model calls before a final answer reaches the user. The aggregate dashboard sees one feature; the bill sees twenty calls.
Context growth. Prompts accumulate over time. System instructions get patched to handle new edge cases. Few-shot examples stay in production after stronger models stop needing them. Retrieved chunks come into the prompt without enough filtering. A prompt that used 800 tokens at launch is using 4,000 tokens six months later, often producing a similar answer at five times the cost. Nobody decided to make it more expensive — it drifted.
Retries and failures. Tool calls fail. JSON outputs break. Schemas change. Each retry usually resends the full prompt, so a workflow with a 10% retry rate costs 10% more than the headline numbers suggest, and that 10% is not visible at the model-call level. Without trace-level visibility, retry cost is hidden inside aggregate usage.
The compounding becomes obvious once you look at a single trace. A request that looks like a $0.02 model call frequently turns into a $0.30+ workflow once long context, additional LLM calls, tool calls, and retries are summed across the trace. The hidden multiplier is real, and it's why aggregate dashboards aren't enough by themselves.
Why dashboards alone don't get you there
Aggregate dashboards report spend after it has already accumulated. Total spend is up, token usage is rising, the most expensive model is the one used most often — those signals are real, but they don't tell you which prompt, retrieval step, tool call, or retry pattern needs to change.
| Dashboard-only | Trace-level observability | |
|---|---|---|
| Per-call cost | Aggregated by model or day | Itemized on every span |
| Tool-call cost | Hidden inside totals | Visible on each span |
| Context bloat | Not surfaced | Token-scaled timelines |
| Retry visibility | Counted as separate calls | Grouped under parent trace |
| Root cause | Manual log review | Starts from the expensive span |
Dashboards are necessary — they tell you spend is rising in the first place — but they're a starting signal, not a debugging tool. Lowering cost requires more granular evidence.
What trace-level cost observability actually shows you
Trace-level cost observability treats every LLM call, tool call, and workflow step as an instrumented span with token counts and estimated cost attached at that level. Cost rolls up through the trace tree so you see total workflow cost and individual-step cost in the same view.
Three capabilities make this useful in practice:
- Per-span cost attribution. Every model call, tool invocation, and retrieval step has its own token usage and estimated cost. No aggregation.
- Cost rollups. Child-span cost propagates to parent traces automatically, so a multi-step workflow's full cost is visible at the top of the trace tree.
- Cost-scaled timelines. Spans visually sized by tokens or cost, so a 12,000-token retrieval step is obviously bigger than a 400-token classifier call. Eyeballing a timeline replaces grepping logs.
Most modern LLM observability platforms cover the first two. Cost-scaled timelines are less universal — they're the difference between "I can compute which step was expensive" and "the expensive step is screaming at me from the trace view." The LLM monitoring listicle covers which tools deliver each capability.
The cost-reduction workflow
Once you can see where cost is happening, three optimization moves cover most of the wins.
1. Prompt optimization
The most common prompt-cost problems are mechanical. Redundant system instructions accumulated over months of patching edge cases. Few-shot examples that newer models no longer need. Retrieved chunks injected into the prompt without filtering for relevance.
Test whether a shorter, cleaner, or better-structured prompt preserves output quality at lower input-token cost. A side-by-side playground (most monitoring + eval platforms ship one) lets you compare variants against the same inputs with token counts, estimated cost, and quality scores inline.
In our experience this delivers 20–40% input-token savings on mature applications without any model change. Worth doing first — it's the cheapest optimization with the lowest risk.
2. Model selection per step
Many production workflows use the same frontier model on every step, even when half the steps are extraction, classification, or formatting that don't need frontier reasoning. Trace-level cost data tells you which steps are doing the expensive work and which aren't.
Once each step has a known cost and a known quality bar, test cheaper models on the cheaper-task steps. Run the same eval across multiple models in a single experiment, see quality scores and estimated cost side by side, and route each step to the cheapest model that meets the bar.
The right answer is rarely "use the cheap model everywhere" — it's "use the cheap model where the cheap model is good enough." That's a per-step decision, and trace-level visibility is what makes it tractable.
3. Workflow fixes
Some cost isn't a prompt or model problem; it's an architecture problem. Common ones:
- Retrieval bloat. A retrieval step pulls 20 chunks when 5 would do. The cost lives in the prompt that consumes them, not the retrieval.
- Silent retries. Schema mismatches or malformed tool outputs trigger retries that resend full prompts. Fixing the schema is cheaper than absorbing the retry cost.
- Agent loops. Each iteration grows context. By the fourth iteration the cost is multiples of the first. Sometimes the fix is a hard turn limit; sometimes it's better tool design so fewer iterations are needed.
These are usually the highest-leverage wins, but they're also the most invasive — you're changing how the system works, not just what model it uses. Eval coverage matters most for these.
Why evals are the gate, not the suggestion
Cost reduction without evals is unmeasured risk. Every "shorter prompt" is also a "prompt with one less instruction" — possibly the one that handles your hardest edge case. Every "cheaper model" performs well on the examples you remember and possibly badly on the cases you didn't think to test.
The pattern that works in practice:
- Observability identifies an expensive trace.
- Convert the trace into a reusable eval case.
- Test the cheaper variant (prompt, model, workflow) against the eval suite.
- Ship the change only if quality holds.
- Keep the eval case in the suite so future regressions are caught automatically.
CI-gated evals (most modern eval platforms ship native GitHub Actions or equivalent) turn this into a single workflow — a PR runs evals, posts results, and blocks merge if quality drops below threshold. Cost-saving changes go through the same engineering discipline as any other production change. That's the part that compounds: every cost reduction comes with a regression test, so the savings don't quietly evaporate three deploys later.
The mental model
The teams that lower their LLM bill without breaking quality treat cost as a debuggable property of the system, not a line on an invoice.
- The bill is a signal, not a target. The target is "the workflow does what it should at the lowest cost that maintains quality."
- The trace tree is the unit of analysis. Aggregate stats are useful for noticing problems; traces are useful for fixing them.
- Every cost-reduction change is gated by evals. Without that gate, you're not optimizing — you're reducing scope unmeasured.
The teams that struggle to lower their bill almost always have a tooling problem, but the deeper issue is that they don't yet treat cost as engineering. Once you do, the workflow takes care of itself.