Why are enterprise LLM bills exploding?

Q: How do you reduce context bloat?

Send less on each call, and send it deliberately. The highest-leverage moves are just-in-time retrieval (pull only the data a step needs instead of dumping files or full history), pruning or lazily loading tool schemas so idle definitions don’t ride on every request, caching stable prefixes like system prompts, and splitting work into phases so stale context from a failed attempt doesn’t recur on later turns. Each attacks the recurring per-call leak rather than the one-time prompt.

Q: Does prompt caching solve re-sent context costs?

Partly. Prompt caching can cut the price of re-sent tokens by roughly 90% by charging cached reads at a fraction of normal input cost, which materially lowers the bill. It does not remove the tokens, reduce context rot, or make the window’s composition visible. Caching is a discount on bloat, not a cure for it.

Q: Is context bloat the same as unbounded usage?

No. Unbounded usage is calling the model too often or without limits; the fix is caps and budgets. Context bloat is sending too much on each call you do make; the fix is curation and governance inside the window. A spending cap can halt a runaway bill without touching the structural waste in every surviving call.

Q: What is the difference between AI FinOps and LLM observability?

LLM observability tools such as LangSmith or Arize track traces, latency, and quality—what the system did and how well. AI FinOps tracks what the system spent and why. The two overlap in tooling but answer different questions, and neither, today, attributes cost to the components inside a single context window.

At Prove AI, we spend a lot of time tracking how AI agents are used and what prevents teams from getting the most out of them. One recent thing we’ve noticed is that enterprise LLM bills are rising due to two separate mechanisms, and most teams don’t have the observability tools needed to distinguish between them, set up sensible metrics, track down associated anomalies, and handle everything else required for robust AI testing.

The first mechanism is unbounded usage: in the absence of any limits on who is building and running an automated workflow, AI spend scales directly with employee enthusiasm (which, I think you’ll agree, is not in short supply these days). The second mechanism is context bloat, which refers to the accumulation of tokens coming from the fact that the same history, tool schemas, and prior results are re-sent on every API call.

The first is the louder of the two failure modes; recent stories abound of major tech companies blowing through an entire year’s budget for AI in just a few months as humans are frantically rehired and developers hunt around for cheaper (probably open-source) generative language models. Naturally, it makes headlines because the numbers involved are eye-watering and the fix (“set limits”) is obvious.

But unbounded usage is just the failure everyone is familiar with. Underneath it, context bloat quietly inflates costs and erodes the ROI on AI investment in a way that no spending cap touches and very few are even aware of.

Mechanism	What drives it	Can finance see it?
Unbounded usage	No per-seat or per-agent caps	Yes, in aggregate spend
Context bloat	History, schemas, and results re-sent every call	No

The rest of this post will talk about how context bloat works, and what steps can be taken today to get it under control.

Why does every AI agent call get billed for the whole context window?

The “context window” is the full set of tokens a model reads when generating a response; because LLM APIs are stateless, everything in the context window is billed as input on every call, whether or not the model actually needed it for that step. Whether you’re talking about an LLM as a judge setup or a sophisticated RAG system, the model itself retains nothing between calls.

Because an agent re-sends the entire context on every turn, the bill for the conversation grows concomitantly, as the context expands.

This is why agents cost far more than chatbots for the same task. A chatbot sends a message and gets an answer. An agent runs in a loop—reasoning, using tools, pulling in relevant files, performing validation on intermediate outputs—and each step re-sends everything that came before.

Tool results are the worst offenders because they linger. In a typical agent system of the sort that powers many AI applications, once the model calls a tool, that call and its result stay in the trajectory until the task finishes. So a single verbose command—opening a large file, dumping a query result—doesn’t cost you once, it hitchhikes along in the input of every subsequent step, compounding to the end of the task.

The result is a bill dominated by constant repetition. Most of what you pay for when running an automated workflow is not the model thinking—it is the same context, re-read and re-charged, turn after turn.

The next section puts more solid numbers on exactly how much this is costing you.

How much of an LLM bill is wasted on re-sent context?

62%

of agentic AI spend went to re-sent context

in one 30-team audit — vs. 11% on the model’s actual output

The short answer is: most of it. One 30-team audit attributed 62% of agentic AI spend to re-sent context, against just 11% for the model’s actual output. This means that a huge fraction of any particular bill is structural re-sending. Though your operation may have numbers that deviate in various ways, it likely has a similar shape.

And, tokenmaxxing and vibe coding aside, the cost is high enough to measurably impact behavior: in one study, 53% of participants named the cost of using AI agents as a barrier to relying on them.

What is context engineering, and why is it the real cost lever?

Context engineering is the practice of curating which tokens enter a model’s context window on each call. It is the real cost lever because, as we’ve seen, the expensive decision is no longer how you phrase a prompt, it is what you re-send every turn.

Context engineering differs from prompt engineering in several ways, but scope is perhaps the most important. Prompt engineering optimizes the wording of instructions; context engineering manages the entire state passed to the model—system prompt, tools, retrieved data, and message history—across many turns.

A model has a finite attention budget, and every token you include draws from that limited resource; good context engineering means finding the smallest set of high-signal tokens able to produce the result you want.

This makes bloat a loss on two fronts. The tokens you over-send don’t just inflate the bill; they also degrade the answer.

The tendency of a model’s accuracy to decline as its context window grows is usually called “context rot”. More tokens stretch attention thinner, so a window padded with stale results and idle schemas is both costlier and less reliable than a curated one.

In practice, the implication is that trimming context is not a chore that trades quality for savings, it improves both simultaneously. The hard part is figuring out what to trim: knowing what was dropped, and whether dropping it changed the answer. That is the subject of the next two sections.

Why can’t finance see token costs inside a context window?

The reason finance doesn’t have visibility into a context window is that billing tools meter calls, not content. Cloud billing and provider dashboards report aggregate spend—the what, not the why—and none of them parse what sits inside a context window. AI FinOps is the practice of attributing and controlling AI infrastructure spend. Newer AI-FinOps tools attribute cost per request, team, or feature, but not to whatever is sitting inside the window.

Where cloud FinOps measures server uptime and bandwidth, AI FinOps has to measure token velocity and context-window utilization—an entirely different unit of account.

The gap has a clear shape once you separate the tool categories:

Tool category	What it sees	What it misses
Cloud billing, provider dashboards	Aggregate spend	Why you spent it; anything inside the window
AI-FinOps platforms	Cost per request, team, or feature	Cost per component inside the window
LLM observability	Traces, latency, quality signals	Cost attribution inside the window

Put another way, finance teams are blind to granular token spend because legacy billing tools cannot parse a context window; they see a lump sum and lack the dimensions to break it apart. No category answers the question this post is about:

“Of the dollars spent on a given call, how many went to the system prompt, how many to re-sent history, how many to idle schemas, how many to stale tool results?”

Per-request attribution tells you which feature was expensive. It does not tell you that 60% of that feature’s cost was history you never needed to re-send.

That blind spot is about to widen, because the platforms are now shipping features that trim the window automatically—and trimming you can’t see is harder to govern than bloat you can.

Do compaction and memory features solve context bloat?

There are now a handful of native features able to automatically trim your context window, which include:

Compaction, which summarizes a conversation nearing the window limit and restarts it from the summary.
Memory (or structured note-taking), which writes durable facts to storage outside the window and reloads them on demand.
Tool-result clearing, which drops raw tool outputs from history once they are no longer needed.

This is genuinely useful, but it also quietly changes the problem. Automatic management is a black box, with four properties that matter to anyone accountable for spend or quality:

It drops content silently. The model decides what to discard; the application receives the compacted result, not a record of what was removed.
It emits no model regression signal. When a summarization step loses a detail that matters three steps later, nothing flags it—you see a wrong answer, not a dropped token.
It is single-provider. Each platform’s trimming is its own, with no cross-provider account of what was cut or saved.
It gives finance nothing. A compaction event lowers the next call’s token count without appearing as any line item an auditor could trace.

(The second property is an inference from how these features are built, not a measured finding—but it follows directly from the fact that none of them emit a log of what they removed.)

The native feature that removes the need to build a trimmer creates the need to govern one. Before, bloat was at least visible because you could count the tokens you were re-sending. After, the tokens are gone, along with any record of what was removed and whether it should have been. You have traded a cost you could measure for a risk you cannot.

Will bigger or cheaper models fix token costs?

No. A larger context window still suffers context rot, and cheaper tokens lower the unit price without making spend visible or attributable. A model upgrade changes the size of the bill, never its legibility.

Bigger windows don’t help because the constraint is attention, not capacity. Windows of every size remain subject to pollution and relevance decay; even if a model can hold a billion tokens, that doesn’t guarantee that it’ll attend to the ones that matter. More room to be wasteful is not less waste.

Cheaper tokens don’t help because price and visibility are separate problems. Halving the cost per token halves a bill you still can’t decompose. You spend less on context you can’t see, and understand it exactly as poorly as before. These are issues to which Prove AI pays much attention.

The durable problem is the pairing of finite attention with opaque spend—not the question of how to summarize. Summarization is getting cheaper; attribution and governance inside the window are not, and no model release on the roadmap changes that.

What is FinOps for the context window?

FinOps for the context window is cost attribution and governance at the level of what goes into each model call. It is the missing layer between LLM observability, which shows traces but not cost, and cloud billing, which shows cost but not the window. Neither sees how a single call’s spend breaks down across system prompt, history, schemas, and tool results.

The category barely exists yet, which is why most teams do context engineering by feel. Closing the gap takes three capabilities that no current tool category provides together:

Line-item attribution inside the window: how many tokens, and dollars, each component costs on each call.
A regression signal for automatic trimming: an alert when compaction or clearing drops something the next step needed.
Cross-provider accounting: one ledger for what was sent, cut, and cached, regardless of which model served the call.

Now, let us return to where we started. Spend is rising from two mechanisms, and they need different instrumentation. Usage caps stop the loud disasters (the weekends where a long-running task sets you back thousands of dollars) because those are problems of how often the model is called. Context bloat is a problem of what each call contains, and no cap touches it. Only attribution and governance inside the window do.

The teams that get this right will not be the ones with the cheapest tokens or the largest windows. They will be the ones who can answer a question their billing dashboard cannot: what, exactly, did we just pay to re-send? Prove AI will be monitoring this fast-changing category as it evolves. Follow along with us for timely updates!

Frequently asked questions

How do you reduce context bloat?

Does prompt caching solve re-sent context costs?

Is context bloat the same as unbounded usage?

What is the difference between AI FinOps and LLM observability?

Prove AI is building solutions to power more correct, explainable and auditable AI outcomes.

We’re always interested in learning about AI management challenges.

Get in Touch