Context Window Budgeting for AI Agents
On a long agent loop, most of your token bill is orientation, not work. Here is how to find it and cut it: cache the stable prefix, cap tool results at the source, and reach for compaction last.
Every agent you build has a hidden line item: how many tokens you spend keeping the model oriented versus how many you spend doing the actual work. On a one-shot call nobody notices. On a loop that runs forty turns, the orientation cost is most of your bill, and it is the first thing that quietly breaks an agent that worked fine in the demo.
This is the context window budget problem, and most of the agents I see treat it as an afterthought. They shove the whole transcript back in on every turn, watch latency climb, and blame the model. The model is fine. The budget is the bug.
What actually fills the window
The Messages API is stateless. There is no server-side conversation; you resend the entire history on every request. So on turn N, you are paying to re-process turns 1 through N minus 1, every time. For a chat that is trivial. For a tool-using agent it is not: each turn adds an assistant message plus one or more tool_result blocks, and tool results are where the tokens hide.
A single grep over a large file, a verbose API response, a stack trace: any of these can dump thousands of tokens into the transcript, and those tokens ride along on every subsequent turn until the conversation ends. I have watched a "small" agent loop carry a 40KB JSON blob through fifteen turns because nobody truncated the tool output before it went back into messages. That is the blob's token count multiplied by the number of turns after it landed. The fix was four lines: cap the tool_result content, keep a pointer, let the model ask for more if it needs it.
So the budget breaks into three buckets, roughly in this order of size for a long agent run:
- Tool results. Usually the biggest and the most compressible. You control what goes back in.
- The system prompt and tool definitions. Fixed per turn, but they ride on every single request, so a bloated system prompt is a tax you pay N times.
- The reasoning and the actual answers. The work. This is the part you do not want to starve.
The cheapest win: cache the stable prefix
Prompt caching is a prefix match. The cache key is the exact bytes of the rendered prompt up to a cache_control breakpoint, and the render order is tools, then system, then messages. Put a breakpoint on the last system block and you cache your tools and system prompt together. On a long agent loop where that prefix is identical every turn, cached reads are billed at a fraction of the normal input rate, so the savings on the fixed portion are large. Check Anthropic's current pricing for the exact multiplier before you model it; I am not going to quote a number that may be stale by the time you read this.
The catch is the word exact. One byte changes anywhere in the prefix and everything after it misses. The classic ways to wreck your own cache:
- A
datetime.now()or a request ID interpolated into the system prompt. Now every request has a unique prefix and nothing caches. json.dumps()on your tool schema withoutsort_keys=True, so the byte order wobbles between requests.- Adding or reordering a tool mid-session. Tools render first, so changing the tool set invalidates the entire cache for that request.
Verify it instead of trusting it. The response usage object reports cache_read_input_tokens. If that stays at zero across requests that should share a prefix, you have a silent invalidator and you go diff the rendered bytes of two consecutive requests until you find it.
# the only check that matters
print(response.usage.cache_read_input_tokens) # > 0 means the prefix hit
print(response.usage.input_tokens) # the uncached remainder you paid full price for
When the conversation outgrows the window
Caching makes the fixed cost cheap. It does not stop the transcript from growing. Eventually a long-running agent approaches the context window limit, and you have three levers, in increasing order of how much they touch your code.
Truncate at the source. Before a tool result goes into messages, cap it. Keep the first chunk plus a note that the rest is available, or write the full payload to a file and pass back the path. This is the highest-leverage change because it stops the tokens from ever entering the loop, and it is just code you already control. Start here.
Context editing. Clear stale tool results and old thinking blocks out of the transcript as turns accumulate. It prunes rather than summarizes, so the structure of the conversation stays intact and the dropped content is just gone. Good when old tool outputs are no longer relevant and you want the transcript lean without paying a model to rewrite it.
Compaction. When you are genuinely going to exceed the window, the API can summarize earlier context server-side into a compaction block. It is a beta feature and the request shape matters: you append the full response.content back onto your messages each turn, not just the extracted text. The compaction block is what the API uses to replace the compacted history on the next request, so if you strip it down to the string you silently lose the compacted state. Check the current docs for which models support it and the exact beta header before you wire it in; that surface moves.
Truncation and context editing both work within a session. Neither persists across sessions. If you need state to survive a process restart, that is a separate problem (a memory store or your own database), not a context-budget lever, and conflating the two is how people end up reaching for compaction when a 20-line truncation would have done it.
How I'd actually spend the budget
Concretely, on an agent loop I was shipping:
- Freeze the system prompt and tool list. No timestamps, no per-request IDs in the prefix. Sort the tool JSON deterministically.
- Put a single
cache_controlbreakpoint on the last system block. Confirmcache_read_input_tokensclimbs after the first turn. - Cap every tool result at a sane size at the point where I build the
tool_resultblock. Pointer-and-fetch for anything bigger. - Only then reach for context editing, and only reach for compaction if a real workload actually approaches the window.
The order is the point. Most of the budget pain I have hit was solved by steps one through three, which are plain code and cost nothing to try. Compaction is real and useful, but it is the heavy tool, and people reach for it first because it sounds like the answer. Truncating a tool result is the answer more often than not.
Measure before you optimize and after. The usage object on every response gives you input_tokens, cache_read_input_tokens, and cache_creation_input_tokens; the real prompt size on a cached turn is the sum of all three, not the first field alone. If your agent ran for an hour and input_tokens reads small, the rest came from cache, which is exactly what you want to see. That is the whole game: keep the orientation cost cached and small, spend the live tokens on the work.