A CLI coding agent feels free until the invoice arrives. You point Claude Code or Codex at a repo, ask it to refactor a module, and ten minutes later it has read forty files, run the test suite three times, and burned six figures of tokens on context you never needed it to see. Multiply that by a team of eight engineers running agents all day, and the bill stops being a rounding error. Token spend on coding agents is mostly waste, and most of that waste is fixable from the command line without changing models or accepting worse output.
TL;DR
Cut agent token costs by trimming context before it hits the model: scope the working set, keep memory files short, and compact long sessions. Turn on prompt caching for stable prefixes (about 90% off repeat reads). Route cheap subtasks to a small model. Cap tool output. Measure cost per run so you know what actually changed.
Introduction
The pain shows up two ways. Either you hit a hard wall mid-task because you blew through a weekly or session limit, or the monthly API bill lands and someone asks why an “AI assistant” costs more than a junior contractor. Both come from the same root cause: CLI agents are token-hungry by default. They read whole files when they need ten lines, replay the entire conversation on every turn, dump raw command output back into context, and re-send the same system prompt and repo map thousands of times a day.
None of that is inherent to the work. A refactor that genuinely needs to reason about 2,000 tokens of code does not need 180,000 tokens of context to do it. The gap between those two numbers is your savings, and almost all of it is recoverable with flags, config files, and habits you can adopt today.
This guide walks through where tokens actually go in a CLI agent run, then gives you concrete tactics to cut each bucket: context hygiene and memory files, prompt caching, model routing, trimming tool output and retrieval, and measuring cost per run so the savings are real and not a guess. The examples assume Claude Code and Codex, but the mechanics apply to any agent that talks to a token-billed API.
One adjacent cost worth naming early: a lot of agent token spend is debugging. An agent that calls a flaky internal API will retry, read error bodies, re-read docs, and loop, with every iteration paying full freight in tokens.
Where the tokens actually go in a CLI agent run
Before you optimize, you need a mental model of the bill. A single agent “turn” sends an input payload to the model and gets an output payload back. You pay for both, and on most providers output costs three to six times more per token than input. For one frontier model family in mid-2026, input runs around $3 per million tokens and output around $15; a cheaper model in the same family is roughly $1 input and $5 output. Treat those as illustrative, not quotes; check the live pricing pages, because providers revise them. The structural point holds regardless of the exact numbers: output is expensive, and input volume is what balloons.
Here is what fills the input payload on a typical run:
- System prompt and tool definitions. The agent’s instructions plus every tool’s JSON schema. Fixed per turn, often 5,000 to 15,000 tokens, re-sent every single turn.
- Memory and project files. Things like
CLAUDE.md, repo conventions, and persistent instructions. Loaded every turn whether relevant or not. - Conversation history. Every previous user message, model response, tool call, and tool result, replayed in full on every turn. This grows without bound and is usually the biggest line item in a long session.
- Retrieved file content. Files the agent read. A single
Readon a 1,200-line file is roughly 12,000 to 18,000 tokens, and agents love to read whole files. - Tool output. Test runner logs,
npm installnoise,git diffof a generated lockfile, stack traces. Raw and verbose by default.
The output payload is the model’s reasoning, code edits, and explanations. Smaller than input on most runs, but priced highest per token, so verbose “let me explain my plan in six paragraphs” behavior is costly.
The single most important fact: conversation history is replayed every turn. A 30-turn session does not cost 30 times one turn. It is closer to the sum of a growing prefix, so the later turns each carry the full weight of everything before them. That is why a long, meandering session is the most expensive thing you can do, and why the tactics below disproportionately target context that gets re-sent.
If you want a deeper look at how session and limit accounting works in practice, the breakdown in how the Claude Code token window resets is a useful companion to this section; it explains why a session that “feels short” can still exhaust a budget.
Context hygiene and memory files
The cheapest token is the one you never send. Context hygiene is the highest-impact habit because it shrinks the input payload on every turn for the rest of the session.
Scope the working set before you start. Do not open an agent at the repo root and say “refactor the billing logic.” It will crawl. Instead, tell it exactly which files matter:
# Instead of a vague prompt that triggers wide exploration:
claude "refactor the retry logic so it uses exponential backoff,
only in src/payments/retry.ts and its test file"
Naming the files keeps the agent from reading twenty candidates to find the two that matter. If you must let it explore, point it at a directory, not the root.
Keep memory files short and stable. A CLAUDE.md (or equivalent project memory file) is loaded into context on every turn. Teams treat it like a wiki and let it grow to 4,000 tokens of onboarding prose. At, say, 50 turns a day across 8 engineers, a bloated memory file is re-sent hundreds of times daily for no marginal benefit. Audit it:
# Rough token check on your memory file (chars / 4 is a decent estimate):
wc -c CLAUDE.md | awk '{print "≈", int($1/4), "tokens per turn"}'
Aim for a tight file: build/test commands, hard conventions, and pointers to where deeper docs live, not the docs themselves. If a section is only relevant to one task a month, it does not belong in the always-loaded file. Move it to a doc the agent reads on demand.
Compact or reset long sessions. When a session has done its job and you are moving to an unrelated task, do not keep typing into the same context. Every new turn now drags the entire old transcript. Use the agent’s compaction or clear command:
# In Claude Code, when the conversation gets long:
/compact # summarizes history into a short digest, drops the raw transcript
# or, for a clean break on a new task:
/clear # starts fresh; old context no longer re-sent
/compact typically replaces tens of thousands of tokens of raw history with a summary one tenth the size, and that smaller prefix is then what every subsequent turn carries. The discipline is simple: one logical task per session, compact or clear between tasks. The workflow patterns in Claude Code workflows lean heavily on this scoping habit, and it is worth adopting wholesale.
Use a project ignore file. Keep generated artifacts, lockfiles, build output, and vendored dependencies out of the agent’s reach. If the agent never sees dist/ or node_modules/, it never spends tokens reading or diffing them. Most agents respect an ignore file; configure it once and the savings are permanent.
Prompt caching: stop paying full price for the same prefix
This is the single biggest lever for repeated runs, and it is mechanical rather than behavioral. Prompt caching lets the provider store a prefix of your request (tools, system prompt, stable context) so subsequent requests that share that prefix read it back at a steep discount instead of reprocessing it.
The economics, per Anthropic’s prompt caching documentation: a cache write costs more than a normal input token (about 1.25x base input for the default 5-minute cache, about 2x for a 1-hour cache), but a cache read costs roughly 0.1x base input; that is about a 90% discount on the cached portion. Because the write premium is small and the read discount is large, caching pays for itself after a single cache hit on the short-lived cache, and after about two hits on the long-lived one. The default cache lifetime is short (around 5 minutes, refreshed each time it is hit), with a 1-hour option available. There is a minimum cacheable size; small models and the largest models need a few thousand tokens before a prefix is eligible, so caching helps most exactly where it matters: large stable prefixes.
The structural rule is to put stable content first and volatile content last, then cache the boundary. Order is tools → system → messages, and changing anything invalidates that level and everything after it. So you want timestamps, the user’s incoming message, and freshly retrieved file content to come after your cache breakpoint, not before it.
If you drive a model directly from your own CLI wrapper, you set this explicitly:
# Cache the stable prefix (system + tool defs + repo conventions).
# The volatile user turn comes after and is NOT part of the cached prefix.
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT + REPO_CONVENTIONS, # stable across runs
"cache_control": {"type": "ephemeral"}, # cache breakpoint here
}
],
messages=[{"role": "user", "content": user_task}], # changes every run
)
# Inspect what actually got cached:
u = response.usage
print("cache write:", u.cache_creation_input_tokens)
print("cache read :", u.cache_read_input_tokens) # these tokens billed ~10%
print("fresh input:", u.input_tokens)
A daily refactor agent that runs the same system prompt and the same 8,000-token repo convention block across 60 invocations a day is the textbook case. Without caching you pay full input price for that 8,000-token block 60 times. With caching you pay the write premium once (or once per cache expiry) and the ~10% read price the other times. On the convention block alone that is close to a 90% reduction, and it stacks with every other tactic here.
Two operational notes. First, keep your prefix byte-stable; a single changed character before the breakpoint busts the cache and you pay a write again. Pin your system prompt and conventions; do not interpolate a timestamp into them. Second, the cache is short-lived by default, so batching related runs close together (rather than spreading them across the day) keeps you hitting a warm cache. OpenAI’s API applies a similar discount to cached input automatically on supported models; the principle is identical even though the knobs differ. The free-tier and routing tricks in running GPT-5.5 free through Codex are a useful complement when caching alone is not enough.
Model routing: cheap model for cheap work
Not every agent action needs a frontier model. Renaming a variable across three files, writing a commit message, summarizing a diff, or generating a boilerplate test does not require the same model that designs an architecture. Yet the default behavior of most CLI agents is to run everything through one expensive model for the whole session.
Routing means deliberately sending low-stakes subtasks to a smaller, cheaper model and reserving the expensive one for genuine reasoning. The price gap is large: a small model in a given family can be three to five times cheaper per token than the flagship, and for mechanical tasks the output quality difference is negligible.
Practical ways to route from the CLI:
# 1. Pick the model per invocation based on the task.
claude --model haiku "write a conventional-commit message for the staged diff"
claude --model sonnet "redesign the caching layer for the payments service"
# 2. Use a cheap model for the high-frequency, low-stakes loop
# (commit messages, changelog entries, quick lint explanations)
# and a strong model only when you explicitly invoke the hard task.
Set the default to the cheaper model and escalate consciously, rather than defaulting to the expensive model and never stepping down. Most teams have the polarity backwards: they run the flagship for everything “to be safe” and pay five times over for commit messages.
A second routing axis is sub-agents. If your agent framework supports delegating a narrow subtask to a child agent, give that child a cheap model and a tiny context. The child does the grunt work (search, summarize, draft) on pennies and reports a short result back to the expensive parent, instead of the expensive parent doing the grunt work itself at full price with full context. The autonomous-loop patterns in the goal command across Codex and Claude Code show how to structure that delegation so the expensive model only sees distilled results.
A note on limits, not just dollars. If you are on a usage-capped plan rather than pure pay-as-you-go, routing also stretches how far your allowance goes. Spending your premium-model budget on commit messages is how teams hit a wall by Thursday. The recent Claude Code weekly limit increase helps, but routing is still what makes the allowance last.
Trimming tool output and retrieval
Tool output is the silent budget killer because it is invisible until you look. Every command an agent runs returns text, and that text goes straight back into context, where it is then replayed on every subsequent turn. A single npm install can return thousands of lines. A test run with verbose logging can return tens of thousands of tokens. A git diff that includes a regenerated lockfile can be enormous. The agent rarely needs all of it; it needs the pass/fail and the relevant failure.
Tactics that cut this cleanly:
Make commands quiet at the source. The agent pays for whatever the command prints. Configure tools to be terse:
# Loud (agent pays for every line):
npm test
# Quiet (only failures and a summary come back):
npm test --silent -- --reporter=dot
# Loud:
npm install
# Quiet:
npm install --silent --no-audit --no-fund
Filter before the agent sees it. When you control the command the agent runs, pipe out the noise so only signal returns:
# Only the lines that matter come back into context:
pytest -q 2>&1 | tail -n 30
# Diff stats instead of a 4,000-line full diff:
git diff --stat
# Grep for the failure instead of dumping the whole log:
npm test 2>&1 | grep -E "(FAIL|✗|Error)" | head -n 20
Prefer targeted reads over whole-file reads. Reading a 1,500-line file to change one function is pure waste. Encourage the agent to grep for the symbol and read a window around it, not the entire file. Many agents do this when the prompt nudges them (“find and read only the function that handles retries, not the whole file”). On a large file that is the difference between ~18,000 tokens and ~800.
Constrain retrieval scope. If your agent does codebase search or RAG over docs, cap how many chunks it pulls back and how large they are. Ten 200-token snippets that answer the question beat fifty 800-token snippets that bury it; you pay for every retrieved token whether the model uses it or not.
These changes are mostly one-time configuration (test reporters, install flags, an ignore file) and they pay out on every run forever, which makes them some of the best return on effort in this entire list.
Measuring and attributing cost per run
You cannot manage what you do not measure, and “the bill went down” is not measurement. To know whether a tactic worked, you need cost attributed to a run, ideally to a task.
Start with the data the API already gives you. Every response includes a usage object. Capture it:
u = response.usage
# Approximate cost in dollars; substitute the live rates for your model.
INPUT_RATE = 3.00 / 1_000_000 # base input $/token (illustrative)
OUTPUT_RATE = 15.00 / 1_000_000 # output $/token (illustrative)
CACHE_READ = 0.30 / 1_000_000 # ~10% of base input
CACHE_WRITE = 3.75 / 1_000_000 # ~1.25x base input (5-min cache)
cost = (
u.input_tokens * INPUT_RATE +
u.output_tokens * OUTPUT_RATE +
u.cache_read_input_tokens * CACHE_READ +
u.cache_creation_input_tokens * CACHE_WRITE
)
print(f"run cost ≈ ${cost:.4f} "
f"(in={u.input_tokens} out={u.output_tokens} "
f"cr={u.cache_read_input_tokens})")
If you use the agent CLI rather than your own wrapper, three approaches work:
# 1. Most agent CLIs expose a usage/cost command for the session.
# Check it after a representative task and write the number down.
claude /cost
# 2. Provider consoles report per-API-key spend. Issue a dedicated
# API key per agent or per project so spend is attributable
# instead of pooled into one untraceable total.
# 3. Tag runs. Wrap the agent invocation in a script that logs
# timestamp, task label, and the reported token counts to a CSV.
# A week of that CSV tells you which tasks are expensive.
Estimate before you run for anything large. Paste the prompt and the files you intend to include into a tokenizer (OpenAI’s public tokenizer is a quick way to sanity-check size) and look at the count. If “include the whole module” is 90,000 tokens and the targeted version is 6,000, you just saw the decision before paying for it.
Track one number per representative task over time: cost per “daily refactor run,” cost per “PR review run.” When you turn on caching or switch a subtask to a cheap model, that number should move. If it does not, the tactic is not doing what you think, and you learned that for the price of one run instead of a month of bills.
Tactic comparison
| Tactic | Typical token savings | Effort |
|---|---|---|
| Scope the working set (name files, don’t crawl) | 30–60% on input per run | Low |
| Short, stable memory file | 5–15% per turn, every turn | Low |
/compact or /clear between tasks |
40–80% on long sessions | Low |
| Prompt caching on stable prefix | ~90% on the cached prefix | Medium |
| Model routing (cheap model for cheap work) | 50–80% on routed subtasks | Medium |
| Quiet/filtered tool output | 20–50% on tool-heavy runs | Low (one-time) |
| Targeted reads over whole-file reads | 70–95% on large-file edits | Low |
| Constrained retrieval scope | 30–60% on RAG-heavy agents | Medium |
| Per-run cost measurement | 0% directly; enables all the above | Low |
Savings ranges are illustrative and stack multiplicatively; the gain on any one tactic depends on your baseline waste.
Conclusion
Agent token costs are mostly self-inflicted, and the command line is where you fix them. The waste lives in context you re-send, output you do not read, and models that are too expensive for the task at hand. Address those and the bill drops without touching the quality of the work.
Do the low-effort items first; scoping, quiet output, and a lean memory file cost nothing and pay out on every run from now on. Layer caching and routing on top and the difference is the kind you can put in a budget.



