Cost at Scale: Why a 1,000-Token System Prompt Is Worth Engineering For

Pi's system prompt fits in under 1,000 tokens. Claude Code's runs over 10,000.

That gap, at 6.5 million monthly active developers, works out to more than $630 million per year in API cost. Before tool definitions. Before context growth. Before any multi-agent amplification.

One design decision. Made once. Compounding every month.

This is post six in the Agent Harnesses series. We've covered what harnesses are, how they fail, how system prompts work, and how to observe them. This post is about what they cost, and specifically which decisions move the number the most.

The token that gets sent every time

Every time your agent loop makes an API call, the system prompt goes with it. The full thing. Every iteration.

Most engineers think about the system prompt once, when they write it. The billing system thinks about it every single time the model is called.

Here's the base math. Say you're running on Claude Sonnet 4.6 at $3.00 per million input tokens. You have 10,000 daily active users, each triggering 10 agent calls per day. That's 3 million calls per month.

A 10,000-token system prompt at that volume:

text

3,000,000 calls × 10,000 tokens = 30 billion tokens/month
30,000,000,000 / 1,000,000 × $3.00 = $90,000/month
Annual: $1,080,000

A 1,000-token system prompt at the same volume:

text

3,000,000 calls × 1,000 tokens = 3 billion tokens/month
3,000,000,000 / 1,000,000 × $3.00 = $9,000/month
Annual: $108,000

The difference is $972,000 per year. For 10,000 users. From the system prompt alone.

This is not a rounding error in the budget. At this point it's a conversation with finance.

The 9x difference

Pi is an AI companion app. Their agent harness has been publicly analyzed and the system prompt comes in at under 1,000 tokens. The whole thing. Claude Code, built by Anthropic, runs somewhere north of 10,000 tokens for its core system context, and that's before conditional blocks get added based on environment and config.

That's a 9x difference in system prompt size between two production harnesses, both built by serious engineering teams.

The gap is not an accident. Pi made a deliberate choice to keep its harness lightweight. Claude Code made a deliberate choice to give its agent deep context. Both are defensible. The point is: it's a choice. Token count is not a technical constraint. It's a design decision.

At OpenCode's reported scale of 6.5 million monthly developers, running 10 agent calls per day:

text

Monthly calls: 6,500,000 × 10 × 30 = 1,950,000,000
Annual calls:  23,400,000,000

10k system prompt:
  23.4B calls × 10,000 tokens × $3.00/1M = $702,000,000/year

1k system prompt:
  23.4B calls × 1,000 tokens × $3.00/1M = $70,200,000/year

Gap: $631,800,000/year

$631 million. From the size of a single text block that every engineer on the team probably read once and forgot about.

I want to be precise about what this is and isn't. Most teams are not operating at 6.5 million users. But the math scales linearly. At 100,000 users and 10 calls/day, a 10k vs 1k prompt difference is $9.7 million per year. At 10,000 users, it's $972,000. These are real numbers for mid-size products.

The quadratic problem nobody tells you about

Before going further, there's a cost structure that makes all of the above worse.

In naive agent loops, costs don't grow linearly as the conversation gets longer. They grow quadratically. Each iteration re-sends the full conversation history to that point. So at step 20 of a 20-step loop, you're billing for step 1's content again. And step 2's. And so on.

A 20-step loop where each step generates 1,000 tokens:

Naive estimate: 20,000 input tokens total
Actual billed: roughly 210,000 input tokens

That's 10x what most people assume. Every calculation in this post gets multiplied by this factor in agent loops without compaction or context trimming.

Well-designed harnesses handle this. Claude Code uses auto-compaction: it summarizes the conversation at the context limit rather than letting it grow unbounded. The system prompt itself gets re-sent each time, but the conversation portion gets compressed. This keeps costs from spiraling as tasks get longer.

If your harness doesn't do this, it should.

Prompt caching: the first lever

Once your system prompt is as tight as it reasonably can be, caching is the next lever.

Anthropic's prompt caching works by marking a static prefix in your prompt with cache_control. The first request writes to cache at 1.25x the normal input rate. Every subsequent request that hits that cached prefix within the TTL pays only 0.10x (a 90% discount).

The math on when this pays off:

text

1 write (1.25x) + N reads (0.10x each) < (1 + N) full reads (1.0x each)
0.25 < 0.9N
N > 0.278

Caching breaks even on the second call. Every call after that is almost free for the cached portion.

At 100 calls with a 10,000-token system prompt on Sonnet 4.6:

text

Without caching: 100 × 10,000 × $3.00/1M = $3.00
With caching:    1 write (1.25x) + 99 reads (0.10x) = $0.3345
Savings: 88.9%

ProjectDiscovery published data from their Neo security scanning platform that shows what this looks like in production. One task ran 67.5 million input tokens across 1,225 agent steps at a 91.8% cache hit rate. A comparable task at a 3.2% cache rate cost roughly 60 times more for the same token volume. Same code. Same model. Different cache hit rate.

That 60x difference is the cost of not setting cache_control.

A few things to know about Anthropic's caching setup:

The minimum cacheable block is 1,024 tokens on Sonnet 4.6 and above. The cache is workspace-scoped (changed from org-level in February 2026). TTL options are 5 minutes (1.25x write cost) or 1 hour (2.0x write cost). At high call volume, the 5-minute window is fine. You're writing the cache once and reading it hundreds of times in that window. At lower volume, 1 hour may make more sense.

OpenAI applies caching automatically with no cache_control needed, but the discount is 50% rather than 90%. For Anthropic's explicit caching, the setup cost is one afternoon of engineering. The return at any real scale is immediate.

Tool definitions are the hidden system prompt

This is the part that surprises most people when they first look at the token bill.

Every API call that includes tool definitions sends the full schema for every tool (name, description, parameters, type annotations) as part of the input. This happens on every call, whether or not those tools get used.

A typical tool definition runs 300 to 600 tokens (median around 400, depending on how verbose the descriptions are and how many parameters there are).

If your harness has 19 tools:

text

19 tools × 400 tokens = 7,600 tokens of overhead per call

If it has 4:

text

4 tools × 400 tokens = 1,600 tokens per call

At 1 million calls per month on Sonnet 4.6, that 6,000-token difference is:

text

6,000 × 1,000,000 / 1,000,000 × $3.00 = $18,000/month
Annual: $216,000

At OpenCode's scale, the same calculation gives $421 million per year. From tool schema overhead alone.

The MCP compounding problem

MCP (Model Context Protocol) servers inject their full tool schema on every message, regardless of whether those tools are relevant to the current task. Research from 2026 shows the range:

A 10-to-15-tool MCP server: 1,500 to 4,000 tokens per call
A large MCP server: 5,000 to 8,000+ tokens per call
At 508 tools: 1.15 million tokens per query just for tool definitions, which works out to $3.77 per call on Claude at standard rates

One benchmark compared native MCP against a lazy-loaded CLI approach over a 20-turn conversation with 30 tools:

text

Native MCP:  36,310 tokens for schemas alone
Optimized:    1,734 tokens total
Savings:        95.2%

The fix is lazy loading: only inject tool definitions when a tool is actually needed. Up to 80-90% reduction for large tool libraries. The second fix is caching tool definitions. They're static. Mark them with cache_control. They're going to get sent on every call, so they're high-value candidates for the cache prefix.

Output tokens: the multiplier nobody budgets for

Input is expensive. Output costs 5x more.

On Sonnet 4.6, input is $3.00 per million tokens. Output is $15.00. The reason is architectural: input tokens get processed in one forward pass; output tokens require a separate forward pass per token (autoregressive generation). More compute per token, higher price per token.

This ratio is consistent across the frontier: Opus 4.6 is 5x ($5 in, $25 out), GPT-4o is 4x ($2.50 in, $10 out), Haiku 4.5 is 5x ($1 in, $5 out).

The verbosity of your model's output is a budget decision.

At 10,000 users, 10 calls/day, 30 days on Sonnet 4.6:

text

Verbose output (500 tokens/call):
  3,000,000 × 500 / 1,000,000 × $15.00 = $22,500/month = $270,000/year

Terse output (100 tokens/call):
  3,000,000 × 100 / 1,000,000 × $15.00 = $4,500/month = $54,000/year

Annual difference: $216,000

$216,000 per year from output length alone. No change to the model, the tools, the system prompt, or the user experience. Just how verbose the responses are.

A few design choices that move the number directly:

Format. Compact JSON schemas and YAML typically save 20-40% over verbose JSON with repeated keys or prose. If you're asking for structured data, be specific about the structure in the prompt.

Length constraints. "Return only the changed lines" or "Respond in three sentences maximum" can cut output tokens 50-80%. Models follow these instructions. They don't need encouragement to be concise, just instruction.

Chain-of-thought is expensive. Asking a model to reason through a problem before answering can multiply output tokens 5x to 20x. Reasoning models in thinking mode can add 10x to 30x output cost for complex tasks. Sometimes that's the right trade. Just know you're opting into it.

Tool call format is often more token-efficient than asking for prose. If the agent's job is to call a tool, ask for the tool call directly. A narrative description of what it's about to do costs tokens and adds nothing.

Multi-agent amplification

Multi-agent architectures multiply costs in two ways at once.

First, each agent in the chain has its own system prompt, sent on every call in that agent's loop. Second, output from one agent becomes input for the next. Context passes between agents and the total token volume compounds.

Here's a rough single-task comparison:

Single agent (10k system prompt):

text

Input:  10,500 tokens (10k system + 500 task)
Output: 2,000 tokens
Cost (Sonnet 4.6): $0.0615

Three-agent chain (orchestrator + 2 subagents + aggregator):

text

Orchestrator:  10,500 input + 3,000 output
Subagent 1:     7,500 input + 3,000 output
Subagent 2:     9,500 input + 3,000 output
Aggregator:    16,000 input + 2,000 output

Total input:  43,500 tokens
Total output: 11,000 tokens
Cost: $0.2955
Multiplier: ~4.8x

That 4.8x figure is conservative. It doesn't include retry logic, verification passes, or the quadratic context growth described earlier. Real-world teams report 5x to 10x amplification for complex multi-agent pipelines.

The implications for harness design:

The orchestrator's system prompt matters most, because it gets sent on the most calls. If you're going to engineer one prompt for token efficiency, start there.

Subagents should default to smaller models. Route to Haiku or GPT-4o mini for simple subtasks, escalate to Sonnet or Opus only when the task requires it. The cost differential is 3x to 5x per token.

Summarize at handoff points. Before passing output from one agent to the next, compress it. The next agent doesn't need the full transcript. It needs the result.

Cache shared context across agents. If all agents in a chain share a task description, shared rules, or a common knowledge base, mark that prefix for caching. It's going to be sent many times.

When to reach for batch

Anthropic's Batch API gives 50% off all token prices. Same model, same completions, no quality difference. The tradeoff is latency: most batches complete within an hour, with a 24-hour maximum SLA.

Sonnet 4.6 batch pricing: $1.50 input / $7.50 output per million tokens (down from $3.00 / $15.00).

Good candidates for batch processing:

Overnight code analysis across a codebase
Bulk document summarization or classification
Evaluation pipeline runs and evals
Data enrichment jobs
Non-urgent report generation
Training data generation

Not suitable for batch: anything a live user is waiting on, real-time agent tasks, streaming responses.

A quick calculation. 500,000 documents per month, 2,000 input tokens + 500 output tokens each:

text

Real-time (Sonnet 4.6):
  Input:  500k × 2,000 / 1M × $3.00  = $3,000
  Output: 500k × 500   / 1M × $15.00 = $3,750
  Total:  $6,750/month

Batch (50% off):
  Input:  $1,500
  Output: $1,875
  Total:  $3,375/month

Annual savings: $40,500

Combined with prompt caching on the shared system prompt, the system prompt cost approaches near-zero on high-volume batch runs. Anthropic quotes up to 95% cost reduction when combining batch pricing with caching.

The 78% optimization

Here's the full before-and-after. Same setup as before: 10,000 users, 10 calls/day, 30 days, Sonnet 4.6.

Configuration	Monthly cost	Annual cost
Baseline: 10k system prompt, no caching, 300-token output	$121,500	$1,458,000
Optimized: 1k system prompt, cached, 100-token output	$26,383	$316,596
Savings	$95,117	$1,141,404
Reduction	78%	78%

The optimized configuration is not a complex engineering project. It's three decisions:

Write a tighter system prompt. Cut it to 1,000 tokens instead of 10,000. Everything that doesn't need to be there on every call should come from tools or be retrieved on demand.

Add cache_control to the static prefix. One afternoon. Immediate 90% discount on cached reads from call two onward.

Set explicit length constraints on model output. "Return only the changed lines." "Respond in JSON with these four fields." The model follows these instructions and the bill reflects it.

Three decisions, made once. $1.1 million per year at 10,000 users. At 6.5 million active developers, the system prompt decision alone is worth $631 million.

This is why Pi's sub-1,000-token system prompt is interesting beyond philosophy. It's not minimalism for its own sake. It's cost discipline encoded structurally. The harness can't accidentally get expensive if there's nothing in it that shouldn't be there.

A note on open-weight models

Self-hosting changes the currency but not the relationship. GPU compute time replaces API dollars, but more tokens still means more cost. The token arithmetic above applies. It just points to H100 hours rather than an Anthropic invoice.

At roughly 5 million tokens per hour throughput on a single H100 at $3/hour, the self-hosted input cost is about $0.60 per million tokens vs $3.00 for Sonnet 4.6. That looks like an 80% saving. But the true cost of self-hosting includes a senior inference engineer ($250,000 to $360,000 loaded annually), real infrastructure TCO (multiply raw GPU cost by 1.3x to 2.0x), and the fact that average GPU utilization in production is 30-40% due to traffic spikes and quiet periods.

The practical break-even for chat-class tasks on a single-host model like Llama 4 Scout is around 5 million tokens per day of consistent traffic. For MoE models like DeepSeek that require multi-host setups, closer to 50 million tokens per day.

If your monthly API bill is under $2,000 to $3,000, the engineering overhead of self-hosting almost never pays off. Above that threshold, it's worth running the numbers. Below it, pay the API bill and go ship something.

What this means for harness design

Token cost is not an ops problem. It's an architecture problem.

The teams who treat the system prompt like a compiled binary, where every token earns its place, end up with harnesses that are cheaper, faster, and easier to reason about. The teams who let the system prompt accumulate, one feature here, one instruction there, each reasonable in isolation, end up with a bill that looks fine at 1,000 users and horrifying at 100,000.

The system prompt is the biggest lever. Cut it. Everything that doesn't need to be there on every call should not be there. Then add cache_control to the static prefix. One afternoon, 90% off your most repeated tokens from call two onward.

Tool schema overhead compounds in a way that's easy to miss until you look at the token breakdown. Lazy-load tools. Cache tool definitions. Keep tool count low per task type. These aren't heroic optimizations. They're one-time decisions that change the cost curve for the life of the product.

Output verbosity multiplies at 5x. Constrain it explicitly. The model follows the instruction.

If you're building multi-agent, model the cost before you commit to the architecture. A 3-agent chain can cost 5x to 10x more than a single agent per task. That's sometimes the right call. But it's not a free call, and the default subagents should be smaller models.

Batch anything that isn't interactive. 50% off, same quality.

At 100 users, bad token hygiene is a footnote. At 100,000 users, it's a budget crisis. The decisions are the same either way. The scale is what changes.

If you're building on a harness now, the next post in this series covers security in harness design: sandboxing, principle of least privilege for tools, and what prompt injection looks like at the harness level.

References

Anthropic pricing documentation: platform.claude.com/docs/en/about-claude/pricing
Anthropic prompt caching documentation: platform.claude.com/docs/en/build-with-claude/prompt-caching
Anthropic batch processing documentation: platform.claude.com/docs/en/build-with-claude/batch-processing
OpenAI API pricing: openai.com/api/pricing
Finout: Anthropic API pricing complete guide (2026): finout.io/blog/anthropic-api-pricing
ProjectDiscovery: "How We Cut LLM Costs by 59% With Prompt Caching": projectdiscovery.io/blog/how-we-cut-llm-cost-with-prompt-caching
Du'An Lightfoot (Medium): "Prompt Caching Is a Must: How I Went from $720 to $72/Month on API Costs": medium.com/@labeveryday/prompt-caching-is-a-must
Augment Code: AI agent loop token costs and context constraints: augmentcode.com/guides/ai-agent-loop-token-cost-context-constraints
Zylos Research: AI agent cost optimization: token economics in production (2026): zylos.ai/research/2026-02-19-ai-agent-cost-optimization-token-economics
MindStudio: AI agent token budget management and Claude Code: mindstudio.ai/blog/ai-agent-token-budget-management-claude-code
MindStudio: Claude Code MCP servers and token overhead: mindstudio.ai/blog/claude-code-mcp-server-token-overhead
Fazm.ai: Tokens used loading MCP tools: fazm.ai/blog/mcp-tool-token-overhead-optimization
Debby Mckinney (Medium): Cutting MCP token costs by 92% at 500+ tools: medium.com/@hi.debmckinney/cutting-mcp-token-costs-by-92-at-500-tools
Silicon Data / Digital Applied: Self-hosting frontier AI models: 2026 TCO analysis: digitalapplied.com/blog/self-host-frontier-models-tco-analysis-2026
Revolution in AI: Self-hosting Llama 4 vs GPT-4o: monthly volume break-even: revolutioninai.com/2026/03/self-hosting-llama-4-vs-gpt4o-api-cost-breakeven
Zylo: OpenAI API pricing: how to control costs: zylo.com/blog/openai-api-pricing
DEV Community: Why your LLM agent costs 10x more than your estimate: dev.to/awxglobal/why-your-llm-agent-costs-10x-more-than-your-estimate-4o78
CodeAnt.ai: Why output and reasoning tokens inflate LLM costs: codeant.ai/blogs/input-vs-output-vs-reasoning-tokens-cost
Claude Code system prompt repo (Piebald AI): github.com/Piebald-AI/claude-code-system-prompts
tokscale: token tracking CLI across Claude Code, Codex, OpenCode, Pi: github.com/junhoyeo/tokscale

Rahul Kashyap is CTO & Co-founder at Designare Solutions and DeepStory, based in Bangalore.