The System Prompt and How the Harness Uses It

Pi gives the model around 800 tokens. Claude Code gives it 27,000. Both ship working code.

That 33x gap is the whole puzzle of system prompt engineering. Either Pi is dangerously under-configured, or Claude Code is spending tokens it doesn't need. Probably neither is completely true. But the question is worth taking seriously, because most engineers treating the system prompt as a text field are missing what it actually is: the harness's primary configuration surface.

This is part three of the Agent Harnesses series. If you haven't read the intro post, the short version: the harness is everything in an AI agent except the model. It manages the agent loop, tools, memory, permissions, and state. The system prompt is where the harness does most of its configuration work.

Not a message. A separate field.

The most common misconception: engineers thinking of the system prompt as "the first message." It isn't.

Anthropic's Messages API takes a system parameter that sits entirely outside the messages array. Messages alternate between user and assistant roles. The system prompt is neither.

json

{
  "model": "claude-opus-4-6",
  "system": "You are a coding assistant...",
  "messages": [
    {"role": "user", "content": "Help me debug this function"}
  ]
}

OpenAI's API originally used "role": "system" inside the messages array. As of the o1 release, that's deprecated in favor of "role": "developer". The developer role signals the second-highest authority level in the instruction hierarchy. Google's Gemini uses a system_instruction field in the GenerateContent request. In Google's ADK, the agent's description and instruction fields together form the effective system prompt.

The structural distinction matters. When Anthropic separates the system field from messages, it's not just API design. It's an explicit signal about the priority hierarchy: system > developer/user > assistant. Models are trained to respect this. Instructions in the system prompt take precedence over user-turn requests when they conflict.

The catch: this enforcement is probabilistic, not deterministic. Well-crafted injection attempts can still influence model behavior through the user turn. More on that below.

The system prompt consumes tokens from the same shared context window as everything else. You can measure it with Anthropic's token counting endpoint:

python

client.messages.count_tokens(
    model="claude-opus-4-6",
    system="Your system prompt here...",
    messages=[...]
)

Request components are processed in a fixed order: tools first, then system, then messages. This same order determines cache invalidation. If your tool definitions change, the system and messages caches both invalidate. If only messages change, tool and system caches stay warm. That ordering shapes the hybrid caching strategy described later.

What harnesses actually put in there

The system prompt is not a greeting. It's the configuration surface. Everything that shapes model behavior gets encoded here.

What actually goes in: tool definitions and JSON schemas (the largest component by token count, by a wide margin), identity and persona, safety and permission rules, task context and output format instructions, and sometimes few-shot examples when the task is format-sensitive enough to justify the token cost. On top of those static components, harnesses inject runtime content: current date, user-specific config, session state, environment info.

Claude Code's anatomy

Claude Code's system prompt is not a monolithic string. The harness assembles it conditionally on every call, from around 110+ instruction blocks based on session mode and config.

Core system prompt text runs about 2,300 to 3,600 tokens depending on mode, broken roughly into: identity and security (~100 tokens), output and tone (~320 tokens), executing actions carefully (~540 tokens), tool usage policy (~550 tokens), harness instructions for terminal markdown, permissions, and compaction (~195 tokens), and thinking frequency tuning (~119 tokens).

Then conditional blocks add on top: Auto mode adds ~188 tokens, Plan mode adds 142 to 1,297 tokens, Learning mode adds ~1,042 tokens, a git status snapshot adds ~97 tokens, and Hooks configuration can add ~1,493 more.

The core system text is not what breaks the token budget. Tool definitions are.

Claude Code has 23+ built-in tools, each with a name, description, and full JSON schema. Together they consume 14,000 to 17,000 tokens before you've connected a single MCP server. Add a lightweight MCP server and you're paying another 1,000 to 2,000 tokens. A heavy one like GitHub or Playwright can add 10,000+.

Baseline total before the conversation starts: around 27,000 tokens. This is documented in community reverse-engineering work at Piebald-AI/claude-code-system-prompts and Drew Breunig's detailed analysis.

Pi's counter-thesis

Pi's total system prompt, including all tool definitions, is under 1,000 tokens. Four tools: read, write, edit, bash. That's it.

Mario Zechner (with involvement from Armin Ronacher of Flask and Jinja2 fame) made a specific bet: frontier models have been RL-trained on enough agentic data that they already know what a coding agent does. A 10,000-token system prompt adds noise, not signal.

Pi's mechanism for staying lean is lazy skills. Each skill keeps only its name and description in context by default. Full instructions and tool schemas load only when the skill is invoked. This is architecturally different from MCP's approach of preloading all tool schemas at session start.

Same model. 33x fewer tokens to configure it. It works, at least for the focused coding tasks Pi targets.

The honest question this raises: is Claude Code's larger prompt a sign of harness sophistication, or partly compensation for behaviors that frontier models already carry in their weights?

The budget competition

A 200,000-token context window sounds unlimited. Claude Code is spending 8.5 to 13.5% of it before the conversation begins.

Component	Tokens	% of 200k
System prompt text (core)	~2,500	1.25%
Tool definitions (built-in)	~14,000–17,000	7–8.5%
MCP server additions	0–30,000+	0–15%+
CLAUDE.md (if loaded)	0–5,000+	0–2.5%
Output buffer (reserved)	~16,000	8%

Claude Code reserves 16,000 tokens for the model's response before every API call. The harness enforces this. Before each call, five compaction strategies run in sequence (cheapest first): Budget Reduction, Snip, Microcompact, Context Collapse, Auto-Compact.

Research on multi-step agentic workflows finds teams consistently underestimate their real token costs by 3 to 5x. Every turn appends to history, tool call results can be large, and the system prompt (including tool definitions) repeats in full on every single call.

CLAUDE.md is not in the system prompt

This surprised me when I first dug into it. CLAUDE.md gets injected as user content, alongside the first message turn, not into the system field.

This has a real consequence: project-level CLAUDE.md instructions get probabilistic compliance. System prompt instructions get closer to deterministic compliance. A CLAUDE.md file cannot override harness-level safety rules because it lives in the user-turn layer, which has lower authority than the system layer.

CLAUDE.md loads from up to five locations (project root, subdirectories, global config, project-scoped config). A 5,000-token CLAUDE.md is a 5,000-token tax on every turn. HumanLayer's real-world implementation keeps it under 60 lines.

There's a structural asymmetry worth understanding. When you connect an MCP server, its tool descriptions go into the tool definitions array, which sits at system context level with high authority. When you write CLAUDE.md instructions, they go into user content with lower authority. Your project's configuration instructions have less weight than the tools you install.

MCP tool bloat

The MCP ecosystem grew from 0 to 4,400+ implementations in five months after Anthropic's November 2024 release, exceeding 17,000 servers by late 2025. Each server dumps its full tool schema list into the context at session start.

Research published as RAG-MCP (arxiv:2505.03275) proposes a fix: use a lightweight retriever to semantically match the user's task against a tool index, then inject only the relevant tool descriptions into context. Results: 50%+ fewer prompt tokens, and tool selection accuracy improving from 13.62% to 43.13% compared to the full-schema baseline.

This is structurally identical to what Pi's lazy skills accomplish. Google's ADK has formalized the same pattern. Ten skills at L1 metadata (name and description only) is roughly 1,000 tokens. Ten skills at full schema is roughly 10,000 tokens. 90% token reduction per skill slot, just by deferring schema loading until invocation.

Injection and the priority hierarchy

The system > user > assistant priority order is enforced through training, not cryptography. The model has no reliable way to distinguish authentic system instructions from attacker content claiming to be system instructions. Training creates a statistical tendency to defer to system-level content. It's not a guarantee.

Two injection paths in practice.

Direct injection: a user message attempts to override system instructions. The "ignore all previous instructions" pattern. Models trained on this attack surface handle it reasonably well on average, but not perfectly.

Indirect injection: the harness reads external content (web pages, files, tool results) that contains embedded instructions. The model processes it as data but may treat parts as instructions.

The indirect path is where agentic coding systems are genuinely vulnerable in production. When Claude Code reads a file or fetches a URL and feeds the result back into context, that content can contain instructions the model might act on.

MCP servers make this worse. When you connect to an MCP server, its tool descriptions go directly into the system prompt context. A malicious MCP server's tool description functions as a system-prompt-level injection. This is a documented attack vector in arxiv:2601.17548. Connecting to untrusted MCP servers means connecting to a potential system-level manipulation surface.

OWASP's 2025 Top 10 for LLM Applications ranks prompt injection at #1, present in 73%+ of production AI deployments assessed in security audits.

Practical defenses at the harness level: label all tool results as untrusted data in the system prompt and instruct the model to treat any instructions found in tool results as suspicious; validate tool outputs before feeding them back into context; use sandboxed execution for code or commands suggested by tool results; keep system prompt permissions narrow (broad permissions create a larger attack surface).

XML structure (next section) also helps. Clear structural markers make it easier for the model to distinguish authoritative instruction from untrusted data.

Prompt Control-Flow Integrity, or PCFI (arxiv:2603.18433) is a more formal approach: a priority-aware middleware that models each request as structured segments (system, developer, user, retrieved content) and applies a three-stage pipeline before forwarding to the LLM.

When long system prompts hurt

More tokens in the system prompt is not safer. There's solid evidence it actively degrades model performance past a point.

A study on system prompts in code generation (arXiv:2602.15228) evaluated 360 configurations across 4 models, 5 system prompts, 3 prompting strategies, 2 languages, and 2 temperature settings. Adding a "helpful assistant" wrapper with explicit rules degraded extraction accuracy by 10% and RAG compliance by 13% on Llama 3 8B, while improving instruction-following by 13%. The net is context-dependent, but the degradation is real and measurable.

Research on instruction-following reliability across 20 proprietary and 26 open-source LLMs (arXiv:2512.14754) found nuance-oriented reliability drops by up to 61.8% with nuanced prompts. Persona prompts alone reduce reliable performance by up to 8.2%.

The mechanism is straightforward: competing instructions degrade each other. A long system prompt that accumulates conflicting rules from different sections produces inconsistent behavior. Extra tool descriptions compete with task instructions for attention allocation. The model has to filter noise it shouldn't have to filter.

Frontier models can follow roughly 150 to 200 instructions with reasonable consistency. This is not an argument for writing 200 instructions. It's a ceiling, and approaching it is how you get unpredictable behavior.

Practical guidance from teams running production agents: HumanLayer caps CLAUDE.md at under 60 lines with only rules that apply universally to every task. OpenAI on AGENTS.md: table of contents, not encyclopedia. Anthropic's guidance: keep persona prompts minimal, avoid personalization that harms reliability. Typical production cap: 5,000 tokens for behavioral instructions, tools added on top.

Pi's thesis pushes further: if the model has been RL-trained on enough agentic data, you can shrink the system prompt toward zero and let the weights carry the behavioral knowledge. This works on frontier models. On smaller models without that training, it would likely fail.

Structure: XML vs markdown vs plain text

Anthropic explicitly recommends XML tags for structuring system prompts. This isn't stylistic. Claude was trained specifically to pay attention to XML structure.

XML tags create unambiguous section boundaries. Markdown whitespace and headers are ambiguous: a line break might be significant or might not. XML has explicit open and close delimiters that survive across any content type. Tokenization artifacts don't disrupt XML parsing the way they can disrupt markdown indentation.

A practical structure for Claude:

xml

<role>You are a senior software engineer...</role>
<instructions>
  When writing code:
  - Follow existing patterns in the repo
  - Add error handling for all external calls
</instructions>
<tools_policy>
  Only use the bash tool to run tests, not to modify files directly
</tools_policy>
<context>
  <project_info>...</project_info>
  <current_date>2026-05-07</current_date>
</context>

For nested content:

xml

<documents>
  <document index="1">...</document>
  <document index="2">...</document>
</documents>

How formats compare across providers:

Approach	Best for	Weakness
XML tags	Claude (native training)	Verbose, unfamiliar to some devs
Markdown headers	GPT-4, Gemini	Ambiguous section boundaries
Plain text	Short prompts, simple personas	Degrades as prompt grows
JSON	Structured config injection	Hard for models to read as instructions

XML also has a secondary benefit for injection defense. You can instruct the model to treat content outside <instructions> tags as untrusted data. This doesn't fully prevent injection, but it makes the structural boundary explicit and auditable.

Dynamic vs static prompts

A static system prompt is written once at harness design time and sent identically on every call. A dynamic prompt gets rebuilt before each API call, with runtime content injected.

What harnesses typically inject dynamically: current date and time (critical for scheduling, "recent" queries, anything time-sensitive), user-specific rules from preferences or org config, project context from CLAUDE.md or AGENTS.md, session state (what has already happened), tool availability (which tools are enabled), conditional mode sections (Plan mode vs. Auto mode in Claude Code), git status (current branch, uncommitted changes), and environment info (OS, shell, available commands).

Claude Code is fully dynamic. It rebuilds the system prompt on every call, assembling conditional blocks based on session mode and config. The assembly is specifically structured to maximize the cacheable prefix.

Dimension	Static	Dynamic
Cost	Lower, cacheable	Higher, cache invalidates on change
Consistency	High, predictable behavior	Lower, can drift with context changes
Relevance	Lower, context gets stale	Higher, model has current info
Debugging	Easy, fixed string	Hard, prompt varies per call
Caching compatibility	Excellent	Limited unless you isolate the static prefix

You don't have to pick one.

The hybrid strategy

Keep behavioral instructions at the top: persona, safety rules, tool policies. Mark them for caching. Append dynamic context at the end, where cache invalidation is bounded.

text

[STATIC - CACHED]
<role>...</role>
<instructions>...</instructions>
<tools_policy>...</tools_policy>

[DYNAMIC - NOT CACHED]
<session_context>
  <date>2026-05-07</date>
  <project>...</project>
  <current_task>...</current_task>
</session_context>

Cache invalidation flows in one direction: tools, then system, then messages. If tool definitions change, everything downstream invalidates. If only messages change, tool and system caches stay valid. Keeping your tool definitions stable across calls is worth engineering for. Every unnecessary tool change is flushing your system prompt cache.

Caching

Anthropic prompt caching can cut costs by up to 90% on frequently repeated system prompts. It's not automatic. You have to annotate the call explicitly.

python

system = [
  {
    "type": "text",
    "text": "Your long static system prompt...",
    "cache_control": {"type": "ephemeral"}  # 5-minute TTL
  }
]

Pricing: standard input tokens at 1x base price; cache write (5-minute TTL) at 1.25x; cache write (1-hour TTL) at 2x; cache read at 0.1x (90% discount). The 5-minute cache breaks even after a single read. For any system prompt called more than once every five minutes, it pays for itself immediately.

Latency impact: Anthropic reports up to 85% reduction for long prompts. A 100,000-token book prompt dropped from 11.5 seconds to 2.4 seconds in their testing.

At scale, caching shifts from an optimization to an architectural decision. ProjectDiscovery cut LLM costs by 59% through prompt caching. A platform calling Claude 10,000 times per day with a 50,000-token system prompt pays roughly $3,500/day without caching, and around $525/day with a 90% cache hit rate, at $7 per million input tokens. That's a meaningful budget line.

OpenAI caches automatically for inputs over 1,024 tokens, with a 50% discount on hits and no explicit annotation needed. Google Vertex AI and Gemini support context caching billed separately from inference.

Anthropic's model gives you more control (explicit annotation, TTL choice) at the cost of requiring that work upfront. Worth knowing: many third-party clients that wrap the Anthropic API don't implement caching at all. A perfectly static system prompt won't be cached unless the client explicitly requests it in the API call.

Provider comparison

Dimension	Anthropic / Claude	OpenAI / Codex	Google ADK
System field name	`system` param (outside messages array)	`"role": "developer"` (o1+) or `instructions` param	`system_instruction` field
Recommended structure	XML tags	Markdown headers	`description` + `instruction` fields
Project config file	CLAUDE.md (user content, not system)	AGENTS.md (injected into context)	Handled in code via ADK
Caching	Explicit `cache_control`, 5-min or 1-hour TTL	Automatic for >1,024 tokens, 50% discount	Separate context caching
Progressive disclosure	Lazy skills (Pi pattern)	AGENTS.md as table of contents	L1 metadata / full schema per skill

What this changes about how you build

Treat the system prompt like code, not a text field. Version-control it. Profile it for token cost. When agent behavior changes unexpectedly, check what the harness is assembling into the system prompt before assuming the model changed.

Tool definitions dominate the budget. If your agents are expensive or slow, audit how many tool schemas you're loading on every call. A single unused MCP server can cost 10,000+ tokens per call. At meaningful call volumes, lazy loading is the difference between a viable cost structure and not.

If you're on Anthropic's API and not using cache_control, you're paying full price on every call for content the model processed last time. The annotation is four lines of code. At any real call volume, the savings are not marginal.

The gap between Pi's 800 tokens and Claude Code's 27,000 is a real design decision, not carelessness on either side. Pi is optimized for frontier models on focused tasks with minimal safety surface area. Claude Code handles a wider surface area with stronger guarantees. The point isn't which is right. It's knowing what each token is paying for, and cutting the ones that aren't doing work.

Next in the series

The next post covers security in harness design: sandboxing strategies, least-privilege tool design, what a prompt injection attack on a production harness actually looks like at the request level, and what path traversal looks like when the harness is the attack surface.

Read it here: The Harness Is Your Last Line of Defense

If you haven't read the pillar post on agent harnesses, that's the right starting point for this series.

References

Anthropic documentation

OpenAI documentation

Claude Code reverse engineering

Piebald-AI/claude-code-system-prompts
How Claude Code Builds a System Prompt - Drew Breunig
Claude Code Token Limits - Faros
Dive into Claude Code - arxiv:2604.14228

Pi harness

What I Learned Building a Minimal Coding Agent - Mario Zechner
pi-mono GitHub repo

Research: length, structure, performance

Empirical Study on System Prompts in Code Generation - arXiv:2602.15228
Revisiting Reliability of LLMs in Instruction-Following - arXiv:2512.14754
RAG-MCP: Mitigating Prompt Bloat via RAG - arxiv:2505.03275

Security and injection

OWASP LLM Top 10 2025 - Prompt Injection #1
Prompt Control-Flow Integrity - arxiv:2603.18433
Prompt Injection on Agentic Coding Assistants - arxiv:2601.17548

Caching

How We Cut LLM Costs 59% With Prompt Caching - ProjectDiscovery

Rahul Kashyap is CTO & Co-founder at Designare Solutions and DeepStory, based in Bangalore.