Learn/Agents, Tools & Harnesses/Lesson 03

Lesson 03

What a harness is and what it owns

The harness is the hidden layer that turns a language model into something that can work on real tasks. Same model, different harness, wildly different outcomes. This lesson maps what the harness actually controls.

The one idea

A harness is everything in an agent except the model: the loop, tools, memory, permissions, context assembly, logging, compaction, and stop rules. It is the operating system the model runs inside.

The shortest definition

If the model is a brain in a jar, the harness is the body and the staff. The brain proposes plans and language. The harness decides what the brain is allowed to see, what actions actually run, and when the session ends.

That is not metaphor for long. Production teams have independently converged on an OS analogy:

Model ≈ CPU: general reasoning, no direct I/O.
Context window ≈ RAM: limited, fast, volatile working set.
Tools ≈ system calls: controlled gateways to the outside world.
Harness ≈ kernel: scheduling, permissions, memory management, recovery.

The analogy breaks in places (the "CPU" is stochastic, context is not addressed byte-by-byte), but it is useful for design. You would not blame Intel because your process leaked file handles. You fix the OS layer.

What the harness owns

Concrete responsibilities, grouped:

The agent loop

Repeat until done: assemble prompt, call model, parse response, execute tools or return answer, append to state. Every product agent is a variation on that loop.

Tool system

Registry, schemas, descriptions, validation, execution, error formatting, timeouts, and idempotency policy for writes.

Memory

At least three kinds:

In-context: messages and tool results currently in the window.
External: vector DB, notes file, ticket system the harness retrieves into context.
Working: scratch plans the agent writes mid-task (todo lists, progress summaries).

Permission and budgets

Hard caps on iterations, tokens, spend, and tool scope. Path allowlists. Human approval gates for destructive actions. These belong in code. Prompts asking nicely are soft limits only.

Context management

What files, docs, and history get injected each turn. What gets summarized, snipped, or dropped when the window fills. This is where long tasks live or die.

Persistence and recovery

Session logs, crash resume, file snapshots before edits, revert commands. The harness makes multi-hour coding sessions possible, not the model's raw memory.

Observability

Structured logs per iteration: tool name, latency, token counts, stop reason. Without this you cannot debug agent runs that cost $12 and return nothing useful.

Reference architecture: every production agent is this stack. Vendor products differ in how much of each box they ship.

Destructive or high-risk tools (writes, sends, refunds) should queue in a pending state until a human approves. Flow:

Model emits tool call → harness validates schema and policy.
If risk class is requires_approval, persist pending call with diff preview (file patch, SQL, recipient list).
UI shows approve / reject; loop pauses with stop_reason=awaiting_human.
On approve, harness executes once with idempotency key; on reject, return structured error so the model can replan.

Never double-execute after resume: clear pending state atomically. Auto-approve read tools; default-deny exfiltration-shaped tools (webhooks, bulk export). Pair with Safety L03 for least-privilege scopes per approved action.

Safety and the harness

Instruction hierarchy, injection defense, and tool sandboxing are harness responsibilities—not model politeness requests. When you harden an agent, start with permissions and validation in this layer, then add model-level guardrails. The Safety & Guardrails course picks up where this lesson leaves off; lesson 03 there maps directly to tool least privilege.

The system prompt is harness configuration

Most engineers treat the system prompt as a personality blurb. In agents it is closer to a config file: role, safety rules, tool-use conventions, output format, project context, and sometimes megabytes of tool schemas.

Pi ships on the order of hundreds of tokens of system prompt. Claude Code ships tens of thousands. Both can ship working code. The gap is not "who cares about prompts." It is different harness philosophies: minimal surface vs exhaustive in-prompt guidance, often paired with prompt caching to make the cost workable.

The system field also sits outside the user/assistant message stream on major APIs. That separation matters for injection defense and for caching static prefix content across turns.

Long system prompts and tool schemas repeat every turn. Providers let you cache stable prefix tokens so later turns pay less for the same content. Harness design affects bill: what you keep static (cacheable) vs what you inject dynamically (fresh each turn). Putting volatile data in the cached block defeats the purpose.

Same model, different harness, different score

Reported benchmark jumps from harness changes alone are not edge cases. Teams have documented double-digit percentage point swings on coding benchmarks by changing only the agent wrapper around identical weights.

That has two implications:

When you evaluate vendors, you are often choosing a harness, not just a model API.
When your agent underperforms, upgrading the model is not always the first lever. Tool descriptions, compaction, and stop rules may move the needle more cheaply.

Frontier labs also co-train models on traces from their own harnesses. Claude gets better inside Claude Code-shaped loops. Codex gets better inside Codex-shaped loops. The alignment is real but harness-specific.

Engineering reality

Harness code is where security lives. The model cannot traverse directories; your path validation can fail open. The model cannot exfiltrate secrets; your tool that posts to Slack can. Threat model the harness as privileged infrastructure, not the model as the only risk surface.

Build vs buy

Using Claude Code or Codex means buying a mature harness plus co-trained model behavior. Building your own means owning the flywheel: your traces, your tool schemas, your evals, your compaction strategy.

Neither is universally correct. Internal tools with strict compliance needs often need custom harnesses. Developer velocity for a small team may favor a hosted agent until the shape of the product stabilizes.

The decision frame: is your differentiation in the harness workflow (how work gets done) or in the model's raw reasoning? Most product teams underestimate the first.

How harnesses grew up (2023 to 2026)

Early agent demos were thin: a while loop, three tools, no persistence. Production harnesses accumulated layers as workloads got real:

Context compaction when sessions exceeded a few dozen steps.
Permission systems when users lost trust after one bad rm.
Session replay when debugging took longer than fixing.
Multi-agent scheduling when single-context agents hit quality ceilings.

This mirrors OS history: kernels start minimal, then add paging, permissions, and schedulers when applications demand it. Scope creep that is actually responding to failure modes.

Multimodal harnesses (preview)

Text harnesses treat message content as strings. Multimodal harnesses treat content as typed blocks: text, image, audio, file references. Tool results may return image blocks (screenshots) or document blocks (PDF pages).

Everything that touches context (truncation, token counting, logging) must understand block types. Image tokens count toward the same budget as text. A screenshot tool without harness-side budgeting can burn thousands of tokens per turn.

Vision-heavy agents are still agents. The harness job list grows: capture, encode, store, reference by file ID, and compact visual history aggressively.

Instruction hierarchy (preview)

System prompt and harness policy sit at high trust. User messages at medium trust. Tool results and fetched web content at low trust: data to analyze, not commands to obey.

Encoding that hierarchy is harness work: XML tags, role separation, sanitization, and refusing to promote untrusted text into system fields. Models are trained to resist some injection, but you still gate dangerous tools in code.

What you control when building

If you build a custom harness, you own the roadmap: which tools exist, how context compacts, what gets logged, how approvals work, and how evals replay production failures. If you buy a hosted agent, you trade control for speed and inherit the vendor's co-training benefits.

Neither choice removes the need to understand this layer. Even hosted products break when your use case needs tools, policies, or observability the default harness does not ship.

Checklist: harness readiness

Before shipping an agent to real users, confirm the harness has: iteration cap, cost cap, tool validation, structured logging, compaction strategy, approval path for destructive tools, and at least one eval replayed from logs. Missing any one item is a common source of launch-week incidents.

Hosted agents ship some of this by default. Custom stacks must implement each line explicitly. Use the checklist as a gap analysis against your current repo.

Relationship to the rest of Track 2

RAG lessons cover how evidence enters context. This course covers what happens after: tools, loops, and side effects. Evaluation and safety courses later assume you know where the harness ends and the model begins. That boundary is the spine of applied AI engineering.

Reading traces like a harness engineer

Open any session log and scan in order: user goal, tools registered, iteration count, compaction events, stop reason. You should infer the story in two minutes. If the log cannot tell that story, fix logging before tuning prompts.

Traces are also training data if you build your own flywheel. Redact PII before storage. The harness owns retention policy.

Comparing hosted vs custom (honest tradeoffs)

Hosted coding agents ship compaction, permissions, and logging you would spend quarters building. Custom harnesses ship when you need domain tools, air-gapped deploys, or compliance controls vendors lack. The mistake is assuming hosted means "no harness thinking." You still configure tools, approvals, and data boundaries.

Custom without observability is worse than hosted with logs. Pick based on whether your edge is workflow or model choice.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

What is the one-sentence definition of a harness?
Name four subsystems the harness owns besides the model call itself.
Why are hard iteration caps harness code, not prompt text?
Why can the same model score differently with different harnesses?

Quick check

Validating and executing tool calls
Deciding what history fits in context
Updating the model's neural network weights during a session
Writing session logs for replay

Physical memory chips
System calls into kernel-managed resources
The graphical desktop

Hard limits in code cannot be ignored by the model
System prompts cost less than JavaScript
Models always follow system prompt requests

A faster GPU
A full harness: tools, loop, permissions, and session management
Different pretraining data only