Agent Harnesses: The Hidden Layer That Actually Runs Your AI

GPT-5.5 jumped from 61.5% to 87.2% on a standard functionality benchmark. The team didn't upgrade the model. They didn't fine-tune it. They didn't change the training data.

They changed the harness.

Same weights. Same model. 25.7 percentage points. That should make you question what you're actually measuring when you compare Claude to GPT or Gemini to Codex. Because a lot of what you attribute to the model is the harness.

Model, harness, agent: get these straight first

These three terms get used interchangeably. They mean different things.

Term	What it is	Real example
Model	The neural network that reasons and generates text	Claude Opus 4.6, GPT-5.5, Llama 4
Harness	The infrastructure layer: the loop, tools, memory, permissions	The tool system and permission pipeline inside Claude Code
Agent	The complete system = Model + Harness, doing a task end to end	Claude Code, Codex, OpenCode, OpenClaw

Every agent contains a harness. The harness is not the agent. It's the infrastructure inside the agent.

When you use Claude Code, you're running an agent. The 19 permission-gated tools, the conversation loop, the file-snapshot system before every edit. That's the harness. When you use Codex, same story.

Why does this matter? Because when someone says "Claude is better at coding than GPT," they're almost always comparing Claude Code to Codex. Two agents. Two harnesses. Not two raw models. The model difference is real, but a meaningful chunk of what you're observing is the harness.

This post is about harnesses. If you're evaluating which tool to use, you're choosing an agent. If you're deciding what to build on, you're designing a harness. Different question.

What a harness actually is

The simplest definition: a harness is everything in an AI agent except the model.

The model is a brain in a jar. It can reason. It can generate text. It cannot open a file. It cannot run a command. It cannot remember anything from last session. It cannot stop itself from running forever.

The harness gives the brain a body. It decides what tools the model can call, what information the model sees before it reasons, how long the model is allowed to keep working, what the model is and is not permitted to do, and what gets saved when the session ends.

That's the job.

The agent loop

The harness runs a loop. Every harness, from the simplest weekend project to Anthropic's production system, is a variation of this:

text

1. Take user input
2. Build a prompt: input + context + memory + tool definitions
3. Call the model
4. Model responds with either:
   a. A final answer → done
   b. A tool call request → run the tool, append the result, go back to step 3
5. Repeat until the model says it's done, or you hit a hard cap

The model never directly touches your filesystem. It never runs a bash command. What it does is output structured text that says "I want to call read_file with path index.ts." The harness reads that, runs the actual function, and hands the result back to the model.

The model decides what to try. The harness decides what's permitted and what's real.

What lives inside a harness

Tool system

The harness maintains a registry of callable functions. It tells the model what tools exist, what they do, and what arguments they accept. When the model requests a tool call, the harness validates the arguments, executes the function, and returns a structured result.

The quality of your tool descriptions directly affects model performance. A vague description produces vague tool usage. A precise description with clear argument constraints produces consistent, recoverable tool calls.

Memory

Harnesses manage several kinds of memory at once. In-context memory is what's currently in the model's context window: all previous messages, tool calls, and results from the current session. External memory is a database the harness queries before each call, retrieving relevant past context and injecting it so the model appears to remember things it technically can't hold in its window. Working memory is the scratchpad for multi-step reasoning: intermediate outputs and plans the model writes to itself across steps.

Poor memory management is the most common reason a capable model fails on long tasks. The context window fills, the oldest and often most important context gets truncated, and the model loses track of what it was doing.

Permission and budget enforcement

Production harnesses enforce limits in code, not prompts. This distinction matters more than it sounds.

You can write "please don't exceed 10 steps" in your system prompt. The model might ignore it, especially under pressure from a complex task. A counter in your harness loop that throws at iteration 10 cannot be ignored.

typescript

// In the system prompt: "Please complete in 10 steps or fewer"
// The model can decide this doesn't apply here.

// In the harness loop:
const MAX_ITERATIONS = 10;
while (iterations < MAX_ITERATIONS) {
  iterations++;
  // ... agent loop
}
throw new Error("Budget exceeded");
// The model has no say in this.

The same applies to tool permissions. "Only read files in the project directory" is a suggestion. Path validation in executeTool that throws before touching anything outside the allowed path is a guarantee.

Context management

The harness controls what information enters each model call: which files are relevant, which prior messages are still needed, which docs to inject. On long-running tasks, old messages get compressed. The model doesn't need the full verbatim history of 40 tool calls from 20 minutes ago. It needs a compact summary of what's been decided. The harness handles that compression.

State persistence

Claude Code writes every message and tool result to a JSONL file under ~/.claude/projects/. If the process crashes mid-task, nothing is lost. You resume from exactly where you stopped. It also snapshots affected files before every edit. Revert is always one command away. The harness made that possible, not the model.

A minimal harness, annotated

The skeleton below runs the agent loop and handles tool calls. The comments explain the architecture, not the syntax.

typescript

import Anthropic from "@anthropic-ai/sdk";
import * as fs from "fs/promises";

const client = new Anthropic();

// ── Tool registry ──────────────────────────────────────────────────────────
// The harness declares what tools exist and how they're shaped.
// The model never sees the implementation, only this description.
// The quality of the description determines the quality of tool usage.

const tools = [
  {
    name: "read_file",
    description: "Read the full contents of a file at the given path",
    input_schema: {
      type: "object",
      properties: {
        path: { type: "string", description: "Absolute or relative file path" }
      },
      required: ["path"],
    },
  },
  {
    name: "write_file",
    description: "Write content to a file, creating it if it does not exist",
    input_schema: {
      type: "object",
      properties: {
        path: { type: "string" },
        content: { type: "string" },
      },
      required: ["path", "content"],
    },
  },
];

// ── Tool execution ──────────────────────────────────────────────────────────
// This is where tool calls become real.
// The model outputs { name: "read_file", input: { path: "index.ts" } }
// The harness decides whether to run it, runs it, formats the result.
// Production harnesses add: path validation, sandboxing, per-tool error handling.

async function executeTool(
  name: string,
  input: Record<string, string>
): Promise<string> {
  if (name === "read_file") {
    // No path validation here.
    // A production harness confirms input.path is inside an allowed directory.
    return await fs.readFile(input.path, "utf-8");
  }
  if (name === "write_file") {
    await fs.writeFile(input.path, input.content);
    return `Written ${input.content.length} chars to ${input.path}`;
  }
  throw new Error(`Unknown tool: ${name}`);
}

// ── The agent loop ──────────────────────────────────────────────────────────
async function runAgent(userMessage: string) {
  const messages: Anthropic.MessageParam[] = [
    { role: "user", content: userMessage }
  ];

  // Hard cap. The model cannot negotiate around this.
  let iterations = 0;
  const MAX_ITERATIONS = 10;

  while (iterations < MAX_ITERATIONS) {
    iterations++;

    // Every call sends the full conversation history plus the tool registry.
    // The model sees all prior messages, all prior tool calls, all prior results.
    const response = await client.messages.create({
      model: "claude-opus-4-6",
      max_tokens: 4096,
      tools,
      messages,
    });

    // stop_reason "end_turn": the model is satisfied. Return to the user.
    if (response.stop_reason === "end_turn") {
      return response.content;
    }

    // stop_reason "tool_use": the model wants to call one or more tools.
    // Append the model's response so it has a record of what it asked for.
    messages.push({ role: "assistant", content: response.content });

    // Execute every tool the model requested in this turn.
    // A single response can contain multiple tool_use blocks.
    const toolResults: Anthropic.ToolResultBlockParam[] = [];

    for (const block of response.content) {
      if (block.type === "tool_use") {
        const result = await executeTool(
          block.name,
          block.input as Record<string, string>
        );

        // tool_use_id links this result to the specific request.
        // Without it, the model cannot tell which result belongs to which call.
        toolResults.push({
          type: "tool_result",
          tool_use_id: block.id,
          content: result,
        });
      }
    }

    // Feed results back as a "user" turn.
    // On the next iteration, the model sees its request and the result,
    // then decides whether to call more tools or return a final answer.
    messages.push({ role: "user", content: toolResults });
  }

  throw new Error("Agent exceeded maximum iterations: task incomplete");
}

Here is what this skeleton is missing, and why each gap matters:

Missing piece	What breaks without it
Path validation	The model can read or write anywhere on your filesystem
Per-tool error handling	One failed tool call kills the entire session
Context compression	The context window overflows on tasks longer than ~30 turns
State persistence	Any crash loses all progress
Sandboxing	A bash tool built this way can delete your filesystem
Observability	You have no visibility into what the model is doing or why it stopped

This skeleton gives you a working agent loop in an afternoon. Getting from this to something production-ready takes weeks. That gap is where Anthropic and OpenAI have put serious engineering time.

How the big providers build their agents' harnesses

These are all agents, complete systems. What's worth understanding is the harness design inside each of them.

Anthropic: Claude Code

The harness exposes 19 permission-gated tools: file read and edit, bash, git operations, web fetch, and MCP calls. The permission system is a layered pipeline, not a single check. A tool call passes through general allow/deny rules, then tool-specific checks, then automated classifiers, and only reaches interactive user approval if nothing upstream resolves it.

Sessions are persisted as JSONL under ~/.claude/projects/. Every edit is preceded by a file snapshot. The three-phase inner loop (gather context, take action, verify results) blends throughout a task rather than running in strict sequence.

Anthropic's engineering blog has the most detailed public writeup on harness design from any major provider. Start with Harness design for long-running application development on the Anthropic engineering blog.

OpenAI: Codex

Codex separates the agent into two layers. Codex Core is the library where all agent logic lives. The App Server is a long-lived JSON-RPC process that hosts Codex threads. All three Codex surfaces (CLI, Cloud, VS Code extension) run on the same Core underneath, which means the harness behavior is consistent across environments.

AGENTS.md is the repo-level config file the harness picks up automatically. You write it once: coding conventions, tool preferences, context about the project. Every Codex session reads it before starting. Shell and file tools run inside a sandbox. MCP servers integrate via a consistent policy layer.

OpenAI published three engineering posts worth reading: "Unrolling the Codex agent loop," "Harness engineering: leveraging Codex in an agent-first world," and "Unlocking the Codex harness: how we built the App Server."

Google: Agent Development Kit (ADK)

Introduced at Google Cloud NEXT 2025. Open-source, Python-based, model-agnostic in practice (though optimized for Gemini and Vertex AI). The distinguishing architectural choice is graph-based multi-agent orchestration: agents are nodes, task delegation is an edge. This makes complex multi-agent workflows explicit and inspectable rather than emergent and hard to debug.

In 2026, Google introduced A2A (Agent-to-Agent) protocol, a standard for how agents built on different harnesses communicate with each other. More than six trillion tokens flow through ADK monthly. GitHub: google/adk-python.

Meta and DeepSeek: models without a first-party harness

Meta ships Llama 4 as weights. No harness. DeepSeek V4 (1.6 trillion total parameters, 49 billion active) is explicitly designed for agentic workflows (long-context reasoning, coding, multi-step tool use) but also ships without a first-party harness. Both rely on the open-source ecosystem for the infrastructure layer.

This is a deliberate product choice. And it creates a real dynamic worth understanding: a team running DeepSeek V4 inside a well-designed custom harness can outperform a team running a stronger model inside a poorly designed one. The harness is that variable.

Open source: Pi and what it proves

Pi: a minimal harness framework

Pi is not a finished agent you install and run. It's a minimal harness core by Mario Zechner. The infrastructure layer you extend to build your own agent. The creator frames it explicitly as a harness design, not a product. That framing is the whole point.

The philosophy: frontier models have been trained extensively on agentic tasks. They know what a coding agent is. You don't need 10,000 tokens of system prompt to explain it to them.

So Pi ships with four tools (read, write, edit, bash), a system prompt under 1,000 tokens, no lock-in to any model provider, and conversations stored as branching tree structures in JSONL rather than a flat list.

That's it. And it works.

The extension model is where it gets interesting. You write a TypeScript module. The harness loads it at runtime. Your extension can register new tools, subscribe to harness events, add keyboard shortcuts, override permission gates for specific paths, or spin up sub-agents without touching the core agent loop.

typescript

// A Pi extension that adds a GitHub tool and listens to events
export default function githubExtension(pi: PiRuntime) {

  // Register a new tool into the harness at runtime.
  // The model discovers this from the registry, not from instructions.
  pi.registerTool({
    name: "create_pull_request",
    description: "Create a GitHub PR for the current branch",
    input_schema: {
      type: "object",
      properties: {
        title: { type: "string" },
        body: { type: "string" },
      },
      required: ["title"],
    },
    execute: async (input) => {
      const pr = await octokit.pulls.create({ /* ... */ });
      return `PR created: ${pr.data.html_url}`;
    }
  });

  // Subscribe to harness events.
  pi.on("tool:after", ({ name }) => {
    console.log(`[audit] ${name} executed`);
  });

  // Scoped permission override for a specific path.
  pi.allowWrite("/tmp/pi-sandbox/**");
}

Notice what's happening here. You're not prompting the model differently. You're changing what the harness can do. The tool exists in the registry. The model uses it. That's a different mental model from "add more instructions to your CLAUDE.md."

A 4-tool, 1,000-token harness can produce the same quality output as a 19-tool, 10,000-token one if the task is well-scoped and the model is strong. Harness size is not correlated with capability. Pi is a demonstration of what you can strip out without losing anything important. Claude Code's larger harness handles more surface area, more tool types, more permission complexity. For a focused coding task, Pi's minimal harness frequently matches it at a fraction of the token overhead.

Open-source agents worth knowing as harness case studies

Two agents are worth a brief mention here, not as harnesses in their own right, but for the harness decisions inside them.

OpenCode (150,000+ GitHub stars, 6.5M monthly developers) made one central harness decision: model-agnostic by design. You connect any model via API key. The harness makes no assumptions about your provider. It supports both CLAUDE.md and AGENTS.md conventions, so it picks up your project context regardless of which ecosystem you came from. That's a direct contrast to Claude Code's co-designed model-harness approach.

OpenClaw / ClawdBot (created November 2025 by Peter Steinberger) made a different one: full UI decoupling. The same harness logic runs regardless of whether you're talking through WhatsApp, Telegram, Discord, Slack, Signal, iMessage, or any of 50+ other integrations. Model, harness, and interface are three independent layers. The chat surface you see and the harness doing the work have nothing to do with each other.

Models are co-trained with their harnesses

Anthropic collects traces, tool call sequences, and feedback from real Claude Code sessions and feeds them back into RLHF training. The model gets progressively better at using Claude Code's specific tools, respecting its permission patterns, and producing outputs in the formats Claude Code expects.

OpenAI does the same with Codex. The model is fine-tuned against the tool formats, the AGENTS.md conventions, and the interaction patterns of the Codex harness.

The result is a feedback loop:

text

Harness generates real-world usage data
      |
      v
Data trains the model
      |
      v
Model performs better inside that harness
      |
      v
Harness collects more quality data
      |
      v
Repeat

A Claude model running inside Claude Code will outperform the same Claude model running inside a generic third-party harness. Not because of prompt engineering. Because the model was trained on Claude Code's harness patterns specifically. The harness and model are co-designed, not just co-deployed.

This has a direct implication for how you read benchmarks. Most "Claude vs. GPT" comparisons are not model vs. model. They're Claude Code vs. Codex: two agents, two harnesses, two separate training loops. You cannot cleanly separate the model difference from the harness difference without running the same model in the same harness with different weights. Almost no benchmark does that.

The data backs this up. Claude Opus 4.5 paired with Haiku 4.5 sub-agents inside a multi-agent harness achieved 87% task completion vs. 74.8% for Opus alone. Same model generation. Different harness configuration. Twelve percentage points.

Evaluate the model inside the harness you're actually going to use. Paper benchmarks run in single-turn, no-tool conditions. They measure model quality in isolation. They tell you very little about agentic performance, which is a function of model and harness together.

Build, customize, or adopt?

Approach	When it makes sense	The real cost
Use an existing agent (Claude Code, Codex, OpenCode)	You want something working today and don't have custom requirements	No control over the harness; the provider's system prompt competes with yours for context budget
Extend a harness framework (Pi)	You need specific tools, multi-model support, or want to own the loop	A few hours of setup; ongoing maintenance is yours
Build from scratch	Your domain has requirements nothing existing covers	You rebuild sandboxing, memory, permissions, and error recovery from zero; plan for weeks

Teams that have done all three tend to land on the same rule: buy the commodity parts (managed runtimes, basic telemetry, observability infrastructure) and build the proprietary ones (domain-specific tools, custom evaluation datasets, environment context). Don't build what you can configure.

The case for building from scratch is narrow. It looks like: you're in a domain with unusual tool requirements (medical records, internal financial APIs, proprietary execution environments), you have security constraints that rule out managed harnesses, or you need sub-1,000-token system prompts at scale because the token cost becomes a real budget line.

For most teams, extending Pi covers 80% of what a custom build would give you, at a fraction of the engineering time.

MCP: worth knowing in one paragraph

Model Context Protocol (MCP), introduced by Anthropic and now adopted across the industry, is a standard for how harnesses expose tools to models. You build a tool server once. It works with Claude Code, Codex, Pi, OpenCode, and anything else that speaks MCP. USB for AI tools. All major harnesses support it. If you're building domain-specific tooling today, build it as MCP servers.

Where this is heading

In 2026, the harness is the new OS. The model is the CPU. Tools are system calls. Memory is RAM. The harness is the kernel. Your specific agent is the application running on top.

If that holds, the long-term competitive position in AI products is not which model you use. It's harness quality, domain-specific tooling, and the feedback loop between your harness and your model's training data. The providers who understood this earliest are already running that loop at scale. Everyone else is still treating model selection as the primary variable.

Agent Harnesses series

Start here, then use the rest of the series as the deeper map.

Rahul Kashyap is CTO & Co-founder at Designare Solutions and DeepStory, based in Delhi.