The Harness as OS: What the Most Useful Analogy in AI Gets Right (and Where It Breaks)

In September 2023, Andrej Karpathy posted a tweet that named something engineers were already starting to build:

"With many pieces dropping recently, a more complete picture is emerging of LLMs not as a chatbot, but the kernel process of a new Operating System."

That framing stuck. Two months later, a paper from UC Berkeley made the same case from a different angle, proposing "virtual context management" explicitly modeled on how operating systems page memory between RAM and disk. The paper was called MemGPT: Towards LLMs as Operating Systems.

Two independent threads, same analogy. That's usually a sign that something real is being named.

What follows is the full architectural breakdown: what each component maps to, where the analogy generates genuinely useful design intuitions, and where it falls apart. If you're building agent systems, the OS framing is the most useful mental model available right now. But it's a design tool. The gaps in the analogy are where the interesting engineering actually is.

Where the thesis came from

Karpathy's September 2023 tweet wasn't a fully-formed theory. He was describing what he was observing: an LLM at the center of a system, orchestrating inputs across text, audio, and vision, calling a code interpreter, browsing the internet, retrieving from an embeddings database.

By November 2023, he'd written a more complete version. His spec: GPT-4 Turbo as the CPU running at "256 cores (batch size) @ 20Hz (tokens/second)," a 128K context window as RAM, and Ada002 embeddings as the filesystem. By 2025, he'd extended it into what he called Software 3.0: model weights as the fixed processing substrate, context window as working memory, prompting as the new programming language.

At the same time, Charles Packer and colleagues at UC Berkeley were building MemGPT. Their problem was concrete: LLMs can only work with what's in their context window, and real tasks need more memory than any context window can hold. Their solution came directly from OS design. Page data in and out of the context window the same way an OS pages data between RAM and disk, and have the system manage that paging automatically, the same way a kernel manages physical memory on behalf of application processes.

The formalization came in March 2024 with AIOS from Rutgers University, which defined a complete kernel: scheduler, context manager, memory manager, storage manager, tool manager, and access manager. In March 2026, the first academic workshop on OS design for AI agents ran at ASPLOS. The field had moved from metaphor to research agenda.

The analogy, mapped

Model = CPU

The model is the stateless compute substrate. It executes instructions on given inputs and produces outputs. It has no persistent state between calls. The same operation class runs on radically different inputs. You can swap the model without changing the abstraction layer above it, the same way you could swap CPU architectures without rewriting the OS.

Karpathy's framing: model weights equal CPU. The weights are fixed. They don't change during inference. The computation happens on top of them.

The analogy maps most cleanly to the weights, not the inference process itself. A GPU running a generative model is actually a closer physical match than a CPU: massively parallel, operating on matrices, with behavior determined by the weight configuration. But for the architectural point, CPU is close enough to be useful.

Context window = RAM

Working memory. Finite. Everything the model can see right now, and nothing else.

The harness decides what goes into the context window, the same way a kernel decides what goes into physical RAM. Claude Code auto-compacts when token usage hits roughly 98% of the context window. That's a software-managed paging operation. The model doesn't trigger it. The harness does.

Harness = Kernel

The harness is the layer that makes everything else work. It manages resources, allocates context window space, executes tools, controls which files and network endpoints the model can reach, handles retries and errors, and runs the loop.

A systematic analysis of Claude Code's architecture found that roughly 1.6% of the system is AI. The other 98.4% is infrastructure: 1,884 files, approximately 512,000 lines, seven safety layers, five compaction stages. The harness is the 98.4%. The model is the 1.6%.

This is exactly how modern OS kernels work. Most of the kernel's code manages resources. A tiny fraction handles the actual execution of user instructions.

Tools = System calls

Tools are the only sanctioned interface between the model and the external world. Like syscalls, they have a defined schema: name, parameters, return type. They're enumerated and finite. The model cannot call arbitrary code. Every tool call is mediated by the harness, not executed directly by the model.

Think of it this way: the model cannot touch the filesystem or network. It can only ask the harness to do it on its behalf. The harness decides whether to comply.

MCP = POSIX

POSIX defined how system calls are named, parameterized, and returned across UNIX-like systems. Model Context Protocol does the same for AI tools: it standardizes how tools expose themselves, how the model invokes them, and how results come back.

OpenAI adopted MCP in March 2025. In December 2025, Anthropic donated MCP to the Linux Foundation-hosted Agentic AI Foundation, making the standard community-owned rather than vendor-owned. That's the AI equivalent of POSIX becoming an open standard. No single vendor owns the syscall ABI.

External memory / RAG = Disk

Persistent. Slower. Retrieved on demand. The model doesn't access external memory directly. It issues a tool call that the harness services by fetching from the store. MemGPT made this concrete: archival memory as disk, recall memory as a searchable store, core memory as always-in-context facts pinned in the working set.

Agent session = Process

A running instance with its own state: context window contents, tool history, accumulated memory, permissions. Multiple sessions are multiple processes, each with its own address space. Claude Code saves sessions as JSONL files under ~/.claude/projects/. That's a process checkpoint.

System prompt = Kernel configuration

The system prompt initializes the agent's session. It defines available tools, behavioral constraints, and task parameters. Like kernel parameters passed at boot, it shapes the entire session's behavior without being visible to the application layer above.

What the framing gets right

Harness complexity follows the same curve as kernel complexity

Early operating systems were thin layers over hardware. As they matured, the kernel grew: scheduler, memory manager, filesystem, networking, security. The story of OS history is a story of the kernel absorbing more responsibility as the hardware it ran on became the platform people built real things on.

Harnesses show exactly the same progression. Early harnesses in 2023 were simple: call the model in a loop, give it a few tools. By 2025, production harnesses had context compaction, multi-agent scheduling, permission systems, session persistence, observability, and recovery logic. This isn't scope creep. It's what happens when a resource-management layer meets real workloads.

The syscall boundary is the right design principle for tool APIs

Treating tools as syscalls gives you the right design instincts. Keep the surface area minimal. Define precise parameter types and return values. Log every call. Enforce permissions at the tool boundary, not inside the model's context.

An MCP server is a syscall table: a defined list of operations the model can invoke. The discipline of syscall design, refined over fifty years of OS engineering, transfers directly.

Virtual memory concepts transfer to context management

The MemGPT insight is that the vocabulary of virtual memory maps cleanly onto context management:

Physical RAM limit maps to context window token limit
Virtual memory maps to virtual context (all data the agent could theoretically access)
Page fault maps to context miss (data not in window, must be fetched)
LRU page replacement maps to recency-based context summarization
Memory-mapped files map to RAG and vector store retrieval

Harness designers now have a vocabulary and a set of proven algorithms to adapt. LRU, working set theory, prefetching. These aren't new problems. The context window is just a new kind of cache.

Multi-agent is multi-process scheduling

Once you have multiple agents, you need scheduling, isolation, inter-process communication, and synchronization. These are solved problems in OS theory. The solutions, scheduling algorithms, shared memory with locks, message queues, transfer directly to multi-agent systems. The OS framing doesn't just describe the problem. It points at the solutions.

Where it breaks down

The analogy is useful, not accurate. Here's where it fails.

The CPU has no opinions

A CPU executes instructions deterministically. Same binary state, same input, same output. Every time. This is the bedrock assumption of software engineering.

The model is stochastic. Even under temperature=0 with greedy decoding, recent research shows non-determinism persists. Floating-point arithmetic behaves differently across batch sizes and sequence lengths. BF16 precision shows variance compared to FP32. Hardware-level rounding differences compound across GPU runs.

The harness cannot guarantee that the same system prompt plus the same input produces the same output. No kernel can control this. A kernel that cannot rely on deterministic CPU behavior is a fundamentally different engineering problem than anything OS designers have had to solve before.

Models hallucinate syscalls

A CPU cannot invent a syscall that doesn't exist. If the instruction is invalid, it faults deterministically. The model can and does fabricate tool calls, invoking tools that aren't in the registry with parameters that don't match the schema.

The numbers here are real. GPT-4o achieves only 28% full sequence match accuracy on nested tool calls. Tool hallucinations increase as the tool count grows: more tools means more confusion. Stronger reasoning models can produce more elaborate hallucinated plans with more opportunities to call non-existent tools. One documented case had an agent hallucinating a bulk_search_users endpoint and retrying variations of it repeatedly.

This has no OS analogy. A kernel receiving an invalid syscall number returns ENOSYS and moves on. The harness has to catch a model mid-hallucination, parse a malformed tool call, decide what to do, and often re-prompt the model to correct itself.

The kernel cannot preempt the model mid-generation

A real kernel can preempt a running process at any moment. It can pause it, inspect its state, kill it, or reschedule it. The harness cannot preempt a model mid-generation. It can only observe the completed output and react.

This is cooperative scheduling, not preemptive. The model runs to completion. Then the harness gets control. There is no mechanism to interrupt a model going in the wrong direction based on a timeout, a higher-priority task arriving, or a user cancellation signal. You wait. Then you handle it.

Prompt injection has no hardware equivalent

OS kernels enforce memory protection at the hardware level. One process cannot read another process's memory. The memory management unit enforces this in silicon.

The harness has no equivalent defense against prompt injection. Malicious content in the model's context can redirect its behavior. Everything in the context window is equally visible and equally influential to the model. There is no hardware-enforced separation between data and instructions.

This is an open problem. The CPU analogy actually makes it sound more solvable than it is.

The model's "instruction set" is natural language

CPU instruction sets are formally specified. Every opcode has a precise binary encoding, exact behavior, and defined side effects. The model's instruction set is natural language: fuzzy, ambiguous, culturally variable, context-sensitive.

You cannot formally specify model behavior with the rigor of a syscall specification. Use the OS framing to guide architecture. Don't use it to prove correctness.

What it implies for harness design

Microkernel vs. monolithic

OS design has been arguing about this since the 1980s. Monolithic kernels (Linux) run all kernel services in the same address space: fast, but a bug anywhere can bring the system down. Microkernels (Mach, seL4) push services to user space: slower inter-process communication overhead, but better isolation and easier verification.

Pi, a minimal open-source agent harness by Mario Zechner, takes the microkernel position deliberately. Its core has four tools: read, write, edit, and bash. Model-agnostic, extensible via plugins. The reasoning is explicit: "existing coding agent platforms had become feature-bloated spaceships." TerminalBench data supports this view. Terminus, which gives the agent only keystrokes to a terminal and reads back raw VT codes, placed at the top of the leaderboard. Excessive harness complexity may actively hurt performance.

Claude Code takes the monolithic position. Deeply integrated with Anthropic's models. Seven safety layers. Five compaction stages. Complex permission gates, all kernel services tightly coupled. Reliable and safe. Not portable to other models.

Neither has won. This is still the same unresolved debate, just running on different hardware.

MCP is the path to portability

If tools are syscalls, they need a standard ABI. MCP is that ABI. Community ownership through the Linux Foundation means no single vendor can fork it for competitive advantage. Harnesses built against MCP interfaces are portable. Harnesses built against bespoke tool APIs are not.

Write against MCP. That's the lesson from fifty years of POSIX.

Context management needs OS-grade algorithms

Current production harnesses use ad hoc compaction. Claude Code summarizes when the context hits 98% full. MemGPT and Letta implement explicit paging with archival memory and recall memory as disk equivalents, using LRU with pinning for critical context.

More sophisticated strategies exist: priority-based retention (keep high-relevance context, evict low-relevance), prefetching (load context the model will likely need before it requests it), and pinning (never evict certain context, the same way a kernel pins pages it needs for interrupt handling). These are proven algorithms. They just need adapting to a new cache layer.

Permissions belong in the harness, not the model

The model cannot be trusted to enforce its own permission boundaries. It can be jailbroken, prompt-injected, or simply hallucinate into restricted territory. Permission enforcement must live in the harness, enforced at the tool call boundary.

Applications don't enforce their own memory bounds. The kernel does, via hardware page tables. Same principle applies here: the model shouldn't decide what it's allowed to touch. The harness should.

Fork/Explore/Commit for multi-agent parallelism

The "Fork, Explore, Commit" paper, presented at AgenticOS 2026, proposes the right primitive for multi-agent systems. A branch() syscall gives each parallel exploration path its own isolated filesystem view via copy-on-write semantics. First-commit-wins resolution: when one branch succeeds, sibling branches are automatically invalidated. BranchFS creates branches in O(1) time with atomic commits.

It's OS fork() adapted for agentic workloads. It solves the isolation problem that current multi-agent systems patch over with ad hoc containers and file locking.

What's still missing

Production harnesses don't have equivalents for several core kernel features. These are the gaps that matter.

There's no preemptive control over the model mid-generation. Every harness today is cooperative: the model runs to completion and then the harness acts. There's no mechanism to interrupt a model mid-generation based on an external event. Every real-time production system eventually needs this.

There's no standard process isolation between concurrent agents. Two agents working on overlapping files can corrupt each other's work. Fork/Explore/Commit is the right proposal. It's a 2026 research paper, not a production primitive you can import.

There's no equivalent of Linux cgroups for agents. Token budgets exist but are implemented differently in every harness. A standard resource controller that enforces compute limits uniformly across multiple agents doesn't exist yet.

There's no standard for inter-agent communication. UNIX processes communicate via signals, pipes, shared memory, and sockets, all kernel-mediated and standardized. Agent systems pass messages through JSON returns, shared files, or shared database state. None of it is standardized.

There's no kernel-level prompt injection defense. OS kernels enforce memory protection in hardware. Harnesses have no equivalent. Architectural separation of instruction and data context, or model-level training for injection resistance: neither is solved.

There are no standard observability primitives. UNIX has /proc, strace, and ptrace. Harnesses have logs and traces, but no standard way to inspect the full internal state of a running agent session: what's in context, what tools have been called, what the pending queue looks like. Letta's Agent Development Environment is the closest thing. It's not a standard.

2027: who builds the Windows, who builds the Linux

The model layer is commoditizing. Training runs are getting cheaper. Open models are closing the gap with proprietary ones.

The harness layer is differentiating.

This is what happened with PC operating systems. Intel's chips became commodity. Microsoft's OS was the durable competitive advantage. The kernel captured the value because it was where the ecosystem locked in.

The current landscape: Claude Code is monolithic, model-coupled, safety-focused, and optimized for software engineering tasks. Karpathy called it the first convincing demonstration of what an LLM agent actually looks like. Cursor is tightly coupled to the IDE and recently made its harness portable via the Cursor SDK. Pi takes the microkernel position: minimal core, model-agnostic, open-source. AIOS (Rutgers) is the most complete academic kernel architecture and increasingly production-facing.

The split is already happening. The Windows play: lock developers to your harness through your tool ecosystem, your memory system, your session format. Cursor and Claude Code are both making this play. The Linux play: make the harness a shared standard. MCP is a step in that direction. Pi and OpenHarness are in this camp.

Model-coupled harnesses offer tight integration and high performance on their specific model, with the trade-off of vendor lock-in. Model-agnostic harnesses run any model, built on open standards like MCP, with no lock-in floor and no lock-in ceiling.

Anthropic's December 2025 donation of MCP to the Linux Foundation is worth reading carefully. It's the AI equivalent of AT&T open-sourcing UNIX System V: a move to prevent any single vendor from owning the syscall ABI, and to win developer trust on the way to winning the developer platform. That's a Windows strategy dressed up as a Linux move.

By 2027, the organizations with the most sophisticated harnesses will have advantages that a better training run alone can't replicate. The best context management. The most reliable tool execution. The most mature multi-agent scheduling. The deepest safety layers. These are engineering problems. They compound over time.

The harness is where the durable competitive moat is being built right now. Most people are still watching the model benchmarks.

What to read next

Blog 11 in this series covers the A2A Protocol: when agents talk to each other, what inter-harness communication actually looks like, and why it matters once your system has more than one agent.

For context management depth, Letta's documentation on virtual context and the MemGPT paper (arxiv 2310.08560) are the two most concrete treatments available. For harness architecture, the Claude Code analysis paper (arxiv 2604.14228) is a systematic breakdown of a production harness. The AIOS paper (arxiv 2403.16971) is the most complete academic kernel specification.

References

Karpathy, A. - Tweet, September 2023 - "LLMs not as a chatbot, but the kernel process of a new OS"
https://x.com/karpathy/status/1707437820045062561
Karpathy, A. - Tweet, November 2023 - "LLM OS" spec with RAM/CPU framing
https://x.com/karpathy/status/1723140519554105733
Karpathy, A. - "2025 LLM Year in Review" - naming Claude Code as first convincing LLM agent
https://karpathy.bearblog.dev/year-in-review-2025/
Packer, C. et al. - "MemGPT: Towards LLMs as Operating Systems" - arxiv 2310.08560, October 2023
https://arxiv.org/abs/2310.08560
Mei, K. et al. - "AIOS: LLM Agent Operating System" - arxiv 2403.16971, COLM 2025
https://arxiv.org/abs/2403.16971
Wang, C. et al. - "Fork, Explore, Commit: OS Primitives for Agentic Exploration" - arxiv 2602.08199, AgenticOS 2026
https://arxiv.org/abs/2602.08199
AgenticOS 2026 - 1st Workshop on OS Design for AI Agents, ASPLOS 2026
https://os-for-agent.github.io/
Letta documentation - Virtual context management and memory architecture
https://docs.letta.com/concepts/memgpt/
"Dive into Claude Code" - arxiv 2604.14228 - systematic analysis of Claude Code harness architecture
https://arxiv.org/html/2604.14228v1
Model Context Protocol - Wikipedia - MCP overview, adoption history, AAIF donation
https://en.wikipedia.org/wiki/Model_Context_Protocol
"Composable OS Kernel Architectures for Autonomous Intelligence" - arxiv 2508.00604
https://arxiv.org/html/2508.00604v1
Yang, J. et al. - SWE-agent - arxiv 2405.15793, NeurIPS 2024 - agent-computer interface design
https://arxiv.org/abs/2405.15793
"Phantom Tool Calls: When AI Agents Invoke Tools That Don't Exist" - TianPan.co
https://tianpan.co/blog/2026-04-14-phantom-tool-calls-when-ai-agents-invoke-tools-that-dont-exist
Zechner, M. - Pi: A Minimal Agent Harness - lucumr.pocoo.org
"Reducing Tool Hallucination via Reliability Alignment" - arxiv 2412.04141
https://arxiv.org/html/2412.04141v1
Augment Code - "Multi-Agent AI Production Requirements Beyond the Demo"
https://www.augmentcode.com/guides/multi-agent-ai-production-requirements
"Cursor, Claude Code, and Codex are merging into one AI coding stack nobody planned" - The New Stack
https://thenewstack.io/ai-coding-tool-stack/
"Fork, Explore, Commit" - Hacker News discussion
https://news.ycombinator.com/item?id=47095932
"Illustrated LLM OS: An Implementational Perspective" - HuggingFace blog
https://huggingface.co/blog/shivance/illustrated-llm-os
Claude Code: How Claude Code Works - Anthropic official documentation
https://code.claude.com/docs/en/how-claude-code-works

Rahul Kashyap is CTO & Co-founder at Designare Solutions and DeepStory, based in Bangalore.