Learn/Agents, Tools & Harnesses/Lesson 05

Lesson 05

Multi-step orchestration and state management

Short agent demos hide the hard problem. Real tasks run dozens of steps, fill the context window, and must survive crashes. Orchestration is how the harness keeps a long job coherent when the model cannot hold everything in working memory at once.

The one idea

Long agent tasks need external state: summaries, scratchpads, checkpoints, and retrieval back into context. The harness orchestrates what the model sees each turn so the job stays on rails as history grows.

Why one context window is not enough

Every tool result and every assistant message stays in the transcript unless the harness removes or compresses it. A forty-step coding task can accumulate hundreds of thousands of tokens of file contents, command output, and diffs.

Two separate problems show up:

Hard limit: you hit the context ceiling and the API refuses or truncates.
Context rot: even well before the ceiling, accuracy drops as length grows. Models attend strongly to the start and end of context and weakly to the middle.

So "buy a 1M token window" does not solve orchestration. You still need policy for what deserves those tokens each turn.

Layers of state

Think in layers instead of one big chat log:

Ephemeral transcript

The live message list the model sees this turn. Grows fast. Subject to compaction.

Working memory

Explicit artifacts the agent maintains: todo lists, plan docs, decision logs. Often updated via dedicated tools so they survive compaction of raw tool dumps.

Session persistence

Append-only logs (JSONL is common) storing every event for replay, audit, and resume after crash.

External memory

Vector stores, wikis, ticket systems. Retrieved on demand rather than kept inline forever.

World state

The actual filesystem, database, and APIs the tools mutate. The source of truth. Context is just a view.

Context is the smallest layer. The harness bridges down to durable state.

Compaction strategies

When context grows too large, the harness must shrink what the model sees. Common stages, from cheap to expensive:

Snip: drop ancient tool output entirely when a summary exists.
Truncate: keep head and tail of long outputs, cut the middle.
Microcompact: collapse verbose sequences ("10 similar read_file calls") into one line.
Summarize: model or heuristic summary of a message range, replace range with summary block.
Retrieve: move detail to external store; inject pointers ("full log at id=...") instead of content.

Production systems often run multiple stages in order. Cheaper stages first; summarize only when needed.

Instead of re-injecting a 20k-token tool result every turn, store it externally and put a short pointer in context: "Error log summary: 3 failures in auth module; full log id=abc." The model can request the full log via tool if needed. Pointer schemes have reduced token load by orders of magnitude on long sessions while improving task success versus stuffing everything inline.

Checkpoints and resume

Long jobs should survive process death. Minimum viable checkpointing:

Append-only event log with session ID and monotonic step index.
Persist user goal and harness config at start.
Snapshot external world state when cheap (git stash, DB transaction boundaries).
On resume, rebuild context from log plus working memory artifacts, not by replaying every token.

Coding agents that snapshot files before edits make revert a harness feature. The model did not "remember" the old file version. The harness kept it.

Single-process agent loops die when the process crashes, deploys mid-run, or waits hours for human approval. Durable workflow engines (Temporal, Inngest, AWS Step Functions) persist orchestration state outside the LLM: which step you are on, what inputs each step received, retry policy, and timers.

Use them when:

Agent jobs run longer than one container lifetime (hours, days).
Human approval gates pause work for unpredictable durations.
You need exactly-once side effects with compensating transactions.
Multiple services participate and must survive partial failures.

Pattern: workflow owns the macro state machine; each "activity" invokes one agent iteration or a fixed tool batch. The LLM is a subroutine, not the scheduler. Inngest and Temporal both support sleep/wait-for-event—useful for approval emails and webhooks.

State graphs (LangGraph mental model)

LangGraph popularized modeling agent orchestration as a directed graph: nodes are steps (model call, tool batch, human review), edges are transitions conditioned on state. You do not need LangGraph to use the idea.

Benefits of graph thinking:

Explicit branches ("tests failed → fix node" vs "tests passed → summarize node").
Checkpointing at nodes maps cleanly to resume after crash.
Easier to visualize than an opaque while-loop in one file.

Your graph state object should hold: user goal, working memory, last tool results, iteration count, and phase enum. The model reads a projection of that state each turn; the harness owns the canonical object.

Orchestration patterns inside one agent

Before spawning multiple agents, single-agent orchestration often suffices:

Phased workflow: research phase (read-only tools), then edit phase (write tools enabled).
Tool gating: expose dangerous tools only after plan approval.
Sub-routines: fixed micro-loop for "run tests until pass or N tries" wrapped as one tool the main agent calls.
Explicit todo tool: model updates task list; harness always re-injects latest list near end of context where attention is strong.

These patterns reduce chaos without paying the coordination tax of multiple models talking to each other.

Engineering reality

Plot cumulative input tokens vs iteration number for your agent. Linear growth is normal. Exponential growth means a tool result is being re-injected whole every turn or a summary failed. That chart is the fastest way to spot orchestration bugs before users complain about cost.

Virtual context and the OS analogy again

MemGPT-style designs treat the context window like RAM and external stores like disk. The harness pages data in and out: fetch relevant memories, evict stale chunks, promote urgent user goals to a pinned header.

Like real virtual memory, paging has failure modes: thrashing (constant retrieve/evict), wrong page brought in (irrelevant retrieval), and dirty state (world changed but summary is stale). Orchestration code must invalidate summaries when tools mutate the world.

Pinning what must not move

Compaction should never drop:

The original user goal (or a faithful compressed version).
Hard constraints ("do not touch prod DB").
Open decisions still in flight.
Latest working todo list.

Harnesses often pin these at the top or bottom of context where attention is strongest. Middle placement is where instructions go to die.

Sub-routines as tools

Instead of exposing twenty micro-steps to the main agent, wrap them as one tool: run_tests_and_return_summary. Internally that may be a fixed script or a nested loop with its own cap. The main agent observes one structured result.

This is orchestration without multi-agent overhead. You get deterministic middle logic and a simpler outer loop.

Measuring orchestration health

Track metrics tied to state, not vibes:

Tokens per iteration (slope and spikes).
Compaction events per session.
Tool result bytes before/after truncation.
Resume success rate after crash.
Tasks abandoned after context errors.

When compaction frequency climbs, your agent is outgrowing its context strategy. Fix policy before buying a bigger window.

Worked scenario: long refactor

Imagine renaming an API across forty files. A naive agent keeps every file read in context. By file twenty, rot sets in and it re-edits file three wrong.

A stronger orchestration plan: maintain a todo file via tool, compact raw reads after each successful edit, pin the rename rule in a header, run a final grep tool to verify zero old symbols. State lives in the repo and logs, not only in chat history.

When to externalize memory

Move facts out of context when they are large, stable, and queryable: ticket databases, code indexes, runbooks. Retrieve on demand with citations the model can follow up. Keep in context only what informs the next one or two decisions.

RAG inside an agent is normal. The retrieval step is just another tool or harness pre-step before the model plans.

Time and staleness

State is not only size. It goes stale. A summary written before a deploy may be wrong after. Harnesses should timestamp summaries and invalidate them when mutating tools succeed. On resume after hours away, re-read critical facts instead of trusting compacted prose blindly.

Clocks and versions belong in tool results: fetched_at, git_sha, schema_version. Small fields, large debugging payoff.

Git as external state

Coding agents already use git as memory: commits mark progress, branches isolate experiments, diffs summarize change. The harness can require commits at checkpoints, attach git diff --stat to context instead of full files, and roll back on failed verification. Treat VCS as part of orchestration, not a side effect of the model "deciding" to commit.

Human edits mid-run

Users will edit files while the agent runs. The harness should detect filesystem drift: re-read before write, or lock files during agent edits. Conflicts should surface as tool errors the model can resolve, not silent overwrites.

Same for external API state: if a human closes a ticket the agent is updating, return not_found and let the loop replan.

Multiplayer agents (human plus bot on the same repo) are the norm in coding tools. Orchestration without drift detection is how you get overwritten changes and angry users.

Backpressure

When compaction cannot keep up, the harness should refuse new tool reads until summarization completes, or switch to a cheaper summary model. Letting context grow while hoping the next turn works is how rot incidents start. Backpressure signals belong in metrics dashboards, not only in user error toasts.

Treat summarization itself as a budgeted operation: it costs tokens and latency. Run it when metrics cross thresholds, not on a fixed schedule that ignores task shape.

Document compaction decisions in logs (compaction_reason=context_budget) so replay explains sudden tone shifts in the model's behavior. Operators should never wonder why the agent "forgot" something without a log line explaining the snip.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

What is the difference between context rot and hitting the context limit?
Name three layers of state outside the live transcript.
Why are memory pointers cheaper than re-injecting full tool output?
What should a session log enable after a crash?

Quick check

The API rejects the request for being too long
Model quality drops as context grows, especially for middle content
The model stops accepting tool calls

Compaction to preserve budget and orientation
Deleting the only record of what happened
Fine-tuning the model

Only in the model's context window
On disk in the real filesystem the tools write to
In the tool schema

The same large tool output being re-injected every turn
You need a smaller model
Too many parallel read tools