Learn/Agents, Tools & Harnesses/Lesson 07

Lesson 07

Multi-agent patterns: when they help, when they hurt

Adding more agents is not free parallelism. It is more context, more handoffs, more failure surfaces. Some tasks genuinely need specialists and orchestrators. Many do not. This lesson is how to tell the difference.

The one idea

Multi-agent systems trade coordination cost for role separation. They help when tasks decompose cleanly with narrow tool sets per role. They hurt when you have a bag of agents with no orchestrator, because errors compound down the chain.

What "multi-agent" actually means

At minimum: more than one model-driven loop, with messages or state passed between them. Common shapes:

Orchestrator + workers: one planner delegates subtasks to specialists.
Pipeline: agent A output feeds agent B feeds agent C (research → draft → review).
Parallel fan-out: multiple workers on subproblems, orchestrator merges.
Critic / verifier: second agent checks the first agent's work.

Each shape is still harness code. The models do not magically coordinate. Something assigns roles, passes context, enforces budgets per agent, and merges results.

The error amplification problem

In a pipeline, agent B treats agent A's output as ground truth. If A is 90% accurate per step, ten dependent steps do not stay 90% accurate. Errors multiply.

Studies on unstructured multi-agent networks report error amplification an order of magnitude worse than single-agent baselines. Centralized orchestration contains the damage but does not remove it.

Anthropic's research on multi-agent systems (Building effective agents) stresses the same lesson: compose simple patterns with clear handoffs; avoid open-ended agent debates without external verification. Orchestrator-workers works when the lead model delegates bounded subtasks and merges structured outputs—not when workers free-chat.

Implication: add agents only when each handoff is narrow, verifiable, and cheap to check. Vague handoffs ("make it better") amplify garbage.

Central orchestration limits but does not eliminate error amplification.

Cost and latency fan-out math

Parallel workers multiply spend. Sketch before you ship:

Sequential pipeline A→B→C: latency ≈ L_A + L_B + L_C (each includes model + tools). Cost ≈ C_A + C_B + C_C. Errors compound: if each stage is 90% accurate, three stages yield 0.9³ ≈ 73% end-to-end.
Parallel fan-out (1 orchestrator + N workers): latency ≈ L_orch + max(L_worker) if workers run concurrently. Cost ≈ C_orch + N × C_worker + C_merge. Eight Haiku workers plus one Opus orchestrator can beat one Opus on wall-clock but only if N × worker cost stays below the single-agent alternative.

Worked example: one Opus session costs ~$2.00 and 120s. Five worker sessions at $0.15 each ($0.75) + orchestrator at $0.50 + merge pass $0.20 = $1.45 total and ~45s wall-clock if workers parallelize—win on both axes. Same pattern at N=20 without caps: $3.50+ and queueing delays; add coordination bugs and you may lose on success rate too.

Rule of thumb: model fan-out cost as N × (avg tokens per worker) × price plus orchestrator overhead. Require a 20–30% success or latency margin over single-agent before keeping the topology.

When multi-agent helps

Good fits share traits:

Role-specific tool sets: researcher read-only, executor write access. Reduces accidental destructive calls.
Parallelizable subtasks: scan ten files independently, merge summaries.
Different model tiers: strong orchestrator, cheap workers for bulk steps.
Explicit verification gate: critic runs tests or lint before merge.
Long horizon with clean interfaces: subagent returns structured JSON, not prose essays.

Frontier coding products use orchestrator plus subagents not because one model cannot write code, but because splitting context and tools keeps each loop focused. Opus coordinating Haiku workers can beat Opus alone on some benchmarks when the harness matches training.

Workers get smaller, focused contexts and cheaper models. The orchestrator handles planning and integration. Total cost and context pressure drop. The catch: orchestration is a trained skill tied to a specific harness shape. A model trained for solo work may not orchestrate well without co-training.

When multi-agent hurts

Red flags:

Agents debate in prose without external verification.
Every agent has the same tools and same system prompt copy-pasted.
No per-agent budget (one loop burns the whole pipeline budget).
Handoffs are full chat logs instead of structured artifacts.
Task could be a single agent with phased tool gating.

The $47k loop incident was multi-agent: two roles reinforced each other without caps. More agents would have made it worse without per-role guards.

Design rules that survive contact with production

Structured handoffs: schema for what crosses the boundary (findings, file paths, verdict enum).
Least privilege per role: minimum tools for that role's job.
Per-agent and global budgets: iteration caps at both levels.
Trust boundaries: treat subagent output like tool output (untrusted until validated).
Trace hierarchy: parent trace ID, child spans per agent for debugging.
Merge policy in harness code: orchestrator does not blindly paste worker prose into main context.

Multi-agent is multi-process scheduling. The OS analogy returns: you need a scheduler, not just more processes.

Engineering reality

Co-training flywheels favor vendor harnesses: models trained to orchestrate inside Claude Code or Codex-shaped loops. Rolling your own multi-agent stack means building your own eval traces and alignment. Budget for that or stay single-agent longer.

Start single-agent, split with evidence

Default path: one agent, good tools, strong compaction, clear evals. Split into multiple agents only when metrics show a bottleneck: context too noisy, roles conflict on tools, or parallel speedup wins after coordination overhead.

If you split, split one dimension at a time: add a verifier before adding a research specialist before adding parallel workers. Measure cost, latency, and success rate at each step.

Tracing multi-agent runs

OpenTelemetry-style traces help when agents span processes. Map:

Trace = full user session.
Parent span = orchestrator iteration.
Child spans = worker runs and individual tool calls.
Baggage = correlation IDs across services.

Without parent/child linkage you cannot tell which worker burned the budget.

Co-training and vendor harnesses

Frontier labs train models on traces from their own agent products. Orchestration behavior is harness-specific. Opus coordinating Haiku inside Anthropic's stack may outperform Opus alone; the same Opus in your custom orchestrator may not without your own alignment work.

That does not mean you must use vendor agents forever. It means custom multi-agent stacks need custom evals and trace collection from day one if you expect similar reliability.

Where this course goes next

You now have the vocabulary: chatbot vs agent, tools, harness, loop, state, failures, multi-agent tradeoffs. The adjacent courses on evaluation, safety, and RAG fill in how to measure and harden what you built here.

When you read benchmark headlines, ask which harness ran the test. When you debug a runaway session, open the harness logs first. That habit is most of the job.

Pattern cheat sheet

Pattern	Use when	Avoid when
Single agent	Most tasks, early product	Never (start here)
Orchestrator + workers	Parallel subtasks, tiered models	Handoffs are vague prose
Pipeline A→B→C	Fixed stages with schemas	Each stage needs improvisation
Critic / verifier	High stakes, cheap checks exist	Critic shares same blind spots

Evaluating multi-agent vs single-agent

Run the same task set both ways on cost, success rate, and P95 latency. Multi-agent should clear a margin after coordination overhead. If success rate improves 2% but cost doubles, the business case may fail unless those tasks are extremely high value.

Log handoff payloads in evals. Inspect failure cases where worker output was syntactically fine but orchestrator merged wrong.

Security note for multi-agent

Treat subagents as untrusted peers. A compromised or injected worker can pass malicious instructions upstream. Orchestrators should not auto-execute exfiltration tools based on worker prose. Strip permissions at delegation boundaries the same way you sandbox MCP tools.

Future: multimodal workers

Workers that return screenshots or audio chunks multiply context pressure. Orchestrators need merge policies that summarize media results before pasting into the main thread. Multi-agent without multimodal-aware compaction fails fast on vision tasks.

Single-agent with good tools still beats multi-agent with sloppy media handoffs. Fix compaction first.

Course wrap-up

You started with chatbot vs agent, moved through tools and harness ownership, the loop, long-run state, failures, and multi-agent tradeoffs. The through-line: the model proposes, the harness disposes. Benchmarks measure agents. Production incidents usually implicate harness policy.

Next steps on the curriculum path: evaluate agent quality systematically, then harden security and guardrails. Build or buy, but instrument either way.

Decision worksheet

Before adding agent B to help agent A, write down: exact input schema B receives, output schema B returns, tools B may call, iteration cap for B, global cap for the session, and how orchestrator validates B's output. If any field is "TBD," stay single-agent until it is not.

That worksheet takes ten minutes and prevents week-long detours into unstructured agent chat rooms.

Revisit the worksheet when someone proposes "let's add another agent." Most proposals fail the schema column. That is a feature, not a blocker.

Reading list (optional)

For deeper dives after this course: vendor system cards that separate model vs agent scores, OWASP LLM Top 10 for injection classes, OpenTelemetry GenAI conventions for tracing, and research on multi-agent error amplification. None replace building one harness and logging one bad session yourself.

Return to lesson one if jargon creeps back in. Model, harness, agent: keep the boundary sharp in every design review.

If you build one thing next, build session replay with iteration-level token counts. Everything else in this course gets easier once you can see the loop run.

Multi-agent is optional sophistication. A reliable single-agent harness is mandatory foundation. Most teams should not skip straight to orchestrator fantasies. Earn multi-agent complexity with eval numbers, not architecture diagrams.

When evals justify splitting, start with one worker role and one orchestrator. Measure for a week before adding a third agent. The worksheet from this lesson is the gate. If metrics regress, merge back to single-agent without shame. Simpler topology is a valid outcome of experimentation. Document the decision either way so the team does not relitigate it monthly. That discipline scales.

Checkpoint

You've finished the course when you can answer these from memory:

Why do errors compound in agent pipelines?
Name two situations where multi-agent is a good fit.
What is a structured handoff and why does it matter?
Why do per-agent budgets matter in addition to a global cap?

Quick check

Higher API latency
Error amplification across handoffs
Models cannot run in parallel

It needs a different tool set than the orchestrator
It uses the same tools as the orchestrator for symmetry
Investors expect the word multi-agent

Infallible because it is another LLM
Untrusted data validated at the orchestrator
Always discarded to save tokens

Exhaust single-agent patterns (compaction, phased tools, evals)
Add as many agents as possible for speed
Copy the full chat log between every agent