Learn/Agents, Tools & Harnesses/Lesson 04

Lesson 04

The agent loop: plan, act, observe, recover

Every agent, from a weekend script to a production coding assistant, runs the same core cycle. Naming the phases helps you design stop conditions, recovery paths, and logs that actually debug real failures.

The one idea

The agent loop is plan (model decides), act (harness runs tools), observe (results enter context), recover (errors and compaction handled), repeat until done or stopped.

The canonical loop

Strip away vendor branding and you get:

Accept user goal and session state.
Assemble prompt: system config, memory, tool schemas, conversation history.
Call the model.
If the model returns a final answer, deliver it and exit.
If the model requests tools, validate and execute them.
Append tool results to state.
Check budgets and safety guards.
Go to step 2.

The model never skips straight from step 3 to the filesystem. Step 5 is always harness code.

Every iteration ends in either a user-visible answer or more context for the next model call.

Plan: what the model does each turn

"Plan" does not mean the model always emits a formal plan document. It means the model chooses the next move given everything in context: answer now, call a tool, call several tools, ask a clarifying question, or revise an earlier approach.

Some harnesses encourage explicit planning: write a todo list tool, think step-by-step in a scratchpad, or require a short rationale before destructive tools. Those patterns trade tokens for steerability. They help on long tasks where the model otherwise drifts after ten tool calls.

Planning quality is sensitive to what the harness puts in context. If the original user goal scrolled out of the window, "planning" becomes guesswork.

Act: what the harness does each turn

When tool calls arrive, the harness:

Parses structured output from the model response.
Validates against registry and policy.
Optionally asks the human to approve.
Runs implementations with timeouts.
Captures stdout, JSON, errors, and metadata.

Act is where side effects happen. A bad act phase cannot be undone by a better model on the next turn. Idempotent tools, dry-run modes, and snapshots before writes are harness concerns.

Observe: feeding reality back

Tool results become user-role or tool-role messages in the transcript. The model's next plan is only as good as those observations.

Good observation design:

Keep results focused on what decision comes next.
Label errors clearly so the model can branch.
Include stable identifiers (file path, record ID) the model can reuse.
Avoid dumping secrets or PII back into context.

Observation also includes non-tool events: compaction summaries ("here is what we decided so far"), retrieved memory snippets, and user corrections mid-run.

A doom loop is when the model repeats a self-check without external verification: read code, decide it looks fine, stop without running tests. The harness saw a clean end_turn and exited. The task was not actually done. Fixes include required verification tools, harness checks that gate completion, or eval hooks that reject premature stops.

Recover: errors are normal

Production loops treat failure as the default path, not an exception. Recovery strategies:

Retry with backoff for transient network errors on tools.
Return structured tool errors so the model can change arguments.
Circuit breakers when the same tool fails identically three times.
Compaction when context is full instead of crashing mid-run.
Checkpoint and resume after process crash or user pause.
Escalate to human when policy requires approval or confidence is low.

Recovery is not "make the model try harder." It is harness logic that changes state or stops spending.

Engineering reality

Log stop_reason every iteration: tool_use, end_turn, max_tokens, circuit_breaker, error. When debugging a bad run, stop reason tells you whether the model thought it finished, the window blew up, or your harness killed the loop. Without that field you are guessing.

When to stop

Exit conditions are harness policy:

Model returns final natural language answer with no pending tools.
Iteration or token budget exhausted.
Cost budget exhausted.
Stuck detection: identical tool call repeated N times.
User cancel.
Harness-specific completion checks (tests passed, ticket created).

Relying only on the model to know when it is done fails on long tasks. Pair model judgment with external verification where stakes are high.

Mapping phases to logs

Operators should see one log row per iteration with fields that map to plan/act/observe/recover:

Plan: model output tokens, tool names requested, or final answer flag.
Act: tool latency, success/failure, policy denials.
Observe: result size injected, compaction events.
Recover: retries, circuit breaker triggers, summarization runs.

When something breaks at 2 a.m., you want a timeline, not a prose novel.

Human-in-the-loop as a phase

Approval gates are part of the loop, not an interruption of it. The harness pauses act until the user accepts or rejects a pending tool batch. State must record pending calls cleanly so resume does not double-execute.

Good UX pairs approvals with diffs (what file will change) and reversible actions (snapshots). The model proposes; the human and harness share responsibility for act.

ReAct: the named pattern behind the loop

The loop you just saw has a research name: ReAct (Reason + Act), from Yao et al. (arXiv:2210.03629). The model interleaves reasoning traces with tool actions: think briefly, call a tool, observe the result, repeat. Modern APIs hide the "thought" in structured tool blocks, but the rhythm is the same.

ReAct-style loops are verbose (reasoning tokens cost money) but debuggable—you can read why the model chose a tool. Plan-then-act variants compress reasoning into a first pass; pick based on evals, not fashion.

Harness loads system config, tool schemas, pinned user goal, working memory, and recent transcript. Budget check: if over threshold, compact before calling the model.

Model returns either a final answer or one or more tool calls (ReAct: may include short reasoning text). Harness buffers streamed tokens until the message is complete.

Validate each tool call against registry and policy. Queue approvals if needed. Execute tools with timeouts. Capture results and latencies.

Append tool results (or structured errors) to state. Update working memory. Log iteration with stop_reason and token counts.

If caps hit, circuit breaker fires, or model signaled done, exit with user-visible summary. Otherwise loop to step 1.

Streaming and user feedback

Streaming tokens to the UI does not change loop semantics, but it changes perceived latency. Users tolerate a 90-second task if they see progress every few seconds.

UX patterns that work:

Stream reasoning text while tool JSON buffers at the end of the message.
Show tool cards as soon as the harness parses a complete tool block: "Running tests…" with spinner during act.
Emit harness events on a side channel (SSE/WebSocket): tool_start, tool_end, compaction, awaiting_approval.
Never parse partial tool JSON for execution—wait for the full assistant message or provider "tool_use stop" signal.

A silent gap during a two-minute shell command feels broken even when work is happening. Surface act-phase progress separately from model token stream.

Stopping gracefully

When a cap fires, return a partial result to the user: what was tried, what succeeded, what remains, and how to resume. Silent kills erode trust. The harness should assemble that summary from structured logs even if the model never gets another turn.

Latency inside the loop

Each iteration adds model latency plus tool latency. Sequential tools multiply wait time. Parallel reads help only when the API and harness support them and when results are independent.

Expose partial progress to users during long act phases (running tests, indexing repo). Perceived latency matters. A silent two-minute gap feels broken even if work is happening.

Testing the loop

Integration tests should drive the harness with scripted model responses (fake tool calls, fake final answers) without calling the real model. Assert state transitions, cap behavior, and error formatting. Model nondeterminism belongs in evals, not in CI for loop logic.

Common loop bugs (and fixes)

Double execution after approval: pending tool state not cleared. Lost tool results: wrong message role when appending. Early exit: treating empty assistant text as done while tools still pending. Runaway cost: no cap on parallel tool fan-out. Each bug is harness state machine logic, not model IQ.

Draw the loop as a state diagram on a whiteboard for your team. States: idle, awaiting_model, executing_tools, awaiting_human, stopped_ok, stopped_error. Transitions are code paths you can test.

Streaming and the loop

See the streaming UX section above for user-facing patterns. Implementation reminder: buffer until tool blocks are complete before act. Partial JSON from early stream flush is a common parser bug.

Variants you will see in the wild

ReAct-style: model interleaves reasoning text and tool calls in one transcript. Easy to debug, verbose on tokens. Named in the ReAct section above.

Plan-then-act: first pass produces a plan object; harness executes phases with different tool sets. Good for governance, harder to improvise.

Tool-only turns: some APIs separate assistant text from tool blocks strictly; harness must handle empty user-visible replies while work continues.

Pick based on evals and UX, not Twitter diagrams. The phases plan/act/observe/recover still apply underneath.

Recovery patterns in practice

When a tool times out, the harness can retry once with backoff, then return a structured error. When validation fails, return which field failed and the allowed enum values. When the model hits max_tokens, trigger compaction and retry the same user goal without asking the human to resend. Each path should be a named branch in code, not an improvised prompt tweak during an incident.

Developers sometimes want full traces for debugging. The model usually needs a short error class and message. Log the full trace server-side with a correlation ID. If the model must see details, truncate to the frames that suggest fix action. Long traces eat context and encourage hallucinated fixes.

Iteration budgets in product terms

Pick defaults users can override: soft warning at iteration 15, hard stop at 40, cost cap at $5 unless enterprise tier. Surface remaining budget in the UI so users understand why a task stopped. Hidden caps feel like random failures.

Document those numbers in your runbook. On-call should not grep code to learn the iteration limit during an outage.

The loop is boring infrastructure until it is not. Most of agent reliability is making the boring path correct every time. Ship the loop before you ship the demo video.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

What are the four phases plan, act, observe, recover?
Which phase executes side effects?
What belongs in a structured tool error for the next turn?
Name two harness-level stop conditions besides "model said done."

Quick check

During the model's forward pass
In the act phase when the harness runs tools
In the observe phase when results are appended

To tell whether the model finished, hit limits, or was stopped by the harness
To fine-tune the model automatically
Because users want to read it in the UI

A successful completion
A doom loop / premature stop the harness should guard against
Fixed by raising max iterations only

Making the model smarter
Recovering from context growth on long multi-step tasks
Eliminating the need for tools