Tool use and function calling in production
Tool use is how a language model reaches outside itself. The model does not call your API. It emits a structured request. The harness validates it, runs the function, and feeds the result back. Most production pain lives in that handoff.
Function calling is a contract between model and harness: JSON schemas describe what tools exist, the model outputs a tool name plus arguments, and the harness is the gatekeeper that validates, executes, and formats results.
From text completion to structured actions
Early "tool use" hacks asked the model to print JSON inside markdown fences. That works in demos and breaks in production. Modern APIs expose tools as first-class objects with names, descriptions, and parameter schemas sent alongside the conversation.
On each turn the model sees:
- The conversation so far (user messages, prior assistant messages, prior tool results).
- A list of available tools and what each one accepts.
- Optionally a system prompt defining role, safety rules, and output format.
The model's reply is not always plain text. It may be a tool use block: a structured payload saying "call search_orders with customer_id=...." The harness parses that payload. If it is valid and permitted, it runs the real function and appends a tool result message. Then the loop continues.
This separation is deliberate. The model is good at choosing among described actions. The harness is good at enforcing types, permissions, timeouts, and audit logs.
The tool registry
Every production harness maintains a registry: the set of tools the model is allowed to know about on this session. Each entry typically includes:
- Name: stable identifier the model must emit exactly.
- Description: when to use this tool and what it returns. This text is part of the prompt budget and part of the UX for the model.
- Input schema: JSON Schema (or equivalent) for required and optional fields, types, and enums.
- Implementation: code the harness runs. The model never sees this.
Registry design is product design. Too few tools and the model improvisates in text. Too many similar tools and the model confuses them. A registry with forty overlapping HTTP wrappers will hallucinate parameters more often than a registry with eight well-named tools.
Descriptions are part of the interface
Engineers often treat tool descriptions as documentation leftovers. For agents they are API docs the model reads on every turn. A vague description produces vague calls. A precise one names edge cases: "Use search_users by email only when the input contains @; otherwise use search_users_by_name."
Good descriptions answer:
- What does this tool do in one sentence?
- When should the model choose it over neighbors?
- What does each argument mean, including units and format?
- What failures look like (empty result vs error)?
Descriptions also cost tokens. A bloated registry eats context before the user task starts. The art is being specific without dumping your entire internal wiki into the system prompt.
Models generalize from training data about APIs. If you expose get_user, fetch_user, and load_user_profile, the model may pick the wrong one or invent arguments that fit a different API shape it saw in pretraining. Prefer one canonical tool per action. Merge synonyms in your application layer instead of exposing them all.
Validation before execution
Never trust a tool call because the JSON looks plausible. The harness should check:
- Tool name exists in the registry.
- All required fields are present.
- Types match the schema (string vs integer vs enum).
- No extra fields the schema did not allow (hallucinated parameters).
- Values pass domain rules (path inside allowed directory, ID format, rate limits).
OpenAI's strict: true mode and similar features constrain argument shape at decode time. That catches syntax-level mistakes early. Semantic mistakes still slip through: right-shaped call, wrong customer ID, destructive action on the wrong resource.
Validation is cheap insurance. A large share of tool errors never reach your database if the gatekeeper rejects malformed calls and returns a structured error the model can read on the next turn.
Tool results become part of the next prompt. A 50 KB JSON blob from a careless list_all_records tool can blow your context budget in one turn. Truncate, summarize, or paginate at the harness. The model does not need every column of every row to decide the next step.
Errors the model can recover from
When a tool fails, return something the model can use, not a stack trace aimed at humans. A useful pattern:
- error_type:
not_found,permission_denied,timeout,invalid_argument - message: short, factual explanation
- hint: optional suggestion (check ID format, try narrower search)
The model often retries with corrected arguments. If you only return "500 Internal Server Error," the loop spins.
Also distinguish tool execution failed from tool returned empty but valid. Empty search results are data. Exceptions are signals to change approach.
Parallel vs sequential tools
Some APIs let the model request multiple tools in one turn. That can cut latency when reads are independent (fetch user profile and fetch order history in parallel). Writes are riskier in parallel: race conditions, partial failures, harder audit trails.
Production harnesses often allow parallel read-only tools but serialize mutations. That policy belongs in code, not in a polite sentence in the system prompt.
The round trip, step by step
One tool-use iteration in detail:
- Harness sends messages plus tool definitions to the API.
- Model responds with a stop reason indicating tool use and one or more tool blocks.
- Harness parses blocks, validates each against schema and policy.
- Harness executes approved tools, collecting results and timings.
- Harness appends assistant tool-use message and user tool-result message to state.
- Loop continues with updated state.
Any break in steps 3-5 surfaces as "the agent is dumb." Often the model chose a reasonable intent; the handoff failed.
Writes need idempotency thinking
Read tools are safe to retry. Write tools are not unless designed that way. If the model retries create_ticket after a timeout, you may get duplicate tickets unless the harness deduplicates by idempotency key or the tool checks for existing records.
Design write tools with:
- Explicit confirmation or dry-run mode.
- Idempotency keys the harness can reuse on retry.
- Clear success payloads the model can cite on later turns.
Model Context Protocol (MCP)
Model Context Protocol (MCP) is an open standard for connecting agents to external tools and data. Instead of every IDE and agent framework inventing its own plugin wire format, MCP defines how a host (your harness) discovers tools from an MCP server, calls them, and streams results back into context.
Why it matters in production:
- Interoperability: one MCP server (Postgres, GitHub, Slack) can plug into multiple clients without bespoke adapters per product.
- Composable tool surface: teams ship capabilities as servers; the harness merges them into the registry per session.
- Explicit trust boundaries: each server declares what it can access; the host decides which servers are enabled for which user.
Permission model (one paragraph): the user (or admin) approves which MCP servers run in a session. The host maps each tool to least-privilege credentials: read-only DB user, scoped API token, sandboxed filesystem root. The model never holds those secrets; it only sees tool names and schemas the host registered. If a server updates its tool list after approval—a "rug pull"—the host should re-prompt or block until the user re-approves. Treat MCP like installing browser extensions: convenient, but each new server expands blast radius.
MCP does not replace harness validation. You still schema-check every call, truncate results, and log executions. For sandboxing patterns and per-tool least privilege, see Safety L03: Sandboxing and least-privilege tools.
Use native function calling when tools are first-party, few, and tightly coupled to your backend. Add MCP when you need a growing catalog of third-party integrations, when security teams want auditable server boundaries, or when the same tool servers must work across multiple agent products. Many stacks do both: core business tools in-process, long tail via MCP.
Tool descriptions as prompt engineering
Descriptions are not documentation leftovers—they are prompt text the model reads every turn. Weak descriptions cause wrong-tool selection; overlong descriptions eat context. Treat each description like a micro-prompt: one-sentence purpose, when-to-use vs neighbors, argument semantics, and failure shapes. A/B description variants in evals the same way you A/B system prompts. See Prompting L06 for schema authoring; this course owns what the harness does after the model emits a call.
Choosing tools for the registry
Start with fewer tools than you think you need. Each tool is a branch in the model's decision tree. Prefer composable primitives:
searchplusreadinstead of ten bespoke lookup tools.run_sql_readonlyseparate fromrun_sql_write.- Domain actions with clear nouns:
create_refund, nothandle_customer_issue.
Add tools when evals show repeated failure modes that a new capability fixes, not when brainstorming whiteboard columns.
Testing tool contracts
Unit-test the harness gatekeeper: feed valid calls, malformed calls, hallucinated fields, and out-of-policy values. Snapshot the error messages returned to the model. Those strings are part of your UX for recovery.
Also test token weight: mock a huge tool result and assert the harness truncates before the next API call. That test prevents 3 a.m. pages about runaway bills.
Anti-patterns in tool design
Avoid tools that are really "run arbitrary code" unless heavily sandboxed. Avoid mega-tools that take ten parameters with optional everything. Avoid returning HTML pages when JSON summaries suffice. Each anti-pattern increases hallucination rate and recovery time.
Prefer tools that mirror stable internal APIs your team already tests. The model learns the contract from the schema; your backend team already knows the edge cases.
Document tool contracts in the same repo as the harness. When the API changes, update schemas in the same pull request. Drift between real API and schema is a top source of silent tool failures.
Provider differences (conceptual)
All major APIs converge on the same idea: tools in the request, tool calls in the response, tool results in the next request. Names differ (function calling, tool use, plugins). Message shapes differ slightly. Your harness abstraction should normalize those differences so product code does not fork per vendor.
Checkpoint
You're ready for the next lesson if you can answer these from memory:
- What three pieces does each tool entry in a registry need?
- Why are descriptions part of the model interface, not just docs?
- What should the harness validate before running a tool?
- Why should tool errors be structured for the model?
Quick check
- The model, using weights learned during training
- The harness, using code registered for that tool
- The LLM API provider's servers
- Tool name validation at the harness gatekeeper
- The model temperature was too high
- The context window was too small
- Better SEO for your API docs
- So the model picks the right tool with the right arguments
- To minimize JSON payload size on the wire
- Pass the entire payload into the next model call unchanged
- Truncate or summarize before injecting into context
- Drop the result and tell the user the tool failed