Structured output
Downstream code wants JSON, enums, and dates it can parse without praying. Models want to chat. Structured output is the bridge: schemas in the prompt, constrained decoding where available, and validation plus retry when the model still drifts.
Asking for JSON is not enough. Production structured output is schema + generation constraints + validation + a recovery path when parsing fails.
Why structure breaks
Left alone, an LLM emits the most likely tokens. That often includes friendly preamble ("Sure! Here is your JSON:"), markdown fences, trailing commas, comments, or a second JSON object when the model keeps talking. Your parser throws. The feature breaks. Users notice.
Structure fails for predictable reasons:
- Ambiguous instructions: "return json" without a schema.
- Competing goals: "be thorough" plus "max 50 tokens."
- Schema too large or nested for the model to track reliably.
- Chatty fine-tuning: models trained to be helpful add prose around data.
- Token cutoffs: generation stops mid-bracket when
max_tokensis tight.
Your job is to shrink the failure surface: smaller schemas, clearer delimiters, API modes that restrict tokens, and code that never trusts raw strings.
JSON in the prompt
Start with an explicit schema in the system or user message. JSON Schema snippets, TypeScript interfaces, or a single minimal example object all work. Shorter schemas parse more reliably than ten-page OpenAPI dumps in the prompt.
Rules that help:
- Demand JSON only, no markdown fences, no preamble.
- Name required fields and allowed enums.
- Specify null vs omit for optional fields.
- Give one valid example object in the prompt.
Even with perfect instructions, assume a non-zero error rate at scale. One request in five hundred will sprout ```json anyway.
Constrained decoding and API modes
Stronger than prompting: APIs and open-source servers that restrict the next token to those that keep the output valid against a grammar or schema. Names vary: JSON mode, structured outputs, guided generation, outlines.
JSON mode usually forces valid JSON syntax but not your field names or types. Good for "parseable object," not enough for strict contracts.
Schema-locked modes map generation to a JSON Schema or similar. Invalid tokens get probability zero. This cuts syntax errors dramatically. You still need semantic validation (is this ID real?).
Grammar-based constraints (regex, EBNF, context-free grammars) help non-JSON formats: SQL subsets, domain-specific languages, phone numbers. Useful when JSON is the wrong shape.
Constraints guarantee syntactic validity, not truth. A schema can force {"sentiment":"positive"} while the text was negative. Semantic evals still matter.
Very large schemas can exceed what the constraint engine or model handles cleanly. Split tasks: extract facts in one call, classify in another. Smaller outputs beat one giant blob.
Three ways to enforce structure
Providers overlap on naming, but the mechanisms differ. Pick based on schema complexity, provider support, and whether you self-host.
| Approach | What it guarantees | Typical tradeoff |
|---|---|---|
| Prompt-only JSON | Nothing syntactic; relies on instruction-following | Cheapest to ship; highest parse-failure rate at scale |
| JSON mode | Valid JSON syntax (object/array), not your field names or enums | Good first API flag; still needs schema validation downstream |
| Grammar / constrained decoding | Tokens restricted to a grammar (JSON Schema subset, regex, EBNF) | Strong syntax; schema size limits vary; more common in OSS servers (vLLM, outlines) |
| Native structured output | Provider maps generation to your JSON Schema (OpenAI structured outputs, Anthropic tool/schema modes, etc.) | Best syntax compliance on supported models; vendor-specific APIs and caps on schema depth |
Production stacks usually combine layers: native or grammar mode for generation, JSON Schema validation in code, repair retry on failure. Prompt-only JSON is fine for prototypes, not for billing pipelines.
Validation and recovery
Always run JSON.parse (or equivalent) inside try/catch. Then validate with JSON Schema, Pydantic, Zod, or custom rules: required keys, numeric ranges, enum membership.
Recovery ladder:
- Strip wrappers: regex or heuristic removal of markdown fences and leading prose.
- Repair pass: a small deterministic fixer (close brackets) or a cheap second model call: "fix this JSON to match schema."
- Retry generation: same prompt with lower temperature and a note "previous output invalid."
- Fail closed: return an error to the user or route to human review. Never silently pass garbage to billing code.
Log failure types: fence detected, truncate, wrong enum, extra keys. Those metrics guide prompt and schema changes.
Markdown fences and preamble stripping
Models love to wrap JSON in ```json fences because training data is full of tutorials. Production parsers should strip common patterns before parse: leading whitespace, fenced blocks, trailing explanation after the closing brace.
Keep a allowlist of repair heuristics and log when each fires. If fence stripping hits 40% of traffic, tighten the system prompt and enable constrained decoding rather than relying on regex forever.
Never strip aggressively on security-sensitive pipelines without validation. Stripping should expose JSON, not invent it.
Enums, dates, and numeric fields
Free-form strings look valid in JSON but fail business rules. Use enums in schema for small closed sets. For dates, specify ISO-8601 in the prompt and validate with a real date parser afterward. Models confuse US vs EU date order constantly.
Numeric fields need ranges: probabilities in [0,1], counts non-negative. Schema validation catches outliers before they hit SQL. For floats, decide rounding rules in code, not in the model.
Nullable fields should be explicit in both schema and shots. Ambiguity between null, "", and missing keys breaks downstream TypeScript and Python models alike.
Streaming and partial JSON
Streaming UX wants tokens early; parsers want complete objects. Common compromise: stream prose to users but buffer JSON until validation passes, or use newline-delimited JSON one record at a time.
Partial JSON during stream is not safe to act on. Wait for parse success before triggering side effects. Show a spinner, not half a transaction.
Some APIs stream only after schema-locked generation completes internally. Read provider docs before designing UX around incremental JSON fields.
Versioning output schemas
Treat JSON field sets like API versions. Adding a required field is a breaking change for mobile clients and data pipelines. Prefer optional new fields with defaults, or bump a schema version the parser checks explicitly.
Document migration in the system prompt when old and new shapes coexist briefly. Models can learn dual formats from examples, but parsers should accept only the version they implement.
Store schema version in logs alongside model and prompt hash. Analytics on parse failures often correlate with a undeployed schema mismatch, not model quality.
Human review queue
When parse or validation fails after retries, route output to a human queue instead of blocking the user silently. Show the raw model text to reviewers with schema errors highlighted. Those samples become gold for prompt fixes and few-shot additions.
Track failure taxonomy over time: fences, truncation, wrong types, hallucinated keys. Product and eng priorities become obvious when 60% of failures are trailing commas vs wrong enums.
Pair the queue with automatic clustering on failure type so PMs see trends, not individual JSON blobs. The goal is a feedback loop from production misparses to schema and prompt changes with owners and dates.
Design schemas models can hit
Prefer flat objects over deep nesting. Use enums instead of free-form strings when you can. Split arrays of complex items into separate steps if needed. Match field names to natural language in the task ("refund_reason" next to instructions about refunds).
Avoid asking for huge arrays in one shot unless you have token budget and evals proving it works. Streaming partial JSON is possible but complicates parsers. Many teams generate row-by-row for large tables.
For arrays of objects, cap max items in schema and prompt ("at most five items"). Models happily emit fifty-element lists that blow output limits. Server-side truncation of arrays is safer than hoping the model self-limits.
Interop with typed client code
Generate parsers from the same JSON Schema you send to the API. Drift between prompt schema and Pydantic/Zod models causes green evals in notebooks and red production when someone adds a field on one side only.
Fail closed in client code: if validation fails, do not coerce types silently. Silent coercion hides model regressions until finance or compliance notices wrong numbers.
Keep a golden file of valid and invalid model outputs in tests. Validators should pass one and reject the other deterministically. That suite runs in CI when schema or prompt changes.
When multiple services consume the same model output, publish the schema as a shared package. One team's optional field is another team's breaking change if parsers are not coordinated.
Prefer explicit unknown or sentinel enum values over inventing new strings when the model is unsure. Parsers can branch on sentinels; they cannot recover from surprise free text in a typed field.
Record which constraint mode was active (prompt-only, JSON mode, schema-locked) in logs so parse-rate regressions map to the right knob.
Never eval() model JSON. Treat output as untrusted input. It can contain strings that break out of SQL, shell, or HTML contexts downstream.
Latency of retries. A 2% parse failure with one retry adds roughly 2% to average latency and doubles cost on those requests. Monitor p95, not just success rate.
Contract tests. Snapshot expected schema versions. Mobile clients and ETL jobs depend on stable keys. Treat schema changes like API semver.
Checkpoint
You're ready for the next lesson if you can answer these from memory:
- Why is "return JSON" alone insufficient for production?
- What is the difference between JSON mode and schema-locked structured output?
- When would you use grammar-based constrained decoding vs a provider native schema API?
- Name three common ways structured generation fails.
- What belongs in a validation and recovery ladder?
- How should you shape schemas to improve reliability?
Quick check
- Valid JSON syntax, not necessarily your field names or business rules
- Full compliance with your JSON Schema including enums and required keys
- That the values are factually correct
- JSON.parse
- Schema or business validation after parsing
- The tokenizer
- Strip markdown fences / preamble and retry parse
- Pass the raw string to the database anyway
- Immediately switch to a larger model forever
- It always halves latency
- Smaller schemas are easier for the model and constraints to satisfy
- APIs forbid nested JSON in a single response