Learn/Safety, Guardrails & Security/Lesson 05

Lesson 05

Output validation and guardrails

Guardrails sit on the path between model output and the rest of your system. Some block bad text. Some enforce shape. Some pause for a human before money moves. This lesson covers the stack and where each layer actually helps.

The one idea

Treat all LLM output as untrusted user input. Validate before it hits a browser, a database, a shell, or the next model turn. Guardrails are not a product category; they are validation gates on every outbound edge.

Inbound vs outbound

Teams say "guardrails" and mean one box in a diagram. In production you need gates in both directions.

Inbound (before the model or before re-entry to context): user messages, uploads, retrieved chunks, tool results. Scan for injection, PII you should not process, malware in attachments.

Outbound (after the model, before side effects): completions shown to users, structured JSON passed to code, tool call arguments, HTML rendered in a browser.

OWASP LLM05:2025 Improper Output Handling is the outbound story: XSS when raw model HTML hits a page, SQL injection when model text concatenates into queries, SSRF when URLs are followed blindly.

Validate on the way in and on the way out. Tool results re-enter the loop as inbound data.

Policy layers: NeMo Guardrails and Llama Guard

Rules and vendor moderation APIs cover part of the stack. Production teams also use dedicated policy frameworks that sit between user input, model calls, and tool execution.

NVIDIA NeMo Guardrails expresses policies in Colang: topical rails (stay on subject), moderation rails (block toxic content), and tool rails (which tools may fire given dialog state). The runtime intercepts turns before and after the model, so policy is code-adjacent rather than a paragraph buried in the system prompt. Good fit when you need auditable, versioned rules and multi-step dialog constraints.

Llama Guard (Meta) and similar safety classifiers score input/output for policy violations (violence, PII echo, jailbreak patterns). Run them as a lightweight model call on inbound and outbound text. They complement NeMo or homegrown rules; they do not replace schema validation on tool arguments.

Guardrails AI focuses on structured output: validators attached to Pydantic-style schemas so completions must pass shape and content checks before your app sees them.

Architecture pattern: rules (fast, deterministic) → classifier rail (Llama Guard or vendor moderation) → schema validation on structured/tool output → HITL on irreversible actions. Pick layers by blast radius, not vendor marketing.

Rules, schemas, and classifiers

Rule-based guards are fast and auditable: blocklists, regex for known injection markers, max length, allowed character sets for tool arguments. They miss novel attacks but catch stupid ones cheaply.

Schema validation applies to structured output. If the model must return JSON matching a schema, reject anything that does not parse or violates types. Do not "fix" bad JSON by passing it through eval() or a permissive deserializer. CVE-2025-68664 (LangGrinch) was reserved keys inside serialized tool output instantiating arbitrary LangChain objects.

Classifiers score text for injection, toxicity, PII, or off-topic content. Small models or vendor moderation endpoints run in tens of milliseconds. Use them where rules are too brittle and human review is too slow.

Stack them: rules first (cheap), classifier second ( broader), human third (high stakes). A single vendor "safety API" is not a complete outbound strategy for app-specific tools.

Tool-result sanitization

Tool output is inbound data for the next turn. Strip role-switch prefixes ([SYSTEM], fake assistant/user dialog). Truncate oversized payloads so one HTML page cannot fill the context window. Redact patterns that look like secrets before re-prompting.

For HTML and Markdown destined for a browser, run through DOMPurify, bleach, or equivalent. Never render raw model HTML.

For SQL, parameterized queries only. The model proposes parameters; your code binds them.

For shell, parse into argv arrays, validate against allowlists, use shell=False. String interpolation into bash is how injection becomes RCE even when the model "only wanted to run tests."

When schema validation fails, teams sometimes ask the model to "fix" its JSON in a follow-up turn. That works for benign formatting errors. It also trains the loop to accept progressively malformed structures and encourages deserializers that try to be helpful. Safer pattern: reject, log, return a structured error to the orchestrator, optionally retry once with a smaller schema. Never instantiate objects from attacker-influenced dicts with magic keys unless you have a strict allowlist of types.

Human-in-the-loop done right

Human review is the last gate on irreversible actions: send email, charge card, delete data, force-push main, grant admin role, connect to a new external domain.

The failure mode is approval fatigue. If every file read triggers a modal, users click yes without reading. Anthropic reported sandboxing cut permission prompts by 84% internally by auto-allowing low-risk actions inside the sandbox and reserving prompts for genuine judgment calls.

Good HITL design: auto-allow reads inside workspace, gate writes and network egress, escalate after repeated denials, show diff previews for destructive ops. Bad HITL design: ask the same question every turn with no memory of prior approvals.

In September 2025 Anthropic disclosed an espionage campaign where attackers manipulated Claude Code toward infiltration targets. HITL and outbound monitoring were part of the reason the campaign was detected. Treat HITL as security control, not UX friction to minimize away.

Engineering reality

Guardrails add latency and cost. A 50ms classifier on every tool result at 20 tool calls per task is a second. Budget guardrail spend like inference: run expensive checks only on paths that can cause harm. Static answers in a FAQ bot need light moderation; an agent with production DB write access needs schema validation plus HITL plus outbound DLP on every mutation tool call.

Agent firewalls and DLP on egress

Open-source projects like Pipelock treat the agent as an untrusted process behind a proxy: outbound HTTP passes through SSRF checks, prompt-injection scanning on inbound tool payloads, and DLP on bodies leaving the network zone. The agent never holds both secrets and raw network in the same address space.

Commercial DLP integrations do similar work at the corporate proxy layer. They help when agents run on employee laptops with VPN egress. They do not replace schema validation on tool args; they catch encoded secrets in HTTP bodies your app forgot to scan.

Wire egress scanning on mutation tools first: send_email, create_issue, http_post. Read-only tools still matter for SSRF, but exfiltration volume usually exits through writes.

Product guardrails vs security guardrails

Product guardrails shape experience: stay on brand, refuse legal advice, do not discuss competitors. Security guardrails stop harm: block SSRF URLs, prevent credential echo, stop SQL concatenation.

Confusing the two leads to green checkmarks on brand tone while shell access remains wide open. Separate configs, separate owners, separate evals. Your marketing team should not be the authority on whether bash is allowlisted.

Measuring guardrail effectiveness

Track false positive rate and false negative rate separately. A guardrail that blocks 100% of attacks but rejects 30% of legitimate support chats will get disabled in production.

Build a labeled set of real user prompts plus synthetic attacks. Run it nightly against staging. Report precision and recall per guardrail stage the same way you report retrieval recall for RAG.

When a guardrail fires, log the rule or classifier score, not only "blocked." You need that detail to tune without flying blind.

Structured output in agent loops

Agents that emit JSON for tool selection need strict parsers: Pydantic, JSON Schema, protobuf. Reject unknown fields. Do not instantiate arbitrary classes from parsed dicts.

When the model picks a tool, validate the tool exists in the registry, arguments match the schema, and path arguments pass filesystem checks before execution. Treat the model's tool call as a suggestion that failed validation by default. Tool argument validation is where LLM05 meets LLM06: malformed or over-privileged args must never reach the adapter. Cross-link: Agents L02: Tool use and function calling for registry design.

Streaming partial JSON is convenient for UX but complicates validation. Buffer until the schema validates or timeout and abort. Half a tool call is worse than a slow one.

Rendering model output safely

Chat UIs often render Markdown with raw HTML passthrough for convenience. Model-generated links and images become XSS vectors when a user opens a shared conversation link.

Sanitize HTML, force rel="noopener noreferrer" on external links, and block javascript: URLs. If you embed model output in emails or PDFs, run the same pipeline. OWASP LLM05 exists because teams skipped this on "internal only" tools that later went external.

For code blocks, syntax highlighting is fine; executing snippets client-side is not. Keep run buttons wired to sandboxed backends, not eval in the browser.

When to block vs when to warn

Hard blocks fit irreversible or high-volume exfil paths: shell, bulk email, payment APIs. Soft warnings fit ambiguous policy edges: mentioning a competitor, mild profanity in internal bots.

Make the mode explicit in config per route. Customer-facing support should fail closed on PII echo; internal codegen can warn and continue with audit log.

Users tolerate friction when the message explains risk ("this request would send 400 ticket rows to an external URL"). Generic "content policy violation" erodes trust and invites override attempts.

Log every soft warning with user ID and session ID so abuse patterns are visible even when nothing was blocked.

Review warning rates monthly. A spike often means a new feature bypassed outbound validation, not that users got rowdier.

Treat guardrail bypass tickets as security incidents until proven otherwise. Document who can temporarily disable a rule and for how long.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

What is the difference between inbound and outbound guardrails?
Why treat tool results as inbound data?
What went wrong in insecure deserialization CVE patterns like LangGrinch?
What causes approval fatigue and how do you avoid it?
When do you use rules vs classifiers vs human review?

Quick check

Using too much training data
Sending LLM output directly to interpreters or browsers without validation
Exceeding the context window

They may contain indirect injection payloads for the next model turn
They always exceed token limits
Models cannot parse JSON from tools

Users see one prompt per month
Users approve hundreds of low-risk prompts and stop reading
Every prompt includes a detailed diff preview

Pass through eval() to coerce types
Reject the output, log the failure, retry or abort under orchestrator control
Skip validation if the JSON looks mostly correct