Learn/Safety, Guardrails & Security/Lesson 06

Lesson 06

Security in harness design

This lesson pulls the course together. The harness is the policy enforcement point for your agent: every layer assumes the others will fail, and you prove it with tests that attack the real tool surface, not just the chat box.

The one idea

Defense in depth is not a checklist of products. It is a posture: deny by default, validate every crossing between trust zones, assume injection succeeds, and make the blast radius boring when it does.

The harness as enforcement point

Everything outside the weights is harness: the loop, tool registry, memory, permission prompts, logging, sandbox launcher, MCP client. The model proposes. The harness disposes.

When you read CVE writeups for Langflow, LangChain, Cursor, or MCP SDK bugs, the pattern repeats. Untrusted input reached an execution primitive because the wrapper assumed good intent. Your design goal is the opposite: explicit validation at every junction, fail closed, no implicit trust between subsystems.

The prior five lessons each added one layer. Threat modeling named the surface. Injection defenses handled untrusted text. Sandboxing and least privilege capped actions. Data controls kept secrets out of context. Output validation blocked bad payloads on the way out. This lesson wires them into one architecture you can ship and test.

Each ring assumes inner rings fail. Outer layers limit blast radius when injection or validation slips through.

Path traversal and filesystem tools

When a model calls read_file("../../etc/passwd"), the harness is the last gate before the OS. Passing paths straight to open() without normalization is how agents read system files in security demos and in real misconfigs.

Correct pattern: resolve with os.path.realpath() and abspath() to collapse ../ and symlinks, then verify the result starts with an allowed base directory plus trailing separator (so /workspace/project-evil does not match /workspace/project).

Handle URL-encoded paths, double encoding, null bytes, and symlink chains. Application validation should be layer two; OS sandbox boundaries should be layer one so both must fail for an escape.

Similar bugs appear outside file tools: adapters that fetch file:// URLs without checking resolved paths have enabled arbitrary file read from chat attachments.

MCP as supply chain

Model Context Protocol connects agents to tools through standardized servers. Wide adoption means wide attack surface: poisoned tool descriptions, rug pulls after install, command injection in STDIO launchers, registries with no authentication.

Tool poisoning. Descriptions are model-facing context. A malicious description can instruct the model to exfiltrate data when the tool is invoked. MCPTox tested many real servers; most agents were vulnerable to description manipulation.

Rug pulls. Clean descriptions at install time, malicious updates later. Hash descriptions at registration and alert on drift between sessions.

STDIO RCE. CVE-2026-30623 in Anthropic's MCP SDK and related issues passed untrusted strings to shell launchers. Pin server versions, sandbox MCP processes, never pass unvalidated URLs or commands from config into exec.

Registry trust. Public MCP marketplaces include malicious packages alongside legitimate ones. Treat third-party servers like third-party npm deps in 2015: vet authors, pin hashes, run isolated, monitor outbound calls.

Audit and hash tool descriptions at connect time. Strip injection markers before showing descriptions to the model. Sandbox MCP server processes with filesystem and network restrictions. In production, do not let end users attach arbitrary servers. Network-isolate each server to only the APIs it needs. Log tool call patterns and flag anomalies (new tools, unusual argument shapes, spike in fetch volume). Review description diffs on upgrade the same way you review lockfile changes.

Multi-agent trust boundaries

When agents delegate to subagents, instructions and tool results cross trust tiers. An orchestrator compromised by injection should not automatically inherit production credentials in downstream workers.

Claude Code's model is a useful reference: operator trust (files you control) above user trust (foreground chat) above agent trust (MCP and inter-agent messages). Agent-tier content cannot escalate without explicit human re-authorization.

Google's A2A Protocol pushes signed AgentCards and consent metadata for sensitive data sharing. Research on multi-agent security stresses the same rule: no agent treats another's output as instructions without validation.

Capability separation by role: the web-research subagent should not hold payment API keys. Audit trails should record which orchestrator spawned which subagent with which permission set.

MCP still lacks a standard policy layer for deep delegation chains. That gap is open as of this writing; design explicit permission stripping in your harness rather than waiting for the protocol to solve it.

Red-team and regression testing

Security controls rot when the tool surface changes every sprint. Automate attacks against staging the same way you automate functional evals. Wire the same golden sets into CI so harness changes cannot merge without passing safety probes.

Garak (NVIDIA): probe library for injection, leakage, jailbreaks. Good baseline scan on model-facing strings.

PyRIT (Microsoft): multi-turn attack orchestration, crescendo patterns. Finds issues single-shot probes miss.

AgentDojo: realistic agent tasks plus adversarial cases. Benchmark used by defenses like MELON and Progent. Use it to measure utility vs robustness tradeoffs when you tighten permissions.

Promptfoo and similar tools integrate OWASP agentic presets into CI. Run on every harness change that touches tools, not only on model swaps.

CI regression for safety. Maintain a frozen set of injection strings, indirect HTML fixtures, and path-traversal tool calls alongside your functional golden set. Run them in the same pipeline as prompt evals. Block merges when attack success rate (ASR) rises or utility drops past threshold. The workflow for fixtures, flaky handling, and thresholds is spelled out in Eval L02: Task evals and golden sets and Eval L05: Regression testing and CI for prompts and harnesses.

Unit tests for path validation and schema rejection are boring and essential. Red-team fuzzing finds novel strings; unit tests keep regressions from shipping when someone refactors the tool adapter.

Engineering reality

Attack success rate is the metric that matters, not "we ran Garak." Track ASR before and after each change on a fixed AgentDojo subset. If tightening filesystem allowlists drops task success 40% while ASR falls 2%, you picked the wrong allowlist. If ASR hits zero and utility drops 3%, ship it. Security and product tradeoffs should be measured, not argued in abstract.

Observability as a security control

You cannot prove controls work if you cannot see tool calls. Structured traces should include: tool name, argument hash (not always raw args if sensitive), allow/deny decision, sandbox ID, permission tier of the instruction source, and latency per layer.

Alert on anomalies: first use of a tool in a session, spike in fetch bytes, repeated path denials, escalation to HITL followed by immediate user approval (possible social engineering). Logs must be scrubbed, but metadata often enough to detect abuse without storing secrets.

Retention policy is part of security. Keeping full prompts for seven years in a shared SIEM recreates the leakage problem you scrubbed at write time. Align log TTL with product privacy promises.

Posture summary

Deny by default on permissions. Validate on every trust boundary crossing. Keep secrets out of context via proxies. Sandbox code and MCP servers. Sanitize tool results inbound and model output outbound. Gate irreversible actions with meaningful human review. Test the harness, not just the prompt.

That is the course arc. The model will keep getting smarter. The CVE list will keep growing in frameworks, not weights. Your job is to build a harness that stays boring when the model misbehaves.

Before you ship checklist

Run through this on every agent feature that touches tools or external data:

Threat model row exists with owner and mitigations for top three failure modes.
Filesystem and network paths are allowlisted; shell runs inside a sandbox.
Secrets never appear in context; APIs called via scoped proxy.
Tool results sanitized inbound; model output validated outbound.
MCP servers pinned, hashed, and sandboxed; no arbitrary user-supplied servers in prod.
HITL gates on irreversible actions; low-risk ops auto-approved inside sandbox.
AgentDojo or equivalent ASR measured in CI on harness changes.

Missing one item is not always a launch blocker. Missing half the list on a production shell tool is.

## LLM agent security review

### Threat model
- [ ] Top 3 assets and entry points documented with OWASP 2025 IDs
- [ ] Worst-case harness action identified ("if model obeys last message")

### Input and context
- [ ] User, retrieved, and tool content treated as untrusted data
- [ ] Indirect injection fixtures in eval set (RAG, email, web HTML)

### Tools and permissions
- [ ] Least privilege per tool; no god-mode service account
- [ ] Path validation (realpath + base prefix) on filesystem tools
- [ ] SSRF controls on fetch (block metadata IPs, no file://)
- [ ] MCP servers pinned, description hashes checked on upgrade

### Output and egress
- [ ] Schema validation on tool args and structured output
- [ ] HTML/Markdown sanitized before render
- [ ] DLP or egress scan on mutation tools

### Testing and ops
- [ ] Golden set + injection probes in CI (Eval L05 pattern)
- [ ] ASR and utility tracked on staging before prod deploy
- [ ] Structured traces with allow/deny decisions (no raw secrets in logs)

Where to go next

This course focused on security posture for LLM apps and agents. Pair it with RAG L06 for retrieval-specific failure modes (LLM08), Agents L03 for harness architecture, and Eval L05 when you wire traces and regression suites into CI.

The harness blog series on rahulkashyap.dev goes deeper on implementation war stories and CVE timelines. Use it as reference when you need incident names and vendor writeups; use these lessons when you need the mental model.

Security is not a gate at the end of a sprint. It is a property of how the loop, tools, and permissions are shaped from the first prototype.

Keeping the posture current

Schedule a quarterly harness review: new tools, new MCP servers, changed model behavior, updated CVEs in dependencies. Fifteen minutes with a diff of the tool registry catches most drift.

Assign an owner. "Everyone owns security" means nobody updates the allowlist when a PM adds a shortcut.

Checkpoint

You have finished the course if you can answer these from memory:

Why is the harness the policy enforcement point, not the model?
What is the correct path validation pattern for filesystem tools?
Name two MCP supply chain attacks and one control for each.
How should subagents differ in trust and credentials?
Which tests prove harness security vs model politeness?

Quick check

One very strong sandbox is enough
Overlapping controls so one failure does not cause total compromise
Relying on the model vendor's safety fine-tuning

Tool poisoning at registration only
A rug pull
Path traversal

To resolve symlinks and .. before verifying the path stays inside allowed dirs
To speed up file reads
To compress files before reading

Benchmarking inference tokens per second
Measuring agent utility and adversarial robustness together
Training tokenizers