Learn/Safety, Guardrails & Security/Lesson 03

Lesson 03

Sandboxing and least-privilege tool design

Prompt defenses stop some bad ideas from reaching the model. Sandboxing and least privilege stop the bad ideas that get through from touching production systems. This lesson covers isolation technology choices and how to scope tools so an injected agent cannot do more than the task allows.

The one idea

A sandbox limits what code can technically do. Least privilege limits what it is allowed to do. You need both: isolation without tight scopes still lets an agent exfiltrate through approved channels; tight scopes without isolation still lose when application validation bugs out.

Why Docker alone is not enough

Docker is where most teams start, and it is a reasonable start. Containers are fast, familiar, and already in CI pipelines. The structural problem is that containers share the host kernel. A container escape does not trap the attacker inside a box. It lands them on your host with whatever access that host has.

Agent frameworks have shipped critical RCE bugs where code execution endpoints ran with no sandbox at all. Docker would have helped. Docker plus seccomp and AppArmor helps more. For high-stakes agents (arbitrary code, customer data, network access), kernel-shared isolation is often not enough.

Docker is fast and weak on isolation. Firecracker microVMs add a dedicated kernel per sandbox. Pick based on blast radius, not blog posts.

Isolation options in plain terms

Firecracker microVMs spin up a lightweight VM with its own kernel via KVM. Escape from inside reaches only that VM's kernel, not your host. Boot time can hit ~125ms with pre-warmed pools. E2B uses this in production for code sandboxes at scale.

gVisor intercepts syscalls in user space (Sentry). No separate VM kernel, so lower overhead than Firecracker, slightly weaker isolation. Google runs it on internal workloads.

Kata Containers wraps VM isolation in OCI-compatible container tooling. Useful when you want VM-grade separation without rewriting a Docker-based deploy pipeline.

Process sandboxes without full containers. Claude Code on Linux uses bubblewrap for filesystem and network boundaries plus seccomp BPF filters. On macOS it uses Seatbelt. OpenAI Codex disables network by default in cloud runs and uses seccomp plus Landlock on Linux locally. Anthropic open-sourced a sandbox-runtime library for arbitrary processes and MCP servers.

WASM and language-level isolation. For untrusted user code (notebooks, plugins, formula evaluators), WebAssembly runtimes and embedded interpreters (Pyodide, Deno with permission flags) offer lighter isolation than full VMs when the blast radius is bounded to a single function call. They complement, not replace, container sandboxes for agents with filesystem and network access.

SSRF and URL-fetch tools

Server-side request forgery (SSRF) is classic web security wearing an agent costume. The model asks a fetch tool to load a URL. An injected prompt steers it toward http://169.254.169.254/ (cloud metadata), internal admin panels, or file:// paths that your adapter resolves without checks.

Controls belong in the harness, not the prompt:

Block link-local, private RFC1918, and metadata IP ranges at the proxy.
Resolve hostnames before connect; reject redirects to forbidden targets.
Disable file:// and custom schemes unless explicitly allowlisted.
Cap response size and content types; HTML from arbitrary URLs is untrusted input for injection (lesson 02).

Mediated fetchers (agent has no socket; proxy fetches) are the right default for production agents. OWASP maps unchecked fetch behavior to LLM06: Excessive Agency when tools can reach networks the task should not touch.

MCP permissions and trust tiers

Model Context Protocol servers extend the tool surface through a standard wire format. Security is only as good as how your harness registers and runs them. See Agents L03 for where MCP sits in the loop; this lesson covers permissions.

Registration-time controls. Hash tool names and descriptions at connect time. Pin server versions. Do not let end users attach arbitrary servers in production without the same vetting you apply to npm dependencies.

Runtime permissions. Run each MCP server in a process sandbox with its own filesystem and network allowlist. A filesystem MCP server should not share the orchestrator's credentials or home directory. STDIO launchers must never pass unvalidated config strings to shell (exec injection CVEs have hit official SDKs).

Trust tiers. Operator-configured servers (you installed and reviewed) rank above user-supplied servers (marketplace one-click). Agent-generated server URLs rank lowest: treat as untrusted until re-authorized. Lesson 06 expands supply-chain attacks (tool poisoning, rug pulls).

Least privilege for tools

Classic least privilege assumes you design access up front. Agents break that assumption: the same binary might read docs in one session and write to a repo in the next. Static broad permissions are either too open (worst case planning) or too tight (tasks fail randomly).

AWS's Generative AI Well-Architected Lens (GENSEC05-BP01) says enforce least privilege per tool, per dataset, per action. Not one god-mode service account for the whole agent.

Filesystem tools. Explicit allowlists for read and write paths. Deny lists override allows for sensitive subtrees (SSH keys, config dirs). Empty allowWrite means no writes anywhere.

Shell tools. Command allowlists, not full bash. Even empty allowlists can miss shell builtins that bypass restrictions (Cursor CVE-2026-22708). Run shell inside a process sandbox too.

Web fetch tools. Domain allowlists. Stronger pattern: mediated fetcher where the agent process has no network socket. A proxy holds network access, scans outbound payloads for secrets, and blocks non-approved destinations.

MiniScope (UC Berkeley) treats finding minimal tool permissions as an optimization problem over permission hierarchies, with roughly 1-6% latency overhead versus vanilla tool calling. Progent enforces JSON-defined privilege policies at runtime and can update them as agent state changes. On AgentDojo and related benchmarks it drove attack success to 0% while preserving task utility. LLMs can draft Progent policies, which lowers the authoring burden. These are not drop-in libraries for every stack yet, but they show where production harnesses are heading: dynamic, minimal permissions instead of one static tool bundle.

Runtime-scoped credentials

The emerging pattern for high-stakes deploys: an identity gateway evaluates the task, mints a short-lived token scoped to exactly the APIs needed, and expires it when the task ends. Closer to OAuth delegated access than a permanent API key in environment variables.

Claude Code's permission modes illustrate the idea at harness level. Different subagents hold different tool access. A security review subagent may reach sensitive endpoints; a formatting subagent may not. Operator trust (settings files you control) ranks above user trust (foreground chat) above agent trust (instructions from MCP or other agents). Agent-tier content cannot escalate without explicit human re-authorization.

What production sandboxes look like in practice

Claude Code's October 2025 sandbox shipped with two enforced boundaries on supported platforms: filesystem allowlists and network egress through an approved proxy. Internal metrics reported an 84% drop in permission prompts because low-risk reads and writes inside the sandbox no longer nag the user. That is the design pattern: automate the boring safe stuff, surface prompts only when the action crosses a real trust boundary.

OpenAI Codex cloud disables network during task execution by default. Local runs combine Seatbelt on macOS with seccomp and Landlock on Linux. Neither replaces your app's business logic filters, but they shrink the syscall surface when injection wins.

When picking technology, ask: what happens after escape? Shared-kernel container escape hits the host. gVisor escape hits Sentry. Firecracker escape hits a guest kernel. Match isolation tier to data classification of whatever lives inside the sandbox.

What sandboxes still cannot stop

A sandbox constrains syscalls and paths. It does not rewrite business logic.

If the agent legitimately has network access to three approved APIs, injection can still exfiltrate through those APIs by encoding data in request bodies the proxy does not inspect.

If secrets are mounted into the sandbox filesystem, reading them is inside the sandbox's allowed behavior.

If the user approves a destructive action because the harness asked on every trivial read, approval fatigue makes human gates useless.

Sandboxing is necessary. It is not sufficient. Pair it with scoped credentials, output scanning, and selective human review for irreversible actions.

Engineering reality

Cold start dominates sandbox economics. E2B pre-warms Firecracker VM pools to hit ~150ms starts; naive on-demand VM boot can blow your latency budget for interactive agents. Model sandbox choice as a product decision: internal codegen batch jobs tolerate 500ms spin-up; pair-programming agents do not. Measure p95 sandbox creation time the same way you measure model time-to-first-token.

Choosing a tier for your app

Use a simple decision tree. If the agent only calls read-only HTTP APIs with no customer data, Docker plus strict domain allowlists may be enough. If it runs user-supplied code or clones untrusted repos, move to gVisor or Firecracker. If it touches production databases or payment systems, combine microVM isolation with task-scoped tokens and HITL on writes.

Document the tier in an architecture decision record. When product asks to "just add a bash tool for convenience," you can point to the tier and the blast radius instead of debating from scratch.

Re-evaluate when you add MCP servers, enable browsing, or widen filesystem paths. Sandbox scope is not set-and-forget.

Filesystem allowlists in code

Config shape matters as much as isolation tech. A typical pattern: explicit allowRead and allowWrite globs, with denyWrite overriding allows for sensitive subtrees like .ssh or config/. Empty write allowlist means read-only everywhere.

For bash tools, pair command allowlists with argument validators. Git and npm may be fine; curl with arbitrary URLs is not. Cursor CVE-2026-22708 showed shell builtins bypass empty command lists, which is why process-level sandboxes still matter when bash exists.

Domain allowlists for fetch tools should use exact hostnames, not substring matches. api.github.com.evil.com is an old trick that still works on naive parsers.

Operational costs to plan for

Sandbox fleets need capacity planning like inference GPUs. Pre-warmed pools cost money idle; cold starts cost money in churned users. Track sandbox lifetime: long-lived sandboxes accumulate temp files and cached secrets if you are not careful.

Rotate sandboxes per task or per session for untrusted code paths. A single VM reused across customers without reset is a cross-tenant leak waiting to happen.

Tag sandbox metrics in the same dashboard as model latency so regressions in pool size show up before users time out.

Alert when pool exhaustion forces cold starts above your SLO threshold.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

Why is Docker-only isolation weak for high-blast-radius agents?
Name two stronger isolation options and one tradeoff each.
What is a mediated fetcher and why split agent vs proxy zones?
Why do command allowlists fail for bash tools without a process sandbox?
What can a sandbox not prevent even when configured correctly?

Quick check

Only access inside the same container
Access on the host machine running the container
A separate VM kernel unrelated to the host

Allow all HTTPS and log URLs afterward
Agent has no network; a proxy fetches only allowlisted domains
Store the API key in the system prompt for convenience

Giving each sandbox its own lightweight VM kernel
Eliminating boot time entirely
Fine-tuning the model on safer outputs

The proxy does not inspect outbound request payloads
The connection uses TLS
The tool is labeled read-only in the prompt