Learn/Safety, Guardrails & Security/Lesson 04

Lesson 04

Data leakage and PII in context

The context window is working memory. Anything you put there can resurface in the next completion, land in logs, or ride along to a vendor API. This lesson traces how secrets and PII get in, and how to keep them out without breaking the product.

The one idea

The agent should rarely need the secret itself. Credentials flow through a mediation layer that holds the value and returns only the result of an authenticated call. What never enters context cannot leak through context.

The context window is a liability

LLM apps feel stateless because each API call is a POST with a JSON body. But the body carries the whole conversation, retrieved chunks, tool results, and sometimes file contents. That bundle is the context window, and it is copied to GPUs, logged by observability tools, and stored in session caches.

Once a secret appears there, three bad things become likely: prompt injection exfiltrates it in the next turn, the model quotes it in a user-visible answer without "meaning to," or your tracing pipeline writes it to a log store with weak retention controls.

GitGuardian's 2026 secrets sprawl report found tens of millions of new hardcoded secrets on public GitHub, with AI-service credentials and MCP config files growing fast. Separate estimates put eventual exposure risk for secrets stored in LLM context very high when injection, logging, and hallucination are combined. Treat context as a broadcast channel, not a vault.

How sensitive data gets in

Coding agents ingest config by default. .env files, docker-compose YAML, CI logs, and error stack traces get pulled into context because they look relevant to the task. They often contain live tokens.

RAG indexes everything. Internal wikis, ticket exports, and Slack archives get chunked and embedded without redaction. A support bot retrieves another customer's PII because chunk boundaries split poorly and access control stopped at the index, not the chunk.

Tool results echo secrets. An API debug endpoint returns full headers including Authorization. A database tool returns rows with emails and phone numbers. All of it goes back into context for the next model call.

MCP and agent configs hardcode credentials. Thousands of MCP server configs in public repos contain API keys. The agent reads the config to know how to connect, which puts the key in context before the first tool call.

Logging and tracing. Frameworks that log full prompts and tool payloads copy context verbatim into Datadog, LangSmith, or S3. Incident response later searches those logs with broad permissions.

The agent passes intent and a short-lived token. The proxy attaches the real credential and returns only the API response body.

PII in retrieval and multi-tenant RAG

PII leakage is not always a Hollywood hack. Often it is retrieval returning the wrong user's record because metadata filters were missing or embeddings ignored ACLs.

Every chunk needs tenant ID, data classification, and expiry in metadata. Filter on those fields before vector search, not after. Post-filtering top-k results wastes slots and still leaks if the model summarizes "near misses."

Minimize fields in retrieved text. If the model only needs "order status," do not pass full billing addresses. Structured extraction calls can pull a single field from a document so the rest never enters context.

Regulatory frames (GDPR, HIPAA) care about purpose limitation and retention. Logging full RAG context for debugging can violate both if logs live longer than the chat session.

Customer chats used as fine-tuning data without scrubbing become permanent memorization targets. Models can regurgitate phone numbers and emails seen often enough in training. Even if you never fine-tune, vendors may offer opt-out training programs: know your DPAs. For internal RL or preference tuning, run PII detection on every row before it enters a dataset. Redact, do not just hash, if the model still sees surrounding text that re-identifies the person.

Controls that work in production

Short-lived credentials. OAuth with narrow scopes, workload identity for cloud resources, Vault dynamic secrets that expire when the task ends.

No secrets in sandbox filesystems. Mounting .aws/credentials into an agent VM because it is convenient is how prompt injection becomes account takeover.

Log scrubbing before storage. Regex plus entropy detection for AWS keys (AKIA...), GitHub tokens (ghp_...), JWT-shaped strings. Tools like detect-secrets, truffleHog, and GitLeaks integrate into CI and log pipelines.

Outbound completion scanning. Scan model output before returning to the user or downstream systems. An agent that read a secret in turn two may paste it in turn five while answering something unrelated.

RAG hygeine at ingest. Blocklist credential patterns during indexing. Tag chunks with sensitivity labels. Re-index when classification rules change.

Engineering reality

Scrubbing logs after the fact is cheaper until it is not. One missed pattern during an incident exports every session to a SIEM forever. Put scrubbing on the write path: transform spans before they leave the app process. Test with synthetic secrets in staging weekly. Entropy-only detectors false-positive on base64 blobs; combine pattern lists with allowlists for known benign formats (UUIDs, commit SHAs).

Redaction at ingest vs at query time

Some teams redact PII when documents enter the index: replace emails with tokens, strip SSN patterns, hash customer IDs. That reduces retrieval risk but can break answers that legitimately need a masked identifier ("your order #TOKEN_4821").

Query-time redaction keeps richer text in storage but requires a policy engine on every retrieval path. Hybrid approaches tag fields: public marketing copy unredacted, support macros redacted at index, HR docs in a separate index with stricter ACLs.

Whatever you choose, document it for legal. "We do not store PII" is false the moment support tickets enter RAG. "We store PII with classification X, retention Y, access via role Z" is auditable.

Vendor and cross-border exposure

When context goes to a third-party API, you inherit their retention, subprocessors, and breach model. Enterprise tiers often offer zero-retention and regional routing. Read the contract, then verify settings in the dashboard. "We use Azure OpenAI" does not automatically mean your prompts never train the public model.

For highly regulated data, self-hosted inference or private endpoints remove one hop, but you still have context inside your harness logs. Data residency is not the same as data minimization.

Session and memory hygiene

Long sessions accumulate secrets in context even when individual turns look clean. A developer pastes an API key once to debug; ten turns later the model summarizes "things we tried" and quotes it back.

Mitigations: session TTLs, periodic context summarization that runs through a redaction pass, separate short-lived sub-sessions for sensitive tasks, and UI warnings when clipboard-detected secrets enter the chat.

Memory products (persistent user profiles across chats) amplify leakage. Anything written to long-term memory should pass the same classification rules as your RAG index.

Incident response when leakage happens

Assume breach drills, not if. When a secret hits a vendor log or a user screenshot, rotate the credential first, then trace: which turn introduced it, which log pipeline stored it, which retention bucket holds it.

Notify using your DPA timelines. Model vendors often have separate security contact paths from billing support. Keep those contacts in the runbook before you need them at 2am.

Post-incident, add a detector for that secret class and a harness rule that blocks the tool path that exfiltrated. Leakage response without code change repeats the same story next quarter.

Customer-facing transparency

Users ask whether their data trains models, how long chats are stored, and who can see logs. Your privacy page should match harness reality: if support staff can replay traces in LangSmith, say so and say for how long.

Enterprise customers will ask for data processing agreements that list subprocessors and retention. Building secret mediation and log scrubbing first makes those questionnaires boring instead of panic-driven.

Classification labels that stick

Define three to four internal labels (public, internal, confidential, restricted) and require them on RAG chunks, log streams, and tool outputs. Restricted data never goes to external model APIs without legal sign-off.

Automate enforcement: if a chunk is restricted, block retrieval unless the session identity has a matching role. Prompts telling the model "be careful" do not substitute for metadata checks.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

Why is the context window a poor place to store secrets?
Name three paths by which PII enters a RAG answer.
What is the secrets proxy pattern?
Why filter on metadata before vector search, not after?
Where should log scrubbing run in the pipeline?

Quick check

Paste the API key in the system prompt for the session
Agent calls an internal proxy with a task-scoped token; proxy holds the real key
Mount production credentials read-only in the sandbox

Retrieval skips tenant metadata filters before vector search
Embeddings use the wrong vector dimension
Temperature is too high during generation

Higher cloud storage bills only
Secrets and PII persisting in third-party log stores
Slower model inference

Secrets the model saw earlier in the session and repeats later
Only malicious user uploads
Overfitting on the training set