Lesson 04

Synthetic data

Synthetic data is examples produced by a model, a template, or a simulator instead of logged human traffic. It scales fast. It also copies every mistake the generator already makes.

The one idea

Synthetic data is a multiplier, not a substitute. It amplifies the teacher's coverage and its blind spots. Use it to fill gaps you can verify, not to avoid hard labeling work on policy-critical slices.

Where synthetic data helps

Cold start: You need 500 examples before human labelers have a rubric to follow. A teacher model drafts seed cases you edit into shape.
Rare intents: Fraud, safety refusals, or edge-case tool calls that almost never appear in logs.
Format diversity: Vary phrasing, length, and input noise while keeping the target behavior fixed.
Distillation: A large teacher generates targets for a smaller student. The synthetic set is the transfer medium.

The pattern that works: seed from real failures, generate variations, human-review the slices that matter, deduplicate aggressively (see L05), then mix with real examples so the distribution does not collapse.

Where it hurts

Variety collapse: One prompt template plus one teacher model produces thousands of near-identical dialogues. The student learns the template, not the task.

Error compounding: Wrong labels at generation time become training signal. At 50k rows, a 2% error rate is 1,000 bad lessons.

Policy blind spots: If the teacher was not trained on your refund policy, synthetic support data will quietly invent one.

Eval contamination: Generating eval-like questions from the same teacher that also wrote training data can leak benchmark structure even without exact string matches.

Synthetic collapse

Teams sometimes report great offline metrics after training on mostly synthetic data, then watch the model fail on messy real user phrasing. The eval looked good because it was drawn from the same synthetic distribution.

Keeping the teacher honest

Treat synthetic generation like a pipeline with QA gates:

Seed from real anchors: Start from logged failures or expert-written cases, not from blank prompts alone.
Vary generators: Multiple prompt templates, temperatures, and paraphrase passes reduce template lock-in.
Filter automatically: Drop rows that fail schema checks, exceed length limits, hit blocklists, or fail a cheap classifier.
Human-review critical slices: Safety, legal, medical, billing. Budget review there, not uniformly across easy rows.
Track provenance: Tag every row as synthetic, which model version produced it, and which seed it came from.

For preference pairs, generate two candidates from different prompts or models when possible. Pairs where both answers share the same failure mode are worse than useless: they teach the ranker to prefer fluent wrong over fluent wrong.

Mixing ratios

There is no universal ratio. A practical starting point for domain fine-tunes:

Keep at least 20–40% high-quality human or expert-reviewed examples in the mix.
Never let synthetic rows dominate safety or compliance slices.
Measure slice-level eval, not just aggregate accuracy. Synthetic-heavy sets often lift easy slices and hide regression on rare cases.

Storage is cheap relative to review. A million tokens of JSONL is often under a few dollars on object storage. The expensive part is knowing which synthetic rows to trust.

Engineering reality

Generating 10k synthetic chat examples with a frontier API might cost $50–$300 depending on length and model, plus human review on a 5–10% sample. That can beat $20k of pure human labeling for cold start, but only if you validate on real traffic slices before you ship.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

When is synthetic data a good idea vs a dangerous shortcut?
What is variety collapse and how do you reduce it?
Why should critical policy slices get human review even if most rows are synthetic?
What metadata should you store with every synthetic row?

Quick check

The JSONL file becomes too large to upload
Variety collapse and amplified teacher errors
Training will be slower than with human data

Seed from real failures or expert-written cases
Skip human review entirely to save cost
Use one template for consistency

To reduce token count during training
For provenance, debugging, and targeted rollback
Because OpenAI requires it in JSONL