Lesson 02

Build a teacher-student distillation dataset

The student only learns the distribution you show it. A good distillation dataset is built from real product pressure, not a random pile of teacher completions.

The one idea

The dataset is the compression plan. It decides which teacher behavior survives in the student, which cases get ignored, and which failures become permanent.

Start from traffic, not imagination

The best prompts for distillation come from the task the small model will actually serve. Use production logs when you can. If you do not have traffic yet, use eval cases, support tickets, documents, or hand-written cases that match the expected product shape.

Coverage matters more than raw count at the start. A thousand near-duplicate examples teach less than a few hundred examples that span easy cases, hard cases, empty inputs, policy boundaries, weird formatting, and user mistakes. Distillation is not magic averaging. It is pattern copying.

Make the target narrow. "Be a smaller general assistant" is not a dataset spec. "Classify these ticket messages into this taxonomy and return JSON with confidence and escalation reason" is a dataset spec.

Generate teacher outputs deliberately

The teacher prompt should describe exactly what you want the student to learn. If the final product needs strict JSON, make the teacher produce strict JSON. If the product needs short answers, do not let the teacher write essays. If refusals matter, include refusal cases in the teacher job.

For ambiguous tasks, generate more than one teacher answer or ask the teacher to grade its own answer against a rubric. You are not doing this because the teacher is always right. You are doing it because disagreement shows you where the boundary is soft.

Keep the teacher prompt, model version, sampling settings, and output filters with the dataset. Six months later, you should be able to explain why an example exists and which teacher produced it.

Practical pattern

Store each row as input, teacher output, teacher metadata, filter status, source, and split. That one extra metadata column often saves hours when an eval regression points back to bad labels.

Filter harder than you generate

Teacher output is not ground truth. Large models can be verbose, inconsistent, overconfident, or just wrong. Bad teacher labels are worse than missing labels because the student will train toward them.

Use automatic filters first: valid JSON, schema match, length bounds, forbidden phrases, citation presence, allowed labels, and duplicate detection. Then sample by slice for human review. Do not only review random rows. Review the weird rows: low-confidence cases, long outputs, rare labels, failed parses, and cases where two teacher samples disagree.

If people correct the teacher, keep the corrected answer as the label and record that it was edited. Those edited rows are valuable because they mark places where the teacher alone was not enough.

Common mistake

Do not train on every teacher answer just because it was expensive to generate. The cheapest bad label is the one you delete before training.

Split by source and difficulty

Random train and test splits can leak near-duplicates. If the same customer issue appears in both train and test with tiny wording changes, your student looks smarter than it is. Split by source, account, document, time window, or task family when possible.

Hold out hard cases on purpose. A compressed model often looks fine on easy examples and collapses on edge cases. Your eval split should include rare labels, long contexts, short contexts, ambiguous inputs, adversarial wording, and cases where the safe answer is "I cannot answer from the given context."

Also keep a small golden set that never enters training. That set is the judge you come back to after each dataset refresh, model change, or quantization pass.

Distill the output shape you need

Students learn habits. If the teacher output includes rambling rationales, the student may ramble. If the teacher sometimes returns `"category"` and sometimes `"type"`, the student may drift. If the teacher says "probably" in labels, the student may emit uncertainty where your parser expects a fixed enum.

Before training, normalize the answer shape. Use stable keys, stable label names, clear refusal text, and a consistent amount of explanation. This is boring data work, but it is exactly what turns distillation from a demo into a product asset.

Engineering reality

The dataset should be versioned like code. Changing the teacher prompt, label schema, or filtering rule changes the model you are building. Put those changes in release notes for the model artifact.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

  • Why should distillation prompts come from real task traffic?
  • What metadata should you keep with teacher-generated labels?
  • Why are random splits risky for distillation evals?
  • What does it mean to normalize the output shape?

Quick check

  • A public list of generic assistant prompts
  • Representative support tickets with clear target labels and hard cases
  • One million unrelated teacher completions
  • So future regressions can be traced to the label source
  • So the student has more tokens to train on
  • So filtering is no longer needed