What is model distillation?

Distillation is the move from "this big model works" to "can a smaller model copy the parts we actually need?" It is one of the cleanest ways to cut latency and cost when the task is stable.

The one idea

Model distillation trains a smaller student model on signals produced by a larger teacher model. The goal is not to clone every capability. The goal is to transfer the behavior your product needs.

The teacher-student setup

In normal supervised fine-tuning, examples usually come from people or existing product data: prompt in, expected answer out. In distillation, the expected answer often comes from a stronger model. The large model acts as the teacher. The smaller model is the student.

For LLMs, the teacher might be a frontier API model, a large open model, or an internal model that is too slow for the final product. You ask the teacher to solve many examples, clean those outputs, then train the student to produce similar answers.

This is useful because the teacher output can include choices that are hard to write by hand: tone, reasoning style, refusal behavior, tool call shape, extraction judgment, and edge-case handling. The student does not become the teacher. It learns a cheaper approximation over the distribution you show it.

The teacher turns your task distribution into labels. The student learns the behavior that shows up in those labels.

What gets distilled?

The simple version is answer distillation: train the student on the teacher's final answers. This is common for classification, extraction, rewrite, routing, and support-response tasks. The student sees the input and the teacher answer, then learns to imitate the answer.

Some setups also distill intermediate behavior. A teacher might produce a rationale, a plan, a tool call trace, or a set of rejected alternatives. That extra signal can help, but it can also teach the student to copy noisy reasoning. Use it only when the intermediate text maps to behavior you want at inference time.

Classic distillation can also train on probability distributions, not just the winning label. If a teacher says class A has 0.55 probability and class B has 0.40, that softness tells the student the case is ambiguous. In many LLM workflows you do not get clean token-level probabilities from the teacher, so you approximate the same idea with multiple samples, rankings, or rubrics.

Vocabulary

A hard label says "this is the answer." A soft target carries more shape: confidence, alternatives, or multiple acceptable outputs. Soft targets can teach the student where the task boundary is fuzzy.

Distillation is not the same as quantization

Distillation trains a new or adapted model. Quantization changes how numbers inside a trained model are stored and computed. Both can make inference cheaper, but they act at different layers.

If you distill a 70B teacher into a 7B student, you are changing the model. If you quantize a 7B model from 16-bit weights to 4-bit weights, you are changing the representation of the same model. In production, teams often combine them: pick or train a smaller student, then quantize it for the target hardware.

The mistake is treating all compression as one knob. It is not. Distillation changes behavior. Quantization changes precision. Pruning changes structure. The right mix depends on which constraint is hurting: memory, latency, throughput, quality, privacy, or device limits.

When distillation is worth it

Distillation shines when the task is narrow, repeated, and expensive at scale. Think high-volume classification, extraction, moderation, routing, structured summarization, customer support drafts, or a voice-agent turn where every 100 milliseconds matters.

It is weaker when the product needs open-ended reasoning across many domains. A small student can copy a teacher's style on examples, but it will not inherit the full breadth of a much larger model. The training data bounds the behavior. Outside that zone, the student may fail sooner and more confidently.

Engineering reality

Distillation only pays off if the new model will serve enough traffic to repay the work. You are buying lower inference cost with training cost, data work, eval work, and ongoing ownership of another model artifact.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

What role does the teacher model play in distillation?
Why does distillation transfer behavior, not every ability of the teacher?
How is distillation different from quantization?
What kind of production task is a good fit for distillation?

Quick check

Generate teacher-labeled examples and train a smaller student
Add a longer prompt to the large model
Build a vector database first

The training examples the model learns from
The numeric precision used to store or compute model weights
Which model receives each request