Learn/Distillation & Compression

Track 3 · Training & adapting models

Distillation & Compression

Big models are useful teachers, but they are not always the right thing to run in production. This course teaches how to make models smaller, faster, and cheaper without lying to yourself about the quality loss.

6 lessons Intermediate After Fine-tuning

What is model distillation?

How a smaller student model learns from a larger teacher, and why distillation is a behavior transfer problem.

Build a teacher-student distillation dataset

How to choose prompts, teacher outputs, rationales, hard cases, and filters that make the student worth training.

Quantization: int8, int4, and GGUF

Why lower precision saves memory, where quality drops, and how formats like GGUF fit local inference.

Pruning, sparsity, and low-rank compression

The structural compression ideas: remove, factor, or skip work, then check whether hardware actually gets faster.

Evaluate compressed LLMs

How to measure the trade: task quality, calibration, latency, memory, throughput, cost, and failure modes.

Deploy small, fast models

Routing, fallback, monitoring, versioning, and when a compressed model should serve traffic.