Lesson 06

Deploy small, fast models

A compressed model rarely needs to replace the large model everywhere. The stronger pattern is to let the small model own the traffic it can handle and escalate the rest.

The one idea

Deploy compressed models as part of a routing system. The small model should handle predictable work, fallback should protect high-risk cases, and monitoring should tell you when the boundary moves.

Choose the serving role

A small model can play several roles. It can be the primary model for a narrow task, a pre-filter before a larger model, a router that chooses tools or prompts, a local privacy-preserving model, or a fallback when the expensive model is unavailable.

Each role has a different risk profile. A router can be wrong but still send the request to a larger model later. A primary model owns the answer users see. An on-device model may trade quality for privacy and low latency. Be explicit about the role before tuning infrastructure.

For many products, the best first deployment is a shadow run. The small model receives the same requests as production, but its answers are logged rather than shown. This gives you traffic-shaped eval data without user exposure.

Route by confidence and task shape

Routing rules can be simple. Send short, common, well-formed cases to the small model. Send long, rare, high-risk, or low-confidence cases to the large model. If the small model fails schema validation, returns low confidence, or hits a policy boundary, escalate.

Do not hide fallback rate. It is one of the core economics of the system. If 95 percent of requests stay on the small model, you probably have a strong cost win. If only 40 percent stay there, the blended latency and cost may not justify the extra complexity.

Request user traffic Router shape, risk, confidence cost-aware policy Small model fast path Large fallback hard or risky path
Small models are easiest to trust when the system has a defined escape hatch.

Version everything together

A compressed model artifact is more than weights. It includes the base model, distillation data, teacher model, training recipe, quantization method, runtime, prompt template, tokenizer, safety rules, and eval report. Changing any of those can change behavior.

Version the whole bundle. A rollback should restore the model and the prompt template together. If you roll back weights but keep a new parser or routing policy, you may not have rolled back the system users experience.

Keep the release note short but concrete: what changed, which eval gates passed, known weak slices, fallback policy, and the expected cost or latency gain.

Monitor for boundary drift

The safe zone for a compressed model can drift. Product traffic changes, users discover new prompts, policies change, and the large fallback may improve. Monitor the slices you used during evaluation: parse failures, fallback rate, latency percentiles, refusal rate, escalation rate, user corrections, and sampled quality review.

When fallback rate climbs, ask whether the router got stricter, traffic got harder, or the small model is stale. When quality drops in one slice, add examples to the next distillation dataset instead of blindly retraining on everything.

Engineering reality

Small models create operational work. They reduce per-request cost only if the monitoring, fallback, and refresh loop stay cheaper than sending everything to the larger model.

Know when not to deploy it

Do not deploy a compressed model just because it is technically impressive. Skip it if the quality budget is unclear, eval coverage is weak, fallback rate destroys the savings, or the task is so broad that the small model will constantly hit its limits.

Also skip it when the product is early and traffic is low. Compression work has fixed cost: data, training, evals, serving, monitoring, and incident response. It is often smarter to pay the API bill until usage proves the bottleneck is real.

The best compressed models feel boring in production. They handle a known slice cheaply, escalate when uncertain, and give you clear numbers showing why they exist.

Checkpoint

You're ready to use this course if you can answer these from memory:

  • What serving roles can a small model play?
  • Why is fallback rate part of the cost model?
  • What has to be versioned besides the model weights?
  • When should you avoid deploying a compressed model?

Quick check

  • Route common cases to the small model and send risky cases to a large fallback
  • Make the small model the only production model
  • Remove fallback to improve average latency
  • Only to make the release note longer
  • Because changing either can change model inputs and outputs
  • Because it prevents all production drift