Evaluate compressed LLMs

A smaller model is not a win by default. It has to keep the behavior that matters while actually improving cost, latency, or deployment reach.

The one idea

Evaluate the trade, not the model in isolation. Compression is successful only when quality loss stays inside an acceptable budget and the serving gains are real.

Start with a baseline

Before testing a compressed model, freeze the baseline. Record the teacher or original model, prompt template, sampling settings, runtime, hardware, and eval scores. Without that, every comparison becomes vague.

Use the same inputs for each candidate. Run the baseline, the distilled model, the quantized model, and any combined version against the same eval set. If sampling is involved, run enough repeats to see variance. A small student that is fast but unstable may create more downstream retries than it saves.

The baseline should include product metrics too: first-token latency, tokens per second, peak memory, cost per request, throughput under load, error rate, and parse failure rate.

Measure by slice

Average scores hide compression damage. Split the eval set into slices that matter for the product: easy cases, hard cases, long context, short context, rare labels, multilingual inputs, strict JSON, refusals, safety boundaries, and cases that need tool calls.

A compressed model might be only two points worse on average but twenty points worse on rare labels. If those rare labels trigger billing, medical triage, fraud review, or customer escalation, the model is not acceptable.

For generation tasks, combine automatic checks with human or rubric-based review. Exact match is useful for structured extraction. It is weak for summaries, support replies, and reasoning traces. The metric has to match the job.

Common mistake

Do not approve compression from a leaderboard score alone. A product eval should reflect the prompts, output shape, and failure cost of your application.

Watch confidence and calibration

Compression can change how confidently a model fails. A smaller model may produce shorter answers, skip caveats, or force a label on ambiguous inputs. If your product depends on confidence thresholds, abstention, or escalation, evaluate those behaviors directly.

For classifiers, compare confidence buckets against real accuracy. For extraction, track empty-field handling and "not found" behavior. For assistants, check whether the model admits uncertainty when context is missing. A cheaper model that confidently invents missing data is not cheaper once humans clean up the damage.

Benchmark the serving path

Run latency tests in the runtime you will deploy. A notebook benchmark is useful for exploration, but production has batching, concurrency, network overhead, cold starts, KV cache growth, and prompt formatting.

Measure at the percentiles users feel: p50, p90, p95, and p99. Voice agents care about first-token latency. Batch extraction cares about throughput. Interactive coding tools care about both first-token latency and sustained tokens per second.

Also measure failure cost. If a small model needs a fallback to the large model on 30 percent of requests, the blended cost may be worse than expected. Routing and fallback belong in the evaluation, not just the final deployment.

Engineering reality

The correct comparison is usually a system comparison: large model only versus small model plus router plus fallback plus monitoring. That blended system is what users and bills will see.

Define the acceptance budget

Before looking at results, decide how much quality you can trade for speed or cost. For one task, a 2 percent quality drop may be fine if latency halves. For another, any increase in unsafe refusal failures may be unacceptable.

Write the acceptance rule in product language: "JSON parse failures must stay under 0.5 percent," "rare-label recall cannot drop by more than 1 point," "p95 first-token latency must stay below 400 ms," or "fallback rate must stay under 10 percent."

This makes the decision less emotional. The compressed model either clears the gate or it does not.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

Why do you need a frozen baseline before compression experiments?
Why are slice metrics more useful than one average score?
What serving metrics should be measured beyond quality?
What is an acceptance budget?

Quick check

The model is approved because the average drop is small
The model needs more work or a routing rule before serving that slice
Ship it and wait for production data

Checkpoint file size only
First-token latency at production percentiles
Training loss from distillation