Lesson 06

Evaluate and ship preference-tuned models

Preference tuning can make a model feel much better. It can also hide regressions under a nicer surface. Shipping requires evals that look for both the win and the damage.

The one idea

A preference-tuned model is ready only when it beats the baseline on target behavior and holds steady on regressions, safety, latency, cost, and boring production cases.

Measure win rate, not vibes

The core eval is a head-to-head comparison against the baseline. For each held-out prompt, generate an answer from both models, hide which model wrote which answer, and ask reviewers or a calibrated judge to pick the better one using the same rubric as training.

Report win, loss, and tie rates by category. A single average hides the interesting part. The tuned model might win on style and lose on factuality. It might improve easy support answers and fail policy edge cases. Segment the results so you can act on them.

Keep regression suites separate

Preference win rate is not enough. You also need hard checks for format validity, refusal boundaries, tool-use contracts, citation behavior, latency, token count, and known failure cases. These should run automatically on every candidate model.

Some regressions are not preference questions. If JSON must parse, it must parse. If a tool argument must match a schema, it must match. If a model must refuse a disallowed request, a warmer answer is not a win.

Check for reward hacking

Preference tuning can teach shortcuts. Watch for answers that get longer without getting better, refusals that appear in harmless cases, overconfident claims, fake citations, repeated boilerplate, or excessive self-justification.

Compare length, citation validity, abstention rate, tool-call rate, and human escalation rate before and after tuning. If those metrics move a lot, read samples. Aggregate numbers tell you where to look, but samples tell you what happened.

Shipping risk

A tuned model can win blind preference tests because it sounds nicer while becoming less grounded. Always pair preference evals with factuality and grounding checks.

Roll out like any risky model change

Version the base model, adapter or tuned weights, preference dataset, rubric, training config, and eval set. If the model serves production traffic, deploy behind a routing flag. Start with shadow traffic or a small percentage of low-risk requests, then expand only if metrics hold.

Keep rollback boring. You should be able to route traffic back to the previous model without retraining anything. If you merged adapters into base weights, store the unmerged artifact too so debugging remains possible.

Engineering reality

Open-model preference tuning is worth owning when the behavior is central to the product, traffic volume makes prompt-only fixes expensive, and you have the eval discipline to catch regressions. Otherwise, a better prompt, RAG change, or hosted model may be the cheaper answer.

The decision boundary

Run preference tuning when you have stable preference data, a model that already has the needed capability, and a clear reason the behavior should live in the model rather than in prompt context or system code. Do not run it because "RL" sounds more advanced.

The best teams treat preference tuning as one part of the product loop: collect traces, label real tradeoffs, tune a candidate, evaluate against the baseline, ship carefully, and feed production failures back into the next dataset.

Checkpoint

You're ready to leave this course if you can answer these from memory:

Why should win rate be segmented by task category?
What belongs in a regression suite besides preference judgments?
Name three signs of reward hacking in generated answers.
When is open-model preference tuning worth owning?

Quick check

Ship it because users prefer it
Block or fix the rollout until the regression is solved
Remove the JSON check from evals

The behavior is central, repeated, stable, and measurable
The model needs frequently changing facts
The current prompt has not been cleaned up yet