Lesson 06

Deploy fine-tuned models

Training creates an artifact. Shipping creates an obligation. A fine-tuned model needs serving choices, versioning, monitoring, rollback, and a refresh plan just like any other production dependency.

The one idea

Deployment is part of fine-tuning. The model, adapter, dataset, prompt, evals, and routing rules must be versioned together or you cannot explain production behavior.

Adapter or merged model

If you trained an adapter, you have two common serving options. You can load the base model and adapter separately, which makes it easy to swap adapters for different tasks or customers. Or you can merge the adapter into the base weights and serve one combined model, which can simplify inference infrastructure.

Separate adapters give flexibility. Merged weights give simplicity. The right choice depends on how many tuned variants you serve, how your inference engine handles adapters, and whether you need hot-swapping without redeploying the base model.

Version the whole bundle

A deployed fine-tune is not just a file of weights. It is a bundle:

Base model name and revision
Adapter or merged model artifact
Dataset version and filtering code
Training configuration and random seed where available
Prompt and system instructions used at inference
Eval suite version and scores
Routing rules that decide when this model is used

If one of those changes, you have a new production candidate. Treat it that way. Otherwise you will not know whether a regression came from data, training, prompt changes, or routing.

Roll out gradually

Do not flip all traffic at once unless the model is low-risk and easy to revert. Start with shadow evaluation, then a small percentage of traffic, then expand if metrics and human review hold. Keep the previous model available for quick rollback.

Monitor the task metric, format validity, refusal behavior, latency, cost, and user-visible escalations. For generated text, sample outputs regularly. Aggregate metrics miss weird qualitative failures, and fine-tunes can fail in very specific pockets.

Cost and latency change

A fine-tuned model may let you use a smaller base model or shorter prompt, which can reduce cost and latency. It can also increase operational cost if you now self-host a model, keep multiple adapters loaded, or run extra routing logic. Measure the whole path, not just tokens.

For adapter serving, watch memory. One base with many adapters can be efficient, but active adapter switching, batching, and cache behavior depend on the serving stack. A design that looks cheap on paper can become slow if every request uses a different adapter and prevents batching.

Refreshes are normal

Fine-tuned models age. Product policy changes, user behavior shifts, new failure modes appear, and the base model ecosystem improves. A refresh is not a failure. It is maintenance.

The healthiest loop is continuous but gated: collect failures, label or rewrite examples, update the dataset version, rerun training, compare against the current production model, and ship only if the eval gate passes. Do not stream raw production mistakes into training without review. That turns incidents into future behavior.

Engineering reality

The best deployment plan is boring: explicit model IDs, immutable datasets, repeatable evals, slow rollout, quick rollback. Fine-tuning feels like model work, but keeping it healthy is ordinary production engineering.

Checkpoint

You have the course if you can answer these from memory:

What is the tradeoff between serving adapters separately and merging them?
What belongs in a production fine-tune bundle?
Why should rollout be gradual?
Why are dataset refreshes normal?

Quick check

Only changing the model weights
Changing the model, dataset, prompt, eval suite, or routing rule
Changing a comment in unrelated code

Failures may contain bad, private, or misleading signal that should be curated first
Because automation is always impossible
Because failures never contain useful examples