Lesson 04

Run a fine-tuning job

Once the dataset and method are chosen, training becomes an experiment loop. You are not just trying to lower loss. You are trying to produce a model that wins your task evals without breaking the rest of the product.

The one idea

A fine-tuning run is only successful if a checkpoint improves held-out task behavior. Training loss is a useful instrument, not the goal.

Start with the base model

Fine-tuning starts from a base or instruction-tuned model. Pick the smallest model that can already do the task reasonably with a strong prompt. If the base model cannot follow the instruction at all, fine-tuning may not rescue it. If a smaller model can do the job after tuning, it may be cheaper and faster to serve than a larger general model.

The base model is part of the experiment record. A dataset tuned on one base may not behave the same on another. If you switch the base model, rerun the evals and treat it as a new model line.

The knobs that matter

The main hyperparameters are familiar from training: learning rate, batch size, number of epochs, sequence length, and the adapter settings if you use LoRA. You do not need to memorize every option. You need to understand what each failure looks like.

Learning rate too high: the model changes too aggressively. Loss may spike, outputs become unstable, and the tune can erase useful behavior.
Learning rate too low: the run barely moves. Loss falls slowly and task behavior does not change enough.
Too many epochs: the model memorizes the training set and gets worse on held-out data.
Sequence length too short: examples get truncated and the model learns from broken conversations.
Batch size too small: updates are noisy. This can work, but the run may need a lower learning rate or gradient accumulation.

Watch validation, not just training

Training loss should usually go down. That only proves the model is fitting examples it sees. Validation loss measures the same objective on held-out examples. If training loss keeps falling while validation loss rises, the model is overfitting. It is getting better at the training set and worse at generalizing.

Still, validation loss is not enough. A model can have a slightly worse loss but better product behavior if it produces valid JSON, follows policy, or chooses safer labels. That is why every serious run needs task evals alongside loss curves.

Two dashboards

Loss tells you whether training is numerically healthy. Task evals tell you whether the model is useful. You need both.

Use checkpoints as candidates

A fine-tuning job can save checkpoints along the way. The last checkpoint is not automatically the best. Earlier checkpoints often generalize better because they learned the pattern before memorizing quirks.

Evaluate multiple checkpoints on the same held-out set. Track exact model ID, adapter ID, dataset version, hyperparameters, and eval score. You want to be able to say "checkpoint 800 won because it improved extraction accuracy by 6 points and did not regress refusal tests," not "the last run looked fine."

A minimal eval gate

A practical first gate can be small and strict:

Task success must improve over the base model with the best prompt.
Output format validity must not regress.
Safety, refusal, or policy cases must not regress.
Latency and cost must stay within the product budget.
Human review of sampled failures must show understandable mistakes, not new weird behavior.

If a tuned model cannot beat the base model plus prompt on a fixed eval, do not ship it. Improve the dataset, the prompt, or the base model choice.

Engineering reality

Fine-tuning without an eval gate creates a model you cannot reason about. You will end up arguing from cherry-picked examples, which is how regressions reach production.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

Why should the base model already be capable of the task?
What does rising validation loss while training loss falls usually mean?
Why might an earlier checkpoint beat the final checkpoint?
What should a minimal eval gate include?

Quick check

Overfitting
The learning rate is definitely too low
The prompt needs more examples

Always the last one
The checkpoint that performs best on held-out task and safety evals
The checkpoint with the lowest training loss, no other checks needed