Lesson 01

When to fine-tune LLMs

Fine-tuning is powerful, but it is also one of the most over-prescribed fixes in AI engineering. Before you spend GPU time, learn the exact kind of problem fine-tuning solves.

The one idea

Fine-tuning is for changing a model's learned behavior: format, tone, domain habits, task reflexes, and decision boundaries. It is usually the wrong tool for adding fresh facts, private documents, or one-off instructions.

Fine-tuning changes the weights

A prompt is temporary context. A retrieval system brings in external information. A tool call gives the model a way to act. Fine-tuning is different: it runs more training steps and changes the model's weights, or adds adapter weights that sit beside them. That means the model starts carrying a new habit even when the exact example is not in the prompt.

This is why fine-tuning can feel like teaching. If every answer from your support assistant must follow a very specific escalation policy, or every extraction must emit one strict JSON shape, you can show the model many examples until that pattern becomes its default. The prompt can get shorter because the behavior is now partly in the model.

But the same permanence is the danger. If you train on bad examples, the model absorbs bad habits. If the facts change next week, the tuned model does not magically know that. You have created a new model artifact that must be evaluated, versioned, and refreshed.

The decision tree

Most fine-tuning decisions get easier if you ask what kind of missing capability you are fixing.

Missing instructions? Improve the prompt first. If the behavior works with a clear prompt, fine-tuning is probably premature.
Missing facts? Use RAG, a database, or a tool. Fine-tuning is a poor knowledge store because updates require another training run.
Missing actions? Give the model tools. Training it to pretend it checked inventory is worse than letting it call the inventory service.
Repeated behavior that the model keeps drifting from? Fine-tuning may fit. This is the zone: style, structure, classification boundaries, domain language, and task-specific judgment.

Fine-tuning is not the universal next step after prompting. It is one option in the stack, and it should be chosen for the right failure mode.

Good fine-tuning targets

The best fine-tuning targets are stable patterns that appear over and over. A model that must classify legal clauses into a company-specific taxonomy, rewrite raw notes into a house style, produce strict function-call arguments, or answer customer tickets with a fixed tone can benefit because the desired behavior is repeated across many examples.

Another good target is domain adaptation. A general model may know English, but not your internal language: abbreviations, document shapes, ticket categories, compliance phrases, or the difference between two concepts that look similar outside your company. Fine-tuning can shift the model toward those distinctions if your data demonstrates them cleanly.

Fine-tuning also helps with latency and cost when a huge prompt is only there to remind the model how to behave. If a 3,000-token policy prompt becomes a 300-token prompt because the model has learned the policy shape, every request gets cheaper. That saving matters at scale.

Rule of thumb

If you can describe the target as "always respond like this kind of expert, in this format, with this taste," fine-tuning may be a good fit. If you describe it as "know these documents," use retrieval first.

Bad fine-tuning targets

Do not fine-tune because your prompt is vague. A tuned model trained on vague examples becomes vague permanently. Tighten the prompt, define the expected output, and build a small eval set before training anything.

Do not fine-tune to store rapidly changing facts. Product prices, policies, account balances, medical references, and live inventory belong in systems that can be queried and updated. A fine-tuned model may repeat old facts confidently, which is worse than admitting it needs a lookup.

Do not fine-tune to bypass capability limits you cannot demonstrate in data. If the base model cannot do the reasoning with high-quality examples in context, a small fine-tune probably will not create the missing reasoning ability. Fine-tuning usually steers and specializes a capability that already exists.

Common mistake

"The model answered wrong, let's fine-tune it" is not a diagnosis. First ask whether the wrong answer came from missing context, missing tools, ambiguous instructions, weak base model capability, or unstable desired behavior.

The minimum proof before training

Before fine-tuning, prove three things. First, the target behavior is valuable enough to own as a model artifact. Second, you can create or collect enough high-quality examples. Third, you can measure whether the tuned model is better without making other important behavior worse.

A small manual eval set is enough to start. Take 50 to 200 representative cases, write the expected answer or grading rubric, and run the base model with your best prompt. If the prompt gets most cases right, you may not need a tune. If it fails in a repeated, nameable way, fine-tuning becomes a real option.

Engineering reality

The decision to fine-tune is mostly a product and evaluation decision, not a GPU decision. The training command is the easy part. The hard part is knowing what behavior you want, capturing it in examples, and proving the result did not quietly regress.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

What is the difference between prompt context and fine-tuned behavior?
Why is fine-tuning a weak way to store changing facts?
Name two problems that are better solved with RAG or tools.
What should you prove with an eval set before training?

Quick check

Fine-tune on the whole handbook
RAG or a database-backed lookup
A shorter system prompt

A repeated, stable behavior pattern
A set of facts that changes daily
A live action like checking a user's account balance