Prepare a fine-tuning dataset
A fine-tune does not learn from your intention. It learns from examples. Dataset work is where the model gets told what "good" looks like.
A fine-tuning dataset is a collection of demonstrations. Each example should show the input the model will see, the behavior you want, and the boundary between acceptable and unacceptable output.
Examples are the product spec
In normal software, a product spec describes behavior in prose and tests enforce it. In fine-tuning, examples do most of that work. If the examples are inconsistent, the model learns inconsistency. If they contain shortcuts, the model learns shortcuts. If they only show easy cases, the model looks good until production supplies the cases you avoided.
A useful example contains the same ingredients the model will see at inference time: the instruction, any user input, any relevant context, and the target answer. For chat models, that often means a list of messages. For completion-style models, it may be a prompt and completion pair. The exact file format changes by provider, but the principle is stable: train on the same interaction shape you expect to serve.
{
"messages": [
{"role": "system", "content": "You classify support tickets."},
{"role": "user", "content": "I was charged twice for my plan."},
{"role": "assistant", "content": "{\"category\":\"billing\", \"priority\":\"medium\"}"}
]
}
This example teaches more than a label. It teaches the task framing, output format, category naming, and the kind of judgment expected for a billing complaint.
Coverage beats volume
People often ask how many examples they need. The honest answer is: enough to cover the behavior. One thousand near-duplicates are weaker than two hundred examples that span the real decision space. A dataset should cover common cases, edge cases, refusal cases, ambiguous cases, and examples where two labels look similar.
For classification, coverage means every label has enough examples and confusing label pairs are represented. For generation, coverage means different lengths, tones, input shapes, and hard constraints. For extraction, coverage means missing fields, messy phrasing, weird ordering, and inputs that should produce an empty answer.
If you cannot name the subgroups in your dataset, you do not know what behavior you are training. Split examples by intent, source, length, language, format, customer segment, and known failure mode before counting them.
Quality filters matter more than clever training
Fine-tuning amplifies dataset quality. Remove examples with wrong answers, inconsistent formatting, duplicated inputs, hidden private data, and stale policies. If the target answer includes reasoning, make sure the reasoning is actually valid and not a polished hallucination. If the answer is JSON, validate it mechanically before training.
Deduplication is especially important. Duplicates overweight one behavior and can leak validation examples into training. If the model sees the same example during training and validation, validation loss looks good for the wrong reason. You did not measure generalization, you measured memory.
Support logs, chat transcripts, and human edits are useful raw material, not a clean dataset. They contain user secrets, policy mistakes, frustrated tone, partial conversations, and old instructions. Curate them before they touch training.
Train, validation, and test splits
Keep separate data for separate jobs. The training set updates the model. The validation set tells you whether the model is improving during training. The test set is held back until you need an honest final comparison.
Splits should be grouped by real-world unit when possible. If ten examples come from the same customer ticket, do not put eight in train and two in validation. The validation examples will be too similar to training. Group by ticket, document, customer, or conversation so validation resembles unseen work.
For small projects, a practical starting point is 80 percent train, 10 percent validation, and 10 percent test. For very small datasets, use a larger validation set than feels comfortable. You need signal more than you need another dozen training examples.
What to leave out
Leave out facts that should be retrieved at runtime. Leave out secrets and personal data unless there is a strong, compliant reason and a retention plan. Leave out examples where even humans disagree unless you label the ambiguity clearly. Leave out prompts that only exist to patch a bad product flow.
Also leave out examples that teach the wrong interface. If the served model will receive tool outputs as structured JSON, train it on that structure. If it will receive plain English summaries, train it that way. Mismatched format is one of the easiest ways to waste a tune.
Your dataset needs version control. Track the source, filtering rules, schema, split assignment, and generation method for every example. When a tuned model behaves oddly, the first useful question is usually "which examples taught it that?"
Checkpoint
You're ready for the next lesson if you can answer these from memory:
- Why are fine-tuning examples closer to a product spec than ordinary logs?
- What does "coverage beats volume" mean?
- Why should validation examples be grouped away from similar training examples?
- Name three kinds of data you should usually leave out of a fine-tuning dataset.
Quick check
- 10,000 near-duplicate examples from one easy case
- 800 curated examples covering common cases, edge cases, and confusing boundaries
- All production chat logs with no filtering
- To measure improvement on examples the model did not learn from directly
- To give the optimizer more gradients
- To reduce the prompt length at inference time