What is preference tuning?
A model can produce many acceptable answers for the same prompt. Preference tuning is how we teach it which answer is better for the product, the user, and the policy we want it to follow.
Supervised fine-tuning says, "imitate this answer." Preference tuning says, "given these two answers, prefer this one." That comparison signal is better for taste, helpfulness, refusal boundaries, and judgment.
Why imitation is not enough
Supervised fine-tuning works well when the target answer is fairly clear. Extract this field. Classify this ticket. Rewrite this paragraph in a known style. The dataset gives the model one good answer and training pushes probability mass toward that answer.
But many LLM behaviors are not that clean. Two answers can both be factually correct while one is more useful. One answer may be shorter, safer, clearer, more honest about uncertainty, or better aligned with the product's voice. It is easier for a human reviewer to compare two candidates than to write the perfect target answer from scratch.
Preference tuning turns that comparison into training signal. The raw unit is usually a prompt, a chosen answer, and a rejected answer. Over many examples, the model learns what kinds of outputs win.
The shape of a preference example
A preference row is small, but it carries a lot of meaning:
- Prompt: the user request and any system context.
- Chosen answer: the response the reviewer preferred.
- Rejected answer: the response the reviewer preferred less.
- Rubric metadata: why the chosen answer won, who labeled it, and whether the pair was close.
What preference tuning changes
Preference tuning is usually about behavior at the margin. It nudges the model toward answers that humans or automated rubrics prefer. That can mean better helpfulness, cleaner formatting, fewer unsupported claims, better refusals, stronger reasoning traces, or a house style that is hard to express as a single instruction.
It is not magic reasoning juice. If the base model cannot solve the task with a good prompt and examples, preference tuning may only teach it to sound confident while still being wrong. The base capability still matters.
Use supervised fine-tuning to teach the task shape. Use preference tuning to teach the ranking between plausible answers.
Why RL shows up
Classical RLHF uses reinforcement learning because the final thing we care about is not next-token likelihood. We care about a score: did humans prefer the whole answer? A reward model estimates that score, then an RL algorithm updates the model to produce answers with higher estimated reward.
Newer methods like DPO avoid a separate RL loop for many use cases. They train directly on chosen and rejected answers with a simpler objective. That is why modern "preference tuning" often includes both RLHF-style methods and non-RL methods. The shared idea is preference signal.
The expensive part is rarely the training command. It is collecting reliable comparisons, keeping labelers calibrated, preventing reward hacking, and proving that the tuned model did not get worse on boring but important cases.
Checkpoint
You're ready for the next lesson if you can answer these from memory:
- How is a preference pair different from an instruction tuning example?
- Why are comparisons often easier to label than perfect answers?
- What kinds of behavior does preference tuning usually improve?
- Why does base model capability still matter?
Quick check
- One perfect target answer per prompt
- A chosen answer compared against a rejected answer
- A longer system prompt
- Choosing between two acceptable writing styles
- Teaching safer refusal behavior
- Creating a missing capability the base model cannot demonstrate