Lesson 04

DPO: direct preference optimization

DPO is popular because it removes a lot of RLHF machinery. You still need good preference data, but you do not need to train a separate reward model and run an online RL loop.

The one idea

DPO trains the model to make chosen answers more likely than rejected answers, while a reference model keeps the update from drifting too far.

What DPO skips

Classic RLHF has two stages after supervised tuning: train a reward model, then optimize a policy against that reward model. DPO folds the preference signal directly into the model update. The dataset still contains prompts, chosen answers, and rejected answers.

This makes DPO feel closer to supervised fine-tuning from an engineering point of view. You can train offline on a static dataset. There is no rollout loop where the policy samples fresh answers and asks a reward model to score them during training.

The reference model still matters

DPO compares the tuned model against a reference model, usually the supervised-tuned model before preference training. The goal is not simply "make the chosen answer likely at any cost." The goal is to prefer the chosen answer relative to the rejected one without destroying broad behavior.

The beta parameter controls how strongly DPO pushes. Too weak, and the model barely changes. Too strong, and it may overfit the preference data, lose diversity, or exaggerate the patterns reviewers happened to like.

Mental model

DPO is like saying: compared with the old model, make this answer win more often than that answer. Do it without turning the model into a weird specialist.

Why teams often start here

DPO is a practical first preference-tuning method because it has fewer moving parts than RLHF. You can reuse much of the supervised fine-tuning setup: dataset loading, batch training, validation, checkpoints, and adapter training. That lowers the operational burden.

It also fits many product tuning problems: style preference, answer helpfulness, format discipline, refusal tone, and domain-specific ranking. If the data is mostly offline human preference pairs, DPO is often the simplest serious baseline.

Where DPO is not enough

DPO is not a free replacement for every RL setup. It depends on the quality and coverage of static preference pairs. If the behavior needs long exploration, tool feedback, verifiable multi-step outcomes, or rewards that depend on running code, a pure offline pairwise method may miss the important signal.

It can also inherit label bias. If reviewers consistently prefer a shallow signal, such as verbosity or confidence, DPO can teach the model that shortcut efficiently. Simpler training does not remove the need for careful evals.

Engineering reality

DPO is attractive because it is easier to run, not because it removes alignment risk. Treat it like a strong baseline. Keep RLHF or outcome-based RL for cases where static preference pairs do not capture the reward you need.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

What parts of the RLHF pipeline does DPO skip?
Why does DPO still use a reference model?
What does the beta parameter control at a high level?
When might static preference pairs be too weak?

Quick check

It skips the separate reward model and online RL loop
It does not need labeled data
It removes inference cost in production

It improves held-out preference win rate
The training method has fewer moving parts, so evals matter less
It passes regression tests on important cases