Track 3 · Training & adapting models
Preference Tuning & RL
Supervised fine-tuning teaches a model what good answers look like. Preference tuning teaches it which answer is better when several answers are plausible. This course covers RLHF, DPO, GRPO, reward models, and the eval work needed before you trust the result.
01
02
03
04
05
06
What is preference tuning?
Why pairwise preferences are different from demonstration data, and how they shape taste, refusal behavior, and judgment.
Build preference datasets
How to collect prompts, compare candidate answers, write rubrics, handle ties, and avoid teaching the model lazy shortcuts.
Reward models and RLHF
The classic RLHF loop: train a reward model, optimize the policy, and keep the tuned model near the useful base behavior.
DPO: direct preference optimization
Why DPO skips the separate reward model and turns preference pairs into a simpler supervised training objective.
GRPO and RL for reasoning
How group-relative rewards reduce critic cost, where GRPO fits, and why verifiable tasks change the RL story.
Evaluate and ship preference-tuned models
Win rates, regression suites, reward hacking checks, rollout plans, and when RL on an open model is worth owning.