Learn/Preference Tuning & RL

Track 3 · Training & adapting models

Preference Tuning & RL

Supervised fine-tuning teaches a model what good answers look like. Preference tuning teaches it which answer is better when several answers are plausible. This course covers RLHF, DPO, GRPO, reward models, and the eval work needed before you trust the result.

6 lessons Intermediate After Fine-tuning

What is preference tuning?

Why pairwise preferences are different from demonstration data, and how they shape taste, refusal behavior, and judgment.

Build preference datasets

How to collect prompts, compare candidate answers, write rubrics, handle ties, and avoid teaching the model lazy shortcuts.

Reward models and RLHF

The classic RLHF loop: train a reward model, optimize the policy, and keep the tuned model near the useful base behavior.

DPO: direct preference optimization

Why DPO skips the separate reward model and turns preference pairs into a simpler supervised training objective.

GRPO and RL for reasoning

How group-relative rewards reduce critic cost, where GRPO fits, and why verifiable tasks change the RL story.

Evaluate and ship preference-tuned models

Win rates, regression suites, reward hacking checks, rollout plans, and when RL on an open model is worth owning.