Lesson 03

Reward models and RLHF

RLHF is the classic preference tuning loop: learn a reward function from human comparisons, then tune the language model to earn more reward without drifting too far from the useful base model.

The one idea

RLHF turns preference labels into a learned reward model, then uses reinforcement learning to make the LLM produce outputs that the reward model scores highly.

The three-model picture

RLHF usually involves three versions of model behavior. First there is the base or supervised-tuned model that already knows how to answer. Second there is a reward model trained to score answers. Third there is the policy model being updated by RL.

The reward model is not the final assistant. It is a judge. It reads a prompt and an answer, then predicts a scalar reward. During RL, the policy generates an answer, the reward model scores it, and the optimizer nudges the policy toward answers with higher scores.

Training the reward model

A reward model is trained on preference pairs. Given the same prompt, it should score the chosen answer higher than the rejected answer. The absolute number is less important than the ordering.

This sounds simple, but it can break in quiet ways. If labelers prefer longer answers, the reward model may learn "longer is better." If labelers reward confident tone, the model may learn to sound certain. If the dataset under-samples refusal boundaries, the reward model may make unsafe answers look helpful.

Reward model trap

A reward model is a lossy proxy for human judgment. Once you optimize against it hard enough, the policy can find weird pockets where the proxy gives high scores for bad answers.

Optimizing the policy

After the reward model exists, the policy model is updated with an RL algorithm. In LLM work, PPO is the classic choice. The policy samples responses, gets reward scores, and updates its token probabilities so high-reward responses become more likely.

There is a catch: if the model chases reward too aggressively, it can lose broad language ability, become repetitive, or exploit reward-model quirks. RLHF systems usually add a KL penalty that keeps the policy close to a reference model. That reference is often the supervised-tuned model you started from.

RLHF is an optimization loop around a learned judge, with drift control to keep the policy useful.

Reward hacking

Reward hacking means the policy learns to maximize the reward signal without actually getting better for users. In text models, that might look like overlong answers, fake citations, excessive hedging, formulaic safety disclaimers, or confident nonsense that the reward model likes.

The fix is not one trick. You need better preference data, adversarial evals, length-normalized checks, human review, and limits on how far the policy can move. The reward model should be treated as one noisy instrument, not as ground truth.

Engineering reality

RLHF has more moving parts than supervised tuning. You own the policy, reward model, rollout sampling, KL settings, evals, and rollback plan. Use it when the preference signal is valuable enough to justify that surface area.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

What does a reward model learn from preference pairs?
Why does RLHF need drift control?
What is reward hacking in LLM tuning?
Why is a reward model not the same as human judgment?

Quick check

It generates the final answer shown to users
It scores prompt-answer pairs during optimization
It converts text into tokens

To keep the tuned policy close to useful base behavior
To make generation faster
To add new documents into the model