Reward models and RLHF
RLHF is the classic preference tuning loop: learn a reward function from human comparisons, then tune the language model to earn more reward without drifting too far from the useful base model.
RLHF turns preference labels into a learned reward model, then uses reinforcement learning to make the LLM produce outputs that the reward model scores highly.
The three-model picture
RLHF usually involves three versions of model behavior. First there is the base or supervised-tuned model that already knows how to answer. Second there is a reward model trained to score answers. Third there is the policy model being updated by RL.
The reward model is not the final assistant. It is a judge. It reads a prompt and an answer, then predicts a scalar reward. During RL, the policy generates an answer, the reward model scores it, and the optimizer nudges the policy toward answers with higher scores.
Training the reward model
A reward model is trained on preference pairs. Given the same prompt, it should score the chosen answer higher than the rejected answer. The absolute number is less important than the ordering.
This sounds simple, but it can break in quiet ways. If labelers prefer longer answers, the reward model may learn "longer is better." If labelers reward confident tone, the model may learn to sound certain. If the dataset under-samples refusal boundaries, the reward model may make unsafe answers look helpful.
A reward model is a lossy proxy for human judgment. Once you optimize against it hard enough, the policy can find weird pockets where the proxy gives high scores for bad answers.
Optimizing the policy
After the reward model exists, the policy model is updated with an RL algorithm. In LLM work, PPO is the classic choice. The policy samples responses, gets reward scores, and updates its token probabilities so high-reward responses become more likely.
There is a catch: if the model chases reward too aggressively, it can lose broad language ability, become repetitive, or exploit reward-model quirks. RLHF systems usually add a KL penalty that keeps the policy close to a reference model. That reference is often the supervised-tuned model you started from.
Reward hacking
Reward hacking means the policy learns to maximize the reward signal without actually getting better for users. In text models, that might look like overlong answers, fake citations, excessive hedging, formulaic safety disclaimers, or confident nonsense that the reward model likes.
The fix is not one trick. You need better preference data, adversarial evals, length-normalized checks, human review, and limits on how far the policy can move. The reward model should be treated as one noisy instrument, not as ground truth.
RLHF has more moving parts than supervised tuning. You own the policy, reward model, rollout sampling, KL settings, evals, and rollback plan. Use it when the preference signal is valuable enough to justify that surface area.
Checkpoint
You're ready for the next lesson if you can answer these from memory:
- What does a reward model learn from preference pairs?
- Why does RLHF need drift control?
- What is reward hacking in LLM tuning?
- Why is a reward model not the same as human judgment?
Quick check
- It generates the final answer shown to users
- It scores prompt-answer pairs during optimization
- It converts text into tokens
- To keep the tuned policy close to useful base behavior
- To make generation faster
- To add new documents into the model