Lesson 05

GRPO and RL for reasoning

Reasoning tasks change the preference-tuning story because some rewards can be checked. A math answer, unit test, parser result, or game outcome can provide stronger signal than a vague human preference.

The one idea

GRPO scores a group of answers for the same prompt and updates the model based on how each answer did relative to the group, which avoids training a separate critic.

Why reasoning rewards are different

For open-ended chat, "better" is often subjective. For reasoning tasks, you can sometimes check the outcome. Did the final answer match? Did the code pass tests? Did the SQL query return the expected rows? Did the plan obey the constraints?

That makes RL more attractive. Instead of asking a human which answer sounds better, you can sample several attempts and reward the ones that actually solve the problem. This is still not easy, but the reward is less squishy.

Group-relative policy optimization

GRPO samples a group of outputs for the same prompt. Each output receives a reward. The algorithm then compares each output against the group's average or normalized reward. Outputs that beat the group get pushed up. Outputs that lag get pushed down.

The practical appeal is that GRPO does not need a separate value model or critic. That can save memory and simplify training, especially when tuning large language models where every extra model copy is expensive.

GRPO uses multiple samples from the same prompt so the model can learn from relative success within the group.

Verifiable rewards are not free

A verifier can be wrong or incomplete. Unit tests can miss edge cases. A math checker can validate the final number while ignoring a brittle chain of reasoning. A sandbox can be too slow or too permissive. Outcome rewards are stronger, but they still need design.

For many production teams, the best use of reasoning RL is narrow: domains where success can be checked cheaply and repeatably. Coding tasks, math tasks, constrained planning, data extraction with exact schemas, and tool workflows with clear end states are better fits than broad chat personality.

Where GRPO fits in the stack

Use supervised fine-tuning to teach the model the format and basic task. Use DPO when you have offline human preferences. Reach for GRPO-style RL when you can score multiple attempts and the score captures real success.

That does not mean every app needs GRPO. If a smaller prompt fix, RAG change, eval suite, or DPO run solves the problem, take the simpler route. RL becomes worth it when exploration reveals better solutions than imitation alone.

Engineering reality

Reasoning RL can burn a lot of tokens because each prompt may sample many completions. Budget for generation, verification, failed attempts, and repeated eval runs before assuming the tuning run is cheap.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

Why are verifiable tasks different from open-ended chat?
What does GRPO compare within a group?
Why is skipping a critic useful for large-model tuning?
What can go wrong with an outcome verifier?

Quick check

Compare several answers for the same prompt and learn from relative rewards
Train only on one human-written answer per prompt
Retrieve more documents before answering

Making a chatbot sound more friendly
Generating code that must pass a test suite
Knowing the latest product prices