Lesson 02

Build preference datasets

Preference tuning is only as good as the comparisons you feed it. This lesson is about building preference data that reflects real product tradeoffs instead of noisy vibes.

The one idea

A preference dataset should make the desired boundary visible: not just good vs bad, but why one plausible answer beats another for a specific prompt.

Start with prompts, not answers

The prompt distribution decides what behavior you are tuning. If the dataset is full of clean, simple prompts, the model gets better at clean, simple prompts. If production traffic has partial context, weird phrasing, missing fields, and adversarial asks, those cases need to appear in the data too.

Collect prompts from real usage when privacy and policy allow it. Otherwise, write synthetic prompts from real task categories, then sample hard cases deliberately. Include ordinary cases as well. A model trained only on edge cases can become jumpy and over-refuse.

Generate candidates with variety

Each prompt needs answers to compare. You can generate candidates from the base model with different sampling settings, from multiple models, from older product versions, or from human-written responses. Variety matters because the comparison should expose meaningful differences.

A weak pair is too easy: one answer is nonsense and the other is fine. Those pairs teach a little, but not much. A strong pair is close: both answers are plausible, but one wins because it cites uncertainty, follows the requested format, refuses correctly, uses better domain language, or avoids a subtle factual mistake.

Dataset habit

Keep the rejected answer. It is not trash. It explains the boundary the model needs to learn.

Use a rubric reviewers can follow

Reviewers need more than "pick the better answer." They need an ordering of priorities. For a support assistant, accuracy might beat warmth. For a medical triage assistant, safety and escalation rules beat brevity. For a code assistant, correct behavior beats pretty explanation.

A good rubric names the dimensions and gives examples. It also tells reviewers what to do with ties. Some pairs should be marked as equal or discarded. Forcing a winner when both answers are basically the same adds label noise.

Correctness: Does the answer solve the user's actual request?
Grounding: Does it avoid claims not supported by the context?
Format: Does it follow the required output shape?
Safety: Does it refuse or redirect when needed?
Usefulness: Is it direct, specific, and not padded?

Track metadata and disagreement

Preference data without metadata ages badly. Store the prompt source, candidate model, generation settings, reviewer ID or reviewer pool, rubric version, timestamp, and reason for the choice. That metadata helps you debug strange training results later.

Disagreement is also signal. If reviewers often disagree on a category, the problem may be ambiguous, the rubric may be weak, or the desired product behavior may not be decided yet. Do not hide that by averaging everything into a single label too early.

Engineering reality

Preference labels are product decisions captured as data. If product, policy, and engineering do not agree on the rubric, training will turn that confusion into model behavior.

Split by prompt, not by row

Validation and test splits must avoid leakage. If the same prompt appears in train and test with different candidate answers, your eval is less honest. Split by prompt or by conversation source so the model is judged on unseen requests.

Keep a frozen human-readable eval set outside the training pipeline. It should include simple cases, hard cases, policy boundaries, and cases where the old model failed. You will use it again after RLHF, DPO, or GRPO.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

Why should preference prompts resemble production traffic?
What makes a pairwise comparison useful rather than obvious?
Why should ties sometimes be discarded?
What metadata should you store with each preference label?

Quick check

One perfect answer and one answer full of random tokens
Two plausible answers where one wins for a clear rubric reason
One answer with no comparison

To reduce leakage between train and eval
To make the dataset easier to download
To reduce output length