Lesson 03

Labeling and rubrics

A label is a decision frozen in a spreadsheet. Rubrics turn vague taste into rules annotators can follow, disagree on productively, and audit later.

The one idea

Labeling is not data entry. It is policy design with a per-row price tag. Without a rubric and agreement checks, you are paying people to smuggle inconsistent judgment into the training set.

What a rubric actually contains

A rubric answers the questions an annotator will ask at 11pm:

What counts as a valid input? (length limits, language, PII rules)
What are the allowed labels or output formats?
What should happen on ambiguity? (ask a clarifying question, refuse, escalate)
What are positive and negative examples for each edge case?
When should an annotator skip or flag a row?

For preference data, the rubric defines what "better" means: more accurate, more concise, safer, more on-brand. Those are not the same criterion. Mixing them without naming which one wins teaches the model that any of them might win.

Good rubrics include boundary cases. "Refund within 30 days with receipt: approve. After 30 days: escalate. Missing receipt: ask once, then escalate." That is more useful than "be helpful."

Measuring agreement

If two annotators see the same example, how often do they agree? You measure that before you trust the labels at scale.

Percent agreement is the simple version: same label divided by total rows. It overstates quality when one class is rare (everyone agrees "not fraud" on 99% of rows).

Cohen's kappa adjusts for chance agreement. Rough guide: below 0.4 is weak, 0.4–0.6 is moderate, above 0.6 is solid for many text tasks. It is not magic. It tells you the rubric is underspecified when kappa is low.

Run a pilot: 50–100 overlapping examples, two or more annotators, review disagreements as rubric bugs, not as "bad annotators." The output of the pilot is an updated rubric, not just a kappa number.

Gold sets

Embed a small set of pre-labeled "gold" rows in every batch. Annotators who drift from gold answers get flagged early. Gold rows also survive rubric updates so you can detect label drift over time.

Resolving disagreement

When annotators disagree, three outcomes are valid:

Rubric gap: The rules were silent. Add a rule and re-label affected rows.
Product gap: The team has not decided what should happen. Escalate to PM or legal, then encode the decision.
Genuine ambiguity: Multiple answers are defensible. Either exclude the row or keep it as a multi-label / soft-target example if your training setup supports that.

Averaging disagreeing labels is usually wrong. If one annotator says "refuse" and another says "answer fully," the model should not learn to do both half the time.

LLM-assisted labeling

Using a model to draft labels is tempting. It can work as a first pass if a human reviews every row that matters. The failure modes are predictable:

Bias propagation: The labeler inherits the model's errors at scale.
Rubric drift: The model fills gaps with its own taste, not your policy.
False confidence: Fluent wrong answers look "done" and skip human review.

A sane workflow: model proposes, human accepts or edits, disagreements feed rubric updates. For preference pairs, never let the same model rank both answers without a human spot-check on safety and policy slices.

This matters directly for preference dataset work. Preference tuning amplifies whatever the pairs encode.

Engineering reality

Labeling cost is often $1–$8 per complex dialogue example after review rounds, and $0.10–$0.50 for simple classification at vendor scale. A 10k-row project at $2/row is $20k before infrastructure. Budget agreement pilots and gold-set QA into that line item, not as an afterthought.

Landmark reference

Book

Chip Huyen, Designing Machine Learning Systems (data chapters)

The data collection and labeling chapters are the best single reference for how production teams think about schemas, pipelines, and iteration. Read them for the workflow mental model.

Take from it: How data flows from sources to training, why feedback loops matter, and how to treat datasets as versioned artifacts.

It skips: LLM-specific rubrics, preference pairs, benchmark contamination, and JSONL fine-tuning quirks. That is what the rest of this course covers.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

What belongs in a labeling rubric beyond the list of allowed labels?
Why is Cohen's kappa more informative than raw percent agreement?
What should you do when two annotators disagree on a policy-heavy case?
Why is LLM-assisted labeling risky without human review on critical slices?

Quick check

The annotators need to be fired immediately
The rubric is ambiguous and needs more boundary rules
You should scale labeling to 10x more rows

Quality control and detecting annotator drift
Replacing the entire training set
Storing unlabeled production logs

Flag as ambiguous and escalate to product
Update the rubric and re-label similar cases
Pick one label at random and keep both in training