Learn/Evaluation & Observability/Lesson 03

Lesson 03

LLM-as-judge: calibration, bias, when it fails

When you cannot unit-test the answer, you may ask another model to grade it. That works often enough to be tempting and badly enough to be dangerous if you skip calibration. This lesson is how to use judges without fooling yourself.

The one idea

An LLM judge is a noisy instrument. Calibrate it against human labels on your task, prefer binary rubrics over vague scales, and never let it be the only grader on high-stakes outcomes.

What LLM-as-judge is good for

Code graders need deterministic truth. Many product questions do not have it: "Did this summary cover the risks?" "Is the refusal appropriate?" "Does the answer stay faithful to the retrieved passages?" Humans can judge these; they do not scale to every CI run.

A judge prompt supplies a rubric, the candidate output, and sometimes reference material. The judge model returns pass/fail or a score. In practice, well-prompted strong models align with human judgment on simple criteria roughly seventy to eighty percent of the time. That is enough to prioritize review and catch obvious failures. It is not enough to treat the judge as ground truth without checking.

Best fits: faithfulness to provided context, coverage of required points, toxicity or policy violations, pairwise preference between two drafts, and triage ("send to human review" vs "auto-pass").

The judge never sees more authority than you give it. Rubric quality dominates model choice.

Writing rubrics that survive contact with reality

Vague rubrics produce vague scores. "Rate helpfulness 1 to 5" will drift with judge temperature, prompt ordering, and Monday moods. Prefer binary questions: "Does every bullet cite a source ID from the context?" "Does the answer refuse when context is empty?"

Strong rubrics include:

The criterion in one sentence.
One pass example and one fail example from your domain.
Explicit instruction to answer only from supplied context when checking faithfulness.
Instruction to output structured JSON: {"pass": true, "reason": "..."}.

For pairwise comparison ("which draft is better?"), randomize order across runs to fight position bias. Run multiple judges or duplicate with swapped order when the decision gates a release.

Criterion: Every factual claim in the answer must be supported by the retrieved passages provided. If a claim has no support, fail.

Pass example: Answer quotes policy section 4.2; passage 4.2 is in context.

Fail example: Answer states "30-day refund" but no passage mentions refund windows.

Output: JSON with pass boolean and one-sentence reason.

Calibration: trust, but verify

Before a judge runs at scale, build a calibration set: fifty to two hundred examples human experts labeled. Run the judge blind. Measure agreement with standard metrics—not vibes.

Percent agreement. The simplest score: what fraction of cases did judge and human pick the same pass/fail label? Easy to explain to stakeholders. Misleading when the class is imbalanced: if ninety percent of cases pass, a judge that always says pass gets ninety percent agreement while missing every fail.

Cohen's kappa (κ). Agreement adjusted for chance. κ = 1 is perfect; κ = 0 is chance-level; κ < 0 is worse than chance. Rule of thumb from inter-rater literature: κ ≥ 0.80 is strong, 0.60–0.80 is moderate, below 0.60 is weak. For a release gate, many teams require κ ≥ 0.70 on the calibration set before trusting a judge in CI. Below that, fix the rubric or add human review—do not scale the judge because it is cheaper than annotators.

Also track precision and recall on the fail class. A judge with high agreement but low recall on failures will green-light the bugs you care about most.

If agreement is below your bar, fix the rubric before swapping models. Most failures I have seen were rubric ambiguity, not "wrong judge model." When agreement is good on the calibration set, lock the judge prompt version and treat changes like code changes: rerun calibration.

Route low-confidence or disagreement cases to humans. A practical pattern: auto-pass high confidence passes, auto-fail clear policy violations, queue the middle for review. The judge becomes a filter, not a verdict.

Suppose five hundred golden cases each need one GPT-4.1 judge call. Typical judge prompt: ~800 input tokens (rubric + context + candidate answer) + ~80 output tokens (JSON verdict). At rough list pricing of $2.50 / 1M input and $10 / 1M output tokens, one call ≈ $0.0028. Five hundred cases ≈ $1.40 per full judge sweep—cheap enough for nightly runs, expensive enough to notice at ten sweeps per day across five teams ($70/day). Add a second judge for pairwise or panel agreement and double it. This is why lesson 05 tiers suites: code graders on every PR, full judge sweeps on merge or nightly.

Engineering reality

Judges cost money and add latency. Budget them: run full judge suites nightly, use code graders on every PR, sample ten percent of production traffic for judge scoring, not one hundred percent unless the stakes demand it. Cache judge results per (prompt version, input hash, output hash) when outputs are deterministic enough. Track judge spend alongside model spend so it does not quietly exceed the model bill.

Known biases and failure modes

Position bias. In pairwise tasks, judges favor the first answer. Mitigate by: (1) randomizing or swapping order across runs, (2) running two passes (A-first and B-first) and counting only consistent wins, (3) using length-normalized rubrics so the first longer answer does not auto-win. For release decisions, require agreement across both orderings.

Length bias. Longer answers score higher even when wrong. Mitigate with rubrics that penalize unsupported claims regardless of length.

Verbosity and fluency bias. Polished prose hides hallucinations. Mitigate with faithfulness checks tied to context, not style.

Self-preference. Models may favor outputs stylistically similar to their own training. Mitigate by using a different judge model family than the generator when possible.

Rubric gaming. Optimizing prompts to satisfy the judge without helping users. Mitigate with holdout sets, human audits, and production metrics.

Missing tools. Judges cannot verify database state or API side effects unless you pass that evidence in. A confident "record updated" summary fails while the DB is untouched. Use code graders for side effects; judges for language-layer claims.

Stale context. If the judge does not see the same retrieval the user saw, faithfulness scores are meaningless. Pass retrieval snippets into the judge prompt or grade at the layer that had the evidence.

When not to use a judge

Deterministic correctness: math, JSON schema, unit tests, permission checks. Use code.
High-stakes domains: legal, medical, financial advice where error cost is asymmetric. Humans sign off.
Subtle brand or taste calls that executives care about. Sample with humans periodically.
Adversarial or safety-critical policies where false passes are unacceptable. Layer rules and humans, do not rely on one judge pass.

The hybrid that usually ships: code graders for structure and side effects, judges for semantic checks, humans for calibration and edge cases. Anthropic-scale pairwise Elo is overkill for most teams. A weekly annotation queue on the bottom five percent of judge scores gets most of the value.

Judges in the eval pipeline

Slot judges after cheap graders filter obvious junk. Log judge version, rubric hash, verdict, and reason string. When production disagrees with the judge, add that case to the calibration set. Judges improve when failures become labeled data, not when you tweak adjectives in the rubric without measurement.

Lesson 05 covers gating merges on aggregate pass rate. If the gate uses judges, the gate threshold must be set from calibrated agreement, not from a number that "felt strict."

Pairwise judging and leaderboards

Sometimes the question is not "pass or fail" but "which of two drafts is better?" Pairwise comparisons power internal model shootouts and prompt tournaments. Run them blind, swap order, aggregate with Elo or Bradley-Terry if you have volume. For small teams, a simple win/loss tally across twenty pairs is enough to pick between two prompt candidates.

Do not confuse pairwise wins with production readiness. A draft can win on style while failing safety checks a binary rubric would catch. Layer pairwise preference under hard policy graders.

Multi-judge panels and disagreement

For borderline content, run two judge prompts or two model families. When they disagree, route to human review and log the disagreement as training signal. Agreement rate between judges on your calibration set predicts how noisy your automated pipeline will feel in production.

Keep a "judge of judges" human sample: five percent of automated verdicts reviewed weekly. Drift in judge agreement is often the first sign that your product changed underneath a frozen rubric.

RAG-specific judge pitfalls

Faithfulness judges fail when retrieved context in production differs from what you paste into the judge prompt. Always pass the same chunks the user model saw, or grade retrieval and generation in separate steps. Citation judges need source ids that match your chunking scheme; if chunks split mid-sentence, a correct quote may look uncited.

For fine-tuned models, judges trained on generic web text may disagree with your domain conventions. Recalibrate when you change model family or add a fine-tune, even if the rubric text is unchanged.

Regulated advice, one-off high-value contracts, novel failure modes the judge has never seen, and any case where false pass cost exceeds judge cost. Route bottom five percent of judge scores and one hundred percent of policy-flagged outputs to humans until calibration catches up.

Evaluators as traced spans

When judges run in production or CI, log them as EVALUATOR spans with rubric version, candidate hash, verdict, and latency. When a judge disagrees with a human label later, you can trace which rubric version fired without guessing. Treat judge prompts like application code: versioned, reviewed, deployed.

Freeze judge temperature and model id in CI the same way you freeze the system under test. A judge that samples creatively is an unreliable gate.

Export calibration disagreements to a spreadsheet reviewers can sort by rubric clause. Patterns in disagreements tell you which bullet in the rubric is ambiguous faster than debating abstract "judge quality."

Checkpoint

You are ready for the next lesson if you can answer these from memory:

What must you do before trusting a judge at scale?
What is Cohen's kappa and when is it too low to gate releases?
Why are binary rubrics usually better than 1-to-5 scales?
Name two judge biases and one mitigation for each.
When should a code grader replace a judge entirely?

Quick check

Deploy to CI immediately; kappa is optional
Stop trusting the judge for merge gates; tighten rubric and recalibrate
Switch judge model only; skip rubric changes
Remove human labels to raise agreement

Immediately switch to the largest available model
Tighten the rubric with pass/fail examples and re-run calibration
Ship it; 62% is fine
Stop using humans entirely

Randomizing or swapping answer order across runs
Setting temperature to zero only
Using longer rubrics without examples
Always using the same model as the generator

LLM judge on the final message only
Code grader that queries the database for the expected row state
Human reading the agent's summary
Pairwise preference against last week's answer

Whether 847 × 19 equals 16093
Whether an answer's claims are supported by retrieved context
Whether JSON matches a fixed schema
Whether medical dosing advice is correct for a patient