Lesson 05

Deduplication and contamination

Exact duplicates waste compute. Near-duplicates distort metrics. Train-eval overlap is the quietest way to ship a model you cannot trust. Check before you train.

The one idea

Deduplication saves money. Contamination checks save your reputation. If eval prompts appear in training data, your metrics measure memorization, not generalization.

Exact vs near duplicates

Exact duplicates are identical strings (or identical hashes). Easy to catch with a hash set. Common sources: exporting the same ticket twice, merging datasets without keys, or synthetic generation loops.

Near duplicates are paraphrases, truncated copies, or boilerplate with small edits. They matter because models still see repeated signal and eval overlap can be partial, not exact.

Large pretraining corpora like The Pile showed how much web text repeats. Training on near-dupes wastes tokens and can overweight spammy templates. The same logic applies at fine-tune scale with smaller files.

MinHash and LSH in plain terms

You cannot compare every row to every other row on a million-row set. That is O(n²). MinHash fingerprints text so similar documents get similar fingerprints.

Turn each document into a set of shingles (overlapping word or character n-grams).
Hash each shingle many times; keep the minimum hash per row. That is the MinHash signature.
Compare signatures: high overlap means high Jaccard similarity between shingle sets.

Locality-Sensitive Hashing (LSH) buckets similar signatures together so you only compare candidates inside the same bucket, not the whole corpus.

You do not need to implement this from scratch. Libraries like datasketch in Python or dedup tools in Hugging Face Datasets apply the same idea. What you need is the habit: run dedup on the training pool before the split, and again after merging new batches.

Near-duplicates land in the same bucket so you can flag or drop them without comparing every pair.

Contamination and benchmark leakage

Contamination means eval or benchmark content influenced training. Classic cases:

MMLU or other benchmark questions appearing in pretraining or fine-tuning corpora.
Your golden eval prompts copied into the JSONL training file.
Held-out test rows that leaked through a random split on duplicate-heavy data (same prompt in train and val because dedup ran after the split).

Order matters: deduplicate first, then split. If you split first, near-duplicates of the same user question can land in both train and validation. Offline loss looks great. Production does not.

For fine-tuning, also check whether eval prompts appear as substrings in training completions. Partial overlap still leaks signal.

Engineering reality

A fine-tune looked strong on an internal golden set until someone noticed 12% of eval prompts appeared verbatim in training data. After removal, accuracy dropped 18 points on the hard slice. The model had not generalized; it had seen the test. Read Prepare a fine-tuning dataset and Task evals and golden sets with this lesson in mind.

Worked flow: 10k rows to train-ready

Ingest JSONL, validate schema.
Normalize text (strip whitespace, canonicalize field order).
Drop exact dupes on a content hash.
Run near-dup detection (MinHash/LSH or embedding clustering at smaller scale).
Remove rows that overlap your golden eval set (exact and high-similarity).
Split train / val / test with stratification on label or slice.
Log counts at each step. If you remove 30% as dupes, your "10k examples" were never 10k.

Store the dedup manifest: which IDs were dropped and why. You will need it when someone asks why row counts changed between versions.

Pre-flight checklist

Checklist

Before you submit a training job

Run these checks: Exact dedup on normalized text. Near-dup pass on training pool. Overlap scan against golden eval and public benchmarks you care about. Split only after dedup. Document row counts and hash of final train file.

Easy to skip: Checking completions, not just prompts. Deduping after the split. Ignoring synthetic batches merged in at the last minute without re-running overlap checks.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

Why deduplicate before train/val/test splits?
What problem does MinHash + LSH solve at scale?
How can fine-tune evals inflate without exact test strings in training data?
What should you log when rows are dropped as duplicates?

Quick check

Training becomes slower
Near-duplicates leak across splits and inflate validation scores
JSONL schema validation fails

Jaccard similarity of their shingle sets
Cosine distance in embedding space only
Exact token count difference

Temperature was too low during eval
Benchmark questions may have leaked into training data
MMLU format mismatch only