Lesson 06

Dataset operations

Models ship. Data drifts. Without versioning, lineage, and slice tracking, you cannot reproduce a result, roll back safely, or explain an incident to compliance.

The one idea

A dataset version is a release artifact. Tie it to the model version, the eval set, and the filters that produced it. When behavior changes, you need to diff data, not guess.

Versioning basics

At minimum, every training snapshot needs:

A content hash (SHA-256 of the normalized file or manifest of row IDs)
Row count and byte size after dedup
Git commit or pipeline run ID that built it
Filter and dedup config (thresholds, blocklists, date ranges)
Parent version if this is a delta on top of an earlier set

Tools like DVC, LakeFS, or plain object storage with immutable prefixes all work. The habit matters more than the brand: never overwrite train.jsonl in place. Write train-v2026-06-29-a3f9.jsonl or store by hash.

For LLM fine-tuning, a million tokens of JSONL is often 2–8 MB compressed. Storage is cheap. Debugging an unversioned file is not.

Dataset cards and lineage

A dataset card documents what is inside and how it was built. Google's Data Cards playbook and Hugging Face dataset READMEs follow the same idea: provenance, collection process, known biases, sensitive fields, and intended use.

Include:

Sources (logs, vendors, synthetic generator + model version)
Labeling rubric version and agreement stats from pilots
Dedup method and how many rows were removed
Known gaps ("weak on non-English," "no medical cases")
License and retention constraints

Lineage links upstream and downstream: raw export → filtered → labeled → deduped → split → training run → deployed model ID. When a user reports a bad answer, you should trace which data version taught that behavior.

Privacy and compliance review

Before a dataset enters training or a shared bucket:

Scan for PII, secrets, and internal identifiers.
Confirm consent and retention policy for logged data.
Restrict access (who can download, who can add rows).
Document redaction rules (what was stripped, what was generalized).

Privacy review is not a one-time gate. A refresh that pulls new logs needs the same checks. Automated scanners catch obvious emails and API keys; policy review catches contextual leaks (customer names in free text, account numbers in ticket subjects).

Engineering reality

Incident pattern: model v3 regressed on billing. Team diffed prompts and weights, then discovered dataset v3 accidentally included pre-redaction exports. Rollback meant redeploying model v2 and pinning data v2. Without version links, that rollback takes days instead of hours.

Refresh cadence and slice tracking

Production traffic shifts. Plan refreshes on a schedule tied to risk:

High-risk domains (finance, health, safety): review failures weekly, full refresh monthly or on policy change.
Stable domains: quarterly refresh plus ad-hoc batches from active learning.

Track slices in metadata: intent, locale, source, labeler, synthetic flag. Eval dashboards should read the same slice keys. When aggregate accuracy is flat but the "refund_escalation" slice drops, you want to see it before users do.

Feed production failures back through the pipeline from lesson 02: filter, label, dedup, version, train, eval, deploy. Close the loop or the model freezes in time.

Linking data version to model version

Your model card or release notes should list:

Base model checkpoint
Training data hash and dataset card link
Eval set version and pass/fail per slice
Training config (LR, epochs, LoRA rank if applicable)

In production, store the data version in the same config object as the model endpoint. When you A/B test, you are often testing model + data + prompt together. Name that bundle explicitly.

Rollback habit

Keep the previous model checkpoint and the previous dataset version deployable for at least one release cycle. Data bugs are as common as weight bugs.

Where this course goes next

You now have the full data path: what counts as a dataset, how to label it, when to use synthetic data, how to deduplicate and check contamination, and how to operate datasets in production.

Next stops in the curriculum:

Prepare a fine-tuning dataset applies this to JSONL and training jobs.
Task evals and golden sets defines how you measure before you ship.
Build preference datasets when alignment needs pairwise judgment.

Checkpoint

You're ready to move on if you can answer these from memory:

What belongs in a minimum dataset version record?
Why is a dataset card more than documentation fluff?
When should privacy review run again after launch?
What should a model release note link to from the data side?

Quick check

To compress the JSONL file
To reproduce and verify exactly which data trained a given model
To replace access control lists

Track and evaluate performance per product workflow or segment
Speed up tokenization
Eliminate the need for deduplication

GPU utilization reporting
Reproducibility and incident rollback
JSONL parsing in the trainer