Lesson 01

Why data quality dominates

A model does not learn what you meant. It learns from the examples, labels, omissions, duplicates, and weird edge cases you put in front of it. Data quality is not prep work. It is the product getting written down.

The one idea

The dataset is the strongest product spec a model ever sees. If the data is noisy, biased, stale, duplicated, or mislabeled, the model will treat those mistakes as instructions.

Data is behavior, not just input

Developers often talk about data as if it is fuel: pour in more and the model gets better. That metaphor is too generous. Data is closer to the source code for the behavior you want, except the compiler is statistical and the bugs are harder to spot.

In supervised learning, every example says, "when the input looks like this, prefer an output like that." In preference tuning, every pair says, "this answer is better than that one." In retrieval systems, every indexed document says, "this is allowed to enter the model's context." Even when you are not training weights, data still defines what the system can see, repeat, cite, and optimize for.

This is why data quality dominates. Better model architecture helps. More compute helps. Clever prompting helps. But if the examples teach inconsistent behavior, the model will learn inconsistency. If the labels reward shortcuts, the model will find shortcuts. If the eval set leaks into the training set, you will think the model improved when it mostly memorized the test.

The pipeline is full of product decisions

A dataset is not found in the wild fully formed. Someone decides what to collect, what to exclude, how to label it, how to split it, how long to keep it, and which examples are too risky to use. Those are product decisions with a spreadsheet interface.

For a support assistant, a label might encode whether the model should answer directly, ask for clarification, escalate to a human, or refuse. For a code agent, data might encode when to edit files, when to run tests, and when to stop. For a medical or finance workflow, data might encode what kind of uncertainty must be surfaced instead of hidden under a polished answer.

When those decisions are implicit, the model gets a messy lesson. One annotator writes terse answers. Another writes long explanations. One example cites policy. Another guesses. One example refuses a dangerous request. Another tries to be helpful anyway. The model sees all of it as signal unless you clean the contradiction.

Collect logs, docs, cases Filter remove junk and risk Label define good behavior Split train, validation, test Train or tune weights absorb patterns Evaluate measure by slice Production new failures feed back Every box changes model behavior. Treat each one like an engineering decision.
Data quality is not one cleanup step. It is a chain of choices, and each choice can become visible model behavior later.

Quality beats volume when signal is scarce

Large models made "more data" famous, but product teams usually do not have internet-scale data. They have tickets, call transcripts, documents, examples from internal tools, eval failures, and a small pile of hand-written cases. In that setting, quality matters more than raw count.

A thousand examples with consistent labels can beat fifty thousand noisy ones. The smaller dataset teaches the decision boundary clearly. The noisy dataset teaches the model to average contradictions. You see this in classification tasks, extraction tasks, tool calling, customer support tone, and refusal behavior. The model is not stubborn. The training signal is.

Quality also means coverage. A dataset full of easy happy-path examples trains a model that looks fine in demos and falls apart on real traffic. You need the boring middle, the frequent edge cases, and the failure cases users actually hit. Rare but expensive cases deserve deliberate sampling because random sampling may almost never pick them.

Useful habit

Describe each dataset slice in product language: "refund escalation," "ambiguous medical symptom," "malformed JSON repair," "out-of-policy request." If you cannot name the slice, you probably cannot measure whether the model handles it.

Three dimensions you can actually measure

Data quality sounds vague until you name what you are checking. Three dimensions show up in almost every AI project:

  • Completeness: Does every example have the fields the model needs? Missing labels, truncated context, or empty assistant turns teach the model that gaps are normal.
  • Consistency: Do similar inputs get similar treatment? If two annotators label the same refund case differently, the model learns that either answer is fine.
  • Representativeness: Does the dataset match the traffic you will serve? A set built from internal testers will miss the messy phrasing real users bring.

You do not need a perfect score on each axis. You need to know where you are weak before you train. A dataset that is complete but unrepresentative will pass schema checks and still fail in production.

Named failure modes

These show up often enough that it helps to recognize them by name:

  • Contaminated eval: Benchmark questions or golden-set prompts appear in pretraining or fine-tuning data. MMLU-style leakage made several open models look stronger than they were on held-out academic tests. The fix is overlap checks, not a bigger model.
  • Label drift: Policy changes in April but half your examples still encode March rules. Aggregate eval scores stay flat while users hit the exact cases that changed.
  • Synthetic collapse: A team generates 50k examples from one teacher model with one prompt template. Variety drops, errors compound, and the student model inherits the teacher's blind spots at scale.

Later lessons in this course cover labeling (L03), synthetic data (L04), and deduplication (L05). Before you fine-tune, read Prepare a fine-tuning dataset with those lessons in mind.

Bad data hides as good metrics

Dataset problems often look like model wins until you inspect the source. Duplicates can make a model look better because the same pattern appears in training and validation. Contamination can put test examples into the training set. Label leakage can give away the answer through a filename, timestamp, category ID, or prompt phrase that never appears in production.

Another common failure is stale truth. If policies changed in April but half the examples encode the March rule, the model learns a blended policy. It may pass examples from both eras in a loose eval, then behave unpredictably when a real user asks the exact thing that changed.

There is also taste drift. A dataset created by one team may reward being concise. A later batch may reward being warm and verbose. Neither is universally wrong, but mixing them without a rubric teaches the model that either behavior is acceptable. Then engineers try to fix the inconsistency with a longer prompt, even though the contradiction came from the examples.

Dataset smell

If the model gets strong aggregate scores but fails the same customer-visible slice repeatedly, stop tuning prompts for a minute. Check whether that slice is missing, mislabeled, duplicated, stale, or leaking into the wrong split.

Labels are compressed judgment

A label is never just a field in a CSV. It is compressed human judgment. For a classifier, it says which category matters. For a preference pair, it says which answer better fits the product. For a safety dataset, it says what the system should refuse, redirect, or answer carefully. For a tool-calling dataset, it says what action the model should take and which arguments are valid.

Because labels carry judgment, labeling needs a rubric. A rubric turns "good answer" into observable criteria: cite the source, ask one clarifying question when required, refuse requests for credentials, use the correct schema, do not invent policy, escalate when confidence is low. Without that, labelers fill the gap with personal taste.

Rubrics also make disagreements useful. If two reviewers disagree, the goal is not to average them. The goal is to find the missing rule. Maybe the task needs a new category. Maybe the prompt is underspecified. Maybe the product team has not decided what should happen. Data work exposes those gaps early.

Data operations matter after launch

The dataset is not finished when the model ships. Real usage changes the distribution. Users discover prompts the team did not imagine. Policies change. New products appear. Attack patterns move. A dataset that was good in January can be stale by June.

Good teams treat datasets like versioned artifacts. They know which model was trained on which data version, which filters ran, which examples were excluded, which eval set guarded the release, and which production failures fed the next refresh. This does not require a huge platform at the start. It requires the discipline to avoid anonymous files named final_v7_cleaned_really.csv.

Engineering reality

The expensive part of data is usually not storage. It is review time, privacy handling, stale labels, unclear ownership, and the cost of debugging behavior that should have been caught as a dataset issue. Data versioning and slice-level evals pay for themselves the first time a release looks good on average but breaks one important workflow.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

  • Why is a dataset closer to a product spec than a pile of inputs?
  • How can duplicates or contamination make a model look better than it is?
  • Why do labels need a rubric instead of only examples?
  • What should change in your process after a model reaches production?

Quick check

  • Switch to a larger model immediately
  • The dataset slice and labels for refund escalation cases
  • Increase temperature so the model explores more answers
  • It makes the dataset too large to store
  • It makes eval scores look good without proving the model can handle new cases
  • It prevents the model from training at all
  • Having to review any labels
  • The need for synthetic examples
  • Inconsistent judgment across examples and annotators