Lesson 02

What counts as a dataset?

A dataset is any curated collection of examples the system is allowed to learn from, retrieve, or measure against. Logs, labels, eval sets, and indexed docs are all datasets. They just teach different things.

The one idea

Not every pile of text is a training set. Name what each collection is for: teaching weights, grounding answers, or measuring quality. Mixing those roles is how leakage and stale truth sneak in.

Four dataset roles in a real product

Most AI products juggle more than one kind of data at once:

  • Training data: Examples used to update model weights (fine-tuning, preference tuning, distillation).
  • Retrieval corpus: Documents indexed for RAG. The model does not memorize them as weights, but they define what it is allowed to cite.
  • Eval / golden sets: Held-out cases with known-good answers or rubrics. Used to decide whether a release is safe.
  • Production logs: Live traffic. Not a dataset until someone filters, labels, and versions it for the next training round.

Calling everything "the dataset" blurs ownership. A support bot might have 8k fine-tune examples, 40k indexed help articles, 200 golden eval cases, and millions of raw chat logs. Only the first three are ready to use. The logs are raw material.

Training updates weights Retrieval grounds context Eval / golden measures quality Prod logs raw traffic Curated next-version dataset filter, label, dedup, version
Same company, four different datasets. Each needs its own schema, owners, and refresh rules.

Raw material vs ready examples

Support tickets, call transcripts, PDFs, and database exports are not training data yet. They are sources. Turning them into examples means deciding what the input is, what the target output is, what to exclude, and what metadata to keep.

For LLM fine-tuning, a ready example usually has a stable shape: messages with roles, or prompt/completion pairs in JSONL. For classification, it might be text plus a label. For preference tuning, chosen and rejected completions tied to the same prompt. For RAG, chunked passages with source IDs and access controls.

Schema matters early. If one row stores the user question in input and another in messages[1].content, your pipeline will break silently or produce uneven training signal.

Schema habit

Write a one-page dataset contract: required fields, allowed values, max lengths, and what "done" means for an example. Review new rows against it before they enter train or eval.

Collection strategies

Where examples come from shapes what the model can do:

  • Human-written: Highest control, highest cost. Good for policy-heavy tasks and refusal behavior.
  • Logged and filtered: Cheapest at scale if you already have traffic. Needs PII scrubbing and quality filters.
  • Expert or vendor labeling: Domain specialists or annotation vendors turn raw cases into labels. Cost is often $0.50–$5+ per example depending on task complexity and review rounds.
  • Active learning: Ship a model, log failures and low-confidence cases, label those first. Puts budget where the model is weakest.
  • Stratified sampling: Deliberately oversample rare intents (fraud escalation, medical edge cases) instead of hoping random sampling catches them.

Scraping public web text is a different beast. Consent, terms of use, robots.txt, and PII at collection time all matter. A corpus that is legally or ethically risky is not cheaper because it was free to download.

Engineering reality

A team once fine-tuned on exported Slack threads without redacting internal URLs and project codenames. The model started echoing naming patterns that never appear in customer-facing prompts. Collection-time PII and secret scanning is cheaper than post-hoc behavior debugging.

Training data is not product data

Product data is what users actually send at runtime: messy, partial, sometimes adversarial. Training data is what you chose to show the model during adaptation. They should overlap in shape, not necessarily in content.

If training examples always include a polished system prompt and clean JSON, but production users paste walls of unstructured text, the model will look worse live than offline evals suggest. If eval cases are written by engineers and production users spell differently, slice metrics will lie.

Golden sets should be representative of product risk, not of what was easy to label. That often means pulling real (consented, redacted) failures into eval, not only writing synthetic happy paths.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

  • What is the difference between training data, a retrieval corpus, and an eval set?
  • Why is a schema contract worth writing before collection scales up?
  • When would you use stratified sampling instead of random sampling?
  • Why should training examples match the interaction shape you serve, not just the topic?

Quick check

  • The RAG document index
  • The held-out golden eval set
  • Unlabeled production chat logs
  • They are always larger
  • They are raw material until curated for a specific role
  • They are already labeled and deduplicated
  • So the model learns the same message structure and fields it will see live
  • To reduce JSONL file size
  • Because OpenAI requires it exclusively