Lesson 03

Features vs learned representations

A model never sees your raw photo or audio clip. It sees numbers. The question that defined two eras of AI is simple: who picks those numbers, you or the model?

The one idea

A feature is a number you extract from raw input to feed the model. For decades people hand-picked the good features, which took expertise and capped how far you could get. Deep learning's win was letting the model discover its own features straight from raw data, layer by layer, instead of hand-picking them.

A model only eats numbers

Whatever you're working with, an email, a photo, a second of speech, the model can't read it directly. It needs numbers. A feature is one of those numbers: something you measure or compute from the raw input and feed in. The length of an email, the brightness of a pixel, how often the word "free" shows up. Stack a bunch of features into a list and you get a feature vector, which is the actual thing the model trains on.

So before any learning happens, someone has to decide which numbers to extract. Get that wrong and no model, however big, can recover. Get it right and even a simple model works well. For most of machine learning's history, choosing those numbers was a human job, and it was the hard part.

The old way: pick the features by hand

This was called feature engineering, and whole careers were built on it. You used domain knowledge to design a recipe that turned messy raw data into a tidy list of numbers a classifier could chew on. The recipe was different for every kind of data, and it was where the expertise lived:

  • Vision: you didn't feed raw pixels in. You ran detectors like SIFT and HOG that summarized edges, corners, and gradient directions into a compact descriptor, then fed that to the model.
  • Audio: you converted the waveform into MFCCs (mel-frequency cepstral coefficients), a handful of numbers per slice of sound designed to capture what the human ear cares about. Speech recognition ran on these for years.
  • Text: you turned a document into word counts, usually weighted by TF-IDF so common words counted for less and distinctive ones counted for more.

Each of these is clever, and each took experts a long time to invent and tune. That's exactly the problem. Your model could only be as good as the features a human thought to hand it, and a human can only think of so much.

Why it capped out

Hand-crafted features only capture what their designer thought to encode. SIFT can describe a gradient, but it has no way to notice a higher-level pattern nobody wrote a rule for. Past a point, you couldn't push accuracy higher by adding data, only by inventing better features, and that well runs dry.

The new way: let the model find them

Deep learning's move was to delete the hand-crafted step. You feed the raw-ish input (the actual pixels, the raw audio spectrogram) straight into a deep network, and the network learns its own features as part of training. Nobody writes the edge detector. The first layer learns to spot edges on its own because edges turn out to be useful for the final task. The next layer combines those edges into corners and textures, the next into shapes, the next into whole objects.

Those internal, learned numbers are what we call representations. A representation is just a learned feature vector: the model's own internal description of the input, shaped by training to be exactly what the rest of the model needs. The field's name for this whole idea is representation learning. Instead of you handing the model good numbers, the model invents better ones than you could, because it's free to encode patterns no human would think to name.

Hand-engineered features Raw input pixels, audio Human picks SIFT, MFCC, TF-IDF Model Prediction Learned representations Raw input pixels, audio Deep network learns its own features edges layer 1 textures layer 2 shapes layer 3 objects layer 4 Prediction
The human-built step in the middle of the top row disappears. In the bottom row the network builds its own features in layers, simple ones first, and the whole stack trains together toward the final answer.

One detail makes this powerful: the features and the final classifier train together, end to end. In the old pipeline the feature recipe was fixed before the model ever saw it, so the features didn't know what task they'd be used for. Here, the learned features are tuned by the same training signal as everything else, so they end up shaped specifically for the job.

The moment it flipped

This wasn't an obvious win at first. The turning point most people point to is 2012, when a deep network called AlexNet entered the ImageNet image-recognition contest. Every other strong team that year used hand-crafted features like SIFT fed into a classical classifier. AlexNet skipped all of that and learned its features from the pixels.

It won by a landslide: a top-5 error rate around 15.3%, roughly 10 percentage points ahead of the runner-up. In a contest where progress had been measured in fractions of a percent, that was a shock. Within a couple of years basically every serious entry was a deep network. The same shift then rolled through speech and text. The lesson the whole field took away: given enough data and compute, learned features beat hand-crafted ones, and the gap only widens as you scale up.

No, it just moved. You rarely hand-craft SIFT-style descriptors anymore, but you still make big choices about how raw data is presented: how audio gets turned into a spectrogram, how text gets split into tokens, how tabular columns get normalized. And on small or tabular datasets, classical features plus a simple model often still win. Deep representation learning is the default when you have lots of data, not a law of nature.

The tradeoff nobody mentions in the hype

Letting the model find its own features sounds strictly better, but it isn't free. You're asking the network to discover, from scratch, things a human expert spent years encoding. That takes a lot more raw material and a lot more compute.

Engineering reality

Learned features are not a free lunch. They come with real costs you feel in production:

They need scale. A network learning its own features wants far more labeled data and far more compute than a hand-crafted pipeline. On a few thousand examples, a deep net often loses to a classical model fed good features. This is why tabular business data (fraud, churn, pricing) is still gradient-boosted trees territory, not deep learning.

The data prep doesn't vanish, it just changes shape. Teams build feature pipelines and even feature stores: systems that compute, version, and serve features consistently between training and live traffic. A feature that's computed one way in training and another way in production is a classic, painful source of silent model bugs.

Learned features are harder to read. A hand-crafted feature like "word count" is obvious. A learned representation is a vector of numbers with no built-in meaning, which makes debugging, auditing, and explaining a model's decision much harder.

So the choice is real. Small data, need to explain it, tight compute budget: hand-crafted features and a simple model are often the right call. Lots of data, hard perceptual problem (images, audio, language): let the model learn its own representations and pay the cost. Most of modern AI bet on the second, which is why the rest of this course is mostly about deep networks.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

  • What is a feature, and why does every model need them no matter what the raw input is?
  • What was "feature engineering", and name one classic hand-crafted feature for vision, audio, and text.
  • What is a learned representation, and what does "representation learning" mean?
  • Why is AlexNet in 2012 treated as a turning point?
  • Give one concrete situation where hand-crafted features still beat a deep network, and say why.

Quick check

  • A human picks the input numbers in one; the model discovers its own during training in the other
  • Learned representations use numbers and hand-engineered features don't
  • Hand-engineered features are always simpler and less accurate by definition
  • It was the first model to run on a GPU
  • It learned features from raw pixels and crushed the hand-crafted-feature competition, triggering a field-wide switch
  • It was the first large language model
  • A large deep network, so it can learn its own features
  • Hand-crafted features feeding a simpler, more interpretable model
  • Whichever one you can throw the most GPUs at