Training vs inference
A model has two halves to its life. One is expensive, happens once, and produces the thing. The other is cheap per call, happens forever, and is where you actually live. Almost every cost and failure decision you make later comes from telling these two apart.
Training is fitting the knobs: a big, one-time (or occasional) job that turns data into a finished model. Inference is running that finished model on new input, over and over. They cost different amounts, run on different schedules, and break in different ways.
The two halves
Lesson 01 said a model is a flexible shape squeezed to fit examples. Squeezing it is one job. Using it afterward is a completely different job, and people new to this often blur them together.
Training is the squeezing. You take a pile of labeled data and an objective, and you grind the knobs (the parameters) until the model's outputs match the known answers well enough. This is slow and heavy: many passes over a large dataset, lots of math, lots of hardware. When it finishes you have an artifact, a frozen set of numbers, the weights. That artifact is the model.
Inference is using the artifact. You hand the finished model one new input, it does a single forward pass, and out comes one prediction. No learning happens. The knobs do not move. It is just arithmetic running the input through the frozen weights. Then you do it again for the next input, and the next, a billion times.
The clearest way to hold this: training writes the model once, inference reads it forever.
Why the costs are so lopsided
Because the two halves run on totally different schedules, their bills look nothing alike, and the surprise is which one wins.
Training is a large fixed cost you pay up front. A frontier model gets trained on thousands of GPUs running for weeks or months. The public estimates for GPT-4's training compute land around $100 million, give or take. That is a big, scary, one-time number, so people assume training is where the money goes.
It usually is not. Each single inference call is cheap, but you make a staggering number of them. A model trained once has to answer every user query, every day, for as long as it is deployed. Those tiny costs pile up and pass the training bill fast. Reported figures put GPT-4's cumulative inference spend in the billions, far above what training cost, with the running tab overtaking the one-time tab within months of going live at scale.
Training is like writing and printing a book: expensive, done once. Inference is like someone reading a copy: cheap each time, but if millions of people read it forever, the total ink and paper dwarfs the cost of writing it.
This split drives a lot of later decisions. It is why a team will spend serious effort shaving a few milliseconds or a fraction of a cent off each inference call. At a billion calls, a tiny per-call saving is a real budget. It is also why training and inference often run on different hardware: training wants raw throughput across a huge cluster, while inference wants low latency on a single request and the cheapest box that still hits your speed target.
The input contract
Here is the part that bites people. A trained model does not accept "audio" or "an image" in the loose human sense. It accepts exactly the shape of input it saw during training, and nothing else.
Take OpenAI's Whisper, the speech-to-text model. It is built around 16 kHz mono audio. That phrase is a precise contract. "16 kHz" means the sound was sampled 16,000 times per second. "Mono" means one channel, not stereo. Whisper computes its features (a spectrogram, the subject of a later audio lesson) assuming that exact rate, so the numbers only line up if the audio really is 16 kHz mono. Feed it 44.1 kHz stereo from a music file and the model does not understand "this is the same speech, just higher quality." The array of numbers is simply wrong, and the transcript comes out garbled.
This is not unique to audio. An image classifier expects a fixed pixel size and channel order. A language model expects text run through one specific tokenizer, not raw characters. In every case the contract is the same: feed the model the exact shape it trained on, or the output is meaningless.
The cruel detail: breaking the contract rarely throws an error. The model takes whatever numbers you give it and dutifully runs them through. You get an answer, it just happens to be wrong. There is no red flag, no stack trace, only quietly bad output that you might not notice until a user does.
Preprocessing has to match on both ends
Raw input almost never arrives in the model's required shape, so there is always a preprocessing step in front of it: resample the audio, resize the image, run text through the tokenizer, scale the numbers. The rule that catches teams is this: the preprocessing at inference time must be identical to the preprocessing used during training.
If during training you converted everything to 16 kHz mono, then at serving time you have to convert everything to 16 kHz mono the same way. If you scaled a numeric feature by subtracting the training set's average, you must subtract that same average in production, not recompute a fresh one. The model learned to expect inputs that went through one specific pipeline. Hand it inputs that went through a slightly different pipeline and you have lied to it.
This mismatch has a name: training/serving skew. It happens when the data or preprocessing during training differs from what the model sees in production. It is not the model drifting over time from lesson 01, it is a deployment bug, a difference you introduced.
How it sneaks in. Training often runs in a notebook with pandas and one set of helper functions. Serving runs in a different service, in a different language sometimes, with reimplemented preprocessing. The two pipelines drift apart by a small detail: a different default sample rate, a tokenizer version bump, channels averaged one way here and dropped another way there. Each looks harmless on its own.
Why it is so expensive to catch. Like the input-contract failure, skew is silent. The model does not crash. Your offline evaluation, which uses the training pipeline, still looks great. Production accuracy is just quietly worse, and because nothing errors, the alert never fires. Teams have shipped with confidence and only found the skew weeks later by digging into why real-world numbers never matched the demo.
The defense. Share one preprocessing implementation between training and serving instead of writing it twice. Pin versions of tokenizers and feature code. And test the model on data pushed through the actual production pipeline, not just the training one, so any gap shows up before users do.
Two more distinctions worth holding
Inference itself comes in two flavors, and the difference shapes how you serve a model.
- Batch inference runs the model over a big pile of inputs at once, offline, where finishing the whole job matters more than any single result's speed. Think scoring every user in a database overnight.
- Realtime inference answers one request at a time, right now, with a person or system waiting. Here latency on the individual call is everything. A voice assistant is realtime: nobody waits five seconds for a transcript.
And a clarification on "once." Training is not always literally one event. Models get retrained periodically as new data arrives or the world drifts. But each retrain is still the same kind of heavy, occasional job that produces a fresh frozen artifact, which inference then reads many times until the next retrain. The pattern holds: rare and expensive to make, frequent and cheap to use.
Checkpoint
You're ready for the next lesson if you can answer these from memory:
- In one sentence each, what happens during training and what happens during inference?
- Do a model's parameters change during inference? Why does that matter?
- Training is a huge one-time cost, yet inference usually costs more in total. Why?
- What does "the model expects 16 kHz mono audio" actually mean, and what happens if you feed it 44.1 kHz stereo?
- What is training/serving skew, and why is it so hard to notice?
Quick check
- Nothing, they stay frozen; inference just runs input through the fixed weights
- They update a little with each request so the model keeps improving
- They reset to random values and re-fit for every new input
- Because each inference call costs more than the whole training run
- Because the cheap per-call cost is paid for billions of calls over years, and it adds up past the one-time training cost
- Because every inference call secretly retrains the whole model again
- The model throws an error and refuses to run
- It runs without error and returns a garbled, wrong transcript
- It produces an even more accurate transcript because the audio is higher quality