Lesson 03

GPU cost per token

A GPU price per hour is not a product cost. You need to convert hardware, utilization, and workload shape into the tokens your users actually consume.

The one idea

Cost per token is hardware cost divided by useful tokens served. Utilization and workload shape matter as much as the sticker price of the GPU.

The basic formula

Start with a simple estimate:

cost per 1M tokens =
  hourly serving cost / useful tokens per hour * 1,000,000

"Hourly serving cost" includes GPU rental or depreciation, host CPU and memory, storage, networking, orchestration, monitoring, and the serving overhead you can actually measure. "Useful tokens per hour" means the tokens that satisfy user requests, not synthetic peak throughput from a benchmark.

If a node costs $4 per hour and serves 2 million useful tokens per hour, the rough serving cost is $2 per 1 million tokens. If utilization drops by half, the cost doubles even though the GPU price did not change.

Utilization is the trap

Most optimistic serving spreadsheets assume the GPU is busy all the time. Real systems have traffic valleys, cold starts, long-tail prompts, failed requests, overprovisioning for p95 latency, and reserved headroom for bursts.

Low utilization is not always bad. You may choose idle capacity to keep latency stable. The mistake is pretending that idle capacity is free.

The hourly bill is fixed for the window. The denominator is how much useful work you get out of it.

Input and output tokens cost differently

A single blended "tokens per second" number hides the shape of the work. Long prompts create prefill work and KV cache memory pressure. Long answers create decode work over time. Some providers price input and output differently because the serving cost profile is different.

Your internal model should separate input tokens, output tokens, cached input tokens, rejected requests, and retried requests. A prompt-template change that adds 2,000 hidden tokens can hurt cost before anyone notices the product got slower.

Add the cost of reliability

Production serving needs more than the happy path. Add headroom for retries, health checks, rolling deploys, replicas in another zone, monitoring, logs, and the capacity you reserve for traffic spikes. Add engineering time when comparing against an API. People are not free infrastructure.

This does not mean self-hosting is bad. It means the decision should include all the work the hosted provider used to do for you.

Engineering reality

The easiest way to lie with cost per token is to benchmark one full GPU for five minutes, then apply that number to a service with uneven traffic, retries, deploy headroom, and on-call requirements.

Use cost by route

Average cost across the whole product is useful for finance, but it is too blurry for engineering. Break cost down by route or feature. A support summary, agent planning step, code review, title generator, and voice response can have completely different token shapes.

Route-level cost tells you where to optimize. You might trim context on one feature, cap output on another, route easy cases to a smaller model, or keep a hosted API for a rare expensive workflow.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

How do you estimate cost per 1 million tokens from an hourly GPU bill?
Why does lower utilization raise cost per token?
Why should input and output tokens be tracked separately?
What production costs are missing from a simple GPU benchmark?

Quick check

It roughly doubles
It roughly halves
It stays the same

It hides noisy feature differences
It points optimization at the routes driving the bill
It only matters to finance