Lesson 03

GPU cost per token

A GPU price per hour is not a product cost. You need to convert hardware, utilization, and workload shape into the tokens your users actually consume.

The one idea

Cost per token is hardware cost divided by useful tokens served. Utilization and workload shape matter as much as the sticker price of the GPU.

The basic formula

Start with a simple estimate:

cost per 1M tokens =
  hourly serving cost / useful tokens per hour * 1,000,000

"Hourly serving cost" includes GPU rental or depreciation, host CPU and memory, storage, networking, orchestration, monitoring, and the serving overhead you can actually measure. "Useful tokens per hour" means the tokens that satisfy user requests, not synthetic peak throughput from a benchmark.

If a node costs $4 per hour and serves 2 million useful tokens per hour, the rough serving cost is $2 per 1 million tokens. If utilization drops by half, the cost doubles even though the GPU price did not change.

Worked example: A100 to dollars per million tokens

Here is a step-by-step estimate you can copy into a spreadsheet. Numbers are illustrative (Lambda Labs on-demand A100 80GB at about $1.29/hr, Llama-class 8B model, FP16, continuous batching with 32 concurrent sequences). Your benchmark will differ; the method is what matters.

Measure throughput. Run your traffic shape on one GPU. Example result: 2,400 output tokens/sec sustained.
Convert to tokens per hour. 2,400 × 3,600 = 8,640,000 output tokens/hour.
Divide hourly cost by hourly tokens. $1.29 ÷ 8.64M = $0.000000149 per output token.
Scale to one million. $0.000000149 × 1,000,000 ≈ $0.15 per 1M output tokens at full utilization.
Haircut for real utilization. If the GPU is busy 70% of the hour (traffic valleys, deploys, headroom), divide by 0.7: ≈ $0.21 per 1M output tokens.

Compare that to a hosted API on the OpenAI pricing page: a mid-tier model might charge $2–15 per 1M output tokens. Self-hosting wins on marginal cost only when utilization and quality are high enough to absorb the fixed fleet cost. The Anyscale LLM economics post walks through the same arithmetic with more hardware tiers.

Input	Value
GPU hourly rate (on-demand A100)	$1.29
Output throughput (benchmark)	2,400 tok/s
Output tokens / hour	8,640,000
Cost / 1M output tokens (100% util)	$0.15
Effective util factor	0.70
Cost / 1M output tokens (realistic)	$0.21

On-demand, spot, and reserved pricing

The same GPU math changes sharply with how you buy capacity. Cloud list prices move; treat these as sensitivity bands, not quotes.

Purchase model	Illustrative A100 $/hr	Cost / 1M output tok (2,400 tok/s, 70% util)
On-demand	$1.29	≈ $0.21
Spot / interruptible	≈ $0.55 (often 40–60% off)	≈ $0.09
1-year reserved	≈ $0.85	≈ $0.14

Spot saves money when your scheduler can tolerate preemption or you have a fallback API. Reserved saves money when traffic is steady enough to keep the machine busy most of the month. Spiky products often lose on reserved GPUs unless you share a pool across features.

FP8 and hardware tier

On H100-class GPUs, FP8 inference can nearly double effective throughput versus FP16 for some model sizes, which cuts $/token if quality holds. TensorRT-LLM and modern serving stacks expose FP8 paths; always run task evals after switching precision.

H100 hourly rates are higher than A100 (often $2–3+/hr on-demand), but the tokens-per-dollar can still win when FP8 utilization is high. Do not compare an H100 FP8 benchmark against an A100 FP16 API bill without normalizing for quality and batch shape.

Mixture-of-experts (MoE) models

MoE checkpoints advertise large total parameter counts, but serving cost tracks active experts per token, not the headline size. A 70B-total / 8B-active MoE can sit closer to an 8B dense model on GPU memory and FLOPs than a 70B dense model.

When estimating cost, use measured tokens/sec for your MoE model at your batch size, not a naive "params divided by GPU FLOPs" guess. Routing overhead, expert imbalance, and memory for all experts on disk still matter. If one expert overheats on your traffic slice, latency and cost spike together.

Scale check: 10M tokens per day

Product teams often ask for a single headline number. Here is how to answer it without hand-waving.

Suppose 10M total tokens per day (about 115 tokens/sec average) with a 75% / 25% input–output split → 7.5M input and 2.5M output daily. Using the realistic self-host figure from above ($0.21 / 1M output tokens) and assuming input costs roughly half of decode on the same GPU:

Self-host variable (GPU only): ≈ $0.53 / 1M blended → 10M × $0.53 / 1M ≈ $5.30 / day (~$160 / month) before fixed replica cost.
Hosted API (illustrative $3 in / $15 out per 1M): 7.5M × $3/1M + 2.5M × $15/1M = $60 / day (~$1,800 / month).

At 10M tokens/day the marginal GPU cost is tiny, but if you keep a GPU reserved 24/7 for availability, add roughly $930/month fixed ($1.29/hr × 720 hr) before the variable $160 — effective ≈ $3.60 / 1M tokens all-in. That is still below the illustrative API bill (~$1,800/month) but not by orders of magnitude. Past roughly 50–100M tokens/day of steady traffic on one model, fixed cost amortizes faster. Re-run with your measured split, cache-hit rate, and quality bar.

Utilization is the trap

Most optimistic serving spreadsheets assume the GPU is busy all the time. Real systems have traffic valleys, cold starts, long-tail prompts, failed requests, overprovisioning for p95 latency, and reserved headroom for bursts.

Low utilization is not always bad. You may choose idle capacity to keep latency stable. The mistake is pretending that idle capacity is free.

The hourly bill is fixed for the window. The denominator is how much useful work you get out of it.

Input and output tokens cost differently

A single blended "tokens per second" number hides the shape of the work. Long prompts create prefill work and KV cache memory pressure. Long answers create decode work over time. Some providers price input and output differently because the serving cost profile is different.

Your internal model should separate input tokens, output tokens, cached input tokens, rejected requests, and retried requests. A prompt-template change that adds 2,000 hidden tokens can hurt cost before anyone notices the product got slower.

Add the cost of reliability

Production serving needs more than the happy path. Add headroom for retries, health checks, rolling deploys, replicas in another zone, monitoring, logs, and the capacity you reserve for traffic spikes. Add engineering time when comparing against an API. People are not free infrastructure.

This does not mean self-hosting is bad. It means the decision should include all the work the hosted provider used to do for you.

Engineering reality

The easiest way to lie with cost per token is to benchmark one full GPU for five minutes, then apply that number to a service with uneven traffic, retries, deploy headroom, and on-call requirements.

Use cost by route

Average cost across the whole product is useful for finance, but it is too blurry for engineering. Break cost down by route or feature. A support summary, agent planning step, code review, title generator, and voice response can have completely different token shapes.

Route-level cost tells you where to optimize. You might trim context on one feature, cap output on another, route easy cases to a smaller model, or keep a hosted API for a rare expensive workflow.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

How do you estimate cost per 1 million tokens from an hourly GPU bill?
Walk through the A100 worked example: what is the realistic $/1M at 70% utilization?
How does spot pricing change the sensitivity table?
Why do MoE models bill on active experts, not total parameters?
Why should input and output tokens be tracked separately?
What production costs are missing from a simple GPU benchmark?

Quick check

It roughly doubles
It roughly halves
It stays the same

It hides noisy feature differences
It points optimization at the routes driving the bill
It only matters to finance