Lesson 04

API vs self-hosting break-even

Self-hosting is tempting when the API bill grows. Break-even is real, but the math has to include quality, operations, utilization, reliability, and the cost of being wrong.

The one idea

The break-even point is not only token volume. Self-hosting wins when total cost, quality, latency, privacy, and operational risk beat the hosted API for the workload you actually have.

Start with the two cost curves

A hosted API usually has a low fixed cost and a cost that grows with tokens. Self-hosting usually has a higher fixed cost and a lower marginal cost when hardware is well used. That creates the classic break-even picture.

The simple version looks like this:

hosted API monthly cost =
  monthly input tokens * input price
  + monthly output tokens * output price

self-hosted monthly cost =
  hardware + cloud infrastructure + engineering + monitoring + risk buffer

The danger is treating this as pure arithmetic. If the self-hosted model is worse, slower, less reliable, or harder to operate, a lower token price may still lose.

Sample numbers

Suppose 2 million requests per month, 1,200 input tokens and 400 output tokens each, API prices of $3 / 1M input and $15 / 1M output, and self-host fixed cost of $6,000/month with $0.40 / 1M tokens variable:

Hosted API: input $7,200 + output $12,000 = $19,200 / month
Self-hosted: $6,000 fixed + (3.2B tokens × $0.40 / 1M) = $6,000 + $1,280 = $7,280 / month

At this volume the API path is more expensive on paper, but the crossover depends on your real fixed cost (engineering, replicas, idle headroom) and whether the open model passes evals. Break-even request volume for these prices: about 670k requests/month (where the two lines cross).

Here is the arithmetic part on its own. Set your traffic and prices and watch where the two curves cross. The hosted line starts at zero and climbs with every token. The self-hosted line starts high (the fixed cost) and climbs slowly. Below the crossover the API is cheaper, above it self-hosting is, assuming the quality and operational caveats in the rest of this lesson hold.

Line item	Formula	Example value
Monthly requests	—	2,000,000
Input tokens / request	—	1,200
Output tokens / request	—	400
API input $/1M	—	$3.00
API output $/1M	—	$15.00
API monthly cost	req × (in×pin + out×pout) / 1M	$19,200
Self-host fixed $/mo	GPU + infra + eng share	$6,000
Self-host $/1M tokens	from L03 benchmark	$0.40
Self-host monthly cost	fixed + req×(in+out)×p / 1M	$7,280
Break-even requests	fixed / (API_per_req − self_per_req)	≈ 670,000 / mo

Copy into Google Sheets: put inputs in column B, reference them in the cost formulas, then plot monthly volume on the x-axis and both cost lines on the y-axis.

Cached input pricing on APIs

Hosted providers increasingly discount repeated input prefixes. That changes break-even math when your system prompt, tool definitions, or RAG context are stable across requests. Structure prompts as static prefix + volatile suffix to maximize cache hits (see Context budgeting).

Illustrative 2025-style tiers (check current pages before budgeting):

Provider / tier	Fresh input	Cached input	Notes
OpenAI GPT-4o class	≈ $2.50 / 1M	≈ $1.25 / 1M (50% off)	Automatic on repeated prefixes above a minimum length
Anthropic Claude	List input price	≈ 90% cheaper on cache read	Cache write costs more; best for stable long system blocks

If 80% of your 1,200 input tokens hit cache at half price, effective input cost drops materially and the API line in the chart flattens. Re-run break-even with your measured cache-hit rate, not list input price.

Quality can erase the savings

Hosted models and open models may not be interchangeable. If a cheaper self-hosted model needs longer prompts, more retries, a larger fallback rate, or more human review, the real cost rises. If it answers poorly, the cost model is a distraction.

Run task evals before price evals. Compare accuracy, refusal behavior, schema validity, latency, token count, and escalation rate. Then price the passing systems.

Common mistake

Do not compare a top hosted model against a smaller open model only on token price. Compare the full workflow cost required to reach the same product quality.

Volume helps only if traffic is usable

High monthly token volume does not guarantee self-hosting wins. The traffic must fit the hardware well. Smooth, predictable traffic can keep GPUs busy. Spiky traffic either creates queues or forces you to pay for idle headroom.

Look at hourly and daily shape, not just monthly totals. A workload with huge weekday bursts and quiet nights can be worse for self-hosting than a smaller workload that runs steadily.

The same monthly token count can produce different hardware economics.

Know the non-cost reasons

Sometimes self-hosting is not mainly about price. You may need data residency, custom models, offline operation, strict logging control, lower variance in latency, or a deployment boundary that a hosted API cannot provide.

Those reasons are valid. Just write them down separately from the cost case. Mixing them together makes the decision hard to audit later.

Use a staged decision

The strongest path is usually staged. Start with a hosted API to prove the product. Add tracing and token accounting early. When a feature gets enough stable traffic, run an open-model eval. If quality passes, shadow self-hosted inference. If shadow numbers pass, route a small slice. Expand only after latency, quality, and cost hold up under real traffic.

This avoids the worst version of self-hosting: buying complexity before the product has a stable workload.

Engineering reality

Break-even is a moving target. API prices change, open models improve, GPU supply changes, traffic shifts, and product prompts grow. Revisit the decision on a schedule instead of treating it as a one-time migration.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

At 2M req/mo with the sample prices, which path is cheaper on paper?
How does cached input pricing change the API cost line?
How can lower model quality erase infrastructure savings?
Why does spiky traffic hurt self-hosting economics?
What non-cost reasons can justify self-hosting?

Quick check

Only the listed token price
The end-to-end passing workflow cost
Only maximum GPU memory

It gives traffic-shaped latency, quality, and cost data without user exposure
It replaces evaluation
It always cuts cost immediately