API vs self-hosting break-even
Self-hosting is tempting when the API bill grows. Break-even is real, but the math has to include quality, operations, utilization, reliability, and the cost of being wrong.
The break-even point is not only token volume. Self-hosting wins when total cost, quality, latency, privacy, and operational risk beat the hosted API for the workload you actually have.
Start with the two cost curves
A hosted API usually has a low fixed cost and a cost that grows with tokens. Self-hosting usually has a higher fixed cost and a lower marginal cost when hardware is well used. That creates the classic break-even picture.
The simple version looks like this:
hosted API monthly cost =
monthly input tokens * input price
+ monthly output tokens * output price
self-hosted monthly cost =
hardware + cloud infrastructure + engineering + monitoring + risk buffer
The danger is treating this as pure arithmetic. If the self-hosted model is worse, slower, less reliable, or harder to operate, a lower token price may still lose.
Sample numbers
Suppose 2 million requests per month, 1,200 input tokens and 400 output tokens each, API prices of $3 / 1M input and $15 / 1M output, and self-host fixed cost of $6,000/month with $0.40 / 1M tokens variable:
- Hosted API: input $7,200 + output $12,000 = $19,200 / month
- Self-hosted: $6,000 fixed + (3.2B tokens × $0.40 / 1M) = $6,000 + $1,280 = $7,280 / month
At this volume the API path is more expensive on paper, but the crossover depends on your real fixed cost (engineering, replicas, idle headroom) and whether the open model passes evals. Break-even request volume for these prices: about 670k requests/month (where the two lines cross).
Here is the arithmetic part on its own. Set your traffic and prices and watch where the two curves cross. The hosted line starts at zero and climbs with every token. The self-hosted line starts high (the fixed cost) and climbs slowly. Below the crossover the API is cheaper, above it self-hosting is, assuming the quality and operational caveats in the rest of this lesson hold.
| Line item | Formula | Example value |
|---|---|---|
| Monthly requests | — | 2,000,000 |
| Input tokens / request | — | 1,200 |
| Output tokens / request | — | 400 |
| API input $/1M | — | $3.00 |
| API output $/1M | — | $15.00 |
| API monthly cost | req × (in×pin + out×pout) / 1M | $19,200 |
| Self-host fixed $/mo | GPU + infra + eng share | $6,000 |
| Self-host $/1M tokens | from L03 benchmark | $0.40 |
| Self-host monthly cost | fixed + req×(in+out)×p / 1M | $7,280 |
| Break-even requests | fixed / (API_per_req − self_per_req) | ≈ 670,000 / mo |
Copy into Google Sheets: put inputs in column B, reference them in the cost formulas, then plot monthly volume on the x-axis and both cost lines on the y-axis.
Cached input pricing on APIs
Hosted providers increasingly discount repeated input prefixes. That changes break-even math when your system prompt, tool definitions, or RAG context are stable across requests. Structure prompts as static prefix + volatile suffix to maximize cache hits (see Context budgeting).
Illustrative 2025-style tiers (check current pages before budgeting):
| Provider / tier | Fresh input | Cached input | Notes |
|---|---|---|---|
| OpenAI GPT-4o class | ≈ $2.50 / 1M | ≈ $1.25 / 1M (50% off) | Automatic on repeated prefixes above a minimum length |
| Anthropic Claude | List input price | ≈ 90% cheaper on cache read | Cache write costs more; best for stable long system blocks |
If 80% of your 1,200 input tokens hit cache at half price, effective input cost drops materially and the API line in the chart flattens. Re-run break-even with your measured cache-hit rate, not list input price.
Quality can erase the savings
Hosted models and open models may not be interchangeable. If a cheaper self-hosted model needs longer prompts, more retries, a larger fallback rate, or more human review, the real cost rises. If it answers poorly, the cost model is a distraction.
Run task evals before price evals. Compare accuracy, refusal behavior, schema validity, latency, token count, and escalation rate. Then price the passing systems.
Do not compare a top hosted model against a smaller open model only on token price. Compare the full workflow cost required to reach the same product quality.
Volume helps only if traffic is usable
High monthly token volume does not guarantee self-hosting wins. The traffic must fit the hardware well. Smooth, predictable traffic can keep GPUs busy. Spiky traffic either creates queues or forces you to pay for idle headroom.
Look at hourly and daily shape, not just monthly totals. A workload with huge weekday bursts and quiet nights can be worse for self-hosting than a smaller workload that runs steadily.
Know the non-cost reasons
Sometimes self-hosting is not mainly about price. You may need data residency, custom models, offline operation, strict logging control, lower variance in latency, or a deployment boundary that a hosted API cannot provide.
Those reasons are valid. Just write them down separately from the cost case. Mixing them together makes the decision hard to audit later.
Use a staged decision
The strongest path is usually staged. Start with a hosted API to prove the product. Add tracing and token accounting early. When a feature gets enough stable traffic, run an open-model eval. If quality passes, shadow self-hosted inference. If shadow numbers pass, route a small slice. Expand only after latency, quality, and cost hold up under real traffic.
This avoids the worst version of self-hosting: buying complexity before the product has a stable workload.
Break-even is a moving target. API prices change, open models improve, GPU supply changes, traffic shifts, and product prompts grow. Revisit the decision on a schedule instead of treating it as a one-time migration.
Checkpoint
You're ready for the next lesson if you can answer these from memory:
- At 2M req/mo with the sample prices, which path is cheaper on paper?
- How does cached input pricing change the API cost line?
- How can lower model quality erase infrastructure savings?
- Why does spiky traffic hurt self-hosting economics?
- What non-cost reasons can justify self-hosting?
Quick check
- Only the listed token price
- The end-to-end passing workflow cost
- Only maximum GPU memory
- It gives traffic-shaped latency, quality, and cost data without user exposure
- It replaces evaluation
- It always cuts cost immediately