Lesson 04

API vs self-hosting break-even

Self-hosting is tempting when the API bill grows. Break-even is real, but the math has to include quality, operations, utilization, reliability, and the cost of being wrong.

The one idea

The break-even point is not only token volume. Self-hosting wins when total cost, quality, latency, privacy, and operational risk beat the hosted API for the workload you actually have.

Start with the two cost curves

A hosted API usually has a low fixed cost and a cost that grows with tokens. Self-hosting usually has a higher fixed cost and a lower marginal cost when hardware is well used. That creates the classic break-even picture.

The simple version looks like this:

hosted API monthly cost =
  monthly input tokens * input price
  + monthly output tokens * output price

self-hosted monthly cost =
  hardware + cloud infrastructure + engineering + monitoring + risk buffer

The danger is treating this as pure arithmetic. If the self-hosted model is worse, slower, less reliable, or harder to operate, a lower token price may still lose.

Quality can erase the savings

Hosted models and open models may not be interchangeable. If a cheaper self-hosted model needs longer prompts, more retries, a larger fallback rate, or more human review, the real cost rises. If it answers poorly, the cost model is a distraction.

Run task evals before price evals. Compare accuracy, refusal behavior, schema validity, latency, token count, and escalation rate. Then price the passing systems.

Common mistake

Do not compare a top hosted model against a smaller open model only on token price. Compare the full workflow cost required to reach the same product quality.

Volume helps only if traffic is usable

High monthly token volume does not guarantee self-hosting wins. The traffic must fit the hardware well. Smooth, predictable traffic can keep GPUs busy. Spiky traffic either creates queues or forces you to pay for idle headroom.

Look at hourly and daily shape, not just monthly totals. A workload with huge weekday bursts and quiet nights can be worse for self-hosting than a smaller workload that runs steadily.

The same monthly token count can produce different hardware economics.

Know the non-cost reasons

Sometimes self-hosting is not mainly about price. You may need data residency, custom models, offline operation, strict logging control, lower variance in latency, or a deployment boundary that a hosted API cannot provide.

Those reasons are valid. Just write them down separately from the cost case. Mixing them together makes the decision hard to audit later.

Use a staged decision

The strongest path is usually staged. Start with a hosted API to prove the product. Add tracing and token accounting early. When a feature gets enough stable traffic, run an open-model eval. If quality passes, shadow self-hosted inference. If shadow numbers pass, route a small slice. Expand only after latency, quality, and cost hold up under real traffic.

This avoids the worst version of self-hosting: buying complexity before the product has a stable workload.

Engineering reality

Break-even is a moving target. API prices change, open models improve, GPU supply changes, traffic shifts, and product prompts grow. Revisit the decision on a schedule instead of treating it as a one-time migration.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

Why is token volume alone not enough for a self-hosting decision?
How can lower model quality erase infrastructure savings?
Why does spiky traffic hurt self-hosting economics?
What non-cost reasons can justify self-hosting?

Quick check

Only the listed token price
The end-to-end passing workflow cost
Only maximum GPU memory

It gives traffic-shaped latency, quality, and cost data without user exposure
It replaces evaluation
It always cuts cost immediately