Choose an LLM serving stack
An LLM serving stack is not just a model behind an HTTP endpoint. It is the whole path from weights to tokens, including scheduling, hardware, tracing, release safety, and cost control.
Choose the stack from the workload backward. Prompt length, output length, concurrency, privacy, latency target, and cost target should drive runtime and hardware choices.
The stack has layers
A serving decision usually starts with "which model should we run?" That matters, but it is only the first layer. The production system also needs a runtime that executes the model, a scheduler that decides which requests run together, hardware that fits the memory and throughput needs, an API surface, observability, and a rollout path.
If one layer is wrong, the rest of the stack pays for it. A great model on a runtime with weak batching can waste a GPU. A fast runtime with no request-level tracing leaves you guessing during incidents. A cheap instance can become expensive if it sits half empty all day.
Start with workload shape
The first question is not "which runtime is fastest?" It is "what traffic do we need to serve?" A chat assistant with long history stresses prefill and KV cache memory. A coding assistant with long outputs stresses decode. A classifier with tiny outputs may care more about request overhead and batching than raw tokens per second.
Write down the expected input tokens, output tokens, requests per second, concurrency, p95 latency target, privacy requirement, and availability target. Those numbers narrow the choices quickly.
- Long input, short output: prioritize prefill throughput, prefix caching, and prompt trimming.
- Short input, long output: prioritize decode throughput, streaming, and output limits.
- Bursty traffic: prioritize queue control, autoscaling, and hosted capacity.
- Private or regulated data: prioritize deployment boundary, logs, access control, and encryption.
Pick the operating model
There are four common operating models. You can call a hosted API, run on serverless GPU infrastructure, deploy on managed inference, or operate your own serving fleet. Each one moves different work onto your team.
Hosted APIs are fast to integrate and remove most infrastructure work. Serverless GPU platforms (Modal, Baseten, Replicate) sit in the middle: you bring a container or model artifact, they handle scaling and cold starts, and you pay per GPU-second instead of reserving machines. Managed inference gives more model and deployment control while still hiding some operations. Self-hosting gives the most control over weights, data boundary, and optimization, but it also gives you capacity planning, upgrades, incident response, and idle hardware risk.
Modal, Baseten, and similar providers charge per GPU-second with autoscaling built in. Good fit when traffic is bursty, the team lacks a platform group, and you want open weights without owning a fleet. Watch cold-start latency on large models and egress costs on big downloads. Compare total cost at your actual duty cycle against a reserved A100 and against a hosted API.
Self-hosting is not automatically cheaper. It only wins when utilization, traffic shape, model quality, operations skill, and hardware pricing line up. If GPUs sit idle or the team spends weeks debugging serving issues, the spreadsheet was incomplete.
Decision matrix
Use this as a first-pass filter. Your workload numbers from the previous section should override a generic row when they disagree.
| Operating model | Best when | Team needs | Cost shape |
|---|---|---|---|
| Hosted API | Product still uncertain, low volume, need frontier quality fast | Integration, prompts, evals, budgets | Low fixed, high marginal per token |
| Serverless GPU | Bursty traffic, open weights, no platform team yet | Container packaging, model serving basics | Per GPU-second; no idle reservation |
| Managed inference | Steady traffic on open models, want fewer sharp edges than DIY | Deployment manifests, routing, monitoring | Reserved capacity + platform fee |
| Self-hosted fleet | High stable volume, strict data boundary, custom optimization | GPU ops, schedulers, on-call, capacity planning | High fixed, low marginal at high utilization |
Stack decision flowchart
Walk this top to bottom with your workload sheet in hand. The goal is to eliminate bad fits early, not to skip benchmarking the survivors.
Make compatibility boring
Before tuning performance, make sure the model can actually run where you want it. Check architecture support, tokenizer compatibility, weight format, quantization support, maximum context length, license, and any custom code requirements.
Compatibility issues often show up late because the first demo uses one happy-path prompt. Run a small acceptance suite before committing: short prompt, long prompt, structured output, tool-call style output, refusal case, Unicode text, and max context behavior.
Choose for rollback
The stack should let you roll back quickly. That means versioned models, stable API contracts, request tracing, prompt versioning, and a way to drain or shift traffic. LLM serving changes can fail in subtle ways: quality drops, latency spikes, token counts grow, or a parser starts failing on one customer slice.
A good stack makes those changes visible and reversible. A weak stack makes every model upgrade feel like a migration.
Checkpoint
You're ready for the next lesson if you can answer these from memory:
- What are the main layers of an LLM serving stack?
- What are the four common operating models for LLM serving?
- When is serverless GPU a better fit than a reserved fleet?
- Why should workload shape come before runtime choice?
- What work does self-hosting move onto your team?
- Why does rollback matter for serving, not just training?
Quick check
- Prefill cost and input token count
- Only output tokens per second
- Temperature and top-p settings
- Because it always has the lowest latency
- Because it removes infrastructure work while the product is still uncertain
- Because it removes the need for observability