Lesson 01

Choose an LLM serving stack

An LLM serving stack is not just a model behind an HTTP endpoint. It is the whole path from weights to tokens, including scheduling, hardware, tracing, release safety, and cost control.

The one idea

Choose the stack from the workload backward. Prompt length, output length, concurrency, privacy, latency target, and cost target should drive runtime and hardware choices.

The stack has layers

A serving decision usually starts with "which model should we run?" That matters, but it is only the first layer. The production system also needs a runtime that executes the model, a scheduler that decides which requests run together, hardware that fits the memory and throughput needs, an API surface, observability, and a rollout path.

If one layer is wrong, the rest of the stack pays for it. A great model on a runtime with weak batching can waste a GPU. A fast runtime with no request-level tracing leaves you guessing during incidents. A cheap instance can become expensive if it sits half empty all day.

A serving stack is a chain. Model quality is only one part of the production outcome.

Start with workload shape

The first question is not "which runtime is fastest?" It is "what traffic do we need to serve?" A chat assistant with long history stresses prefill and KV cache memory. A coding assistant with long outputs stresses decode. A classifier with tiny outputs may care more about request overhead and batching than raw tokens per second.

Write down the expected input tokens, output tokens, requests per second, concurrency, p95 latency target, privacy requirement, and availability target. Those numbers narrow the choices quickly.

Long input, short output: prioritize prefill throughput, prefix caching, and prompt trimming.
Short input, long output: prioritize decode throughput, streaming, and output limits.
Bursty traffic: prioritize queue control, autoscaling, and hosted capacity.
Private or regulated data: prioritize deployment boundary, logs, access control, and encryption.

Pick the operating model

There are four common operating models. You can call a hosted API, run on serverless GPU infrastructure, deploy on managed inference, or operate your own serving fleet. Each one moves different work onto your team.

Hosted APIs are fast to integrate and remove most infrastructure work. Serverless GPU platforms (Modal, Baseten, Replicate) sit in the middle: you bring a container or model artifact, they handle scaling and cold starts, and you pay per GPU-second instead of reserving machines. Managed inference gives more model and deployment control while still hiding some operations. Self-hosting gives the most control over weights, data boundary, and optimization, but it also gives you capacity planning, upgrades, incident response, and idle hardware risk.

Serverless GPU

Modal, Baseten, and similar providers charge per GPU-second with autoscaling built in. Good fit when traffic is bursty, the team lacks a platform group, and you want open weights without owning a fleet. Watch cold-start latency on large models and egress costs on big downloads. Compare total cost at your actual duty cycle against a reserved A100 and against a hosted API.

Engineering reality

Self-hosting is not automatically cheaper. It only wins when utilization, traffic shape, model quality, operations skill, and hardware pricing line up. If GPUs sit idle or the team spends weeks debugging serving issues, the spreadsheet was incomplete.

Decision matrix

Use this as a first-pass filter. Your workload numbers from the previous section should override a generic row when they disagree.

Operating model	Best when	Team needs	Cost shape
Hosted API	Product still uncertain, low volume, need frontier quality fast	Integration, prompts, evals, budgets	Low fixed, high marginal per token
Serverless GPU	Bursty traffic, open weights, no platform team yet	Container packaging, model serving basics	Per GPU-second; no idle reservation
Managed inference	Steady traffic on open models, want fewer sharp edges than DIY	Deployment manifests, routing, monitoring	Reserved capacity + platform fee
Self-hosted fleet	High stable volume, strict data boundary, custom optimization	GPU ops, schedulers, on-call, capacity planning	High fixed, low marginal at high utilization

Stack decision flowchart

Walk this top to bottom with your workload sheet in hand. The goal is to eliminate bad fits early, not to skip benchmarking the survivors.

Privacy, volume certainty, and traffic shape narrow the stack before you benchmark runtimes.

Make compatibility boring

Before tuning performance, make sure the model can actually run where you want it. Check architecture support, tokenizer compatibility, weight format, quantization support, maximum context length, license, and any custom code requirements.

Compatibility issues often show up late because the first demo uses one happy-path prompt. Run a small acceptance suite before committing: short prompt, long prompt, structured output, tool-call style output, refusal case, Unicode text, and max context behavior.

Choose for rollback

The stack should let you roll back quickly. That means versioned models, stable API contracts, request tracing, prompt versioning, and a way to drain or shift traffic. LLM serving changes can fail in subtle ways: quality drops, latency spikes, token counts grow, or a parser starts failing on one customer slice.

A good stack makes those changes visible and reversible. A weak stack makes every model upgrade feel like a migration.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

What are the main layers of an LLM serving stack?
What are the four common operating models for LLM serving?
When is serverless GPU a better fit than a reserved fleet?
Why should workload shape come before runtime choice?
What work does self-hosting move onto your team?
Why does rollback matter for serving, not just training?

Quick check

Prefill cost and input token count
Only output tokens per second
Temperature and top-p settings

Because it always has the lowest latency
Because it removes infrastructure work while the product is still uncertain
Because it removes the need for observability