Lesson 01

Choose an LLM serving stack

An LLM serving stack is not just a model behind an HTTP endpoint. It is the whole path from weights to tokens, including scheduling, hardware, tracing, release safety, and cost control.

The one idea

Choose the stack from the workload backward. Prompt length, output length, concurrency, privacy, latency target, and cost target should drive runtime and hardware choices.

The stack has layers

A serving decision usually starts with "which model should we run?" That matters, but it is only the first layer. The production system also needs a runtime that executes the model, a scheduler that decides which requests run together, hardware that fits the memory and throughput needs, an API surface, observability, and a rollout path.

If one layer is wrong, the rest of the stack pays for it. A great model on a runtime with weak batching can waste a GPU. A fast runtime with no request-level tracing leaves you guessing during incidents. A cheap instance can become expensive if it sits half empty all day.

A serving stack is a chain. Model quality is only one part of the production outcome.

Start with workload shape

The first question is not "which runtime is fastest?" It is "what traffic do we need to serve?" A chat assistant with long history stresses prefill and KV cache memory. A coding assistant with long outputs stresses decode. A classifier with tiny outputs may care more about request overhead and batching than raw tokens per second.

Write down the expected input tokens, output tokens, requests per second, concurrency, p95 latency target, privacy requirement, and availability target. Those numbers narrow the choices quickly.

Long input, short output: prioritize prefill throughput, prefix caching, and prompt trimming.
Short input, long output: prioritize decode throughput, streaming, and output limits.
Bursty traffic: prioritize queue control, autoscaling, and hosted capacity.
Private or regulated data: prioritize deployment boundary, logs, access control, and encryption.

Pick the operating model

There are three common operating models. You can call a hosted API, run an open model on managed inference infrastructure, or operate your own serving fleet. Each one moves different work onto your team.

Hosted APIs are fast to integrate and remove most infrastructure work. Managed inference gives more model and deployment control while still hiding some operations. Self-hosting gives the most control over weights, data boundary, and optimization, but it also gives you capacity planning, upgrades, incident response, and idle hardware risk.

Engineering reality

Self-hosting is not automatically cheaper. It only wins when utilization, traffic shape, model quality, operations skill, and hardware pricing line up. If GPUs sit idle or the team spends weeks debugging serving issues, the spreadsheet was incomplete.

Make compatibility boring

Before tuning performance, make sure the model can actually run where you want it. Check architecture support, tokenizer compatibility, weight format, quantization support, maximum context length, license, and any custom code requirements.

Compatibility issues often show up late because the first demo uses one happy-path prompt. Run a small acceptance suite before committing: short prompt, long prompt, structured output, tool-call style output, refusal case, Unicode text, and max context behavior.

Choose for rollback

The stack should let you roll back quickly. That means versioned models, stable API contracts, request tracing, prompt versioning, and a way to drain or shift traffic. LLM serving changes can fail in subtle ways: quality drops, latency spikes, token counts grow, or a parser starts failing on one customer slice.

A good stack makes those changes visible and reversible. A weak stack makes every model upgrade feel like a migration.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

What are the main layers of an LLM serving stack?
Why should workload shape come before runtime choice?
What work does self-hosting move onto your team?
Why does rollback matter for serving, not just training?

Quick check

Prefill cost and input token count
Only output tokens per second
Temperature and top-p settings

Because it always has the lowest latency
Because it removes infrastructure work while the product is still uncertain
Because it removes the need for observability