Lesson 02

vLLM vs TGI vs llama.cpp

Open LLM serving runtimes optimize for different jobs. The right choice depends on whether you need high-throughput GPU serving, managed-model ergonomics, or small local inference.

The one idea

Do not choose a runtime by popularity. Choose it by the workload path it makes easy: GPU batching, Hugging Face deployment, or local and edge inference.

The five mental models

vLLM is usually the first name people reach for when they want high-throughput serving for open LLMs on GPUs. Its core strength is efficient memory management for the KV cache and serving many concurrent requests.

SGLang is a newer GPU serving runtime built around RadixAttention for prefix caching across requests, strong structured-generation support, and competitive throughput on long shared prefixes. Pick it when many requests reuse the same system prompt, tool schemas, or RAG context blocks.

Text Generation Inference, usually called TGI, is Hugging Face's production inference server. Its strength is the model ecosystem, deployment ergonomics, streaming APIs, batching, and integration with the Hugging Face world.

TensorRT-LLM is NVIDIA's optimized inference stack. Its strength is kernel-level performance on NVIDIA hardware, especially H100 with FP8, and tight integration with TensorRT. Pick it when you need maximum throughput on a fixed NVIDIA fleet and can invest in the build pipeline.

llama.cpp is the local inference workhorse. Its strength is running quantized models on CPUs, Apple Silicon, consumer GPUs, and machines that are not a classic datacenter GPU fleet.

Compare the default fit

This table is intentionally simple. Real deployments have details, but the default fit gives you a good starting point.

Runtime	Good fit	Watch out for
vLLM	GPU serving with many concurrent requests, throughput pressure, and OpenAI-compatible APIs.	Runtime support can vary by model architecture, quantization path, and deployment target.
SGLang	Shared-prefix workloads (RAG, agents, tool schemas), structured output, radix-style prefix cache wins.	Younger ecosystem than vLLM; verify your model family and ops tooling before committing.
TGI	Teams using Hugging Face models who want a production server with familiar deployment hooks.	Make sure its batching, model support, and hardware path fit your exact workload.
TensorRT-LLM	NVIDIA datacenter GPUs, H100 FP8, latency-sensitive production with a build-and-benchmark culture.	Engine build step per model revision; less flexible for rapid model swaps than Python-first servers.
llama.cpp	Local apps, edge devices, CPU inference, Apple Silicon, and GGUF quantized models.	It is not the default answer for large multi-user GPU serving.

Feature matrix

Check the features your product actually needs before picking a winner on a generic benchmark.

Feature	vLLM	SGLang	TGI	TensorRT-LLM	llama.cpp
Multi-user GPU batching	Strong	Strong	Strong	Strong	Weak
Prefix / prompt caching	Yes (PagedAttention + prefix cache)	Strong (RadixAttention)	Yes	Yes	Limited
Structured output	Good	Strong	Good	Good	Varies
FP8 on H100	Yes	Yes	Limited	Strong	No
Local / edge	No	No	No	No	Strong
Ops flexibility	High	High	High	Medium (engine builds)	High for local

SGLang vs vLLM

Both are strong GPU servers with continuous batching and OpenAI-style APIs. The fork is workload shape, not religion.

Pick SGLang when a large fraction of input tokens are identical across requests: shared system prompts, repeated tool definitions, RAG chunks that many users hit, or agent harnesses with stable prefixes. RadixAttention is built for that reuse.
Pick vLLM when prefixes are diverse, you need the broadest model and quantization coverage today, or your team already has vLLM deployment playbooks.
Benchmark both on your real prompt distribution. Prefix-cache wins show up only when reuse is high enough to amortize cache management.

For structured generation (JSON, regex, grammar constraints), SGLang has been a common pick when constraint decoding is on the critical path. vLLM has been catching up fast; treat the matrix as a hypothesis to test, not a verdict.

Match runtime to bottleneck

If your bottleneck is multi-user GPU throughput, start with a runtime built around scheduling and KV cache efficiency. If your bottleneck is getting an open model deployed with fewer sharp edges, a server with strong model ecosystem support can save time. If your bottleneck is shipping a private local assistant, a small quantized model with llama.cpp may beat a server fleet you do not need.

There is no universal winner because the workloads are not the same. A voice agent, coding assistant, mobile summarizer, and offline desktop tool can all be "LLM serving" while needing very different tradeoffs.

Use the runtime that matches the shape of the deployment, not the one that won a benchmark on another workload.

Benchmark the same model

Comparisons get messy when each runtime is tested with a different model, quantization, prompt length, output length, batch size, or hardware target. Keep the model and traffic shape fixed before judging the runtime.

At minimum, record time to first token, output tokens per second, total tokens per second across all users, memory use, p95 latency, error rate, and startup time. Also record whether the runtime supports your needed features: streaming, structured output constraints, adapters, tensor parallelism, quantization, metrics, and health checks.

Prefer boring APIs

The runtime is an implementation detail your product should be able to change. Put a stable internal API in front of it. Normalize request fields, streaming events, errors, model names, timeout behavior, and token accounting.

This lets you test vLLM against TGI, move a local feature to llama.cpp, or fall back to a hosted API without rewriting product code. You still need to test behavior, but the blast radius is smaller.

Engineering reality

The hidden cost of a runtime is the adapter code around it: auth, request shaping, retries, tracing, metrics, prompt templates, and deployment manifests. Include that glue in your comparison.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

What is vLLM usually optimized for?
When should you pick SGLang over vLLM?
What is TensorRT-LLM's default hardware advantage?
What makes TGI attractive for many Hugging Face model deployments?
Why is llama.cpp important for local and edge inference?
Why should runtime benchmarks use the same model and traffic shape?

Quick check

vLLM
TGI
llama.cpp

Using different models, token lengths, or hardware for each runtime
Recording p95 latency
Testing streaming output