vLLM vs TGI vs llama.cpp
Open LLM serving runtimes optimize for different jobs. The right choice depends on whether you need high-throughput GPU serving, managed-model ergonomics, or small local inference.
Do not choose a runtime by popularity. Choose it by the workload path it makes easy: GPU batching, Hugging Face deployment, or local and edge inference.
The five mental models
vLLM is usually the first name people reach for when they want high-throughput serving for open LLMs on GPUs. Its core strength is efficient memory management for the KV cache and serving many concurrent requests.
SGLang is a newer GPU serving runtime built around RadixAttention for prefix caching across requests, strong structured-generation support, and competitive throughput on long shared prefixes. Pick it when many requests reuse the same system prompt, tool schemas, or RAG context blocks.
Text Generation Inference, usually called TGI, is Hugging Face's production inference server. Its strength is the model ecosystem, deployment ergonomics, streaming APIs, batching, and integration with the Hugging Face world.
TensorRT-LLM is NVIDIA's optimized inference stack. Its strength is kernel-level performance on NVIDIA hardware, especially H100 with FP8, and tight integration with TensorRT. Pick it when you need maximum throughput on a fixed NVIDIA fleet and can invest in the build pipeline.
llama.cpp is the local inference workhorse. Its strength is running quantized models on CPUs, Apple Silicon, consumer GPUs, and machines that are not a classic datacenter GPU fleet.
Compare the default fit
This table is intentionally simple. Real deployments have details, but the default fit gives you a good starting point.
| Runtime | Good fit | Watch out for |
|---|---|---|
| vLLM | GPU serving with many concurrent requests, throughput pressure, and OpenAI-compatible APIs. | Runtime support can vary by model architecture, quantization path, and deployment target. |
| SGLang | Shared-prefix workloads (RAG, agents, tool schemas), structured output, radix-style prefix cache wins. | Younger ecosystem than vLLM; verify your model family and ops tooling before committing. |
| TGI | Teams using Hugging Face models who want a production server with familiar deployment hooks. | Make sure its batching, model support, and hardware path fit your exact workload. |
| TensorRT-LLM | NVIDIA datacenter GPUs, H100 FP8, latency-sensitive production with a build-and-benchmark culture. | Engine build step per model revision; less flexible for rapid model swaps than Python-first servers. |
| llama.cpp | Local apps, edge devices, CPU inference, Apple Silicon, and GGUF quantized models. | It is not the default answer for large multi-user GPU serving. |
Feature matrix
Check the features your product actually needs before picking a winner on a generic benchmark.
| Feature | vLLM | SGLang | TGI | TensorRT-LLM | llama.cpp |
|---|---|---|---|---|---|
| Multi-user GPU batching | Strong | Strong | Strong | Strong | Weak |
| Prefix / prompt caching | Yes (PagedAttention + prefix cache) | Strong (RadixAttention) | Yes | Yes | Limited |
| Structured output | Good | Strong | Good | Good | Varies |
| FP8 on H100 | Yes | Yes | Limited | Strong | No |
| Local / edge | No | No | No | No | Strong |
| Ops flexibility | High | High | High | Medium (engine builds) | High for local |
SGLang vs vLLM
Both are strong GPU servers with continuous batching and OpenAI-style APIs. The fork is workload shape, not religion.
- Pick SGLang when a large fraction of input tokens are identical across requests: shared system prompts, repeated tool definitions, RAG chunks that many users hit, or agent harnesses with stable prefixes. RadixAttention is built for that reuse.
- Pick vLLM when prefixes are diverse, you need the broadest model and quantization coverage today, or your team already has vLLM deployment playbooks.
- Benchmark both on your real prompt distribution. Prefix-cache wins show up only when reuse is high enough to amortize cache management.
For structured generation (JSON, regex, grammar constraints), SGLang has been a common pick when constraint decoding is on the critical path. vLLM has been catching up fast; treat the matrix as a hypothesis to test, not a verdict.
Match runtime to bottleneck
If your bottleneck is multi-user GPU throughput, start with a runtime built around scheduling and KV cache efficiency. If your bottleneck is getting an open model deployed with fewer sharp edges, a server with strong model ecosystem support can save time. If your bottleneck is shipping a private local assistant, a small quantized model with llama.cpp may beat a server fleet you do not need.
There is no universal winner because the workloads are not the same. A voice agent, coding assistant, mobile summarizer, and offline desktop tool can all be "LLM serving" while needing very different tradeoffs.
Benchmark the same model
Comparisons get messy when each runtime is tested with a different model, quantization, prompt length, output length, batch size, or hardware target. Keep the model and traffic shape fixed before judging the runtime.
At minimum, record time to first token, output tokens per second, total tokens per second across all users, memory use, p95 latency, error rate, and startup time. Also record whether the runtime supports your needed features: streaming, structured output constraints, adapters, tensor parallelism, quantization, metrics, and health checks.
Prefer boring APIs
The runtime is an implementation detail your product should be able to change. Put a stable internal API in front of it. Normalize request fields, streaming events, errors, model names, timeout behavior, and token accounting.
This lets you test vLLM against TGI, move a local feature to llama.cpp, or fall back to a hosted API without rewriting product code. You still need to test behavior, but the blast radius is smaller.
The hidden cost of a runtime is the adapter code around it: auth, request shaping, retries, tracing, metrics, prompt templates, and deployment manifests. Include that glue in your comparison.
Checkpoint
You're ready for the next lesson if you can answer these from memory:
- What is vLLM usually optimized for?
- When should you pick SGLang over vLLM?
- What is TensorRT-LLM's default hardware advantage?
- What makes TGI attractive for many Hugging Face model deployments?
- Why is llama.cpp important for local and edge inference?
- Why should runtime benchmarks use the same model and traffic shape?
Quick check
- vLLM
- TGI
- llama.cpp
- Using different models, token lengths, or hardware for each runtime
- Recording p95 latency
- Testing streaming output