vLLM vs TGI vs llama.cpp
Open LLM serving runtimes optimize for different jobs. The right choice depends on whether you need high-throughput GPU serving, managed-model ergonomics, or small local inference.
Do not choose a runtime by popularity. Choose it by the workload path it makes easy: GPU batching, Hugging Face deployment, or local and edge inference.
The three mental models
vLLM is usually the first name people reach for when they want high-throughput serving for open LLMs on GPUs. Its core strength is efficient memory management for the KV cache and serving many concurrent requests.
Text Generation Inference, usually called TGI, is Hugging Face's production inference server. Its strength is the model ecosystem, deployment ergonomics, streaming APIs, batching, and integration with the Hugging Face world.
llama.cpp is the local inference workhorse. Its strength is running quantized models on CPUs, Apple Silicon, consumer GPUs, and machines that are not a classic datacenter GPU fleet.
Compare the default fit
This table is intentionally simple. Real deployments have details, but the default fit gives you a good starting point.
| Runtime | Good fit | Watch out for |
|---|---|---|
| vLLM | GPU serving with many concurrent requests, throughput pressure, and OpenAI-compatible APIs. | Runtime support can vary by model architecture, quantization path, and deployment target. |
| TGI | Teams using Hugging Face models who want a production server with familiar deployment hooks. | Make sure its batching, model support, and hardware path fit your exact workload. |
| llama.cpp | Local apps, edge devices, CPU inference, Apple Silicon, and GGUF quantized models. | It is not the default answer for large multi-user GPU serving. |
Match runtime to bottleneck
If your bottleneck is multi-user GPU throughput, start with a runtime built around scheduling and KV cache efficiency. If your bottleneck is getting an open model deployed with fewer sharp edges, a server with strong model ecosystem support can save time. If your bottleneck is shipping a private local assistant, a small quantized model with llama.cpp may beat a server fleet you do not need.
There is no universal winner because the workloads are not the same. A voice agent, coding assistant, mobile summarizer, and offline desktop tool can all be "LLM serving" while needing very different tradeoffs.
Benchmark the same model
Comparisons get messy when each runtime is tested with a different model, quantization, prompt length, output length, batch size, or hardware target. Keep the model and traffic shape fixed before judging the runtime.
At minimum, record time to first token, output tokens per second, total tokens per second across all users, memory use, p95 latency, error rate, and startup time. Also record whether the runtime supports your needed features: streaming, structured output constraints, adapters, tensor parallelism, quantization, metrics, and health checks.
Prefer boring APIs
The runtime is an implementation detail your product should be able to change. Put a stable internal API in front of it. Normalize request fields, streaming events, errors, model names, timeout behavior, and token accounting.
This lets you test vLLM against TGI, move a local feature to llama.cpp, or fall back to a hosted API without rewriting product code. You still need to test behavior, but the blast radius is smaller.
The hidden cost of a runtime is the adapter code around it: auth, request shaping, retries, tracing, metrics, prompt templates, and deployment manifests. Include that glue in your comparison.
Checkpoint
You're ready for the next lesson if you can answer these from memory:
- What is vLLM usually optimized for?
- What makes TGI attractive for many Hugging Face model deployments?
- Why is llama.cpp important for local and edge inference?
- Why should runtime benchmarks use the same model and traffic shape?
Quick check
- vLLM
- TGI
- llama.cpp
- Using different models, token lengths, or hardware for each runtime
- Recording p95 latency
- Testing streaming output