Lesson 02

vLLM vs TGI vs llama.cpp

Open LLM serving runtimes optimize for different jobs. The right choice depends on whether you need high-throughput GPU serving, managed-model ergonomics, or small local inference.

The one idea

Do not choose a runtime by popularity. Choose it by the workload path it makes easy: GPU batching, Hugging Face deployment, or local and edge inference.

The three mental models

vLLM is usually the first name people reach for when they want high-throughput serving for open LLMs on GPUs. Its core strength is efficient memory management for the KV cache and serving many concurrent requests.

Text Generation Inference, usually called TGI, is Hugging Face's production inference server. Its strength is the model ecosystem, deployment ergonomics, streaming APIs, batching, and integration with the Hugging Face world.

llama.cpp is the local inference workhorse. Its strength is running quantized models on CPUs, Apple Silicon, consumer GPUs, and machines that are not a classic datacenter GPU fleet.

Compare the default fit

This table is intentionally simple. Real deployments have details, but the default fit gives you a good starting point.

Runtime	Good fit	Watch out for
vLLM	GPU serving with many concurrent requests, throughput pressure, and OpenAI-compatible APIs.	Runtime support can vary by model architecture, quantization path, and deployment target.
TGI	Teams using Hugging Face models who want a production server with familiar deployment hooks.	Make sure its batching, model support, and hardware path fit your exact workload.
llama.cpp	Local apps, edge devices, CPU inference, Apple Silicon, and GGUF quantized models.	It is not the default answer for large multi-user GPU serving.

Match runtime to bottleneck

If your bottleneck is multi-user GPU throughput, start with a runtime built around scheduling and KV cache efficiency. If your bottleneck is getting an open model deployed with fewer sharp edges, a server with strong model ecosystem support can save time. If your bottleneck is shipping a private local assistant, a small quantized model with llama.cpp may beat a server fleet you do not need.

There is no universal winner because the workloads are not the same. A voice agent, coding assistant, mobile summarizer, and offline desktop tool can all be "LLM serving" while needing very different tradeoffs.

Use the runtime that matches the shape of the deployment, not the one that won a benchmark on another workload.

Benchmark the same model

Comparisons get messy when each runtime is tested with a different model, quantization, prompt length, output length, batch size, or hardware target. Keep the model and traffic shape fixed before judging the runtime.

At minimum, record time to first token, output tokens per second, total tokens per second across all users, memory use, p95 latency, error rate, and startup time. Also record whether the runtime supports your needed features: streaming, structured output constraints, adapters, tensor parallelism, quantization, metrics, and health checks.

Prefer boring APIs

The runtime is an implementation detail your product should be able to change. Put a stable internal API in front of it. Normalize request fields, streaming events, errors, model names, timeout behavior, and token accounting.

This lets you test vLLM against TGI, move a local feature to llama.cpp, or fall back to a hosted API without rewriting product code. You still need to test behavior, but the blast radius is smaller.

Engineering reality

The hidden cost of a runtime is the adapter code around it: auth, request shaping, retries, tracing, metrics, prompt templates, and deployment manifests. Include that glue in your comparison.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

What is vLLM usually optimized for?
What makes TGI attractive for many Hugging Face model deployments?
Why is llama.cpp important for local and edge inference?
Why should runtime benchmarks use the same model and traffic shape?

Quick check

vLLM
TGI
llama.cpp

Using different models, token lengths, or hardware for each runtime
Recording p95 latency
Testing streaming output