Track 4 · Inference & serving
Latency & Throughput
A single GPU can serve one user quickly or many users cheaply, but not both for free. This course covers the techniques that bend that tradeoff: batching, continuous batching, speculative decoding, streaming, and the scheduler that ties them together.