Learn/Latency & Throughput
Track 4 · Inference & serving

Latency & Throughput

A single GPU can serve one user quickly or many users cheaply, but not both for free. This course covers the techniques that bend that tradeoff: batching, continuous batching, speculative decoding, streaming, and the scheduler that ties them together.

6 lessons Intermediate After How Inference Works