Latency & Throughput

Name: Latency & Throughput
Author: Rahul Kashyap

A single GPU can serve one user quickly or many users cheaply, but not both for free. This course covers the techniques that bend that tradeoff: batching, continuous batching, speculative decoding, streaming, and the scheduler that ties them together.

6 lessons Intermediate After How Inference Works