Track 4 · Inference & serving
Serving & Economics
Serving is where model choice becomes a bill, a latency graph, and an incident plan. This course teaches how to choose a runtime, price the workload, compare API and self-hosting, and keep the system healthy after launch.
01
02
03
04
05
06
Choose an LLM serving stack
The layers of a serving system: model artifact, runtime, scheduler, hardware, API, observability, and release policy.
vLLM vs TGI vs llama.cpp
How three common runtimes differ, and which workload each one is usually built for.
GPU cost per token
How to turn hourly hardware cost, utilization, throughput, and token mix into a usable serving cost model.
API vs self-hosting break-even
When hosted APIs are cheaper, when self-hosting can win, and why volume alone is not enough.
Edge and on-device LLM serving
What changes when inference runs on phones, laptops, browsers, or regional edge nodes.
Operate LLM serving in production
Capacity planning, autoscaling, rollout safety, budget alerts, incident response, and the dashboards that matter.