Learn/Serving & Economics

Track 4 · Inference & serving

Serving & Economics

Serving is where model choice becomes a bill, a latency graph, and an incident plan. This course teaches how to choose a runtime, price the workload, compare API and self-hosting, and keep the system healthy after launch.

6 lessons Intermediate After How Inference Works

Choose an LLM serving stack

The layers of a serving system: model artifact, runtime, scheduler, hardware, API, observability, and release policy.

vLLM vs TGI vs llama.cpp

How three common runtimes differ, and which workload each one is usually built for.

GPU cost per token

How to turn hourly hardware cost, utilization, throughput, and token mix into a usable serving cost model.

API vs self-hosting break-even

When hosted APIs are cheaper, when self-hosting can win, and why volume alone is not enough.

Edge and on-device LLM serving

What changes when inference runs on phones, laptops, browsers, or regional edge nodes.

Operate LLM serving in production

Capacity planning, autoscaling, rollout safety, budget alerts, incident response, and the dashboards that matter.