Lesson 05

Edge and on-device LLM serving

Moving inference closer to the user can improve privacy and responsiveness, but it also changes model size, update strategy, quality expectations, and the kinds of tasks the model should own.

The one idea

Edge and on-device serving work best when the task is narrow, latency or privacy is valuable, and the product can tolerate smaller-model behavior.

There are two different ideas

Edge serving usually means inference runs in regional infrastructure closer to the user. You still operate servers, but network distance is lower and data may stay inside a region.

On-device serving means inference runs on the user's phone, laptop, browser, or embedded device. The product gains local privacy and offline behavior, but the hardware is less predictable and updates are harder to control.

Do not blur them together. The operational model is different.

Why move inference closer?

There are four strong reasons: lower network latency, better privacy boundaries, offline support, and lower central infrastructure load. These are product reasons, not just infrastructure preferences.

For example, an editor autocomplete feature may feel better if it can produce small suggestions locally. A medical note feature may need data to stay on a device or inside a region. A field app may need to work with weak connectivity. A consumer app may use a local small model for cheap drafts and call a larger model only when needed.

The model has to fit

On-device inference usually means smaller models, quantized weights, lower memory use, and careful output limits. A model that works on a datacenter GPU may be unrealistic on a phone. The task should be designed around that constraint.

Good local tasks are narrow: autocomplete, classification, short rewrites, entity extraction, command parsing, local search help, draft summaries, and privacy-sensitive pre-processing. Broad reasoning tasks often still need a larger remote model.

Local inference is strongest when paired with routing, not asked to solve every task.

Model size vs device RAM

Quantized weights plus runtime overhead must fit in available memory. Leave headroom for OS, app, and KV cache if you run multi-turn on device. These are ballpark GGUF file sizes; KV cache adds more at runtime.

Model	Q4_K_M (approx)	Q8 (approx)	Typical device fit
3B	≈ 2.0 GB	≈ 3.5 GB	Modern phones (8 GB+ RAM), laptops
7–8B	≈ 4.5 GB	≈ 8 GB	Laptops (16 GB RAM), high-end phones with care
13–14B	≈ 8 GB	≈ 14 GB	16 GB+ laptops; tight on mobile
32B	≈ 18 GB	≈ 34 GB	32 GB+ MacBook / workstation; not phone-class
70B	≈ 40 GB	≈ 70 GB	Unified-memory Mac Studio / dual-GPU desktop

If the model does not fit, drop quant width (Q4 → Q3), shorten context, or route hard tasks to a remote model. The Quantization lesson covers quality tradeoffs by bit width.

Apple MLX and mobile NPUs

On Apple Silicon, MLX is a common path beside llama.cpp for local inference: unified memory means a 32 GB Mac can run larger quant models than a 32 GB PC with discrete GPU limits. MLX fits when you are shipping a macOS/iOS product and want tight Metal integration.

On phones and tablets, vendor NPUs (Apple Neural Engine, Qualcomm Hexagon, etc.) can run small int8/int4 models for classification and short generation, but memory and thermal budgets cap context length and output size. Treat NPUs as accelerators for narrow tasks, not as a drop-in replacement for a datacenter GPU on 70B models.

Test on the oldest device you still support. A model that flies on a developer laptop can be unusable on a three-year-old phone after OS overhead and background apps take their share.

Updates are a product problem

Central servers can roll back quickly. Devices are messier. Users may be offline, storage is limited, downloads cost bandwidth, and old versions can linger. A local model release needs versioning, compatibility checks, progressive rollout, and a fallback plan.

Also decide what telemetry is allowed. The whole point may be privacy, so do not recreate the privacy problem with logs. Aggregate counters and local eval events may be enough.

Design around trust

On-device models can be cheaper and more private, but users still experience the output. Tell the product what the local model is allowed to do. Let it suggest, draft, classify, or pre-fill. Be careful when it makes final decisions, touches money, applies policy, or handles high-risk advice.

A common pattern is local-first, remote-when-needed. The local model handles common fast work. The remote model handles hard tasks, account-level actions, or cases where quality matters more than offline behavior.

Engineering reality

Edge work moves cost from your cloud bill into product complexity: model downloads, device compatibility, battery impact, version drift, and limited observability. That can still be a great trade, but it is not free.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

How is regional edge serving different from on-device serving?
What product reasons justify local inference?
How much RAM does a 7B Q4 model roughly need?
When is MLX a better fit than llama.cpp on Mac?
What kinds of tasks fit small on-device models best?
Why are updates and telemetry harder on device?

Quick check

Short autocomplete suggestions inside an editor
Final medical advice with no fallback
Multi-document legal reasoning across thousands of tokens

All product complexity disappears
Managing updates across many devices and versions
It always requires a network round trip