Quantization: int8, int4, and GGUF

Quantization is the most common way to make a model fit on cheaper hardware. The idea is simple: use fewer bits. The hard part is knowing what that does to quality and speed.

The one idea

Quantization stores model weights, and sometimes activations, with lower precision. This cuts memory and bandwidth, but it also rounds away information the model may need.

Why lower precision helps

A model is mostly numbers. In many LLM serving setups, the weights are stored in 16-bit floating point. A 7B model at 16 bits needs roughly 14 GB just for weights before runtime overhead. Move the weights to 8 bits and the weight memory roughly halves. Move them to 4 bits and it roughly quarters.

That memory cut matters twice. First, the model may fit on a smaller GPU, a laptop, or a phone. Second, inference often waits on memory bandwidth. If the hardware can move less data per token, it may generate faster.

The catch is that real speed depends on kernels, hardware support, batch size, context length, and whether the runtime can compute efficiently in that format. A tiny file is not automatically a fast server.

Bit width is a memory knob first. Speed and quality need separate measurement.

Post-training quantization

Post-training quantization takes an already-trained model and converts weights to a lower precision format. It is popular because it is fast and does not require a full retraining run. You start with a checkpoint, run a quantization tool, then test the result.

Simple quantization maps a range of floating point values into a smaller set of integers. For example, many nearby weight values may collapse into the same int4 bucket. Good methods pick scales per tensor, per channel, per group, or per block so the rounding damage is not spread evenly across sensitive and insensitive parts of the model.

Some approaches also use a calibration dataset. The tool runs representative inputs through the model to learn which values matter most. Calibration does not need to be huge, but it should look like the prompts the model will serve.

int8, int4, and where quality drops

Int8 is often a conservative first step. Many models survive it well, especially when the runtime and hardware support it cleanly. Int4 is more aggressive. It can be the difference between "does not fit" and "runs locally," but it is more likely to hurt long-context behavior, reasoning-heavy tasks, rare tokens, multilingual tasks, and precise formatting.

Do not judge quality from one chat. Compression failures are often slice-specific. A quantized model may answer easy questions fine but fail on tool-call JSON, math, rare labels, or safety boundaries. That is why the eval set from the fine-tuning course matters here too.

Common mistake

A lower perplexity hit does not always mean your product is safe. A small average drop can hide a large failure on the exact slice your application depends on.

What GGUF is for

GGUF is a model file format used heavily in the llama.cpp ecosystem for local and CPU-friendly inference. A GGUF file stores quantized weights plus metadata the runtime needs to load the model. You will see names like `Q4_K_M`, `Q5_K_M`, or `Q8_0`, which describe quantization variants.

For an engineer, the important point is not memorizing every suffix. The important point is that the format and quantization type are part of the deployment target. A model that is great as an fp16 Hugging Face checkpoint may behave differently after conversion to a GGUF quant.

If your target is local inference, edge inference, or CPU fallback, test the exact GGUF file you plan to ship. File format, runtime, prompt template, and sampling settings all affect behavior.

Quantization-aware training

Post-training quantization is the quick path. Quantization-aware training is the heavier path: during training or fine-tuning, the model is exposed to the effects of lower precision. The model learns to be robust to the rounding noise.

This can recover quality, especially for aggressive formats, but it costs more. It also gives you another training recipe to own. For many product teams, the practical order is simple: try post-training quantization, evaluate hard, then reach for quantization-aware training only if the cheaper method fails and the deployment target is worth it.

Engineering reality

Quantization is a deployment experiment, not just a conversion command. Record the source checkpoint, quantization method, calibration data, runtime, hardware, prompt template, and eval result together.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

Why does lowering weight precision reduce memory?
Why can int4 hurt quality more than int8?
What does a calibration dataset help with?
Why should you test the exact GGUF file you plan to ship?

Quick check

Post-training quantization to int8 or int4
Use a larger teacher model
Add retrieval

Because chat UIs are always slow
Because failures can hide in rare labels, formatting, long context, or hard reasoning slices
Because conversion proves quality automatically