Lesson 03

Full fine-tuning vs LoRA, PEFT, and QLoRA

There is more than one way to adapt a model. The main question is simple: do you update the whole model, or train a small set of extra weights that steer it?

The one idea

Full fine-tuning updates the base model's weights. Parameter-efficient fine-tuning freezes most of the base model and trains a small adapter. LoRA and QLoRA are adapter methods that make useful tuning possible on far less hardware.

Full fine-tuning

In full fine-tuning, every trainable weight in the model can move. A 7B parameter model has billions of knobs, and the optimizer must store gradients and optimizer state for those knobs. That gives the training run maximum flexibility, but it is expensive in GPU memory, storage, and operational complexity.

Full fine-tuning makes sense when you have a large, high-quality dataset, a stable target domain, enough compute, and a reason to deeply reshape the model. It can be useful for serious domain adaptation or post-training a base model into a product model. It is usually not the first move for a team trying to improve one application workflow.

PEFT: change less, get most of the gain

PEFT means parameter-efficient fine-tuning. Instead of updating all weights, you freeze the base model and train a much smaller set of parameters. The base model remains intact. The adapter learns a task-specific nudge.

The benefit is practical: lower memory, faster experiments, smaller artifacts, and easier rollback. You can keep one base model and many adapters for different customers, languages, or tasks. Serving can either load the adapter at runtime or merge it into the base weights when that makes deployment simpler.

LoRA: small matrices that steer big ones

LoRA stands for low-rank adaptation. A transformer layer is full of large weight matrices. LoRA freezes those matrices and learns a small low-rank update beside them. During inference, the model uses the original matrix plus the learned update.

The low-rank part matters because it assumes the useful change lives in a smaller subspace than the full matrix. That sounds like a math trick, but the engineering result is direct: you train far fewer parameters while still steering the model's behavior.

LoRA does not rewrite the whole model. It learns a compact update that is added to selected weight matrices.

QLoRA: quantize the base, train the adapter

QLoRA keeps the base model in a quantized form, often 4-bit, while training LoRA adapters. Quantization shrinks the memory footprint of the frozen model. The adapter training still uses higher-precision math where needed, but the giant base takes much less room.

The tradeoff is complexity and sometimes speed. Quantized training stacks can be more sensitive to library versions, kernels, and GPU type. But QLoRA is often the difference between "needs a large multi-GPU machine" and "fits on a single accessible GPU" for experimentation.

How to choose

Start with LoRA or QLoRA unless you have a concrete reason not to. Adapters make the experiment loop cheaper, and cheap loops matter because you will run more than one tune. If the adapter plateaus and the dataset is strong, then consider full fine-tuning.

Use full fine-tuning when the target requires broad model changes and you can afford the risk. Use LoRA when you want strong adaptation with simpler serving and enough memory for the base model. Use QLoRA when memory is tight and the extra stack complexity is worth it.

Engineering reality

The first fine-tuning win usually comes from better data, not a more exotic method. A clean LoRA run on a sharp dataset will beat a full fine-tune on messy examples.

Checkpoint

You're ready for the next lesson if you can answer these from memory:

What changes during full fine-tuning?
Why are adapter methods easier to experiment with?
What is LoRA adding beside the frozen base weights?
Why does QLoRA reduce memory needs?

Quick check

They adapt fewer parameters, making experiments cheaper and simpler
They make overfitting impossible
They replace the need for retrieval

Skipping gradient descent entirely
Keeping the frozen base model quantized while training adapters
Removing attention layers from the model