Why quantization?

As models become larger in size, GPU memory has become a precious resource which bottlenecks both model training and inference. While reducing the memory footprint of models can be useful in any setting, it is crucially helpful in resource constrained setting such as being able to do LLM research in a budget or deploying LLMs to the edge.

Since an ML model is basically a set of weights, the memory footprint directly depends on the number of parameters and the size of each parameter in memory. It follows that to reduce the memory footprint, we either need to reduce the number of params of the size of each param. This maps onto three popular methods for reducing model size

Pruning

This involves reducing the number of params in a model. Now one cannot remove weights from a network willy-nilly. In general the idea is to remove weights which have minimal impact on the prediction accuracy of the network. While the devil is in the details, a general characteristic of pruning based model size reduction is the need to retrain the model post pruning. While this is feasible in some cases, the viability of doing this for really large networks is questionable.

Distillation

Another popular technique for reducing model size is distillation. This too reduces the number of model parameters. To do so, it employs a teacher-student training scheme where the student (smaller) model is trained on a combination of the original training data and the teacher (larger) model. This technique has proved to be very effective in improving the performance of smaller LLMs. However distillation does require non trivial effort and doing it well is an active area of research.

Quantization

Quantization takes a different approach of preserving the number of parameters while reducing the size of each parameter in memory. This involves using lower precision data types to represent each parameter instead of using the standard 32 bits. For instance, using 4 bits instead of 32 can directly reduce memory footprint by 8 times. An advantage of this technique is the simplicity and generality of it. It can be applied to any neural network (with some caveats for specific scenarios) and requires a transformation that is quite straightforward (as opposed to complicated model retraining required by the above two methods)

The rest of this post focusses on Quantization as method to reduce model memory footprint during inference and training. In particular, we will look at range based quantization methods in detail.

A note on Data types

Screenshot 2025-04-02 at 12.19.10 PM.png

(Img src: https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/)