Quantization in Pytorch

Learn via video courses
Topics Covered

Overview

Computational resources on the server and the device must be utilized efficiently when developing deep neural models for production systems.

Various techniques are available to support more efficient deployment on servers and edge devices that work around the bottlenecks faced when deploying large deep neural nets in production. This article talks about a technique called quantization that aims to speed up the inference time from models and is very effective in reducing the memory and space acquired by the model, thus making it easier to deploy and maintain.

Introduction

Large deep-learning models must perform many matrix and vector calculations to produce predictions for the given input data. When models are deployed in practice, this greatly hinders the performance of the models in both research and production settings owing to latency and throughput concerns.

To successfully run deep learning systems into production, we need to have a faster way of producing predictions from the models. There are many techniques available to speed up deep learning model inference time. Among those, quantization is one very popular technique that has been shown to provide impressive speed-ups. For instance, their article discusses how the Roblox engineering team scaled BERT To Serve 1+ Billion Daily Requests on CPUs.

To this end, this article is a walkthrough of quantization as a concept and how PyTorch quantization can be used to quantize large deep neural networks in PyTorch, one of the most popular deep learning libraries.

What is Quantization?

Quantization is a model optimization technique that aims to reduce the model's size and speed up the inference process from the models by simplifying the mathematical operations the model performs to reach an output value using an input value.

Most deep learning libraries rely on the data type float32 or the floating point type requiring 32 bits for their vector and matrix calculations.

Quantization targets this aspect and enables faster computations as it simplifies the math by reducing the number of bits required to represent a number. It does so by mapping tensors with data type float32 to tensors with data type int8.

int8 is the integer data type with 8 bits - this means that it can represent 28=2562^8 = 256 total values in the range 27=128-2^7 = -128 through 271=127.2^7 - 1 = 127.

float32 takes up a size four times higher than required to represent int8. In other words, by mapping numbers in the float32 format to the int8 format, we require only KaTeX parse error: Expected 'EOF', got '%' at position 3: 25%̲ of what was required initially to represent the same numbers. This translates to the following gains -

  • 4 times reduction in the size of the model
  • 2-4 time reduction in the required memory bandwidth
  • 2-4 times faster inference due to savings in memory bandwidth and faster computing with int8 arithmetic (although the exact speed up varies depending on the hardware, the runtime, and the model at hand).

Quantization maps the float32 data type into int 8 by creating bins (or intervals) that determine what integer will the floating point number be mapped to.

Hence, using lower-precision numerical formats reduces the storage space (memory access) required without changing the model architecture by compromising the precision of the numbers. The overall result is a faster running model (reduced latency) with a lower memory footprint and reduced computations that we can scale easily.

Benefits of Quantization

  • Quantization yields improved speed while dealing with very large neural networks as conversion from float 32 to int eight leads to faster calculations and enables us to train very large models comparatively in less time.
  • The speed vs. accuracy trade-off - While we are at it, it is important to note that by lowering the precision of our numbers in quantization, we are compromising on the accuracy as, essentially, mapping float32 to int8 introduces approximations in the network, which leads to results with slightly lesser accuracy. This point also forms a base for one of the caveats of quantization that we are going to discuss next.

Caveats of Quantization

  • Quantization is an inference-only technique - during the training of deep neural networks, the errors are backpropagated, and gradients are calculated that are then used to update the model parameters. It is crucial to note that the int8 data type is not numerically accurate enough to represent the gradients and hence to backpropagate errors during training. While we can afford to operate in some level of approximation during the model inference stage, it is impossible to work with int8 data type during the model training stage as performing backpropagation in int8 arithmetic doesn't seem viable and will almost lead the model weights to diverge rather than them converging to an optimal solution.

  • One-shot quantization - Some models might perform differently after being subjected to one-shot quantization - the accuracy might decrease a lot, and the deployed model will become useless. For such cases, we can choose not to apply Quantization to the entire model and run only certain network parts in int8 format while leaving other parts to operate in float32 only.

PyTorch Quantization

Let us get an overview of the support provided by PyTorch for the quantization of models -

  • PyTorch has defined data types corresponding to quantized tensors, and these data types share many of the features of tensors.
  • To customize our implementation, quantized tensors can be used to write kernels similar to how we write kernels for floating point tensors.
  • With the torch.nn.quantized and torch.nn.quantized.dynamic API, PyTorch supports quantized modules for common operations. However, we also need to note here that not all layers have quantized implementations. For example, some layers, like accumulation layers, accumulate errors too quickly, making quantization a curse rather than a boon. For a list of supported operators, see the documentation here.
  • Quantized models are also compatible with PyTorch's traceable and scriptable operations.
  • Currently, quantized operations are supported only for the following CPU backends - x86 and ARM.

Types of PyTorch Quantization

PyTorch supports three types of quantization that primarily differ in where the bins for the conversion of fp32 to int8 are defined.

All three PyTorch Quantization techniques work differently from each other in how and where they fine-tune the quantization algorithm and determine the bins to map the float 32 vectors into int8, owing to which each of them has certain advantages and disadvantages over the other.

Let us now discuss the three types of PyTorch Quantization and understand the API PyTorch provides.

Dynamic Quantization

As its name suggests, Dynamic Quantization fine-tunes its algorithm to map numbers from the float 32 system into int 8 “dynamically at the runtime”.

During the runtime, to map the fp32 vectors into int8 ones, the fp32 vectors are multiplied by a scaling factor, and the resulting numbers are rounded off to the nearest integer to get an int8 value in the range (-128 to 127). While the weights of the model parameters are already known to us and hence are quantized before getting used, the activations are quantized as we go (hence dynamically during the runtime) just before computations succeeding the activation layer are performed.

The scaling factor is subjected to small adjustments during the runtime based on the observed input values in the data - the adjustments are carried out until an optimal conversion algorithm is obtained.

This essentially makes dynamic Quantization an on-the-fly approach where we do not have many choices to configure, and this is also demonstrated in how simply with a single line of code, dynamic Quantization can be performed in PyTorch, like so -

Here model is the model we want to quantize dynamically, {torch.nn.Linear} determines the set of layers to dynamically quantize and dtype=torch.qint8 determines the target dtype for quantized weights.

This way, the calculations are performed using efficient int8 matrix multiplication and convolution implementations, resulting in faster operations.

Even though the calculations with activations are performed using int8 operations for faster calculations, the activations are read and written to memory in the floating-point format only.

Dynamic Quantization can provide a decently performing model when applied to large models in natural language processing (transformers, recurrent neural networks, etc.) to be able to deploy these models successfully in production, which otherwise is challenging due to their huge size where the memory required to perform such a large number of precise calculations is a potential major bottleneck.

However, due to its simplicity in determining the conversion algorithm dynamically, Dynamic Quantization often proves to be the least performant quantization technique. It could be a check or a baseline to see how the model performance shall degrade upon quantization.

Post-Training Static Quantization

Again, as the name suggests, Post-Training Static Quantization fine-tunes its conversion algorithm applied after the model training process is completed.

Post-Training Static Quantization further accelerates the inference process by enabling the networks to use both integer arithmetic and int8 memory accesses.

In other words, Static Quantization works to provide further speed-ups by allowing us to pass quantized int8 values between the calculations instead of having to convert the values to floats - and then back to ints - between every operation (essentially eliminating the storing and reading of activations in fp32 format).

Post-Training Static Quantization consists of a fine-tuning step between the completion of the model training and the inference step. The fine-tuning step is separate from model training as it is used to fine-tune the quantization algorithm to convert the fp32 number format into int8. This is done by feeding batches of data to the trained model to compute the resulting distributions of the different activations (we will see shortly how this is done by inserting “observer” modules provided by PyTorch at different points are meant to record these distributions).

The information from how the activations are distributed determines the bins for quantizing data during inference.

Importantly, to implement Post-Training Static Quantization, we need to insert two layers in the module initialization code called torch.quantization.QuantStub and torch.quantization.DeQuantStub, like so -

The quant_layer converts the numbers in fp32 to int8 so that conv and relu will run in int8 format and then the dequant_layer will perform the int8 to fp32 conversion.

Operator Fusion in Post-Training Static Quantization PyTorch provides an API called torch.quantization.fuse_modules that can be used to fuse multiple operations (layers) into a single operation (a single layer) that provides twofold benefits - it saves on memory access (thus providing further speed-up) as it pushes the combined layer of operations into the low-level library that can then be computed in one go and this eventually also provides improvement to the operation’s numerical accuracy.

We can pass to torch.quantization.fuse_modules the named model layers to perform fusion on, like so -

Observer Modules

As we discussed earlier, the distributions of the activations are recorded to determine the conversion algorithm. To record these statistics before the quantization, PyTorch supports observer modules via its torch.quantization.prepare API, which can be used like so -

Taking it all together.. We will now take all of these components together and implement a simple example showing Post-Training Static Quantization (PyTorch docs inspire the code)

In PyTorch,

Unlike dynamic quantization, where we can specify what layers to perform quantization on, static quantization is performed on all model layers by default.

To turn it off for a particular layer, we can set its qconfig field to None, like so - model.conv_1_4.qconfig = None.

In practice, static quantization is the right technique for medium-to-large-sized models that use convolutions.

Quantization Aware Training

Quantization-aware training going by its name, makes the training process aware that the model will be quantized for inferential purposes.

During the training passes (both the forward and the backward pass), all weights and activations that are in fp 32 format are made to mimic (or simulate) the int8 format; in a way, they are “fake quantized” so that the optimizer becomes aware of the quantization process during the training itself.

It is to be importantly kept in mind that the training takes place in floating point precision only; as we have already discussed 8 bit integer format isn't adequate enough to represent gradients and hence support the training process.

PyTorch supports Quantization Aware Training by injecting FakeQuantile layers into the model, and this method usually leads to the best performance when compared to the other two methods we studied above.

Naturally, this comes at the cost of increased training time with proper care required to ensure that the continues to converge.

It is again fairly simple to implement QAT in PyTorch like so -

As specified above, PyTorch quantization is currently CPU only. But Quantization Aware Training can be run on both CPU and GPU.

It is crucial to note that, unlike post-training static quantization, where the model is put in the evaluation mode, we put the model in the training mode in Quantization Aware Training as the quantization processed during the training process itself in contrast to the former, where the calibration step is an additional step that is done after the model's training is completed.

Choosing a PyTorch Quantization Approach

With three PyTorch Quantization techniques available to us, it would benefit from having a general guideline for choosing the appropriate one according to our use case.

A general guideline (as suggested by the PyTorch tram itself) for choosing the right quantization strategy according to our model type is as follows:

  • For large language models used in Natural Language Processing like Long Short Term Memory networks (LSTMs) or Recurrent Neural Networks (RNNs) and for BERT or other transformer-based models, dynamic quantization is the suitable technique to use as their performance during production (in terms of latency and throughput) is primarily limited by the by compute or memory bandwidth required for their weights

  • For convolutional neural networks (CNNs), post-training static quantization can be used on the account that for such networks, the throughput is affected by the memory bandwidth for the activations. In case static quantization decreases performance to the extent that it isn't useful, Quantization Aware Training can be used for such networks.

Conclusion

Quantization is a clever technique useful in deploying large deep neural models in production systems, edge/mobile devices, etc. This article walked us through the concept of quantization in deep neural networks. Let us review in points that we studied in this article -

  • We began by getting an introduction to why techniques like quantization are required to deploy large neural models, after which we understood the concept of quantization.
  • Then we walked through some basic trade-offs and caveats of the quantization technique itself, after which we looked at the support offered in PyTorch for implementing quantization.
  • After this, we understood three different types of quantization that can be used to run faster inference from neural networks called Dynamic Quantization, Post-Training Static Quantization, and Quantization Aware Training while also implementing each one of them in PyTorch.
  • Finally, we walked through a general recipe that could be used to choose the appropriate quantization strategy according to the model at hand.