Distillation for Quantization

  1. Introduction

  2. Distillation for Quantization Support Matrix

  3. Get Started with Distillation for Quantization API

  4. Examples

Introduction

Distillation and quantization are both promising methods to reduce the computational and memory footprint that huge transformer-based networks require. Quantization refers to a process of reducing the bit precision for both activations and weights. Distillation method transfers knowledge from a heavy teacher model to a light one (student) and it could be used as a performance-booster in lower-bits quantizations. Quantization-aware training recovers accuracy degradation from representation loss in the retraining process and typically provides better performance compared to post-training quantization.
Intel provides a quantization-aware training (QAT) method that incorporates a novel layer-by-layer knowledge distillation step for INT8 quantization pipelines.

Distillation for Quantization Support Matrix

Algorithm PyTorch TensorFlow
Distillation for Quantization

Get Started with Distillation for Quantization API

User can pass the customized training/evaluation functions to Distillation for quantization tasks. In this case, distillation process can be done by pre-defined hooks in Neural Compressor. Users could place those hooks inside the quantization training function.

Neural Compressor defines several hooks for user pass

on_train_begin() : Hook executed before training begins
on_after_compute_loss(input, student_output, student_loss) : Hook executed after each batch inference of student model
on_epoch_end() : Hook executed at each epoch end

Following section illustrates how to use hooks in user pass-in training function:

def training_func_for_nc(model):
    compression_manager.on_train_begin()
    for epoch in range(epochs):
        compression_manager.on_epoch_begin(epoch)
        for i, batch in enumerate(dataloader):
            compression_manager.on_step_begin(i)
            ......
            output = model(batch)
            loss = output.loss
            loss = compression_manager.on_after_compute_loss(batch, output, loss)
            loss.backward()
            compression_manager.on_before_optimizer_step()
            optimizer.step()
            compression_manager.on_step_end()
        compression_manager.on_epoch_end()
    compression_manager.on_train_end()
...

In this case, the launcher code is like the following:

from neural_compressor.experimental import common, Distillation, Quantization
from neural_compressor.config import DistillationConfig, KnowledgeDistillationLossConfig
from neural_compressor import QuantizationAwareTrainingConfig
from neural_compressor.training import prepare_compression

combs = []
distillation_criterion = KnowledgeDistillationLossConfig()
d_conf = DistillationConfig(teacher_model=teacher_model, criterion=distillation_criterion)
combs.append(d_conf)
q_conf = QuantizationAwareTrainingConfig()
combs.append(q_conf)
compression_manager = prepare_compression(model, combs)
model = compression_manager.model

model = training_func_for_nc(model)
eval_func(model)

Examples

For examples of distillation for quantization, please refer to distillation-for-quantization examples