Distillation for Quantization ============ 1. [Introduction](#introduction) 2. [Distillation for Quantization Support Matrix](#distillation-for-quantization-support-matrix) 3. [Get Started with Distillation for Quantization API](#get-started-with-api) 4. [Examples](#examples) ### Introduction Distillation and quantization are both promising methods to reduce the computational and memory footprint that huge transformer-based networks require. Quantization refers to a process of reducing the bit precision for both activations and weights. Distillation method transfers knowledge from a heavy teacher model to a light one (student) and it could be used as a performance-booster in lower-bits quantizations. Quantization-aware training recovers accuracy degradation from representation loss in the retraining process and typically provides better performance compared to post-training quantization. Intel provides a quantization-aware training (QAT) method that incorporates a novel layer-by-layer knowledge distillation step for INT8 quantization pipelines. ### Distillation for Quantization Support Matrix |**Algorithm** |**PyTorch** |**TensorFlow** | |---------------------------------|:--------:|:---------:| |Distillation for Quantization |✔ |✖ | ### Get Started with Distillation for Quantization API User can pass the customized training/evaluation functions to `Distillation` for quantization tasks. In this case, distillation process can be done by pre-defined hooks in Neural Compressor. Users could place those hooks inside the quantization training function. Neural Compressor defines several hooks for user pass ``` on_train_begin() : Hook executed before training begins on_after_compute_loss(input, student_output, student_loss) : Hook executed after each batch inference of student model on_epoch_end() : Hook executed at each epoch end ``` Following section illustrates how to use hooks in user pass-in training function: ```python def training_func_for_nc(model): compression_manager.on_train_begin() for epoch in range(epochs): compression_manager.on_epoch_begin(epoch) for i, batch in enumerate(dataloader): compression_manager.on_step_begin(i) ...... output = model(batch) loss = output.loss loss = compression_manager.on_after_compute_loss(batch, output, loss) loss.backward() compression_manager.on_before_optimizer_step() optimizer.step() compression_manager.on_step_end() compression_manager.on_epoch_end() compression_manager.on_train_end() ... ``` In this case, the launcher code is like the following: ```python from neural_compressor.experimental import common, Distillation, Quantization from neural_compressor.config import DistillationConfig, KnowledgeDistillationLossConfig from neural_compressor import QuantizationAwareTrainingConfig from neural_compressor.training import prepare_compression combs = [] distillation_criterion = KnowledgeDistillationLossConfig() d_conf = DistillationConfig(teacher_model=teacher_model, criterion=distillation_criterion) combs.append(d_conf) q_conf = QuantizationAwareTrainingConfig() combs.append(q_conf) compression_manager = prepare_compression(model, combs) model = compression_manager.model model = training_func_for_nc(model) eval_func(model) ``` ### Examples For examples of distillation for quantization, please refer to [distillation-for-quantization examples](../../examples/pytorch/nlp/huggingface_models/text-classification/optimization_pipeline/distillation_for_quantization/fx/README.html)