FP8 Quantization
Introduction
Float point 8 (FP8) is a promising data type for low precision quantization which provides a data distribution that is completely different from INT8 and it’s shown as below.
Intel Gaudi2, also known as HPU, provides this data type capability for low precision quantization, which includes E4M3
and E5M2
. For more information about these two data type, please refer to link.
Intel Neural Compressor provides general quantization APIs to leverage HPU FP8 capability. with simple with lower memory usage and lower compute cost, 8 bit model
Supported Parameters
Attribute | Description | Values |
---|---|---|
fp8_config | The target data type of FP8 quantization. | E4M3 (default) - As Fig. 2 E5M2 - As Fig. 1. |
hp_dtype | The high precision data type of non-FP8 operators. | bf16 (default) - torch.bfloat16 fp16 - torch.float16. fp32 - torch.float32. |
observer | The observer to measure the statistics. | maxabs (default), saves all tensors to files. |
allowlist | List of nn.Module names or types to quantize. When setting an empty list, all the supported modules will be quantized by default. See Supported Modules. Not setting the list at all is not recommended as it will set the allowlist to these modules only: torch.nn.Linear, torch.nn.Conv2d, and BMM. | Default = {'names': [], 'types': FP8_WHITE_LIST} |
blocklist | List of nn.Module names or types not to quantize. Defaults to empty list, so you may omit it from the config file. | Default = {'names': [], 'types': ()} |
mode | The mode, measure or quantize, to run HQT with. | MEASURE - Measure statistics of all modules and emit the results to dump_stats_path. QUANTIZE - Quantize and run the model according to the provided measurements. AUTO (default) - Select from [MEASURE, QUANTIZE] automatically. |
dump_stats_path | The path to save and load the measurements. The path is created up until the level before last "/". The string after the last / will be used as prefix to all the measurement files that will be created. | Default = "./hqt_output/measure" |
scale_method | The method for calculating the scale from the measurement. | - unit_scale - Always use scale of 1. - hw_aligned_single_scale - Always use scale that's aligned to the corresponding HW accelerated scale. - maxabs_hw (default) - Scale is calculated to stretch/compress the maxabs measurement to the full-scale of FP8 and then aligned to the corresponding HW accelerated scale. - maxabs_pow2 - Scale is calculated to stretch/compress the maxabs measurement to the full-scale of FP8 and then rounded to the power of 2. - maxabs_hw_opt_weight - Scale of model params (weights) is chosen as the scale that provides minimal mean-square-error between quantized and non-quantized weights, from all possible HW accelerated scales. Scale of activations is calculated the same as maxabs_hw. - act_maxabs_pow2_weights_pcs_opt_pow2 - Scale of model params (weights) is calculated per-channel of the params tensor. The scale per-channel is calculated the same as maxabs_hw_opt_weight. Scale of activations is calculated the same as maxabs_pow2. - act_maxabs_hw_weights_pcs_maxabs_pow2 - Scale of model params (weights) is calculated per-channel of the params tensor. The scale per-channel is calculated the same as maxabs_pow2. Scale of activations is calculated the same as maxabs_hw. |
measure_exclude | If this attribute is not defined, the default is OUTPUT. Since most models do not require measuring output tensors, you can exclude it to speed up the measurement process. | NONE - All tensors are measured. OUTPUT (default) - Excludes measurement of output tensors. |
Get Start with FP8 Quantization
Demo Usage
from neural_compressor.torch.quantization import (
FP8Config,
prepare,
convert,
)
import torchvision.models as models
model = models.resnet18()
qconfig = FP8Config(fp8_config="E4M3")
model = prepare(model, qconfig)
# customer defined calibration
calib_func(model)
model = convert(model)
Examples
Task | Example |
---|---|
Computer Vision (CV) | Link |
Large Language Model (LLM) | Link |
Note: For LLM, Optimum-habana provides higher performance based on modified modeling files, so here the Link of LLM goes to Optimum-habana, which utilize Intel Neural Compressor for FP8 quantization internally.