PyTorch AutoRound
Overview
AutoRound is an advanced model quantization algorithm integrated into Neural Compressor for low-bit LLM. As a key algorithm component of INC, AutoRound enables efficient quantization across a wide range of models and features while consistently achieving superior accuracy. While requiring additional tuning time, it provides a robust foundation for INC’s comprehensive quantization capabilities.
Supported Features
Weight-Only Quantization (WoQ) - Quantize model weights while keeping activations in full precision. See Weight-Only Quantization for details.
Microscaling (MX) Quantization - Neural Compressor seamlessly applies the MX data type to post-training quantization, offering meticulously crafted recipes to empower users to quantize LLMs without sacrificing accuracy. Refer to MX Quantization.
NVFP4 Quantization - NVFP4 is a specialized 4-bit floating-point format (FP4) developed by NVIDIA for deep learning workloads. See NVFP4 Quantization.
Quantization-Aware Training (QAT) - Fine-tune models during quantization to achieve better accuracy. See Quantization-Aware Training for details.
FP8 KV Cache and Attention Static Quantization (Experimental) - The support for the FP8 data type enhances inference performance by quantizing key-value cache and attention computations to FP8 precision.
Getting Started
Basic Usage
from neural_compressor.torch.quantization import prepare, convert, AutoRoundConfig
quant_config = AutoRoundConfig(tokenizer=tokenizer) # tokenizer used for calibration
model = prepare(model, quant_config)
model = convert(model)
# For more detailed usage, please refer to the [Supported Features] documentation.
FP8 KV Cache and FP8 Attention support
from transformers import AutoModelForCausalLM, AutoTokenizer
from neural_compressor.torch.quantization import (
AutoRoundConfig,
convert,
prepare,
)
fp32_model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m", trust_remote_code=True)
output_dir = "./saved_inc"
quant_config = AutoRoundConfig(
tokenizer=tokenizer,
scheme="MXFP4", # MXFP4, MXFP8, NVFP4
iters=0, # rtn mode
seqlen=2,
static_kv_dtype="fp8", # None, fp8, float16
static_attention_dtype=None, # None, fp8
export_format="auto_round",
output_dir=output_dir,
)
model = prepare(model=fp32_model, quant_config=quant_config)
model = convert(model)
Reference
[1]. Cheng, Wenhua, et al. “Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs” arXiv preprint arXiv:2309.05516 (2023).
[2]: NVIDIA, Introducing NVFP4 for efficient and accurate low-precision inference,NVIDIA Developer Blog, Jun. 2025. [Online]. Available: https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/
[3]. Intel AutoRound, https://github.com/intel/auto-round