# PyTorch AutoRound

## Overview
AutoRound is an advanced model quantization algorithm integrated into Neural Compressor for low-bit LLM. As a key algorithm component of INC, AutoRound enables efficient quantization across a wide range of models and features while consistently achieving superior accuracy. While requiring additional tuning time, it provides a robust foundation for INC's comprehensive quantization capabilities.

## Supported Features

- **Weight-Only Quantization (WoQ)** - Quantize model weights while keeping activations in full precision. See [Weight-Only Quantization](./PT_WeightOnlyQuant.html) for details.

- **Microscaling (MX) Quantization** - Neural Compressor seamlessly applies the MX data type to post-training quantization, offering meticulously crafted recipes to empower users to quantize LLMs without sacrificing accuracy. Refer to [MX Quantization](./PT_MXQuant.html).

- **NVFP4 Quantization** - NVFP4 is a specialized 4-bit floating-point format (FP4) developed by NVIDIA for deep learning workloads. See [NVFP4 Quantization](./PT_NVFP4Quant.html).

- **Quantization-Aware Training (QAT)** - Fine-tune models during quantization to achieve better accuracy. See [Quantization-Aware Training](./PT_QAT.html) for details.

- **FP8 KV Cache and Attention Static Quantization (Experimental)** - The support for the FP8 data type enhances inference performance by quantizing key-value cache and attention computations to FP8 precision.

## Getting Started

### Basic Usage

```python
from neural_compressor.torch.quantization import prepare, convert, AutoRoundConfig

quant_config = AutoRoundConfig(tokenizer=tokenizer)  # tokenizer used for calibration
model = prepare(model, quant_config)
model = convert(model)

# For more detailed usage, please refer to the [Supported Features] documentation.
```
### FP8 KV Cache and FP8 Attention support
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from neural_compressor.torch.quantization import (
    AutoRoundConfig,
    convert,
    prepare,
)

fp32_model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m", trust_remote_code=True)

output_dir = "./saved_inc"
quant_config = AutoRoundConfig(
    tokenizer=tokenizer,
    scheme="MXFP4",  # MXFP4, MXFP8, NVFP4
    iters=0,  # rtn mode
    seqlen=2,
    static_kv_dtype="fp8",  # None, fp8, float16
    static_attention_dtype=None,  # None, fp8
    export_format="auto_round",
    output_dir=output_dir,
)

model = prepare(model=fp32_model, quant_config=quant_config)
model = convert(model)
```

## Reference

[1]. Cheng, Wenhua, et al. "Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs" arXiv preprint arXiv:2309.05516 (2023).

[2]: NVIDIA, Introducing NVFP4 for efficient and accurate low-precision inference,NVIDIA Developer Blog, Jun. 2025. [Online]. Available: https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/

[3]. Intel AutoRound, https://github.com/intel/auto-round