neural_compressor.torch.utils.block_wise
This utility is for block-wise calibration of LLMs.
Functions
|
Get prefix and number of attention blocks of transformer models. |
|
Replace forward to get the input args and kwargs of first block for AWQ algorithm. |
|
Recover model and block forward for AWQ algorithm. |
|
Calibration model on hpu block-by-block to reduce device memory usage. |
Module Contents
- neural_compressor.torch.utils.block_wise.get_block_prefix(model)[source]
Get prefix and number of attention blocks of transformer models.
- Parameters:
model (torch.nn.Module) – input model
- Returns:
block_list name in model block_num(int): number of block in block_list
- Return type:
block_prefix(str)
- neural_compressor.torch.utils.block_wise.replace_forward(model)[source]
Replace forward to get the input args and kwargs of first block for AWQ algorithm.
- Parameters:
model (torch.nn.Module) – input model.
- Raises:
ValueError – to avoid inference of rest parts in model.
- Returns:
model with replaced forward.
- Return type:
torch.nn.Module
- neural_compressor.torch.utils.block_wise.recover_forward(model)[source]
Recover model and block forward for AWQ algorithm.
- Parameters:
model (torch.nn.Module) – input model.
- Returns:
model with recovered forward.
- Return type:
torch.nn.Module
- neural_compressor.torch.utils.block_wise.block_wise_calibration(model, dataloader=None, data=None, inference_dtype=torch.bfloat16)[source]
Calibration model on hpu block-by-block to reduce device memory usage.
- Parameters:
model (torch.nn.Module) – prepared model.
dataloader (obj) – dataloader.
data (obj) – one data.