neural_compressor.torch.utils.block_wise

This utility is for block-wise calibration of LLMs.

Functions

`get_block_prefix`(model)	Get prefix and number of attention blocks of transformer models.
`replace_forward`(model)	Replace forward to get the input args and kwargs of first block for AWQ algorithm.
`recover_forward`(model)	Recover model and block forward for AWQ algorithm.
`block_wise_calibration`(model[, dataloader, data, ...])	Calibration model on hpu block-by-block to reduce device memory usage.

Module Contents

neural_compressor.torch.utils.block_wise.get_block_prefix(model)[source]

Get prefix and number of attention blocks of transformer models.

Parameters:: model (torch.nn.Module) – input model
Returns:: block_list name in model block_num(int): number of block in block_list
Return type:: block_prefix(str)

neural_compressor.torch.utils.block_wise.replace_forward(model)[source]

Replace forward to get the input args and kwargs of first block for AWQ algorithm.

Parameters:: model (torch.nn.Module) – input model.
Raises:: ValueError – to avoid inference of rest parts in model.
Returns:: model with replaced forward.
Return type:: torch.nn.Module

neural_compressor.torch.utils.block_wise.recover_forward(model)[source]

Recover model and block forward for AWQ algorithm.

Parameters:: model (torch.nn.Module) – input model.
Returns:: model with recovered forward.
Return type:: torch.nn.Module

neural_compressor.torch.utils.block_wise.block_wise_calibration(model, dataloader=None, data=None, inference_dtype=torch.bfloat16)[source]

Calibration model on hpu block-by-block to reduce device memory usage.

Parameters:

model (torch.nn.Module) – prepared model.
dataloader (obj) – dataloader.
data (obj) – one data.