neural_compressor.torch.utils.block_wise

This utility is for block-wise calibration of LLMs.

Functions

get_block_prefix(model)

Get prefix and number of attention blocks of transformer models.

replace_forward(model)

Replace forward to get the input args and kwargs of first block for AWQ algorithm.

recover_forward(model)

Recover model and block forward for AWQ algorithm.

block_wise_calibration(model[, dataloader, data, ...])

Calibration model on hpu block-by-block to reduce device memory usage.

Module Contents

neural_compressor.torch.utils.block_wise.get_block_prefix(model)[source]

Get prefix and number of attention blocks of transformer models.

Parameters:

model (torch.nn.Module) – input model

Returns:

block_list name in model block_num(int): number of block in block_list

Return type:

block_prefix(str)

neural_compressor.torch.utils.block_wise.replace_forward(model)[source]

Replace forward to get the input args and kwargs of first block for AWQ algorithm.

Parameters:

model (torch.nn.Module) – input model.

Raises:

ValueError – to avoid inference of rest parts in model.

Returns:

model with replaced forward.

Return type:

torch.nn.Module

neural_compressor.torch.utils.block_wise.recover_forward(model)[source]

Recover model and block forward for AWQ algorithm.

Parameters:

model (torch.nn.Module) – input model.

Returns:

model with recovered forward.

Return type:

torch.nn.Module

neural_compressor.torch.utils.block_wise.block_wise_calibration(model, dataloader=None, data=None, inference_dtype=torch.bfloat16)[source]

Calibration model on hpu block-by-block to reduce device memory usage.

Parameters:
  • model (torch.nn.Module) – prepared model.

  • dataloader (obj) – dataloader.

  • data (obj) – one data.