`neural_compressor.torch.algorithms.weight_only.gptq`

Module Contents

Classes

`RAWGPTQuantizer`	Main API for GPTQ algorithm.
`GPTQ`	Please refer to:
`GPTQuantizer`	The base quantizer for all algorithm quantizers.

Functions

`is_leaf`(module)	Judge whether a module has no child-modules.
`trace_gptq_target_blocks`(module[, module_types])	Search transformer stacked structures, which is critical in LLMs and GPTQ execution.
`find_layers`(module[, layers, name])	Get all layers with target types.
`find_layers_name`(module[, layers, name])	Get all layers with target types.
`log_quantizable_layers_per_transformer`(transformer_blocks)	Print all layers which will be quantized in GPTQ algorithm.
`quantize`(x, scale, zero, maxq)	Do quantization.

neural_compressor.torch.algorithms.weight_only.gptq.is_leaf(module)[source]

Judge whether a module has no child-modules.

Parameters:: module – torch.nn.Module
Returns:: whether a module has no child-modules.
Return type:: a bool

neural_compressor.torch.algorithms.weight_only.gptq.trace_gptq_target_blocks(module, module_types=[torch.nn.ModuleList, torch.nn.Sequential])[source]

Search transformer stacked structures, which is critical in LLMs and GPTQ execution.

Parameters:

module – torch.nn.Module
module_types – List of torch.nn.Module.

Returns:

gptq_related_blocks = {: “embeddings”: {}, # Dict embedding layers before transformer stack module, “transformers_pre”: {}, # TODO “transformers_name”: string. LLMs’ transformer stack module name , “transformers”: torch.nn.ModuleList. LLMs’ transformer stack module, “transformers”: {}, Dict# TODO

}

neural_compressor.torch.algorithms.weight_only.gptq.find_layers(module, layers=[nn.Conv2d, nn.Conv1d, nn.Linear, transformers.Conv1D], name='')[source]: Get all layers with target types.

neural_compressor.torch.algorithms.weight_only.gptq.find_layers_name(module, layers=[nn.Conv2d, nn.Conv1d, nn.Linear, transformers.Conv1D], name='')[source]: Get all layers with target types.

neural_compressor.torch.algorithms.weight_only.gptq.log_quantizable_layers_per_transformer(transformer_blocks, layers=[nn.Conv2d, nn.Conv1d, nn.Linear, transformers.Conv1D])[source]: Print all layers which will be quantized in GPTQ algorithm.

neural_compressor.torch.algorithms.weight_only.gptq.quantize(x, scale, zero, maxq)[source]: Do quantization.

class neural_compressor.torch.algorithms.weight_only.gptq.RAWGPTQuantizer(model, weight_config={}, nsamples=128, use_max_length=True, max_seq_length=2048, device=None, export_compressed_model=False, use_layer_wise=False, model_path='', dataloader=None, *args, **kwargs)[source]

Main API for GPTQ algorithm.

Please refer to: GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers url: https://arxiv.org/abs/2210.17323

class neural_compressor.torch.algorithms.weight_only.gptq.GPTQ(layer, W, device='cpu')[source]: Please refer to: GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers (https://arxiv.org/abs/2210.17323)

class neural_compressor.torch.algorithms.weight_only.gptq.GPTQuantizer(quant_config={})[source]

The base quantizer for all algorithm quantizers.

The Quantizer unifies the interfaces across various quantization algorithms, including GPTQ, RTN, etc. Given a float model, Quantizer apply the quantization algorithm to the model according to the quant_config.

To implement a new quantization algorithm,, inherit from Quantizer and implement the following methods:

prepare: prepare a given model for convert.
convert: convert a prepared model to a quantized model.

Note: quantize and execute are optional for new quantization algorithms.

neural_compressor.torch.algorithms.weight_only.gptq

Module Contents

Classes

Functions

`neural_compressor.torch.algorithms.weight_only.gptq`