:py:mod:`neural_compressor.compression.pruner.pruners.mha` ========================================================== .. py:module:: neural_compressor.compression.pruner.pruners.mha .. autoapi-nested-parse:: Mha pruner. Module Contents --------------- Classes ~~~~~~~ .. autoapisummary:: neural_compressor.compression.pruner.pruners.mha.PythonMultiheadAttentionPruner .. py:class:: PythonMultiheadAttentionPruner(config, mha_modules) Pruning Pruner. In this pruner, We apply pruning for multi-head attentions. multi-head attention pruning means remove partial QKV layers and their corresponding feedward layers simultaneously. :param mha_modules: A List :param [: { 'qkv_name': ['query_layer_name', 'key_layer_name', 'value_layer_name'], 'ffn_name': ['attention_ffn_name'], 'mha_name': ['mha_name'] (keep not change), 'qkv_module': [torch.nn.Linear, torch.nn.Linear, torch.nn.Linear], 'ffn_module': [torch.nn.Linear], 'mha_module': [torch.nn.Module] (keep not change), } ... :param ]: :param that stores the pruning mha modules.: :param config: A config dict object that contains the pruner information. .. attribute:: mha_compressions a Dict. (key: MHA module name; value: MHACompression object in .model_slim.weight_slim) Main object to hook critical attributes for mha pruning and modify these attributes. .. attribute:: linear_layers a Dict. {key: linear layer name; value: torch.nn.Linear object.} Store independent linear layer look-up table, which used by criterion object. linear_layers length should be 4x of mha_compression because one mha_compression hooks 4 linear layers: query, key, value and subsequent ffn layer. .. attribute:: head_masks A dict. {key: MHA module name; value: torch.Tensor(1, mha_head_size)} Similar to Huggingface built-in head_mask attribute. .. attribute:: mha_scores A dict. {key: MHA module name; value: torch.Tensor(1, mha_head_size)} Store scores for different heads.