intel_npu_acceleration_library package

intel_npu_acceleration_library package#

Subpackages#

Submodules#

intel_npu_acceleration_library.bindings module#

intel_npu_acceleration_library.compiler module#

class intel_npu_acceleration_library.compiler.CompilerConfig(use_to: bool = False, dtype: dtype | NPUDtype = torch.float16, training: bool = False)#

Bases: object

Configuration class to store the compilation configuration of a model for the NPU.

intel_npu_acceleration_library.compiler.apply_general_optimizations(model: Module)#

Apply general optimizations to a torch.nn.Module.

Parameters:: model (torch.nn.Module) – a pytorch nn.Module to compile and optimize for the npu

intel_npu_acceleration_library.compiler.apply_horizontal_fusion(model: Module, *args: Any, **kwargs: Any)#

Recursively apply the optimization function.

Parameters:

model (torch.nn.Module) – original module
args (Any) – positional arguments
kwargs (Any) – keyword arguments

intel_npu_acceleration_library.compiler.compile(model: Module, config: CompilerConfig) → Module#

Compile a model for the NPU.

Parameters:

model (torch.nn.Module) – a pytorch nn.Module to compile and optimize for the npu
config (CompilerConfig) – the compiler configuration

Raises:

RuntimeError – invalid datatypes

Returns:

compiled NPU nn.Module

Return type:

torch.nn.Module

intel_npu_acceleration_library.compiler.create_npu_kernels(model: Module)#

Create NPU kernels.

Parameters:: model (torch.nn.Module) – a pytorch nn.Module to compile and optimize for the npu

intel_npu_acceleration_library.compiler.forward(self, input)#

Override forward method for WeightOnlyLinear class.

Parameters:: input – The input tensor.
Returns:: The output tensor.
Return type:: torch.Tensor

intel_npu_acceleration_library.compiler.lower_linear(model: Module, *args: Any, **kwargs: Any)#

Recursively apply the optimization function.

Parameters:

model (torch.nn.Module) – original module
args (Any) – positional arguments
kwargs (Any) – keyword arguments

intel_npu_acceleration_library.compiler.module_optimization(func: Callable) → Module#

Optimize recursively a torch.nn.Module with a specific function.

The function func get called recursively to every module in the network.

Parameters:: func (Callable) – optimization function
Returns:: optimized module
Return type:: torch.nn.Module

intel_npu_acceleration_library.compiler.npu(gm: Module | GraphModule, example_inputs: List[Tensor]) → Module | GraphModule#

Implement the custom torch 2.0 compile backend for the NPU.

Parameters:

gm (Union[torch.nn.Module, torch.fx.GraphModule]) – The torch fx Module
example_inputs (List[torch.Tensor]) – A list of example inputs

Returns:

The compiled model

Return type:

Union[torch.nn.Module, torch.fx.GraphModule]

intel_npu_acceleration_library.compiler.optimize_llama_attention(model: Module, *args: Any, **kwargs: Any)#

Recursively apply the optimization function.

Parameters:

model (torch.nn.Module) – original module
args (Any) – positional arguments
kwargs (Any) – keyword arguments

intel_npu_acceleration_library.compiler.weights_quantization(model: Module, *args: Any, **kwargs: Any)#

Recursively apply the optimization function.

Parameters:

model (torch.nn.Module) – original module
args (Any) – positional arguments
kwargs (Any) – keyword arguments

intel_npu_acceleration_library.optimizations module#

intel_npu_acceleration_library.optimizations.delattr_recursively(module: Module, target: str)#

Delete attribute recursively by name in a torch.nn.Module.

Parameters:

module (nn.Module) – the nn.Module
target (str) – the attribute you want to delete

intel_npu_acceleration_library.optimizations.fuse_linear_layers(model: Module, modules: Dict[str, Linear], targets: List[str], fused_layer_name: str) → None#

Fuse two linear layers and append them to the nn Module.

Parameters:

model (nn.Module) – Origianl nn.Module object
modules (Dict[nn.Linear]) – a dictiorany of node name: linear layer
targets (List[str]) – list of layer node names
fused_layer_name (str) – fused layer name

Raises:

ValueError – All linear layers must be of type nn.Linear and must have the same input dimension

intel_npu_acceleration_library.optimizations.horizontal_fusion_linear(model: Module) → Module#

Fuze horizontally two or more linear layers that share the same origin. This will increase NPU hw utilization.

Parameters:: model (torch.nn.Module) – The original nn.Module
Returns:: optimize nn.Module where parallel linear operations has been fused into a single bigger one
Return type:: torch.nn.Module

intel_npu_acceleration_library.quantization module#

intel_npu_acceleration_library.quantization.compress_to_i4(weights: Tensor) → Tensor#

Compresses a given tensor to 4-bit representation.

Parameters:: weights (torch.Tensor) – The input tensor to be compressed.
Returns:: The compressed tensor with 4-bit representation.
Return type:: torch.Tensor

intel_npu_acceleration_library.quantization.quantize_fit(model: Module, weights_dtype: str, algorithm: str = 'RTN') → Module#

Quantize a model with a given configuration.

Parameters:

model (torch.nn.Module) – The model to quantize
weights_dtype (str) – The datatype for the weights
algorithm (str, optional) – The quantization algorithm. Defaults to “RTN”.

Raises:

RuntimeError – Quantization error: unsupported datatype

Returns:

The quantized model

Return type:

torch.nn.Module

intel_npu_acceleration_library.quantization.quantize_i4_model(model: Module, algorithm: str = 'RTN') → Module#

Quantize a model to 4-bit representation.

Parameters:

model (torch.nn.Module) – The model to quantize
algorithm (str, optional) – The quantization algorithm. Defaults to “RTN”.

Returns:

The quantized model

Return type:

torch.nn.Module

intel_npu_acceleration_library.quantization.quantize_i8_model(model: Module, algorithm: str = 'RTN') → Module#

Quantize a model to 8-bit representation.

Parameters:

model (torch.nn.Module) – The model to quantize
algorithm (str, optional) – The quantization algorithm. Defaults to “RTN”.

Returns:

The quantized model

Return type:

torch.nn.Module

intel_npu_acceleration_library.quantization.quantize_model(model: Module, dtype: NPUDtype) → Module#

Quantize a model.

Parameters:

model (torch.nn.Module) – The model to quantize
dtype (NPUDtype) – The desired datatype

Raises:

RuntimeError – Quantization error: unsupported datatype

Returns:

The quantized model

Return type:

torch.nn.Module

intel_npu_acceleration_library.quantization.quantize_tensor(weight: Tensor, min_max_range: Tuple[int, int] = (-128, 127)) → Tuple[Tensor, Tensor]#

Quantize a fp16 tensor symmetrically.

Produces a quantize tensor (same shape, dtype == torch.int8) and a scale tensor (dtype == `torch.float16) The quantization equation is the following W = S * W_q

Parameters:

weight (torch.Tensor) – The tensor to quantize
min_max_range (Tuple[int, int]) – The min and max range for the quantized tensor. Defaults to (-128, 127).

Raises:

RuntimeError – Error in the quantization step

Returns:

Quantized tensor and scale

Return type:

Tuple[torch.Tensor, torch.Tensor]

Module contents#

class intel_npu_acceleration_library.NPUAutoModel#

Bases: object

NPU wrapper for AutoModel.

Attrs:: from_pretrained: Load a pretrained model

from_pretrained(config: ~intel_npu_acceleration_library.compiler.CompilerConfig, *, transformers_class: ~typing.Type | None = <class 'transformers.models.auto.modeling_auto.AutoModel'>, export=True, **kwargs: ~typing.Any) → Module#

class intel_npu_acceleration_library.NPUModel#

Bases: object

Base NPU model class.

static from_pretrained(model_name_or_path: str, config: CompilerConfig, transformers_class: Type | None = None, export=True, *args: Any, **kwargs: Any) → Module#

Template for the from_pretrained static method.

Parameters:

model_name_or_path (str) – model name or path
config (CompilerConfig) – compiler configuration
transformers_class (Optional[Type], optional) – base class to use. Must have a from_pretrained method. Defaults to None.
export (bool, optional) – enable the caching of the model. Defaults to True.
args (Any) – positional arguments
kwargs (Any) – keyword arguments

Raises:

RuntimeError – Invalid class
AttributeError – Cannot export model with trust_remote_code=True

Returns:

compiled mode

Return type:

torch.nn.Module

class intel_npu_acceleration_library.NPUModelForCausalLM#

Bases: object

NPU wrapper for AutoModelForCausalLM.

Attrs:: from_pretrained: Load a pretrained model

from_pretrained(config: ~intel_npu_acceleration_library.compiler.CompilerConfig, *, transformers_class: ~typing.Type | None = <class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, export=True, **kwargs: ~typing.Any) → Module#

class intel_npu_acceleration_library.NPUModelForSeq2SeqLM#

Bases: object

NPU wrapper for AutoModelForSeq2SeqLM.

Attrs:: from_pretrained: Load a pretrained model

from_pretrained(config: ~intel_npu_acceleration_library.compiler.CompilerConfig, *, transformers_class: ~typing.Type | None = <class 'transformers.models.auto.modeling_auto.AutoModelForSeq2SeqLM'>, export=True, **kwargs: ~typing.Any) → Module#

intel_npu_acceleration_library.compile(model: Module, config: CompilerConfig) → Module#

Compile a model for the NPU.

Parameters:

model (torch.nn.Module) – a pytorch nn.Module to compile and optimize for the npu
config (CompilerConfig) – the compiler configuration

Raises:

RuntimeError – invalid datatypes

Returns:

compiled NPU nn.Module

Return type:

torch.nn.Module

intel_npu_acceleration_library package

Contents

intel_npu_acceleration_library package#

Subpackages#

Submodules#

intel_npu_acceleration_library.bindings module#

intel_npu_acceleration_library.compiler module#

intel_npu_acceleration_library.optimizations module#

intel_npu_acceleration_library.quantization module#

Module contents#