intel_npu_acceleration_library package#

Subpackages#

Submodules#

intel_npu_acceleration_library.bindings module#

intel_npu_acceleration_library.compiler module#

class intel_npu_acceleration_library.compiler.CompilerConfig(use_to: bool = False, dtype: dtype | NPUDtype = torch.float16, training: bool = False)#

Bases: object

Configuration class to store the compilation configuration of a model for the NPU.

intel_npu_acceleration_library.compiler.apply_general_optimizations(model: Module)#

Apply general optimizations to a torch.nn.Module.

Parameters:

model (torch.nn.Module) – a pytorch nn.Module to compile and optimize for the npu

intel_npu_acceleration_library.compiler.apply_horizontal_fusion(model: Module, *args: Any, **kwargs: Any)#

Recursively apply the optimization function.

Parameters:
  • model (torch.nn.Module) – original module

  • args (Any) – positional arguments

  • kwargs (Any) – keyword arguments

intel_npu_acceleration_library.compiler.compile(model: Module, config: CompilerConfig) Module#

Compile a model for the NPU.

Parameters:
  • model (torch.nn.Module) – a pytorch nn.Module to compile and optimize for the npu

  • config (CompilerConfig) – the compiler configuration

Raises:

RuntimeError – invalid datatypes

Returns:

compiled NPU nn.Module

Return type:

torch.nn.Module

intel_npu_acceleration_library.compiler.create_npu_kernels(model: Module)#

Create NPU kernels.

Parameters:

model (torch.nn.Module) – a pytorch nn.Module to compile and optimize for the npu

intel_npu_acceleration_library.compiler.forward(self, input)#

Override forward method for WeightOnlyLinear class.

Parameters:

input – The input tensor.

Returns:

The output tensor.

Return type:

torch.Tensor

intel_npu_acceleration_library.compiler.lower_linear(model: Module, *args: Any, **kwargs: Any)#

Recursively apply the optimization function.

Parameters:
  • model (torch.nn.Module) – original module

  • args (Any) – positional arguments

  • kwargs (Any) – keyword arguments

intel_npu_acceleration_library.compiler.module_optimization(func: Callable) Module#

Optimize recursively a torch.nn.Module with a specific function.

The function func get called recursively to every module in the network.

Parameters:

func (Callable) – optimization function

Returns:

optimized module

Return type:

torch.nn.Module

intel_npu_acceleration_library.compiler.npu(gm: Module | GraphModule, example_inputs: List[Tensor]) Module | GraphModule#

Implement the custom torch 2.0 compile backend for the NPU.

Parameters:
  • gm (Union[torch.nn.Module, torch.fx.GraphModule]) – The torch fx Module

  • example_inputs (List[torch.Tensor]) – A list of example inputs

Returns:

The compiled model

Return type:

Union[torch.nn.Module, torch.fx.GraphModule]

intel_npu_acceleration_library.compiler.optimize_llama_attention(model: Module, *args: Any, **kwargs: Any)#

Recursively apply the optimization function.

Parameters:
  • model (torch.nn.Module) – original module

  • args (Any) – positional arguments

  • kwargs (Any) – keyword arguments

intel_npu_acceleration_library.compiler.weights_quantization(model: Module, *args: Any, **kwargs: Any)#

Recursively apply the optimization function.

Parameters:
  • model (torch.nn.Module) – original module

  • args (Any) – positional arguments

  • kwargs (Any) – keyword arguments

intel_npu_acceleration_library.optimizations module#

intel_npu_acceleration_library.optimizations.delattr_recursively(module: Module, target: str)#

Delete attribute recursively by name in a torch.nn.Module.

Parameters:
  • module (nn.Module) – the nn.Module

  • target (str) – the attribute you want to delete

intel_npu_acceleration_library.optimizations.fuse_linear_layers(model: Module, modules: Dict[str, Linear], targets: List[str], fused_layer_name: str) None#

Fuse two linear layers and append them to the nn Module.

Parameters:
  • model (nn.Module) – Origianl nn.Module object

  • modules (Dict[nn.Linear]) – a dictiorany of node name: linear layer

  • targets (List[str]) – list of layer node names

  • fused_layer_name (str) – fused layer name

Raises:

ValueError – All linear layers must be of type nn.Linear and must have the same input dimension

intel_npu_acceleration_library.optimizations.horizontal_fusion_linear(model: Module) Module#

Fuze horizontally two or more linear layers that share the same origin. This will increase NPU hw utilization.

Parameters:

model (torch.nn.Module) – The original nn.Module

Returns:

optimize nn.Module where parallel linear operations has been fused into a single bigger one

Return type:

torch.nn.Module

intel_npu_acceleration_library.quantization module#

intel_npu_acceleration_library.quantization.compress_to_i4(weights: Tensor) Tensor#

Compresses a given tensor to 4-bit representation.

Parameters:

weights (torch.Tensor) – The input tensor to be compressed.

Returns:

The compressed tensor with 4-bit representation.

Return type:

torch.Tensor

intel_npu_acceleration_library.quantization.quantize_fit(model: Module, weights_dtype: str, algorithm: str = 'RTN') Module#

Quantize a model with a given configuration.

Parameters:
  • model (torch.nn.Module) – The model to quantize

  • weights_dtype (str) – The datatype for the weights

  • algorithm (str, optional) – The quantization algorithm. Defaults to “RTN”.

Raises:

RuntimeError – Quantization error: unsupported datatype

Returns:

The quantized model

Return type:

torch.nn.Module

intel_npu_acceleration_library.quantization.quantize_i4_model(model: Module, algorithm: str = 'RTN') Module#

Quantize a model to 4-bit representation.

Parameters:
  • model (torch.nn.Module) – The model to quantize

  • algorithm (str, optional) – The quantization algorithm. Defaults to “RTN”.

Returns:

The quantized model

Return type:

torch.nn.Module

intel_npu_acceleration_library.quantization.quantize_i8_model(model: Module, algorithm: str = 'RTN') Module#

Quantize a model to 8-bit representation.

Parameters:
  • model (torch.nn.Module) – The model to quantize

  • algorithm (str, optional) – The quantization algorithm. Defaults to “RTN”.

Returns:

The quantized model

Return type:

torch.nn.Module

intel_npu_acceleration_library.quantization.quantize_model(model: Module, dtype: NPUDtype) Module#

Quantize a model.

Parameters:
  • model (torch.nn.Module) – The model to quantize

  • dtype (NPUDtype) – The desired datatype

Raises:

RuntimeError – Quantization error: unsupported datatype

Returns:

The quantized model

Return type:

torch.nn.Module

intel_npu_acceleration_library.quantization.quantize_tensor(weight: Tensor, min_max_range: Tuple[int, int] = (-128, 127)) Tuple[Tensor, Tensor]#

Quantize a fp16 tensor symmetrically.

Produces a quantize tensor (same shape, dtype == torch.int8) and a scale tensor (dtype == `torch.float16) The quantization equation is the following W = S * W_q

Parameters:
  • weight (torch.Tensor) – The tensor to quantize

  • min_max_range (Tuple[int, int]) – The min and max range for the quantized tensor. Defaults to (-128, 127).

Raises:

RuntimeError – Error in the quantization step

Returns:

Quantized tensor and scale

Return type:

Tuple[torch.Tensor, torch.Tensor]

Module contents#

class intel_npu_acceleration_library.NPUAutoModel#

Bases: object

NPU wrapper for AutoModel.

Attrs:

from_pretrained: Load a pretrained model

from_pretrained(config: ~intel_npu_acceleration_library.compiler.CompilerConfig, *, transformers_class: ~typing.Type | None = <class 'transformers.models.auto.modeling_auto.AutoModel'>, export=True, **kwargs: ~typing.Any) Module#
class intel_npu_acceleration_library.NPUModel#

Bases: object

Base NPU model class.

static from_pretrained(model_name_or_path: str, config: CompilerConfig, transformers_class: Type | None = None, export=True, *args: Any, **kwargs: Any) Module#

Template for the from_pretrained static method.

Parameters:
  • model_name_or_path (str) – model name or path

  • config (CompilerConfig) – compiler configuration

  • transformers_class (Optional[Type], optional) – base class to use. Must have a from_pretrained method. Defaults to None.

  • export (bool, optional) – enable the caching of the model. Defaults to True.

  • args (Any) – positional arguments

  • kwargs (Any) – keyword arguments

Raises:
  • RuntimeError – Invalid class

  • AttributeError – Cannot export model with trust_remote_code=True

Returns:

compiled mode

Return type:

torch.nn.Module

class intel_npu_acceleration_library.NPUModelForCausalLM#

Bases: object

NPU wrapper for AutoModelForCausalLM.

Attrs:

from_pretrained: Load a pretrained model

from_pretrained(config: ~intel_npu_acceleration_library.compiler.CompilerConfig, *, transformers_class: ~typing.Type | None = <class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, export=True, **kwargs: ~typing.Any) Module#
class intel_npu_acceleration_library.NPUModelForSeq2SeqLM#

Bases: object

NPU wrapper for AutoModelForSeq2SeqLM.

Attrs:

from_pretrained: Load a pretrained model

from_pretrained(config: ~intel_npu_acceleration_library.compiler.CompilerConfig, *, transformers_class: ~typing.Type | None = <class 'transformers.models.auto.modeling_auto.AutoModelForSeq2SeqLM'>, export=True, **kwargs: ~typing.Any) Module#
intel_npu_acceleration_library.compile(model: Module, config: CompilerConfig) Module#

Compile a model for the NPU.

Parameters:
  • model (torch.nn.Module) – a pytorch nn.Module to compile and optimize for the npu

  • config (CompilerConfig) – the compiler configuration

Raises:

RuntimeError – invalid datatypes

Returns:

compiled NPU nn.Module

Return type:

torch.nn.Module