intel_npu_acceleration_library.nn package#

Submodules#

intel_npu_acceleration_library.nn.autograd module#

class intel_npu_acceleration_library.nn.autograd.AutogradMatMul(*args, **kwargs)#

Bases: Function

Autograd module for Linear operation.

static backward(ctx, grad_output: Tensor) Iterable[Tensor | None]#

Run a linear backward pass.

grad_output shape: [batch, output_channels] x shape: [batch, input_channels] w shape: [output_channels, input_channels]

Expected gradients dl_dx shape: [batch, input_channels] dl_dw shape: [output_channels, input_channels]

Equivalent pytorch code: dl_dx = grad_output @ w.to(torch.float32) dl_dw = (x.T @ grad_output).T

Parameters:
  • ctx (Any) – the autograd context

  • grad_output (torch.Tensor) – output gradient

Returns:

Input and parameters gradients

Return type:

Iterable[Union[torch.Tensor, None]]

static forward(ctx, x: Tensor, w: Tensor, scale: Tensor | None = None) Tensor#

Run a linear forward pass. Depending on the datatype of the weights it runs a float or quantized operation.

Equivalent pytorch code: result = x @ w.T

Parameters:
  • ctx (Any) – the autograd context

  • x (torch.Tensor) – Activation tensor. Its dtype must be torch.float16

  • w (torch.Tensor) – Weight tensor. Its dtype must be torch.float16

  • scale (Optional[torch.Tensor], optional) – Quantization scale. If weights.dtype == torch.int8 then it must be set. Defaults to None.

Returns:

result

Return type:

torch.Tensor

intel_npu_acceleration_library.nn.linear module#

class intel_npu_acceleration_library.nn.linear.Linear(weight: Tensor, bias: Tensor | None = None)#

Bases: Module

Torch Linear operation NPU backend.

forward(x: Tensor) Tensor#

Torch module forward method.

Parameters:

x (torch.Tensor) – Input tensor

Returns:

result

Return type:

torch.Tensor

static fromTensor(weight: Tensor, bias: Tensor | None, dtype: dtype = torch.float16) Linear | QuantizedLinear#

Generate a NPU Linear layer from a torch one.

Parameters:
  • weight (torch.Tensor) – the original weight tensor

  • bias (Optional[torch.Tensor]) – the original bias tensor

  • dtype (torch.dtype) – the desired datatype

Raises:

RuntimeError – dtype not supported

Returns:

A NPU linear layer

Return type:

Union[Linear, QuantizedLinear]

static fromTorch(layer: Linear, dtype: dtype = torch.float16) Linear | QuantizedLinear#

Generate a NPU Linear layer from a torch one.

Parameters:
  • layer (torch.nn.Linear) – the original torch.nn.Linear model to run on the NPU

  • dtype (torch.dtype) – the desired datatype

Returns:

A NPU linear layer

Return type:

Union[Linear, QuantizedLinear]

class intel_npu_acceleration_library.nn.linear.QuantizedLinear(weight: Tensor, scale: Tensor, bias: Tensor | None = None)#

Bases: Module

Torch Quantized Linear operation NPU backend.

forward(x: Tensor) Tensor#

Torch module forward method.

Parameters:

x (torch.Tensor) – Input tensor

Raises:

RuntimeError – Training is not supported for QuantizedLinear layer. Use .eval() to do inference only

Returns:

result

Return type:

torch.Tensor

intel_npu_acceleration_library.nn.llm module#

class intel_npu_acceleration_library.nn.llm.FusedLlamaMLP(parameters: List[Tensor])#

Bases: Module

LLAMA MLP operation NPU backend.

forward(x: Tensor) Tensor#

Torch module forward method.

Parameters:

x (torch.Tensor) – Input tensor

Returns:

result

Return type:

torch.Tensor

static fromTorch(layer: Module, dtype: dtype = torch.float16) FusedLlamaMLP#

Generate a NPU LlamaMLP layer from a transformer LlamaMLP one.

Parameters:
  • layer (torch.nn.Linear) – the original LlamaMLP model to run on the NPU

  • dtype (torch.dtype) – the desired datatype

Returns:

A NPU LlamaMLP layer

Return type:

FusedLlamaMLP

class intel_npu_acceleration_library.nn.llm.LlamaAttention(config: LlamaConfig, q_weights: Tensor, kv_weights: Tensor, o_proj: Tensor, rotary_emb: Module, dtype: dtype = torch.float16, layer_idx: int | None = None)#

Bases: Module

LlamaAttention operation NPU backend.

forward(hidden_states: Tensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, past_key_value: Cache | None = None, output_attentions: bool | None = False, use_cache: bool | None = False, cache_position: LongTensor | None = None, position_embeddings: Tuple[Tensor, Tensor] | None = None)#

Torch module forward method.

Parameters:
  • hidden_states (torch.Tensor) – input to the layer of shape (batch, seq_len, embed_dim)

  • attention_mask (Optional[torch.Tensor], optional) – attention mask of shape (batch_size, sequence_length). Defaults to None.

  • position_ids (Optional[torch.Tensor], optional) – position_ids of shape (batch_size, sequence_length). Defaults to None.

  • past_key_value (Optional[Cache], optional) – Pre-computed hidden-states (key and values in the self-attention blocks). Defaults to None.

  • output_attentions (Optional[bool], optional) – Whether or not to return the attentions tensors of all attention layers.. Defaults to False.

  • use_cache (Optional[bool], optional) – If set to True, past_key_values key value states are returned. Defaults to False.

  • cache_position (Optional[torch.LongTensor], optional) – Cache position useful for static cache applications . Defaults to None.

  • position_embeddings (Optional[Tuple[torch.Tensor, torch.Tensor]], optional) – If set to a tuple, it means the sin and cos are uniformly calculated by the outer LlamaModel and passed in. Defaults to None.

Returns:

result

Return type:

_type_

static fromTorch(layer: Module, dtype: dtype = torch.float16) LlamaAttention#

Generate a NPU LlamaAttention layer from a transformer LlamaAttention one.

Parameters:
  • layer (torch.nn.Linear) – the original LlamaAttention model to run on the NPU

  • dtype (torch.dtype) – the desired datatype

Returns:

A NPU LlamaAttention layer

Return type:

LlamaAttention

class intel_npu_acceleration_library.nn.llm.PhiMLP(parameters: List[Tensor])#

Bases: Module

Phi-2 MLP operation NPU backend.

forward(x: Tensor) Tensor#

Torch module forward method.

Parameters:

x (torch.Tensor) – Input tensor

Returns:

result

Return type:

torch.Tensor

static fromTorch(layer: Module, dtype: dtype = torch.float16) PhiMLP#

Generate a NPU PhiMLP layer from a transformer one.

Parameters:
  • layer (torch.nn.Linear) – the original PhiMLP model to run on the NPU

  • dtype (torch.dtype) – the desired datatype

Returns:

A NPU PhiMLP layer

Return type:

PhiMLP

intel_npu_acceleration_library.nn.llm.generate_with_static_shape(model: Module, input_ids: Tensor, max_length: int, attention_mask: Tensor | None = None, use_past: bool | None = True, pad_token_id: int | None = None, **kwargs) Generator[int, None, None]#

Run LLM generator routine wiht static shapes.

Parameters:
  • model (torch.nn.Module) – LLM mode

  • input_ids (torch.Tensor) – model input_ids

  • max_length (int) – model max lenght.

  • attention_mask (Optional[torch.Tensor], optional) – input attention mask. Defaults to None.

  • use_past (Optional[bool], optional) – Enable/disable KV caching. Defaults to True.

  • pad_token_id (Optional[int], optional) – Padding token. Defaults to None.

  • kwargs – Additional arguments

Raises:

RuntimeError – pad_token_id is not set and needed for static shape generation

Yields:

Generator[int, None, None] – Return a generator of new tokens

intel_npu_acceleration_library.nn.llm.lshift_insert(tensor: Tensor, value: float) Tensor#

Compute shift left and insert a value into a tensor.

Parameters:
  • tensor (torch.Tensor) – input tensor

  • value (float) – value to add

Returns:

output tensor

Return type:

torch.Tensor

intel_npu_acceleration_library.nn.llm.warm_up_decoder_model(tokenizer: AutoTokenizer, model: Module, model_seq_length: int, use_past: bool | None = True)#

Warm up the model on the NPU.

This function JIT compile all the layers offloaded to the NPU and load and warm them into the NPU. This is particolarly useful for LLM decoders

Parameters:
  • tokenizer (AutoTokenizer) – a tokenizer

  • model (torch.nn.Module) – a torch Module representing a language model decoder

  • model_seq_length (int) – Max sequence lenght for the tokenizer padding

  • use_past (Optional[bool], optional) – Enable or Disable KV-caching. Defaults to True.

Module contents#

class intel_npu_acceleration_library.nn.Conv2d(weights: Tensor, bias: Tensor | None = None, strides: int | Sequence[int] = 1, padding: int | Sequence[int] = 0, dilation: int | Sequence[int] = 1, groups: int = 1)#

Bases: Module

2D convolutional layer implementation.

Attrs:

weight (torch.Tensor): The weight tensor of the layer. bias (torch.Tensor): The bias tensor of the layer.

property bias: Tensor#

Get the bias tensor of the layer.

Returns:

The bias tensor.

Return type:

torch.Tensor

forward(x: Tensor) Tensor#

Torch module forward method.

Parameters:

x (torch.Tensor) – Input tensor

Returns:

result

Return type:

torch.Tensor

static fromTorch(layer, dtype: dtype = torch.float16) Conv2d#

Create a Conv2d layer from a torch.nn.Conv2d layer.

Parameters:
  • layer (torch.nn.Conv2d) – The torch Conv2d layer.

  • dtype (torch.dtype, optional) – Data type of the layer.

Returns:

The converted Conv2d layer.

Return type:

Conv2d

property weight: Tensor#

Get the weight tensor of the layer.

Returns:

The weight tensor.

Return type:

torch.Tensor

class intel_npu_acceleration_library.nn.Linear(weight: Tensor, bias: Tensor | None = None)#

Bases: Module

Torch Linear operation NPU backend.

forward(x: Tensor) Tensor#

Torch module forward method.

Parameters:

x (torch.Tensor) – Input tensor

Returns:

result

Return type:

torch.Tensor

static fromTensor(weight: Tensor, bias: Tensor | None, dtype: dtype = torch.float16) Linear | QuantizedLinear#

Generate a NPU Linear layer from a torch one.

Parameters:
  • weight (torch.Tensor) – the original weight tensor

  • bias (Optional[torch.Tensor]) – the original bias tensor

  • dtype (torch.dtype) – the desired datatype

Raises:

RuntimeError – dtype not supported

Returns:

A NPU linear layer

Return type:

Union[Linear, QuantizedLinear]

static fromTorch(layer: Linear, dtype: dtype = torch.float16) Linear | QuantizedLinear#

Generate a NPU Linear layer from a torch one.

Parameters:
  • layer (torch.nn.Linear) – the original torch.nn.Linear model to run on the NPU

  • dtype (torch.dtype) – the desired datatype

Returns:

A NPU linear layer

Return type:

Union[Linear, QuantizedLinear]

class intel_npu_acceleration_library.nn.LlamaAttention(config: LlamaConfig, q_weights: Tensor, kv_weights: Tensor, o_proj: Tensor, rotary_emb: Module, dtype: dtype = torch.float16, layer_idx: int | None = None)#

Bases: Module

LlamaAttention operation NPU backend.

forward(hidden_states: Tensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, past_key_value: Cache | None = None, output_attentions: bool | None = False, use_cache: bool | None = False, cache_position: LongTensor | None = None, position_embeddings: Tuple[Tensor, Tensor] | None = None)#

Torch module forward method.

Parameters:
  • hidden_states (torch.Tensor) – input to the layer of shape (batch, seq_len, embed_dim)

  • attention_mask (Optional[torch.Tensor], optional) – attention mask of shape (batch_size, sequence_length). Defaults to None.

  • position_ids (Optional[torch.Tensor], optional) – position_ids of shape (batch_size, sequence_length). Defaults to None.

  • past_key_value (Optional[Cache], optional) – Pre-computed hidden-states (key and values in the self-attention blocks). Defaults to None.

  • output_attentions (Optional[bool], optional) – Whether or not to return the attentions tensors of all attention layers.. Defaults to False.

  • use_cache (Optional[bool], optional) – If set to True, past_key_values key value states are returned. Defaults to False.

  • cache_position (Optional[torch.LongTensor], optional) – Cache position useful for static cache applications . Defaults to None.

  • position_embeddings (Optional[Tuple[torch.Tensor, torch.Tensor]], optional) – If set to a tuple, it means the sin and cos are uniformly calculated by the outer LlamaModel and passed in. Defaults to None.

Returns:

result

Return type:

_type_

static fromTorch(layer: Module, dtype: dtype = torch.float16) LlamaAttention#

Generate a NPU LlamaAttention layer from a transformer LlamaAttention one.

Parameters:
  • layer (torch.nn.Linear) – the original LlamaAttention model to run on the NPU

  • dtype (torch.dtype) – the desired datatype

Returns:

A NPU LlamaAttention layer

Return type:

LlamaAttention

class intel_npu_acceleration_library.nn.Module(profile: bool = False)#

Bases: Module

A PyTorch module that runs on the NPU.

create_model(args: Sequence[Any], kwargs: MutableMapping[str, Any]) NNFactory#

Create a model from the module.

Parameters:
  • args (Sequence[Any]) – positional arguments

  • kwargs (MutableMapping[str, Any]) – keyword arguments

Returns:

The model.

Return type:

NNFactory

extract_tensors_from_arguments(args: Sequence[Any]) Sequence[Tensor]#

Extract the tensors from the arguments.

Parameters:

args (Sequence[Any]) – The positional arguments.

Returns:

The tensors.

Return type:

Sequence[torch.Tensor]

factory_forward(*args: Any, **kwargs: Any)#

Run the model using the factory.

Parameters:
  • args (Any) – The positional arguments.

  • kwargs (Any) – The keyword arguments.

Returns:

The output tensor.

Return type:

torch.Tensor

forward(*args, **kwargs) Tensor#

Run the forward pass of the module.

Parameters:
  • args (Any) – The positional arguments.

  • kwargs (Any) – The keyword arguments.

Raises:

NotImplementedError – If the forward method is not implemented.

Returns:

The output tensor.

Return type:

torch.Tensor

to(*args, **kwargs)#

Move the module to a device or to a different dtype.

Parameters:
  • args (Any) – The positional arguments.

  • kwargs (Any) – The keyword arguments.

Returns:

The output tensor.

Return type:

torch.Tensor

class intel_npu_acceleration_library.nn.PhiMLP(parameters: List[Tensor])#

Bases: Module

Phi-2 MLP operation NPU backend.

forward(x: Tensor) Tensor#

Torch module forward method.

Parameters:

x (torch.Tensor) – Input tensor

Returns:

result

Return type:

torch.Tensor

static fromTorch(layer: Module, dtype: dtype = torch.float16) PhiMLP#

Generate a NPU PhiMLP layer from a transformer one.

Parameters:
  • layer (torch.nn.Linear) – the original PhiMLP model to run on the NPU

  • dtype (torch.dtype) – the desired datatype

Returns:

A NPU PhiMLP layer

Return type:

PhiMLP

class intel_npu_acceleration_library.nn.QuantizedLinear(weight: Tensor, scale: Tensor, bias: Tensor | None = None)#

Bases: Module

Torch Quantized Linear operation NPU backend.

forward(x: Tensor) Tensor#

Torch module forward method.

Parameters:

x (torch.Tensor) – Input tensor

Raises:

RuntimeError – Training is not supported for QuantizedLinear layer. Use .eval() to do inference only

Returns:

result

Return type:

torch.Tensor