intel_npu_acceleration_library.nn package#

Submodules#

intel_npu_acceleration_library.nn.autograd module#

class intel_npu_acceleration_library.nn.autograd.AutogradMatMul(*args, **kwargs)#

Bases: Function

Autograd module for Linear operation.

static backward(ctx, grad_output: Tensor) Iterable[Tensor | None]#

Run a linear backward pass.

grad_output shape: [batch, output_channels] x shape: [batch, input_channels] w shape: [output_channels, input_channels]

Expected gradients dl_dx shape: [batch, input_channels] dl_dw shape: [output_channels, input_channels]

Equivalent pytorch code: dl_dx = grad_output @ w.to(torch.float32) dl_dw = (x.T @ grad_output).T

Parameters:
  • ctx (Any) – the autograd context

  • grad_output (torch.Tensor) – output gradient

Returns:

Input and parameters gradients

Return type:

Iterable[Union[torch.Tensor, None]]

static forward(ctx, x: Tensor, w: Tensor, scale: Tensor | None = None) Tensor#

Run a linear forward pass. Depending on the datatype of the weights it runs a float or quantized operation.

Equivalent pytorch code: result = x @ w.T

Parameters:
  • ctx (Any) – the autograd context

  • x (torch.Tensor) – Activation tensor. Its dtype must be torch.float16

  • w (torch.Tensor) – Weight tensor. Its dtype must be torch.float16

  • scale (Optional[torch.Tensor], optional) – Quantization scale. If weights.dtype == torch.int8 then it must be set. Defaults to None.

Returns:

result

Return type:

torch.Tensor

intel_npu_acceleration_library.nn.linear module#

class intel_npu_acceleration_library.nn.linear.Linear(weight: Tensor, bias: Tensor | None = None)#

Bases: Module

Torch Linear operation NPU backend.

forward(x: Tensor) Tensor#

Torch module forward method.

Parameters:

x (torch.Tensor) – Input tensor

Returns:

result

Return type:

torch.Tensor

static fromTensor(weight: Tensor, bias: Tensor | None, dtype: dtype = torch.float16) Linear | QuantizedLinear#

Generate a NPU Linear layer from a torch one.

Parameters:
  • weight (torch.Tensor) – the original weight tensor

  • bias (Optional[torch.Tensor]) – the original bias tensor

  • dtype (torch.dtype) – the desired datatype

Raises:

RuntimeError – Quantized Linear requires input_channel to be a multiple of 8

Returns:

A NPU linear layer

Return type:

Union[Linear, QuantizedLinear]

static fromTorch(layer: Linear, dtype: dtype = torch.float16) Linear | QuantizedLinear#

Generate a NPU Linear layer from a torch one.

Parameters:
  • layer (torch.nn.Linear) – the original torch.nn.Linear model to run on the NPU

  • dtype (torch.dtype) – the desired datatype

Returns:

A NPU linear layer

Return type:

Union[Linear, QuantizedLinear]

class intel_npu_acceleration_library.nn.linear.QuantizedLinear(weight: Tensor, scale: Tensor, bias: Tensor | None = None)#

Bases: Module

Torch Quantized Linear operation NPU backend.

forward(x: Tensor) Tensor#

Torch module forward method.

Parameters:

x (torch.Tensor) – Input tensor

Raises:

RuntimeError – Training is not supported for QuantizedLinear layer. Use .eval() to do inference only

Returns:

result

Return type:

torch.Tensor

intel_npu_acceleration_library.nn.llm module#

class intel_npu_acceleration_library.nn.llm.FusedLlamaMLP(parameters: List[Tensor])#

Bases: Module

LLAMA MLP operation NPU backend.

forward(x: Tensor) Tensor#

Torch module forward method.

Parameters:

x (torch.Tensor) – Input tensor

Returns:

result

Return type:

torch.Tensor

static fromTorch(layer: Module, dtype: dtype = torch.float16) FusedLlamaMLP#

Generate a NPU LlamaMLP layer from a transformer LlamaMLP one.

Parameters:
  • layer (torch.nn.Linear) – the original LlamaMLP model to run on the NPU

  • dtype (torch.dtype) – the desired datatype

Returns:

A NPU LlamaMLP layer

Return type:

FusedLlamaMLP

class intel_npu_acceleration_library.nn.llm.LlamaAttention(config: LlamaConfig, q_weights: Tensor, kv_weights: Tensor, o_proj: Tensor, rotary_emb: Module, dtype: dtype = torch.float16, layer_idx: int | None = None)#

Bases: Module

LlamaAttention operation NPU backend.

forward(hidden_states: Tensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, past_key_value: Cache | None = None, output_attentions: bool | None = False, use_cache: bool | None = False, cache_position: LongTensor | None = None)#

Torch module forward method.

Parameters:
  • hidden_states (torch.Tensor) – input to the layer of shape (batch, seq_len, embed_dim)

  • attention_mask (Optional[torch.Tensor], optional) – attention mask of shape (batch_size, sequence_length). Defaults to None.

  • position_ids (Optional[torch.Tensor], optional) – position_ids of shape (batch_size, sequence_length). Defaults to None.

  • past_key_value (Optional[Cache], optional) – Pre-computed hidden-states (key and values in the self-attention blocks). Defaults to None.

  • output_attentions (Optional[bool], optional) – Whether or not to return the attentions tensors of all attention layers.. Defaults to False.

  • use_cache (Optional[bool], optional) – If set to True, past_key_values key value states are returned. Defaults to False.

  • cache_position (Optional[torch.LongTensor], optional) – Cache position useful for static cache applications . Defaults to None.

Returns:

result

Return type:

_type_

static fromTorch(layer: Module, dtype: dtype = torch.float16) LlamaAttention#

Generate a NPU LlamaAttention layer from a transformer LlamaAttention one.

Parameters:
  • layer (torch.nn.Linear) – the original LlamaAttention model to run on the NPU

  • dtype (torch.dtype) – the desired datatype

Returns:

A NPU LlamaAttention layer

Return type:

LlamaAttention

class intel_npu_acceleration_library.nn.llm.PhiMLP(parameters: List[Tensor])#

Bases: Module

Phi-2 MLP operation NPU backend.

forward(x: Tensor) Tensor#

Torch module forward method.

Parameters:

x (torch.Tensor) – Input tensor

Returns:

result

Return type:

torch.Tensor

static fromTorch(layer: Module, dtype: dtype = torch.float16) PhiMLP#

Generate a NPU PhiMLP layer from a transformer one.

Parameters:
  • layer (torch.nn.Linear) – the original PhiMLP model to run on the NPU

  • dtype (torch.dtype) – the desired datatype

Returns:

A NPU PhiMLP layer

Return type:

PhiMLP

intel_npu_acceleration_library.nn.llm.generate_with_static_shape(model: Module, input_ids: Tensor, max_length: int, attention_mask: Tensor | None = None, use_past: bool | None = True, pad_token_id: int | None = None, **kwargs) Generator[int, None, None]#

Run LLM generator routine wiht static shapes.

Parameters:
  • model (torch.nn.Module) – LLM mode

  • input_ids (torch.Tensor) – model input_ids

  • max_length (int) – model max lenght.

  • attention_mask (Optional[torch.Tensor], optional) – input attention mask. Defaults to None.

  • use_past (Optional[bool], optional) – Enable/disable KV caching. Defaults to True.

  • pad_token_id (Optional[int], optional) – Padding token. Defaults to None.

  • kwargs – Additional arguments

Raises:

RuntimeError – pad_token_id is not set and needed for static shape generation

Yields:

Generator[int, None, None] – Return a generator of new tokens

intel_npu_acceleration_library.nn.llm.lshift_insert(tensor: Tensor, value: float) Tensor#

Compute shift left and insert a value into a tensor.

Parameters:
  • tensor (torch.Tensor) – input tensor

  • value (float) – value to add

Returns:

output tensor

Return type:

torch.Tensor

intel_npu_acceleration_library.nn.llm.warm_up_decoder_model(tokenizer: AutoTokenizer, model: Module, model_seq_length: int, use_past: bool | None = True)#

Warm up the model on the NPU.

This function JIT compile all the layers offloaded to the NPU and load and warm them into the NPU. This is particolarly useful for LLM decoders

Parameters:
  • tokenizer (AutoTokenizer) – a tokenizer

  • model (torch.nn.Module) – a torch Module representing a language model decoder

  • model_seq_length (int) – Max sequence lenght for the tokenizer padding

  • use_past (Optional[bool], optional) – Enable or Disable KV-caching. Defaults to True.

Module contents#

class intel_npu_acceleration_library.nn.Conv2d(matmul, in_channels, out_channels, kernel_size, stride=(1, 1), padding=(0, 0), dilation=(1, 1))#

Bases: Module

2D convolutional layer implementation.

Attrs:

weight (torch.Tensor): The weight tensor of the layer. bias (torch.Tensor): The bias tensor of the layer.

Parameters:
  • matmul (torch.nn.Module) – The matrix multiplication module.

  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • kernel_size (Union[int, Tuple[int, int]]) – Size of the convolutional kernel.

  • stride (Union[int, Tuple[int, int]], optional) – Stride of the convolution. Defaults to (1, 1).

  • padding (Union[int, Tuple[int, int]], optional) – Padding added to the input. Defaults to (0, 0).

  • dilation (Union[int, Tuple[int, int]], optional) – Dilation rate of the convolution. Defaults to (1, 1).

property bias: Tensor#

Get the bias tensor of the layer.

Returns:

The bias tensor.

Return type:

torch.Tensor

compute_output_dim(dim, idx) int#

Compute the output dimension for a given input dimension.

Parameters:
  • dim (int) – Input dimension.

  • idx (int) – Index of the dimension.

Returns:

Output dimension.

Return type:

int

forward(x) Tensor#

Forward pass of the convolutional layer.

Parameters:

x (torch.Tensor) – Input tensor.

Returns:

Output tensor.

Return type:

torch.Tensor

static fromTorch(layer, dtype) Conv2d#

Create a Conv2d layer from a torch.nn.Conv2d layer.

Parameters:
  • layer (torch.nn.Conv2d) – The torch Conv2d layer.

  • dtype (torch.dtype) – Data type of the layer.

Returns:

The converted Conv2d layer.

Return type:

Conv2d

property weight: Tensor#

Get the weight tensor of the layer.

Returns:

The weight tensor.

Return type:

torch.Tensor

class intel_npu_acceleration_library.nn.Linear(weight: Tensor, bias: Tensor | None = None)#

Bases: Module

Torch Linear operation NPU backend.

forward(x: Tensor) Tensor#

Torch module forward method.

Parameters:

x (torch.Tensor) – Input tensor

Returns:

result

Return type:

torch.Tensor

static fromTensor(weight: Tensor, bias: Tensor | None, dtype: dtype = torch.float16) Linear | QuantizedLinear#

Generate a NPU Linear layer from a torch one.

Parameters:
  • weight (torch.Tensor) – the original weight tensor

  • bias (Optional[torch.Tensor]) – the original bias tensor

  • dtype (torch.dtype) – the desired datatype

Raises:

RuntimeError – Quantized Linear requires input_channel to be a multiple of 8

Returns:

A NPU linear layer

Return type:

Union[Linear, QuantizedLinear]

static fromTorch(layer: Linear, dtype: dtype = torch.float16) Linear | QuantizedLinear#

Generate a NPU Linear layer from a torch one.

Parameters:
  • layer (torch.nn.Linear) – the original torch.nn.Linear model to run on the NPU

  • dtype (torch.dtype) – the desired datatype

Returns:

A NPU linear layer

Return type:

Union[Linear, QuantizedLinear]

class intel_npu_acceleration_library.nn.LlamaAttention(config: LlamaConfig, q_weights: Tensor, kv_weights: Tensor, o_proj: Tensor, rotary_emb: Module, dtype: dtype = torch.float16, layer_idx: int | None = None)#

Bases: Module

LlamaAttention operation NPU backend.

forward(hidden_states: Tensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, past_key_value: Cache | None = None, output_attentions: bool | None = False, use_cache: bool | None = False, cache_position: LongTensor | None = None)#

Torch module forward method.

Parameters:
  • hidden_states (torch.Tensor) – input to the layer of shape (batch, seq_len, embed_dim)

  • attention_mask (Optional[torch.Tensor], optional) – attention mask of shape (batch_size, sequence_length). Defaults to None.

  • position_ids (Optional[torch.Tensor], optional) – position_ids of shape (batch_size, sequence_length). Defaults to None.

  • past_key_value (Optional[Cache], optional) – Pre-computed hidden-states (key and values in the self-attention blocks). Defaults to None.

  • output_attentions (Optional[bool], optional) – Whether or not to return the attentions tensors of all attention layers.. Defaults to False.

  • use_cache (Optional[bool], optional) – If set to True, past_key_values key value states are returned. Defaults to False.

  • cache_position (Optional[torch.LongTensor], optional) – Cache position useful for static cache applications . Defaults to None.

Returns:

result

Return type:

_type_

static fromTorch(layer: Module, dtype: dtype = torch.float16) LlamaAttention#

Generate a NPU LlamaAttention layer from a transformer LlamaAttention one.

Parameters:
  • layer (torch.nn.Linear) – the original LlamaAttention model to run on the NPU

  • dtype (torch.dtype) – the desired datatype

Returns:

A NPU LlamaAttention layer

Return type:

LlamaAttention

class intel_npu_acceleration_library.nn.PhiMLP(parameters: List[Tensor])#

Bases: Module

Phi-2 MLP operation NPU backend.

forward(x: Tensor) Tensor#

Torch module forward method.

Parameters:

x (torch.Tensor) – Input tensor

Returns:

result

Return type:

torch.Tensor

static fromTorch(layer: Module, dtype: dtype = torch.float16) PhiMLP#

Generate a NPU PhiMLP layer from a transformer one.

Parameters:
  • layer (torch.nn.Linear) – the original PhiMLP model to run on the NPU

  • dtype (torch.dtype) – the desired datatype

Returns:

A NPU PhiMLP layer

Return type:

PhiMLP

class intel_npu_acceleration_library.nn.QuantizedLinear(weight: Tensor, scale: Tensor, bias: Tensor | None = None)#

Bases: Module

Torch Quantized Linear operation NPU backend.

forward(x: Tensor) Tensor#

Torch module forward method.

Parameters:

x (torch.Tensor) – Input tensor

Raises:

RuntimeError – Training is not supported for QuantizedLinear layer. Use .eval() to do inference only

Returns:

result

Return type:

torch.Tensor