intel_npu_acceleration_library.nn package#
Submodules#
intel_npu_acceleration_library.nn.autograd module#
- class intel_npu_acceleration_library.nn.autograd.AutogradMatMul(*args, **kwargs)#
Bases:
Function
Autograd module for Linear operation.
- static backward(ctx, grad_output: Tensor) Iterable[Tensor | None] #
Run a linear backward pass.
grad_output shape: [batch, output_channels] x shape: [batch, input_channels] w shape: [output_channels, input_channels]
Expected gradients dl_dx shape: [batch, input_channels] dl_dw shape: [output_channels, input_channels]
Equivalent pytorch code: dl_dx = grad_output @ w.to(torch.float32) dl_dw = (x.T @ grad_output).T
- Parameters:
ctx (Any) – the autograd context
grad_output (torch.Tensor) – output gradient
- Returns:
Input and parameters gradients
- Return type:
Iterable[Union[torch.Tensor, None]]
- static forward(ctx, x: Tensor, w: Tensor, scale: Tensor | None = None) Tensor #
Run a linear forward pass. Depending on the datatype of the weights it runs a float or quantized operation.
Equivalent pytorch code: result = x @ w.T
- Parameters:
ctx (Any) – the autograd context
x (torch.Tensor) – Activation tensor. Its dtype must be torch.float16
w (torch.Tensor) – Weight tensor. Its dtype must be torch.float16
scale (Optional[torch.Tensor], optional) – Quantization scale. If weights.dtype == torch.int8 then it must be set. Defaults to None.
- Returns:
result
- Return type:
torch.Tensor
intel_npu_acceleration_library.nn.linear module#
- class intel_npu_acceleration_library.nn.linear.Linear(weight: Tensor, bias: Tensor | None = None)#
Bases:
Module
Torch Linear operation NPU backend.
- forward(x: Tensor) Tensor #
Torch module forward method.
- Parameters:
x (torch.Tensor) – Input tensor
- Returns:
result
- Return type:
torch.Tensor
- static fromTensor(weight: Tensor, bias: Tensor | None, dtype: dtype = torch.float16) Linear | QuantizedLinear #
Generate a NPU Linear layer from a torch one.
- Parameters:
weight (torch.Tensor) – the original weight tensor
bias (Optional[torch.Tensor]) – the original bias tensor
dtype (torch.dtype) – the desired datatype
- Raises:
RuntimeError – dtype not supported
- Returns:
A NPU linear layer
- Return type:
Union[Linear, QuantizedLinear]
- static fromTorch(layer: Linear, dtype: dtype = torch.float16) Linear | QuantizedLinear #
Generate a NPU Linear layer from a torch one.
- Parameters:
layer (torch.nn.Linear) – the original torch.nn.Linear model to run on the NPU
dtype (torch.dtype) – the desired datatype
- Returns:
A NPU linear layer
- Return type:
Union[Linear, QuantizedLinear]
- class intel_npu_acceleration_library.nn.linear.QuantizedLinear(weight: Tensor, scale: Tensor, bias: Tensor | None = None)#
Bases:
Module
Torch Quantized Linear operation NPU backend.
- forward(x: Tensor) Tensor #
Torch module forward method.
- Parameters:
x (torch.Tensor) – Input tensor
- Raises:
RuntimeError – Training is not supported for QuantizedLinear layer. Use .eval() to do inference only
- Returns:
result
- Return type:
torch.Tensor
intel_npu_acceleration_library.nn.llm module#
- class intel_npu_acceleration_library.nn.llm.FusedLlamaMLP(parameters: List[Tensor])#
Bases:
Module
LLAMA MLP operation NPU backend.
- forward(x: Tensor) Tensor #
Torch module forward method.
- Parameters:
x (torch.Tensor) – Input tensor
- Returns:
result
- Return type:
torch.Tensor
- static fromTorch(layer: Module, dtype: dtype = torch.float16) FusedLlamaMLP #
Generate a NPU LlamaMLP layer from a transformer LlamaMLP one.
- Parameters:
layer (torch.nn.Linear) – the original LlamaMLP model to run on the NPU
dtype (torch.dtype) – the desired datatype
- Returns:
A NPU LlamaMLP layer
- Return type:
- class intel_npu_acceleration_library.nn.llm.LlamaAttention(config: LlamaConfig, q_weights: Tensor, kv_weights: Tensor, o_proj: Tensor, rotary_emb: Module, dtype: dtype = torch.float16, layer_idx: int | None = None)#
Bases:
Module
LlamaAttention operation NPU backend.
- forward(hidden_states: Tensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, past_key_value: Cache | None = None, output_attentions: bool | None = False, use_cache: bool | None = False, cache_position: LongTensor | None = None, position_embeddings: Tuple[Tensor, Tensor] | None = None)#
Torch module forward method.
- Parameters:
hidden_states (torch.Tensor) – input to the layer of shape (batch, seq_len, embed_dim)
attention_mask (Optional[torch.Tensor], optional) – attention mask of shape (batch_size, sequence_length). Defaults to None.
position_ids (Optional[torch.Tensor], optional) – position_ids of shape (batch_size, sequence_length). Defaults to None.
past_key_value (Optional[Cache], optional) – Pre-computed hidden-states (key and values in the self-attention blocks). Defaults to None.
output_attentions (Optional[bool], optional) – Whether or not to return the attentions tensors of all attention layers.. Defaults to False.
use_cache (Optional[bool], optional) – If set to True, past_key_values key value states are returned. Defaults to False.
cache_position (Optional[torch.LongTensor], optional) – Cache position useful for static cache applications . Defaults to None.
position_embeddings (Optional[Tuple[torch.Tensor, torch.Tensor]], optional) – If set to a tuple, it means the sin and cos are uniformly calculated by the outer LlamaModel and passed in. Defaults to None.
- Returns:
result
- Return type:
_type_
- static fromTorch(layer: Module, dtype: dtype = torch.float16) LlamaAttention #
Generate a NPU LlamaAttention layer from a transformer LlamaAttention one.
- Parameters:
layer (torch.nn.Linear) – the original LlamaAttention model to run on the NPU
dtype (torch.dtype) – the desired datatype
- Returns:
A NPU LlamaAttention layer
- Return type:
- class intel_npu_acceleration_library.nn.llm.PhiMLP(parameters: List[Tensor])#
Bases:
Module
Phi-2 MLP operation NPU backend.
- forward(x: Tensor) Tensor #
Torch module forward method.
- Parameters:
x (torch.Tensor) – Input tensor
- Returns:
result
- Return type:
torch.Tensor
- intel_npu_acceleration_library.nn.llm.generate_with_static_shape(model: Module, input_ids: Tensor, max_length: int, attention_mask: Tensor | None = None, use_past: bool | None = True, pad_token_id: int | None = None, **kwargs) Generator[int, None, None] #
Run LLM generator routine wiht static shapes.
- Parameters:
model (torch.nn.Module) – LLM mode
input_ids (torch.Tensor) – model input_ids
max_length (int) – model max lenght.
attention_mask (Optional[torch.Tensor], optional) – input attention mask. Defaults to None.
use_past (Optional[bool], optional) – Enable/disable KV caching. Defaults to True.
pad_token_id (Optional[int], optional) – Padding token. Defaults to None.
kwargs – Additional arguments
- Raises:
RuntimeError – pad_token_id is not set and needed for static shape generation
- Yields:
Generator[int, None, None] – Return a generator of new tokens
- intel_npu_acceleration_library.nn.llm.lshift_insert(tensor: Tensor, value: float) Tensor #
Compute shift left and insert a value into a tensor.
- Parameters:
tensor (torch.Tensor) – input tensor
value (float) – value to add
- Returns:
output tensor
- Return type:
torch.Tensor
- intel_npu_acceleration_library.nn.llm.warm_up_decoder_model(tokenizer: AutoTokenizer, model: Module, model_seq_length: int, use_past: bool | None = True)#
Warm up the model on the NPU.
This function JIT compile all the layers offloaded to the NPU and load and warm them into the NPU. This is particolarly useful for LLM decoders
- Parameters:
tokenizer (AutoTokenizer) – a tokenizer
model (torch.nn.Module) – a torch Module representing a language model decoder
model_seq_length (int) – Max sequence lenght for the tokenizer padding
use_past (Optional[bool], optional) – Enable or Disable KV-caching. Defaults to True.
Module contents#
- class intel_npu_acceleration_library.nn.Conv2d(weights: Tensor, bias: Tensor | None = None, strides: int | Sequence[int] = 1, padding: int | Sequence[int] = 0, dilation: int | Sequence[int] = 1, groups: int = 1)#
Bases:
Module
2D convolutional layer implementation.
- Attrs:
weight (torch.Tensor): The weight tensor of the layer. bias (torch.Tensor): The bias tensor of the layer.
- property bias: Tensor#
Get the bias tensor of the layer.
- Returns:
The bias tensor.
- Return type:
torch.Tensor
- forward(x: Tensor) Tensor #
Torch module forward method.
- Parameters:
x (torch.Tensor) – Input tensor
- Returns:
result
- Return type:
torch.Tensor
- static fromTorch(layer, dtype: dtype = torch.float16) Conv2d #
Create a Conv2d layer from a torch.nn.Conv2d layer.
- Parameters:
layer (torch.nn.Conv2d) – The torch Conv2d layer.
dtype (torch.dtype, optional) – Data type of the layer.
- Returns:
The converted Conv2d layer.
- Return type:
- property weight: Tensor#
Get the weight tensor of the layer.
- Returns:
The weight tensor.
- Return type:
torch.Tensor
- class intel_npu_acceleration_library.nn.Linear(weight: Tensor, bias: Tensor | None = None)#
Bases:
Module
Torch Linear operation NPU backend.
- forward(x: Tensor) Tensor #
Torch module forward method.
- Parameters:
x (torch.Tensor) – Input tensor
- Returns:
result
- Return type:
torch.Tensor
- static fromTensor(weight: Tensor, bias: Tensor | None, dtype: dtype = torch.float16) Linear | QuantizedLinear #
Generate a NPU Linear layer from a torch one.
- Parameters:
weight (torch.Tensor) – the original weight tensor
bias (Optional[torch.Tensor]) – the original bias tensor
dtype (torch.dtype) – the desired datatype
- Raises:
RuntimeError – dtype not supported
- Returns:
A NPU linear layer
- Return type:
Union[Linear, QuantizedLinear]
- static fromTorch(layer: Linear, dtype: dtype = torch.float16) Linear | QuantizedLinear #
Generate a NPU Linear layer from a torch one.
- Parameters:
layer (torch.nn.Linear) – the original torch.nn.Linear model to run on the NPU
dtype (torch.dtype) – the desired datatype
- Returns:
A NPU linear layer
- Return type:
Union[Linear, QuantizedLinear]
- class intel_npu_acceleration_library.nn.LlamaAttention(config: LlamaConfig, q_weights: Tensor, kv_weights: Tensor, o_proj: Tensor, rotary_emb: Module, dtype: dtype = torch.float16, layer_idx: int | None = None)#
Bases:
Module
LlamaAttention operation NPU backend.
- forward(hidden_states: Tensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, past_key_value: Cache | None = None, output_attentions: bool | None = False, use_cache: bool | None = False, cache_position: LongTensor | None = None, position_embeddings: Tuple[Tensor, Tensor] | None = None)#
Torch module forward method.
- Parameters:
hidden_states (torch.Tensor) – input to the layer of shape (batch, seq_len, embed_dim)
attention_mask (Optional[torch.Tensor], optional) – attention mask of shape (batch_size, sequence_length). Defaults to None.
position_ids (Optional[torch.Tensor], optional) – position_ids of shape (batch_size, sequence_length). Defaults to None.
past_key_value (Optional[Cache], optional) – Pre-computed hidden-states (key and values in the self-attention blocks). Defaults to None.
output_attentions (Optional[bool], optional) – Whether or not to return the attentions tensors of all attention layers.. Defaults to False.
use_cache (Optional[bool], optional) – If set to True, past_key_values key value states are returned. Defaults to False.
cache_position (Optional[torch.LongTensor], optional) – Cache position useful for static cache applications . Defaults to None.
position_embeddings (Optional[Tuple[torch.Tensor, torch.Tensor]], optional) – If set to a tuple, it means the sin and cos are uniformly calculated by the outer LlamaModel and passed in. Defaults to None.
- Returns:
result
- Return type:
_type_
- static fromTorch(layer: Module, dtype: dtype = torch.float16) LlamaAttention #
Generate a NPU LlamaAttention layer from a transformer LlamaAttention one.
- Parameters:
layer (torch.nn.Linear) – the original LlamaAttention model to run on the NPU
dtype (torch.dtype) – the desired datatype
- Returns:
A NPU LlamaAttention layer
- Return type:
- class intel_npu_acceleration_library.nn.Module(profile: bool = False)#
Bases:
Module
A PyTorch module that runs on the NPU.
- create_model(args: Sequence[Any], kwargs: MutableMapping[str, Any]) NNFactory #
Create a model from the module.
- Parameters:
args (Sequence[Any]) – positional arguments
kwargs (MutableMapping[str, Any]) – keyword arguments
- Returns:
The model.
- Return type:
- extract_tensors_from_arguments(args: Sequence[Any]) Sequence[Tensor] #
Extract the tensors from the arguments.
- Parameters:
args (Sequence[Any]) – The positional arguments.
- Returns:
The tensors.
- Return type:
Sequence[torch.Tensor]
- factory_forward(*args: Any, **kwargs: Any)#
Run the model using the factory.
- Parameters:
args (Any) – The positional arguments.
kwargs (Any) – The keyword arguments.
- Returns:
The output tensor.
- Return type:
torch.Tensor
- forward(*args, **kwargs) Tensor #
Run the forward pass of the module.
- Parameters:
args (Any) – The positional arguments.
kwargs (Any) – The keyword arguments.
- Raises:
NotImplementedError – If the forward method is not implemented.
- Returns:
The output tensor.
- Return type:
torch.Tensor
- to(*args, **kwargs)#
Move the module to a device or to a different dtype.
- Parameters:
args (Any) – The positional arguments.
kwargs (Any) – The keyword arguments.
- Returns:
The output tensor.
- Return type:
torch.Tensor
- class intel_npu_acceleration_library.nn.PhiMLP(parameters: List[Tensor])#
Bases:
Module
Phi-2 MLP operation NPU backend.
- forward(x: Tensor) Tensor #
Torch module forward method.
- Parameters:
x (torch.Tensor) – Input tensor
- Returns:
result
- Return type:
torch.Tensor
- class intel_npu_acceleration_library.nn.QuantizedLinear(weight: Tensor, scale: Tensor, bias: Tensor | None = None)#
Bases:
Module
Torch Quantized Linear operation NPU backend.
- forward(x: Tensor) Tensor #
Torch module forward method.
- Parameters:
x (torch.Tensor) – Input tensor
- Raises:
RuntimeError – Training is not supported for QuantizedLinear layer. Use .eval() to do inference only
- Returns:
result
- Return type:
torch.Tensor