intel_npu_acceleration_library.backend package#

Submodules#

intel_npu_acceleration_library.backend.base module#

class intel_npu_acceleration_library.backend.base.BaseNPUBackend(profile: bool | None = False)#

Bases: object

A base class that represent a abstract Matrix-Matrix operation on the NPU.

save(path: str)#

Save the Openvino model.

Parameters:

path (str) – the model save path

saveCompiledModel(path: str)#

Save the compiled model.

Parameters:

path (str) – the compiled model save path

class intel_npu_acceleration_library.backend.base.BaseNPUBackendWithPrefetch(profile: bool)#

Bases: BaseNPUBackend

A base class that represent a abstract Matrix-Matrix operation on the NPU.

Linear type classes employ an algorithm to optimize weights prefetching

add_to_map(wt_hash: str, weights: Iterable[ndarray | Tuple[ndarray, ...]])#

Add an operation parameters to the operation hash:parameter map.

Parameters:
  • wt_hash (str) – operation hash

  • weights (Iterable[Union[np.ndarray, Tuple[np.ndarray, ...]]]) – Operation parameters

create_parameters(weights: Iterable[ndarray | Tuple[ndarray, ...]]) _Pointer#

Create an operation parameter from a list of weights.

Parameters:

weights (Iterable[Union[np.ndarray, Tuple[np.ndarray, ...]]]) – Operation parameters

Raises:
  • RuntimeError – Quantized weights needs to be in int8 format

  • ValueError – Invalid dtype for scale

Returns:

an instance to the Parameters object

Return type:

ctypes._Pointer

load_wt_fn(module, parameters)#

Load asyncronously the parameter into the NPU.

Parameters:
  • module – the NPU backend module

  • parameters – the weights parameter class

prefetchWeights()#

Prefetch next operation weights.

setWeights(wt_hash: str | None, *args: ndarray | Tuple[ndarray, ...]) bool#

Set the operation weights in the NPU.

Parameters:
  • wt_hash (str) – operation hash. If set to None force the load of the weights

  • args (Union[np.ndarray, Tuple[np.ndarray, ...]]) – Variable length weights list. Can be a np array or a tuple of weight, scale in case of quantized tensors

Returns:

Return True if the op parameters are already in the op map

Return type:

bool

intel_npu_acceleration_library.backend.base.adapt_weight(w: ndarray) ndarray#

Adapt the weights to run on the NPU.

Parameters:

w (np.ndarray) – weights array

Raises:

RuntimeError – Unsupported shape

Returns:

The adapted array

Return type:

np.ndarray

intel_npu_acceleration_library.backend.factory module#

class intel_npu_acceleration_library.backend.factory.NNFactory(inC: int, outC: int, batch: int, profile: bool = False, device: str = 'NPU')#

Bases: BaseNPUBackendWithPrefetch

Linear class, computing a matrix matrix multiplication with weights prefetching.

compile(output_node: _Pointer)#

Finalize and compile a model.

Parameters:

output_node (ctypes._Pointer) – Model output node

linear(input_node: _Pointer, output_channels: int, input_channels: int, bias: bool | None = False, quantize: bool = False) _Pointer#

Generate a linear layer.

Parameters:
  • input_node (ctypes._Pointer) – layer input node

  • output_channels (int) – number of output channels

  • input_channels (int) – number of input channels

  • bias (bool, optional) – enable/disable bias. Defaults to False.

  • quantize (bool, optional) – quantize linear model. Defaults to False.

Returns:

_description_

Return type:

ctypes._Pointer

parameter(shape: ~typing.Tuple[int, int], dtype: ~numpy.dtype[~typing.Any] | None | type[~typing.Any] | ~numpy._typing._dtype_like._SupportsDType[~numpy.dtype[~typing.Any]] | str | tuple[~typing.Any, int] | tuple[~typing.Any, ~typing.SupportsIndex | ~collections.abc.Sequence[~typing.SupportsIndex]] | list[~typing.Any] | ~numpy._typing._dtype_like._DTypeDict | tuple[~typing.Any, ~typing.Any] = <class 'numpy.float16'>) _Pointer#

Generate a model input parameter.

Parameters:
  • shape (Tuple[int, int]) – Parameter shape (only 2D tensors supported atm)

  • dtype (np.dtype, optional) – parameter type np.int8 and np.float16 supported. Defaults to np.float16.

Raises:

RuntimeError – Unsupported shape

Returns:

an instance to a parameter object

Return type:

ctypes._Pointer

run(X: ndarray, *weights: ndarray | Tuple[ndarray, ndarray], **kwargs: Any) ndarray#

Run the layer: X * W^T.

Parameters:
  • X (np.ndarray) – lhs operator

  • weights (Union[np.ndarray, Tuple[np.ndarray, np.ndarray]]) – rhs operators

  • kwargs (Any) – additional arguments

Raises:

RuntimeError – Input tensor shape mismatch

Returns:

result

Return type:

np.ndarray

intel_npu_acceleration_library.backend.linear module#

class intel_npu_acceleration_library.backend.linear.Linear(inC: int, outC: int, batch: int, profile: bool = False, device: str = 'NPU')#

Bases: NNFactory

Linear class, computing a matrix matrix multiplication with weights prefetching.

run(X: ndarray, W: ndarray, op_id: str) ndarray#

Run the layer: X * W^T.

Parameters:
  • X (np.ndarray) – lhs operator

  • W (np.ndarray) – rhs operator

  • op_id (str) – operation id

Raises:

RuntimeError – Input or weight tensor shape mismatch

Returns:

result

Return type:

np.ndarray

intel_npu_acceleration_library.backend.matmul module#

class intel_npu_acceleration_library.backend.matmul.MatMul(inC: int, outC: int, batch: int, profile: bool = False, device: str = 'NPU')#

Bases: NNFactory

MatMul class, computing a matrix matrix multiplication.

run(X: ndarray, W: ndarray) ndarray#

Run the layer: X * W^T.

Parameters:
  • X (np.ndarray) – lhs operator

  • W (np.ndarray) – rhs operator

Raises:

RuntimeError – Input or weight tensor shape mismatch

Returns:

result

Return type:

np.ndarray

intel_npu_acceleration_library.backend.mlp module#

class intel_npu_acceleration_library.backend.mlp.MLP(hidden_size: int, intermediate_size: int, batch: int, activation: str = 'swiglu', bias: bool | None = False, profile: bool = False, device: str = 'NPU')#

Bases: NNFactory

Linear class, computing a matrix matrix multiplication with weights prefetching.

intel_npu_acceleration_library.backend.qlinear module#

class intel_npu_acceleration_library.backend.qlinear.QLinear(inC: int, outC: int, batch: int, profile: bool = False, device: str = 'NPU')#

Bases: NNFactory

Quantized Linear class, computing a matrix matrix multiplication with weights prefetching.

run(X: ndarray, W: ndarray, scale: ndarray, op_id: str) ndarray#

Run the layer: $X * (W * S)^T$ .

Parameters:
  • X (np.ndarray) – activation

  • W (np.ndarray) – quantized weights

  • scale (np.ndarray) – quantization scale

  • op_id (str) – operation id

Raises:

RuntimeError – Input, weights or scale shape mismatch

Returns:

result

Return type:

np.ndarray

intel_npu_acceleration_library.backend.qmatmul module#

class intel_npu_acceleration_library.backend.qmatmul.QMatMul(inC: int, outC: int, batch: int, profile: bool = False, device: str = 'NPU')#

Bases: NNFactory

Quantized Linear class, computing a matrix matrix multiplication.

run(X: ndarray, W: ndarray, scale: ndarray) ndarray#

Run the layer: X * (W * S)^T.

Parameters:
  • X (np.ndarray) – activation

  • W (np.ndarray) – quantized weights

  • scale (np.ndarray) – quantization scale

Raises:

RuntimeError – Input, weights or scale shape mismatch

Returns:

result

Return type:

np.ndarray

intel_npu_acceleration_library.backend.runtime module#

intel_npu_acceleration_library.backend.runtime.clear_cache()#

Clear the cache of models.

intel_npu_acceleration_library.backend.runtime.run_factory(x: Tensor, weights: List[Tensor], backend_cls: Any, op_id: str | None = None) Tensor#

Run a factory operation. Depending on the datatype of the weights it runs a float or quantized operation.

Parameters:
  • x (torch.Tensor) – Activation tensor. Its dtype must be torch.float16

  • weights (torch.Tensor) – Weights tensor. Its dtype can be torch.float16 or torch.int8

  • backend_cls (Any) – Backend class to run

  • op_id (Optional[str], optional) – Operation ID. Defaults to None.

Returns:

result

Return type:

torch.Tensor

intel_npu_acceleration_library.backend.runtime.run_matmul(x: Tensor, weights: Tensor, scale: Tensor | None = None, op_id: str | None = None) Tensor#

Run a matmul operation. Depending on the datatype of the weights it runs a float or quantized operation.

Parameters:
  • x (torch.Tensor) – Activation tensor. Its dtype must be torch.float16

  • weights (torch.Tensor) – Weights tensor. Its dtype can be torch.float16 or torch.int8

  • scale (Optional[torch.Tensor], optional) – Quantization scale. If weights.dtype == torch.int8 then it must be set. Defaults to None.

  • op_id (Optional[str], optional) – Operation ID. Defaults to None.

Raises:

RuntimeError – Unsupported weights datatype. Supported types: [torch.float16, torch.int8]

Returns:

result

Return type:

torch.Tensor

intel_npu_acceleration_library.backend.runtime.set_contiguous(tensor: Tensor) Tensor#

Set tensor to be contiguous in memory.

Parameters:

tensor (torch.Tensor) – input tensor

Returns:

output, contiguous tensor

Return type:

torch.Tensor

Module contents#

class intel_npu_acceleration_library.backend.Linear(inC: int, outC: int, batch: int, profile: bool = False, device: str = 'NPU')#

Bases: NNFactory

Linear class, computing a matrix matrix multiplication with weights prefetching.

run(X: ndarray, W: ndarray, op_id: str) ndarray#

Run the layer: X * W^T.

Parameters:
  • X (np.ndarray) – lhs operator

  • W (np.ndarray) – rhs operator

  • op_id (str) – operation id

Raises:

RuntimeError – Input or weight tensor shape mismatch

Returns:

result

Return type:

np.ndarray

class intel_npu_acceleration_library.backend.MLP(hidden_size: int, intermediate_size: int, batch: int, activation: str = 'swiglu', bias: bool | None = False, profile: bool = False, device: str = 'NPU')#

Bases: NNFactory

Linear class, computing a matrix matrix multiplication with weights prefetching.

class intel_npu_acceleration_library.backend.MatMul(inC: int, outC: int, batch: int, profile: bool = False, device: str = 'NPU')#

Bases: NNFactory

MatMul class, computing a matrix matrix multiplication.

run(X: ndarray, W: ndarray) ndarray#

Run the layer: X * W^T.

Parameters:
  • X (np.ndarray) – lhs operator

  • W (np.ndarray) – rhs operator

Raises:

RuntimeError – Input or weight tensor shape mismatch

Returns:

result

Return type:

np.ndarray

class intel_npu_acceleration_library.backend.NNFactory(inC: int, outC: int, batch: int, profile: bool = False, device: str = 'NPU')#

Bases: BaseNPUBackendWithPrefetch

Linear class, computing a matrix matrix multiplication with weights prefetching.

compile(output_node: _Pointer)#

Finalize and compile a model.

Parameters:

output_node (ctypes._Pointer) – Model output node

linear(input_node: _Pointer, output_channels: int, input_channels: int, bias: bool | None = False, quantize: bool = False) _Pointer#

Generate a linear layer.

Parameters:
  • input_node (ctypes._Pointer) – layer input node

  • output_channels (int) – number of output channels

  • input_channels (int) – number of input channels

  • bias (bool, optional) – enable/disable bias. Defaults to False.

  • quantize (bool, optional) – quantize linear model. Defaults to False.

Returns:

_description_

Return type:

ctypes._Pointer

parameter(shape: ~typing.Tuple[int, int], dtype: ~numpy.dtype[~typing.Any] | None | type[~typing.Any] | ~numpy._typing._dtype_like._SupportsDType[~numpy.dtype[~typing.Any]] | str | tuple[~typing.Any, int] | tuple[~typing.Any, ~typing.SupportsIndex | ~collections.abc.Sequence[~typing.SupportsIndex]] | list[~typing.Any] | ~numpy._typing._dtype_like._DTypeDict | tuple[~typing.Any, ~typing.Any] = <class 'numpy.float16'>) _Pointer#

Generate a model input parameter.

Parameters:
  • shape (Tuple[int, int]) – Parameter shape (only 2D tensors supported atm)

  • dtype (np.dtype, optional) – parameter type np.int8 and np.float16 supported. Defaults to np.float16.

Raises:

RuntimeError – Unsupported shape

Returns:

an instance to a parameter object

Return type:

ctypes._Pointer

run(X: ndarray, *weights: ndarray | Tuple[ndarray, ndarray], **kwargs: Any) ndarray#

Run the layer: X * W^T.

Parameters:
  • X (np.ndarray) – lhs operator

  • weights (Union[np.ndarray, Tuple[np.ndarray, np.ndarray]]) – rhs operators

  • kwargs (Any) – additional arguments

Raises:

RuntimeError – Input tensor shape mismatch

Returns:

result

Return type:

np.ndarray

class intel_npu_acceleration_library.backend.QLinear(inC: int, outC: int, batch: int, profile: bool = False, device: str = 'NPU')#

Bases: NNFactory

Quantized Linear class, computing a matrix matrix multiplication with weights prefetching.

run(X: ndarray, W: ndarray, scale: ndarray, op_id: str) ndarray#

Run the layer: $X * (W * S)^T$ .

Parameters:
  • X (np.ndarray) – activation

  • W (np.ndarray) – quantized weights

  • scale (np.ndarray) – quantization scale

  • op_id (str) – operation id

Raises:

RuntimeError – Input, weights or scale shape mismatch

Returns:

result

Return type:

np.ndarray

class intel_npu_acceleration_library.backend.QMatMul(inC: int, outC: int, batch: int, profile: bool = False, device: str = 'NPU')#

Bases: NNFactory

Quantized Linear class, computing a matrix matrix multiplication.

run(X: ndarray, W: ndarray, scale: ndarray) ndarray#

Run the layer: X * (W * S)^T.

Parameters:
  • X (np.ndarray) – activation

  • W (np.ndarray) – quantized weights

  • scale (np.ndarray) – quantization scale

Raises:

RuntimeError – Input, weights or scale shape mismatch

Returns:

result

Return type:

np.ndarray

intel_npu_acceleration_library.backend.clear_cache()#

Clear the cache of models.

intel_npu_acceleration_library.backend.get_driver_version() str#

Get the driver version for the Intel® NPU Acceleration Library.

Raises:

RuntimeError – an error is raised if the platform is not supported. Currently supported platforms are Windows and Linux

Returns:

_description_

Return type:

str

intel_npu_acceleration_library.backend.npu_available() bool#

Return if the NPU is available.

Returns:

Return True if the NPU is available in the system

Return type:

bool

intel_npu_acceleration_library.backend.run_factory(x: Tensor, weights: List[Tensor], backend_cls: Any, op_id: str | None = None) Tensor#

Run a factory operation. Depending on the datatype of the weights it runs a float or quantized operation.

Parameters:
  • x (torch.Tensor) – Activation tensor. Its dtype must be torch.float16

  • weights (torch.Tensor) – Weights tensor. Its dtype can be torch.float16 or torch.int8

  • backend_cls (Any) – Backend class to run

  • op_id (Optional[str], optional) – Operation ID. Defaults to None.

Returns:

result

Return type:

torch.Tensor

intel_npu_acceleration_library.backend.run_matmul(x: Tensor, weights: Tensor, scale: Tensor | None = None, op_id: str | None = None) Tensor#

Run a matmul operation. Depending on the datatype of the weights it runs a float or quantized operation.

Parameters:
  • x (torch.Tensor) – Activation tensor. Its dtype must be torch.float16

  • weights (torch.Tensor) – Weights tensor. Its dtype can be torch.float16 or torch.int8

  • scale (Optional[torch.Tensor], optional) – Quantization scale. If weights.dtype == torch.int8 then it must be set. Defaults to None.

  • op_id (Optional[str], optional) – Operation ID. Defaults to None.

Raises:

RuntimeError – Unsupported weights datatype. Supported types: [torch.float16, torch.int8]

Returns:

result

Return type:

torch.Tensor