Welcome to Intel® NPU Acceleration Library’s documentation!#

The Intel® NPU Acceleration Library is a Python library designed to boost the efficiency of your applications by leveraging the power of the Intel Neural Processing Unit (NPU) to perform high-speed computations on compatible hardware.

Installation#

Check that your system has an available NPU (how-to).

You can install the packet in your machine with

pip install intel-npu-acceleration-library

Run a LLaMA model on the NPU#

To run LLM models you need to install the transformers library

pip install transformers

You are now up and running! You can create a simple script like the following one to run a LLM on the NPU

from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM
import intel_npu_acceleration_library
import torch

model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

model = AutoModelForCausalLM.from_pretrained(model_id, use_cache=True).eval()
tokenizer = AutoTokenizer.from_pretrained(model_id, use_default_system_prompt=True)
tokenizer.pad_token_id = tokenizer.eos_token_id
streamer = TextStreamer(tokenizer, skip_special_tokens=True)

print("Compile model for the NPU")
model = intel_npu_acceleration_library.compile(model, dtype=torch.int8)

query = input("Ask something: ")
prefix = tokenizer(query, return_tensors="pt")["input_ids"]

generation_kwargs = dict(
   input_ids=prefix,
   streamer=streamer,
   do_sample=True,
   top_k=50,
   top_p=0.9,
   max_new_tokens=512,
)

print("Run inference")
_ = model.generate(**generation_kwargs)

Take note that you only need to use intel_npu_acceleration_library.compile to offload the heavy computation to the NPU.

Feel free to check Usage and LLM and the examples folder for additional use-cases and examples.

Site map#

Developements guide:

Indices and tables#