Welcome to Intel® NPU Acceleration Library’s documentation!#
The Intel® NPU Acceleration Library is a Python library designed to boost the efficiency of your applications by leveraging the power of the Intel Neural Processing Unit (NPU) to perform high-speed computations on compatible hardware.
Installation#
Check that your system has an available NPU (how-to).
You can install the packet in your machine with
pip install intel-npu-acceleration-library
Run a LLaMA model on the NPU#
To run LLM models you need to install the transformers library
pip install transformers
You are now up and running! You can create a simple script like the following one to run a LLM on the NPU
from transformers import AutoTokenizer, TextStreamer
from intel_npu_acceleration_library import NPUModelForCausalLM
import torch
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
model = NPUModelForCausalLM.from_pretrained(model_id, use_cache=True, dtype=torch.int8).eval()
tokenizer = AutoTokenizer.from_pretrained(model_id, use_default_system_prompt=True)
tokenizer.pad_token_id = tokenizer.eos_token_id
streamer = TextStreamer(tokenizer, skip_special_tokens=True)
query = input("Ask something: ")
prefix = tokenizer(query, return_tensors="pt")["input_ids"]
generation_kwargs = dict(
input_ids=prefix,
streamer=streamer,
do_sample=True,
top_k=50,
top_p=0.9,
max_new_tokens=512,
)
print("Run inference")
_ = model.generate(**generation_kwargs)
Take note that you only need to use intel_npu_acceleration_library.compile to offload the heavy computation to the NPU.
Feel free to check Usage and LLM and the examples folder for additional use-cases and examples.