Intel® Extension for PyTorch* Large Language Model (LLM) Feature Get Started

Intel® Extension for PyTorch* extends optimizations to large language models (LLM). Optimizations are at development and experimental phase at this moment. You are welcomed to have a try with these optimizations on 4th Gen Intel® Xeon® Scalable processors.

System Requirements

Hardware

4th Gen Intel® Xeon® Scalable processors

OS

CentOS/RHEL 8

Linux Kernel

Intel® 4th Gen Xeon® Platinum: 5.15.0; Intel® 4th Gen Xeon® Max: 5.19.0

Python

3.9, conda is required.

Compiler

Preset in the compilation script below, if compile from source

Installation

Prebuilt wheel file are available for Python 3.9. Alternatively, a script is provided to compile from source.

Install From Prebuilt Wheel Files

python -m pip install https://download.pytorch.org/whl/nightly/cpu/torch-2.1.0.dev20230711%2Bcpu-cp39-cp39-linux_x86_64.whl https://download.pytorch.org/whl/nightly/cpu/torchvision-0.16.0.dev20230711%2Bcpu-cp39-cp39-linux_x86_64.whl https://download.pytorch.org/whl/nightly/cpu/torchaudio-2.1.0.dev20230711%2Bcpu-cp39-cp39-linux_x86_64.whl
python -m pip install https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_dev/cpu/intel_extension_for_pytorch-2.1.0.dev0%2Bcpu.llm-cp39-cp39-linux_x86_64.whl
conda install -y libstdcxx-ng=12 -c conda-forge

Compile From Source

Note

Make sure you are using a Python 3.9 conda environment.

wget https://github.com/intel/intel-extension-for-pytorch/raw/v2.1.0.dev%2Bcpu.llm/scripts/compile_bundle.sh
sed -i "18 i conda update -y sysroot_linux-64" compile_bundle.sh
sed -i "49s|.*|python -m pip install https://download.pytorch.org/whl/nightly/cpu/torch-2.1.0.dev20230711%2Bcpu-cp39-cp39-linux_x86_64.whl https://download.pytorch.org/whl/nightly/cpu/torchvision-0.16.0.dev20230711%2Bcpu-cp39-cp39-linux_x86_64.whl https://download.pytorch.org/whl/nightly/cpu/torchaudio-2.1.0.dev20230711%2Bcpu-cp39-cp39-linux_x86_64.whl|" compile_bundle.sh
bash compile_bundle.sh

Launch Examples

Supported Models

The following 3 models are supported. When running the example scripts, it is needed to replace the place holder <MODEL_ID> in example launch commands with:

GPT-J

“EleutherAI/gpt-j-6b”

GPT-Neox

“EleutherAI/gpt-neox-20b”

Llama 2

Model directory path output from the transformers conversion tool.* Verified meta-llama/Llama-2-7b-chat and meta-llama/Llama-2-13b-chat.

* Llama 2 model conversion steps:

  1. Follow instructions to download model files for conversion.

  2. Decompress the downloaded model file.

  3. Follow instructions to convert the model.

  4. Launch example scripts with the place holder <MODEL_ID> substituted by the --output_dir argument value of the conversion script.

Install Dependencies

conda install -y gperftools -c conda-forge
conda install -y intel-openmp
python -m pip install transformers==4.28.1 cpuid accelerate datasets sentencepiece protobuf==3.20.3

# [Optional] install neural-compressor for GPT-J INT8 only
python -m pip install neural-compressor==2.2

# [Optional] The following is only for DeepSpeed case
git clone https://github.com/delock/DeepSpeedSYCLSupport
cd DeepSpeedSYCLSupport
git checkout gma/run-opt-branch
python -m pip install -r requirements/requirements.txt
python setup.py install
cd ../
git clone https://github.com/oneapi-src/oneCCL.git
cd oneCCL
mkdir build
cd build
cmake ..
make -j install
source _install/env/setvars.sh
cd ../..

Note

If an error complaining ninja is not found when compiling deepspeed, please use conda and pip command to uninstall all ninja packages, and reinstall it with pip.

Run Examples

The following 5 python scripts are provided in Github repo example directory to launch inference workloads with supported models.

  • run_generation.py

  • run_generation_with_deepspeed.py

  • run_gpt-j_int8.py

  • run_gpt-neox_int8.py

  • run_llama_int8.py

Preparations

A separate prompt.json file is required to run performance benchmarks. You can use the command below to download a sample file. For simple testing, an argument --prompt is provided by the scripts to take a text for processing.

To get these Python scripts, you can either get the entire Github repository down with git command, or use the following wget commands to get individual scripts.

# Get the example scripts with git command
git clone https://github.com/intel/intel-extension-for-pytorch.git
cd intel-extension-for-pytorch
git checkout v2.1.0.dev+cpu.llm
cd examples/cpu/inference/python/llm

# Alternatively, get individual example scripts
wget https://github.com/intel/intel-extension-for-pytorch/raw/v2.1.0.dev%2Bcpu.llm/examples/cpu/inference/python/llm/run_generation.py
wget https://github.com/intel/intel-extension-for-pytorch/raw/v2.1.0.dev%2Bcpu.llm/examples/cpu/inference/python/llm/run_generation_with_deepspeed.py
wget https://github.com/intel/intel-extension-for-pytorch/raw/v2.1.0.dev%2Bcpu.llm/examples/cpu/inference/python/llm/run_gpt-j_int8.py
wget https://github.com/intel/intel-extension-for-pytorch/raw/v2.1.0.dev%2Bcpu.llm/examples/cpu/inference/python/llm/run_gpt-neox_int8.py
wget https://github.com/intel/intel-extension-for-pytorch/raw/v2.1.0.dev%2Bcpu.llm/examples/cpu/inference/python/llm/run_llama_int8.py

# Get the sample prompt.json
# Make sure the downloaded prompt.json file is under the same directory as that of the python scripts mentioned above.
wget https://intel-extension-for-pytorch.s3.amazonaws.com/miscellaneous/llm/prompt.json

The following environment variables are required to achieve a good performance on 4th Gen Intel® Xeon® Scalable processors.

export LD_PRELOAD=${CONDA_PREFIX}/lib/libstdc++.so.6

# Setup environment variables for performance on Xeon
export KMP_BLOCKTIME=INF
export KMP_TPAUSE=0
export KMP_SETTINGS=1
export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_FORJOIN_BARRIER_PATTERN=dist,dist
export KMP_PLAIN_BARRIER_PATTERN=dist,dist
export KMP_REDUCTION_BARRIER_PATTERN=dist,dist
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libiomp5.so # Intel OpenMP

# Tcmalloc is a recommended malloc implementation that emphasizes fragmentation avoidance and scalable concurrency support.
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so

Single Instance Performance

# Get prompt file to the path of scripts
mv PATH/TO/prompt.json WORK_DIR

# bfloat16 benchmark
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <physical cores list> python run_generation.py --benchmark -m <MODEL_ID> --dtype bfloat16 --ipex --jit

# int8 benchmark
## (1) Do quantization to get the quantized model
mkdir saved_results

## GPT-J quantization
python run_gpt-j_int8.py --ipex-smooth-quant --lambada --output-dir "saved_results" --jit --int8-bf16-mixed -m <GPTJ MODEL_ID>
## Llama 2 quantization
python run_llama_int8.py --ipex-smooth-quant --lambada --output-dir "saved_results" --jit --int8-bf16-mixed -m <LLAMA MODEL_ID>
## GPT-NEOX quantization
python run_gpt-neox_int8.py --ipex-weight-only-quantization --lambada --output-dir "saved_results" --jit --int8 -m <GPT-NEOX MODEL_ID>

## (2) Run int8 performance test (note that GPT-NEOX uses --int8 instead of --int8-bf16-mixed)
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_<MODEL>_int8.py -m <MODEL_ID> --quantized-model-path "./saved_results/best_model.pt" --benchmark --jit --int8-bf16-mixed

Single Instance Accuracy

# bfloat16
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <physical cores list> python run_generation.py --accuracy-only -m <MODEL_ID> --dtype bfloat16 --ipex --jit --lambada

# Quantization as a performance part
# (1) Do quantization to get the quantized model as mentioned above
# (2) Run int8 accuracy test (note that GPT-NEOX uses --int8 instead of --int8-bf16-mixed)
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_<MODEL>_int8.py -m <MODEL ID> --quantized-model-path "./saved_results/best_model.pt" --accuracy-only --jit --int8-bf16-mixed --lambada

Distributed Performance with DeepSpeed (autoTP)

export DS_SHM_ALLREDUCE=1
unset KMP_AFFINITY

# Get prompt file to the path of scripts
mv PATH/TO/prompt.json WORK_DIR

# Run GPTJ/LLAMA with bfloat16  DeepSpeed
deepspeed --bind_cores_to_rank run_generation_with_deepspeed.py --benchmark -m <MODEL_ID> --dtype bfloat16 --ipex --jit

# Run GPT-NeoX with ipex weight only quantization
deepspeed --bind_cores_to_rank run_generation_with_deepspeed.py --benchmark -m EleutherAI/gpt-neox-20b --dtype float32 --ipex --jit --ipex-weight-only-quantization