Intel® Extension for PyTorch* Large Language Model (LLM) Feature Get Started
Intel® Extension for PyTorch* extends optimizations to large language models (LLM). Optimizations are at development and experimental phase at this moment. You are welcomed to have a try with these optimizations on 4th Gen Intel® Xeon® Scalable processors.
System Requirements
Hardware |
4th Gen Intel® Xeon® Scalable processors |
|---|---|
OS |
CentOS/RHEL 8 |
Linux Kernel |
Intel® 4th Gen Xeon® Platinum: 5.15.0; Intel® 4th Gen Xeon® Max: 5.19.0 |
Python |
3.9, conda is required. |
Compiler |
Preset in the compilation script below, if compile from source |
Installation
Prebuilt wheel file are available for Python 3.9. Alternatively, a script is provided to compile from source.
Install From Prebuilt Wheel Files
python -m pip install https://download.pytorch.org/whl/nightly/cpu/torch-2.1.0.dev20230711%2Bcpu-cp39-cp39-linux_x86_64.whl https://download.pytorch.org/whl/nightly/cpu/torchvision-0.16.0.dev20230711%2Bcpu-cp39-cp39-linux_x86_64.whl https://download.pytorch.org/whl/nightly/cpu/torchaudio-2.1.0.dev20230711%2Bcpu-cp39-cp39-linux_x86_64.whl
python -m pip install https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_dev/cpu/intel_extension_for_pytorch-2.1.0.dev0%2Bcpu.llm-cp39-cp39-linux_x86_64.whl
conda install -y libstdcxx-ng=12 -c conda-forge
Compile From Source
Note
Make sure you are using a Python 3.9 conda environment.
wget https://github.com/intel/intel-extension-for-pytorch/raw/v2.1.0.dev%2Bcpu.llm/scripts/compile_bundle.sh
sed -i "18 i conda update -y sysroot_linux-64" compile_bundle.sh
sed -i "49s|.*|python -m pip install https://download.pytorch.org/whl/nightly/cpu/torch-2.1.0.dev20230711%2Bcpu-cp39-cp39-linux_x86_64.whl https://download.pytorch.org/whl/nightly/cpu/torchvision-0.16.0.dev20230711%2Bcpu-cp39-cp39-linux_x86_64.whl https://download.pytorch.org/whl/nightly/cpu/torchaudio-2.1.0.dev20230711%2Bcpu-cp39-cp39-linux_x86_64.whl|" compile_bundle.sh
bash compile_bundle.sh
Launch Examples
Supported Models
The following 3 models are supported. When running the example scripts, it is needed to replace the place holder <MODEL_ID> in example launch commands with:
GPT-J |
“EleutherAI/gpt-j-6b” |
|---|---|
GPT-Neox |
“EleutherAI/gpt-neox-20b” |
Llama 2 |
Model directory path output from the transformers conversion tool.* Verified meta-llama/Llama-2-7b-chat and meta-llama/Llama-2-13b-chat. |
* Llama 2 model conversion steps:
Follow instructions to download model files for conversion.
Decompress the downloaded model file.
Follow instructions to convert the model.
Launch example scripts with the place holder <MODEL_ID> substituted by the --output_dir argument value of the conversion script.
Install Dependencies
conda install -y gperftools -c conda-forge
conda install -y intel-openmp
python -m pip install transformers==4.28.1 cpuid accelerate datasets sentencepiece protobuf==3.20.3
# [Optional] install neural-compressor for GPT-J INT8 only
python -m pip install neural-compressor==2.2
# [Optional] The following is only for DeepSpeed case
git clone https://github.com/delock/DeepSpeedSYCLSupport
cd DeepSpeedSYCLSupport
git checkout gma/run-opt-branch
python -m pip install -r requirements/requirements.txt
python setup.py install
cd ../
git clone https://github.com/oneapi-src/oneCCL.git
cd oneCCL
mkdir build
cd build
cmake ..
make -j install
source _install/env/setvars.sh
cd ../..
Note
If an error complaining ninja is not found when compiling deepspeed, please use conda and pip command to uninstall all ninja packages, and reinstall it with pip.
Run Examples
The following 5 python scripts are provided in Github repo example directory to launch inference workloads with supported models.
run_generation.py
run_generation_with_deepspeed.py
run_gpt-j_int8.py
run_gpt-neox_int8.py
run_llama_int8.py
Preparations
A separate prompt.json file is required to run performance benchmarks. You can use the command below to download a sample file. For simple testing, an argument --prompt is provided by the scripts to take a text for processing.
To get these Python scripts, you can either get the entire Github repository down with git command, or use the following wget commands to get individual scripts.
# Get the example scripts with git command
git clone https://github.com/intel/intel-extension-for-pytorch.git
cd intel-extension-for-pytorch
git checkout v2.1.0.dev+cpu.llm
cd examples/cpu/inference/python/llm
# Alternatively, get individual example scripts
wget https://github.com/intel/intel-extension-for-pytorch/raw/v2.1.0.dev%2Bcpu.llm/examples/cpu/inference/python/llm/run_generation.py
wget https://github.com/intel/intel-extension-for-pytorch/raw/v2.1.0.dev%2Bcpu.llm/examples/cpu/inference/python/llm/run_generation_with_deepspeed.py
wget https://github.com/intel/intel-extension-for-pytorch/raw/v2.1.0.dev%2Bcpu.llm/examples/cpu/inference/python/llm/run_gpt-j_int8.py
wget https://github.com/intel/intel-extension-for-pytorch/raw/v2.1.0.dev%2Bcpu.llm/examples/cpu/inference/python/llm/run_gpt-neox_int8.py
wget https://github.com/intel/intel-extension-for-pytorch/raw/v2.1.0.dev%2Bcpu.llm/examples/cpu/inference/python/llm/run_llama_int8.py
# Get the sample prompt.json
# Make sure the downloaded prompt.json file is under the same directory as that of the python scripts mentioned above.
wget https://intel-extension-for-pytorch.s3.amazonaws.com/miscellaneous/llm/prompt.json
The following environment variables are required to achieve a good performance on 4th Gen Intel® Xeon® Scalable processors.
export LD_PRELOAD=${CONDA_PREFIX}/lib/libstdc++.so.6
# Setup environment variables for performance on Xeon
export KMP_BLOCKTIME=INF
export KMP_TPAUSE=0
export KMP_SETTINGS=1
export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_FORJOIN_BARRIER_PATTERN=dist,dist
export KMP_PLAIN_BARRIER_PATTERN=dist,dist
export KMP_REDUCTION_BARRIER_PATTERN=dist,dist
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libiomp5.so # Intel OpenMP
# Tcmalloc is a recommended malloc implementation that emphasizes fragmentation avoidance and scalable concurrency support.
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
Single Instance Performance
# Get prompt file to the path of scripts
mv PATH/TO/prompt.json WORK_DIR
# bfloat16 benchmark
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <physical cores list> python run_generation.py --benchmark -m <MODEL_ID> --dtype bfloat16 --ipex --jit
# int8 benchmark
## (1) Do quantization to get the quantized model
mkdir saved_results
## GPT-J quantization
python run_gpt-j_int8.py --ipex-smooth-quant --lambada --output-dir "saved_results" --jit --int8-bf16-mixed -m <GPTJ MODEL_ID>
## Llama 2 quantization
python run_llama_int8.py --ipex-smooth-quant --lambada --output-dir "saved_results" --jit --int8-bf16-mixed -m <LLAMA MODEL_ID>
## GPT-NEOX quantization
python run_gpt-neox_int8.py --ipex-weight-only-quantization --lambada --output-dir "saved_results" --jit --int8 -m <GPT-NEOX MODEL_ID>
## (2) Run int8 performance test (note that GPT-NEOX uses --int8 instead of --int8-bf16-mixed)
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_<MODEL>_int8.py -m <MODEL_ID> --quantized-model-path "./saved_results/best_model.pt" --benchmark --jit --int8-bf16-mixed
Single Instance Accuracy
# bfloat16
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <physical cores list> python run_generation.py --accuracy-only -m <MODEL_ID> --dtype bfloat16 --ipex --jit --lambada
# Quantization as a performance part
# (1) Do quantization to get the quantized model as mentioned above
# (2) Run int8 accuracy test (note that GPT-NEOX uses --int8 instead of --int8-bf16-mixed)
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_<MODEL>_int8.py -m <MODEL ID> --quantized-model-path "./saved_results/best_model.pt" --accuracy-only --jit --int8-bf16-mixed --lambada
Distributed Performance with DeepSpeed (autoTP)
export DS_SHM_ALLREDUCE=1
unset KMP_AFFINITY
# Get prompt file to the path of scripts
mv PATH/TO/prompt.json WORK_DIR
# Run GPTJ/LLAMA with bfloat16 DeepSpeed
deepspeed --bind_cores_to_rank run_generation_with_deepspeed.py --benchmark -m <MODEL_ID> --dtype bfloat16 --ipex --jit
# Run GPT-NeoX with ipex weight only quantization
deepspeed --bind_cores_to_rank run_generation_with_deepspeed.py --benchmark -m EleutherAI/gpt-neox-20b --dtype float32 --ipex --jit --ipex-weight-only-quantization