Large Language Models (LLM) Optimizations Overview
In the current technological landscape, Generative AI (GenAI) workloads and models have gained widespread attention and popularity. Large Language Models (LLMs) have emerged as the dominant models driving these GenAI applications. As you know, most of LLMs are GPT like architecture which is consist of multiple Decoder layers and the MultiHeadAttention and FeedForward layer are two key components of every Decoder layer. The generation task is memory bound due to iterative decode and kv_cache is also need to special management to reduce memory overheads. Intel® Extension for PyTorch* provide a lot of specific optimizations for these LLMs. In operator level, we provide highly efficient GEMM kernel to speedup Linear layer and customized operators to reduce the memory footprint. To better trade-off the performance and accuracy, different low-precision solutions e.g., smoothQuant and weight-only-quantization are also enabled. Besides, tensor parallel can also adopt to get lower latency for LLMs.
These LLM-specific optimizations can be automatically applied with a single frontend API function in python interface, ipex.optimize_transformers(). Check optimize_transformers for more details.
Supported Models
Model Family |
LLAMA |
GPT-J |
GPT-NEOX |
FALCON* |
OPT |
---|---|---|---|---|---|
Verified < MODEL ID > (Huggingface hub) |
“meta-llama/Llama-2-7b-hf”, “meta-llama/Llama-2-13b-hf”, “meta-llama/Llama-2-70b-hf” |
“EleutherAI/gpt-j-6b” |
“EleutherAI/gpt-neox-20b” |
“tiiuae/falcon-40b” |
“facebook/opt-30b”, “facebook/opt-1.3b” |
FP32/BF16 |
✅ |
✅ |
✅ |
✅ |
✅ |
Weight only quantzation INT8 |
✅ |
✅ |
✅ |
✅ |
✅ |
Weight only quantization INT4 |
✅ |
✅ |
✅ |
✅ |
✅ |
Static quantization INT8 |
✅ |
✅ |
❎** |
❎** |
❎** |
* For Falcon models from remote hub, we need to modify the config.json to use the modeling_falcon.py in transformers. Therefore, in the following scripts, we need to pass an extra configuration file like “–config-file=model_config/tiiuae_falcon-40b_config.json”. This is optional for FP32/BF16 but needed for quantizations.
** For GPT-NEOX/FALCON/OPT models, the accuracy recipes of static quantization INT8 are not ready thus they will be skipped in our coverage.
Note: The above verified models (including other models in the same model family, like “codellama/CodeLlama-7b-hf” from LLAMA family) are well supported with all optimizations like indirect access KV cache, fused ROPE, and prepacked TPP Linear (fp32/bf16). For other LLM model families, we are working in progress to cover those optimizations, which will expand the model list above.
Check LLM best known practice for instructions to install/setup environment and example scripts..
Demos
Intel® Extension for PyTorch* LLM optimizations could be integrated into a typical LLM Q&A web service.
Following figures show demos with Llama 2 model and GPT-J model with single inference and distributed inference with deepspeed with lower precision data types.
Figure Legends:
Llama 2 model with BF16
Llama 2 model with INT8 Quantization with SmoothQuant technique
Llama 2 model with INT8 Weight Only Quantization
GPT-J model with INT4 Weight Only Quantization
Llama 2 model Distributed Inference with DeepSpeed with AutoTP feature on BF16
Llama 2 model Distributed Inference with DeepSpeed with AutoTP feature with Weight Only Quantization INT8
Optimization Methodologies
The brief introduction of these optimizations are as following:
Linear Operator Optimization
Linear operator is the most obvious hotspot in LLMs inference. There are three backend to speedup linear GEMM kernels in Intel® Extension for PyTorch*. They are oneDNN, Tensor Processing Primitives(TPP) which has been used by Fast BERT feature and customized linear kernels for weight only quantization. All of them use specific block format to utilize hardware resources in a highly efficient way.
Low Precision Data Types
While Generative AI (GenAI) workloads and models are getting more and more popular, large language models (LLM) used in these workloads are getting more and more parameters. The increasing size of LLM models enhances workload accuracies; however, it also leads to significantly heavier computations and places higher requirements to the underlying hardware. Given that, quantization becomes a more important methodology for inference workloads.
Quantization with shorter data types benefits from its nature to improve memory IO throughputs and amount of computations on CPU. Moreover, shorter data types make it possible to keep more data in CPU cache, thus reducing memory access occurrences. Comparing to cache access, memory access is much more time costing. Specifically from computation perspective, AVX-512 Vector Neural Network Instructions (VNNI) instruction set shipped with the 2nd Generation Intel® Xeon® Scalable Processors and newer, as well as Intel® Advanced Matrix Extensions (Intel® AMX) instruction set shipped with the 4th Generation Intel® Xeon® Scalable Processors, provide instruction level accelerations to INT8 computations.
Except for the mixed-precision and INT8 native quantization solution, e.g., post-training static quantization and dynamic quantization in Pytorch, SmoothQuant and weight only quantization (both INT8 weight and INT4 weight are supported) are also enabled in Intel® Extension for PyTorch* to get beeter accuracy and performance compared with native solution.
Intel® Extension for PyTorch* speeds up INT8 computations by leveraging oneDNN and oneDNN graph as the backend. Intel® Extension for PyTorch* static quantization provides a default recipe to automatically decide which operators to quantize. Its backend oneDNN graph brings matrix-multiplication-based fusions for common seen operator patterns and other common fusions like quantization + data type casting. These fusions help achieve best computation cache locality and efficiency, and thus reduce INT8 quantization overhead significantly.
Intel® Extension for PyTorch* also delivers INT4 optimizations via 4-bit weight-only quantization (WOQ). As the name indicates, WOQ quantizes only weights to 4-bit integers to further improve the computation efficiency via saved memory bandwidth utilization. This technique reduces text generation latency especially from the second token. AMX INT8 instructions and fusions are also applied for these performant computations.
Indirect Access KV Cache
kv_cache is used to reduce computation for decoder layer but it also bring memory overheads, for example, when we use beam search, the kv_cache should be reordered according to latest beam idx and the current key/value should also be concat with kv_cache in the attention layer to get entire context to do scale dot product. When the sequence is very long, memory overheads caused by the reorder_cache and concat will be performance bottleneck. Indirect Access KV_cache (IAKV) is provided to reduce these overheads. Firstly, IAKV pre-allocates buffers (key and value use different buffer) to store all key/value hidden states and beam index information, the data format is shown in the following left figure (beam_width=4 in this case) and token state of key (value) in every timestamp will be store in this pre-allocated buffer. Secondly, we can use beam index history which is shown in the following right figure to decide which beam should be used by a timestamp and this information will generate a offset to access the kv_cache buffer which means that the reorder_cache and concat overheads will be eliminated by this way.
Graph Optimization
Operators fusion is generally use to enable sub-graph fusion to reduce the memory footprint. Except for linear post ops fusion, e.g, linear + activation function, a lot of customized operators are also provided in Intel® Extension for PyTorch* for further performance improvement, For example, Rotary Position Embedding (ROPE) and Root Mean Square Layer Normalization (RMSNorm).
Distributed Inference
All above optimizations already help you to get very good performance with single instance. To furthly reduce the inference latency and improve throughput, tensor parallel is also enabled in our soluction. You can firstly use DeepSpeed to auto shard the model and then apply above optimizations with the frontend API function provided by Intel® Extension for PyTorch.