Large Language Models (LLM) Optimizations Overview ================================================== In the current technological landscape, Generative AI (GenAI) workloads and models have gained widespread attention and popularity. Large Language Models (LLMs) have emerged as the dominant models driving these GenAI applications. As you know, most of LLMs are GPT like architecture which is consist of multiple Decoder layers and the MultiHeadAttention and FeedForward layer are two key components of every Decoder layer. The generation task is memory bound due to iterative decode and kv_cache is also need to special management to reduce memory overheads. Intel® Extension for PyTorch* provide a lot of specific optimizations for these LLMs. In operator level, we provide highly efficient GEMM kernel to speedup Linear layer and customized operators to reduce the memory footprint. To better trade-off the performance and accuracy, different low-precision solutions e.g., smoothQuant and weight-only-quantization are also enabled. Besides, tensor parallel can also adopt to get lower latency for LLMs. These LLM-specific optimizations can be automatically applied with a single frontend API function in python interface, `ipex.optimize_transformers()`. Check `optimize_transformers <./llm/llm_optimize_transformers.md>`_ for more details. .. toctree:: :hidden: :maxdepth: 1 llm/llm_optimize_transformers Supported Models ---------------- .. list-table:: :widths: auto :header-rows: 1 * - Model Family - LLAMA - GPT-J - GPT-NEOX - FALCON* - OPT * - Verified < MODEL ID > (Huggingface hub) - "meta-llama/Llama-2-7b-hf", "meta-llama/Llama-2-13b-hf", "meta-llama/Llama-2-70b-hf" - "EleutherAI/gpt-j-6b" - "EleutherAI/gpt-neox-20b" - "tiiuae/falcon-40b" - "facebook/opt-30b", "facebook/opt-1.3b" * - FP32/BF16 - ✅ - ✅ - ✅ - ✅ - ✅ * - Weight only quantzation INT8 - ✅ - ✅ - ✅ - ✅ - ✅ * - Weight only quantization INT4 - ✅ - ✅ - ✅ - ✅ - ✅ * - Static quantization INT8 - ✅ - ✅ - ❎\*\* - ❎\*\* - ❎\*\* \* For Falcon models from remote hub, we need to modify the config.json to use the modeling_falcon.py in transformers. Therefore, in the following scripts, we need to pass an extra configuration file like "--config-file=model_config/tiiuae_falcon-40b_config.json". This is optional for FP32/BF16 but needed for quantizations. \*\* For GPT-NEOX/FALCON/OPT models, the accuracy recipes of static quantization INT8 are not ready thus they will be skipped in our coverage. *Note*: The above verified models (including other models in the same model family, like "codellama/CodeLlama-7b-hf" from LLAMA family) are well supported with all optimizations like indirect access KV cache, fused ROPE, and prepacked TPP Linear (fp32/bf16). For other LLM model families, we are working in progress to cover those optimizations, which will expand the model list above. Check `LLM best known practice `_ for instructions to install/setup environment and example scripts.. Demos ----- Intel® Extension for PyTorch* LLM optimizations could be integrated into a typical LLM Q&A web service. .. list-table:: * - .. image:: ../../images/llm/GenAI-bf16.gif :width: 500 :alt: UI with BF16 - .. image:: ../../images/llm/GenAI-int8.gif :width: 500 :alt: UI with INT8 Following figures show demos with Llama 2 model and GPT-J model with single inference and distributed inference with deepspeed with lower precision data types. .. list-table:: * - .. figure:: ../../images/llm/bf16_llama.gif :width: 300 :alt: Llama 2 with BF16 a - .. figure:: ../../images/llm/smoothquant_int8_llama.gif :width: 300 :alt: Llama 2 with INT8 Quantization with SmoothQuant b - .. figure:: ../../images/llm/woq_int8_llama.gif :width: 300 :alt: Weight Only Quantization with INT8 for Llama 2 c * - .. figure:: ../../images/llm/woq_int4_gptj.gif :width: 300 :alt: Weight Only Quantization with INT4 for GPT-J d - .. figure:: ../../images/llm/autotp_bf16_llama.gif :width: 300 :alt: Distributed Inference with DeepSpeed with BF16 on Llama 2 with AutoTP feature e - .. figure:: ../../images/llm/autotp_woq_int8_llama.gif :width: 300 :alt: Distributed Inference with DeepSpeed with Weight Only Quantization INT8 on Llama 2 with AutoTP feature f Figure Legends: a. Llama 2 model with BF16 b. Llama 2 model with INT8 Quantization with SmoothQuant technique c. Llama 2 model with INT8 Weight Only Quantization d. GPT-J model with INT4 Weight Only Quantization e. Llama 2 model Distributed Inference with DeepSpeed with AutoTP feature on BF16 f. Llama 2 model Distributed Inference with DeepSpeed with AutoTP feature with Weight Only Quantization INT8 Optimization Methodologies -------------------------- The brief introduction of these optimizations are as following: Linear Operator Optimization ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Linear operator is the most obvious hotspot in LLMs inference. There are three backend to speedup linear GEMM kernels in Intel® Extension for PyTorch*. They are oneDNN, Tensor Processing Primitives(TPP) which has been used by `Fast BERT feature <./fast_bert.md>`_ and customized linear kernels for weight only quantization. All of them use specific block format to utilize hardware resources in a highly efficient way. Low Precision Data Types ~~~~~~~~~~~~~~~~~~~~~~~~ While Generative AI (GenAI) workloads and models are getting more and more popular, large language models (LLM) used in these workloads are getting more and more parameters. The increasing size of LLM models enhances workload accuracies; however, it also leads to significantly heavier computations and places higher requirements to the underlying hardware. Given that, quantization becomes a more important methodology for inference workloads. Quantization with shorter data types benefits from its nature to improve memory IO throughputs and amount of computations on CPU. Moreover, shorter data types make it possible to keep more data in CPU cache, thus reducing memory access occurrences. Comparing to cache access, memory access is much more time costing. Specifically from computation perspective, AVX-512 Vector Neural Network Instructions (VNNI) instruction set shipped with the 2nd Generation Intel® Xeon® Scalable Processors and newer, as well as Intel® Advanced Matrix Extensions (Intel® AMX) instruction set shipped with the 4th Generation Intel® Xeon® Scalable Processors, provide instruction level accelerations to INT8 computations. Except for the mixed-precision and INT8 native quantization solution, e.g., post-training static quantization and dynamic quantization in Pytorch, `SmoothQuant `_ and weight only quantization (both INT8 weight and INT4 weight are supported) are also enabled in Intel® Extension for PyTorch* to get beeter accuracy and performance compared with native solution. Intel® Extension for PyTorch* speeds up INT8 computations by leveraging oneDNN and oneDNN graph as the backend. Intel® Extension for PyTorch* static quantization provides a default recipe to automatically decide which operators to quantize. Its backend oneDNN graph brings matrix-multiplication-based fusions for common seen operator patterns and other common fusions like quantization + data type casting. These fusions help achieve best computation cache locality and efficiency, and thus reduce INT8 quantization overhead significantly. Intel® Extension for PyTorch* also delivers INT4 optimizations via 4-bit weight-only quantization (WOQ). As the name indicates, WOQ quantizes only weights to 4-bit integers to further improve the computation efficiency via saved memory bandwidth utilization. This technique reduces text generation latency especially from the second token. AMX INT8 instructions and fusions are also applied for these performant computations. Indirect Access KV Cache ~~~~~~~~~~~~~~~~~~~~~~~~ kv_cache is used to reduce computation for decoder layer but it also bring memory overheads, for example, when we use beam search, the kv_cache should be reordered according to latest beam idx and the current key/value should also be concat with kv_cache in the attention layer to get entire context to do scale dot product. When the sequence is very long, memory overheads caused by the reorder_cache and concat will be performance bottleneck. Indirect Access KV_cache (IAKV) is provided to reduce these overheads. Firstly, IAKV pre-allocates buffers (key and value use different buffer) to store all key/value hidden states and beam index information, the data format is shown in the following left figure (beam_width=4 in this case) and token state of key (value) in every timestamp will be store in this pre-allocated buffer. Secondly, we can use beam index history which is shown in the following right figure to decide which beam should be used by a timestamp and this information will generate a offset to access the kv_cache buffer which means that the reorder_cache and concat overheads will be eliminated by this way. .. image:: ../../images/llm/llm_iakv_1.png :width: 400 :alt: The key/value cache data format .. image:: ../../images/llm/llm_iakv_2.png :width: 400 :alt: The beam idx trace for every step Graph Optimization ~~~~~~~~~~~~~~~~~~ Operators fusion is generally use to enable sub-graph fusion to reduce the memory footprint. Except for linear post ops fusion, e.g, linear + activation function, a lot of customized operators are also provided in Intel® Extension for PyTorch* for further performance improvement, For example, Rotary Position Embedding (ROPE) and Root Mean Square Layer Normalization (RMSNorm). Distributed Inference ~~~~~~~~~~~~~~~~~~~~~ All above optimizations already help you to get very good performance with single instance. To furthly reduce the inference latency and improve throughput, tensor parallel is also enabled in our soluction. You can firstly use DeepSpeed to auto shard the model and then apply above optimizations with the frontend API function provided by Intel® Extension for PyTorch.