ISA Dynamic Dispatching ======================= This document explains the dynamic kernel dispatch mechanism for Intel® Extension for PyTorch\* (IPEX) based on CPU ISA. It is an extension to the similar mechanism in PyTorch. ## Overview IPEX dyndisp is forked from **PyTorch:** `ATen/native/DispatchStub.h` and `ATen/native/DispatchStub.cpp`. IPEX adds additional CPU ISA level support, such as `AVX512_VNNI`, `AVX512_BF16` and `AMX`. PyTorch & IPEX CPU ISA support statement: | | DEFAULT | AVX2 | AVX2_VNNI | AVX512 | AVX512_VNNI | AVX512_BF16 | AMX | AVX512_FP16 | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | | PyTorch | ✔ | ✔ | ✘ | ✔ | ✘ | ✘ | ✘ | ✘ | | IPEX-1.11 | ✘ | ✔ | ✘ | ✔ | ✘ | ✘ | ✘ | ✘ | | IPEX-1.12 | ✘ | ✔ | ✘ | ✔ | ✔ | ✔ | ✔ | ✘ | | IPEX-1.13 | ✘ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✘ | | IPEX-2.1 | ✘ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | | IPEX-2.2 | ✘ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | | IPEX-2.3 | ✘ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | \* Current IPEX DEFAULT level implemented as same as AVX2 level. ### CPU ISA build compiler requirement | ISA Level | GCC requirement | | ---- | ---- | | AVX2 | Any | | AVX512 | GCC 9.2+ | | AVX512_VNNI | GCC 9.2+ | | AVX512_BF16 | GCC 10.3+ | | AVX2_VNNI | GCC 11.2+ | | AMX | GCC 11.2+ | | AVX512_FP16 | GCC 12.1+ | \* Check with `cmake/Modules/FindAVX.cmake` for detailed compiler checks. ## Dynamic Dispatch Design Dynamic dispatch copies the kernel implementation source files to multiple folders for each ISA level. It then builds each file using its ISA specific parameters. Each generated object file will contain its function body (**Kernel Implementation**). Kernel Implementation uses an anonymous namespace so that different CPU versions won't conflict. **Kernel Stub** is a "virtual function" with polymorphic kernel implementations pertaining to ISA levels. At the runtime, **Dispatch Stub implementation** will check CPUIDs and OS status to determins which ISA level pointer best matches the function body. ### Code Folder Struct >#### **Kernel implementation:** `csrc/cpu/aten/kernels/xyzKrnl.cpp` >#### **Kernel Stub:** `csrc/cpu/aten/xyz.cpp` and `csrc/cpu/aten/xyz.h` >#### **Dispatch Stub implementation:** `csrc/cpu/dyndisp/DispatchStub.cpp` and `csrc/cpu/dyndisp/DispatchStub.h` ### CodeGen Process IPEX build system will generate code for each ISA level with specifiy complier parameters. The CodeGen script is located at `cmake/cpu/IsaCodegen.cmake`. The CodeGen will copy each cpp files from **Kernel implementation**, and then add ISA level as new file suffix. > **Sample:** > > ---- > > **Origin file:** > > `csrc/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp` > > **Generate files:** > > DEFAULT: `build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.DEFAULT.cpp -O3 -D__AVX__ -DCPU_CAPABILITY_AVX2 -mavx2 -mfma -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store -DCPU_CAPABILITY=DEFAULT -DCPU_CAPABILITY_DEFAULT` > > AVX2: `build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.AVX2.cpp -O3 -D__AVX__ -mavx2 -mfma -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store -DCPU_CAPABILITY=AVX2 -DCPU_CAPABILITY_AVX2` > > AVX512: `build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.AVX512.cpp -O3 -D__AVX512F__ -mavx512f -mavx512bw -mavx512vl -mavx512dq -mfma -DCPU_CAPABILITY=AVX512 -DCPU_CAPABILITY_AVX512` > > AVX512_VNNI: `build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.AVX512_VNNI.cpp -O3 -D__AVX512F__ -DCPU_CAPABILITY_AVX512 -mavx512f -mavx512bw -mavx512vl -mavx512dq -mavx512vnni -mfma -DCPU_CAPABILITY=AVX512_VNNI -DCPU_CAPABILITY_AVX512_VNNI` > > AVX512_BF16: `build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.AVX512_BF16.cpp -O3 -D__AVX512F__ -DCPU_CAPABILITY_AVX512 -DCPU_CAPABILITY_AVX512_VNNI -mavx512f -mavx512bw -mavx512vl -mavx512dq -mavx512vnni -mavx512bf16 -mfma -DCPU_CAPABILITY=AVX512_BF16 -DCPU_CAPABILITY_AVX512_BF16` > > AMX: `build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.AMX.cpp -O3 -D__AVX512F__ -DCPU_CAPABILITY_AVX512 -DCPU_CAPABILITY_AVX512_VNNI -DCPU_CAPABILITY_AVX512_BF16 -mavx512f -mavx512bw -mavx512vl -mavx512dq -mavx512vbmi -mavx512vnni -mavx512bf16 -mfma -mamx-tile -mamx-int8 -mamx-bf16 -DCPU_CAPABILITY=AMX -DCPU_CAPABILITY_AMX` > > AVX512_FP16: `build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.AVX512_FP16.cpp -O3 -D__AVX512F__ -DCPU_CAPABILITY_AVX512 -DCPU_CAPABILITY_AVX512_VNNI -DCPU_CAPABILITY_AVX512_BF16 -mavx512f -mavx512bw -mavx512vl -mavx512dq -mavx512vbmi -mavx512vnni -mavx512bf16 -mfma -mamx-tile -mamx-int8 -mamx-bf16 -mavx512fp16 -DCPU_CAPABILITY_AMX -DCPU_CAPABILITY=AVX512_FP16 -DCPU_CAPABILITY_AVX512_FP16` --- >**Note:** >1. DEFAULT level kernels is not fully implemented in IPEX. In order to align to PyTorch, we build default use AVX2 parameters in stead of that. So, IPEX minimal required executing machine support AVX2. >2. `-D__AVX__` and `-D__AVX512F__` is defined for depends library [sleef](https://sleef.org/) . >3. `-DCPU_CAPABILITY_AVX512` and `-DCPU_CAPABILITY_AVX2` are must to be defined for **PyTorch:** `aten/src/ATen/cpu/vec`, it determins vec register width. >4. `-DCPU_CAPABILITY=[ISA_NAME]` is must to be defined for **PyTorch:** `aten/src/ATen/cpu/vec`, it is used as inline namespace name. >5. Higher ISA level is compatible to lower ISA levels, so it needs to contains level ISA feature definitions. Such as AVX512_BF16 need contains `-DCPU_CAPABILITY_AVX512` `-DCPU_CAPABILITY_AVX512_VNNI`. But AVX512 don't contains AVX2 definitions, due to there are different vec register width. ## Add Custom Kernel If you want to add a new custom kernel, and the kernel uses CPU ISA instructions, refer to these tips: 1. Add CPU ISA related kernel implementation to the folder: `csrc/cpu/aten/kernels/NewKernelKrnl.cpp` 2. Add kernel stub to the folder: `csrc/cpu/aten/NewKernel.cpp` 3. Include header file: `csrc/cpu/dyndisp/DispatchStub.h`, and reference to the comment in the header file. ```c++ // Implements instruction set specific function dispatch. // // Kernels that may make use of specialized instruction sets (e.g. AVX2) are // compiled multiple times with different compiler flags (e.g. -mavx2). A // DispatchStub contains a table of function pointers for a kernel. At runtime, // the fastest available kernel is chosen based on the features reported by // cpuinfo. // // Example: // // In csrc/cpu/aten/MyKernel.h: // using fn_type = void(*)(const Tensor& x); // IPEX_DECLARE_DISPATCH(fn_type, stub); // // In csrc/cpu/aten/MyKernel.cpp // IPEX_DEFINE_DISPATCH(stub); // // In csrc/cpu/aten/kernels/MyKernel.cpp: // namespace { // // use anonymous namespace so that different cpu versions won't conflict // void kernel(const Tensor& x) { ... } // } // IPEX_REGISTER_DISPATCH(stub, &kernel); // // To call: // stub(kCPU, tensor); ``` 4. Write the kernel follow the guide. It contains: declare function type, register stub, call stub, etc. >**Note:** > >1. Some kernels only call **oneDNN** or **iDeep** implementation, or other backend implementation, which is not needed to add kernel implementations. (Refer: `BatchNorm.cpp`) >2. Vec related header file must be included in kernel implementation files, but can not be included in kernel stub. Kernel stub is common code for all ISA level, and can't pass ISA related compiler parameters. >3. For more intrinsics, check the [Intel® Intrinsics Guide](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html). ### ISA intrinics specific kernel example: This is a FP32 convert to BF16 function example, and it is implemented for `AVX512_BF16`, `AVX512` and `DEFAULT` ISA levels. ```c++ //csrc/cpu/aten/CvtFp32ToBf16.h #pragma once #include namespace torch_ipex { namespace cpu { void cvt_fp32_to_bf16(at::BFloat16* dst, const float* src, int len); namespace { void cvt_fp32_to_bf16_kernel_impl(at::BFloat16* dst, const float* src, int len); } using cvt_fp32_to_bf16_kernel_fn = void (*)(at::BFloat16*, const float*, int); IPEX_DECLARE_DISPATCH(cvt_fp32_to_bf16_kernel_fn, cvt_fp32_to_bf16_kernel_stub); } // namespace cpu } // namespace torch_ipex ``` ```c++ //csrc/cpu/aten/CvtFp32ToBf16.cpp #include "CvtFp32ToBf16.h" namespace torch_ipex { namespace cpu { IPEX_DEFINE_DISPATCH(cvt_fp32_to_bf16_kernel_stub); void cvt_fp32_to_bf16(at::BFloat16* dst, const float* src, int len) { return cvt_fp32_to_bf16_kernel_stub(kCPU, dst, src, len); } } // namespace cpu } // namespace torch_ipex ``` Macro `CPU_CAPABILITY_AVX512` and `CPU_CAPABILITY_AVX512_BF16` are defined by compiler check, it is means that current compiler havs capability to generate defined ISA level code. Because of `AVX512_BF16` is higher level than `AVX512`, and it compatible to `AVX512`. `CPU_CAPABILITY_AVX512_BF16` can be contained in `CPU_CAPABILITY_AVX512` region. ```c++ //csrc/cpu/aten/kernels/CvtFp32ToBf16Krnl.cpp #include #include "csrc/aten/cpu/CvtFp32ToBf16.h" namespace torch_ipex { namespace cpu { namespace { #if defined(CPU_CAPABILITY_AVX512) #include #else #include #endif using namespace at::vec; #if defined(CPU_CAPABILITY_AVX512) #include inline __m256i _cvt_fp32_to_bf16(const __m512 src) { #if (defined CPU_CAPABILITY_AVX512_BF16) // AVX512_BF16 ISA implementation. return reinterpret_cast<__m256i>(_mm512_cvtneps_pbh(src)); #else // AVX512 ISA implementation. __m512i value = _mm512_castps_si512(src); __m512i nan = _mm512_set1_epi32(0xffff); auto mask_value = _mm512_cmp_ps_mask(src, src, _CMP_ORD_Q); __m512i ones = _mm512_set1_epi32(0x1); __m512i vec_bias = _mm512_set1_epi32(0x7fff); // uint32_t lsb = (input >> 16) & 1; auto t_value = _mm512_and_si512(_mm512_srli_epi32(value, 16), ones); // uint32_t rounding_bias = 0x7fff + lsb; t_value = _mm512_add_epi32(t_value, vec_bias); // input += rounding_bias; t_value = _mm512_add_epi32(t_value, value); // input = input >> 16; t_value = _mm512_srli_epi32(t_value, 16); // Check NaN before converting back to bf16 t_value = _mm512_mask_blend_epi32(mask_value, nan, t_value); return _mm512_cvtusepi32_epi16(t_value); #endif } void cvt_fp32_to_bf16_kernel_impl( at::BFloat16* dst, const float* src, int len) { int i = 0; for (; i < len - 15; i += 16) { auto f32 = _mm512_loadu_ps(src + i); _mm256_storeu_si256((__m256i*)(dst + i), _cvt_fp32_to_bf16(f32)); } if (i < len) { auto mask = (1 << (len - i)) - 1; auto f32 = _mm512_maskz_loadu_ps(mask, src + i); _mm256_mask_storeu_epi16(dst + i, mask, _cvt_fp32_to_bf16(f32)); } } #else // DEFAULT ISA implementation. void cvt_fp32_to_bf16_kernel_impl( at::BFloat16* dst, const float* src, int len) { for (int j = 0; j < len; j++) { *(dst + j) = *(src + j); } } #endif } // anonymous namespace IPEX_REGISTER_DISPATCH(cvt_fp32_to_bf16_kernel_stub, &cvt_fp32_to_bf16_kernel_impl); } // namespace cpu } // namespace torch_ipex ``` ### Vec specific kernel example: This example shows how to get the data type size and its Vec size. In different ISA, Vec has a different register width and a different Vec size. ```c++ //csrc/cpu/aten/GetVecLength.h #pragma once #include namespace torch_ipex { namespace cpu { std::tuple get_cpp_typesize_and_vecsize(at::ScalarType dtype); namespace { std::tuple get_cpp_typesize_and_vecsize_kernel_impl( at::ScalarType dtype); } using get_cpp_typesize_and_vecsize_kernel_fn = std::tuple (*)(at::ScalarType); IPEX_DECLARE_DISPATCH( get_cpp_typesize_and_vecsize_kernel_fn, get_cpp_typesize_and_vecsize_kernel_stub); } // namespace cpu } // namespace torch_ipex ``` ```c++ //csrc/cpu/aten/GetVecLength.cpp #include "GetVecLength.h" namespace torch_ipex { namespace cpu { IPEX_DEFINE_DISPATCH(get_cpp_typesize_and_vecsize_kernel_stub); // get cpp typesize and vectorsize by at::ScalarType std::tuple get_cpp_typesize_and_vecsize(at::ScalarType dtype) { return get_cpp_typesize_and_vecsize_kernel_stub(kCPU, dtype); } } // namespace cpu } // namespace torch_ipex ``` ```c++ //csrc/cpu/aten/kernels/GetVecLengthKrnl.cpp #include #include "csrc/cpu/aten/GetVecLength.h" namespace torch_ipex { namespace cpu { namespace { std::tuple get_cpp_typesize_and_vecsize_kernel_impl( at::ScalarType dtype) { switch (dtype) { case at::ScalarType::Double: return std::make_tuple( sizeof(double), at::vec::Vectorized::size()); case at::ScalarType::Float: return std::make_tuple(sizeof(float), at::vec::Vectorized::size()); case at::ScalarType::ComplexDouble: return std::make_tuple( sizeof(c10::complex), at::vec::Vectorized>::size()); case at::ScalarType::ComplexFloat: return std::make_tuple( sizeof(c10::complex), at::vec::Vectorized>::size()); case at::ScalarType::BFloat16: return std::make_tuple( sizeof(decltype( c10::impl::ScalarTypeToCPPType::t)), at::vec::Vectorized::t)>::size()); case at::ScalarType::Half: return std::make_tuple( sizeof(decltype( c10::impl::ScalarTypeToCPPType::t)), at::vec::Vectorized::t)>::size()); default: TORCH_CHECK( false, "Currently only floating and complex ScalarType are supported."); } } } // anonymous namespace IPEX_REGISTER_DISPATCH( get_cpp_typesize_and_vecsize_kernel_stub, &get_cpp_typesize_and_vecsize_kernel_impl); } // namespace cpu } // namespace torch_ipex ``` ## Private Debug APIs Here are three ISA-related private APIs that can help debugging:: 1. Query current ISA level. 2. Query max CPU supported ISA level. 3. Query max binary supported ISA level. >**Note:** > >1. Max CPU supported ISA level only depends on CPU features. >2. Max binary supported ISA level only depends on built complier version. >3. Current ISA level, it is the smaller of `max CPU ISA level` and `max binary ISA level`. ### Example: ```bash python Python 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information. >>> import intel_extension_for_pytorch._C as core >>> core._get_current_isa_level() 'AMX' >>> core._get_highest_cpu_support_isa_level() 'AMX' >>> core._get_highest_binary_support_isa_level() 'AMX' >>> quit() ``` ## Select ISA level manually. By default, IPEX dispatches to the kernels with the maximum ISA level supported by the underlying CPU hardware. This ISA level can be overridden by the environment variable `ATEN_CPU_CAPABILITY` (same environment variable as PyTorch). The available values are {`avx2`, `avx512`, `avx512_vnni`, `avx512_bf16`, `amx`, `avx512_fp16`}. The effective ISA level would be the minimal level between `ATEN_CPU_CAPABILITY` and the maximum level supported by the hardware. ### Example: ```bash $ python -c 'import intel_extension_for_pytorch._C as core;print(core._get_current_isa_level())' AMX $ ATEN_CPU_CAPABILITY=avx2 python -c 'import intel_extension_for_pytorch._C as core;print(core._get_current_isa_level())' AVX2 ``` >**Note:** > >`core._get_current_isa_level()` is an IPEX internal function used for checking the current effective ISA level. It is used for debugging purpose only and subject to change. ## CPU feature check An addtional CPU feature check tool in the subfolder: `tests/cpu/isa` ```bash $ cmake . -- The C compiler identification is GNU 11.2.1 -- The CXX compiler identification is GNU 11.2.1 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /opt/rh/gcc-toolset-11/root/usr/bin/cc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /opt/rh/gcc-toolset-11/root/usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Configuring done -- Generating done -- Build files have been written to: tests/cpu/isa $ make [ 33%] Building CXX object CMakeFiles/cpu_features.dir/intel_extension_for_pytorch/csrc/cpu/isa/cpu_feature.cpp.o [ 66%] Building CXX object CMakeFiles/cpu_features.dir/intel_extension_for_pytorch/csrc/cpu/isa/cpu_feature_main.cpp.o [100%] Linking CXX executable cpu_features [100%] Built target cpu_features $ ./cpu_features XCR0: 00000000000602e7 os --> avx: true os --> avx2: true os --> avx512: true os --> amx: true mmx: true sse: true sse2: true sse3: true ssse3: true sse4_1: true sse4_2: true aes_ni: true sha: true xsave: true fma: true f16c: true avx: true avx2: true avx_vnni: true avx512_f: true avx512_cd: true avx512_pf: false avx512_er: false avx512_vl: true avx512_bw: true avx512_dq: true avx512_ifma: true avx512_vbmi: true avx512_vpopcntdq: true avx512_4fmaps: false avx512_4vnniw: false avx512_vbmi2: true avx512_vpclmul: true avx512_vnni: true avx512_bitalg: true avx512_fp16: true avx512_bf16: true avx512_vp2intersect: true amx_bf16: true amx_tile: true amx_int8: true prefetchw: true prefetchwt1: false ```