XeTLA v0.3.6
IntelĀ® Xe Templates for Linear Algebra - API Definition Document
 
Loading...
Searching...
No Matches
gpu::xetla::group::group_row_reduce_store_t< dtype_acc, dtype_out, row_size, wg_size_x, wg_size_y, max_simd_len, arch_ > Struct Template Reference

This is the group row reduction(reduce_sum) + cooperative write out. More...

#include <reduction_api.hpp>

Detailed Description

template<typename dtype_acc, typename dtype_out, uint32_t row_size, uint32_t wg_size_x, uint32_t wg_size_y, uint32_t max_simd_len = 32, gpu_arch arch_ = gpu_arch::Xe>
struct gpu::xetla::group::group_row_reduce_store_t< dtype_acc, dtype_out, row_size, wg_size_x, wg_size_y, max_simd_len, arch_ >

This is the group row reduction(reduce_sum) + cooperative write out.

Use slm to exchange the data. For wg_size_y threads, at the beginning, everyone will keep one row of data; Then, they compose a wg_size_y * row_size 2D block in SLM; After that, each thread will load a small wg_size_y * block_size block, do the local reduction and write to global memory

Template Parameters
dtype_accIs the data type to do the reduction
dtype_outIs the data type to write out
row_sizeIs the vector size per row
wg_size_xIs the wg size in x direction, is the number of parallel reductions in the wg.
wg_size_yIs the wg size in y direction, i.e. is the number of threads that participate in this reduction.
max_simd_lenIs the max SIMD for scatter load. The limitation comes from the scattered load from local memory.
arch_Is the HW generation.