Detailed Description

template<typename dtype_acc, typename dtype_out, uint32_t row_size, uint32_t wg_size_x, uint32_t wg_size_y, uint32_t max_simd_len = 32, gpu_arch arch_ = gpu_arch::Xe>
struct gpu::xetla::group::group_row_reduce_store_t< dtype_acc, dtype_out, row_size, wg_size_x, wg_size_y, max_simd_len, arch_ >

This is the group row reduction(reduce_sum) + cooperative write out.

Use slm to exchange the data. For wg_size_y threads, at the beginning, everyone will keep one row of data; Then, they compose a wg_size_y * row_size 2D block in SLM; After that, each thread will load a small wg_size_y * block_size block, do the local reduction and write to global memory

Template Parameters

dtype_acc	Is the data type to do the reduction
dtype_out	Is the data type to write out
row_size	Is the vector size per row
wg_size_x	Is the wg size in x direction, is the number of parallel reductions in the wg.
wg_size_y	Is the wg size in y direction, i.e. is the number of threads that participate in this reduction.
max_simd_len	Is the max SIMD for scatter load. The limitation comes from the scattered load from local memory.
arch_	Is the HW generation.