This is the group row reduction(reduce_sum) + cooperative write out. More...
#include <reduction_api.hpp>
This is the group row reduction(reduce_sum) + cooperative write out.
Use slm to exchange the data. For wg_size_y threads, at the beginning, everyone will keep one row of data; Then, they compose a wg_size_y * row_size 2D block in SLM; After that, each thread will load a small wg_size_y * block_size block, do the local reduction and write to global memory
| dtype_acc | Is the data type to do the reduction |
| dtype_out | Is the data type to write out |
| row_size | Is the vector size per row |
| wg_size_x | Is the wg size in x direction, is the number of parallel reductions in the wg. |
| wg_size_y | Is the wg size in y direction, i.e. is the number of threads that participate in this reduction. |
| max_simd_len | Is the max SIMD for scatter load. The limitation comes from the scattered load from local memory. |
| arch_ | Is the HW generation. |