CUDA crosslane vs OpenCL sub-groups¶

Sub-group function mapping¶

This document describes the mapping of the SYCL subgroup operations (based on the proposal SYCL subgroup proposal) to CUDA (queries responses and PTX instruction mapping)

Sub-group device Queries¶

Query	CUDA backend result
`info::device::max_num_sub_groups`	sm 3.0 to 7.0: 64; sm 7.5 32 (see HW_spec)
`info::device::sub_group_independent_forward_progress`	`true`
`info::device::sub_group_sizes`	{32}

Sub-group function mapping¶

Sub-group function	PTX mapping	LLVM Intrinsic	Min version	Note
`get_local_id()`	`%laneid`	`@llvm.nvvm.read.ptx.sreg.laneid`
`get_local_range()`	`WARP_SZ`	`@llvm.nvvm.read.ptx.sreg.warpsize`
`get_max_local_range`	`WARP_SZ`	`@llvm.nvvm.read.ptx.sreg.warpsize`
`get_group_id`	`%warpid`	`@llvm.nvvm.read.ptx.sreg.warpid`
`get_group_range`	`%nwarpid`	`@llvm.nvvm.read.ptx.sreg.nwarpid`
`get_uniform_group_range`	`%nwarpid`	`@llvm.nvvm.read.ptx.sreg.nwarpid`
`barrier`	`bar.warp.sync`	`@llvm.nvvm.bar.warp.sync`
`any(bool)`	`vote{.sync}.any.pred`	`llvm.nvvm.vote.any{.sync}`
`all(bool)`	`vote{.sync}.all.pred`	`llvm.nvvm.vote.all{.sync}`
`broadcast`	`shfl.sync.idx.b32`	`llvm.shfl.sync.idx.{f32,i32}`	`sm_30`	Only implemented for float and int32 in LLVM but should extendable
`reduce`	None	None		cuda_reduce
`exclusive_scan`	None	None		cuda_scan_example/ptx_scan_example
`inclusive_scan`	None	None		cuda_scan_example/ptx_scan_example
`shuffle`	`shfl.sync.idx.b32`	`llvm.shfl.sync.idx.{f32,i32}`	`sm_30`	Insn only for 32 bits. Requires emulation for non 32-bits.
`shuffle_down`	`shfl.sync.down.b32`	`llvm.shfl.sync.down.{f32,i32}`	`sm_30`	Insn only for 32 bits. Requires emulation for non 32-bits.
`shuffle_up`	`shfl.sync.up.b32`	`llvm.shfl.sync.up.{f32,i32}`	`sm_30`	Insn only for 32 bits. Requires emulation for non 32-bits.
`shuffle_xor`	`shfl.sync.bfly.b32`	`llvm.shfl.sync.bfly.{f32,i32}`	`sm_30`	Insn only for 32 bits. Requires emulation for non 32-bits.
`shuffle` (2 inputs)	None	None		Can be implemented using CUDA shuffle function (non in-place modification + predication)
`shuffle_down` (2 inputs)	None	None		Can be implemented using CUDA shuffle function (non in-place modification + predication)
`shuffle_up` (2 inputs)	None	None		Can be implemented using CUDA shuffle function (non in-place modification + predication)
`load` (scalar)	None	None		Maps to normal load, guarantees coalesced access
`load` (vector)	None	None		Maps to normal load, guarantees coalesced access
`store` (scalar)	None	None		Maps to normal store, guarantees coalesced access
`store` (vector)	None	None		Maps to normal store, guarantees coalesced access