CUDA crosslane vs OpenCL sub-groups¶
Sub-group function mapping¶
This document describes the mapping of the SYCL subgroup operations (based on the proposal SYCL subgroup proposal) to CUDA (queries responses and PTX instruction mapping)
Sub-group device Queries¶
Query |
CUDA backend result |
---|---|
|
sm 3.0 to 7.0: 64; sm 7.5 32 (see HW_spec) |
|
|
|
{32} |
Sub-group function mapping¶
Sub-group function |
PTX mapping |
LLVM Intrinsic |
Min version |
Note |
---|---|---|---|---|
|
|
|
||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
|
|
Only implemented for float and int32 in LLVM but should extendable |
|
None |
None |
||
|
None |
None |
||
|
None |
None |
||
|
|
|
|
Insn only for 32 bits. Requires emulation for non 32-bits. |
|
|
|
|
Insn only for 32 bits. Requires emulation for non 32-bits. |
|
|
|
|
Insn only for 32 bits. Requires emulation for non 32-bits. |
|
|
|
|
Insn only for 32 bits. Requires emulation for non 32-bits. |
|
None |
None |
Can be implemented using CUDA shuffle function (non in-place modification + predication) |
|
|
None |
None |
Can be implemented using CUDA shuffle function (non in-place modification + predication) |
|
|
None |
None |
Can be implemented using CUDA shuffle function (non in-place modification + predication) |
|
|
None |
None |
Maps to normal load, guarantees coalesced access |
|
|
None |
None |
Maps to normal load, guarantees coalesced access |
|
|
None |
None |
Maps to normal store, guarantees coalesced access |
|
|
None |
None |
Maps to normal store, guarantees coalesced access |