ESIMD “stateless” accessors support design

This document describes design of automatic conversion of “stateful” memory accesses via SYCL accessors to “stateless” accesses within ESIMD kernels by the SYCL compiler.

Overview of Intel GPU memory access types

Intel GPU hardware has two main modes of accessing memory - stateless and stateful, with specific memory access data port messages corresponding to each of the modes (send instructions with dataport fixed function target).

  • In a stateless access base memory location(s) is represented with a single virtual memory address, which can be a USM pointer.

  • In stateful - with a <surface index, 32-bit offset> pair of values, where the surface index is an index into a “binding table” which contains surface descriptors available to the kernel. Surface is a contigous memory area accessible through its descriptor by stateful accesses. Each descriptor contains various information about the surface - for example, its size and format.

Pointers used in statless accesses are usually coming from USM or C++ memory allocation routines and are passed directly by the runtime as kernel arguments.

The stateful access style has a number of drawbacks which makes it undesirable to use in HPC application. The biggest one is 4Gb limitation on the surface size. Another one is problems with creating data structures with nested pointer fields or double indirection on host and use them on the device.

Accessor and USM pointer kernel argument passing details

ESIMD compiler when compiling a kernel records information about each memory argument and stores it together with the kernel’s SPIRV. Basically, for each kernel argument, there is information whether it is a memory argument, and, if yes, whether it is surface-based or pointer-based.

When JITting the kernel, the scalar GPU compiler back-end can convert memory arguments and memory accesses between the two modes depending on optimization or other settings, and record final type of memory argument with the generated kernel executable. GPU runtime uses that information to wrap/not wrap incoming memory pointer with a surface before passing it onto the harware into the actual kernel argument.

The vector back-end can’t do this in many cases, as memory accesses in SPIRV are represented by hardware-specific intrinsics rather then standard generic memory access SPIRV instructions. This design basically enables the vector BE to redirect code generation for stateful memory access APIs to stateless or stateful intrinsics, and also generate correct annotations. Since it uses the same runtime, which relies on parameter annotation when making the wrap/no-wrap decision, the runtime part does not need much changes.

Problem definition

Currently, ESIMD compiler always maps buffer/accessor-based memory accesses to stateful accesses, thus imposing the 4Gb datum size limitation on user programs with accessors.

Proposed solution

Short/mid-term

API header changes

The general idea is to introduce C++ preprocessor macro ESIMD_FORCE_STATELESS_MEM_ACCESS which will control code generation for the stateful memory access APIs - such as:

template <typename Tx, int N, typename AccessorTy,
          typename Flags = vector_aligned_tag,
          typename = std::enable_if_t<is_simd_flag_type_v<Flags>>,
          class T = detail::__raw_t<Tx>>
__ESIMD_API simd<Tx, N> block_load(AccessorTy acc, uint32_t offset,
                                   Flags = {});

template <typename T, int N, typename AccessorTy>
__ESIMD_API std::enable_if_t<(sizeof(T) <= 4) &&
                                 (N == 1 || N == 8 || N == 16 || N == 32) &&
                                 !std::is_pointer<AccessorTy>::value,
                             simd<T, N>>
gather(AccessorTy acc, simd<uint32_t, N> offsets, uint32_t glob_offset = 0,
       simd_mask<N> mask = 1);

Implementation of the APIs would follow this pattern:

// this API should verify that accessor is to global address space, this is needed both for
// the case with conversion to stateless and the case w/o.
T stateful_memory_api(accessor acc, uint32 offset, args...) {
#ifdef ESIMD_FORCE_STATELESS_MEM_ACCESS
  accessor_elelemt_type *ptr = acc.get_pointer() + offset;
  return stateless_memory_api(ptr, args...);
#else
  <original implementation>
#endif
}

The new macro is supposed to be set by users directly or via some logic based on other macros set by users.

Compiler changes

The API part of the implementation is as simple as above, the compiler one is slightly more complicated. Compiler needs to make sure that in presence of ESIMD_FORCE_STATELESS_MEM_ACCESS macro, the actual memory parameter annotation described above is correct and tells that memory is a pointer, not a surface index. Parameter annotations are generated by the front-end - these are kernel_arg_accessor_ptr and kernel_arg_type metadata nodes, which are then translated to buffer_t (for surface) or svmptr_t (for pointer) metadata annotations consumed by the back-end.

Variant 1

This is the recommended variant. A new driver option is added - -fsycl-esimd-force-stateless-mem-access. Under this option:

  • SYCL C++ device compiler FE defines the ESIMD_FORCE_STATELESS_MEM_ACCESS macro

  • sycl-post-link tool is run with a new option -esimd-force-stateless-mem-access. Under this option, the tool configures the LowerESIMD.cpp pass to ignore the kernel_arg_accessor_ptr and always generate svmptr_t annotation for memory arguments.

Variant 2

Clang C++ FE is changed to generate desired kernel_arg_accessor_ptr metadata depending on ESIMD_FORCE_STATELESS_MEM_ACCESS macro setting. If set, it will mark all memory arguments as pointers in kernel_arg_accessor_ptr and kernel_arg_type MD nodes.

Variant 3 (no go)

Definition of SYCL_ESIMD_KERNEL is changed depending on presence of ESIMD_FORCE_STATELESS_MEM_ACCESS:

#ifdef ESIMD_FORCE_STATELESS_MEM_ACCESS
#define SYCL_ESIMD_KERNEL __attribute__((sycl_explicit_simd)) __attribute__((sycl_explicit_simd_force_stateless))
#else
#define SYCL_ESIMD_KERNEL __attribute__((sycl_explicit_simd))
#endif

Then LowerESIMD lowers parameter annotation depending on sycl_explicit_simd_force_stateless attribute presence. The drawback is that is allowed to use [[intel::sycl_explicit_simd]] w/o SYCL_ESIMD_KERNEL

ESIMD Verifier changes

All the compiler variants require that accessor::get_pointer() can be used in the device code. ESIMDVerifier.cpp needs to additionally allow the following regexps:

    "^cl::sycl::accessor<.+>::getPointerAdjusted",
    "^cl::sycl::accessor<.+>::getQualifiedPtr",
    "^cl::sycl::accessor<.+>::get_pointer",
    "^cl::sycl::multi_ptr<.+>::.+"

But only if it is run in “force-stateless” mode.

Long-term

Long term solution would be replacing the

  #ifdef ESIMD_FORCE_STATELESS_MEM_ACCESS

with

if_device_has(platform_requires_stateless_access)

and removing all the changes in other components. Plus VC BE need to be taught to generate correct pointer parameter annotation not relying on the middle-end providing it.