SYCLcompat¶

SYCLcompat is a header-only library that intends to help developers familiar with other heterogeneous programming models (such as OpenMP, CUDA or HIP) to familiarize themselves with the SYCL programming API while porting their existing codes. Compatibility tools can also benefit from the reduced API size when converting legacy codebases.

SYCLcompat provides:

A high-level API that provides closer semantics to other programming models, simplifying line by line conversions.
Alternative submission APIs that encapsulate SYCL-specific “queue” and “event” APIs for easier reference.
Ability to gradually introduce other SYCL concepts as the user familiarizes themselves with the core SYCL API.
Clear distinction between core SYCL API and the compatibility interface via separate namespaces.

Notice¶

Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. used by permission by Khronos.

Support¶

SYCLcompat depends on specific oneAPI DPC++ compiler extensions that may not be available to all the SYCL 2020 specification implementations.

Specifically, this library depends on the following SYCL extensions:

If available, the following extensions extend SYCLcompat functionality:

sycl_ext_intel_device_info [Optional]
sycl_ext_oneapi_bfloat16_math_functions [Optional]
sycl_ext_oneapi_max_work_group_query [Optional]

Hardware Requirements¶

Some of the functionalities provided by SYCLcompat rely on Unified Shared Memory (aspect::usm_device_allocations), though most of the USM-like memory APIs (malloc*, memcpy*, memset*) support hardware with only buffer/accessor support. See section Buffer Support below.

Usage¶

All functionality is available under the syclcompat:: namespace, imported through the main header, syclcompat.hpp. Note that syclcompat.hpp does not import the <sycl/sycl.hpp> header.

#include <syclcompat.hpp>

This document presents the public API under the Features section, and provides a working Sample code using this library. Refer to those to learn to use the library.

Versioning¶

SYCLcompat adopts semantic versioning (major.minor.patch) in a manner which aligns with oneAPI releases. Each oneAPI product release has an associated SYCLcompat release. Between oneAPI releases, there will be at most one major or minor bump. In other words, if a given oneAPI release has SYCLcompat version 1.0.0, the next release will have either 1.1.0 or, if breaking changes have been made, 2.0.0. This guarantee has implications for code merged to the sycl branch, described below.

Between release cycles, ongoing updates to SYCLcompat (including possibly breaking changes) are merged into DPC++ via PRs to the sycl branch. If a PR introduces the first breaking changes since the last release, that PR must bump to the next major version. Otherwise, if the PR introduces new functionality and neither the major nor minor have been bumped since the last release, it must bump to the next minor release. If a PR introduces important bugfixes to existing functionality, patch should be bumped, and there are no limits to how many patch bumps can occur between release cycles.

Release Process¶

Once all changes planned for a release have been merged, the release process is defined as:

Check the major.minor version associated with the previous release.
Confirm the version bump process outlined above has been followed.
If no version bump has occurred since previous release, bump to next minor.
oneAPI release is delivered.
Tag the SYCLcompat release on DPC++ repo: SYCLcompat-major.minor.0.

Deprecation Process/Breaking Changes¶

As outlined above, SYCLcompat may sometimes make API breaking changes, indicated with a major version bump. Advanced notice (at least one major oneAPI release) will be provided via a deprecation warning on the relevant APIs, indicating to the user which alternative API should be used instead.

Note that SYCLcompat is currently in pre-release, and until version 1.0.0 we do not consider our API to be stable, and may change it with shorter notice.

Changelog¶

Since SYCLcompat releases are aligned with oneAPI product releases, the changelog for SYCLcompat is incorporated into SYCL’s Release Notes.

Experimental Namespace¶

SYCLcompat provides some new experimental features in the syclcompat::experimental namespace. This serves as a testing ground for new features which are expected to migrate to syclcompat:: in time, but the developers do not guarantee either API stability or continued existence of these features; they may be modified or removed without notice. When features are migrated from syclcompat::experimental to syclcompat::, this will be treated as a minor version bump.

Features¶

dim3¶

SYCLcompat provides a dim3 class akin to that of CUDA or HIP programming models. dim3 encapsulates other languages iteration spaces that are represented with coordinate letters (x, y, z). In SYCL, the fastest-moving dimension is the one with the highest index, e.g. in a SYCL 2D range iteration space, there are two dimensions, 0 and 1, and 1 will be the one that “moves faster”. For CUDA/HIP, the convention is reversed: x is the dimension which moves fastest. syclcompat::dim3 follows this convention, so that syclcompat::dim3(32, 4) is equivalent to sycl::range<2>(4, 32), and syclcompat::dim3(32, 4, 2) is equivalent to sycl::range<3>(2, 4, 32).

namespace syclcompat {

class dim3 {
public:
  unsigned int x, y, z;
  dim3(const sycl::range<3> &r);
  dim3(const sycl::range<2> &r);
  dim3(const sycl::range<1> &r);
  constexpr dim3(unsigned int x = 1, unsigned int y = 1, unsigned int z = 1);

  constexpr size_t size();

  operator sycl::range<3>();
  operator sycl::range<2>();
  operator sycl::range<1>();
};

// Element-wise operators
inline dim3 operator*(const dim3 &a, const dim3 &b);
inline dim3 operator+(const dim3 &a, const dim3 &b);
inline dim3 operator-(const dim3 &a, const dim3 &b);

} // syclcompat

The compatibility headers for SYCL offer a number of convenience functions that help the mapping between xyz-based coordinates to SYCL iteration spaces in the different scopes available. In addition to the global range, the following helper functions are also provided:

namespace syclcompat {

namespace local_id {
inline size_t x();
inline size_t y();
inline size_t z();
} // namespace local_id

namespace local_range {
inline size_t x();
inline size_t y();
inline size_t z();
} // namespace local_range

namespace work_group_id {
inline size_t x();
inline size_t y();
inline size_t z();
} // namespace work_group_id

namespace work_group_range {
inline size_t x();
inline size_t y();
inline size_t z();
} // namespace work_group_range

namespace global_range {
inline size_t x();
inline size_t y();
inline size_t z();
} // namespace global_range

namespace global_id {
inline size_t x();
inline size_t y();
inline size_t z();
} // namespace global_id

} // syclcompat

These translate any kernel dimensions from one convention to the other. An example of an equivalent SYCL call for a 3D kernel using compat is syclcompat::global_id::x() == get_global_id(2).

launch¶

SYCLcompat provides a kernel launch interface which accepts a function that executes on the device (a.k.a “kernel”) instead of a lambda/functor. It can be called either by using a pair of “teams”/”blocks” and “threads”, from OpenMP/CUDA terminology, or using a sycl::nd_range. The interface accepts a device function with the use of an auto F template parameter, and a variadic Args for the function’s arguments.

Various overloads for launch<function> exist to permit the user to launch on a specific queue, or to describe the range as either nd_range or dim3, dim3.

namespace syclcompat {

template <auto F, typename... Args>
sycl::event launch(const dim3 &grid, const dim3 &threads, Args... args);

template <auto F, int Dim, typename... Args>
sycl::event launch(const sycl::nd_range<Dim> &range, Args... args);

template <auto F, int Dim, typename... Args>
sycl::event launch(const sycl::nd_range<Dim> &range,
                   sycl::queue q, Args... args);

template <auto F, typename... Args>
sycl::event launch(const dim3 &grid, const dim3 &threads,
                   sycl::queue q, Args... args);

} // syclcompat

For example, if the user had an existing function named vectorAdd to execute on a device such as follows:

void vectorAdd(const float *A, const float *B, float *C, int n);

using SYCLcompat, the user can call it as follows:

syclcompat::launch<vectorAdd>(blocksPerGrid, threadsPerBlock, d_A, d_B, d_C, n);

which would be equivalent to the following call using a sycl::nd_range:

auto range = sycl::nd_range<3>{blocksPerGrid * threadsPerBlock,
                               threadsPerBlock};
syclcompat::launch<vectorAdd>(range, d_A, d_B, d_C, n);

Note that since syclcompat::launch accepts a device function, the kernel lambda is constructed by SYCLcompat internally. This means that, for example, sycl::local_accessors cannot be declared. Instead, users wishing to use local memory should launch with a launch_policy object as described below.

launch_policy¶

In addition to the simple syclcompat::launch interface described above, SYCLcompat provides a more flexible (experimental) interface to launch a kernel with a given launch_policy. By constructing and passing a launch_policy, users can pass sycl::ext::oneapi::experimental::properties associated with the kernel or launch, as well as request local memory for the kernel.

In order to disambiguate the variadic constructor of launch_policy, the following wrapper structs are defined. The kernel_properties and launch_properties wrappers can be constructed either with a variadc set of properties, or with an existing sycl_exp::properties object.

namespace syclcompat::experimental {
namespace sycl_exp = sycl::ext::oneapi::experimental;

// Wrapper for kernel sycl_exp::properties
template <typename Properties> struct kernel_properties {
  using Props = Properties;
  template <typename... Props>
  kernel_properties(Props... properties);
  template <typename... Props>
  kernel_properties(sycl_exp::properties<Props...> properties)
  Properties props;
};

// Wrapper for launch sycl_exp::properties
template <typename Properties> struct launch_properties {
  using Props = Properties;
  template <typename... Props>
  launch_properties(Props... properties);
  template <typename... Props>
  launch_properties(sycl_exp::properties<Props...> properties)
  Properties props;
};

// Wrapper for local memory size
struct local_mem_size {
  local_mem_size(size_t size = 0);
  size_t size;
};

} //namespace syclcompat::experimental

The constructors of launch_policy are variadic, accepting any form of range (nd_range, range, dim3, dim3, dim3), followed by zero or more of local_memory_size, kernel_properties, and launch_properties:

namespace syclcompat::experimental {
namespace sycl_exp = sycl::ext::oneapi::experimental;

// launch_policy is constructed by the user & passed to `compat_exp::launch`
template <typename Range, typename KProps, typename LProps, bool LocalMem>
class launch_policy {
public:
  using KPropsT = KProps;
  using LPropsT = LProps;
  using RangeT = Range;
  static constexpr bool HasLocalMem = LocalMem;

  template <typename... Ts>
  launch_policy(Range range, Ts... ts);

  template <typename... Ts>
  launch_policy(dim3 global_range, Ts... ts);

  template <typename... Ts>
  launch_policy(dim3 global_range, dim3 local_range, Ts... ts);

  KProps get_kernel_properties();
  LProps get_launch_properties();
  size_t get_local_mem_size();
  Range get_range();
};
} //namespace syclcompat::experimental

The launch overloads accepting a launch_policy are:

namespace syclcompat::experimental {

template <auto F, typename LaunchPolicy, typename... Args>
sycl::event launch(LaunchPolicy launch_policy, sycl::queue q, Args... args);

template <auto F, typename LaunchPolicy, typename... Args>
sycl::event launch(LaunchPolicy launch_policy, Args... args);
} //namespace syclcompat::experimental

For local memory, launch<function> injects a char * pointer to the beginning of a local accessor of the requested local_mem_size as the last argument of the kernel function. This char * can then be reinterpreted as the datatype required by the user within the kernel function.

For example, the previous function named vectorAdd can be modified with the following signature, which adds a char * pointer to access local memory inside the kernel:

void vectorAdd(const float *A, const float *B, float *C, int n,
               char *local_mem);

Then, the new vectorAdd can be launched like this:

using syclcompat::experimental;
launch_policy policy{blocksPerGrid, threadsPerBlock,
                      local_mem_size(nbytes)};
launch<vectorAdd>(policy, d_A, d_B, d_C, n);

To request a different cache/local memory split on supported hardware:

using syclcompat::experimental;
namespace sycl_intel_exp = sycl::ext::intel::experimental;

sycl_intel_exp::cache_config cache_config{
    sycl_intel_exp::large_slm};
kernel_properties kernel_props{cache_config};
launch_policy policy{blocksPerGrid, threadsPerBlock,
                      local_mem_size(nbytes), kernel_props};

launch<vectorAdd>(policy, d_A, d_B, d_C, n);

To request a certain cluster dimension on supported hardware:

using syclcompat::experimental;
namespace sycl_exp = sycl::ext::oneapi::experimental;

sycl_exp::cuda::cluster_size cluster_dims(cluster_range);
launch_policy policy{blocksPerGrid, threadsPerBlock,
                                  local_mem_size(nbytes), 
                                  launch_properties{cluster_dims}};

launch<vectorAdd>(policy, d_A, d_B, d_C, n);

Utilities¶

SYCLcompat introduces a set of utility functions designed to streamline the usage of the library and its launch<function> mechanism.

The first utility function is syclcompat::wg_barrier(), which provides a concise work-group barrier. syclcompat::wg_barrier() uses the SYCL_INTEL_free_function_queries extension to provide this functionality.

The second utility function, syclcompat::compute_nd_range, ensures that the provided global size and work group sizes are appropriate for a given dimensionality, and that global size is rounded up to a multiple of the work group size in each dimension.

namespace syclcompat {

inline void wg_barrier();

template <int Dim>
inline sycl::nd_range<Dim> compute_nd_range(sycl::range<Dim> global_size_in,
                                            sycl::range<Dim> work_group_size);
inline sycl::nd_range<1> compute_nd_range(int global_size_in, 
                                          int work_group_size);

} // syclcompat

Queues¶

The design for this library assumes in-order queues (sycl::property::queue::in_order()).

Many of the APIs accept an optional queue parameter, and this can be an out-of-order queue, either created manually or retrieved via a call to syclcompat::create_queue(), specifying false for the in_order parameter.

namespace syclcompat {

inline sycl::queue create_queue(bool print_on_async_exceptions = false,
                                bool in_order = true);

} // syclcompat

However, SYCLcompat does not implement any mechanisms to deal with this case. The rationale for this is that a user wanting the full power of SYCL’s dependency management shouldn’t be using the this library. As such, support for out-of-order queues is very limited. The only way to safely use an out-of-order queue at present is to explicitly q.wait() or e.wait() where e is the sycl::event returned through a syclcompat::async API.

To facilitate machine translation from other heterogeneous programming models to SYCL, SYCLcompat provides the following pointer aliases for sycl::event and sycl::queue, and the function destroy_event which destroys an event_ptr allocated on the heap.

namespace syclcompat {

using event_ptr = sycl::event *;

using queue_ptr = sycl::queue *;

static void destroy_event(event_ptr event);

} // syclcompat

Memory Operations¶

This library provides interfaces to allocate memory to be accessed within kernel functions and on the host. The syclcompat::malloc function allocates device USM memory, the syclcompat::malloc_host function allocates host USM memory, and the syclcompat::malloc_shared function allocates shared USM memory.

In each case we provide a template and non-templated interface for allocating memory, taking the number of elements or number of bytes respectively.

The interface includes both synchronous and asynchronous malloc, memcpy, memset, fill, and free operations.

There is a helper class pointer_attributes to query allocation type for memory pointers using SYCLcompat, through sycl::usm::alloc and sycl::get_pointer_device.

namespace syclcompat {

// Expects number of elements
template <typename T>
T *malloc(size_t count, sycl::queue q = get_default_queue());
template <typename T>
T *malloc_host(size_t count, sycl::queue q = get_default_queue());
template <typename T>
T *malloc_shared(size_t count, sycl::queue q = get_default_queue());

// Expects size of the memory in bytes
void *malloc(size_t num_bytes, sycl::queue q = get_default_queue());
void *malloc_host(size_t num_bytes, sycl::queue q = get_default_queue());
void *malloc_shared(size_t num_bytes, sycl::queue q = get_default_queue());

// 2D, 3D memory allocation wrappers
void *malloc(size_t &pitch, size_t x, size_t y,
             sycl::queue q = get_default_queue())
pitched_data malloc(sycl::range<3> size, sycl::queue q = get_default_queue());

// Blocking memcpy
void memcpy(void *to_ptr, const void *from_ptr, size_t size,
            sycl::queue q = get_default_queue());
void memcpy(T *to_ptr, const T *from_ptr, size_t count,
            sycl::queue q = get_default_queue());
void memcpy(void *to_ptr, size_t to_pitch, const void *from_ptr,
            size_t from_pitch, size_t x, size_t y,
            sycl::queue q = get_default_queue()); // 2D matrix
void memcpy(pitched_data to, sycl::id<3> to_pos,
            pitched_data from, sycl::id<3> from_pos,
            sycl::range<3> size,
            sycl::queue q = get_default_queue()); // 3D matrix

// Non-blocking memcpy
sycl::event memcpy_async(void *to_ptr, const void *from_ptr, size_t size,
                         sycl::queue q = get_default_queue());
template <typename T>
sycl::event memcpy_async(T *to_ptr, T void *from_ptr, size_t count,
                         sycl::queue q = get_default_queue());
sycl::event memcpy_async(void *to_ptr, size_t to_pitch,
                         const void *from_ptr, size_t from_pitch,
                         size_t x, size_t y,
                         sycl::queue q = get_default_queue()); // 2D matrix
sycl::event memcpy_async(pitched_data to, sycl::id<3> to_pos,
                         pitched_data from, sycl::id<3> from_pos,
                         sycl::range<3> size,
                         sycl::queue q = get_default_queue()); // 3D matrix

// Fill
template <class T>
void fill(void *dev_ptr, const T &pattern, size_t count,
          sycl::queue q = get_default_queue());
template <typename T>
sycl::event fill_async(void *dev_ptr, const T &pattern,
                       size_t count, sycl::queue q = get_default_queue());

// Memset
void memset(void *dev_ptr, int value, size_t size,
                   sycl::queue q = get_default_queue());
void memset(void *ptr, size_t pitch, int val, size_t x, size_t y,
            sycl::queue q = get_default_queue()); // 2D matrix
void memset(pitched_data pitch, int val, sycl::range<3> size,
                          sycl::queue q = get_default_queue()); // 3D matrix
sycl::event memset_async(void *dev_ptr, int value, size_t size,
                         sycl::queue q = get_default_queue());
sycl::event memset_async(void *ptr, size_t pitch, int val,
                         size_t x, size_t y,
                         sycl::queue q = get_default_queue()); // 2D matrix
sycl::event memset_async(pitched_data pitch, int val,
                         sycl::range<3> size,
                         sycl::queue q = get_default_queue()); // 3D matrix

// Free
void wait_and_free(void *ptr, sycl::queue q = get_default_queue());
void free(void *ptr, sycl::queue q = get_default_queue());
sycl::event enqueue_free(const std::vector<void *> &pointers,
                         const std::vector<sycl::event> &events,
                         sycl::queue q = get_default_queue());

// Queries pointer allocation type
class pointer_attributes {
public:
  void init(const void *ptr, sycl::queue q = get_default_queue());
  sycl::usm::alloc get_memory_type();
  const void *get_device_pointer();
  const void *get_host_pointer();
  bool is_memory_shared();
  unsigned int get_device_id();
};

} // syclcompat

The syclcompat::experimental namespace contains currently unsupported memcpy overloads which take a syclcompat::experimental::memcpy_parameter argument. These are included for forwards compatibility and currently throw a std::runtime_error.

namespace syclcompat {
namespace experimental {
// Forward declarations for types relating to unsupported memcpy_parameter API:

#ifdef SYCL_EXT_ONEAPI_BINDLESS_IMAGES
class image_mem_wrapper;
#endif
class image_matrix;

/// Memory copy parameters for 2D/3D memory data.
struct memcpy_parameter {
  struct data_wrapper {
    pitched_data pitched{};
    sycl::id<3> pos{};
#ifdef SYCL_EXT_ONEAPI_BINDLESS_IMAGES
    experimental::image_mem_wrapper *image_bindless{nullptr};
#endif
    image_matrix *image{nullptr};
  };
  data_wrapper from{};
  data_wrapper to{};
  sycl::range<3> size{};
};

/// [UNSUPPORTED] Synchronously copies 2D/3D memory data specified by \p param .
/// The function will return after the copy is completed.
///
/// \param param Memory copy parameters.
/// \param q Queue to execute the copy task.
/// \returns no return value.
static inline void memcpy(const memcpy_parameter &param,
                          sycl::queue q = get_default_queue());

/// [UNSUPPORTED] Asynchronously copies 2D/3D memory data specified by \p param
/// . The return of the function does NOT guarantee the copy is completed.
///
/// \param param Memory copy parameters.
/// \param q Queue to execute the copy task.
/// \returns no return value.
static inline void memcpy_async(const memcpy_parameter &param,
                                sycl::queue q = get_default_queue());

} // namespace experimental
} // namespace syclcompat

Finally, the class pitched_data, which manages memory allocation for 3D spaces, padded to avoid uncoalesced memory accesses.

namespace syclcompat {

class pitched_data {
public:
  pitched_data();
  pitched_data(void *data, size_t pitch, size_t x, size_t y);

  void *get_data_ptr();
  size_t get_pitch();
  size_t get_x();
  size_t get_y();

  void set_data_ptr(void *data);
  void set_pitch(size_t pitch);
  void set_x(size_t x);
  void set_y(size_t y);
};

} // syclcompat

There are various helper classes and aliases defined within SYCLcompat to encapsulate and define memory operations and objects. These classes and aliases are primarily designed to assist with machine translation from other heterogeneous programming models.

The wrapper class device_memory provides a unified representation for device memory in various regions. The class provides methods to allocate memory for the object (init()) and access the underlying memory in various ways (get_ptr(), get_access(), operator[]). Aliases for global and USM shared specializations are provided.

The memory_traits class is provided as a traits helper for device_memory. The accessor class template provides a 2D or 3D sycl::accessor-like wrapper around raw pointers.

namespace syclcompat {

enum class memory_region {
  global = 0, // device global memory
  constant,   // device read-only memory
  local,      // device local memory
  usm_shared, // memory which can be accessed by host and device
};

using byte_t = uint8_t;

template <memory_region Memory, class T = byte_t> class memory_traits {
public:
  static constexpr sycl::access::address_space asp =
      (Memory == memory_region::local)
          ? sycl::access::address_space::local_space
          : sycl::access::address_space::global_space;
  static constexpr sycl::target target =
      (Memory == memory_region::local)
          ? sycl::target::local
          : sycl::target::device;
  static constexpr sycl::access_mode mode =
      (Memory == memory_region::constant)
          ? sycl::access_mode::read
          : sycl::access_mode::read_write;
  static constexpr size_t type_size = sizeof(T);
  using element_t =
      typename std::conditional_t<Memory == constant, const T, T>;
  using value_t = typename std::remove_cv_t<T>;
  template <size_t Dimension = 1>
  using accessor_t = typename std::conditional_t<
      target == sycl::target::local,
      sycl::local_accessor<T, Dimension>,
      sycl::accessor<T, Dimension, mode>>;
  using pointer_t = T *;
};

template <class T, memory_region Memory, size_t Dimension> class device_memory {
public:
  using accessor_t =
      typename memory_traits<Memory, T>::template accessor_t<Dimension>;
  using value_t = typename memory_traits<Memory, T>::value_t;
  using syclcompat_accessor_t =
      syclcompat::accessor<T, Memory, Dimension>;

  device_memory();

  device_memory(const sycl::range<Dimension> &in_range,
                std::initializer_list<value_t> &&init_list);

  template <size_t D = Dimension>
  device_memory(
      const typename std::enable_if<D == 2, sycl::range<2>>::type &in_range,
      std::initializer_list<std::initializer_list<value_t>> &&init_list);

  device_memory(const sycl::range<Dimension> &range_in);

  // Variadic constructor taking 1, 2 or 3 integers to be interpreted as a
  // sycl::range<Dim>.
  template <class... Args>
  device_memory(Args... Arguments);

  ~device_memory();

  // Allocate memory with default queue, and init memory if has initial value.
  void init();
  // Allocate memory with specified queue, and init memory if has initial
  // value.
  void init(sycl::queue q);

  // The variable is assigned to a device pointer.
  void assign(value_t *src, size_t size);

  // Get memory pointer of the memory object, which is virtual pointer when
  // usm is not used, and device pointer when usm is used.
  value_t *get_ptr();
  // Get memory pointer of the memory object, which is virtual pointer when
  // usm is not used, and device pointer when usm is used.
  value_t *get_ptr(sycl::queue q);

  // Get the device memory object size in bytes.
  size_t get_size();

  template <size_t D = Dimension>
  typename std::enable_if<D == 1, T>::type &operator[](size_t index);

  // Get accessor with dimension info for the device memory object
  // when usm is used and dimension is greater than 1.
  template <size_t D = Dimension>
  typename std::enable_if<D != 1, syclcompat_accessor_t>::type
  get_access(sycl::handler &cgh);
};


template <class T, memory_region Memory>
class device_memory<T, Memory, 0> : public device_memory<T, Memory, 1> {
public:
  using base = device_memory<T, Memory, 1>;
  using value_t = typename base::value_t;
  using accessor_t =
      typename memory_traits<Memory, T>::template accessor_t<0>;
  device_memory(const value_t &val);
  device_memory();
};

template <class T, size_t Dimension>
using global_memory = device_memory<T, memory_region::global, Dimension>;
template <class T, size_t Dimension>
using constant_memory = detail::device_memory<T, constant, Dimension>;
template <class T, size_t Dimension>
using shared_memory = device_memory<T, memory_region::usm_shared, Dimension>;


template <class T, memory_region Memory, size_t Dimension> class accessor;

template <class T, memory_region Memory> class accessor<T, Memory, 3> {
public:
  using memory_t = memory_traits<Memory, T>;
  using element_t = typename memory_t::element_t;
  using pointer_t = typename memory_t::pointer_t;
  using accessor_t = typename memory_t::template accessor_t<3>;

  accessor(pointer_t data, const sycl::range<3> &in_range);
  template <memory_region M = Memory>
  accessor(typename std::enable_if<M != memory_region::local,
                                   const accessor_t>::type &acc);
  accessor(const accessor_t &acc, const sycl::range<3> &in_range);

  accessor<T, Memory, 2> operator[](size_t index) const;

  pointer_t get_ptr() const;

};

template <class T, memory_region Memory> class accessor<T, Memory, 2> {
public:
  using memory_t = memory_traits<Memory, T>;
  using element_t = typename memory_t::element_t;
  using pointer_t = typename memory_t::pointer_t;
  using accessor_t = typename memory_t::template accessor_t<2>;

  accessor(pointer_t data, const sycl::range<2> &in_range);
  template <memory_region M = Memory>
  accessor(typename std::enable_if<M != memory_region::local,
                                   const accessor_t>::type &acc);
  accessor(const accessor_t &acc, const sycl::range<2> &in_range);

  pointer_t operator[](size_t index);

  pointer_t get_ptr() const;
};

} // syclcompat

Buffer Support¶

Although SYCLcompat is primarily designed around the Unified Shared Memory model, there is (limited) support for the buffer/accessor model. This can be enabled by setting the compiler define SYCLCOMPAT_USM_LEVEL_NONE. This macro instructs SYCLcompat to effectively provide emulated USM pointers via a Memory Manager singleton.

Note that in SYCLCOMPAT_USM_LEVEL_NONE mode, the pointers returned by e.g. syclcompat::malloc, and passed to syclcompat::memcpy can only interact with syclcompat APIs. It is legal to perform pointer arithmetic on these virtual pointers, but attempting to dereference them, passing them to sycl APIs, or passing them into kernels will result in an error.

The SYCLcompat tests with the suffix _usmnone.cpp provide examples of how to use SYCLCOMPAT_USM_LEVEL_NONE.

ptr_to_int¶

The following cuda backend specific function is introduced in order to translate from local memory pointers to uint32_t or size_t variables that contain a byte address to the local (local refers to.shared in nvptx) memory state space.

namespace syclcompat {
template <typename T>
__syclcompat_inline__
    std::enable_if_t<std::is_same_v<T, uint32_t> || std::is_same_v<T, size_t>,
                     T>
    ptr_to_int(void *ptr)
} // namespace syclcompat

These variables can be used in inline PTX instructions that take address operands. Such inline PTX instructions are commonly used in optimized libraries. A simplified example usage of the above functions is as follows:

  half *data = syclcompat::local_mem<half[NUM_ELEMENTS]>();
  // ...
  // ...
  T addr =
      syclcompat::ptr_to_int<T>(reinterpret_cast<char *>(data) + (id % 8) * 16);
  uint32_t fragment;
#if defined(__NVPTX__)
  asm volatile("ldmatrix.sync.aligned.m8n8.x1.shared.b16 {%0}, [%1];\n"
               : "=r"(fragment)
               : "r"(addr));
#endif

Device Information¶

sycl::device properties are encapsulated using the device_info helper class. The class is meant to be constructed and used through the extended device implemented in SYCLcompat.

This is the synopsis of device_info:

class device_info {
public:
  const char *get_name();
  char *get_name();
  template <typename WorkItemSizesTy = sycl::range<3>,
            std::enable_if_t<std::is_same_v<WorkItemSizesTy, sycl::id<3>> ||
                                 std::is_same_v<WorkItemSizesTy, int *>,
                             int> = 0>
  auto get_max_work_item_sizes() const;

  template <typename WorkItemSizesTy = sycl::range<3>,
          std::enable_if_t<std::is_same_v<WorkItemSizesTy, sycl::id<3>> ||
                                std::is_same_v<WorkItemSizesTy, int *>,
                            int> = 0>
  auto get_max_work_item_sizes() const;
  bool get_host_unified_memory() const;
  int get_major_version() const;
  int get_minor_version() const;
  int get_integrated() const;
  int get_max_clock_frequency() const;
  int get_max_compute_units() const;
  int get_max_work_group_size() const;
  int get_max_sub_group_size() const;
  int get_max_work_items_per_compute_unit() const;
  int get_max_register_size_per_work_group() const;
  template <typename NDRangeSizeTy = size_t *,
            std::enable_if_t<std::is_same_v<NDRangeSizeTy, size_t *> ||
                                 std::is_same_v<NDRangeSizeTy, int *>,
                             int> = 0>
  auto get_max_nd_range_size() const;
  template <typename NDRangeSizeTy = size_t *,
            std::enable_if_t<std::is_same_v<NDRangeSizeTy, size_t *> ||
                                 std::is_same_v<NDRangeSizeTy, int *>,
                             int> = 0>
  auto get_max_nd_range_size();
  size_t get_global_mem_size() const;
  size_t get_local_mem_size() const;

  unsigned int get_memory_clock_rate() const;
  unsigned int get_memory_bus_width() const;
  uint32_t get_device_id() const;
  std::array<unsigned char, 16> get_uuid() const;
  unsigned int get_global_mem_cache_size() const;
  int get_image1d_max() const;
  auto get_image2d_max() const;
  auto get_image2d_max();
  auto get_image3d_max() const;
  auto get_image3d_max();

  void set_name(const char *name);
  void set_max_work_item_sizes(const sycl::range<3> max_work_item_sizes);
  [[deprecated]] void
  set_max_work_item_sizes(const sycl::id<3> max_work_item_sizes);
  void set_host_unified_memory(bool host_unified_memory);
  void set_major_version(int major);
  void set_minor_version(int minor);
  void set_integrated(int integrated);
  void set_max_clock_frequency(int frequency);
  void set_max_compute_units(int max_compute_units);
  void set_global_mem_size(size_t global_mem_size);
  void set_local_mem_size(size_t local_mem_size);
  void set_max_work_group_size(int max_work_group_size);
  void set_max_sub_group_size(int max_sub_group_size);
  void
  set_max_work_items_per_compute_unit(int max_work_items_per_compute_unit);
  void set_max_nd_range_size(int max_nd_range_size[]);
  void set_max_nd_range_size(sycl::id<3> max_nd_range_size);
  void set_memory_clock_rate(unsigned int memory_clock_rate);
  void set_memory_bus_width(unsigned int memory_bus_width);
  void 
  set_max_register_size_per_work_group(int max_register_size_per_work_group);
  void set_device_id(uint32_t device_id);
  void set_uuid(std::array<unsigned char, 16> uuid);
  void set_global_mem_cache_size(unsigned int global_mem_cache_size);
  void set_image1d_max(size_t image_max_buffer_size);
  void set_image2d_max(size_t image_max_width_buffer_size,
                       size_t image_max_height_buffer_size);
  void set_image3d_max(size_t image_max_width_buffer_size,
                       size_t image_max_height_buffer_size,
                       size_t image_max_depth_buffer_size);
};

Device Management¶

Multiple SYCL functionalities are exposed through utility functions to manage the current sycl::device, sycl::queue, and sycl::context, exposed as follows:

namespace syclcompat {

// Util function to create a new queue for the current device
static inline sycl::queue create_queue(bool print_on_async_exceptions = false,
                                       bool in_order = true);

// Util function to get the default queue of current device in
// device manager.
static inline sycl::queue get_default_queue();

// Util function to set the default queue of the current device in the
// device manager.
// If the device extension saved queue is the default queue, 
// the previous saved queue will be overwritten as well.
// This function will be blocking if there are submitted kernels in the
// previous default queue.
static inline void set_default_queue(const sycl::queue &q);

// Util function to wait for the queued kernels.
static inline void wait(sycl::queue q = get_default_queue());

// Util function to wait for the queued kernels and throw unhandled errors.
static inline void wait_and_throw(sycl::queue q = get_default_queue());

// Util function to get the id of current device in
// device manager.
static inline unsigned int get_current_device_id();

// Util function to get the current device.
static inline device_ext &get_current_device();

// Util function to get a device by id.
static inline device_ext &get_device(unsigned int id);

// Util function to get the context of the default queue of current
// device in device manager.
static inline sycl::context get_default_context();

// Util function to get a CPU device.
static inline device_ext &cpu_device();

/// Filter out devices; only keep the device whose name contains one of the
/// subname in \p dev_subnames.
/// May break device id mapping and change current device. It's better to be
/// called before other SYCLcompat or SYCL APIs.
static inline void filter_device(const std::vector<std::string> &dev_subnames);

/// Print all the devices (and their IDs) in the dev_mgr
static inline void list_devices();

// Util function to select a device by its id
static inline unsigned int select_device(unsigned int id);

// Util function to get the device id from a device
static inline unsigned int get_device_id(const sycl::device &dev);

// Util function to get the number of available devices
static inline unsigned int device_count();

// Util function to check whether a device supports some kinds of sycl::aspect.
static inline void
has_capability_or_fail(const sycl::device &dev,
                       const std::initializer_list<sycl::aspect> &props);
} // syclcompat

The exposed functionalities include creation and destruction of queues, through syclcompat::create_queue and syclcompat::destroy_queue, and providing the ability to wait for submitted kernels using syclcompat::wait or syclcompat::wait_and_throw. Any async errors will be output to stderr if print_on_async_exceptions, and will have the default behavior otherwise, which calls std:terminate. Synchronous exceptions have to be managed by users independently of what is set in this parameter.

Devices are managed through a helper class, device_ext. The device_ext class associates a vector of sycl::queues with its sycl::device. The device_ext destructor waits on a set of sycl::event which can be added to via add_event. This is used, for example, to implement syclcompat::enqueue_free to schedule release of memory after a kernel or mempcy. SYCL device properties can be queried through device_ext as well. device_ext also provides the has_capability_or_fail member function, which throws a sycl::exception if the device does not have the specified list of sycl::aspect.

Devices can be listed and filtered using syclcompat::list_devices() and syclcompat::filter_device(). If SYCLCOMPAT_VERBOSE is defined at compile time, the available SYCL devices are printed to the standard output both at initialization time, and when the device list is filtered using syclcompat::filter_device.

Users can manage queues through the syclcompat::set_default_queue(sycl::queue q) free function, and the device_ext set_saved_queue, set_default_queue, and get_saved_queue member functions. set_default_queue is blocking, and overwrites the previous default queue with a user defined one, waiting for any submitted kernels to finish. The device_ext automatically sets the saved queue to the default queue. Therefore, it’s important to note that if the previous default queue was the device’s saved queue, setting a new default queue will update the reference of the saved queue to the new default one to keep the state of the class consistent.

The class is exposed as follows:

namespace syclcompat {

class device_ext : public sycl::device {
  device_ext();
  device_ext(const sycl::device &base, bool print_on_async_exceptions = false,
             bool in_order = true);
  ~device_ext();

  bool is_native_host_atomic_supported();
  int get_major_version() const;
  int get_minor_version() const;
  int get_max_compute_units() const;
  int get_max_clock_frequency() const;
  int get_integrated() const;
  int get_max_sub_group_size() const;
  int get_max_register_size_per_work_group() const;
  int get_max_work_group_size() const;
  int get_mem_base_addr_align() const;
  size_t get_global_mem_size() const;
  size_t get_local_mem_size() const;
  void get_memory_info(size_t &free_memory, size_t &total_memory) const;

  void get_device_info(device_info &out) const;
  device_info get_device_info() const;
  void reset(bool print_on_async_exceptions = false, bool in_order = true);

  sycl::queue *default_queue();
  void set_default_queue(const sycl::queue &q);
  void queues_wait_and_throw();
  sycl::queue *create_queue(bool print_on_async_exceptions = false,
                            bool in_order = true);
  void destroy_queue(sycl::queue *&queue);
  void set_saved_queue(sycl::queue *q);
  sycl::queue *get_saved_queue();
  sycl::context get_context();

  void
  has_capability_or_fail(const std::initializer_list<sycl::aspect> &props) const;
};

} // syclcompat

Free functions are provided for querying major and minor version directly from a sycl::device, equivalent to the methods of device_ext described above:

static int get_major_version(const sycl::device &dev);
static int get_minor_version(const sycl::device &dev);

Multiple devices¶

SYCLcompat allows you to manage multiple devices through syclcompat::select_device and syclcompat::create_queue. The library uses the default SYCL device (i.e. the device returned by sycl::default_selector_v) as the default device, and exposes all other devices available on the system through the syclcompat::select_device(unsigned int id) member function.

The interface uses the syclcompat::device_ext::get_current_device_id() to get the current CPU thread, and returns the associated device stored internally as a map with that thread. The map is constructed using calls to syclcompat::select_device(unsigned int id). Any thread which hasn’t used this member function to select a device will be given the default device. Note that this implies multiple threads on a single device by default.

Be aware that targetting multiple devices may lead to unintended behavior caused by developers, as SYCLcompat does not implement a mechanism to warn when the wrong queue is used as an argument in any of the member functions of the syclcompat namespace.

Atomic Operations¶

SYCLcompat provides an interface for common atomic operations (add, sub, and, or, xor, min, max, inc, dec, exchange, compare_exchange). While SYCL exposes atomic operations through member functions of sycl::atomic_ref, this library provides access via functions taking a standard pointer argument. Template arguments control the sycl::memory_scope, sycl::memory_order and sycl::access::address_space of these atomic operations. SYCLcompat also exposes overloads for these atomic functions which take a runtime memoryScope argument. Every atomic operation is implemented via an API function taking a raw pointer as the target. Additional overloads for syclcompat::compare_exchange_strong are provided which take a sycl::multi_ptr instead of a raw pointer. The type of the operand for most atomic operations is defined as syclcompat::type_identity_t<T> to avoid template deduction issues when an operand of a different type (e.g. double literal) is supplied. Atomic addition and subtraction free functions make use of syclcompat::arith_t<T> to differentiate between numeric and pointer arithmetic.

The available operations are exposed as follows:

namespace syclcompat {

template <class T> struct type_identity {
  using type = T;
};
template <class T> using type_identity_t = typename type_identity<T>::type;

template <typename T> struct arith {
  using type = std::conditional_t<std::is_pointer_v<T>, std::ptrdiff_t, T>;
};
template <typename T> using arith_t = typename arith<T>::type;

template <sycl::access::address_space addressSpace =
              sycl::access::address_space::generic_space,
          sycl::memory_order memoryOrder = sycl::memory_order::relaxed,
          sycl::memory_scope memoryScope = sycl::memory_scope::device,
          typename T>
T atomic_fetch_add(T *addr, arith_t<T> operand);

template <sycl::access::address_space addressSpace =
              sycl::access::address_space::generic_space,
          sycl::memory_order memoryOrder = sycl::memory_order::relaxed,
          sycl::memory_scope memoryScope = sycl::memory_scope::device,
          typename T>
T atomic_fetch_sub(T *addr, arith_t<T> operand);

template <sycl::access::address_space addressSpace =
              sycl::access::address_space::generic_space,
          sycl::memory_order memoryOrder = sycl::memory_order::relaxed,
          sycl::memory_scope memoryScope = sycl::memory_scope::device,
          typename T>
T atomic_fetch_and(T *addr, type_identity<T> operand);

template <sycl::access::address_space addressSpace =
              sycl::access::address_space::generic_space,
          sycl::memory_order memoryOrder = sycl::memory_order::relaxed,
          sycl::memory_scope memoryScope = sycl::memory_scope::device,
          typename T>
T atomic_fetch_or(T *addr, type_identity<T> operand);

template <sycl::access::address_space addressSpace =
              sycl::access::address_space::generic_space,
          sycl::memory_order memoryOrder = sycl::memory_order::relaxed,
          sycl::memory_scope memoryScope = sycl::memory_scope::device,
          typename T>
T atomic_fetch_xor(T *addr, type_identity<T> operand);

template <sycl::access::address_space addressSpace =
              sycl::access::address_space::generic_space,
          sycl::memory_order memoryOrder = sycl::memory_order::relaxed,
          sycl::memory_scope memoryScope = sycl::memory_scope::device,
          typename T>
T atomic_fetch_min(T *addr, type_identity<T> operand);

template <sycl::access::address_space addressSpace =
              sycl::access::address_space::generic_space,
          sycl::memory_order memoryOrder = sycl::memory_order::relaxed,
          sycl::memory_scope memoryScope = sycl::memory_scope::device,
          typename T>
T atomic_fetch_max(T *addr, type_identity<T> operand);

template <sycl::access::address_space addressSpace =
              sycl::access::address_space::generic_space,
          sycl::memory_order memoryOrder = sycl::memory_order::relaxed,
          sycl::memory_scope memoryScope = sycl::memory_scope::device>
unsigned int atomic_fetch_compare_inc(unsigned int *addr,
                                      unsigned int operand);

template <sycl::access::address_space addressSpace =
              sycl::access::address_space::generic_space,
          sycl::memory_order memoryOrder = sycl::memory_order::relaxed,
          sycl::memory_scope memoryScope = sycl::memory_scope::device>
unsigned int atomic_fetch_compare_dec(unsigned int *addr,
                                      unsigned int operand);

template <sycl::access::address_space addressSpace =
              sycl::access::address_space::generic_space,
          sycl::memory_order memoryOrder = sycl::memory_order::relaxed,
          sycl::memory_scope memoryScope = sycl::memory_scope::device,
          typename T>
T atomic_exchange(T *addr, type_identity<T> operand);

template <sycl::access::address_space addressSpace =
              sycl::access::address_space::generic_space,
          sycl::memory_order memoryOrder = sycl::memory_order::relaxed,
          sycl::memory_scope memoryScope = sycl::memory_scope::device,
          typename T>
T atomic_compare_exchange_strong(
    sycl::multi_ptr<T, addressSpace> addr, type_identity_t<T> expected,
    type_identity_t<T> desired,
    sycl::memory_order success = sycl::memory_order::relaxed,
    sycl::memory_order fail = sycl::memory_order::relaxed);
template <sycl::access::address_space addressSpace =
              sycl::access::address_space::generic_space,
          sycl::memory_order memoryOrder = sycl::memory_order::relaxed,
          sycl::memory_scope memoryScope = sycl::memory_scope::device,
          typename T>
T atomic_compare_exchange_strong(
    T *addr, T expected, T desired,
    sycl::memory_order success = sycl::memory_order::relaxed,
    sycl::memory_order fail = sycl::memory_order::relaxed);

} // namespace syclcompat

SYCLcompat also provides an atomic class with the store, load, exchange, compare_exchange_weak, fetch_add, and fetch_sub operations. The atomic class wrapper supports int, unsigned int, long, unsigned long, long long, unsigned long long, float, double and pointer datatypes.

namespace syclcompat {

template <typename T,
          sycl::memory_scope DefaultScope = sycl::memory_scope::system,
          sycl::memory_order DefaultOrder = sycl::memory_order::seq_cst,
          sycl::access::address_space Space =
              sycl::access::address_space::generic_space>
class atomic {
  static constexpr sycl::memory_order default_read_order =
      sycl::atomic_ref<T, DefaultOrder, DefaultScope,
                       Space>::default_read_order;
  static constexpr sycl::memory_order default_write_order =
      sycl::atomic_ref<T, DefaultOrder, DefaultScope,
                       Space>::default_write_order;
  static constexpr sycl::memory_scope default_scope = DefaultScope;
  static constexpr sycl::memory_order default_read_modify_write_order =
      DefaultOrder;

  constexpr atomic() noexcept = default;

  constexpr atomic(T d) noexcept;

  void store(T operand, sycl::memory_order memoryOrder = default_write_order,
             sycl::memory_scope memoryScope = default_scope) noexcept;

  T load(sycl::memory_order memoryOrder = default_read_order,
         sycl::memory_scope memoryScope = default_scope) const noexcept;

  T exchange(T operand,
             sycl::memory_order memoryOrder = default_read_modify_write_order,
             sycl::memory_scope memoryScope = default_scope) noexcept;

  bool compare_exchange_weak(
      T &expected, T desired, sycl::memory_order success,
      sycl::memory_order failure,
      sycl::memory_scope memoryScope = default_scope) noexcept;

  bool compare_exchange_weak(
      T &expected, T desired,
      sycl::memory_order memoryOrder = default_read_modify_write_order,
      sycl::memory_scope memoryScope = default_scope) noexcept;

  bool compare_exchange_strong(
      T &expected, T desired, sycl::memory_order success,
      sycl::memory_order failure,
      sycl::memory_scope memoryScope = default_scope) noexcept;

  bool compare_exchange_strong(
      T &expected, T desired,
      sycl::memory_order memoryOrder = default_read_modify_write_order,
      sycl::memory_scope memoryScope = default_scope) noexcept;

  T fetch_add(arith_t<T> operand,
              sycl::memory_order memoryOrder = default_read_modify_write_order,
              sycl::memory_scope memoryScope = default_scope) noexcept;

  T fetch_sub(arith_t<T> operand,
              sycl::memory_order memoryOrder = default_read_modify_write_order,
              sycl::memory_scope memoryScope = default_scope) noexcept;
};

} // namespace syclcompat

Compatibility Utilities¶

This library provides a number of small compatibility utilities which exist to facilitate machine translation of code from other programming models to SYCL. These functions are part of the public API, but they are not expected to be useful to developers writing their own code.

Functionality is provided to represent a pair of integers as a double. cast_ints_to_double(int, int) returns a double containing the given integers in the high & low 32-bits respectively. cast_double_to_int casts the high or low 32-bits back into an integer.

reverse_bits reverses the bits of a 32-bit unsigned integer, ffs returns the position of the first least significant set bit in an integer. byte_level_permute returns a byte-permutation of two input unsigned integers, with bytes selected according to a third unsigned integer argument. match_all_over_sub_group and match_any_over_sub_group allows comparison of values across work-items within a sub-group.

The function ternary_logic_opperforms bitwise logical operations on three input values of a, b and c based on the specified 8-bit truth table lut and return the result.

The functions select_from_sub_group, shift_sub_group_left, shift_sub_group_right and permute_sub_group_by_xor provide equivalent functionality to sycl::select_from_group, sycl::shift_group_left, sycl::shift_group_right and sycl::permute_group_by_xor, respectively. However, they provide an optional argument to represent the logical_group size (default 32).

int_as_queue_ptr helps with translation of code by reinterpret casting an address to sycl::queue *, or returning a pointer to SYCLcompat’s default queue if the address is <= 2. args_selector is a helper class for extracting arguments from an array of pointers to arguments or buffer of arguments to pass to a kernel function. The class allows users to exclude parameters such as sycl::nd_item. Experimental support for masked versions of select_from_sub_group, shift_sub_group_left, shift_sub_group_right and permute_sub_group_by_xor is provided only for SPIRV or CUDA devices.

As part of the compatibility utilities to facilitate machine translation to SYCL, two aliases for errors are provided, err0 and err1.

namespace syclcompat {

inline int cast_double_to_int(double d, bool use_high32 = true);

inline double cast_ints_to_double(int high32, int low32);

inline unsigned int byte_level_permute(unsigned int a, unsigned int b,
                                       unsigned int s);

inline uint32_t lop3(uint32_t a, uint32_t b, uint32_t c, uint8_t lut)

template <typename ValueT> inline int ffs(ValueT a);

template <typename T>
unsigned int match_any_over_sub_group(sycl::sub_group g, unsigned member_mask,
                                      T value);

template <typename T>
unsigned int match_all_over_sub_group(sycl::sub_group g, unsigned member_mask,
                                      T value, int *pred);

template <typename ValueT>
ValueT select_from_sub_group(sycl::sub_group g, ValueT x, int remote_local_id,
                        int logical_sub_group_size = 32);

template <typename ValueT>
ValueT shift_sub_group_left(sycl::sub_group g, ValueT x, unsigned int delta,
                       int logical_sub_group_size = 32);

template <typename ValueT>
ValueT shift_sub_group_right(sycl::sub_group g, ValueT x, unsigned int delta,
                        int logical_sub_group_size = 32);

template <typename ValueT>
ValueT permute_sub_group_by_xor(sycl::sub_group g, ValueT x, unsigned int mask,
                           int logical_sub_group_size = 32);

namespace experimental {

template <typename ValueT>
ValueT select_from_sub_group(unsigned int member_mask, sycl::sub_group g, ValueT x,
                             int remote_local_id, int logical_sub_group_size = 32);

template <typename ValueT>
ValueT shift_sub_group_left(unsigned int member_mask, sycl::sub_group g, ValueT x,
                            unsigned int delta, int logical_sub_group_size = 32);

template <typename ValueT>
ValueT shift_sub_group_right(unsigned int member_mask, sycl::sub_group g, ValueT x,
                             unsigned int delta, int logical_sub_group_size = 32);

template <typename ValueT>
ValueT permute_sub_group_by_xor(unsigned int member_mask, sycql::sub_group g, ValueT x,
                                unsigned int mask, int logical_sub_group_size = 32);

} // namespace experimental

inline sycl::queue *int_as_queue_ptr(uintptr_t x);

using err0 = detail::generic_error_type<struct err0_tag, int>;
using err1 = detail::generic_error_type<struct err1_tag, int>;

template <int n_nondefault_params, int n_default_params, typename T>
class args_selector;

template <int n_nondefault_params, int n_default_params, typename R,
          typename... Ts>
class args_selector<n_nondefault_params, n_default_params, R(Ts...)> {
public:
  // Get the type of the ith argument of R(Ts...)
  template <int i>
  using arg_type =
      std::tuple_element_t<account_for_default_params<i>(), std::tuple<Ts...>>;

  // If kernel_params is nonnull, then args_selector will
  // extract arguments from kernel_params. Otherwise, it
  // will extract them from extra.
  args_selector(void **kernel_params, void **extra)
      : kernel_params(kernel_params), args_buffer(get_args_buffer(extra)) {}

  // Get a reference to the i-th argument extracted from kernel_params
  // or extra.
  template <int i> arg_type<i> &get();
};

} // namespace syclcompat

The function experimental::nd_range_barrier synchronizes work items from all work groups within a SYCL kernel. This is not officially supported by the SYCL spec, and so should be used with caution. experimental::calculate_max_active_wg_per_xecore and experimental::calculate_max_potential_wg are used for occupancy calculation. There is also an experimental::logical_group class which allows sycl::sub_groups to be further subdivided into ‘logical’ groups to perform sub-group level operations. This class provides methods to get the local & group id and range. experimental::group_type, experimental::group and experimental::group_base are helper classes to manage the supported group types.

namespace syclcompat {
namespace experimental {

#if defined(__AMDGPU__) || defined(__NVPTX__)
// seq_cst currently not working for AMD nor Nvidia
constexpr sycl::memory_order barrier_memory_order = sycl::memory_order::acq_rel;
#else
constexpr sycl::memory_order barrier_memory_order = sycl::memory_order::seq_cst;
#endif

template <int dimensions = 3>
inline void nd_range_barrier(
    sycl::nd_item<dimensions> item,
    sycl::atomic_ref<unsigned int, barrier_memory_order,
                     sycl::memory_scope::device,
                     sycl::access::address_space::global_space> &counter);

template <>
inline void nd_range_barrier(
    sycl::nd_item<1> item,
    sycl::atomic_ref<unsigned int, barrier_memory_order,
                     sycl::memory_scope::device,
                     sycl::access::address_space::global_space> &counter);

template <int dimensions = 3> class logical_group {
public:
  logical_group(sycl::nd_item<dimensions> item, sycl::group<dimensions> parent_group,
                uint32_t size);
  uint32_t get_local_linear_id() const;
  uint32_t get_group_linear_id() const;
  uint32_t get_local_linear_range() const;
  uint32_t get_group_linear_range() const;
};

inline int calculate_max_active_wg_per_xecore(int *num_wg, int wg_size,
                                              int slm_size = 0,
                                              int sg_size = 32,
                                              bool used_barrier = false,
                                              bool used_large_grf = false);

inline int calculate_max_potential_wg(int *num_wg, int *wg_size,
                                      int max_wg_size_for_device_code,
                                      int slm_size = 0, int sg_size = 32,
                                      bool used_barrier = false,
                                      bool used_large_grf = false);
// Supported group types
enum class group_type { work_group, sub_group, logical_group, root_group };

// The group_base will dispatch the function call to the specific interface
// based on the group type.
template <int dimensions = 3> class group_base {
public:
  group_base(sycl::nd_item<dimensions> item);

  // Returns the number of work-items in the group.
  size_t get_local_linear_range();
  // Returns the index of the work-item within the group.
  size_t get_local_linear_id();

  // Wait for all the elements within the group to complete their execution
  // before proceeding.
  void barrier();
};

// Container type that can store supported group_types.
template <typename GroupT, int dimensions = 3>
class group : public group_base<dimensions> {
public:
  group(GroupT g, sycl::nd_item<dimensions> item);
};

} // namespace experimental
} // namespace syclcompat

SYCLcompat provides a wrapper API max_active_work_groups_per_cu providing ‘work-groups per compute unit’ semantics. It is templated on the kernel functor, and takes a work-group size represented by either sycl::range<Dim> or syclcompat::dim3, the local memory size in bytes, and an optional queue. The function returns the maximum number of work-groups which can be executed per compute unit. May return zero even when below resource limits (i.e. returning 0 does not imply the kernel cannot execute).

namespace syclcompat{
template <class KernelName>
size_t max_active_work_groups_per_cu(
    syclcompat::dim3 wg_dim3, size_t local_mem_size,
    sycl::queue queue = syclcompat::get_default_queue());

template <class KernelName, int RangeDim>
size_t max_active_work_groups_per_cu(
    sycl::range<RangeDim> wg_range, size_t local_mem_size,
    sycl::queue queue = syclcompat::get_default_queue());
}

To assist machine translation, helper aliases are provided for inlining and alignment attributes. The class template declarations sycl_compat_kernel_name and sycl_compat_kernel_scalar are used to assist automatic generation of kernel names during machine translation.

get_sycl_language_version returns an integer representing the version of the SYCL spec supported by the current SYCL compiler.

The SYCLCOMPAT_CHECK_ERROR macro encapsulates an error-handling mechanism for expressions that might throw sycl::exception and std::runtime_error. If no exceptions are thrown, it returns syclcompat::error_code::success. If a sycl::exception is caught, it returns syclcompat::error_code::backend_error. If a std::runtime_error exception is caught, syclcompat::error_code::default_error is returned instead. For both cases, it prints the error message to the standard error stream.

get_error_string_dummy is a dummy function introduced to assist auto migration. The SYCLomatic user should replace it with a real error-handling function. SYCL reports errors using exceptions and does not use error codes.

namespace syclcompat {

template <class... Args> class syclcompat_kernel_name;
template <int Arg> class syclcompat_kernel_scalar;

#if defined(_MSC_VER)
#define __syclcompat_align__(n) __declspec(align(n))
#define __syclcompat_inline__ __forceinline
#else
#define __syclcompat_align__(n) __attribute__((aligned(n)))
#define __syclcompat_inline__ __inline__ __attribute__((always_inline))
#endif

#if defined(_MSC_VER)
#define __syclcompat_noinline__ __declspec(noinline)
#else
#define __syclcompat_noinline__ __attribute__((noinline))
#endif

#define SYCLCOMPAT_COMPATIBILITY_TEMP (900)

#ifdef _WIN32
#define SYCLCOMPAT_EXPORT __declspec(dllexport)
#else
#define SYCLCOMPAT_EXPORT
#endif


namespace syclcompat {
enum error_code { success = 0, backend_error = 1, default_error = 999 };
inline const char *get_error_string_dummy(int ec);
}

#define SYCLCOMPAT_CHECK_ERROR(expr)

int get_sycl_language_version();

} // namespace syclcompat

Kernel Helper Functions¶

Kernel helper functions provide a structure kernel_function_info to keep SYCL kernel information, and provide a utility function get_kernel_function_info() to get the kernel information. Overloads are provided to allow either returning a kernel_function_info object, or to return by pointer argument. In the current version, kernel_function_info describes only maximum work-group size.

SYCLcompat also provides the kernel_library and kernel_function classes. kernel_library facilitates the loading and unloading of kernel libraries. kernel_function represents a specific kernel function within a loaded library and can be invoked with specified arguments. load_kernel_library, load_kernel_library_mem, and unload_kernel_library are free functions to handle the loading and unloading of kernel_library objects. get_kernel_function, and invoke_kernel_function offer a similar functionality for kernel_function objects.

namespace syclcompat {

struct kernel_function_info {
  int max_work_group_size = 0;
};

static void get_kernel_function_info(kernel_function_info *kernel_info,
                                     const void *function);
static kernel_function_info get_kernel_function_info(const void *function);

class kernel_library {
  constexpr kernel_library();
  constexpr kernel_library(void *ptr);
  operator void *() const;
};

static kernel_library load_kernel_library(const std::string &name);
static kernel_library load_kernel_library_mem(char const *const image);
static void unload_kernel_library(const kernel_library &library);

class kernel_function {
    constexpr kernel_function();
    constexpr kernel_function(kernel_functor ptr);
    operator void *() const;
    void operator()(sycl::queue &q, const sycl::nd_range<3> &range,
                    unsigned int local_mem_size, void **args, void **extra);
};

static kernel_function get_kernel_function(kernel_library &library,
                                           const std::string &name);
static void invoke_kernel_function(kernel_function &function,
                                   sycl::queue &queue,
                                   sycl::range<3> group_range,
                                   sycl::range<3> local_range,
                                   unsigned int local_mem_size,
                                   void **kernel_params, void **extra);

} // namespace syclcompat

Math Functions¶

The funnelshift_* APIs perform a concatenate-shift operation on two 32-bit values, and return a 32-bit result. The two unsigned integer arguments (low and high) are concatenated to a 64-bit value which is then shifted left or right by shift bits. The functions then return either the least- or most-significant 32 bits. The _l* variants shift left and return the most significant 32 bits, while the _r* variants shift right and return the least significant 32 bits. The _l/_r APIs differ from the _lc/_rc APIs in how they clamp the shift argument: funnelshift_l and funnelshift_r shift the result by shift & 31 bits, whereas funnelshift_lc and funnelshift_rc shift the result by min(shift, 32) bits.

syclcompat::fast_length provides a wrapper to SYCL’s fast_length(sycl::vec<float,N>) that accepts arguments for a C++ array and a length. syclcompat::length provides a templated version that wraps over sycl::length. There are wrappers for clamp, isnan, cbrt, min, max, fmax_nan, fmin_nan, and pow, as well as an implementation of relu saturation is also provided.

compare, unordered_compare, compare_both, unordered_compare_both, compare_mask, and unordered_compare_mask, handle both ordered and unordered comparisons.

vectorized_max and vectorized_min are binary operations returning the max/min of two arguments, where each argument is treated as a sycl::vec type. vectorized_isgreater performs elementwise isgreater, treating each argument as a vector of elements, and returning 0 for vector components for which isgreater is false, and -1 when true. vectorized_sum_abs_diff calculates the absolute difference for two values without modulo overflow for vector types.

The functions cmul,cdiv,cabs, cmul_add, and conj define complex math operations which accept sycl::vec<T,2> arguments representing complex values.

The dp4a function returns the 4-way 8-bit dot product accumulate for unsigned and signed 32-bit integer values. The dp2a_lo and dp2a_hi functions return the two-way 16-bit to 8-bit dot product using the second and first 16 bits of the second operand, respectively. These three APIs return a single 32-bit value with the accumulated result, which is unsigned if both operands are uint32_t and signed otherwise.

Various maths functions are defined operate on any floating point types. syclcompat::is_floating_point_v extends the standard library’s std::is_floating_point_v to include sycl::half and, where available, sycl::ext::oneapi::bfloat16. The current version of SYCLcompat also provides a specialization of std::common_type_t for sycl::ext::oneapi::bfloat16, though this will be moved to the sycl_ext_oneapi_bfloat16 extension in future.

namespace std {
template <> struct common_type<sycl::ext::oneapi::bfloat16> {
  using type = sycl::ext::oneapi::bfloat16;
};

template <>
struct common_type<sycl::ext::oneapi::bfloat16, sycl::ext::oneapi::bfloat16> {
  using type = sycl::ext::oneapi::bfloat16;
};

template <typename T> struct common_type<sycl::ext::oneapi::bfloat16, T> {
  using type = sycl::ext::oneapi::bfloat16;
};

template <typename T> struct common_type<T, sycl::ext::oneapi::bfloat16> {
  using type = sycl::ext::oneapi::bfloat16;
};
} // namespace std

namespace syclcompat{

// Trait for extended floating point definition
template <typename T>
struct is_floating_point : std::is_floating_point<T>{};

template <> struct is_floating_point<sycl::half> : std::true_type {};

#ifdef SYCL_EXT_ONEAPI_BFLOAT16_MATH_FUNCTIONS
template <> struct is_floating_point<sycl::ext::oneapi::bfloat16> : std::true_type {};
#endif
template <typename T>

inline constexpr bool is_floating_point_v = is_floating_point<T>::value;

inline unsigned int funnelshift_l(unsigned int low, unsigned int high,
                                  unsigned int shift); 

inline unsigned int funnelshift_lc(unsigned int low, unsigned int high,
                                   unsigned int shift); 

inline unsigned int funnelshift_r(unsigned int low, unsigned int high,
                                  unsigned int shift);

inline unsigned int funnelshift_rc(unsigned int low, unsigned int high,
                                   unsigned int shift);

inline float fast_length(const float *a, int len);

template <typename ValueT>
inline ValueT length(const ValueT *a, const int len);

inline ValueT clamp(ValueT val, ValueT min_val, ValueT max_val);

// Determine whether 2 element value is NaN.
template <typename ValueT>
inline std::enable_if_t<ValueT::size() == 2, ValueT> isnan(const ValueT a);

// cbrt function wrapper.
template <typename ValueT>
inline std::enable_if_t<std::is_floating_point_v<ValueT> ||
                            std::is_same_v<ValueT, sycl::half>,
                        ValueT>
cbrt(ValueT val);

// For floating-point types, `float` or `double` arguments are acceptable.
// For integer types, `std::uint32_t`, `std::int32_t`, `std::uint64_t` or
// `std::int64_t` type arguments are acceptable.
// sycl::half supported as well.
template <typename T1, typename T2>
std::enable_if_t<std::is_integral_v<T1> && std::is_integral_v<T2>,
                 std::common_type_t<T1, T2>>
min(T1 a, T2 b);
template <typename T1, typename T2>
std::enable_if_t<std::is_floating_point_v<T1> && std::is_floating_point_v<T2>,
                 std::common_type_t<T1, T2>>
min(T1 a, T2 b);

sycl::half min(sycl::half a, sycl::half b);

template <typename T1, typename T2>
std::enable_if_t<std::is_integral_v<T1> && std::is_integral_v<T2>,
                 std::common_type_t<T1, T2>>
max(T1 a, T2 b);
template <typename T1, typename T2>
std::enable_if_t<std::is_floating_point_v<T1> && std::is_floating_point_v<T2>,
                 std::common_type_t<T1, T2>>
max(T1 a, T2 b);

sycl::half max(sycl::half a, sycl::half b);

// Performs 2 elements comparison and returns the bigger one. If either of
// inputs is NaN, then return NaN.
template <typename ValueT, typename ValueU>
inline std::common_type_t<ValueT, ValueU> fmax_nan(const ValueT a,
                                                   const ValueU b);

template <typename ValueT, typename ValueU>
inline sycl::vec<std::common_type_t<ValueT, ValueU>, 2>
fmax_nan(const sycl::vec<ValueT, 2> a, const sycl::vec<ValueU, 2> b);

template <typename ValueT, typename ValueU>
inline sycl::marray<std::common_type_t<ValueT, ValueU>, 2>
fmax_nan(const sycl::marray<ValueT, 2> a, const sycl::marray<ValueU, 2> b);

// Performs 2 elements comparison and returns the smaller one. If either of
// inputs is NaN, then return NaN.
template <typename ValueT, typename ValueU>
inline std::common_type_t<ValueT, ValueU> fmin_nan(const ValueT a,
                                                   const ValueU b);
template <typename ValueT, typename ValueU>
inline sycl::vec<std::common_type_t<ValueT, ValueU>, 2>
fmin_nan(const sycl::vec<ValueT, 2> a, const sycl::vec<ValueU, 2> b);

template <typename ValueT, typename ValueU>
inline sycl::marray<std::common_type_t<ValueT, ValueU>, 2>
fmin_nan(const sycl::marray<ValueT, 2> a, const sycl::marray<ValueU, 2> b);

inline float pow(const float a, const int b) { return sycl::pown(a, b); }
inline double pow(const double a, const int b) { return sycl::pown(a, b); }

template <typename ValueT, typename ValueU>
inline typename std::enable_if_t<std::is_floating_point_v<ValueT>, ValueT>
pow(const ValueT a, const ValueU b);

// Requires aspect::fp64, as it casts to double internally.
template <typename ValueT, typename ValueU>
inline typename std::enable_if_t<!std::is_floating_point_v<ValueT>, double>
pow(const ValueT a, const ValueU b);

template <typename ValueT> inline ValueT relu(const ValueT a);

template <class ValueT, int NumElements>
inline sycl::vec<ValueT, NumElements>
relu(const sycl::vec<ValueT, NumElements> a);

template <class ValueT>
inline std::enable_if_t<std::is_floating_point_v<ValueT> ||
                            std::is_same_v<sycl::half, ValueT>,
                        sycl::marray<ValueT, 2>>
relu(const sycl::marray<ValueT, 2> a);

// The following definition is enabled when BinaryOperation(ValueT, ValueT) returns bool
// std::enable_if_t<std::is_same_v<std::invoke_result_t<BinaryOperation, ValueT, ValueT>, bool>, bool>
template <typename ValueT, class BinaryOperation>
inline bool 
compare(const ValueT a, const ValueT b, const BinaryOperation binary_op);
template <typename ValueT, class BinaryOperation>
inline std::enable_if_t<ValueT::size() == 2, ValueT>
compare(const ValueT a, const ValueT b, const BinaryOperation binary_op);

// The following definition is enabled when BinaryOperation(ValueT, ValueT) returns bool
// std::enable_if_t<std::is_same_v<std::invoke_result_t<BinaryOperation, ValueT, ValueT>, bool>, bool>
template <typename ValueT, class BinaryOperation>
inline bool
unordered_compare(const ValueT a, const ValueT b,
                  const BinaryOperation binary_op);
template <typename ValueT, class BinaryOperation>
inline std::enable_if_t<ValueT::size() == 2, ValueT>
unordered_compare(const ValueT a, const ValueT b,
                  const BinaryOperation binary_op);

template <typename ValueT, class BinaryOperation>
inline std::enable_if_t<ValueT::size() == 2, bool>
compare_both(const ValueT a, const ValueT b, const BinaryOperation binary_op);
template <typename ValueT, class BinaryOperation>

inline std::enable_if_t<ValueT::size() == 2, bool>
unordered_compare_both(const ValueT a, const ValueT b,
                       const BinaryOperation binary_op);

template <typename ValueT, class BinaryOperation>
inline std::enable_if_t<ValueT::size() == 2, unsigned>
compare_mask(const ValueT a, const ValueT b, const BinaryOperation binary_op);

template <typename ValueT, class BinaryOperation>
inline std::enable_if_t<ValueT::size() == 2, unsigned>
unordered_compare_mask(const ValueT a, const ValueT b,
                       const BinaryOperation binary_op);

template <typename S, typename T> inline T vectorized_max(T a, T b);

template <typename S, typename T> inline T vectorized_min(T a, T b);

template <typename S, typename T> inline T vectorized_isgreater(T a, T b);

template <>
inline unsigned vectorized_isgreater<sycl::ushort2, unsigned>(unsigned a,
                                                              unsigned b);

template <typename VecT>
inline unsigned vectorized_sum_abs_diff(unsigned a, unsigned b);

template <typename T>
sycl::vec<T, 2> cmul(sycl::vec<T, 2> x, sycl::vec<T, 2> y);

template <typename T>
sycl::vec<T, 2> cdiv(sycl::vec<T, 2> x, sycl::vec<T, 2> y);

template <typename T> T cabs(sycl::vec<T, 2> x);

template <typename ValueT>
inline sycl::vec<ValueT, 2> cmul_add(const sycl::vec<ValueT, 2> a,
                                     const sycl::vec<ValueT, 2> b,
                                     const sycl::vec<ValueT, 2> c);

template <typename ValueT>
inline sycl::marray<ValueT, 2> cmul_add(const sycl::marray<ValueT, 2> a,
                                        const sycl::marray<ValueT, 2> b,
                                        const sycl::marray<ValueT, 2> c);

template <typename T> sycl::vec<T, 2> conj(sycl::vec<T, 2> x);

template <typename ValueT> inline ValueT reverse_bits(ValueT a);


template <typename T1, typename T2>
using dot_product_acc_t =
    std::conditional_t<std::is_unsigned_v<T1> && std::is_unsigned_v<T2>,
                       uint32_t, int32_t>;

template <typename T1, typename T2>
inline dot_product_acc_t<T1, T2> dp2a_lo(T1 a, T2 b,
                                         dot_product_acc_t<T1, T2> c);

template <typename T1, typename T2>
inline dot_product_acc_t<T1, T2> dp2a_hi(T1 a, T2 b,
                                         dot_product_acc_t<T1, T2> c);

template <typename T1, typename T2>
inline dot_product_acc_t<T1, T2> dp4a(T1 a, T2 b,
                                      dot_product_acc_t<T1, T2> c);
} // namespace syclcompat

vectorized_binary computes the BinaryOperation for two operands, with each value treated as a vector type. vectorized_unary offers the same interface for operations with a single operand. vectorized_ternary offers the interface for three operands with two BinaryOperation. The implemented BinaryOperations are abs_diff, add_sat, rhadd, hadd, maximum, minimum, and sub_sat. And the vectorized_with_pred offers the BinaryOperation for two operands, meanwihle provides the pred of high/low halfword operation.

namespace syclcompat {
  
template <typename VecT, class UnaryOperation>
inline unsigned vectorized_unary(unsigned a, const UnaryOperation unary_op);

// A sycl::abs wrapper functor.
struct abs {
  template <typename ValueT> auto operator()(const ValueT x) const;
};

template <typename VecT, class BinaryOperation>
inline unsigned vectorized_binary(unsigned a, unsigned b,
                                  const BinaryOperation binary_op,
                                  bool need_relu = false);

template <typename VecT, typename BinaryOperation1, typename BinaryOperation2>
inline unsigned vectorized_ternary(unsigned a, unsigned b, unsigned c,
                                   const BinaryOperation1 binary_op1,
                                   const BinaryOperation2 binary_op2,
                                   bool need_relu = false);

template <typename ValueT, typename BinaryOperation>
inline unsigned vectorized_with_pred(unsigned a, unsigned b,
                                     const BinaryOperation binary_op,
                                     bool *pred_hi, bool *pred_lo);

// A sycl::abs_diff wrapper functor.
struct abs_diff {
  template <typename ValueT>
  auto operator()(const ValueT x, const ValueT y) const;
};
// A sycl::add_sat wrapper functor.
struct add_sat {
  template <typename ValueT>
  auto operator()(const ValueT x, const ValueT y) const;
};
// A sycl::rhadd wrapper functor.
struct rhadd {
  template <typename ValueT>
  auto operator()(const ValueT x, const ValueT y) const;
};
// A sycl::hadd wrapper functor.
struct hadd {
  template <typename ValueT>
  auto operator()(const ValueT x, const ValueT y) const;
};
// A sycl::max wrapper functor.
struct maximum {
  template <typename ValueT>
  auto operator()(const ValueT x, const ValueT y) const;
  template <typename ValueT>
  auto operator()(const ValueT x, const ValueT y, bool *pred) const;
};
// A sycl::min wrapper functor.
struct minimum {
  template <typename ValueT>
  auto operator()(const ValueT x, const ValueT y) const;
  template <typename ValueT>
  auto operator()(const ValueT x, const ValueT y, bool *pred) const;
};
// A sycl::sub_sat wrapper functor.
struct sub_sat {
  template <typename ValueT>
  auto operator()(const ValueT x, const ValueT y) const;
};

} // namespace syclcompat

vectorized_binary also supports comparison operators from the standard library (std::equal_to, std::not_equal_to, etc) and the semantics can be modified by changing the comparison operator template instantiation. For example:

unsigned int Input1;
unsigned int Input2;
// initialize inputs...

// Performs comparison on sycl::ushort2, following sycl::vec semantics
// Returns unsigned int containing, per vector element, 0xFFFF if true, and 0x0000 if false
syclcompat::vectorized_binary<sycl::ushort2>(
      Input1, Input2, std::equal_to<>());

// Performs element-wise comparison on unsigned short
// Returns unsigned int containing, per vector element, 1 if true, and 0 if false
syclcompat::vectorized_binary<sycl::ushort2>(
      Input1, Input2, std::equal_to<unsigned short>());

The math header provides a set of functions to extend 32-bit operations to 33 bit, and handle sign extension internally. There is support for add, sub, absdiff, min and max operations. Each operation provides overloads to include a second, separate, BinaryOperation after the first, and include the _sat variation, determines if the returning value is saturated or not.

template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_add(AT a, BT b);

template <typename RetT, typename AT, typename BT, typename CT,
          typename BinaryOperation>
inline constexpr RetT extend_add(AT a, BT b, CT c, BinaryOperation second_op);

template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_add_sat(AT a, BT b);

template <typename RetT, typename AT, typename BT, typename CT,
          typename BinaryOperation>
inline constexpr RetT extend_add_sat(AT a, BT b, CT c,
                                     BinaryOperation second_op);

template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_sub(AT a, BT b);

template <typename RetT, typename AT, typename BT, typename CT,
          typename BinaryOperation>
inline constexpr RetT extend_sub(AT a, BT b, CT c, BinaryOperation second_op);

template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_sub_sat(AT a, BT b);

template <typename RetT, typename AT, typename BT, typename CT,
          typename BinaryOperation>
inline constexpr RetT extend_sub_sat(AT a, BT b, CT c,
                                     BinaryOperation second_op);

template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_absdiff(AT a, BT b);

template <typename RetT, typename AT, typename BT, typename CT,
          typename BinaryOperation>
inline constexpr RetT extend_absdiff(AT a, BT b, CT c,
                                     BinaryOperation second_op);

template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_absdiff_sat(AT a, BT b);

template <typename RetT, typename AT, typename BT, typename CT,
          typename BinaryOperation>
inline constexpr RetT extend_absdiff_sat(AT a, BT b, CT c,
                                         BinaryOperation second_op);

template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_min(AT a, BT b);

template <typename RetT, typename AT, typename BT, typename CT,
          typename BinaryOperation>
inline constexpr RetT extend_min(AT a, BT b, CT c, BinaryOperation second_op);

template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_min_sat(AT a, BT b);

template <typename RetT, typename AT, typename BT, typename CT,
          typename BinaryOperation>
inline constexpr RetT extend_min_sat(AT a, BT b, CT c,
                                     BinaryOperation second_op);

template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_max(AT a, BT b);

template <typename RetT, typename AT, typename BT, typename CT,
          typename BinaryOperation>
inline constexpr RetT extend_max(AT a, BT b, CT c, BinaryOperation second_op);

template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_max_sat(AT a, BT b);

template <typename RetT, typename AT, typename BT, typename CT,
          typename BinaryOperation>
inline constexpr RetT extend_max_sat(AT a, BT b, CT c,
                                     BinaryOperation second_op);

Another set of vectorized extend 32-bit operations is provided in the math header.These APIs treat each of the 32-bit operands as 2-elements vector (16-bits each) while handling sign extension to 17-bits internally. There is support for add, sub, absdiff, min, max and avg binary operations. Each operation provides has a _sat variat which determines if the returning value is saturated or not, and a _add variant that computes the binary sum of the the initial operation outputs and a third operand.

/// Compute vectorized addition of \p a and \p b, with each value treated as a
/// 2 elements vector type and extend each element to 17 bit.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The extend vectorized addition of the two values
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vadd2(AT a, BT b, RetT c);

/// Compute vectorized addition of \p a and \p b, with each value treated as a 2
/// elements vector type and extend each element to 17 bit. Then add each half
/// of the result and add with \p c.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The addition of each half of extend vectorized addition of the two
/// values and the third value
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vadd2_add(AT a, BT b, RetT c);

/// Compute vectorized addition of \p a and \p b with saturation, with each
/// value treated as a 2 elements vector type and extend each element to 17 bit.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The extend vectorized addition of the two values with saturation
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vadd2_sat(AT a, BT b, RetT c);

/// Compute vectorized subtraction of \p a and \p b, with each value treated as
/// a 2 elements vector type and extend each element to 17 bit.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The extend vectorized subtraction of the two values
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vsub2(AT a, BT b, RetT c);

/// Compute vectorized subtraction of \p a and \p b, with each value treated as
/// a 2 elements vector type and extend each element to 17 bit. Then add each
/// half of the result and add with \p c.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The addition of each half of extend vectorized subtraction of the
/// two values and the third value
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vsub2_add(AT a, BT b, RetT c);

/// Compute vectorized subtraction of \p a and \p b with saturation, with each
/// value treated as a 2 elements vector type and extend each element to 17 bit.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The extend vectorized subtraction of the two values with saturation
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vsub2_sat(AT a, BT b, RetT c);

/// Compute vectorized abs_diff of \p a and \p b, with each value treated as a 2
/// elements vector type and extend each element to 17 bit.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The extend vectorized abs_diff of the two values
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vabsdiff2(AT a, BT b, RetT c);

/// Compute vectorized abs_diff of \p a and \p b, with each value treated as a 2
/// elements vector type and extend each element to 17 bit. Then add each half
/// of the result and add with \p c.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The addition of each half of extend vectorized abs_diff of the
/// two values and the third value
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vabsdiff2_add(AT a, BT b, RetT c);

/// Compute vectorized abs_diff of \p a and \p b with saturation, with each
/// value treated as a 2 elements vector type and extend each element to 17 bit.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The extend vectorized abs_diff of the two values with saturation
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vabsdiff2_sat(AT a, BT b, RetT c);

/// Compute vectorized minimum of \p a and \p b, with each value treated as a 2
/// elements vector type and extend each element to 17 bit.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The extend vectorized minimum of the two values
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vmin2(AT a, BT b, RetT c);

/// Compute vectorized minimum of \p a and \p b, with each value treated as a 2
/// elements vector type and extend each element to 17 bit. Then add each half
/// of the result and add with \p c.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The addition of each half of extend vectorized minimum of the
/// two values and the third value
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vmin2_add(AT a, BT b, RetT c);

/// Compute vectorized minimum of \p a and \p b with saturation, with each value
/// treated as a 2 elements vector type and extend each element to 17 bit.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The extend vectorized minimum of the two values with saturation
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vmin2_sat(AT a, BT b, RetT c);

/// Compute vectorized maximum of \p a and \p b, with each value treated as a 2
/// elements vector type and extend each element to 17 bit.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The extend vectorized maximum of the two values
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vmax2(AT a, BT b, RetT c);

/// Compute vectorized maximum of \p a and \p b, with each value treated as a 2
/// elements vector type and extend each element to 17 bit. Then add each half
/// of the result and add with \p c.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The addition of each half of extend vectorized maximum of the
/// two values and the third value
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vmax2_add(AT a, BT b, RetT c);

/// Compute vectorized maximum of \p a and \p b with saturation, with each value
/// treated as a 2 elements vector type and extend each element to 17 bit.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The extend vectorized maximum of the two values with saturation
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vmax2_sat(AT a, BT b, RetT c);

/// Compute vectorized average of \p a and \p b, with each value treated as a 2
/// elements vector type and extend each element to 17 bit.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The extend vectorized average of the two values
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vavrg2(AT a, BT b, RetT c);

/// Compute vectorized average of \p a and \p b, with each value treated as a 2
/// elements vector type and extend each element to 17 bit. Then add each half
/// of the result and add with \p c.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The addition of each half of extend average maximum of the
/// two values and the third value
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vavrg2_add(AT a, BT b, RetT c);

/// Compute vectorized average of \p a and \p b with saturation, with each value
/// treated as a 2 elements vector type and extend each element to 17 bit.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The extend vectorized average of the two values with saturation
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vavrg2_sat(AT a, BT b, RetT c);

Similarly, a set of vectorized extend 32-bit operations is provided in the math header treating each of the 32-bit operands as 4-elements vector (8-bits each) while handling sign extension to 9-bits internally. There is support for add, sub, absdiff, min, max and avg binary operations. Each operation provides has a _sat variat which determines if the returning value is saturated or not, and a _add variant that computes the binary sum of the the initial operation outputs and a third operand.

/// Compute vectorized addition of \p a and \p b, with each value treated as a
/// 4 elements vector type and extend each element to 9 bit.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The extend vectorized addition of the two values
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vadd4(AT a, BT b, RetT c);

/// Compute vectorized addition of \p a and \p b, with each value treated as a 4
/// elements vector type and extend each element to 9 bit. Then add each half
/// of the result and add with \p c.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The addition of each half of extend vectorized addition of the two
/// values and the third value
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vadd4_add(AT a, BT b, RetT c);

/// Compute vectorized addition of \p a and \p b with saturation, with each
/// value treated as a 4 elements vector type and extend each element to 9 bit.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The extend vectorized addition of the two values with saturation
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vadd4_sat(AT a, BT b, RetT c);

/// Compute vectorized subtraction of \p a and \p b, with each value treated as
/// a 4 elements vector type and extend each element to 9 bit.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The extend vectorized subtraction of the two values
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vsub4(AT a, BT b, RetT c);

/// Compute vectorized subtraction of \p a and \p b, with each value treated as
/// a 4 elements vector type and extend each element to 9 bit. Then add each
/// half of the result and add with \p c.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The addition of each half of extend vectorized subtraction of the
/// two values and the third value
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vsub4_add(AT a, BT b, RetT c);

/// Compute vectorized subtraction of \p a and \p b with saturation, with each
/// value treated as a 4 elements vector type and extend each element to 9 bit.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The extend vectorized subtraction of the two values with saturation
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vsub4_sat(AT a, BT b, RetT c);

/// Compute vectorized abs_diff of \p a and \p b, with each value treated as a 4
/// elements vector type and extend each element to 9 bit.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The extend vectorized abs_diff of the two values
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vabsdiff4(AT a, BT b, RetT c);

/// Compute vectorized abs_diff of \p a and \p b, with each value treated as a 4
/// elements vector type and extend each element to 9 bit. Then add each half
/// of the result and add with \p c.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The addition of each half of extend vectorized abs_diff of the
/// two values and the third value
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vabsdiff4_add(AT a, BT b, RetT c);

/// Compute vectorized abs_diff of \p a and \p b with saturation, with each
/// value treated as a 4 elements vector type and extend each element to 9 bit.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The extend vectorized abs_diff of the two values with saturation
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vabsdiff4_sat(AT a, BT b, RetT c);

/// Compute vectorized minimum of \p a and \p b, with each value treated as a 4
/// elements vector type and extend each element to 9 bit.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The extend vectorized minimum of the two values
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vmin4(AT a, BT b, RetT c);

/// Compute vectorized minimum of \p a and \p b, with each value treated as a 4
/// elements vector type and extend each element to 9 bit. Then add each half
/// of the result and add with \p c.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The addition of each half of extend vectorized minimum of the
/// two values and the third value
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vmin4_add(AT a, BT b, RetT c);

/// Compute vectorized minimum of \p a and \p b with saturation, with each value
/// treated as a 4 elements vector type and extend each element to 9 bit.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The extend vectorized minimum of the two values with saturation
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vmin4_sat(AT a, BT b, RetT c);

/// Compute vectorized maximum of \p a and \p b, with each value treated as a 4
/// elements vector type and extend each element to 9 bit.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The extend vectorized maximum of the two values
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vmax4(AT a, BT b, RetT c);

/// Compute vectorized maximum of \p a and \p b, with each value treated as a 4
/// elements vector type and extend each element to 9 bit. Then add each half
/// of the result and add with \p c.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The addition of each half of extend vectorized maximum of the
/// two values and the third value
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vmax4_add(AT a, BT b, RetT c);

/// Compute vectorized maximum of \p a and \p b with saturation, with each value
/// treated as a 4 elements vector type and extend each element to 9 bit.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The extend vectorized maximum of the two values with saturation
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vmax4_sat(AT a, BT b, RetT c);

/// Compute vectorized average of \p a and \p b, with each value treated as a 4
/// elements vector type and extend each element to 9 bit.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The extend vectorized average of the two values
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vavrg4(AT a, BT b, RetT c);

/// Compute vectorized average of \p a and \p b, with each value treated as a 4
/// elements vector type and extend each element to 9 bit. Then add each half
/// of the result and add with \p c.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The addition of each half of extend vectorized average of the
/// two values and the third value
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vavrg4_add(AT a, BT b, RetT c);

/// Compute vectorized average of \p a and \p b with saturation, with each value
/// treated as a 4 elements vector type and extend each element to 9 bit.
/// \tparam [in] RetT The type of the return value, can only be 32 bit integer
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \returns The extend vectorized average of the two values with saturation
template <typename RetT, typename AT, typename BT>
inline constexpr RetT extend_vavrg4_sat(AT a, BT b, RetT c);

Vectorized comparison APIs also provided in the math header behave similarly and support a std comparison operator parameter which can be greater, less, greater_equal, less_equal, equal_to or not_equal_to. These APIs cover both the 2-elements (16-bits each) and 4-elements (8-bits each) variants, as well as an additional _add variant that computes the sum of the 2/4 output elements.

/// Extend \p a and \p b to 33 bit and vectorized compare input values using
/// specified comparison \p cmp .
///
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \tparam [in] BinaryOperation The type of the compare operation
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] cmp The comparsion operator
/// \returns The comparison result of the two extended values.
template <typename AT, typename BT, typename BinaryOperation>
inline constexpr unsigned extend_vcompare2(AT a, BT b, BinaryOperation cmp);

/// Extend Inputs to 33 bit, and vectorized compare input values using specified
/// comparison \p cmp , then add the result with \p c .
///
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \tparam [in] BinaryOperation The type of the compare operation
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \param [in] cmp The comparsion operator
/// \returns The comparison result of the two extended values, and add the
/// result with \p c .
template <typename AT, typename BT, typename BinaryOperation>
inline constexpr unsigned extend_vcompare2_add(AT a, BT b, unsigned c,
                                               BinaryOperation cmp);

/// Extend \p a and \p b to 33 bit and vectorized compare input values using
/// specified comparison \p cmp .
///
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \tparam [in] BinaryOperation The type of the compare operation
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] cmp The comparsion operator
/// \returns The comparison result of the two extended values.
template <typename AT, typename BT, typename BinaryOperation>
inline constexpr unsigned extend_vcompare4(AT a, BT b, BinaryOperation cmp);

/// Extend Inputs to 33 bit, and vectorized compare input values using specified
/// comparison \p cmp , then add the result with \p c .
///
/// \tparam [in] AT The type of the first value, can only be 32 bit integer
/// \tparam [in] BT The type of the second value, can only be 32 bit integer
/// \tparam [in] BinaryOperation The type of the compare operation
/// \param [in] a The first value
/// \param [in] b The second value
/// \param [in] c The third value
/// \param [in] cmp The comparsion operator
/// \returns The comparison result of the two extended values, and add the
/// result with \p c .
template <typename AT, typename BT, typename BinaryOperation>
inline constexpr unsigned extend_vcompare4_add(AT a, BT b, unsigned c,
                                               BinaryOperation cmp);

The math header file provides APIs for bit-field insertion (bfi_safe) and bit-field extraction (bfe_safe). These are bounds-checked variants of underlying detail APIs (detail::bfi, detail::bfe) which, in future releases, will be exposed to the user.

/// Bitfield-insert with boundary checking.
///
/// Align and insert a bit field from \param x into \param y . Source \param
/// bit_start gives the starting bit position for the insertion, and source
/// \param num_bits gives the bit field length in bits.
///
/// \tparam T The type of \param x and \param y , must be an unsigned integer.
/// \param x The source of the bitfield.
/// \param y The source where bitfield is inserted.
/// \param bit_start The position to start insertion.
/// \param num_bits The number of bits to insertion.
template <typename T>
inline T bfi_safe(const T x, const T y, const uint32_t bit_start,
                  const uint32_t num_bits);

/// Bitfield-extract with boundary checking.
///
/// Extract bit field from \param source and return the zero or sign-extended
/// result. Source \param bit_start gives the bit field starting bit position,
/// and source \param num_bits gives the bit field length in bits.
///
/// The result is padded with the sign bit of the extracted field. If `num_bits`
/// is zero, the  result is zero. If the start position is beyond the msb of the
/// input, the result is filled with the replicated sign bit of the extracted
/// field.
///
/// \tparam T The type of \param source value, must be an integer.
/// \param source The source value to extracting.
/// \param bit_start The position to start extracting.
/// \param num_bits The number of bits to extracting.
template <typename T>
inline T bfe_safe(const T source, const uint32_t bit_start,
                  const uint32_t num_bits);

Group Utilities¶

Group utility functions and classes optimize data movement, processing, and communication within work-groups. The exchange class facilitates rearranging data between blocked and striped layouts, improving memory access patterns. The group_radix_sort class implements an efficient radix sort for distributed data, supporting both ascending and descending order. The group_load and group_store classes manage structured data movement between linear memory and work-group arrangements, supporting blocked and striped formats with optional range-guarding. The group_shuffle class enables efficient inter-work-item communication through selective data exchanges, shifting operations, and group-wide data movement. These utilities collectively enhance parallel performance by improving memory efficiency, load balancing, and computational throughput.

namespace syclcompat {
/// Rearranging data partitioned across a work-group.
///
/// \tparam T The type of the data elements.
/// \tparam ElementsPerWorkItem The number of data elements assigned to a
/// work-item.
template <typename T, size_t ElementsPerWorkItem> class exchange {
public:
  exchange(uint8_t *local_memory);

  static size_t get_local_memory_size(size_t group_threads);

  /// Inplace rearrange elements from blocked order to striped order.
  ///
  /// Suppose 512 integer data elements partitioned across 128 work-items, where
  /// each work-item owns 4 ( \p ElementsPerWorkItem ) data elements and the
  /// blocked \p input across the work-group is:
  ///
  ///   {[0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511]}.
  ///
  /// The striped order output is:
  ///
  ///   {[0, 128, 256, 384], [1, 129, 257, 385], ..., [127, 255, 383, 511]}.
  ///
  /// \tparam Item The work-item identifier type.
  /// \param item The work-item identifier.
  /// \param input The input data of each work-item.
  template <typename Item>
  __syclcompat_inline__ void
  blocked_to_striped(Item item, T (&input)[ElementsPerWorkItem]);

  /// Inplace rearrange elements from striped order to blocked order.
  ///
  /// Suppose 512 integer data elements partitioned across 128 work-items, where
  /// each work-item owns 4 ( \p ElementsPerWorkItem ) data elements and the
  /// striped \p input across the work-group is:
  ///
  ///   { [0, 128, 256, 384], [1, 129, 257, 385], ..., [127, 255, 383, 511] }.
  ///
  /// The blocked order output is:
  ///
  ///   { [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }.
  ///
  /// \tparam Item The work-item identifier type.
  /// \param item The work-item identifier.
  /// \param input The input data of each work-item.
  template <typename Item>
  __syclcompat_inline__ void
  striped_to_blocked(Item item, T (&input)[ElementsPerWorkItem]);

  /// Rearrange elements from blocked order to striped order.
  ///
  /// Suppose 512 integer data elements partitioned across 128 work-items, where
  /// each work-item owns 4 ( \p ElementsPerWorkItem ) data elements and the
  /// blocked \p input across the work-group is:
  ///
  ///   { [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }.
  ///
  /// The striped order output is:
  ///
  ///   { [0, 128, 256, 384], [1, 129, 257, 385], ..., [127, 255, 383, 511] }.
  ///
  /// \tparam Item The work-item identifier type.
  /// \param item The work-item identifier.
  /// \param input The input data of each work-item.
  /// \param output The corresponding output data of each work-item.
  template <typename Item>
  __syclcompat_inline__ void
  blocked_to_striped(Item item, T (&input)[ElementsPerWorkItem],
                     T (&output)[ElementsPerWorkItem]);

  /// Rearrange elements from striped order to blocked order.
  ///
  /// Suppose 512 integer data elements partitioned across 128 work-items, where
  /// each work-item owns 4 ( \p ElementsPerWorkItem ) data elements and the
  /// striped \p input across the work-group is:
  ///
  ///   { [0, 128, 256, 384], [1, 129, 257, 385], ..., [127, 255, 383, 511] }.
  ///
  /// The blocked order output is:
  ///
  ///   { [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }.
  ///
  /// \tparam Item The work-item identifier type.
  /// \param item The work-item identifier.
  /// \param input The input data of each work-item.
  /// \param output The corresponding output data of each work-item.
  template <typename Item>
  __syclcompat_inline__ void
  striped_to_blocked(Item item, T (&input)[ElementsPerWorkItem],
                     T (&output)[ElementsPerWorkItem]);

  /// Inplace exchanges data items annotated by rank into blocked arrangement.
  ///
  /// Suppose 512 integer data elements partitioned across 128 work-items, where
  /// each work-item owns 4 ( \p ElementsPerWorkItem ) data elements and the
  /// striped \p input across the work-group is:
  ///
  ///   { [0, 128, 256, 384], [1, 129, 257, 385], ..., [127, 255, 383, 511] }.
  ///
  /// The rank across the work-group is:
  ///
  ///   { [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }.
  ///
  /// The blocked order output is:
  ///
  ///   { [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }.
  ///
  /// \tparam Item The work-item identifier type.
  /// \param item The work-item identifier.
  /// \param input The input data of each work-item.
  /// \param ranks The corresponding rank annotation of each work-item.
  template <typename Item>
  __syclcompat_inline__ void
  scatter_to_blocked(Item item, T (&input)[ElementsPerWorkItem],
                     int (&ranks)[ElementsPerWorkItem]);

  /// Inplace exchanges data items annotated by rank into striped arrangement.
  ///
  /// Suppose 512 integer data elements partitioned across 128 work-items, where
  /// each work-item owns 4 ( \p ElementsPerWorkItem ) data elements and the
  /// blocked \p input across the work-group is:
  ///
  ///   { [0, 1, 2, 3], [4, 5, 6, 7], ..., [508, 509, 510, 511] }.
  ///
  /// The rank across the work-group is:
  ///
  ///   { [16, 20, 24, 28], [32, 36, 40, 44], ..., [499, 503, 507, 511] }.
  ///
  /// The striped order output of each work-item will be:
  ///
  ///   { [0, 128, 256, 384], [1, 129, 257, 385], ..., [127, 255, 383, 511] }.
  ///
  /// \tparam Item The work-item identifier type.
  /// \param item The work-item identifier.
  /// \param input The input data of each work-item.
  /// \param ranks The corresponding rank annotation of each work-item.
  template <typename Item>
  __syclcompat_inline__ void
  scatter_to_striped(Item item, T (&input)[ElementsPerWorkItem],
                     int (&ranks)[ElementsPerWorkItem]);
};

/// The work-group wide radix sort to sort integer data elements
/// assigned to all work-items in the work-group.
///
/// \tparam T The type of the data elements.
/// \tparam ElementsPerWorkItem The number of data elements assigned to
/// a work-item.
/// \tparam RADIX_BITS The number of radix bits per digit place.
template <typename T, int ElementsPerWorkItem, int RADIX_BITS = 4>
class group_radix_sort {
public:
  group_radix_sort(uint8_t *local_memory);

  static size_t get_local_memory_size(size_t group_threads);

  /// Performs an ascending work-group wide radix sort over a blocked
  /// arrangement of input elements.
  ///
  /// Suppose 512 integer data elements partitioned across 128 work-items, where
  /// each work-item owns 4 ( \p ElementsPerWorkItem ) data elements and the
  /// \p input across the work-group is:
  ///
  ///   { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }.
  ///
  /// The ascending order output is:
  ///
  ///   { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }.
  ///
  /// \tparam Item The work-item identifier type.
  /// \param item The work-item identifier.
  /// \param input The input data of each work-item.
  /// \param begin_bit The beginning (least-significant) bit index needed for
  /// key comparison.
  /// \param end_bit The past-the-end (most-significant) bit
  /// index needed for key comparison.
  template <typename Item>
  __syclcompat_inline__ void
  sort(const Item &item, T (&input)[ElementsPerWorkItem], int begin_bit = 0,
       int end_bit = 8 * sizeof(T));

  /// Performs an descending work-group wide radix sort over a blocked
  /// arrangement of input elements.
  ///
  /// Suppose 512 integer data elements partitioned across 128 work-items, where
  /// each work-item owns 4 ( \p ElementsPerWorkItem ) data elements and the
  /// \p input across the work-group is:
  ///
  ///   { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }.
  ///
  /// The descending order output is:
  ///
  ///   { [511,510,509,508], [11,10,9,8], [7,6,5,4], ..., [3,2,1,0] }.
  ///
  /// \tparam Item The work-item identifier type.
  /// \param item The work-item identifier.
  /// \param input The input data of each work-item.
  /// \param begin_bit The beginning (least-significant) bit index needed for
  /// key comparison.
  /// \param end_bit The past-the-end (most-significant) bit
  /// index needed for key comparison.
  template <typename Item>
  __syclcompat_inline__ void
  sort_descending(const Item &item, T (&input)[ElementsPerWorkItem],
                  int begin_bit = 0, int end_bit = 8 * sizeof(T));

  /// Performs an ascending radix sort across a blocked arrangement of input
  /// elements, leaving them in a striped arrangement.
  ///
  /// Suppose 512 integer data elements partitioned across 128 work-items, where
  /// each work-item owns 4 ( \p ElementsPerWorkItem ) data elements and the
  /// \p input across the work-group is:
  ///
  ///   { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }.
  ///
  /// The corresponding output of each work-item will be:
  ///
  ///   { [0,128,256,384], [1,129,257,385], [2,130,258,386], ...,
  ///   [127,255,383,511] }.
  ///
  /// \tparam Item The work-item identifier type.
  /// \param item The work-item identifier.
  /// \param input The input data of each work-item.
  /// \param begin_bit The beginning (least-significant) bit index needed for
  /// key comparison.
  /// \param end_bit The past-the-end (most-significant) bit
  /// index needed for key comparison.
  template <typename Item>
  __syclcompat_inline__ void
  sort_blocked_to_striped(const Item &item, T (&input)[ElementsPerWorkItem],
                          int begin_bit = 0, int end_bit = 8 * sizeof(T));

  /// Performs an descending radix sort across a blocked arrangement of input
  /// elements, leaving them in a striped arrangement.
  ///
  /// Suppose 512 integer data elements partitioned across 128 work-items, where
  /// each work-item owns 4 ( \p ElementsPerWorkItem ) data elements and the
  /// \p input across the work-group is:
  ///
  ///   { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }.
  ///
  /// The descending striped order output is:
  ///
  ///   { [0,128,256,384], [1,129,257,385], [2,130,258,386], ...,
  ///   [127,255,383,511] }.
  ///
  /// \tparam Item The work-item identifier type.
  /// \param item The work-item identifier.
  /// \param input The input data of each work-item.
  /// \param begin_bit The beginning (least-significant) bit index needed for
  /// key comparison.
  /// \param end_bit The past-the-end (most-significant) bit
  /// index needed for key comparison.
  template <typename Item>
  __syclcompat_inline__ void sort_descending_blocked_to_striped(
      const Item &item, T (&input)[ElementsPerWorkItem], int begin_bit = 0,
      int end_bit = 8 * sizeof(T));
};

/// Load linear segment items into block format across threads
/// Helper for Block Load
enum load_algorithm {
  BLOCK_LOAD_DIRECT,
  BLOCK_LOAD_STRIPED,
};

/// Load a linear segment of elements into a blocked arrangement across the
/// work-group.
///
/// \tparam T The data type to load.
/// \tparam ElementsPerWorkItem The number of consecutive elements partitioned
/// onto each work-item.
/// \tparam InputIteratorT  The random-access iterator type for input \iterator.
/// \tparam ItemT The sycl::nd_item index space class.
/// \param item The calling work-item.
/// \param input_iter The work-group's base input iterator for loading from.
/// \param data Data to load.
template <typename T, size_t ElementsPerWorkItem, typename InputIteratorT,
          typename ItemT>
__syclcompat_inline__ void load_direct_blocked(const ItemT &item,
                                               InputIteratorT input_iter,
                                               T (&data)[ElementsPerWorkItem]);

/// Load a linear segment of elements into a striped arrangement across the
/// work-group.
///
/// \tparam T The data type to load.
/// \tparam ElementsPerWorkItem The number of consecutive elements partitioned
/// onto each work-item.
/// \tparam InputIteratorT  The random-access iterator type for input \iterator.
/// \tparam ItemT The sycl::nd_item index space class.
/// \param item The calling work-item.
/// \param input_iter The work-group's base input iterator for loading from.
/// \param data Data to load.
template <typename T, int ElementsPerWorkItem, typename InputIteratorT,
          typename ItemT>
__syclcompat_inline__ void load_direct_striped(const ItemT &item,
                                               InputIteratorT input_iter,
                                               T (&data)[ElementsPerWorkItem]);

/// Load a linear segment of elements into a blocked arrangement across the
/// work-group, guarded by range.
///
/// \tparam T The data type to load.
/// \tparam ElementsPerWorkItem The number of consecutive elements partitioned
/// onto each work-item.
/// \tparam InputIteratorT  The random-access iterator type for input \iterator.
/// \tparam ItemT The sycl::nd_item index space class.
/// \param item The calling work-item.
/// \param input_iter The work-group's base input iterator for loading from.
/// \param data Data to load.
/// \param valid_items Number of valid items to load
template <typename T, size_t ElementsPerWorkItem, typename InputIteratorT,
          typename ItemT>
__syclcompat_inline__ void
load_direct_blocked(const ItemT &item, InputIteratorT input_iter,
                    T (&data)[ElementsPerWorkItem], int valid_items);

/// Load a linear segment of elements into a striped arrangement across the
/// work-group, guarded by range.
///
/// \tparam T The data type to load.
/// \tparam ElementsPerWorkItem The number of consecutive elements partitioned
/// onto each work-item.
/// \tparam InputIteratorT  The random-access iterator type for input \iterator.
/// \tparam ItemT The sycl::nd_item index space class.
/// \param item The calling work-item.
/// \param input_iter The work-group's base input iterator for loading from.
/// \param data Data to load.
/// \param valid_items Number of valid items to load
template <typename T, int ElementsPerWorkItem, typename InputIteratorT,
          typename ItemT>
__syclcompat_inline__ void
load_direct_striped(const ItemT &item, InputIteratorT input_iter,
                    T (&data)[ElementsPerWorkItem], int valid_items);

/// Store a blocked arrangement of items across a work-group into a linear
/// segment of items.
///
/// \tparam T The data type to store.
/// \tparam ElementsPerWorkItem The number of consecutive elements partitioned
/// onto each work-item.
/// \tparam OutputIteratorT  The random-access iterator type for output.
/// \iterator.
/// \tparam ItemT The sycl::nd_item index space class.
/// \param item The calling work-item.
/// \param output_iter The work-group's base output iterator for writing.
/// \param data Data to store.
template <typename T, size_t ElementsPerWorkItem, typename OutputIteratorT,
          typename ItemT>
__syclcompat_inline__ void
store_direct_blocked(const ItemT &item, OutputIteratorT output_iter,
                     T (&data)[ElementsPerWorkItem]);

/// Store a striped arrangement of items across a work-group into a linear
/// segment of items.
///
/// \tparam T The data type to store.
/// \tparam ElementsPerWorkItem The number of consecutive elements partitioned
/// onto each work-item.
/// \tparam OutputIteratorT  The random-access iterator type for output.
/// \iterator.
/// \tparam ItemT The sycl::nd_item index space class.
/// \param item The calling work-item.
/// \param output_iter The work-group's base output iterator for writing.
/// \param items Data to store.
template <typename T, size_t ElementsPerWorkItem, typename OutputIteratorT,
          typename ItemT>
__syclcompat_inline__ void
store_direct_striped(const ItemT &item, OutputIteratorT output_iter,
                     T (&data)[ElementsPerWorkItem]);

/// Store a blocked arrangement of items across a work-group into a linear
/// segment of items, guarded by range.
///
/// \tparam T The data type to store.
/// \tparam ElementsPerWorkItem The number of consecutive elements partitioned
/// onto each work-item.
/// \tparam OutputIteratorT  The random-access iterator type for output.
/// \iterator.
/// \tparam ItemT The sycl::nd_item index space class.
/// \param item The calling work-item.
/// \param output_iter The work-group's base output iterator for writing.
/// \param data Data to store.
/// \param valid_items Number of valid items to load
template <typename T, size_t ElementsPerWorkItem, typename OutputIteratorT,
          typename ItemT>
__syclcompat_inline__ void
store_direct_blocked(const ItemT &item, OutputIteratorT output_iter,
                     T (&data)[ElementsPerWorkItem], size_t valid_items);

/// Store a striped arrangement of items across a work-group into a linear
/// segment of items, guarded by range.
///
/// \tparam T The data type to store.
/// \tparam ElementsPerWorkItem The number of consecutive elements partitioned
/// onto each work-item.
/// \tparam OutputIteratorT  The random-access iterator type for output.
/// \iterator.
/// \tparam ItemT The sycl::nd_item index space class.
/// \param item The calling work-item.
/// \param output_iter The work-group's base output iterator for writing.
/// \param items Data to store.
/// \param valid_items Number of valid items to load
template <typename T, size_t ElementsPerWorkItem, typename OutputIteratorT,
          typename ItemT>
__syclcompat_inline__ void
store_direct_striped(const ItemT &item, OutputIteratorT output_iter,
                     T (&data)[ElementsPerWorkItem], size_t valid_items);

/// Enumerates alternative algorithms for syclcompat::group::group_load to read
/// a linear segment of data from memory into a blocked arrangement across a
/// work-group.
enum class group_load_algorithm {
  /// A blocked arrangement of data is read directly from memory.
  blocked,

  /// A striped arrangement of data is read directly from memory.
  striped
};

/// Provide methods for loading a linear segment of items from memory into a
/// blocked arrangement across a work-group.
///
/// \tparam T The input data type.
/// \tparam ElementsPerWorkItem The number of data elements assigned to a
/// work-item.
/// \tparam LoadAlgorithm The data movement strategy, default is blocked.
template <typename T, size_t ElementsPerWorkItem,
          group_load_algorithm LoadAlgorithm = group_load_algorithm::blocked>
class group_load {
public:
  group_load(uint8_t *);

  static size_t get_local_memory_size([[maybe_unused]] size_t work_group_size);

  /// Load a linear segment of items from memory.
  ///
  /// Suppose 512 integer data elements partitioned across 128 work-items, where
  /// each work-item owns 4 ( \p ElementsPerWorkItem ) data elements and the
  /// \p input across the work-group is:
  ///
  ///   1, 2, 3, 4, 5, 6, 7, ..., 508, 509, 510, 511.
  ///
  /// The blocked order \p data of each work-item will be:
  ///
  ///   {[0,1,2,3], [4,5,6,7], ..., [508,509,510,511]}.
  ///
  /// The striped order \p output of each work-item will be:
  ///
  ///   {[0,128,256,384], [1,129,257,385], ..., [127,255,383,511]}.
  ///
  /// \tparam ItemT The sycl::nd_item index space class.
  /// \tparam InputIteratorT The random-access iterator type for input
  /// \iterator.
  /// \param item The work-item identifier.
  /// \param input_iter The work-group's base input iterator for loading from.
  /// \param data The data to load.
  template <typename ItemT, typename InputIteratorT>
  __syclcompat_inline__ void load(const ItemT &item, InputIteratorT input_iter,
                                  T (&data)[ElementsPerWorkItem]);

  /// Load a linear segment of items from memory, guarded by range.
  ///
  /// Suppose 512 integer data elements partitioned across 128 work-items, where
  /// each work-item owns 4 ( \p ElementsPerWorkItem ) data elements and
  /// valid_items is 5, the \p input across the work-group is:
  ///
  ///   0, 1, 2, 3, 4, 5, 6, 7, ..., 508, 509, 510, 511.
  ///
  /// The blocked order \p data of each work-item will be:
  ///
  ///   {[0,1,2,3], [4,?,?,?], ..., [?,?,?,?]}.
  ///
  /// The striped order \p output of each work-item will be:
  ///
  ///   {[0,?,?,?], [1,?,?,?], [2,?,?,?], [3,?,?,?] ..., [?,?,?,?]}.
  ///
  /// \tparam ItemT The sycl::nd_item index space class.
  /// \tparam InputIteratorT The random-access iterator type for input
  /// \iterator.
  /// \param item The work-item identifier.
  /// \param input_iter The work-group's base input iterator for loading from.
  /// \param data The data to load.
  /// \param valid_items Number of valid items to load
  template <typename ItemT, typename InputIteratorT>
  __syclcompat_inline__ void load(const ItemT &item, InputIteratorT input_iter,
                                  T (&data)[ElementsPerWorkItem],
                                  int valid_items);
};

/// Enumerates alternative algorithms for syclcompat::group::group_load to write
/// a blocked arrangement of items across a work-group to a linear segment of
/// memory.
enum class group_store_algorithm {
  /// A blocked arrangement of data is written directly to memory.
  blocked,

  /// A striped arrangement of data is written directly to memory.
  striped,
};

/// Provide methods for writing a blocked arrangement of elements partitioned
/// across a work-group to a linear segment of memory.
///
/// \tparam T The output data type.
/// \tparam ElementsPerWorkItem The number of data elements assigned to a
/// work-item.
/// \tparam StoreAlgorithm The data movement strategy, default is blocked.
template <typename T, size_t ElementsPerWorkItem,
          group_store_algorithm StoreAlgorithm = group_store_algorithm::blocked>
class group_store {
public:
  group_store(uint8_t *);

  static size_t get_local_memory_size([[maybe_unused]] size_t work_group_size);

  /// Store items into a linear segment of memory.
  ///
  /// Suppose 512 integer data elements partitioned across 128 work-items, where
  /// each work-item owns 4 ( \p ElementsPerWorkItem ) data elements and the
  /// \p input across the work-group is:
  ///
  ///   {[0,1,2,3], [4,5,6,7], ..., [508,509,510,511]}.
  ///
  /// The blocked order \p output will be:
  ///
  ///   1, 2, 3, 4, 5, 6, 7, ..., 508, 509, 510, 511.
  ///
  /// The striped order \p output will be:
  ///
  ///   0, 128, 256, 384, 1, 129, 257, 385, ..., 127, 255, 383, 511.
  ///
  /// \tparam ItemT The sycl::nd_item index space class.
  /// \tparam OutputIteratorT The random-access iterator type for \p output
  /// iterator.
  /// \param item The work-item identifier.
  /// \param input The input data of each work-item.
  /// \param data The data to store.
  template <typename ItemT, typename OutputIteratorT>
  __syclcompat_inline__ void store(const ItemT &item,
                                   OutputIteratorT output_iter,
                                   T (&data)[ElementsPerWorkItem]);

  /// Store items into a linear segment of memory, guarded by range.
  ///
  /// Suppose 512 integer data elements partitioned across 128 work-items, where
  /// each work-item owns 4 ( \p ElementsPerWorkItem ) data elements and
  /// \p valid_items is 5, the \p output across the work-group is:
  ///
  ///   {[0,0,0,0], [0,0,0,0], ..., [0,0,0,0]}.
  ///
  /// The blocked order \p output will be:
  ///
  ///   0, 1, 2, 3, 4, 5, 0, 0, ..., 0, 0, 0, 0.
  ///
  /// The striped order \p output will be:
  ///
  ///   0, 4, 8, 12, 16, 0, 0, 0, ..., 0, 0, 0, 0.
  ///
  /// \tparam ItemT The sycl::nd_item index space class.
  /// \tparam OutputIteratorT The random-access iterator type for \p output
  /// iterator.
  /// \param item The work-item identifier.
  /// \param input The input data of each work-item.
  /// \param data The data to store.
  /// \param valid_items Number of valid items to load
  template <typename ItemT, typename OutputIteratorT>
  __syclcompat_inline__ void
  store(const ItemT &item, OutputIteratorT output_iter,
        T (&data)[ElementsPerWorkItem], size_t valid_items);
};

/// The work-group wide shuffle operations that allow work-items to exchange
/// data elements with other work-items within the same work-group.
///
/// \tparam T The type of the data elements.
/// \tparam group_dim_0 The first dimension size of the work-group.
/// \tparam group_dim_1 The second dimension size of the work-group.
/// \tparam group_dim_2 The third dimension size of the work-group.
template <typename T, int group_dim_0, int group_dim_1 = 1, int group_dim_2 = 1>
class group_shuffle {
public:
  group_shuffle(uint8_t *local_memory);

  static constexpr size_t get_local_memory_size(size_t work_group_size);

  /// Selects a value from a work-item at a given distance in the work-group
  /// and stores the value in the output.
  ///
  /// \tparam ItemT The work-item identifier type.
  /// \param item The work-item identifier.
  /// \param input The input from the calling work-item.
  /// \param output The output where the selected data will be stored.
  /// \param distance The distance of work-items to look ahead or behind in the
  /// work-group.
  template <typename ItemT>
  __syclcompat_inline__ void select(const ItemT &item, T input, T &output,
                                    int distance = 1);
  /// Selects a value from a work-item at a given distance in the work-group
  /// and stores the value in the output, using a wrapped index to handle
  /// overflow.
  ///
  /// \tparam ItemT The work-item identifier type.
  /// \param item The work-item identifier.
  /// \param input The input data to be selected.
  /// \param output The output where the selected data will be stored.
  /// \param distance The number of work-items to look ahead in the
  /// work-group.
  template <typename ItemT>
  __syclcompat_inline__ void select2(const ItemT &item, T input, T &output,
                                     unsigned int distance = 1);
  /// Performs a shuffle operation to move data to the right across the
  /// work-items, shifting elements in a work-item array by one position to the
  /// right.
  ///
  /// \tparam ElementsPerWorkItem The number of data elements per work-item.
  /// \tparam ItemT The work-item identifier type.
  /// \param item The work-item identifier.
  /// \param input The input data to be shuffled.
  /// \param output The array that will store the shuffle result.
  template <int ElementsPerWorkItem, typename ItemT>
  __syclcompat_inline__ void shuffle_right(const ItemT &item,
                                           T (&input)[ElementsPerWorkItem],
                                           T (&output)[ElementsPerWorkItem]);

  /// Performs a shuffle operation to move data to the right across the
  /// work-items, storing the suffix of the group after the shuffle operation.
  ///
  /// \tparam ElementsPerWorkItem The number of data elements per work-item.
  /// \tparam ItemT The work-item identifier type.
  /// \param item The work-item identifier.
  /// \param input The input data to be shuffled.
  /// \param output The array that will store the shuffle result.
  /// \param group_suffix The suffix of the group after the shuffle.
  template <int ElementsPerWorkItem, typename ItemT>
  __syclcompat_inline__ void
  shuffle_right(const ItemT &item, T (&input)[ElementsPerWorkItem],
                T (&output)[ElementsPerWorkItem], T &group_suffix);

  /// Performs a shuffle operation to move data to the left across the
  /// work-items, shifting elements in a work-item array by one position to the
  /// left.
  ///
  /// \tparam ElementsPerWorkItem The number of data elements per work-item.
  /// \tparam ItemT The work-item identifier type.
  /// \param item The work-item identifier.
  /// \param input The input data to be shuffled.
  /// \param output The array that will store the shuffle result.
  template <int ElementsPerWorkItem, typename ItemT>
  __syclcompat_inline__ void shuffle_left(const ItemT &item,
                                          T (&input)[ElementsPerWorkItem],
                                          T (&output)[ElementsPerWorkItem]);

  /// Performs a shuffle operation to move data to the left across the
  /// work-items, storing the prefix of the group before the shuffle operation.
  ///
  /// \tparam ElementsPerWorkItem The number of data elements per work-item.
  /// \tparam ItemT The work-item identifier type.
  /// \param item The work-item identifier.
  /// \param input The input data to be shuffled.
  /// \param output The array that will store the shuffle result.
  /// \param group_prefix The prefix of the group before the shuffle.
  template <int ElementsPerWorkItem, typename ItemT>
  __syclcompat_inline__ void
  shuffle_left(const ItemT &item, T (&input)[ElementsPerWorkItem],
               T (&output)[ElementsPerWorkItem], T &group_prefix);
};
} // namespace syclcompat

Sample Code¶

Below is a simple linear algebra sample, which computes y = mx + b implemented using this library:

#include <cassert>
#include <iostream>

#include <syclcompat.hpp>
#include <sycl/sycl.hpp>

/**
 * Slope intercept form of a straight line equation: Y = m * X + b
 */
template <int BLOCK_SIZE>
void slope_intercept(float *Y, float *X, float m, float b, size_t n) {

  // Block index
  size_t bx = syclcompat::work_group_id::x();
  // Thread index
  size_t tx = syclcompat::local_id::x();

  size_t i = bx * BLOCK_SIZE + tx;
  // or  i = syclcompat::global_id::x();
  if (i < n)
    Y[i] = m * X[i] + b;
}

void check_memory(void *ptr, std::string msg) {
  if (ptr == nullptr) {
    std::cerr << "Failed to allocate memory: " << msg << std::endl;
    exit(EXIT_FAILURE);
  }
}

/**
 * Program main
 */
int main(int argc, char **argv) {
  std::cout << "Simple Kernel example" << std::endl;

  constexpr size_t n_points = 32;
  constexpr float m = 1.5f;
  constexpr float b = 0.5f;

  int block_size = 32;
  if (block_size > syclcompat::get_current_device()
                       .get_info<sycl::info::device::max_work_group_size>())
    block_size = 16;

  std::cout << "block_size = " << block_size << ", n_points = " << n_points
            << std::endl;

  // Allocate host memory for vectors X and Y
  size_t mem_size = n_points * sizeof(float);
  float *h_X = (float *)syclcompat::malloc_host(mem_size);
  float *h_Y = (float *)syclcompat::malloc_host(mem_size);
  check_memory(h_X, "h_X allocation failed.");
  check_memory(h_Y, "h_Y allocation failed.");

  // Alternative templated allocation for the expected output
  float *h_expected = syclcompat::malloc_host<float>(n_points);
  check_memory(h_expected, "Not enough for h_expected.");

  // Initialize host memory & expected output
  for (size_t i = 0; i < n_points; i++) {
    h_X[i] = i + 1;
    h_expected[i] = m * h_X[i] + b;
  }

  // Allocate device memory
  float *d_X = (float *)syclcompat::malloc(mem_size);
  float *d_Y = (float *)syclcompat::malloc(mem_size);
  check_memory(d_X, "d_X allocation failed.");
  check_memory(d_Y, "d_Y allocation failed.");

  // copy host memory to device
  syclcompat::memcpy(d_X, h_X, mem_size);

  size_t threads = block_size;
  size_t grid = n_points / block_size;

  std::cout << "Computing result using SYCL Kernel... ";
  if (block_size == 16) {
    syclcompat::launch<slope_intercept<16>>(grid, threads, d_Y, d_X, m, b,
                                        n_points);
  } else {
    syclcompat::launch<slope_intercept<32>>(grid, threads, d_Y, d_X, m, b,
                                        n_points);
  }
  syclcompat::wait();
  std::cout << "DONE" << std::endl;

  // Async copy result from device to host
  syclcompat::memcpy_async(h_Y, d_Y, mem_size).wait();

  // Check output
  for (size_t i = 0; i < n_points; i++) {
    assert(h_Y[i] - h_expected[i] < 1e-6);
  }

  // Clean up memory
  syclcompat::free(h_X);
  syclcompat::free(h_Y);
  syclcompat::free(h_expected);
  syclcompat::free(d_X);
  syclcompat::free(d_Y);

  return 0;
}

Maintainers¶

To report problems with this library, please open a new issue with the [COMPAT] tag at:

https://github.com/intel/llvm/issues

Contributors¶

Alberto Cabrera, Codeplay
Gordon Brown, Codeplay
Joe Todd, Codeplay
Pietro Ghiglio, Codeplay
Ruyman Reyes, Codeplay/Intel

Contributions¶

This library is licensed under the Apache 2.0 license. If you have an idea for a new sample, different build system integration or even a fix for something that is broken, please get in contact.