Behavior for optional kernel features

This design document describes the changes that are needed in DPC++ in order to conform to the SYCL 2020 specification regarding the behavior of applications that use optional kernel features. An optional kernel feature is any feature that is supported by some devices and not by others. For example, not all devices support 16-bit floating point operations, so the sycl::half data type is an optional kernel feature. Some DPC++ extensions like AMX are also optional kernel features.

The requirements for this design come mostly from the SYCL 2020 specification section 5.7 “Optional kernel features” but they also encompass the C++ attribute [[sycl::device_has()]] that is described in section 5.8.1 “Kernel attributes” and section 5.8.2 “Device function attributes”.

At the time this document was written, the published version of the SYCL 2020 specification does not include the [[sycl::device_has()]] attribute. The published spec refers to this attribute as [[sycl::requires()]], but the attribute was renamed to [[sycl::device_has()]] in this pull request to the SYCL 2020 specification.

Definition of terms

A kernel’s static call graph

The term “static call graph” of a kernel means the set of all functions that a kernel may call, including functions that are called transitively from other functions. In standard SYCL, device code is not allowed to contain function pointers, virtual functions, or indirect function calls. It is therefore easy to compute the static call graph of a kernel. By starting at the kernel function itself (e.g. the function passed to parallel_for), the compiler can identify all functions called by that function, then it can find all functions called by those functions, etc. Depending on the tool which does the analysis, the “static call graph” could include only those functions that reside in the same translation unit as the kernel, or it could include all functions that reside in the same executable image (or shared library) as the kernel. In the sections below, we try to make the distinction clear whenever we refer to a kernel’s static call graph.

We are contemplating a DPC++ extension that would allow some limited use of function pointers in device code. This feature is not yet fully defined or supported. We expect that the semantics of this feature will include some way for the compiler to deduce a limited set of possible targets for each indirect function call. Therefore, it is still possible for the compiler to construct a “static call graph” for each kernel, the only difference is that each call site now adds a set of possible target functions to a kernel’s static call graph. The details about how this will work are expected to be included in the DPC++ extension specification that enables indirect function calls.

An exported device function

The term “exported device function” means a device function that is exported from a shared library as defined by Device Code Dynamic Linking.

The FE compiler

The term “FE compiler” refers to the entire DPC++ compiler chain that runs when the user executes the clang++ command. This includes the clang front-end itself, all passes over LLVM IR, the post-link tool, and any AOT compilation phases (when the user compiles in AOT mode). The FE compiler does not include the JIT compiler which translates SPIR-V (or another IL format) into native code when the application executes.

Requirements

There are several categories of requirements covered by this design. Each of these is described in more detail in the sections that follow:

  • The FE compiler must issue a diagnostic in some cases when a kernel or device function uses an optional feature. However, the FE compiler must not generate a diagnostic in other cases.

  • The runtime must raise an exception when a kernel using optional features is submitted to a device that does not support those features. This exception must be raised synchronously from the kernel invocation command (e.g. parallel_for()).

  • The runtime must not raise an exception (or otherwise fail) merely due to speculative compilation of a kernel for a device, when the application does not specifically submit the kernel to that device.

Diagnostics from the FE compiler

In general, the FE compiler does not know which kernels the application will submit to which devices. Therefore, the FE compiler does not generally know which optional features a kernel can legally use. Thus, in general, the FE compiler must not issue any diagnostic simply because a kernel uses an optional feature.

The only exception to this rule occurs when the application uses the C++ attribute [[sycl::device_has()]]. When the application decorates a kernel or device function with this attribute, it is an assertion that the kernel or device function is allowed to use only those optional features which are listed by the attribute. Therefore, the FE compiler must issue a diagnostic if the kernel or device function uses any other optional kernel features.

The SYCL 2020 specification only mandates this diagnostic when a kernel or device function that is decorated with [[sycl::device_has()]] uses an optional kernel feature (not listed in the attribute), and when that use is in the kernel’s static call graph as computed for the translation unit that contains the kernel function. Thus, the compiler is not required to issue a diagnostic if the use is in a SYCL_EXTERNAL function that is defined in another translation unit.

Note that this behavior does not change when the compiler runs in AOT mode. Even if the user specifies a target device via “-fsycl-targets”, that does not necessarily mean that the user expects all the code in the application to be runnable on that device. Consider an application that uses some middleware library, where the library’s header contains kernels optimized for different devices. An application should be able to compile in AOT mode with this library without getting errors. Therefore the AOT compiler must not fail simply because the middleware header contains device code for devices that are not being compiled for.

Runtime exception if device doesn’t support feature

When the application submits a kernel to a device via one of the kernel invocation commands (e.g. parallel_for()), the runtime must check if the kernel uses optional features that are not supported on that device. If the kernel uses an unsupported feature, the runtime must throw a synchronous errc::kernel_not_supported exception.

When doing these checks, the runtime must consider all uses of optional features in the kernel’s static call graph, regardless of whether those uses are in the same translation unit as the kernel and regardless of whether those uses come from device code in a shared library.

This exception, however, is only required for features that are exposed via a C++ type or function. Examples of this include sycl::half or instantiating sycl::atomic_ref for a 64-bit type. If the kernel relies on optional features that are more “notional” such as sub-group independent forward progress (info::device::sub_group_independent_forward_progress), no exception is required.

To further clarify, this exception must be thrown in the following circumstances:

  • For a kernel that is not decorated with [[sycl::device_has()]], the exception must be thrown if the kernel uses a feature that the device does not support.

  • For a kernel that is decorated with [[sycl::device_has()]], the exception must be thrown if the device does not have the aspects listed in that attribute. Note that the exception must be thrown even if the kernel does not actually use a feature corresponding to the aspect, and it must be thrown even if the aspect does not correspond to any optional feature.

  • For a kernel that is decorated with [[sycl::device_has()]], the FE compiler will mostly check (at compile time) whether the kernel uses any features that are not listed in the attribute. However, this check only results in a warning, so the runtime is still responsible for throwing the exception if any of the functions called by the kernel uses an optional feature that the device does not support.

  • For a kernel that is decorated with the [[sycl::reqd_work_group_size(W)]] or [[sycl::reqd_sub_group_size(S)]] attribute, the exception must be thrown if the device does not support the work group size W or the sub-group size S.

Note that the exception must be thrown synchronously, not delayed and thrown on the queue’s asynchronous handler.

No runtime exception for speculative compilation

It is currently common for the runtime to speculatively compile some kernels. For example, DPC++ may bundle all kernels from the same translation unit together into a single device image. When the application submits one kernel K to a device D, the runtime actually compiles all kernels in K’s device image for device D. Let’s assume in this example that the kernel K uses only features that are supported by D. It would be illegal for the runtime to throw an exception in such a case just because some other kernel in the same device image uses a feature that is not supported by device D.

Design

Changes to DPC++ headers

With the exception of the [[sycl::reqd_work_group_size()]] and [[sycl::reqd_sub_group_size()]] attributes, all kernel optional features are associated with some device aspect. For example, the sycl::half type is an optional feature which is only supported on devices that have the aspect::fp16 aspect. We can therefore use device aspects as a way to describe the set of optional features that a kernel uses (with the exception of the required work-group or sub-group size).

As we will see later, it is helpful to decorate all APIs in DPC++ headers that correspond to optional kernel features with a C++ attribute that identifies the associated aspect. We cannot use the [[sycl::device_has()]] attribute for this purpose, though, because that attribute is allowed only for functions. Instead, we invent a new internal attribute [[sycl_detail::uses_aspects()]] that can be used to decorate both functions and types. This attribute is not documented for user code; instead it is an internal implementation detail of DPC++.

Like all use of C++ attributes in the DPC++ headers, the headers should spell the attribute using initial and trailing double underscores ([[__sycl_detail__::__uses_aspects__()]]). We show that form in the code samples below, but this design document uses the form without the underbars elsewhere. Both forms refer to the same attribute.

To illustrate, the type sycl::half is an optional feature whose associated aspect is aspect::fp16. We therefore decorate the declaration like this:

using half [[__sycl_detail__::__uses_aspects__(aspect::fp16)]] =
  cl::sycl::detail::half_impl::half;

If an optional feature is expressed as a class type, it can be similarly decorated (here illustrating a hypothetical AMX type):

class [[__sycl_detail__::__uses_aspects__(aspect::ext_intel_amx)]] amx_type {
  /* ... */
};

This attribute is also used to decorate function declarations that correspond to optional features. Again, illustrating a hypothetical AMX extension:

[[__sycl_detail__::__uses_aspects__(aspect::ext_intel_amx)]]
void amx_multiply();

This attribute can also be used to decorate class templates where only certain instantiations correspond to optional features. See “Appendix: Adding an attribute to 8-byte atomic_ref for an illustration of how this attribute can be used in conjunction with partial specialization to mark only certain instantiations of sycl::atomic_ref as an optional feature.

Although the examples above show only a single aspect parameter to the [[sycl_detail::uses_aspects()]] attribute, this attribute should support a list of aspects, similar to the [[sycl::device_has()]] attribute. This will allow us to support future features that depend on a conjunction of aspects (e.g. a feature that does atomic operations on 64-bit floating point values might be decorated with [[sycl_detail::uses_aspects(aspect::fp64, aspect::atomic64)]]).

Unfortunately, the fundamental type double is also an optional kernel feature. Since there is no type alias for double, there is no convenient place to add an attribute. Instead, the FE device compiler must behave as though there was an implicit [[sycl_detail::uses_aspects(aspect::fp64)]] attribute for any device code that uses the double type.

New LLVM IR metadata

In order to communicate the information from [[sycl::device_has()]] and [[sycl_detail::uses_aspects()]] attributes to the DPC++ post-link tool, we introduce several new LLVM IR metadata.

The named metadata !intel_types_that_use_aspects conveys information about types that are decorated with [[sycl_detail::uses_aspects()]]. This metadata is not referenced by any instruction in the module, so it must be looked up by name. The format looks like this:

!intel_types_that_use_aspects = !{!0, !1, !2}
!0 = !{!"class.cl::sycl::detail::half_impl::half", i32 8}
!1 = !{!"class.cl::sycl::amx_type", i32 9}
!2 = !{!"class.cl::sycl::other_type", i32 8, i32 9}

The value of the !intel_types_that_use_aspects metadata is a list of unnamed metadata nodes, each of which describes one type that is decorated with [[sycl_detail::uses_aspects()]]. The value of each unnamed metadata node starts with a string giving the name of the type which is followed by a list of i32 constants where each constant is a value from enum class aspect telling the numerical value of an aspect from the type’s [[sycl_detail::uses_aspects()]] attribute. In the example above, the type cl::sycl::detail::half_impl::half uses an aspect whose numerical value is 8 and the type cl::sycl::other_type uses two aspects 8 and 9.

NOTE: The reason we choose this representation is because LLVM IR does not allow metadata to be attached directly to types. This representation works around that limitation by creating global named metadata that references the type’s name.

We also introduce two metadata that can be attached to a function definition similar to the existing !intel_reqd_sub_group_size. The !intel_declared_aspects metadata is used for functions that are decorated with [[sycl::device_has()]], and the !intel_used_aspects metadata is used to store the propagated information about all aspects used by a kernel or exported device function.

In each case, the metadata’s parameter is an unnamed metadata node, and the value of the metadata node is a list of i32 constants, where each constant is a value from enum class aspect.

For example, the following illustrates the IR that corresponds to a function foo that is decorated with [[sycl::device_has()]] where the required aspects have the numerical values 8 and 9. In addition, the function uses an optional feature that corresponds to an aspect with numerical value 8.

define void @foo() !intel_declared_aspects !1 !intel_used_aspects !2 {}
!1 = !{i32 8, i32 9}
!2 = !{i32 8}

Changes to the DPC++ front-end

The front-end of the device compiler is responsible for parsing the [[sycl::device_has()]] and [[sycl_detail::uses_aspects()]] attributes and transferring the information to the LLVM IR metadata described above according to the following rules:

  • If the translation unit contains any type definitions that are decorated with [[sycl_detail::uses_aspects()]], the front-end creates an !intel_types_that_use_aspects metadata describing the aspects used by all such types.

  • If a function is decorated with [[sycl_detail::uses_aspects()]], the front-end adds an !intel_used_aspects metadata to the function’s definition listing the aspects from that attribute.

  • If a function is decorated with [[sycl::device_has()]], the front-end adds an !intel_declared_aspects metadata to the function’s definition listing the aspects from that attribute.

New LLVM IR pass to propagate aspect usage

We add a new IR phase to the device compiler which does the following:

  • Creates (or augments) a function’s !intel_used_aspects metadata with aspects that come from references to types in the intel_types_that_use_aspects list.

  • Propagates each function’s !intel_used_aspects metadata up the static call graph so that each function lists the aspects used by that function and by any functions it calls.

  • Diagnoses a warning if any function that has !intel_declared_aspects uses an aspect not listed in that declared set.

It is important that this IR phase runs before any other optimization phase that might eliminate a reference to a type or inline a function call because such optimizations will cause us to miss information about aspects that are used. Therefore, it is recommended that this new phase run first, before all other IR phases.

Implementing the first bullet point is straightforward. The implementation can scan the IR for each function looking for instructions that reference a type. It can then see if that type is in the !intel_types_that_use_aspects set; if so it adds the type’s aspects to the function’s !intel_used_aspects set. While doing this, the implementation must have a special case for the double type because the front-end does not include that type in the !intel_types_that_use_aspects set. If a function references the double type, the implementation implicitly assumes that the function uses aspect::fp64 and adds that aspect to the function’s !intel_used_aspects set.

NOTE: This scan of the IR will require comparing the type referenced by each IR instruction with the names of the types in the !intel_types_that_use_aspects metadata. It would be very inefficient if we did a string comparison each time. As an optimization, the implementation can first lookup up each type name in the !intel_types_that_use_aspects metadata set, finding the “type pointer” that corresponds to each type name. Then the pass over the IR can compare the type pointer in each IR instruction with the type pointers from the !intel_types_that_use_aspects metadata set.

The second bullet point requires building the static call graph, but the implementation need not scan the instructions in each function. Instead, it need only look at the !intel_used_aspects metadata for each function, propagating the aspects used by each function up to it callers and augmenting the caller’s !intel_used_aspects set.

Diagnosing warnings is then straightforward. The implement looks for functions that have !intel_declared_aspects and compares that set with the !intel_used_aspects set (if any). If a function uses an aspect that is not in the declared set, the implementation issues a warning.

One weakness of this design is that the warning message will only be able to contain the source location of the problem if the compiler was invoked with -g because this is the only time when the front-end propagates source location information into the IR. To compensate, the warning message displays the static call chain that leads to the problem. For example:

warning: function 'foo' uses aspect 'fp64' not listed in 'sycl::device_has'
use is from this call chain:
  foo()
  bar()
  boo()
compile with '-g' to get source location

Including the call chain in the warning message will require maintaining some additional information during the traversal of the static call graph described above.

When the compiler is invoked with -g, the implementation uses the !DILocation metadata to improve the warning message with source file, line, and column information like so:

hw.cpp:27:4: warning: function 'foo' uses aspect 'fp64' not listed in 'sycl::device_has'
use is from this call chain:
  foo()
  bar() hw.cpp:15:3
  boo() hw.cpp:25:5

In the example above, the location hw.cpp:27:4 gives the source location of the code that uses the fp64 aspect, in this case somewhere in the boo() function. The location hw.cpp:15:3 tells the location in foo() of the call to bar(), etc.

NOTE: Issuing this warning message from an IR pass is a compromise. We would get better source location if the front-end diagnosed this warning. However, we feel that the analysis required to diagnose this warning would be too expensive in the front-end because it requires an additional pass over the AST. By contrast, we can diagnose the warning more efficiently in an IR pass because traversal of the IR is much more efficient than traversal of the AST. The downside, though, is that the warning message is less informative.

Assumptions on other phases of clang

The post-link tool (described below) uses the !intel_used_aspects and !intel_declared_aspects metadata, so this metadata must be retained by any other clang passes. However, post-link only uses this metadata when it decorates the definition of a kernel function or the definition of an exported device function, so it does not matter if intervening clang passes discard the metadata on other device functions.

We think this is a safe assumption for two reasons. First, the existing design must already preserve the !reqd_work_group_size metadata that decorates kernel functions. Second, the kernel functions and exported device functions always have external linkage, so there is no possibility that a clang phase will optimize them away.

NOTE: Ideally, we would change the llvm-link tool to somehow preserve the !intel_declared_aspects and !intel_used_aspects metadata for functions marked SYCL_EXTERNAL so that we could compare the declared aspects (in the module that imports the function) with the used aspects (in the module the exports the function). This would allow us to diagnose errors where the importing translation unit’s declared aspects do not match the aspects actually used by the function.

We do not propose this change as part of this design, though. We expect that this will not be a common error because applications can avoid this problem by declaring the SYCL_EXTERNAL function in a common header that is included by both the importing and the exporting translation unit. If the declaration (in the header) is decorated with [[sycl::device_has()]], the shared declaration will ensure that the definition stays in sync with the declaration.

Changes to the post-link tool

As noted in the requirements section above, DPC++ currently bundles kernels together regardless of the optional features they use, and this can lead to problems resulting from speculative compilation. To illustrate, consider kernel K1 that uses no optional features and kernel K2 that uses a feature corresponding to aspect A, and consider the case when K1 and K2 are bundled together in the same device image. Now consider an application that submits K1 to a device that does not have aspect A. The application should expect this to work, but DPC++ currently fails because JIT-compiling K1 causes the entire bundle to be compiled, and this fails when trying to compile K2 for a device that does not have aspect A.

We solve this problem by changing the post-link tool to bundle kernels and exported device functions according to the aspects that they use.

Changes to the device code split algorithm

The algorithm for splitting device functions into images must be changed to account for the aspects used by each kernel or exported device function. The goal is to ensure that two kernels or exported device functions are only bundled together into the same device image if they use exactly the same set of aspects.

For the purposes of this analysis, the set of Used aspects is computed by taking the union of the aspects listed in the kernel’s (or device function’s) !intel_used_aspects and !intel_declared_aspects sets. This is consistent with the SYCL specification, which says that a kernel decorated with [[sycl::device_has()]] may only be submitted to a device that provides the listed aspects, regardless of whether the kernel actually uses those aspects.

We must also split two kernels into different device images if they have different [[sycl::reqd_sub_group_size()]] or different [[sycl::reqd_work_group_size()]] values. The reasoning is similar as the aspect case. The JIT compiler currently raises an error if it tries to compile a kernel that has a required sub-group size if the size isn’t supported by the target device. The behavior for required work-group size is less clear. The Intel implementation does not raise a JIT compilation error when compiling a kernel that uses an unsupported work-group size, but other backends might. Therefore, it seems safest to split device code based required work-group size also.

Therefore, two kernels or exported device functions are only bundled together into the same device image if all of the following are true:

  • They share the same set of Used aspects,

  • They either both have no required work-group size or both have the same required work-group size, and

  • They either both have the same numeric value for their required sub-group size or neither has a numeric value for a required sub-group size. (Note that this implies that kernels decorated with [[intel::named_sub_group_size(automatic)]] can be bundled together with kernels that are decorated with [[intel::named_sub_group_size(primary)]] and that either of these kernels could be bundled with a kernel that has no required sub-group size.)

These criteria are an additional filter applied to the device code split algorithm after taking into account the -fsycl-device-code-split command line option. If the user requests per_kernel device code split, then each kernel is already in its own device image, so no further splitting is required. If the user requests any other option, device code is first split according to that option, and then another split is performed to ensure that each device image contains only kernels or exported device functions that meet the criteria listed above.

Create the “SYCL/device-requirements” property set

The DPC++ runtime needs some way to know about the Used aspects, required sub-group size, and required work-group size of an image. Therefore, the post-link tool provides this information in a new property set named “SYCL/device-requirements”.

The following table lists the properties that this set may contain and their types:

Property Name

Property Type

“aspect”

PI_PROPERTY_TYPE_BYTE_ARRAY

“reqd_sub_group_size”

PI_PROPERTY_TYPE_BYTE_ARRAY

“reqd_work_group_size”

PI_PROPERTY_TYPE_BYTE_ARRAY

There is an “aspect” property if the image’s Used set is not empty. The value of the property is an array of uint32 values, where each uint32 value is the numerical value of an aspect from enum class aspect. The size of the property (which is always divisible by 4) tells the number of aspects in the array.

There is a “reqd_sub_group_size” property if the image contains any kernels with a numeric required sub-group size. (I.e. this excludes kernels where the required sub-group size is a named value like automatic or primary.) The value of the property is a uint32 value that tells the required size.

There is a “reqd_work_group_size” property if the image contains any kernels with a required work-group size. The value of the property is a BYTE_ARRAY with the following layout:

<dim_count (uint32)> <dim0 (uint32)> ...

Where dim_count is the number of work group dimensions (i.e. 1, 2, or 3), and dim0 ... are the values of the dimensions from the [[reqd_work_group_size()]] attribute, in the same order as they appear in the attribute.

NOTE: One may wonder why the type of the “reqd_sub_group_size” property is not PI_PROPERTY_TYPE_UINT32 since its value is always 32-bits. The reason is that we may want to expand this property in the future to contain a list of required sub-group sizes. Likewise, the “reqd_work_group_size” property may be expanded in the future to contain a list of required work-group sizes.

Changes specific to AOT mode

In AOT mode, for each AOT target specified by the -fsycl-targets command line option, DPC++ normally invokes the AOT compiler for each device IR module resulting from the sycl-post-link tool. For example, this is the ocloc command for Intel Graphics AOT target and the opencl-aot command for the x86 AOT target with SPIR-V as the input, or other specific tools for the PTX target with LLVM IR bitcode input. This causes a problem, though, for IR modules that use optional features because these commands could fail if they attempt to compile IR using an optional feature that is not supported by the target device. We therefore need some way to avoid calling these commands in these cases.

The overall design is as follows. The DPC++ installation includes a configuration file that has one entry for each device that it supports. Each entry contains the set of aspects that the device supports and the set of sub-group sizes that it supports. DPC++ then consults this configuration file to decide whether to invoke a particular AOT compiler on each device IR module, using the information from the module’s “SYCL/device-requirements” property set.

Device configuration file

The configuration file uses a simple YAML format where each top-level key is a name of a device architecture. There are sub-keys under each device for the supported aspects and sub-group sizes. For example:

intel_gpu_11_1:
  aspects: [1, 2, 3]
  sub-group-sizes: [8, 16]
intel_gpu_icl:
  aspects: [2, 3]
  sub-group-sizes: [8, 16]
x86_64_avx512:
  aspects: [1, 2, 3, 9, 11]
  sub-group-sizes: [8, 32]

The values of the aspects in this configuration file can be the numerical values from the enum class aspect enumeration or the enum identifier itself.

One advantage to encoding this information in a textual configuration file is that customers can update the file if necessary. This could be useful, for example, if a new device is released before there is a new DPC++ release. In fact, the DPC++ driver supports a command line option which allows the user to select an alternate configuration file.

TODO:

  • Define location of the default device configuration file.

New features in clang compilation driver and tools

NOTE: the term device binary image used to refer to a device code form consumable by the DPC++ runtime library. Earlier device code forms are referred to as device code module or device IR module. In case of AOT, device binary image is a natively compiled binary, and IR module - either a SPIR-V or LLVM IR bitcode module.

Overview

After the sycl-post-link performs necessary aspect usage analysis and splits the incoming monolithic device code module into pieces - smaller device code modules - it outputs a file table as a result. Each row in the table corresponds to an individual output module, and each element of a row is a name of a file containing necessary information about the module, such as the code itself, and its properties.

At the action graph building stage for each requested AOT compilation target - SPIR-V-based (such as Intel Graphics targets) and/or non-SPIR-V-based (such as PTX) - the driver adds an aspect-filter action which filters out input file table rows with device code modules using features unsupported on current target. Then the output table goes as input into the AOT stage, and the prior filtering guarantees that the AOT compiler will not encounter device code it can’t compile. In the extreme case when all device code modules use unsupported aspects, the input file table will be empty. The picture below illustrates the action graph built by the clang driver along with file lists and tables generated and consumed by various nodes of the graph. The example set of targets used for the illustration is 4 targets

  • spir64 (runtime JITted SPIR-V)

  • AOT targets

    • non-SPIR-V based

      • ptx64 (PTX)

    • SPIR-V based

      • intel_gpu_12 (Intel Graphics)

      • x86_64_avx512 (AVX512)

Device SPIRV translation and AOT compilation

Aspect filter tool

This tool transforms an input file table by removing rows with device code files that use features unsupported for the target architecture given as tool’s argument.

Name:

  • sycl-aspect-filter, located next to other tools like file-table-tform

Input:

  • file table, normally coming out of sycl-post-link or file-table-tform tools

Command line arguments:

  • -target=<target> target device architecture to filter for

  • -device-config-file=<path> path to the device configuration file

Output:

  • the input file table filtered as needed

In more details, the tool performs the following actions:

  1. Checks if the input file table contains “Properties” column. If not, copies the input file table to output and exits without error.

  2. Reads in the device configuration file and finds some entry E corresponding to the architecture given on the command line. If there is no such entry - reports an error and exits.

  3. For each row in the input file table:

    • loads the properties file from the “Properties” column

    • checks if there is the SYCL/device-requirements property

    • if no, copies current row to the output file table and goes to the next

    • if yes, checks if all the requirements listed in the property are supported by the target architecture as specified by entry E in the device configuration file

      • if yes, copies current row to the output file table and goes to the next

      • otherwise skips this row

Configuration file location and driver option

A default device configuration file is present in DPC++ build. Users may override it using the -fsycl-device-config-file=<path> compiler command line option. Exact location of the file and final name of the compiler option is TBD.

AOT target identification

There are several user-visible places in the SDK where SYCL device target architectures need to be identified:

  • -fsycl-targets option

  • a device configuration file entry

  • -target option of the sycl-aspec-filter tool

  • a SYCL aspect enum identifier (we expect to add a new SYCL aspect for each device target architecture)

In all such places architecture naming should be the same. In some cases aliases are allowed. Below is a list of target architectures supported by DPC++:

target/alias(es)

description

intel_gpu

Generic Intel graphics architecture

intel_gpu_tgl, intel_gpu_12_0

Intel Tiger Lake (11th generation Core) integrated graphics architecture

ptx64

Generic 64-bit PTX target architecture

spir64

Generic 64-bit SPIR-V target

x86_64

Generic 64-bit x86 architecture

TODO: Provide full list of AOT targets supported by the identification mechanism.

Example of clang compilation invocation with 2 AOT targets and generic SPIR-V:

clang++ -fsycl -fsycl-targets=spir64,intel_gpu_12_0,ptx64 ...

Changes to the DPC++ runtime

The DPC++ runtime must be changed to check if a kernel uses any optional features that the device does not support. If this happens, the runtime must raise a synchronous errc::kernel_not_supported exception.

When the application submits a kernel to a device, the runtime identifies all the other device images that export device functions which are needed by the kernel as described in Device Code Dynamic Linking. Before the runtime actually links these images together, it compares each image’s “SYCL/device-requirements” against the features provided by the target device. If any of the following checks fail, the runtime throws errc::kernel_not_supported:

  • The “aspect” property contains an aspect that is not provided by the device, or

  • The “reqd_sub_group_size” property contains a sub-group size that the device does not support.

There is no way currently for the runtime to query the work-group sizes that a device supports, so the “reqd_work_group_size” property is not checked. We include this property in the set nonetheless for possible future use.

If the runtime throws an exception, it happens even before the runtime tries to access the contents of the device image.

Appendix: Adding an attribute to 8-byte atomic_ref

As described above under “Changes to DPC++ headers”, we need to decorate any SYCL type representing an optional device feature with the [[sycl_detail::uses_aspects()]] attribute. This is somewhat tricky for atomic_ref, though, because it is only an optional feature when specialized for a 8-byte type. However, we can accomplish this by using partial specialization techniques. The following code snippet demonstrates (best read from bottom to top):

namespace sycl {
namespace detail {

template<typename T>
class atomic_ref_impl_base {
 public:
  atomic_ref_impl_base(T x) : x(x) {}

  // All the member functions for atomic_ref go here

 private:
  T x;
};

// Template class which can be specialized based on the size of the underlying
// type.
template<typename T, size_t S>
class atomic_ref_impl : public atomic_ref_impl_base<T> {
 public:
  using atomic_ref_impl_base<T>::atomic_ref_impl_base;
};

// Explicit specialization for 8-byte types.  Only this specialization has the
// attribute.
template<typename T>
class [[__sycl_detail__::__uses_aspects__(aspect::atomic64)]]
    atomic_ref_impl<T, 8> : public atomic_ref_impl_base<T> {
 public:
  using atomic_ref_impl_base<T>::atomic_ref_impl_base;
};

} // namespace detail

// Publicly visible atomic_ref class.
template<typename T>
class atomic_ref : public detail::atomic_ref_impl<T, sizeof(T)> {
 public:
  atomic_ref(T x) : detail::atomic_ref_impl<T, sizeof(T)>(x) {}
};

} // namespace sycl