SYCL Runtime Compilation

SYCL-RTC means using the kernel_compiler extension to wrap a SYCL source string comprised of kernel definitions in the free-function syntax into a kernel_bundle in the ext_oneapi_source state, which is then compiled into exectuable state by the extension’s build(...) function. The feature is backed by an implementation inside the sycl-jit library, which exposes the modular, LLVM-based compiler tech behind DPC++ to be called by the SYCL runtime. This document gives an overview of the design.

#include <sycl/sycl.hpp>
namespace syclexp = sycl::ext::oneapi::experimental;

// ...

std::string sycl_source = R"""(
  #include <sycl/sycl.hpp>
  
  extern "C" SYCL_EXT_ONEAPI_FUNCTION_PROPERTY((
    sycl::ext::oneapi::experimental::nd_range_kernel<1>))
  void vec_add(float* in1, float* in2, float* out){
    size_t id = sycl::ext::oneapi::this_work_item::get_nd_item<1>()
                .get_global_linear_id();
    out[id] = in1[id] + in2[id];
  }
)""";

sycl::queue q;

auto source_bundle = syclexp::create_kernel_bundle_from_source(
  q.get_context(), syclexp::source_language::sycl, sycl_source);

// This design document explains what happens on the next line.
auto exec_bundle = syclexp::build(source_bundle);

File-and-process-based prototype

The first implementation of the build(...) function wrote the source string into a temporary file, invoked DPC++ on it with the -fsycl-dump-device-code flag to dump the device code to another file in SPIR-V format, and finally loaded that file back into the runtime, from where it was executed.

The rationale for an in-memory compilation pipeline

Invoking the DPC++ executable as outlined in the previous section worked reasonably well to implement the basic kernel_compiler extension, but we observed several shortcomings:

  • Functional completeness: Emitting a single SPIR-V file is sufficient for simple kernels, but more advanced device code may result in multiple device images comprised of SPIR-V binaries and accompanying metadata (runtime properties) that needs to be communicated to the runtime.

  • Robustness: Reading multiple dependent files from a temporary directory can be be fragile.

  • Performance: Multiple processes are launched by the compiler driver, and file I/O operations have a non-negligible overhead. The -fsycl-dump-device-code required the presence of a dummy main() to be added to the source string, and caused an unnecessary host compilation to be performed.

  • Security: Reading executable code from disk is a security concern, and users of an RTC-enabled application may be unaware that a compilation writing intermediate files is happening in the background.

These challenges ultimately motivated the design of the in-memory compilation pipeline based on the sycl-jit library which is now the default approach in DPC++ and the oneAPI product distribution since the 2025.2 release. This new approach leverages modular compiler technology to produce a faster, more feature-rich, more robust and safer implementation of the kernel_compiler extension.

The individual steps in the pipeline (frontend, device library linking, sycl-post-link and target format translation) are now invoked programmatically via an API inside the same process, and intermediate results are passed along as objects in memory. The code can be found in the compileSYCL(...) function.

Using the LibTooling API to compile the source string to an llvm::Module

LibTooling is a high-level API to write standalone tools based on Clang, such as linters, refactoring tools or static analysers. To use it, one defines a tool action to run on a set of files in a virtual filesystem overlay, which the frontend then processes according to a compilation command database.

For SYCL-RTC, the filesystem overlay is populated with files containing the source string and any virtual include_files (defined via the homonymous property). The compilation command is static and puts the frontend into -fsycl-device-only mode. Any user-given options (from the build_options property) are appended. Lastly, the implementation defines a custom tool action which runs the frontend until LLVM codegen, and then obtains ownership of the LLVM module.

This might be a slightly unusual way to use of LibTooling, but we found it works great for SYCL-RTC. The next sections explain the jit_compiler::compileDeviceCode(...) function in more detail.

Step 1: Determine the path of the compiler installation

To set up up working frontend invocation, we need to know where to find supplemental files such as the SYCL headers. Normally, these paths are determined relative to the compiler executable, however in our case, the executable is actually the RTC-enabled application, which can reside in an arbitrary location. Instead, we use OS-specific logic inside getDPCPPRoot() to determine the location of the shared library sycl-jit.so (or .dll on Windows) which contains the SYCL-RTC implementation. From its location, we can derive the compiler installation’s root directory.

Step 2: Collect command-line arguments

The next step is to collect the command-line arguments for the frontend invocation. The adjustArgs(...) function relies on Clang’s option handling infrastructure to set the required options to enter the device compilation mode (-fsycl-device-only), set up the compiler environment, and select the target. Finally, any user-specified arguments passed via the build_options property are appended to the list of command-line arguments.

Step 3: Configure the ClangTool

Once we know the required command-line arguments, we can set up the compilation command database and an instance of the ClangTool class, which provides the entry point to the LibTooling interface. As we’ll be translating only a single file containing the source string, we construct a FixedCompilationDatabase relative to the current working directory.

To implement the kernel_compiler extension cleanly, we need to capture all output (e.g. warnings and errors) from the frontend. The ClangDiagnosticsWrapper class configures a TextDiagnosticsPrinter to append all messages to a string maintained by our implementation to collect all output produced during the runtime compilation.

The configuration of the ClangTool instance continues in the setupTool function. First, we redirect all output to our diagnostics wrapper. Then, we set up the overlay filesystem with a file named rtc_<n>.cpp (n is incremented for each use of the kernel_compiler extension’s build(...) function) in the current directory with the contents of the source string. Each of the virtual header files that the application defined via the include_files property becomes also a file in the overlay filesystem, using the path specified in the property.

The ClangTool class exposes so-called argument adjusters, which are intended to modify the command-line arguments coming from the compilation command database. We have to clear the default adjusters defined by the class, because one of them injects the -fsyntax-only flag, which would conflict with the -fsycl-device-only flag we need for SYCL-RTC. Finally, we add an argument adjuster ourselves to overwrite the name of executable in the invocation. Again, this is to help the correct detection of the environment, by making the invocation as similar as possible to a normal use of DPC++.

Step 4: Run an action

The last step is to define a ToolAction to be executed on the source files. Clang conveniently provides the EmitLLVMAction, which runs the frontend up until the LLVM IR code generation, which is exactly what we need. However, LibTooling does not provides a helper to wrap it in a ToolAction, so we need to define and run our own GetLLVMModuleAction.

We extracted common boilerplate code to configure a CompilerInstance in the RTCActionBase class. Inside the GetLLVMModuleAction, we instantiate and execute the aforementioned EmitLLVMAction, and, in case the translation was successful, obtains ownership of the constructed llvm::Module from it.

Finally, the call to Action.takeModule() transfers ownership again to the caller of compileDeviceCode. Note that this simple mechanism works because we know that there is only a single compilation happening for every instance of the ClangTool and hence our GetLLVMModuleAction class.

Caching

The implementation optionally uses the runtime’s persistent cache to elide recurring invocations of the frontend, which we observed to be the most expensive (in terms of runtime overhead) phase of our compilation pipeline.

Overall design

We cache only the frontend invocation, meaning that after a successful translation, we store the LLVM IR module obtained via LibTooling on disk in the Bitcode format using built-in utilities. In case of a cache hit in a later runtime compilation, we load the module from disk and feed it into the device linking phase. The rationale for this design was that were no utilities to save and restore the linked and post-processed device images to disk at the time (the SYCLBIN infrastructure was added later), and caching these steps would have resulted only in marginal further runtime savings.

Cache key considerations

The main challenge is to define a robust cache key. Because code compiled via SYCL-RTC can #include header files defined via the include_files property as well as from the filesystem, e.g. sycl.hpp from the DPC++ installation or user libraries, it is not sufficient to look only at the source string. In order to make the cache as conservative as possible (cache collisions are unlikely but mathematically possible), we decided to compute a hash value of the preprocessed source string, i.e. with all #include directives resolved. We additionally compute a hash value of the rendered command-line arguments, and append it to the hash of the preprocessed source to obtain the final cache key.

Implementation notes

The cache key computation is implemented in the jit_compiler::calculateHash(...) function. We are again relying on LibTooling to invoke the preprocessor - handily, Clang provides a PreprocessorFrontendAction that we extend to tailor to our use-case. We choose BLAKE3 as the hash algorithm because its proven in similar contexts (most notably, ccache) and available as a utility in the LLVM ecosystem. As the output is a byte array, we apply Base64 encoding to obtain a character string for use with the persistent cache.

Device library linking and SYCL-specific transformations

With an LLVM IR module in hand, obtained either from the frontend or the cache, the next steps in the compilation pipeline are simple.

The device library linking is done by the jit_compiler::linkDeviceLibraries(...) function. These libraries provide primitives for a variety of extra functionality, such as an extended set of math functions and support for bfloat16 arithmetic, and are available as Bitcode files inside the DPC++ installation or the vendor toolchain, so we just use LLVM utilities to load them into memory and link them to the module representing the runtime-compiled kernels. The main challenge here is that the logic to select the device libraries is currently not reusable from its implementation in the driver, so our implementation is a simplified copy of the SYCL::getDeviceLibraries(...) method, which needs to be kept in sync with the driver code.

For the SYCL-specific post-processing, implemented in jit_compiler::performPostLink(...), we can reuse modular analysis and transformation passes in the SYCLLowerIR component. The main tasks for the post-processing passes is to split the device code module into smaller units (either as requested by the user, or required by the ESIMD mode), and to compute the properties that need to be passed to the SYCL runtime when the device images are loaded. The logic to orchestrate the SYCLLowerIR passes is adapted from the sycl-post-link tool’s processInputModule(...) function. This duplicated code should be removed as well once a suitable reusable implementation becomes available.

Translation to the target format

The final phase in the pipeline is to translate the LLVM IR modules resulting from the previous phase into a device-specific target format that can be handled by the runtime. For Intel CPUs and GPUs, that’s binary SPIR-V. For AMD and NVIDIA GPUs, we emit AMDGCN and PTX assembly, respectively. Over time, we created our own set of utilities to facilitate the translation. Internally, we dispatch the task to either the SPIR-V translator (a copy of which is maintained inside the DPC++ repository), or use vendor-specific backends that are part of LLVM to generate the third-party GPU code.

Third-party hardware support

SYCL-RTC works for AMD and NVIDIA GPUs, too. The usage of the kernel_compiler extension remains the same for SYCL devices representing such a third-party GPU. The concrete GPU architecture is queried via the environment variable SYCL_JIT_AMDGCN_PTX_TARGET_CPU when executing the RTC-enabled application. For AMD GPUs, it is mandatory to set it. For NVIDIA GPUs, it is highly recommended to change it from the conservative default architecture (sm_50).

$ clang++ -fsycl myapp.cpp -o myapp
$ SYCL_JIT_AMDGCN_PTX_TARGET_CPU=sm_90 ./myapp

A list of values that can be set as the target CPU can be found in the documentation of the -fsycl-targets= option (leave out the amd_gpu_ and nvidia_gpu_ prefixes).

Further reading