SYCL Runtime Compilation¶
SYCL-RTC means using the
kernel_compiler
extension to wrap a SYCL source string comprised of kernel definitions in the
free-function syntax
into a kernel_bundle
in the ext_oneapi_source
state, which is then compiled
into exectuable
state by the extension’s build(...)
function. The feature is
backed by an implementation inside the sycl-jit
library, which exposes the
modular, LLVM-based compiler tech behind DPC++ to be called by the SYCL runtime.
This document gives an overview of the design.
#include <sycl/sycl.hpp>
namespace syclexp = sycl::ext::oneapi::experimental;
// ...
std::string sycl_source = R"""(
#include <sycl/sycl.hpp>
extern "C" SYCL_EXT_ONEAPI_FUNCTION_PROPERTY((
sycl::ext::oneapi::experimental::nd_range_kernel<1>))
void vec_add(float* in1, float* in2, float* out){
size_t id = sycl::ext::oneapi::this_work_item::get_nd_item<1>()
.get_global_linear_id();
out[id] = in1[id] + in2[id];
}
)""";
sycl::queue q;
auto source_bundle = syclexp::create_kernel_bundle_from_source(
q.get_context(), syclexp::source_language::sycl, sycl_source);
// This design document explains what happens on the next line.
auto exec_bundle = syclexp::build(source_bundle);
File-and-process-based prototype¶
The
first implementation
of the build(...)
function wrote the source string into a temporary file,
invoked DPC++ on it with the -fsycl-dump-device-code
flag to dump the device
code to another file in SPIR-V format, and finally loaded that file back into
the runtime, from where it was executed.
The rationale for an in-memory compilation pipeline¶
Invoking the DPC++ executable as outlined in the previous section worked
reasonably well to implement the basic kernel_compiler
extension, but we
observed several shortcomings:
Functional completeness: Emitting a single SPIR-V file is sufficient for simple kernels, but more advanced device code may result in multiple device images comprised of SPIR-V binaries and accompanying metadata (runtime properties) that needs to be communicated to the runtime.
Robustness: Reading multiple dependent files from a temporary directory can be be fragile.
Performance: Multiple processes are launched by the compiler driver, and file I/O operations have a non-negligible overhead. The
-fsycl-dump-device-code
required the presence of a dummymain()
to be added to the source string, and caused an unnecessary host compilation to be performed.Security: Reading executable code from disk is a security concern, and users of an RTC-enabled application may be unaware that a compilation writing intermediate files is happening in the background.
These challenges ultimately motivated the design of the in-memory compilation
pipeline based on the sycl-jit
library which is now the default approach in
DPC++ and the oneAPI product distribution since the 2025.2 release. This new
approach leverages modular compiler technology to produce a faster, more
feature-rich, more robust and safer implementation of the kernel_compiler
extension.
The individual steps in the pipeline (frontend, device library linking,
sycl-post-link
and target format translation) are now invoked programmatically
via an API inside the same process, and intermediate results are passed along as
objects in memory. The code can be found in the
compileSYCL(...)
function.
Using the LibTooling API to compile the source string to an llvm::Module
¶
LibTooling is a high-level API to write standalone tools based on Clang, such as linters, refactoring tools or static analysers. To use it, one defines a tool action to run on a set of files in a virtual filesystem overlay, which the frontend then processes according to a compilation command database.
For SYCL-RTC, the filesystem overlay is populated with files containing the
source string and any virtual include_files
(defined via the homonymous
property). The compilation command is static and puts the frontend into
-fsycl-device-only
mode. Any user-given options (from the build_options
property) are appended. Lastly, the implementation defines a custom tool action
which runs the frontend until LLVM codegen, and then obtains ownership of the
LLVM module.
This might be a slightly unusual way to use of LibTooling, but we found it works
great for SYCL-RTC. The next sections explain the
jit_compiler::compileDeviceCode(...)
function in more detail.
Step 1: Determine the path of the compiler installation¶
To set up up working frontend invocation, we need to know where to find
supplemental files such as the SYCL headers. Normally, these paths are
determined relative to the compiler executable, however in our case, the
executable is actually the RTC-enabled application, which can reside in an
arbitrary location. Instead, we use OS-specific logic inside
getDPCPPRoot()
to determine the location of the shared library sycl-jit.so
(or .dll
on
Windows) which contains the SYCL-RTC implementation. From its location, we can
derive the compiler installation’s root directory.
Step 2: Collect command-line arguments¶
The next step is to collect the command-line arguments for the frontend
invocation. The
adjustArgs(...)
function relies on Clang’s option handling infrastructure to set the required
options to enter the device compilation mode (-fsycl-device-only
), set up the
compiler environment, and select the target. Finally, any user-specified
arguments passed via the
build_options
property are appended to the list of command-line arguments.
Step 3: Configure the ClangTool
¶
Once we know the required command-line arguments, we can set up the compilation
command database and an
instance
of the
ClangTool
class, which provides the entry point to the LibTooling interface. As we’ll be
translating only a single file containing the source string, we construct a
FixedCompilationDatabase
relative to the current working directory.
To implement the kernel_compiler
extension cleanly, we need to capture all
output (e.g. warnings and errors) from the frontend. The
ClangDiagnosticsWrapper
class configures a
TextDiagnosticsPrinter
to append all messages to a string maintained by our implementation to collect
all output produced during the runtime compilation.
The configuration of the ClangTool
instance continues in the
setupTool
function. First, we redirect all output to our diagnostics wrapper. Then, we
set up the overlay
filesystem
with a file named rtc_<n>.cpp
(n is incremented for each use of the
kernel_compiler
extension’s build(...)
function) in the current directory
with the contents of the source string. Each of the virtual header files that
the application defined via the
include_files
property becomes also a file in the overlay filesystem, using the path specified
in the property.
The ClangTool
class exposes so-called argument adjusters, which are intended
to modify the command-line arguments coming from the compilation command
database. We have to
clear
the default adjusters defined by the class, because one of them injects the
-fsyntax-only
flag, which would conflict with the -fsycl-device-only
flag we
need for SYCL-RTC. Finally, we
add
an argument adjuster ourselves to overwrite the name of executable in the
invocation. Again, this is to help the correct detection of the environment, by
making the invocation as similar as possible to a normal use of DPC++.
Step 4: Run an action¶
The last step is to define a
ToolAction
to be executed on the source files. Clang conveniently provides the
EmitLLVMAction
,
which runs the frontend up until the LLVM IR code generation, which is exactly
what we need. However, LibTooling does not provides a helper to wrap it in a
ToolAction
, so we need to define and run our own
GetLLVMModuleAction
.
We extracted common boilerplate code to configure a
CompilerInstance
in the
RTCActionBase
class. Inside the GetLLVMModuleAction
, we instantiate and execute the
aforementioned EmitLLVMAction
, and, in case the translation was successful,
obtains ownership
of the constructed llvm::Module
from it.
Finally, the call to
Action.takeModule()
transfers ownership again to the caller of compileDeviceCode
. Note that this
simple mechanism works because we know that there is only a single compilation
happening for every instance of the ClangTool
and hence our
GetLLVMModuleAction
class.
Caching¶
The implementation optionally uses the runtime’s persistent cache to elide recurring invocations of the frontend, which we observed to be the most expensive (in terms of runtime overhead) phase of our compilation pipeline.
Overall design¶
We cache only the frontend invocation, meaning that after a successful translation, we store the LLVM IR module obtained via LibTooling on disk in the Bitcode format using built-in utilities. In case of a cache hit in a later runtime compilation, we load the module from disk and feed it into the device linking phase. The rationale for this design was that were no utilities to save and restore the linked and post-processed device images to disk at the time (the SYCLBIN infrastructure was added later), and caching these steps would have resulted only in marginal further runtime savings.
Cache key considerations¶
The main challenge is to define a robust cache key. Because code compiled via
SYCL-RTC can #include
header files defined via the include_files
property as
well as from the filesystem, e.g. sycl.hpp
from the DPC++ installation or user
libraries, it is not sufficient to look only at the source string. In order to
make the cache as conservative as possible (cache collisions are unlikely but
mathematically possible), we decided to compute a hash value of the
preprocessed source string, i.e. with all #include
directives resolved. We
additionally compute a hash value of the rendered command-line arguments, and
append it to the hash of the preprocessed source to obtain the final cache key.
Implementation notes¶
The cache key computation is implemented in the
jit_compiler::calculateHash(...)
function. We are again relying on LibTooling to invoke the preprocessor -
handily, Clang provides a
PreprocessorFrontendAction
that we extend to tailor to our use-case. We choose
BLAKE3 as the hash
algorithm because its proven in similar contexts (most notably,
ccache) and available as a utility in the LLVM ecosystem.
As the output is a byte array, we apply Base64 encoding to obtain a character
string for use with the persistent cache.
Device library linking and SYCL-specific transformations¶
With an LLVM IR module in hand, obtained either from the frontend or the cache, the next steps in the compilation pipeline are simple.
The device library linking is done by the
jit_compiler::linkDeviceLibraries(...)
function. These libraries provide primitives for a variety of extra
functionality, such as an extended set of math functions and support for
bfloat16
arithmetic, and are available as Bitcode files inside the DPC++
installation or the vendor toolchain, so we just use LLVM utilities to load them
into memory and link them to the module representing the runtime-compiled
kernels. The main challenge here is that the logic to select the device
libraries is currently not reusable from its implementation in the driver, so
our implementation is a simplified copy of the
SYCL::getDeviceLibraries(...)
method, which needs to be kept in sync with the driver code.
For the SYCL-specific post-processing, implemented in
jit_compiler::performPostLink(...)
,
we can reuse modular analysis and transformation passes in the
SYCLLowerIR
component. The main tasks for the post-processing passes is to split the device
code module into smaller units (either as requested by the user, or required by
the ESIMD mode), and to compute the properties that need to be passed to the
SYCL runtime when the device images are loaded. The logic to orchestrate the
SYCLLowerIR
passes is adapted from the sycl-post-link
tool’s
processInputModule(...)
function. This duplicated code should be removed as well once a suitable
reusable implementation becomes available.
Translation to the target format¶
The final phase in the pipeline is to translate the LLVM IR modules resulting from the previous phase into a device-specific target format that can be handled by the runtime. For Intel CPUs and GPUs, that’s binary SPIR-V. For AMD and NVIDIA GPUs, we emit AMDGCN and PTX assembly, respectively. Over time, we created our own set of utilities to facilitate the translation. Internally, we dispatch the task to either the SPIR-V translator (a copy of which is maintained inside the DPC++ repository), or use vendor-specific backends that are part of LLVM to generate the third-party GPU code.
Third-party hardware support¶
SYCL-RTC works for AMD and NVIDIA GPUs, too. The usage of the kernel_compiler
extension remains the same for SYCL devices representing such a third-party GPU.
The concrete GPU architecture is queried via the environment variable
SYCL_JIT_AMDGCN_PTX_TARGET_CPU
when executing the RTC-enabled application. For
AMD GPUs, it is mandatory to set it. For NVIDIA GPUs, it is highly
recommended to change it from the conservative default architecture (sm_50
).
$ clang++ -fsycl myapp.cpp -o myapp
$ SYCL_JIT_AMDGCN_PTX_TARGET_CPU=sm_90 ./myapp
A list of values that can be set as the target CPU can be found in the
documentation of the -fsycl-targets=
option (leave
out the amd_gpu_
and nvidia_gpu_
prefixes).
Further reading¶
Technical presentation at IWOCL 2025: Fast In-Memory Runtime Compilation of SYCL Code: Slides Video Recording
Blog post: SYCL Runtime Compilation: A New Way to Specialise Kernels Using C++ Metaprogramming