A brief overview of kernel and program caching mechanism

Rationale behind caching

During SYCL program execution SYCL runtime will create internal objects representing kernels and programs, it may also invoke JIT compiler to bring kernels in a program to executable state. Those runtime operations are quite expensive, and in some cases caching approach can be employed to eliminate redundant kernel or program object re-creation and online recompilation. Few examples below illustrate scenarios where such optimization is possible.

Use-case #1. Submission of the same kernel in a loop:

  using namespace sycl;

  queue Q;
  std::vector<buffer> Bufs;

  ...
  // initialize Bufs with some number of buffers
  ...

  for (size_t Idx = 0; Idx < Bufs.size(); ++Idx) {
    Q.submit([&](handler &CGH) {
      auto Acc = Bufs[Idx].get_access<access::mode::read_write>(CGH);

      CGH.parallel_for<class TheKernel>(
          range<2>{N, M}, [=](item<2> Item) { ... });
    });
  }

Use-case #2. Submission of multiple kernels within a single program1:

  using namespace sycl;

  queue Q;

  Q.submit([&](handler &CGH) {
    ...

    CGH.parallel_for<class TheKernel_1>(
        range<2>{N_1, M_1}, [=](item<2> Item) { ... });
  });
  Q.submit([&](handler &CGH) {
    ...

    CGH.parallel_for<class TheKernel_2>(
        range<2>{N_2, M_2}, [=](item<2> Item) { ... });
  });
  Q.submit([&](handler &CGH) {
    ...

    CGH.parallel_for<class TheKernel_3>(
        range<2>{N_3, M_3}, [=](item<2> Item) { ... });
  });
  ...
  Q.submit([&](handler &CGH) {
    ...

    CGH.parallel_for<class TheKernel_K>(
        range<2>{N_K, M_K}, [=](item<2> Item) { ... });
  });

In both cases SYCL runtime will need to build the program and kernels multiple times, which may involve JIT compilation and take quite a lot of time.

In order to eliminate this waste of run-time we introduce a kernel and program caching. The cache is per-context and it caches underlying objects of non interop kernels and programs.

Use-case #3. Rebuild of all programs on SYCL application restart: JIT compilation for cases when an application contains huge amount of device code (big kernels or multiple kernels) may take significant time. The kernels and programs are rebuilt on every program restart. AOT compilation can be used to avoid that but it ties application to specific backend runtime version(s) and predefined HW configuration(s). As a general solution it is reasonable to have program persistent cache which works between application restarts (e.g. cache on disk for device code built for specific HW/SW configuration).

1: Here “program” means an internal SYCL runtime object corresponding to a device code module or native binary defining a set of SYCL kernels and/or device functions.

Data structure of cache

The cache is split into two levels:

  • in-memory cache which is used during application runtime for device code which has been already loaded and built for target device.

  • persistent (on-disk) cache which is used to store device binaries between application executions.

In-memory cache

The cache stores underlying PI objects behind sycl::program and sycl::kernel user-level objects in a per-context data storage. The storage consists of two maps: one is for programs and the other is for kernels.

The programs map’s key consists of four components:

  • kernel set id1,

  • specialization constants values,

  • the device this program is built for,

  • build options id 2.

The kernels map’s key consists of two components:

  • the program the kernel belongs to,

  • kernel name3.

1: Kernel set id is an ordinal number of the device binary image the kernel is contained in.

2: The concatenation of build options (both compile and link options) set in application or environment variables. There are three sources of build options that the cache is aware of:

  • from device image (pi_device_binary_struct::CompileOptions, pi_device_binary_struct::LinkOptions);

  • environment variables (SYCL_PROGRAM_COMPILE_OPTIONS, SYCL_PROGRAM_LINK_OPTIONS);

  • options passed through SYCL API.

Note: Backend runtimes used by SYCL can have extra environment or configurations values (e.g. IGC has igc_flags.def which affect JIT process). Changing such configuration will invalidate cache and manual cache cleanup should be done.

3: Kernel name is a kernel ID mangled class’ name which is provided to methods of sycl::handler (e.g. parallel_for or single_task).

Persistent cache

The cache works behind in-memory cache and stores the same underlying PI object behind sycl::program user-level objects in a per-context data storage. The storage is organized as a map for storing device code image. It uses different keys to address difference in SYCL objects ids between applications runs as well as the fact that the same kernel name can be used in different SYCL applications.

The programs map’s key consists of four components:

  • device image id1,

  • specialization constants values,

  • device id2 this program is built for,

  • build options id3.

Hashes are used for fast built device image identification and shorten filenames on disk. Once cache directory on disc is identified (see Persistent cache storage structure for detailed directory structure) full key values are compared with the ones stored on disk (in every .src file located in the cache item directory):

  • if they match the built image is loaded from corresponding .bin file;

  • otherwise image build is done and new cache item is created on disk containing 2 files: <max_n+1>.src for key values and <max_n+1>.bin for built image.

1: Hash out of the device code image used as input for the build.

2: Hash out of the string which is concatenation of values for info::platform::name, info::device::name, info::device::version, info::device::driver_version parameters to differentiate different HW and SW installed on the same host as well as SW/HW upgrades.

3: Hash for the concatenation of build options (both compile and link options) set in application or environment variables. There are three sources of build options:

  • from device image (pi_device_binary_struct::CompileOptions, pi_device_binary_struct::LinkOptions);

  • environment variables (SYCL_PROGRAM_COMPILE_OPTIONS, SYCL_PROGRAM_LINK_OPTIONS);

  • options passed through SYCL API.

Cache configuration

The environment variables which affect cache behavior are described in EnvironmentVariables.md.

Implementation details

The caches are represented with instance of KernelProgramCache class. The runtime creates one instance of the class per distinct SYCL context (A context object which is a result of copying another context object isn’t “distinct”, as it corresponds to the same underlying internal object representing a context).

The KernelProgramCache is essentially a pair of maps as described above.

When does the cache come at work?

The cache is used when one submits a kernel for execution or builds program with SYCL API. That means that the cache works when either user explicitly calls program::build_with_kernel_type<>()/program::get_kernel<>() methods or SYCL RT builds a program or gets the required kernel as needed during application execution. Cacheability of an object can be tested with program_impl::is_cacheable() method. SYCL RT will only try to insert cacheable programs or kernels into the cache. This is done as a part of ProgramManager::getOrCreateKernel() method.

NOTE: a kernel is only cacheable if and only if the program it belongs to is cacheable. On the other hand if the program is cacheable, then each and every kernel of this program will be cached also.

All requests to build a program or to create a kernel - whether they originate from explicit user API calls or from internal SYCL runtime execution logic - end up with calling the function getOrBuild() with number of lambda functions passed as arguments:

  • Acquire function;

  • GetCache function;

  • Build function.

Acquire function returns a locked version of cache. Locking is employed for thread safety. The threads are blocked only for insert-or-acquire attempt, i.e. when calling to map::insert in getOrBuild function. The rest of operation is done with the help of atomics and condition variables (plus a mutex for proper work of condition variable).

GetCache function returns a reference to mapping key->value out of locked instance of cache. We will see rationale behind it a bit later.

Build function actually builds the kernel or program.

Thread-safety

Why do we need thread safety here? It is quite possible to have a use-case when the sycl::context is shared across multiple threads (e.g. via sharing a queue). Possibility of enqueueing multiple cacheable kernels simultaneously from multiple threads requires us to provide thread-safety for the caching mechanisms.

It is worth of noting that we don’t cache the PI resource (kernel or program) by itself. Instead we augment the resource with the status of build process. Hence, what is cached is a wrapper structure BuildResult which contains three information fields - pointer to built resource, build error (if applicable) and current build status (either of “in progress”, “succeeded”, “failed”).

One can find definition of BuildResult template in KernelProgramCache.

The built resource access synchronization approach aims at minimizing the time any thread holds the global lock guarding the maps to improve performance. To achieve that, the global lock is acquired only for the duration of the global map access. Actual build of the program happens outside of the lock, so other threads can request or build other programs in the meantime. A thread requesting a BuildResult instance via getOrBuild can go one of three ways: A) Build result is not available, it is the first thread to request it. Current thread will then execute the build letting others wait for the result using the per-build result condition variable kept in BuildResult’s MBuildCV field. B) Build result is not available, another thread is already building the result. Current thread will then wait for the result using the MBuildCV condition variable. C) Build result is available. The thread simply takes it from the Ptr field w/o using any mutexes or condition variables.

As noted before, access to BuildResult instance fields may occur from different threads simultaneously, but the global lock is no longer held. So, to make it safe and to make sure only one thread builds the requested program, the following is done:

  • program build state is reflected in the State field, threads use compare-and-swap technique to compete who will do the build and become thread A. Threads C will find ‘DONE’ in this field and immediately return the with built result at hand.

  • thread A and thread(s) B use the MBuildCV conditional variable field and MBuildResultMutex mutex field guarding that variable to implement the “single producer-multiple consumers scheme”.

  • the build result itself appears in the ‘Ptr’ field when available.

All fields are atomic because they can be accessed from multiple threads.

A specialization of helper class Locked for reference of proper mapping is returned by Acquire function. The use of this class implements RAII to make code look cleaner a bit. Now, GetCache function will return the mapping to be employed that includes the 3 components: kernel name, device as well as any specialization constants values. These get added to BuildResult and are cached. The BuildResult structure is specialized with either PiKernel or PiProgram1.

Hash function

STL hash function specialized for std::string is going to be used: template<>  struct hash<std::string>

Core of caching mechanism

Now, let us see how ‘getOrBuild’ function works. First, we fetch the cache with sequential calls to Acquire and GetCache functions. Then, we check if this is the first attempt to build this kernel or program. This is achieved with an attempt to insert another key-value pair into the map. At this point we try to insert BuildResult stub instance with status equal to “in progress” which will allow other threads to know that someone is (i.e. we’re) building the object (i.e. kernel or program) now. If insertion fails, we will wait for building thread to finish with call to waitUntilBuilt function. This function will throw stored exception2 upon build failure. This allows waiting threads to see the same result as the building thread. Special case of the failure is when build result doesn’t contain the error (i.e. the error wasn’t of sycl::exception type) and the pointer to object in BuildResult instance is nil. In this case, the building thread has finished the build process and has returned an error to the user. But this error may be sporadic in nature and may be spurious. Hence, the waiting thread will try to build the same object once more.

BuildResult structure also contains synchronization objects: mutex and condition variable. We employ them to signal waiting threads that the build process for this kernel/program is finished (either successfully or with a failure).

1: The use of std::remove_pointer was omitted for the sake of simplicity here.

2: Actually, we store contents of the exception: its message and error code.

Persistent cache storage structure

The device code image are stored on file system using structure below:

<cache_root>/
  <device_hash>/
    <device_image_hash>/
      <spec_constants_values_hash>/
        <build_options_hash>/
          <n>.src
          <n>.bin
  • <cache_root> - root directory storing cache files, that depends on environment variables (see SYCL_CACHE_DIR description in the EnvironmentVariables.md);

  • <device_hash> - hash out of device information used to identify target device;

  • <device_image_hash> - hash made out of device image used as input for the JIT compilation;

  • <spec_constants_values_hash> - hash for specialization constants values;

  • <build_options_hash> - hash for all build options;

  • <n> - sequential number of hash collisions. When hashes matches for the specific build but full values don’t, new cache item is added with incremented value (enumeration started from 0).

Two files per cache item are stored on disk:

  • <n>.src contains full values for build parameters (device information, specialization constant values, build options, device image) which is used to resolve hash collisions and analysis of cached items.

  • <n>.bin contains built device code.

Inter-process safety

For on-disk cache there might be access collisions for accessing the same file from different instances of SYCL applications:

  • write collision happens when 2 instances of the same application are started to write to the same cache file/directory;

  • read collision may happen if one application is writing to the file and the other instance of the application is trying to read from it while write operation is not finished.

To avoid collisions the file system entries are locked for read-write access until write operation is finished. e.g if new file or directory should be created/deleted parent directory is locked, file is created in locked state, then the directory and the file are unlocked.

To address cases with high lock rate (multiple copies of the SYCL applications are run in parallel and use the same cache directory) nested directories representing cache keys are used to minimize locks down to applications which build the same device with the same parameters. Directory is locked for minimum time because it can be unlocked once subdirectory is created. If file is created in a directory, the directory should be locked until file creation is done.

Advisory locking 1 is used to ensure that the user/OS tools are able to manage files.

1. Advisory locks work only when a process explicitly acquires and releases locks, and are ignored if a process is not aware of locks.

Cache eviction mechanism

Cache eviction mechanism is required to avoid resources overflow both for memory and disk. The general idea is to delete items following cache size or LRU (least recently used) strategy both for in-memory and persistent cache.

In-memory cache eviction

It is initiated on program/kernel maps access/add item operation. When cache size exceeds storage threshold the items which are least recently used are deleted. TODO: add detailed description of in-memory cache eviction mechanism.

Persistent cache eviction

Persistent cache eviction is going to be applied based on file last access (read/write) date (access time). On SYCL application shutdown phase cache eviction process is initiated which walks through cache directories as follows:

  • if the file is locked, go to the next file;

  • otherwise check file access time:

    • if file access time is above threshold, delete the file and remove parent directory while they are unlocked and empty;

    • otherwise do nothing.

Cache limitations

The caching isn’t done when:

  • when program is built out of source with program::build_with_source() or program::compile_with_source() method;

  • when program is a result of linking multiple programs;

  • program is built using interoperability methods.

Points of improvement (things to do)

  • Employ the same built object for multiple devices of the same ISA, capabilities and so on. NOTE: It’s not really known if it’s possible to check if two distinct devices are exactly the same. Probably this should be an improvement request for plugins. By now it is assumed that two devices with the same device id 2 are the same.

  • Improve testing: cover real use-cases. See currently covered cases here.

  • Implement tool for exploring cache items (initially it is possible using OS utilities).