DPC++ Runtime
Runtime libraries for oneAPI DPC++
DPC++ Execution Graph

SYCL, unlike OpenCL, provides a programming model in which the user doesn't need to manage dependencies between kernels and memory explicitly. The DPC++ Runtime must ensure correct execution with respect to the order commands are submitted. More...


class  sycl::_V1::detail::Command
 The Command class represents some action that needs to be performed on one or more memory objects. More...
struct  sycl::_V1::detail::MemObjRecord
 Memory Object Record. More...
class  sycl::_V1::detail::Scheduler::GraphBuilder
 Graph builder class. More...
class  sycl::_V1::detail::Scheduler::GraphProcessor
 Graph Processor provides interfaces for enqueueing commands and their dependencies to the underlying runtime. More...
class  sycl::_V1::detail::Scheduler
 DPC++ graph scheduler class. More...

Detailed Description

SYCL, unlike OpenCL, provides a programming model in which the user doesn't need to manage dependencies between kernels and memory explicitly. The DPC++ Runtime must ensure correct execution with respect to the order commands are submitted.

This document describes the part of the DPC++ Runtime that is responsible for building and processing dependency graph.

A couple of words about DPC++ and SYCL execution and memory model

The SYCL framework defines command group (CG) as an entity that represents minimal execution block. The command group is submitted to SYCL queue and consists of a kernel or an explicit memory operation, and their requirements. The SYCL queue defines the device and context using which the kernel should be executed.

The commands that contain explicit memory operations include copy, fill, update_host and other operations. It's up to implementation how to define these operations.

The relative order of command groups submission defines the order in which kernels must be executed if their memory requirements intersect. For example, if a command group A writes to a buffer X, command group B reads from X, then the scheduled execution order of A and B will be the same as their dynamic submission order (matches program order if submitted from the same host thread).

Memory requirements are requests to SYCL memory objects, such as buffer and image. SYCL memory objects are not bound to any specific context or device, it's SYCL responsibility to allocate and/or copy memory to the target context to achieve correct execution.

Refer to SYCL Specification 1.2.1 sections 3.4 and 3.5 to find more information about SYCL execution and memory model.

Example of DPC++ application

// Creating SYCL CPU and GPU queues
sycl::queue CPU_Queue = ...;
sycl::queue GPU_Queue = ...;
// Creating 3 SYCL buffers
auto BufferA = ...; // Buffer is initialized with host memory.
auto BufferB = ...;
auto BufferC = ...;
// "Copy command group" section
// Request processing explicit copy operation on CPU
// The copy operation reads from BufferA and writes to BufferB
CPU_Queue.submit([&](handler &CGH) {
auto A = BufferA.get_access<read>(CGH);
auto B = BufferB.get_access<write>(CGH);
CGH.copy(A, B);
// "Multi command group" section
// Request processing multi kernel on GPU
// The kernel reads from BufferB, multiplies by 4 and writes result to
// BufferC
GPU_Queue.submit([&](handler &CGH) {
auto B = BufferB.get_access<read>(CGH);
auto C = BufferC.get_access<write>(CGH);
CGH.parallel_for<class multi>(range<1>{N}, [=](id<1> Index) {
C[Index] = B[Index] * 4;
// "Host accessor creation" section
// Request the latest data of BufferC for the moment
// This is a synchronization point, which means that the DPC++ RT blocks
// on creation of the accessor until requested data is available.
auto C = BufferC.get_access<read>();

In the example above the DPC++ RT does the following:

  1. Copy command group. The DPC++ RT allocates memory for BufferA and BufferB on CPU then executes an explicit copy operation on CPU.
  2. Multi command group DPC++ RT allocates memory for BufferC and BufferB on GPU and copy content of BufferB from CPU to GPU, then execute "multi" kernel on GPU.
  3. Host accessor creation DPC++ RT allocates(it's possible to reuse already allocated memory) memory available for user for BufferC then copy content of BufferC from GPU to this memory.

So, the example above will be converted to the following OpenCL pseudo code

// Initialization(not related to the Scheduler)
Platform = clGetPlatforms(...);
DeviceCPU = clGetDevices(CL_DEVICE_TYPE_CPU, ...);
DeviceGPU = clGetDevices(CL_DEVICE_TYPE_GPU, ...);
ContextCPU = clCreateContext(DeviceCPU, ...)
ContextGPU = clCreateContext(DeviceGPU, ...)
QueueCPU = clCreateCommandQueue(ContextCPU, DeviceCPU, ...);
QueueGPU = clCreateCommandQueue(ContextGPU, DeviceGPU, ...);
// Copy command group:
BufferACPU = clCreateBuffer(ContextCPU, CL_MEM_USE_HOST_PTR, ...);
BufferBCPU = clCreateBuffer(ContextCPU, ...);
CopyEvent = clEnqueueCopyBuffer(QueueCPU, BufferACPU, BufferBCPU, ...)
// Multi command group:
ReadBufferEvent =
clEnqueueReadBuffer(QueueCPU, BufferBCPU, HostPtr, CopyEvent, ...);
BufferBGPU = clCreateBuffer(ContextGPU, ...);
UserEvent = clCreateUserEvent(ContextCPU);
clSetEventCallback(ReadBufferEvent, event_completion_callback,
WriteBufferEvent = clEnqueueWriteBuffer(QueueGPU, BufferBGPU, HostPtr,
UserEvent, ...); BufferCGPU = clCreateBuffer(ContextGPU, ...); ProgramGPU =
clCreateProgramWithIL(ContextGPU, ...); clBuildProgram(ProgramGPU);
MultiKernel = clCreateKernel("multi");
clSetKernelArg(MultiKernel, BufferBGPU, ...);
clSetKernelArg(MultiKernel, BufferCGPU, ...);
MultiEvent =
clEnqueueNDRangeKernel(QueueGPU, MultiKernel, WriteBufferEvent, ...);
// Host accessor creation:
clEnqueueMapBuffer(QueueGPU, BufferCGPU, BLOCKING_MAP, MultiEvent, ...);
// Releasing mem objects during SYCL buffers destruction.
// Release(not related to the Scheduler)
void write(GlobalBufAccessorT &GlobalFlushBuf, size_t FlushBufferSize, unsigned WIOffset, const char *Str, unsigned Len, unsigned Padding=0)
Definition: stream.hpp:110
@ A
@ B
@ multi
@ read