DPC++ Runtime
Runtime libraries for oneAPI DPC++
DPC++ Execution Graph

SYCL, unlike OpenCL, provides a programming model in which the user doesn't need to manage dependencies between kernels and memory explicitly. The DPC++ Runtime must ensure correct execution with respect to the order commands are submitted. More...

Classes

class  cl::sycl::detail::Command
 The Command class represents some action that needs to be performed on one or more memory objects. More...
 
struct  cl::sycl::detail::MemObjRecord
 Memory Object Record. More...
 
class  cl::sycl::detail::Scheduler::GraphBuilder
 Graph builder class. More...
 
class  cl::sycl::detail::Scheduler::GraphProcessor
 Graph Processor provides interfaces for enqueueing commands and their dependencies to the underlying runtime. More...
 
class  cl::sycl::detail::Scheduler
 DPC++ graph scheduler class. More...
 

Detailed Description

SYCL, unlike OpenCL, provides a programming model in which the user doesn't need to manage dependencies between kernels and memory explicitly. The DPC++ Runtime must ensure correct execution with respect to the order commands are submitted.

This document describes the part of the DPC++ Runtime that is responsible for building and processing dependency graph.

A couple of words about DPC++ and SYCL execution and memory model

The SYCL framework defines command group (CG) as an entity that represents minimal execution block. The command group is submitted to SYCL queue and consists of a kernel or an explicit memory operation, and their requirements. The SYCL queue defines the device and context using which the kernel should be executed.

The commands that contain explicit memory operations include copy, fill, update_host and other operations. It's up to implementation how to define these operations.

The relative order of command groups submission defines the order in which kernels must be executed if their memory requirements intersect. For example, if a command group A writes to a buffer X, command group B reads from X, then the scheduled execution order of A and B will be the same as their dynamic submission order (matches program order if submitted from the same host thread).

Memory requirements are requests to SYCL memory objects, such as buffer and image. SYCL memory objects are not bound to any specific context or device, it's SYCL responsibility to allocate and/or copy memory to the target context to achieve correct execution.

Refer to SYCL Specification 1.2.1 sections 3.4 and 3.5 to find more information about SYCL execution and memory model.

Example of DPC++ application

{
// Creating SYCL CPU and GPU queues
cl::sycl::queue CPU_Queue = ...;
cl::sycl::queue GPU_Queue = ...;
// Creating 3 SYCL buffers
auto BufferA = ...; // Buffer is initialized with host memory.
auto BufferB = ...;
auto BufferC = ...;
// "Copy command group" section
// Request processing explicit copy operation on CPU
// The copy operation reads from BufferA and writes to BufferB
CPU_Queue.submit([&](handler &CGH) {
auto A = BufferA.get_access<read>(CGH);
auto B = BufferB.get_access<write>(CGH);
CGH.copy(A, B);
});
// "Multi command group" section
// Request processing multi kernel on GPU
// The kernel reads from BufferB, multiplies by 4 and writes result to
// BufferC
GPU_Queue.submit([&](handler &CGH) {
auto B = BufferB.get_access<read>(CGH);
auto C = BufferC.get_access<write>(CGH);
CGH.parallel_for<class multi>(range<1>{N}, [=](id<1> Index) {
C[Index] = B[Index] * 4;
});
});
// "Host accessor creation" section
// Request the latest data of BufferC for the moment
// This is a synchronization point, which means that the DPC++ RT blocks
// on creation of the accessor until requested data is available.
auto C = BufferC.get_access<read>();
}

In the example above the DPC++ RT does the following:

  1. Copy command group. The DPC++ RT allocates memory for BufferA and BufferB on CPU then executes an explicit copy operation on CPU.
  2. Multi command group DPC++ RT allocates memory for BufferC and BufferB on GPU and copy content of BufferB from CPU to GPU, then execute "multi" kernel on GPU.
  3. Host accessor creation DPC++ RT allocates(it's possible to reuse already allocated memory) memory available for user for BufferC then copy content of BufferC from GPU to this memory.

So, the example above will be converted to the following OpenCL pseudo code

// Initialization(not related to the Scheduler)
Platform = clGetPlatforms(...);
DeviceCPU = clGetDevices(CL_DEVICE_TYPE_CPU, ...);
DeviceGPU = clGetDevices(CL_DEVICE_TYPE_GPU, ...);
ContextCPU = clCreateContext(DeviceCPU, ...)
ContextGPU = clCreateContext(DeviceGPU, ...)
QueueCPU = clCreateCommandQueue(ContextCPU, DeviceCPU, ...);
QueueGPU = clCreateCommandQueue(ContextGPU, DeviceGPU, ...);
// Copy command group:
BufferACPU = clCreateBuffer(ContextCPU, CL_MEM_USE_HOST_PTR, ...);
BufferBCPU = clCreateBuffer(ContextCPU, ...);
CopyEvent = clEnqueueCopyBuffer(QueueCPU, BufferACPU, BufferBCPU, ...)
// Multi command group:
ReadBufferEvent =
clEnqueueReadBuffer(QueueCPU, BufferBCPU, HostPtr, CopyEvent, ...);
BufferBGPU = clCreateBuffer(ContextGPU, ...);
UserEvent = clCreateUserEvent(ContextCPU);
clSetEventCallback(ReadBufferEvent, event_completion_callback,
/*data=*/UserEvent);
WriteBufferEvent = clEnqueueWriteBuffer(QueueGPU, BufferBGPU, HostPtr,
UserEvent, ...); BufferCGPU = clCreateBuffer(ContextGPU, ...); ProgramGPU =
clCreateProgramWithIL(ContextGPU, ...); clBuildProgram(ProgramGPU);
MultiKernel = clCreateKernel("multi");
clSetKernelArg(MultiKernel, BufferBGPU, ...);
clSetKernelArg(MultiKernel, BufferCGPU, ...);
MultiEvent =
clEnqueueNDRangeKernel(QueueGPU, MultiKernel, WriteBufferEvent, ...);
// Host accessor creation:
clEnqueueMapBuffer(QueueGPU, BufferCGPU, BLOCKING_MAP, MultiEvent, ...);
// Releasing mem objects during SYCL buffers destruction.
clReleaseBuffer(BufferACPU);
clReleaseBuffer(BufferBCPU);
clReleaseBuffer(BufferBGPU);
clReleaseBuffer(BufferCGPU);
// Release(not related to the Scheduler)
clReleaseKernel(MultiKernel);
clReleaseProgram(ProgramGPU);
clReleaseContext(ContextGPU);
clReleaseContext(ContextCPU);
cl::sycl::access::mode::read
@ read
cl::sycl::queue
Encapsulates a single SYCL queue which schedules kernels on a SYCL device.
Definition: queue.hpp:103
cl::sycl::detail::write
void write(GlobalBufAccessorT &GlobalFlushBuf, size_t FlushBufferSize, unsigned WIOffset, const char *Str, unsigned Len, unsigned Padding=0)
Definition: stream.hpp:110
cl::__ESIMD_NS::rgba_channel::B
@ B
cl::__ESIMD_NS::rgba_channel::A
@ A
cl::sycl::queue::submit
event submit(T CGF _CODELOCPARAM(&CodeLoc))
Submits a command group function object to the queue, in order to be scheduled for execution on the d...
Definition: queue.hpp:252