SYCL Instrumentation¶
Any language or programming paradigm must provide mechanisms to correlate a developer’s use of the language to the debug and performance traces for that use. A lightweight tracing framework (XPTI) was developed to enable this for SYCL and is the primary mechanism that is employed to enable debug and performance traces.
NOTE: For additional information on the XPTI framework, please refer to the Framework Documentation for API use and framework performance data.
This document outlines the use of this framework API in the SYCL runtime library. The primary concept enable by this framework is the generation of a unique 64-bit ID, referred to as the Universal ID (UID), for every public language entry point into the library. This allows tools and other helps in the software stack to correlate debug and performance data by tagging it with the 64-bit UID. The framework also provides the ability to propagate this UID all the way to the driver layers for the target device so data from lower layers and hardware can be correlated easily.
The XPTI concepts in use here are:
Tracepoints - define all the points in a software layer we want to instrument or trace. The trace point is used to generate the UID.
Notification - allows the software layer to communicate the trace information to listeners/subscribers
Callback - implemented by subscribers to specific events to capture the trace information
The SYCL runtime layer defines the tracepoints and notifies the information about any given tracepoint to a registered subscriber. These tracepoints are enabled in meaningful locations of the runtime to provide semantic information about the developer’s use of the language. This would include information such as relationships that form asynchronous task graphs or other constructs such as barriers that are introduced while waiting on events.
Instrumentation Trace Points¶
This section will document all the places in the SYCL runtime that have been instrumented to capture the asynchronous task graphs created by the runtime. The task graphs are captured as graph, nodes and edges:
The graph encapsulates all of the disjoint task graphs generated by the application.
The nodes capture operations that are performed, such as kernel executions or memory transfers
The edges represent dependence relationships, the representation of which mimics control flow as opposed to a dependence graph. The source node in an edge must complete before the target node can begin execution.
All code changes to enable this have been guarded by
XPTI_ENABLE_INSTRUMENTATION
macro and the CMake files have been updated to
have this as an option which is enabled by default and this change is under
llvm/sycl/CMakeLists.txt
.
...
# Create a soft option for enabling or disabling the instrumentation
# of the SYCL runtime
option(SYCL_ENABLE_XPTI_TRACING "Enable tracing of SYCL constructs" ON)
The Graph¶
Any SYCL application can submit command groups to any active queue during the lifetime of the application. Each submission is handled by the runtime and the asynchronous task graphs are updated to reflect the new submission. This may be as simple as adding a new node to the task-graph or adding multiple nodes to the graph, where one of the nodes represents the computation and the others dependent memory transfers.
To model this, we create a global graph for every application instantiation
and all kernel executions in the applications are added as nodes in this
global graph. In the SYCL runtime, there is no obvious location where the
creation of the global graph can be inserted as many objects are
instantiated statically. Currently, we embed the graph creation in the
plugin interface (PI) layer initialize()
call. In this call, we will
perform two operations:
Initialize all listeners and create a trace event to represent the graph. This is done in
sycl/include/sycl/detail/pi.cpp
.Send a
graph_create
event to all subscribers. This notification will only be sent once.
The Nodes¶
The command group lambdas are captured and encapsulated in a Command
object. This object is evaluated for dependencies on data/memory or external
OpenCL events and an asynchronous task graph is built by mapping all these
dependencies, before it is enqueued on the device. In order to capture the
command groups (nodes) and the dependencies (edges), the base class
Command
and any derived classes that are of interest are instrumented.
In this section, we discuss the instrumentation of the Command object in two parts: (1) The changes made to capture end-user source code details for language constructs (2) The instrumentation that handles capturing the relevant metadata.
In order to capture end-user source code information, we have implemented
sycl::detail::code_location
class that uses the builtin functions in the compiler. However, equivalent implementations are unavailable on Windows and separate cross-platform implementation might be used in the future. To mitigate this, the Windows implementation will always reportunknown_file
,unknown_func
and a line number of 0 for source file, function name and line number. We handle this case while processing this information.The source information of a language construct, such as source file, function name, line number and column number allow us to determine if a Command that was previously created for a construct is being created again. In such cases, we will not emit a
node_create
event, but we will bump up the instance count recording the number of instances created. Secondly, the source information allows us to associate a unique ID with the source location and propagate it all the way to the driver, if possible. This will allow us to associate a Kernel event with a source location at all times. All instrumentation that identifies a command object of a given type and emits thenode_create
event is located in theemitInstrumentationData()
and must be implemented by all derived classes.To enable this source location information, we start with enabling the public methods in the queue class, such as
queue.submit()
,queue.parallel_for()
,queue.wait()
, etc to include a default argument that captures the source location information. The location of the line in the caller that makes the call toqueue.submit()
,queue.parallel_for()
, etc is represented in this default argument. These changes are present inqueue.hpp
andordered_queue.hpp
. The default arguments for all public functions are guarded by#ifdef SYCL_INSTRUMENTATION_METADATA
that is currently enabled by default.The location information, when captured, is propagated all the way to the
CommandGroup
object. So, for everyCommandGroup
object, we will have the corresponding source location in end-user code where the command group is submitted to the queue. This metadata is propagated by the instrumentation to the subscribers of the stream.The base
Command class
and all derived classes are instrumented to capture the relevant information for each command object and anode_create
event is generated.
The Node instance¶
Once a command object is created, it is enqueued on the device for
execution. To capture the execution of this node instance, we instrument the
enqueue()
method to determine the cost of this computation or memory
related kernel. As the commands are enqueued, the enqueue method emits a
pair of events indicating the task_begin
and task_end
events that
capture the duration of the enqueued command. For commands that are
asynchronous, the pair of events capture just the kernel submission and the
actual execution of the command on the device is tracked through the
cl_event
returned by the enqueue operation. In the case of host kernel
execution or commands that are synchronous, the cost is measured directly.
In the case of the command being submitted to an OpenCL device, we capture the event of the submitted kernel and propagate it to the subscriber tool. It is up to the tool to register a callback for this event completion and close the task opened for the command object.
The Edges¶
As discussed in the previous section, the command groups submitted to the
device queues form nodes in the asynchronous tasks graphs created by
the SYCL runtime. In addition to these nodes, based on the memory references
(through accessors or USM pointers), additional nodes to allocate
,
copy
and release
are created and they are necessary for the
computation kernels to run. The computation kernel has dependencies on the
memory objects and these dependencies are recorded as event
s and in
our model we represent them as edges between the dependent nodes.
Tools monitoring the event stream then can start capturing the asynchronous
task graph as it is being built. As dependencies are added to a command
object, the instrumentation emits these dependencies as edge_create
events. Each of these edge_create
events encapsulate the two command
objects that have a dependency through this edge. The source object of this
edge event must complete execution first before the target object of the
edge can begin execution.
To instrument this part of the code, the Command::addDep
methods of
the Command object are instrumented to create the trace points and notify
all subscribers.
The Release
command, as implemented in the SYCL runtime, has a
reference to the memory object, but no explicit dependencies are created. To
model the edges correctly, we instrument the waitForRecordToFinish
method in
the Scheduler
where the release operation waits on all the
dependent operations to complete to capture the edges.
This concludes all the changes that were made to the SYCL runtime to support tracing. The next section talks about the XPTI framework that allows applications and runtimes to efficiently capture, record and emit trace notifications for important events during the run.
Documentation of SYCL tracepoints¶
XPTI Stream Domain¶
Traces belong to a named stream and this constitutes a domain of data. The XPTI framework allows the instrumentation logic to define a stream and associate the traces to the stream. A stream also defines the protocol to be observed to decipher the data at the receiving end. The XPTI API defines the notion of a trace point that includes an event, a trace point type and a notification.
The event consists a payload that describes the event (
source file
,function name
,line number
and/or acode pointer
), aunique_id
that is used to identify the event, aglobal user data field
and a mechanism to recordmetadata
associated with the event. Theunique_id
is generated from the payload, so if the trace point is visited multiple times, it represents the sameunique_id
and this allows us to determine the number of instances of a trace point.The trace point type defines the type of notification that is being emitted for the trace point. There are many commonly occurring trace point types that are predefined by the framework, but a stream can extend this set by the extension APIs provided. A subscriber must explicitly register a callback for each trace point type that is of interest to the subscriber. If no subscribers are registered for a stream or a trace point type, then traces will not be emitted. A given trace point event may be used to emit multiple traces to different trace point types.
The notification emits the trace to all subscribers of the stream domain that have a callback registered to the given trace point type. The stream can attached a per-instance user data during this notification call that must be guaranteed to be valid for the duration of the notification call.
This document will outline the protocol for the streams of data being generated by the SYCL runtime.
SYCL Stream "ur.call"
Notification Signatures¶
Trace Point Type |
Parameter Description |
Metadata |
---|---|---|
|
xpti::trace_point_type_t::function_begin that marks the beginning of a functionur.call layer.nullptr - since the stream of data just captures functions being called.function_begin event with the function_end event. const char * |
None |
|
xpti::trace_point_type_t::function_end that marks the beginning of a functionur.call layer.nullptr - since the stream of data just captures functions being called.function_begin event with the function_end event. This value is guaranteed to be the same value received by the trace event for the corresponding function_begin const char * |
None |
SYCL Stream "ur.call.debug"
Notification Signatures¶
Trace Point Type |
Parameter Description |
Metadata |
---|---|---|
|
xpti::trace_point_type_t::function_with_args_begin that marks the beginning of a functionur.call.debug layer.nullptr if code location is not available or event ID with code location data.function_with_args_begin event with the function_with_args_end event. function_with_args_t object, that includes function ID, name, and arguments. |
None |
|
xpti::trace_point_type_t::function_with_args_end that marks the beginning of a functionur.call.debug layer.nullptr if code location is not available or event ID with code location data.function_with_args_begin event with the function_with_args_end event. This value is guaranteed to be the same value received by the trace event for the corresponding function_with_args_begin function_with_args_t object, that includes function ID, name, arguments, and return value. |
None |
SYCL Stream "sycl"
Notification Signatures¶
All trace point types in bold provide semantic information about the graph, nodes and edges and the topology of the asynchronous task graphs created by the runtime.
Trace Point Type |
Parameter Description |
Metadata |
---|---|---|
|
xpti::trace_point_type_t::graph_create that marks the creation of an asynchronous graph.nullptr nullptr |
None |
|
xpti::trace_point_type_t::node_create that marks the creation of a node in the graph, which could be a computational kernel or memory operation.graph_create event.command_group_node , memory_transfer_node , memory_allocation_node , sub_buffer_creation_node , memory_deallocation_node , host_acc_create_buffer_lock_node , host_acc_destroy_buffer_release_node combined with the address of the command group object and represented as a string [const char * ] |
sycl_device , sycl_device_type , sycl_device_name , kernel_name , from_source , sym_function_name , sym_source_file_name , sym_line_no . The per-queue unique ID can be obtained by using xptiGetStashedTuple API call. See queue_create documentation for usage information. memory_object , offset , access_range , allocation_type , copy_from , copy_to ,device_id , device_name , memory_size , src_memory_ptr , dest_memory_ptr , memory_ptr , value_set . The per-queue unique ID can be obtained by using xptiGetSTashedTuple API call. See queue_create documentation for usage information. |
|
xpti::trace_point_type_t::graph_create that marks the creation of an asynchronous graph.graph_create event.nullptr |
|
|
xpti::trace_point_type_t::task_begin that marks the beginning of a task belonging to one of the nodes in the graph. When the trace event is for a kernel executing on a device other than the the CPU, this task_begin and corresponding task_end mark the submit call. To track the execution of the kernel on the device, the trace_signal event must be monitored to get the kernel event handle from which the execution statistics can be gathered. graph_create event.task_end trace event. nullptr |
Same metadata defined for the node the trace task belongs to. |
|
xpti::trace_point_type_t::task_end that marks the end of a task belonging to one of the nodes in the graph. The specific task instance can be tacked through the instance ID parameter which helps correlate the task_end with the corresponding task_begin . graph_create event.task_begin trace event. nullptr |
Same metadata defined for the node the trace task belongs to. |
|
xpti::trace_point_type_t::signal that marks the an event that contains the event handle of an executing kernel on a device. graph_create event. |
Same metadata defined for the node the trace task belongs to. |
|
xpti::trace_point_type_t::wait_begin that marks the beginning of the wait on an event nullptr wait_begin event with the wait_end event. queue.wait and the address of the event sent in as const char * queue.wait() or queue.wait_and_throw() will capture the waiting on the action represented by the event object, which could be the execution of a kernel, completion of a memory operation, etc that is embedded in the command group handler. All wait events contain metadata that indicates the SYCL device on which the corresponding operation has been submitted. If the event is from a command group handler, then the source location information is available as well. |
|
|
xpti::trace_point_type_t::wait_end that marks the beginning of the wait on an event nullptr wait_begin event with the wait_end event. queue.wait and the address of the event as const char * |
|
|
xpti::trace_point_type_t::barrier_begin that marks the beginning of a barrier while enqueuing a command group objectgraph_create event.barrier_begin event with the barrier_end event. enqueue.barrier and the reason for the barrier as a const char * Buffer locked by host accessor , Blocked by host task or Unknown reason . |
sycl_device , sycl_device_type , sycl_device_name , kernel_name , from_source , sym_function_name , sym_source_file_name , sym_line_no , sym_column_no memory_object , offset , access_range_start , access_range_end , allocation_type , copy_from , copy_to |
|
xpti::trace_point_type_t::barrier_end that marks the end of the barrier that is encountered during enqueue.graph_create event.barrier_begin event with the barrier_end event. enqueue.barrier and the reason for the barrier as a const char * Buffer locked by host accessor , Blocked by host task or Unknown reason . |
sycl_device , sycl_device_type , sycl_device_name , kernel_name , from_source , sym_function_name , sym_source_file_name , sym_line_no , sym_column_no memory_object , offset , access_range_start , access_range_end , allocation_type , copy_from , copy_to |
|
xpti::trace_point_type_t::diagnostics that represents general purpose notifications. For example, it is emitted when an exception is thrown in SYCL runtime. const char * |
|
|
xpti::trace_point_type_t::queue_create that marks the creation of a queue, which could be a device or host queue. |
queue_id field has been deprecated and replaced with the instance information and supporting XPTI API calls (xptiGetStashedTuple ). Using the instance information is the recommended approach. char *key = 0; uint64_t value; if (xptiGetStashedTuple(&key, value) ==xpti::result_t::XPTI_RESULT_SUCCESS) { // key will contain "queue_id" // value will contain the per-queue unique ID }
|
|
xpti::trace_point_type_t::queue_destroy that marks the destruction of a queue, which could be a device or host queue. |
queue_id field has been deprecated and replaced with the instance information and supporting XPTI API calls (xptiGetStashedTuple ). Using the instance information is the recommended approach. queue_handle is absent for host queue since no backend object is used. |
Metadata description¶
Metadata |
Type |
Description |
---|---|---|
|
|
Value of |
|
|
Start of accessor range |
|
|
End of accessor range |
|
C-style string |
Allocation type |
|
|
ID of source device |
|
|
ID of target device |
|
|
Unique identifier of event |
|
|
|
|
C-style string |
Kernel name |
|
|
Unique identifier of memory object |
|
|
Accessor offset size in bytes |
|
|
Unique identifier of SYCL device |
|
C-style string |
|
|
C-style string |
Result of |
|
C-style string |
Function name |
|
C-style string |
Source file name |
|
|
File line number |
|
|
File column number |
|
|
Includes kernel execution parameters (global size, local size, offset) and number of kernel arguments |
|
|
Description for the Nth kernel argument. It includes argument kind (sycl::detail::kernel_param_kind_t), pointer to the value, size and index in the argument list. |
Buffer management stream "sycl.experimental.buffer"
Notification Signatures¶
Trace Point Type |
Parameter Description |
Metadata |
---|---|---|
|
xpti::trace_point_type_t::offload_memory_object_data_t that marks offload buffer creation pointoneapi.experimental.buffer layer.xpti::trace_event_data_t - contains information about source location.nullptr since no begin-end event alignment is needed. offload_memory_object_data_t object, that includes buffer object ID, host pointer used to create/initialize buffer, buffer element information (type name, size), number of buffer dimensions and buffer size for each dimension. |
None |
|
xpti::trace_point_type_t::offload_association_data_t that provides association between user level buffer object and platform specific memory objectoneapi.experimental.buffer layer.nullptr - since the stream of data just captures functions being called.nullptr since no begin-end event alignment is needed.offload_association_data_t object, that includes user object ID and platform-specific representation for offload buffer. |
None |
|
xpti::trace_point_type_t::offload_memory_object_data_t that marks offload buffer destruction pointoneapi.experimental.buffer layer.nullptr - since the stream of data just captures functions being called.nullptr since no begin-end event alignment is needed. offload_memory_object_data_t object, that includes buffer object ID. |
None |
|
xpti::trace_point_type_t::offload_memory_object_release_data_t that provides information about release of platform specific memory objectnullptr - since the stream of data just captures functions being called.nullptr - since the stream of data just captures functions being called.nullptr since no begin-end event alignment is needed.offload_association_data_t object, that includes user object ID and platform-specific representation for offload buffer. |
None |
|
xpti::trace_point_type_t::offload_accessor_data_t that marks offload accessor creation pointoneapi.experimental.buffer layer.nullptr - since the stream of data just captures functions being called.nullptr since no begin-end event alignment is needed. offload_accessor_data_t object, that includes buffer object ID, accessor handle created from specific buffer, accessor information (access target and mode). |
None |
Image management stream "sycl.experimental.image"
Notification Signatures¶
Trace Point Type |
Parameter Description |
Metadata |
---|---|---|
|
xpti::trace_point_type_t::offload_image_data_t that marks offload image creation pointoneapi.experimental.image layer.xpti::trace_event_data_t - contains information about source location.nullptr since no begin-end event alignment is needed. offload_image_data_t object, that includes image object ID, host pointer used to create/initialize image, number of image dimensions, the image format and sampler information (addressing mode, coordinate normalization mode, filtering mode). |
None |
|
xpti::trace_point_type_t::offload_association_data_t that provides association between user level image object and platform specific memory objectoneapi.experimental.image layer.nullptr - since the stream of data just captures functions being called.nullptr since no begin-end event alignment is needed.offload_association_data_t object, that includes user object ID and platform-specific representation for offload image. |
None |
|
xpti::trace_point_type_t::offload_image_data_t that marks offload image destruction pointoneapi.experimental.image layer.nullptr - since the stream of data just captures functions being called.nullptr since no begin-end event alignment is needed. offload_image_data_t object, that includes image object ID. |
None |
|
xpti::trace_point_type_t::offload_association_data_t that provides information about release of platform specific memory objectnullptr - since the stream of data just captures functions being called.nullptr - since the stream of data just captures functions being called.nullptr since no begin-end event alignment is needed.offload_association_data_t object, that includes user object ID and platform-specific representation for offload image. |
None |
|
xpti::trace_point_type_t::offload_image_accessor_data_t that marks offload image accessor creation pointoneapi.experimental.image layer.nullptr - since the stream of data just captures functions being called.nullptr since no begin-end event alignment is needed. offload_image_accessor_data_t object, that includes image object ID, accessor handle created from specific image, access target (if the accessor is not a host accessor), access mode (if the accessor is to an unsampled image) and element information (type name, size). |
None |
SYCL Memory Allocations Stream "sycl.experimental.mem_alloc"
Notification Signatures¶
Trace Point Type |
Parameter Description |
Metadata |
---|---|---|
|
xpti::trace_point_type_t::mem_alloc_begin that marks the beginning of memory allocation processoneapi.level_zero.experimental.mem_alloc layer.nullptr - since the stream of data just captures functions being called.mem_alloc_begin event with the mem_alloc_end event. mem_alloc_data_t object, that includes memory object ID (if any), allocation size, and guard zone size (if any). |
None |
|
xpti::trace_point_type_t::mem_alloc_end that marks the end of memory allocation processoneapi.level_zero.experimental.mem_alloc layer.nullptr - since the stream of data just captures functions being called.mem_alloc_begin event with the mem_alloc_end event. This value is guaranteed to be the same value received by the trace event for the corresponding mem_alloc_begin .mem_alloc_data_t object, that includes memory object ID (if any), allocated pointer, allocation size, and guard zone size (if any). |
None |
|
xpti::trace_point_type_t::mem_release_begin that marks the beginning of memory allocation processoneapi.level_zero.experimental.mem_alloc layer.nullptr - since the stream of data just captures functions being called.mem_release_begin event with the mem_release_end event. mem_alloc_data_t object, that includes memory object ID (if any) and released pointer. |
None |
|
xpti::trace_point_type_t::mem_release_end that marks the end of memory allocation processoneapi.level_zero.experimental.mem_alloc layer.nullptr - since the stream of data just captures functions being called.mem_release_begin event with the mem_release_end event. This value is guaranteed to be the same value received by the trace event for the corresponding mem_release_begin .mem_alloc_data_t object, that includes memory object ID (if any) and released pointer. |
None |
SYCL Stream "sycl.experimental.level_zero.call"
Notification Signatures¶
This stream transfers events about Level Zero API calls made by SYCL application.
Trace Point Type |
Parameter Description |
Metadata |
---|---|---|
|
xpti::trace_point_type_t::function_begin that marks the beginning of a functionur.call layer.nullptr - since the stream of data just captures functions being called.function_begin event with the function_end event. const char * |
None |
|
xpti::trace_point_type_t::function_end that marks the beginning of a functionur.call layer.nullptr - since the stream of data just captures functions being called.function_begin event with the function_end event. This value is guaranteed to be the same value received by the trace event for the corresponding function_begin const char * |
None |
SYCL Stream "sycl.experimental.level_zero.debug"
Notification Signatures¶
This stream transfers events about Level Zero API calls and their function arguments made by SYCL application.
Trace Point Type |
Parameter Description |
Metadata |
---|---|---|
|
xpti::trace_point_type_t::function_with_args_begin that marks the beginning of a functionur.call.debug layer.nullptr - since the stream of data just captures functions being called.function_with_args_begin event with the function_with_args_end event. function_with_args_t object, that includes function ID, name, and arguments. |
None |
|
xpti::trace_point_type_t::function_with_args_end that marks the beginning of a functionur.call.debug layer.nullptr - since the stream of data just captures functions being called.function_with_args_begin event with the function_with_args_end event. This value is guaranteed to be the same value received by the trace event for the corresponding function_with_args_begin function_with_args_t object, that includes function ID, name, arguments, and return value. |
None |
SYCL Stream "sycl.experimental.cuda.call"
Notification Signatures¶
This stream transfers events about CUDA Driver API calls made by SYCL application.
Trace Point Type |
Parameter Description |
Metadata |
---|---|---|
|
xpti::trace_point_type_t::function_begin that marks the beginning of a functionur.call layer.nullptr - since the stream of data just captures functions being called.function_begin event with the function_end event. const char * |
None |
|
xpti::trace_point_type_t::function_end that marks the beginning of a functionur.call layer.nullptr - since the stream of data just captures functions being called.function_begin event with the function_end event. This value is guaranteed to be the same value received by the trace event for the corresponding function_begin const char * |
None |
SYCL Stream "sycl.experimental.cuda.debug"
Notification Signatures¶
This stream transfers events about CUDA Driver API calls and their function arguments made by SYCL application.
Trace Point Type |
Parameter Description |
Metadata |
---|---|---|
|
xpti::trace_point_type_t::function_with_args_begin that marks the beginning of a functionur.call.debug layer.nullptr - since the stream of data just captures functions being called.function_with_args_begin event with the function_with_args_end event. function_with_args_t object, that includes function ID, name, and arguments. |
None |
|
xpti::trace_point_type_t::function_with_args_end that marks the beginning of a functionur.call.debug layer.nullptr - since the stream of data just captures functions being called.function_with_args_begin event with the function_with_args_end event. This value is guaranteed to be the same value received by the trace event for the corresponding function_with_args_begin function_with_args_t object, that includes function ID, name, arguments, and return value. |
None |