SYCL Instrumentation

Any language or programming paradigm must provide mechanisms to correlate a developer’s use of the language to the debug and performance traces for that use. A lightweight tracing framework (XPTI) was developed to enable this for SYCL and is the primary mechanism that is employed to enable debug and performance traces.

NOTE: For additional information on the XPTI framework, please refer to the Framework Documentation for API use and framework performance data.

This document outlines the use of this framework API in the SYCL runtime library. The primary concept enable by this framework is the generation of a unique 64-bit ID, referred to as the Universal ID (UID), for every public language entry point into the library. This allows tools and other helps in the software stack to correlate debug and performance data by tagging it with the 64-bit UID. The framework also provides the ability to propagate this UID all the way to the driver layers for the target device so data from lower layers and hardware can be correlated easily.

The XPTI concepts in use here are:

  1. Tracepoints - define all the points in a software layer we want to instrument or trace. The trace point is used to generate the UID.

  2. Notification - allows the software layer to communicate the trace information to listeners/subscribers

  3. Callback - implemented by subscribers to specific events to capture the trace information

The SYCL runtime layer defines the tracepoints and notifies the information about any given tracepoint to a registered subscriber. These tracepoints are enabled in meaningful locations of the runtime to provide semantic information about the developer’s use of the language. This would include information such as relationships that form asynchronous task graphs or other constructs such as barriers that are introduced while waiting on events.

Instrumentation Trace Points

This section will document all the places in the SYCL runtime that have been instrumented to capture the asynchronous task graphs created by the runtime. The task graphs are captured as graph, nodes and edges:

  • The graph encapsulates all of the disjoint task graphs generated by the application.

  • The nodes capture operations that are performed, such as kernel executions or memory transfers

  • The edges represent dependence relationships, the representation of which mimics control flow as opposed to a dependence graph. The source node in an edge must complete before the target node can begin execution.

All code changes to enable this have been guarded by XPTI_ENABLE_INSTRUMENTATION macro and the CMake files have been updated to have this as an option which is enabled by default and this change is under llvm/sycl/CMakeLists.txt.

...
# Create a soft option for enabling or disabling the instrumentation
# of the SYCL runtime
option(SYCL_ENABLE_XPTI_TRACING "Enable tracing of SYCL constructs" ON)

The Graph

Any SYCL application can submit command groups to any active queue during the lifetime of the application. Each submission is handled by the runtime and the asynchronous task graphs are updated to reflect the new submission. This may be as simple as adding a new node to the task-graph or adding multiple nodes to the graph, where one of the nodes represents the computation and the others dependent memory transfers.

To model this, we create a global graph for every application instantiation and all kernel executions in the applications are added as nodes in this global graph. In the SYCL runtime, there is no obvious location where the creation of the global graph can be inserted as many objects are instantiated statically. Currently, we embed the graph creation in the plugin interface (PI) layer initialize() call. In this call, we will perform two operations:

  1. Initialize all listeners and create a trace event to represent the graph. This is done in sycl/include/CL/sycl/detail/pi.cpp.

  2. Send a graph_create event to all subscribers. This notification will only be sent once.

The Nodes

The command group lambdas are captured and encapsulated in a Command object. This object is evaluated for dependencies on data/memory or external OpenCL events and an asynchronous task graph is built by mapping all these dependencies, before it is enqueued on the device. In order to capture the command groups (nodes) and the dependencies (edges), the base class Command and any derived classes that are of interest are instrumented.

In this section, we discuss the instrumentation of the Command object in two parts: (1) The changes made to capture end-user source code details for language constructs (2) The instrumentation that handles capturing the relevant metadata.

  1. In order to capture end-user source code information, we have implemented cl::sycl::detail::code_location class that uses the builtin functions in the compiler. However, equivalent implementations are unavailable on Windows and separate cross-platform implementation might be used in the future. To mitigate this, the Windows implementation will always report unknown_file, unknown_func and a line number of 0 for source file, function name and line number. We handle this case while processing this information.

    The source information of a language construct, such as source file, function name, line number and column number allow us to determine if a Command that was previously created for a construct is being created again. In such cases, we will not emit a node_create event, but we will bump up the instance count recording the number of instances created. Secondly, the source information allows us to associate a unique ID with the source location and propagate it all the way to the driver, if possible. This will allow us to associate a Kernel event with a source location at all times. All instrumentation that identifies a command object of a given type and emits the node_create event is located in the emitInstrumentationData() and must be implemented by all derived classes.

    To enable this source location information, we start with enabling the public methods in the queue class, such as queue.submit(), queue.parallel_for(), queue.wait(), etc to include a default argument that captures the source location information. The location of the line in the caller that makes the call to queue.submit(), queue.parallel_for(), etc is represented in this default argument. These changes are present in queue.hpp and ordered_queue.hpp. The default arguments for all public functions are guarded by #ifdef SYCL_INSTRUMENTATION_METADATA that is currently enabled by default.

    The location information, when captured, is propagated all the way to the CommandGroup object. So, for every CommandGroup object, we will have the corresponding source location in end-user code where the command group is submitted to the queue. This metadata is propagated by the instrumentation to the subscribers of the stream.

  2. The base Command class and all derived classes are instrumented to capture the relevant information for each command object and a node_create event is generated.

The Node instance

Once a command object is created, it is enqueued on the device for execution. To capture the execution of this node instance, we instrument the enqueue() method to determine the cost of this computation or memory related kernel. As the commands are enqueued, the enqueue method emits a pair of events indicating the task_begin and task_endevents that capture the duration of the enqueued command. For commands that are asynchronous, the pair of events capture just the kernel submission and the actual execution of the command on the device is tracked through the cl_event returned by the enqueue operation. In the case of host kernel execution or commands that are synchronous, the cost is measured directly.

In the case of the command being submitted to an OpenCL device, we capture the event of the submitted kernel and propagate it to the subscriber tool. It is up to the tool to register a callback for this event completion and close the task opened for the command object.

The Edges

As discussed in the previous section, the command groups submitted to the device queues form nodes in the asynchronous tasks graphs created by the SYCL runtime. In addition to these nodes, based on the memory references (through accessors or USM pointers), additional nodes to allocate, copy and release are created and they are necessary for the computation kernels to run. The computation kernel has dependencies on the memory objects and these dependencies are recorded as events and in our model we represent them as edges between the dependent nodes.

Tools monitoring the event stream then can start capturing the asynchronous task graph as it is being built. As dependencies are added to a command object, the instrumentation emits these dependencies as edge_create events. Each of these edge_createevents encapsulate the two command objects that have a dependency through this edge. The source object of this edge event must complete execution first before the target object of the edge can begin execution.

To instrument this part of the code, the Command::addDep methods of the Command object are instrumented to create the trace points and notify all subscribers.

The Release command, as implemented in the SYCL runtime, has a reference to the memory object, but no explicit dependencies are created. To model the edges correctly, we instrument the waitForRecordToFinish method in the Scheduler where the release operation waits on all the dependent operations to complete to capture the edges.

This concludes all the changes that were made to the SYCL runtime to support tracing. The next section talks about the XPTI framework that allows applications and runtimes to efficiently capture, record and emit trace notifications for important events during the run.

Documentation of SYCL tracepoints

XPTI Stream Domain

Traces belong to a named stream and this constitutes a domain of data. The XPTI framework allows the instrumentation logic to define a stream and associate the traces to the stream. A stream also defines the protocol to be observed to decipher the data at the receiving end. The XPTI API defines the notion of a trace point that includes an event, a trace point type and a notification.

  • The event consists a payload that describes the event (source file, function name, line number and/or a code pointer), a unique_id that is used to identify the event, a global user data field and a mechanism to record metadata associated with the event. The unique_id is generated from the payload, so if the trace point is visited multiple times, it represents the same unique_id and this allows us to determine the number of instances of a trace point.

  • The trace point type defines the type of notification that is being emitted for the trace point. There are many commonly occurring trace point types that are predefined by the framework, but a stream can extend this set by the extension APIs provided. A subscriber must explicitly register a callback for each trace point type that is of interest to the subscriber. If no subscribers are registered for a stream or a trace point type, then traces will not be emitted. A given trace point event may be used to emit multiple traces to different trace point types.

  • The notification emits the trace to all subscribers of the stream domain that have a callback registered to the given trace point type. The stream can attached a per-instance user data during this notification call that must be guaranteed to be valid for the duration of the notification call.

This document will outline the protocol for the streams of data being generated by the SYCL runtime.

SYCL Stream "sycl.pi" Notification Signatures

Trace Point Type

Parameter Description

Metadata

function_begin

  • trace_type: xpti::trace_point_type_t::function_begin that marks the beginning of a function
  • parent: Event ID created for all functions in the sycl.pi layer.
  • event: nullptr - since the stream of data just captures functions being called.
  • instance: Unique ID to allow the correlation of the function_begin event with the function_end event.
  • user_data: Name of the function being called sent in as const char *
  • None

    function_end

  • trace_type: xpti::trace_point_type_t::function_end that marks the beginning of a function
  • parent: Event ID created for all functions in the sycl.pi layer.
  • event: nullptr - since the stream of data just captures functions being called.
  • instance: Unique ID to allow the correlation of the function_begin event with the function_end event. This value is guaranteed to be the same value received by the trace event for the corresponding function_begin
  • user_data: Name of the function being called sent in as const char *
  • None

    SYCL Stream "sycl.pi.debug" Notification Signatures

    Trace Point Type

    Parameter Description

    Metadata

    function_with_args_begin

  • trace_type: xpti::trace_point_type_t::function_with_args_begin that marks the beginning of a function
  • parent: Event ID created for all functions in the sycl.pi.debug layer.
  • event: nullptr - since the stream of data just captures functions being called.
  • instance: Unique ID to allow the correlation of the function_with_args_begin event with the function_with_args_end event.
  • user_data: A pointer to function_with_args_t object, that includes function ID, name, and arguments.
  • None

    function_with_args_end

  • trace_type: xpti::trace_point_type_t::function_with_args_end that marks the beginning of a function
  • parent: Event ID created for all functions in the sycl.pi.debug layer.
  • event: nullptr - since the stream of data just captures functions being called.
  • instance: Unique ID to allow the correlation of the function_with_args_begin event with the function_with_args_end event. This value is guaranteed to be the same value received by the trace event for the corresponding function_with_args_begin
  • user_data: A pointer to function_with_args_t object, that includes function ID, name, arguments, and return value.
  • None

    SYCL Stream "sycl" Notification Signatures

    All trace point types in bold provide semantic information about the graph, nodes and edges and the toplogy of the asynchronous task graphs created by the runtime.

    Trace Point Type

    Parameter Description

    Metadata

    graph_create

  • trace_type: xpti::trace_point_type_t::graph_create that marks the creation of an asynchronous graph.
  • parent: nullptr
  • event: The global asynchronous graph object ID. All other graph related events such as node and edge creation will always this ID as the parent ID.
  • instance: Unique ID related to the event, but not a correlation ID as there are other events to correlate to.
  • user_data: nullptr
  • SYCL runtime will always have one instance of a graph object with many disjoint subgraphs that get created during the execution of an application.

    None

    node_create

  • trace_type: xpti::trace_point_type_t::node_create that marks the creation of a node in the graph, which could be a computational kernel or memory operation.
  • parent: The global graph event that is created during the graph_create event.
  • event: The unique ID that identifies the data parallel compute operation or memory operation.
  • instance: Unique ID related to the event, but not a correlation ID as there are other events to correlate to.
  • user_data: Command type that has been submitted through the command group handler, which could be one of: command_group_node, memory_transfer_node, memory_allocation_node, sub_buffer_creation_node, memory_deallocation_node, host_acc_create_buffer_lock_node, host_acc_destroy_buffer_release_node combined with the address of the command group object and represented as a string [const char *]
  • SYCL runtime will always have one instance of a graph object with many disjoint subgraphs that get created during the execution of an application.

  • Computational Kernels
  • sycl_device, kernel_name, from_source, sym_function_name, sym_source_file_name, sym_line_no
  • Memory operations
  • memory_object, offset, access_range, allocation_type, copy_from, copy_to

    edge_create

  • trace_type: xpti::trace_point_type_t::graph_create that marks the creation of an asynchronous graph.
  • parent: The global graph event that is created during the graph_create event.
  • event: The unique ID that identifies the dependence relationship between two operations.
  • instance: Unique ID related to the event, but not a correlation ID as there are other events to correlate to.
  • user_data: nullptr
  • Edges capture dependence relationships between computations or computations and memory operations.

    access_mode, memory_object, event

    task_begin

  • trace_type: xpti::trace_point_type_t::task_begin that marks the beginning of a task belonging to one of the nodes in the graph. When the trace event is for a kernel executing on a device other than the the CPU, this task_begin and corresponding task_end mark the submit call. To track the execution of the kernel on the device, the trace_signal event must be monitored to get the kernel event handle from which the execution statistics can be gathered.
  • parent: The global graph event that is created during the graph_create event.
  • event: The event ID will reflect the ID of the computation or memory operation kernel, which would be one of the nodes in the graph.
  • instance: Instance ID for the task that can be used to correlate it with the corresponding task_end trace event.
  • user_data: nullptr
  • Same metadata defined for the node the trace task belongs to.

    task_end

  • trace_type: xpti::trace_point_type_t::task_end that marks the end of a task belonging to one of the nodes in the graph. The specific task instance can be tacked through the instance ID parameter which helps correlate the task_end with the corresponding task_begin.
  • parent: The global graph event that is created during the graph_create event.
  • event: The event ID will reflect the ID of the computation or memory operation kernel, which would be one of the nodes in the graph.
  • instance: Instance ID for the task that can be used to correlate it with the corresponding task_begin trace event.
  • user_data: nullptr
  • Same metadata defined for the node the trace task belongs to.

    signal

  • trace_type: xpti::trace_point_type_t::signal that marks the an event that contains the event handle of an executing kernel on a device.
  • parent: The global graph event that is created during the graph_create event.
  • event: The event ID will reflect the ID of the computation or memory operation kernel, which would be one of the nodes in the graph.
  • instance: Instance ID for the task for which the signal has been generated.
  • user_data: Address of the kernel event that is returned by the device so the progress of the execution can be tracked.
  • Same metadata defined for the node the trace task belongs to.

    wait_begin

  • trace_type: xpti::trace_point_type_t::wait_begin that marks the beginning of the wait on an event
  • parent: nullptr
  • event: The event ID will reflect the ID of the command group object submission that created this event or the graph event, if the event is an external event.
  • instance: Unique ID to allow the correlation of the wait_begin event with the wait_end event.
  • user_data: String indicating event.wait and the address of the event sent in as const char *
  • Tracing the event.wait() or event.wait_and_throw() will capture the waiting on the action represented by the event object, which could be the execution of a kernel, completion of a memory operation, etc.

    None

    wait_end

  • trace_type: xpti::trace_point_type_t::wait_end that marks the beginning of the wait on an event
  • parent: nullptr
  • event: The event ID will reflect the ID of the command group object submission that created this event or the graph event, if the event is an external event.
  • instance: Unique ID to allow the correlation of the wait_begin event with the wait_end event.
  • user_data: String indicating event.wait and the address of the event sent in as const char *
  • None

    wait_begin

  • trace_type: xpti::trace_point_type_t::wait_begin that marks the beginning of the wait on an event
  • parent: nullptr
  • event: The event ID will reflect the ID of the command group object submission that created this event or a new event based on the combination of the string “queue.wait” and the address of the event.
  • instance: Unique ID to allow the correlation of the wait_begin event with the wait_end event.
  • user_data: String indicating queue.wait and the address of the event sent in as const char *
  • Tracing the queue.wait() or queue.wait_and_throw() will capture the waiting on the action represented by the event object, which could be the execution of a kernel, completion of a memory operation, etc that is embedded in the command group handler. All wait events contain metadata that indicates the SYCL device on which the corresponding operation has been submitted. If the event is from a command group handler, then the source location information is available as well.

    sycl_device, sym_function_name, sym_source_file_name, sym_line_no

    wait_end

  • trace_type: xpti::trace_point_type_t::wait_end that marks the beginning of the wait on an event
  • parent: nullptr
  • event: The event ID will reflect the ID of the command group object submission that created this event or a new event based on the combination of the string “queue.wait” and the address of the event.
  • instance: Unique ID to allow the correlation of the wait_begin event with the wait_end event.
  • user_data: String indicating queue.wait and the address of the event as const char *
  • sycl_device, sym_function_name, sym_source_file_name, sym_line_no

    barrier_begin

  • trace_type: xpti::trace_point_type_t::barrier_begin that marks the beginning of a barrier while enqueuing a command group object
  • parent: The global graph event that is created during the graph_create event.
  • event: The event ID will reflect the ID of the command group object that has encountered a barrier during the enqueue operation.
  • instance: Unique ID to allow the correlation of the barrier_begin event with the barrier_end event.
  • user_data: String indicating enqueue.barrier and the reason for the barrier as a const char *
  • The reason for the barrier could be one of Buffer locked by host accessor, Blocked by host task or Unknown reason.

  • Computational Kernels
  • sycl_device, kernel_name, from_source, sym_function_name, sym_source_file_name, sym_line_no
  • Memory operations
  • memory_object, offset, access_range, allocation_type, copy_from, copy_to

    barrier_end

  • trace_type: xpti::trace_point_type_t::barrier_end that marks the end of the barrier that is encountered during enqueue.
  • parent: The global graph event that is created during the graph_create event.
  • event: The event ID will reflect the ID of the command group object that has encountered a barrier during the enqueue operation.
  • instance: Unique ID to allow the correlation of the barrier_begin event with the barrier_end event.
  • user_data: String indicating enqueue.barrier and the reason for the barrier as a const char *
  • The reason for the barrier could be one of Buffer locked by host accessor, Blocked by host task or Unknown reason.

  • Computational Kernels
  • sycl_device, kernel_name, from_source, sym_function_name, sym_source_file_name, sym_line_no
  • Memory operations
  • memory_object, offset, access_range, allocation_type, copy_from, copy_to