Implementation design for “device_global”¶
This document describes the implementation design for the DPC++ extension sycl_ext_oneapi_device_global, which allows applications to declare global variables in device code.
Requirements¶
The extension specification document referenced above contains the full set of requirements for this feature, but some requirements that are particularly relevant to the design are called out here.
The first issue relates to the mechanism for integrating host and device code. Like specialization constants, device global variables are referenced in both host and device code, so they require some mechanism to correlate the variable instance in device code with the variable instance in host code. The API for referencing a device global variable, however, is different from the API for specialization constants. Whereas specialization constants are referenced through a templated member function:
sycl::specialization_id<int> spec_var;
void func(sycl::queue q) {
q.submit([&](sycl::handler &cgh) {
cgh.set_specialization_constant<spec_var>(42);
cgh.single_task([=](sycl::kernel_handler kh) {
int i = kh.get_specialization_constant<spec_var>();
});
});
}
Device global variables, by contrast, are referenced by their address:
sycl::ext::oneapi::experimental::device_global<int> dev_var;
void func(sycl::queue q) {
int val = 42;
q.copy(&val, dev_var).wait(); // The 'dev_var' parameter is by reference
q.submit([&](sycl::handler &cgh) {
cgh.single_task([=] {
int i = dev_var;
});
});
}
This is a key difference because the compiler does not statically know which device global variable is being referenced; we only know the address at runtime. As we will see later, this has a ramification on the integration headers and on the mechanism that connects instances of device global variables in host code with their corresponding instances in device code.
Another issue relates to the device_image_scope
property which can be applied
to a device global variable declaration. The intent of this property is to
allow a device global variable to be implemented directly on top of a SPIR-V
module scope global variable. When this property is not present, an
instance of a device global variable is shared across all device images that
are loaded onto a particular device. By contrast, when this property is
present, each device image has its own instance of the device global variable.
However, since multiple variable instances have confusing semantics, the API
requires the user to ensure that each such variable exists in exactly one
device image. The extension specification has more details on this property.
The important impact on the design, though, is that device global variables
declared with the device_image_scope
property have an implementation that is
quite different from device global variables that are not declared with this
property. The sections below describe both implementations.
Design¶
Changes to DPC++ headers¶
Class specializations based on device_image_scope
¶
The headers, of course, include the declaration of the new device_global
class, which is described in the extension specification. The declaration
of this class uses partial specialization to define the class differently
depending on whether it has the device_image_scope
property. When the
property is not present, the class has a member variable which is a pointer to
the underlying type. Member functions which return a reference to the value
(e.g. get
) return the value of this pointer:
template<typename T, typename PropertyListT>
class device_global {
T *usmptr;
public:
T& get() noexcept { return *usmptr; }
/* other member functions */
};
However, when the property is present, it has a member variable which is the type itself, and member functions return a reference to this value.
template<typename T, typename PropertyListT>
class device_global {
T val{};
public:
T& get() noexcept { return val; }
/* other member functions */
};
Note that the val
member has a default initializer that causes it to be
“value initialized”. Since the type T
is limited to types that are trivially
constructible, this means that val
will be zero initialized.
In both cases the member variable (either usmptr
or val
) must be the first
member variable in the class. As we will see later, the runtime assumes that
the address of the device_global
variable itself is the same as the address
of this member variable.
Attributes attached to the class¶
The device_global
class declaration contains three C++ attributes which
convey information to the front-end. These attributes are only needed for the
device compiler, and the #ifdef __SYCL_DEVICE_ONLY__
allows the customer to
use another host compiler, even if it does not recognize these attributes.
Also note that these attributes are all in the __sycl_detail__
namespace, so
they are considered implementation details of DPC++. We do not intend to
support them as general attributes that customer code can use.
template <typename T, typename PropertyListT = property_list<>>
class device_global {/*...*/};
// Partial specialization to make PropertyListT visible as a parameter pack
// of properties.
template <typename T, typename ...Props>
class
#ifdef __SYCL_DEVICE_ONLY__
[[__sycl_detail__::add_ir_global_variable_attributes(
"sycl-device-global-size",
Props::meta_name...,
sizeof(T),
Props::meta_value...
)]]
[[__sycl_detail__::global_variable_allowed]]
[[__sycl_detail__::device_global]]
#endif
device_global<T, property_list<Props...>> {/*...*/};
The [[__sycl_detail__::add_ir_global_variable_attributes()]]
attribute is
used to convey information about the compile-time properties to the front-end,
and it is described more fully by the compile-time properties design
document. This attribute is also used for other classes that have properties,
so it is not specific to the device_global
class.
Note that the parameter list to
[[__sycl_detail__::add_ir_global_variable_attributes()]]
contains one
additional property named "sycl-device-global-size"
. The sycl-post-link
tool uses this property to distinguish device global variables from other
module scope variables, and the property tells the size of the underlying data
type of the device global variable.
The [[__sycl_detail__::global_variable_allowed]]
attribute informs the
front-end that global variables of this type are allowed to be referenced in
device code. By default, the front-end diagnoses an error if device code
references a global variable unless the variable is constexpr
or const
and
constant initialized. However, the presence of this attribute informs the
front-end that variables of this type are an exception to this rule, so the
front-end does not diagnose an error when device code references a
device_global
variable. This attribute could also be used by other types,
so it is also not specific to the device_global
class.
NOTE: The implementation of device-side asserts recently introduced a
new C++ attribute sycl_global_var
for a similar purpose. The design for
device global variables cannot use that attribute because sycl_global_var
is
intended to be specified on the variable definition (not the type declaration),
and we do not want to force users to add an attribute to each definition of a
device_global
variable. However, the implementation of device-side asserts
could be changed to use [[__sycl_detail__::global_variable_allowed]]
. We
could then remove the support for sycl_global_var
.
The last attribute [[__sycl_detail__::device_global]]
controls error
reporting for variables declared of this type. The device global extension
specification places restrictions on where a device_global
variable can be
declared. Rather than have the front-end recognize the name of the
device_global
type, the front-end uses this attribute to know which
restrictions to enforce for this type.
Declaration of member functions to copy device global variables¶
The headers are also updated to add the new copy()
and memcpy()
member
functions to handler
and queue
which copy data to or from a device global
variable. These declarations use SFINAE such that they are conditionally
available depending on the host_access
property.
Changes to the DPC++ front-end¶
There are several changes to the device compiler front-end:
The front-end adds a new LLVM IR attribute
sycl-unique-id
to the definition of eachdevice_global
variable, which provides a unique string identifier for each device global variable. The rules for creating this string are the same as__builtin_sycl_unique_stable_id
, so the front-end can use the same algorithm when generating the string.The front-end checks for restrictions on variable declarations using the
device_global
type. As described above, the front-end uses the[[__sycl_detail__::device_global]]
attribute (rather than the class name) to know which set of restrictions to check. The restrictions specific to device global variables are documented in the extension specification.The front-end avoids diagnosing an error when variables of type
device_global
are referenced in device code because the type is decorated with the[[__sycl_detail__::global_variable_allowed]]
attribute.The front-end generates new content in both the integration header and the integration footer, which is described in more detail below.
Handling shadowed device global variables¶
The example above shows a simple case where the user’s device global variables
can all be uniquely referenced via fully qualified lookup (e.g.
::inner::Bar
). However, it is possible for users to construct applications
where this is not the case, for example:
sycl::device_global<int> FuBar;
namespace {
sycl::device_global<int> FuBar;
}
In this example, the FuBar
variable in the global namespace shadows a
variable with the same name in the unnamed namespace. The integration footer
can reference the variable in the global namespace as ::FuBar
, but there is
no way to reference the variable in the unnamed namespace using fully qualified
lookup.
Such programs are still legal, though. The integration footer can support cases like this by defining a shim function that returns a reference to the shadowed device global:
namespace {
namespace __sycl_detail {
static constexpr decltype(FuBar) &__shim_1() {
return FuBar; // References 'FuBar' in the unnamed namespace
}
} // namespace __sycl_detail
} // namespace (unnamed)
namespace sycl::detail {
__sycl_device_global_registration::__sycl_device_global_registration() noexcept {
device_global_map::add(&::FuBar,
/* same string returned from __builtin_sycl_unique_stable_id(::FuBar) */);
device_global_map::add(&::__sycl_detail::__shim_1(),
/* same string returned from __builtin_sycl_unique_stable_id(::(unnamed)::FuBar) */);
}
} // namespace sycl::detail
The __shim_1()
function is defined in the same namespace as the second
FuBar
device global, so it can reference the variable through unqualified
name lookup. Furthermore, the name of the shim function is globally unique, so
it is guaranteed not to be shadowed by any other name in the translation unit.
This problem with variable shadowing is also a problem for the integration
footer we use for specialization constants. See the specialization constant
design document for more details on this topic.
Changes to the DPC++ driver¶
A new command line argument, --device-globals
must be passed to the
sycl-post-link
tool to enable processing device global variables.
Changes to the sycl-post-link
tool¶
The sycl-post-link
tool performs its normal algorithm to identify the set of
kernels and device functions that are bundled together into each module. Once
it identifies the functions in each module, it scans those functions looking
for references to global variables that are decorated with the LLVM IR
attribute "sycl-device-global-size"
(these are the variables of type
device_global
). If any device global variable decorated with the LLVM IR
attribute corresponding to the device_image_scope
property appears in more
than one module, the sycl-post-link
tool issues an error diagnostic:
error: device_global variable <name> with property "device_image_scope"
is used in more than one device image.
Assuming that no error diagnostic is issued, the sycl-post-link
tool includes
the IR definition of each device_global
variable in the modules that
reference that variable.
As described in the design for compile-time properties, the
sycl-post-link
tool is responsible for generating idiomatic LLVM IR for any
compile-time properties that need to be generated in SPIR-V.
The HostAccessINTEL decoration is required for all device global variables
because it provides the name that the DPC++ runtime uses to access the
variable. Therefore, the sycl-post-link
tool always generates idiomatic LLVM
IR for this decoration. The first SPIR-V operand is set according to the
host_access
property (or set to Read/Write if the device global doesn’t
have that property). The second SPIR-V operation is set to the value of the
device global’s sycl-unique-id
.
The sycl-post-link
tool also generates idiomatic LLVM IR for the
InitModeINTEL decoration (if the device global has the init_mode
property) and for the ImplementInCSRINTEL decoration (if the device global
has the implement_in_csr
property). See the
SPV_INTEL_global_variable_decorations SPIR-V extension for details
about all of these decorations.
The sycl-post-link
tool also create a “SYCL/device globals” property set for
each device code module that contains at least one device global variable.
New “SYCL/device globals” property set¶
Each device code module that references one or more device global variables
has an associated “SYCL/device globals” property set. The name of each
property in this set is the sycl-unique-id
string of a device_global
variable that is contained by the module. The value of each property has
property type PI_PROPERTY_TYPE_BYTE_ARRAY
and contains a structure with the
following fields:
struct {
uint32_t size;
uint8_t device_image_scope;
};
The size
field contains the size (in bytes) of the underlying type T
of the
device global variable. The sycl-post-link
tool gets this value from the
LLVM IR attribute "sycl-device-global-size"
.
The device_image_scope
field is either 1
(true) or 0
(false), telling
whether the device global variable was declared with the device_image_scope
property.
Changes to the DPC++ runtime¶
Several changes are needed to the DPC++ runtime
As noted in the requirements section, an instance of a device global variable that does not have the
device_image_scope
property is shared by all device images on a device. To satisfy this requirement, the device global variable contains a pointer to a buffer allocated from USM device memory, and the content of the variable is stored in this buffer. All device images on a particular device point to the same buffer, so the variable’s state is shared. The runtime, therefore, must allocate this USM buffer for each such device global variable.As we noted above, the front-end generates new content in the integration footer which calls the function
sycl::detail::device_global_map::add()
. The runtime defines this function and maintains information about all the device global variables in the application. This information includes:The host address of the variable.
The string which uniquely identifies the variable.
The size (in bytes) of the underlying
T
type for the variable.A boolean telling whether the variable is decorated with the
device_image_scope
property.The associated per-device USM buffer pointer, if this variable does not have the
device_image_scope
property.
We refer to this information as the “device global database” below.
The runtime also implements the new
copy
andmemcpy
functions in thequeue
andhandler
classes which copy to or fromdevice_global
variables.
Initializing the device global variables in device code¶
When a DPC++ application submits a kernel, the runtime constructs a
pi_program
containing this kernel that is compiled for the target device, if
such a pi_program
does not yet exist. If the kernel resides in a device code
module that calls into a shared library, the runtime identifies a set of device
code modules that need to be online-linked together in order to construct the
pi_program
.
After creating a pi_program
and before invoking any kernel it contains, the
runtime does the following:
Scan the entries of the “SYCL/device globals” property sets of each device code module that contributes to the the
pi_program
to get information about each device global variable that is used by thepi_program
. This information is added to device global database.For each device global variable that is not decorated with the
device_image_scope
property:Check if a USM buffer has already been created for the variable on this target device. If not, the runtime allocates the buffer from USM device memory using the size from the database and zero-initializes the content of the buffer. The pointer to this buffer is saved in the database for future reuse.
Regardless of whether the USM buffer has already been created for the variable, the runtime initializes the
usmptr
member in the device instance of the variable by using a new PI interface which copies data from the host to a global variable in api_program
. It is a simple matter to use this interface to overwrite theusmptr
member with the address of the USM buffer.
Note that the runtime does not need to initialize the val
member variable of
device global variables that are decorated with device_image_scope
because
the val
default initializer already guarantees that this member variable is
zero initialized.
Implementing the copy
and memcpy
functions in queue
and handler
¶
Each of these functions accepts a (host) pointer to a device global variable as one of its parameters, and the runtime uses this pointer to find the associated information for this variable in the device global database. The code in the integration footer ensures that the database will at least contain the address of the variable and its unique string, even if no kernel referencing this variable has been submitted yet.
Each of these functions is templated on the variable’s underlying type T
, so
it knows the size of this type. Each function is also templated on the
variable’s property list, so it knows whether the variable has the
device_image_scope
property.
The remaining behavior depends on whether the variable is decorated with the
device_image_scope
property.
If the variable is not decorated with this property, the runtime uses the
database to determine if a USM buffer has been allocated yet for this variable
on this device. If not, the runtime allocates the buffer using sizeof(T)
and zero-initializes the buffer. Regardless, the runtime implements the copy
/ memcpy
function by copying to or from this USM buffer, using the normal
mechanism for copying to / from a USM pointer.
The runtime avoids the future cost of looking up the variable in the database
by caching the USM pointer in the host instance of the variable’s usmptr
member.
If the variable is decorated with the device_image_scope
property, the
variable’s value exists directly in the device code module, not in a USM
buffer. The runtime first uses the variable’s unique string identifier to see
if there is a pi_program
that contains the variable. If there is more than
one such pi_program
, the runtime diagnoses an error by throwing
errc::invalid
. If there is no such pi_program
, the runtime scans all
“SYCL/device globals” property sets to find the device code module that
contains this variable and uses its normal mechanism for creating a
pi_program
from this device code module. (The algorithm for creating device
code modules in the sycl-post-link
tool ensures that there will be no more
than one module that contains the variable.) Finally, the runtime uses the
new PI interface to copy to or from the contents of the variable in this
pi_program
.
It is possible that a device global variable with device_image_scope
is not
referenced by any kernel, in which case the variable’s unique string will not
exist in any property set. In this case, the runtime simply uses the host
instance of the device_global
variable to hold the value and copies to or
from the val
member.
In all cases, the runtime uses sizeof(T)
to determine if the copy operation
will read or write beyond the end of the device global variable’s storage. If
so, the runtime diagnoses an error by throwing errc::invalid
.
New PI interface to copy to or from a module scope variable¶
As noted above, we need new PI interfaces that can copy data to or from an
instance of a device global variable in a pi_program
. This functionality is
exposed as two new PI interfaces:
pi_result piextEnqueueDeviceGlobalVariableRead(
pi_queue Queue, pi_program Program, const char *Name, pi_bool BlockingRead,
size_t Count, size_t Offset, void *Dst, pi_uint32 NumEventsInWaitList,
const pi_event *EventsWaitList, pi_event *Event);
pi_result piextEnqueueDeviceGlobalVariableWrite(
pi_queue Queue, pi_program Program, const char *Name, pi_bool BlockingWrite,
size_t Count, size_t Offset, const void *Src, pi_uint32 NumEventsInWaitList,
const pi_event *EventsWaitList, pi_event *Event);
The piextEnqueueDeviceGlobalVariableRead
function reads Count
bytes at
byte-offset Offset
from a device global variable in Program
identified by
the name Name
. The read data is stored in Dst
. Likewise, the
piextEnqueueDeviceGlobalVariableWrite
function reads Count
bytes from Dst
and stores them at byte-offset Offset
in the device global variable in
Program
identified by the name Name
.
Both functions will enqueue the associated memory command on Queue
where it
will first wait for NumEventsInWaitList
events in EventsWaitList
to finish.
Event
will be populated with the event associated with resulting enqueued
command. If either BlockingRead
or BlockingWrite
is true
the call will
block on the host until the enqueued command finishes execution.
For device_global
variables the Name
parameter in calls to these functions
is the same as the associated sycl-unique-id
string.
The Level Zero backend has existing APIs that can implement these PI
interfaces. The plugin first calls zeModuleGetGlobalPointer()
to get a
device pointer for the variable and then calls
zeCommandListAppendMemoryCopy()
to copy to or from that pointer.
However, the documentation (and implementation) of zeModuleGetGlobalPointer()
needs to be extended slightly. The description currently says:
The application may query global pointer from any module that either exports or imports it.
The application must dynamically link a module that imports a global before the global pointer can be queried from it.
This must be changed to say something along these lines:
The interpretation of
pGlobalName
depends on how the module was created. If the module was created from SPIR-V that declares the GlobalVariableDecorationsINTEL capability, the implementation looks first for an OpVariable that is decorated with HostAccessINTEL where the Name operand is the same aspGlobalName
. If no such variable is found, the implementation then looks for an OpVariable that is decorated with LinkageAttributes where the Name operand is the same aspGlobalName
. (The implementation considers both exported and imported variables as candidates.)If the module was created from native code that came from a previous call to
zeModuleGetNativeBinary
and that other module was created from SPIR-V, then the interpretation ofpGlobalName
is the same as the SPIR-V case.If
pGlobalName
identifies an imported SPIR-V variable, the module must be dynamically linked before the variable’s pointer may be queried.
The OpenCL backend has a proposed extension
cl_intel_global_variable_access
that defines functions
clEnqueueReadGlobalVariableINTEL()
and clEnqueueWriteGlobalVariableINTEL()
which can be easily used to implement these PI interfaces. This DPC++ design
depends upon implementation of that OpenCL extension.
The CUDA backend has existing APIs cuModuleGetGlobal()
and cuMemcpyAsync()
which can be used to implement these PI interfaces.
Design choices¶
This section captures some of the discussions about aspects of the design.
Should the value be zero-initialized¶
There was some debate about whether the value in the device_global
should
always be zero-initialized. We decided to require this in order to be
consistent with C++ rules for global variables. We want device_global
to
model the normal rules for global variables. Since C++ guarantees that a
global variable with a trivial constructor is zero-initialized, we want that
behavior too.
The downside is that some applications may allocate a very large storage for
the underlying type T
of a device global variable, and they may not want to
pay the cost of zero initializing it. We agree that this is a theoretical
problem, but we aren’t sure if this will be an issue for real applications. If
it turns out to be a real problem, we propose adding a new property that
prevents initialization of the device global value. For example, we could add
a new parameter to the init_mode
property called none
.
Why not include both val
and usmptr
member variables¶
Rather than using partial specialization to define device_global
differently
based on the device_image_scope
property, we could instead define both member
variables regardless of the properties. This would make the header file
implementation easier, but it would lead to wasted space in the case when the
device_image_scope
property was not specified since the val
member is
unused in this case. Wasting space on the host may not be such a big problem,
but the space would also be wasted on every device that reference the device
global variable, and this seems like a bigger problem. We decided that the
extra header file complexity of partial specialization is worth avoiding this
wasted memory.