Proposed design for offloading model¶
Introduction¶
This document covers the proposed design for using the new offloading model for the DPC++ Compiler. This leverages the existing community Offloading design OffloadingDesign which covers the Clang driver and code generation steps for creating offloading applications.
The current offloading model is completely encapsulated within the Clang Compiler Driver requiring the driver to perform all of the additional steps for generating the host and device compilation passes. The Driver is also responsible for initiating any of the link-time processing that occurs for each device target.
The updated offloading model removes much of the functionality that is
performed during the link phase of the offloading compilation from the driver
and moves it to a clang-linker-wrapper
tool.
Below is a general representation of the overall offloading flow that is
performed during a full compilation from source to final executable. The
compiler driver is responsible for creating the fat object and the
clang-linker-wrapper
tool is responsible for the general functionality that is
performed during the link.
Diagram 1: General Offload Flow
Fat Binary Generation¶
The generation of the fat binary will be controlled by the driver. The model from the community is generating the fat binary as a secondary process when creating the host object. Generation will be separated from the host compilation step. This is being done to enable proper support for using an external host compiler as well as taking advantage of potential parallelism during compilation of both the host and target device binaries.
The fat object in the new model is generated during the host compilation. The host compilation takes an additional argument which points to the device binary which will be embedded in the final object. Generation will be separated out to allow for potential parallelism during compilation of both the host and target device binaries.
Diagram 2: Fat Object Generation
Packager¶
When dealing with multiple device binaries, an additional step is performed to
package the multiple device binaries before being added to the host object.
This additional step is performed with the clang-offload-packager
taking
image inputs containing information relating to the target triple,
architecture setting and offloading kind.
The clang-offload-packager
is run during ‘fat object’ generation regardless
of the number of device binaries being added to the conglomerate fat object.
The device binaries are contained in what is designated as an ‘Offload Binary’.
These binaries can reside in a variety of binary formats including Bitcode
files, ELF objects, executables and shared objects, COFF objects, archives or
simply stored as an offload binary.
We should have the ability to package SPIR-V based device binaries in the offload section of any given binary. These device binaries will be packaged as normal with the packager and placed within the given section.
Example usage of the external clang-offload-packager
call:
clang-offload-packager --image=file=<name>,triple=<triple>,kind=<kind>
Clang Linker Wrapper¶
The clang-linker-wrapper
provides the interface to perform the needed link
steps when consuming fat binaries. The linker wrapper performs a majority of
the work involved during the link step during an offload compilation,
significantly reducing the amount of work that is occuring in the compiler
driver. From the compilation perspective, the linker wrapper replaces the
typical call to the host link. This allows for the responsibility of the
compiler driver to be nearly identical when performing a regular compilation
vs an offloading compilation.
From a high level, using the clang-linker-wrapper
provides a few benefits:
Moves all of the device linking responsibility out of the compiler driver.
Allows for a more direct ability to perform linking for offloading without requiring the use of the driver, using more linker like calls.
Provides additional flexibility with the ability to dynamically modify the toolchain execution.
Example usage of the external clang-linker-wrapper
call:
clang-linker-wrapper <wrapper opts> -- <linker opts>
Device Extraction¶
During the compilation step, the device binaries are embedded in a section of
the host binary. When performing the link, this section is extracted from the
object and mapped according to the device kind. The clang-linker-wrapper
is responsible for examining all of the input binaries, grabbing the embedded
device binaries and determining any additional device linking paths that need
to be taken.
For example, when an embedded device binary is of the OFK_SYCL
kind and of
the spir64_gen
architecture triple, the resulting extracted binary is linked,
post-link processed and converted to SPIR-V before being passed to ocloc
to
generate the final device binary. Options passed via --gpu-tool-arg=
will
be applied to the ocloc
step as well.
Binaries generated during the offload compilation will be ‘bundled’ together to create a conglomerate fat binary. Depending on the type of binary, the device is embedded as follows:
Objects¶
Object types considered are COFF objects, ELF executables, ELF relocatable
objects and ELF shared objects. The device section of the object is in a
section marked by .llvm.offloading
for COFF objects. For ELF files, there is
a section marked with the LLVM_OFFLOADING
type.
Bitcode¶
The section which contains the offloading data is from the .llvm.offloading
section within the llvm.embedded.object
metadata.
Offload Binary¶
The binary itself can be represented by just an offload binary, not requiring to be in a section of another binary. This representation is used for any kind of device only binary that is created. The device only binaries are represented by the file generated from the Packager.
Archives¶
Each item in the archive will be extracted and will be individually checked for the file type, properly performing extraction based on the file types already listed above.
Support of existing fat object format¶
The current model generates fat objects using the clang-offload-bundler
tool. This tool creates a binary that contains individual sections that are
labeled with __CLANG_OFFLOAD_BUNDLE__kind-triple
. This format does not
match the new format as described above. To enable an ease of transition
from the old model to the new model, support will be added that allows for
support of identifying and extracting device binaries that are using the old
model.
It is expected that all new binaries generated with the updated offloading
model will represent the embedded fat object format, moving away from the
clang-offload-bundler
usage. We will not support a mixing of fat object
formats such as objects created with the bundler combined with embedded objects
in a single archive or objects created with an embedded offload section and
additional sections generated with the bundler.
Device Linking¶
During this phase, all of the individual device binaries that are extracted and
are associated with a given target are worked on given the type of binary we
are working with. The default device code is typically represented in LLVM-IR
which requires an additional link step of the device code before being
wrapped and integrated into the final executable. As mentioned in
Packager device representation in SPIR-V should be considered
with the ability to link native SPIR-V modules. Performing the device link
against LLVM-IR is performed by llvm-link
.
To reduce the potential size of the linked device binary, an additional host
link step is performed to gather dependency information when static device
libraries are being compiled. This information is sent through the
clang-offload-deps
tool to generate a dependency IR file which is used
during the device link step.
The use of the clang-linker-wrapper introduces the support of LTO for device code. We can leverage this and move away from the dependency gathering information step with ‘clang-offload-deps’ and use thinLTO for device code.
There are multiple device linking steps that also occur. The first step links
together all of the objects and the required device libraries. The second is
performed including all of optional device libraries, the static device
libraries and the dependency information that was gathered above. This link
step is performed with --only-needed
to streamline the final device binary.
A list of device libraries that need to linked in with user code is provided by
the driver. The driver is also responsible for letting the
clang-linker-wrapper
know the location of the device libraries.
Option |
Expected Behavior |
---|---|
|
A comma separated list of device libraries that are linked during the device link |
|
The location in which the device libraries reside |
Table: Options to pass device libraries to the clang-linker-wrapper
The device libraries are controlled via the -fno-sycl-device-lib=arg
option
where the driver determines based on this option which libraries to tell the
linker wrapper to pull in.
Post-link and SPIR-V translation¶
After the device binaries are linked together, two additional steps are performed to prepare the device binary for consumption by an offline compilation tool for AOT or to be wrapped for JIT processing.
The sycl-post-link
tool is used after the device link is performed,
applying any changes such as optimizations and code splitting before passing
off to the llvm-spirv
tool, which translates the LLVM-IR to SPIR-V.
Option |
Expected Behavior |
---|---|
|
Options that will control sycl-post-link step |
|
Options that will control llvm-spirv step |
Table: Options to pass sycl-post-link and llvm-spirv options to the clang-linker-wrapper
Options that will be used by clang-linker-wrapper when invoking the sycl-post-link
tool are provided by the driver via the --sycl-post-link-options=<arg>
option.
Options that will be used by clang-linker-wrapper when invoking the llvm-spirv
tool are provided by the driver via the --llvm-spirv-options=<arg>
option.
Ahead Of Time Compilation¶
The updated offloading model will integrate the Ahead of Time (AOT) compilation behaviors into the clang-linker-wrapper. The actions will typically take place after the device link, post link, and LLVM-IR to SPIR-V translation steps.
Regardless of the AOT target, the flow is similar, only modifying the offline compiler that is used to create the target device image. It is expected that the offline compiler will also use unique command lines specific to the tool to create the image.
To support the needed option passing triggered by use of the
-Xsycl-target-backend
option and implied options based on the optional
device behaviors for AOT compilations for GPU new command line interfaces
are needed to pass along this information.
Target |
Triple |
Offline Tool |
Option for Additional Args |
---|---|---|---|
CPU |
spir64_x86_64 |
opencl-aot |
|
GPU |
spir64_gen |
ocloc |
|
FPGA |
spir64_fpga |
aoc/opencl-aot |
|
Table: Ahead of Time Info
To complete the support needed for the various targets using the
clang-linker-wrapper
as the main interface, a few additional options will
be needed to communicate from the driver to the tool. Further details of usage
are given further below.
Option Name |
Purpose |
---|---|
|
Tells the link step to perform ‘early’ or ‘image’ processing to create archives for FPGA |
|
Provide the number of parallel jobs that will be used when processing split jobs |
Table: Additional Options for clang-linker-wrapper
The clang-linker-wrapper
provides an existing option named -wrapper-jobs
that may be useful for our usage.
spir64_gen support¶
Compilation behaviors involving AOT for GPU involve an additional call to
the OpenCL Offline compiler (OCLOC). This call occurs after the post-link
step performed by sycl-post-link
and the SPIR-V translation step which is done
by llvm-spirv
. Additional options passed by the user via the
-Xsycl-target-backend=spir64_gen <opts>
command as well as the implied
options set via target options such as -fsycl-targets=intel_gpu_skl
will be processed by a new options to the wrapper, --gpu-tool-arg=<arg>
To support multiple target specifications, for instance:
-fsycl-targets=intel_gpu_skl,intel_gpu_pvc
, multiple --gpu-tool-arg
options can be passed on the command line. Each instance will be considered
a separate OCLOC call passing along the <args>
as options to the OCLOC call.
The compiler driver will be responsible for putting together the full option
list to be passed along.
-fsycl -fsycl-targets=spir64_gen,intel_gpu_skl -Xsycl-target-backend=spir64_gen “-device pvc -options -extraopt_pvc” -Xsycl-target-backend=intel_gpu_skl “-options -extraopt_skl”
Example: spir64_gen enabling options
–gpu-tool-arg=”-device pvc -options extraopt_pvc” –gpu-tool-arg=”-options -extraopt_skl”
Example: clang-linker-wrapper options
Each OCLOC call will be represented as a separate device binary that is individually wrapped and linked into the final executable.
Additionally, the syntax can be expanded to enable the ability to pass specific
options to a specific device GPU target for spir64_gen. The syntax will
resemble --gpu-tool-arg=<arch> <arg>
. This corresponds to the existing
option syntax of -fsycl-targets=intel_gpu_arch
where arch
can be a fixed
set of targets.
–offload-arch¶
For SYCL offloading to Intel GPUs, Intel CPUs, NVidia and AMD GPUs, specify the device architecture using --offload-arch
option. For instance
--offload-arch=sm_80
to target an NVidia Tesla A100,
--offload-arch=gfx90a
to target an AMD Instinct MI250X, or
--offload-arch=sm_80,gfx90a
to target both.
For Intel Graphics AOT target, valid values for --offload-arch
are mapped to
valid device names accepted by OCLOC (the Intel GPU AOT compiler) via the -device
option.
SYCL offloading with --offload-arch
for Intel CPUs and Intel GPUs is currently enabled only with --offload-new-driver
option.
Example:
$ clang++ -fsycl -offload-arch=bdw --offload-new-driver -c foo.cpp // SYCL AOT for Intel GPU.
$ clang++ -fsycl -offload-arch=broadwell --offload-new-driver -c foo.cpp // SYCL AOT for Intel CPU.
The following table shows a mapping of the accepted values for --offload-arch
to enable SYCL offloading to Intel GPUs and the corresponding -device
value passed to OCLOC.
Intel GPU device |
|
OCLOC -device value |
---|---|---|
Intel(R) microarchitecture code name Broadwell Intel graphics architecture |
bdw |
bdw |
Intel(R) microarchitecture code name Skylake Intel graphics architecture |
skl |
skl |
Kaby Lake Intel graphics architecture |
kbl |
kbl |
Coffee Lake Intel graphics architecture |
cfl |
cfl |
Apollo Lake Intel graphics architecture |
apl |
apl |
Broxton Intel graphics architecture |
bxt |
apl |
Gemini Lake Intel graphics architecture |
glk |
glk |
Whiskey Lake Intel graphics architecture |
whl |
whl |
Amber Lake Intel graphics architecture |
aml |
aml |
Comet Lake Intel graphics architecture |
cml |
cml |
Ice Lake Intel graphics architecture |
icl, icllp |
icllp |
Elkhart Lake Intel graphics architecture |
ehl |
ehl |
Jasper Lake Intel graphics architecture |
jsl |
jsl |
Tiger Lake Intel graphics architecture |
tgl, tgllp |
tgllp |
Rocket Lake Intel graphics architecture |
rkl |
rkl |
Alder Lake S Intel graphics architecture |
adl_s |
adl_s |
Raptor Lake Intel graphics architecture |
rpl_s |
adl_s |
Alder Lake P Intel graphics architecture |
adl_p |
adl_p |
Alder Lake N Intel graphics architecture |
adl_n |
adl_n |
DG1 Intel graphics architecture |
dg1 |
dg1 |
Alchemist G10 Intel graphics architecture |
acm_g10, dg2_g10 |
acm_g10 |
Alchemist G11 Intel graphics architecture |
acm_g11, dg2_g11 |
acm_g11 |
Alchemist G12 Intel graphics architecture |
acm_g12, dg2_g12 |
acm_g12 |
Ponte Vecchio Intel graphics architecture |
pvc |
pvc |
Ponte Vecchio VG Intel graphics architecture |
pvc_vg |
pvc_vg |
Meteor Lake U/S or Arrow Lake U/S Intel graphics architecture |
mtl_u, mtl_s, arl_u |
mtl_s |
Meteor Lake H Intel graphics architecture |
mtl_h |
mtl_h |
Arrow Lake H Intel graphics architecture |
arl_h |
arl_h |
Battlemage G21 Intel graphics architecture |
bmg_g21 |
bmg_g21 |
Lunar Lake Intel graphics architecture |
lnl_m |
lnl_m |
nvptx64-nvidia-cuda support¶
For SYCL offloading to NVidia GPUs using --offload-arch
option, the following table
lists the accepted values.
NVidia GPU device name |
|
---|---|
NVIDIA Maxwell architecture (compute capability 5.0) |
sm_50 |
NVIDIA Maxwell architecture (compute capability 5.2) |
sm_52 |
NVIDIA Maxwell architecture (compute capability 5.3) |
sm_53 |
NVIDIA Pascal architecture (compute capability 6.0) |
sm_60 |
NVIDIA Pascal architecture (compute capability 6.1) |
sm_61 |
NVIDIA Pascal architecture (compute capability 6.2) |
sm_62 |
NVIDIA Volta architecture (compute capability 7.0) |
sm_70 |
NVIDIA Volta architecture (compute capability 7.2) |
sm_72 |
NVIDIA Turing architecture (compute capability 7.5) |
sm_75 |
NVIDIA Ampere architecture (compute capability 8.0) |
sm_80 |
NVIDIA Ampere architecture (compute capability 8.6) |
sm_86 |
NVIDIA Jetson/Drive AGX Orin architecture |
sm_87 |
NVIDIA Ada Lovelace architecture |
sm_89 |
NVIDIA Hopper architecture |
sm_90 |
NVIDIA Hopper architecture (with wgmma and setmaxnreg instructions) |
sm_90a |
amdgcn-amd-amdhsa support¶
For SYCL offloading to AMD GPUs using --offload-arch
option, the following table
lists the accepted values.
AMD GPU device name |
|
---|---|
AMD GCN GFX7 (Sea Islands (CI)) architecture |
gfx700 |
AMD GCN GFX7 (Sea Islands (CI)) architecture |
gfx701 |
AMD GCN GFX7 (Sea Islands (CI)) architecture |
gfx702 |
AMD GCN GFX8 (Volcanic Islands (VI)) architecture |
gfx801 |
AMD GCN GFX8 (Volcanic Islands (VI)) architecture |
gfx802 |
AMD GCN GFX8 (Volcanic Islands (VI)) architecture |
gfx803 |
AMD GCN GFX8 (Volcanic Islands (VI)) architecture |
gfx805 |
AMD GCN GFX8 (Volcanic Islands (VI)) architecture |
gfx810 |
AMD GCN GFX9 (Vega) architecture |
gfx900 |
AMD GCN GFX9 (Vega) architecture |
gfx902 |
AMD GCN GFX9 (Vega) architecture |
gfx904 |
AMD GCN GFX9 (Vega) architecture |
gfx906 |
AMD GCN GFX9 (Vega) architecture |
gfx908 |
AMD GCN GFX9 (Vega) architecture |
gfx909 |
AMD GCN GFX9 (Vega) architecture |
gfx90a |
AMD GCN GFX9 (Vega) architecture |
gfx90c |
AMD GCN GFX9 (Vega) architecture |
gfx940 |
AMD GCN GFX9 (Vega) architecture |
gfx941 |
AMD GCN GFX9 (Vega) architecture |
gfx942 |
AMD GCN GFX10.1 (RDNA 1) architecture |
gfx1010 |
AMD GCN GFX10.1 (RDNA 1) architecture |
gfx1011 |
AMD GCN GFX10.1 (RDNA 1) architecture |
gfx1012 |
AMD GCN GFX10.1 (RDNA 1) architecture |
gfx1013 |
AMD GCN GFX10.3 (RDNA 2) architecture |
gfx1030 |
GCN GFX10.3 (RDNA 2) architecture |
gfx1031 |
GCN GFX10.3 (RDNA 2) architecture |
gfx1032 |
GCN GFX10.3 (RDNA 2) architecture |
gfx1033 |
GCN GFX10.3 (RDNA 2) architecture |
gfx1034 |
GCN GFX10.3 (RDNA 2) architecture |
gfx1035 |
GCN GFX10.3 (RDNA 2) architecture |
gfx1036 |
GCN GFX11 (RDNA 3) architecture |
gfx1100 |
GCN GFX11 (RDNA 3) architecture |
gfx1101 |
GCN GFX11 (RDNA 3) architecture |
gfx1102 |
GCN GFX11 (RDNA 3) architecture |
gfx1103 |
GCN GFX11 (RDNA 3) architecture |
gfx1150 |
GCN GFX11 (RDNA 3) architecture |
gfx1151 |
GCN GFX12 (RDNA 4) architecture |
gfx1200 |
GCN GFX12 (RDNA 4) architecture |
gfx1201 |
spir64_fpga support¶
Compilation behaviors involving AOT for FPGA involve an additional call to
either aoc
(for Hardware/Simulation) or opencl-aot
(for Emulation). This
call occurs after the post-link step performed by sycl-post-link
and the
SPIR-V translation step performed by llvm-spirv
. Additional options passed
by the user via the -Xsycl-target-backend=spir64_fpga <opts>
command will be
processed by a new options to the wrapper,
--fpga-tool-arg=<arg>
The FPGA target also has support for additional generated binaries that contain intermediate files specific for FPGA. These binaries (aoco, aocr and aocx) can reside in archives and are treated differently than traditional device binaries.
Generation of the AOCR and AOCX type binary is triggered by the command line
option -fsycl-link
, where -fsycl-link=image
creates AOCX archives and
-fsycl-link=early
generates AOCR archives. The files generated by these
options are handled in a specific manner when encountered.
Any archive with an AOCR type device binary will have the AOCR binary
extracted and passed to aoc
to produce an AOCX final image. This final
image is wrapped and added to the final binary during the host link. The use
of -fsycl-link=image
with an AOCR binary will create an AOCX based archive
instead of completing the host link. Any archive with an AOCX type device
binary skips the aoc
step and is wrapped and added to the final binary during
the host link. Archives with any AOCO device binaries are extracted and passed
through to aoc -library-list=<listfile>
As the clang-linker-wrapper
is responsible for understanding the archives
that are added on the command line, it will need to know when to look for
these unique device binaries based on the expected compilation targets. The
behavior of creating the AOCX/AOCR type archive will be triggered via an
additional command line option specified by the driver when -fsycl-link
options are used. The --fpga-link=<type>
option will tell the wrapper when
these handlings need to occur.
When using the -fintelfpga
option to enable AOT for FPGA, there are
additional expectations during the compilation. Use of the option will enable
debug generation and also generate dependency information. The dependency
generation should be packaged along with the device binary for use during
the link phase. It is expected that the full fat object, containing host,
device and dependency file is generated before being passed to the link phase.
The dependency information is only used when compiling for hardware.
The clang-linker-wrapper
tool will be responsible to determine which FPGA
tool is being used during the AOT device compilation phase. The use of
-simulation
or -hardware
as passed in by --fpga-tool-arg
signifies
which tool is used.
spir64_x86_64 support¶
Compilation behaviors involving AOT for CPU involve an additional call to
opencl-aot
. This call occurs after the post-link step performed by
sycl-post-link
and the SPIR-V translation step performed by llvm-spirv
.
Additional options passed by the user via the
-Xsycl-target-backend=spir64_x86_64 <opts>
command will be processed by a new
option to the wrapper, --cpu-tool-arg=<arg>
Similar to SYCL offloading to Intel GPUs using --offload-arch
, SYCL AOT for Intel CPUs
will also leverage the --offload-arch
option.
The valid CPU device names accepted for --offload-arch
are CPU names from clang -march
.
These names are more verbose, and do not overlap with the Intel GPU names.
These user input CPU names are mapped to the corresponding opencl-aot -march
option.
The following table shows a mapping of the accepted values for --offload-arch
to enable SYCL offloading to Intel CPUs and the corresponding -march
value passed to opencl-aot.
Intel CPU device |
|
opencl-aot -march value |
---|---|---|
Intel(R) Advanced Vector Extensions 512 |
skylake-avx512 |
avx512 |
Intel(R) Advanced Vector Extensions 2 |
core-avx2 |
avx2 |
Intel(R) Advanced Vector Extensions |
corei7-avx |
avx |
Intel(R) Streaming SIMD Extensions 4.2 |
corei7 |
sse4.2 |
Intel(R) microarchitecture code name Westmere |
westmere |
wsm |
Intel(R) microarchitecture code name Sandy Bridge |
sandybridge |
snb |
Intel(R) microarchitecture code name Ivy Bridge |
ivybridge |
ivyb |
Intel(R) microarchitecture code name Broadwell |
broadwell |
bdw |
Intel(R) microarchitecture code name Coffee Lake |
coffeelake |
cfl |
Intel(R) microarchitecture code name Alder Lake |
alderlake |
adl |
Intel(R) microarchitecture code name Skylake (client) |
skylake |
skylake |
Intel(R) microarchitecture code name Skylake (server) |
skx |
skx |
Intel(R) microarchitecture code name Cascade Lake |
cascadelake |
clk |
Intel(R) microarchitecture code name Ice Lake (client) |
icelake-client |
icl |
Intel(R) microarchitecture code name Ice Lake (server) |
icelake-server |
icx |
Intel(R) microarchitecture code name Sapphire Rapids |
sapphirerapids |
spr |
Intel(R) microarchitecture code name Granite Rapids |
graniterapids |
gnr |
Wrapping of device image¶
Once the device binary is pulled out of the fat binary, the binary must be
wrapped and provided the needed entry points to be used during execution. This
is performed during the link phase and controlled by the clang-linker-wrapper
.
It is expected that the wrap information that is generated to be wrapped
around the device binary will match current wrapping information that is used
for the existing offload model. The wrapping in the old model is using the
clang-offload-wrapper
tool.
Integration of llvm-foreach¶
Use of llvm-foreach
is used frequently during the offloading process. The
functionality was introduced due to the toolchain building limitations of the
driver and the need perform multiple compilations during a specific step of
the compilation. For example, the support for device code splitting performed
by the sycl-post-link
tool would require consumers of the split code to work
each item individually. The driver toolchain does not know that this split
occurs, or how many splits were performed. llvm-foreach
permits this
obfuscation.
With the new model, the compilation sequence is built dynamically within the
clang-linker-wrapper
, allowing for the use of llvm-foreach
to become
obsolete. The clang-linker-wrapper
is then responsible for performing the
needed individual calls work item which can be done using parallel for.
Beyond llvm-foreach and similar job hiding tools¶
Tools like llvm-foreach
, file-table-tform
, spirv-to-ir-wrapper
were all
introduced to provide a way to manipulate behaviors that could only be
determined at runtime of the compiler toolchain. These were needed to work
around the fact that the toolchain commands constructed by the driver is a fixed
state of commands.
Moving the functionality into clang-linker-wrapper
presents the opportunity
step away from the static command construction and create the call chain on
the fly based on real time output from corresponding tools being called.
Host Link¶
The final host link is also performed by the linker wrapper. This link is built upon the full link command line as constructed by the compiler driver, including all libraries and the linked/wrapped device binaries to complete the compilation process.
The provided command line for the final host link step contains the full list of libraries and objects to be linked against. The expectation is for this list to be complete. With the old model, the host objects are directly passed to the host link step. The device objects are processed separately. As we are passing the full command line to the link step, the objects provided will need to be full fat objects. This is different from the old model which will require for an additional step before the link to create the full fat object that is properly represented on the host link command line. This additional step is necessary due to the fact that we are creating the fat objects during a separate step as opposed to integrating the offload binaries during the host object generation. See Fat Binary Generation.
Transitioning from old model to new model¶
The binary representation of the fat objects is not equivalent when dealing
with differences between the old and the new model. Behavior of the new
model will be guarded by the --offload-new-driver
compiler switch. This will
allow for implementation of the model without disturbing the existing behavior.
When we are ready to make the switch over, it is a matter of making the
switch the default mode.
Initial representation of the fat objects will be represented by the existing
format created by the clang-offload-bundler
. This will allow for older
binaries to continue to be consumed with the updated offloading model. We will
transition to the packager representation at a future date.