Proposed design for offloading model

Introduction

This document covers the proposed design for using the new offloading model for the DPC++ Compiler. This leverages the existing community Offloading design OffloadingDesign which covers the Clang driver and code generation steps for creating offloading applications.

The current offloading model is completely encapsulated within the Clang Compiler Driver requiring the driver to perform all of the additional steps for generating the host and device compilation passes. The Driver is also responsible for initiating any of the link-time processing that occurs for each device target.

The updated offloading model removes much of the functionality that is performed during the link phase of the offloading compilation from the driver and moves it to a clang-linker-wrapper tool.

Below is a general representation of the overall offloading flow that is performed during a full compilation from source to final executable. The compiler driver is responsible for creating the fat object and the clang-linker-wrapper tool is responsible for the general functionality that is performed during the link.

High level view of the offloading flow

Diagram 1: General Offload Flow

Fat Binary Generation

The generation of the fat binary will be controlled by the driver. The model from the community is generating the fat binary as a secondary process when creating the host object. Generation will be separated from the host compilation step. This is being done to enable proper support for using an external host compiler as well as taking advantage of potential parallelism during compilation of both the host and target device binaries.

The fat object in the new model is generated during the host compilation. The host compilation takes an additional argument which points to the device binary which will be embedded in the final object. Generation will be separated out to allow for potential parallelism during compilation of both the host and target device binaries.

Creating the fat object

Diagram 2: Fat Object Generation

Packager

When dealing with multiple device binaries, an additional step is performed to package the multiple device binaries before being added to the host object. This additional step is performed with the clang-offload-packager taking image inputs containing information relating to the target triple, architecture setting and offloading kind.

The clang-offload-packager is run during ‘fat object’ generation regardless of the number of device binaries being added to the conglomerate fat object. The device binaries are contained in what is designated as an ‘Offload Binary’. These binaries can reside in a variety of binary formats including Bitcode files, ELF objects, executables and shared objects, COFF objects, archives or simply stored as an offload binary.

We should have the ability to package SPIR-V based device binaries in the offload section of any given binary. These device binaries will be packaged as normal with the packager and placed within the given section.

Example usage of the external clang-offload-packager call:

clang-offload-packager --image=file=<name>,triple=<triple>,kind=<kind>

Clang Linker Wrapper

The clang-linker-wrapper provides the interface to perform the needed link steps when consuming fat binaries. The linker wrapper performs a majority of the work involved during the link step during an offload compilation, significantly reducing the amount of work that is occuring in the compiler driver. From the compilation perspective, the linker wrapper replaces the typical call to the host link. This allows for the responsibility of the compiler driver to be nearly identical when performing a regular compilation vs an offloading compilation.

From a high level, using the clang-linker-wrapper provides a few benefits:

  • Moves all of the device linking responsibility out of the compiler driver.

  • Allows for a more direct ability to perform linking for offloading without requiring the use of the driver, using more linker like calls.

  • Provides additional flexibility with the ability to dynamically modify the toolchain execution.

Example usage of the external clang-linker-wrapper call:

clang-linker-wrapper <wrapper opts> -- <linker opts>

Device Extraction

During the compilation step, the device binaries are embedded in a section of the host binary. When performing the link, this section is extracted from the object and mapped according to the device kind. The clang-linker-wrapper is responsible for examining all of the input binaries, grabbing the embedded device binaries and determining any additional device linking paths that need to be taken.

For example, when an embedded device binary is of the OFK_SYCL kind and of the spir64_gen architecture triple, the resulting extracted binary is linked, post-link processed and converted to SPIR-V before being passed to ocloc to generate the final device binary. Options passed via --gpu-tool-arg= will be applied to the ocloc step as well.

Binaries generated during the offload compilation will be ‘bundled’ together to create a conglomerate fat binary. Depending on the type of binary, the device is embedded as follows:

Objects

Object types considered are COFF objects, ELF executables, ELF relocatable objects and ELF shared objects. The device section of the object is in a section marked by .llvm.offloading for COFF objects. For ELF files, there is a section marked with the LLVM_OFFLOADING type.

Bitcode

The section which contains the offloading data is from the .llvm.offloading section within the llvm.embedded.object metadata.

Offload Binary

The binary itself can be represented by just an offload binary, not requiring to be in a section of another binary. This representation is used for any kind of device only binary that is created. The device only binaries are represented by the file generated from the Packager.

Archives

Each item in the archive will be extracted and will be individually checked for the file type, properly performing extraction based on the file types already listed above.

Support of existing fat object format

The current model generates fat objects using the clang-offload-bundler tool. This tool creates a binary that contains individual sections that are labeled with __CLANG_OFFLOAD_BUNDLE__kind-triple. This format does not match the new format as described above. To enable an ease of transition from the old model to the new model, support will be added that allows for support of identifying and extracting device binaries that are using the old model.

It is expected that all new binaries generated with the updated offloading model will represent the embedded fat object format, moving away from the clang-offload-bundler usage. We will not support a mixing of fat object formats such as objects created with the bundler combined with embedded objects in a single archive or objects created with an embedded offload section and additional sections generated with the bundler.

Device Linking

During this phase, all of the individual device binaries that are extracted and are associated with a given target are worked on given the type of binary we are working with. The default device code is typically represented in LLVM-IR which requires an additional link step of the device code before being wrapped and integrated into the final executable. As mentioned in Packager device representation in SPIR-V should be considered with the ability to link native SPIR-V modules. Performing the device link against LLVM-IR is performed by llvm-link.

To reduce the potential size of the linked device binary, an additional host link step is performed to gather dependency information when static device libraries are being compiled. This information is sent through the clang-offload-deps tool to generate a dependency IR file which is used during the device link step.

The use of the clang-linker-wrapper introduces the support of LTO for device code. We can leverage this and move away from the dependency gathering information step with ‘clang-offload-deps’ and use thinLTO for device code.

There are multiple device linking steps that also occur. The first step links together all of the objects and the required device libraries. The second is performed including all of optional device libraries, the static device libraries and the dependency information that was gathered above. This link step is performed with --only-needed to streamline the final device binary.

A list of device libraries that need to linked in with user code is provided by the driver. The driver is also responsible for letting the clang-linker-wrapper know the location of the device libraries.

Option

Expected Behavior

--sycl-device-libraries=<arg>

A comma separated list of device libraries that are linked during the device link

--sycl-device-library-location=<arg>

The location in which the device libraries reside

Table: Options to pass device libraries to the clang-linker-wrapper

The device libraries are controlled via the -fno-sycl-device-lib=arg option where the driver determines based on this option which libraries to tell the linker wrapper to pull in.

Post-link and SPIR-V translation

After the device binaries are linked together, two additional steps are performed to prepare the device binary for consumption by an offline compilation tool for AOT or to be wrapped for JIT processing.

The sycl-post-link tool is used after the device link is performed, applying any changes such as optimizations and code splitting before passing off to the llvm-spirv tool, which translates the LLVM-IR to SPIR-V.

Option

Expected Behavior

--sycl-post-link-options=<arg>

Options that will control sycl-post-link step

--llvm-spirv-options=<arg>

Options that will control llvm-spirv step

Table: Options to pass sycl-post-link and llvm-spirv options to the clang-linker-wrapper

Options that will be used by clang-linker-wrapper when invoking the sycl-post-link tool are provided by the driver via the --sycl-post-link-options=<arg> option. Options that will be used by clang-linker-wrapper when invoking the llvm-spirv tool are provided by the driver via the --llvm-spirv-options=<arg> option.

Ahead Of Time Compilation

The updated offloading model will integrate the Ahead of Time (AOT) compilation behaviors into the clang-linker-wrapper. The actions will typically take place after the device link, post link, and LLVM-IR to SPIR-V translation steps.

Regardless of the AOT target, the flow is similar, only modifying the offline compiler that is used to create the target device image. It is expected that the offline compiler will also use unique command lines specific to the tool to create the image.

To support the needed option passing triggered by use of the -Xsycl-target-backend option and implied options based on the optional device behaviors for AOT compilations for GPU new command line interfaces are needed to pass along this information.

Target

Triple

Offline Tool

Option for Additional Args

CPU

spir64_x86_64

opencl-aot

--cpu-tool-arg=<arg>

GPU

spir64_gen

ocloc

--gpu-tool-arg=<arg>

FPGA

spir64_fpga

aoc/opencl-aot

--fpga-tool-arg=<arg>

Table: Ahead of Time Info

To complete the support needed for the various targets using the clang-linker-wrapper as the main interface, a few additional options will be needed to communicate from the driver to the tool. Further details of usage are given further below.

Option Name

Purpose

--fpga-link-type=<arg>

Tells the link step to perform ‘early’ or ‘image’ processing to create archives for FPGA

--parallel-link-sycl=<arg>

Provide the number of parallel jobs that will be used when processing split jobs

Table: Additional Options for clang-linker-wrapper

The clang-linker-wrapper provides an existing option named -wrapper-jobs that may be useful for our usage.

spir64_gen support

Compilation behaviors involving AOT for GPU involve an additional call to the OpenCL Offline compiler (OCLOC). This call occurs after the post-link step performed by sycl-post-link and the SPIR-V translation step which is done by llvm-spirv. Additional options passed by the user via the -Xsycl-target-backend=spir64_gen <opts> command as well as the implied options set via target options such as -fsycl-targets=intel_gpu_skl will be processed by a new options to the wrapper, --gpu-tool-arg=<arg>

To support multiple target specifications, for instance: -fsycl-targets=intel_gpu_skl,intel_gpu_pvc, multiple --gpu-tool-arg options can be passed on the command line. Each instance will be considered a separate OCLOC call passing along the <args> as options to the OCLOC call. The compiler driver will be responsible for putting together the full option list to be passed along.

-fsycl -fsycl-targets=spir64_gen,intel_gpu_skl -Xsycl-target-backend=spir64_gen “-device pvc -options -extraopt_pvc” -Xsycl-target-backend=intel_gpu_skl “-options -extraopt_skl”

Example: spir64_gen enabling options

–gpu-tool-arg=”-device pvc -options extraopt_pvc” –gpu-tool-arg=”-options -extraopt_skl”

Example: clang-linker-wrapper options

Each OCLOC call will be represented as a separate device binary that is individually wrapped and linked into the final executable.

Additionally, the syntax can be expanded to enable the ability to pass specific options to a specific device GPU target for spir64_gen. The syntax will resemble --gpu-tool-arg=<arch> <arg>. This corresponds to the existing option syntax of -fsycl-targets=intel_gpu_arch where arch can be a fixed set of targets.

–offload-arch

For SYCL offloading to Intel GPUs, Intel CPUs, NVidia and AMD GPUs, specify the device architecture using --offload-arch option. For instance --offload-arch=sm_80 to target an NVidia Tesla A100, --offload-arch=gfx90a to target an AMD Instinct MI250X, or --offload-arch=sm_80,gfx90a to target both.

For Intel Graphics AOT target, valid values for --offload-arch are mapped to valid device names accepted by OCLOC (the Intel GPU AOT compiler) via the -device option.

SYCL offloading with --offload-arch for Intel CPUs and Intel GPUs is currently enabled only with --offload-new-driver option.

Example:

$ clang++ -fsycl -offload-arch=bdw --offload-new-driver -c foo.cpp // SYCL AOT for Intel GPU.
$ clang++ -fsycl -offload-arch=broadwell --offload-new-driver -c foo.cpp // SYCL AOT for Intel CPU.

The following table shows a mapping of the accepted values for --offload-arch to enable SYCL offloading to Intel GPUs and the corresponding -device value passed to OCLOC.

Intel GPU device

--offload-arch accepted value

OCLOC -device value

Intel(R) microarchitecture code name Broadwell Intel graphics architecture

bdw

bdw

Intel(R) microarchitecture code name Skylake Intel graphics architecture

skl

skl

Kaby Lake Intel graphics architecture

kbl

kbl

Coffee Lake Intel graphics architecture

cfl

cfl

Apollo Lake Intel graphics architecture

apl

apl

Broxton Intel graphics architecture

bxt

apl

Gemini Lake Intel graphics architecture

glk

glk

Whiskey Lake Intel graphics architecture

whl

whl

Amber Lake Intel graphics architecture

aml

aml

Comet Lake Intel graphics architecture

cml

cml

Ice Lake Intel graphics architecture

icl, icllp

icllp

Elkhart Lake Intel graphics architecture

ehl

ehl

Jasper Lake Intel graphics architecture

jsl

jsl

Tiger Lake Intel graphics architecture

tgl, tgllp

tgllp

Rocket Lake Intel graphics architecture

rkl

rkl

Alder Lake S Intel graphics architecture

adl_s

adl_s

Raptor Lake Intel graphics architecture

rpl_s

adl_s

Alder Lake P Intel graphics architecture

adl_p

adl_p

Alder Lake N Intel graphics architecture

adl_n

adl_n

DG1 Intel graphics architecture

dg1

dg1

Alchemist G10 Intel graphics architecture

acm_g10, dg2_g10

acm_g10

Alchemist G11 Intel graphics architecture

acm_g11, dg2_g11

acm_g11

Alchemist G12 Intel graphics architecture

acm_g12, dg2_g12

acm_g12

Ponte Vecchio Intel graphics architecture

pvc

pvc

Ponte Vecchio VG Intel graphics architecture

pvc_vg

pvc_vg

Meteor Lake U/S or Arrow Lake U/S Intel graphics architecture

mtl_u, mtl_s, arl_u

mtl_s

Meteor Lake H Intel graphics architecture

mtl_h

mtl_h

Arrow Lake H Intel graphics architecture

arl_h

arl_h

Battlemage G21 Intel graphics architecture

bmg_g21

bmg_g21

Lunar Lake Intel graphics architecture

lnl_m

lnl_m

nvptx64-nvidia-cuda support

For SYCL offloading to NVidia GPUs using --offload-arch option, the following table lists the accepted values.

NVidia GPU device name

--offload-arch accepted values for NVidia GPUs

NVIDIA Maxwell architecture (compute capability 5.0)

sm_50

NVIDIA Maxwell architecture (compute capability 5.2)

sm_52

NVIDIA Maxwell architecture (compute capability 5.3)

sm_53

NVIDIA Pascal architecture (compute capability 6.0)

sm_60

NVIDIA Pascal architecture (compute capability 6.1)

sm_61

NVIDIA Pascal architecture (compute capability 6.2)

sm_62

NVIDIA Volta architecture (compute capability 7.0)

sm_70

NVIDIA Volta architecture (compute capability 7.2)

sm_72

NVIDIA Turing architecture (compute capability 7.5)

sm_75

NVIDIA Ampere architecture (compute capability 8.0)

sm_80

NVIDIA Ampere architecture (compute capability 8.6)

sm_86

NVIDIA Jetson/Drive AGX Orin architecture

sm_87

NVIDIA Ada Lovelace architecture

sm_89

NVIDIA Hopper architecture

sm_90

NVIDIA Hopper architecture (with wgmma and setmaxnreg instructions)

sm_90a

amdgcn-amd-amdhsa support

For SYCL offloading to AMD GPUs using --offload-arch option, the following table lists the accepted values.

AMD GPU device name

--offload-arch accepted values for AMD GPUs

AMD GCN GFX7 (Sea Islands (CI)) architecture

gfx700

AMD GCN GFX7 (Sea Islands (CI)) architecture

gfx701

AMD GCN GFX7 (Sea Islands (CI)) architecture

gfx702

AMD GCN GFX8 (Volcanic Islands (VI)) architecture

gfx801

AMD GCN GFX8 (Volcanic Islands (VI)) architecture

gfx802

AMD GCN GFX8 (Volcanic Islands (VI)) architecture

gfx803

AMD GCN GFX8 (Volcanic Islands (VI)) architecture

gfx805

AMD GCN GFX8 (Volcanic Islands (VI)) architecture

gfx810

AMD GCN GFX9 (Vega) architecture

gfx900

AMD GCN GFX9 (Vega) architecture

gfx902

AMD GCN GFX9 (Vega) architecture

gfx904

AMD GCN GFX9 (Vega) architecture

gfx906

AMD GCN GFX9 (Vega) architecture

gfx908

AMD GCN GFX9 (Vega) architecture

gfx909

AMD GCN GFX9 (Vega) architecture

gfx90a

AMD GCN GFX9 (Vega) architecture

gfx90c

AMD GCN GFX9 (Vega) architecture

gfx940

AMD GCN GFX9 (Vega) architecture

gfx941

AMD GCN GFX9 (Vega) architecture

gfx942

AMD GCN GFX10.1 (RDNA 1) architecture

gfx1010

AMD GCN GFX10.1 (RDNA 1) architecture

gfx1011

AMD GCN GFX10.1 (RDNA 1) architecture

gfx1012

AMD GCN GFX10.1 (RDNA 1) architecture

gfx1013

AMD GCN GFX10.3 (RDNA 2) architecture

gfx1030

GCN GFX10.3 (RDNA 2) architecture

gfx1031

GCN GFX10.3 (RDNA 2) architecture

gfx1032

GCN GFX10.3 (RDNA 2) architecture

gfx1033

GCN GFX10.3 (RDNA 2) architecture

gfx1034

GCN GFX10.3 (RDNA 2) architecture

gfx1035

GCN GFX10.3 (RDNA 2) architecture

gfx1036

GCN GFX11 (RDNA 3) architecture

gfx1100

GCN GFX11 (RDNA 3) architecture

gfx1101

GCN GFX11 (RDNA 3) architecture

gfx1102

GCN GFX11 (RDNA 3) architecture

gfx1103

GCN GFX11 (RDNA 3) architecture

gfx1150

GCN GFX11 (RDNA 3) architecture

gfx1151

GCN GFX12 (RDNA 4) architecture

gfx1200

GCN GFX12 (RDNA 4) architecture

gfx1201

spir64_fpga support

Compilation behaviors involving AOT for FPGA involve an additional call to either aoc (for Hardware/Simulation) or opencl-aot (for Emulation). This call occurs after the post-link step performed by sycl-post-link and the SPIR-V translation step performed by llvm-spirv. Additional options passed by the user via the -Xsycl-target-backend=spir64_fpga <opts> command will be processed by a new options to the wrapper, --fpga-tool-arg=<arg>

The FPGA target also has support for additional generated binaries that contain intermediate files specific for FPGA. These binaries (aoco, aocr and aocx) can reside in archives and are treated differently than traditional device binaries.

Generation of the AOCR and AOCX type binary is triggered by the command line option -fsycl-link, where -fsycl-link=image creates AOCX archives and -fsycl-link=early generates AOCR archives. The files generated by these options are handled in a specific manner when encountered.

Any archive with an AOCR type device binary will have the AOCR binary extracted and passed to aoc to produce an AOCX final image. This final image is wrapped and added to the final binary during the host link. The use of -fsycl-link=image with an AOCR binary will create an AOCX based archive instead of completing the host link. Any archive with an AOCX type device binary skips the aoc step and is wrapped and added to the final binary during the host link. Archives with any AOCO device binaries are extracted and passed through to aoc -library-list=<listfile>

As the clang-linker-wrapper is responsible for understanding the archives that are added on the command line, it will need to know when to look for these unique device binaries based on the expected compilation targets. The behavior of creating the AOCX/AOCR type archive will be triggered via an additional command line option specified by the driver when -fsycl-link options are used. The --fpga-link=<type> option will tell the wrapper when these handlings need to occur.

When using the -fintelfpga option to enable AOT for FPGA, there are additional expectations during the compilation. Use of the option will enable debug generation and also generate dependency information. The dependency generation should be packaged along with the device binary for use during the link phase. It is expected that the full fat object, containing host, device and dependency file is generated before being passed to the link phase. The dependency information is only used when compiling for hardware.

The clang-linker-wrapper tool will be responsible to determine which FPGA tool is being used during the AOT device compilation phase. The use of -simulation or -hardware as passed in by --fpga-tool-arg signifies which tool is used.

spir64_x86_64 support

Compilation behaviors involving AOT for CPU involve an additional call to opencl-aot. This call occurs after the post-link step performed by sycl-post-link and the SPIR-V translation step performed by llvm-spirv. Additional options passed by the user via the -Xsycl-target-backend=spir64_x86_64 <opts> command will be processed by a new option to the wrapper, --cpu-tool-arg=<arg>

Similar to SYCL offloading to Intel GPUs using --offload-arch, SYCL AOT for Intel CPUs will also leverage the --offload-arch option. The valid CPU device names accepted for --offload-arch are CPU names from clang -march. These names are more verbose, and do not overlap with the Intel GPU names. These user input CPU names are mapped to the corresponding opencl-aot -march option.

The following table shows a mapping of the accepted values for --offload-arch to enable SYCL offloading to Intel CPUs and the corresponding -march value passed to opencl-aot.

Intel CPU device

--offload-arch accepted value

opencl-aot -march value

Intel(R) Advanced Vector Extensions 512

skylake-avx512

avx512

Intel(R) Advanced Vector Extensions 2

core-avx2

avx2

Intel(R) Advanced Vector Extensions

corei7-avx

avx

Intel(R) Streaming SIMD Extensions 4.2

corei7

sse4.2

Intel(R) microarchitecture code name Westmere

westmere

wsm

Intel(R) microarchitecture code name Sandy Bridge

sandybridge

snb

Intel(R) microarchitecture code name Ivy Bridge

ivybridge

ivyb

Intel(R) microarchitecture code name Broadwell

broadwell

bdw

Intel(R) microarchitecture code name Coffee Lake

coffeelake

cfl

Intel(R) microarchitecture code name Alder Lake

alderlake

adl

Intel(R) microarchitecture code name Skylake (client)

skylake

skylake

Intel(R) microarchitecture code name Skylake (server)

skx

skx

Intel(R) microarchitecture code name Cascade Lake

cascadelake

clk

Intel(R) microarchitecture code name Ice Lake (client)

icelake-client

icl

Intel(R) microarchitecture code name Ice Lake (server)

icelake-server

icx

Intel(R) microarchitecture code name Sapphire Rapids

sapphirerapids

spr

Intel(R) microarchitecture code name Granite Rapids

graniterapids

gnr

Wrapping of device image

Once the device binary is pulled out of the fat binary, the binary must be wrapped and provided the needed entry points to be used during execution. This is performed during the link phase and controlled by the clang-linker-wrapper.

It is expected that the wrap information that is generated to be wrapped around the device binary will match current wrapping information that is used for the existing offload model. The wrapping in the old model is using the clang-offload-wrapper tool.

Integration of llvm-foreach

Use of llvm-foreach is used frequently during the offloading process. The functionality was introduced due to the toolchain building limitations of the driver and the need perform multiple compilations during a specific step of the compilation. For example, the support for device code splitting performed by the sycl-post-link tool would require consumers of the split code to work each item individually. The driver toolchain does not know that this split occurs, or how many splits were performed. llvm-foreach permits this obfuscation.

With the new model, the compilation sequence is built dynamically within the clang-linker-wrapper, allowing for the use of llvm-foreach to become obsolete. The clang-linker-wrapper is then responsible for performing the needed individual calls work item which can be done using parallel for.

Beyond llvm-foreach and similar job hiding tools

Tools like llvm-foreach, file-table-tform, spirv-to-ir-wrapper were all introduced to provide a way to manipulate behaviors that could only be determined at runtime of the compiler toolchain. These were needed to work around the fact that the toolchain commands constructed by the driver is a fixed state of commands.

Moving the functionality into clang-linker-wrapper presents the opportunity step away from the static command construction and create the call chain on the fly based on real time output from corresponding tools being called.

Transitioning from old model to new model

The binary representation of the fat objects is not equivalent when dealing with differences between the old and the new model. Behavior of the new model will be guarded by the --offload-new-driver compiler switch. This will allow for implementation of the model without disturbing the existing behavior. When we are ready to make the switch over, it is a matter of making the switch the default mode.

Initial representation of the fat objects will be represented by the existing format created by the clang-offload-bundler. This will allow for older binaries to continue to be consumed with the updated offloading model. We will transition to the packager representation at a future date.