Core programming guide#

Memory management#

Objects are either shared, or unique, or managed. We detail the memory policies in the following.

Shared objects#

Objects are created, retained, and released. At creation, objects are constructed and the reference count is set to 1. The retain operation increases the reference count by 1 and the release operation decreases the reference count by 1. If the reference count is equal to 0, the object is destroyed and the memory is freed. Relasing an object is always safe after passing it to a function, because the function either does not need the object anymore, or it increased the reference count by one.

The reference count of an object of type is managed with

  • tinytc_type_create

  • tinytc_type_retain

  • tinytc_type_release

Unique objects#

Unique objects always have a single owner. In the C-API, when the object is passed to a function, the ownership may be passed to another object when the function’s documentation says so. For example, when adding an instruction to a region, the region takes ownership of the instruction and the user must not destroy the instruction as that would lead to a double free. In C++, the copy constructor is deleted and unique objects must be moved when a function transfers ownership.

An object is created and deleted with

  • tinytc_type_create

  • tinytc_type_destroy

Managed ojects#

Some objects are never created or deleted but looked up in a parent object. In that case the user never needs to destroy the object but only the parent object. Care must be taken that the parent object is not deleted while the managed object is still in use.

The object is obtained with

  • tinytc_type_get

Error#

The C-API returns error codes, the C++-API throws exceptions.

Cf. tinytc_status_t for a list of error codes. Level Zero and OpenCL codes are mapped to tinytc_status_compute_runtime_error.

Parser#

Programs written in the tensor language are parsed from a file, stdin, or a string. The tinytc_compiler_context_t object controls optimization level, optimization flags, and error logging. (The default compiler context does not print or log errors.) When an error reporter is installed via tinytc_compiler_context_set_error_reporter, then errors are printed along with source code locations and source context. For example:

test/lit/opt/check-ir/type_mismatch0.ir:6.8-23: Type of operand must match return type

func @kernel(%K0: memref<f32>) {

  %0 = load %K0[] : f64
       ~~~~~~~~~~~~~~~~
test/lit/opt/check-ir/type_mismatch0.ir:5.14-16: value defined here

func @kernel(%K0: memref<f32>) {
             ~~~

Example:

tinytc_status_t status;
tinytc_compiler_context_t ctx = NULL;
tinytc_prog_t program = NULL;
status = tinytc_compiler_context_create(&ctx);
// ... check status ...
status = tinytc_compiler_context_set_error_reporter(ctx, error_callback, NULL);
// ... check status ...
status = tinytc_parse_file(&program, "test/lit/opt/check-ir/type_mismatch0.ir", ctx)
if (status != tinytc_status_success) {
    printf("Error: %d\n", status);
}
// ...
err:
tinytc_prog_release(program);
tinytc_compiler_context_release(ctx);

Compiler#

Program objects (tinytc_prog_t) are online-compiled using the tinytc_prog_compile_to_spirv_and_assemble (compile_to_spirv_and_assemble) function. The program object is hereby modified as compiler passes are necessary. A binary object is returned that contains the SPIR-V binary.

Some compiler passes specialize the code based on properties of the GPU device. Therefore, a tinytc_core_info_t object is required. It is recommend to query the core info from the runtime using any of the tinytc_runtime_core_info_create functions (create_core_info in C++), but one may also look up the core info from a table, as done in the example code below.

Example:

tinytc_status_t status;
tinytc_core_info_t info = NULL;
tinytc_binary_t bin = NULL;
status = tinytc_core_info_intel_create_from_arch(&info, tinytc_intel_gpu_architecture_pvc);
// ... check status ...
status = tinytc_prog_compile_to_spirv_and_assemble(&bin, program, info);
// ...
tinytc_binary_release(source);
tinytc_core_info_release(info);

Note

Code generation targets SPIR-V. As a minimum, the Addresses, SubgroupDispatch, and Int64 capability must be supported by the runtime.

Further capabilites are required for specific functionality:

  • Int(8|16) for i8, i16 ints

  • Float(16|64) for f16, f64 floats

  • Int64Atomics for atomics on i64

  • Groups for work group operations (e.g. broadcast)

  • AtomicFloat(16|32|64)AddExt for atomics on f16, f32, f64 (SPV_EXT_shader_atomic_float[16]_add extensions)

  • BFloat16ConversionINTEL for bf16 support (SPV_INTEL_bfloat16_conversion extension)

  • SubgroupBufferBlockIOINTEL for efficient block loads and stores (SPV_INTEL_subgroups extension)

Device info#

Kernels are specialized for properties of the target device, such as the subgroup size, the maximum work group size, and the register space available to a subgroup. Moreover, the device’s support level can be queried from the run-time.

tinytc_support_level_t level;
tinytc_ze_get_support_level(device, &level);
if (level >= tinytc_support_level_basic) {
    tinytc_core_info_t info;
    tinytc_ze_core_info_create(&info, device);
    // ...
    tinytc_core_info_release(info);
}

Runtime#

The JIT compiler compiles tensor programs into SPIR-V binaries. The libray provides functions to create the runtime’s kernel bundle object (cl_program, sycl::kernel_bundle, ze_module_handle_t) from a binary object. The runtime’s kernel objects are obtained using the native API or the Tiny Tensor Compiler API (if applicable). Setting the kernel arguments should follow the calling convention. The Tiny Tensor Compiler should be used to translate the 2D work-group size of the tensor language to a 3D work-group size, and to translate the group size to the global size that is passed to the runtime.

Example for “func @foo(%a: i32, …) { … }” (without error handling code):

ze_module_handle_t bundle = NULL;
ze_kernel_handle_t kernel = NULL;
int a = 42;
tinytc_ze_kernel_bundle_create_with_binary(&bundle, context, device, bin);
tinytc_ze_kernel_create(&kernel, bundle, "foo"); // Sets the work-group size
zeKernelSetArgumentValue(kernel, 0, sizeof(a), &a);
// ...
ze_group_count_t group_count = tinytc_ze_get_group_count(howmany);
zeCommandListAppendLaunchKernel(command_list, kernel, &group_count, NULL, 0, NULL);
// ...
zeKernelDestroy(kernel);
zeModuleDestroy(bundle);

Note

Kernel bundles can also be created from program objects directly, e.g. with tinytc_cl_kernel_bundle_create_with_program or tinytc_ze_kernel_bundle_create_with_program.

Recipe#

Recipes provide a code generator for common applications. Their usage is quite simple in comparison, as writing the code, parsing, and compiling are all encapsulated in the recipe.

Recipes are submitted to the runtime using a recipe handler. The general usage of a recipe is as following:

tinytc_recipe_t recipe = NULL;
tinytc_recipe_handler_t handler = NULL;
tinytc_recipe_<recipe_name>_create(&recipe, info, <recipe_parameters>, ctx);
tinytc_ze_recipe_handler_create(&handler, context, device, recipe, ctx);
tinytc_recipe_<recipe_name>_set_args(handler, <recipe_args>);
tinytc_ze_recipe_handler_submit(handler, command_list, NULL, 0, NULL);
// ...
tinytc_recipe_handler_release(handler);
tinytc_recipe_release(recipe);

Memory objects are either buffers (e.g. cl_mem in OpenCL) or Unified Shared Memory pointers or Shared Virtual Memory pointers. The unified interface requires the memory object to be given as void-pointer, annotated with tinytc_mem_type_t. For example:

// OpenCL
cl_mem A = ...;
tinytc_recipe_<recipe_name>_set_args(..., A, tinytc_mem_type_buffer, ...);

// Level Zero
void* A = ...;
tinytc_recipe_<recipe_name>_set_args(..., A, tinytc_mem_type_usm_pointer, ...);

In C++, one only needs to pass the memory object. The memory object is implicitly converted to the mem type that automatically determines whether a pointer or a cl_mem object is given. A pointer maps to tinytc_mem_type_usm_pointer and a cl_mem object maps to tinytc_mem_type_buffer. For SVM pointers, one needs to explicitly call mem(pointer, tinytc_mem_type_svm_pointer).

Batched small GEMM#

The batched small GEMM recipe implements the following tensor operation:

\[C_i = \alpha \text{op}_A(A_i) \text{op}_B(B_i) + \beta C_i\]

where \(\text{op}_A(A_i) \in \mathbb{R}^{M\times K}\), \(\text{op}_B(B_i) \in \mathbb{R}^{K\times N}\), \(C_i \in \mathbb{R}^{M\times N}\), \(i\) is the group id, and

\[\begin{split}\text{op}_{X}(Y) = \left\{\begin{array}{rcl} Y^T & \text{if} & t_X = T, \\ Y & \text{if} & t_X = N. \end{array}\right.\end{split}\]

The matrices in a matrix batch are separated by a fixed stride, that is, the address is computed as following for a matrix batch X:

X[m + n * ldX + i * strideX] // accesses X_i(m,n)

Tall and skinny GEMM#

The tall and skinny GEMM recipe implements the following tensor operation:

\[C = \alpha AB + \beta C\]

where \(A \in \mathbb{R}^{M\times K}\), \(B \in \mathbb{R}^{K\times N}\), \(C \in \mathbb{R}^{M\times N}\), and \(M \gg K\), \(M \gg N\).