Tensor language reference#
The grammar is given in ABNF syntax.
Execution model#
The unit of execution described by a function written in the tensor language is called a kernel. Kernels are launched in batches, where each instance of the kernel is called a work-group. The kernel has access to a three dimensional group id that is used to select the work done in the work group. Each work group consists of a fixed number of subgroups that execute concurrently. Subgroups can be further divided into work-items, where the number of work-items per subgroup is given by the subgroup size.
The language distinguishes between collective, SPMD, and mixed instructions. A collective instruction distributes the work among the work-items in an implementation-defined manner. Local variables passed to or returned from a collective instruction are always uniform, meaning that each work-item holds the same value. An SPMD instruction follows the OpenCL execution model, where local variables may have a different value for each work-item. Mixed instructions accept both varying and uniform local variables.
In an SPMD region, we call an argument dynamically uniform if all work-items in a subgroup have the same value.
Regions come in two different kinds: collective and SPMD. A collective instructions must only appear in a collective region, and an SPMD instruction must only appear in a SPMD region. Mixed instructions might appear in both kinds of regions. SPMD regions may be nested in collective regions but collective regions must not be nested in SPMD regions.
Core rules#
White space is used to separate tokens, where a token is either an identifier, a literal, a keyword, or characters such as punctuation or delimiters. Otherwise, white space has no meaning.
Comments start with ; and stop at the end of the line (\n).
Identifier#
Identifiers are either named or unnamed.
Named identifiers are letter followed by letters, underscores, or digits.
Unnamed identifiers are simply numbers.
As in LLVM, local identifiers are prefixed with %, whereas global identifiers
are prefixed with @.
identifier = unnamed-identifier / named-identifier
unnamed-identifier = 1*DIGIT
named-identifier = ALPHA *(ALPHA / DIGIT / "_")
local-identifier = "%" identifier
global-identifier = "@" identifier
Constants#
constant = boolean-constant / integer-constant / floating-constant / complex-constant
boolean-constant = "true" / "false"
integer-constant = [sign] 1*DIGIT
sign = "-" / "+"
floating-constant = [sign] (*DIGIT "." 1*DIGIT ["e" [sign] 1*DIGIT] / "inf" / "nan")
mantissa-dec = *DIGIT "." 1*DIGIT / 1*DIGIT "."
mantissa-hex = *HEXDIG "." 1*HEXDIG / 1*HEXDIG "."
exponent = [sign] 1*DIGIT
floating-constant-dec = [sign] (mantissa-dec ["e" exponent] / 1*DIGIT "e" exponent)
floating-constant-hex = [sign] "0x" (mantissa-hex ["p" exponent] / 1*HEXDIG "p" exponent)
floating-constant = floating-constant-dec / floating-constant-hex
complex-constant = "[" floating-constant "," floating-constant "]"
Integer constants must lie in the range \(-2^{63}+1,\dots,2^{63}-1\).
Floating point constants are given in C syntax and expected to be in the range of double precision numbers. The hexadecimal floating point syntax is supported, too. strtod can be used for parsing floating point numbers.
Attributes#
attribute = array-attribute /
boolean-attribute /
dictionary-attribute /
integer-attribute /
string-attribute
array-attribute = "[" [attribute *(", " attribute)] "]"
boolean-attribute = boolean-constant
dictionary-attribute = "{" [named-attribute *("," named-attribute)] "}"
named-attribute = attribute-name "=" attribute
attribute-name = "alignment" /
"shape_gcd" /
"stride_gcd" /
"subgroup_size" /
"unroll" /
"work_group_size" /
string-attribute
integer-attribute = integer-constant
string-attribute = %x22 *(%x20-21 / %x23-7E) %x22
Attributes add information about an operation, for example to assert properties or to direct the compiler.
Functions#
function-definition = "func" global-identifier "(" [argument-list] ")"
["attributes" dictionary-attribute] region
argument-list = argument *("," argument)
argument = local-identifier ":" type [dictionary-attribute]
Defines a function that is callable from the host.
Attributes#
Subgroup size and work-group size are determined automatically by the compiler, but can be overriden using the function’s attribute dictionary:
Name |
Type |
Description |
subgroup_size |
integer-attribute |
Subgroup size; valid values depend on the target device (typically 16 or 32) |
work_group_size |
array-attribute with 2 integer-attribute entries |
Two dimensional work-group size in number of work-items |
The work-group size attribute defines the size of the local work group. Due to the focus on matrix operations, the work-group size is always two-dimensional, where the first mode is used to tile the rows and the second mode is used to tile the columns. The first mode must be a multiple of the subgroup size. If the subgroup size is omitted, then the first mode must be a multiple of one of the subgroup sizes supported by the device. The product of the work-group size modes must be smaller or equal than the maximum work-group size of device.
The subgroup size attribute enforces a particular subgroup size that must be supported by the device.
Parameter attributes#
Parameters with memref or group type accept the following named attributes:
Name |
Type |
Description |
alignment |
integer-attribute |
Minimum pointer alignment |
shape_gcd |
array-attribute of integer-attribute |
Greatest common divisors of shape |
stride_gcd |
array-attribute of integer-attribute |
Greatest common divisors of stride |
Cf. the documentation of the memref type and the group type.
Restrictions#
Arguments must not have coopmatrix type.
Regions#
region = "{" *instruction "}"
A region is an ordered list of instructions. An instruction might contain a region. Regions have access to values from its enclosing region, but the enclosing region does not have access to values assigned in the region.
Types#
type = void-type / boolean-type / number-type / memref-type / group-type
void-type = "void"
Boolean type#
boolean-type = "bool"
Boolean type that only has two states (true or false).
Scalar types#
number-type = integer-type / floating-type / complex-type
integer-type = "i8" / "i16" / "i32" / "i64" / "index"
floating-type = "bf16" / "f16" / "f32" / "f64"
complex-type = "c32" / "c64"
Scalar types are either signless integer (“i”), floating point (“f”), or complex floating point (“c”). The number behind the scalar type prefix denotes the number of bits, e.g. “f64” are double precision floating point numbers. The “bf16” type encodes bfloat16 floating point numbers. The “index” type is an integer type whose width is platform-specific.
Type sizes in bytes are given by
\(\alpha\) |
i8 |
i16 |
i32 |
i64 |
bf16 |
f16 |
f32 |
f64 |
c32 |
c64 |
|---|---|---|---|---|---|---|---|---|---|---|
\(\text{size}(\alpha)\) |
1 |
2 |
4 |
8 |
2 |
2 |
4 |
8 |
8 |
16 |
Mixed precision operands might be allowed in instructions if the operands’ types are promotable. The scalar type \(\alpha\) may be promoted to the scalar type \(\beta\) if all values an operand of type \(\alpha\) may take can be exactly represented in type \(\beta\). Formally, \(\alpha\) is promotable to \(\beta\) if \(\alpha \preceq \beta\), where the partial order \(\preceq\) is defined by the following relation matrix:
\(\preceq\) |
i8 |
i16 |
i32 |
i64 |
bf16 |
f16 |
f32 |
f64 |
c32 |
c64 |
|---|---|---|---|---|---|---|---|---|---|---|
i8 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
i16 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
|||
i32 |
1 |
1 |
1 |
1 |
1 |
|||||
i64 |
1 |
|||||||||
bf16 |
1 |
1 |
1 |
1 |
1 |
|||||
f16 |
1 |
1 |
1 |
1 |
1 |
|||||
f32 |
1 |
1 |
1 |
1 |
||||||
f64 |
1 |
1 |
||||||||
c32 |
1 |
1 |
||||||||
c64 |
1 |
Moreover, for scalar types \(\alpha,\beta\) we define
Here, “fail” means that the promotion is not allowed and the compiler should throw an error.
Memref type#
memref-type = "memref<" number-type tensor-shape ["," memory-layout] ["," address-space] ">"
constant-or-dynamic = integer-constant / "?"
tensor-shape = *("x" constant-or-dynamic)
address-space = "global" / "local"
A memref is a reference to a region of memory.
In analogy to the C/C++-language, the memref can be thought of as a pointer,
but with additional information on the size and memory layout of the memory region.
The size information can be either fixed or dynamic.
For example, the memref<f32x4x8> is analogue to float* with the additional information
that the memory region contains 32 floats structured in 4 rows and 8 columns.
The memref<f32x4x?> type is analogue to float*, too, but here the number of floats
and the number of columns is only known at run-time.
Run-time size information is stored in a dope vector; the calling convention for memrefs is implementation-defined.
The memref can have order 0. E.g. memref<f32> can be thought of as a pointer to a single precision float.
A vector is a tensor of order 1, e.g. memref<f64x4>.
A matrix is a tensor of order 2, e.g. memref<f64x4x4>.
A tensor of order n is given by memref<f32xs_1x...xs_n>.
Dynamic mode sizes are written using a question mark in place of an integer constant.
The default memory layout is the packed dense layout.
E.g. the memory layout of memref<f32x5x6x7> is strided<1,5,30>.
We note that memref<f32x5x6x7> and memref<f32x5x6x7,strided<1,5,30>>
are the same type.
Memrefs have an optional address space attribute. The global address space referse to memory objects allocated from the global memory pool that is shared by all work groups. The local memory space is shared by all work-items of the work-group but inaccessible to another work-group. The default address space is “global”, memrefs with “local” address space are returned by the alloca instruction.
Definitions#
Let V be a value of memref type. The \(\text{order}(V)\) operation returns the memref’s order. The \(\text{shape}(V)\) returns the tensor shape as tuple. \(\text{rows}(V)\) and \(\text{columns}(V)\) return the size of the first and second mode, respectively. The \(\text{element_type}(V)\) operation gives the underlying scalar type.
For example, let B be a value of memref<f32x8x16x4> type, then
\(\text{order}(B) = 3\)
\(\text{shape}(B) = (8,16,4)\)
\(\text{rows}(B) = 8\)
\(\text{columns}(B) = 16\)
\(\text{element_type}(B) = \text{f32}\)
Memory layout#
memory-layout = strided-layout
Strided layout#
strided-layout = "strided<" [constant-or-dynamic-list] ">"
constant-or-dynamic-list = constant-or-dynamic *("," constant-or-dynamic)
The strided layout is a sequence of integers \(S_1,S_2,...,S_n\), where n must be equal to the order of the tensor. The strided layout is defined as the map
We further impose the following restriction for a tensor with shape \(s_1\times s_2 \times ... \times s_n\):
\(1 \leq S_1\)
\(\forall i \in [2,n]: S_{i-1}s_{i-1} \leq S_i\)
Therefore, we have the “column-major” layout. The default packed dense layout is given by
\(1 = S_1\)
\(\forall i \in [2,n]: S_{i-1}s_{i-1} = S_i\)
Stride modes might be dynamic as well, indicated by a question mark.
Alignment attribute#
The alignment=X attribute gives the alignment X of the memref’s base pointer in bytes. That is, for the pointer P pointing to the first element of the memref we must have \(P = 0 \pmod{X}\).
Restriction: The alignment must be a multiple of the size of the memref’s element type.
Greatest common divisor (GCD) attributes#
The shape_gcd=[d_1,…,d_k] attribute asserts that \(s_i = 0 \pmod{d_i}, i=1,\dots,k\), where k is smaller or equal than the order of the tensor n and \(s_i\) is the i-th entry of the shape vector. The divisors are understood to be the greatest common divisors for the set of shapes that the kernel is used for. For example, if we know that \(s_1\) is always a multiple of 4 then we can set shape_gcd=[4].
The stride_gcd=[D_1,…,D_m] attribute asserts that \(S_i = 0 \pmod{D_i}, i=1,\dots,m\), where m is smaller or equal than the order of the tensor n and \(S_i\) is the i-th entry of the stride vector. The divisors are understood to be the greatest common divisors for the set of strides that the kernel is used for. For example, if we know that \(S_2\) is always a multiple of 4 then we can set stride_gcd=[1,4].
Group type#
group-type = "group<" memref-type "x" constant-or-dynamic ["," "offset" ":" constant-or-dynamic] ">"
The group type collects unstructured pointers to memref’s with potentially different dynamic mode sizes.
The C-analogy of a group is a pointer-to-a-pointer.
For example, the C-analogue of a group<memref<f32x16x16>x?> is a float**.
The group shape is always one-dimensional and may be queried using the size instruction.
The optional offset parameter is used to offset each pointer by the given number of elements.
Given the C-analogue float** group, loading element i with offset off gives the
pointer float* tmp = group[i] + off.
The default offset is 0.
Dynamic values (‘?’) may appear in the memref-type, in the group shape, and in the offset. These values are stored in the dope vector; the calling convention for groups is implementation-defined.
Attributes#
Attributes applied on a group type are passed through to the memrefs. That is, when a memref is loaded from the group then the memref attributes are equal to the attributes of the group.
Cooperative matrix type#
coopmatrix-type = "coopmatrix<" number-type 2*2("x" integer-constant) "," matrix-use ">"
matrix-use = "matrix_a" / "matrix_b" / "matrix_acc"
The coopmatrix represents a matrix distributed across a subgroup, where each work-item in a subgroup stores a part of the matrix. The number-type specifies the matrix element type, the first integer-constant the number of rows, and the second integer-constant the number of columns. The matrix-use may affect the distribution of the matrix in the subgroup, and the name refers to the position of the matrix in a matrix multiplication.
Not all matrix shapes need to be supported in the implementation. The supported matrix shapes may depend on data type, matrix use, and target hardware.
An argument to any instruction that has coopmatrix type must be dynamically uniform.
Definitions#
Let V be a value of coopmatrix type. The \(\text{rows}(V)\) and \(\text{columns}(V)\) functions return the size of the first and second mode, respectively, and \(\text{shape}(V)\) returns rows and cols as tuple. The \(\text{component_type}(V)\) operation gives the underlying scalar type and \(\text{use}(V)\) returns the use.
For example, let B be a value of coopmatrix<f32x8x16,matrix_acc> type, then
\(\text{shape}(B) = (8,16)\)
\(\text{rows}(B) = 8\)
\(\text{columns}(B) = 16\)
\(\text{component_type}(B) = \text{f32}\)
\(\text{use}(B) = \text{matrix_acc}\)
Instructions#
Instructions may return zero, one, or multiple values, and follow the following format:
value-instruction-assignment = local-identifier "=" value-instruction
multi-value-instruction-assignment = [local-identifier-list "="] multi-value-instruction
local-identifier-list = local-identifier *("," local-identifier)
instruction = value-instruction-assignment
/ multi-value-instruction-assignment
That is, on the left-hand side we have list of values that are produced by the instruction followed by an equals sign, or an empty string, if the instruction does not produce values. On the right-hand side, after the equals sign or empty string, the name of the instruction is written, e.g. “ger”, optionally followed by instruction modifiers, e.g. “ger.atomic”. Then, a list of operands follows that is usually comma-seperated but might also be printed in a custom format (e.g. for “load”, “store”, “subview”, etc.). If the instruction produces values, then the types of the returned values must be annotated after a colon.
Collective instructions#
Alloca#
value-instruction = "alloca" [dictionary-attribute] ":" memref-type
Overview#
The alloca instruction allocates temporary memory that is freed automatically at the end of the block that contains the alloca.
Attributes#
Alloca accepts the following named attributes:
Name |
Type |
Description |
alignment |
integer-attribute |
Base pointer alignment; must not be larger than the default alignment. |
Restrictions#
The memref’s size must known at compile-time, i.e. the tensor shape must not contain any dynamic modes.
The address space must be “local”.
Axpby#
transpose = ".t" / ".n"
instruction =/ "axpby" [".atomic"] [transpose] local-identifier "," local-identifier ","
local-identifier "," local-identifier
Overview#
Axpby implements
for vectors and matrices, where \(\text{op}(X)\) is defined as
If the atomic flag is set, B is updated atomically.
Operands#
Op.-No. |
Type |
Description |
|---|---|---|
1 |
number-type |
\(\alpha\) |
2 |
memref-type |
A |
3 |
number-type |
\(\beta\) |
4 |
memref-type |
B |
Restrictions#
\(\text{shape}(B) = \text{shape}(\text{op}(A))\)
\(\text{order}(B) = 0 \lor \text{order}(B) = 1 \lor \text{order}(B) = 2\)
\(\text{type}(\alpha) \preceq \text{element_type}(A) \preceq \text{element_type}(B)\)
\(\text{type}(\beta) \preceq \text{element_type}(B)\)
If the atomic flag is set, \(\beta\) must be constant and \(\beta \in \{0,1\}\).
Cumulative sum#
instruction =/ "cumsum" [".atomic"] local-identifier "," local-identifier "," integer-constant ","
local-identifier "," local-identifier
Overview#
Computes the n-mode cumulative sum
where \(L_{s_n}\) is the lower triangular matrix of ones of size \(s_n\times s_n\) and \(s_n\) is the n-th entry of the shape vector of A. In index notation, we have equivalently
If the atomic flag is set, B is updated atomically.
Operands#
Op.-No. |
Type |
Description |
|---|---|---|
1 |
number-type |
\(\alpha\) |
2 |
memref-type |
A |
3 |
integer-constant |
n (summation mode) |
4 |
number-type |
\(\beta\) |
5 |
memref-type |
B |
Restrictions#
\(\text{order}(A) \geq 1\)
\(\text{shape}(A) = \text{shape}(B)\)
\(\text{type}(\alpha) \preceq \text{element_type}(A) \preceq \text{element_type}(B)\)
\(\text{type}(\beta) \preceq \text{element_type}(B)\)
If the atomic flag is set, \(\beta\) must be constant and \(\beta \in \{0,1\}\).
Foreach#
instruction =/ "foreach" "(" local-identifier-list ")" "="
"(" local-identifier-list ")" "," "(" local-identifier-list ")" region
Overview#
A foreach loop that executes the loop’s range without any sequence guarantee. The region of a foreach is a spmd region.
The three local identifier lists define the loop range and the local identifiers that make the trip count available within the loop body. All three lists must have the same length and have the following format:
where \(N\) is the common length of each of the three lists. The loop range is defined as the cartesian product of the half-open intervals \([\text{from}_i; \text{to}_i)\) such that the trip count take the values
The integer type of a “from” and “to” pair must match.
The mapping of trip count to work-item is implementation-defined.
Foreach tile#
instruction =/ "foreach_tile" "(" local-identifier-list ")" "="
"(" local-identifier-list ")" "," "(" local-identifier-list ")"
"as" "(" local-identifier-list ")" "<=" "(" integer-list ")"
region
integer-list = integer-constant *("," integer-constant)
Overview#
A foreach loop that partitions the loop range into tiles. The region of a foreach_tile is a spmd region.
The first three local identifier lists define the loop range and the local identifiers that make the tile offset available within the loop body. All three lists must have the same length and have the following format:
where \(N\) is the common length of each of the three lists and the integer type of a “from” and “to” pair must match. After “as” comes an identifier list that makes the tile shape available in the loop body and the constant upper bound for the tile shape, following the format
The number of tiles in mode \(i=1,\dots,N\) is given by \(K_i = \lceil(\text{to}_i-\text{from}_i) / \text{tile_shape}_i\rceil\) and the tile offset takes the following values:
The size variable is given by
Therefore, the size is equal to the tile shape except for the loop remainder.
Restrictions#
The first entry of the tile shape (\(\text{tile_shape}_1\)) must be a multiple of the subgroup size. The tile offsets (\(\text{var}_i\)) must be dynamically uniform.
Example#
foreach_tile (%i,%j)=(%c0,%c0),(%c70,%c64) as (%ti,%tj)<=(32,32) {
%c32 = constant 32 : index
%is_remainder = less_than %ti, %c32 : bool
if %is_remainder {
%tile = cooperative_matrix_load.rows_checked %A[%i,%j] : coopmatrix<f32x32x32,matrix_acc>
cooperative_matrix_store.rows_checked %tile, %B[%i,%j]
} else {
%tile = cooperative_matrix_load %A[%i,%j] : coopmatrix<f32x32x32,matrix_acc>
cooperative_matrix_store %tile, %B[%i,%j]
}
}
GEMM#
instruction =/ "gemm" [".atomic"] [transpose] [transpose] local-identifier "," local-identifier ","
local-identifier "," local-identifier "," local-identifier
Overview#
GEMM implements the well-known GEMM BLAS-3 operation.
The functions \(\text{op}_1\) and \(\text{op}_2\) are defined as
where transpose1 and transpose2 refer to the first and second transpose modifier, respectively.
If the atomic flag is set, C is updated atomically.
Operands#
Op.-No. |
Type |
Description |
|---|---|---|
1 |
number-type |
\(\alpha\) |
2 |
memref-type |
A |
3 |
memref-type |
B |
4 |
number-type |
\(\beta\) |
5 |
memref-type |
C |
Restrictions#
\(\text{order}(A) = \text{order}(B) = \text{order}(C) = 2\)
\(\text{colums}(\text{op}_1(A)) = \text{rows}(\text{op}_2(B))\)
\(\text{rows}(C) = \text{rows}(\text{op}_1(A))\)
\(\text{columns}(C) = \text{columns}(\text{op}_2(B))\)
\(\text{type}(\alpha) \preceq \text{promote}(\text{element_type}(A), \text{element_type}(B)) \preceq \text{element_type}(C)\)
\(\text{type}(\beta) \preceq \text{element_type}(C)\)
If the atomic flag is set, \(\beta\) must be constant and \(\beta \in \{0,1\}\).
GEMV#
instruction =/ "gemv" [".atomic"] [transpose] local-identifier "," local-identifier ","
local-identifier "," local-identifier "," local-identifier
Overview#
GEMV implements the well-known GEMM BLAS-2 operation.
where \(\text{op}_1\) is defined as in GEMM.
If the atomic flag is set, c is updated atomically.
Operands#
Op.-No. |
Type |
Description |
|---|---|---|
1 |
number-type |
\(\alpha\) |
2 |
memref-type |
A |
3 |
memref-type |
b |
4 |
number-type |
\(\beta\) |
5 |
memref-type |
c |
Restrictions#
\(\text{order}(A) = 2\)
\(\text{order}(b) = \text{order}(c) = 1\)
\(\text{colums}(\text{op}_1(A)) = \text{rows}(b)\)
\(\text{rows}(c) = \text{rows}(\text{op}_1(A))\)
\(\text{type}(\alpha) \preceq \text{promote}(\text{element_type}(A), \text{element_type}(b)) \preceq \text{element_type}(C)\)
\(\text{type}(\beta) \preceq \text{element_type}(C)\)
If the atomic flag is set, \(\beta\) must be constant and \(\beta \in \{0,1\}\).
GER#
instruction =/ "ger" [".atomic"] local-identifier "," local-identifier ","
local-identifier "," local-identifier "," local-identifier
Overview#
Computes the general rank-1 update:
If the atomic flag is set, C is updated atomically.
Operands#
Op.-No. |
Type |
Description |
|---|---|---|
1 |
number-type |
\(\alpha\) |
2 |
memref-type |
a |
3 |
memref-type |
b |
4 |
number-type |
\(\beta\) |
5 |
memref-type |
C |
Restrictions#
\(\text{order}(a) = \text{order}(b) = 1\)
\(\text{order}(C) = 2\)
\(\text{rows}(C) = \text{rows}(a)\)
\(\text{columns}(C) = \text{rows}(b)\)
\(\text{type}(\alpha) \preceq \text{promote}(\text{element_type}(A), \text{element_type}(b)) \preceq \text{element_type}(C)\)
\(\text{type}(\beta) \preceq \text{element_type}(C)\)
If the atomic flag is set, \(\beta\) must be constant and \(\beta \in \{0,1\}\).
Hadamard product#
instruction =/ "hadamard" [".atomic"] local-identifier "," local-identifier ","
local-identifier "," local-identifier "," local-identifier
Overview#
Computes the Hadamard product of two vectors or two matrices. That is, in index notation we have
for vectors and
for matrices. If the atomic flag is set, c/C is updated atomically.
Operands#
Op.-No. |
Type |
Description |
|---|---|---|
1 |
number-type |
\(\alpha\) |
2 |
memref-type |
a/A |
3 |
memref-type |
b/B |
4 |
number-type |
\(\beta\) |
5 |
memref-type |
c/C |
Restrictions#
\(\text{order}(a) = \text{order}(b) = \text{order}(c) = o\) with \(o\in\{1,2\}\)
\(\text{shape}(a) = \text{shape}(b) = \text{shape}(c)\)
\(\text{type}(\alpha) \preceq \text{promote}(\text{element_type}(A), \text{element_type}(b)) \preceq \text{element_type}(C)\)
\(\text{type}(\beta) \preceq \text{element_type}(C)\)
If the atomic flag is set, \(\beta\) must be constant and \(\beta \in \{0,1\}\).
Parallel#
instruction =/ "parallel" region
Overview#
Opens an spmd region.
Sum#
instruction =/ "sum" [".atomic"] [transpose] local-identifier "," local-identifier ","
local-identifier "," local-identifier
Overview#
Computes the matrix-vector product or the dot product of A with a vector of ones. That is, if the result is a vector we have
where \(\text{op}(A)\) is defined as in the axpby instruction, and if the result is a scalar we have
If the atomic flag is set, b is updated atomically.
Operands#
Op.-No. |
Type |
Description |
|---|---|---|
1 |
number-type |
\(\alpha\) |
2 |
memref-type |
A |
3 |
number-type |
\(\beta\) |
4 |
memref-type |
b |
Restrictions#
\(\text{order}(b) = 1 \lor \text{order}(b) = 0\)
\(\text{order}(A) = \text{order}(b)+1\)
\(\text{rows}(b) = \text{rows}(\text{op}(A)) \text{ if } \text{order}(b) = 1\)
\(\text{type}(\alpha) \preceq \text{element_type}(A) \preceq \text{element_type}(B)\)
\(\text{type}(\beta) \preceq \text{element_type}(B)\)
If the atomic flag is set, \(\beta\) must be constant and \(\beta \in \{0,1\}\).
Additional instructions#
instruction =/ "lifetime_stop" local-identifier
Mixed instructions#
Arithmetic (binary)#
arith-binary-type = "add" /
"sub" /
"mul" /
"div" /
"rem" /
"max" /
"min" /
"shl" /
"shr" /
"and" /
"or" /
"xor"
value-instruction =/ arith-binary-type local-identifier "," local-identifier
":" (boolean-type / number-type / coopmatrix-type)
Overview#
Binary arithmetic operation on scalars and cooperative matrices. Both operands, as well as the returned type, have the same scalar or component type. Arithmetic on cooperative matrices is done component-wise.
The following table shows the operations’ description and the types that are allowed for the operation. The backslash “\” is used to exclude types from the list of allowed types.
Op |
Allowed type |
Description |
|---|---|---|
add |
number-type |
Sum of operands |
sub |
number-type |
Difference of operands |
mul |
number-type |
Product of operands |
div |
number-type |
Quotient of operands |
rem |
number-type \ complex-type |
Remainder from the division of operands |
max |
number-type \ complex-type |
Maximum of operands |
min |
number-type \ complex-type |
Minimum of operands |
shl |
integer-type |
Left shift first operand by second operand |
shr |
integer-type |
Arithmetic right shift first operand by second operand |
and |
boolean-type / integer-type |
Bitwise and |
or |
boolean-type / integer-type |
Bitwise or |
xor |
boolean-type / integer-type |
Bitwise xor |
Arithmetic (unary)#
arith-unary-type = "abs" /
"neg" /
"not" /
"conj" /
"im" /
"re"
value-instruction =/ arith-unary-type local-identifier
":" (number-type / coopmatrix-type)
Overview#
Unary arithmetic operation on scalars and cooperative matrices. For integer and floating point input, the operand must have the same type as the returned value. For complex input, the returned value has the component floating point type for “.abs”, “.im”, and “.re”, and the returned value has the same type as the operand for “.neg” and “.conj”. Arithmetic on cooperative matrices is done component-wise.
The following table shows the operations’ description and the types that are allowed for the operation.
Op |
Allowed type |
Description |
|---|---|---|
abs |
number-type |
Compute absolute value |
neg |
number-type |
Negation |
not |
boolean-type / integer-type |
Bitwise not |
conj |
complex-type |
Complex conjugate |
im |
complex-type |
Extract imaginary part |
re |
complex-type |
Extract real part |
Associated#
value-instruction =/ "associated" local-identifier ":" bool-type
Overview#
Checks whether if a memref or group is associated. Returns true if the base address is non-null.
Operands#
Op.-No. |
Type |
Description |
|---|---|---|
1 |
memref-type / group-type |
tensor |
Returns#
True if the memref is associated and false otherwise, that is, if the base address is a null pointer.
Atomic load#
value-instruction =/ "atomic_load" [memory_scope] [memory_semantics]
local-identifier "[" [local-identifier-list] "]"
":" scalar-type
scope = ".cross_device" /
".device" /
".work_group" /
".subgroup"
memory_semantics = ".relaxed" /
".acquire" /
".release" /
".acquire_release" /
".sequentially_consistent"
Overview#
Load the element given by the index list from a memref atomically. The number of indices must match the order of the memref and a single index must be given for a group.
The store is atomic and the default scope is “work_group” and the default memory semantics is “relaxed”.
Operands#
Op.-No. |
Type |
Description |
|---|---|---|
1 |
memref-type / group-type |
tensor |
2… |
index |
index list |
Returns#
A value of the memref’s element type.
Atomic store#
instruction =/ "atomic_store" [memory_scope] [memory_semantics] local-identifier ","
local-identifier "[" [local-identifier-list] "]"
Overview#
Store a scalar value (first operand) in a memref (second operand) at the position given by the index list. The number of indices must match the order of the memref.
The store is atomic and the default scope is “work_group” and the default memory semantics is “relaxed”.
When storing a complex value the update may be pseudo-atomic, meaning that an atomic store is used for the the real and imaginary separately.
Operands#
Op.-No. |
Type |
Description |
|---|---|---|
1 |
number-type |
value |
2 |
memref-type |
tensor |
3… |
index |
index list |
Restrictions#
\(\text{type}(value) = \text{element_type}(tensor)\)
Atomic update#
atomic-update-op = "atomic_add" /
"atomic_min" /
"atomic_max"
value-instruction =/ atomic-update-scope [memory_scope] [memory_semantics] local-identifier ","
local-identifier "[" [local-identifier-list] "]"
":" number-type
Overview#
Store a scalar value (first operand) in a memref (second operand) at the position given by the index list. The number of indices must match the order of the memref, and the return type must match the memref’s element type.
The following steps are done atomically: The value at the memory location is fetched, the fetched value is updated with the fetched value, and the resulting value is stored at the memory location. The default scope is “work_group” and the default memory semantics is “relaxed”.
When storing a complex value the update may be pseudo-atomic, meaning that an atomic update is used for the the real and imaginary separately.
Operands#
Op.-No. |
Type |
Description |
|---|---|---|
1 |
number-type |
value |
2 |
memref-type |
tensor |
3… |
index |
index list |
Restrictions#
\(\text{type}(value) = \text{element_type}(tensor)\)
Barrier#
instruction =/ "barrier" [".global"] [".local"]
Overview#
Note: Barriers are inserted automatically in collective regions, but not in SPMD regions. Manual barrier insertion should only be only necessesary in SPMD regions.
Control barrier. The barrier must be encountered by all work-items. A work-item in a work-group is not allowed to continue until all work-items in the work-group have reached the barrier.
Aditional memory fences are controlled by the following attributes:
Attribute |
Description |
|---|---|
.global |
Ensure that global memory accesses become visible to the work-group. |
.local |
Ensure that local memory accesses become visible to the work-group. |
Builtin (mixed)#
mixed-builtin-type = "group_id" comp3 /
"num_groups" comp3 /
"num_subgroups" comp3 /
"subgroup_size"
comp3 = ".x" / ".y" / ".z"
value-instruction =/ mixed-builtin-type ":" integer-type
Overview#
Returns a builtin value.
The group id is three dimensional; the mode is selected with the .x, .y, and .z suffix. Each mode starts with zero and is limited by the corresponding num_groups mode. That is,
The number of subgroups is related to the 2-dimensional work-group size as following:
The following table shows the builtins’ description and the types that are returned.
Builtin |
Type |
OpenCL analogue |
Description |
|---|---|---|---|
group_id |
index |
get_group_id |
Returns the x, y, or z mode of the group id |
num_groups |
index |
get_num_groups |
Returns number of groups in the x, y, or z mode |
num_subgroups |
i32 |
N/A |
Returns the number of subgroups in the x, y, or z mode |
subgroup_size |
i32 |
get_max_sub_group_size |
Returns the subgroup size |
Cast#
value-instruction =/ "cast" local-identifier ":" number-type
value-instruction =/ "cast" local-identifier ":" coopmatrix-type
Overview#
Cast scalar values or cooperative matrices to type indicated after the colon.
The source type must be a coopmatrix type if the destination type is a coopmatrix type, and the shapes must match. The coopmatrix use must either match, or the use of the source type must be matrix_acc and the use of the destination type must be matrix_a or matrix_b.
Casts from complex types to non-complex types are forbidden. The following table summarizes the casts and the mapping to SPIR-V (the casts are done component-wise for coopmatrix types):
Operand type |
Result type |
SPIR-V Op |
|---|---|---|
integer-type |
integer-type |
OpSConvert |
floating-type |
floating-type |
OpFConvert |
complex-type |
complex-type |
OpFConvert (on vector2) |
integer-type |
floating-type |
OpConvertSToF |
floating-type |
integer-type |
OpConvertFToS |
floating-type |
complex-type |
OpFConvert on real part, imaginary part is zero |
integer-type |
complex-type |
OpConvertSToF on real part, imaginary part is zero |
complex-type |
integer-type |
Forbidden |
complex-type |
floating-type |
Forbidden |
Comparison#
comparison-type = "equal" /
"not_equal" /
"greater_than" /
"greater_than_equal" /
"less_than" /
"less_than_equal"
value-instruction =/ comparison-type local-identifier "," local-identifier ":" "bool"
Overview#
Scalar comparison. Both operands must have the same scalar type and the returned value has boolean type.
The following table shows the comparisons’ description and the types that are allowed for the comparison. The backslash “\” is used to exclude types from the list of allowed types.
Cond |
Allowed type |
Description |
|---|---|---|
equal |
number-type |
Equal |
not_equal |
number-type |
Not equal |
greater_than |
number-type \ complex-type |
Greater than |
greather_than_equal |
number-type \ complex-type |
Greater than or equal |
less_than |
number-type \ complex-type |
Less than |
less_than_equal |
number-type \ complex-type |
Less than or equal |
Constant#
value-instruction =/ "constant" constant ":" (boolean-type / number-type / coopmatrix-type)
Overview#
Sets the result value to a constant value. The type of the constant must match the scalar or component type (e.g. an integer type requires an integer-constant and a floating type requires a floating-constant).
When the result is a cooperative matrix, all entries are set to the same constant value.
Expand#
value-instruction =/ "expand" local-identifier "[" integer-constant "->" expand-shape "]" ":" memref-type
expand-shape = integer-constant-or-identifier 1*("x" integer-constant-or-identifier)
integer-constant-or-identifier = integer-constant / local-identifier
Overview#
The expand instruction returns a view on a tensor with a mode viewed as higher-order mode.
Operands#
The first argument must point to a value of memref type. The first integer constant before “->” gives the mode that shall be expanded. The expand shape coming after “->” gives the new shape of the mode. Dynamic values in the expand shape must have index type.
Restrictions#
The memref type of the result must conform with the following rules:
Element type and address space must match the operand’s memref type.
Shape: The mode size is replaced with the expand shape. The product of the expand shape must equal the size of the expanded mode.
expand %0[1 -> 2x8] : memref<f32x32x2x8x8> ; %0: memref<f32x32x16x8> expand %0[1 -> 2x2x2x2] : memref<f32x32x2x2x2x2x8> ; %0: memref<f32x32x16x8>
Identifiers: Local identifiers in the expand shape are dynamic in the resulting memref type. The product of the dynamic expand shape must equal the size of the expanded mode.
expand %0[1 -> %1 x 2] : memref<f32x32x?x2> ; %0: memref<f32x32x?> expand %0[1 -> 2 x %1] : memref<f32x32x2x?> ; %0: memref<f32x32x?> expand %0[1 -> %1 x 2] : memref<f32x32x?x2> ; %0: memref<f32x32x16> expand %0[1 -> %1 x 2] : memref<f32x32x?x2> ; %0: memref<f32x32x?> expand %0[1 -> %1 x %2 x 2] : memref<f32x32x?x?x2> ; %0: memref<f32x32x16> expand %0[1 -> %2 x 2 x %1] : memref<f32x32x?x2x?> ; %0: memref<f32x32x16> expand %0[1 -> %1 x %2] : memref<f32x32x?x?> ; %0: memref<f32x32x?> expand %0[1 -> %1 x %2] : memref<f32x32x?x?> ; %0: memref<f32x32x16>
Note: In the third example above, %1 must be equal to 8. The output mode corresponding to %1 is still dynamic.
Stride: A new stride entry is entered that follows the canonical stride computation. It is also permissible to put ‘?’ for a stride instead of the constant value.
expand %0[0->4 x 8] : memref<f32x4x8x7,strided<2,8,64>> ; %0: memref<f32x32x7,strided<2,64>> expand %0[0->4 x 8] : memref<f32x4x8x7,strided<2,?,?>> ; %0: memref<f32x32x7,strided<2,64>> expand %0[0->%1 x 4] : memref<f32x?x4x7,strided<2,?,?>> ; %0: memref<f32x?x7,strided<2,?>> expand %0[0->4 x %1] : memref<f32x4x?x7,strided<2,8,?>> ; %0: memref<f32x?x7,strided<2,?>> expand %0[0->4 x %1] : memref<f32x4x?x7,strided<2,?,?>> ; %0: memref<f32x?x7,strided<2,?>>
Further restrictions:
The product of the expand shape must be the same as the mode size.
If the product of the expand shape is only known at runtime, then it is undefined behaviour if the dynamic product does not match the mode size.
For#
multi-value-instruction = "for" local-identifier "="
local-identifier "," local-identifier ["," local-identifier]
["init" "(" init-value-list ")" "->" "(" return-type-list ")" ] region
["attributes" dictionary-attribute]
init-value-list = init-value *("," init-value)
init-value = local-identifier "=" local-identifier
return-type-list = return-type *("," return-type)
return-type = boolean-type / number-type / coopmatrix-type
Overview#
A for loop. Instructions in the for loop execute sequentially and its region is a mixed region.
Arguments#
The trip count is stored in the first local identifier and is accessible within the loop body. The loop’s range [from; to) is given by the first and the second local identifier after the equals sign, and a step size may be given with the third local identifier after the equals sign. The step size defaults to 1 if omitted. The integer type of “from”, “to”, and “step” must be identical, and the integer type of the loop variable follows the loop range’s type.
Values that are given in the init-value-list may be carried from one iteration to the next. The local identifier gives the name of the loop-carried value as it is accessible in the loop body. The local identifier given on the right-hand side of the init-value expression determines the initial value of the loop-carried value, and its type must coincide with the number-type-list. When loop-carried values are present, the loop’s last instruction must be a yield instruction that updates the loop-carried values for the next iteration. The number and types of the yielded values must correspond the number-type-list.
Returns#
The final value of the loop-carried values are returned by the for instruction.
Example:
%from = constant 2 : i32
%to = constant 6 : i32
%f0 = constant 0 : i64
%f1 = constant 1 : i64
%fn_1, %fn = for %n=%from,%to init(%fn_2=%f0,%fn_1=%f1) -> (i64,i64) {
%fn = add %fn_2, %fn_1 : i64
yield (%fn_1, %fn)
}
; %fn_1 contains the fourth Fibonacci number and %fn the fifth Fibonacci number
Attributes#
The following named attributes may be passed in the attribute dictionary:
Name |
Type |
Description |
unroll |
boolean-attribute or integer-attribute |
true: request to unroll loop, false: request to not unroll loop, integer: partial unroll count |
Fuse#
value-instruction =/ "fuse" local-identifier "[" integer-constant "," integer-constant "]"
":" memref-type
Overview#
The fuse instruction returns a view on a tensor with two or more adjacent modes viewed as a single mode.
Fused modes are specified as the interval [from, to], where counting starts from 0. From and to must refer to existing modes, that is, we require \(0 \leq \text{from} < \text{to} < \text{order}(\text{tensor})\). Moreover, the stride vector S and the shape vector s must satisify the following compatibility condition:
\(\forall k \in [\text{from},\text{to}): S_{k}s_{k} = S_{k+1}\)
If S(i:j) and s(i:j) are known at compile time, the fuse instruction is illegal if the compatibility condition is not satisfied. If a single entry in S(i:j) or s(i:j) is dynamic, then fusing modes that violate the compatbility condition is undefined beheaviour, e.g.
; Illegal, modes cannot be fused
fuse %0[0,1] : memref<f32x128> ; %0: memref<f32x8x16,strided<1,10>>
; Undefined behaviour if dynamic stride != 8
fuse %0[0,1] : memref<f32x128,strided<1,?>> ; %0: memref<f32x8x16,strided<1,?>>
Operands#
Op.-No. |
Type |
Description |
|---|---|---|
1 |
memref-type |
tensor |
2 |
integer-constant |
from |
3 |
integer-constant |
to |
Restrictions#
The memref type of the result must conform with the following rules:
Element type and address space must match the operand’s memref type.
Shape: The mode size of the fused modes is the product of the mode sizes. If one mode is dynamic the fused mode size is dynamic.
fuse %0[1,3] : memref<f32x32x512x42> ; %0: memref<f32x32x16x8x4x42> fuse %0[1,3] : memref<f32x32x?x42,strided<1,32,?>> ; %0: memref<f32x32x16x?x4x42,strided<1,16,?,?,?>>
Stride: Strides remain unchanged or are replaced by ‘?’.
fuse %0[1,2] : memref<f32x32x32x2,strided<1,48,1536>> ; %0: memref<f32x32x16x2x2,strided<1,48,768,1536>> fuse %0[1,2] : memref<f32x32x32x2,strided<1,?,?>> ; %0: memref<f32x32x16x2x2,strided<1,48,768,1536>> fuse %0[0,1] : memref<f32x?x32,strided<1,?>> ; %0: memref<f32x8x?x32,strided<1,?,?>>
If#
multi-value-instruction =/ "if" local-identifier ["->" "(" return-type-list ")"]
region ["else" region]
Overview#
An if statement. Both regions are mixed regions.
The condition (first operand) must have boolean type.
Returns#
The if instruction may return multiple values, where the number of values and the value types are given by the return-type-list. If values are returned, the last instruction in both the “then”-region and the “else”-region must be a yield instruction (the “else”-region cannot be omitted).
Example:
%1 = cmp.lt %0, 16 : i32 %x = if %1 -> (i32) { yield (%0) } else { %c16 = constant 16 : i32 yield (%c16) }
Load#
value-instruction =/ "load" local-identifier "[" [local-identifier-list] "]"
":" scalar-or-memref-type
scalar-or-memref-type = number-type / memref-type
Overview#
Load the element given by the index list from a memref or group. The number of indices must match the order of the memref and a single index must be given for a group.
Operands#
Op.-No. |
Type |
Description |
|---|---|---|
1 |
memref-type / group-type |
tensor |
2… |
index |
index list |
Returns#
A value of the memref’s element type or the group’s memref type. Examples:
load %0[] : f32 ; %0: memref<f32>load %0[5, %1] : f32 ; %0: memref<f32x10x?>load %0[%1] : memref<f32x42> ; %0: group<memref<f32x42>x?>load %0[%1] : memref<f32x42> ; %0: group<memref<f32x42>x?, offset: ?>
Math (unary)#
math-unary-type = "cos" /
"sin" /
"exp" /
"exp2" /
"log" /
"log2" /
"native_cos" /
"native_sin"
"native_exp" /
"native_exp2"
"native_log" /
"native_log2"
value-instruction =/ math-unary-type local-identifier ":" number-type
Overview#
Unary math operation on scalars. The operand must have the same type as the returned value.
The following table shows the operations’ description and the types that are allowed for the operation.
Op |
Allowed type |
Description |
|---|---|---|
cos |
floating-type |
Compute cosine function |
sin |
floating-type |
Compute sine function |
exp |
floating-type / complex-type |
Compute base-e exponential function |
exp2 |
floating-type / complex-type |
Compute base-2 exponential function |
log |
floating-type |
Compute base-e logarithm function |
log2 |
floating-type |
Compute base-2 logarithm function |
native_cos |
floating-type |
Compute cosine function with implementation-defined error |
native_sin |
floating-type |
Compute sine function with implementation-defined error |
native_exp |
floating-type / complex-type |
Compute base-e exponential function with implementation-defined error |
native_exp2 |
floating-type / complex-type |
Compute base-2 exponential function with implementation-defined error |
native_log |
floating-type |
Compute base-e logarithm function with implementation-defined error |
native_log2 |
floating-type |
Compute base-2 logarithm function with implementation-defined error |
Size#
value-instruction =/ "size" local-identifier "[" integer-constant "]" ":" "index"
Overview#
The size instruction returns the i-th entry of the tensor’s shape, where “i” is given by the integer constant in square brackets. “i” must be in bounds, i.e. \(0 \leq i < \text{order}(tensor)\).
For group types, the group size is returned and “i” must be 0.
Operands#
Op.-No. |
Type |
Description |
|---|---|---|
1 |
memref-type / group-type |
tensor |
2 |
integer-constant |
mode index |
Subview#
value-instruction =/ "subview" local-identifier "[" [index-or-slice-list] "]"
":" memref-type
index-or-slice-list = index-or-slice *("," index-or-slice)
index-or-slice = integer-constant-or-identifier [":" integer-constant-or-identifier]
Overview#
The subview instruction returns a view on a tensor.
The first argument must point to a value of memref type. The number of indices in square brackets must match the order of the memref type. The indices are either given as single index or as a slice, where slices are given in offset plus size notation (“%offset : %size”). E.g. the slice “%0 : %1” extracts a block of %1 elements beginning from %0, which is equivalent to the index interval [%0, %0 + %1).
Note
A slice is often defined as “%0 : %1” being the index interval [%0, %1). However, then the compiler needs to figure out whether %1 - %0 is constant or not in order to determine whether the mode size is known at compile-time or not. Therefore, we prefer the offset plus size notation.
Zero sizes are used to encode that a rank-reduction is required, that is, the rank of size 0 is removed from the output memref type. A single index is syntactic sugar for offset plus size 0, e.g. %0 is syntactic sugar for %0:0. (Note that a zero-size rank, e.g. in memref<f32x8x0>, is non-sense, because any multi-index passed to the memref would be out-of-bounds. However, a one-sized rank, e.g. memref<f32x8x1>, might be desirable.) A dynamic size of zero is undefined behaviour.
There is no run-time check whether the indices are within bounds. Offset and size must be of index type. Offset must be non-negative and size must be positive.
Restrictions#
The memref type of the result must conform with the following rules:
Element type and address space must match the operand’s memref type.
Invariant-stride: The stride is not changed or replaced with ‘?’.
subview %0[4:8,8:4] : memref<f32x8x4,strided<1,32>> ; %0: memref<f32x32x16> subview %0[4:8,8:4] : memref<f32x8x4,strided<1,?>> ; %0: memref<f32x32x16>
Rank-reduction: A mode accessed by offset only or a mode with size statically known to be 0 is removed from the output tensor.
subview %0[2:4, %1] : memref<f32x4> ; %0: memref<f32x16x8> subview %0[2:4, %1:0] : memref<f32x4> ; %0: memref<f32x16x8> subview %0[2:4, %1:1] : memref<f64x4x1,strided<1,16>> ; %0: memref<f64x16x8>
Output-mode size: The size of the output mode is determined by the size field of a slice and may be dynamic.
subview %0[%1:4] : memref<f32x4> ; %0: memref<f32x16> subview %0[%2:%2] : memref<f32x?> ; %0: memref<f32x16> subview %0[2:4, %2:%2, 6:7] : memref<f32x4x?x7,strided<1,16,672> ; %0: memref<f32x16x42x13> subview %0[2:4, %2:%2, 6:7] : memref<f32x4x?x7,strided<1,?,?> ; %0: memref<f32x16x42x13,strided<1,?,?>>
Store#
instruction =/ "store" local-identifier ","
local-identifier "[" [local-identifier-list] "]"
Overview#
Store a scalar value (first operand) in a memref (second operand) at the position given by the index list. The number of indices must match the order of the memref.
Note: Store should only be used in SPMD regions as otherwise the same memory location is written from all work-items.
Operands#
Op.-No. |
Type |
Description |
|---|---|---|
1 |
number-type |
value |
2 |
memref-type |
tensor |
3… |
index |
index list |
Restrictions#
\(\text{type}(value) = \text{element_type}(tensor)\)
Yield#
instruction =/ "yield" "(" [local-identifier-list] ")"
Overview#
Yield returns values from an if or for instruction.
Operands#
Op.-No. |
Type |
Description |
|---|---|---|
1… |
boolean-type / number-type / coopmatrix-type |
value |
SPMD instructions#
Builtin (SPMD)#
spmd-builtin-type = "subgroup_id" comp3 /
"subgroup_linear_id" /
"subgroup_local_id"
value-instruction =/ spmd-builtin-type ":" integer-type
Overview#
Returns a builtin value.
The mode of the subgroup id is selected with the .x, .y, and .z suffix. Each mode starts with zero and is limited by the corresponding num_subgroups mode. That is,
The subgroup linear id combines the x, y, and z modes of the subgroup id as following (note that that \(\text{subgroup_id}_z = 0\) due to \(\text{num_subgroups}_z = 1\)):
The subgroup local id is the invocation id within the subgroup and ranges from 0 to subgroup_size-1.
The following table shows the builtins’ description and the types that are returned.
Builtin |
Type |
OpenCL analogue |
Description |
|---|---|---|---|
subgroup_id |
i32 |
N/A |
Returns the x, y, or z mode of the subgroup id |
subgroup_linear_id |
i32 |
get_sub_group_id |
Returns linear subgroup id |
subgroup_local_id |
i32 |
get_sub_group_local_id |
Returns the local invocation id in the subgroup |
Cooperative matrix apply#
value-instruction =/ "cooperative_matrix_apply"
"(" local-identifier "," local-identifier "," local-identifier ")"
"=" local-identifier
"->" coopmatrix-type region
Overview#
Apply an action on every component of a coopmatrix and update the component with the result of the action. The action is described in the parallel region of the instruction.
Arguments#
The first three local identifier introduce SSA values for the row index, column index, and component value. The row and columns values have i32 type and the component value has the same component type as the resulting coopmatrix type. The fourth identifer, after “in”, gives the input coopmatrix, and its type must match the result type.
The region must yield exactly one value whose scalar type is identical to the component type of the coopmatrix.
Example:
%0 = ... ; contains a coopmatrix of type coopmatrix<f32x16x16,matrix_acc>
%1 = cooperative_matrix_apply (%i,%j,%v)=%0 -> coopmatrix<f32x16x16,matrix_acc> {
%mask = cmp.le %i, %j : bool
%exp_v_masked = if %mask -> (f32) {
%exp_v = math.native_exp %v : f32
yield (%exp_v)
} else {
%zero = constant 0.0 : f32
yield (%zero)
}
yield (%exp_v_masked)
}
; The entries of %1 are given by %1[i,j] = exp(%0[i,j]) if i <= j else 0
Cooperative matrix atomic load#
value-instruction =/ "cooperative_matrix_atomic_load" [transpose] [checked-flag]
[memory_scope] [memory_semantics]
local-identifier "[" local-identifier "," local-identifier "]"
":" coopmatrix-type
Overview#
Atomic matrix load. Atomic is meant component-wise, there is no atomicity w.r.t. to the whole matrix. The default scope is “work_group” and the default memory semantics is “relaxed”.
Except for atomicity, the instruction is idential to the Cooperative matrix load instruction.
Cooperative matrix atomic store#
instruction =/ "cooperative_matrix_atomic_store" [transpose] [checked-flag]
[memory_scope] [memory_semantics]
local-identifier "," local-identifier
"[" local-identifier "," local-identifier "]"
Overview#
Atomic matrix store. Atomic is meant component-wise, there is no atomicity w.r.t. to the whole matrix. The default scope is “work_group” and the default memory semantics is “relaxed”.
Except for atomicity, the instruction is idential to the Cooperative matrix store instruction.
Cooperative matrix atomic update#
cooperative-matrix-atomic-update-op = "cooperative_matrix_atomic_add" /
"cooperative_matrix_atomic_max" /
"cooperative_matrix_atomic_min"
value-instruction =/ cooperative-matrix-atomic-update-op [transpose] [checked-flag]
[memory_scope] [memory_semantics]
local-identifier "," local-identifier
"[" local-identifier "," local-identifier "]"
":" coopmatrix-type
Overview#
Atomic matrix update. Atomic is meant component-wise, there is no atomicity w.r.t. to the whole matrix. The default scope is “work_group” and the default memory semantics is “relaxed”.
See Cooperative matrix store instruction for further description.
Cooperative matrix construct#
value-instruction =/ "cooperative_matrix_construct" local-identifier ":" coopmatrix-type
Overview#
Returns a coopmatrix whose entries are initialized to the given dynamically uniform number. The type of the number must match the component type of the coopmatrix type.
Operands#
Op.-No. |
Type |
Description |
|---|---|---|
1 |
number-type |
Number |
Restrictions#
The number must be dynamically uniform.
Cooperative matrix extract#
value-instruction =/ "cooperative_matrix_extract"
local-identifier "[" integer-constant "]" ":" number-type
Overview#
Return an element of the coopmatrix’s work-item vector. The index is supplied in square brackets and must be greater or equal than zero and smaller than the length of the work-item vector, cf. Coopmatrix layout.
The scalar type of the returned value must match the component type of the coopmatrix.
Operands#
Op.-No. |
Type |
Description |
|---|---|---|
1 |
coopmatrix-type |
Cooperative matrix |
2 |
integer-constant |
Index into work-item vector |
Cooperative matrix insert#
value-instruction =/ "cooperative_matrix_insert" local-identifier ","
local-identifier "[" integer-constant "]" ":" coopmatrix-type
Overview#
Return a copy the coopmatrix, while modifying one entry of the coopmatrix. The index is supplied in square brackets and must be greater or equal than zero and smaller than the length of the work-item vector, cf. Coopmatrix layout.
The coopmatrix type of the returned value must match the coopmatrix type of the incoming matrix. The scalar type of the inserted scalar must match the component type of the coopmatrix.
Operands#
Op.-No. |
Type |
Description |
|---|---|---|
1 |
number-type |
Inserted scalar |
2 |
coopmatrix-type |
Cooperative matrix |
3 |
integer-constant |
Index into work-item vector |
Cooperative matrix load#
value-instruction =/ "cooperative_matrix_load" [transpose] [checked-flag]
local-identifier "[" local-identifier "," local-identifier "]"
":" coopmatrix-type
checked-flag = ".rows_checked" / ".cols_checked" / ".both_checked"
Overview#
Load a cooperative matrix from a 2d-memref at the position given by the indices in square brackets. The position gives the starting row and column index, that is, when a coopmatrix of size \(X\times Y\) is loaded from memref \(M\) at position \(x, y\), then the components \(A_{ij}\) of the coopmatrix are given by
where \(S_1\) and \(S_2\) are the entries of the memref’s stride array. When the transpose modifier “.t” is given, we have
When the checked flag is set, the following out-of-bound checks are added (with memref shape \(s_1\times s_2\)):
Flag |
Description |
|---|---|
.n.rows_checked |
\(A_{ij} := M[...] \text{ if } 0 \leq x+i < s_1 \text{ else } 0\) |
.t.rows_checked |
\(A_{ij} := M[...] \text{ if } 0 \leq y+i < s_2 \text{ else } 0\) |
.n.cols_checked |
\(A_{ij} := M[...] \text{ if } 0 \leq y+j < s_2 \text{ else } 0\) |
.t.cols_checked |
\(A_{ij} := M[...] \text{ if } 0 \leq x+j < s_1 \text{ else } 0\) |
.n.both_checked |
.n.rows_checked.n and .n.cols_checked |
.t.both_checked |
.t.rows_checked.t and .t.cols_checked |
Operands#
Op.-No. |
Type |
Description |
|---|---|---|
1 |
memref-type |
M |
2 |
index |
x |
3 |
index |
y |
Restrictions#
\(\text{order}(M) = 2\)
\(\text{component_type}(A) = \text{element_type}(M)\)
All arguments must be dynamically uniform.
Cooperative matrix mul add#
value-instruction =/ "cooperative_matrix_mul_add" local-identifier ","
local-identifier "," local-identifier ":" coopmatrix-type
Overview#
Matrix mul add returns the value of
where A, B, and C are matrices given by the three operands.
The number of rows of matrix A,C, and D must be a multiple of the subgroup size.
Operands#
Op.-No. |
Type |
Use |
Description |
|---|---|---|---|
1 |
coopmatrix-type |
matrix_a |
A |
2 |
coopmatrix-type |
matrix_b |
B |
3 |
coopmatrix-type |
matrix_acc |
C |
Restrictions#
\(\forall X\in\{A,C,D\}: \text{rows}(X) \bmod \text{subgroup_size} = 0\)
\(\text{columns}(A) = \text{rows}(B)\)
\(\text{rows}(C) = \text{rows}(A) \land \text{columns}(C) = \text{columns}(B)\)
\(\text{shape}(D) = \text{shape}(C)\)
\(\text{use}(D) = \text{matrix_acc}\)
\(\text{promote}(\text{component_type}(A), \text{component_type}(B)) \preceq \text{component_type}(C)\)
Cast of \(\text{component_type}(C)\) to \(\text{component_type}(D)\) must be allowed
Cooperative matrix prefetch#
instruction =/ "cooperative_matrix_prefetch" integer-constant ","
local-identifier "[" local-identifier "," local-identifier "]" ","
integer-constant "," integer-constant
Overview#
Cooperatively prefetch memory into device cache. The cache level is given by the first non-negative integer constant, where “0” is the cache closest the core and core distance increases with increasing cache level. The prefetch instruction is ignored if the cache level does not exist in the target device. The position in square brackets gives the starting row and column index. The last two positive integer constants give the size of the memory region to fetch (in rows by columns). The following memory locations are prefetched:
Prefetch is an optimization hint and may be disregarded by the compiler.
Operands#
Op.-No. |
Type |
Description |
|---|---|---|
1 |
integer-constant |
Cache-level |
2 |
memref-type |
M |
3 |
index |
x |
4 |
index |
y |
5 |
integer-constant |
X |
6 |
integer-constant |
Y |
Restrictions#
All arguments must be dynamically uniform.
Cooperative matrix reduce#
coopmatrix-reduce-op = "cooperative_matrix_reduce_add" /
"cooperative_matrix_reduce_max" /
"cooperative_matrix_reduce_min" /
value-instruction =/ coopmatrix-reduce-op reduce-mode local-identifier ":" coopmatrix-type
reduce-mode = ".row" / ".column"
Overview#
Computes the sum, maximum, or minimum over either the rows or columns of a coopmatrix.
The component type and use of the the returned value’s coopmatrix type must match the component type and use of the incoming matrix.
For a row reduction the resulting shape must be \(M\times 1\) and for a column reduction the resulting shape must be \(1\times N\), where the shape of the incoming matrix is \(M\times N\).
Operands#
Op.-No. |
Type |
Description |
|---|---|---|
1 |
coopmatrix-type |
Incoming cooperative matrix |
Restrictions#
\(\text{rows}(A) \bmod \text{subgroup_size} = 0\)
Cooperative matrix scale#
value-instruction =/ "cooperative_matrix_scale" local-identifier "," local-identifier
":" coopmatrix-type
Overview#
Scale a coopmatrix by a scalar. The scalar type of the scalar and the component type of the coopmatrix must match, and the returned must have the same coopmatrix type as the matrix operand.
Operands#
Op.-No. |
Type |
Description |
|---|---|---|
1 |
number-type |
scalar |
2 |
coopmatrix-type |
matrix |
Restrictions#
\(\text{type}(scalar) = \text{component_type}(matrix)\)
\(\text{type}(result) = \text{type}(matrix)\)
Cooperative matrix store#
instruction =/ "cooperative_matrix_store" [transpose] [checked-flag]
local-identifier "," local-identifier
"[" local-identifier "," local-identifier "]"
Overview#
Store a cooperative matrix value in a 2d-memref at the position given by the indices in square brackets. The position gives the starting row and column index, that is, when a coopmatrix of size \(X\times Y\) is written to memref \(M\) at position \(x, y\), then the components \(A_{ij}\) of the coopmatrix are written to
where \(S_1\) and \(S_2\) are the entries of the memref’s stride array. When the transpose modifier “.t” is given, we have
When the checked flag is set, the following out-of-bound checks are added (with memref shape \(s_1\times s_2\)):
Flag |
Description |
|---|---|
.n.rows_checked |
Only execute store if \(0 \leq x+i < s_1\) |
.t.rows_checked |
Only execute store if \(0 \leq y+i < s_2\) |
.n.cols_checked |
Only execute store if \(0 \leq y+j < s_2\) |
.t.cols_checked |
Only execute store if \(0 \leq x+j < s_1\) |
.n.both_checked |
.n.rows_checked + .n.cols_checked |
.t.both_checked |
.t.rows_checked + .t.cols_checked |
Operands#
Op.-No. |
Type |
Description |
|---|---|---|
1 |
coopmatrix-type |
A |
2 |
memref-type |
M |
3 |
index |
x |
4 |
index |
y |
Restrictions#
\(\text{component_type}(A) = \text{element_type}(B)\)
All arguments must be dynamically uniform.
Subgroup broadcast#
value-instruction =/ "subgroup_broadcast" local-identifier "," local-identifier ":" number-type
Overview#
Broadcast a scalar to all work-items in the subgroup. The scalar type of the first operand and the type of the result must match. The second identifier must have i32 type.
Operands#
Op.-No. |
Type |
Description |
|---|---|---|
1 |
number-type |
Value that is to be distributed to all work-items of the sub-group |
2 |
i32 |
Subgroup local index that identifies the work-item whose value is returned to all other work-items |
Restrictions#
The second operand must be dynamically uniform.
Subgroup operation#
subgroup-operation-type = "subgroup_exclusive_scan_add" /
"subgroup_exclusive_scan_max" /
"subgroup_exclusive_scan_min" /
"subgroup_inclusive_scan_add" /
"subgroup_inclusive_scan_max" /
"subgroup_inclusive_scan_min" /
"subgroup_reduce_add" /
"subgroup_reduce_max" /
"subgroup_reduce_min"
value-instruction =/ subgroup-operation-type local-identifier ":" number-type
Overview#
Let \([x_0,x_1,\dots,x_{n-1}]\) be the input vector contributed by a subgroup of size n. (The work-item with subgroup local id i contributes \(x_i\).) Let \(\diamond\) be the binary operator and I the identity. We define the output vector of size n for the group operations in the following table:
Operation type |
Result |
|---|---|
exclusive_scan |
\([I, x_0, (x_0 \diamond x_1), \dots, x_0 \diamond x_1 \diamond \dots \diamond x_{n-2}]\) |
inclusive_scan |
\([x_0, (x_0 \diamond x_1), \dots, x_0 \diamond x_1 \diamond \dots \diamond x_{n-1}]\) |
reduce |
\([s,s,\dots,s] \text{ with } s := x_0 \diamond \dots \diamond x_{n-1}\) |
Add#
Computes the subgroup operation with \(\diamond:=+\) and \(I:=0\).
Max#
Computes the subgroup operation with \(\diamond:=\max\) and identity as given in the following table:
Identity |
Value |
|---|---|
integer-type |
Smallest integer representable by integer type |
floating-type |
\(-\infty\) |
complex type |
Forbidden |
Min#
Computes the subgroup operation with \(\diamond:=\min\) and identity as given in the following table:
Identity |
Value |
|---|---|
integer-type |
Largest integer representable by integer type |
floating-type |
\(+\infty\) |
complex type |
Forbidden |