Tensor language reference

Contents

Tensor language reference#

The grammar is given in ABNF syntax.

Execution model#

The unit of execution described by a function written in the tensor language is called a kernel. Kernels are launched in batches, where each instance of the kernel is called a work-group. The kernel has access to a three dimensional group id that is used to select the work done in the work group. Each work group consists of a fixed number of subgroups that execute concurrently. Subgroups can be further divided into work-items, where the number of work-items per subgroup is given by the subgroup size.

The language distinguishes between collective, SPMD, and mixed instructions. A collective instruction distributes the work among the work-items in an implementation-defined manner. Local variables passed to or returned from a collective instruction are always uniform, meaning that each work-item holds the same value. An SPMD instruction follows the OpenCL execution model, where local variables may have a different value for each work-item. Mixed instructions accept both varying and uniform local variables.

In an SPMD region, we call an argument dynamically uniform if all work-items in a subgroup have the same value.

Regions come in two different kinds: collective and SPMD. A collective instructions must only appear in a collective region, and an SPMD instruction must only appear in a SPMD region. Mixed instructions might appear in both kinds of regions. SPMD regions may be nested in collective regions but collective regions must not be nested in SPMD regions.

Core rules#

White space is used to separate tokens, where a token is either an identifier, a literal, a keyword, or characters such as punctuation or delimiters. Otherwise, white space has no meaning.

Comments start with ; and stop at the end of the line (\n).

Identifier#

Identifiers are either named or unnamed. Named identifiers are letter followed by letters, underscores, or digits. Unnamed identifiers are simply numbers. As in LLVM, local identifiers are prefixed with %, whereas global identifiers are prefixed with @.

identifier                  = unnamed-identifier / named-identifier
unnamed-identifier          = 1*DIGIT
named-identifier            = ALPHA *(ALPHA / DIGIT / "_")
local-identifier            = "%" identifier
global-identifier           = "@" identifier

Constants#

constant                    = boolean-constant / integer-constant / floating-constant / complex-constant
boolean-constant            = "true" / "false"
integer-constant            = [sign] 1*DIGIT
sign                        = "-" / "+"
floating-constant           = [sign] (*DIGIT "." 1*DIGIT ["e" [sign] 1*DIGIT] / "inf" / "nan")
mantissa-dec                = *DIGIT "." 1*DIGIT / 1*DIGIT "."
mantissa-hex                = *HEXDIG "." 1*HEXDIG / 1*HEXDIG "."
exponent                    = [sign] 1*DIGIT
floating-constant-dec       = [sign] (mantissa-dec ["e" exponent] / 1*DIGIT "e" exponent)
floating-constant-hex       = [sign] "0x" (mantissa-hex ["p" exponent] / 1*HEXDIG "p" exponent)
floating-constant           = floating-constant-dec / floating-constant-hex
complex-constant            = "[" floating-constant "," floating-constant "]"

Integer constants must lie in the range \(-2^{63}+1,\dots,2^{63}-1\).

Floating point constants are given in C syntax and expected to be in the range of double precision numbers. The hexadecimal floating point syntax is supported, too. strtod can be used for parsing floating point numbers.

Attributes#

attribute                   = array-attribute /
                              boolean-attribute /
                              dictionary-attribute /
                              integer-attribute /
                              string-attribute
array-attribute             = "[" [attribute *(", " attribute)] "]"
boolean-attribute           = boolean-constant
dictionary-attribute        = "{" [named-attribute *("," named-attribute)] "}"
named-attribute             = attribute-name "=" attribute
attribute-name              = "alignment" /
                              "shape_gcd" /
                              "stride_gcd" /
                              "subgroup_size" /
                              "unroll" /
                              "work_group_size" /
                              string-attribute
integer-attribute           = integer-constant
string-attribute            = %x22 *(%x20-21 / %x23-7E) %x22

Attributes add information about an operation, for example to assert properties or to direct the compiler.

Functions#

function-definition         = "func" global-identifier "(" [argument-list] ")"
                              ["attributes" dictionary-attribute] region
argument-list               = argument *("," argument)
argument                    = local-identifier ":" type [dictionary-attribute]

Defines a function that is callable from the host.

Attributes#

Subgroup size and work-group size are determined automatically by the compiler, but can be overriden using the function’s attribute dictionary:

Name

Type

Description

subgroup_size

integer-attribute

Subgroup size; valid values depend on the target device (typically 16 or 32)

work_group_size

array-attribute with 2 integer-attribute entries

Two dimensional work-group size in number of work-items

The work-group size attribute defines the size of the local work group. Due to the focus on matrix operations, the work-group size is always two-dimensional, where the first mode is used to tile the rows and the second mode is used to tile the columns. The first mode must be a multiple of the subgroup size. If the subgroup size is omitted, then the first mode must be a multiple of one of the subgroup sizes supported by the device. The product of the work-group size modes must be smaller or equal than the maximum work-group size of device.

The subgroup size attribute enforces a particular subgroup size that must be supported by the device.

Parameter attributes#

Parameters with memref or group type accept the following named attributes:

Name

Type

Description

alignment

integer-attribute

Minimum pointer alignment

shape_gcd

array-attribute of integer-attribute

Greatest common divisors of shape

stride_gcd

array-attribute of integer-attribute

Greatest common divisors of stride

Cf. the documentation of the memref type and the group type.

Restrictions#

  • Arguments must not have coopmatrix type.

Regions#

region                      = "{" *instruction "}"

A region is an ordered list of instructions. An instruction might contain a region. Regions have access to values from its enclosing region, but the enclosing region does not have access to values assigned in the region.

Types#

type                        = void-type / boolean-type / number-type / memref-type / group-type
void-type                   = "void"

Boolean type#

boolean-type                = "bool"

Boolean type that only has two states (true or false).

Scalar types#

number-type                 = integer-type / floating-type / complex-type
integer-type                = "i8" / "i16" / "i32" / "i64" / "index"
floating-type               = "bf16" / "f16" / "f32" / "f64"
complex-type                = "c32" / "c64"

Scalar types are either signless integer (“i”), floating point (“f”), or complex floating point (“c”). The number behind the scalar type prefix denotes the number of bits, e.g. “f64” are double precision floating point numbers. The “bf16” type encodes bfloat16 floating point numbers. The “index” type is an integer type whose width is platform-specific.

Type sizes in bytes are given by

\(\alpha\)

i8

i16

i32

i64

bf16

f16

f32

f64

c32

c64

\(\text{size}(\alpha)\)

1

2

4

8

2

2

4

8

8

16

Mixed precision operands might be allowed in instructions if the operands’ types are promotable. The scalar type \(\alpha\) may be promoted to the scalar type \(\beta\) if all values an operand of type \(\alpha\) may take can be exactly represented in type \(\beta\). Formally, \(\alpha\) is promotable to \(\beta\) if \(\alpha \preceq \beta\), where the partial order \(\preceq\) is defined by the following relation matrix:

\(\preceq\)

i8

i16

i32

i64

bf16

f16

f32

f64

c32

c64

i8

1

1

1

1

1

1

1

1

1

1

i16

1

1

1

1

1

1

1

i32

1

1

1

1

1

i64

1

bf16

1

1

1

1

1

f16

1

1

1

1

1

f32

1

1

1

1

f64

1

1

c32

1

1

c64

1

Moreover, for scalar types \(\alpha,\beta\) we define

\[\begin{split}\text{promote}(\alpha, \beta) = \left\{\begin{array}{rcl} \beta & \text{ if } & \alpha \preceq \beta, \\ \alpha & \text{ if } & \beta \preceq \alpha, \\ \text{fail} & \text{ else.} \end{array}\right.\end{split}\]

Here, “fail” means that the promotion is not allowed and the compiler should throw an error.

Memref type#

memref-type                 = "memref<" number-type tensor-shape ["," memory-layout] ["," address-space] ">"
constant-or-dynamic         = integer-constant / "?"
tensor-shape                = *("x" constant-or-dynamic)
address-space               = "global" / "local"

A memref is a reference to a region of memory. In analogy to the C/C++-language, the memref can be thought of as a pointer, but with additional information on the size and memory layout of the memory region. The size information can be either fixed or dynamic. For example, the memref<f32x4x8> is analogue to float* with the additional information that the memory region contains 32 floats structured in 4 rows and 8 columns. The memref<f32x4x?> type is analogue to float*, too, but here the number of floats and the number of columns is only known at run-time.

Run-time size information is stored in a dope vector; the calling convention for memrefs is implementation-defined.

The memref can have order 0. E.g. memref<f32> can be thought of as a pointer to a single precision float. A vector is a tensor of order 1, e.g. memref<f64x4>. A matrix is a tensor of order 2, e.g. memref<f64x4x4>. A tensor of order n is given by memref<f32xs_1x...xs_n>.

Dynamic mode sizes are written using a question mark in place of an integer constant.

The default memory layout is the packed dense layout. E.g. the memory layout of memref<f32x5x6x7> is strided<1,5,30>. We note that memref<f32x5x6x7> and memref<f32x5x6x7,strided<1,5,30>> are the same type.

Memrefs have an optional address space attribute. The global address space referse to memory objects allocated from the global memory pool that is shared by all work groups. The local memory space is shared by all work-items of the work-group but inaccessible to another work-group. The default address space is “global”, memrefs with “local” address space are returned by the alloca instruction.

Definitions#

Let V be a value of memref type. The \(\text{order}(V)\) operation returns the memref’s order. The \(\text{shape}(V)\) returns the tensor shape as tuple. \(\text{rows}(V)\) and \(\text{columns}(V)\) return the size of the first and second mode, respectively. The \(\text{element_type}(V)\) operation gives the underlying scalar type.

For example, let B be a value of memref<f32x8x16x4> type, then

  • \(\text{order}(B) = 3\)

  • \(\text{shape}(B) = (8,16,4)\)

  • \(\text{rows}(B) = 8\)

  • \(\text{columns}(B) = 16\)

  • \(\text{element_type}(B) = \text{f32}\)

Memory layout#

memory-layout               = strided-layout
Strided layout#
strided-layout              = "strided<" [constant-or-dynamic-list] ">"
constant-or-dynamic-list    = constant-or-dynamic *("," constant-or-dynamic)

The strided layout is a sequence of integers \(S_1,S_2,...,S_n\), where n must be equal to the order of the tensor. The strided layout is defined as the map

\[(i_1,i_2,...,i_n) \mapsto i_1 S_1 + i_2 S_2 + ... + i_n S_n\]

We further impose the following restriction for a tensor with shape \(s_1\times s_2 \times ... \times s_n\):

  • \(1 \leq S_1\)

  • \(\forall i \in [2,n]: S_{i-1}s_{i-1} \leq S_i\)

Therefore, we have the “column-major” layout. The default packed dense layout is given by

  • \(1 = S_1\)

  • \(\forall i \in [2,n]: S_{i-1}s_{i-1} = S_i\)

Stride modes might be dynamic as well, indicated by a question mark.

Alignment attribute#

The alignment=X attribute gives the alignment X of the memref’s base pointer in bytes. That is, for the pointer P pointing to the first element of the memref we must have \(P = 0 \pmod{X}\).

Restriction: The alignment must be a multiple of the size of the memref’s element type.

Greatest common divisor (GCD) attributes#

The shape_gcd=[d_1,…,d_k] attribute asserts that \(s_i = 0 \pmod{d_i}, i=1,\dots,k\), where k is smaller or equal than the order of the tensor n and \(s_i\) is the i-th entry of the shape vector. The divisors are understood to be the greatest common divisors for the set of shapes that the kernel is used for. For example, if we know that \(s_1\) is always a multiple of 4 then we can set shape_gcd=[4].

The stride_gcd=[D_1,…,D_m] attribute asserts that \(S_i = 0 \pmod{D_i}, i=1,\dots,m\), where m is smaller or equal than the order of the tensor n and \(S_i\) is the i-th entry of the stride vector. The divisors are understood to be the greatest common divisors for the set of strides that the kernel is used for. For example, if we know that \(S_2\) is always a multiple of 4 then we can set stride_gcd=[1,4].

Group type#

group-type                  = "group<" memref-type "x" constant-or-dynamic ["," "offset" ":" constant-or-dynamic] ">"

The group type collects unstructured pointers to memref’s with potentially different dynamic mode sizes. The C-analogy of a group is a pointer-to-a-pointer. For example, the C-analogue of a group<memref<f32x16x16>x?> is a float**.

The group shape is always one-dimensional and may be queried using the size instruction.

The optional offset parameter is used to offset each pointer by the given number of elements. Given the C-analogue float** group, loading element i with offset off gives the pointer float* tmp = group[i] + off. The default offset is 0.

Dynamic values (‘?’) may appear in the memref-type, in the group shape, and in the offset. These values are stored in the dope vector; the calling convention for groups is implementation-defined.

Attributes#

Attributes applied on a group type are passed through to the memrefs. That is, when a memref is loaded from the group then the memref attributes are equal to the attributes of the group.

Cooperative matrix type#

coopmatrix-type             = "coopmatrix<" number-type 2*2("x" integer-constant) "," matrix-use ">"
matrix-use                  = "matrix_a" / "matrix_b" / "matrix_acc"

The coopmatrix represents a matrix distributed across a subgroup, where each work-item in a subgroup stores a part of the matrix. The number-type specifies the matrix element type, the first integer-constant the number of rows, and the second integer-constant the number of columns. The matrix-use may affect the distribution of the matrix in the subgroup, and the name refers to the position of the matrix in a matrix multiplication.

Not all matrix shapes need to be supported in the implementation. The supported matrix shapes may depend on data type, matrix use, and target hardware.

An argument to any instruction that has coopmatrix type must be dynamically uniform.

Definitions#

Let V be a value of coopmatrix type. The \(\text{rows}(V)\) and \(\text{columns}(V)\) functions return the size of the first and second mode, respectively, and \(\text{shape}(V)\) returns rows and cols as tuple. The \(\text{component_type}(V)\) operation gives the underlying scalar type and \(\text{use}(V)\) returns the use.

For example, let B be a value of coopmatrix<f32x8x16,matrix_acc> type, then

  • \(\text{shape}(B) = (8,16)\)

  • \(\text{rows}(B) = 8\)

  • \(\text{columns}(B) = 16\)

  • \(\text{component_type}(B) = \text{f32}\)

  • \(\text{use}(B) = \text{matrix_acc}\)

Instructions#

Instructions may return zero, one, or multiple values, and follow the following format:

value-instruction-assignment        = local-identifier "=" value-instruction
multi-value-instruction-assignment  = [local-identifier-list "="] multi-value-instruction
local-identifier-list               = local-identifier *("," local-identifier)
instruction                         = value-instruction-assignment
                                      / multi-value-instruction-assignment

That is, on the left-hand side we have list of values that are produced by the instruction followed by an equals sign, or an empty string, if the instruction does not produce values. On the right-hand side, after the equals sign or empty string, the name of the instruction is written, e.g. “ger”, optionally followed by instruction modifiers, e.g. “ger.atomic”. Then, a list of operands follows that is usually comma-seperated but might also be printed in a custom format (e.g. for “load”, “store”, “subview”, etc.). If the instruction produces values, then the types of the returned values must be annotated after a colon.

Collective instructions#

Alloca#

value-instruction   = "alloca" [dictionary-attribute] ":" memref-type
Overview#

The alloca instruction allocates temporary memory that is freed automatically at the end of the block that contains the alloca.

Attributes#

Alloca accepts the following named attributes:

Name

Type

Description

alignment

integer-attribute

Base pointer alignment; must not be larger than the default alignment.

Restrictions#
  • The memref’s size must known at compile-time, i.e. the tensor shape must not contain any dynamic modes.

  • The address space must be “local”.

Axpby#

transpose       =  ".t" / ".n"
instruction     =/ "axpby" [".atomic"] [transpose] local-identifier "," local-identifier ","
                           local-identifier "," local-identifier
Overview#

Axpby implements

\[B := \alpha \text{op}(A) + \beta B\]

for vectors and matrices, where \(\text{op}(X)\) is defined as

\[\begin{split}\text{op}(X) := \left\{ \begin{array}{rcl} X^T & \text{ if } & \text{transpose} = \text{".t"} \wedge \text{order}(X) = 2,\\ X & \text{ else. } \end{array} \right.\end{split}\]

If the atomic flag is set, B is updated atomically.

Operands#

Op.-No.

Type

Description

1

number-type

\(\alpha\)

2

memref-type

A

3

number-type

\(\beta\)

4

memref-type

B

Restrictions#
  • \(\text{shape}(B) = \text{shape}(\text{op}(A))\)

  • \(\text{order}(B) = 0 \lor \text{order}(B) = 1 \lor \text{order}(B) = 2\)

  • \(\text{type}(\alpha) \preceq \text{element_type}(A) \preceq \text{element_type}(B)\)

  • \(\text{type}(\beta) \preceq \text{element_type}(B)\)

  • If the atomic flag is set, \(\beta\) must be constant and \(\beta \in \{0,1\}\).

Cumulative sum#

instruction     =/ "cumsum" [".atomic"] local-identifier "," local-identifier "," integer-constant ","
                            local-identifier "," local-identifier
Overview#

Computes the n-mode cumulative sum

\[B := \alpha A \times_{n} L_{s_n} + \beta B,\]

where \(L_{s_n}\) is the lower triangular matrix of ones of size \(s_n\times s_n\) and \(s_n\) is the n-th entry of the shape vector of A. In index notation, we have equivalently

\[B_{i_1\dots i_{n-1}ji_{n+1}\dots i_M} := \alpha \sum_{i_n=1}^{j}A_{i_1\dots i_{n-1}i_ni_{n+1}\dots i_M} + \beta B_{i_1\dots i_{n-1}ji_{n+1}\dots i_M},\]

If the atomic flag is set, B is updated atomically.

Operands#

Op.-No.

Type

Description

1

number-type

\(\alpha\)

2

memref-type

A

3

integer-constant

n (summation mode)

4

number-type

\(\beta\)

5

memref-type

B

Restrictions#
  • \(\text{order}(A) \geq 1\)

  • \(\text{shape}(A) = \text{shape}(B)\)

  • \(\text{type}(\alpha) \preceq \text{element_type}(A) \preceq \text{element_type}(B)\)

  • \(\text{type}(\beta) \preceq \text{element_type}(B)\)

  • If the atomic flag is set, \(\beta\) must be constant and \(\beta \in \{0,1\}\).

Foreach#

instruction     =/ "foreach" "(" local-identifier-list ")" "="
                   "(" local-identifier-list ")" "," "(" local-identifier-list ")" region
Overview#

A foreach loop that executes the loop’s range without any sequence guarantee. The region of a foreach is a spmd region.

The three local identifier lists define the loop range and the local identifiers that make the trip count available within the loop body. All three lists must have the same length and have the following format:

\[(\text{var}_1, \dots, \text{var}_N) = (\text{from}_1, \dots, \text{from}_N), (\text{to}_1, \dots, \text{to}_N),\]

where \(N\) is the common length of each of the three lists. The loop range is defined as the cartesian product of the half-open intervals \([\text{from}_i; \text{to}_i)\) such that the trip count take the values

\[(\text{var}_1, \dots, \text{var}_N) \in [\text{from}_1; \text{to}_1) \times \dots \times [\text{from}_N; \text{to}_N)\]

The integer type of a “from” and “to” pair must match.

The mapping of trip count to work-item is implementation-defined.

Foreach tile#

instruction     =/ "foreach_tile" "(" local-identifier-list ")" "="
                   "(" local-identifier-list ")" "," "(" local-identifier-list ")"
                   "as" "(" local-identifier-list ")" "<=" "(" integer-list ")"
                   region
integer-list    = integer-constant *("," integer-constant)
Overview#

A foreach loop that partitions the loop range into tiles. The region of a foreach_tile is a spmd region.

The first three local identifier lists define the loop range and the local identifiers that make the tile offset available within the loop body. All three lists must have the same length and have the following format:

\[(\text{var}_1, \dots, \text{var}_N) = (\text{from}_1, \dots, \text{from}_N), (\text{to}_1, \dots, \text{to}_N),\]

where \(N\) is the common length of each of the three lists and the integer type of a “from” and “to” pair must match. After “as” comes an identifier list that makes the tile shape available in the loop body and the constant upper bound for the tile shape, following the format

\[(\text{size}_1, \dots, \text{size}_N) \leq (\text{tile_shape}_1, \dots, \text{tile_shape}_N).\]

The number of tiles in mode \(i=1,\dots,N\) is given by \(K_i = \lceil(\text{to}_i-\text{from}_i) / \text{tile_shape}_i\rceil\) and the tile offset takes the following values:

\[\text{var}_i = \text{from}_i+k\cdot \text{tile_shape}_i, \quad k=0,\dots,K_i-1.\]

The size variable is given by

\[\text{size}_i = \min(\text{tile_shape}_i, \text{to}_i - \text{var}_i).\]

Therefore, the size is equal to the tile shape except for the loop remainder.

Restrictions#

The first entry of the tile shape (\(\text{tile_shape}_1\)) must be a multiple of the subgroup size. The tile offsets (\(\text{var}_i\)) must be dynamically uniform.

Example#
foreach_tile (%i,%j)=(%c0,%c0),(%c70,%c64) as (%ti,%tj)<=(32,32) {
    %c32 = constant 32 : index
    %is_remainder = less_than %ti, %c32 : bool
    if %is_remainder {
        %tile = cooperative_matrix_load.rows_checked %A[%i,%j] : coopmatrix<f32x32x32,matrix_acc>
        cooperative_matrix_store.rows_checked %tile, %B[%i,%j]
    } else {
        %tile = cooperative_matrix_load %A[%i,%j] : coopmatrix<f32x32x32,matrix_acc>
        cooperative_matrix_store %tile, %B[%i,%j]
    }
}

GEMM#

instruction     =/ "gemm" [".atomic"] [transpose] [transpose] local-identifier "," local-identifier ","
                          local-identifier "," local-identifier "," local-identifier
Overview#

GEMM implements the well-known GEMM BLAS-3 operation.

\[C := \alpha \text{op}_1(A) \text{op}_2(B) + \beta C\]

The functions \(\text{op}_1\) and \(\text{op}_2\) are defined as

\[\begin{split}\text{op}_i(X) := \left\{ \begin{array}{rcl} X^T & \text{ if } & \text{transpose}_i = \text{".t"},\\ X & \text{ else. } \end{array} \right.\end{split}\]

where transpose1 and transpose2 refer to the first and second transpose modifier, respectively.

If the atomic flag is set, C is updated atomically.

Operands#

Op.-No.

Type

Description

1

number-type

\(\alpha\)

2

memref-type

A

3

memref-type

B

4

number-type

\(\beta\)

5

memref-type

C

Restrictions#
  • \(\text{order}(A) = \text{order}(B) = \text{order}(C) = 2\)

  • \(\text{colums}(\text{op}_1(A)) = \text{rows}(\text{op}_2(B))\)

  • \(\text{rows}(C) = \text{rows}(\text{op}_1(A))\)

  • \(\text{columns}(C) = \text{columns}(\text{op}_2(B))\)

  • \(\text{type}(\alpha) \preceq \text{promote}(\text{element_type}(A), \text{element_type}(B)) \preceq \text{element_type}(C)\)

  • \(\text{type}(\beta) \preceq \text{element_type}(C)\)

  • If the atomic flag is set, \(\beta\) must be constant and \(\beta \in \{0,1\}\).

GEMV#

instruction     =/ "gemv" [".atomic"] [transpose] local-identifier "," local-identifier ","
                          local-identifier "," local-identifier "," local-identifier
Overview#

GEMV implements the well-known GEMM BLAS-2 operation.

\[c := \alpha \text{op}_1(A) b + \beta c\]

where \(\text{op}_1\) is defined as in GEMM.

If the atomic flag is set, c is updated atomically.

Operands#

Op.-No.

Type

Description

1

number-type

\(\alpha\)

2

memref-type

A

3

memref-type

b

4

number-type

\(\beta\)

5

memref-type

c

Restrictions#
  • \(\text{order}(A) = 2\)

  • \(\text{order}(b) = \text{order}(c) = 1\)

  • \(\text{colums}(\text{op}_1(A)) = \text{rows}(b)\)

  • \(\text{rows}(c) = \text{rows}(\text{op}_1(A))\)

  • \(\text{type}(\alpha) \preceq \text{promote}(\text{element_type}(A), \text{element_type}(b)) \preceq \text{element_type}(C)\)

  • \(\text{type}(\beta) \preceq \text{element_type}(C)\)

  • If the atomic flag is set, \(\beta\) must be constant and \(\beta \in \{0,1\}\).

GER#

instruction     =/ "ger" [".atomic"] local-identifier "," local-identifier ","
                         local-identifier "," local-identifier "," local-identifier
Overview#

Computes the general rank-1 update:

\[C := \alpha a b^T + \beta C\]

If the atomic flag is set, C is updated atomically.

Operands#

Op.-No.

Type

Description

1

number-type

\(\alpha\)

2

memref-type

a

3

memref-type

b

4

number-type

\(\beta\)

5

memref-type

C

Restrictions#
  • \(\text{order}(a) = \text{order}(b) = 1\)

  • \(\text{order}(C) = 2\)

  • \(\text{rows}(C) = \text{rows}(a)\)

  • \(\text{columns}(C) = \text{rows}(b)\)

  • \(\text{type}(\alpha) \preceq \text{promote}(\text{element_type}(A), \text{element_type}(b)) \preceq \text{element_type}(C)\)

  • \(\text{type}(\beta) \preceq \text{element_type}(C)\)

  • If the atomic flag is set, \(\beta\) must be constant and \(\beta \in \{0,1\}\).

Hadamard product#

instruction     =/ "hadamard" [".atomic"] local-identifier "," local-identifier ","
                              local-identifier "," local-identifier "," local-identifier
Overview#

Computes the Hadamard product of two vectors or two matrices. That is, in index notation we have

\[c_{i} := \alpha a_{i} b_{i} + \beta c_{i}\]

for vectors and

\[C_{ij} := \alpha A_{ij} B_{ij} + \beta C_{ij}\]

for matrices. If the atomic flag is set, c/C is updated atomically.

Operands#

Op.-No.

Type

Description

1

number-type

\(\alpha\)

2

memref-type

a/A

3

memref-type

b/B

4

number-type

\(\beta\)

5

memref-type

c/C

Restrictions#
  • \(\text{order}(a) = \text{order}(b) = \text{order}(c) = o\) with \(o\in\{1,2\}\)

  • \(\text{shape}(a) = \text{shape}(b) = \text{shape}(c)\)

  • \(\text{type}(\alpha) \preceq \text{promote}(\text{element_type}(A), \text{element_type}(b)) \preceq \text{element_type}(C)\)

  • \(\text{type}(\beta) \preceq \text{element_type}(C)\)

  • If the atomic flag is set, \(\beta\) must be constant and \(\beta \in \{0,1\}\).

Parallel#

instruction     =/ "parallel" region
Overview#

Opens an spmd region.

Sum#

instruction     =/ "sum" [".atomic"] [transpose] local-identifier "," local-identifier ","
                         local-identifier "," local-identifier
Overview#

Computes the matrix-vector product or the dot product of A with a vector of ones. That is, if the result is a vector we have

\[b := \alpha \text{op}(A) \vec{1} + \beta b,\]

where \(\text{op}(A)\) is defined as in the axpby instruction, and if the result is a scalar we have

\[b := \alpha \left<A,\vec{1}\right> + \beta b\]

If the atomic flag is set, b is updated atomically.

Operands#

Op.-No.

Type

Description

1

number-type

\(\alpha\)

2

memref-type

A

3

number-type

\(\beta\)

4

memref-type

b

Restrictions#
  • \(\text{order}(b) = 1 \lor \text{order}(b) = 0\)

  • \(\text{order}(A) = \text{order}(b)+1\)

  • \(\text{rows}(b) = \text{rows}(\text{op}(A)) \text{ if } \text{order}(b) = 1\)

  • \(\text{type}(\alpha) \preceq \text{element_type}(A) \preceq \text{element_type}(B)\)

  • \(\text{type}(\beta) \preceq \text{element_type}(B)\)

  • If the atomic flag is set, \(\beta\) must be constant and \(\beta \in \{0,1\}\).

Additional instructions#

instruction             =/ "lifetime_stop" local-identifier

Mixed instructions#

Arithmetic (binary)#

arith-binary-type       =  "add" /
                           "sub" /
                           "mul" /
                           "div" /
                           "rem" /
                           "max" /
                           "min" /
                           "shl" /
                           "shr" /
                           "and" /
                           "or"  /
                           "xor"
value-instruction       =/ arith-binary-type local-identifier "," local-identifier
                           ":" (boolean-type / number-type / coopmatrix-type)
Overview#

Binary arithmetic operation on scalars and cooperative matrices. Both operands, as well as the returned type, have the same scalar or component type. Arithmetic on cooperative matrices is done component-wise.

The following table shows the operations’ description and the types that are allowed for the operation. The backslash “\” is used to exclude types from the list of allowed types.

Op

Allowed type

Description

add

number-type

Sum of operands

sub

number-type

Difference of operands

mul

number-type

Product of operands

div

number-type

Quotient of operands

rem

number-type \ complex-type

Remainder from the division of operands

max

number-type \ complex-type

Maximum of operands

min

number-type \ complex-type

Minimum of operands

shl

integer-type

Left shift first operand by second operand

shr

integer-type

Arithmetic right shift first operand by second operand

and

boolean-type / integer-type

Bitwise and

or

boolean-type / integer-type

Bitwise or

xor

boolean-type / integer-type

Bitwise xor

Arithmetic (unary)#

arith-unary-type        =  "abs" /
                           "neg" /
                           "not" /
                           "conj" /
                           "im" /
                           "re"
value-instruction       =/ arith-unary-type local-identifier
                           ":" (number-type / coopmatrix-type)
Overview#

Unary arithmetic operation on scalars and cooperative matrices. For integer and floating point input, the operand must have the same type as the returned value. For complex input, the returned value has the component floating point type for “.abs”, “.im”, and “.re”, and the returned value has the same type as the operand for “.neg” and “.conj”. Arithmetic on cooperative matrices is done component-wise.

The following table shows the operations’ description and the types that are allowed for the operation.

Op

Allowed type

Description

abs

number-type

Compute absolute value

neg

number-type

Negation

not

boolean-type / integer-type

Bitwise not

conj

complex-type

Complex conjugate

im

complex-type

Extract imaginary part

re

complex-type

Extract real part

Associated#

value-instruction =/ "associated" local-identifier ":" bool-type
Overview#

Checks whether if a memref or group is associated. Returns true if the base address is non-null.

Operands#

Op.-No.

Type

Description

1

memref-type / group-type

tensor

Returns#

True if the memref is associated and false otherwise, that is, if the base address is a null pointer.

Atomic load#

value-instruction =/ "atomic_load" [memory_scope] [memory_semantics]
                                   local-identifier "[" [local-identifier-list] "]"
                                   ":" scalar-type
scope             =  ".cross_device" /
                     ".device" /
                     ".work_group" /
                     ".subgroup"
memory_semantics  =  ".relaxed" /
                     ".acquire" /
                     ".release" /
                     ".acquire_release" /
                     ".sequentially_consistent"
Overview#

Load the element given by the index list from a memref atomically. The number of indices must match the order of the memref and a single index must be given for a group.

The store is atomic and the default scope is “work_group” and the default memory semantics is “relaxed”.

Operands#

Op.-No.

Type

Description

1

memref-type / group-type

tensor

2…

index

index list

Returns#

A value of the memref’s element type.

Atomic store#

instruction     =/ "atomic_store" [memory_scope] [memory_semantics] local-identifier ","
                                  local-identifier "[" [local-identifier-list] "]"
Overview#

Store a scalar value (first operand) in a memref (second operand) at the position given by the index list. The number of indices must match the order of the memref.

The store is atomic and the default scope is “work_group” and the default memory semantics is “relaxed”.

When storing a complex value the update may be pseudo-atomic, meaning that an atomic store is used for the the real and imaginary separately.

Operands#

Op.-No.

Type

Description

1

number-type

value

2

memref-type

tensor

3…

index

index list

Restrictions#
  • \(\text{type}(value) = \text{element_type}(tensor)\)

Atomic update#

atomic-update-op  =  "atomic_add" /
                     "atomic_min" /
                     "atomic_max"
value-instruction =/ atomic-update-scope [memory_scope] [memory_semantics] local-identifier ","
                                         local-identifier "[" [local-identifier-list] "]"
                                         ":" number-type
Overview#

Store a scalar value (first operand) in a memref (second operand) at the position given by the index list. The number of indices must match the order of the memref, and the return type must match the memref’s element type.

The following steps are done atomically: The value at the memory location is fetched, the fetched value is updated with the fetched value, and the resulting value is stored at the memory location. The default scope is “work_group” and the default memory semantics is “relaxed”.

When storing a complex value the update may be pseudo-atomic, meaning that an atomic update is used for the the real and imaginary separately.

Operands#

Op.-No.

Type

Description

1

number-type

value

2

memref-type

tensor

3…

index

index list

Restrictions#
  • \(\text{type}(value) = \text{element_type}(tensor)\)

Barrier#

instruction             =/ "barrier" [".global"] [".local"]
Overview#

Note: Barriers are inserted automatically in collective regions, but not in SPMD regions. Manual barrier insertion should only be only necessesary in SPMD regions.

Control barrier. The barrier must be encountered by all work-items. A work-item in a work-group is not allowed to continue until all work-items in the work-group have reached the barrier.

Aditional memory fences are controlled by the following attributes:

Attribute

Description

.global

Ensure that global memory accesses become visible to the work-group.

.local

Ensure that local memory accesses become visible to the work-group.

Builtin (mixed)#

mixed-builtin-type      =  "group_id" comp3      /
                           "num_groups" comp3    /
                           "num_subgroups" comp3 /
                           "subgroup_size"
comp3                   = ".x" / ".y" / ".z"
value-instruction       =/ mixed-builtin-type ":" integer-type
Overview#

Returns a builtin value.

The group id is three dimensional; the mode is selected with the .x, .y, and .z suffix. Each mode starts with zero and is limited by the corresponding num_groups mode. That is,

\[\forall d \in \{x,y,z\} : 0 \leq \text{group_id}_d < \text{num_groups}_d\]

The number of subgroups is related to the 2-dimensional work-group size as following:

\[\begin{split}\begin{aligned} \text{num_subgroups}_x &= \frac{\text{work_group_size[0]}}{\text{subgroup_size}} \\ \text{num_subgroups}_y &= \text{work_group_size[1]} \\ \text{num_subgroups}_z &= 1 \end{aligned}\end{split}\]

The following table shows the builtins’ description and the types that are returned.

Builtin

Type

OpenCL analogue

Description

group_id

index

get_group_id

Returns the x, y, or z mode of the group id

num_groups

index

get_num_groups

Returns number of groups in the x, y, or z mode

num_subgroups

i32

N/A

Returns the number of subgroups in the x, y, or z mode

subgroup_size

i32

get_max_sub_group_size

Returns the subgroup size

Cast#

value-instruction       =/ "cast" local-identifier ":" number-type
value-instruction       =/ "cast" local-identifier ":" coopmatrix-type
Overview#

Cast scalar values or cooperative matrices to type indicated after the colon.

The source type must be a coopmatrix type if the destination type is a coopmatrix type, and the shapes must match. The coopmatrix use must either match, or the use of the source type must be matrix_acc and the use of the destination type must be matrix_a or matrix_b.

Casts from complex types to non-complex types are forbidden. The following table summarizes the casts and the mapping to SPIR-V (the casts are done component-wise for coopmatrix types):

Operand type

Result type

SPIR-V Op

integer-type

integer-type

OpSConvert

floating-type

floating-type

OpFConvert

complex-type

complex-type

OpFConvert (on vector2)

integer-type

floating-type

OpConvertSToF

floating-type

integer-type

OpConvertFToS

floating-type

complex-type

OpFConvert on real part, imaginary part is zero

integer-type

complex-type

OpConvertSToF on real part, imaginary part is zero

complex-type

integer-type

Forbidden

complex-type

floating-type

Forbidden

Comparison#

comparison-type         =  "equal" /
                           "not_equal" /
                           "greater_than" /
                           "greater_than_equal" /
                           "less_than" /
                           "less_than_equal"
value-instruction       =/ comparison-type local-identifier "," local-identifier ":" "bool"
Overview#

Scalar comparison. Both operands must have the same scalar type and the returned value has boolean type.

The following table shows the comparisons’ description and the types that are allowed for the comparison. The backslash “\” is used to exclude types from the list of allowed types.

Cond

Allowed type

Description

equal

number-type

Equal

not_equal

number-type

Not equal

greater_than

number-type \ complex-type

Greater than

greather_than_equal

number-type \ complex-type

Greater than or equal

less_than

number-type \ complex-type

Less than

less_than_equal

number-type \ complex-type

Less than or equal

Constant#

value-instruction       =/ "constant" constant ":" (boolean-type / number-type / coopmatrix-type)
Overview#

Sets the result value to a constant value. The type of the constant must match the scalar or component type (e.g. an integer type requires an integer-constant and a floating type requires a floating-constant).

When the result is a cooperative matrix, all entries are set to the same constant value.

Expand#

value-instruction       =/ "expand" local-identifier "[" integer-constant "->" expand-shape "]" ":" memref-type
expand-shape            =  integer-constant-or-identifier 1*("x" integer-constant-or-identifier)
integer-constant-or-identifier = integer-constant / local-identifier
Overview#

The expand instruction returns a view on a tensor with a mode viewed as higher-order mode.

Operands#

The first argument must point to a value of memref type. The first integer constant before “->” gives the mode that shall be expanded. The expand shape coming after “->” gives the new shape of the mode. Dynamic values in the expand shape must have index type.

Restrictions#

The memref type of the result must conform with the following rules:

  1. Element type and address space must match the operand’s memref type.

  2. Shape: The mode size is replaced with the expand shape. The product of the expand shape must equal the size of the expanded mode.

    expand %0[1 -> 2x8]      : memref<f32x32x2x8x8>     ; %0: memref<f32x32x16x8>
    expand %0[1 -> 2x2x2x2]  : memref<f32x32x2x2x2x2x8> ; %0: memref<f32x32x16x8>
    
  3. Identifiers: Local identifiers in the expand shape are dynamic in the resulting memref type. The product of the dynamic expand shape must equal the size of the expanded mode.

    expand %0[1 -> %1 x 2]      : memref<f32x32x?x2>   ; %0: memref<f32x32x?>
    expand %0[1 -> 2 x %1]      : memref<f32x32x2x?>   ; %0: memref<f32x32x?>
    expand %0[1 -> %1 x 2]      : memref<f32x32x?x2>   ; %0: memref<f32x32x16>
    expand %0[1 -> %1 x 2]      : memref<f32x32x?x2>   ; %0: memref<f32x32x?>
    expand %0[1 -> %1 x %2 x 2] : memref<f32x32x?x?x2> ; %0: memref<f32x32x16>
    expand %0[1 -> %2 x 2 x %1] : memref<f32x32x?x2x?> ; %0: memref<f32x32x16>
    expand %0[1 -> %1 x %2]     : memref<f32x32x?x?>   ; %0: memref<f32x32x?>
    expand %0[1 -> %1 x %2]     : memref<f32x32x?x?>   ; %0: memref<f32x32x16>
    

    Note: In the third example above, %1 must be equal to 8. The output mode corresponding to %1 is still dynamic.

  4. Stride: A new stride entry is entered that follows the canonical stride computation. It is also permissible to put ‘?’ for a stride instead of the constant value.

    expand %0[0->4 x 8]  : memref<f32x4x8x7,strided<2,8,64>> ; %0: memref<f32x32x7,strided<2,64>>
    expand %0[0->4 x 8]  : memref<f32x4x8x7,strided<2,?,?>>  ; %0: memref<f32x32x7,strided<2,64>>
    expand %0[0->%1 x 4] : memref<f32x?x4x7,strided<2,?,?>>  ; %0: memref<f32x?x7,strided<2,?>>
    expand %0[0->4 x %1] : memref<f32x4x?x7,strided<2,8,?>>  ; %0: memref<f32x?x7,strided<2,?>>
    expand %0[0->4 x %1] : memref<f32x4x?x7,strided<2,?,?>>  ; %0: memref<f32x?x7,strided<2,?>>
    

Further restrictions:

  • The product of the expand shape must be the same as the mode size.

  • If the product of the expand shape is only known at runtime, then it is undefined behaviour if the dynamic product does not match the mode size.

For#

multi-value-instruction = "for" local-identifier "="
                                local-identifier "," local-identifier ["," local-identifier]
                          ["init" "(" init-value-list ")" "->" "(" return-type-list ")" ] region
                          ["attributes" dictionary-attribute]
init-value-list         = init-value *("," init-value)
init-value              = local-identifier "=" local-identifier
return-type-list        = return-type *("," return-type)
return-type             = boolean-type / number-type / coopmatrix-type
Overview#

A for loop. Instructions in the for loop execute sequentially and its region is a mixed region.

Arguments#

The trip count is stored in the first local identifier and is accessible within the loop body. The loop’s range [from; to) is given by the first and the second local identifier after the equals sign, and a step size may be given with the third local identifier after the equals sign. The step size defaults to 1 if omitted. The integer type of “from”, “to”, and “step” must be identical, and the integer type of the loop variable follows the loop range’s type.

Values that are given in the init-value-list may be carried from one iteration to the next. The local identifier gives the name of the loop-carried value as it is accessible in the loop body. The local identifier given on the right-hand side of the init-value expression determines the initial value of the loop-carried value, and its type must coincide with the number-type-list. When loop-carried values are present, the loop’s last instruction must be a yield instruction that updates the loop-carried values for the next iteration. The number and types of the yielded values must correspond the number-type-list.

Returns#

The final value of the loop-carried values are returned by the for instruction.

Example:

%from = constant 2 : i32
%to = constant 6 : i32
%f0 = constant 0 : i64
%f1 = constant 1 : i64
%fn_1, %fn = for %n=%from,%to init(%fn_2=%f0,%fn_1=%f1) -> (i64,i64) {
    %fn = add %fn_2, %fn_1 : i64
    yield (%fn_1, %fn)
}
; %fn_1 contains the fourth Fibonacci number and %fn the fifth Fibonacci number
Attributes#

The following named attributes may be passed in the attribute dictionary:

Name

Type

Description

unroll

boolean-attribute or integer-attribute

true: request to unroll loop, false: request to not unroll loop, integer: partial unroll count

Fuse#

value-instruction       =/ "fuse" local-identifier "[" integer-constant "," integer-constant "]"
                                  ":" memref-type
Overview#

The fuse instruction returns a view on a tensor with two or more adjacent modes viewed as a single mode.

Fused modes are specified as the interval [from, to], where counting starts from 0. From and to must refer to existing modes, that is, we require \(0 \leq \text{from} < \text{to} < \text{order}(\text{tensor})\). Moreover, the stride vector S and the shape vector s must satisify the following compatibility condition:

\(\forall k \in [\text{from},\text{to}): S_{k}s_{k} = S_{k+1}\)

If S(i:j) and s(i:j) are known at compile time, the fuse instruction is illegal if the compatibility condition is not satisfied. If a single entry in S(i:j) or s(i:j) is dynamic, then fusing modes that violate the compatbility condition is undefined beheaviour, e.g.

; Illegal, modes cannot be fused
fuse %0[0,1] : memref<f32x128>              ; %0: memref<f32x8x16,strided<1,10>>
; Undefined behaviour if dynamic stride != 8
fuse %0[0,1] : memref<f32x128,strided<1,?>> ; %0: memref<f32x8x16,strided<1,?>>
Operands#

Op.-No.

Type

Description

1

memref-type

tensor

2

integer-constant

from

3

integer-constant

to

Restrictions#

The memref type of the result must conform with the following rules:

  1. Element type and address space must match the operand’s memref type.

  2. Shape: The mode size of the fused modes is the product of the mode sizes. If one mode is dynamic the fused mode size is dynamic.

    fuse %0[1,3] : memref<f32x32x512x42>               ; %0: memref<f32x32x16x8x4x42>
    fuse %0[1,3] : memref<f32x32x?x42,strided<1,32,?>> ; %0: memref<f32x32x16x?x4x42,strided<1,16,?,?,?>>
    
  3. Stride: Strides remain unchanged or are replaced by ‘?’.

    fuse %0[1,2] : memref<f32x32x32x2,strided<1,48,1536>> ; %0: memref<f32x32x16x2x2,strided<1,48,768,1536>>
    fuse %0[1,2] : memref<f32x32x32x2,strided<1,?,?>>     ; %0: memref<f32x32x16x2x2,strided<1,48,768,1536>>
    fuse %0[0,1] : memref<f32x?x32,strided<1,?>>          ; %0: memref<f32x8x?x32,strided<1,?,?>>
    

If#

multi-value-instruction =/ "if" local-identifier ["->" "(" return-type-list ")"]
                           region ["else" region]
Overview#

An if statement. Both regions are mixed regions.

The condition (first operand) must have boolean type.

Returns#

The if instruction may return multiple values, where the number of values and the value types are given by the return-type-list. If values are returned, the last instruction in both the “then”-region and the “else”-region must be a yield instruction (the “else”-region cannot be omitted).

Example:

%1 = cmp.lt %0, 16 : i32
%x = if %1 -> (i32) {
    yield (%0)
} else {
    %c16 = constant 16 : i32
    yield (%c16)
}

Load#

value-instruction           =/ "load" local-identifier "[" [local-identifier-list] "]"
                                      ":" scalar-or-memref-type
scalar-or-memref-type       =  number-type / memref-type
Overview#

Load the element given by the index list from a memref or group. The number of indices must match the order of the memref and a single index must be given for a group.

Operands#

Op.-No.

Type

Description

1

memref-type / group-type

tensor

2…

index

index list

Returns#

A value of the memref’s element type or the group’s memref type. Examples:

  1. load %0[] : f32 ; %0: memref<f32>

  2. load %0[5, %1] : f32 ; %0: memref<f32x10x?>

  3. load %0[%1] : memref<f32x42> ; %0: group<memref<f32x42>x?>

  4. load %0[%1] : memref<f32x42> ; %0: group<memref<f32x42>x?, offset: ?>

Math (unary)#

math-unary-type         =  "cos" /
                           "sin" /
                           "exp" /
                           "exp2" /
                           "log" /
                           "log2" /
                           "native_cos" /
                           "native_sin"
                           "native_exp" /
                           "native_exp2"
                           "native_log" /
                           "native_log2"
value-instruction       =/ math-unary-type local-identifier ":" number-type
Overview#

Unary math operation on scalars. The operand must have the same type as the returned value.

The following table shows the operations’ description and the types that are allowed for the operation.

Op

Allowed type

Description

cos

floating-type

Compute cosine function

sin

floating-type

Compute sine function

exp

floating-type / complex-type

Compute base-e exponential function

exp2

floating-type / complex-type

Compute base-2 exponential function

log

floating-type

Compute base-e logarithm function

log2

floating-type

Compute base-2 logarithm function

native_cos

floating-type

Compute cosine function with implementation-defined error

native_sin

floating-type

Compute sine function with implementation-defined error

native_exp

floating-type / complex-type

Compute base-e exponential function with implementation-defined error

native_exp2

floating-type / complex-type

Compute base-2 exponential function with implementation-defined error

native_log

floating-type

Compute base-e logarithm function with implementation-defined error

native_log2

floating-type

Compute base-2 logarithm function with implementation-defined error

Size#

value-instruction       =/ "size" local-identifier "[" integer-constant "]" ":" "index"
Overview#

The size instruction returns the i-th entry of the tensor’s shape, where “i” is given by the integer constant in square brackets. “i” must be in bounds, i.e. \(0 \leq i < \text{order}(tensor)\).

For group types, the group size is returned and “i” must be 0.

Operands#

Op.-No.

Type

Description

1

memref-type / group-type

tensor

2

integer-constant

mode index

Subview#

value-instruction       =/ "subview" local-identifier "[" [index-or-slice-list] "]"
                                     ":" memref-type
index-or-slice-list     =  index-or-slice *("," index-or-slice)
index-or-slice          =  integer-constant-or-identifier [":" integer-constant-or-identifier]
Overview#

The subview instruction returns a view on a tensor.

The first argument must point to a value of memref type. The number of indices in square brackets must match the order of the memref type. The indices are either given as single index or as a slice, where slices are given in offset plus size notation (“%offset : %size”). E.g. the slice “%0 : %1” extracts a block of %1 elements beginning from %0, which is equivalent to the index interval [%0, %0 + %1).

Note

A slice is often defined as “%0 : %1” being the index interval [%0, %1). However, then the compiler needs to figure out whether %1 - %0 is constant or not in order to determine whether the mode size is known at compile-time or not. Therefore, we prefer the offset plus size notation.

Zero sizes are used to encode that a rank-reduction is required, that is, the rank of size 0 is removed from the output memref type. A single index is syntactic sugar for offset plus size 0, e.g. %0 is syntactic sugar for %0:0. (Note that a zero-size rank, e.g. in memref<f32x8x0>, is non-sense, because any multi-index passed to the memref would be out-of-bounds. However, a one-sized rank, e.g. memref<f32x8x1>, might be desirable.) A dynamic size of zero is undefined behaviour.

There is no run-time check whether the indices are within bounds. Offset and size must be of index type. Offset must be non-negative and size must be positive.

Restrictions#

The memref type of the result must conform with the following rules:

  1. Element type and address space must match the operand’s memref type.

  2. Invariant-stride: The stride is not changed or replaced with ‘?’.

    subview %0[4:8,8:4]  : memref<f32x8x4,strided<1,32>> ; %0: memref<f32x32x16>
    subview %0[4:8,8:4]  : memref<f32x8x4,strided<1,?>>  ; %0: memref<f32x32x16>
    
  3. Rank-reduction: A mode accessed by offset only or a mode with size statically known to be 0 is removed from the output tensor.

    subview %0[2:4, %1]   : memref<f32x4>                 ; %0: memref<f32x16x8>
    subview %0[2:4, %1:0] : memref<f32x4>                 ; %0: memref<f32x16x8>
    subview %0[2:4, %1:1] : memref<f64x4x1,strided<1,16>> ; %0: memref<f64x16x8>
    
  4. Output-mode size: The size of the output mode is determined by the size field of a slice and may be dynamic.

    subview %0[%1:4]            : memref<f32x4>                      ; %0: memref<f32x16>
    subview %0[%2:%2]           : memref<f32x?>                      ; %0: memref<f32x16>
    subview %0[2:4, %2:%2, 6:7] : memref<f32x4x?x7,strided<1,16,672> ; %0: memref<f32x16x42x13>
    subview %0[2:4, %2:%2, 6:7] : memref<f32x4x?x7,strided<1,?,?>    ; %0: memref<f32x16x42x13,strided<1,?,?>>
    

Store#

instruction     =/ "store" local-identifier ","
                           local-identifier "[" [local-identifier-list] "]"
Overview#

Store a scalar value (first operand) in a memref (second operand) at the position given by the index list. The number of indices must match the order of the memref.

Note: Store should only be used in SPMD regions as otherwise the same memory location is written from all work-items.

Operands#

Op.-No.

Type

Description

1

number-type

value

2

memref-type

tensor

3…

index

index list

Restrictions#
  • \(\text{type}(value) = \text{element_type}(tensor)\)

Yield#

instruction                 =/ "yield" "(" [local-identifier-list] ")"
Overview#

Yield returns values from an if or for instruction.

Operands#

Op.-No.

Type

Description

1…

boolean-type / number-type / coopmatrix-type

value

SPMD instructions#

Builtin (SPMD)#

spmd-builtin-type       =  "subgroup_id" comp3 /
                           "subgroup_linear_id"     /
                           "subgroup_local_id"
value-instruction       =/ spmd-builtin-type ":" integer-type
Overview#

Returns a builtin value.

The mode of the subgroup id is selected with the .x, .y, and .z suffix. Each mode starts with zero and is limited by the corresponding num_subgroups mode. That is,

\[\forall d \in \{x,y,z\} : 0 \leq \text{subgroup_id}_d < \text{num_subgroups}_d\]

The subgroup linear id combines the x, y, and z modes of the subgroup id as following (note that that \(\text{subgroup_id}_z = 0\) due to \(\text{num_subgroups}_z = 1\)):

\[\text{subgroup_linear_id} = \text{subgroup_id}_x + \text{subgroup_id}_y\cdot \text{num_subgroups}_x\]

The subgroup local id is the invocation id within the subgroup and ranges from 0 to subgroup_size-1.

The following table shows the builtins’ description and the types that are returned.

Builtin

Type

OpenCL analogue

Description

subgroup_id

i32

N/A

Returns the x, y, or z mode of the subgroup id

subgroup_linear_id

i32

get_sub_group_id

Returns linear subgroup id

subgroup_local_id

i32

get_sub_group_local_id

Returns the local invocation id in the subgroup

Cooperative matrix apply#

value-instruction           =/ "cooperative_matrix_apply"
                               "(" local-identifier "," local-identifier "," local-identifier ")"
                               "=" local-identifier
                               "->" coopmatrix-type region
Overview#

Apply an action on every component of a coopmatrix and update the component with the result of the action. The action is described in the parallel region of the instruction.

Arguments#

The first three local identifier introduce SSA values for the row index, column index, and component value. The row and columns values have i32 type and the component value has the same component type as the resulting coopmatrix type. The fourth identifer, after “in”, gives the input coopmatrix, and its type must match the result type.

The region must yield exactly one value whose scalar type is identical to the component type of the coopmatrix.

Example:

%0 = ... ; contains a coopmatrix of type coopmatrix<f32x16x16,matrix_acc>
%1 = cooperative_matrix_apply (%i,%j,%v)=%0 -> coopmatrix<f32x16x16,matrix_acc> {
    %mask = cmp.le %i, %j : bool
    %exp_v_masked = if %mask -> (f32) {
        %exp_v = math.native_exp %v : f32
        yield (%exp_v)
    } else {
        %zero = constant 0.0 : f32
        yield (%zero)
    }
    yield (%exp_v_masked)
}
; The entries of %1 are given by %1[i,j] = exp(%0[i,j]) if i <= j else 0

Cooperative matrix atomic load#

value-instruction =/ "cooperative_matrix_atomic_load" [transpose] [checked-flag]
                                                      [memory_scope] [memory_semantics]
                     local-identifier "[" local-identifier "," local-identifier "]"
                     ":" coopmatrix-type
Overview#

Atomic matrix load. Atomic is meant component-wise, there is no atomicity w.r.t. to the whole matrix. The default scope is “work_group” and the default memory semantics is “relaxed”.

Except for atomicity, the instruction is idential to the Cooperative matrix load instruction.

Cooperative matrix atomic store#

instruction     =/ "cooperative_matrix_atomic_store" [transpose] [checked-flag]
                                                     [memory_scope] [memory_semantics]
                   local-identifier "," local-identifier
                   "[" local-identifier "," local-identifier "]"
Overview#

Atomic matrix store. Atomic is meant component-wise, there is no atomicity w.r.t. to the whole matrix. The default scope is “work_group” and the default memory semantics is “relaxed”.

Except for atomicity, the instruction is idential to the Cooperative matrix store instruction.

Cooperative matrix atomic update#

cooperative-matrix-atomic-update-op = "cooperative_matrix_atomic_add" /
                                      "cooperative_matrix_atomic_max" /
                                      "cooperative_matrix_atomic_min"
value-instruction =/ cooperative-matrix-atomic-update-op [transpose] [checked-flag]
                                                         [memory_scope] [memory_semantics]
                     local-identifier "," local-identifier
                     "[" local-identifier "," local-identifier "]"
                     ":" coopmatrix-type
Overview#

Atomic matrix update. Atomic is meant component-wise, there is no atomicity w.r.t. to the whole matrix. The default scope is “work_group” and the default memory semantics is “relaxed”.

See Cooperative matrix store instruction for further description.

Cooperative matrix construct#

value-instruction       =/ "cooperative_matrix_construct" local-identifier ":" coopmatrix-type
Overview#

Returns a coopmatrix whose entries are initialized to the given dynamically uniform number. The type of the number must match the component type of the coopmatrix type.

Operands#

Op.-No.

Type

Description

1

number-type

Number

Restrictions#

The number must be dynamically uniform.

Cooperative matrix extract#

value-instruction       =/ "cooperative_matrix_extract"
                            local-identifier "[" integer-constant "]" ":" number-type
Overview#

Return an element of the coopmatrix’s work-item vector. The index is supplied in square brackets and must be greater or equal than zero and smaller than the length of the work-item vector, cf. Coopmatrix layout.

The scalar type of the returned value must match the component type of the coopmatrix.

Operands#

Op.-No.

Type

Description

1

coopmatrix-type

Cooperative matrix

2

integer-constant

Index into work-item vector

Cooperative matrix insert#

value-instruction       =/ "cooperative_matrix_insert" local-identifier ","
                            local-identifier "[" integer-constant "]" ":" coopmatrix-type
Overview#

Return a copy the coopmatrix, while modifying one entry of the coopmatrix. The index is supplied in square brackets and must be greater or equal than zero and smaller than the length of the work-item vector, cf. Coopmatrix layout.

The coopmatrix type of the returned value must match the coopmatrix type of the incoming matrix. The scalar type of the inserted scalar must match the component type of the coopmatrix.

Operands#

Op.-No.

Type

Description

1

number-type

Inserted scalar

2

coopmatrix-type

Cooperative matrix

3

integer-constant

Index into work-item vector

Cooperative matrix load#

value-instruction           =/ "cooperative_matrix_load" [transpose] [checked-flag]
                               local-identifier "[" local-identifier "," local-identifier "]"
                               ":" coopmatrix-type
checked-flag                = ".rows_checked" / ".cols_checked" / ".both_checked"
Overview#

Load a cooperative matrix from a 2d-memref at the position given by the indices in square brackets. The position gives the starting row and column index, that is, when a coopmatrix of size \(X\times Y\) is loaded from memref \(M\) at position \(x, y\), then the components \(A_{ij}\) of the coopmatrix are given by

\[\forall i \in [0,X), j \in [0,Y): A_{ij} := M[(x + i) S_1 + (y + j) S_2],\]

where \(S_1\) and \(S_2\) are the entries of the memref’s stride array. When the transpose modifier “.t” is given, we have

\[\forall i \in [0,X), j \in [0,Y): A_{ij} := M[(x + j) S_1 + (y + i) S_2]\]

When the checked flag is set, the following out-of-bound checks are added (with memref shape \(s_1\times s_2\)):

Flag

Description

.n.rows_checked

\(A_{ij} := M[...] \text{ if } 0 \leq x+i < s_1 \text{ else } 0\)

.t.rows_checked

\(A_{ij} := M[...] \text{ if } 0 \leq y+i < s_2 \text{ else } 0\)

.n.cols_checked

\(A_{ij} := M[...] \text{ if } 0 \leq y+j < s_2 \text{ else } 0\)

.t.cols_checked

\(A_{ij} := M[...] \text{ if } 0 \leq x+j < s_1 \text{ else } 0\)

.n.both_checked

.n.rows_checked.n and .n.cols_checked

.t.both_checked

.t.rows_checked.t and .t.cols_checked

Operands#

Op.-No.

Type

Description

1

memref-type

M

2

index

x

3

index

y

Restrictions#
  • \(\text{order}(M) = 2\)

  • \(\text{component_type}(A) = \text{element_type}(M)\)

  • All arguments must be dynamically uniform.

Cooperative matrix mul add#

value-instruction           =/ "cooperative_matrix_mul_add" local-identifier ","
                               local-identifier "," local-identifier ":" coopmatrix-type
Overview#

Matrix mul add returns the value of

\[D := AB + C,\]

where A, B, and C are matrices given by the three operands.

The number of rows of matrix A,C, and D must be a multiple of the subgroup size.

Operands#

Op.-No.

Type

Use

Description

1

coopmatrix-type

matrix_a

A

2

coopmatrix-type

matrix_b

B

3

coopmatrix-type

matrix_acc

C

Restrictions#
  • \(\forall X\in\{A,C,D\}: \text{rows}(X) \bmod \text{subgroup_size} = 0\)

  • \(\text{columns}(A) = \text{rows}(B)\)

  • \(\text{rows}(C) = \text{rows}(A) \land \text{columns}(C) = \text{columns}(B)\)

  • \(\text{shape}(D) = \text{shape}(C)\)

  • \(\text{use}(D) = \text{matrix_acc}\)

  • \(\text{promote}(\text{component_type}(A), \text{component_type}(B)) \preceq \text{component_type}(C)\)

  • Cast of \(\text{component_type}(C)\) to \(\text{component_type}(D)\) must be allowed

Cooperative matrix prefetch#

instruction     =/ "cooperative_matrix_prefetch" integer-constant ","
                    local-identifier "[" local-identifier "," local-identifier "]" ","
                    integer-constant "," integer-constant
Overview#

Cooperatively prefetch memory into device cache. The cache level is given by the first non-negative integer constant, where “0” is the cache closest the core and core distance increases with increasing cache level. The prefetch instruction is ignored if the cache level does not exist in the target device. The position in square brackets gives the starting row and column index. The last two positive integer constants give the size of the memory region to fetch (in rows by columns). The following memory locations are prefetched:

\[\{\forall i \in [0,X), j \in [0,Y): M[(x + i) S_1 + (y + j) S_2]\}\]

Prefetch is an optimization hint and may be disregarded by the compiler.

Operands#

Op.-No.

Type

Description

1

integer-constant

Cache-level

2

memref-type

M

3

index

x

4

index

y

5

integer-constant

X

6

integer-constant

Y

Restrictions#
  • All arguments must be dynamically uniform.

Cooperative matrix reduce#

coopmatrix-reduce-op    =  "cooperative_matrix_reduce_add" /
                           "cooperative_matrix_reduce_max" /
                           "cooperative_matrix_reduce_min" /
value-instruction       =/ coopmatrix-reduce-op reduce-mode local-identifier ":" coopmatrix-type
reduce-mode             =  ".row" / ".column"
Overview#

Computes the sum, maximum, or minimum over either the rows or columns of a coopmatrix.

The component type and use of the the returned value’s coopmatrix type must match the component type and use of the incoming matrix.

For a row reduction the resulting shape must be \(M\times 1\) and for a column reduction the resulting shape must be \(1\times N\), where the shape of the incoming matrix is \(M\times N\).

Operands#

Op.-No.

Type

Description

1

coopmatrix-type

Incoming cooperative matrix

Restrictions#
  • \(\text{rows}(A) \bmod \text{subgroup_size} = 0\)

Cooperative matrix scale#

value-instruction           =/ "cooperative_matrix_scale" local-identifier "," local-identifier
                               ":" coopmatrix-type
Overview#

Scale a coopmatrix by a scalar. The scalar type of the scalar and the component type of the coopmatrix must match, and the returned must have the same coopmatrix type as the matrix operand.

Operands#

Op.-No.

Type

Description

1

number-type

scalar

2

coopmatrix-type

matrix

Restrictions#
  • \(\text{type}(scalar) = \text{component_type}(matrix)\)

  • \(\text{type}(result) = \text{type}(matrix)\)

Cooperative matrix store#

instruction     =/ "cooperative_matrix_store" [transpose] [checked-flag]
                   local-identifier "," local-identifier
                   "[" local-identifier "," local-identifier "]"
Overview#

Store a cooperative matrix value in a 2d-memref at the position given by the indices in square brackets. The position gives the starting row and column index, that is, when a coopmatrix of size \(X\times Y\) is written to memref \(M\) at position \(x, y\), then the components \(A_{ij}\) of the coopmatrix are written to

\[\forall i \in [0,X), j \in [0,Y): M[(x + i) S_1 + (y + j) S_2] := A_{ij},\]

where \(S_1\) and \(S_2\) are the entries of the memref’s stride array. When the transpose modifier “.t” is given, we have

\[\forall i \in [0,X), j \in [0,Y): M[(x + j) S_1 + (y + i) S_2] := A_{ij}\]

When the checked flag is set, the following out-of-bound checks are added (with memref shape \(s_1\times s_2\)):

Flag

Description

.n.rows_checked

Only execute store if \(0 \leq x+i < s_1\)

.t.rows_checked

Only execute store if \(0 \leq y+i < s_2\)

.n.cols_checked

Only execute store if \(0 \leq y+j < s_2\)

.t.cols_checked

Only execute store if \(0 \leq x+j < s_1\)

.n.both_checked

.n.rows_checked + .n.cols_checked

.t.both_checked

.t.rows_checked + .t.cols_checked

Operands#

Op.-No.

Type

Description

1

coopmatrix-type

A

2

memref-type

M

3

index

x

4

index

y

Restrictions#
  • \(\text{component_type}(A) = \text{element_type}(B)\)

  • All arguments must be dynamically uniform.

Subgroup broadcast#

value-instruction       =/ "subgroup_broadcast" local-identifier "," local-identifier ":" number-type
Overview#

Broadcast a scalar to all work-items in the subgroup. The scalar type of the first operand and the type of the result must match. The second identifier must have i32 type.

Operands#

Op.-No.

Type

Description

1

number-type

Value that is to be distributed to all work-items of the sub-group

2

i32

Subgroup local index that identifies the work-item whose value is returned to all other work-items

Restrictions#
  • The second operand must be dynamically uniform.

Subgroup operation#

subgroup-operation-type = "subgroup_exclusive_scan_add" /
                          "subgroup_exclusive_scan_max" /
                          "subgroup_exclusive_scan_min" /
                          "subgroup_inclusive_scan_add" /
                          "subgroup_inclusive_scan_max" /
                          "subgroup_inclusive_scan_min" /
                          "subgroup_reduce_add" /
                          "subgroup_reduce_max" /
                          "subgroup_reduce_min"
value-instruction       =/ subgroup-operation-type local-identifier ":" number-type
Overview#

Let \([x_0,x_1,\dots,x_{n-1}]\) be the input vector contributed by a subgroup of size n. (The work-item with subgroup local id i contributes \(x_i\).) Let \(\diamond\) be the binary operator and I the identity. We define the output vector of size n for the group operations in the following table:

Operation type

Result

exclusive_scan

\([I, x_0, (x_0 \diamond x_1), \dots, x_0 \diamond x_1 \diamond \dots \diamond x_{n-2}]\)

inclusive_scan

\([x_0, (x_0 \diamond x_1), \dots, x_0 \diamond x_1 \diamond \dots \diamond x_{n-1}]\)

reduce

\([s,s,\dots,s] \text{ with } s := x_0 \diamond \dots \diamond x_{n-1}\)

Add#

Computes the subgroup operation with \(\diamond:=+\) and \(I:=0\).

Max#

Computes the subgroup operation with \(\diamond:=\max\) and identity as given in the following table:

Identity

Value

integer-type

Smallest integer representable by integer type

floating-type

\(-\infty\)

complex type

Forbidden

Min#

Computes the subgroup operation with \(\diamond:=\min\) and identity as given in the following table:

Identity

Value

integer-type

Largest integer representable by integer type

floating-type

\(+\infty\)

complex type

Forbidden