The matrix multiplication (MatMul) primitive computes the product of two 2D tensors with optional bias addition:
\[ dst(m, n) = \sum_{k=0}^{K} \left( src(m, k) \cdot weights(k, n) \right) + bias(m, n) \]
The MatMul primitive also supports batching multiple independent matrix multiplication operations, in which case the tensors must be 3D:
\[ dst(mb, m, n) = \sum_{k=0}^{K} \left( src(mb, m, k) \cdot weights(mb, k, n) \right) + bias(mb, m, n) \]
The bias tensor is optional and supports implicit broadcast semantics: any of its dimensions can be 1 and the same value would be used across the corresponding dimension. However, \(bias\) must have the same number of dimensions as the \(dst\).
The MatMul primitive supports input and output tensors with runtime specified shapes and memory formats. The runtime specified dimensions or strides are specified using the DNNL_RUNTIME_DIM_VAL wildcard value during the primitive initialization and creation stage. At the execution stage, the user must pass fully specified memory objects so that the primitive is able to perform the computations. Note that the less information about shapes or format is available at the creation stage, the less performant execution will be. In particular, if the shape is not known at creation stage, one cannot use the special format tag dnnl::memory::format_tag::any to enable an implementation to choose the most appropriate memory format for the corresponding input or output shapes. On the other hand, runtime specified shapes enable users to create a primitive once and use it in different situations.
The MatMul primitive supports the following combinations of data types for source, destination, weights, and bias tensors:
Source  Weights  Destination  Bias 

f32  f32  f32  f32 
f16  f16  f16  f16 
bf16  bf16  bf16  bf16, f32 
u8, s8  s8, u8  u8, s8, s32, f32  u8, s8, s32, f32 
The MatMul primitive expects the following tensors:
Dims  Source  Weights  Destination  Bias 

2D  \(M \times K\)  \(K \times N\)  \(M \times N\)  None or \((M \text{ or } 1) \times (N \text{ or } 1)\) 
3D  \(MB \times M \times K\)  \(MB \times K \times N\)  \(MB \times M \times N\)  None or \((MB \text{ or } 1) \times (M \text{ or } 1) \times (N \text{ or } 1)\) 
The MatMul primitive is generally optimized for the case in which memory objects use plain memory formats (with some restrictions; see the table below). However, it is recommended to use the placeholder memory format dnnl::memory::format_tag::any if an input tensor is reused across multiple executions. In this case, the primitive will set the most appropriate memory format for the corresponding input tensor.
The table below shows the combinations of memory formats for which the MatMul primitive is optimized. The memory format of the destination tensor should always be dnnl::memory::format_tag::ab for the 2D case and dnnl::memory::format_tag::abc for the 3D one.
Dims  Logical tensors  MatMul is optimized for the following memory formats 

2D  Source: \(M \times K\) Weights: \(K \times N\)  Source: dnnl_ab or dnnl_ba Weights: dnnl_ab or dnnl_ba 
3D  Source: \(MB \times M \times K\) Weights: \(MB \times K \times N\)  Source: dnnl_abc or dnnl_acb Weights: dnnl_abc or dnnl_acb 
Attributes and postops enable modifying the behavior of the MatMul primitive. The following attributes and postops are supported:
Type  Operation  Restrictions  Description 

Attribute  Output scales  Scales the result by given scale factor(s)  
Attribute  Zero points  Int8 computations only  Sets zero point(s) for the corresponding tensors 
Postop  Eltwise  Applies an Eltwise operation to the result  
Postop  Sum  Adds the operation result to the destination tensor instead of overwriting it 
To facilitate dynamic quantization, the primitive supports runtime output scales. That means a user could configure attributes with output scales set to the DNNL_RUNTIME_F32_VAL wildcard value instead of the actual scales, if the scales are not known at the primitive descriptor creation stage. In this case, the user must provide the scales as an additional input memory object with argument DNNL_ARG_ATTR_OUTPUT_SCALES
during the execution stage.
Similarly to runtime output scales, the primitive supports runtime zero points. The wildcard value for zero points is DNNL_RUNTIME_S32_VAL. During the execution stage, the corresponding memory object needs to be passed in the argument with index set to (DNNL_ARG_ATTR_ZERO_POINTS  DNNL_ARG_${MEMORY_INDEX}
).
DNNL_ARG_ATTR_ZERO_POINTS  DNNL_ARG_SRC
).u8
data type for weights.Engine  Name  Comments 

CPU  MatMul Tutorial: Comparison with SGEMM  C++ API example demonstrating MatMul as a replacement for SGEMM functions. Concepts:

CPU/GPU  MatMul Tutorial: INT8 Inference  C++ API example demonstrating how one can use MatMul fused with ReLU in INT8 inference. Concepts:

CPU  MatMul Tutorial: Quantization  C++ API example demonstrating how one can perform reduced precision matrixmatrix multiplication using MatMul and the accuracy of the result compared to the floating point computations. Concepts:
