Benchmarks Framework Guide#
Goal of the benchmarks is to provide performance guidance on various use case scenarios of the Intel® Data Mover Library (Intel® DML). To cover these cases the benchmarks provide several performance metrics and modes of operation.
Attention
Currently, the Intel DML benchmarks framework offers limited support.
Only mem_move cases are currently supported.
Compiling and running benchmarks is available only on Linux operating systems.
Additionally, running benchmarks is available only on Intel® Skylake (or later) microarchitecture.
Quick Start#
Benchmark are based on the Google benchmark library and built as a part of Intel DML. Refer to Installation page for details on how to build the library and resolve all prerequisites.
The example below demonstrates running a benchmark on a memory-move operation via the Low-Level C API using the accelerator, a queue size of 16, and 1 MB of data.
Warning
Make sure to resolve requirements for running on hardware path and configure Intel® Data Streaming Accelerator (Intel® DSA) before executing the example.
./<install_dir>/bin/dml_benchmarks --benchmark_min_time=0.3 --benchmark_filter=move/api:c/path:.*/exec:.*/qsize:16/bsize:0/in_mem:llc/out_mem:cc_llc/timer:full/size:1048576/.*
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
move/api:c/path:dsa/exec:sync/qsize:16/bsize:0/in_mem:llc/out_mem:cc_llc/timer:full/size:1048576/real_time 833022 ns 831025 ns 503 Latency=833.022us Latency/Op=52.0639us Throughput=20.1402G/s
move/api:c/path:dsa/exec:async/qsize:16/bsize:0/in_mem:llc/out_mem:cc_llc/timer:full/size:1048576/real_time/threads:1 763158 ns 760850 ns 549 Latency=763.158us Latency/Op=47.6974us Throughput=21.9839G/s
move/api:c/path:dsa/exec:async/qsize:16/bsize:0/in_mem:llc/out_mem:cc_llc/timer:full/size:1048576/real_time/threads:2 361855 ns 722324 ns 1134 Latency=361.855us Latency/Op=45.2319us Throughput=23.1822G/s
move/api:c/path:dsa/exec:async/qsize:16/bsize:0/in_mem:llc/out_mem:cc_llc/timer:full/size:1048576/real_time/threads:4 182710 ns 729806 ns 2132 Latency=182.71us Latency/Op=45.6775us Throughput=22.9561G/s
move/api:c/path:dsa/exec:async/qsize:16/bsize:0/in_mem:llc/out_mem:cc_llc/timer:full/size:1048576/real_time/threads:8 144228 ns 1151468 ns 2896 Latency=144.228us Latency/Op=72.114us Throughput=14.5405G/s
move/api:c/path:dsa/exec:async/qsize:16/bsize:0/in_mem:llc/out_mem:cc_llc/timer:full/size:1048576/real_time/threads:16 81297 ns 1297643 ns 5120 Latency=81.2965us Latency/Op=81.2965us Throughput=12.8982G/s
Key Terms#
API: Currently, only Low-Level C API, case naming:
api:c
, High-Level C++ API, case namingapi:cpp
, and standard glibc mem_move and mem_copy implementations, case namingapi::glibc
are supported.Path: Represents execution path for the library. Case naming:
path:dsa
for executing on the accelerator,path:cpu
for running on the CPU.
Warning
Executing on path:dsa
requires the accelerator to be already configured.
Refer to accelerator configuration page.
Execution mode: Defines synchronous or asynchronous execution. Case naming:
exec:async
,exec:sync
.Sync mode: Each measurement loop, one call to Intel DML operation is submitted always followed by blocking wait. Thus only one operation is processed at a time and only one engine of the accelerator is loaded. Is not affected by
--threads
argument. Main output of sync mode is Latency metric.--queue_size
can be used to submit several operations at once to measure latency of the various queue sizes.Async mode: Each case spawns
--threads
number of threads and each thread runs it’s own measurement loop. For each loop, benchmark will submit--queue_size
/--threads
operations without blocking wait and resubmit operation as soon as it is completed always keeping device busy. Main output of async mode is Throughput metric. For small workloads, higher number of threads may be required to saturate devices, for big workloads, even one thread may reach capacity.Blocks: Input data is split by blocks of size
--block_size=XXXX
(with XXXX being in bytes) and each block is processed separately.
Using the Benchmarks Framework#
Intel DML benchmark framework is based on Google benchmark library,
run ./dml_benchmarks --help
to get a full list of supported commands and input arguments with detailed description.
Attention
If no accelerators are available on the system, you can use --no_hw
to suppress Intel DSA initialization check warning.
In order to set up a specific run configuration --benchmark_filter
should be used,
which input is a regex based on the case name.
For example, use the following expression to launch a copy operation using the Low-Level C API
with synchronous execution on a CPU: --benchmark_filter="copy/api:c/path:cpu/exec:sync/.*"
.
Executing using Accelerators#
Attention
It is the user’s responsibility to configure the accelerator and ensure the availability of the device(s).
Make sure to resolve requirements for running on hardware path and configure accelerator before executing the example.
Benchmark Framework does not support choosing a specific Intel DSA instance for execution.
However, it is possible to limit execution to devices only from a certain NUMA node using the --node=<integer>
option.
Note
By default, when --node=<integer>
is not used or when --node
is set to -1
, the behavior is as follows:
If the Intel DML version is `< 1.2.0`, the library will auto-detect the NUMA node of the calling process and use the device(s) located on the same NUMA node.
If the Intel DML version is `>= 1.2.0`, the library will use the device(s) located on the socket of the calling thread.
Latency Tests#
For reporting or tracking latency metric, use sync
mode, a single Intel DSA instance, and a single thread.
Below are examples for mem_copy (copy
) and mem_move (move
) using 4kb block_sizes:
sudo ./<install_dir>/bin/dml_benchmarks --benchmark_filter="copy/.*/path:dsa/exec:sync/.*/size:4096/.*" --benchmark_min_time=0.1
sudo ./<install_dir>/bin/dml_benchmarks --benchmark_filter="move/.*/path:dsa/exec:sync/.*/size:4096/.*" --benchmark_min_time=0.1
Throughput Tests#
For reporting or tracking throughput metric, use async
mode, 1 to 4 Intel DSA devices, and multiple threads.
Below are examples for mem_copy (copy
) and mem_move (move
) using 4kb block_size and queue_size=128
on the Low-Level C API:
sudo ./<install_dir>/bin/dml_benchmarks --benchmark_filter="copy/api:c/path:dsa/exec:async/qsize:128/.*/size:4096/.*/threads:8" --benchmark_min_time=0.5
sudo ./<install_dir>/bin/dml_benchmarks --benchmark_filter="move/api:c/path:dsa/exec:async/qsize:128/.*/size:4096/.*/threads:8" --benchmark_min_time=0.5