Benchmark
Introduction
Intel Neural Compressor provides a command incbench
to launch the Intel CPU performance benchmark.
To get the peak performance on Intel Xeon CPU, we should avoid crossing NUMA node in one instance.
Therefore, by default, incbench
will trigger 1 instance on the first NUMA node.
Supported Matrix
Platform | Status |
---|---|
Linux | ✔ |
Windows | ✔ |
Usage
Parameters | Default | comments |
---|---|---|
num_instances | 1 | Number of instances |
num_cores_per_instance | None | Number of cores in each instance |
C, cores | 0-${num_cores_on_NUMA-1} | decides the visible core range |
cross_memory | False | whether to allocate memory cross NUMA |
Note: cross_memory is set to True only when memory is insufficient.
General Use Cases
incbench main.py
: run 1 instance on NUMA:0.incbench --num_i 2 main.py
: run 2 instances on NUMA:0.incbench --num_c 2 main.py
: run multi-instances with 2 cores per instance on NUMA:0.incbench -C 24-47 main.py
: run 1 instance on COREs:24-47.incbench -C 24-47 --num_c 4 main.py
: run multi-instances with 4 COREs per instance on COREs:24-47.
Note: > -
num_i
works the same asnum_instances
> -num_c
works the same asnum_cores_per_instance
Dump Throughput and Latency Summary
To merge benchmark results from multi-instances, “incbench” automatically checks log file messages for “throughput” and “latency” information matching the following patterns.
throughput_pattern = r"[T,t]hroughput:\s*([0-9]*\.?[0-9]+)\s*([a-zA-Z/]*)"
latency_pattern = r"[L,l]atency:\s*([0-9]*\.?[0-9]+)\s*([a-zA-Z/]*)"
Demo usage
print("Throughput: {:.3f} samples/sec".format(throughput))
print("Latency: {:.3f} ms".format(latency * 10**3))