oneDNN object cache optimization

Introduction

The oneDNN object (primitive/primitive description/memory) creation overhead becomes noticeable, especially in small model latency scenarios.

OneDNN object cache optimization is an experimental feature for optimizing model latency by binding a oneDNN object to a TensorFlow graph node. You can enable this environment by setting the environment variable ‘ITEX_CACHE_ONEDNN_OBJECT’ to on. By default, it is set to off.

TensorFlow supports optimizations to support different scenarios:

Dynamic Shape - TensorFlow supports dynamic shape, which means a node can get different shape input. This optimization will invalidate the cache by checking the input dims/shape with the oneDNN meta input (used in layout propagation).
Operator Parallel Execution - TensorFlow supports operator parallel execution, which means a node may execute in different schedule threads. The oneDNN requires thread safe in this scenario only: user scratchpad and oneDNN stream creation on demand. This optimization is aligning to satisfy a oneDNN requirement.
Concurrent Execution - TensorFlow supports concurrent execution, which means a node may be executed in different thread concurrently. The optimization handles this case by adding a mutex lock.

Optimization in convolution

Convolution optimization will cache oneDNN object dnnl::memory, dnnl::primitive, dnnl::primitive_desc and dnnl_exec_arg_t

dnnl::memory - input/weight/bias/output/scratchpad memory and two temporary memory areas for input and weight reorder if needed.
dnnl::primitive - convolution primitive and input/weight reorder primitive if needed.
dnnl::primitive_desc - convolution primitive description.
dnnl_exec_arg_t - convolution primitive arguments and input/weight reorder primitive arguments if needed.

Temporary device memory includes scratchpad memory and input/weight reorder output device memory if needed.