Creating and using plans

Creating and using plans#

Complex-valued tensors are the default input and output of a Fast Fourier Transform, that is, the N1××ND complex-valued tensor x is mapped to the same-sized complex output tensor X by the following formula:

X(k1,,kD)=j1=0N11jD=0ND1x(j1,,jD)exp(σ2πil=1Dkljl/Nl)

The sign of the transform σ{1,1} is also called the direction. We adopt the convention that the forward transform as negative sign and that the backward transform has positive sign.

In practice, the input tensor x is often real-valued, and in that case the following symmetry holds for the output tensor X:

X(k1,,kD)=X(N1k1,,kD),

where the star denotes the complex conjugate.

Note

Index k1 is taken modulo N1, that is, X(N1,,kD)=X(0,,kD). Thus, X(0,,kD)=X(0,,kD) must be real. Moreover, similar symmetries holds in modes 2,,D, too, but those are not necessary for the discussion.

Due to the symmetry, only the first N1/2+1 entries need to be stored in the output tensor. So for the real-valued N1××ND input tensor one typically stores the N1/2+1××ND complex-valued output tensor.

The above considerations lead to specialized FFT routines for complex-valued FFTs, FFTs with real-valued input data, or FFTs with conjugate even symmetry. These are called complex-to-complex (c2c), real-to-complex (r2c), and complex-to-real (c2r).

Bbfft uses just-in-time (JIT) compilation to obtain fast FFTs for every problem size, transform type, tensor strides, and transform direction. JIT compilation is rather expensive, thus every configuration is captured in a plan object that is reused for multiple FFT invocations of the same type.

Attention

Code samples in the following assume you have using namespace bbfft; in your code.

One dimension#

Let x be the input tensor of shape M×N×K, where the FFT is taken over the second mode (N-mode). Plan generation is unified in the Double-Batched FFT Library via the configuration struct. Initializer-list syntax may be used to conveniently create configurations:

configuration cfg = {1,              // One dimensional
                     {M, N, K},      // Tensor shape
                     precision,      // Single (f32) or double (f64)
                     direction,      // Forward (-1) or backward (+1)
                     transform_type, // c2c, r2c, c2r
                     input_strides,  // Strides of input tensor
                     output_strides  // Strides of output tensor
                    };

Input and output strides are explained in more detail in Understanding data layouts. One may omit input_strides and output_strides. Then, the default layout for an in-place transform is set. In order to choose the default layout for an out-of-place transform, use

bool inplace = false;
cfg.set_strides_default(inplace);

SYCL#

Using SYCL, the plan is created using the make_plan() factory function as following:

#include "bbfft/sycl/make_plan.hpp"

auto Q = sycl::queue{sycl::gpu_selector_v};
auto plan = make_plan(cfg, Q);

Plans are executed using

auto event = plan.execute(input, output);
event.wait();

in out-of-place mode or using

auto event = plan.execute(inout);
event.wait();

in in-place mode. Input, output, or inout are pointers to either device memory, shared memory, or host memory, using one of the sycl::malloc_<device,shared,host> functions. Note that in-place mode is also used if input = output.

The data type of input and output depends on the transform type:

// c2c
std::complex<real_t>* input, output;
// r2c
real_t* input;
std::complex<real_t>* output;
// c2r
std::complex<real_t>* input;
real_t* output;

where real_t=float for f32 precision and real_t=double for f64 precision. In SYCL-mode the library uses either the OpenCL or Level Zero back-end. The back-end can be selected at run-time by setting the SYCL_DEVICE_FILTER appropriately.

OpenCL#

For OpenCL, use

#include "bbfft/cl/make_plan.hpp"

cl_command_queue Q = clCreateCommandQueueWithProperties(...);
auto plan = make_plan(cfg, Q);

Execute functions return a cl_event that shall be used for synchronization. Input and output buffers must be created with the cl<Device,Shared,Host>MemAllocINTEL functions from the cl_intel_unified_shared_memory extension.

Level Zero#

#include "bbfft/ze/make_plan.hpp"

ze_context_handle_t;
ze_device_handle_t device;
ze_command_list_handle_t cmd_list;
zeContextCreate(..., &context);
zeDeviceGet(.., &device);
zeCommandListCreateImmediate(..., &cmd_list);
auto plan = make_plan(cfg, cmd_list, context, device);

Execute functions return a ze_event_handle_t that shall be used for synchronization. Input and output buffers must be created with the zeMemAlloc<Device,Shared,Host> functions.

Two or three dimensions#

Let x be the input tensor of shape M×N1××ND×K, where the FFT is taken over the N-modes. Everything works the same way as in the one-dimensional case, except that dimension, shape, and strides need to be adjusted. For example,

cfg.dim = 3;
cfg.shape = {M, N1, N2, N3, K};

Warning

Two- and three-dimensional FFTs are currently under development. Do not expect that special data layouts beside the default data layout give the correct results.