Creating and using plans#

Complex-valued tensors are the default input and output of a Fast Fourier Transform, that is, the $N_{1} \times \dots \times N_{D}$ complex-valued tensor $x$ is mapped to the same-sized complex output tensor $X$ by the following formula:

X (k_{1}, \dots, k_{D}) = \sum_{j_{1} = 0}^{N_{1} - 1} \dots \sum_{j_{D} = 0}^{N_{D} - 1} x (j_{1}, \dots, j_{D}) \cdot \exp (σ \cdot 2 π i \sum_{l = 1}^{D} k_{l} j_{l} / N_{l})

The sign of the transform $σ \in {- 1, 1}$ is also called the direction. We adopt the convention that the forward transform as negative sign and that the backward transform has positive sign.

In practice, the input tensor $x$ is often real-valued, and in that case the following symmetry holds for the output tensor $X$ :

X (k_{1}, \dots, k_{D}) = X^{*} (N_{1} - k_{1}, \dots, k_{D}),

where the star denotes the complex conjugate.

Note

Index $k_{1}$ is taken modulo $N_{1}$ , that is, $X^{*} (N_{1}, \dots, k_{D}) = X^{*} (0, \dots, k_{D})$ . Thus, $X (0, \dots, k_{D}) = X^{*} (0, \dots, k_{D})$ must be real. Moreover, similar symmetries holds in modes $2, \dots, D$ , too, but those are not necessary for the discussion.

Due to the symmetry, only the first $⌊ N_{1} / 2 ⌋ + 1$ entries need to be stored in the output tensor. So for the real-valued $N_{1} \times \dots \times N_{D}$ input tensor one typically stores the $⌊ N_{1} / 2 ⌋ + 1 \times \dots \times N_{D}$ complex-valued output tensor.

The above considerations lead to specialized FFT routines for complex-valued FFTs, FFTs with real-valued input data, or FFTs with conjugate even symmetry. These are called complex-to-complex (c2c), real-to-complex (r2c), and complex-to-real (c2r).

Bbfft uses just-in-time (JIT) compilation to obtain fast FFTs for every problem size, transform type, tensor strides, and transform direction. JIT compilation is rather expensive, thus every configuration is captured in a plan object that is reused for multiple FFT invocations of the same type.

Attention

Code samples in the following assume you have using namespace bbfft; in your code.

One dimension#

Let $x$ be the input tensor of shape $M \times N \times K$ , where the FFT is taken over the second mode (N-mode). Plan generation is unified in the Double-Batched FFT Library via the configuration struct. Initializer-list syntax may be used to conveniently create configurations:

configuration cfg = {1,              // One dimensional
                     {M, N, K},      // Tensor shape
                     precision,      // Single (f32) or double (f64)
                     direction,      // Forward (-1) or backward (+1)
                     transform_type, // c2c, r2c, c2r
                     input_strides,  // Strides of input tensor
                     output_strides  // Strides of output tensor
                    };

Input and output strides are explained in more detail in Understanding data layouts. One may omit input_strides and output_strides. Then, the default layout for an in-place transform is set. In order to choose the default layout for an out-of-place transform, use

bool inplace = false;
cfg.set_strides_default(inplace);

SYCL#

Using SYCL, the plan is created using the make_plan() factory function as following:

#include "bbfft/sycl/make_plan.hpp"

auto Q = sycl::queue{sycl::gpu_selector_v};
auto plan = make_plan(cfg, Q);

Plans are executed using

auto event = plan.execute(input, output);
event.wait();

in out-of-place mode or using

auto event = plan.execute(inout);
event.wait();

in in-place mode. Input, output, or inout are pointers to either device memory, shared memory, or host memory, using one of the sycl::malloc_<device,shared,host> functions. Note that in-place mode is also used if input = output.

The data type of input and output depends on the transform type:

// c2c
std::complex<real_t>* input, output;
// r2c
real_t* input;
std::complex<real_t>* output;
// c2r
std::complex<real_t>* input;
real_t* output;

where real_t=float for f32 precision and real_t=double for f64 precision. In SYCL-mode the library uses either the OpenCL or Level Zero back-end. The back-end can be selected at run-time by setting the SYCL_DEVICE_FILTER appropriately.

OpenCL#

For OpenCL, use

#include "bbfft/cl/make_plan.hpp"

cl_command_queue Q = clCreateCommandQueueWithProperties(...);
auto plan = make_plan(cfg, Q);

Execute functions return a cl_event that shall be used for synchronization. Input and output buffers must be created with the cl<Device,Shared,Host>MemAllocINTEL functions from the cl_intel_unified_shared_memory extension.

Level Zero#

#include "bbfft/ze/make_plan.hpp"

ze_context_handle_t;
ze_device_handle_t device;
ze_command_list_handle_t cmd_list;
zeContextCreate(..., &context);
zeDeviceGet(.., &device);
zeCommandListCreateImmediate(..., &cmd_list);
auto plan = make_plan(cfg, cmd_list, context, device);

Execute functions return a ze_event_handle_t that shall be used for synchronization. Input and output buffers must be created with the zeMemAlloc<Device,Shared,Host> functions.

Two or three dimensions#

Let $x$ be the input tensor of shape $M \times N_{1} \times \dots \times N_{D} \times K$ , where the FFT is taken over the N-modes. Everything works the same way as in the one-dimensional case, except that dimension, shape, and strides need to be adjusted. For example,

cfg.dim = 3;
cfg.shape = {M, N1, N2, N3, K};

Warning

Two- and three-dimensional FFTs are currently under development. Do not expect that special data layouts beside the default data layout give the correct results.