Creating and using plans#
Complex-valued tensors are the default input and output of a Fast Fourier Transform, that is, the \(N_1\times\dots\times N_D\) complex-valued tensor \(x\) is mapped to the same-sized complex output tensor \(X\) by the following formula:
The sign of the transform \(\sigma \in \{-1,1\}\) is also called the direction. We adopt the convention that the forward transform as negative sign and that the backward transform has positive sign.
In practice, the input tensor \(x\) is often real-valued, and in that case the following symmetry holds for the output tensor \(X\):
where the star denotes the complex conjugate.
Note
Index \(k_1\) is taken modulo \(N_1\), that is, \(X^*(N_1,\dots,k_D) = X^*(0,\dots,k_D)\). Thus, \(X(0,\dots,k_D)=X^*(0,\dots,k_D)\) must be real. Moreover, similar symmetries holds in modes \(2,\dots,D\), too, but those are not necessary for the discussion.
Due to the symmetry, only the first \(\lfloor N_1/2\rfloor+1\) entries need to be stored in the output tensor. So for the real-valued \(N_1\times\dots\times N_D\) input tensor one typically stores the \(\lfloor N_1/2\rfloor+1\times\dots\times N_D\) complex-valued output tensor.
The above considerations lead to specialized FFT routines for complex-valued FFTs, FFTs with real-valued input data, or FFTs with conjugate even symmetry. These are called complex-to-complex (c2c), real-to-complex (r2c), and complex-to-real (c2r).
Bbfft uses just-in-time (JIT) compilation to obtain fast FFTs for every problem size, transform type, tensor strides, and transform direction. JIT compilation is rather expensive, thus every configuration is captured in a plan object that is reused for multiple FFT invocations of the same type.
Attention
Code samples in the following assume you have using namespace bbfft;
in your code.
One dimension#
Let \(x\) be the input tensor of shape \(M \times N \times K\),
where the FFT is taken over the second mode (N-mode).
Plan generation is unified in the Double-Batched FFT Library via the configuration
struct.
Initializer-list syntax may be used to conveniently create configurations:
configuration cfg = {1, // One dimensional
{M, N, K}, // Tensor shape
precision, // Single (f32) or double (f64)
direction, // Forward (-1) or backward (+1)
transform_type, // c2c, r2c, c2r
input_strides, // Strides of input tensor
output_strides // Strides of output tensor
};
Input and output strides are explained in more detail in Understanding data layouts. One may omit input_strides and output_strides. Then, the default layout for an in-place transform is set. In order to choose the default layout for an out-of-place transform, use
bool inplace = false;
cfg.set_strides_default(inplace);
SYCL#
Using SYCL, the plan is created using the make_plan()
factory function as following:
#include "bbfft/sycl/make_plan.hpp"
auto Q = sycl::queue{sycl::gpu_selector_v};
auto plan = make_plan(cfg, Q);
Plans are executed using
auto event = plan.execute(input, output);
event.wait();
in out-of-place mode or using
auto event = plan.execute(inout);
event.wait();
in in-place mode. Input, output, or inout are pointers to either device memory, shared memory, or host
memory, using one of the sycl::malloc_<device,shared,host>
functions.
Note that in-place mode is also used if input = output.
The data type of input and output depends on the transform type:
// c2c
std::complex<real_t>* input, output;
// r2c
real_t* input;
std::complex<real_t>* output;
// c2r
std::complex<real_t>* input;
real_t* output;
where real_t=float for f32 precision and real_t=double for f64 precision. In SYCL-mode the library uses either the OpenCL or Level Zero back-end. The back-end can be selected at run-time by setting the SYCL_DEVICE_FILTER appropriately.
OpenCL#
For OpenCL, use
#include "bbfft/cl/make_plan.hpp"
cl_command_queue Q = clCreateCommandQueueWithProperties(...);
auto plan = make_plan(cfg, Q);
Execute functions return a cl_event
that shall be used for synchronization.
Input and output buffers must be created with the cl<Device,Shared,Host>MemAllocINTEL
functions
from the
cl_intel_unified_shared_memory extension.
Level Zero#
#include "bbfft/ze/make_plan.hpp"
ze_context_handle_t;
ze_device_handle_t device;
ze_command_list_handle_t cmd_list;
zeContextCreate(..., &context);
zeDeviceGet(.., &device);
zeCommandListCreateImmediate(..., &cmd_list);
auto plan = make_plan(cfg, cmd_list, context, device);
Execute functions return a ze_event_handle_t
that shall be used for synchronization.
Input and output buffers must be created with the zeMemAlloc<Device,Shared,Host>
functions.
Two or three dimensions#
Let \(x\) be the input tensor of shape \(M \times N_1 \times \dots \times N_D \times K\), where the FFT is taken over the N-modes. Everything works the same way as in the one-dimensional case, except that dimension, shape, and strides need to be adjusted. For example,
cfg.dim = 3;
cfg.shape = {M, N1, N2, N3, K};
Warning
Two- and three-dimensional FFTs are currently under development. Do not expect that special data layouts beside the default data layout give the correct results.