Understanding data layouts#

The most important message comes first: Every data layout in the Double-Batched FFT Library is defined in column major order.

Strides#

Suppose we are given a M×N1××ND×K tensor. The strides for the packed tensor layout (in column major order) is

packed(M,N1,,ND,K)=(1,M,MN1,,MN1ND)

Let s=packed(M,N1,,ND,K). Offsets of the entry (m,n1,,nD,k) are computed with

linear_index=ms0+j=1Dnjsj+ksD+1

such that x[linear_index] gives the correct entry, where x is the base address of the tensor’s data. Here, x might be either real or complex, that is, the linear_index is taken w.r.t. to the underlying data type.

Default strides may be overriden in the configuration.

Warning

s01 is unsupported.

Default c2c#

Default c2c strides are

input_strides=packed(M,N1,,ND,K)output_strides=packed(M,N1,,ND,K)

There is no distinction between in-place and out-of-place transforms.

Default r2c#

For r2c, the input tensor is real and the output tensor is complex. We only need to store half of the modes due to symmetry, therefore define

N1=N1/2+1

We need to distinguish between the default out-of-place layout and the default in-place layout.

Out-of-place#

For the out-of-place transform the input tensor uses the default packed format

input_strides=packed(M,N1,,ND,K)

For the output tensor we use the default packed format but truncate the first FFT mode:

output_strides=packed(M,N1,,ND,K)

In-place#

The input tensor is overwritten during the FFT, hence it needs enough space to store the output tensor. Therefore, the first FFT mode needs to be padded. Let

N1=2N1

The strides are

input_strides=packed(M,N1,,ND,K)output_strides=packed(M,N1,,ND,K)

Example: Let N1=8. Then N1=5 and we store 5 complex values in the first FFT mode. As 1 complex value requires the space of 2 real values we pad the input tensor with 2 extra reals and have N1=10.

Default c2r#

c2r is the converse of r2c, so we simply swap input and output strides.

Tensor indexer#

The tensor_indexer is a helpful class to work with input tensors. E.g. in one dimension for a r2c in-place transform we can use the following code:

std::size_t N_out = N / 2 + 1;
auto xi = tensor_indexer<std::size_t, 3, layout::col_major>({M, N, K}, {1, M, M * N_out});
auto x = malloc_device<T>(xi.size(), Q);
for (std::size_t k = 0; k < K; ++k) {
   for (std::size_t n = 0; n < N; ++n) {
      for (std::size_t m = 0; m < M; ++m) {
         x[xi(m, n, k)] = ...; // Load data for entry (m, n, k)
      }
   }
}

Tip

The tensor_indexer and configuration strides are compatible. For example, given the configuration cfg, one can initialize xi with

auto xi = tensor_indexer<std::size_t, 3, layout::col_major>(
              fit_array<3>(cfg.shape), fit_array<3>(cfg.istride));