Understanding data layouts#

The most important message comes first: Every data layout in the Double-Batched FFT Library is defined in column major order.

Strides#

Suppose we are given a $M \times N_{1} \times \dots \times N_{D} \times K$ tensor. The strides for the packed tensor layout (in column major order) is

packed (M, N_{1}, \dots, N_{D}, K) = (1, M, M \cdot N_{1}, \dots, M \cdot N_{1} \cdot \dots \cdot N_{D})

Let $s = packed (M, N_{1}, \dots, N_{D}, K)$ . Offsets of the entry $(m, n_{1}, \dots, n_{D}, k)$ are computed with

linear_index = m s_{0} + \sum_{j = 1}^{D} n_{j} s_{j} + k s_{D + 1}

such that x[linear_index] gives the correct entry, where x is the base address of the tensor’s data. Here, x might be either real or complex, that is, the linear_index is taken w.r.t. to the underlying data type.

Default strides may be overriden in the configuration.

Warning

$s_{0} \neq 1$ is unsupported.

Default c2c#

Default c2c strides are

\begin{array}{r} \begin{matrix} input_strides = packed (M, N_{1}, \dots, N_{D}, K) \\ output_strides = packed (M, N_{1}, \dots, N_{D}, K) \end{matrix} \end{array}

There is no distinction between in-place and out-of-place transforms.

Default r2c#

For r2c, the input tensor is real and the output tensor is complex. We only need to store half of the modes due to symmetry, therefore define

N_{1}^{'} = ⌊ N_{1} / 2 ⌋ + 1

We need to distinguish between the default out-of-place layout and the default in-place layout.

Out-of-place#

For the out-of-place transform the input tensor uses the default packed format

input_strides = packed (M, N_{1}, \dots, N_{D}, K)

For the output tensor we use the default packed format but truncate the first FFT mode:

output_strides = packed (M, N_{1}^{'}, \dots, N_{D}, K)

In-place#

The input tensor is overwritten during the FFT, hence it needs enough space to store the output tensor. Therefore, the first FFT mode needs to be padded. Let

N_{1}^{″} = 2 N_{1}^{'}

The strides are

\begin{array}{r} \begin{matrix} input_strides = packed (M, N_{1}^{″}, \dots, N_{D}, K) \\ output_strides = packed (M, N_{1}^{'}, \dots, N_{D}, K) \end{matrix} \end{array}

Example: Let $N_{1} = 8$ . Then $N_{1}^{'} = 5$ and we store 5 complex values in the first FFT mode. As 1 complex value requires the space of 2 real values we pad the input tensor with 2 extra reals and have $N_{1}^{″} = 10$ .

Default c2r#

c2r is the converse of r2c, so we simply swap input and output strides.

Tensor indexer#

The tensor_indexer is a helpful class to work with input tensors. E.g. in one dimension for a r2c in-place transform we can use the following code:

std::size_t N_out = N / 2 + 1;
auto xi = tensor_indexer<std::size_t, 3, layout::col_major>({M, N, K}, {1, M, M * N_out});
auto x = malloc_device<T>(xi.size(), Q);
for (std::size_t k = 0; k < K; ++k) {
   for (std::size_t n = 0; n < N; ++n) {
      for (std::size_t m = 0; m < M; ++m) {
         x[xi(m, n, k)] = ...; // Load data for entry (m, n, k)
      }
   }
}

Tip

The tensor_indexer and configuration strides are compatible. For example, given the configuration cfg, one can initialize xi with

auto xi = tensor_indexer<std::size_t, 3, layout::col_major>(
              fit_array<3>(cfg.shape), fit_array<3>(cfg.istride));

Understanding data layouts

Contents

Understanding data layouts#

Strides#

Default c2c#

Default r2c#

Out-of-place#

In-place#

Default c2r#

Tensor indexer#