pith. sign in

arxiv: 2506.12718 · v3 · submitted 2025-06-15 · 🧮 math.NA · cs.DS· cs.MS· cs.NA

Permutation-Avoiding FFT-Based Convolution

Pith reviewed 2026-05-19 10:05 UTC · model grok-4.3

classification 🧮 math.NA cs.DScs.MScs.NA
keywords FFT convolutionCooley-Tukeyindex-reversal permutationdiscrete convolutionmulti-dimensional FFTpermutation avoidancegeneral radix
0
0 comments X

The pith

The three sets of index-reversal permutations in Cooley-Tukey FFT convolution cancel exactly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that when discrete convolution is computed through FFTs in the Cooley-Tukey framework, the index-reversal permutations required for the two forward transforms and the one inverse transform cancel one another completely. This cancellation removes every permutation step from the procedure. A reader would care because the permutations create irregular memory access that lowers arithmetic intensity, even when the total floating-point work stays the same. The study extends the observation to multi-dimensional arrays and arbitrary radices, then benchmarks the resulting kernels against conventional FFT-based convolution.

Core claim

In discrete convolution performed via FFTs that follow the Cooley-Tukey decomposition, the index-reversal permutations arising from the forward transforms of the two input arrays and the inverse transform of their product cancel exactly. The cancellation holds in the multi-dimensional, general-radix setting and leaves both the mathematical correctness of the convolution and the expected floating-point operation count unchanged. As a direct result, the convolution can be carried out without executing any index-reversal steps.

What carries the argument

Exact cancellation of the three index-reversal permutation sets that appear in the forward and inverse stages of an FFT-based convolution pipeline.

If this is right

  • Permutation-free FFT convolution kernels become feasible for general-radix, multi-dimensional cases.
  • Memory access patterns improve while the arithmetic operation count remains identical to standard FFT convolution.
  • Arithmetic intensity rises because the irregular accesses induced by permutations are eliminated.
  • Developers of FFT libraries should consider adding explicit support for these kernels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same cancellation logic may apply to other composite FFT operations that chain multiple forward and inverse transforms.
  • Hardware-specific implementations could further reduce cache misses once the permutation steps are removed.
  • Performance comparisons on modern accelerators would test whether the theoretical arithmetic-intensity gain translates to wall-clock speedup.

Load-bearing premise

The index-reversal permutations from the two forward FFTs and the inverse FFT cancel exactly and without remainder in the multi-dimensional general-radix setting.

What would settle it

A concrete multi-dimensional input size and radix choice where the permutation-free procedure produces a result different from a verified standard convolution or where the floating-point operation count increases.

Figures

Figures reproduced from arXiv: 2506.12718 by Hartwig Anzt, Nicolas Venkovic.

Figure 1
Figure 1. Figure 1: Average per-element permutation cost for computing systems #1 (solid lines) and #2 (dashed lines) [PITH_FULL_IMAGE:figures/full_fig_p021_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Average per-element butterfly cost for computing systems #1 (solid lines) and #2 (dashed lines) [PITH_FULL_IMAGE:figures/full_fig_p022_2.png] view at source ↗
read the original abstract

Fast Fourier Transform (FFT) libraries are widely used for evaluating discrete convolutions. Most FFT implementations follow some variant of the Cooley-Tukey framework, in which the transform is decomposed into butterfly operations and index-reversal permutations. While butterfly operations dominate the floating-point operation count, the memory access patterns induced by index-reversal permutations significantly degrade the FFT's arithmetic intensity. When performing discrete convolution, the three sets of index-reversal permutations which occur in FFT-based implementations using Cooley-Tukey frameworks cancel out, thus paving the way to implementations free of any permutation. To the best of our knowledge, such permutation-free variants of FFT-based discrete convolution are not commonly used in practice, making such kernels worth investigating. Here, we look into such permutation-avoiding convolution procedures for multi-dimensional cases within a general radix Cooley-Tukey framework. We perform numerical experiments to benchmark the algorithms presented against state-of-the-art FFT-based convolution implementations. Our results suggest that developers of FFT libraries should consider supporting permutation-avoiding convolution kernels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript claims that the three index-reversal (digit-reversal) permutations arising in Cooley-Tukey FFT-based discrete convolution—one from each forward transform and one from the inverse—cancel exactly, enabling permutation-free convolution kernels. It extends the observation to the multi-dimensional general-radix setting and supplies numerical benchmarks comparing the resulting algorithms against standard FFT-based convolution implementations.

Significance. If the cancellation identity holds without residual reordering or extra correction steps, the result would improve arithmetic intensity of convolution by removing memory-bound permutation stages whose cost is not negligible in practice. The algebraic character of the argument (no fitted parameters) and the provision of concrete numerical benchmarks against state-of-the-art libraries are positive features that would make the work useful to FFT library developers.

major comments (1)
  1. [multi-dimensional general-radix extension] The central cancellation argument for the multi-dimensional general-radix case (the extension beyond the one-dimensional Cooley-Tukey composition) does not automatically guarantee exact cancellation once the pointwise product is formed. Each dimension applies its own radix-dependent digit reversal; because frequency-domain multiplication does not commute with a global reversal operator, a residual reordering step may remain whose floating-point operation count is not subtracted from the claimed total. An explicit derivation or exhaustive check for at least one non-power-of-two radix in two or more dimensions is needed to support the operation-count claim.
minor comments (1)
  1. [Abstract] The abstract states that numerical experiments were performed but supplies neither error-bar statistics nor an explicit list of tested edge cases (e.g., odd lengths, mixed radices). Adding a short sentence or table reference would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback on our manuscript. We address the single major comment below and will revise the paper accordingly.

read point-by-point responses
  1. Referee: [multi-dimensional general-radix extension] The central cancellation argument for the multi-dimensional general-radix case (the extension beyond the one-dimensional Cooley-Tukey composition) does not automatically guarantee exact cancellation once the pointwise product is formed. Each dimension applies its own radix-dependent digit reversal; because frequency-domain multiplication does not commute with a global reversal operator, a residual reordering step may remain whose floating-point operation count is not subtracted from the claimed total. An explicit derivation or exhaustive check for at least one non-power-of-two radix in two or more dimensions is needed to support the operation-count claim.

    Authors: We appreciate the referee raising this point. In the multi-dimensional setting the reversal operator R is the tensor product of the per-dimension digit-reversal operators and is therefore identical for both forward transforms. Because the frequency-domain product is pointwise, the identity (R u) .* (R v) = R (u .* v) holds for any permutation matrix R; consequently the product remains in the domain expected by the inverse transform and the three reversals continue to cancel exactly. No residual global reordering is introduced. To make this explicit we will add, in the revised manuscript, a self-contained derivation for a 2-D convolution using radix 3 along the first axis and radix 4 along the second axis, together with a direct operation-count verification on that example. revision: yes

Circularity Check

0 steps flagged

No circularity detected: permutation cancellation is a direct algebraic consequence of FFT composition

full rationale

The paper derives the cancellation of the three index-reversal permutations (two forward FFTs and one inverse) as a property of the Cooley-Tukey decomposition applied to the convolution theorem. This holds by explicit composition of the permutation operators in the general-radix multi-dimensional setting and does not rely on any fitted parameters, self-citation chains, or re-labeling of inputs as outputs. The central claim is an identity that follows from the standard definitions of the transforms; numerical benchmarks then compare performance rather than validate the identity itself. No load-bearing step reduces to a tautology or to prior self-cited results that would need to be assumed.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the standard decomposition of Cooley-Tukey FFTs into butterflies and index-reversal permutations; no free parameters, ad-hoc axioms, or new entities are introduced.

axioms (1)
  • standard math Cooley-Tukey FFT decomposes into butterfly operations and index-reversal permutations.
    Invoked in the opening paragraph of the abstract as the basis for the permutation count.

pith-pipeline@v0.9.0 · 5712 in / 1238 out tokens · 43981 ms · 2026-05-19T10:05:14.407824+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    The journal of Supercomputing 4 (1990), 23–35

    FFTs in external or hierarchical memory. The journal of Supercomputing 4 (1990), 23–35. Leo Bluestein

  2. [2]

    IEEE Transactions on Audio and Electroacoustics 18, 4 (1970), 451–455

    A linear filtering approach to the computation of discrete Fourier transform. IEEE Transactions on Audio and Electroacoustics 18, 4 (1970), 451–455. Grace Chan and Andrew TA Wood

  3. [3]

    Applied Statistics (1997), 171–181

    Algorithm AS 312: An Algorithm for simulating stationary Gaussian random fields. Applied Statistics (1997), 171–181. James W Cooley and John W Tukey

  4. [4]

    Mathematics of computation 19, 90 (1965), 297–301

    An algorithm for the machine calculation of complex Fourier series. Mathematics of computation 19, 90 (1965), 297–301. Colin R Dietrich and Garry N Newsam

  5. [5]

    Water Resources Research 29, 8 (1993), 2861–2869

    A fast and exact method for multidimensional Gaussian stochastic simulations. Water Resources Research 29, 8 (1993), 2861–2869. Claude R Dietrich and Garry N Newsam

  6. [6]

    SIAM Journal on Scientific Computing 18, 4 (1997), 1088–1107

    Fast and exact simulation of stationary Gaussian processes through circulant embedding of the covariance matrix. SIAM Journal on Scientific Computing 18, 4 (1997), 1088–1107. Pierre Duhamel and Henk Hollmann

  7. [7]

    Matteo Frigo and Steven G Johnson

    ‘Split radix’ FFT algorithm.Electronics letters 20, 1 (1984), 14–16. Matteo Frigo and Steven G Johnson

  8. [8]

    In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat

    FFTW: An adaptive software architecture for the FFT. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181) , Vol

  9. [9]

    The design and implementation of FFTW3. Proc. IEEE 93, 2 (2005), 216–231. W Morven Gentleman and Gordon Sande

  10. [10]

    In Proceedings of the November 7-10, 1966, fall joint computer conference

    Fast fourier transforms: for fun and profit. In Proceedings of the November 7-10, 1966, fall joint computer conference . 563–578. Rafael C Gonzalez

  11. [11]

    Journal of the Royal Statistical Society Series B: Statistical Methodology 20, 2 (1958), 361–372

    The interaction algorithm and practical Fourier analysis. Journal of the Royal Statistical Society Series B: Statistical Methodology 20, 2 (1958), 361–372. Intel Corporation

  12. [12]

    Computer methods in applied mechanics and engineering 157, 1-2 (1998), 69–94

    A numerical method for computing the overall response of nonlinear composites with complex microstructure. Computer methods in applied mechanics and engineering 157, 1-2 (1998), 69–94. Alan V Oppenheim

  13. [13]

    Discrete Fourier transforms when the number of data samples is prime. Proc. IEEE 56, 6 (1968), 1107–1108. Paul N Swarztrauber

  14. [14]

    SIAM Rev

    The methods of cyclic reduction, Fourier analysis and the FACR algorithm for the discrete solution of Poisson’s equation on a rectangle. SIAM Rev. 19, 3 (1977), 490–501. Llewellyn H Thomas

  15. [15]

    Applications of digital computers (1963), 44–45

    Using a computer to solve problems in physics. Applications of digital computers (1963), 44–45. Charles Van Loan

  16. [16]

    Journal of computational and graphical statistics 3, 4 (1994), 409–432

    Simulation of stationary Gaussian processes in [0, 1]𝑑 . Journal of computational and graphical statistics 3, 4 (1994), 409–432. R Yavne

  17. [17]

    AFIPS ’68 (Fall, part I): Proceedings of the December 9-11, 1968, fall joint computer conference, part I (1968), 115–125

    An economical method for calculating the discrete Fourier transform. AFIPS ’68 (Fall, part I): Proceedings of the December 9-11, 1968, fall joint computer conference, part I (1968), 115–125. Jan Zeman, Jaroslav Vondřejc, Jan Novák, and Ivo Marek

  18. [18]

    Accelerating a FFT-based solver for numerical homogenization of periodic media by conjugate gradients. J. Comput. Phys. 229, 21 (2010), 8065–8071. A FFT and convolution sub-kernel algorithms In the algorithms presented in Section 5, for any given 𝑥 ∈ C𝑛, we need to evaluate 𝐴𝑟,𝑛𝑥, 𝐴𝑟,𝑛𝑥 and 𝐴𝑇 𝑟,𝑛𝑥. Irrespective of the radix, these kernels are implemented...

  19. [19]

    , 1 do 2: 𝑘 := 8𝑞 3: 𝑛𝑏 := 𝑛/𝑘 4: ℓ := 𝑘/8 5: for 𝑏 = 1,

    38 Nicolas Venkovic and Hartwig Anzt Algorithm 18 Transposed butterfly kernel of radix-8 Input: 𝑥 ∈ C𝑛, 𝑛 = 8𝑡 Output: 𝑥 := 𝐴𝑇 8,𝑛𝑥 1: for 𝑞 = 𝑡, . . . , 1 do 2: 𝑘 := 8𝑞 3: 𝑛𝑏 := 𝑛/𝑘 4: ℓ := 𝑘/8 5: for 𝑏 = 1, . . . , 𝑛𝑏 do 6: for 𝑗 = 1 . . . , ℓ do 7: 𝑧1 := 𝑥 (𝑏 −1)𝑘+𝑗 8: 𝑧2 := 𝑥 (𝑏 −1)𝑘+ℓ+𝑗 9: 𝑧3 := 𝑥 (𝑏 −1)𝑘+2ℓ+𝑗 10: 𝑧4 := 𝑥 (𝑏 −1)𝑘+3ℓ+𝑗 11: 𝑧5 := 𝑥 (𝑏 ...