Permutation-Avoiding FFT-Based Convolution
Pith reviewed 2026-05-19 10:05 UTC · model grok-4.3
The pith
The three sets of index-reversal permutations in Cooley-Tukey FFT convolution cancel exactly.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In discrete convolution performed via FFTs that follow the Cooley-Tukey decomposition, the index-reversal permutations arising from the forward transforms of the two input arrays and the inverse transform of their product cancel exactly. The cancellation holds in the multi-dimensional, general-radix setting and leaves both the mathematical correctness of the convolution and the expected floating-point operation count unchanged. As a direct result, the convolution can be carried out without executing any index-reversal steps.
What carries the argument
Exact cancellation of the three index-reversal permutation sets that appear in the forward and inverse stages of an FFT-based convolution pipeline.
If this is right
- Permutation-free FFT convolution kernels become feasible for general-radix, multi-dimensional cases.
- Memory access patterns improve while the arithmetic operation count remains identical to standard FFT convolution.
- Arithmetic intensity rises because the irregular accesses induced by permutations are eliminated.
- Developers of FFT libraries should consider adding explicit support for these kernels.
Where Pith is reading between the lines
- The same cancellation logic may apply to other composite FFT operations that chain multiple forward and inverse transforms.
- Hardware-specific implementations could further reduce cache misses once the permutation steps are removed.
- Performance comparisons on modern accelerators would test whether the theoretical arithmetic-intensity gain translates to wall-clock speedup.
Load-bearing premise
The index-reversal permutations from the two forward FFTs and the inverse FFT cancel exactly and without remainder in the multi-dimensional general-radix setting.
What would settle it
A concrete multi-dimensional input size and radix choice where the permutation-free procedure produces a result different from a verified standard convolution or where the floating-point operation count increases.
Figures
read the original abstract
Fast Fourier Transform (FFT) libraries are widely used for evaluating discrete convolutions. Most FFT implementations follow some variant of the Cooley-Tukey framework, in which the transform is decomposed into butterfly operations and index-reversal permutations. While butterfly operations dominate the floating-point operation count, the memory access patterns induced by index-reversal permutations significantly degrade the FFT's arithmetic intensity. When performing discrete convolution, the three sets of index-reversal permutations which occur in FFT-based implementations using Cooley-Tukey frameworks cancel out, thus paving the way to implementations free of any permutation. To the best of our knowledge, such permutation-free variants of FFT-based discrete convolution are not commonly used in practice, making such kernels worth investigating. Here, we look into such permutation-avoiding convolution procedures for multi-dimensional cases within a general radix Cooley-Tukey framework. We perform numerical experiments to benchmark the algorithms presented against state-of-the-art FFT-based convolution implementations. Our results suggest that developers of FFT libraries should consider supporting permutation-avoiding convolution kernels.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that the three index-reversal (digit-reversal) permutations arising in Cooley-Tukey FFT-based discrete convolution—one from each forward transform and one from the inverse—cancel exactly, enabling permutation-free convolution kernels. It extends the observation to the multi-dimensional general-radix setting and supplies numerical benchmarks comparing the resulting algorithms against standard FFT-based convolution implementations.
Significance. If the cancellation identity holds without residual reordering or extra correction steps, the result would improve arithmetic intensity of convolution by removing memory-bound permutation stages whose cost is not negligible in practice. The algebraic character of the argument (no fitted parameters) and the provision of concrete numerical benchmarks against state-of-the-art libraries are positive features that would make the work useful to FFT library developers.
major comments (1)
- [multi-dimensional general-radix extension] The central cancellation argument for the multi-dimensional general-radix case (the extension beyond the one-dimensional Cooley-Tukey composition) does not automatically guarantee exact cancellation once the pointwise product is formed. Each dimension applies its own radix-dependent digit reversal; because frequency-domain multiplication does not commute with a global reversal operator, a residual reordering step may remain whose floating-point operation count is not subtracted from the claimed total. An explicit derivation or exhaustive check for at least one non-power-of-two radix in two or more dimensions is needed to support the operation-count claim.
minor comments (1)
- [Abstract] The abstract states that numerical experiments were performed but supplies neither error-bar statistics nor an explicit list of tested edge cases (e.g., odd lengths, mixed radices). Adding a short sentence or table reference would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive feedback on our manuscript. We address the single major comment below and will revise the paper accordingly.
read point-by-point responses
-
Referee: [multi-dimensional general-radix extension] The central cancellation argument for the multi-dimensional general-radix case (the extension beyond the one-dimensional Cooley-Tukey composition) does not automatically guarantee exact cancellation once the pointwise product is formed. Each dimension applies its own radix-dependent digit reversal; because frequency-domain multiplication does not commute with a global reversal operator, a residual reordering step may remain whose floating-point operation count is not subtracted from the claimed total. An explicit derivation or exhaustive check for at least one non-power-of-two radix in two or more dimensions is needed to support the operation-count claim.
Authors: We appreciate the referee raising this point. In the multi-dimensional setting the reversal operator R is the tensor product of the per-dimension digit-reversal operators and is therefore identical for both forward transforms. Because the frequency-domain product is pointwise, the identity (R u) .* (R v) = R (u .* v) holds for any permutation matrix R; consequently the product remains in the domain expected by the inverse transform and the three reversals continue to cancel exactly. No residual global reordering is introduced. To make this explicit we will add, in the revised manuscript, a self-contained derivation for a 2-D convolution using radix 3 along the first axis and radix 4 along the second axis, together with a direct operation-count verification on that example. revision: yes
Circularity Check
No circularity detected: permutation cancellation is a direct algebraic consequence of FFT composition
full rationale
The paper derives the cancellation of the three index-reversal permutations (two forward FFTs and one inverse) as a property of the Cooley-Tukey decomposition applied to the convolution theorem. This holds by explicit composition of the permutation operators in the general-radix multi-dimensional setting and does not rely on any fitted parameters, self-citation chains, or re-labeling of inputs as outputs. The central claim is an identity that follows from the standard definitions of the transforms; numerical benchmarks then compare performance rather than validate the identity itself. No load-bearing step reduces to a tautology or to prior self-cited results that would need to be assumed.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Cooley-Tukey FFT decomposes into butterfly operations and index-reversal permutations.
Reference graph
Works this paper leans on
-
[1]
The journal of Supercomputing 4 (1990), 23–35
FFTs in external or hierarchical memory. The journal of Supercomputing 4 (1990), 23–35. Leo Bluestein
work page 1990
-
[2]
IEEE Transactions on Audio and Electroacoustics 18, 4 (1970), 451–455
A linear filtering approach to the computation of discrete Fourier transform. IEEE Transactions on Audio and Electroacoustics 18, 4 (1970), 451–455. Grace Chan and Andrew TA Wood
work page 1970
-
[3]
Applied Statistics (1997), 171–181
Algorithm AS 312: An Algorithm for simulating stationary Gaussian random fields. Applied Statistics (1997), 171–181. James W Cooley and John W Tukey
work page 1997
-
[4]
Mathematics of computation 19, 90 (1965), 297–301
An algorithm for the machine calculation of complex Fourier series. Mathematics of computation 19, 90 (1965), 297–301. Colin R Dietrich and Garry N Newsam
work page 1965
-
[5]
Water Resources Research 29, 8 (1993), 2861–2869
A fast and exact method for multidimensional Gaussian stochastic simulations. Water Resources Research 29, 8 (1993), 2861–2869. Claude R Dietrich and Garry N Newsam
work page 1993
-
[6]
SIAM Journal on Scientific Computing 18, 4 (1997), 1088–1107
Fast and exact simulation of stationary Gaussian processes through circulant embedding of the covariance matrix. SIAM Journal on Scientific Computing 18, 4 (1997), 1088–1107. Pierre Duhamel and Henk Hollmann
work page 1997
-
[7]
Matteo Frigo and Steven G Johnson
‘Split radix’ FFT algorithm.Electronics letters 20, 1 (1984), 14–16. Matteo Frigo and Steven G Johnson
work page 1984
-
[8]
FFTW: An adaptive software architecture for the FFT. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181) , Vol
work page 1998
-
[9]
The design and implementation of FFTW3. Proc. IEEE 93, 2 (2005), 216–231. W Morven Gentleman and Gordon Sande
work page 2005
-
[10]
In Proceedings of the November 7-10, 1966, fall joint computer conference
Fast fourier transforms: for fun and profit. In Proceedings of the November 7-10, 1966, fall joint computer conference . 563–578. Rafael C Gonzalez
work page 1966
-
[11]
Journal of the Royal Statistical Society Series B: Statistical Methodology 20, 2 (1958), 361–372
The interaction algorithm and practical Fourier analysis. Journal of the Royal Statistical Society Series B: Statistical Methodology 20, 2 (1958), 361–372. Intel Corporation
work page 1958
-
[12]
Computer methods in applied mechanics and engineering 157, 1-2 (1998), 69–94
A numerical method for computing the overall response of nonlinear composites with complex microstructure. Computer methods in applied mechanics and engineering 157, 1-2 (1998), 69–94. Alan V Oppenheim
work page 1998
-
[13]
Discrete Fourier transforms when the number of data samples is prime. Proc. IEEE 56, 6 (1968), 1107–1108. Paul N Swarztrauber
work page 1968
- [14]
-
[15]
Applications of digital computers (1963), 44–45
Using a computer to solve problems in physics. Applications of digital computers (1963), 44–45. Charles Van Loan
work page 1963
-
[16]
Journal of computational and graphical statistics 3, 4 (1994), 409–432
Simulation of stationary Gaussian processes in [0, 1]𝑑 . Journal of computational and graphical statistics 3, 4 (1994), 409–432. R Yavne
work page 1994
-
[17]
An economical method for calculating the discrete Fourier transform. AFIPS ’68 (Fall, part I): Proceedings of the December 9-11, 1968, fall joint computer conference, part I (1968), 115–125. Jan Zeman, Jaroslav Vondřejc, Jan Novák, and Ivo Marek
work page 1968
-
[18]
Accelerating a FFT-based solver for numerical homogenization of periodic media by conjugate gradients. J. Comput. Phys. 229, 21 (2010), 8065–8071. A FFT and convolution sub-kernel algorithms In the algorithms presented in Section 5, for any given 𝑥 ∈ C𝑛, we need to evaluate 𝐴𝑟,𝑛𝑥, 𝐴𝑟,𝑛𝑥 and 𝐴𝑇 𝑟,𝑛𝑥. Irrespective of the radix, these kernels are implemented...
work page 2010
-
[19]
, 1 do 2: 𝑘 := 8𝑞 3: 𝑛𝑏 := 𝑛/𝑘 4: ℓ := 𝑘/8 5: for 𝑏 = 1,
38 Nicolas Venkovic and Hartwig Anzt Algorithm 18 Transposed butterfly kernel of radix-8 Input: 𝑥 ∈ C𝑛, 𝑛 = 8𝑡 Output: 𝑥 := 𝐴𝑇 8,𝑛𝑥 1: for 𝑞 = 𝑡, . . . , 1 do 2: 𝑘 := 8𝑞 3: 𝑛𝑏 := 𝑛/𝑘 4: ℓ := 𝑘/8 5: for 𝑏 = 1, . . . , 𝑛𝑏 do 6: for 𝑗 = 1 . . . , ℓ do 7: 𝑧1 := 𝑥 (𝑏 −1)𝑘+𝑗 8: 𝑧2 := 𝑥 (𝑏 −1)𝑘+ℓ+𝑗 9: 𝑧3 := 𝑥 (𝑏 −1)𝑘+2ℓ+𝑗 10: 𝑧4 := 𝑥 (𝑏 −1)𝑘+3ℓ+𝑗 11: 𝑧5 := 𝑥 (𝑏 ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.