Permutation-Avoiding FFT-Based Convolution

Hartwig Anzt; Nicolas Venkovic

arxiv: 2506.12718 · v3 · submitted 2025-06-15 · 🧮 math.NA · cs.DS· cs.MS· cs.NA

Permutation-Avoiding FFT-Based Convolution

Nicolas Venkovic , Hartwig Anzt This is my paper

Pith reviewed 2026-05-19 10:05 UTC · model grok-4.3

classification 🧮 math.NA cs.DScs.MScs.NA

keywords FFT convolutionCooley-Tukeyindex-reversal permutationdiscrete convolutionmulti-dimensional FFTpermutation avoidancegeneral radix

0 comments

The pith

The three sets of index-reversal permutations in Cooley-Tukey FFT convolution cancel exactly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that when discrete convolution is computed through FFTs in the Cooley-Tukey framework, the index-reversal permutations required for the two forward transforms and the one inverse transform cancel one another completely. This cancellation removes every permutation step from the procedure. A reader would care because the permutations create irregular memory access that lowers arithmetic intensity, even when the total floating-point work stays the same. The study extends the observation to multi-dimensional arrays and arbitrary radices, then benchmarks the resulting kernels against conventional FFT-based convolution.

Core claim

In discrete convolution performed via FFTs that follow the Cooley-Tukey decomposition, the index-reversal permutations arising from the forward transforms of the two input arrays and the inverse transform of their product cancel exactly. The cancellation holds in the multi-dimensional, general-radix setting and leaves both the mathematical correctness of the convolution and the expected floating-point operation count unchanged. As a direct result, the convolution can be carried out without executing any index-reversal steps.

What carries the argument

Exact cancellation of the three index-reversal permutation sets that appear in the forward and inverse stages of an FFT-based convolution pipeline.

If this is right

Permutation-free FFT convolution kernels become feasible for general-radix, multi-dimensional cases.
Memory access patterns improve while the arithmetic operation count remains identical to standard FFT convolution.
Arithmetic intensity rises because the irregular accesses induced by permutations are eliminated.
Developers of FFT libraries should consider adding explicit support for these kernels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same cancellation logic may apply to other composite FFT operations that chain multiple forward and inverse transforms.
Hardware-specific implementations could further reduce cache misses once the permutation steps are removed.
Performance comparisons on modern accelerators would test whether the theoretical arithmetic-intensity gain translates to wall-clock speedup.

Load-bearing premise

The index-reversal permutations from the two forward FFTs and the inverse FFT cancel exactly and without remainder in the multi-dimensional general-radix setting.

What would settle it

A concrete multi-dimensional input size and radix choice where the permutation-free procedure produces a result different from a verified standard convolution or where the floating-point operation count increases.

Figures

Figures reproduced from arXiv: 2506.12718 by Hartwig Anzt, Nicolas Venkovic.

**Figure 2.** Figure 2: Average per-element butterfly cost for computing systems #1 (solid lines) and #2 (dashed lines) [PITH_FULL_IMAGE:figures/full_fig_p022_2.png] view at source ↗

read the original abstract

Fast Fourier Transform (FFT) libraries are widely used for evaluating discrete convolutions. Most FFT implementations follow some variant of the Cooley-Tukey framework, in which the transform is decomposed into butterfly operations and index-reversal permutations. While butterfly operations dominate the floating-point operation count, the memory access patterns induced by index-reversal permutations significantly degrade the FFT's arithmetic intensity. When performing discrete convolution, the three sets of index-reversal permutations which occur in FFT-based implementations using Cooley-Tukey frameworks cancel out, thus paving the way to implementations free of any permutation. To the best of our knowledge, such permutation-free variants of FFT-based discrete convolution are not commonly used in practice, making such kernels worth investigating. Here, we look into such permutation-avoiding convolution procedures for multi-dimensional cases within a general radix Cooley-Tukey framework. We perform numerical experiments to benchmark the algorithms presented against state-of-the-art FFT-based convolution implementations. Our results suggest that developers of FFT libraries should consider supporting permutation-avoiding convolution kernels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The permutations cancel in the convolution pipeline as claimed, giving a clean algebraic shortcut, but the multi-D general-radix benchmarks leave some questions on exact operation counts and edge-case robustness.

read the letter

The central observation is that the three index-reversal steps in a standard FFT convolution cancel exactly, so you can drop the permutations and keep the same arithmetic result. That identity is straightforward once you line up the forward and inverse transforms, and the paper spells out the construction inside a general-radix Cooley-Tukey framework for multiple dimensions. They also run direct numerical comparisons against current FFT-based convolution routines and report throughput gains in the tested cases. That combination of an explicit algebraic shortcut plus concrete timings is the useful part; it is not a routine extension because most library implementations still carry the reversals through the whole pipeline. The work is therefore worth a look for anyone tuning convolution kernels in scientific codes. The soft spot is the multi-dimensional general-radix case. The stress-test note correctly flags that each dimension can have its own radix, so the digit reversals are not automatically a single global operator. The paper asserts that the composition still cancels without remainder and that the floating-point operation count stays as expected, but the abstract gives no error-bar statistics or exhaustive checks on mixed-radix inputs. If the full experiments include those checks and show no hidden reordering cost, the issue is minor; otherwise a referee would want to see the precise indexing argument written out. This paper is aimed at FFT library developers and people who care about arithmetic intensity in convolution pipelines. A reader working on high-performance numerical methods would get practical implementation ideas and a clear baseline for further tuning. It is solid enough to deserve a serious referee: the core claim is algebraic rather than fitted, the experiments are present, and the incremental nature does not make it trivial. I would send it out for review rather than desk-reject.

Referee Report

1 major / 1 minor

Summary. The manuscript claims that the three index-reversal (digit-reversal) permutations arising in Cooley-Tukey FFT-based discrete convolution—one from each forward transform and one from the inverse—cancel exactly, enabling permutation-free convolution kernels. It extends the observation to the multi-dimensional general-radix setting and supplies numerical benchmarks comparing the resulting algorithms against standard FFT-based convolution implementations.

Significance. If the cancellation identity holds without residual reordering or extra correction steps, the result would improve arithmetic intensity of convolution by removing memory-bound permutation stages whose cost is not negligible in practice. The algebraic character of the argument (no fitted parameters) and the provision of concrete numerical benchmarks against state-of-the-art libraries are positive features that would make the work useful to FFT library developers.

major comments (1)

[multi-dimensional general-radix extension] The central cancellation argument for the multi-dimensional general-radix case (the extension beyond the one-dimensional Cooley-Tukey composition) does not automatically guarantee exact cancellation once the pointwise product is formed. Each dimension applies its own radix-dependent digit reversal; because frequency-domain multiplication does not commute with a global reversal operator, a residual reordering step may remain whose floating-point operation count is not subtracted from the claimed total. An explicit derivation or exhaustive check for at least one non-power-of-two radix in two or more dimensions is needed to support the operation-count claim.

minor comments (1)

[Abstract] The abstract states that numerical experiments were performed but supplies neither error-bar statistics nor an explicit list of tested edge cases (e.g., odd lengths, mixed radices). Adding a short sentence or table reference would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback on our manuscript. We address the single major comment below and will revise the paper accordingly.

read point-by-point responses

Referee: [multi-dimensional general-radix extension] The central cancellation argument for the multi-dimensional general-radix case (the extension beyond the one-dimensional Cooley-Tukey composition) does not automatically guarantee exact cancellation once the pointwise product is formed. Each dimension applies its own radix-dependent digit reversal; because frequency-domain multiplication does not commute with a global reversal operator, a residual reordering step may remain whose floating-point operation count is not subtracted from the claimed total. An explicit derivation or exhaustive check for at least one non-power-of-two radix in two or more dimensions is needed to support the operation-count claim.

Authors: We appreciate the referee raising this point. In the multi-dimensional setting the reversal operator R is the tensor product of the per-dimension digit-reversal operators and is therefore identical for both forward transforms. Because the frequency-domain product is pointwise, the identity (R u) .* (R v) = R (u .* v) holds for any permutation matrix R; consequently the product remains in the domain expected by the inverse transform and the three reversals continue to cancel exactly. No residual global reordering is introduced. To make this explicit we will add, in the revised manuscript, a self-contained derivation for a 2-D convolution using radix 3 along the first axis and radix 4 along the second axis, together with a direct operation-count verification on that example. revision: yes

Circularity Check

0 steps flagged

No circularity detected: permutation cancellation is a direct algebraic consequence of FFT composition

full rationale

The paper derives the cancellation of the three index-reversal permutations (two forward FFTs and one inverse) as a property of the Cooley-Tukey decomposition applied to the convolution theorem. This holds by explicit composition of the permutation operators in the general-radix multi-dimensional setting and does not rely on any fitted parameters, self-citation chains, or re-labeling of inputs as outputs. The central claim is an identity that follows from the standard definitions of the transforms; numerical benchmarks then compare performance rather than validate the identity itself. No load-bearing step reduces to a tautology or to prior self-cited results that would need to be assumed.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the standard decomposition of Cooley-Tukey FFTs into butterflies and index-reversal permutations; no free parameters, ad-hoc axioms, or new entities are introduced.

axioms (1)

standard math Cooley-Tukey FFT decomposes into butterfly operations and index-reversal permutations.
Invoked in the opening paragraph of the abstract as the basis for the permutation count.

pith-pipeline@v0.9.0 · 5712 in / 1238 out tokens · 43981 ms · 2026-05-19T10:05:14.407824+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

The journal of Supercomputing 4 (1990), 23–35

FFTs in external or hierarchical memory. The journal of Supercomputing 4 (1990), 23–35. Leo Bluestein

work page 1990
[2]

IEEE Transactions on Audio and Electroacoustics 18, 4 (1970), 451–455

A linear filtering approach to the computation of discrete Fourier transform. IEEE Transactions on Audio and Electroacoustics 18, 4 (1970), 451–455. Grace Chan and Andrew TA Wood

work page 1970
[3]

Applied Statistics (1997), 171–181

Algorithm AS 312: An Algorithm for simulating stationary Gaussian random fields. Applied Statistics (1997), 171–181. James W Cooley and John W Tukey

work page 1997
[4]

Mathematics of computation 19, 90 (1965), 297–301

An algorithm for the machine calculation of complex Fourier series. Mathematics of computation 19, 90 (1965), 297–301. Colin R Dietrich and Garry N Newsam

work page 1965
[5]

Water Resources Research 29, 8 (1993), 2861–2869

A fast and exact method for multidimensional Gaussian stochastic simulations. Water Resources Research 29, 8 (1993), 2861–2869. Claude R Dietrich and Garry N Newsam

work page 1993
[6]

SIAM Journal on Scientific Computing 18, 4 (1997), 1088–1107

Fast and exact simulation of stationary Gaussian processes through circulant embedding of the covariance matrix. SIAM Journal on Scientific Computing 18, 4 (1997), 1088–1107. Pierre Duhamel and Henk Hollmann

work page 1997
[7]

Matteo Frigo and Steven G Johnson

‘Split radix’ FFT algorithm.Electronics letters 20, 1 (1984), 14–16. Matteo Frigo and Steven G Johnson

work page 1984
[8]

In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat

FFTW: An adaptive software architecture for the FFT. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181) , Vol

work page 1998
[9]

The design and implementation of FFTW3. Proc. IEEE 93, 2 (2005), 216–231. W Morven Gentleman and Gordon Sande

work page 2005
[10]

In Proceedings of the November 7-10, 1966, fall joint computer conference

Fast fourier transforms: for fun and profit. In Proceedings of the November 7-10, 1966, fall joint computer conference . 563–578. Rafael C Gonzalez

work page 1966
[11]

Journal of the Royal Statistical Society Series B: Statistical Methodology 20, 2 (1958), 361–372

The interaction algorithm and practical Fourier analysis. Journal of the Royal Statistical Society Series B: Statistical Methodology 20, 2 (1958), 361–372. Intel Corporation

work page 1958
[12]

Computer methods in applied mechanics and engineering 157, 1-2 (1998), 69–94

A numerical method for computing the overall response of nonlinear composites with complex microstructure. Computer methods in applied mechanics and engineering 157, 1-2 (1998), 69–94. Alan V Oppenheim

work page 1998
[13]

Discrete Fourier transforms when the number of data samples is prime. Proc. IEEE 56, 6 (1968), 1107–1108. Paul N Swarztrauber

work page 1968
[14]

SIAM Rev

The methods of cyclic reduction, Fourier analysis and the FACR algorithm for the discrete solution of Poisson’s equation on a rectangle. SIAM Rev. 19, 3 (1977), 490–501. Llewellyn H Thomas

work page 1977
[15]

Applications of digital computers (1963), 44–45

Using a computer to solve problems in physics. Applications of digital computers (1963), 44–45. Charles Van Loan

work page 1963
[16]

Journal of computational and graphical statistics 3, 4 (1994), 409–432

Simulation of stationary Gaussian processes in [0, 1]𝑑 . Journal of computational and graphical statistics 3, 4 (1994), 409–432. R Yavne

work page 1994
[17]

AFIPS ’68 (Fall, part I): Proceedings of the December 9-11, 1968, fall joint computer conference, part I (1968), 115–125

An economical method for calculating the discrete Fourier transform. AFIPS ’68 (Fall, part I): Proceedings of the December 9-11, 1968, fall joint computer conference, part I (1968), 115–125. Jan Zeman, Jaroslav Vondřejc, Jan Novák, and Ivo Marek

work page 1968
[18]

Accelerating a FFT-based solver for numerical homogenization of periodic media by conjugate gradients. J. Comput. Phys. 229, 21 (2010), 8065–8071. A FFT and convolution sub-kernel algorithms In the algorithms presented in Section 5, for any given 𝑥 ∈ C𝑛, we need to evaluate 𝐴𝑟,𝑛𝑥, 𝐴𝑟,𝑛𝑥 and 𝐴𝑇 𝑟,𝑛𝑥. Irrespective of the radix, these kernels are implemented...

work page 2010
[19]

, 1 do 2: 𝑘 := 8𝑞 3: 𝑛𝑏 := 𝑛/𝑘 4: ℓ := 𝑘/8 5: for 𝑏 = 1,

38 Nicolas Venkovic and Hartwig Anzt Algorithm 18 Transposed butterfly kernel of radix-8 Input: 𝑥 ∈ C𝑛, 𝑛 = 8𝑡 Output: 𝑥 := 𝐴𝑇 8,𝑛𝑥 1: for 𝑞 = 𝑡, . . . , 1 do 2: 𝑘 := 8𝑞 3: 𝑛𝑏 := 𝑛/𝑘 4: ℓ := 𝑘/8 5: for 𝑏 = 1, . . . , 𝑛𝑏 do 6: for 𝑗 = 1 . . . , ℓ do 7: 𝑧1 := 𝑥 (𝑏 −1)𝑘+𝑗 8: 𝑧2 := 𝑥 (𝑏 −1)𝑘+ℓ+𝑗 9: 𝑧3 := 𝑥 (𝑏 −1)𝑘+2ℓ+𝑗 10: 𝑧4 := 𝑥 (𝑏 −1)𝑘+3ℓ+𝑗 11: 𝑧5 := 𝑥 (𝑏 ...

work page 2025

[1] [1]

The journal of Supercomputing 4 (1990), 23–35

FFTs in external or hierarchical memory. The journal of Supercomputing 4 (1990), 23–35. Leo Bluestein

work page 1990

[2] [2]

IEEE Transactions on Audio and Electroacoustics 18, 4 (1970), 451–455

A linear filtering approach to the computation of discrete Fourier transform. IEEE Transactions on Audio and Electroacoustics 18, 4 (1970), 451–455. Grace Chan and Andrew TA Wood

work page 1970

[3] [3]

Applied Statistics (1997), 171–181

Algorithm AS 312: An Algorithm for simulating stationary Gaussian random fields. Applied Statistics (1997), 171–181. James W Cooley and John W Tukey

work page 1997

[4] [4]

Mathematics of computation 19, 90 (1965), 297–301

An algorithm for the machine calculation of complex Fourier series. Mathematics of computation 19, 90 (1965), 297–301. Colin R Dietrich and Garry N Newsam

work page 1965

[5] [5]

Water Resources Research 29, 8 (1993), 2861–2869

A fast and exact method for multidimensional Gaussian stochastic simulations. Water Resources Research 29, 8 (1993), 2861–2869. Claude R Dietrich and Garry N Newsam

work page 1993

[6] [6]

SIAM Journal on Scientific Computing 18, 4 (1997), 1088–1107

Fast and exact simulation of stationary Gaussian processes through circulant embedding of the covariance matrix. SIAM Journal on Scientific Computing 18, 4 (1997), 1088–1107. Pierre Duhamel and Henk Hollmann

work page 1997

[7] [7]

Matteo Frigo and Steven G Johnson

‘Split radix’ FFT algorithm.Electronics letters 20, 1 (1984), 14–16. Matteo Frigo and Steven G Johnson

work page 1984

[8] [8]

In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat

FFTW: An adaptive software architecture for the FFT. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181) , Vol

work page 1998

[9] [9]

The design and implementation of FFTW3. Proc. IEEE 93, 2 (2005), 216–231. W Morven Gentleman and Gordon Sande

work page 2005

[10] [10]

In Proceedings of the November 7-10, 1966, fall joint computer conference

Fast fourier transforms: for fun and profit. In Proceedings of the November 7-10, 1966, fall joint computer conference . 563–578. Rafael C Gonzalez

work page 1966

[11] [11]

Journal of the Royal Statistical Society Series B: Statistical Methodology 20, 2 (1958), 361–372

The interaction algorithm and practical Fourier analysis. Journal of the Royal Statistical Society Series B: Statistical Methodology 20, 2 (1958), 361–372. Intel Corporation

work page 1958

[12] [12]

Computer methods in applied mechanics and engineering 157, 1-2 (1998), 69–94

A numerical method for computing the overall response of nonlinear composites with complex microstructure. Computer methods in applied mechanics and engineering 157, 1-2 (1998), 69–94. Alan V Oppenheim

work page 1998

[13] [13]

Discrete Fourier transforms when the number of data samples is prime. Proc. IEEE 56, 6 (1968), 1107–1108. Paul N Swarztrauber

work page 1968

[14] [14]

SIAM Rev

The methods of cyclic reduction, Fourier analysis and the FACR algorithm for the discrete solution of Poisson’s equation on a rectangle. SIAM Rev. 19, 3 (1977), 490–501. Llewellyn H Thomas

work page 1977

[15] [15]

Applications of digital computers (1963), 44–45

Using a computer to solve problems in physics. Applications of digital computers (1963), 44–45. Charles Van Loan

work page 1963

[16] [16]

Journal of computational and graphical statistics 3, 4 (1994), 409–432

Simulation of stationary Gaussian processes in [0, 1]𝑑 . Journal of computational and graphical statistics 3, 4 (1994), 409–432. R Yavne

work page 1994

[17] [17]

AFIPS ’68 (Fall, part I): Proceedings of the December 9-11, 1968, fall joint computer conference, part I (1968), 115–125

An economical method for calculating the discrete Fourier transform. AFIPS ’68 (Fall, part I): Proceedings of the December 9-11, 1968, fall joint computer conference, part I (1968), 115–125. Jan Zeman, Jaroslav Vondřejc, Jan Novák, and Ivo Marek

work page 1968

[18] [18]

Accelerating a FFT-based solver for numerical homogenization of periodic media by conjugate gradients. J. Comput. Phys. 229, 21 (2010), 8065–8071. A FFT and convolution sub-kernel algorithms In the algorithms presented in Section 5, for any given 𝑥 ∈ C𝑛, we need to evaluate 𝐴𝑟,𝑛𝑥, 𝐴𝑟,𝑛𝑥 and 𝐴𝑇 𝑟,𝑛𝑥. Irrespective of the radix, these kernels are implemented...

work page 2010

[19] [19]

, 1 do 2: 𝑘 := 8𝑞 3: 𝑛𝑏 := 𝑛/𝑘 4: ℓ := 𝑘/8 5: for 𝑏 = 1,

38 Nicolas Venkovic and Hartwig Anzt Algorithm 18 Transposed butterfly kernel of radix-8 Input: 𝑥 ∈ C𝑛, 𝑛 = 8𝑡 Output: 𝑥 := 𝐴𝑇 8,𝑛𝑥 1: for 𝑞 = 𝑡, . . . , 1 do 2: 𝑘 := 8𝑞 3: 𝑛𝑏 := 𝑛/𝑘 4: ℓ := 𝑘/8 5: for 𝑏 = 1, . . . , 𝑛𝑏 do 6: for 𝑗 = 1 . . . , ℓ do 7: 𝑧1 := 𝑥 (𝑏 −1)𝑘+𝑗 8: 𝑧2 := 𝑥 (𝑏 −1)𝑘+ℓ+𝑗 9: 𝑧3 := 𝑥 (𝑏 −1)𝑘+2ℓ+𝑗 10: 𝑧4 := 𝑥 (𝑏 −1)𝑘+3ℓ+𝑗 11: 𝑧5 := 𝑥 (𝑏 ...

work page 2025