The Equivalence of Fast Algorithms for Convolution, Parallel FIR Filters, Polynomial Modular Multiplication, and Pointwise Multiplication in DFT/NTT Domain

Keshab K. Parhi

arxiv: 2512.01974 · v1 · submitted 2025-12-01 · 📡 eess.SP · cs.CR

The Equivalence of Fast Algorithms for Convolution, Parallel FIR Filters, Polynomial Modular Multiplication, and Pointwise Multiplication in DFT/NTT Domain

Keshab K. Parhi This is my paper

Pith reviewed 2026-05-17 02:37 UTC · model grok-4.3

classification 📡 eess.SP cs.CR

keywords fast convolutionparallel FIR filterspolynomial modular multiplicationDFT domainNTT domainCook-Toom algorithmWinograd algorithmhomomorphic encryption

0 comments

The pith

Fast convolution structures serve as the foundation for fast algorithms in parallel FIR filtering, polynomial modular multiplication, and pointwise multiplications over DFT and NTT domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that established fast convolution methods based on Cook-Toom and Winograd algorithms directly translate into efficient designs for parallel FIR filters, polynomial modular multiplication, and pointwise multiplications in the DFT and NTT domains. This equivalence allows any fast structure developed for convolution to be repurposed for the other problems while preserving the same reduction in multiplication count. Readers would care because the mappings unify signal processing tasks with operations central to post-quantum cryptography and homomorphic encryption. Short-length structures can be iterated for longer lengths in all domains through the same algebraic connections.

Core claim

Well known fast convolution structures form the basis for design of fast algorithms in four other problem domains: fast parallel filters, fast polynomial modular multiplication, and fast pointwise multiplication in the DFT and NTT domains. Fast polynomial modular multiplication and fast pointwise multiplication problems are important for cryptosystem applications such as post-quantum cryptography and homomorphic encryption. By establishing the equivalence of these problems, a fast structure from one domain can be used to design a fast structure for another domain.

What carries the argument

Algebraic mappings from fast convolution structures, including those from Cook-Toom and Winograd algorithms, that connect convolution to parallel FIR filters and to pointwise multiplications in DFT and NTT domains while preserving multiplication complexity reductions.

Load-bearing premise

The algebraic mappings between convolution, FIR filtering, polynomial modular multiplication, and pointwise DFT/NTT multiplication preserve both correctness and the claimed reduction in multiplication complexity without additional overhead or hidden assumptions on field characteristics or transform lengths.

What would settle it

A concrete case of a specific length and field where mapping a known fast convolution structure to polynomial modular multiplication produces either an incorrect result or requires more multiplications than a direct method.

Figures

Figures reproduced from arXiv: 2512.01974 by Keshab K. Parhi.

**Figure 2.** Figure 2: A 2-by-2 convolution structure [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: A 4-by-4 iterated fast convolution [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗

**Figure 4.** Figure 4: A 2-parallel FIR filter [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: A L-parallel filter. IV. FAST POYNOMIAL MODULAR MULTIPLICATION A polynomial modular multiplication is defined as: p[x] = a[x]b[x] mod (x n + 1), (16) where a[x] = a0 + a1x + a2x 2 + . . . + an−1x n−1 , (17) b[x] = b0 + b1x + b2x 2 + . . . + bn−1x n−1 . (18) The polynomials a(x), b(x), and p(x) are of length-n. We can define [7] a[x] = a0[x 2 ] + xa1[x 2 ], (19) b[x] = b0[x 2 ] + xb1[x 2 ]. (20) [PITH_FULL… view at source ↗

**Figure 6.** Figure 6: A 4-parallel fast filter derived from a fast 4-by-4 convolution. Fast polynomial modular multiplication is of interest in PQC applications [7]. By applying fast algorithm similar to the 2- parallel FIR filter, we can derive a 2-parallel fast polynomial modular multiplier as described below: p[x] =a[x] · b[x] mod (x n + 1) (21) =p0[x 2 ] + x · p1[x 2 ] (22) p0[x 2 ] =a0[x 2 ] · b0[x 2 ] + x 2 · a1[x 2 ] · b… view at source ↗

**Figure 7.** Figure 7: Polynomial modular multiplication (y = x 2 ). V. FAST POINTWISE MULTIPLICATION IN THE DFT DOMAIN We can compute y[n] = h[n] ∗ x[n] in the DFT domain using Y [k] = X[k]H[k], where Y [k] denotes the N-point DFT of y[n]. Other notations can be defined similarly. All sequences are assumed to be of length-N. Fast computation of X[k]H[k] is referred to as fast pointwise multiplication in the DFT domain. We can e… view at source ↗

**Figure 9.** Figure 9: Pointwise modular multiplication in the NTT domain. [PITH_FULL_IMAGE:figures/full_fig_p005_9.png] view at source ↗

**Figure 8.** Figure 8: Pointwise multiplication in the DFT domain. [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗

read the original abstract

Fast time-domain algorithms have been developed in signal processing applications to reduce the multiplication complexity. For example, fast convolution structures using Cook-Toom and Winograd algorithms are well understood. Short length fast convolutions can be iterated to obtain fast convolution structures for long lengths. In this paper, we show that well known fast convolution structures form the basis for design of fast algorithms in four other problem domains: fast parallel filters, fast polynomial modular multiplication, and fast pointwise multiplication in the DFT and NTT domains. Fast polynomial modular multiplication and fast pointwise multiplication problems are important for cryptosystem applications such as post-quantum cryptography and homomorphic encryption. By establishing the equivalence of these problems, we show that a fast structure from one domain can be used to design a fast structure for another domain. This understanding is important as there are many well known solutions for fast convolution that can be used in other signal processing and cryptosystem applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames known fast convolution algorithms as reusable for parallel FIR, poly mod mult, and DFT/NTT pointwise mult, but the abstract leaves the exact mappings and complexity preservation unshown.

read the letter

The main thing to know is that Parhi claims well-known fast convolution structures based on Cook-Toom and Winograd can be directly repurposed for fast parallel FIR filters, fast polynomial modular multiplication, and fast pointwise multiplication in the DFT and NTT domains, with the savings carrying over to post-quantum crypto and homomorphic encryption work. The paper positions this as a unifying observation rather than a brand-new algorithm. It does a reasonable job of spelling out why this reuse matters: short-length convolution blocks can be iterated for longer lengths, and the same structures apply across these problem domains without starting from scratch each time. That kind of cross-area pointer can save implementers time when they already have convolution code or hardware in hand. The abstract is clear on the intended applications and on the fact that many existing convolution solutions exist to borrow from. The soft spots are in the missing details. The central claim rests on algebraic equivalences that preserve multiplication count exactly, yet the abstract supplies no derivation steps, no explicit mappings, and no worked examples that would let a reader check for extra scalar multiplications or domain-conversion overhead. The stress-test concern about field characteristics and transform-length restrictions is fair to raise here; NTT requires a primitive root of unity whose order divides q-1, and similar constraints appear for DFT and modular polynomial rings. If any mapping introduces hidden general multiplications or forces shorter usable lengths than the original convolution allows, the claimed reduction does not fully transfer. Without seeing those constructions in the full text it is hard to judge how clean the equivalences actually are. This paper is for hardware or algorithm designers who already know the fast-convolution literature and want a quick map to crypto arithmetic tasks. It is not aimed at readers looking for first-principles derivations or new complexity bounds. I would send it to peer review so the mappings and any length or field assumptions can be checked in detail.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that well-known fast convolution algorithms (Cook-Toom and Winograd) and their iterated long-length versions are algebraically equivalent to fast algorithms for parallel FIR filtering, polynomial modular multiplication, and pointwise multiplication in the DFT and NTT domains. By establishing these equivalences, structures developed for one domain can be reused in the others while preserving the same multiplication-complexity reduction, with particular relevance to post-quantum cryptography and homomorphic encryption.

Significance. If the mappings are exact equivalences that introduce neither extra general multiplications nor domain-specific overhead, the paper supplies a unifying design principle that lets the extensive literature on fast convolution be ported directly to cryptographic polynomial arithmetic. This would be a useful contribution for practitioners who need low-multiplication implementations in rings with specific algebraic constraints.

major comments (2)

[§4] §4 (NTT equivalence): the claim that the fast convolution structure transfers without change in multiplication count is not accompanied by an explicit statement of the necessary condition that n divides q-1 for the existence of a primitive n-th root of unity. Without this, it is unclear whether the usable lengths are restricted relative to the original convolution case, undermining the assertion that the complexity reduction carries over unchanged.
[§3.2] §3.2 (polynomial modular multiplication mapping): the reduction step modulo the polynomial (e.g., x^n-1 or x^n+1) is described only at the level of the linear convolution; the manuscript does not count or bound any additional scalar multiplications or additions required for the modular reduction after the fast convolution block. If these operations are non-negligible, the claimed multiplication savings are not preserved.

minor comments (2)

[Abstract] The abstract and introduction repeatedly use the phrase 'equivalence of these problems' without a precise definition of what is being equated (input-output map, arithmetic complexity, or both). A short clarifying sentence would help readers.
[§5] Notation for the transform lengths (n, N, etc.) is introduced inconsistently across the DFT and NTT sections; a single table summarizing the parameter correspondences would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. We address each major comment below and will revise the paper accordingly to improve clarity while preserving the core claims of equivalence.

read point-by-point responses

Referee: [§4] §4 (NTT equivalence): the claim that the fast convolution structure transfers without change in multiplication count is not accompanied by an explicit statement of the necessary condition that n divides q-1 for the existence of a primitive n-th root of unity. Without this, it is unclear whether the usable lengths are restricted relative to the original convolution case, undermining the assertion that the complexity reduction carries over unchanged.

Authors: We agree that a primitive n-th root of unity exists only when n divides q-1. The equivalences in the manuscript are stated for NTT lengths where such a root exists, which is the standard assumption in the cryptographic settings (post-quantum cryptography and homomorphic encryption) where NTTs are employed; q is chosen to support the desired n. Under this condition the multiplication count remains identical to the convolution case, with no extra multiplications introduced. We will revise §4 to state this necessary condition explicitly and note that usable lengths are restricted to those for which the NTT is defined, consistent with the lengths considered in the convolution setting. revision: yes
Referee: [§3.2] §3.2 (polynomial modular multiplication mapping): the reduction step modulo the polynomial (e.g., x^n-1 or x^n+1) is described only at the level of the linear convolution; the manuscript does not count or bound any additional scalar multiplications or additions required for the modular reduction after the fast convolution block. If these operations are non-negligible, the claimed multiplication savings are not preserved.

Authors: For modular reduction modulo x^n-1 or x^n+1, the linear convolution of two degree-(n-1) polynomials yields a degree-(2n-2) result. Reduction is performed by adding (for x^n-1) or subtracting (for x^n+1) the upper n-1 coefficients to the lower n-1 coefficients. This step uses only additions and subtractions and requires no scalar multiplications. Consequently the multiplication count of the fast structure is unchanged. We will revise §3.2 to describe the reduction explicitly, bound the added operations, and confirm that they introduce no multiplications. revision: yes

Circularity Check

0 steps flagged

No circularity: algebraic equivalences link independent known structures

full rationale

The paper claims that known fast convolution algorithms (Cook-Toom, Winograd) can be reused in four other domains via algebraic mappings that preserve correctness and multiplication complexity. The abstract and description present this as an observation of equivalence rather than a quantity defined in terms of itself. No equation reduces a derived result to a fitted input by construction, no uniqueness theorem is imported from self-citation, and no ansatz is smuggled via prior work. The mappings are asserted to be equivalences that transfer existing structures without introducing new overhead, and the paper does not rely on self-referential definitions or load-bearing self-citations for its central claim. The derivation is therefore self-contained and draws on externally established fast convolution results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on algebraic equivalence between the listed operations. No free parameters, invented entities, or ad-hoc axioms are visible in the abstract.

pith-pipeline@v0.9.0 · 5465 in / 1132 out tokens · 31986 ms · 2026-05-17T02:37:36.409927+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages

[1]

R. E. Blahut,Fast algorithms for signal processing. Cambridge University Press, 2010

work page 2010
[2]

Fast FIR filtering: Algorithms and imple- mentations,

Z.-J. Mou and P. Duhamel, “Fast FIR filtering: Algorithms and imple- mentations,”Signal Processing, vol. 13, no. 4, pp. 377–384, 1987

work page 1987
[3]

Short-length FIR filters and their use in fast nonrecursive filtering,

——, “Short-length FIR filters and their use in fast nonrecursive filtering,” IEEE Transactions on Signal Processing, vol. 39, no. 6, pp. 1322–1332, 1991

work page 1991
[4]

P. P. Vaidyanathan,Multirate Systems and Filter Banks. Prentice Hall, Aug. 1993

work page 1993
[5]

K. K. Parhi,VLSI Digital Signal Processing Systems: Design and Imple- mentation. John Wiley & Sons, 1999

work page 1999
[6]

Hardware efficient fast parallel FIR filter structures based on iterated short convolution,

C. Cheng and K. K. Parhi, “Hardware efficient fast parallel FIR filter structures based on iterated short convolution,”IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 51, no. 8, pp. 1492–1500, 2004

work page 2004
[7]

High-speed vlsi architectures for modular polynomial multiplication via fast filtering and applications to lattice-based cryptography,

W. Tan, A. Wang, X. Zhang, Y . Lao, and K. K. Parhi, “High-speed vlsi architectures for modular polynomial multiplication via fast filtering and applications to lattice-based cryptography,”IEEE Transactions on Computers, vol. 72, no. 9, pp. 2454–2466, 2023

work page 2023
[8]

KyberMat: Efficient accelerator for matrix-vector polynomial multiplication in CRYSTALS-Kyber scheme via NTT and polyphase decomposition,

W. Tan, Y . Lao, and K. K. Parhi, “KyberMat: Efficient accelerator for matrix-vector polynomial multiplication in CRYSTALS-Kyber scheme via NTT and polyphase decomposition,” in2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD). IEEE, 2023, pp. 1–9

work page 2023
[9]

Low-area/power parallel FIR digital filter implementations,

D. A. Parker and K. K. Parhi, “Low-area/power parallel FIR digital filter implementations,”Journal of VLSI signal processing systems for signal, image and video technology, vol. 17, no. 1, pp. 75–92, 1997

work page 1997

[1] [1]

R. E. Blahut,Fast algorithms for signal processing. Cambridge University Press, 2010

work page 2010

[2] [2]

Fast FIR filtering: Algorithms and imple- mentations,

Z.-J. Mou and P. Duhamel, “Fast FIR filtering: Algorithms and imple- mentations,”Signal Processing, vol. 13, no. 4, pp. 377–384, 1987

work page 1987

[3] [3]

Short-length FIR filters and their use in fast nonrecursive filtering,

——, “Short-length FIR filters and their use in fast nonrecursive filtering,” IEEE Transactions on Signal Processing, vol. 39, no. 6, pp. 1322–1332, 1991

work page 1991

[4] [4]

P. P. Vaidyanathan,Multirate Systems and Filter Banks. Prentice Hall, Aug. 1993

work page 1993

[5] [5]

K. K. Parhi,VLSI Digital Signal Processing Systems: Design and Imple- mentation. John Wiley & Sons, 1999

work page 1999

[6] [6]

Hardware efficient fast parallel FIR filter structures based on iterated short convolution,

C. Cheng and K. K. Parhi, “Hardware efficient fast parallel FIR filter structures based on iterated short convolution,”IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 51, no. 8, pp. 1492–1500, 2004

work page 2004

[7] [7]

High-speed vlsi architectures for modular polynomial multiplication via fast filtering and applications to lattice-based cryptography,

W. Tan, A. Wang, X. Zhang, Y . Lao, and K. K. Parhi, “High-speed vlsi architectures for modular polynomial multiplication via fast filtering and applications to lattice-based cryptography,”IEEE Transactions on Computers, vol. 72, no. 9, pp. 2454–2466, 2023

work page 2023

[8] [8]

KyberMat: Efficient accelerator for matrix-vector polynomial multiplication in CRYSTALS-Kyber scheme via NTT and polyphase decomposition,

W. Tan, Y . Lao, and K. K. Parhi, “KyberMat: Efficient accelerator for matrix-vector polynomial multiplication in CRYSTALS-Kyber scheme via NTT and polyphase decomposition,” in2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD). IEEE, 2023, pp. 1–9

work page 2023

[9] [9]

Low-area/power parallel FIR digital filter implementations,

D. A. Parker and K. K. Parhi, “Low-area/power parallel FIR digital filter implementations,”Journal of VLSI signal processing systems for signal, image and video technology, vol. 17, no. 1, pp. 75–92, 1997

work page 1997