arxiv: 2603.13331 · v2 · submitted 2026-03-05 · 💻 cs.AI · cs.LG

Recognition: 2 theorem links

· Lean Theorem

The Norm-Separation Delay Law of Grokking: A First-Principles Theory of Delayed Generalization

Truong Xuan Khanh , Truong Quynh Hoa , Luu Duc Trung , Phan Thanh Duc

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:14 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords grokkinggeneralization delaynorm separationregularized optimizationphase transitionLyapunov contractiondeep learning dynamics

0 comments

The pith

Grokking delay equals the inverse of the optimizer contraction rate times the log of the memorizing-to-generalizing norm ratio.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives a quantitative law for the delay between perfect memorization and sudden generalization in neural networks. It models grokking as the moment when a lower-norm generalizing representation overtakes a higher-norm memorizing one under regularization-driven contraction. The central result is the Norm-Separation Delay Law, which expresses the delay directly in terms of the effective contraction rate and the squared-norm ratio. Experiments across modular arithmetic and parity tasks confirm the predicted inverse scaling with weight decay and learning rate plus logarithmic dependence on the norm ratio. The work also shows that only certain optimizers permit the required decoupling of memorization from contraction.

Core claim

Grokking is a norm-driven representational phase transition in regularised training dynamics. The delay T_grok minus T_mem equals Theta of gamma_eff inverse times log of theta_mem norm squared over theta_post norm squared, where gamma_eff is eta lambda for SGD and at least eta lambda for AdamW. The upper bound follows from a discrete Lyapunov contraction argument while the matching lower bound follows from the dynamical constraints of regularised first-order optimisation.

What carries the argument

The Norm-Separation Delay Law, which uses discrete Lyapunov contraction under regularization to quantify the time required for the smaller-norm generalizing interpolator to overtake the larger-norm memorizing one.

If this is right

Grokking delay scales inversely with weight decay strength across tasks.
Grokking delay scales inversely with learning rate.
Grokking occurs reliably with AdamW but fails entirely with SGD at identical hyperparameters.
A simple three-input predictor using contraction rate, norm ratio, and memorization time achieves 34.6 percent mean absolute error on held-out runs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Measuring the norm ratio at the moment of memorization could allow early prediction of when generalization will appear.
Optimizers or regularizers could be designed to control or shorten the delay by altering contraction rates or norm gaps.
The same contraction-plus-norm-separation mechanism may govern other delayed-generalization phenomena beyond the tasks tested here.

Load-bearing premise

That grokking is caused by norm separation between two competing interpolating representations under regularization, with the generalizing solution having the smaller norm.

What would settle it

A controlled training run in which the observed grokking delay fails to scale inversely with weight decay or learning rate, or fails to show the predicted logarithmic dependence on the measured norm ratio at memorization time.

Figures

Figures reproduced from arXiv: 2603.13331 by Luu Duc Trung, Phan Thanh Duc, Truong Quynh Hoa, Truong Xuan Khanh.

**Figure 1.** Figure 1: Conceptual overview of the Norm-Separation Delay Law. After memorisation (Tmem), weight decay contracts parameter norms exponentially from the high-norm memorisation region toward the low-norm Fourier manifold. The grokking delay is the time required for this exponential contraction to traverse the norm gap log(∥θmem∥ 2/∥θpost∥ 2 ). Generalisation (Tgrok) occurs once parameters enter the Fourier region and… view at source ↗

**Figure 2.** Figure 2: Lyapunov escape validation (real data). (a) Fitted contraction rates ρ across 10 seeds; all exceed the weight-decay baseline 1 − ηλ = 0.999 (green), confirming AdamW amplification (γfit ≈ 1.41 · ηλ). (b) Distribution of grokking times across 10 seeds (mean 1840±215 steps). (c) Norm separation: Vmem ≈ 3900 vs Vpost ≈ 300 across seeds. ηλ = 0.001 by ∼40% because AdamW applies adaptive per-parameter learning … view at source ↗

**Figure 3.** Figure 3: Scaling laws. (a) Tgrok vs 1/λ in Regime II (R2 = 0.97). (b) Tgrok vs 1/η (R2 = 0.92). (c) Delay vs log norm ratio across 7 moduli (r = 0.91). Experiment B (joint η × λ grid). To test the combined scaling Tgrok ∝ 1/(ηλ), we run a grid with η ∈ {2×10−3 , 10−3 , 5×10−4} and λ ∈ {0.5, 1.0, 2.0}, 5 seeds per cell [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

**Figure 4.** Figure 4: Cross-task generalization. (a) Modular multiplication: all 20 runs grok across 4 moduli. (b) Exponential fit quality (R2 ) for all 20 multiplication runs; all exceed 0.988. (c) Sparse parity: Vfinal > Vmem in all 15 runs—no norm separation, no grokking. 4.7 SGD vs AdamW Ablation The theory in Section 3 is derived for SGD with weight decay, while all experiments in Sections 4.2–4.6 use AdamW. To close this … view at source ↗

**Figure 5.** Figure 5: AdamW contraction analysis. (a) Fitted contraction rates across 10 seeds; all exceed the weight-decay baseline 1−ηλ by ∼41%. (b) Grokking success: AdamW groks in 5/5 seeds; SGD fails entirely. (c) R2 of exponential fits across all seeds (mean R2 = 0.9990). 4.8 Generalization Beyond Modular Addition A key question is whether the escape-time formula applies beyond modular addition. We test two structurally d… view at source ↗

**Figure 6.** Figure 6: Phase diagram and cross-task universality. (a) Phase diagram in (η, λ) space showing three regimes. Red triangle: SGD fails where AdamW succeeds. (b) Delay vs log norm ratio for both addition (S1, blue circles) and multiplication (S9, coral diamonds); the linear relationship holds across tasks. delay is −120 steps). Crucially, the norm ratio is inverted: Vfinal/Vmem ∈ [1.64, 1.97], meaning the final parame… view at source ↗

read the original abstract

Grokking -- the sudden generalisation that appears long after a model has perfectly memorised its training data -- has been widely observed but lacks a quantitative theory explaining the length of the delay. We show that grokking is a norm-driven representational phase transition in regularised training dynamics, and establish the Norm-Separation Delay Law: $T_{\mathrm{grok}} - T_{\mathrm{mem}} = \Theta(\gamma_{\mathrm{eff}}^{-1} \log(\|\theta_{\mathrm{mem}}\|^2 / \|\theta_{\mathrm{post}}\|^2))$, where $\gamma_{\mathrm{eff}}$ is the optimiser's effective contraction rate ($\gamma_{\mathrm{eff}} = \eta\lambda$ for SGD, $\gamma_{\mathrm{eff}} \ge \eta\lambda$ for AdamW). The upper bound follows from a discrete Lyapunov contraction argument; the matching lower bound from dynamical constraints of regularised first-order optimisation. Across 293 training runs spanning modular addition, modular multiplication, and sparse parity, we confirm three falsifiable predictions: inverse scaling with weight decay ($R^2 = 0.97$), inverse scaling with learning rate ($R^2 = 0.92$), and logarithmic dependence on the norm ratio (Pearson $r = 0.91$). A fourth finding reveals that grokking requires an optimiser capable of decoupling memorisation from contraction: SGD fails entirely at the same hyperparameters where AdamW reliably groks. These results reframe grokking not as a mysterious optimisation artefact but as a predictable consequence of norm separation between competing interpolating representations. We further derive a practical three-input algorithm that predicts grokking delay at memorisation time with 34.6% mean absolute error (bootstrap 95% CI [30.0%, 39.4%], $N=60$ seeds), enabling principled early stopping.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean scaling law for grokking delay from norm separation plus strong empirical checks, but the lower-bound part of the Θ claim rests on unshown dynamical constraints.

read the letter

The punchline is that this work turns the grokking delay into a testable scaling relation T_grok minus T_mem equals theta of one over gamma_eff times log of the squared norm ratio at memorization. The experiments back the inverse dependence on weight decay and learning rate with R-squared 0.97 and 0.92, and the log-norm term with Pearson 0.91 across 293 runs on modular addition, multiplication, and sparse parity. They also show that AdamW groks reliably where plain SGD does not at the same hyperparameters, which is a concrete optimizer distinction worth noting. The three-input predictor achieving 34.6 percent MAE on new seeds is a practical takeaway that could be used for early stopping checks. What the paper does well is keep the claims falsifiable and report the scaling relations directly from the data rather than post-hoc curve fitting. The empirical coverage is broad enough to make the inverse and log patterns credible. The soft spot is the derivation. The upper bound follows from a discrete Lyapunov contraction on the quadratic penalty, which is standard. The matching lower bound is attributed to dynamical constraints of regularized first-order optimization without an explicit sequence showing why any trajectory must spend at least that many steps before the smaller-norm interpolator dominates. If those constraints reduce to the loss staying flat until norms separate, the two-sided Θ is not fully justified and the result is closer to an upper-bound scaling plus observed correlation. The predictor also uses the measured norm ratio at memorization time, so it functions more as a diagnostic than a zero-shot forecast from initialization. This paper is for researchers working on optimization dynamics and delayed generalization in deep networks. A reader interested in quantitative theories of grokking will get value from the scaling plots and the optimizer contrast even if they want tighter proofs. It deserves a serious referee because the empirical patterns are sharp and the central idea is simple enough to check or extend.

Referee Report

1 major / 2 minor

Summary. The paper claims that grokking arises as a norm-driven representational phase transition under regularized training. It establishes the Norm-Separation Delay Law T_grok - T_mem = Θ(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)), where γ_eff is the optimizer's effective contraction rate (ηλ for SGD, ≥ηλ for AdamW). The upper bound is derived from a discrete Lyapunov contraction argument on the quadratic norm penalty; the matching lower bound follows from dynamical constraints of regularized first-order optimization. Across 293 runs on modular addition, multiplication, and sparse parity, the work reports inverse scaling of delay with weight decay (R²=0.97) and learning rate (R²=0.92), logarithmic dependence on the norm ratio (Pearson r=0.91), failure of SGD to grok at hyperparameters where AdamW succeeds, and a three-input predictor achieving 34.6% MAE at memorization time.

Significance. If the derivation is completed, the result supplies the first quantitative, falsifiable scaling law for grokking delay grounded in optimization dynamics rather than phenomenology. The high R² fits, the explicit contrast between SGD and AdamW, and the practical early-stopping algorithm constitute clear strengths that could be directly useful for training analysis. The work reframes delayed generalization as a predictable consequence of norm separation between competing interpolators.

major comments (1)

[Abstract / Norm-Separation Delay Law statement] The central claim asserts both an upper and a matching lower bound for the Θ expression. The upper bound is attributed to a discrete Lyapunov contraction argument, yet the manuscript supplies only a high-level summary without the explicit sequence of inequalities, the precise Lyapunov function, or error terms. The lower bound is ascribed to 'dynamical constraints of regularised first-order optimisation' without a derivation showing that any trajectory must require at least Ω(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)) steps before the smaller-norm solution can dominate the loss landscape. Until these steps are written out, the quantitative law reduces to an empirically supported scaling plus an unproven lower bound.

minor comments (2)

The practical three-input predictor is announced with a 34.6% MAE but its exact inputs, training procedure, and bootstrap details are not fully specified in the provided text; a short algorithmic box or pseudocode would improve reproducibility.
The definition of γ_eff for AdamW is given as ≥ηλ; an explicit expression or bound in terms of β1, β2, and ε would remove ambiguity when comparing optimizers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The central concern is that the upper and lower bounds in the Norm-Separation Delay Law are stated at a high level without explicit derivations. We agree this must be remedied and will supply the complete proofs in the revision.

read point-by-point responses

Referee: [Abstract / Norm-Separation Delay Law statement] The central claim asserts both an upper and a matching lower bound for the Θ expression. The upper bound is attributed to a discrete Lyapunov contraction argument, yet the manuscript supplies only a high-level summary without the explicit sequence of inequalities, the precise Lyapunov function, or error terms. The lower bound is ascribed to 'dynamical constraints of regularised first-order optimisation' without a derivation showing that any trajectory must require at least Ω(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)) steps before the smaller-norm solution can dominate the loss landscape. Until these steps are written out, the quantitative law reduces to an empirically supported scaling plus an unproven lower bound.

Authors: We acknowledge that the present manuscript presents the bounds at a summary level. In the revised version we will expand the dedicated proof section to include: (i) the explicit Lyapunov function V(θ) = ½‖θ‖² together with the full contraction inequality ‖θ_{t+1}‖² ≤ (1 − 2ηλ + O(η²L))‖θ_t‖² + η²‖∇L‖² under standard smoothness assumptions, yielding the O(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)) upper bound with explicit remainder terms; (ii) the matching lower-bound argument showing that any first-order trajectory must take at least Ω(γ_eff^{-1} log(ratio)) steps for the smaller-norm interpolator to dominate, because the loss gap between the two competing solutions closes at a rate bounded by the same contraction factor and cannot be accelerated beyond it while both remain interpolators. These additions will render the Θ statement fully rigorous while preserving the main-text summary. revision: yes

Circularity Check

1 steps flagged

Delay law uses measured norm ratio at memorization as direct input to quantitative prediction

specific steps

fitted input called prediction [Abstract (Norm-Separation Delay Law statement)]
"T_grok - T_mem = Θ(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)), where γ_eff is the optimiser's effective contraction rate... We further derive a practical three-input algorithm that predicts grokking delay at memorisation time with 34.6% mean absolute error"

The delay is expressed directly in terms of the norm ratio measured at T_mem; the 'prediction' algorithm therefore takes that observed ratio as an input rather than deriving the full delay length from hyperparameters and initial conditions alone. The Θ scaling is then fitted to data that already encodes the same norm separation.

full rationale

The central claim presents the Norm-Separation Delay Law as derived from a discrete Lyapunov contraction (upper bound) plus dynamical constraints (lower bound). However, the explicit formula for the delay directly incorporates the observed ‖θ_mem‖² / ‖θ_post‖² ratio measured at T_mem, and the practical three-input prediction algorithm is evaluated on that same measured ratio. This makes the quantitative output statistically dependent on post-memorization observations rather than a parameter-free derivation from hyperparameters alone. The empirical R² and Pearson correlations are reported on the same runs, but no self-citation chain or self-definitional loop is present; the derivation steps themselves are not shown to collapse by construction. Overall partial circularity from fitted-input usage.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on optimization-dynamics assumptions identifying norm separation as the driver of the phase transition; no new physical entities are introduced and the norm ratio is treated as an observable rather than a fitted constant.

free parameters (1)

γ_eff
Effective contraction rate defined as ηλ for SGD and bounded for AdamW; its precise value for AdamW may require empirical calibration.

axioms (2)

domain assumption Discrete Lyapunov contraction governs the upper bound on delay under regularized first-order optimization.
Invoked to establish the Θ upper bound on T_grok - T_mem.
domain assumption Grokking arises as a representational phase transition driven by norm separation between memorizing and generalizing interpolators.
Core framing that converts the delay into a norm-ratio problem.

pith-pipeline@v0.9.0 · 5666 in / 1437 out tokens · 81458 ms · 2026-05-15T16:14:07.811515+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Tgrok − Tmem = Θ(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)) ... upper bound follows from a discrete Lyapunov contraction argument
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_fourth_deriv_at_zero unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

exponential contraction of parameter norms ... rate 1−ηλ

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry
cs.LG 2026-04 unverdicted novelty 8.0

Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicti...

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal

doi:10.1073/pnas.1907378117. Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849– 15854,

work page doi:10.1073/pnas.1907378117
[2]

Léon Bottou, Frank E

doi:10.1073/pnas.1903070116. Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311,

work page doi:10.1073/pnas.1903070116
[3]

Lénaïc Chizat, Edouard Oyallon, and Francis Bach

doi:10.1137/16M1080173. Lénaïc Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. In Advances in Neural Information Processing Systems, volume 32,

work page doi:10.1137/16m1080173
[4]

Xander Davies, Lauro Langosco, and David Krueger

URLhttps: //proceedings.mlr.press/v202/chughtai23a.html. Xander Davies, Lauro Langosco, and David Krueger. Unifying grokking and double descent.arXiv preprint arXiv:2303.06173,

work page arXiv
[5]

Xander Davies, Lauro Langosco, and David Krueger

doi:10.48550/arXiv.2303.06173. Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. InAdvances in Neural Information Processing Systems, volume 31,

work page doi:10.48550/arxiv.2303.06173
[6]

Adam: A Method for Stochastic Optimization

URLhttps://arxiv.org/abs/1412.6980. Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. InAdvances in Neural Information Processing Systems, volume 32,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever

URLhttps://openreview.net/forum?id=XsHqr9dEGH. Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003,

work page 2021
[8]

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt

doi:10.1088/1742-5468/ac3a74. Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. InInternational Conference on Learning Representations,

work page doi:10.1088/1742-5468/ac3a74
[9]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

URLhttps://openreview.net/forum?id=9XFSbDPmdW. Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Gener- alization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

doi:10.48550/arXiv.2201.02177. 30 Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2201.02177
[11]

Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, and Joshua M

URL https://jmlr.org/papers/v19/18-188.html. Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, and Joshua M. Susskind. The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon.arXiv preprint arXiv:2206.04817,

work page arXiv
[12]

Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, and Joshua M

doi:10.48550/arXiv.2206.04817. Vikrant Varma, Rohin Shah, Zachary Kenton, János Kramár, and Ramana Kumar. Explaining grokking through circuit efficiency.arXiv preprint arXiv:2309.02390,

work page doi:10.48550/arxiv.2206.04817
[13]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N

doi:10.48550/arXiv.2309.02390. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, volume 30,

work page doi:10.48550/arxiv.2309.02390
[14]

The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology

Alper Yıldırım. The geometric inductive bias of grokking: Bypassing phase transitions via architectural topology.arXiv preprint arXiv:2603.05228,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology

doi:10.48550/arXiv.2603.05228. A Proof of the Discrete Escape Theorem We provide a self-contained proof of Theorem 3.2. The argument proceeds in three steps: (i) a one-step Lyapunov recursion, (ii) unrolling the recursion to obtain the escape time, and (iii) deriving the lower bound on escape time. Proof of Theorem 3.2 (full).Under the assumptions:Ltrain ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.05228
[16]

Distribute the Kactive frequencies evenly acrossHheads (each head handlesK/Hfrequencies)

computescos(2πk(a+ b)/p) =⟨e (k) a , e(k) b ⟩via a dot-product attention over the two-token sequence[E[:, a], E[:, b]]. Distribute the Kactive frequencies evenly acrossHheads (each head handlesK/Hfrequencies). For each headh: •W h Q, W h K ∈R d×dh: select the2(K/H)active Fourier coordinates for this head. Only2(K/H)rows are nonzero, each of magnitudeO(1),...

work page 2023