pith. machine review for the scientific record. sign in

arxiv: 2605.04230 · v1 · submitted 2026-05-05 · 💻 cs.LG · cs.AI

Recognition: unknown

Layerwise LQR for Geometry-Aware Optimization of Deep Networks

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:53 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords geometry-aware optimizationnatural gradientLQRpreconditionersecond-order methodslayerwisedeep learningK-FAC
0
0 comments X

The pith

Steepest descent under Newton and natural gradient equals a finite-horizon LQR problem.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the steepest-descent direction under a family of quadratic geometry models, including Newton, Gauss-Newton, and Fisher/natural gradient, admits an exact rewriting as the solution to a finite-horizon linear quadratic regulator problem. This rewriting makes the layerwise cost structure and cross-layer interactions explicit without ever forming the full curvature matrix. The authors then relax the LQR objective to learn reusable diagonal or Kronecker-factored inverse preconditioners that can be applied inside existing first-order loops. Experiments on ResNets and Transformers indicate that the resulting updates improve training dynamics and often final test accuracy while incurring only modest extra cost.

Core claim

The steepest-descent step under a broad class of divergence-induced quadratic models—including Newton, Gauss-Newton, Fisher/natural-gradient, and intermediate-layer metrics—can be written as a finite-horizon Linear Quadratic Regulator (LQR) problem. This formulation serves as a reference that exposes the layerwise dynamics and cost matrices encoding the original dense geometry. A scalable relaxation then learns structured inverse preconditioners by minimizing the LQR objective and reusing them across iterations.

What carries the argument

Finite-horizon LQR equivalence for the steepest-descent step under quadratic geometry models, whose quadratic cost matrices encode the original curvature and whose solution yields the update direction.

If this is right

  • Standard optimizers can be wrapped with the learned preconditioners while retaining a principled link to second-order geometry.
  • No global curvature matrix needs to be formed or inverted at any point.
  • Training dynamics improve on ResNets and Transformers, frequently translating into better final test performance.
  • The same LQR-derived reference can serve to evaluate other scalable approximations to second-order methods.
  • Only modest wall-clock overhead is added beyond the base optimizer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The control-theoretic lens could let researchers import LQR stability certificates to analyze convergence of the learned preconditioners.
  • The horizon length in the LQR formulation might be tuned per architecture to capture longer-range layer dependencies than current Kronecker methods allow.
  • If the relaxation succeeds, analogous optimal-control rewritings could be attempted for other non-convex problems outside deep learning where quadratic models arise.

Load-bearing premise

That minimizing the global LQR objective over structured diagonal or Kronecker inverse preconditioners preserves enough of the original geometry to improve optimization dynamics without introducing instability or bias.

What would settle it

On a small network where the exact Newton or natural-gradient direction can be computed directly, measuring whether the LLQR-structured direction produces measurably slower loss decrease per step than the true geometry-aware direction would falsify the claim that the relaxation retains useful geometry.

Figures

Figures reproduced from arXiv: 2605.04230 by Aristide Baratin, Pierre-Luc Bacon, Razvan Pascanu, Simon Dufort-Labb\'e, Simon Lacoste-Julien.

Figure 1
Figure 1. Figure 1: ImageNet and IWSLT14 training curves under NGD induced divergence. view at source ↗
Figure 2
Figure 2. Figure 2: Validation of the LQR equivalence and relaxation on the Rosenbrock function. (Red and Orange) Newton’s step and Riccati solution over￾lap, confirming exact equivalence. (Green) Re￾laxation with block–diagonal U converges faster as update frequency increases, though remaining approximate. (Blue) Relaxation with dense U on the single-layer formulation matches the conver￾gence rate of Newton’s step. (Violet) … view at source ↗
Figure 3
Figure 3. Figure 3: The grokking phenomenon—a long plateau before sudden generalization—serves as a view at source ↗
Figure 4
Figure 4. Figure 4: Rosenbrock extended diagnostics. Left: optimization trajectories over Rosenbrock level curves for Newton’s method, exact LQR, two-layer LLQR, and a diagonal-Hessian Newton approximation. Right: cosine similarity between each approximate update direction and the exact Newton direction −H−1 g. Despite using diagonal degrees of freedom, LLQR learns an update￾aligned correction that remains close to Newton, wh… view at source ↗
Figure 5
Figure 5. Figure 5: ResNet-18 training curves on CIFAR–10 and CIFAR–100 with SGDM and AdamW under view at source ↗
read the original abstract

Geometry-aware optimizers such as Newton and natural gradient can improve conditioning in deep learning, but scalable variants such as K-FAC, Shampoo, and related preconditioners usually impose structural approximations early, often discarding cross-layer interactions induced by the network computation. We introduce Layerwise LQR (LLQR), a framework for learning structured inverse preconditioners under a global layerwise optimal-control objective. The starting point is an exact equivalence: the steepest-descent step under a broad class of divergence-induced quadratic models--including Newton, Gauss-Newton, Fisher/natural-gradient, and intermediate-layer metrics--can be written as a finite-horizon Linear Quadratic Regulator (LQR) problem. This formulation serves as a reference that exposes the layerwise dynamics and cost matrices encoding the original dense geometry. We then derive a scalable relaxation that learns diagonal, (E-)Kronecker-factored, or other structured inverse preconditioners by minimizing the LQR objective and reusing them across iterations. The resulting optimizer wraps standard methods while retaining a principled connection to second-order geometry, without forming or inverting the global curvature matrix. Experiments on ResNets and Transformers show that LLQR improves optimization dynamics and often translates these gains into improved final test performance, while adding only modest wall-clock overhead. It establishes LLQR as a practical framework for geometry-aware second-order methods and a reference for evaluating scalable approximations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Layerwise LQR (LLQR), a framework that establishes an exact equivalence between the steepest-descent step under a class of divergence-induced quadratic models (Newton, Gauss-Newton, Fisher/natural gradient, and intermediate-layer metrics) and a finite-horizon Linear Quadratic Regulator (LQR) problem. It then derives a scalable relaxation that learns structured inverse preconditioners (diagonal or (E-)Kronecker-factored) by minimizing the LQR objective and reuses them across iterations. The resulting optimizer is shown to improve optimization dynamics and test performance on ResNets and Transformers with modest overhead, while avoiding explicit formation or inversion of the global curvature matrix.

Significance. If the equivalence holds exactly and the structured relaxation preserves sufficient geometry without introducing uncontrolled bias or instability, the work provides a principled bridge between optimal-control formulations and scalable second-order methods in deep learning. It supplies a reference objective against which other layerwise approximations can be evaluated and could guide the design of preconditioners that explicitly account for cross-layer interactions induced by the network computation.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (LQR equivalence derivation): The central claim of an 'exact equivalence' to a finite-horizon LQR problem models layer-to-layer propagation as linear dynamics with quadratic per-stage costs. For networks containing nonlinear activations (ReLU, GELU, etc.), the effective mapping from parameter perturbations to loss is not linear, so the cost matrices derived from the geometry may not recover the original quadratic objective without additional linearization or approximation steps. The manuscript must supply the full derivation, state any such steps explicitly, and quantify the resulting modeling error.
  2. [§4] §4 (structured relaxation): Minimizing the LQR objective over chosen structures (diagonal, Kronecker) reuses the global reference but treats structure choice as a free parameter. No error bound or stability analysis is provided showing that the resulting preconditioners retain enough of the dense geometry-aware benefits to improve dynamics without introducing bias or divergence. A quantitative comparison of the relaxed versus dense LQR cost, or a perturbation analysis around the dense solution, is required to support the claim that the relaxation is principled rather than heuristic.
  3. [§5] §5 (experiments): The reported gains on ResNets and Transformers are promising, yet the experimental design lacks controls that isolate the contribution of the LQR objective from the particular choice of structure or hyper-parameter tuning. Direct comparisons to K-FAC or Shampoo using identical Kronecker structures, together with ablations that disable the LQR minimization step, are needed to establish that the layerwise optimal-control formulation is responsible for the observed improvements.
minor comments (2)
  1. [§2–3] Notation in §2–3: Define the per-stage cost matrices Q_t, R_t explicitly in terms of the divergence-induced Hessian or Fisher blocks so that readers can verify they recover the original quadratic model.
  2. [Figures] Figure captions and legends: Include the number of independent runs and error bars (or shaded regions) for all optimization-trajectory plots to allow assessment of statistical reliability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment point by point below, outlining the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (LQR equivalence derivation): The central claim of an 'exact equivalence' to a finite-horizon LQR problem models layer-to-layer propagation as linear dynamics with quadratic per-stage costs. For networks containing nonlinear activations (ReLU, GELU, etc.), the effective mapping from parameter perturbations to loss is not linear, so the cost matrices derived from the geometry may not recover the original quadratic objective without additional linearization or approximation steps. The manuscript must supply the full derivation, state any such steps explicitly, and quantify the resulting modeling error.

    Authors: The equivalence holds exactly with respect to the quadratic model of the loss, which is the explicit foundation of the steepest-descent steps in Newton, Gauss-Newton, Fisher/natural-gradient, and related methods. The quadratic approximation is formed locally around the current parameters; within this model the objective is quadratic by construction, so the layer-to-layer propagation of perturbations can be represented as linear dynamics with quadratic per-stage costs derived from the curvature. Nonlinear activations are incorporated when computing the local geometry (via forward and backward passes), but they do not alter the quadratic character of the model itself. We will expand §3 with the complete derivation, explicitly stating the quadratic-model assumption and clarifying that the modeling error is precisely the standard second-order approximation error. To quantify this error we will add empirical plots comparing the quadratic-model loss to the true loss on small networks in the revised manuscript. revision: yes

  2. Referee: [§4] §4 (structured relaxation): Minimizing the LQR objective over chosen structures (diagonal, Kronecker) reuses the global reference but treats structure choice as a free parameter. No error bound or stability analysis is provided showing that the resulting preconditioners retain enough of the dense geometry-aware benefits to improve dynamics without introducing bias or divergence. A quantitative comparison of the relaxed versus dense LQR cost, or a perturbation analysis around the dense solution, is required to support the claim that the relaxation is principled rather than heuristic.

    Authors: We agree that a more rigorous characterization of the structured relaxation would strengthen the paper. Deriving general theoretical error bounds is difficult in the high-dimensional non-convex setting, but we will add two concrete analyses in the revision: (1) a quantitative comparison of the achieved LQR objective value for the dense solution versus the diagonal and Kronecker-structured solutions on small-scale problems where the dense solution remains computable, and (2) a perturbation analysis around the dense LQR solution demonstrating that the structured preconditioners preserve the dominant geometric directions. These additions will provide empirical and local-analytic support for the claim that the relaxation remains principled. revision: partial

  3. Referee: [§5] §5 (experiments): The reported gains on ResNets and Transformers are promising, yet the experimental design lacks controls that isolate the contribution of the LQR objective from the particular choice of structure or hyper-parameter tuning. Direct comparisons to K-FAC or Shampoo using identical Kronecker structures, together with ablations that disable the LQR minimization step, are needed to establish that the layerwise optimal-control formulation is responsible for the observed improvements.

    Authors: We will strengthen the experimental section by adding direct comparisons against K-FAC and Shampoo that employ identical Kronecker-factored structures. We will also include ablations in which the preconditioner structures are obtained without minimizing the global LQR objective (e.g., via standard per-layer K-FAC updates or fixed non-optimized structures). These controls will isolate the contribution of the layerwise optimal-control formulation from structure choice and hyper-parameter effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity in LQR equivalence or structured relaxation

full rationale

The paper's core derivation establishes a mathematical equivalence between steepest descent under divergence-induced quadratic models and a finite-horizon LQR problem, then uses that LQR objective as a reference to derive scalable structured approximations by explicit minimization over diagonal or Kronecker forms. This is a standard exact-to-approximate technique with independent content in the equivalence proof and the relaxation; it does not reduce by construction to fitted inputs, self-definitions, or load-bearing self-citations. No quoted steps exhibit the enumerated circular patterns, and the framework remains falsifiable via experiments on ResNets and Transformers.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the stated equivalence between steepest descent and LQR plus the validity of the structured relaxation; both are treated as domain assumptions rather than derived from more basic principles.

free parameters (1)
  • preconditioner structure (diagonal, (E-)Kronecker)
    The paper selects specific structured forms for the learned inverse preconditioners; these choices are not derived but imposed for scalability.
axioms (1)
  • domain assumption steepest-descent step under divergence-induced quadratic models equals finite-horizon LQR
    This equivalence is presented as the exact starting point in the abstract.

pith-pipeline@v0.9.0 · 5560 in / 1283 out tokens · 86755 ms · 2026-05-08T16:53:33.850741+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 14 canonical work pages · 3 internal anchors

  1. [1]

    URL https: //books.google.ca/books?id=k9rCAgAAQBAJ

    ISBN 9780486318189. URL https: //books.google.ca/books?id=k9rCAgAAQBAJ. Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Deep equilibrium models. InAdvances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 688–699,

  2. [2]

    Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

    URLhttps://jmlr.org/papers/v18/17-468.html. Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.CoRR, abs/2409.20325,

  3. [4]

    Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud

    URL https:// books.google.ca/books?id=k_FQAAAAMAAJ. Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations. InAdvances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 6572–6583,

  4. [5]

    Terrance Devries and Graham W

    doi: 10.1109/CDC.1989.70136. Terrance Devries and Graham W. Taylor. Improved regularization of convolutional neural networks with cutout.CoRR, abs/1708.04552,

  5. [6]

    Improved Regularization of Convolutional Neural Networks with Cutout

    URLhttp://arxiv.org/abs/1708.04552. Stuart E Dreyfus. The numerical solution of non-linear optimal control problems. InNumerical solutions of nonlinear differential equations: Proceedings of an advanced symposium, pages 97–113. Hoboken, NJ, USA: Wiley,

  6. [7]

    Thomas George, César Laurent, Xavier Bouthillier, Nicolas Ballas, and Pascal Vincent

    URL https://api.semanticscholar.org/CorpusID: 64849498. Thomas George, César Laurent, Xavier Bouthillier, Nicolas Ballas, and Pascal Vincent. Fast approximate natural gradient descent in a kronecker factored eigenbasis. InAdvances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, De...

  7. [8]

    Hosseini

    11 Damien Martins Gomes, Yanlei Zhang, Eugene Belilovsky, Guy Wolf, and Mahdi S. Hosseini. Adafisher: Adaptive second order optimization via fisher information. InThe Thirteenth Inter- national Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,

  8. [9]

    A theoretical framework for back-propagation

    Yann LeCun. A theoretical framework for back-propagation. InProceedings of the 1988 Connec- tionist Models Summer School,

  9. [10]

    Qianxiao Li and Shuji Hao

    URL https://api.semanticscholar.org/CorpusID: 16775098. Qianxiao Li and Shuji Hao. An optimal control approach to deep learning and applications to discrete-weight neural networks. In Jennifer G. Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, ...

  10. [11]

    Wu Lin, Felix Dangel, Runa Eschenhagen, Kirill Neklyudov, Agustinus Kristiadi, Richard E

    URL https://jmlr.org/papers/v18/ 17-653.html. Wu Lin, Felix Dangel, Runa Eschenhagen, Kirill Neklyudov, Agustinus Kristiadi, Richard E. Turner, and Alireza Makhzani. Structured inverse-free natural gradient descent: Memory-efficient & numerically-stable KFAC. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27,

  11. [12]

    net/forum?id=Y2wRKE0Qor

    URL https://openreview. net/forum?id=Y2wRKE0Qor. Guan-Horng Liu, Tianrong Chen, and Evangelos A. Theodorou. Ddpnopt: Differential dynamic programming neural optimizer. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7,

  12. [13]

    James Martens

    URL https:// openreview.net/forum?id=6s7ME_X5_Un. James Martens. Deep learning via hessian-free optimization. In Johannes Fürnkranz and Thorsten Joachims, editors,Proceedings of the 27th International Conference on Machine Learning (ICML- 10), June 21-24, 2010, Haifa, Israel, pages 735–742. Omnipress,

  13. [14]

    James Martens

    URL https://icml.cc/ Conferences/2010/papers/458.pdf. James Martens. New insights and perspectives on the natural gradient method.J. Mach. Learn. Res., 21:146:1–146:76,

  14. [15]

    URL https://proceedings.mlr.press/ v37/martens15.html

    PMLR. URL https://proceedings.mlr.press/ v37/martens15.html. Si Yi Meng, Sharan Vaswani, Issam Hadj Laradji, Mark Schmidt, and Simon Lacoste-Julien. Fast and furious convergence: Stochastic second order methods under interpolation. In Silvia Chiappa and Roberto Calandra, editors,The 23rd International Conference on Artificial Intelligence and Statistics, ...

  15. [16]

    12 Eiji Mizutani and Stuart Dreyfus

    URL http:// proceedings.mlr.press/v108/meng20a.html. 12 Eiji Mizutani and Stuart Dreyfus. On derivation of stagewise second-order backpropagation by invari- ant imbedding for multi-stage neural-network learning. InProceedings of the International Joint Conference on Neural Networks, IJCNN 2006, part of the IEEE World Congress on Computational Intelligence...

  16. [17]

    URLhttps://doi.org/10.1109/IJCNN.2006.247151

    doi: 10.1109/IJCNN.2006.247151. URLhttps://doi.org/10.1109/IJCNN.2006.247151. Eiji Mizutani and Stuart E. Dreyfus. Stagewise newton, differential dynamic programming, and neighboring optimum control for neural-network learning. InAmerican Control Conference, ACC 2005, Portland, OR, USA, 8-10 June, 2005, pages 1331–1336. IEEE,

  17. [18]

    2005.1470149

    doi: 10.1109/ACC. 2005.1470149. URLhttps://doi.org/10.1109/ACC.2005.1470149. Eiji Mizutani and Stuart E. Dreyfus. Second-order stagewise backpropagation for hessian-matrix analyses and investigation of negative curvature.Neural Networks, 21(2-3):193–203,

  18. [19]

    URLhttps://doi.org/10.1016/j.neunet.2007.12.038

    1016/J.NEUNET.2007.12.038. URLhttps://doi.org/10.1016/j.neunet.2007.12.038. Eiji Mizutani, Stuart E. Dreyfus, and James Weldon Demmel. Second-order backpropagation algorithms for a stagewise-partitioned separable hessian matrix. InIEEE International Joint Conference on Neural Networks, IJCNN 2005, Montreal, QC, Canada, July 31 - August 4, 2005, pages 1027...

  19. [20]

    URL https://doi.org/ 10.1109/IJCNN.2005.1555994

    doi: 10.1109/IJCNN.2005.1555994. URL https://doi.org/ 10.1109/IJCNN.2005.1555994. Kazuki Osawa, Shigang Li, and Torsten Hoefler. Pipefisher: Efficient training of large language models using pipelining and fisher information matrices. In Dawn Song, Michael Carbin, and Tianqi Chen, editors,Proceedings of the Sixth Conference on Machine Learning and Systems...

  20. [21]

    Scaling neural machine translation

    Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. Scaling neural machine translation. InProceedings of the Third Conference on Machine Translation: Research Papers, WMT 2018, Belgium, Brussels, October 31 - November 1, 2018, pages 1–9. Association for Computational Linguistics,

  21. [22]

    doi: 10.18653/v1/W18-6301

    doi: 10.18653/V1/W18-6301. Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. In2nd Inter- national Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings,

  22. [23]

    Revisiting Natural Gradient for Deep Networks

    URLhttp://arxiv.org/abs/1301.3584. Razvan Pascanu, Clare Lyle, Ionut-Vlad Modoranu, Naima Elosegui Borras, Dan Alistarh, Petar Velickovic, Sarath Chandar, Soham De, and James Martens. Optimizers qualitatively alter solutions and we should leverage this.CoRR, abs/2507.12224,

  23. [25]

    URLhttps://doi.org/10.1162/neco.1994.6.1.147

    doi: 10.1162/NECO.1994.6.1.147. URLhttps://doi.org/10.1162/neco.1994.6.1.147. L.S. Pontryagin, V .G. Boltyanskii, Karreman Mathematics Research Collection, L.W. Neustadt, R.V . Gamkrelidze, K.N. Trirogoff, E.F. Mishchenko, and D.E. Brown.The Mathematical Theory of Optimal Processes. International series of monographs in pure and applied mathematics. Inter...

  24. [27]

    URL https://arxiv.org/abs/2201.02177. W.T. Reid.Riccati Differential Equations. Mathematics in science and engineering : a series of monographs and textbooks. Academic Press,

  25. [28]

    Minghao Xu, Lichuan Xiang, Xu Cai, and Hongkai Wen

    URL https://proceedings.neurips.cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. Minghao Xu, Lichuan Xiang, Xu Cai, and Hongkai Wen. No more adam: Learning rate scaling at initialization is all you need.CoRR, abs/2412.11768,

  26. [29]

    control bars

    doi: 10.48550/ARXIV .2412.11768. URLhttps://doi.org/10.48550/arXiv.2412.11768. Lin Zhang, Shaohuai Shi, Wei Wang, and Bo Li. Scalable K-FAC training for deep neural networks with distributed preconditioning.IEEE Trans. Cloud Comput., 11(3):2365–2378,

  27. [30]

    URLhttps://doi.org/10.1109/TCC.2022.3205918

    doi: 10.1109/TCC.2022.3205918. URLhttps://doi.org/10.1109/TCC.2022.3205918. 14 A Gradient Descent as LQR We make explicit the correspondence between a single gradient-descent (steepest–descent under Euclidean norm) update on the parameters and a finite–horizon Linear Quadratic Regulator (LQR). Consider a discrete dynamical system composed ofNlayers xi+1 =...

  28. [31]

    In the relaxed setup, the inverse preconditioner U is learned with SGDM with an inner loop learning rate of 10−4 over 500 inner steps at every update of the preconditioner

    +u 3 y−u 2 2 (37) In our validation setup, we optimize for the minima of R1,100 with respect to x, y. In the relaxed setup, the inverse preconditioner U is learned with SGDM with an inner loop learning rate of 10−4 over 500 inner steps at every update of the preconditioner. F.2 ResNet-18 Experiments CIFAR-100:We trained ResNet-18 models for 200 epochs, us...

  29. [32]

    The only LLQR hyperparameter changed relative to the preceding NGD experiments was the EMA decay for U, set to 0.925 instead of 0.95

    The preconditioner was updated every 250 training iterations. The only LLQR hyperparameter changed relative to the preceding NGD experiments was the EMA decay for U, set to 0.925 instead of 0.95. All results are averaged over five seeds and the reported time multiplier was measured on NVIDIA L40S GPUs. F.6 Grokking Experiments Grokking experiments were co...