pith. sign in

arxiv: 2605.16134 · v1 · pith:SZBKRAHXnew · submitted 2026-05-15 · 💻 cs.LG · cs.AI

Navigating Potholes with Geometry-Aware Sharpness Minimization

Pith reviewed 2026-05-20 19:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords sharpness-aware minimizationloss landscape geometrypreconditionerpotholesflat minimatwo-timescale optimizationneural network training
0
0 comments X

The pith

A slow geometry preconditioner combined with sharpness-aware minimization amplifies escape from local loss potholes while keeping wide flat basins stable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LLQR+SAM to improve sharpness-aware minimization by incorporating a learned preconditioner that captures smoothed loss landscape geometry. The preconditioner comes from the LLQR framework and updates as a slow exponential moving average, providing a low-resolution view of average curvature. SAM perturbations then run on top of this view at a faster timescale to boost signals for escaping directions that appear flat on average but prove sharp locally. If the two-timescale separation works as described, optimizers gain a way to navigate loss surfaces with hidden traps without destabilizing good regions. Empirical gains on vision and sequence tasks support treating slow geometry and fast sharpness correction as complementary rather than redundant.

Core claim

The central claim is that the preconditioner amplifies the SAM escape signal in directions that are flat under the average geometry but locally sharp, called potholes, while wide flat basins remain stable. This occurs because the preconditioner is updated sparsely as a slow exponential moving average from LLQR, capturing smoothed geometry, and the SAM perturbation probes curvature at a faster timescale on top of that geometry.

What carries the argument

The two-timescale structure of a slow LLQR-derived preconditioner maintained as an exponential moving average that supplies average loss geometry for a faster SAM perturbation to act on.

If this is right

  • The method produces consistent gains over both SAM and LLQR alone on standard vision and sequence modeling benchmarks.
  • The preconditioner selectively boosts escape from directions that look flat on average but are sharp at finer scale.
  • Wide flat basins stay stable under the combined updates rather than being destabilized.
  • Slow geometry learning and fast sharpness probing function as complementary mechanisms rather than alternatives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same slow-fast separation could be tested with other adaptive or second-order optimizers to see if geometry awareness improves their sharpness handling.
  • If the low-resolution geometry picture holds, one could experiment with even slower update rates or different smoothing windows on very large models.
  • The pothole concept suggests checking whether similar local sharpness within flat regions appears in other optimization problems outside neural networks.

Load-bearing premise

The slow exponential moving average of the LLQR-derived preconditioner supplies a sufficiently accurate low-resolution picture of loss geometry that reliably distinguishes locally sharp potholes from stable flat basins.

What would settle it

A test on a synthetic loss surface containing explicit potholes and wide flat regions where LLQR+SAM shows no improvement in escaping the potholes relative to plain SAM would disprove the amplification effect.

Figures

Figures reproduced from arXiv: 2605.16134 by Aristide Baratin, Damien Scieur, Ioannis Mitliagkas, Mehrab Hamidi, Razvan Pascanu, Simon Dufort-Labb\'e.

Figure 1
Figure 1. Figure 1: Interaction between FSAM and LLQR on ResNet-50/ImageNet. Left: Top-1 error for SGDM and FSAM, a SAM variant, with and without LLQR. Although both SAM-style methods and LLQR can be interpreted as curvature-correcting mechanisms, their combination yields gains over either component alone, suggesting complementary effects. Right: Test loss versus elapsed training time. Despite the usual concern that second-or… view at source ↗
Figure 2
Figure 2. Figure 2: Toy sharp-well escape mechanism. The surface has a flat basin at the origin and a sharp annular basin near radius 5. All four optimizers use the same learning rate, and the SAM variants use the same radius, with starts chosen in the sharp basin, but not at minima. The non-SAM variants remain trapped, while the SAM variants leave the sharp well; the LLQR +SAM trajectory reaches the flat region with faster l… view at source ↗
Figure 3
Figure 3. Figure 3: Gradient-noise escape from the sharp minimum. All variants start at the bottom of the sharp well and receive a shared deterministic Gaussian perturbation schedule in the update gradient. The non-SAM variants remain near the sharp well, while SAM variants are ejected. At variance 10−9 , both SAM variants reach the flat basin, but LLQR +SAM has substantially shorter path length than Euclidean SAM, consistent… view at source ↗
Figure 4
Figure 4. Figure 4: IWSLT14 German-to-English convergence. Validation BLEU and token error curves for the fairseq Transformer benchmark. Pairing LLQR with SAM accelerates optimization while offering the modest best-performance gains reported in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: LLQR cadence sweep for ViT-B/16 and ViT-L/16. Each panel shows compile-adjusted [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Across ViT scales, preconditioner￾update time grows nearly linearly for both LLQR and LLQR+SAM, far from the quadratic scaling typically associated with second-order methods. We introduced LLQR+SAM, which pairs a slowly-updated LLQR preconditioner with a SAM perturbation evaluated and transported in the induced geometry. On a quadratic two-scale model the dynamics is closed-form: SAM pre￾vents the iterate … view at source ↗
read the original abstract

Sharpness-aware minimization (SAM) encourages flat minima by perturbing parameters along directions of high loss curvature, but treats all parameter directions uniformly, ignoring the underlying loss geometry. We introduce LLQR+SAM, which combines SAM with a learned preconditioner obtained from the recently proposed LLQR framework, a second-order method that recasts steepest descent as a layerwise linear-quadratic regulator problem. The preconditioner is updated sparsely and maintained as a slow exponential moving average, so it captures a smoothed, low-resolution picture of the loss landscape geometry. The SAM perturbation then operates on top of this learned geometry, probing curvature at a faster timescale. We show that this two-timescale structure is not merely a computational convenience: theoretically, the preconditioner amplifies the SAM escape signal in directions that are flat under the average geometry but locally sharp (potholes). Wide, flat basins, by contrast, remain stable. Empirically, LLQR+SAM gives consistent gains over both SAM and LLQR alone across standard vision and sequence modeling benchmarks, supporting the view that slow learned geometry and fast sharpness correction are genuinely complementary.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes LLQR+SAM, which augments sharpness-aware minimization (SAM) with a learned preconditioner derived from the LLQR framework. The preconditioner is updated sparsely via a slow exponential moving average to capture a smoothed view of loss geometry at a slower timescale, while SAM operates at a faster timescale. The central theoretical claim is that this two-timescale structure amplifies the SAM escape signal specifically in directions that are flat under the average geometry but locally sharp (potholes), whereas wide flat basins remain stable. Empirical results are reported to show consistent gains over SAM and LLQR alone on vision and sequence modeling benchmarks.

Significance. If the two-timescale separation can be rigorously justified, the work provides a concrete mechanism for making SAM geometry-aware without uniform treatment of directions, which could improve the discovery of flat minima in deep networks. The empirical consistency across benchmarks, if reproducible with full experimental details, would indicate that slow learned geometry and fast sharpness correction are complementary rather than redundant.

major comments (2)
  1. [Section 3] Section 3 (theoretical analysis): the derivation treats the LLQR-derived preconditioner as fixed over the fast SAM timescale and claims amplification of escape signals in pothole directions, but provides no perturbation analysis, error bound, or fixed-point argument quantifying how much the slow EMA incorporates information from the fast SAM perturbations. This assumption is load-bearing for the central claim that the preconditioner reliably distinguishes locally sharp potholes from stable wide basins.
  2. [Abstract and Section 3] Abstract and Section 3: the theoretical amplification result is stated without explicit assumptions, derivation steps, or the precise definition of the two-timescale separation; it is therefore unclear whether the result introduces independent grounding or reduces to quantities already present in the LLQR framework.
minor comments (2)
  1. Provide the exact EMA decay rate, update frequency of the preconditioner, and full experimental protocol (including hyperparameter ranges and number of runs) so that the reported gains can be reproduced.
  2. Define all LLQR-specific notation (e.g., the form of the preconditioner) at first use in the theoretical section to improve readability for readers unfamiliar with the base framework.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments on the theoretical analysis are well-taken and point to opportunities for strengthening the presentation of the two-timescale argument. We address each major comment below and commit to revisions that improve rigor and clarity without altering the core claims.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (theoretical analysis): the derivation treats the LLQR-derived preconditioner as fixed over the fast SAM timescale and claims amplification of escape signals in pothole directions, but provides no perturbation analysis, error bound, or fixed-point argument quantifying how much the slow EMA incorporates information from the fast SAM perturbations. This assumption is load-bearing for the central claim that the preconditioner reliably distinguishes locally sharp potholes from stable wide basins.

    Authors: We agree that the current derivation would benefit from an explicit perturbation analysis to justify treating the preconditioner as approximately fixed. In the revised manuscript we will add a first-order perturbation argument showing that the contribution of fast SAM steps to the slow EMA update is bounded by the timescale separation ratio (specifically O(α / η), where α is the EMA decay and η the perturbation step size). This bound confirms that the preconditioner continues to reflect the averaged geometry at leading order, thereby preserving the differential amplification between pothole directions and wide basins. The added analysis will be placed in Section 3 with supporting calculations in an appendix. revision: yes

  2. Referee: [Abstract and Section 3] Abstract and Section 3: the theoretical amplification result is stated without explicit assumptions, derivation steps, or the precise definition of the two-timescale separation; it is therefore unclear whether the result introduces independent grounding or reduces to quantities already present in the LLQR framework.

    Authors: We acknowledge that the assumptions and derivation steps can be stated more explicitly. The two-timescale separation is defined as the preconditioner update frequency being at least an order of magnitude slower than the SAM perturbation frequency, with the EMA decay factor satisfying α ≪ 1. Under this separation the amplification result follows from a direct expansion of the preconditioned sharpness term, which isolates an extra positive contribution precisely in directions that are flat on average but locally sharp. This interaction term is not present in the original LLQR analysis and therefore supplies independent grounding. In the revision we will list all assumptions at the beginning of Section 3, expand the derivation with intermediate steps, and move supporting lemmas to an appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper's central theoretical claim—that the slow EMA preconditioner from the LLQR framework amplifies the SAM escape signal specifically in directions flat on average but locally sharp—is presented as a new derivation in Section 3 that exploits the two-timescale separation. This does not reduce by construction to a self-definition, a fitted parameter, or an unverified self-citation chain; the LLQR framework is invoked as an external base method whose properties are used to ground the new amplification result, while the paper supplies its own analysis of the interaction with SAM perturbations. Empirical gains on vision and sequence benchmarks supply independent external support. No load-bearing step equates the claimed prediction to its inputs by definition or forces the outcome through renaming or ansatz smuggling.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the LLQR framework as a black-box source of geometry and on the modeling choice that a slow EMA supplies a useful average geometry distinct from instantaneous curvature.

free parameters (1)
  • EMA decay rate for preconditioner
    Controls how slowly the low-resolution geometry picture is updated; its value is chosen to separate timescales.
axioms (1)
  • domain assumption Loss landscape admits a meaningful separation between average geometry (captured by slow EMA) and local curvature (probed by fast SAM).
    Invoked to justify why the preconditioner amplifies escape only in pothole directions.

pith-pipeline@v0.9.0 · 5747 in / 1287 out tokens · 51589 ms · 2026-05-20T19:34:43.879465+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 2 internal anchors

  1. [1]

    9th International Conference on Learning Representations,

    Pierre Foret and Ariel Kleiner and Hossein Mobahi and Behnam Neyshabur , title =. 9th International Conference on Learning Representations,. 2021 , url =

  2. [2]

    International Conference on Machine Learning,

    Yang Zhao and Hao Zhang and Xiuyuan Hu , title =. International Conference on Machine Learning,. 2022 , url =

  3. [3]

    2026 , eprint=

    Layerwise LQR for Geometry-Aware Optimization of Deep Networks , author=. 2026 , eprint=

  4. [4]

    Hospedales , title =

    Minyoung Kim and Da Li and Shell Xu Hu and Timothy M. Hospedales , title =. International Conference on Machine Learning,. 2022 , url =

  5. [5]

    Riemannian

    Jihun Yun and Eunho Yang , editor =. Riemannian. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 , year =

  6. [6]

    Peter Holderrieth, Yilun Xu, and Tommi Jaakkola

    Sepp Hochreiter and J. Flat Minima , journal =. 1997 , url =. doi:10.1162/neco.1997.9.1.1 , biburl =

  7. [7]

    5th International Conference on Learning Representations,

    Nitish Shirish Keskar and Dheevatsa Mudigere and Jorge Nocedal and Mikhail Smelyanskiy and Ping Tak Peter Tang , title =. 5th International Conference on Learning Representations,. 2017 , url =

  8. [8]

    Proceedings of the 34th International Conference on Machine Learning,

    Laurent Dinh and Razvan Pascanu and Samy Bengio and Yoshua Bengio , title =. Proceedings of the 34th International Conference on Machine Learning,. 2017 , url =

  9. [9]

    Proceedings of the 38th International Conference on Machine Learning,

    Jungmin Kwon and Jeongseop Kim and Hyunseo Park and In Kwon Choi , title =. Proceedings of the 38th International Conference on Machine Learning,. 2021 , url =

  10. [10]

    Emogen: Emotional image content generation with text-to-image diffusion models,

    Tao Li and Pan Zhou and Zhengbao He and Xinwen Cheng and Xiaolin Huang , title =. 2024 , url =. doi:10.1109/CVPR52733.2024.00538 , timestamp =

  11. [11]

    Natural Gradient Works Efficiently in Learning , journal =

    Shun. Natural Gradient Works Efficiently in Learning , journal =. 1998 , url =. doi:10.1162/089976698300017746 , timestamp =

  12. [12]

    Grosse , title =

    James Martens and Roger B. Grosse , title =. Proceedings of the 32nd International Conference on Machine Learning,. 2015 , url =

  13. [13]

    Proceedings of the 35th International Conference on Machine Learning,

    Vineet Gupta and Tomer Koren and Yoram Singer , title =. Proceedings of the 35th International Conference on Machine Learning,. 2018 , url =

  14. [14]

    In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

    Kaijie Zhu and Xixu Hu and Jindong Wang and Xing Xie and Ge Yang , title =. 2023 , url =. doi:10.1109/ICCV51070.2023.00408 , timestamp =

  15. [15]

    Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada , pages =

    Behnam Neyshabur and Ruslan Salakhutdinov and Nathan Srebro , title =. Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada , pages =. 2015 , url =

  16. [16]

    Spectral Norm Regularization for Improving the Generalizability of Deep Learning

    Yuichi Yoshida and Takeru Miyato , title =. CoRR , volume =. 2017 , url =. 1705.10941 , timestamp =

  17. [17]

    Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montr

    Aladin Virmaux and Kevin Scaman , title =. Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montr. 2018 , timestamp =

  18. [18]

    Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montr

    Hao Li and Zheng Xu and Gavin Taylor and Christoph Studer and Tom Goldstein , title =. Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montr. 2018 , timestamp =

  19. [19]

    Alex Krizhevsky and Geoffrey Hinton , title =

  20. [20]

    3rd International Conference on Learning Representations,

    Karen Simonyan and Andrew Zisserman , title =. 3rd International Conference on Learning Representations,. 2015 , url =

  21. [21]

    Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun , title =. 2016. 2016 , url =. doi:10.1109/CVPR.2016.90 , timestamp =

  22. [22]

    Proceedings of the British Machine Vision Conference 2016,

    Sergey Zagoruyko and Nikos Komodakis , title =. Proceedings of the British Machine Vision Conference 2016,. 2016 , url =

  23. [23]

    Dongyoon Han and Jiwhan Kim and Junmo Kim , title =. 2017. 2017 , url =. doi:10.1109/CVPR.2017.668 , timestamp =

  24. [24]

    Improved Regularization of Convolutional Neural Networks with Cutout

    Terrance Devries and Graham W. Taylor , title =. CoRR , volume =. 2017 , url =. 1708.04552 , timestamp =

  25. [25]

    Giannakis , title =

    Bingcong Li and Georgios B. Giannakis , title =. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 , year =

  26. [26]

    Peng Mi and Li Shen and Tianhe Ren and Yiyi Zhou and Xiaoshuai Sun and Rongrong Ji and Dacheng Tao , title =. Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022 , year =

  27. [27]

    Proceedings of the Third Conference on Machine Translation: Research Papers,

    Myle Ott and Sergey Edunov and David Grangier and Michael Auli , title =. Proceedings of the Third Conference on Machine Translation: Research Papers,. 2018 , url =. doi:10.18653/V1/W18-6301 , timestamp =

  28. [28]

    , author Dong, W

    Jia Deng and Wei Dong and Richard Socher and Li. ImageNet:. 2009. 2009 , url =. doi:10.1109/CVPR.2009.5206848 , timestamp =

  29. [29]

    9th International Conference on Learning Representations,

    Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby , title =. 9th International Conference on Learning Representations,. 2021 , url =