Navigating Potholes with Geometry-Aware Sharpness Minimization
Pith reviewed 2026-05-20 19:34 UTC · model grok-4.3
The pith
A slow geometry preconditioner combined with sharpness-aware minimization amplifies escape from local loss potholes while keeping wide flat basins stable.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the preconditioner amplifies the SAM escape signal in directions that are flat under the average geometry but locally sharp, called potholes, while wide flat basins remain stable. This occurs because the preconditioner is updated sparsely as a slow exponential moving average from LLQR, capturing smoothed geometry, and the SAM perturbation probes curvature at a faster timescale on top of that geometry.
What carries the argument
The two-timescale structure of a slow LLQR-derived preconditioner maintained as an exponential moving average that supplies average loss geometry for a faster SAM perturbation to act on.
If this is right
- The method produces consistent gains over both SAM and LLQR alone on standard vision and sequence modeling benchmarks.
- The preconditioner selectively boosts escape from directions that look flat on average but are sharp at finer scale.
- Wide flat basins stay stable under the combined updates rather than being destabilized.
- Slow geometry learning and fast sharpness probing function as complementary mechanisms rather than alternatives.
Where Pith is reading between the lines
- The same slow-fast separation could be tested with other adaptive or second-order optimizers to see if geometry awareness improves their sharpness handling.
- If the low-resolution geometry picture holds, one could experiment with even slower update rates or different smoothing windows on very large models.
- The pothole concept suggests checking whether similar local sharpness within flat regions appears in other optimization problems outside neural networks.
Load-bearing premise
The slow exponential moving average of the LLQR-derived preconditioner supplies a sufficiently accurate low-resolution picture of loss geometry that reliably distinguishes locally sharp potholes from stable flat basins.
What would settle it
A test on a synthetic loss surface containing explicit potholes and wide flat regions where LLQR+SAM shows no improvement in escaping the potholes relative to plain SAM would disprove the amplification effect.
Figures
read the original abstract
Sharpness-aware minimization (SAM) encourages flat minima by perturbing parameters along directions of high loss curvature, but treats all parameter directions uniformly, ignoring the underlying loss geometry. We introduce LLQR+SAM, which combines SAM with a learned preconditioner obtained from the recently proposed LLQR framework, a second-order method that recasts steepest descent as a layerwise linear-quadratic regulator problem. The preconditioner is updated sparsely and maintained as a slow exponential moving average, so it captures a smoothed, low-resolution picture of the loss landscape geometry. The SAM perturbation then operates on top of this learned geometry, probing curvature at a faster timescale. We show that this two-timescale structure is not merely a computational convenience: theoretically, the preconditioner amplifies the SAM escape signal in directions that are flat under the average geometry but locally sharp (potholes). Wide, flat basins, by contrast, remain stable. Empirically, LLQR+SAM gives consistent gains over both SAM and LLQR alone across standard vision and sequence modeling benchmarks, supporting the view that slow learned geometry and fast sharpness correction are genuinely complementary.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes LLQR+SAM, which augments sharpness-aware minimization (SAM) with a learned preconditioner derived from the LLQR framework. The preconditioner is updated sparsely via a slow exponential moving average to capture a smoothed view of loss geometry at a slower timescale, while SAM operates at a faster timescale. The central theoretical claim is that this two-timescale structure amplifies the SAM escape signal specifically in directions that are flat under the average geometry but locally sharp (potholes), whereas wide flat basins remain stable. Empirical results are reported to show consistent gains over SAM and LLQR alone on vision and sequence modeling benchmarks.
Significance. If the two-timescale separation can be rigorously justified, the work provides a concrete mechanism for making SAM geometry-aware without uniform treatment of directions, which could improve the discovery of flat minima in deep networks. The empirical consistency across benchmarks, if reproducible with full experimental details, would indicate that slow learned geometry and fast sharpness correction are complementary rather than redundant.
major comments (2)
- [Section 3] Section 3 (theoretical analysis): the derivation treats the LLQR-derived preconditioner as fixed over the fast SAM timescale and claims amplification of escape signals in pothole directions, but provides no perturbation analysis, error bound, or fixed-point argument quantifying how much the slow EMA incorporates information from the fast SAM perturbations. This assumption is load-bearing for the central claim that the preconditioner reliably distinguishes locally sharp potholes from stable wide basins.
- [Abstract and Section 3] Abstract and Section 3: the theoretical amplification result is stated without explicit assumptions, derivation steps, or the precise definition of the two-timescale separation; it is therefore unclear whether the result introduces independent grounding or reduces to quantities already present in the LLQR framework.
minor comments (2)
- Provide the exact EMA decay rate, update frequency of the preconditioner, and full experimental protocol (including hyperparameter ranges and number of runs) so that the reported gains can be reproduced.
- Define all LLQR-specific notation (e.g., the form of the preconditioner) at first use in the theoretical section to improve readability for readers unfamiliar with the base framework.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. The comments on the theoretical analysis are well-taken and point to opportunities for strengthening the presentation of the two-timescale argument. We address each major comment below and commit to revisions that improve rigor and clarity without altering the core claims.
read point-by-point responses
-
Referee: [Section 3] Section 3 (theoretical analysis): the derivation treats the LLQR-derived preconditioner as fixed over the fast SAM timescale and claims amplification of escape signals in pothole directions, but provides no perturbation analysis, error bound, or fixed-point argument quantifying how much the slow EMA incorporates information from the fast SAM perturbations. This assumption is load-bearing for the central claim that the preconditioner reliably distinguishes locally sharp potholes from stable wide basins.
Authors: We agree that the current derivation would benefit from an explicit perturbation analysis to justify treating the preconditioner as approximately fixed. In the revised manuscript we will add a first-order perturbation argument showing that the contribution of fast SAM steps to the slow EMA update is bounded by the timescale separation ratio (specifically O(α / η), where α is the EMA decay and η the perturbation step size). This bound confirms that the preconditioner continues to reflect the averaged geometry at leading order, thereby preserving the differential amplification between pothole directions and wide basins. The added analysis will be placed in Section 3 with supporting calculations in an appendix. revision: yes
-
Referee: [Abstract and Section 3] Abstract and Section 3: the theoretical amplification result is stated without explicit assumptions, derivation steps, or the precise definition of the two-timescale separation; it is therefore unclear whether the result introduces independent grounding or reduces to quantities already present in the LLQR framework.
Authors: We acknowledge that the assumptions and derivation steps can be stated more explicitly. The two-timescale separation is defined as the preconditioner update frequency being at least an order of magnitude slower than the SAM perturbation frequency, with the EMA decay factor satisfying α ≪ 1. Under this separation the amplification result follows from a direct expansion of the preconditioned sharpness term, which isolates an extra positive contribution precisely in directions that are flat on average but locally sharp. This interaction term is not present in the original LLQR analysis and therefore supplies independent grounding. In the revision we will list all assumptions at the beginning of Section 3, expand the derivation with intermediate steps, and move supporting lemmas to an appendix. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper's central theoretical claim—that the slow EMA preconditioner from the LLQR framework amplifies the SAM escape signal specifically in directions flat on average but locally sharp—is presented as a new derivation in Section 3 that exploits the two-timescale separation. This does not reduce by construction to a self-definition, a fitted parameter, or an unverified self-citation chain; the LLQR framework is invoked as an external base method whose properties are used to ground the new amplification result, while the paper supplies its own analysis of the interaction with SAM perturbations. Empirical gains on vision and sequence benchmarks supply independent external support. No load-bearing step equates the claimed prediction to its inputs by definition or forces the outcome through renaming or ansatz smuggling.
Axiom & Free-Parameter Ledger
free parameters (1)
- EMA decay rate for preconditioner
axioms (1)
- domain assumption Loss landscape admits a meaningful separation between average geometry (captured by slow EMA) and local curvature (probed by fast SAM).
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
the preconditioner amplifies the SAM escape signal in directions that are flat under the average geometry but locally sharp (potholes). Wide, flat basins, by contrast, remain stable.
-
IndisputableMonolith/Foundation/ArrowOfTime.leanforward_accumulates echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
two-timescale structure... slow exponential moving average... fast timescale
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
9th International Conference on Learning Representations,
Pierre Foret and Ariel Kleiner and Hossein Mobahi and Behnam Neyshabur , title =. 9th International Conference on Learning Representations,. 2021 , url =
work page 2021
-
[2]
International Conference on Machine Learning,
Yang Zhao and Hao Zhang and Xiuyuan Hu , title =. International Conference on Machine Learning,. 2022 , url =
work page 2022
-
[3]
Layerwise LQR for Geometry-Aware Optimization of Deep Networks , author=. 2026 , eprint=
work page 2026
-
[4]
Minyoung Kim and Da Li and Shell Xu Hu and Timothy M. Hospedales , title =. International Conference on Machine Learning,. 2022 , url =
work page 2022
-
[5]
Jihun Yun and Eunho Yang , editor =. Riemannian. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 , year =
work page 2023
-
[6]
Peter Holderrieth, Yilun Xu, and Tommi Jaakkola
Sepp Hochreiter and J. Flat Minima , journal =. 1997 , url =. doi:10.1162/neco.1997.9.1.1 , biburl =
-
[7]
5th International Conference on Learning Representations,
Nitish Shirish Keskar and Dheevatsa Mudigere and Jorge Nocedal and Mikhail Smelyanskiy and Ping Tak Peter Tang , title =. 5th International Conference on Learning Representations,. 2017 , url =
work page 2017
-
[8]
Proceedings of the 34th International Conference on Machine Learning,
Laurent Dinh and Razvan Pascanu and Samy Bengio and Yoshua Bengio , title =. Proceedings of the 34th International Conference on Machine Learning,. 2017 , url =
work page 2017
-
[9]
Proceedings of the 38th International Conference on Machine Learning,
Jungmin Kwon and Jeongseop Kim and Hyunseo Park and In Kwon Choi , title =. Proceedings of the 38th International Conference on Machine Learning,. 2021 , url =
work page 2021
-
[10]
Emogen: Emotional image content generation with text-to-image diffusion models,
Tao Li and Pan Zhou and Zhengbao He and Xinwen Cheng and Xiaolin Huang , title =. 2024 , url =. doi:10.1109/CVPR52733.2024.00538 , timestamp =
-
[11]
Natural Gradient Works Efficiently in Learning , journal =
Shun. Natural Gradient Works Efficiently in Learning , journal =. 1998 , url =. doi:10.1162/089976698300017746 , timestamp =
-
[12]
James Martens and Roger B. Grosse , title =. Proceedings of the 32nd International Conference on Machine Learning,. 2015 , url =
work page 2015
-
[13]
Proceedings of the 35th International Conference on Machine Learning,
Vineet Gupta and Tomer Koren and Yoram Singer , title =. Proceedings of the 35th International Conference on Machine Learning,. 2018 , url =
work page 2018
-
[14]
In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)
Kaijie Zhu and Xixu Hu and Jindong Wang and Xing Xie and Ge Yang , title =. 2023 , url =. doi:10.1109/ICCV51070.2023.00408 , timestamp =
-
[15]
Behnam Neyshabur and Ruslan Salakhutdinov and Nathan Srebro , title =. Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada , pages =. 2015 , url =
work page 2015
-
[16]
Spectral Norm Regularization for Improving the Generalizability of Deep Learning
Yuichi Yoshida and Takeru Miyato , title =. CoRR , volume =. 2017 , url =. 1705.10941 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[17]
Aladin Virmaux and Kevin Scaman , title =. Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montr. 2018 , timestamp =
work page 2018
-
[18]
Hao Li and Zheng Xu and Gavin Taylor and Christoph Studer and Tom Goldstein , title =. Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montr. 2018 , timestamp =
work page 2018
-
[19]
Alex Krizhevsky and Geoffrey Hinton , title =
-
[20]
3rd International Conference on Learning Representations,
Karen Simonyan and Andrew Zisserman , title =. 3rd International Conference on Learning Representations,. 2015 , url =
work page 2015
-
[21]
Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun , title =. 2016. 2016 , url =. doi:10.1109/CVPR.2016.90 , timestamp =
-
[22]
Proceedings of the British Machine Vision Conference 2016,
Sergey Zagoruyko and Nikos Komodakis , title =. Proceedings of the British Machine Vision Conference 2016,. 2016 , url =
work page 2016
-
[23]
Dongyoon Han and Jiwhan Kim and Junmo Kim , title =. 2017. 2017 , url =. doi:10.1109/CVPR.2017.668 , timestamp =
-
[24]
Improved Regularization of Convolutional Neural Networks with Cutout
Terrance Devries and Graham W. Taylor , title =. CoRR , volume =. 2017 , url =. 1708.04552 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[25]
Bingcong Li and Georgios B. Giannakis , title =. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 , year =
work page 2023
-
[26]
Peng Mi and Li Shen and Tianhe Ren and Yiyi Zhou and Xiaoshuai Sun and Rongrong Ji and Dacheng Tao , title =. Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022 , year =
work page 2022
-
[27]
Proceedings of the Third Conference on Machine Translation: Research Papers,
Myle Ott and Sergey Edunov and David Grangier and Michael Auli , title =. Proceedings of the Third Conference on Machine Translation: Research Papers,. 2018 , url =. doi:10.18653/V1/W18-6301 , timestamp =
-
[28]
Jia Deng and Wei Dong and Richard Socher and Li. ImageNet:. 2009. 2009 , url =. doi:10.1109/CVPR.2009.5206848 , timestamp =
-
[29]
9th International Conference on Learning Representations,
Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby , title =. 9th International Conference on Learning Representations,. 2021 , url =
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.