Recognition: 2 theorem links
· Lean TheoremOptimizer-Induced Mode Connectivity: From AdamW to Muon
Pith reviewed 2026-05-12 03:34 UTC · model grok-4.3
The pith
Solutions from one optimizer form a connected set at large width in two-layer ReLU networks due to implicit regularization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For two-layer ReLU networks, solutions from a single optimizer in the Lion-K family form a connected set at sufficiently large width. Optimizer-induced regions can be disjoint or overlap at large width depending on regularization, while at small width AdamW and Muon reach disconnected zero-loss components separated by a provable loss barrier. In GPT-2 pretraining, same-optimizer paths preserve each model's spectrum and cross-optimizer paths traverse a smooth transition.
What carries the argument
Optimizer-induced implicit regularization that restricts solutions to connected regions within each optimizer's reachable set.
Load-bearing premise
That the implicit regularization imposed by each optimizer is strong enough to force all its solutions into one connected component once the network width is large.
What would settle it
Identifying two AdamW solutions on a sufficiently wide two-layer ReLU network whose connecting paths all encounter a positive loss barrier.
Figures
read the original abstract
Mode connectivity has been widely studied, yet the role of the optimizer remains underexplored. We revisit it through optimizer-induced implicit regularization, asking how connectivity behaves when restricted to solutions constrained by a given optimizer. For two-layer ReLU networks, we show that solutions from a single optimizer -- AdamW, Muon, or others in the Lion-$\mathcal{K}$ family -- form a connected set at sufficiently large width, a result not implied by prior work. We then characterize how optimizer-induced regions interact: at large width two different regions can be disjoint or overlap depending on regularization, while in our small-width example AdamW and Muon converge to disconnected zero-loss components separated by a provable loss barrier. Empirically, in GPT-2 pretraining, we observe same-optimizer paths preserve each model's spectrum while cross-optimizer paths traverse a smooth transition. Our results reveal optimizer-dependent structure beyond classical mode connectivity literature.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that optimizer-induced implicit regularization structures the loss landscape such that, for two-layer ReLU networks, solutions obtained from any single optimizer (AdamW, Muon, or members of the Lion-K family) form a connected set at sufficiently large width. It further shows that regions induced by different optimizers may be disjoint or overlap depending on regularization strength, with a concrete small-width example in which AdamW and Muon zero-loss solutions are separated by a provable loss barrier. Empirically, linear paths between GPT-2 models trained with the same optimizer preserve spectral properties, while cross-optimizer paths exhibit smooth transitions.
Significance. If the central claims hold, the work supplies a new axis for mode-connectivity analysis by tying connectivity to the implicit bias of concrete optimizers rather than to the loss alone. The two-layer ReLU results are not implied by prior connectivity theorems, and the GPT-2 observations indicate that the phenomenon is observable in practical training. These contributions could inform both theoretical understanding of optimization dynamics and practical choices among optimizers.
major comments (2)
- [§3] §3 (two-layer ReLU connectivity): the argument that single-optimizer solution sets are connected at large width rests on the claim that implicit regularization sufficiently constrains the feasible set; the manuscript must exhibit the precise width-dependent argument and confirm that connectivity follows directly from the optimizer update rule without additional unstated assumptions.
- [Small-width example] Small-width example (interaction between AdamW and Muon regions): the claim of a provable loss barrier separating the two zero-loss components is load-bearing for the disconnection result; the explicit construction of the barrier and the argument ruling out any lower-loss path between the components must be supplied in full.
minor comments (2)
- [Abstract] The Lion-K family is referenced in the abstract without an immediate definition or citation; a short clarifying sentence in the introduction would improve readability.
- [Empirical GPT-2 section] GPT-2 experiments: the description of how spectra are extracted and compared along interpolation paths should include the precise metric, number of independent runs, and any statistical controls used to support the 'preserve' versus 'smooth transition' statements.
Simulated Author's Rebuttal
We are grateful to the referee for the positive evaluation and the recommendation for minor revision. We address the two major comments point by point below, agreeing to provide the requested clarifications and expansions in the revised manuscript.
read point-by-point responses
-
Referee: [§3] §3 (two-layer ReLU connectivity): the argument that single-optimizer solution sets are connected at large width rests on the claim that implicit regularization sufficiently constrains the feasible set; the manuscript must exhibit the precise width-dependent argument and confirm that connectivity follows directly from the optimizer update rule without additional unstated assumptions.
Authors: We appreciate this feedback. The current manuscript sketches the connectivity via implicit regularization but does not fully detail the width dependence. In the revised version, we will include the precise argument showing that for widths exceeding a threshold determined by the network depth and regularization parameters, the optimizer-specific constraints define a connected component. Connectivity then follows from the convexity of the constrained set induced by the update rules, without additional assumptions. revision: yes
-
Referee: [Small-width example] Small-width example (interaction between AdamW and Muon regions): the claim of a provable loss barrier separating the two zero-loss components is load-bearing for the disconnection result; the explicit construction of the barrier and the argument ruling out any lower-loss path between the components must be supplied in full.
Authors: The referee is correct that the barrier claim requires a complete argument. While the example is given, the full proof ruling out lower-loss paths is only outlined. We will supply the explicit construction (including the specific small-width network weights for AdamW and Muon solutions) and the detailed argument in the revision, demonstrating that the loss must increase along any connecting path due to the mismatch in the effective regularizers. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's central derivation establishes optimizer-induced mode connectivity for two-layer ReLU networks at large width via analysis of implicit regularization specific to AdamW, Muon, and Lion-K family optimizers. This is explicitly positioned as not implied by prior work, with additional characterization of inter-optimizer region interactions and empirical GPT-2 observations. No load-bearing step reduces by construction to a fitted input, self-citation chain, or renamed known result; the argument relies on new theoretical constraints and verifiable empirical paths rather than re-using quantities internal to the paper. The derivation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce the notions of intra-optimizer and inter-optimizer mode connectivity... For AdamW and several widely used optimizers in the Lion-K framework... we show that intra-optimizer connectivity holds for two-layer ReLU networks if the width is sufficiently large.
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the regularized solution set associated with R is O_R(λ) := arg min f(θ) ∩ {θ | R(θ) ≤ 1/λ}
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
- [2]
-
[3]
Deep Reinforcement Learning with Double Q-learning , author=. 2015 , eprint=
work page 2015
-
[4]
Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs , author=. 2018 , eprint=
work page 2018
-
[5]
C^2M^3 : Cycle-Consistent Multi-Model Merging , author=. 2024 , eprint=
work page 2024
-
[6]
ZipIt! Merging Models from Different Tasks without Training , author=. 2024 , eprint=
work page 2024
-
[7]
Merging Text Transformer Models from Different Initializations , author=. 2024 , eprint=
work page 2024
- [8]
-
[9]
Approaching Deep Learning through the Spectral Dynamics of Weights , author=. 2024 , eprint=
work page 2024
- [10]
-
[11]
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , author=. 2017 , eprint=
work page 2017
-
[12]
Proceedings of the 1st NeurIPS Workshop on Symmetry and Geometry in Neural Representations , pages =
Connectedness of loss landscapes via the lens of Morse theory , author =. Proceedings of the 1st NeurIPS Workshop on Symmetry and Geometry in Neural Representations , pages =. 2023 , editor =
work page 2023
-
[13]
Revisiting the Initial Steps in Adaptive Gradient Descent Optimization , author=. 2025 , eprint=
work page 2025
-
[14]
How does the optimizer implicitly bias the model merging loss landscape? , author=. 2025 , eprint=
work page 2025
-
[15]
Do Deep Neural Network Solutions Form a Star Domain? , author=. 2024 , eprint=
work page 2024
- [16]
- [17]
-
[18]
Re-basin via implicit Sinkhorn differentiation , author=. 2022 , eprint=
work page 2022
-
[19]
Analysis of Linear Mode Connectivity via Permutation-Based Weight Matching: With Insights into Other Permutation Search Methods , author=. 2025 , eprint=
work page 2025
-
[20]
REPAIR: REnormalizing Permuted Activations for Interpolation Repair , author=. 2023 , eprint=
work page 2023
-
[21]
Softmax is 1/2 -Lipschitz: A tight bound across all _p norms , author=. 2025 , eprint=
work page 2025
-
[22]
Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , author=. 2017 , eprint=
work page 2017
-
[23]
International workshop on multiple classifier systems , pages=
Ensemble methods in machine learning , author=. International workshop on multiple classifier systems , pages=. 2000 , organization=
work page 2000
-
[24]
Git Re-Basin: Merging Models modulo Permutation Symmetries , author=. 2023 , eprint=
work page 2023
-
[25]
Mitigating Transformer Overconfidence via Lipschitz Regularization , author=. 2023 , eprint=
work page 2023
-
[26]
Local Lipschitz Bounds of Deep Neural Networks , author=. 2023 , eprint=
work page 2023
-
[27]
Fantastic Pretraining Optimizers and Where to Find Them , author=. 2025 , eprint=
work page 2025
-
[28]
Training Deep Learning Models with Norm-Constrained LMOs , author=. 2025 , eprint=
work page 2025
- [29]
- [30]
-
[31]
Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =
work page 2024
-
[32]
PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective , author=. 2026 , eprint=
work page 2026
- [33]
- [34]
-
[35]
Topology and Geometry of Half-Rectified Network Optimization , author=. 2017 , eprint=
work page 2017
-
[36]
Horizontal and Vertical Ensemble with Deep Representation for Classification , author=. 2013 , eprint=
work page 2013
-
[37]
Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal? , author=. 2025 , eprint=
work page 2025
-
[38]
ASGO: Adaptive Structured Gradient Optimization , author=. 2025 , eprint=
work page 2025
- [39]
- [40]
-
[41]
Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and Hillier, Chloe and Lillicrap, Timothy P , title =. arXiv preprint , url =
-
[42]
Efficient Streaming Language Models with Attention Sinks , author=. 2024 , eprint=
work page 2024
-
[43]
Llemma: An Open Language Model For Mathematics , author=. 2023 , eprint=
work page 2023
- [44]
-
[45]
DataComp-LM: In search of the next generation of training sets for language models , author=. 2024 , eprint=
work page 2024
-
[46]
Uniform convergence may be unable to explain generalization in deep learning , author=. 2021 , eprint=
work page 2021
-
[47]
The AMU-UEDIN Submission to the WMT16 News Translation Task: Attention-based NMT Models as Feature Functions in Phrase-based SMT , author=. 2016 , eprint=
work page 2016
-
[48]
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , author=. 2022 , eprint=
work page 2022
-
[49]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
-
[50]
Flex Attention: A Programming Model for Generating Optimized Attention Kernels , author=. 2024 , eprint=
work page 2024
- [51]
-
[52]
Advances in neural information processing systems , volume=
Preconditioned spectral descent for deep learning , author=. Advances in neural information processing systems , volume=
-
[53]
Flat minima , author=. Neural computation , volume=. 1997 , publisher=
work page 1997
-
[54]
Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts , author=. 2025 , eprint=
work page 2025
-
[55]
Steerable Adversarial Scenario Generation through Test-Time Preference Alignment , author=. 2026 , eprint=
work page 2026
-
[56]
Training Compute-Optimal Large Language Models , author=. 2022 , eprint=
work page 2022
-
[57]
Very Deep Convolutional Networks for Large-Scale Image Recognition , author=. 2015 , eprint=
work page 2015
-
[58]
How Muon's Spectral Design Benefits Generalization: A Study on Imbalanced Data , author=. 2025 , eprint=
work page 2025
-
[59]
Muon Outperforms Adam in Tail-End Associative Memory Learning , author=. 2025 , eprint=
work page 2025
- [60]
- [61]
-
[62]
Weight Scope Alignment: A Frustratingly Easy Method for Model Merging , author=. 2024 , eprint=
work page 2024
-
[63]
Proceedings of the 40th International Conference on Machine Learning , pages =
Optimal Shrinkage for Distributed Second-Order Optimization , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =
work page 2023
- [64]
- [65]
-
[66]
Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books , author=. 2015 , eprint=
work page 2015
-
[67]
One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling , author=. 2014 , eprint=
work page 2014
-
[68]
When do spectral gradient updates help in deep learning? , author=. 2026 , eprint=
work page 2026
-
[69]
Preconditioning Benefits of Spectral Orthogonalization in Muon , author=. 2026 , eprint=
work page 2026
- [70]
-
[71]
Understanding Mode Connectivity via Parameter Space Symmetry , author=. 2025 , eprint=
work page 2025
-
[72]
Fantastic Pretraining Optimizers and Where to Find Them 2.1: Hyperball Optimization , author =. 2025 , month =
work page 2025
-
[73]
Implicit Bias of Spectral Descent and Muon on Multiclass Separable Data , author=. 2025 , eprint=
work page 2025
- [74]
- [75]
-
[76]
On Provable Benefits of Muon in Federated Learning , author=. 2025 , eprint=
work page 2025
- [77]
-
[78]
Convergence Bound and Critical Batch Size of Muon Optimizer , author=. 2025 , eprint=
work page 2025
-
[79]
Lions and Muons: Optimization via Stochastic Frank-Wolfe , author=. 2025 , eprint=
work page 2025
- [80]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.