arxiv: 2605.09991 · v1 · submitted 2026-05-11 · 💻 cs.AI · cs.LG· math.OC

Recognition: 2 theorem links

· Lean Theorem

Optimizer-Induced Mode Connectivity: From AdamW to Muon

Fangzhao Zhang , Sungyoon Kim , Erica Zhang , Yiqi Jiang , Mert Pilanci

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:34 UTC · model grok-4.3

classification 💻 cs.AI cs.LGmath.OC

keywords mode connectivityimplicit regularizationloss landscapeoptimizersReLU networksAdamWMuontransformers

0 comments

The pith

Solutions from one optimizer form a connected set at large width in two-layer ReLU networks due to implicit regularization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how optimizer choice shapes connectivity among low-loss solutions. For two-layer ReLU networks it proves that all solutions reached by any single optimizer, such as AdamW or Muon, lie on one connected component once the width is large enough. Different optimizers can produce either overlapping or disjoint regions depending on regularization strength, and small-width examples exhibit a provable loss barrier between AdamW and Muon zero-loss solutions. In GPT-2 pretraining, paths staying within one optimizer preserve the model spectrum while paths that cross optimizers produce smooth transitions. These findings show that the solution space is partitioned by optimizer-specific structure rather than being uniformly connected.

Core claim

For two-layer ReLU networks, solutions from a single optimizer in the Lion-K family form a connected set at sufficiently large width. Optimizer-induced regions can be disjoint or overlap at large width depending on regularization, while at small width AdamW and Muon reach disconnected zero-loss components separated by a provable loss barrier. In GPT-2 pretraining, same-optimizer paths preserve each model's spectrum and cross-optimizer paths traverse a smooth transition.

What carries the argument

Optimizer-induced implicit regularization that restricts solutions to connected regions within each optimizer's reachable set.

Load-bearing premise

That the implicit regularization imposed by each optimizer is strong enough to force all its solutions into one connected component once the network width is large.

What would settle it

Identifying two AdamW solutions on a sufficiently wide two-layer ReLU network whose connecting paths all encounter a positive loss barrier.

Figures

Figures reproduced from arXiv: 2605.09991 by Erica Zhang, Fangzhao Zhang, Mert Pilanci, Sungyoon Kim, Yiqi Jiang.

**Figure 1.** Figure 1: Motivating Experiment. Singular value histograms for layer 1 up_proj weights in GPT-2 training. Models trained with AdamW and Muon converge to solutions with distinct spectrum. Solutions obtained by the same optimizer can be interpolated by paths that preserve the spectrum. Naive interpolation does not connect models with low loss paths. See Section 4.2 for experimental details. open whether the two endpoi… view at source ↗

**Figure 2.** Figure 2: Mode Connectivity Under Optimizer-Induced Constraints. Solid line denotes low-loss path, dashed line denotes path with a barrier. (a) Classical mode connectivity treats low-loss solutions as lying in a connected set. (b) Regularized solution sets induced by each optimizer are connected for sufficiently wide networks. The regions may or may not intersect, depending on the problem and regularization. (c) In … view at source ↗

**Figure 3.** Figure 3: Spectrum along same-optimizer connectivity paths. Each panel shows the singular value [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Spectrum along the AdamW→Muon connectivity path. Each panel shows the singular value histogram at a different interpolation coefficient. We next connect a model trained with AdamW to one trained with Muon. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Loss barrier along the AdamW→Muon cross-optimizer path, evaluated in-distribution on enwik8 (left) and out-of-distribution on Stories (right). spectral diversity along cross-optimizer paths for more effective model merging (Stoica et al., 2024; Crisostomi et al., 2024) or continual learning is a promising practical direction. Acknowledgments and Disclosure of Funding This work was supported in part by the … view at source ↗

**Figure 6.** Figure 6: Spectrum along the AdamW→Muon connectivity path. Each panel shows the singular value histogram at a different interpolation coefficient t. θC trained with Muon [PITH_FULL_IMAGE:figures/full_fig_p037_6.png] view at source ↗

**Figure 7.** Figure 7: Out-of-distribution experiments for additional datasets. [PITH_FULL_IMAGE:figures/full_fig_p037_7.png] view at source ↗

read the original abstract

Mode connectivity has been widely studied, yet the role of the optimizer remains underexplored. We revisit it through optimizer-induced implicit regularization, asking how connectivity behaves when restricted to solutions constrained by a given optimizer. For two-layer ReLU networks, we show that solutions from a single optimizer -- AdamW, Muon, or others in the Lion-$\mathcal{K}$ family -- form a connected set at sufficiently large width, a result not implied by prior work. We then characterize how optimizer-induced regions interact: at large width two different regions can be disjoint or overlap depending on regularization, while in our small-width example AdamW and Muon converge to disconnected zero-loss components separated by a provable loss barrier. Empirically, in GPT-2 pretraining, we observe same-optimizer paths preserve each model's spectrum while cross-optimizer paths traverse a smooth transition. Our results reveal optimizer-dependent structure beyond classical mode connectivity literature.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Optimizer choice shapes connected components in the loss landscape, with single-optimizer sets connected at large width in two-layer ReLUs but possible disconnection across optimizers like AdamW and Muon.

read the letter

The paper's central point is that mode connectivity is not just about the loss surface but gets partitioned by the optimizer through its implicit regularization. For two-layer ReLU networks, solutions reached by one optimizer (AdamW, Muon, or the Lion-K family) form a connected set once width is large enough, while different optimizers can produce regions that are either disjoint or overlapping depending on regularization strength. In a narrow-width example they even exhibit a provable loss barrier between AdamW and Muon zero-loss points. The GPT-2 runs then show that same-optimizer linear paths keep the eigenvalue spectrum roughly stable, whereas cross-optimizer paths produce a smoother transition in that spectrum.

Referee Report

2 major / 2 minor

Summary. The paper claims that optimizer-induced implicit regularization structures the loss landscape such that, for two-layer ReLU networks, solutions obtained from any single optimizer (AdamW, Muon, or members of the Lion-K family) form a connected set at sufficiently large width. It further shows that regions induced by different optimizers may be disjoint or overlap depending on regularization strength, with a concrete small-width example in which AdamW and Muon zero-loss solutions are separated by a provable loss barrier. Empirically, linear paths between GPT-2 models trained with the same optimizer preserve spectral properties, while cross-optimizer paths exhibit smooth transitions.

Significance. If the central claims hold, the work supplies a new axis for mode-connectivity analysis by tying connectivity to the implicit bias of concrete optimizers rather than to the loss alone. The two-layer ReLU results are not implied by prior connectivity theorems, and the GPT-2 observations indicate that the phenomenon is observable in practical training. These contributions could inform both theoretical understanding of optimization dynamics and practical choices among optimizers.

major comments (2)

[§3] §3 (two-layer ReLU connectivity): the argument that single-optimizer solution sets are connected at large width rests on the claim that implicit regularization sufficiently constrains the feasible set; the manuscript must exhibit the precise width-dependent argument and confirm that connectivity follows directly from the optimizer update rule without additional unstated assumptions.
[Small-width example] Small-width example (interaction between AdamW and Muon regions): the claim of a provable loss barrier separating the two zero-loss components is load-bearing for the disconnection result; the explicit construction of the barrier and the argument ruling out any lower-loss path between the components must be supplied in full.

minor comments (2)

[Abstract] The Lion-K family is referenced in the abstract without an immediate definition or citation; a short clarifying sentence in the introduction would improve readability.
[Empirical GPT-2 section] GPT-2 experiments: the description of how spectra are extracted and compared along interpolation paths should include the precise metric, number of independent runs, and any statistical controls used to support the 'preserve' versus 'smooth transition' statements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the positive evaluation and the recommendation for minor revision. We address the two major comments point by point below, agreeing to provide the requested clarifications and expansions in the revised manuscript.

read point-by-point responses

Referee: [§3] §3 (two-layer ReLU connectivity): the argument that single-optimizer solution sets are connected at large width rests on the claim that implicit regularization sufficiently constrains the feasible set; the manuscript must exhibit the precise width-dependent argument and confirm that connectivity follows directly from the optimizer update rule without additional unstated assumptions.

Authors: We appreciate this feedback. The current manuscript sketches the connectivity via implicit regularization but does not fully detail the width dependence. In the revised version, we will include the precise argument showing that for widths exceeding a threshold determined by the network depth and regularization parameters, the optimizer-specific constraints define a connected component. Connectivity then follows from the convexity of the constrained set induced by the update rules, without additional assumptions. revision: yes
Referee: [Small-width example] Small-width example (interaction between AdamW and Muon regions): the claim of a provable loss barrier separating the two zero-loss components is load-bearing for the disconnection result; the explicit construction of the barrier and the argument ruling out any lower-loss path between the components must be supplied in full.

Authors: The referee is correct that the barrier claim requires a complete argument. While the example is given, the full proof ruling out lower-loss paths is only outlined. We will supply the explicit construction (including the specific small-width network weights for AdamW and Muon solutions) and the detailed argument in the revision, demonstrating that the loss must increase along any connecting path due to the mismatch in the effective regularizers. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central derivation establishes optimizer-induced mode connectivity for two-layer ReLU networks at large width via analysis of implicit regularization specific to AdamW, Muon, and Lion-K family optimizers. This is explicitly positioned as not implied by prior work, with additional characterization of inter-optimizer region interactions and empirical GPT-2 observations. No load-bearing step reduces by construction to a fitted input, self-citation chain, or renamed known result; the argument relies on new theoretical constraints and verifiable empirical paths rather than re-using quantities internal to the paper. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated. The work appears to rest on standard neural-network assumptions such as ReLU properties and width limits.

pith-pipeline@v0.9.0 · 5469 in / 1207 out tokens · 47257 ms · 2026-05-12T03:34:30.127827+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce the notions of intra-optimizer and inter-optimizer mode connectivity... For AdamW and several widely used optimizers in the Lion-K framework... we show that intra-optimizer connectivity holds for two-layer ReLU networks if the width is sufficiently large.
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the regularized solution set associated with R is O_R(λ) := arg min f(θ) ∩ {θ | R(θ) ≤ 1/λ}

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

168 extracted references · 168 canonical work pages · 2 internal anchors

[1]

2025 , eprint=

Pre-training under infinite compute , author=. 2025 , eprint=

work page 2025
[2]

2017 , eprint=

Snapshot Ensembles: Train 1, get M for free , author=. 2017 , eprint=

work page 2017
[3]

2015 , eprint=

Deep Reinforcement Learning with Double Q-learning , author=. 2015 , eprint=

work page 2015
[4]

2018 , eprint=

Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs , author=. 2018 , eprint=

work page 2018
[5]

2024 , eprint=

C^2M^3 : Cycle-Consistent Multi-Model Merging , author=. 2024 , eprint=

work page 2024
[6]

2024 , eprint=

ZipIt! Merging Models from Different Tasks without Training , author=. 2024 , eprint=

work page 2024
[7]

2024 , eprint=

Merging Text Transformer Models from Different Initializations , author=. 2024 , eprint=

work page 2024
[8]

, author=

Revisiting Mode Connectivity in Neural Networks with Bezier Surface. , author=. ICLR , year=

work page
[9]

2024 , eprint=

Approaching Deep Learning through the Spectral Dynamics of Weights , author=. 2024 , eprint=

work page 2024
[10]

2017 , eprint=

Sharp Minima Can Generalize For Deep Nets , author=. 2017 , eprint=

work page 2017
[11]

2017 , eprint=

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , author=. 2017 , eprint=

work page 2017
[12]

Proceedings of the 1st NeurIPS Workshop on Symmetry and Geometry in Neural Representations , pages =

Connectedness of loss landscapes via the lens of Morse theory , author =. Proceedings of the 1st NeurIPS Workshop on Symmetry and Geometry in Neural Representations , pages =. 2023 , editor =

work page 2023
[13]

2025 , eprint=

Revisiting the Initial Steps in Adaptive Gradient Descent Optimization , author=. 2025 , eprint=

work page 2025
[14]

2025 , eprint=

How does the optimizer implicitly bias the model merging loss landscape? , author=. 2025 , eprint=

work page 2025
[15]

2024 , eprint=

Do Deep Neural Network Solutions Form a Star Domain? , author=. 2024 , eprint=

work page 2024
[16]

2023 , eprint=

Disentangling Linear Mode-Connectivity , author=. 2023 , eprint=

work page 2023
[17]

2024 , eprint=

Transformer Fusion with Optimal Transport , author=. 2024 , eprint=

work page 2024
[18]

2022 , eprint=

Re-basin via implicit Sinkhorn differentiation , author=. 2022 , eprint=

work page 2022
[19]

2025 , eprint=

Analysis of Linear Mode Connectivity via Permutation-Based Weight Matching: With Insights into Other Permutation Search Methods , author=. 2025 , eprint=

work page 2025
[20]

2023 , eprint=

REPAIR: REnormalizing Permuted Activations for Interpolation Repair , author=. 2023 , eprint=

work page 2023
[21]

2025 , eprint=

Softmax is 1/2 -Lipschitz: A tight bound across all _p norms , author=. 2025 , eprint=

work page 2025
[22]

2017 , eprint=

Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , author=. 2017 , eprint=

work page 2017
[23]

International workshop on multiple classifier systems , pages=

Ensemble methods in machine learning , author=. International workshop on multiple classifier systems , pages=. 2000 , organization=

work page 2000
[24]

2023 , eprint=

Git Re-Basin: Merging Models modulo Permutation Symmetries , author=. 2023 , eprint=

work page 2023
[25]

2023 , eprint=

Mitigating Transformer Overconfidence via Lipschitz Regularization , author=. 2023 , eprint=

work page 2023
[26]

2023 , eprint=

Local Lipschitz Bounds of Deep Neural Networks , author=. 2023 , eprint=

work page 2023
[27]

2025 , eprint=

Fantastic Pretraining Optimizers and Where to Find Them , author=. 2025 , eprint=

work page 2025
[28]

2025 , eprint=

Training Deep Learning Models with Norm-Constrained LMOs , author=. 2025 , eprint=

work page 2025
[29]

2020 , eprint=

Scaling Laws for Neural Language Models , author=. 2020 , eprint=

work page 2020
[30]

2025 , eprint=

AdaMuon: Adaptive Muon Optimizer , author=. 2025 , eprint=

work page 2025
[31]

2024 , url =

Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

work page 2024
[32]

2026 , eprint=

PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective , author=. 2026 , eprint=

work page 2026
[33]

2025 , eprint=

Kimi K2: Open Agentic Intelligence , author=. 2025 , eprint=

work page 2025
[34]

2025 , url =

Jeremy Bernstein , title =. 2025 , url =

work page 2025
[35]

2017 , eprint=

Topology and Geometry of Half-Rectified Network Optimization , author=. 2017 , eprint=

work page 2017
[36]

2013 , eprint=

Horizontal and Vertical Ensemble with Deep Representation for Classification , author=. 2013 , eprint=

work page 2013
[37]

2025 , eprint=

Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal? , author=. 2025 , eprint=

work page 2025
[38]

2025 , eprint=

ASGO: Adaptive Structured Gradient Optimization , author=. 2025 , eprint=

work page 2025
[39]

2025 , eprint=

The Universal Weight Subspace Hypothesis , author=. 2025 , eprint=

work page 2025
[40]

2023 , eprint=

StarCoder: may the source be with you! , author=. 2023 , eprint=

work page 2023
[41]

arXiv preprint , url =

Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and Hillier, Chloe and Lillicrap, Timothy P , title =. arXiv preprint , url =

work page
[42]

2024 , eprint=

Efficient Streaming Language Models with Attention Sinks , author=. 2024 , eprint=

work page 2024
[43]

2023 , eprint=

Llemma: An Open Language Model For Mathematics , author=. 2023 , eprint=

work page 2023
[44]

2025 , url =

Jianlin Su , title =. 2025 , url =

work page 2025
[45]

2024 , eprint=

DataComp-LM: In search of the next generation of training sets for language models , author=. 2024 , eprint=

work page 2024
[46]

2021 , eprint=

Uniform convergence may be unable to explain generalization in deep learning , author=. 2021 , eprint=

work page 2021
[47]

2016 , eprint=

The AMU-UEDIN Submission to the WMT16 News Translation Task: Attention-based NMT Models as Feature Functions in Phrase-based SMT , author=. 2016 , eprint=

work page 2016
[48]

2022 , eprint=

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , author=. 2022 , eprint=

work page 2022
[49]

The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

work page
[50]

2024 , eprint=

Flex Attention: A Programming Model for Generating Optimized Attention Kernels , author=. 2024 , eprint=

work page 2024
[51]

2024 , eprint=

The Platonic Representation Hypothesis , author=. 2024 , eprint=

work page 2024
[52]

Advances in neural information processing systems , volume=

Preconditioned spectral descent for deep learning , author=. Advances in neural information processing systems , volume=

work page
[53]

Neural computation , volume=

Flat minima , author=. Neural computation , volume=. 1997 , publisher=

work page 1997
[54]

2025 , eprint=

Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts , author=. 2025 , eprint=

work page 2025
[55]

2026 , eprint=

Steerable Adversarial Scenario Generation through Test-Time Preference Alignment , author=. 2026 , eprint=

work page 2026
[56]

2022 , eprint=

Training Compute-Optimal Large Language Models , author=. 2022 , eprint=

work page 2022
[57]

2015 , eprint=

Very Deep Convolutional Networks for Large-Scale Image Recognition , author=. 2015 , eprint=

work page 2015
[58]

2025 , eprint=

How Muon's Spectral Design Benefits Generalization: A Study on Imbalanced Data , author=. 2025 , eprint=

work page 2025
[59]

2025 , eprint=

Muon Outperforms Adam in Tail-End Associative Memory Learning , author=. 2025 , eprint=

work page 2025
[60]

2025 , eprint=

Muon Optimizer Accelerates Grokking , author=. 2025 , eprint=

work page 2025
[61]

2019 , eprint=

A Simple Method for Commonsense Reasoning , author=. 2019 , eprint=

work page 2019
[62]

2024 , eprint=

Weight Scope Alignment: A Frustratingly Easy Method for Model Merging , author=. 2024 , eprint=

work page 2024
[63]

Proceedings of the 40th International Conference on Machine Learning , pages =

Optimal Shrinkage for Distributed Second-Order Optimization , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

work page 2023
[64]

2016 , eprint=

Pointer Sentinel Mixture Models , author=. 2016 , eprint=

work page 2016
[65]

2006 , howpublished =

Hutter, Marcus , title =. 2006 , howpublished =

work page 2006
[66]

2015 , eprint=

Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books , author=. 2015 , eprint=

work page 2015
[67]

2014 , eprint=

One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling , author=. 2014 , eprint=

work page 2014
[68]

2026 , eprint=

When do spectral gradient updates help in deep learning? , author=. 2026 , eprint=

work page 2026
[69]

2026 , eprint=

Preconditioning Benefits of Spectral Orthogonalization in Muon , author=. 2026 , eprint=

work page 2026
[70]

2026 , eprint=

Controlled LLM Training on Spectral Sphere , author=. 2026 , eprint=

work page 2026
[71]

2025 , eprint=

Understanding Mode Connectivity via Parameter Space Symmetry , author=. 2025 , eprint=

work page 2025
[72]

2025 , month =

Fantastic Pretraining Optimizers and Where to Find Them 2.1: Hyperball Optimization , author =. 2025 , month =

work page 2025
[73]

2025 , eprint=

Implicit Bias of Spectral Descent and Muon on Multiclass Separable Data , author=. 2025 , eprint=

work page 2025
[74]

2024 , eprint=

Modular Duality in Deep Learning , author=. 2024 , eprint=

work page 2024
[75]

2024 , eprint=

Old Optimizer, New Norm: An Anthology , author=. 2024 , eprint=

work page 2024
[76]

2025 , eprint=

On Provable Benefits of Muon in Federated Learning , author=. 2025 , eprint=

work page 2025
[77]

2025 , eprint=

On the Convergence of Muon and Beyond , author=. 2025 , eprint=

work page 2025
[78]

2025 , eprint=

Convergence Bound and Critical Batch Size of Muon Optimizer , author=. 2025 , eprint=

work page 2025
[79]

2025 , eprint=

Lions and Muons: Optimization via Stochastic Frank-Wolfe , author=. 2025 , eprint=

work page 2025
[80]

2025 , eprint=

On the Convergence Analysis of Muon , author=. 2025 , eprint=

work page 2025

Showing first 80 references.