Recognition: unknown
Lightweight Geometric Adaptation for Training Physics-Informed Neural Networks
Pith reviewed 2026-05-10 12:19 UTC · model grok-4.3
The pith
A secant-based correction augments first-order optimizers to speed convergence and raise accuracy when training PINNs on anisotropic loss landscapes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that consecutive gradient differences supply a sufficient proxy for local geometric change, allowing a step-normalized secant curvature indicator to control an adaptive predictive correction that can be added to any first-order optimizer, thereby mitigating anisotropy and rapid variation in the PINN loss and producing consistent gains in convergence speed, training stability, and accuracy on diverse PDE benchmarks.
What carries the argument
The step-normalized secant curvature indicator, computed from successive gradient differences, which adaptively scales the strength of a predictive correction term added to the first-order update rule.
If this is right
- Convergence speed increases on high-dimensional and nonlinear PDE problems without added matrix costs.
- Training stability improves for reaction-diffusion systems such as Gray-Scott and Belousov-Zhabotinsky.
- Solution accuracy rises on the 2D Kuramoto-Sivashinsky equation relative to standard first-order methods.
- The correction remains compatible with popular optimizers while staying computationally light.
- No explicit second-order information is required, preserving the efficiency of first-order training loops.
Where Pith is reading between the lines
- The same gradient-difference proxy could be tested on other scientific machine-learning tasks that face curved loss surfaces, such as operator learning or inverse problems.
- Combining the secant indicator with existing momentum or adaptive-rate schemes might produce further gains that the current experiments do not explore.
- The approach may reduce sensitivity to learning-rate choices in practical PINN deployments, though this would need separate verification.
- Scalability to very large networks or three-dimensional time-dependent PDEs remains untested and could expose limits of the secant proxy.
Load-bearing premise
Consecutive gradient differences supply a reliable enough proxy for local geometric change in the anisotropic PINN loss landscape to let the secant indicator set correction strength effectively.
What would settle it
If repeated runs on the high-dimensional heat equation or the 2D Kuramoto-Sivashinsky system show no measurable improvement in convergence speed or final accuracy when the secant correction is added versus the unaugmented baseline optimizer, the claimed benefit would be falsified.
Figures
read the original abstract
Physics-Informed Neural Networks (PINNs) often suffer from slow convergence, training instability, and reduced accuracy on challenging partial differential equations due to the anisotropic and rapidly varying geometry of their loss landscapes. We propose a lightweight curvature-aware optimization framework that augments existing first-order optimizers with an adaptive predictive correction based on secant information. Consecutive gradient differences are used as a cheap proxy for local geometric change, together with a step-normalized secant curvature indicator to control the correction strength. The framework is plug-and-play, computationally efficient, and broadly compatible with existing optimizers, without explicitly forming second-order matrices. Experiments on diverse PDE benchmarks show consistent improvements in convergence speed, training stability, and solution accuracy over standard optimizers and strong baselines, including on the high-dimensional heat equation, Gray--Scott system, Belousov--Zhabotinsky system, and 2D Kuramoto--Sivashinsky system.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a lightweight curvature-aware optimization framework for Physics-Informed Neural Networks (PINNs) that augments existing first-order optimizers with an adaptive predictive correction. Consecutive gradient differences serve as a cheap proxy for local geometric change in the loss landscape, combined with a step-normalized secant curvature indicator to modulate correction strength. The method is presented as plug-and-play and computationally efficient without forming second-order matrices. Experiments on PDE benchmarks including the high-dimensional heat equation, Gray-Scott system, Belousov-Zhabotinsky system, and 2D Kuramoto-Sivashinsky system are claimed to demonstrate consistent gains in convergence speed, training stability, and solution accuracy over standard optimizers and baselines.
Significance. If the empirical claims hold under rigorous validation, the work would provide a practical, low-overhead technique to mitigate known optimization challenges in PINNs arising from anisotropic loss landscapes. This could meaningfully extend the usability of PINNs for complex PDEs by improving reliability without the expense of full second-order methods, representing a useful incremental advance in the field.
major comments (2)
- [Abstract] Abstract: the central claim of consistent improvements in convergence speed, stability, and accuracy rests on the secant curvature indicator derived from consecutive gradient differences serving as a reliable proxy for the curvature directions that drive PINN training issues. However, PINN losses are composite and highly anisotropic; the manuscript must demonstrate (via correlation analysis or targeted ablations) that this two-point secant approximates relevant Hessian actions better than first-order heuristics, or the gains may be incidental rather than geometrically motivated.
- [Abstract] Abstract: no quantitative results, error bars, ablation studies, or implementation details are supplied to support the benchmark claims. The full manuscript must include these (e.g., tables of relative L2 errors, convergence curves with statistics over multiple seeds) to make the data-to-claim link verifiable; without them the strongest claim cannot be assessed.
minor comments (1)
- [Abstract] The abstract refers to 'strong baselines' without naming them; the manuscript should explicitly list the compared optimizers and any prior PINN-specific methods.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments below and have revised the manuscript to strengthen the abstract and supporting evidence.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of consistent improvements in convergence speed, stability, and accuracy rests on the secant curvature indicator derived from consecutive gradient differences serving as a reliable proxy for the curvature directions that drive PINN training issues. However, PINN losses are composite and highly anisotropic; the manuscript must demonstrate (via correlation analysis or targeted ablations) that this two-point secant approximates relevant Hessian actions better than first-order heuristics, or the gains may be incidental rather than geometrically motivated.
Authors: We agree that explicit validation of the secant proxy against Hessian actions would strengthen the geometric interpretation. Full Hessian computation is infeasible for the high-dimensional PDEs in our benchmarks, but we have added targeted ablations in the revised manuscript comparing the secant-based correction against pure first-order heuristics (e.g., gradient clipping variants and momentum-only baselines) on the same problems. These show that the curvature modulation yields statistically significant gains beyond first-order adjustments alone. Where dimensionality permits, we have also included correlation plots between the secant indicator and diagonal Hessian approximations. revision: yes
-
Referee: [Abstract] Abstract: no quantitative results, error bars, ablation studies, or implementation details are supplied to support the benchmark claims. The full manuscript must include these (e.g., tables of relative L2 errors, convergence curves with statistics over multiple seeds) to make the data-to-claim link verifiable; without them the strongest claim cannot be assessed.
Authors: The full manuscript already contains tables of relative L2 errors, convergence curves, and ablation studies across the four PDE benchmarks, with results averaged over multiple random seeds. We have revised the abstract to include specific quantitative highlights (e.g., relative error reductions and stability metrics) and now explicitly reference the multi-seed statistics and implementation details (hyperparameters, computational overhead) that appear in the main text and supplementary material. Error bars have been added to all relevant figures. revision: yes
Circularity Check
No significant circularity; derivation uses independent first-order observables
full rationale
The paper defines its adaptive correction explicitly from consecutive gradient differences as an external, observable proxy for local change, together with a normalized secant indicator; this construction does not reduce to the target PDE accuracy or convergence metrics by definition or fitting. No load-bearing self-citations, uniqueness theorems, or ansatz smuggling appear in the provided text, and the central claims rest on empirical results across independent benchmarks rather than tautological re-derivation of inputs. The method is presented as a plug-and-play augmentation compatible with existing first-order optimizers, keeping the derivation chain self-contained against external validation.
Axiom & Free-Parameter Ledger
free parameters (1)
- correction strength control parameter
axioms (1)
- domain assumption Consecutive gradient differences act as a cheap proxy for local geometric change in the loss landscape
Forward citations
Cited by 2 Pith papers
-
Physics-Informed Neural PDE Solvers via Spatio-Temporal MeanFlow
Spatio-Temporal MeanFlow adapts MeanFlow to PDEs by replacing the generative velocity field with the physical operator and extending the integral constraint to the spatio-temporal domain, yielding a unified solver for...
-
Two-scale Neural Networks for Singularly Perturbed Dynamical Systems with Multiple Parameters
A neural network augmented with the geometric mean of multiple small parameters approximates solutions to singularly perturbed dynamical systems with satisfactory accuracy on tested coupled cases.
Reference graph
Works this paper leans on
-
[1]
Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.J
Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.J. Comput. Phys., 378:686–707, 2019
2019
-
[2]
Physics- informed machine learning.Nat
George Em Karniadakis, Ioannis G Kevrekidis, Lu Lu, Paris Perdikaris, Sifan Wang, and Liu Yang. Physics- informed machine learning.Nat. Rev. Phys., 3(6):422–440, 2021
2021
-
[3]
Physics-informed neural networks for studying heat transfer in porous media
Jiaxuan Xu, Han Wei, and Hua Bao. Physics-informed neural networks for studying heat transfer in porous media. Int. J. Heat Mass Transfer, 217:124671, 2023. 18 APREPRINT- APRIL20, 2026
2023
-
[4]
Physics-informed neural networks for heat transfer problems.J
Shengze Cai, Zhicheng Wang, Sifan Wang, Paris Perdikaris, and George Em Karniadakis. Physics-informed neural networks for heat transfer problems.J. Heat Transfer, 143(6):060801, 2021
2021
-
[5]
Initialization-enhanced physics-informed neural network with domain decomposition (IDPINN).J
Chenhao Si and Ming Yan. Initialization-enhanced physics-informed neural network with domain decomposition (IDPINN).J. Comput. Phys., 530:113914, 2025
2025
-
[6]
HxPINN: A hypernetwork-based physics-informed neural network for real-time monitoring of an industrial heat exchanger.Numer
Ritam Majumdar, Vishal Jadhav, Anirudh Deodhar, Shirish Karande, Lovekesh Vig, and Venkataramana Runkana. HxPINN: A hypernetwork-based physics-informed neural network for real-time monitoring of an industrial heat exchanger.Numer. Heat Transfer B, 86(6):1910–1931, 2025
1910
-
[7]
Physics-informed neural networks (PINN) for computational solid mechanics: Numerical frameworks and applications.Thin-Walled Struct., 205:112495, 2024
Haoteng Hu, Lehua Qi, and Xujiang Chao. Physics-informed neural networks (PINN) for computational solid mechanics: Numerical frameworks and applications.Thin-Walled Struct., 205:112495, 2024
2024
-
[8]
Physics-guided, physics-informed, and physics-encoded neural networks and operators in scientific computing: Fluid and solid mechanics.J
Salah A Faroughi, Nikhil M Pawar, Celio Fernandes, Maziar Raissi, Subasish Das, Nima K Kalantari, and Seyed Kourosh Mahjour. Physics-guided, physics-informed, and physics-encoded neural networks and operators in scientific computing: Fluid and solid mechanics.J. Comput. Inf. Sci. Eng., 24(4):040802, 2024
2024
-
[9]
Learning in modal space: Solving time-dependent stochastic PDEs using physics-informed neural networks.SIAM J
Dongkun Zhang, Ling Guo, and George Em Karniadakis. Learning in modal space: Solving time-dependent stochastic PDEs using physics-informed neural networks.SIAM J. Sci. Comput., 42(2):A639–A665, 2020
2020
-
[10]
Solving inverse stochastic problems from discrete particle observations using the Fokker-Planck equation and physics-informed neural networks.SIAM J
Xiaoli Chen, Liu Yang, Jinqiao Duan, and George Em Karniadakis. Solving inverse stochastic problems from discrete particle observations using the Fokker-Planck equation and physics-informed neural networks.SIAM J. Sci. Comput., 43(3):B811–B830, 2021
2021
-
[11]
Adversarial uncertainty quantification in physics-informed neural networks.J
Yibo Yang and Paris Perdikaris. Adversarial uncertainty quantification in physics-informed neural networks.J. Comput. Phys., 394:136–152, 2019
2019
-
[12]
Quantifying total uncertainty in physics-informed neural networks for solving forward and inverse stochastic problems.J
Dongkun Zhang, Lu Lu, Ling Guo, and George Em Karniadakis. Quantifying total uncertainty in physics-informed neural networks for solving forward and inverse stochastic problems.J. Comput. Phys., 397:108850, 2019
2019
-
[13]
B-PINNs: Bayesian physics-informed neural networks for forward and inverse PDE problems with noisy data.J
Liu Yang, Xuhui Meng, and George Em Karniadakis. B-PINNs: Bayesian physics-informed neural networks for forward and inverse PDE problems with noisy data.J. Comput. Phys., 425:109913, 2021
2021
-
[14]
Self-adaptive physics-informed neural networks.J
Levi D McClenny and Ulisses M Braga-Neto. Self-adaptive physics-informed neural networks.J. Comput. Phys., 474:111722, 2023
2023
-
[15]
Understanding and mitigating gradient flow pathologies in physics-informed neural networks.SIAM J
Sifan Wang, Yujun Teng, and Paris Perdikaris. Understanding and mitigating gradient flow pathologies in physics-informed neural networks.SIAM J. Sci. Comput., 43(5):A3055–A3081, 2021
2021
-
[16]
On the eigenvector bias of fourier feature networks: From regression to solving multi-scale PDEs with physics-informed neural networks.Comput
Sifan Wang, Hanwen Wang, and Paris Perdikaris. On the eigenvector bias of fourier feature networks: From regression to solving multi-scale PDEs with physics-informed neural networks.Comput. Methods Appl. Mech. Engrg., 384:113938, 2021
2021
-
[17]
Characterizing possible failure modes in physics-informed neural networks
Aditi Krishnapriyan, Amir Gholami, Shandian Zhe, Robert Kirby, and Michael W Mahoney. Characterizing possible failure modes in physics-informed neural networks. InAdvances in Neural Information Processing Systems, volume 34, pages 26548–26560, 2021
2021
-
[18]
Binary structured physics-informed neural networks for solving equations with rapidly changing solutions.J
Yanzhi Liu, Ruifan Wu, and Ying Jiang. Binary structured physics-informed neural networks for solving equations with rapidly changing solutions.J. Comput. Phys., 518:113341, 2024
2024
-
[19]
When and why PINNs fail to train: A neural tangent kernel perspective.J
Sifan Wang, Xinling Yu, and Paris Perdikaris. When and why PINNs fail to train: A neural tangent kernel perspective.J. Comput. Phys., 449:110768, 2022
2022
-
[20]
Residual- based attention in physics-informed neural networks.Comput
Sokratis J Anagnostopoulos, Juan Diego Toscano, Nikolaos Stergiopulos, and George Em Karniadakis. Residual- based attention in physics-informed neural networks.Comput. Methods Appl. Mech. Engrg., 421:116805, 2024
2024
-
[21]
A Variational Framework for Residual-Based Adaptivity in Neural PDE Solvers and Operator Learning
Juan Diego Toscano, Daniel T Chen, Vivek Oommen, Jérôme Darbon, and George Em Karniadakis. A vari- ational framework for residual-based adaptivity in neural PDE solvers and operator learning.arXiv preprint arXiv:2509.14198, 2025
-
[22]
Convolution-weighting method for the physics-informed neural network: A primal-dual optimization perspective.J
Chenhao Si and Ming Yan. Convolution-weighting method for the physics-informed neural network: A primal-dual optimization perspective.J. Comput. Phys., 555:114773, 2026
2026
-
[23]
Visualizing the loss landscape of neural nets
Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. InAdvances in Neural Information Processing Systems, volume 31, 2018
2018
-
[24]
Exploring landscapes for better minima along valleys
Tong Zhao, Jiacheng Li, Yuanchang Zhou, Guangming Tan, and Weile Jia. Exploring landscapes for better minima along valleys. InAdvances in Neural Information Processing Systems, 2025
2025
-
[25]
Challenges in training PINNs: A loss landscape perspective
Pratik Rathore, Weimu Lei, Zachary Frangella, Lu Lu, and Madeleine Udell. Challenges in training PINNs: A loss landscape perspective. InInternational Conference on Machine Learning, pages 42159–42191, 2024. 19 APREPRINT- APRIL20, 2026
2024
-
[26]
Optimizing the optimizer for physics-informed neural networks and Kolmogorov-Arnold networks.Comput
Elham Kiyani, Khemraj Shukla, Jorge F Urbán, Jérôme Darbon, and George Em Karniadakis. Optimizing the optimizer for physics-informed neural networks and Kolmogorov-Arnold networks.Comput. Methods Appl. Mech. Engrg., 446:118308, 2025
2025
-
[27]
Unveiling the optimization process of physics informed neural networks: How accurate and competitive can PINNs be?J
Jorge F Urbán, Petros Stefanou, and José A Pons. Unveiling the optimization process of physics informed neural networks: How accurate and competitive can PINNs be?J. Comput. Phys., 523:113656, 2025
2025
-
[28]
Gradient alignment in physics-informed neural networks: A second-order optimization perspective
Sifan Wang, Ananyae Kumar Bhartari, Bowen Li, and Paris Perdikaris. Gradient alignment in physics-informed neural networks: A second-order optimization perspective. InAdvances in Neural Information Processing Systems, 2025
2025
-
[29]
Curvature-Aware Optimization for High-Accuracy Physics-Informed Neural Networks
Anas Jnini, Elham Kiyani, Khemraj Shukla, Jorge F. Urban, Nazanin Ahmadi Daryakenari, Johannes Muller, Marius Zeinhofer, and George Em Karniadakis. Curvature-aware optimization for high-accuracy physics-informed neural networks.arXiv preprint arXiv:2604.05230, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[30]
The limit points of (optimistic) gradient descent in min-max optimization
Constantinos Daskalakis and Ioannis Panageas. The limit points of (optimistic) gradient descent in min-max optimization. InAdvances in Neural Information Processing Systems, 2018
2018
-
[31]
Optimization, learning, and games with predictable sequences
Alexander Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences. In Advances in Neural Information Processing Systems, 2013
2013
-
[32]
Panayotis Mertikopoulos, Bruno Lecouat, Houssam Zenati, Chuan-Sheng Foo, Vijay Chandrasekhar, and Georgios Piliouras. Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile.arXiv preprint arXiv:1807.02629, 2018
-
[33]
A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach
Aryan Mokhtari, Asuman Ozdaglar, and Sarath Pattathil. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. InInternational Conference on Artificial Intelligence and Statistics, pages 1497–1507, 2020
2020
-
[34]
Weijie Su, Stephen Boyd, and Emmanuel J. Candès. A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights.J. Mach. Learn. Res., 17(153):1–43, 2016
2016
-
[35]
Adan: Adaptive Nesterov momentum algorithm for faster optimizing deep models.IEEE Trans
Xingyu Xie, Pan Zhou, Huan Li, Zhouchen Lin, and Shuicheng Yan. Adan: Adaptive Nesterov momentum algorithm for faster optimizing deep models.IEEE Trans. Pattern Anal. Mach. Intell., 46(12):9508–9520, 2024
2024
-
[36]
A method for solving the convex programming problem with convergence rateo(1/k2).Soviet Math
Yurii E Nesterov. A method for solving the convex programming problem with convergence rateo(1/k2).Soviet Math. Dokl., 27(2):372–376, 1983
1983
-
[37]
Springer Science & Business Media, 2013
Yurii Nesterov.Introductory Lectures on Convex Optimization: A Basic Course. Springer Science & Business Media, 2013
2013
-
[38]
A simple convergence proof of Adam and Adagrad.Trans
Alexandre Défossez, Leon Bottou, Francis Bach, and Nicolas Usunier. A simple convergence proof of Adam and Adagrad.Trans. Mach. Learn. Res., 2022
2022
-
[39]
ASGO: Adaptive structured gradient optimization
Kang An, Yuxing Liu, Rui Pan, Yi Ren, Shiqian Ma, Donald Goldfarb, and Tong Zhang. ASGO: Adaptive structured gradient optimization. InAdvances in Neural Information Processing Systems, 2025
2025
-
[40]
Zhishuai Guo, Yi Xu, Wotao Yin, Rong Jin, and Tianbao Yang. A novel convergence analysis for algorithms of the Adam family.arXiv preprint arXiv:2112.03459, 2021
-
[41]
ASAM: Adaptive sharpness-aware mini- mization for scale-invariant learning of deep neural networks
Jungmin Kwon, Jeongseop Kim, Hyunseo Park, and In Kwon Choi. ASAM: Adaptive sharpness-aware mini- mization for scale-invariant learning of deep neural networks. InInternational Conference on Machine Learning, pages 5905–5914, 2021
2021
-
[42]
Adam: A Method for Stochastic Optimization
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[43]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019
2019
-
[44]
Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham M. Kakade. SOAP: Improving and stabilizing shampoo using Adam for language modeling. InInternational Conference on Learning Representations, 2025
2025
-
[45]
Muon: An optimizer for hidden layers in neural networks, 2024
Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024
2024
-
[46]
A method for representing periodic functions and enforcing exactly periodic boundary conditions with deep neural networks.J
Suchuan Dong and Naxian Ni. A method for representing periodic functions and enforcing exactly periodic boundary conditions with deep neural networks.J. Comput. Phys., 435:110242, 2021
2021
-
[47]
Mohammad Mahabubur Rahman and Deepanshu Verma. Regularity and error estimates in physics-informed neural networks for the Kuramoto-Sivashinsky equation.arXiv preprint arXiv:2511.09728, 2025. 20 APREPRINT- APRIL20, 2026 A Algorithms for SOAP and Muon Algorithm 2SOAP Optimizer Input:Loss function ℓ, initial parameters θ0 ∈R m×n, step sizes {ηk}, hyperparame...
-
[48]
Initialize m0 =0 , v0 =0 , L0 =ϵI m, R0 = ϵIn,U L =I m,U R =I n 2.whilek < Tdo 3.G k =∇ θℓ(θk, ζk); 4.L k =β P Lk−1 + (1−β P )GkG⊤ k ; 5.R k =β P Rk−1 + (1−β P )G⊤ k Gk; 6.ifk(modf) == 0then 7.U L,D L =EigenDecomp(L k); 8.U R,D R =EigenDecomp(R k); 9.end if
-
[49]
˜Gk =U ⊤ L GkUR; 11.m k =β 1mk−1 + (1−β 1) ˜Gk; 12.v k =β 2vk−1 + (1−β 2) ˜G2 k
-
[50]
ˆmk =m k/(1−β k 1 ); ˆvk =v k/(1−β k 2 ); 14.P k = ˆmk/(√ˆvk +ε); 15.∆θ k =U LPkU⊤ R; 16.θ k+1 =θ k −η k(∆θk +λθ k); 17.end while Algorithm 3Curvature-Aware SOAP Input:Loss function ℓ, initial parameters θ0 ∈R m×n, step sizes {ηk}, hyperparameters β1, β2, βa ∈[0,1) , βP ∈[0,1) , ε >0 , f (eigen-update frequency), αbase (base trajectory),λ(weight decay) Ou...
-
[51]
Initialize m0 =0 , v0 =0 , L0 =ϵI m, R0 = ϵIn, UL =I m, UR =I n, A0 =0 , G0 =0 , θ−1 =θ 0 2.whilek < Tdo 3.G k =∇ θℓ(θk, ζk); 4.A k =β 1Ak−1 + (1−β a)(Gk −G k−1); 5.∆θ k =θ k −θ k−1; 6.κ k = ⟨∆θk,Gk−Gk−1⟩F ∥∆θk∥2 F ; 7.α k =α base(1 + tanh(−κk)) 8.G boost,k =G k +α kAk; 9.L k =β P Lk−1 + (1−β P )Gboost,kG⊤ boost,k; 10.R k =β P Rk−1 +(1−β P )G⊤ boost,kGboo...
-
[52]
˜Gk =U ⊤ L Gboost,kUR; 16.m k =β 1mk−1 + (1−β 1) ˜Gk; 17.v k =β 2vk−1 + (1−β 2) ˜G2 k
-
[53]
ˆmk =m k/(1−β k 1 ); ˆvk =v k/(1−β k 2 ); 19.P k = ˆmk/(√ˆvk +ε); 20.∆ step =U LPkU⊤ R; 21.θ k+1 =θ k −η k(∆step +λθ k); 22.end while 21 APREPRINT- APRIL20, 2026 Algorithm 4Muon Optimizer Input:Loss function ℓ, initial parameters θ0 ∈R m×n (assume m≤n ), step sizes {ηk}, hyperparameter µ∈ [0,1) (momentum), λ (weight decay), I (Newton-Schulz iterations) Ou...
2026
-
[54]
InitializeM 0 =0 2.whilek < Tdo 3.G k =∇ θℓ(θk, ζk); 4.M k =µM k−1 +G k; 5.O k =µM k +G k; 6.X=O k/∥Ok∥F ; 7.fori= 1toIdo 8.A=XX ⊤; 9.X= 1.5X−0.5AX; 10.end for
-
[55]
n∆θ k =X·max(1, p n/m); 12.θ k+1 =θ k −η k(∆θk +λθ k); 13.end while Algorithm 5Curvature-Aware Muon Input:Loss function ℓ, initial parameters θ0 ∈R m×n (assume m≤n ), step sizes {ηk}, hyperparameters µ∈[0,1) (momentum), αbase (base trajectory), λ (weight decay),I(Newton-Schulz iterations) Output:{θ k}T k=1
-
[56]
Initialize M0 =0 , A0 =0 , G0 =0 , θ−1 =θ 0 2.whilek < Tdo 3.G k =∇ θℓ(θk, ζk); 4.A k =µA k−1 + (1−µ)(G k −G k−1); 5.∆θ k =θ k −θ k−1; 6.κ k = ⟨∆θk,Gk−Gk−1⟩F ∥∆θk∥2 F ; 7.α k =α base(1 + tanh(−κk)) 8.G boost,k =G k +α kAk; 9.M k =µM k−1 +G boost,k; 10.O k =µM k +G boost,k; 11.X=O k/∥Ok∥F ; 12.fori= 1toIdo 13.A=XX ⊤; 14.X= 1.5X−0.5AX; 15.end for 16.∆ step ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.