arxiv: 2604.15392 · v1 · submitted 2026-04-16 · 💻 cs.LG · cs.AI· stat.ML

Recognition: unknown

Lightweight Geometric Adaptation for Training Physics-Informed Neural Networks

Kang An , Chenhao Si , Shiqian Ma , Ming Yan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords physics-informed neural networksPINNsoptimizationsecant approximationcurvature-aware trainingPDE benchmarksgradient differencesloss landscape adaptation

0 comments

The pith

A secant-based correction augments first-order optimizers to speed convergence and raise accuracy when training PINNs on anisotropic loss landscapes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets slow and unstable training of physics-informed neural networks by adding a lightweight predictive correction to existing first-order optimizers. It estimates local curvature change from the difference between consecutive gradients and scales a correction term with a step-normalized indicator. This avoids forming any second-order matrices while adapting to the rapidly varying geometry of the PINN loss. Experiments on high-dimensional heat equations, Gray-Scott, Belousov-Zhabotinsky, and 2D Kuramoto-Sivashinsky systems report faster convergence, greater stability, and improved solution accuracy over standard methods and strong baselines. The framework is presented as plug-and-play and computationally cheap for broad use with current training code.

Core claim

The central claim is that consecutive gradient differences supply a sufficient proxy for local geometric change, allowing a step-normalized secant curvature indicator to control an adaptive predictive correction that can be added to any first-order optimizer, thereby mitigating anisotropy and rapid variation in the PINN loss and producing consistent gains in convergence speed, training stability, and accuracy on diverse PDE benchmarks.

What carries the argument

The step-normalized secant curvature indicator, computed from successive gradient differences, which adaptively scales the strength of a predictive correction term added to the first-order update rule.

If this is right

Convergence speed increases on high-dimensional and nonlinear PDE problems without added matrix costs.
Training stability improves for reaction-diffusion systems such as Gray-Scott and Belousov-Zhabotinsky.
Solution accuracy rises on the 2D Kuramoto-Sivashinsky equation relative to standard first-order methods.
The correction remains compatible with popular optimizers while staying computationally light.
No explicit second-order information is required, preserving the efficiency of first-order training loops.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same gradient-difference proxy could be tested on other scientific machine-learning tasks that face curved loss surfaces, such as operator learning or inverse problems.
Combining the secant indicator with existing momentum or adaptive-rate schemes might produce further gains that the current experiments do not explore.
The approach may reduce sensitivity to learning-rate choices in practical PINN deployments, though this would need separate verification.
Scalability to very large networks or three-dimensional time-dependent PDEs remains untested and could expose limits of the secant proxy.

Load-bearing premise

Consecutive gradient differences supply a reliable enough proxy for local geometric change in the anisotropic PINN loss landscape to let the secant indicator set correction strength effectively.

What would settle it

If repeated runs on the high-dimensional heat equation or the 2D Kuramoto-Sivashinsky system show no measurable improvement in convergence speed or final accuracy when the secant correction is added versus the unaugmented baseline optimizer, the claimed benefit would be falsified.

Figures

Figures reproduced from arXiv: 2604.15392 by Chenhao Si, Kang An, Ming Yan, Shiqian Ma.

**Figure 1.** Figure 1: Projection of the loss landscape L = LI + λLF for the viscous Burgers’ equation onto a 2D parameter subspace using random directions. The vertical axis shows the logarithm of the scalar loss value, while the two horizontal axes represent the perturbation scale along the two random directions. The left subfigure shows the landscape for the data loss alone (λ = 0), consisting of the initial and boundary cond… view at source ↗

**Figure 2.** Figure 2: Histories of the secant curvature indicator [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: , and the final numerical results are summarized in [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: History of the relative L 2 errors for the Gray-Scott system. Because training is performed using a timemarching strategy, the error curves exhibit periodic spikes at the transition between consecutive time windows (every 20,000 iterations), where the model must adapt to a newly initialized subproblem. Within each window, the error decreases as optimization progresses [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗

**Figure 5.** Figure 5: Spatiotemporal heatmaps for the Gray-Scott system using the best-performing optimizer CA-SOAP. For each [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Spatiotemporal heatmaps for the Belousov-Zhabotinsky system using CA-SOAP. For each species ( [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: History of the relative L 2 errors for the Belousov-Zhabotinsky system [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: History of the relative L 2 errors for the 2D Kuramoto–Sivashinsky system. The curves show the training evolution of the relative L 2 errors for both u and v under different optimizers [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

Physics-Informed Neural Networks (PINNs) often suffer from slow convergence, training instability, and reduced accuracy on challenging partial differential equations due to the anisotropic and rapidly varying geometry of their loss landscapes. We propose a lightweight curvature-aware optimization framework that augments existing first-order optimizers with an adaptive predictive correction based on secant information. Consecutive gradient differences are used as a cheap proxy for local geometric change, together with a step-normalized secant curvature indicator to control the correction strength. The framework is plug-and-play, computationally efficient, and broadly compatible with existing optimizers, without explicitly forming second-order matrices. Experiments on diverse PDE benchmarks show consistent improvements in convergence speed, training stability, and solution accuracy over standard optimizers and strong baselines, including on the high-dimensional heat equation, Gray--Scott system, Belousov--Zhabotinsky system, and 2D Kuramoto--Sivashinsky system.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a lightweight secant proxy from gradient differences to first-order PINN optimizers and claims steadier training on several PDE benchmarks, but the justification for that proxy and the strength of the reported gains both need checking.

read the letter

This paper's core idea is to take consecutive gradient differences as a cheap stand-in for curvature and use a step-normalized version of that to scale an adaptive correction on top of any first-order optimizer. It stays fully first-order, avoids Hessian matrices, and is presented as easy to drop in. That combination is the main new piece, even if it draws from older curvature-aware tricks. They test it on the high-dimensional heat equation plus Gray-Scott, Belousov-Zhabotinsky, and 2D Kuramoto-Sivashinsky systems, and the abstract says it improves convergence speed, stability, and final accuracy over standard Adam-style methods and some stronger baselines. Keeping the overhead low and showing results across a few different PDE types is the part that actually works in its favor. The soft spots are more noticeable. The abstract supplies no numbers, error bars, or ablation tables, so the size and consistency of the gains are impossible to judge from what is written. The central assumption—that two-point secants reliably track the curvature directions that actually cause PINN instability—also sits on thin ground. PINN losses are composite and strongly anisotropic; nothing in the description shows the proxy correlates with the relevant Hessian action better than simpler first-order adjustments would. If the improvements come from extra regularization or tuning rather than geometric adaptation, the story changes. This is aimed at people already running PINNs who want a low-cost stabilizer to try. It is not going to reset the field, but the experiments are concrete enough that a referee should look at the full tables and any ablations to see whether the proxy holds up. I would send it to review rather than desk-reject.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a lightweight curvature-aware optimization framework for Physics-Informed Neural Networks (PINNs) that augments existing first-order optimizers with an adaptive predictive correction. Consecutive gradient differences serve as a cheap proxy for local geometric change in the loss landscape, combined with a step-normalized secant curvature indicator to modulate correction strength. The method is presented as plug-and-play and computationally efficient without forming second-order matrices. Experiments on PDE benchmarks including the high-dimensional heat equation, Gray-Scott system, Belousov-Zhabotinsky system, and 2D Kuramoto-Sivashinsky system are claimed to demonstrate consistent gains in convergence speed, training stability, and solution accuracy over standard optimizers and baselines.

Significance. If the empirical claims hold under rigorous validation, the work would provide a practical, low-overhead technique to mitigate known optimization challenges in PINNs arising from anisotropic loss landscapes. This could meaningfully extend the usability of PINNs for complex PDEs by improving reliability without the expense of full second-order methods, representing a useful incremental advance in the field.

major comments (2)

[Abstract] Abstract: the central claim of consistent improvements in convergence speed, stability, and accuracy rests on the secant curvature indicator derived from consecutive gradient differences serving as a reliable proxy for the curvature directions that drive PINN training issues. However, PINN losses are composite and highly anisotropic; the manuscript must demonstrate (via correlation analysis or targeted ablations) that this two-point secant approximates relevant Hessian actions better than first-order heuristics, or the gains may be incidental rather than geometrically motivated.
[Abstract] Abstract: no quantitative results, error bars, ablation studies, or implementation details are supplied to support the benchmark claims. The full manuscript must include these (e.g., tables of relative L2 errors, convergence curves with statistics over multiple seeds) to make the data-to-claim link verifiable; without them the strongest claim cannot be assessed.

minor comments (1)

[Abstract] The abstract refers to 'strong baselines' without naming them; the manuscript should explicitly list the compared optimizers and any prior PINN-specific methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and have revised the manuscript to strengthen the abstract and supporting evidence.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of consistent improvements in convergence speed, stability, and accuracy rests on the secant curvature indicator derived from consecutive gradient differences serving as a reliable proxy for the curvature directions that drive PINN training issues. However, PINN losses are composite and highly anisotropic; the manuscript must demonstrate (via correlation analysis or targeted ablations) that this two-point secant approximates relevant Hessian actions better than first-order heuristics, or the gains may be incidental rather than geometrically motivated.

Authors: We agree that explicit validation of the secant proxy against Hessian actions would strengthen the geometric interpretation. Full Hessian computation is infeasible for the high-dimensional PDEs in our benchmarks, but we have added targeted ablations in the revised manuscript comparing the secant-based correction against pure first-order heuristics (e.g., gradient clipping variants and momentum-only baselines) on the same problems. These show that the curvature modulation yields statistically significant gains beyond first-order adjustments alone. Where dimensionality permits, we have also included correlation plots between the secant indicator and diagonal Hessian approximations. revision: yes
Referee: [Abstract] Abstract: no quantitative results, error bars, ablation studies, or implementation details are supplied to support the benchmark claims. The full manuscript must include these (e.g., tables of relative L2 errors, convergence curves with statistics over multiple seeds) to make the data-to-claim link verifiable; without them the strongest claim cannot be assessed.

Authors: The full manuscript already contains tables of relative L2 errors, convergence curves, and ablation studies across the four PDE benchmarks, with results averaged over multiple random seeds. We have revised the abstract to include specific quantitative highlights (e.g., relative error reductions and stability metrics) and now explicitly reference the multi-seed statistics and implementation details (hyperparameters, computational overhead) that appear in the main text and supplementary material. Error bars have been added to all relevant figures. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses independent first-order observables

full rationale

The paper defines its adaptive correction explicitly from consecutive gradient differences as an external, observable proxy for local change, together with a normalized secant indicator; this construction does not reduce to the target PDE accuracy or convergence metrics by definition or fitting. No load-bearing self-citations, uniqueness theorems, or ansatz smuggling appear in the provided text, and the central claims rest on empirical results across independent benchmarks rather than tautological re-derivation of inputs. The method is presented as a plug-and-play augmentation compatible with existing first-order optimizers, keeping the derivation chain self-contained against external validation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that gradient differences serve as a valid curvature proxy and on the existence of tunable control for correction strength; no new physical entities are introduced.

free parameters (1)

correction strength control parameter
The abstract states that a step-normalized secant curvature indicator is used to control correction strength, implying at least one tunable scalar.

axioms (1)

domain assumption Consecutive gradient differences act as a cheap proxy for local geometric change in the loss landscape
Explicitly invoked in the abstract as the basis for the adaptive correction without forming second-order matrices.

pith-pipeline@v0.9.0 · 5460 in / 1259 out tokens · 60566 ms · 2026-05-10T12:19:05.431074+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Physics-Informed Neural PDE Solvers via Spatio-Temporal MeanFlow
cs.LG 2026-05 unverdicted novelty 7.0

Spatio-Temporal MeanFlow adapts MeanFlow to PDEs by replacing the generative velocity field with the physical operator and extending the integral constraint to the spatio-temporal domain, yielding a unified solver for...
Two-scale Neural Networks for Singularly Perturbed Dynamical Systems with Multiple Parameters
math.NA 2026-05 unverdicted novelty 4.0

A neural network augmented with the geometric mean of multiple small parameters approximates solutions to singularly perturbed dynamical systems with satisfactory accuracy on tested coupled cases.

Reference graph

Works this paper leans on

56 extracted references · 6 canonical work pages · cited by 2 Pith papers · 2 internal anchors

[1]

Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.J

Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.J. Comput. Phys., 378:686–707, 2019

2019
[2]

Physics- informed machine learning.Nat

George Em Karniadakis, Ioannis G Kevrekidis, Lu Lu, Paris Perdikaris, Sifan Wang, and Liu Yang. Physics- informed machine learning.Nat. Rev. Phys., 3(6):422–440, 2021

2021
[3]

Physics-informed neural networks for studying heat transfer in porous media

Jiaxuan Xu, Han Wei, and Hua Bao. Physics-informed neural networks for studying heat transfer in porous media. Int. J. Heat Mass Transfer, 217:124671, 2023. 18 APREPRINT- APRIL20, 2026

2023
[4]

Physics-informed neural networks for heat transfer problems.J

Shengze Cai, Zhicheng Wang, Sifan Wang, Paris Perdikaris, and George Em Karniadakis. Physics-informed neural networks for heat transfer problems.J. Heat Transfer, 143(6):060801, 2021

2021
[5]

Initialization-enhanced physics-informed neural network with domain decomposition (IDPINN).J

Chenhao Si and Ming Yan. Initialization-enhanced physics-informed neural network with domain decomposition (IDPINN).J. Comput. Phys., 530:113914, 2025

2025
[6]

HxPINN: A hypernetwork-based physics-informed neural network for real-time monitoring of an industrial heat exchanger.Numer

Ritam Majumdar, Vishal Jadhav, Anirudh Deodhar, Shirish Karande, Lovekesh Vig, and Venkataramana Runkana. HxPINN: A hypernetwork-based physics-informed neural network for real-time monitoring of an industrial heat exchanger.Numer. Heat Transfer B, 86(6):1910–1931, 2025

1910
[7]

Physics-informed neural networks (PINN) for computational solid mechanics: Numerical frameworks and applications.Thin-Walled Struct., 205:112495, 2024

Haoteng Hu, Lehua Qi, and Xujiang Chao. Physics-informed neural networks (PINN) for computational solid mechanics: Numerical frameworks and applications.Thin-Walled Struct., 205:112495, 2024

2024
[8]

Physics-guided, physics-informed, and physics-encoded neural networks and operators in scientific computing: Fluid and solid mechanics.J

Salah A Faroughi, Nikhil M Pawar, Celio Fernandes, Maziar Raissi, Subasish Das, Nima K Kalantari, and Seyed Kourosh Mahjour. Physics-guided, physics-informed, and physics-encoded neural networks and operators in scientific computing: Fluid and solid mechanics.J. Comput. Inf. Sci. Eng., 24(4):040802, 2024

2024
[9]

Learning in modal space: Solving time-dependent stochastic PDEs using physics-informed neural networks.SIAM J

Dongkun Zhang, Ling Guo, and George Em Karniadakis. Learning in modal space: Solving time-dependent stochastic PDEs using physics-informed neural networks.SIAM J. Sci. Comput., 42(2):A639–A665, 2020

2020
[10]

Solving inverse stochastic problems from discrete particle observations using the Fokker-Planck equation and physics-informed neural networks.SIAM J

Xiaoli Chen, Liu Yang, Jinqiao Duan, and George Em Karniadakis. Solving inverse stochastic problems from discrete particle observations using the Fokker-Planck equation and physics-informed neural networks.SIAM J. Sci. Comput., 43(3):B811–B830, 2021

2021
[11]

Adversarial uncertainty quantification in physics-informed neural networks.J

Yibo Yang and Paris Perdikaris. Adversarial uncertainty quantification in physics-informed neural networks.J. Comput. Phys., 394:136–152, 2019

2019
[12]

Quantifying total uncertainty in physics-informed neural networks for solving forward and inverse stochastic problems.J

Dongkun Zhang, Lu Lu, Ling Guo, and George Em Karniadakis. Quantifying total uncertainty in physics-informed neural networks for solving forward and inverse stochastic problems.J. Comput. Phys., 397:108850, 2019

2019
[13]

B-PINNs: Bayesian physics-informed neural networks for forward and inverse PDE problems with noisy data.J

Liu Yang, Xuhui Meng, and George Em Karniadakis. B-PINNs: Bayesian physics-informed neural networks for forward and inverse PDE problems with noisy data.J. Comput. Phys., 425:109913, 2021

2021
[14]

Self-adaptive physics-informed neural networks.J

Levi D McClenny and Ulisses M Braga-Neto. Self-adaptive physics-informed neural networks.J. Comput. Phys., 474:111722, 2023

2023
[15]

Understanding and mitigating gradient flow pathologies in physics-informed neural networks.SIAM J

Sifan Wang, Yujun Teng, and Paris Perdikaris. Understanding and mitigating gradient flow pathologies in physics-informed neural networks.SIAM J. Sci. Comput., 43(5):A3055–A3081, 2021

2021
[16]

On the eigenvector bias of fourier feature networks: From regression to solving multi-scale PDEs with physics-informed neural networks.Comput

Sifan Wang, Hanwen Wang, and Paris Perdikaris. On the eigenvector bias of fourier feature networks: From regression to solving multi-scale PDEs with physics-informed neural networks.Comput. Methods Appl. Mech. Engrg., 384:113938, 2021

2021
[17]

Characterizing possible failure modes in physics-informed neural networks

Aditi Krishnapriyan, Amir Gholami, Shandian Zhe, Robert Kirby, and Michael W Mahoney. Characterizing possible failure modes in physics-informed neural networks. InAdvances in Neural Information Processing Systems, volume 34, pages 26548–26560, 2021

2021
[18]

Binary structured physics-informed neural networks for solving equations with rapidly changing solutions.J

Yanzhi Liu, Ruifan Wu, and Ying Jiang. Binary structured physics-informed neural networks for solving equations with rapidly changing solutions.J. Comput. Phys., 518:113341, 2024

2024
[19]

When and why PINNs fail to train: A neural tangent kernel perspective.J

Sifan Wang, Xinling Yu, and Paris Perdikaris. When and why PINNs fail to train: A neural tangent kernel perspective.J. Comput. Phys., 449:110768, 2022

2022
[20]

Residual- based attention in physics-informed neural networks.Comput

Sokratis J Anagnostopoulos, Juan Diego Toscano, Nikolaos Stergiopulos, and George Em Karniadakis. Residual- based attention in physics-informed neural networks.Comput. Methods Appl. Mech. Engrg., 421:116805, 2024

2024
[21]

A Variational Framework for Residual-Based Adaptivity in Neural PDE Solvers and Operator Learning

Juan Diego Toscano, Daniel T Chen, Vivek Oommen, Jérôme Darbon, and George Em Karniadakis. A vari- ational framework for residual-based adaptivity in neural PDE solvers and operator learning.arXiv preprint arXiv:2509.14198, 2025

work page arXiv 2025
[22]

Convolution-weighting method for the physics-informed neural network: A primal-dual optimization perspective.J

Chenhao Si and Ming Yan. Convolution-weighting method for the physics-informed neural network: A primal-dual optimization perspective.J. Comput. Phys., 555:114773, 2026

2026
[23]

Visualizing the loss landscape of neural nets

Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. InAdvances in Neural Information Processing Systems, volume 31, 2018

2018
[24]

Exploring landscapes for better minima along valleys

Tong Zhao, Jiacheng Li, Yuanchang Zhou, Guangming Tan, and Weile Jia. Exploring landscapes for better minima along valleys. InAdvances in Neural Information Processing Systems, 2025

2025
[25]

Challenges in training PINNs: A loss landscape perspective

Pratik Rathore, Weimu Lei, Zachary Frangella, Lu Lu, and Madeleine Udell. Challenges in training PINNs: A loss landscape perspective. InInternational Conference on Machine Learning, pages 42159–42191, 2024. 19 APREPRINT- APRIL20, 2026

2024
[26]

Optimizing the optimizer for physics-informed neural networks and Kolmogorov-Arnold networks.Comput

Elham Kiyani, Khemraj Shukla, Jorge F Urbán, Jérôme Darbon, and George Em Karniadakis. Optimizing the optimizer for physics-informed neural networks and Kolmogorov-Arnold networks.Comput. Methods Appl. Mech. Engrg., 446:118308, 2025

2025
[27]

Unveiling the optimization process of physics informed neural networks: How accurate and competitive can PINNs be?J

Jorge F Urbán, Petros Stefanou, and José A Pons. Unveiling the optimization process of physics informed neural networks: How accurate and competitive can PINNs be?J. Comput. Phys., 523:113656, 2025

2025
[28]

Gradient alignment in physics-informed neural networks: A second-order optimization perspective

Sifan Wang, Ananyae Kumar Bhartari, Bowen Li, and Paris Perdikaris. Gradient alignment in physics-informed neural networks: A second-order optimization perspective. InAdvances in Neural Information Processing Systems, 2025

2025
[29]

Curvature-Aware Optimization for High-Accuracy Physics-Informed Neural Networks

Anas Jnini, Elham Kiyani, Khemraj Shukla, Jorge F. Urban, Nazanin Ahmadi Daryakenari, Johannes Muller, Marius Zeinhofer, and George Em Karniadakis. Curvature-aware optimization for high-accuracy physics-informed neural networks.arXiv preprint arXiv:2604.05230, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

The limit points of (optimistic) gradient descent in min-max optimization

Constantinos Daskalakis and Ioannis Panageas. The limit points of (optimistic) gradient descent in min-max optimization. InAdvances in Neural Information Processing Systems, 2018

2018
[31]

Optimization, learning, and games with predictable sequences

Alexander Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences. In Advances in Neural Information Processing Systems, 2013

2013
[32]

Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile.arXiv preprint arXiv:1807.02629, 2018

Panayotis Mertikopoulos, Bruno Lecouat, Houssam Zenati, Chuan-Sheng Foo, Vijay Chandrasekhar, and Georgios Piliouras. Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile.arXiv preprint arXiv:1807.02629, 2018

work page arXiv 2018
[33]

A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach

Aryan Mokhtari, Asuman Ozdaglar, and Sarath Pattathil. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. InInternational Conference on Artificial Intelligence and Statistics, pages 1497–1507, 2020

2020
[34]

Weijie Su, Stephen Boyd, and Emmanuel J. Candès. A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights.J. Mach. Learn. Res., 17(153):1–43, 2016

2016
[35]

Adan: Adaptive Nesterov momentum algorithm for faster optimizing deep models.IEEE Trans

Xingyu Xie, Pan Zhou, Huan Li, Zhouchen Lin, and Shuicheng Yan. Adan: Adaptive Nesterov momentum algorithm for faster optimizing deep models.IEEE Trans. Pattern Anal. Mach. Intell., 46(12):9508–9520, 2024

2024
[36]

A method for solving the convex programming problem with convergence rateo(1/k2).Soviet Math

Yurii E Nesterov. A method for solving the convex programming problem with convergence rateo(1/k2).Soviet Math. Dokl., 27(2):372–376, 1983

1983
[37]

Springer Science & Business Media, 2013

Yurii Nesterov.Introductory Lectures on Convex Optimization: A Basic Course. Springer Science & Business Media, 2013

2013
[38]

A simple convergence proof of Adam and Adagrad.Trans

Alexandre Défossez, Leon Bottou, Francis Bach, and Nicolas Usunier. A simple convergence proof of Adam and Adagrad.Trans. Mach. Learn. Res., 2022

2022
[39]

ASGO: Adaptive structured gradient optimization

Kang An, Yuxing Liu, Rui Pan, Yi Ren, Shiqian Ma, Donald Goldfarb, and Tong Zhang. ASGO: Adaptive structured gradient optimization. InAdvances in Neural Information Processing Systems, 2025

2025
[40]

Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 23(4):2341–2368, 2013a

Zhishuai Guo, Yi Xu, Wotao Yin, Rong Jin, and Tianbao Yang. A novel convergence analysis for algorithms of the Adam family.arXiv preprint arXiv:2112.03459, 2021

work page arXiv 2021
[41]

ASAM: Adaptive sharpness-aware mini- mization for scale-invariant learning of deep neural networks

Jungmin Kwon, Jeongseop Kim, Hyunseo Park, and In Kwon Choi. ASAM: Adaptive sharpness-aware mini- mization for scale-invariant learning of deep neural networks. InInternational Conference on Machine Learning, pages 5905–5914, 2021

2021
[42]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[43]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

2019
[44]

Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham M. Kakade. SOAP: Improving and stabilizing shampoo using Adam for language modeling. InInternational Conference on Learning Representations, 2025

2025
[45]

Muon: An optimizer for hidden layers in neural networks, 2024

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024

2024
[46]

A method for representing periodic functions and enforcing exactly periodic boundary conditions with deep neural networks.J

Suchuan Dong and Naxian Ni. A method for representing periodic functions and enforcing exactly periodic boundary conditions with deep neural networks.J. Comput. Phys., 435:110242, 2021

2021
[47]

Regularity and error estimates in physics-informed neural networks for the Kuramoto-Sivashinsky equation.arXiv preprint arXiv:2511.09728, 2025

Mohammad Mahabubur Rahman and Deepanshu Verma. Regularity and error estimates in physics-informed neural networks for the Kuramoto-Sivashinsky equation.arXiv preprint arXiv:2511.09728, 2025. 20 APREPRINT- APRIL20, 2026 A Algorithms for SOAP and Muon Algorithm 2SOAP Optimizer Input:Loss function ℓ, initial parameters θ0 ∈R m×n, step sizes {ηk}, hyperparame...

work page arXiv 2025
[48]

Initialize m0 =0 , v0 =0 , L0 =ϵI m, R0 = ϵIn,U L =I m,U R =I n 2.whilek < Tdo 3.G k =∇ θℓ(θk, ζk); 4.L k =β P Lk−1 + (1−β P )GkG⊤ k ; 5.R k =β P Rk−1 + (1−β P )G⊤ k Gk; 6.ifk(modf) == 0then 7.U L,D L =EigenDecomp(L k); 8.U R,D R =EigenDecomp(R k); 9.end if
[49]

˜Gk =U ⊤ L GkUR; 11.m k =β 1mk−1 + (1−β 1) ˜Gk; 12.v k =β 2vk−1 + (1−β 2) ˜G2 k
[50]

ˆmk =m k/(1−β k 1 ); ˆvk =v k/(1−β k 2 ); 14.P k = ˆmk/(√ˆvk +ε); 15.∆θ k =U LPkU⊤ R; 16.θ k+1 =θ k −η k(∆θk +λθ k); 17.end while Algorithm 3Curvature-Aware SOAP Input:Loss function ℓ, initial parameters θ0 ∈R m×n, step sizes {ηk}, hyperparameters β1, β2, βa ∈[0,1) , βP ∈[0,1) , ε >0 , f (eigen-update frequency), αbase (base trajectory),λ(weight decay) Ou...
[51]

Initialize m0 =0 , v0 =0 , L0 =ϵI m, R0 = ϵIn, UL =I m, UR =I n, A0 =0 , G0 =0 , θ−1 =θ 0 2.whilek < Tdo 3.G k =∇ θℓ(θk, ζk); 4.A k =β 1Ak−1 + (1−β a)(Gk −G k−1); 5.∆θ k =θ k −θ k−1; 6.κ k = ⟨∆θk,Gk−Gk−1⟩F ∥∆θk∥2 F ; 7.α k =α base(1 + tanh(−κk)) 8.G boost,k =G k +α kAk; 9.L k =β P Lk−1 + (1−β P )Gboost,kG⊤ boost,k; 10.R k =β P Rk−1 +(1−β P )G⊤ boost,kGboo...
[52]

˜Gk =U ⊤ L Gboost,kUR; 16.m k =β 1mk−1 + (1−β 1) ˜Gk; 17.v k =β 2vk−1 + (1−β 2) ˜G2 k
[53]

ˆmk =m k/(1−β k 1 ); ˆvk =v k/(1−β k 2 ); 19.P k = ˆmk/(√ˆvk +ε); 20.∆ step =U LPkU⊤ R; 21.θ k+1 =θ k −η k(∆step +λθ k); 22.end while 21 APREPRINT- APRIL20, 2026 Algorithm 4Muon Optimizer Input:Loss function ℓ, initial parameters θ0 ∈R m×n (assume m≤n ), step sizes {ηk}, hyperparameter µ∈ [0,1) (momentum), λ (weight decay), I (Newton-Schulz iterations) Ou...

2026
[54]

InitializeM 0 =0 2.whilek < Tdo 3.G k =∇ θℓ(θk, ζk); 4.M k =µM k−1 +G k; 5.O k =µM k +G k; 6.X=O k/∥Ok∥F ; 7.fori= 1toIdo 8.A=XX ⊤; 9.X= 1.5X−0.5AX; 10.end for
[55]

n∆θ k =X·max(1, p n/m); 12.θ k+1 =θ k −η k(∆θk +λθ k); 13.end while Algorithm 5Curvature-Aware Muon Input:Loss function ℓ, initial parameters θ0 ∈R m×n (assume m≤n ), step sizes {ηk}, hyperparameters µ∈[0,1) (momentum), αbase (base trajectory), λ (weight decay),I(Newton-Schulz iterations) Output:{θ k}T k=1
[56]

Initialize M0 =0 , A0 =0 , G0 =0 , θ−1 =θ 0 2.whilek < Tdo 3.G k =∇ θℓ(θk, ζk); 4.A k =µA k−1 + (1−µ)(G k −G k−1); 5.∆θ k =θ k −θ k−1; 6.κ k = ⟨∆θk,Gk−Gk−1⟩F ∥∆θk∥2 F ; 7.α k =α base(1 + tanh(−κk)) 8.G boost,k =G k +α kAk; 9.M k =µM k−1 +G boost,k; 10.O k =µM k +G boost,k; 11.X=O k/∥Ok∥F ; 12.fori= 1toIdo 13.A=XX ⊤; 14.X= 1.5X−0.5AX; 15.end for 16.∆ step ...