pith. machine review for the scientific record. sign in

arxiv: 2604.15392 · v1 · submitted 2026-04-16 · 💻 cs.LG · cs.AI· stat.ML

Recognition: unknown

Lightweight Geometric Adaptation for Training Physics-Informed Neural Networks

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords physics-informed neural networksPINNsoptimizationsecant approximationcurvature-aware trainingPDE benchmarksgradient differencesloss landscape adaptation
0
0 comments X

The pith

A secant-based correction augments first-order optimizers to speed convergence and raise accuracy when training PINNs on anisotropic loss landscapes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets slow and unstable training of physics-informed neural networks by adding a lightweight predictive correction to existing first-order optimizers. It estimates local curvature change from the difference between consecutive gradients and scales a correction term with a step-normalized indicator. This avoids forming any second-order matrices while adapting to the rapidly varying geometry of the PINN loss. Experiments on high-dimensional heat equations, Gray-Scott, Belousov-Zhabotinsky, and 2D Kuramoto-Sivashinsky systems report faster convergence, greater stability, and improved solution accuracy over standard methods and strong baselines. The framework is presented as plug-and-play and computationally cheap for broad use with current training code.

Core claim

The central claim is that consecutive gradient differences supply a sufficient proxy for local geometric change, allowing a step-normalized secant curvature indicator to control an adaptive predictive correction that can be added to any first-order optimizer, thereby mitigating anisotropy and rapid variation in the PINN loss and producing consistent gains in convergence speed, training stability, and accuracy on diverse PDE benchmarks.

What carries the argument

The step-normalized secant curvature indicator, computed from successive gradient differences, which adaptively scales the strength of a predictive correction term added to the first-order update rule.

If this is right

  • Convergence speed increases on high-dimensional and nonlinear PDE problems without added matrix costs.
  • Training stability improves for reaction-diffusion systems such as Gray-Scott and Belousov-Zhabotinsky.
  • Solution accuracy rises on the 2D Kuramoto-Sivashinsky equation relative to standard first-order methods.
  • The correction remains compatible with popular optimizers while staying computationally light.
  • No explicit second-order information is required, preserving the efficiency of first-order training loops.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same gradient-difference proxy could be tested on other scientific machine-learning tasks that face curved loss surfaces, such as operator learning or inverse problems.
  • Combining the secant indicator with existing momentum or adaptive-rate schemes might produce further gains that the current experiments do not explore.
  • The approach may reduce sensitivity to learning-rate choices in practical PINN deployments, though this would need separate verification.
  • Scalability to very large networks or three-dimensional time-dependent PDEs remains untested and could expose limits of the secant proxy.

Load-bearing premise

Consecutive gradient differences supply a reliable enough proxy for local geometric change in the anisotropic PINN loss landscape to let the secant indicator set correction strength effectively.

What would settle it

If repeated runs on the high-dimensional heat equation or the 2D Kuramoto-Sivashinsky system show no measurable improvement in convergence speed or final accuracy when the secant correction is added versus the unaugmented baseline optimizer, the claimed benefit would be falsified.

Figures

Figures reproduced from arXiv: 2604.15392 by Chenhao Si, Kang An, Ming Yan, Shiqian Ma.

Figure 1
Figure 1. Figure 1: Projection of the loss landscape L = LI + λLF for the viscous Burgers’ equation onto a 2D parameter subspace using random directions. The vertical axis shows the logarithm of the scalar loss value, while the two horizontal axes represent the perturbation scale along the two random directions. The left subfigure shows the landscape for the data loss alone (λ = 0), consisting of the initial and boundary cond… view at source ↗
Figure 2
Figure 2. Figure 2: Histories of the secant curvature indicator [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: , and the final numerical results are summarized in [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: History of the relative L 2 errors for the Gray-Scott system. Because training is performed using a time￾marching strategy, the error curves exhibit periodic spikes at the transition between consecutive time windows (every 20,000 iterations), where the model must adapt to a newly initialized subproblem. Within each window, the error decreases as optimization progresses [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗
Figure 5
Figure 5. Figure 5: Spatiotemporal heatmaps for the Gray-Scott system using the best-performing optimizer CA-SOAP. For each [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Spatiotemporal heatmaps for the Belousov-Zhabotinsky system using CA-SOAP. For each species ( [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: History of the relative L 2 errors for the Belousov-Zhabotinsky system [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: History of the relative L 2 errors for the 2D Kuramoto–Sivashinsky system. The curves show the training evolution of the relative L 2 errors for both u and v under different optimizers [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
read the original abstract

Physics-Informed Neural Networks (PINNs) often suffer from slow convergence, training instability, and reduced accuracy on challenging partial differential equations due to the anisotropic and rapidly varying geometry of their loss landscapes. We propose a lightweight curvature-aware optimization framework that augments existing first-order optimizers with an adaptive predictive correction based on secant information. Consecutive gradient differences are used as a cheap proxy for local geometric change, together with a step-normalized secant curvature indicator to control the correction strength. The framework is plug-and-play, computationally efficient, and broadly compatible with existing optimizers, without explicitly forming second-order matrices. Experiments on diverse PDE benchmarks show consistent improvements in convergence speed, training stability, and solution accuracy over standard optimizers and strong baselines, including on the high-dimensional heat equation, Gray--Scott system, Belousov--Zhabotinsky system, and 2D Kuramoto--Sivashinsky system.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a lightweight curvature-aware optimization framework for Physics-Informed Neural Networks (PINNs) that augments existing first-order optimizers with an adaptive predictive correction. Consecutive gradient differences serve as a cheap proxy for local geometric change in the loss landscape, combined with a step-normalized secant curvature indicator to modulate correction strength. The method is presented as plug-and-play and computationally efficient without forming second-order matrices. Experiments on PDE benchmarks including the high-dimensional heat equation, Gray-Scott system, Belousov-Zhabotinsky system, and 2D Kuramoto-Sivashinsky system are claimed to demonstrate consistent gains in convergence speed, training stability, and solution accuracy over standard optimizers and baselines.

Significance. If the empirical claims hold under rigorous validation, the work would provide a practical, low-overhead technique to mitigate known optimization challenges in PINNs arising from anisotropic loss landscapes. This could meaningfully extend the usability of PINNs for complex PDEs by improving reliability without the expense of full second-order methods, representing a useful incremental advance in the field.

major comments (2)
  1. [Abstract] Abstract: the central claim of consistent improvements in convergence speed, stability, and accuracy rests on the secant curvature indicator derived from consecutive gradient differences serving as a reliable proxy for the curvature directions that drive PINN training issues. However, PINN losses are composite and highly anisotropic; the manuscript must demonstrate (via correlation analysis or targeted ablations) that this two-point secant approximates relevant Hessian actions better than first-order heuristics, or the gains may be incidental rather than geometrically motivated.
  2. [Abstract] Abstract: no quantitative results, error bars, ablation studies, or implementation details are supplied to support the benchmark claims. The full manuscript must include these (e.g., tables of relative L2 errors, convergence curves with statistics over multiple seeds) to make the data-to-claim link verifiable; without them the strongest claim cannot be assessed.
minor comments (1)
  1. [Abstract] The abstract refers to 'strong baselines' without naming them; the manuscript should explicitly list the compared optimizers and any prior PINN-specific methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and have revised the manuscript to strengthen the abstract and supporting evidence.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of consistent improvements in convergence speed, stability, and accuracy rests on the secant curvature indicator derived from consecutive gradient differences serving as a reliable proxy for the curvature directions that drive PINN training issues. However, PINN losses are composite and highly anisotropic; the manuscript must demonstrate (via correlation analysis or targeted ablations) that this two-point secant approximates relevant Hessian actions better than first-order heuristics, or the gains may be incidental rather than geometrically motivated.

    Authors: We agree that explicit validation of the secant proxy against Hessian actions would strengthen the geometric interpretation. Full Hessian computation is infeasible for the high-dimensional PDEs in our benchmarks, but we have added targeted ablations in the revised manuscript comparing the secant-based correction against pure first-order heuristics (e.g., gradient clipping variants and momentum-only baselines) on the same problems. These show that the curvature modulation yields statistically significant gains beyond first-order adjustments alone. Where dimensionality permits, we have also included correlation plots between the secant indicator and diagonal Hessian approximations. revision: yes

  2. Referee: [Abstract] Abstract: no quantitative results, error bars, ablation studies, or implementation details are supplied to support the benchmark claims. The full manuscript must include these (e.g., tables of relative L2 errors, convergence curves with statistics over multiple seeds) to make the data-to-claim link verifiable; without them the strongest claim cannot be assessed.

    Authors: The full manuscript already contains tables of relative L2 errors, convergence curves, and ablation studies across the four PDE benchmarks, with results averaged over multiple random seeds. We have revised the abstract to include specific quantitative highlights (e.g., relative error reductions and stability metrics) and now explicitly reference the multi-seed statistics and implementation details (hyperparameters, computational overhead) that appear in the main text and supplementary material. Error bars have been added to all relevant figures. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses independent first-order observables

full rationale

The paper defines its adaptive correction explicitly from consecutive gradient differences as an external, observable proxy for local change, together with a normalized secant indicator; this construction does not reduce to the target PDE accuracy or convergence metrics by definition or fitting. No load-bearing self-citations, uniqueness theorems, or ansatz smuggling appear in the provided text, and the central claims rest on empirical results across independent benchmarks rather than tautological re-derivation of inputs. The method is presented as a plug-and-play augmentation compatible with existing first-order optimizers, keeping the derivation chain self-contained against external validation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that gradient differences serve as a valid curvature proxy and on the existence of tunable control for correction strength; no new physical entities are introduced.

free parameters (1)
  • correction strength control parameter
    The abstract states that a step-normalized secant curvature indicator is used to control correction strength, implying at least one tunable scalar.
axioms (1)
  • domain assumption Consecutive gradient differences act as a cheap proxy for local geometric change in the loss landscape
    Explicitly invoked in the abstract as the basis for the adaptive correction without forming second-order matrices.

pith-pipeline@v0.9.0 · 5460 in / 1259 out tokens · 60566 ms · 2026-05-10T12:19:05.431074+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Physics-Informed Neural PDE Solvers via Spatio-Temporal MeanFlow

    cs.LG 2026-05 unverdicted novelty 7.0

    Spatio-Temporal MeanFlow adapts MeanFlow to PDEs by replacing the generative velocity field with the physical operator and extending the integral constraint to the spatio-temporal domain, yielding a unified solver for...

  2. Two-scale Neural Networks for Singularly Perturbed Dynamical Systems with Multiple Parameters

    math.NA 2026-05 unverdicted novelty 4.0

    A neural network augmented with the geometric mean of multiple small parameters approximates solutions to singularly perturbed dynamical systems with satisfactory accuracy on tested coupled cases.

Reference graph

Works this paper leans on

56 extracted references · 6 canonical work pages · cited by 2 Pith papers · 2 internal anchors

  1. [1]

    Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.J

    Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.J. Comput. Phys., 378:686–707, 2019

  2. [2]

    Physics- informed machine learning.Nat

    George Em Karniadakis, Ioannis G Kevrekidis, Lu Lu, Paris Perdikaris, Sifan Wang, and Liu Yang. Physics- informed machine learning.Nat. Rev. Phys., 3(6):422–440, 2021

  3. [3]

    Physics-informed neural networks for studying heat transfer in porous media

    Jiaxuan Xu, Han Wei, and Hua Bao. Physics-informed neural networks for studying heat transfer in porous media. Int. J. Heat Mass Transfer, 217:124671, 2023. 18 APREPRINT- APRIL20, 2026

  4. [4]

    Physics-informed neural networks for heat transfer problems.J

    Shengze Cai, Zhicheng Wang, Sifan Wang, Paris Perdikaris, and George Em Karniadakis. Physics-informed neural networks for heat transfer problems.J. Heat Transfer, 143(6):060801, 2021

  5. [5]

    Initialization-enhanced physics-informed neural network with domain decomposition (IDPINN).J

    Chenhao Si and Ming Yan. Initialization-enhanced physics-informed neural network with domain decomposition (IDPINN).J. Comput. Phys., 530:113914, 2025

  6. [6]

    HxPINN: A hypernetwork-based physics-informed neural network for real-time monitoring of an industrial heat exchanger.Numer

    Ritam Majumdar, Vishal Jadhav, Anirudh Deodhar, Shirish Karande, Lovekesh Vig, and Venkataramana Runkana. HxPINN: A hypernetwork-based physics-informed neural network for real-time monitoring of an industrial heat exchanger.Numer. Heat Transfer B, 86(6):1910–1931, 2025

  7. [7]

    Physics-informed neural networks (PINN) for computational solid mechanics: Numerical frameworks and applications.Thin-Walled Struct., 205:112495, 2024

    Haoteng Hu, Lehua Qi, and Xujiang Chao. Physics-informed neural networks (PINN) for computational solid mechanics: Numerical frameworks and applications.Thin-Walled Struct., 205:112495, 2024

  8. [8]

    Physics-guided, physics-informed, and physics-encoded neural networks and operators in scientific computing: Fluid and solid mechanics.J

    Salah A Faroughi, Nikhil M Pawar, Celio Fernandes, Maziar Raissi, Subasish Das, Nima K Kalantari, and Seyed Kourosh Mahjour. Physics-guided, physics-informed, and physics-encoded neural networks and operators in scientific computing: Fluid and solid mechanics.J. Comput. Inf. Sci. Eng., 24(4):040802, 2024

  9. [9]

    Learning in modal space: Solving time-dependent stochastic PDEs using physics-informed neural networks.SIAM J

    Dongkun Zhang, Ling Guo, and George Em Karniadakis. Learning in modal space: Solving time-dependent stochastic PDEs using physics-informed neural networks.SIAM J. Sci. Comput., 42(2):A639–A665, 2020

  10. [10]

    Solving inverse stochastic problems from discrete particle observations using the Fokker-Planck equation and physics-informed neural networks.SIAM J

    Xiaoli Chen, Liu Yang, Jinqiao Duan, and George Em Karniadakis. Solving inverse stochastic problems from discrete particle observations using the Fokker-Planck equation and physics-informed neural networks.SIAM J. Sci. Comput., 43(3):B811–B830, 2021

  11. [11]

    Adversarial uncertainty quantification in physics-informed neural networks.J

    Yibo Yang and Paris Perdikaris. Adversarial uncertainty quantification in physics-informed neural networks.J. Comput. Phys., 394:136–152, 2019

  12. [12]

    Quantifying total uncertainty in physics-informed neural networks for solving forward and inverse stochastic problems.J

    Dongkun Zhang, Lu Lu, Ling Guo, and George Em Karniadakis. Quantifying total uncertainty in physics-informed neural networks for solving forward and inverse stochastic problems.J. Comput. Phys., 397:108850, 2019

  13. [13]

    B-PINNs: Bayesian physics-informed neural networks for forward and inverse PDE problems with noisy data.J

    Liu Yang, Xuhui Meng, and George Em Karniadakis. B-PINNs: Bayesian physics-informed neural networks for forward and inverse PDE problems with noisy data.J. Comput. Phys., 425:109913, 2021

  14. [14]

    Self-adaptive physics-informed neural networks.J

    Levi D McClenny and Ulisses M Braga-Neto. Self-adaptive physics-informed neural networks.J. Comput. Phys., 474:111722, 2023

  15. [15]

    Understanding and mitigating gradient flow pathologies in physics-informed neural networks.SIAM J

    Sifan Wang, Yujun Teng, and Paris Perdikaris. Understanding and mitigating gradient flow pathologies in physics-informed neural networks.SIAM J. Sci. Comput., 43(5):A3055–A3081, 2021

  16. [16]

    On the eigenvector bias of fourier feature networks: From regression to solving multi-scale PDEs with physics-informed neural networks.Comput

    Sifan Wang, Hanwen Wang, and Paris Perdikaris. On the eigenvector bias of fourier feature networks: From regression to solving multi-scale PDEs with physics-informed neural networks.Comput. Methods Appl. Mech. Engrg., 384:113938, 2021

  17. [17]

    Characterizing possible failure modes in physics-informed neural networks

    Aditi Krishnapriyan, Amir Gholami, Shandian Zhe, Robert Kirby, and Michael W Mahoney. Characterizing possible failure modes in physics-informed neural networks. InAdvances in Neural Information Processing Systems, volume 34, pages 26548–26560, 2021

  18. [18]

    Binary structured physics-informed neural networks for solving equations with rapidly changing solutions.J

    Yanzhi Liu, Ruifan Wu, and Ying Jiang. Binary structured physics-informed neural networks for solving equations with rapidly changing solutions.J. Comput. Phys., 518:113341, 2024

  19. [19]

    When and why PINNs fail to train: A neural tangent kernel perspective.J

    Sifan Wang, Xinling Yu, and Paris Perdikaris. When and why PINNs fail to train: A neural tangent kernel perspective.J. Comput. Phys., 449:110768, 2022

  20. [20]

    Residual- based attention in physics-informed neural networks.Comput

    Sokratis J Anagnostopoulos, Juan Diego Toscano, Nikolaos Stergiopulos, and George Em Karniadakis. Residual- based attention in physics-informed neural networks.Comput. Methods Appl. Mech. Engrg., 421:116805, 2024

  21. [21]

    A Variational Framework for Residual-Based Adaptivity in Neural PDE Solvers and Operator Learning

    Juan Diego Toscano, Daniel T Chen, Vivek Oommen, Jérôme Darbon, and George Em Karniadakis. A vari- ational framework for residual-based adaptivity in neural PDE solvers and operator learning.arXiv preprint arXiv:2509.14198, 2025

  22. [22]

    Convolution-weighting method for the physics-informed neural network: A primal-dual optimization perspective.J

    Chenhao Si and Ming Yan. Convolution-weighting method for the physics-informed neural network: A primal-dual optimization perspective.J. Comput. Phys., 555:114773, 2026

  23. [23]

    Visualizing the loss landscape of neural nets

    Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. InAdvances in Neural Information Processing Systems, volume 31, 2018

  24. [24]

    Exploring landscapes for better minima along valleys

    Tong Zhao, Jiacheng Li, Yuanchang Zhou, Guangming Tan, and Weile Jia. Exploring landscapes for better minima along valleys. InAdvances in Neural Information Processing Systems, 2025

  25. [25]

    Challenges in training PINNs: A loss landscape perspective

    Pratik Rathore, Weimu Lei, Zachary Frangella, Lu Lu, and Madeleine Udell. Challenges in training PINNs: A loss landscape perspective. InInternational Conference on Machine Learning, pages 42159–42191, 2024. 19 APREPRINT- APRIL20, 2026

  26. [26]

    Optimizing the optimizer for physics-informed neural networks and Kolmogorov-Arnold networks.Comput

    Elham Kiyani, Khemraj Shukla, Jorge F Urbán, Jérôme Darbon, and George Em Karniadakis. Optimizing the optimizer for physics-informed neural networks and Kolmogorov-Arnold networks.Comput. Methods Appl. Mech. Engrg., 446:118308, 2025

  27. [27]

    Unveiling the optimization process of physics informed neural networks: How accurate and competitive can PINNs be?J

    Jorge F Urbán, Petros Stefanou, and José A Pons. Unveiling the optimization process of physics informed neural networks: How accurate and competitive can PINNs be?J. Comput. Phys., 523:113656, 2025

  28. [28]

    Gradient alignment in physics-informed neural networks: A second-order optimization perspective

    Sifan Wang, Ananyae Kumar Bhartari, Bowen Li, and Paris Perdikaris. Gradient alignment in physics-informed neural networks: A second-order optimization perspective. InAdvances in Neural Information Processing Systems, 2025

  29. [29]

    Curvature-Aware Optimization for High-Accuracy Physics-Informed Neural Networks

    Anas Jnini, Elham Kiyani, Khemraj Shukla, Jorge F. Urban, Nazanin Ahmadi Daryakenari, Johannes Muller, Marius Zeinhofer, and George Em Karniadakis. Curvature-aware optimization for high-accuracy physics-informed neural networks.arXiv preprint arXiv:2604.05230, 2026

  30. [30]

    The limit points of (optimistic) gradient descent in min-max optimization

    Constantinos Daskalakis and Ioannis Panageas. The limit points of (optimistic) gradient descent in min-max optimization. InAdvances in Neural Information Processing Systems, 2018

  31. [31]

    Optimization, learning, and games with predictable sequences

    Alexander Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences. In Advances in Neural Information Processing Systems, 2013

  32. [32]

    Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile.arXiv preprint arXiv:1807.02629, 2018

    Panayotis Mertikopoulos, Bruno Lecouat, Houssam Zenati, Chuan-Sheng Foo, Vijay Chandrasekhar, and Georgios Piliouras. Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile.arXiv preprint arXiv:1807.02629, 2018

  33. [33]

    A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach

    Aryan Mokhtari, Asuman Ozdaglar, and Sarath Pattathil. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. InInternational Conference on Artificial Intelligence and Statistics, pages 1497–1507, 2020

  34. [34]

    Weijie Su, Stephen Boyd, and Emmanuel J. Candès. A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights.J. Mach. Learn. Res., 17(153):1–43, 2016

  35. [35]

    Adan: Adaptive Nesterov momentum algorithm for faster optimizing deep models.IEEE Trans

    Xingyu Xie, Pan Zhou, Huan Li, Zhouchen Lin, and Shuicheng Yan. Adan: Adaptive Nesterov momentum algorithm for faster optimizing deep models.IEEE Trans. Pattern Anal. Mach. Intell., 46(12):9508–9520, 2024

  36. [36]

    A method for solving the convex programming problem with convergence rateo(1/k2).Soviet Math

    Yurii E Nesterov. A method for solving the convex programming problem with convergence rateo(1/k2).Soviet Math. Dokl., 27(2):372–376, 1983

  37. [37]

    Springer Science & Business Media, 2013

    Yurii Nesterov.Introductory Lectures on Convex Optimization: A Basic Course. Springer Science & Business Media, 2013

  38. [38]

    A simple convergence proof of Adam and Adagrad.Trans

    Alexandre Défossez, Leon Bottou, Francis Bach, and Nicolas Usunier. A simple convergence proof of Adam and Adagrad.Trans. Mach. Learn. Res., 2022

  39. [39]

    ASGO: Adaptive structured gradient optimization

    Kang An, Yuxing Liu, Rui Pan, Yi Ren, Shiqian Ma, Donald Goldfarb, and Tong Zhang. ASGO: Adaptive structured gradient optimization. InAdvances in Neural Information Processing Systems, 2025

  40. [40]

    Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 23(4):2341–2368, 2013a

    Zhishuai Guo, Yi Xu, Wotao Yin, Rong Jin, and Tianbao Yang. A novel convergence analysis for algorithms of the Adam family.arXiv preprint arXiv:2112.03459, 2021

  41. [41]

    ASAM: Adaptive sharpness-aware mini- mization for scale-invariant learning of deep neural networks

    Jungmin Kwon, Jeongseop Kim, Hyunseo Park, and In Kwon Choi. ASAM: Adaptive sharpness-aware mini- mization for scale-invariant learning of deep neural networks. InInternational Conference on Machine Learning, pages 5905–5914, 2021

  42. [42]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  43. [43]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

  44. [44]

    Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham M. Kakade. SOAP: Improving and stabilizing shampoo using Adam for language modeling. InInternational Conference on Learning Representations, 2025

  45. [45]

    Muon: An optimizer for hidden layers in neural networks, 2024

    Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024

  46. [46]

    A method for representing periodic functions and enforcing exactly periodic boundary conditions with deep neural networks.J

    Suchuan Dong and Naxian Ni. A method for representing periodic functions and enforcing exactly periodic boundary conditions with deep neural networks.J. Comput. Phys., 435:110242, 2021

  47. [47]

    Regularity and error estimates in physics-informed neural networks for the Kuramoto-Sivashinsky equation.arXiv preprint arXiv:2511.09728, 2025

    Mohammad Mahabubur Rahman and Deepanshu Verma. Regularity and error estimates in physics-informed neural networks for the Kuramoto-Sivashinsky equation.arXiv preprint arXiv:2511.09728, 2025. 20 APREPRINT- APRIL20, 2026 A Algorithms for SOAP and Muon Algorithm 2SOAP Optimizer Input:Loss function ℓ, initial parameters θ0 ∈R m×n, step sizes {ηk}, hyperparame...

  48. [48]

    Initialize m0 =0 , v0 =0 , L0 =ϵI m, R0 = ϵIn,U L =I m,U R =I n 2.whilek < Tdo 3.G k =∇ θℓ(θk, ζk); 4.L k =β P Lk−1 + (1−β P )GkG⊤ k ; 5.R k =β P Rk−1 + (1−β P )G⊤ k Gk; 6.ifk(modf) == 0then 7.U L,D L =EigenDecomp(L k); 8.U R,D R =EigenDecomp(R k); 9.end if

  49. [49]

    ˜Gk =U ⊤ L GkUR; 11.m k =β 1mk−1 + (1−β 1) ˜Gk; 12.v k =β 2vk−1 + (1−β 2) ˜G2 k

  50. [50]

    ˆmk =m k/(1−β k 1 ); ˆvk =v k/(1−β k 2 ); 14.P k = ˆmk/(√ˆvk +ε); 15.∆θ k =U LPkU⊤ R; 16.θ k+1 =θ k −η k(∆θk +λθ k); 17.end while Algorithm 3Curvature-Aware SOAP Input:Loss function ℓ, initial parameters θ0 ∈R m×n, step sizes {ηk}, hyperparameters β1, β2, βa ∈[0,1) , βP ∈[0,1) , ε >0 , f (eigen-update frequency), αbase (base trajectory),λ(weight decay) Ou...

  51. [51]

    Initialize m0 =0 , v0 =0 , L0 =ϵI m, R0 = ϵIn, UL =I m, UR =I n, A0 =0 , G0 =0 , θ−1 =θ 0 2.whilek < Tdo 3.G k =∇ θℓ(θk, ζk); 4.A k =β 1Ak−1 + (1−β a)(Gk −G k−1); 5.∆θ k =θ k −θ k−1; 6.κ k = ⟨∆θk,Gk−Gk−1⟩F ∥∆θk∥2 F ; 7.α k =α base(1 + tanh(−κk)) 8.G boost,k =G k +α kAk; 9.L k =β P Lk−1 + (1−β P )Gboost,kG⊤ boost,k; 10.R k =β P Rk−1 +(1−β P )G⊤ boost,kGboo...

  52. [52]

    ˜Gk =U ⊤ L Gboost,kUR; 16.m k =β 1mk−1 + (1−β 1) ˜Gk; 17.v k =β 2vk−1 + (1−β 2) ˜G2 k

  53. [53]

    ˆmk =m k/(1−β k 1 ); ˆvk =v k/(1−β k 2 ); 19.P k = ˆmk/(√ˆvk +ε); 20.∆ step =U LPkU⊤ R; 21.θ k+1 =θ k −η k(∆step +λθ k); 22.end while 21 APREPRINT- APRIL20, 2026 Algorithm 4Muon Optimizer Input:Loss function ℓ, initial parameters θ0 ∈R m×n (assume m≤n ), step sizes {ηk}, hyperparameter µ∈ [0,1) (momentum), λ (weight decay), I (Newton-Schulz iterations) Ou...

  54. [54]

    InitializeM 0 =0 2.whilek < Tdo 3.G k =∇ θℓ(θk, ζk); 4.M k =µM k−1 +G k; 5.O k =µM k +G k; 6.X=O k/∥Ok∥F ; 7.fori= 1toIdo 8.A=XX ⊤; 9.X= 1.5X−0.5AX; 10.end for

  55. [55]

    n∆θ k =X·max(1, p n/m); 12.θ k+1 =θ k −η k(∆θk +λθ k); 13.end while Algorithm 5Curvature-Aware Muon Input:Loss function ℓ, initial parameters θ0 ∈R m×n (assume m≤n ), step sizes {ηk}, hyperparameters µ∈[0,1) (momentum), αbase (base trajectory), λ (weight decay),I(Newton-Schulz iterations) Output:{θ k}T k=1

  56. [56]

    Initialize M0 =0 , A0 =0 , G0 =0 , θ−1 =θ 0 2.whilek < Tdo 3.G k =∇ θℓ(θk, ζk); 4.A k =µA k−1 + (1−µ)(G k −G k−1); 5.∆θ k =θ k −θ k−1; 6.κ k = ⟨∆θk,Gk−Gk−1⟩F ∥∆θk∥2 F ; 7.α k =α base(1 + tanh(−κk)) 8.G boost,k =G k +α kAk; 9.M k =µM k−1 +G boost,k; 10.O k =µM k +G boost,k; 11.X=O k/∥Ok∥F ; 12.fori= 1toIdo 13.A=XX ⊤; 14.X= 1.5X−0.5AX; 15.end for 16.∆ step ...