pith. sign in

arxiv: 2605.15959 · v1 · pith:XJOPTQAXnew · submitted 2026-05-15 · 💻 cs.LG · cs.AI

When and Why Adversarial Training Improves PINNs: A Neural Tangent Kernel Perspective

Pith reviewed 2026-05-20 19:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords physics-informed neural networksadversarial trainingneural tangent kernelspectral biasdifferential equationsgenerative adversarial networkstraining dynamics
0
0 comments X

The pith

Adversarial training reshapes PINN dynamics through the discriminator to reduce spectral bias and stiffness in solving differential equations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds an analysis framework showing how a GAN-style discriminator alters the training process of physics-informed neural networks. Using the neural tangent kernel, it demonstrates that the discriminator can steer gradient flow to better capture high-frequency and multiscale solution features that standard training misses. This explains the observed gains in accuracy and leads to a derived training procedure that is simpler and more reliable. Readers would value the result because it turns an empirical trick into a predictable way to build trustworthy surrogate models for physics problems.

Core claim

Adversarial training improves PINNs because the discriminator influences training dynamics in a manner that mitigates spectral bias, stiffness, and poor accuracy on high-frequency or multiscale solutions. The neural tangent kernel perspective supplies the theoretical account of why and when this occurs, unifies the behavior of different GAN variants, and yields a new practical algorithm whose empirical results show orders-of-magnitude gains over baseline PINN methods.

What carries the argument

The discriminator's modulation of the PINN's neural tangent kernel, which changes the effective learning rates across frequency modes during optimization.

If this is right

  • Adversarial training is effective precisely when the target solution contains high-frequency or multiscale content that standard gradient descent fails to learn.
  • The derived algorithm trains PINNs to several orders of magnitude higher accuracy while remaining computationally practical.
  • Different GAN variants achieve their gains through the same underlying dynamic influence on the kernel.
  • The framework supplies conditions under which the improvement is guaranteed rather than merely observed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same discriminator-driven kernel adjustment could be tested on other frequency-limited network tasks such as image super-resolution or turbulence modeling.
  • Extending the analysis to time-dependent or higher-dimensional differential equations would check whether the mitigation scales.
  • One could replace the adversarial discriminator with a simpler frequency-weighted loss derived from the same kernel insight and measure whether comparable gains appear.

Load-bearing premise

The discriminator can be made to steer the PINN optimization trajectory away from spectral bias and stiffness.

What would settle it

Training runs in which adding a discriminator leaves the neural tangent kernel spectrum and the error decay on high-frequency test functions unchanged would falsify the claimed mechanism.

Figures

Figures reproduced from arXiv: 2605.15959 by Chi Chiu SO, He Wang, Jun-Min Wang, Yuan-dong Cao.

Figure 1
Figure 1. Figure 1: Laplace equation: final converged train [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Left: successful training regimes under balanced or moderately imbalanced [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Laplace equation: The first row reports the training MSE, validation MSE, and residual [PITH_FULL_IMAGE:figures/full_fig_p057_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Poisson equation: The first row reports the training MSE, validation MSE, and residual [PITH_FULL_IMAGE:figures/full_fig_p059_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Reaction-Difussion equation: The first row reports the training MSE, validation MSE, [PITH_FULL_IMAGE:figures/full_fig_p060_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Viscous Burgers equation: The first row reports the training MSE, validation MSE, and [PITH_FULL_IMAGE:figures/full_fig_p061_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Klein-Gordon equation: The first row reports the training MSE, validation MSE, and resid [PITH_FULL_IMAGE:figures/full_fig_p062_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ablation study on the same pde with different boundary condition: The first row reports [PITH_FULL_IMAGE:figures/full_fig_p064_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ablation study on the DEQGAN and RB. The top row reports the training MSE, validation [PITH_FULL_IMAGE:figures/full_fig_p065_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Controlled ablation for LSGAN on the Laplace benchmark. The first row reports the [PITH_FULL_IMAGE:figures/full_fig_p067_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Controlled ablation for GAN on the Klein–Gordon benchmark. The first row reports [PITH_FULL_IMAGE:figures/full_fig_p068_11.png] view at source ↗
read the original abstract

Physics-informed neural networks (PINNs) are powerful surrogates for differential equations but are notoriously difficult to train due to spectral bias, stiffness, and poor accuracy on high-frequency or multiscale solutions. Adversarial training based on generative adversarial networks (GANs) has recently gained surprisingly strong empirical results in improving training, but the underlying mechanisms remain elusive. To this end, we propose a new analysis framework for adversarially trained PINNs, based on the key observation of how the discriminator in GANs can influence the training dynamics of PINNs. The framework first provides a much needed theoretical grounding to why and when adversarial training is effective in PINNs, then presents a unified analysis of GANs variants in such training, and finally leads to a new, practical, efficient training algorithm for PINNs. Empirical results demonstrate that our method can significantly reduce the pathology of PINNs training, thereby providing better models with superior performances, often several magnitudes more accurate than alternative methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes an NTK-based analysis framework for adversarially trained PINNs. It claims that the discriminator modifies PINN training dynamics to mitigate spectral bias, stiffness, and poor accuracy on high-frequency solutions; provides theoretical grounding for when and why adversarial (GAN-based) training is effective; unifies analysis across GAN variants; derives a new practical training algorithm; and reports empirical gains of several orders of magnitude in accuracy over baselines.

Significance. If the NTK derivation is valid and the linearization regime holds, the work would supply a much-needed theoretical account of an empirically observed phenomenon in PINN training. A unified view of GAN variants plus a new algorithm could guide more reliable high-frequency PDE solvers. The reported magnitude improvements, if reproducible and isolated to the proposed mechanism, would be a notable practical advance at the intersection of scientific machine learning and differential-equation surrogates.

major comments (2)
  1. [NTK analysis / theoretical framework] NTK linearization section (central derivation): the analysis treats the kernel as fixed after linearization around random initialization. The adversarial discriminator term introduces a state-dependent gradient whose magnitude grows with the current generator output; this can drive ||θ − θ0|| outside the lazy-training regime, rendering the derived effective kernel and the claimed mitigation of spectral bias invalid. The manuscript provides no bound on parameter drift or explicit regime of validity for the combined physics-plus-adversarial loss.
  2. [Empirical results] Empirical claims (results section): the abstract states “several magnitudes more accurate,” yet no quantitative factors, error bars, or ablation isolating the NTK-derived algorithm from generic adversarial training appear in the provided summary. Without these, it is impossible to verify that the reported gains stem from the proposed dynamics rather than hyper-parameter tuning.
minor comments (2)
  1. The phrase “unified analysis of GANs variants” is used without an explicit table or section mapping each variant to its NTK modification; adding such a summary would improve readability.
  2. Notation for the discriminator-induced gradient term should be introduced with an explicit equation before the main dynamics claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address each major point below and have revised the manuscript to strengthen both the theoretical guarantees and the empirical presentation.

read point-by-point responses
  1. Referee: [NTK analysis / theoretical framework] NTK linearization section (central derivation): the analysis treats the kernel as fixed after linearization around random initialization. The adversarial discriminator term introduces a state-dependent gradient whose magnitude grows with the current generator output; this can drive ||θ − θ0|| outside the lazy-training regime, rendering the derived effective kernel and the claimed mitigation of spectral bias invalid. The manuscript provides no bound on parameter drift or explicit regime of validity for the combined physics-plus-adversarial loss.

    Authors: We appreciate the referee’s careful reading of the central derivation. The analysis is performed under the standard NTK lazy-training assumption, which requires that the network width is sufficiently large and the learning rate is appropriately scaled so that parameters remain close to initialization. To make the regime of validity explicit, we have added a new subsection (Section 3.3 in the revision) that derives a sufficient bound on ||θ − θ0|| in terms of the discriminator’s Lipschitz constant, the physics-loss gradient norm, and the number of training steps. We also include empirical measurements of parameter drift across all reported experiments, confirming that the drift remains well within the linearization regime for the network widths and step sizes used. revision: yes

  2. Referee: [Empirical results] Empirical claims (results section): the abstract states “several magnitudes more accurate,” yet no quantitative factors, error bars, or ablation isolating the NTK-derived algorithm from generic adversarial training appear in the provided summary. Without these, it is impossible to verify that the reported gains stem from the proposed dynamics rather than hyper-parameter tuning.

    Authors: We thank the referee for noting the need for greater precision in the empirical reporting. The full manuscript already contains quantitative results in Section 5 (Tables 1–2 and Figures 4–6) that document error reductions between two and four orders of magnitude on high-frequency and multiscale PDEs, together with standard deviations computed over five independent random seeds. In the revision we have inserted an explicit ablation subsection that compares the NTK-derived training algorithm against standard adversarial PINN training under identical hyper-parameters, thereby isolating the contribution of the dynamics predicted by the framework. revision: yes

Circularity Check

0 steps flagged

NTK-based derivation of adversarial PINN dynamics is self-contained

full rationale

The paper introduces an analysis framework grounded in the Neural Tangent Kernel to explain how a discriminator modifies PINN training dynamics, spectral bias, and stiffness. No equations, sections, or self-citations are exhibited that reduce the central claims to fitted inputs, self-definitions, or prior author results by construction. The key observation (discriminator influence on dynamics) is stated as the starting point for deriving when and why adversarial training helps, with the unified GAN analysis and new algorithm presented as downstream consequences rather than tautological renamings or forced fits. The derivation therefore remains independent of its target conclusions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on the stated key observation about the discriminator's influence. No free parameters, additional axioms, or invented entities are explicitly described.

axioms (1)
  • domain assumption The discriminator in GANs can influence the training dynamics of PINNs.
    This is presented as the key observation forming the basis of the proposed analysis framework.

pith-pipeline@v0.9.0 · 5702 in / 1212 out tokens · 46125 ms · 2026-05-20T19:36:25.026013+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages

  1. [1]

    Limitations of physics informed machine learning for nonlinear two-phase transport in porous media.Journal of Machine Learning for Modeling and Computing, 1(1), 2020

    Olga Fuks and Hamdi A Tchelepi. Limitations of physics informed machine learning for nonlinear two-phase transport in porous media.Journal of Machine Learning for Modeling and Computing, 1(1), 2020

  2. [2]

    Deep hidden physics models: Deep learning of nonlinear partial differential equations.Journal of Machine Learning Research, 19(25):1–24, 2018

    Maziar Raissi. Deep hidden physics models: Deep learning of nonlinear partial differential equations.Journal of Machine Learning Research, 19(25):1–24, 2018

  3. [3]

    Physics- constrained deep learning for high-dimensional surrogate modeling and uncertainty quantifica- tion without labeled data.Journal of computational physics, 394:56–81, 2019

    Yinhao Zhu, Nicholas Zabaras, Phaedon-Stelios Koutsourelakis, and Paris Perdikaris. Physics- constrained deep learning for high-dimensional surrogate modeling and uncertainty quantifica- tion without labeled data.Journal of computational physics, 394:56–81, 2019

  4. [4]

    Understanding and mitigating gradient flow pathologies in physics-informed neural networks.SIAM Journal on Scientific Computing, 43 (5):A3055–A3081, 2021

    Sifan Wang, Yujun Teng, and Paris Perdikaris. Understanding and mitigating gradient flow pathologies in physics-informed neural networks.SIAM Journal on Scientific Computing, 43 (5):A3055–A3081, 2021

  5. [5]

    When and why pinns fail to train: A neural tangent kernel perspective.Journal of Computational Physics, 449:110768, 2022

    Sifan Wang, Xinling Yu, and Paris Perdikaris. When and why pinns fail to train: A neural tangent kernel perspective.Journal of Computational Physics, 449:110768, 2022

  6. [6]

    Deqgan: Learning the loss function for pinns with generative adversarial networks.arXiv preprint arXiv:2209.07081, 2022

    Blake Bullwinkel, Dylan Randle, Pavlos Protopapas, and David Sondak. Deqgan: Learning the loss function for pinns with generative adversarial networks.arXiv preprint arXiv:2209.07081, 2022

  7. [7]

    A physics-informed gan framework based on model-free data- driven computational mechanics.Computer Methods in Applied Mechanics and Engineering, 424:116907, 2024

    Kerem Ciftci and Klaus Hackl. A physics-informed gan framework based on model-free data- driven computational mechanics.Computer Methods in Applied Mechanics and Engineering, 424:116907, 2024

  8. [8]

    Physics-informed generative adversar- ial networks for stochastic differential equations.SIAM Journal on Scientific Computing, 42(1): A292–A317, 2020

    Liu Yang, Dongkun Zhang, and George Em Karniadakis. Physics-informed generative adversar- ial networks for stochastic differential equations.SIAM Journal on Scientific Computing, 42(1): A292–A317, 2020

  9. [9]

    Loss-attentional physics-informed neural networks.Journal of Computational Physics, 501:112781, 2024

    Yanjie Song, He Wang, He Yang, Maria Luisa Taccari, and Xiaohui Chen. Loss-attentional physics-informed neural networks.Journal of Computational Physics, 501:112781, 2024

  10. [10]

    Adversarial physics-informed neural networks with hard constraints for optimal control of pdes.Journal of Computational Physics, page 114307, 2025

    Yuandong Cao, Chi Chiu So, Yifan Dai, Siu Pang Yung, and Jun-Min Wang. Adversarial physics-informed neural networks with hard constraints for optimal control of pdes.Journal of Computational Physics, page 114307, 2025

  11. [11]

    Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

  12. [12]

    f-gan: Training generative neural samplers using variational divergence minimization.Advances in neural information processing systems, 29, 2016

    Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization.Advances in neural information processing systems, 29, 2016

  13. [13]

    Least squares generative adversarial networks

    Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. InProceedings of the IEEE international conference on computer vision, pages 2794–2802, 2017

  14. [14]

    Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

    Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

  15. [15]

    A neural tangent kernel perspective of gans

    Jean-Yves Franceschi, Emmanuel De Bézenac, Ibrahim Ayed, Mickaël Chen, Sylvain Lamprier, and Patrick Gallinari. A neural tangent kernel perspective of gans. InInternational Conference on Machine Learning, pages 6660–6704. PMLR, 2022

  16. [16]

    Wasserstein generative adversarial networks

    Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. InInternational conference on machine learning, pages 214–223. Pmlr, 2017

  17. [17]

    Artificial neural networks for solving ordinary and partial differential equations.IEEE transactions on neural networks, 9(5): 987–1000, 1998

    Isaac E Lagaris, Aristidis Likas, and Dimitrios I Fotiadis. Artificial neural networks for solving ordinary and partial differential equations.IEEE transactions on neural networks, 9(5): 987–1000, 1998. 10

  18. [18]

    Solving differential equations using neural network solution bundles.arXiv preprint arXiv:2006.14372, 2020

    Cedric Flamant, Pavlos Protopapas, and David Sondak. Solving differential equations using neural network solution bundles.arXiv preprint arXiv:2006.14372, 2020

  19. [19]

    Self-adaptive physics-informed neural networks

    Levi D McClenny and Ulisses M Braga-Neto. Self-adaptive physics-informed neural networks. Journal of Computational Physics, 474:111722, 2023

  20. [20]

    Weak adversarial networks for high-dimensional partial differential equations.Journal of Computational Physics, 411:109409, 2020

    Yaohua Zang, Gang Bao, Xiaojing Ye, and Haomin Zhou. Weak adversarial networks for high-dimensional partial differential equations.Journal of Computational Physics, 411:109409, 2020

  21. [21]

    Is the neural tangent kernel of pinns deep learning general partial differential equations always convergent?Physica D: Nonlinear Phenomena, 457:133987, 2024

    Zijian Zhou and Zhenya Yan. Is the neural tangent kernel of pinns deep learning general partial differential equations always convergent?Physica D: Nonlinear Phenomena, 457:133987, 2024

  22. [22]

    The positivity of the neural tangent kernel.SIAM Journal on Mathematics of Data Science, 7(2):495–515, 2025

    Luís Carvalho, João L Costa, José Mourão, and Gonçalo Oliveira. The positivity of the neural tangent kernel.SIAM Journal on Mathematics of Data Science, 7(2):495–515, 2025. 11 Appendix Contents A Preliminary Knowledge 14 A.1 Generative adversarial networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.2 Physics-informed neural network . . . ....

  23. [23]

    Gaussian-process initialization.The random function g(·;θ 0) converges in law to a Gaus- sian process

  24. [24]

    Deterministic limiting NTK.The random NTK k(x, x′;θ 0) converges to a deterministic limiting kernelk ∞(x, x′)

  25. [25]

    Kernel stability during training.Along training, the NTK remains approximately constant: k(x, x′;θ(t))≈k(x, x ′;θ 0)≈k ∞(x, x′).(51)

  26. [26]

    Output dynamics in function space.If g(t) = [g(x1;θ(t)), . . . , g(x N;θ(t))] ⊤ denotes the vector of network outputs on training points and L(g) is the loss viewed as a function of the outputs, then under gradient flow one has ˙g(t)≈ −K∇ gL(g(t)),(52) whereKis the limiting NTK Gram matrix. Interpretation.The theorem states that, in the wide-network regim...

  27. [27]

    The discriminator evolves linearly in function space as ft =f 0 +t f ∗ θ ,(77) wheref ∗ θ is the unnormalized MMD witness function associated withk D

  28. [28]

    In residual-space notation, the witness function is given by f ∗ θ (r) =−E ˜r∼ˆµr θ kD(˜r, r) +k D(0, r),(78) or equivalently f ∗ θ (r) =− 1 Nr NrX i=1 kD(ri, r) +k D(0, r).(79)

  29. [29]

    The induced generator-side sample weighting is γIPM i = 1 Nr ∂rft(ri) = 1 Nr ∂rf0(ri) + t Nr ∂rf ∗ θ (ri).(80) Proof.In the IPM setting, the discriminator-side NTK dynamics admit an explicit linear form in function space. For fixed ˆµr θ and target δ0, the discriminator moves in the direction of the kernel witness function associated with the discrepancy ...

  30. [30]

    TheL 2(ˆγr θ )functional gradient ofL LS D is ∇ˆγr θ LLS D (f) = 2(ρ θ −f).(89)

  31. [31]

    Consequently, the discriminator NTK flow is ∂tft = 2T kD,ˆγr θ (ρθ −f t).(90) 22

  32. [32]

    The resulting function-space dynamics are linear, and the solution is ft = exp −2t T kD,ˆγr θ (f0 −ρ θ) +ρ θ.(91) Equivalently, if φt(x) :=e −2tx −1,(92) then ft =f 0 +φ t TkD,ˆγr θ (f0 −ρ θ).(93)

  33. [33]

    The induced generator-side sample weighting is γLSGAN i =− 1 Nr ft(ri)−1 ∂rft(ri).(94) Proof.Starting from (86), we rewrite the discriminator loss with respect to the empirical measure ˆγr θ . Since ˆγr θ = 1 2 ˆµr θ + 1 2 δ0, we may expressL LS D as LLS D (f) = Z −1 2 dˆµr θ dˆγr θ f2 − 1 2 dδ0 dˆγr θ (f−1) 2 dˆγr θ .(95) Taking the functional derivative...

  34. [34]

    TheL 2(ˆγr θ )functional gradient ofL GAN D is ∇ˆγr θ LGAN D (f) = 2 ρθ −σ(f) .(105)

  35. [35]

    Consequently, the discriminator NTK flow is ∂tft = 2T kD,ˆγr θ ρθ −σ(f t) .(106) 24

  36. [36]

    Iff ∞ is a stationary point of(106), then TkD,ˆγr θ ρθ −σ(f ∞) = 0.(107) In particular, if the kernel operator is injective on the empirical support, then σ(f∞) =ρ θ onsupp(ˆγr θ ).(108)

  37. [37]

    Near a stationary point f∞, writing ft =f ∞ +h t with ∥ht∥ ≪1 , one has the first-order approximation ∂tht =−2T kD,ˆγr θ σ′(f∞)ht +O(∥h t∥2).(109)

  38. [38]

    generated samples

    The induced generator-side sample weighting is γGAN i = 1 Nr 1 1−D t(ri) ∂rDt(ri) = 1 Nr Dt(ri)∂ rft(ri).(110) Proof.Rewriting (102) over the empirical measureˆγ r θ , we have LGAN D (f) = Z dδ0 dˆγr θ logσ(f) + dˆµr θ dˆγr θ log(1−σ(f)) dˆγr θ .(111) Differentiating with respect tofyields ∇ˆγr θ LGAN D (f) = dδ0 dˆγr θ σ′(f) σ(f) − dˆµr θ dˆγr θ σ′(f) 1−...

  39. [39]

    Moreover, if λ1(A)≥λ 2(A)≥ · · · ≥λ Nr(A)≥0(156) denote the eigenvalues ofA, then for eachj= 1,

    Since H is symmetric positive semidefinite, all eigenvalues of M are real and nonnegative. Moreover, if λ1(A)≥λ 2(A)≥ · · · ≥λ Nr(A)≥0(156) denote the eigenvalues ofA, then for eachj= 1, . . . , N r, λmin(K)λ j(A)≤λ j(M) =λ j(H)≤λ max(K)λ j(A).(157)

  40. [40]

    Since H=K 1/2AK1/2 is symmetric positive semidefinite, its spectrum is real and nonnegative, and therefore the same is true for M

    Introducing the transformed variable z(t) :=K −1/2r(t),(158) the dynamics(153)become ˙z(t) =−Hz(t).(159) 30 Therefore, ∥z(t)∥2 ≤e −2λmin(H)t ∥z(0)∥2.(160) Since λmin(K)∥z∥2 ≤ ∥r∥2 ≤λ max(K)∥z∥2,(161) the residual energy satisfies E(t)≤κ(K)E(0)e −2λmin(H)t , κ(K) := λmax(K) λmin(K) .(162) In particular, using(157), E(t)≤κ(K)E(0)e −2λmin(K)λmin(A)t.(163) Pr...

  41. [41]

    , N r.(168)

    The residual dynamics in modal coordinates are ˙er(t) = Λeγ(t),(167) that is, ˙erj(t) =λ jeγj(t), j= 1, . . . , N r.(168)

  42. [42]

    If the discriminator-induced weighting is modewise aligned in the sense that eγj(t) =−a j(t)erj(t), a j(t)≥a ∗ >0,(169) then each mode satisfies ˙erj(t) =−λ jaj(t)erj(t),(170) and therefore |erj(t)| ≤ |erj(0)|e−λj a∗t.(171) Consequently, the residual energy satisfies E(t)≤ 1 2 NrX j=1 erj(0)2e−2λj a∗t.(172)

  43. [43]

    Proof.Since ˙r(t) =K G rrγ(t),(175) left-multiplying byU ⊤ and usingK G rr =UΛU ⊤ gives ˙er(t) =U ⊤ ˙r(t) =U ⊤K G rrγ(t) =U ⊤UΛU ⊤γ(t) = Λeγ(t),(176) which proves (167)

    More generally, if the feedback remains descent-oriented but no longer scales linearly with the residual, and one only has r(t)⊤K G rrγ(t)≤ −c∥r(t)∥ 2α, α >1,(173) then the residual energy obeys the algebraic decay estimate E(t)≤C(1 +t) −1/(α−1) (174) for some constantC >0. Proof.Since ˙r(t) =K G rrγ(t),(175) left-multiplying byU ⊤ and usingK G rr =UΛU ⊤ ...

  44. [44]

    In the strictly positive case, the residual energy increases

    If r(t)⊤K G rr(t)γ(t)≥0(182) on some time interval, then E(t) fails to decrease monotonically on that interval. In the strictly positive case, the residual energy increases

  45. [45]

    If this happens before r(t) becomes small, training enters a plateau regime

    If ∥γ(t)∥ →0(183) while∥K G rr(t)∥remains bounded, then ∥˙r(t)∥ ≤ ∥K G rr(t)∥ ∥γ(t)∥ →0.(184) Thus the first-order residual dynamics stall. If this happens before r(t) becomes small, training enters a plateau regime

  46. [46]

    In this case, E(t) does not exhibit a stable monotone decay trend, but instead undergoes oscillatory evolution

    Let s(t) :=r(t) ⊤K G rr(t)γ(t).(185) Ifs(t)changes sign repeatedly on a time interval, namely if there exist sequences t1 < t2 < t3 <· · ·(186) such that s(t2m−1)>0, s(t 2m)<0,(187) for all admissible m, then the residual-energy slope alternates between ascent-oriented and descent-oriented phases. In this case, E(t) does not exhibit a stable monotone deca...

  47. [47]

    The generator gradient is ∇θL(X2) G = 2 Nr NrX i=1 ri R′ f(X2) i ∂sf(X2) i ∇θri,(195) where f(X2) i :=f(s i;ϕ), ∂ sf(X2) i :=∂ sf(s i;ϕ)

  48. [48]

    The generator gradient flow becomes ˙θ=− 2 Nr NrX i=1 ri R′ f(X2) i ∂sf(X2) i ∇θri.(196)

  49. [49]

    Then the residual dynamics satisfy ˙r=−K G rr(θ)γ (X2),(197) where γ(X2) i = 2 Nr ri R′ f(r 2 i ;ϕ) ∂sf(r 2 i ;ϕ).(198) 35

    Let Jr denote the residual Jacobian and K G rr(θ) =J rJ ⊤ r the generator residual NTK. Then the residual dynamics satisfy ˙r=−K G rr(θ)γ (X2),(197) where γ(X2) i = 2 Nr ri R′ f(r 2 i ;ϕ) ∂sf(r 2 i ;ϕ).(198) 35

  50. [50]

    ,eγNr),(200) one has γ(X2) =−eΓr,(201) and therefore ˙r=K G rr(θ)eΓr.(202)

    Defining eγi :=− 2 Nr R′ f(r 2 i ;ϕ) ∂sf(r 2 i ;ϕ),(199) and eΓ = diag(eγ1, . . . ,eγNr),(200) one has γ(X2) =−eΓr,(201) and therefore ˙r=K G rr(θ)eΓr.(202)

  51. [51]

    Therefore, ∇θL(X2) G = 1 Nr NrX i=1 R′ f(s i;ϕ) ∂sf(s i;ϕ)∇ θsi = 2 Nr NrX i=1 ri R′ f(X2) i ∂sf(X2) i ∇θri,(204) which proves the first statement

    The corresponding residual-energy law is ˙E=r ⊤K G rr(θ)eΓr.(203) Proof.Sinces i =r 2 i , we have ∇θsi = 2ri ∇θri. Therefore, ∇θL(X2) G = 1 Nr NrX i=1 R′ f(s i;ϕ) ∂sf(s i;ϕ)∇ θsi = 2 Nr NrX i=1 ri R′ f(X2) i ∂sf(X2) i ∇θri,(204) which proves the first statement. The gradient-flow equation (196) follows immediately. Next, multiplying by the residual Jacobi...

  52. [52]

    If the discriminator input is the residual itself, so that ˙r=KΓ1,(212) then the modal coefficients satisfy ˙cj =u ⊤ j ˙r=λ j u⊤ j Γ1.(213)

  53. [53]

    If the discriminator input is the squared residual, so that ˙r=K eΓr,(214) then the modal coefficients satisfy ˙cj =u ⊤ j ˙r=λ j NrX k=1 ck u⊤ jeΓuk.(215) Proof.For the residual-input dynamics, ˙r=KΓ1, we compute ˙cj =u ⊤ j ˙r=u ⊤ j KΓ1=λ j u⊤ j Γ1,(216) which proves (213). For the squared-residual-input dynamics, ˙r=K eΓr, we similarly obtain ˙cj =u ⊤ j ...

  54. [54]

    SinceHis symmetric, all eigenvalues ofMare real

  55. [55]

    Since H is congruent toeΓ, Sylvester’s law of inertia implies thatH andeΓ have the same inertia, and therefore so doesM

  56. [56]

    Proof.The similarity relation follows directly from K −1/2M K1/2 =K −1/2(KeΓ)K1/2 =K 1/2eΓK1/2 =H.(219) Hence M and H have the same eigenvalues

    IfeΓ⪯0 , then all eigenvalues of M are nonpositive; ifeΓ⪰0 , then all eigenvalues of M are nonnegative; ifeΓis indefinite, thenMhas both positive and negative eigenvalues. Proof.The similarity relation follows directly from K −1/2M K1/2 =K −1/2(KeΓ)K1/2 =K 1/2eΓK1/2 =H.(219) Hence M and H have the same eigenvalues. Since H is symmetric whenevereΓ is symme...

  57. [57]

    If eΓ⪰0, then the extreme eigenvalues ofMsatisfy λmin(M) =λ min(H)≥λ min(K)λmin(eΓ), λ max(M) =λ max(H)≤λ max(K)λmax(eΓ). (220)

  58. [58]

    Introduce z:=K −1/2r.(221) Then the squared-residual-input dynamics become ˙z=Hz,(222) and the residual energy satisfies λmin(K)∥z∥2 ≤ ∥r∥2 ≤λ max(K)∥z∥2.(223)

  59. [59]

    In the strictly negative definite case eΓ≺0, E(t)≤κ(K)E(0)e 2λmax(H)t ≤κ(K)E(0)e −2λmin(K)|λmax(eΓ)|t,(224) whereλ max(H)<0and κ(K) := λmax(K) λmin(K) .(225) 38

    If eΓ⪯0, thenH⪯0. In the strictly negative definite case eΓ≺0, E(t)≤κ(K)E(0)e 2λmax(H)t ≤κ(K)E(0)e −2λmin(K)|λmax(eΓ)|t,(224) whereλ max(H)<0and κ(K) := λmax(K) λmin(K) .(225) 38

  60. [60]

    IfeΓ⪰0 , then the transformed energy 1 2 ∥z(t)∥2 is nondecreasing. In the strictly positive definite caseeΓ≻0 , the transformed dynamics are expansive, and the original residual energy admits the lower bound E(t)≥κ(K) −1E(0)e 2λmin(H)t ≥κ(K) −1E(0)e 2λmin(K)λmin(eΓ)t.(226)

  61. [61]

    Proof.For the extreme eigenvalue bounds, note that H=K 1/2eΓK1/2 is symmetric

    IfeΓ is indefinite, then some modes decay while others grow, and the residual energy is not guaranteed to be monotone. Proof.For the extreme eigenvalue bounds, note that H=K 1/2eΓK1/2 is symmetric. IfeΓ⪰0, then for anyx̸= 0, x⊤Hx= (K 1/2x)⊤eΓ(K1/2x)≥λ min(eΓ)∥K 1/2x∥2 ≥λ min(K)λmin(eΓ)∥x∥ 2.(227) Taking the minimum over unit vectorsxyields λmin(H)≥λ min(K...

  62. [62]

    the self-kernel blocksK G rr, KG bb, KG 00,

  63. [63]

    the cross-kernel blocksK G rb, KG r0, KG b0,

  64. [64]

    adversarial

    the three discriminator-induced sample-weight vectorsγ (r),γ (b),γ (0). This means that improving one channel adversarially may either help or hurt the others, depending on the sign and structure of the corresponding cross-kernel couplings. 45 E.7 Constant-kernel regime and NTK interpretation In the infinite-width or lazy-training regime, it is natural to...