pith. sign in

arxiv: 2605.08856 · v2 · pith:FK2RKBBYnew · submitted 2026-05-09 · 💻 cs.LG

Controlling Transient Amplification Improves Long-horizon Rollouts

Pith reviewed 2026-05-19 17:39 UTC · model grok-4.3

classification 💻 cs.LG
keywords autoregressive modelslong-horizon rolloutstransient amplificationcommutativity regularizationJacobian analysisneural simulatorsphysical system predictionout-of-distribution generalization
0
0 comments X

The pith

Non-normal and non-commuting Jacobians along rollout trajectories cause transient error amplification and long-horizon drift even in asymptotically stable systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Autoregressive neural simulators match classical solvers on short predictions of physical systems but lose accuracy rapidly over long horizons. The paper traces the drift to transient amplification of perturbations whenever the Jacobians encountered along a trajectory are non-normal and fail to commute with one another. Commutativity regularization adds two penalties, one on the normality defect of each Jacobian and one on the commutator norm between consecutive Jacobians, both estimated cheaply with Jacobian-vector products. These penalties incur no extra cost at inference time and come with a propagator bound that quantifies rollout error under approximate normality and commutativity. On 1D and 2D spatio-temporal tasks and on ERA5 climate data the regularized models remain stable for thousands of steps where unregularized baselines diverge, with the largest gains appearing on out-of-distribution initial conditions.

Core claim

When the Jacobians along an autoregressive trajectory are non-normal and non-commuting, the model amplifies errors transiently, resulting in rollout drift even when the overall system is asymptotically stable. Commutativity regularization combines two penalties designed to reduce the normality defect of individual Jacobians and the commutator norm of Jacobians across steps; the penalties are estimated with Jacobian-vector products and come with a propagator bound that quantifies rollout error under approximate commutativity and normality.

What carries the argument

Commutativity regularization, the combination of penalties on Jacobian normality defect and inter-step commutator norm, estimated via Jacobian-vector products with no inference-time overhead.

If this is right

  • UNet and FNO variants achieve stable rollouts over thousands of steps on both synthetic and real 1D and 2D spatio-temporal data.
  • FourCastNet climate forecasts on ERA5 improve without any new training data.
  • The largest accuracy gains appear on out-of-distribution initial conditions where baseline models quickly leave the training distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same transient-amplification mechanism may explain rollout drift in other autoregressive sequence tasks such as video prediction or long time-series forecasting.
  • The propagator bound supplies a quantitative tool for analyzing error growth in any approximately commuting dynamical model.
  • Monitoring Jacobian normality and commutator norms during training could serve as a practical diagnostic for future rollout instability.

Load-bearing premise

Linearization around the model's own rollout trajectories captures the dominant source of long-horizon error, and the two penalties can be tuned to reduce normality defect and commutator norm without creating new instabilities or degrading short-horizon accuracy.

What would settle it

A controlled experiment in which models with persistently high normality defects and large commutator norms nevertheless produce stable long-horizon rollouts, or in which regularization successfully lowers those quantities yet rollout drift remains unchanged.

Figures

Figures reproduced from arXiv: 2605.08856 by Adeel Pervez, Francesco Locatello.

Figure 1
Figure 1. Figure 1: Normal vs. non-normal transient growth on [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Latent advance architecture used by commutativity regularization. Other configu [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: KdV rollout nMSE vs. time, averaged over 50 held-out test trajectories. The dashed [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: KdV UNet variant rollout [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 7
Figure 7. Figure 7: BVE rollout RMSE on the held-out test set. Setup. The backbone is a 2D UNet with circular padding, and an 8×8×256 bottleneck on which the regulariser acts. Both regimes are trained with one￾step MSE on 200-frame trajectories. Further details are deferred to Appendix E. Result: the baseline destabilises inside the training window [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Barotropic vorticity unregularized (middle) and regularized (bottom) UNet rollout [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: ERA5 rollout RMSE vs. lead time on t2m and z500 for held-out years 2018 and 2019 together with finetuning without regularization. Finetuning-only destroys long-horizon t2m; the same data with commutativity regularisation pulls well below the frozen FCN baseline. three-year finetuning. Architecture, hyperparameters, visualizations and other details are in Appendix F. Plain finetuning worsens t2m, the regula… view at source ↗
Figure 10
Figure 10. Figure 10: SST rollout RMSE (normalised units) versus lead time (Cf. Appendix G). [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Latent block commutativity regularization. Use only for the FourCastNet [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 11
Figure 11. Figure 11: Latent block commutativity regularization. Use only for the FourCastNet [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: KdV space–time plots of u(x, t) for representative in-distribution test trajectories: ground truth, baseline rollout, and commutativity-regularised rollout, run for the full 5000 steps (UNet variants) from a single initial condition. Color scale is symmetric and shared per trajectory. D.7 Out-of-distribution Per-trajectory snapshots See [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 12
Figure 12. Figure 12: KdV space–time plots of u(x, t) for representative in-distribution test trajectories: ground truth, baseline rollout, and commutativity-regularised rollout, run for the full 5000 steps (UNet variants) from a single initial condition. Color scale is symmetric and shared per trajectory. D.7 Out-of-distribution Per-trajectory snapshots See [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: KdV space–time plots of u(x, t) for representative OOD test trajectories: ground truth, baseline rollout, and commutativity-regularised rollout, run for the full 5000 steps (UNet variants). on the doubly-periodic domain [0, 2π] 2 with N=64 grid points per side. The solver is pseudospectral in space with the 2/3-rule dealiasing mask and RK4 in time. The Jacobian J(ψ, q) = ∂xψ ∂yq − ∂yψ ∂xq is evaluated in … view at source ↗
Figure 13
Figure 13. Figure 13: KdV space–time plots of u(x, t) for representative OOD test trajectories: ground truth, baseline rollout, and commutativity-regularised rollout, run for the full 5000 steps (UNet variants). J(ψ, q) = ∂xψ ∂yq − ∂yψ ∂xq is evaluated in real space; all linear operations (Laplacian inversion, derivatives, hyperviscosity) are evaluated in spectral space. Physical and numerical parameters are summarised in [PI… view at source ↗
Figure 14
Figure 14. Figure 14: KdV space–time plots of u(x, t) for representative test trajectories: ground truth, baseline rollout, and commutativity-regularised rollout, run for the full 2000 steps (FNO variants). sampled in spectral space with i.i.d. uniform phases on [0, 2π), then rescaled in real space so that the RMS vorticity is 1.5. Each trajectory is integrated through a 2 s spin-up that is discarded, after which 200 snapshots… view at source ↗
Figure 14
Figure 14. Figure 14: KdV space–time plots of u(x, t) for representative test trajectories: ground truth, baseline rollout, and commutativity-regularised rollout, run for the full 2000 steps (FNO variants). discarded, after which 200 snapshots are recorded at ∆tout=0.05 s, giving a 10 s recording window. Splits and normalisation. The dataset contains 300 trajectories, split deterministically into 240 / 30 / 30 for train / val … view at source ↗
Figure 15
Figure 15. Figure 15: Vorticity snapshots ζ(x, y, t) for representative test trajectories: ground truth, baseline rollout, and commutativity-regularised rollout, all run for the full 199 rollout steps (∼ 10 s) from a single initial condition. Color scale is symmetric and shared between truth and predictions per trajectory; absolute error uses a separate scale. The baseline visibly amplifies vorticity to several times the natur… view at source ↗
Figure 15
Figure 15. Figure 15: Vorticity snapshots ζ(x, y, t) for representative test trajectories: ground truth, baseline rollout, and commutativity-regularised rollout, all run for the full 199 rollout steps (∼ 10 s) from a single initial condition. Color scale is symmetric and shared between truth and predictions per trajectory; absolute error uses a separate scale. The baseline visibly amplifies vorticity to several times the natur… view at source ↗
Figure 16
Figure 16. Figure 16: Further vorticity snapshots ζ(x, y, t) for representative test trajectories: ground truth, baseline rollout, and commutativity-regularised rollout, all run for the full 199 rollout steps (∼ 10 s) from a single initial condition. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗
Figure 16
Figure 16. Figure 16: Further vorticity snapshots ζ(x, y, t) for representative test trajectories: ground truth, baseline rollout, and commutativity-regularised rollout, all run for the full 199 rollout steps (∼ 10 s) from a single initial condition. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Qualitative day-6, 8 and 10 rollout error snapshots of t2m on a single ERA5 2018 initial condition for 2m temperature. Land pixels are filled with 0 in the raw NetCDF and are masked only in the visualisations (Appendix G.6); the network sees them as zero pixels. Padding. The network requires spatial dimensions divisible by 2 4 = 16. We pad each frame from 180×360 to 192×384 using boundary-aware padding: 6… view at source ↗
Figure 17
Figure 17. Figure 17: Qualitative day-6, 8 and 10 rollout error snapshots of t2m on a single ERA5 2018 initial condition for 2m temperature. Padding. The network requires spatial dimensions divisible by 2 4 = 16. We pad each frame from 180×360 to 192×384 using boundary-aware padding: 6 rows of reflect padding at each pole (no real wrap-around at the poles), and 12 columns of wrap (periodic) padding at the 0 ◦/360◦ longitude se… view at source ↗
Figure 18
Figure 18. Figure 18: Spatial snapshots of the SST autoregressive rollout at selected lead times for the [PITH_FULL_IMAGE:figures/full_fig_p035_18.png] view at source ↗
Figure 18
Figure 18. Figure 18: Spatial snapshots of the SST autoregressive rollout at selected lead times for the [PITH_FULL_IMAGE:figures/full_fig_p036_18.png] view at source ↗
read the original abstract

Autoregressive neural simulators now match classical solvers on short-horizon prediction of physical systems, yet their accuracy degrades rapidly when rolled out over long horizons. In this work, we identify transient amplification of perturbations around rollout trajectories as a structural mechanism driving rollout error. Using a linearization analysis we show that when the Jacobians along an autoregressive trajectory are non-normal and non-commuting, the model amplifies errors transiently, resulting in model rollout drift even when the overall system is asymptotically stable. Building on the analysis, we propose commutativity regularization: a combination of two penalties designed to reduce the normality defect of individual Jacobians and the commutator norm of Jacobians across steps. The penalties are estimated with Jacobian-vector products and have no inference-time cost. We show a propagator bound that quantifies rollout error under approximate commutativity and normality. We evaluate UNet and FNO variants with commutativity regularization on 1D and 2D spatio-temporal data in synthetic and real settings, showing successful long-horizon rollouts over thousands of steps. Further, we show that the method improves FourCastNet climate forecasts on ERA5 without using any new data. The gain is most pronounced out-of-distribution: trained on trajectories of a few hundred steps, regularized models remain in-distribution for thousands of rollout steps on initial conditions where baselines diverge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that non-normal and non-commuting Jacobians along autoregressive trajectories in neural simulators cause transient amplification of perturbations, leading to long-horizon rollout drift even in asymptotically stable systems. It introduces commutativity regularization (two penalties on normality defect and commutator norm, estimated via Jacobian-vector products) to mitigate this, derives a propagator bound quantifying error under approximate commutativity/normality, and reports improved long-horizon performance for UNet/FNO variants and FourCastNet on synthetic 1D/2D data and real ERA5 climate forecasts, with gains most evident out-of-distribution over thousands of steps.

Significance. If the central mechanism and regularization hold, the work supplies a practical, inference-free technique for stabilizing autoregressive neural simulators of physical systems, backed by a theoretical bound and strong empirical results on held-out long rollouts and real data. This could meaningfully advance reliable long-term forecasting in climate and fluid dynamics without requiring additional training data.

major comments (3)
  1. [§3] §3 (linearization analysis): the first-order linearization of the autoregressive map F around the model's own trajectory assumes perturbations remain small enough for higher-order Taylor terms to be negligible, yet the paper provides no direct comparison of linearized versus full nonlinear error propagation on trajectories that have already begun to diverge. This assumption is load-bearing for the claim that targeting Jacobian normality and commutativity will control the dominant source of long-horizon drift.
  2. [§4] §4 (propagator bound): the bound is stated to quantify rollout error under approximate commutativity and normality, but the manuscript does not report the explicit dependence on perturbation size or the conditions under which the bound remains predictive once states deviate by amounts comparable to the signal itself.
  3. [§5] §5 (experiments): hyperparameter selection for the two regularization coefficients and the precise baseline controls (e.g., equivalent compute or alternative stability penalties) are not detailed enough to isolate the contribution of reduced transient amplification from other possible effects on the observed long-horizon gains.
minor comments (2)
  1. [Methods] Clarify the precise definitions and estimation procedures for the normality defect and commutator norm (including any approximations in the Jacobian-vector products) so that the penalties can be reproduced exactly.
  2. [Figures] In the rollout-error figures, add shaded regions or multiple seeds to indicate variability and confirm that the regularized models remain stable beyond the training horizon on the reported ERA5 initial conditions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments and positive assessment of the work's potential impact. We address each of the major comments below and outline the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (linearization analysis): the first-order linearization of the autoregressive map F around the model's own trajectory assumes perturbations remain small enough for higher-order Taylor terms to be negligible, yet the paper provides no direct comparison of linearized versus full nonlinear error propagation on trajectories that have already begun to diverge. This assumption is load-bearing for the claim that targeting Jacobian normality and commutativity will control the dominant source of long-horizon drift.

    Authors: We acknowledge that the linearization is an approximation and that a direct comparison to nonlinear propagation would provide stronger validation. In the revised manuscript, we will add experiments that compare the error growth predicted by the linearized model to the actual nonlinear rollout errors on trajectories where perturbations have grown to moderate sizes, thereby testing the validity of the assumption in the relevant regime. revision: yes

  2. Referee: [§4] §4 (propagator bound): the bound is stated to quantify rollout error under approximate commutativity and normality, but the manuscript does not report the explicit dependence on perturbation size or the conditions under which the bound remains predictive once states deviate by amounts comparable to the signal itself.

    Authors: The propagator bound is derived under the assumption of small perturbations where the linearization holds, and its explicit dependence on perturbation size is implicit in the error terms. We will revise the manuscript to explicitly state the dependence on the initial perturbation norm and discuss the range of validity, including when states deviate significantly, noting that the bound serves as a guiding theoretical tool rather than a tight prediction for large deviations. revision: partial

  3. Referee: [§5] §5 (experiments): hyperparameter selection for the two regularization coefficients and the precise baseline controls (e.g., equivalent compute or alternative stability penalties) are not detailed enough to isolate the contribution of reduced transient amplification from other possible effects on the observed long-horizon gains.

    Authors: We agree that additional details are necessary for reproducibility and to isolate the effect. In the revised version, we will expand the experimental section to include the hyperparameter search procedure, the specific values used, and additional ablation studies with alternative stability-promoting penalties under matched computational budgets. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper performs a standard linearization of the autoregressive map along trajectories and derives a propagator bound from the resulting Jacobian properties (non-normality and non-commutativity). This is a conventional first-order analysis rather than a self-referential definition or a fitted quantity renamed as a prediction. The proposed commutativity regularization is an independent penalty term motivated by the analysis but not required for the bound itself to hold mathematically. Empirical results on held-out long-horizon rollouts and ERA5 data provide an external check that does not reduce to the derivation. No self-citations, ansatzes smuggled via prior work, or uniqueness theorems from the same authors are invoked as load-bearing steps. The chain remains independent of its target claims.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on a linearization analysis whose validity depends on the trajectory remaining close enough to the true dynamics for first-order approximations to hold, plus the assumption that the chosen regularization weights can be selected without harming short-horizon performance.

free parameters (1)
  • regularization coefficients
    Two scalar weights balancing the normality and commutator penalties are hyperparameters whose values are chosen to achieve the reported gains.
axioms (1)
  • domain assumption Local Jacobian properties along the rollout trajectory dominate long-term error accumulation even when the underlying dynamical system is asymptotically stable.
    Invoked in the linearization analysis that identifies transient amplification as the structural mechanism.

pith-pipeline@v0.9.0 · 5763 in / 1430 out tokens · 42534 ms · 2026-05-19T17:39:39.855830+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 2 internal anchors

  1. [1]

    URLhttps://openreview.net/forum?id=MKP1g8wU0P. H. Hersbach, B. Bell, P. Berrisford, S. Hirahara, A. Horányi, J. Muñoz-Sabater, J. Nicolas, C. Peubey, R. Radu, D. Schepers, A. Simmons, C. Soci, S. Abdalla, X. Abellan, G. Balsamo, P. Bechtold, G. Biavati, J. Bidlot, M. Bonavita, G. De Chiara, P. Dahlgren, D. Dee, M. Diamantakis, R. Dragani, J. Flemming, R. ...

  2. [2]

    Hersbach, B

    doi: https://doi.org/10.1002/qj.3803. URLhttps://rmets.onlinelibrary. wiley.com/doi/abs/10.1002/qj.3803. R. A. Horn and C. R. Johnson.Matrix Analysis. Cambridge University Press, Cambridge,

  3. [3]

    doi: https://doi.org/10.1016/j.neunet.2026.108641

    ISSN 0893-6080. doi: https://doi.org/10.1016/j.neunet.2026.108641. URL https://www.sciencedirect. com/science/article/pii/S0893608026001036. H.-O. Kreiss. Über die Stäbilitätsdefinition für Differenzengleichungen die partielle Differ- entialgleichungen approximieren.BIT Numerical Mathematics, 2(3):153–181,

  4. [4]

    doi: 10.1007/BF01957346. R. Lam, A. Sanchez-Gonzalez, M. Willson, P. Wirnsberger, M. Fortunato, F. Alet, S. Ravuri, T. Ewalds, Z. Eaton-Rosen, W. Hu, A. Merose, S. Hoyer, G. Holland, O. Vinyals, J. Stott, A. Pritzel, S. Mohamed, and P. Battaglia. Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, Dec

  5. [5]

    Science, 382 (6677), 1416--1421, doi:10.1126/science.adi2336, ://www.science.org/doi/10.1126/science.adi2336

    doi: 10.1126/science.adi2336. URL https://www.science.org/doi/10.1126/science.adi2336. Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anand- kumar. Fourier neural operator for parametric partial differential equations. InInter- national Conference on Learning Representations,

  6. [6]

    URLhttps://arxiv.org/abs/ 2010.08895. P. Lippe, B. S. Veeling, P. Perdikaris, R. E. Turner, and J. Brandstetter. PDE-refiner: Achieving accurate long rollouts with neural PDE solvers. InThirty-seventh Conference on Neural Information Processing Systems,

  7. [7]

    2025 , issn =

    ISSN 0045-7825. doi: 10.1016/j.cma.2024.117441. URLhttps://www.sciencedirect.com/science/article/pii/S0045782524006960. M. McCabe, P. Harrington, S. Subramanian, and J. Brown. Towards stability of autoregressive neural operators.Transactions on Machine Learning Research,

  8. [8]

    doi: 10.1007/s11071-005-2824-x. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. PyTorch: an imperative style, high-performance deep learning library. InProceedings of...

  9. [9]

    Gross, Francisco Massa, A

    URLhttps://dl. acm.org/doi/10.5555/3454287.3455008. J. Pathak, S. Subramanian, P. Harrington, S. Raja, A. Chattopadhyay, M. Mardani, T. Kurth, D. Hall, Z. Li, K. Azizzadenesheli, P. Hassanzadeh, K. Kashinath, and A. Anandkumar. FourCastNet: A global data-driven high-resolution weather model using adaptive Fourier 16 neural operators.arXiv preprint arXiv:2...

  10. [10]

    URLhttps://arxiv.org/abs/ 2202.11214. T. Pfaff, M. Fortunato, A. Sanchez-Gonzalez, and P. W. Battaglia. Learning mesh-based simulation with graph networks. InInternational Conference on Learning Representations (ICLR),

  11. [11]

    doi: 10.1175/1520-0442(2002)015<1609:AIISAS>2.0.CO;2. O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Intervention (MICCAI), pages 234–241,

  12. [12]

    doi: 10.1007/978-3-319-24574-4_28. W. J. Rugh.Nonlinear system theory. Johns Hopkins University Press Baltimore,

  13. [13]

    doi: 10.1146/annurev.fluid.38.050304.092139. D. Scieur, G. Gidel, Q. Bertrand, and F. Pedregosa. The curse of unrolling: Rate of differentiating through optimization. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors,Advances in Neural Information Processing Systems,

  14. [14]

    doi: 10.1126/science.261.5121.578. G. Wen, Z. Li, K. Azizzadenesheli, A. Anandkumar, and S. M. Benson. U-fno–an enhanced fourier neural operator-based deep-learning model for multiphase flow.Advances in Water Resources, page 104180,

  15. [15]

    and Kevrekidis, Ioannis G

    doi: 10.1007/s00332-015-9258-5. 18 Appendix A. Proof of theorem 2.1 We first record the exact result when the Jacobians commute and are individually normal. Proposition A.1(Exact commuting case).Let J0, . . . , JT−1 ∈R n×n be simultaneously diagonalisable as Jt = UΛtU ⊤ for a common orthogonal matrix U and diagonalΛ t = diag(λ1,t, . . . , λn,t). Ifρ= max ...

  16. [16]

    When ε = η = 0, each Jt is normal and all pairs commute

    Proof. When ε = η = 0, each Jt is normal and all pairs commute. Commuting normal matrices are simultaneously diagonalizable by a common orthogonal matrix (Horn and Johnson, 1985), giving∥ΦT ∥2 ≤ρ T by Proposition A.1. For ε, η > 0, the joint conditions (iii)–(iv) imply that{Jt} lies within distanceδ(ε, η) of the closed set of simultaneously orthogonally d...

  17. [17]

    with a single denoising network conditioned on the previous state, the current (noisy) prediction, and a step index k∈ { 0, . . . , M}. The backbone is identical to the UNet above; we useM = 4 refinement iterations per rollout step and the geometric noise schedule of Lippe et al. (2023) with σmin = 10−7. PDE-Refiner therefore costsM+1 = 5× backbone evalua...

  18. [18]

    but is overtaken by both alternatives by step∼100and is more than an order of magnitude worse than UNet+CR already inside the training window (step200). PDE-Refiner is the strongest unregularisedmodel up to step ∼1000, paying5 × inference-time cost; from step ∼2000 onwards UNet+CR overtakes it on both the in-distribution and the out-of-distribution split....

  19. [19]

    Color scale is symmetric and shared between truth and predictions per trajectory; absolute error uses a separate scale

    28 Figure 15: Vorticity snapshotsζ(x, y, t)for representative test trajectories: ground truth, baseline rollout, and commutativity-regularised rollout, all run for the full199rollout steps (∼ 10s) from a single initial condition. Color scale is symmetric and shared between truth and predictions per trajectory; absolute error uses a separate scale. The bas...

  20. [20]

    + normality λc 10−5 λn 10−5 JVP frequency every minibatch (comm_freq=1) Skip blocks first10AFNO blocks detached Comm

    Optimiser AdamW Peak learning rate5×10 −6 Weight decay0 Schedule cosine annealing Epochs50 Batch size2(single-GPU, A10080GB) Loss (one-step) latitude-weighted MSE between ˆXt+1 and ERA5 Regulariser latent comm. + normality λc 10−5 λn 10−5 JVP frequency every minibatch (comm_freq=1) Skip blocks first10AFNO blocks detached Comm. pair adjacent-step pair(X t,...

  21. [21]

    The version we use contains1727weeks beginning in

    G.1 Data and preprocessing Source.NOAA Optimum Interpolation Sea-Surface Temperature, weekly-mean product (sst.wkmean.1990-present) (Reynolds et al., 2002), covering the global ocean on a1◦ (180×360) grid at weekly cadence. The version we use contains1727weeks beginning in