Controlling Transient Amplification Improves Long-horizon Rollouts

Adeel Pervez; Francesco Locatello

REVIEW 3 major objections 2 minor 21 references

Non-normal and non-commuting Jacobians along rollout trajectories cause transient error amplification and long-horizon drift even in asymptotically stable systems.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-19 17:39 UTC pith:FK2RKBBY

load-bearing objection The commutativity regularization stabilizes long rollouts in practice on climate models, though the linearization story may not fully account for the improvement once deviations grow large. the 3 major comments →

arxiv 2605.08856 v2 pith:FK2RKBBY submitted 2026-05-09 cs.LG

Controlling Transient Amplification Improves Long-horizon Rollouts

Adeel Pervez , Francesco Locatello This is my paper

classification cs.LG

keywords autoregressive modelslong-horizon rolloutstransient amplificationcommutativity regularizationJacobian analysisneural simulatorsphysical system predictionout-of-distribution generalization

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Autoregressive neural simulators match classical solvers on short predictions of physical systems but lose accuracy rapidly over long horizons. The paper traces the drift to transient amplification of perturbations whenever the Jacobians encountered along a trajectory are non-normal and fail to commute with one another. Commutativity regularization adds two penalties, one on the normality defect of each Jacobian and one on the commutator norm between consecutive Jacobians, both estimated cheaply with Jacobian-vector products. These penalties incur no extra cost at inference time and come with a propagator bound that quantifies rollout error under approximate normality and commutativity. On 1D and 2D spatio-temporal tasks and on ERA5 climate data the regularized models remain stable for thousands of steps where unregularized baselines diverge, with the largest gains appearing on out-of-distribution initial conditions.

Core claim

When the Jacobians along an autoregressive trajectory are non-normal and non-commuting, the model amplifies errors transiently, resulting in rollout drift even when the overall system is asymptotically stable. Commutativity regularization combines two penalties designed to reduce the normality defect of individual Jacobians and the commutator norm of Jacobians across steps; the penalties are estimated with Jacobian-vector products and come with a propagator bound that quantifies rollout error under approximate commutativity and normality.

What carries the argument

Commutativity regularization, the combination of penalties on Jacobian normality defect and inter-step commutator norm, estimated via Jacobian-vector products with no inference-time overhead.

Load-bearing premise

Linearization around the model's own rollout trajectories captures the dominant source of long-horizon error, and the two penalties can be tuned to reduce normality defect and commutator norm without creating new instabilities or degrading short-horizon accuracy.

What would settle it

A controlled experiment in which models with persistently high normality defects and large commutator norms nevertheless produce stable long-horizon rollouts, or in which regularization successfully lowers those quantities yet rollout drift remains unchanged.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

UNet and FNO variants achieve stable rollouts over thousands of steps on both synthetic and real 1D and 2D spatio-temporal data.
FourCastNet climate forecasts on ERA5 improve without any new training data.
The largest accuracy gains appear on out-of-distribution initial conditions where baseline models quickly leave the training distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same transient-amplification mechanism may explain rollout drift in other autoregressive sequence tasks such as video prediction or long time-series forecasting.
The propagator bound supplies a quantitative tool for analyzing error growth in any approximately commuting dynamical model.
Monitoring Jacobian normality and commutator norms during training could serve as a practical diagnostic for future rollout instability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

The commutativity regularization stabilizes long rollouts in practice on climate models, though the linearization story may not fully account for the improvement once deviations grow large.

read the letter

The key point is that this paper links non-normal and non-commuting Jacobians in autoregressive neural simulators to transient error amplification, and shows that regularizing those properties with two Jacobian-vector-product penalties improves long-horizon stability. The new part is the specific pair of penalties—one for the normality defect of each Jacobian and one for the commutator norm across steps—plus the propagator bound that quantifies the error under approximate normality and commutativity. They implement this without extra cost at inference. On the positive side, the experiments demonstrate that regularized UNet and FNO models handle thousands of rollout steps on synthetic data, and the approach lifts FourCastNet performance on ERA5 data, particularly for out-of-distribution initial conditions where unregularized versions diverge quickly. That last result is useful because it uses no new data. The main soft spot is the stress-test concern about the linearization. The analysis assumes perturbations stay small enough that the local Jacobians govern the dynamics, but in practice rollout drift quickly makes deviations large. The paper does not report a direct check of whether linearized predictions match full nonlinear error growth on failing trajectories. This makes the causal story for why the regularization works less tight than it could be. Minor issues include the need for more detail on how the regularization coefficients were chosen and whether short-term accuracy is preserved across all cases. This paper is for researchers developing learned simulators for fluids, climate, and other physical systems that need reliable multi-step forecasts. A practitioner looking for a simple add-on to stabilize existing architectures would find it worth trying. It deserves a serious referee because the real-data gains are concrete and the method is lightweight. I recommend sending it out for peer review.

Referee Report

3 major / 2 minor

Summary. The paper claims that non-normal and non-commuting Jacobians along autoregressive trajectories in neural simulators cause transient amplification of perturbations, leading to long-horizon rollout drift even in asymptotically stable systems. It introduces commutativity regularization (two penalties on normality defect and commutator norm, estimated via Jacobian-vector products) to mitigate this, derives a propagator bound quantifying error under approximate commutativity/normality, and reports improved long-horizon performance for UNet/FNO variants and FourCastNet on synthetic 1D/2D data and real ERA5 climate forecasts, with gains most evident out-of-distribution over thousands of steps.

Significance. If the central mechanism and regularization hold, the work supplies a practical, inference-free technique for stabilizing autoregressive neural simulators of physical systems, backed by a theoretical bound and strong empirical results on held-out long rollouts and real data. This could meaningfully advance reliable long-term forecasting in climate and fluid dynamics without requiring additional training data.

major comments (3)

[§3] §3 (linearization analysis): the first-order linearization of the autoregressive map F around the model's own trajectory assumes perturbations remain small enough for higher-order Taylor terms to be negligible, yet the paper provides no direct comparison of linearized versus full nonlinear error propagation on trajectories that have already begun to diverge. This assumption is load-bearing for the claim that targeting Jacobian normality and commutativity will control the dominant source of long-horizon drift.
[§4] §4 (propagator bound): the bound is stated to quantify rollout error under approximate commutativity and normality, but the manuscript does not report the explicit dependence on perturbation size or the conditions under which the bound remains predictive once states deviate by amounts comparable to the signal itself.
[§5] §5 (experiments): hyperparameter selection for the two regularization coefficients and the precise baseline controls (e.g., equivalent compute or alternative stability penalties) are not detailed enough to isolate the contribution of reduced transient amplification from other possible effects on the observed long-horizon gains.

minor comments (2)

[Methods] Clarify the precise definitions and estimation procedures for the normality defect and commutator norm (including any approximations in the Jacobian-vector products) so that the penalties can be reproduced exactly.
[Figures] In the rollout-error figures, add shaded regions or multiple seeds to indicate variability and confirm that the regularized models remain stable beyond the training horizon on the reported ERA5 initial conditions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments and positive assessment of the work's potential impact. We address each of the major comments below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [§3] §3 (linearization analysis): the first-order linearization of the autoregressive map F around the model's own trajectory assumes perturbations remain small enough for higher-order Taylor terms to be negligible, yet the paper provides no direct comparison of linearized versus full nonlinear error propagation on trajectories that have already begun to diverge. This assumption is load-bearing for the claim that targeting Jacobian normality and commutativity will control the dominant source of long-horizon drift.

Authors: We acknowledge that the linearization is an approximation and that a direct comparison to nonlinear propagation would provide stronger validation. In the revised manuscript, we will add experiments that compare the error growth predicted by the linearized model to the actual nonlinear rollout errors on trajectories where perturbations have grown to moderate sizes, thereby testing the validity of the assumption in the relevant regime. revision: yes
Referee: [§4] §4 (propagator bound): the bound is stated to quantify rollout error under approximate commutativity and normality, but the manuscript does not report the explicit dependence on perturbation size or the conditions under which the bound remains predictive once states deviate by amounts comparable to the signal itself.

Authors: The propagator bound is derived under the assumption of small perturbations where the linearization holds, and its explicit dependence on perturbation size is implicit in the error terms. We will revise the manuscript to explicitly state the dependence on the initial perturbation norm and discuss the range of validity, including when states deviate significantly, noting that the bound serves as a guiding theoretical tool rather than a tight prediction for large deviations. revision: partial
Referee: [§5] §5 (experiments): hyperparameter selection for the two regularization coefficients and the precise baseline controls (e.g., equivalent compute or alternative stability penalties) are not detailed enough to isolate the contribution of reduced transient amplification from other possible effects on the observed long-horizon gains.

Authors: We agree that additional details are necessary for reproducibility and to isolate the effect. In the revised version, we will expand the experimental section to include the hyperparameter search procedure, the specific values used, and additional ablation studies with alternative stability-promoting penalties under matched computational budgets. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper performs a standard linearization of the autoregressive map along trajectories and derives a propagator bound from the resulting Jacobian properties (non-normality and non-commutativity). This is a conventional first-order analysis rather than a self-referential definition or a fitted quantity renamed as a prediction. The proposed commutativity regularization is an independent penalty term motivated by the analysis but not required for the bound itself to hold mathematically. Empirical results on held-out long-horizon rollouts and ERA5 data provide an external check that does not reduce to the derivation. No self-citations, ansatzes smuggled via prior work, or uniqueness theorems from the same authors are invoked as load-bearing steps. The chain remains independent of its target claims.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on a linearization analysis whose validity depends on the trajectory remaining close enough to the true dynamics for first-order approximations to hold, plus the assumption that the chosen regularization weights can be selected without harming short-horizon performance.

free parameters (1)

regularization coefficients
Two scalar weights balancing the normality and commutator penalties are hyperparameters whose values are chosen to achieve the reported gains.

axioms (1)

domain assumption Local Jacobian properties along the rollout trajectory dominate long-term error accumulation even when the underlying dynamical system is asymptotically stable.
Invoked in the linearization analysis that identifies transient amplification as the structural mechanism.

pith-pipeline@v0.9.0 · 5763 in / 1430 out tokens · 42534 ms · 2026-05-19T17:39:39.855830+00:00 · methodology

0 comments

read the original abstract

Autoregressive neural simulators now match classical solvers on short-horizon prediction of physical systems, yet their accuracy degrades rapidly when rolled out over long horizons. In this work, we identify transient amplification of perturbations around rollout trajectories as a structural mechanism driving rollout error. Using a linearization analysis we show that when the Jacobians along an autoregressive trajectory are non-normal and non-commuting, the model amplifies errors transiently, resulting in model rollout drift even when the overall system is asymptotically stable. Building on the analysis, we propose commutativity regularization: a combination of two penalties designed to reduce the normality defect of individual Jacobians and the commutator norm of Jacobians across steps. The penalties are estimated with Jacobian-vector products and have no inference-time cost. We show a propagator bound that quantifies rollout error under approximate commutativity and normality. We evaluate UNet and FNO variants with commutativity regularization on 1D and 2D spatio-temporal data in synthetic and real settings, showing successful long-horizon rollouts over thousands of steps. Further, we show that the method improves FourCastNet climate forecasts on ERA5 without using any new data. The gain is most pronounced out-of-distribution: trained on trajectories of a few hundred steps, regularized models remain in-distribution for thousands of rollout steps on initial conditions where baselines diverge.

Figures

Figures reproduced from arXiv: 2605.08856 by Adeel Pervez, Francesco Locatello.

**Figure 2.** Figure 2: Latent advance architecture used by commutativity regularization. Other configu [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: KdV rollout nMSE vs. time, averaged over 50 held-out test trajectories. The dashed [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: KdV UNet variant rollout [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 7.** Figure 7: BVE rollout RMSE on the held-out test set. Setup. The backbone is a 2D UNet with circular padding, and an 8×8×256 bottleneck on which the regulariser acts. Both regimes are trained with onestep MSE on 200-frame trajectories. Further details are deferred to Appendix E. Result: the baseline destabilises inside the training window [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Barotropic vorticity unregularized (middle) and regularized (bottom) UNet rollout [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: ERA5 rollout RMSE vs. lead time on t2m and z500 for held-out years 2018 and 2019 together with finetuning without regularization. Finetuning-only destroys long-horizon t2m; the same data with commutativity regularisation pulls well below the frozen FCN baseline. three-year finetuning. Architecture, hyperparameters, visualizations and other details are in Appendix F. Plain finetuning worsens t2m, the regula… view at source ↗

**Figure 10.** Figure 10: SST rollout RMSE (normalised units) versus lead time (Cf. Appendix G). [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Latent block commutativity regularization. Use only for the FourCastNet [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 11.** Figure 11: Latent block commutativity regularization. Use only for the FourCastNet [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: KdV space–time plots of u(x, t) for representative in-distribution test trajectories: ground truth, baseline rollout, and commutativity-regularised rollout, run for the full 5000 steps (UNet variants) from a single initial condition. Color scale is symmetric and shared per trajectory. D.7 Out-of-distribution Per-trajectory snapshots See [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: KdV space–time plots of u(x, t) for representative OOD test trajectories: ground truth, baseline rollout, and commutativity-regularised rollout, run for the full 5000 steps (UNet variants). on the doubly-periodic domain [0, 2π] 2 with N=64 grid points per side. The solver is pseudospectral in space with the 2/3-rule dealiasing mask and RK4 in time. The Jacobian J(ψ, q) = ∂xψ ∂yq − ∂yψ ∂xq is evaluated in … view at source ↗

**Figure 14.** Figure 14: KdV space–time plots of u(x, t) for representative test trajectories: ground truth, baseline rollout, and commutativity-regularised rollout, run for the full 2000 steps (FNO variants). sampled in spectral space with i.i.d. uniform phases on [0, 2π), then rescaled in real space so that the RMS vorticity is 1.5. Each trajectory is integrated through a 2 s spin-up that is discarded, after which 200 snapshots… view at source ↗

**Figure 15.** Figure 15: Vorticity snapshots ζ(x, y, t) for representative test trajectories: ground truth, baseline rollout, and commutativity-regularised rollout, all run for the full 199 rollout steps (∼ 10 s) from a single initial condition. Color scale is symmetric and shared between truth and predictions per trajectory; absolute error uses a separate scale. The baseline visibly amplifies vorticity to several times the natur… view at source ↗

**Figure 16.** Figure 16: Further vorticity snapshots ζ(x, y, t) for representative test trajectories: ground truth, baseline rollout, and commutativity-regularised rollout, all run for the full 199 rollout steps (∼ 10 s) from a single initial condition. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗

**Figure 17.** Figure 17: Qualitative day-6, 8 and 10 rollout error snapshots of t2m on a single ERA5 2018 initial condition for 2m temperature. Land pixels are filled with 0 in the raw NetCDF and are masked only in the visualisations (Appendix G.6); the network sees them as zero pixels. Padding. The network requires spatial dimensions divisible by 2 4 = 16. We pad each frame from 180×360 to 192×384 using boundary-aware padding: 6… view at source ↗

**Figure 17.** Figure 17: Qualitative day-6, 8 and 10 rollout error snapshots of t2m on a single ERA5 2018 initial condition for 2m temperature. Padding. The network requires spatial dimensions divisible by 2 4 = 16. We pad each frame from 180×360 to 192×384 using boundary-aware padding: 6 rows of reflect padding at each pole (no real wrap-around at the poles), and 12 columns of wrap (periodic) padding at the 0 ◦/360◦ longitude se… view at source ↗

**Figure 18.** Figure 18: Spatial snapshots of the SST autoregressive rollout at selected lead times for the [PITH_FULL_IMAGE:figures/full_fig_p035_18.png] view at source ↗

**Figure 18.** Figure 18: Spatial snapshots of the SST autoregressive rollout at selected lead times for the [PITH_FULL_IMAGE:figures/full_fig_p036_18.png] view at source ↗

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 3 internal anchors

[1]

URLhttps://openreview.net/forum?id=MKP1g8wU0P. H. Hersbach, B. Bell, P. Berrisford, S. Hirahara, A. Horányi, J. Muñoz-Sabater, J. Nicolas, C. Peubey, R. Radu, D. Schepers, A. Simmons, C. Soci, S. Abdalla, X. Abellan, G. Balsamo, P. Bechtold, G. Biavati, J. Bidlot, M. Bonavita, G. De Chiara, P. Dahlgren, D. Dee, M. Diamantakis, R. Dragani, J. Flemming, R. ...

work page 1999
[2]

Hersbach, B

doi: https://doi.org/10.1002/qj.3803. URLhttps://rmets.onlinelibrary. wiley.com/doi/abs/10.1002/qj.3803. R. A. Horn and C. R. Johnson.Matrix Analysis. Cambridge University Press, Cambridge,

work page doi:10.1002/qj.3803
[3]

Benchmarking autoregressive conditional diffusion models for turbulent flow simulation

ISSN 0893-6080. doi: https://doi.org/10.1016/j.neunet.2026.108641. URL https://www.sciencedirect. com/science/article/pii/S0893608026001036. H.-O. Kreiss. Über die Stäbilitätsdefinition für Differenzengleichungen die partielle Differ- entialgleichungen approximieren.BIT Numerical Mathematics, 2(3):153–181,

work page doi:10.1016/j.neunet.2026.108641 2026
[4]

doi: 10.1007/BF01957346. R. Lam, A. Sanchez-Gonzalez, M. Willson, P. Wirnsberger, M. Fortunato, F. Alet, S. Ravuri, T. Ewalds, Z. Eaton-Rosen, W. Hu, A. Merose, S. Hoyer, G. Holland, O. Vinyals, J. Stott, A. Pritzel, S. Mohamed, and P. Battaglia. Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, Dec

work page doi:10.1007/bf01957346
[5]

Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, 2023

doi: 10.1126/science.adi2336. URL https://www.science.org/doi/10.1126/science.adi2336. Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anand- kumar. Fourier neural operator for parametric partial differential equations. InInter- national Conference on Learning Representations,

work page doi:10.1126/science.adi2336
[6]

URLhttps://arxiv.org/abs/ 2010.08895. P. Lippe, B. S. Veeling, P. Perdikaris, R. E. Turner, and J. Brandstetter. PDE-refiner: Achieving accurate long rollouts with neural PDE solvers. InThirty-seventh Conference on Neural Information Processing Systems,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[7]

Differentiability in Unrolled Training of Neural Physics Simulators on Transient Dynamics.Computer Methods in Applied Mechanics and Engineering, 433:117441, 2025

ISSN 0045-7825. doi: 10.1016/j.cma.2024.117441. URLhttps://www.sciencedirect.com/science/article/pii/S0045782524006960. M. McCabe, P. Harrington, S. Subramanian, and J. Brown. Towards stability of autoregressive neural operators.Transactions on Machine Learning Research,

work page doi:10.1016/j.cma.2024.117441 2024
[8]

doi: 10.1007/s11071-005-2824-x. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. PyTorch: an imperative style, high-performance deep learning library. InProceedings of...

work page doi:10.1007/s11071-005-2824-x
[9]

2019, in Advances in Neural Information Processing Systems 32, 8024–8035, doi: 10.5555/3454287.3455008

URLhttps://dl. acm.org/doi/10.5555/3454287.3455008. J. Pathak, S. Subramanian, P. Harrington, S. Raja, A. Chattopadhyay, M. Mardani, T. Kurth, D. Hall, Z. Li, K. Azizzadenesheli, P. Hassanzadeh, K. Kashinath, and A. Anandkumar. FourCastNet: A global data-driven high-resolution weather model using adaptive Fourier 16 neural operators.arXiv preprint arXiv:2...

work page doi:10.5555/3454287.3455008
[10]

URLhttps://arxiv.org/abs/ 2202.11214. T. Pfaff, M. Fortunato, A. Sanchez-Gonzalez, and P. W. Battaglia. Learning mesh-based simulation with graph networks. InInternational Conference on Learning Representations (ICLR),

work page internal anchor Pith review Pith/arXiv arXiv
[11]

doi: 10.1175/1520-0442(2002)015<1609:AIISAS>2.0.CO;2. O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Intervention (MICCAI), pages 234–241,

work page doi:10.1175/1520-0442(2002)015 2002
[12]

doi: 10.1007/978-3-319-24574-4_28. W. J. Rugh.Nonlinear system theory. Johns Hopkins University Press Baltimore,

work page doi:10.1007/978-3-319-24574-4_28
[13]

doi: 10.1146/annurev.fluid.38.050304.092139. D. Scieur, G. Gidel, Q. Bertrand, and F. Pedregosa. The curse of unrolling: Rate of differentiating through optimization. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors,Advances in Neural Information Processing Systems,

work page doi:10.1146/annurev.fluid.38.050304.092139
[14]

doi: 10.1126/science.261.5121.578. G. Wen, Z. Li, K. Azizzadenesheli, A. Anandkumar, and S. M. Benson. U-fno–an enhanced fourier neural operator-based deep-learning model for multiphase flow.Advances in Water Resources, page 104180,

work page doi:10.1126/science.261.5121.578
[15]

A Data-Driven Approximation of the Koopman Operator: Extending Dynamic Mode Decomposition

doi: 10.1007/s00332-015-9258-5. 18 Appendix A. Proof of theorem 2.1 We first record the exact result when the Jacobians commute and are individually normal. Proposition A.1(Exact commuting case).Let J0, . . . , JT−1 ∈R n×n be simultaneously diagonalisable as Jt = UΛtU ⊤ for a common orthogonal matrix U and diagonalΛ t = diag(λ1,t, . . . , λn,t). Ifρ= max ...

work page internal anchor Pith review doi:10.1007/s00332-015-9258-5
[16]

When ε = η = 0, each Jt is normal and all pairs commute

Proof. When ε = η = 0, each Jt is normal and all pairs commute. Commuting normal matrices are simultaneously diagonalizable by a common orthogonal matrix (Horn and Johnson, 1985), giving∥ΦT ∥2 ≤ρ T by Proposition A.1. For ε, η > 0, the joint conditions (iii)–(iv) imply that{Jt} lies within distanceδ(ε, η) of the closed set of simultaneously orthogonally d...

work page 1985
[17]

with a single denoising network conditioned on the previous state, the current (noisy) prediction, and a step index k∈ { 0, . . . , M}. The backbone is identical to the UNet above; we useM = 4 refinement iterations per rollout step and the geometric noise schedule of Lippe et al. (2023) with σmin = 10−7. PDE-Refiner therefore costsM+1 = 5× backbone evalua...

work page 2023
[18]

but is overtaken by both alternatives by step∼100and is more than an order of magnitude worse than UNet+CR already inside the training window (step200). PDE-Refiner is the strongest unregularisedmodel up to step ∼1000, paying5 × inference-time cost; from step ∼2000 onwards UNet+CR overtakes it on both the in-distribution and the out-of-distribution split....

work page 2000
[19]

Color scale is symmetric and shared between truth and predictions per trajectory; absolute error uses a separate scale

28 Figure 15: Vorticity snapshotsζ(x, y, t)for representative test trajectories: ground truth, baseline rollout, and commutativity-regularised rollout, all run for the full199rollout steps (∼ 10s) from a single initial condition. Color scale is symmetric and shared between truth and predictions per trajectory; absolute error uses a separate scale. The bas...

work page 2015
[20]

+ normality λc 10−5 λn 10−5 JVP frequency every minibatch (comm_freq=1) Skip blocks first10AFNO blocks detached Comm

Optimiser AdamW Peak learning rate5×10 −6 Weight decay0 Schedule cosine annealing Epochs50 Batch size2(single-GPU, A10080GB) Loss (one-step) latitude-weighted MSE between ˆXt+1 and ERA5 Regulariser latent comm. + normality λc 10−5 λn 10−5 JVP frequency every minibatch (comm_freq=1) Skip blocks first10AFNO blocks detached Comm. pair adjacent-step pair(X t,...

work page arXiv 2013
[21]

The version we use contains1727weeks beginning in

G.1 Data and preprocessing Source.NOAA Optimum Interpolation Sea-Surface Temperature, weekly-mean product (sst.wkmean.1990-present) (Reynolds et al., 2002), covering the global ocean on a1◦ (180×360) grid at weekly cadence. The version we use contains1727weeks beginning in

work page 1990

[1] [1]

URLhttps://openreview.net/forum?id=MKP1g8wU0P. H. Hersbach, B. Bell, P. Berrisford, S. Hirahara, A. Horányi, J. Muñoz-Sabater, J. Nicolas, C. Peubey, R. Radu, D. Schepers, A. Simmons, C. Soci, S. Abdalla, X. Abellan, G. Balsamo, P. Bechtold, G. Biavati, J. Bidlot, M. Bonavita, G. De Chiara, P. Dahlgren, D. Dee, M. Diamantakis, R. Dragani, J. Flemming, R. ...

work page 1999

[2] [2]

Hersbach, B

doi: https://doi.org/10.1002/qj.3803. URLhttps://rmets.onlinelibrary. wiley.com/doi/abs/10.1002/qj.3803. R. A. Horn and C. R. Johnson.Matrix Analysis. Cambridge University Press, Cambridge,

work page doi:10.1002/qj.3803

[3] [3]

Benchmarking autoregressive conditional diffusion models for turbulent flow simulation

ISSN 0893-6080. doi: https://doi.org/10.1016/j.neunet.2026.108641. URL https://www.sciencedirect. com/science/article/pii/S0893608026001036. H.-O. Kreiss. Über die Stäbilitätsdefinition für Differenzengleichungen die partielle Differ- entialgleichungen approximieren.BIT Numerical Mathematics, 2(3):153–181,

work page doi:10.1016/j.neunet.2026.108641 2026

[4] [4]

doi: 10.1007/BF01957346. R. Lam, A. Sanchez-Gonzalez, M. Willson, P. Wirnsberger, M. Fortunato, F. Alet, S. Ravuri, T. Ewalds, Z. Eaton-Rosen, W. Hu, A. Merose, S. Hoyer, G. Holland, O. Vinyals, J. Stott, A. Pritzel, S. Mohamed, and P. Battaglia. Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, Dec

work page doi:10.1007/bf01957346

[5] [5]

Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, 2023

doi: 10.1126/science.adi2336. URL https://www.science.org/doi/10.1126/science.adi2336. Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anand- kumar. Fourier neural operator for parametric partial differential equations. InInter- national Conference on Learning Representations,

work page doi:10.1126/science.adi2336

[6] [6]

URLhttps://arxiv.org/abs/ 2010.08895. P. Lippe, B. S. Veeling, P. Perdikaris, R. E. Turner, and J. Brandstetter. PDE-refiner: Achieving accurate long rollouts with neural PDE solvers. InThirty-seventh Conference on Neural Information Processing Systems,

work page internal anchor Pith review Pith/arXiv arXiv 2010

[7] [7]

Differentiability in Unrolled Training of Neural Physics Simulators on Transient Dynamics.Computer Methods in Applied Mechanics and Engineering, 433:117441, 2025

ISSN 0045-7825. doi: 10.1016/j.cma.2024.117441. URLhttps://www.sciencedirect.com/science/article/pii/S0045782524006960. M. McCabe, P. Harrington, S. Subramanian, and J. Brown. Towards stability of autoregressive neural operators.Transactions on Machine Learning Research,

work page doi:10.1016/j.cma.2024.117441 2024

[8] [8]

doi: 10.1007/s11071-005-2824-x. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. PyTorch: an imperative style, high-performance deep learning library. InProceedings of...

work page doi:10.1007/s11071-005-2824-x

[9] [9]

2019, in Advances in Neural Information Processing Systems 32, 8024–8035, doi: 10.5555/3454287.3455008

URLhttps://dl. acm.org/doi/10.5555/3454287.3455008. J. Pathak, S. Subramanian, P. Harrington, S. Raja, A. Chattopadhyay, M. Mardani, T. Kurth, D. Hall, Z. Li, K. Azizzadenesheli, P. Hassanzadeh, K. Kashinath, and A. Anandkumar. FourCastNet: A global data-driven high-resolution weather model using adaptive Fourier 16 neural operators.arXiv preprint arXiv:2...

work page doi:10.5555/3454287.3455008

[10] [10]

URLhttps://arxiv.org/abs/ 2202.11214. T. Pfaff, M. Fortunato, A. Sanchez-Gonzalez, and P. W. Battaglia. Learning mesh-based simulation with graph networks. InInternational Conference on Learning Representations (ICLR),

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

doi: 10.1175/1520-0442(2002)015<1609:AIISAS>2.0.CO;2. O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Intervention (MICCAI), pages 234–241,

work page doi:10.1175/1520-0442(2002)015 2002

[12] [12]

doi: 10.1007/978-3-319-24574-4_28. W. J. Rugh.Nonlinear system theory. Johns Hopkins University Press Baltimore,

work page doi:10.1007/978-3-319-24574-4_28

[13] [13]

doi: 10.1146/annurev.fluid.38.050304.092139. D. Scieur, G. Gidel, Q. Bertrand, and F. Pedregosa. The curse of unrolling: Rate of differentiating through optimization. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors,Advances in Neural Information Processing Systems,

work page doi:10.1146/annurev.fluid.38.050304.092139

[14] [14]

doi: 10.1126/science.261.5121.578. G. Wen, Z. Li, K. Azizzadenesheli, A. Anandkumar, and S. M. Benson. U-fno–an enhanced fourier neural operator-based deep-learning model for multiphase flow.Advances in Water Resources, page 104180,

work page doi:10.1126/science.261.5121.578

[15] [15]

A Data-Driven Approximation of the Koopman Operator: Extending Dynamic Mode Decomposition

doi: 10.1007/s00332-015-9258-5. 18 Appendix A. Proof of theorem 2.1 We first record the exact result when the Jacobians commute and are individually normal. Proposition A.1(Exact commuting case).Let J0, . . . , JT−1 ∈R n×n be simultaneously diagonalisable as Jt = UΛtU ⊤ for a common orthogonal matrix U and diagonalΛ t = diag(λ1,t, . . . , λn,t). Ifρ= max ...

work page internal anchor Pith review doi:10.1007/s00332-015-9258-5

[16] [16]

When ε = η = 0, each Jt is normal and all pairs commute

Proof. When ε = η = 0, each Jt is normal and all pairs commute. Commuting normal matrices are simultaneously diagonalizable by a common orthogonal matrix (Horn and Johnson, 1985), giving∥ΦT ∥2 ≤ρ T by Proposition A.1. For ε, η > 0, the joint conditions (iii)–(iv) imply that{Jt} lies within distanceδ(ε, η) of the closed set of simultaneously orthogonally d...

work page 1985

[17] [17]

with a single denoising network conditioned on the previous state, the current (noisy) prediction, and a step index k∈ { 0, . . . , M}. The backbone is identical to the UNet above; we useM = 4 refinement iterations per rollout step and the geometric noise schedule of Lippe et al. (2023) with σmin = 10−7. PDE-Refiner therefore costsM+1 = 5× backbone evalua...

work page 2023

[18] [18]

but is overtaken by both alternatives by step∼100and is more than an order of magnitude worse than UNet+CR already inside the training window (step200). PDE-Refiner is the strongest unregularisedmodel up to step ∼1000, paying5 × inference-time cost; from step ∼2000 onwards UNet+CR overtakes it on both the in-distribution and the out-of-distribution split....

work page 2000

[19] [19]

Color scale is symmetric and shared between truth and predictions per trajectory; absolute error uses a separate scale

28 Figure 15: Vorticity snapshotsζ(x, y, t)for representative test trajectories: ground truth, baseline rollout, and commutativity-regularised rollout, all run for the full199rollout steps (∼ 10s) from a single initial condition. Color scale is symmetric and shared between truth and predictions per trajectory; absolute error uses a separate scale. The bas...

work page 2015

[20] [20]

+ normality λc 10−5 λn 10−5 JVP frequency every minibatch (comm_freq=1) Skip blocks first10AFNO blocks detached Comm

Optimiser AdamW Peak learning rate5×10 −6 Weight decay0 Schedule cosine annealing Epochs50 Batch size2(single-GPU, A10080GB) Loss (one-step) latitude-weighted MSE between ˆXt+1 and ERA5 Regulariser latent comm. + normality λc 10−5 λn 10−5 JVP frequency every minibatch (comm_freq=1) Skip blocks first10AFNO blocks detached Comm. pair adjacent-step pair(X t,...

work page arXiv 2013

[21] [21]

The version we use contains1727weeks beginning in

G.1 Data and preprocessing Source.NOAA Optimum Interpolation Sea-Surface Temperature, weekly-mean product (sst.wkmean.1990-present) (Reynolds et al., 2002), covering the global ocean on a1◦ (180×360) grid at weekly cadence. The version we use contains1727weeks beginning in

work page 1990