Controlling Transient Amplification Improves Long-horizon Rollouts
Pith reviewed 2026-05-19 17:39 UTC · model grok-4.3
The pith
Non-normal and non-commuting Jacobians along rollout trajectories cause transient error amplification and long-horizon drift even in asymptotically stable systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When the Jacobians along an autoregressive trajectory are non-normal and non-commuting, the model amplifies errors transiently, resulting in rollout drift even when the overall system is asymptotically stable. Commutativity regularization combines two penalties designed to reduce the normality defect of individual Jacobians and the commutator norm of Jacobians across steps; the penalties are estimated with Jacobian-vector products and come with a propagator bound that quantifies rollout error under approximate commutativity and normality.
What carries the argument
Commutativity regularization, the combination of penalties on Jacobian normality defect and inter-step commutator norm, estimated via Jacobian-vector products with no inference-time overhead.
If this is right
- UNet and FNO variants achieve stable rollouts over thousands of steps on both synthetic and real 1D and 2D spatio-temporal data.
- FourCastNet climate forecasts on ERA5 improve without any new training data.
- The largest accuracy gains appear on out-of-distribution initial conditions where baseline models quickly leave the training distribution.
Where Pith is reading between the lines
- The same transient-amplification mechanism may explain rollout drift in other autoregressive sequence tasks such as video prediction or long time-series forecasting.
- The propagator bound supplies a quantitative tool for analyzing error growth in any approximately commuting dynamical model.
- Monitoring Jacobian normality and commutator norms during training could serve as a practical diagnostic for future rollout instability.
Load-bearing premise
Linearization around the model's own rollout trajectories captures the dominant source of long-horizon error, and the two penalties can be tuned to reduce normality defect and commutator norm without creating new instabilities or degrading short-horizon accuracy.
What would settle it
A controlled experiment in which models with persistently high normality defects and large commutator norms nevertheless produce stable long-horizon rollouts, or in which regularization successfully lowers those quantities yet rollout drift remains unchanged.
Figures
read the original abstract
Autoregressive neural simulators now match classical solvers on short-horizon prediction of physical systems, yet their accuracy degrades rapidly when rolled out over long horizons. In this work, we identify transient amplification of perturbations around rollout trajectories as a structural mechanism driving rollout error. Using a linearization analysis we show that when the Jacobians along an autoregressive trajectory are non-normal and non-commuting, the model amplifies errors transiently, resulting in model rollout drift even when the overall system is asymptotically stable. Building on the analysis, we propose commutativity regularization: a combination of two penalties designed to reduce the normality defect of individual Jacobians and the commutator norm of Jacobians across steps. The penalties are estimated with Jacobian-vector products and have no inference-time cost. We show a propagator bound that quantifies rollout error under approximate commutativity and normality. We evaluate UNet and FNO variants with commutativity regularization on 1D and 2D spatio-temporal data in synthetic and real settings, showing successful long-horizon rollouts over thousands of steps. Further, we show that the method improves FourCastNet climate forecasts on ERA5 without using any new data. The gain is most pronounced out-of-distribution: trained on trajectories of a few hundred steps, regularized models remain in-distribution for thousands of rollout steps on initial conditions where baselines diverge.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that non-normal and non-commuting Jacobians along autoregressive trajectories in neural simulators cause transient amplification of perturbations, leading to long-horizon rollout drift even in asymptotically stable systems. It introduces commutativity regularization (two penalties on normality defect and commutator norm, estimated via Jacobian-vector products) to mitigate this, derives a propagator bound quantifying error under approximate commutativity/normality, and reports improved long-horizon performance for UNet/FNO variants and FourCastNet on synthetic 1D/2D data and real ERA5 climate forecasts, with gains most evident out-of-distribution over thousands of steps.
Significance. If the central mechanism and regularization hold, the work supplies a practical, inference-free technique for stabilizing autoregressive neural simulators of physical systems, backed by a theoretical bound and strong empirical results on held-out long rollouts and real data. This could meaningfully advance reliable long-term forecasting in climate and fluid dynamics without requiring additional training data.
major comments (3)
- [§3] §3 (linearization analysis): the first-order linearization of the autoregressive map F around the model's own trajectory assumes perturbations remain small enough for higher-order Taylor terms to be negligible, yet the paper provides no direct comparison of linearized versus full nonlinear error propagation on trajectories that have already begun to diverge. This assumption is load-bearing for the claim that targeting Jacobian normality and commutativity will control the dominant source of long-horizon drift.
- [§4] §4 (propagator bound): the bound is stated to quantify rollout error under approximate commutativity and normality, but the manuscript does not report the explicit dependence on perturbation size or the conditions under which the bound remains predictive once states deviate by amounts comparable to the signal itself.
- [§5] §5 (experiments): hyperparameter selection for the two regularization coefficients and the precise baseline controls (e.g., equivalent compute or alternative stability penalties) are not detailed enough to isolate the contribution of reduced transient amplification from other possible effects on the observed long-horizon gains.
minor comments (2)
- [Methods] Clarify the precise definitions and estimation procedures for the normality defect and commutator norm (including any approximations in the Jacobian-vector products) so that the penalties can be reproduced exactly.
- [Figures] In the rollout-error figures, add shaded regions or multiple seeds to indicate variability and confirm that the regularized models remain stable beyond the training horizon on the reported ERA5 initial conditions.
Simulated Author's Rebuttal
We thank the referee for their constructive comments and positive assessment of the work's potential impact. We address each of the major comments below and outline the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (linearization analysis): the first-order linearization of the autoregressive map F around the model's own trajectory assumes perturbations remain small enough for higher-order Taylor terms to be negligible, yet the paper provides no direct comparison of linearized versus full nonlinear error propagation on trajectories that have already begun to diverge. This assumption is load-bearing for the claim that targeting Jacobian normality and commutativity will control the dominant source of long-horizon drift.
Authors: We acknowledge that the linearization is an approximation and that a direct comparison to nonlinear propagation would provide stronger validation. In the revised manuscript, we will add experiments that compare the error growth predicted by the linearized model to the actual nonlinear rollout errors on trajectories where perturbations have grown to moderate sizes, thereby testing the validity of the assumption in the relevant regime. revision: yes
-
Referee: [§4] §4 (propagator bound): the bound is stated to quantify rollout error under approximate commutativity and normality, but the manuscript does not report the explicit dependence on perturbation size or the conditions under which the bound remains predictive once states deviate by amounts comparable to the signal itself.
Authors: The propagator bound is derived under the assumption of small perturbations where the linearization holds, and its explicit dependence on perturbation size is implicit in the error terms. We will revise the manuscript to explicitly state the dependence on the initial perturbation norm and discuss the range of validity, including when states deviate significantly, noting that the bound serves as a guiding theoretical tool rather than a tight prediction for large deviations. revision: partial
-
Referee: [§5] §5 (experiments): hyperparameter selection for the two regularization coefficients and the precise baseline controls (e.g., equivalent compute or alternative stability penalties) are not detailed enough to isolate the contribution of reduced transient amplification from other possible effects on the observed long-horizon gains.
Authors: We agree that additional details are necessary for reproducibility and to isolate the effect. In the revised version, we will expand the experimental section to include the hyperparameter search procedure, the specific values used, and additional ablation studies with alternative stability-promoting penalties under matched computational budgets. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper performs a standard linearization of the autoregressive map along trajectories and derives a propagator bound from the resulting Jacobian properties (non-normality and non-commutativity). This is a conventional first-order analysis rather than a self-referential definition or a fitted quantity renamed as a prediction. The proposed commutativity regularization is an independent penalty term motivated by the analysis but not required for the bound itself to hold mathematically. Empirical results on held-out long-horizon rollouts and ERA5 data provide an external check that does not reduce to the derivation. No self-citations, ansatzes smuggled via prior work, or uniqueness theorems from the same authors are invoked as load-bearing steps. The chain remains independent of its target claims.
Axiom & Free-Parameter Ledger
free parameters (1)
- regularization coefficients
axioms (1)
- domain assumption Local Jacobian properties along the rollout trajectory dominate long-term error accumulation even when the underlying dynamical system is asymptotically stable.
Reference graph
Works this paper leans on
-
[1]
URLhttps://openreview.net/forum?id=MKP1g8wU0P. H. Hersbach, B. Bell, P. Berrisford, S. Hirahara, A. Horányi, J. Muñoz-Sabater, J. Nicolas, C. Peubey, R. Radu, D. Schepers, A. Simmons, C. Soci, S. Abdalla, X. Abellan, G. Balsamo, P. Bechtold, G. Biavati, J. Bidlot, M. Bonavita, G. De Chiara, P. Dahlgren, D. Dee, M. Diamantakis, R. Dragani, J. Flemming, R. ...
work page 1999
-
[2]
doi: https://doi.org/10.1002/qj.3803. URLhttps://rmets.onlinelibrary. wiley.com/doi/abs/10.1002/qj.3803. R. A. Horn and C. R. Johnson.Matrix Analysis. Cambridge University Press, Cambridge,
-
[3]
doi: https://doi.org/10.1016/j.neunet.2026.108641
ISSN 0893-6080. doi: https://doi.org/10.1016/j.neunet.2026.108641. URL https://www.sciencedirect. com/science/article/pii/S0893608026001036. H.-O. Kreiss. Über die Stäbilitätsdefinition für Differenzengleichungen die partielle Differ- entialgleichungen approximieren.BIT Numerical Mathematics, 2(3):153–181,
-
[4]
doi: 10.1007/BF01957346. R. Lam, A. Sanchez-Gonzalez, M. Willson, P. Wirnsberger, M. Fortunato, F. Alet, S. Ravuri, T. Ewalds, Z. Eaton-Rosen, W. Hu, A. Merose, S. Hoyer, G. Holland, O. Vinyals, J. Stott, A. Pritzel, S. Mohamed, and P. Battaglia. Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, Dec
-
[5]
doi: 10.1126/science.adi2336. URL https://www.science.org/doi/10.1126/science.adi2336. Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anand- kumar. Fourier neural operator for parametric partial differential equations. InInter- national Conference on Learning Representations,
-
[6]
URLhttps://arxiv.org/abs/ 2010.08895. P. Lippe, B. S. Veeling, P. Perdikaris, R. E. Turner, and J. Brandstetter. PDE-refiner: Achieving accurate long rollouts with neural PDE solvers. InThirty-seventh Conference on Neural Information Processing Systems,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[7]
ISSN 0045-7825. doi: 10.1016/j.cma.2024.117441. URLhttps://www.sciencedirect.com/science/article/pii/S0045782524006960. M. McCabe, P. Harrington, S. Subramanian, and J. Brown. Towards stability of autoregressive neural operators.Transactions on Machine Learning Research,
-
[8]
doi: 10.1007/s11071-005-2824-x. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. PyTorch: an imperative style, high-performance deep learning library. InProceedings of...
-
[9]
URLhttps://dl. acm.org/doi/10.5555/3454287.3455008. J. Pathak, S. Subramanian, P. Harrington, S. Raja, A. Chattopadhyay, M. Mardani, T. Kurth, D. Hall, Z. Li, K. Azizzadenesheli, P. Hassanzadeh, K. Kashinath, and A. Anandkumar. FourCastNet: A global data-driven high-resolution weather model using adaptive Fourier 16 neural operators.arXiv preprint arXiv:2...
-
[10]
URLhttps://arxiv.org/abs/ 2202.11214. T. Pfaff, M. Fortunato, A. Sanchez-Gonzalez, and P. W. Battaglia. Learning mesh-based simulation with graph networks. InInternational Conference on Learning Representations (ICLR),
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
doi: 10.1175/1520-0442(2002)015<1609:AIISAS>2.0.CO;2. O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Intervention (MICCAI), pages 234–241,
-
[12]
doi: 10.1007/978-3-319-24574-4_28. W. J. Rugh.Nonlinear system theory. Johns Hopkins University Press Baltimore,
-
[13]
doi: 10.1146/annurev.fluid.38.050304.092139. D. Scieur, G. Gidel, Q. Bertrand, and F. Pedregosa. The curse of unrolling: Rate of differentiating through optimization. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors,Advances in Neural Information Processing Systems,
-
[14]
doi: 10.1126/science.261.5121.578. G. Wen, Z. Li, K. Azizzadenesheli, A. Anandkumar, and S. M. Benson. U-fno–an enhanced fourier neural operator-based deep-learning model for multiphase flow.Advances in Water Resources, page 104180,
-
[15]
doi: 10.1007/s00332-015-9258-5. 18 Appendix A. Proof of theorem 2.1 We first record the exact result when the Jacobians commute and are individually normal. Proposition A.1(Exact commuting case).Let J0, . . . , JT−1 ∈R n×n be simultaneously diagonalisable as Jt = UΛtU ⊤ for a common orthogonal matrix U and diagonalΛ t = diag(λ1,t, . . . , λn,t). Ifρ= max ...
-
[16]
When ε = η = 0, each Jt is normal and all pairs commute
Proof. When ε = η = 0, each Jt is normal and all pairs commute. Commuting normal matrices are simultaneously diagonalizable by a common orthogonal matrix (Horn and Johnson, 1985), giving∥ΦT ∥2 ≤ρ T by Proposition A.1. For ε, η > 0, the joint conditions (iii)–(iv) imply that{Jt} lies within distanceδ(ε, η) of the closed set of simultaneously orthogonally d...
work page 1985
-
[17]
with a single denoising network conditioned on the previous state, the current (noisy) prediction, and a step index k∈ { 0, . . . , M}. The backbone is identical to the UNet above; we useM = 4 refinement iterations per rollout step and the geometric noise schedule of Lippe et al. (2023) with σmin = 10−7. PDE-Refiner therefore costsM+1 = 5× backbone evalua...
work page 2023
-
[18]
but is overtaken by both alternatives by step∼100and is more than an order of magnitude worse than UNet+CR already inside the training window (step200). PDE-Refiner is the strongest unregularisedmodel up to step ∼1000, paying5 × inference-time cost; from step ∼2000 onwards UNet+CR overtakes it on both the in-distribution and the out-of-distribution split....
work page 2000
-
[19]
28 Figure 15: Vorticity snapshotsζ(x, y, t)for representative test trajectories: ground truth, baseline rollout, and commutativity-regularised rollout, all run for the full199rollout steps (∼ 10s) from a single initial condition. Color scale is symmetric and shared between truth and predictions per trajectory; absolute error uses a separate scale. The bas...
work page 2015
-
[20]
Optimiser AdamW Peak learning rate5×10 −6 Weight decay0 Schedule cosine annealing Epochs50 Batch size2(single-GPU, A10080GB) Loss (one-step) latitude-weighted MSE between ˆXt+1 and ERA5 Regulariser latent comm. + normality λc 10−5 λn 10−5 JVP frequency every minibatch (comm_freq=1) Skip blocks first10AFNO blocks detached Comm. pair adjacent-step pair(X t,...
-
[21]
The version we use contains1727weeks beginning in
G.1 Data and preprocessing Source.NOAA Optimum Interpolation Sea-Surface Temperature, weekly-mean product (sst.wkmean.1990-present) (Reynolds et al., 2002), covering the global ocean on a1◦ (180×360) grid at weekly cadence. The version we use contains1727weeks beginning in
work page 1990
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.