Stable Fine-Time-Step Long-Horizon Turbulence Prediction with a Multi-Stepsize Mixture-of-Experts Neural Operator
Pith reviewed 2026-05-10 14:10 UTC · model grok-4.3
The pith
Mixture-of-experts neural operators stay stable during long turbulence forecasts at fine time steps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Ms-MoE-IFactFormer architecture conditions on relative stride and employs a time-step router to activate scale-specific routed experts along with a shared expert, allowing one model to represent multiple time-advancement operators and yielding more stable autoregressive long-horizon predictions with better agreement to long-time-averaged statistics on forced homogeneous isotropic turbulence and turbulent channel flow datasets at up to twenty times finer temporal resolution.
What carries the argument
Multi-stepsize mixture-of-experts (Ms-MoE) neural operator on an implicit factorized Transformer backbone, using a router that selects experts based on the input relative stride to handle different temporal scales.
If this is right
- Long-horizon autoregressive rollouts remain stable at fine temporal resolutions instead of degrading quickly.
- Improved matching to time-averaged flow statistics on both homogeneous isotropic turbulence and channel flow cases.
- One architecture can serve as a family of stride-parameterized operators without retraining separate models.
- Opens way for applying similar techniques to more complex turbulent flows beyond the tested cases.
Where Pith is reading between the lines
- If the routing mechanism generalizes, it could allow adaptive time-stepping in simulations where local flow features require varying resolutions.
- The approach might integrate with physics-informed constraints to further reduce drift in conserved quantities over long times.
- Testing on experimental data rather than filtered DNS could reveal sensitivity to noise or incomplete observations.
Load-bearing premise
That routing to stride-specific experts plus a shared expert based on relative stride is enough to control error accumulation in fine-step autoregressive rollouts for the range of turbulent flows considered.
What would settle it
A long-horizon rollout on the channel flow dataset that shows rapid growth in deviation from reference statistics or numerical instability within the tested time horizon.
Figures
read the original abstract
Neural operators have been increasingly used as data-driven surrogates for time-marching predictions of turbulent flows. However, long-horizon autoregressive prediction is sensitive to error accumulation and the choice of prediction interval. Excessively small time increments may increase temporal redundancy and lengthen rollouts, which can degrade the stability of neural operators in turbulence forecasting. This work pursues a unified objective: stable long-horizon autoregressive prediction at fine temporal resolution for three-dimensional turbulence. We propose a multi-stepsize mixture-of-experts (Ms-MoE) neural operator built on an implicit factorized Transformer (IFactFormer) backbone. The model conditions on a requested relative stride and uses a time-step router to activate scale-specific routed experts together with a shared expert, yielding a single architecture that represents a family of stride-parameterized time-advancement operators. We evaluate the approach on forced homogeneous isotropic turbulence (HIT) and turbulent channel flow using filtered direct numerical simulation datasets. Relative to sampling intervals used in previous studies, we construct training datasets with up to 20 times finer temporal resolution and report long-horizon autoregressive rollouts using qualitative time-slice comparisons and long-time-averaged statistics. Ms-MoE-IFactFormer yields more stable long-horizon rollouts and improved agreement with long-time-averaged statistics on both HIT and turbulent channel flow, suggesting potential for stable time-marching at fine temporal resolution in more complex turbulent flows.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a multi-stepsize mixture-of-experts neural operator (Ms-MoE-IFactFormer) built on an implicit factorized Transformer backbone for stable long-horizon autoregressive prediction of 3D turbulent flows at fine temporal resolution. The model conditions on a requested relative stride, employs a time-step router to activate scale-specific routed experts plus a shared expert, and is evaluated on filtered DNS datasets for forced homogeneous isotropic turbulence (HIT) and turbulent channel flow. It claims more stable rollouts and improved agreement with long-time-averaged statistics relative to prior sampling intervals, suggesting applicability to more complex flows.
Significance. If the quantitative results hold, the stride-conditioned Ms-MoE approach provides a unified architecture for a family of time-advancement operators, addressing error accumulation in fine-time-step autoregressive rollouts of turbulence. This could meaningfully advance neural-operator surrogates in fluid dynamics by enabling stable long-horizon predictions at temporal resolutions up to 20 times finer than previous studies, with potential for broader use in multi-scale flow modeling.
major comments (2)
- Abstract and §4 (results): the central claim of 'more stable long-horizon rollouts and improved agreement with long-time-averaged statistics' is asserted without reported quantitative metrics (e.g., L2 error norms, kinetic-energy spectra, or stability measures with error bars over rollout horizon), training details, or ablation studies on the router/expert activation; this prevents verification that the stride conditioning actually suppresses error accumulation as hypothesized.
- §3 (method): the description of the time-step router and scale-specific experts is high-level; without explicit equations for the conditioning mechanism, router loss, or how relative stride is encoded into the IFactFormer layers, it is unclear whether the architecture guarantees the claimed parameter-free family of operators or merely interpolates between discrete strides.
minor comments (2)
- Abstract: 'up to 20 times finer temporal resolution' should specify the exact baseline sampling intervals from prior studies for reproducibility.
- Notation: the acronym 'Ms-MoE-IFactFormer' is introduced without expanding 'IFactFormer' on first use in the abstract.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments. We address each major point below, providing clarifications and committing to revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [—] Abstract and §4 (results): the central claim of 'more stable long-horizon rollouts and improved agreement with long-time-averaged statistics' is asserted without reported quantitative metrics (e.g., L2 error norms, kinetic-energy spectra, or stability measures with error bars over rollout horizon), training details, or ablation studies on the router/expert activation; this prevents verification that the stride conditioning actually suppresses error accumulation as hypothesized.
Authors: We agree that the presentation would benefit from additional quantitative support. The manuscript already reports long-time-averaged statistics (mean velocity profiles, Reynolds stresses for channel flow, and kinetic energy spectra for HIT) that quantify improved agreement relative to baselines. To directly address the concern, we will revise §4 to include L2 error norms of the velocity field over increasing rollout horizons, time-evolving kinetic energy spectra, and stability metrics (e.g., error growth rates) with error bars computed from multiple independent rollouts. Expanded training details (dataset sizes, optimizer settings, and hyperparameter choices) and ablation studies isolating the router and expert activation will also be added to demonstrate the contribution of stride conditioning to error suppression. These revisions will be incorporated in the next version. revision: yes
-
Referee: [—] §3 (method): the description of the time-step router and scale-specific experts is high-level; without explicit equations for the conditioning mechanism, router loss, or how relative stride is encoded into the IFactFormer layers, it is unclear whether the architecture guarantees the claimed parameter-free family of operators or merely interpolates between discrete strides.
Authors: The §3 description is concise by design, but we accept that explicit formulations are needed for reproducibility. The relative stride s_rel is encoded as a continuous scalar that is embedded and injected into the query, key, and value projections of the IFactFormer layers, enabling the shared backbone to modulate its temporal scale without any parameter changes. The router is a lightweight MLP whose output gates the scale-specific experts; it is trained with the primary prediction loss plus a load-balancing term that penalizes under-utilization of experts. This conditioning produces a single set of weights that realizes a continuous family of operators for arbitrary strides via the learned embedding, rather than discrete interpolation. We will expand §3 with the precise equations for stride embedding, router gating, and the composite loss, together with a schematic of the conditioning path, in the revised manuscript. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper proposes a new Ms-MoE-IFactFormer architecture that conditions on relative stride to activate scale-specific routed experts plus a shared expert, yielding a family of stride-parameterized operators. The central claims of improved stability and agreement with long-time-averaged statistics in long-horizon autoregressive rollouts are supported by direct evaluation on filtered DNS datasets for HIT and turbulent channel flow at up to 20x finer temporal resolution than prior work. No equations, fitted parameters, or results are shown to reduce by construction to the inputs or to prior self-citations; the model is introduced as an explicit design choice and the reported improvements are independent empirical outcomes rather than tautological renamings or self-referential fits.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Ms-MoE-IFactFormer
no independent evidence
Reference graph
Works this paper leans on
-
[1]
S. B. Pope,Turbulent Flows, Cambridge University Press, (2000)
work page 2000
-
[2]
S. L. Brunton, B. R. Noack, and P. Koumoutsakos,Machine learning for fluid mechanics, Annual Review of Fluid Mechanics, 52(1)(2020), 477–508. 18
work page 2020
-
[3]
K. Duraisamy, G. Iaccarino, and H. Xiao,Turbulence modeling in the age of data, Annual Review of Fluid Mechanics, 51(1)(2019), 357–377
work page 2019
-
[4]
A. D. Beck, and M. Kurz,A perspective on machine learning methods in turbulence modeling, GAMM- Mitteilungen, 44(1)(2021), e202100002
work page 2021
-
[5]
N. Kovachki, Z. Li, B. Liu, K. Azizzadenesheli, K. Bhattacharya, A. M. Stuart, and A. Anandkumar,Neural Operator: Learning Maps Between Function Spaces with Applications to PDEs, Journal of Machine Learning Research, 24(89)(2023), 1–97
work page 2023
-
[6]
L. Lu, P. Jin, G. Pang, Z. Zhang, and G. E. Karniadakis,Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators, Nature Machine Intelligence, 3(3)(2021), 218–229
work page 2021
-
[7]
Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. M. Stuart, and A. Anandkumar,Fourier Neural Operator for Parametric Partial Differential Equations, International Conference on Learning Represen- tations (ICLR), (2021)
work page 2021
-
[8]
Z. Li, W. Peng, Z. Yuan, and J. Wang,Fourier neural operator approach to large eddy simulation of three- dimensional turbulence, Theoretical and Applied Mechanics Letters, 12(6)(2022), 100389
work page 2022
-
[9]
T. Luo, Z. Li, Z. Yuan, W. Peng, T. Liu, L. Wang, and J. Wang,Fourier neural operator for large eddy simulation of compressible Rayleigh–Taylor turbulence, Physics of Fluids, 36(7)(2024), 075165
work page 2024
-
[10]
J. Park, and H. Choi,Toward neural-network-based large-eddy simulation: application to turbulent channel flow, Journal of Fluid Mechanics, 914(2021), A16
work page 2021
-
[11]
Y . Guan, A. Chattopadhyay, A. Subel, and P. Hassanzadeh,Stable a posteriori LES of 2D turbulence using convolutional neural networks: Backscattering analysis and generalization to higher Re via transfer learning, Journal of Computational Physics, 458(2022), 111090
work page 2022
-
[12]
S. Zhao, Z. Li, B. Fan, Y . Wang, H. Yang, and J. Wang,LESnets (large-eddy simulation nets): Physics-informed neural operator for large-eddy simulation of turbulence, Journal of Computational Physics, 537(2025), 114125
work page 2025
-
[13]
W. Peng, Z. Yuan, and J. Wang,Attention-enhanced neural network models for turbulence simulation, Physics of Fluids, 34(2)(2022), 025111
work page 2022
-
[14]
W. Peng, Z. Yuan, Z. Li, and J. Wang,Linear attention coupled Fourier neural operator for simulation of three- dimensional turbulence, Physics of Fluids, 35(1)(2023), 015106
work page 2023
-
[15]
Z. Hao, Z. Wang, H. Su, C. Ying, Y . Dong, S. Liu, Z. Cheng, J. Song, and J. Zhu,GNOT: A general neural oper- ator transformer for operator learning, Proceedings of the 40th International Conference on Machine Learning (ICML), PMLR, 202(2023), 12556–12569
work page 2023
-
[16]
Z. Li, D. Shu, and A. Barati Farimani,Scalable Transformer for PDE Surrogate Modeling, Advances in Neural Information Processing Systems, 36(2023), 28010–28039
work page 2023
-
[17]
H. Wu, H. Luo, H. Wang, J. Wang, and M. Long,Transolver: A Fast Transformer Solver for PDEs on General Geometries, Proceedings of the 41st International Conference on Machine Learning (ICML), PMLR, 235(2024), 53681–53705
work page 2024
-
[18]
Z. Li, T. Liu, W. Peng, Z. Yuan, and J. Wang,A transformer-based neural operator for large-eddy simulation of turbulence, Physics of Fluids, 36(6)(2024), 065167
work page 2024
- [19]
-
[20]
P. Lai, Y . Chen, D. Yang, R. Wang, F. Wang, and H. Xu,From Complex Dynamics to DynFormer: Rethinking Transformers for PDEs, arXiv preprint, arXiv:2603.03112, doi:10.48550/arXiv.2603.03112, (2026). 19
-
[21]
H. Yang, Z. Li, X. Wang, and J. Wang,An implicit factorized transformer with applications to fast prediction of three-dimensional turbulence, Theoretical and Applied Mechanics Letters, 14(6)(2024), 100527
work page 2024
-
[22]
H. Yang, Y . Wang, and J. Wang,Implicit factorized transformer approach to fast prediction of turbulent channel flows, Science China Physics, Mechanics & Astronomy, 69(1)(2026), 214606
work page 2026
-
[23]
F. Gonzalez, F.-X. Demoulin, and S. Bernard,Towards Long-Term Predictions of Turbulence Using Neural Operators, arXiv preprint, arXiv:2307.13517, doi:10.48550/arXiv.2307.13517, (2023)
- [24]
-
[25]
Z. Li, W. Peng, Z. Yuan, and J. Wang,Long-term predictions of turbulence by implicit U-Net enhanced Fourier neural operator, Physics of Fluids, 35(7)(2023), 075145
work page 2023
-
[26]
Y . Wang, Z. Li, Z. Yuan, W. Peng, T. Liu, and J. Wang,Prediction of turbulent channel flow using Fourier neural operator-based machine-learning strategy, Physical Review Fluids, 9(2024), 084604
work page 2024
-
[27]
X. Zou, Z. Li, Y . Wang, H. Yang, and J. Wang,Uncertainty quantification and stability of neural operators for prediction of three-dimensional turbulence, Journal of Computational Physics, 549(2026), 114640
work page 2026
- [28]
- [29]
-
[30]
A. Lamb, A. Goyal, Y . Zhang, S. Zhang, A. Courville, and Y . Bengio,Professor forcing: A new algorithm for training recurrent networks, Advances in Neural Information Processing Systems, 29(2016), 4601–4609
work page 2016
-
[31]
H. Choi, and P. Moin,Effects of the computational time step on numerical solutions of turbulent flow, Journal of Computational Physics, 113(1)(1994), 1–4
work page 1994
-
[32]
P. K. Yeung, K. R. Sreenivasan, and S. B. Pope,Effects of finite spatial and temporal resolution in direct numer- ical simulations of incompressible isotropic turbulence, Physical Review Fluids, 3(6)(2018), 064603
work page 2018
-
[33]
P. K. Yeung, and S. B. Pope,An algorithm for tracking fluid particles in numerical simulations of homogeneous turbulence, Journal of Computational Physics, 79(2)(1988), 373–416
work page 1988
-
[34]
F. Fossella, L. Biferale, A. Carrassi, M. Cencini, and V . Gupta,Multiscale data assimilation in turbulent models, Physical Review E, 113(2)(2026), 024208
work page 2026
-
[35]
D. B. Quinn, Y . van Halder, and D. Lentink,Adaptive control of turbulence intensity is accelerated by frugal flow sampling, Journal of The Royal Society Interface, 14(136)(2017), 20170621
work page 2017
-
[36]
Y . Liu, J. N. Kutz, and S. L. Brunton,Hierarchical deep learning of multiscale differential equation time- steppers, Philosophical Transactions of the Royal Society A, 380(2229)(2022), 20210200
work page 2022
-
[37]
A. J. Linot, J. Burby, Q. Tang, P. Balaprakash, M. D. Graham, and R. Maulik,Stabilized neural ordinary differ- ential equations for long-time forecasting of dynamical systems, Journal of Computational Physics, 474(2023), 111838
work page 2023
-
[38]
C. Chen, and J.-L. Wu,Neural dynamical operator: Continuous spatial-temporal model with gradient-based and derivative-free optimization methods, Journal of Computational Physics, 520(2025), 113480
work page 2025
-
[39]
D. W. Abueidda, M. Nonna, P. Pantidis, and M. E. Mobasher,Time resolution independent operator learning, Computer Methods in Applied Mechanics and Engineering, 450(2026), 118586. 20
work page 2026
- [40]
-
[41]
X. Huang, and P. Perdikaris,PhysicsCorrect: A Training-Free Approach for Stable Neural PDE Simulations, arXiv preprint, arXiv:2507.02227, doi:10.48550/arXiv.2507.02227, (2025)
-
[42]
R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton,Adaptive mixtures of local experts, Neural Compu- tation, 3(1)(1991), 79–87
work page 1991
-
[43]
N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. V . Le, G. E. Hinton, and J. Dean,Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, International Conference on Learning Representations (ICLR), (2017)
work page 2017
-
[44]
D. Dai, C. Deng, C. Zhao, R. X. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y . Wu, Z. Xie, Y . K. Li, P. Huang, F. Luo, C. Ruan, Z. Sui, and W. Liang,DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture- of-Experts Language Models, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), ...
-
[45]
H. Wang, H. Xin, J. Wang, X. Yang, F. Zha, H. Dong, and Y . Jiang,Mixture-of-Experts Operator Transformer for Large-Scale PDE Pre-Training, arXiv preprint, arXiv:2510.25803, doi:10.48550/arXiv.2510.25803, (2025)
-
[46]
D. Sun, X. Zhou, X. Wang, H. Si, W. Lyu, J. Tang, and B. Luo,NESTOR: A Nested MOE-based Neural Operator for Large-Scale PDE Pre-Training, arXiv preprint, arXiv:2602.22059, doi:10.48550/arXiv.2602.22059, (2026)
-
[47]
X. Han, L. Wei, Z. Dou, Y . Sun, Z. Han, and Q. Tian,ViMoE: An Empirical Study of Designing Vision Mixture- of-Experts, IEEE Transactions on Image Processing, 34(2025), 7209–7221
work page 2025
-
[48]
Smagorinsky,General circulation experiments with the primitive equations
J. Smagorinsky,General circulation experiments with the primitive equations. I. The basic experiment, Monthly Weather Review, 91(3)(1963), 99–164
work page 1963
-
[49]
A. Holtzman, J. Buys, L. Du, M. Forbes, and Y . Choi,The Curious Case of Neural Text Degeneration, Interna- tional Conference on Learning Representations (ICLR), (2020)
work page 2020
-
[50]
P. Moin, K. Squires, W. Cabot, and S. Lee,A dynamic subgrid-scale model for compressible turbulence and scalar transport, Physics of Fluids A: Fluid Dynamics, 3(11)(1991), 2746–2757
work page 1991
-
[51]
F. Nicoud, and F. Ducros,Subgrid-scale stress modelling based on the square of the velocity gradient tensor, Flow, Turbulence and Combustion, 62(3)(1999), 183–200
work page 1999
-
[52]
I. Loshchilov, and F. Hutter,Decoupled Weight Decay Regularization, International Conference on Learning Representations (ICLR), (2019)
work page 2019
-
[53]
A. Sanchez-Gonzalez, J. Godwin, T. Pfaff, R. Ying, J. Leskovec, and P. W. Battaglia,Learning to simulate complex physics with graph networks, Proceedings of the 37th International Conference on Machine Learning, PMLR, 119(2020), 8459–8468
work page 2020
-
[54]
K. Stachenfeld, D. B. Fielding, D. Kochkov, M. Cranmer, T. Pfaff, J. Godwin, C. Cui, S. Ho, P. W. Battaglia, and A. Sanchez-Gonzalez,Learned coarse models for efficient turbulence simulation, International Conference on Learning Representations (ICLR), (2022)
work page 2022
-
[55]
A. Tran, A. Mathews, L. Xie, and C. S. Ong,Factorized Fourier neural operators, International Conference on Learning Representations (ICLR), (2023). 21
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.