Transformers for dynamical systems learn transfer operators in-context
Pith reviewed 2026-05-15 20:16 UTC · model grok-4.3
The pith
A transformer trained on one dynamical system forecasts another by lifting time series with delay embeddings and identifying long-lived invariant sets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Attention-based models apply a transfer-operator forecasting strategy in-context. They lift low-dimensional time series using delay embedding to detect the system's higher-dimensional dynamical manifold, and identify and forecast long-lived invariant sets that characterize the global flow on this manifold.
What carries the argument
Delay embedding to reconstruct the dynamical manifold combined with identification of long-lived invariant sets that enable transfer-operator style forecasting.
If this is right
- Transformers can adapt to entirely new physical systems at test time without any retraining.
- Attention mechanisms use global attractor structure to support short-term forecasts.
- Training dynamics show an early tradeoff between in-distribution accuracy and out-of-distribution generalization that produces double descent.
- Large foundation models for scientific machine learning implicitly learn transfer operators when forecasting dynamical systems.
Where Pith is reading between the lines
- In-context learning for physical forecasting may depend more on phase-space reconstruction than on memorization of training trajectories.
- Explicitly adding delay-embedding layers could improve robustness when applying transformers to chaotic or multi-scale flows.
- The same mechanism might underlie zero-shot transfer observed in larger models across turbulent regimes.
Load-bearing premise
The observed out-of-distribution forecasting performance arises specifically from delay embedding plus invariant-set tracking rather than generic statistical pattern matching.
What would settle it
A controlled model variant prevented from performing delay embedding or from tracking invariant sets would lose all out-of-distribution forecasting ability on new dynamical systems.
Figures
read the original abstract
Large-scale foundation models for scientific machine learning adapt to physical settings unseen during training, such as zero-shot transfer between turbulent scales. This phenomenon, in-context learning, challenges conventional understanding of learning and adaptation in physical systems. Here, we study in-context learning of dynamical systems in a minimal setting: we train a small two-layer, single-head transformer to forecast one dynamical system, and then evaluate its ability to forecast a different dynamical system without retraining. We discover an early tradeoff in training between in-distribution and out-of-distribution performance, which manifests as a secondary double descent phenomenon. We discover that attention-based models apply a transfer-operator forecasting strategy in-context. They (1) lift low-dimensional time series using delay embedding, to detect the system's higher-dimensional dynamical manifold, and (2) identify and forecast long-lived invariant sets that characterize the global flow on this manifold. Our results clarify the mechanism enabling large pretrained models to forecast unseen physical systems at test time without retraining, and they illustrate the unique ability of attention-based models to leverage global attractor information in service of short-term forecasts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript trains a small two-layer single-head transformer on forecasting one dynamical system and shows that the resulting model can forecast a different, unseen dynamical system in-context without retraining. It reports an early training tradeoff between in-distribution and out-of-distribution performance that appears as a secondary double-descent curve, and interprets the model's behavior as implementing a transfer-operator strategy: delay-embedding the input time series to recover the underlying manifold and identifying long-lived invariant sets that govern the global flow.
Significance. If the mechanistic interpretation is substantiated, the work supplies a concrete account of how attention-based models achieve zero-shot adaptation across physical regimes, linking in-context learning to classical dynamical-systems concepts such as delay embedding and transfer operators. This could inform the design of foundation models for scientific machine learning and clarify why attention mechanisms are particularly effective at exploiting global attractor structure for short-term prediction.
major comments (2)
- [Results / Experiments] The central claim that the transformer implements delay embedding plus invariant-set detection (abstract and results) rests on post-hoc interpretation of OOD forecasting behavior and the observed early tradeoff/double descent. No causal interventions are described that selectively disable delay embedding (e.g., single-timestep or fixed-history ablations) or invariant-set identification (e.g., attention masking or representation probes) while preserving generic sequence modeling; consequently, alternative explanations based on statistical pattern matching cannot be ruled out.
- [Abstract / Results] The abstract and main text provide no quantitative metrics, error bars, or statistical significance tests for the reported OOD forecasting gains or the double-descent phenomenon, making it impossible to evaluate the robustness or magnitude of the claimed effects.
minor comments (2)
- [Introduction] Add explicit references to Takens' embedding theorem and standard transfer-operator literature when introducing the delay-embedding and invariant-set mechanisms.
- [Methods] Clarify the precise definition of the transfer operator being approximated and how it is recovered from the attention weights or hidden states.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive feedback. We address the two major comments below and will incorporate revisions to strengthen the evidence and reporting in the manuscript.
read point-by-point responses
-
Referee: [Results / Experiments] The central claim that the transformer implements delay embedding plus invariant-set detection (abstract and results) rests on post-hoc interpretation of OOD forecasting behavior and the observed early tradeoff/double descent. No causal interventions are described that selectively disable delay embedding (e.g., single-timestep or fixed-history ablations) or invariant-set identification (e.g., attention masking or representation probes) while preserving generic sequence modeling; consequently, alternative explanations based on statistical pattern matching cannot be ruled out.
Authors: We agree that the current evidence for the mechanistic interpretation is observational and that targeted causal interventions would provide stronger support. The manuscript demonstrates consistent OOD forecasting behavior across several dynamical systems that aligns with delay embedding followed by invariant-set forecasting, but we acknowledge that this does not yet rule out purely statistical pattern-matching alternatives. In the revised version we will add ablations using single-timestep inputs (to test the necessity of delay embedding) and attention masking over recent tokens (to test the role of long-range invariant-set identification), while preserving the core sequence-modeling capacity. revision: yes
-
Referee: [Abstract / Results] The abstract and main text provide no quantitative metrics, error bars, or statistical significance tests for the reported OOD forecasting gains or the double-descent phenomenon, making it impossible to evaluate the robustness or magnitude of the claimed effects.
Authors: We accept this criticism. The revised manuscript will include quantitative metrics (e.g., mean squared error with standard deviations computed over multiple random seeds and system instances), error bars on all reported curves, and appropriate statistical significance tests for both the OOD performance gains and the secondary double-descent phenomenon. revision: yes
Circularity Check
No circularity: empirical interpretation of observed transformer behavior
full rationale
The paper reports empirical observations of an early training tradeoff, double descent, and in-context forecasting performance on dynamical systems. It interprets these behaviors as the model performing delay embedding to recover manifolds and identifying invariant sets to apply transfer-operator forecasting. No equations, fitted parameters, or first-principles derivations are shown that reduce the claimed mechanism to its own inputs by construction. The central claims rest on post-hoc analysis of model outputs rather than any self-definitional, fitted-input, or self-citation load-bearing step. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Dynamical systems possess attractors containing long-lived invariant sets that govern global flow
- standard math Delay embedding reconstructs the higher-dimensional manifold from scalar time series
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
They (1) lift low-dimensional time series using delay embedding, to detect the system's higher-dimensional dynamical manifold, and (2) identify and forecast long-lived invariant sets that characterize the global flow on this manifold.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the order of the best-approximating Markov chain scales linearly with the intrinsic dimension of Test-OOD ... consistent with Takens’ theorem
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Quantifying atten- tion flow in transformers
Samira Abnar and Willem Zuidema. Quantifying atten- tion flow in transformers. InProceedings of the 58th an- nual meeting of the association for computational linguis- tics, pages 4190–4197, 2020
work page 2020
-
[2]
What learning algorithm is in- context learning? investigations with linear models
Ekin Aky¨ urek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in- context learning? investigations with linear models. In The Eleventh International Conference on Learning Rep- resentations, 2023
work page 2023
-
[3]
Chronos: Learning the language of time series.Transactions on Machine Learn- ing Research, 2024
Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, et al. Chronos: Learning the language of time series.Transactions on Machine Learn- ing Research, 2024
work page 2024
-
[4]
Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019
work page 2019
-
[5]
A theory of learning from different domains
Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1):151–175, 2010
work page 2010
-
[6]
Jonah Botvinick-Greenhouse, Robert Martin, and Yunan Yang. Invariant measures in time-delay coordinates for unique dynamical system identification.Physical Review Letters, 135(16):167202, 2025
work page 2025
-
[7]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020
work page 1901
-
[8]
Cambridge University Press, 2022
Steven L Brunton and J Nathan Kutz.Data-driven sci- ence and engineering: Machine learning, dynamical sys- tems, and control. Cambridge University Press, 2022
work page 2022
-
[9]
Jake Buzhardt, C Ricardo Constante-Amores, and Michael D Graham. On the relationship between koop- man operator approximations and neural ordinary dif- ferential equations for data-driven time-evolution predic- tions.Chaos: An Interdisciplinary Journal of Nonlinear Science, 35(4), 2025
work page 2025
-
[10]
B Clavier, D Zarzoso, Diego Del-Castillo-Negrete, and E Fr´ enod. Generative-machine-learning surro- gate model of plasma turbulence.Physical Review E, 111(1):L013202, 2025
work page 2025
-
[11]
P. Cvitanovi´ c, R. Artuso, R. Mainieri, G. Tanner, and G. Vattay.Chaos: Classical and Quantum. Niels Bohr Inst., Copenhagen, 2016
work page 2016
-
[12]
A mechanistic analysis of transformers for dynamical systems.arXiv preprint arXiv:2512.21113, 2025
Gregory Duth´ e, Nikolaos Evangelou, Wei Liu, Ioannis G Kevrekidis, and Eleni Chatzi. A mechanistic analysis of transformers for dynamical systems.arXiv preprint arXiv:2512.21113, 2025
-
[13]
Ezra Edelman, Nikolaos Tsilivis, Benjamin L Edelman, Eran Malach, and Surbhi Goel. The evolution of sta- tistical induction heads: In-context learning markov chains.Advances in neural information processing sys- tems, 37:64273–64311, 2024
work page 2024
-
[14]
Detecting and locating near-optimal almost-invariant sets and cycles
Gary Froyland and Michael Dellnitz. Detecting and locating near-optimal almost-invariant sets and cycles. SIAM Journal on Scientific Computing, 24(6):1839– 1863, 2003
work page 2003
-
[15]
Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gre- gory Valiant. What can transformers learn in-context? a case study of simple function classes.Advances in neural information processing systems, 35:30583–30598, 2022
work page 2022
-
[16]
Chaos as an interpretable benchmark for forecasting and data-driven modelling.NeurIPS, 34, 2021
William Gilpin. Chaos as an interpretable benchmark for forecasting and data-driven modelling.NeurIPS, 34, 2021
work page 2021
-
[17]
Out-of- domain generalization in dynamical systems reconstruc- tion
Niclas Alexander G¨ oring, Florian Hess, Manuel Bren- ner, Zahra Monfared, and Daniel Durstewitz. Out-of- domain generalization in dynamical systems reconstruc- tion. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Mach...
work page 2024
-
[18]
Maximilian Herde, Bogdan Raonic, Tobias Rohner, Roger K¨ appeli, Roberto Molinaro, Emmanuel De Bezenac, and Siddhartha Mishra. Poseidon: Efficient foundation models for pdes.Advances in Neural Information Processing Systems, 37:72525–72624, 2024
work page 2024
-
[19]
Panda: A pretrained forecast model for universal representation of chaotic dynamics
Jeffrey Lai, Anthony Bao, and William Gilpin. Panda: A pretrained forecast model for universal representation of chaotic dynamics. InThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[20]
Predictability: A problem partly solved
Edward N Lorenz. Predictability: A problem partly solved. InProc. Seminar on predictability, volume 1, pages 1–18. Reading, 1996
work page 1996
-
[21]
Domain adaptation: Learning bounds and algorithms
Yishay Mansour, Mehryar Mohri, and Afshin Ros- tamizadeh. Domain adaptation: Learning bounds and algorithms.arXiv preprint arXiv:0902.3430, 2009
-
[22]
Walrus: A cross-domain foundation model for continuum dynamics.arXiv preprint arXiv:2511.15684, 2025
Michael McCabe, Payel Mukhopadhyay, Tanya Marwah, Bruno Regaldo-Saint Blancard, Francois Rozet, Cristiana Diaconu, Lucas Meyer, Kaze WK Wong, Hadi Sotoudeh, Alberto Bietti, et al. Walrus: A cross-domain foun- dation model for continuum dynamics.arXiv preprint arXiv:2511.15684, 2025
-
[23]
Michael McCabe, Bruno R´ egaldo-Saint Blancard, Liam Parker, Ruben Ohana, Miles Cranmer, Alberto Bietti, Michael Eickenberg, Siavash Golkar, Geraud Krawezik, Francois Lanusse, et al. Multiple physics pretraining for 6 spatiotemporal surrogate models.Advances in Neural In- formation Processing Systems, 37:119301–119335, 2024
work page 2024
-
[24]
Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double de- scent: Where bigger models and more data hurt.Jour- nal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021
work page 2021
-
[25]
A time series is worth 64 words: Long-term forecasting with transformers
Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. InThe Eleventh International Conference on Learning Repre- sentations, 2023
work page 2023
-
[26]
Probabilistic weather forecasting with machine learning.Nature, 637(8044):84–90, 2025
Ilan Price, Alvaro Sanchez-Gonzalez, Ferran Alet, Tom R Andersson, Andrew El-Kadi, Dominic Masters, Timo Ewalds, Jacklynn Stott, Shakir Mohamed, Peter Battaglia, et al. Probabilistic weather forecasting with machine learning.Nature, 637(8044):84–90, 2025
work page 2025
-
[27]
Md Ashiqur Rahman, Robert Joseph George, Mogab Elleithy, Daniel Leibovici, Zongyi Li, Boris Bonev, Colin White, Julius Berner, Raymond A Yeh, Jean Kossaifi, et al. Pretraining codomain attention neural operators for solving multiphysics pdes.Advances in Neural Infor- mation Processing Systems, 37:104035–104064, 2024
work page 2024
-
[28]
The mechanistic basis of data depen- dence and abrupt learning in an in-context classification task
Gautam Reddy. The mechanistic basis of data depen- dence and abrupt learning in an in-context classification task. InThe Twelfth International Conference on Learn- ing Representations, 2024
work page 2024
-
[29]
Jason W Rocks and Pankaj Mehta. Memorizing with- out overfitting: Bias, variance, and interpolation in overparameterized models.Physical review research, 4(1):013201, 2022
work page 2022
-
[30]
Shawn G Rosofsky and Eliu A Huerta. Magnetohydrody- namics with physics informed neural operators.Machine Learning: Science and Technology, 4(3):035002, 2023
work page 2023
-
[31]
Mathias Schreiner, Ole Winther, and Simon Olsson. Implicit transfer operator learning: Multiple time- resolution models for molecular dynamics.Advances in Neural Information Processing Systems, 36:36449–36462, 2023
work page 2023
-
[32]
Shashank Subramanian, Peter Harrington, Kurt Keutzer, Wahid Bhimji, Dmitriy Morozov, Michael W Mahoney, and Amir Gholami. Towards foundation models for scien- tific machine learning: Characterizing scaling and trans- fer behavior.Advances in Neural Information Processing Systems, 36:71242–71262, 2023
work page 2023
-
[33]
Dynamical systems and turbulence.War- wick, 1980, pages 366–381, 1981
Floris Takens. Dynamical systems and turbulence.War- wick, 1980, pages 366–381, 1981
work page 1980
-
[34]
Ulam.A Collection of Mathematical Prob- lems
Stanislaw M. Ulam.A Collection of Mathematical Prob- lems. Interscience Publishers, New York, 1960
work page 1960
-
[35]
Transformers learn in-context by gradient descent
Johannes Von Oswald, Eyvind Niklasson, Ettore Ran- dazzo, Jo˜ ao Sacramento, Alexander Mordvintsev, An- drey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. InInternational Conference on Machine Learning, pages 35151–35174. PMLR, 2023
work page 2023
-
[36]
Ruiqi Zhang, Spencer Frei, and Peter L Bartlett. Trained transformers learn linear models in-context.Journal of Machine Learning Research, 25(49):1–55, 2024
work page 2024
-
[37]
Yuanzhao Zhang and William Gilpin. Context parrot- ing: A simple but tough-to-beat baseline for foundation models in scientific machine learning. InThe Fourteenth International Conference on Learning Representations, 2026. 1 APPENDIX CONTENTS References 5 Appendix 1 Appendix 1 Appendix A. Code Availability 1 Appendix B. Model architecture and training 1 A...
work page 2026
-
[38]
Quantifying attention flow in transformers
Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 4190–4197, 2020
work page 2020
-
[39]
Chronos: Learning the language of time series.Transactions on Machine Learning Research, 2024
Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, et al. Chronos: Learning the language of time series.Transactions on Machine Learning Research, 2024
work page 2024
-
[40]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[41]
Universal redundancies in time series foundation models.arXiv preprint arXiv:2602.01605, 2026
Anthony Bao, Venkata Hasith Vattikuti, Jeffrey Lai, and William Gilpin. Universal redundancies in time series foundation models.arXiv preprint arXiv:2602.01605, 2026
-
[42]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020
work page 1901
-
[43]
Power-law distributions in empirical data.SIAM review, 51(4):661–703, 2009
Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ Newman. Power-law distributions in empirical data.SIAM review, 51(4):661–703, 2009
work page 2009
-
[44]
Ezra Edelman, Nikolaos Tsilivis, Benjamin L Edelman, Eran Malach, and Surbhi Goel. The evolution of statistical induction heads: In-context learning markov chains.Advances in neural information processing systems, 37:64273–64311, 2024
work page 2024
-
[45]
Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes.Advances in neural information processing systems, 35:30583–30598, 2022
work page 2022
-
[46]
Measuring the strangeness of strange attractors.Physica D: nonlinear phenomena, 9(1-2):189–208, 1983
Peter Grassberger and Itamar Procaccia. Measuring the strangeness of strange attractors.Physica D: nonlinear phenomena, 9(1-2):189–208, 1983
work page 1983
-
[47]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 7
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[48]
Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021
work page 2021
-
[49]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020
work page 2020
-
[50]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.