Transformers for dynamical systems learn transfer operators in-context

Anthony Bao; Jeffrey Lai; William Gilpin

arxiv: 2602.18679 · v2 · submitted 2026-02-21 · 💻 cs.LG · nlin.CD

Transformers for dynamical systems learn transfer operators in-context

Anthony Bao , Jeffrey Lai , William Gilpin This is my paper

Pith reviewed 2026-05-15 20:16 UTC · model grok-4.3

classification 💻 cs.LG nlin.CD

keywords transformersin-context learningdynamical systemstransfer operatorsdelay embeddinginvariant setsforecastingattractors

0 comments

The pith

A transformer trained on one dynamical system forecasts another by lifting time series with delay embeddings and identifying long-lived invariant sets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a small two-layer transformer, after training to forecast one dynamical system, can zero-shot forecast a different system without retraining. It achieves this by using delay embedding to lift the observed low-dimensional time series into a higher-dimensional space that reveals the underlying dynamical manifold. The model then identifies and forecasts the long-lived invariant sets that govern the global flow on that manifold. This in-context strategy explains how attention-based models adapt to unseen physical systems at test time. Training also exhibits an early tradeoff between in-distribution and out-of-distribution performance that appears as a secondary double descent.

Core claim

Attention-based models apply a transfer-operator forecasting strategy in-context. They lift low-dimensional time series using delay embedding to detect the system's higher-dimensional dynamical manifold, and identify and forecast long-lived invariant sets that characterize the global flow on this manifold.

What carries the argument

Delay embedding to reconstruct the dynamical manifold combined with identification of long-lived invariant sets that enable transfer-operator style forecasting.

If this is right

Transformers can adapt to entirely new physical systems at test time without any retraining.
Attention mechanisms use global attractor structure to support short-term forecasts.
Training dynamics show an early tradeoff between in-distribution accuracy and out-of-distribution generalization that produces double descent.
Large foundation models for scientific machine learning implicitly learn transfer operators when forecasting dynamical systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

In-context learning for physical forecasting may depend more on phase-space reconstruction than on memorization of training trajectories.
Explicitly adding delay-embedding layers could improve robustness when applying transformers to chaotic or multi-scale flows.
The same mechanism might underlie zero-shot transfer observed in larger models across turbulent regimes.

Load-bearing premise

The observed out-of-distribution forecasting performance arises specifically from delay embedding plus invariant-set tracking rather than generic statistical pattern matching.

What would settle it

A controlled model variant prevented from performing delay embedding or from tracking invariant sets would lose all out-of-distribution forecasting ability on new dynamical systems.

Figures

Figures reproduced from arXiv: 2602.18679 by Anthony Bao, Jeffrey Lai, William Gilpin.

**Figure 1.** Figure 1: Double descent during in-context learning of dynamical systems. (A) Forecasts from a transformer trained on univariate time series from one dynamical system (Train-ID) then evaluated on its ability to forecast unseen trajectories from the same system (Test-ID, blue), versus forecasts of trajectories from an unseen system (Test-OOD, magenta). Gray curve shows the context, a subset of the total test data. … view at source ↗

**Figure 2.** Figure 2: Scaling laws for out-of-distribution generalization in dynamical systems. (A) The test error of the unseen system (cross-entropy, Test-OOD) versus the difference between the training and testing sets (KL divergence between the attractor of Test-OOD and Train-ID). (B) The error in the steady-state invariant distribution of the transformer dynamics relative to the true invariant distribution of Test-OOD, … view at source ↗

**Figure 3.** Figure 3: Transformers perform time-delay embedding during inference. (A) Empirical time-delayed next-token probabilities ˆp(xt+1|xt−k) averaged across Test-OOD context for a transformer trained on a different system (Train-ID). (B) Exact next-token probabilities ptrue(xt+1|xt−k) obtained from fitting a Markov chain on a long sample of Test-OOD. (C) The order of the closest-approximating Markov chain, versus the tr… view at source ↗

**Figure 4.** Figure 4: Transformers learn in-context transfer operators on reconstructed dynamical manifolds. (A) The eigenvalue spectrum of the transfer operator estimated from the fully-observed Test-OOD attractor p(yt+1|yt), and the time-lagged transfer operator estimated by sampling contiguous length-k sequences from the transformer ˆp(yˆt+1|yˆt) = pˆ(xt+1, xt, ...xt−k|xt, xt−1, ...xt−k−1). (B) (Top) The invariant distributi… view at source ↗

read the original abstract

Large-scale foundation models for scientific machine learning adapt to physical settings unseen during training, such as zero-shot transfer between turbulent scales. This phenomenon, in-context learning, challenges conventional understanding of learning and adaptation in physical systems. Here, we study in-context learning of dynamical systems in a minimal setting: we train a small two-layer, single-head transformer to forecast one dynamical system, and then evaluate its ability to forecast a different dynamical system without retraining. We discover an early tradeoff in training between in-distribution and out-of-distribution performance, which manifests as a secondary double descent phenomenon. We discover that attention-based models apply a transfer-operator forecasting strategy in-context. They (1) lift low-dimensional time series using delay embedding, to detect the system's higher-dimensional dynamical manifold, and (2) identify and forecast long-lived invariant sets that characterize the global flow on this manifold. Our results clarify the mechanism enabling large pretrained models to forecast unseen physical systems at test time without retraining, and they illustrate the unique ability of attention-based models to leverage global attractor information in service of short-term forecasts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Small transformers show in-context adaptation to new dynamical systems via delay embedding and invariant-set forecasting, but the mechanism lacks isolating controls.

read the letter

The main thing to know is that this paper trains a tiny two-layer single-head transformer on one dynamical system and finds it can forecast a different one at test time without retraining. The authors interpret this as the model lifting the input via delay embedding to recover the manifold and then tracking long-lived invariant sets that govern the flow, which they link to transfer-operator ideas from dynamical systems theory. They also report an early training tradeoff between in-distribution and out-of-distribution accuracy that produces a secondary double descent.

Referee Report

2 major / 2 minor

Summary. The manuscript trains a small two-layer single-head transformer on forecasting one dynamical system and shows that the resulting model can forecast a different, unseen dynamical system in-context without retraining. It reports an early training tradeoff between in-distribution and out-of-distribution performance that appears as a secondary double-descent curve, and interprets the model's behavior as implementing a transfer-operator strategy: delay-embedding the input time series to recover the underlying manifold and identifying long-lived invariant sets that govern the global flow.

Significance. If the mechanistic interpretation is substantiated, the work supplies a concrete account of how attention-based models achieve zero-shot adaptation across physical regimes, linking in-context learning to classical dynamical-systems concepts such as delay embedding and transfer operators. This could inform the design of foundation models for scientific machine learning and clarify why attention mechanisms are particularly effective at exploiting global attractor structure for short-term prediction.

major comments (2)

[Results / Experiments] The central claim that the transformer implements delay embedding plus invariant-set detection (abstract and results) rests on post-hoc interpretation of OOD forecasting behavior and the observed early tradeoff/double descent. No causal interventions are described that selectively disable delay embedding (e.g., single-timestep or fixed-history ablations) or invariant-set identification (e.g., attention masking or representation probes) while preserving generic sequence modeling; consequently, alternative explanations based on statistical pattern matching cannot be ruled out.
[Abstract / Results] The abstract and main text provide no quantitative metrics, error bars, or statistical significance tests for the reported OOD forecasting gains or the double-descent phenomenon, making it impossible to evaluate the robustness or magnitude of the claimed effects.

minor comments (2)

[Introduction] Add explicit references to Takens' embedding theorem and standard transfer-operator literature when introducing the delay-embedding and invariant-set mechanisms.
[Methods] Clarify the precise definition of the transfer operator being approximated and how it is recovered from the attention weights or hidden states.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address the two major comments below and will incorporate revisions to strengthen the evidence and reporting in the manuscript.

read point-by-point responses

Referee: [Results / Experiments] The central claim that the transformer implements delay embedding plus invariant-set detection (abstract and results) rests on post-hoc interpretation of OOD forecasting behavior and the observed early tradeoff/double descent. No causal interventions are described that selectively disable delay embedding (e.g., single-timestep or fixed-history ablations) or invariant-set identification (e.g., attention masking or representation probes) while preserving generic sequence modeling; consequently, alternative explanations based on statistical pattern matching cannot be ruled out.

Authors: We agree that the current evidence for the mechanistic interpretation is observational and that targeted causal interventions would provide stronger support. The manuscript demonstrates consistent OOD forecasting behavior across several dynamical systems that aligns with delay embedding followed by invariant-set forecasting, but we acknowledge that this does not yet rule out purely statistical pattern-matching alternatives. In the revised version we will add ablations using single-timestep inputs (to test the necessity of delay embedding) and attention masking over recent tokens (to test the role of long-range invariant-set identification), while preserving the core sequence-modeling capacity. revision: yes
Referee: [Abstract / Results] The abstract and main text provide no quantitative metrics, error bars, or statistical significance tests for the reported OOD forecasting gains or the double-descent phenomenon, making it impossible to evaluate the robustness or magnitude of the claimed effects.

Authors: We accept this criticism. The revised manuscript will include quantitative metrics (e.g., mean squared error with standard deviations computed over multiple random seeds and system instances), error bars on all reported curves, and appropriate statistical significance tests for both the OOD performance gains and the secondary double-descent phenomenon. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical interpretation of observed transformer behavior

full rationale

The paper reports empirical observations of an early training tradeoff, double descent, and in-context forecasting performance on dynamical systems. It interprets these behaviors as the model performing delay embedding to recover manifolds and identifying invariant sets to apply transfer-operator forecasting. No equations, fitted parameters, or first-principles derivations are shown that reduce the claimed mechanism to its own inputs by construction. The central claims rest on post-hoc analysis of model outputs rather than any self-definitional, fitted-input, or self-citation load-bearing step. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard dynamical-systems concepts (delay embedding, invariant sets, transfer operators) treated as background knowledge rather than derived or fitted quantities.

axioms (2)

domain assumption Dynamical systems possess attractors containing long-lived invariant sets that govern global flow
Invoked when interpreting the transformer's in-context forecasts as operating on the manifold's invariant sets.
standard math Delay embedding reconstructs the higher-dimensional manifold from scalar time series
Standard Takens embedding theorem assumed when stating that the model lifts low-dimensional series to detect the manifold.

pith-pipeline@v0.9.0 · 5483 in / 1410 out tokens · 33957 ms · 2026-05-15T20:16:38.092134+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

They (1) lift low-dimensional time series using delay embedding, to detect the system's higher-dimensional dynamical manifold, and (2) identify and forecast long-lived invariant sets that characterize the global flow on this manifold.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the order of the best-approximating Markov chain scales linearly with the intrinsic dimension of Test-OOD ... consistent with Takens’ theorem

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 2 internal anchors

[1]

Quantifying atten- tion flow in transformers

Samira Abnar and Willem Zuidema. Quantifying atten- tion flow in transformers. InProceedings of the 58th an- nual meeting of the association for computational linguis- tics, pages 4190–4197, 2020

work page 2020
[2]

What learning algorithm is in- context learning? investigations with linear models

Ekin Aky¨ urek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in- context learning? investigations with linear models. In The Eleventh International Conference on Learning Rep- resentations, 2023

work page 2023
[3]

Chronos: Learning the language of time series.Transactions on Machine Learn- ing Research, 2024

Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, et al. Chronos: Learning the language of time series.Transactions on Machine Learn- ing Research, 2024

work page 2024
[4]

Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

work page 2019
[5]

A theory of learning from different domains

Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1):151–175, 2010

work page 2010
[6]

Invariant measures in time-delay coordinates for unique dynamical system identification.Physical Review Letters, 135(16):167202, 2025

Jonah Botvinick-Greenhouse, Robert Martin, and Yunan Yang. Invariant measures in time-delay coordinates for unique dynamical system identification.Physical Review Letters, 135(16):167202, 2025

work page 2025
[7]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[8]

Cambridge University Press, 2022

Steven L Brunton and J Nathan Kutz.Data-driven sci- ence and engineering: Machine learning, dynamical sys- tems, and control. Cambridge University Press, 2022

work page 2022
[9]

Jake Buzhardt, C Ricardo Constante-Amores, and Michael D Graham. On the relationship between koop- man operator approximations and neural ordinary dif- ferential equations for data-driven time-evolution predic- tions.Chaos: An Interdisciplinary Journal of Nonlinear Science, 35(4), 2025

work page 2025
[10]

Generative-machine-learning surro- gate model of plasma turbulence.Physical Review E, 111(1):L013202, 2025

B Clavier, D Zarzoso, Diego Del-Castillo-Negrete, and E Fr´ enod. Generative-machine-learning surro- gate model of plasma turbulence.Physical Review E, 111(1):L013202, 2025

work page 2025
[11]

Cvitanovi´ c, R

P. Cvitanovi´ c, R. Artuso, R. Mainieri, G. Tanner, and G. Vattay.Chaos: Classical and Quantum. Niels Bohr Inst., Copenhagen, 2016

work page 2016
[12]

A mechanistic analysis of transformers for dynamical systems.arXiv preprint arXiv:2512.21113, 2025

Gregory Duth´ e, Nikolaos Evangelou, Wei Liu, Ioannis G Kevrekidis, and Eleni Chatzi. A mechanistic analysis of transformers for dynamical systems.arXiv preprint arXiv:2512.21113, 2025

work page arXiv 2025
[13]

The evolution of sta- tistical induction heads: In-context learning markov chains.Advances in neural information processing sys- tems, 37:64273–64311, 2024

Ezra Edelman, Nikolaos Tsilivis, Benjamin L Edelman, Eran Malach, and Surbhi Goel. The evolution of sta- tistical induction heads: In-context learning markov chains.Advances in neural information processing sys- tems, 37:64273–64311, 2024

work page 2024
[14]

Detecting and locating near-optimal almost-invariant sets and cycles

Gary Froyland and Michael Dellnitz. Detecting and locating near-optimal almost-invariant sets and cycles. SIAM Journal on Scientific Computing, 24(6):1839– 1863, 2003

work page 2003
[15]

What can transformers learn in-context? a case study of simple function classes.Advances in neural information processing systems, 35:30583–30598, 2022

Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gre- gory Valiant. What can transformers learn in-context? a case study of simple function classes.Advances in neural information processing systems, 35:30583–30598, 2022

work page 2022
[16]

Chaos as an interpretable benchmark for forecasting and data-driven modelling.NeurIPS, 34, 2021

William Gilpin. Chaos as an interpretable benchmark for forecasting and data-driven modelling.NeurIPS, 34, 2021

work page 2021
[17]

Out-of- domain generalization in dynamical systems reconstruc- tion

Niclas Alexander G¨ oring, Florian Hess, Manuel Bren- ner, Zahra Monfared, and Daniel Durstewitz. Out-of- domain generalization in dynamical systems reconstruc- tion. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Mach...

work page 2024
[18]

Poseidon: Efficient foundation models for pdes.Advances in Neural Information Processing Systems, 37:72525–72624, 2024

Maximilian Herde, Bogdan Raonic, Tobias Rohner, Roger K¨ appeli, Roberto Molinaro, Emmanuel De Bezenac, and Siddhartha Mishra. Poseidon: Efficient foundation models for pdes.Advances in Neural Information Processing Systems, 37:72525–72624, 2024

work page 2024
[19]

Panda: A pretrained forecast model for universal representation of chaotic dynamics

Jeffrey Lai, Anthony Bao, and William Gilpin. Panda: A pretrained forecast model for universal representation of chaotic dynamics. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[20]

Predictability: A problem partly solved

Edward N Lorenz. Predictability: A problem partly solved. InProc. Seminar on predictability, volume 1, pages 1–18. Reading, 1996

work page 1996
[21]

Domain adaptation: Learning bounds and algorithms

Yishay Mansour, Mehryar Mohri, and Afshin Ros- tamizadeh. Domain adaptation: Learning bounds and algorithms.arXiv preprint arXiv:0902.3430, 2009

work page arXiv 2009
[22]

Walrus: A cross-domain foundation model for continuum dynamics.arXiv preprint arXiv:2511.15684, 2025

Michael McCabe, Payel Mukhopadhyay, Tanya Marwah, Bruno Regaldo-Saint Blancard, Francois Rozet, Cristiana Diaconu, Lucas Meyer, Kaze WK Wong, Hadi Sotoudeh, Alberto Bietti, et al. Walrus: A cross-domain foun- dation model for continuum dynamics.arXiv preprint arXiv:2511.15684, 2025

work page arXiv 2025
[23]

Multiple physics pretraining for 6 spatiotemporal surrogate models.Advances in Neural In- formation Processing Systems, 37:119301–119335, 2024

Michael McCabe, Bruno R´ egaldo-Saint Blancard, Liam Parker, Ruben Ohana, Miles Cranmer, Alberto Bietti, Michael Eickenberg, Siavash Golkar, Geraud Krawezik, Francois Lanusse, et al. Multiple physics pretraining for 6 spatiotemporal surrogate models.Advances in Neural In- formation Processing Systems, 37:119301–119335, 2024

work page 2024
[24]

Deep double de- scent: Where bigger models and more data hurt.Jour- nal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double de- scent: Where bigger models and more data hurt.Jour- nal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

work page 2021
[25]

A time series is worth 64 words: Long-term forecasting with transformers

Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. InThe Eleventh International Conference on Learning Repre- sentations, 2023

work page 2023
[26]

Probabilistic weather forecasting with machine learning.Nature, 637(8044):84–90, 2025

Ilan Price, Alvaro Sanchez-Gonzalez, Ferran Alet, Tom R Andersson, Andrew El-Kadi, Dominic Masters, Timo Ewalds, Jacklynn Stott, Shakir Mohamed, Peter Battaglia, et al. Probabilistic weather forecasting with machine learning.Nature, 637(8044):84–90, 2025

work page 2025
[27]

Pretraining codomain attention neural operators for solving multiphysics pdes.Advances in Neural Infor- mation Processing Systems, 37:104035–104064, 2024

Md Ashiqur Rahman, Robert Joseph George, Mogab Elleithy, Daniel Leibovici, Zongyi Li, Boris Bonev, Colin White, Julius Berner, Raymond A Yeh, Jean Kossaifi, et al. Pretraining codomain attention neural operators for solving multiphysics pdes.Advances in Neural Infor- mation Processing Systems, 37:104035–104064, 2024

work page 2024
[28]

The mechanistic basis of data depen- dence and abrupt learning in an in-context classification task

Gautam Reddy. The mechanistic basis of data depen- dence and abrupt learning in an in-context classification task. InThe Twelfth International Conference on Learn- ing Representations, 2024

work page 2024
[29]

Memorizing with- out overfitting: Bias, variance, and interpolation in overparameterized models.Physical review research, 4(1):013201, 2022

Jason W Rocks and Pankaj Mehta. Memorizing with- out overfitting: Bias, variance, and interpolation in overparameterized models.Physical review research, 4(1):013201, 2022

work page 2022
[30]

Magnetohydrody- namics with physics informed neural operators.Machine Learning: Science and Technology, 4(3):035002, 2023

Shawn G Rosofsky and Eliu A Huerta. Magnetohydrody- namics with physics informed neural operators.Machine Learning: Science and Technology, 4(3):035002, 2023

work page 2023
[31]

Implicit transfer operator learning: Multiple time- resolution models for molecular dynamics.Advances in Neural Information Processing Systems, 36:36449–36462, 2023

Mathias Schreiner, Ole Winther, and Simon Olsson. Implicit transfer operator learning: Multiple time- resolution models for molecular dynamics.Advances in Neural Information Processing Systems, 36:36449–36462, 2023

work page 2023
[32]

Towards foundation models for scien- tific machine learning: Characterizing scaling and trans- fer behavior.Advances in Neural Information Processing Systems, 36:71242–71262, 2023

Shashank Subramanian, Peter Harrington, Kurt Keutzer, Wahid Bhimji, Dmitriy Morozov, Michael W Mahoney, and Amir Gholami. Towards foundation models for scien- tific machine learning: Characterizing scaling and trans- fer behavior.Advances in Neural Information Processing Systems, 36:71242–71262, 2023

work page 2023
[33]

Dynamical systems and turbulence.War- wick, 1980, pages 366–381, 1981

Floris Takens. Dynamical systems and turbulence.War- wick, 1980, pages 366–381, 1981

work page 1980
[34]

Ulam.A Collection of Mathematical Prob- lems

Stanislaw M. Ulam.A Collection of Mathematical Prob- lems. Interscience Publishers, New York, 1960

work page 1960
[35]

Transformers learn in-context by gradient descent

Johannes Von Oswald, Eyvind Niklasson, Ettore Ran- dazzo, Jo˜ ao Sacramento, Alexander Mordvintsev, An- drey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. InInternational Conference on Machine Learning, pages 35151–35174. PMLR, 2023

work page 2023
[36]

Trained transformers learn linear models in-context.Journal of Machine Learning Research, 25(49):1–55, 2024

Ruiqi Zhang, Spencer Frei, and Peter L Bartlett. Trained transformers learn linear models in-context.Journal of Machine Learning Research, 25(49):1–55, 2024

work page 2024
[37]

Context parrot- ing: A simple but tough-to-beat baseline for foundation models in scientific machine learning

Yuanzhao Zhang and William Gilpin. Context parrot- ing: A simple but tough-to-beat baseline for foundation models in scientific machine learning. InThe Fourteenth International Conference on Learning Representations, 2026. 1 APPENDIX CONTENTS References 5 Appendix 1 Appendix 1 Appendix A. Code Availability 1 Appendix B. Model architecture and training 1 A...

work page 2026
[38]

Quantifying attention flow in transformers

Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 4190–4197, 2020

work page 2020
[39]

Chronos: Learning the language of time series.Transactions on Machine Learning Research, 2024

Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, et al. Chronos: Learning the language of time series.Transactions on Machine Learning Research, 2024

work page 2024
[40]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[41]

Universal redundancies in time series foundation models.arXiv preprint arXiv:2602.01605, 2026

Anthony Bao, Venkata Hasith Vattikuti, Jeffrey Lai, and William Gilpin. Universal redundancies in time series foundation models.arXiv preprint arXiv:2602.01605, 2026

work page arXiv 2026
[42]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[43]

Power-law distributions in empirical data.SIAM review, 51(4):661–703, 2009

Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ Newman. Power-law distributions in empirical data.SIAM review, 51(4):661–703, 2009

work page 2009
[44]

The evolution of statistical induction heads: In-context learning markov chains.Advances in neural information processing systems, 37:64273–64311, 2024

Ezra Edelman, Nikolaos Tsilivis, Benjamin L Edelman, Eran Malach, and Surbhi Goel. The evolution of statistical induction heads: In-context learning markov chains.Advances in neural information processing systems, 37:64273–64311, 2024

work page 2024
[45]

What can transformers learn in-context? a case study of simple function classes.Advances in neural information processing systems, 35:30583–30598, 2022

Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes.Advances in neural information processing systems, 35:30583–30598, 2022

work page 2022
[46]

Measuring the strangeness of strange attractors.Physica D: nonlinear phenomena, 9(1-2):189–208, 1983

Peter Grassberger and Itamar Procaccia. Measuring the strangeness of strange attractors.Physica D: nonlinear phenomena, 9(1-2):189–208, 1983

work page 1983
[47]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 7

work page internal anchor Pith review Pith/arXiv arXiv 2017
[48]

Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

work page 2021
[49]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

work page 2020
[50]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017

[1] [1]

Quantifying atten- tion flow in transformers

Samira Abnar and Willem Zuidema. Quantifying atten- tion flow in transformers. InProceedings of the 58th an- nual meeting of the association for computational linguis- tics, pages 4190–4197, 2020

work page 2020

[2] [2]

What learning algorithm is in- context learning? investigations with linear models

Ekin Aky¨ urek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in- context learning? investigations with linear models. In The Eleventh International Conference on Learning Rep- resentations, 2023

work page 2023

[3] [3]

Chronos: Learning the language of time series.Transactions on Machine Learn- ing Research, 2024

Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, et al. Chronos: Learning the language of time series.Transactions on Machine Learn- ing Research, 2024

work page 2024

[4] [4]

Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

work page 2019

[5] [5]

A theory of learning from different domains

Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1):151–175, 2010

work page 2010

[6] [6]

Invariant measures in time-delay coordinates for unique dynamical system identification.Physical Review Letters, 135(16):167202, 2025

Jonah Botvinick-Greenhouse, Robert Martin, and Yunan Yang. Invariant measures in time-delay coordinates for unique dynamical system identification.Physical Review Letters, 135(16):167202, 2025

work page 2025

[7] [7]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901

[8] [8]

Cambridge University Press, 2022

Steven L Brunton and J Nathan Kutz.Data-driven sci- ence and engineering: Machine learning, dynamical sys- tems, and control. Cambridge University Press, 2022

work page 2022

[9] [9]

Jake Buzhardt, C Ricardo Constante-Amores, and Michael D Graham. On the relationship between koop- man operator approximations and neural ordinary dif- ferential equations for data-driven time-evolution predic- tions.Chaos: An Interdisciplinary Journal of Nonlinear Science, 35(4), 2025

work page 2025

[10] [10]

Generative-machine-learning surro- gate model of plasma turbulence.Physical Review E, 111(1):L013202, 2025

B Clavier, D Zarzoso, Diego Del-Castillo-Negrete, and E Fr´ enod. Generative-machine-learning surro- gate model of plasma turbulence.Physical Review E, 111(1):L013202, 2025

work page 2025

[11] [11]

Cvitanovi´ c, R

P. Cvitanovi´ c, R. Artuso, R. Mainieri, G. Tanner, and G. Vattay.Chaos: Classical and Quantum. Niels Bohr Inst., Copenhagen, 2016

work page 2016

[12] [12]

A mechanistic analysis of transformers for dynamical systems.arXiv preprint arXiv:2512.21113, 2025

Gregory Duth´ e, Nikolaos Evangelou, Wei Liu, Ioannis G Kevrekidis, and Eleni Chatzi. A mechanistic analysis of transformers for dynamical systems.arXiv preprint arXiv:2512.21113, 2025

work page arXiv 2025

[13] [13]

The evolution of sta- tistical induction heads: In-context learning markov chains.Advances in neural information processing sys- tems, 37:64273–64311, 2024

Ezra Edelman, Nikolaos Tsilivis, Benjamin L Edelman, Eran Malach, and Surbhi Goel. The evolution of sta- tistical induction heads: In-context learning markov chains.Advances in neural information processing sys- tems, 37:64273–64311, 2024

work page 2024

[14] [14]

Detecting and locating near-optimal almost-invariant sets and cycles

Gary Froyland and Michael Dellnitz. Detecting and locating near-optimal almost-invariant sets and cycles. SIAM Journal on Scientific Computing, 24(6):1839– 1863, 2003

work page 2003

[15] [15]

What can transformers learn in-context? a case study of simple function classes.Advances in neural information processing systems, 35:30583–30598, 2022

Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gre- gory Valiant. What can transformers learn in-context? a case study of simple function classes.Advances in neural information processing systems, 35:30583–30598, 2022

work page 2022

[16] [16]

Chaos as an interpretable benchmark for forecasting and data-driven modelling.NeurIPS, 34, 2021

William Gilpin. Chaos as an interpretable benchmark for forecasting and data-driven modelling.NeurIPS, 34, 2021

work page 2021

[17] [17]

Out-of- domain generalization in dynamical systems reconstruc- tion

Niclas Alexander G¨ oring, Florian Hess, Manuel Bren- ner, Zahra Monfared, and Daniel Durstewitz. Out-of- domain generalization in dynamical systems reconstruc- tion. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Mach...

work page 2024

[18] [18]

Poseidon: Efficient foundation models for pdes.Advances in Neural Information Processing Systems, 37:72525–72624, 2024

Maximilian Herde, Bogdan Raonic, Tobias Rohner, Roger K¨ appeli, Roberto Molinaro, Emmanuel De Bezenac, and Siddhartha Mishra. Poseidon: Efficient foundation models for pdes.Advances in Neural Information Processing Systems, 37:72525–72624, 2024

work page 2024

[19] [19]

Panda: A pretrained forecast model for universal representation of chaotic dynamics

Jeffrey Lai, Anthony Bao, and William Gilpin. Panda: A pretrained forecast model for universal representation of chaotic dynamics. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[20] [20]

Predictability: A problem partly solved

Edward N Lorenz. Predictability: A problem partly solved. InProc. Seminar on predictability, volume 1, pages 1–18. Reading, 1996

work page 1996

[21] [21]

Domain adaptation: Learning bounds and algorithms

Yishay Mansour, Mehryar Mohri, and Afshin Ros- tamizadeh. Domain adaptation: Learning bounds and algorithms.arXiv preprint arXiv:0902.3430, 2009

work page arXiv 2009

[22] [22]

Walrus: A cross-domain foundation model for continuum dynamics.arXiv preprint arXiv:2511.15684, 2025

Michael McCabe, Payel Mukhopadhyay, Tanya Marwah, Bruno Regaldo-Saint Blancard, Francois Rozet, Cristiana Diaconu, Lucas Meyer, Kaze WK Wong, Hadi Sotoudeh, Alberto Bietti, et al. Walrus: A cross-domain foun- dation model for continuum dynamics.arXiv preprint arXiv:2511.15684, 2025

work page arXiv 2025

[23] [23]

Multiple physics pretraining for 6 spatiotemporal surrogate models.Advances in Neural In- formation Processing Systems, 37:119301–119335, 2024

Michael McCabe, Bruno R´ egaldo-Saint Blancard, Liam Parker, Ruben Ohana, Miles Cranmer, Alberto Bietti, Michael Eickenberg, Siavash Golkar, Geraud Krawezik, Francois Lanusse, et al. Multiple physics pretraining for 6 spatiotemporal surrogate models.Advances in Neural In- formation Processing Systems, 37:119301–119335, 2024

work page 2024

[24] [24]

Deep double de- scent: Where bigger models and more data hurt.Jour- nal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double de- scent: Where bigger models and more data hurt.Jour- nal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

work page 2021

[25] [25]

A time series is worth 64 words: Long-term forecasting with transformers

Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. InThe Eleventh International Conference on Learning Repre- sentations, 2023

work page 2023

[26] [26]

Probabilistic weather forecasting with machine learning.Nature, 637(8044):84–90, 2025

Ilan Price, Alvaro Sanchez-Gonzalez, Ferran Alet, Tom R Andersson, Andrew El-Kadi, Dominic Masters, Timo Ewalds, Jacklynn Stott, Shakir Mohamed, Peter Battaglia, et al. Probabilistic weather forecasting with machine learning.Nature, 637(8044):84–90, 2025

work page 2025

[27] [27]

Pretraining codomain attention neural operators for solving multiphysics pdes.Advances in Neural Infor- mation Processing Systems, 37:104035–104064, 2024

Md Ashiqur Rahman, Robert Joseph George, Mogab Elleithy, Daniel Leibovici, Zongyi Li, Boris Bonev, Colin White, Julius Berner, Raymond A Yeh, Jean Kossaifi, et al. Pretraining codomain attention neural operators for solving multiphysics pdes.Advances in Neural Infor- mation Processing Systems, 37:104035–104064, 2024

work page 2024

[28] [28]

The mechanistic basis of data depen- dence and abrupt learning in an in-context classification task

Gautam Reddy. The mechanistic basis of data depen- dence and abrupt learning in an in-context classification task. InThe Twelfth International Conference on Learn- ing Representations, 2024

work page 2024

[29] [29]

Memorizing with- out overfitting: Bias, variance, and interpolation in overparameterized models.Physical review research, 4(1):013201, 2022

Jason W Rocks and Pankaj Mehta. Memorizing with- out overfitting: Bias, variance, and interpolation in overparameterized models.Physical review research, 4(1):013201, 2022

work page 2022

[30] [30]

Magnetohydrody- namics with physics informed neural operators.Machine Learning: Science and Technology, 4(3):035002, 2023

Shawn G Rosofsky and Eliu A Huerta. Magnetohydrody- namics with physics informed neural operators.Machine Learning: Science and Technology, 4(3):035002, 2023

work page 2023

[31] [31]

Implicit transfer operator learning: Multiple time- resolution models for molecular dynamics.Advances in Neural Information Processing Systems, 36:36449–36462, 2023

Mathias Schreiner, Ole Winther, and Simon Olsson. Implicit transfer operator learning: Multiple time- resolution models for molecular dynamics.Advances in Neural Information Processing Systems, 36:36449–36462, 2023

work page 2023

[32] [32]

Towards foundation models for scien- tific machine learning: Characterizing scaling and trans- fer behavior.Advances in Neural Information Processing Systems, 36:71242–71262, 2023

Shashank Subramanian, Peter Harrington, Kurt Keutzer, Wahid Bhimji, Dmitriy Morozov, Michael W Mahoney, and Amir Gholami. Towards foundation models for scien- tific machine learning: Characterizing scaling and trans- fer behavior.Advances in Neural Information Processing Systems, 36:71242–71262, 2023

work page 2023

[33] [33]

Dynamical systems and turbulence.War- wick, 1980, pages 366–381, 1981

Floris Takens. Dynamical systems and turbulence.War- wick, 1980, pages 366–381, 1981

work page 1980

[34] [34]

Ulam.A Collection of Mathematical Prob- lems

Stanislaw M. Ulam.A Collection of Mathematical Prob- lems. Interscience Publishers, New York, 1960

work page 1960

[35] [35]

Transformers learn in-context by gradient descent

Johannes Von Oswald, Eyvind Niklasson, Ettore Ran- dazzo, Jo˜ ao Sacramento, Alexander Mordvintsev, An- drey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. InInternational Conference on Machine Learning, pages 35151–35174. PMLR, 2023

work page 2023

[36] [36]

Trained transformers learn linear models in-context.Journal of Machine Learning Research, 25(49):1–55, 2024

Ruiqi Zhang, Spencer Frei, and Peter L Bartlett. Trained transformers learn linear models in-context.Journal of Machine Learning Research, 25(49):1–55, 2024

work page 2024

[37] [37]

Context parrot- ing: A simple but tough-to-beat baseline for foundation models in scientific machine learning

Yuanzhao Zhang and William Gilpin. Context parrot- ing: A simple but tough-to-beat baseline for foundation models in scientific machine learning. InThe Fourteenth International Conference on Learning Representations, 2026. 1 APPENDIX CONTENTS References 5 Appendix 1 Appendix 1 Appendix A. Code Availability 1 Appendix B. Model architecture and training 1 A...

work page 2026

[38] [38]

Quantifying attention flow in transformers

Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 4190–4197, 2020

work page 2020

[39] [39]

Chronos: Learning the language of time series.Transactions on Machine Learning Research, 2024

Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, et al. Chronos: Learning the language of time series.Transactions on Machine Learning Research, 2024

work page 2024

[40] [40]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[41] [41]

Universal redundancies in time series foundation models.arXiv preprint arXiv:2602.01605, 2026

Anthony Bao, Venkata Hasith Vattikuti, Jeffrey Lai, and William Gilpin. Universal redundancies in time series foundation models.arXiv preprint arXiv:2602.01605, 2026

work page arXiv 2026

[42] [42]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901

[43] [43]

Power-law distributions in empirical data.SIAM review, 51(4):661–703, 2009

Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ Newman. Power-law distributions in empirical data.SIAM review, 51(4):661–703, 2009

work page 2009

[44] [44]

The evolution of statistical induction heads: In-context learning markov chains.Advances in neural information processing systems, 37:64273–64311, 2024

Ezra Edelman, Nikolaos Tsilivis, Benjamin L Edelman, Eran Malach, and Surbhi Goel. The evolution of statistical induction heads: In-context learning markov chains.Advances in neural information processing systems, 37:64273–64311, 2024

work page 2024

[45] [45]

What can transformers learn in-context? a case study of simple function classes.Advances in neural information processing systems, 35:30583–30598, 2022

Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes.Advances in neural information processing systems, 35:30583–30598, 2022

work page 2022

[46] [46]

Measuring the strangeness of strange attractors.Physica D: nonlinear phenomena, 9(1-2):189–208, 1983

Peter Grassberger and Itamar Procaccia. Measuring the strangeness of strange attractors.Physica D: nonlinear phenomena, 9(1-2):189–208, 1983

work page 1983

[47] [47]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 7

work page internal anchor Pith review Pith/arXiv arXiv 2017

[48] [48]

Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

work page 2021

[49] [49]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

work page 2020

[50] [50]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017