pith. sign in

arxiv: 2602.18679 · v2 · submitted 2026-02-21 · 💻 cs.LG · nlin.CD

Transformers for dynamical systems learn transfer operators in-context

Pith reviewed 2026-05-15 20:16 UTC · model grok-4.3

classification 💻 cs.LG nlin.CD
keywords transformersin-context learningdynamical systemstransfer operatorsdelay embeddinginvariant setsforecastingattractors
0
0 comments X

The pith

A transformer trained on one dynamical system forecasts another by lifting time series with delay embeddings and identifying long-lived invariant sets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a small two-layer transformer, after training to forecast one dynamical system, can zero-shot forecast a different system without retraining. It achieves this by using delay embedding to lift the observed low-dimensional time series into a higher-dimensional space that reveals the underlying dynamical manifold. The model then identifies and forecasts the long-lived invariant sets that govern the global flow on that manifold. This in-context strategy explains how attention-based models adapt to unseen physical systems at test time. Training also exhibits an early tradeoff between in-distribution and out-of-distribution performance that appears as a secondary double descent.

Core claim

Attention-based models apply a transfer-operator forecasting strategy in-context. They lift low-dimensional time series using delay embedding to detect the system's higher-dimensional dynamical manifold, and identify and forecast long-lived invariant sets that characterize the global flow on this manifold.

What carries the argument

Delay embedding to reconstruct the dynamical manifold combined with identification of long-lived invariant sets that enable transfer-operator style forecasting.

If this is right

  • Transformers can adapt to entirely new physical systems at test time without any retraining.
  • Attention mechanisms use global attractor structure to support short-term forecasts.
  • Training dynamics show an early tradeoff between in-distribution accuracy and out-of-distribution generalization that produces double descent.
  • Large foundation models for scientific machine learning implicitly learn transfer operators when forecasting dynamical systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • In-context learning for physical forecasting may depend more on phase-space reconstruction than on memorization of training trajectories.
  • Explicitly adding delay-embedding layers could improve robustness when applying transformers to chaotic or multi-scale flows.
  • The same mechanism might underlie zero-shot transfer observed in larger models across turbulent regimes.

Load-bearing premise

The observed out-of-distribution forecasting performance arises specifically from delay embedding plus invariant-set tracking rather than generic statistical pattern matching.

What would settle it

A controlled model variant prevented from performing delay embedding or from tracking invariant sets would lose all out-of-distribution forecasting ability on new dynamical systems.

Figures

Figures reproduced from arXiv: 2602.18679 by Anthony Bao, Jeffrey Lai, William Gilpin.

Figure 1
Figure 1. Figure 1: Double descent during in-context learning of dynamical systems. (A) Forecasts from a transformer trained on univariate time series from one dynamical sys￾tem (Train-ID) then evaluated on its ability to forecast un￾seen trajectories from the same system (Test-ID, blue), versus forecasts of trajectories from an unseen system (Test-OOD, magenta). Gray curve shows the context, a subset of the total test data. … view at source ↗
Figure 2
Figure 2. Figure 2: Scaling laws for out-of-distribution gener￾alization in dynamical systems. (A) The test error of the unseen system (cross-entropy, Test-OOD) versus the dif￾ference between the training and testing sets (KL divergence between the attractor of Test-OOD and Train-ID). (B) The error in the steady-state invariant distribution of the trans￾former dynamics relative to the true invariant distribution of Test-OOD, … view at source ↗
Figure 3
Figure 3. Figure 3: Transformers perform time-delay embedding during inference. (A) Empirical time-delayed next-token probabilities ˆp(xt+1|xt−k) averaged across Test-OOD context for a transformer trained on a different system (Train-ID). (B) Exact next-token probabilities ptrue(xt+1|xt−k) obtained from fitting a Markov chain on a long sample of Test-OOD. (C) The order of the closest-approximating Markov chain, ver￾sus the tr… view at source ↗
Figure 4
Figure 4. Figure 4: Transformers learn in-context transfer operators on reconstructed dynamical manifolds. (A) The eigenvalue spectrum of the transfer operator estimated from the fully-observed Test-OOD attractor p(yt+1|yt), and the time-lagged transfer operator estimated by sampling contiguous length-k sequences from the transformer ˆp(yˆt+1|yˆt) = pˆ(xt+1, xt, ...xt−k|xt, xt−1, ...xt−k−1). (B) (Top) The invariant distributi… view at source ↗
read the original abstract

Large-scale foundation models for scientific machine learning adapt to physical settings unseen during training, such as zero-shot transfer between turbulent scales. This phenomenon, in-context learning, challenges conventional understanding of learning and adaptation in physical systems. Here, we study in-context learning of dynamical systems in a minimal setting: we train a small two-layer, single-head transformer to forecast one dynamical system, and then evaluate its ability to forecast a different dynamical system without retraining. We discover an early tradeoff in training between in-distribution and out-of-distribution performance, which manifests as a secondary double descent phenomenon. We discover that attention-based models apply a transfer-operator forecasting strategy in-context. They (1) lift low-dimensional time series using delay embedding, to detect the system's higher-dimensional dynamical manifold, and (2) identify and forecast long-lived invariant sets that characterize the global flow on this manifold. Our results clarify the mechanism enabling large pretrained models to forecast unseen physical systems at test time without retraining, and they illustrate the unique ability of attention-based models to leverage global attractor information in service of short-term forecasts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript trains a small two-layer single-head transformer on forecasting one dynamical system and shows that the resulting model can forecast a different, unseen dynamical system in-context without retraining. It reports an early training tradeoff between in-distribution and out-of-distribution performance that appears as a secondary double-descent curve, and interprets the model's behavior as implementing a transfer-operator strategy: delay-embedding the input time series to recover the underlying manifold and identifying long-lived invariant sets that govern the global flow.

Significance. If the mechanistic interpretation is substantiated, the work supplies a concrete account of how attention-based models achieve zero-shot adaptation across physical regimes, linking in-context learning to classical dynamical-systems concepts such as delay embedding and transfer operators. This could inform the design of foundation models for scientific machine learning and clarify why attention mechanisms are particularly effective at exploiting global attractor structure for short-term prediction.

major comments (2)
  1. [Results / Experiments] The central claim that the transformer implements delay embedding plus invariant-set detection (abstract and results) rests on post-hoc interpretation of OOD forecasting behavior and the observed early tradeoff/double descent. No causal interventions are described that selectively disable delay embedding (e.g., single-timestep or fixed-history ablations) or invariant-set identification (e.g., attention masking or representation probes) while preserving generic sequence modeling; consequently, alternative explanations based on statistical pattern matching cannot be ruled out.
  2. [Abstract / Results] The abstract and main text provide no quantitative metrics, error bars, or statistical significance tests for the reported OOD forecasting gains or the double-descent phenomenon, making it impossible to evaluate the robustness or magnitude of the claimed effects.
minor comments (2)
  1. [Introduction] Add explicit references to Takens' embedding theorem and standard transfer-operator literature when introducing the delay-embedding and invariant-set mechanisms.
  2. [Methods] Clarify the precise definition of the transfer operator being approximated and how it is recovered from the attention weights or hidden states.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address the two major comments below and will incorporate revisions to strengthen the evidence and reporting in the manuscript.

read point-by-point responses
  1. Referee: [Results / Experiments] The central claim that the transformer implements delay embedding plus invariant-set detection (abstract and results) rests on post-hoc interpretation of OOD forecasting behavior and the observed early tradeoff/double descent. No causal interventions are described that selectively disable delay embedding (e.g., single-timestep or fixed-history ablations) or invariant-set identification (e.g., attention masking or representation probes) while preserving generic sequence modeling; consequently, alternative explanations based on statistical pattern matching cannot be ruled out.

    Authors: We agree that the current evidence for the mechanistic interpretation is observational and that targeted causal interventions would provide stronger support. The manuscript demonstrates consistent OOD forecasting behavior across several dynamical systems that aligns with delay embedding followed by invariant-set forecasting, but we acknowledge that this does not yet rule out purely statistical pattern-matching alternatives. In the revised version we will add ablations using single-timestep inputs (to test the necessity of delay embedding) and attention masking over recent tokens (to test the role of long-range invariant-set identification), while preserving the core sequence-modeling capacity. revision: yes

  2. Referee: [Abstract / Results] The abstract and main text provide no quantitative metrics, error bars, or statistical significance tests for the reported OOD forecasting gains or the double-descent phenomenon, making it impossible to evaluate the robustness or magnitude of the claimed effects.

    Authors: We accept this criticism. The revised manuscript will include quantitative metrics (e.g., mean squared error with standard deviations computed over multiple random seeds and system instances), error bars on all reported curves, and appropriate statistical significance tests for both the OOD performance gains and the secondary double-descent phenomenon. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical interpretation of observed transformer behavior

full rationale

The paper reports empirical observations of an early training tradeoff, double descent, and in-context forecasting performance on dynamical systems. It interprets these behaviors as the model performing delay embedding to recover manifolds and identifying invariant sets to apply transfer-operator forecasting. No equations, fitted parameters, or first-principles derivations are shown that reduce the claimed mechanism to its own inputs by construction. The central claims rest on post-hoc analysis of model outputs rather than any self-definitional, fitted-input, or self-citation load-bearing step. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard dynamical-systems concepts (delay embedding, invariant sets, transfer operators) treated as background knowledge rather than derived or fitted quantities.

axioms (2)
  • domain assumption Dynamical systems possess attractors containing long-lived invariant sets that govern global flow
    Invoked when interpreting the transformer's in-context forecasts as operating on the manifold's invariant sets.
  • standard math Delay embedding reconstructs the higher-dimensional manifold from scalar time series
    Standard Takens embedding theorem assumed when stating that the model lifts low-dimensional series to detect the manifold.

pith-pipeline@v0.9.0 · 5483 in / 1410 out tokens · 33957 ms · 2026-05-15T20:16:38.092134+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 2 internal anchors

  1. [1]

    Quantifying atten- tion flow in transformers

    Samira Abnar and Willem Zuidema. Quantifying atten- tion flow in transformers. InProceedings of the 58th an- nual meeting of the association for computational linguis- tics, pages 4190–4197, 2020

  2. [2]

    What learning algorithm is in- context learning? investigations with linear models

    Ekin Aky¨ urek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in- context learning? investigations with linear models. In The Eleventh International Conference on Learning Rep- resentations, 2023

  3. [3]

    Chronos: Learning the language of time series.Transactions on Machine Learn- ing Research, 2024

    Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, et al. Chronos: Learning the language of time series.Transactions on Machine Learn- ing Research, 2024

  4. [4]

    Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

    Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

  5. [5]

    A theory of learning from different domains

    Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1):151–175, 2010

  6. [6]

    Invariant measures in time-delay coordinates for unique dynamical system identification.Physical Review Letters, 135(16):167202, 2025

    Jonah Botvinick-Greenhouse, Robert Martin, and Yunan Yang. Invariant measures in time-delay coordinates for unique dynamical system identification.Physical Review Letters, 135(16):167202, 2025

  7. [7]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

  8. [8]

    Cambridge University Press, 2022

    Steven L Brunton and J Nathan Kutz.Data-driven sci- ence and engineering: Machine learning, dynamical sys- tems, and control. Cambridge University Press, 2022

  9. [9]

    Jake Buzhardt, C Ricardo Constante-Amores, and Michael D Graham. On the relationship between koop- man operator approximations and neural ordinary dif- ferential equations for data-driven time-evolution predic- tions.Chaos: An Interdisciplinary Journal of Nonlinear Science, 35(4), 2025

  10. [10]

    Generative-machine-learning surro- gate model of plasma turbulence.Physical Review E, 111(1):L013202, 2025

    B Clavier, D Zarzoso, Diego Del-Castillo-Negrete, and E Fr´ enod. Generative-machine-learning surro- gate model of plasma turbulence.Physical Review E, 111(1):L013202, 2025

  11. [11]

    Cvitanovi´ c, R

    P. Cvitanovi´ c, R. Artuso, R. Mainieri, G. Tanner, and G. Vattay.Chaos: Classical and Quantum. Niels Bohr Inst., Copenhagen, 2016

  12. [12]

    A mechanistic analysis of transformers for dynamical systems.arXiv preprint arXiv:2512.21113, 2025

    Gregory Duth´ e, Nikolaos Evangelou, Wei Liu, Ioannis G Kevrekidis, and Eleni Chatzi. A mechanistic analysis of transformers for dynamical systems.arXiv preprint arXiv:2512.21113, 2025

  13. [13]

    The evolution of sta- tistical induction heads: In-context learning markov chains.Advances in neural information processing sys- tems, 37:64273–64311, 2024

    Ezra Edelman, Nikolaos Tsilivis, Benjamin L Edelman, Eran Malach, and Surbhi Goel. The evolution of sta- tistical induction heads: In-context learning markov chains.Advances in neural information processing sys- tems, 37:64273–64311, 2024

  14. [14]

    Detecting and locating near-optimal almost-invariant sets and cycles

    Gary Froyland and Michael Dellnitz. Detecting and locating near-optimal almost-invariant sets and cycles. SIAM Journal on Scientific Computing, 24(6):1839– 1863, 2003

  15. [15]

    What can transformers learn in-context? a case study of simple function classes.Advances in neural information processing systems, 35:30583–30598, 2022

    Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gre- gory Valiant. What can transformers learn in-context? a case study of simple function classes.Advances in neural information processing systems, 35:30583–30598, 2022

  16. [16]

    Chaos as an interpretable benchmark for forecasting and data-driven modelling.NeurIPS, 34, 2021

    William Gilpin. Chaos as an interpretable benchmark for forecasting and data-driven modelling.NeurIPS, 34, 2021

  17. [17]

    Out-of- domain generalization in dynamical systems reconstruc- tion

    Niclas Alexander G¨ oring, Florian Hess, Manuel Bren- ner, Zahra Monfared, and Daniel Durstewitz. Out-of- domain generalization in dynamical systems reconstruc- tion. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Mach...

  18. [18]

    Poseidon: Efficient foundation models for pdes.Advances in Neural Information Processing Systems, 37:72525–72624, 2024

    Maximilian Herde, Bogdan Raonic, Tobias Rohner, Roger K¨ appeli, Roberto Molinaro, Emmanuel De Bezenac, and Siddhartha Mishra. Poseidon: Efficient foundation models for pdes.Advances in Neural Information Processing Systems, 37:72525–72624, 2024

  19. [19]

    Panda: A pretrained forecast model for universal representation of chaotic dynamics

    Jeffrey Lai, Anthony Bao, and William Gilpin. Panda: A pretrained forecast model for universal representation of chaotic dynamics. InThe Fourteenth International Conference on Learning Representations, 2026

  20. [20]

    Predictability: A problem partly solved

    Edward N Lorenz. Predictability: A problem partly solved. InProc. Seminar on predictability, volume 1, pages 1–18. Reading, 1996

  21. [21]

    Domain adaptation: Learning bounds and algorithms

    Yishay Mansour, Mehryar Mohri, and Afshin Ros- tamizadeh. Domain adaptation: Learning bounds and algorithms.arXiv preprint arXiv:0902.3430, 2009

  22. [22]

    Walrus: A cross-domain foundation model for continuum dynamics.arXiv preprint arXiv:2511.15684, 2025

    Michael McCabe, Payel Mukhopadhyay, Tanya Marwah, Bruno Regaldo-Saint Blancard, Francois Rozet, Cristiana Diaconu, Lucas Meyer, Kaze WK Wong, Hadi Sotoudeh, Alberto Bietti, et al. Walrus: A cross-domain foun- dation model for continuum dynamics.arXiv preprint arXiv:2511.15684, 2025

  23. [23]

    Multiple physics pretraining for 6 spatiotemporal surrogate models.Advances in Neural In- formation Processing Systems, 37:119301–119335, 2024

    Michael McCabe, Bruno R´ egaldo-Saint Blancard, Liam Parker, Ruben Ohana, Miles Cranmer, Alberto Bietti, Michael Eickenberg, Siavash Golkar, Geraud Krawezik, Francois Lanusse, et al. Multiple physics pretraining for 6 spatiotemporal surrogate models.Advances in Neural In- formation Processing Systems, 37:119301–119335, 2024

  24. [24]

    Deep double de- scent: Where bigger models and more data hurt.Jour- nal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

    Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double de- scent: Where bigger models and more data hurt.Jour- nal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

  25. [25]

    A time series is worth 64 words: Long-term forecasting with transformers

    Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. InThe Eleventh International Conference on Learning Repre- sentations, 2023

  26. [26]

    Probabilistic weather forecasting with machine learning.Nature, 637(8044):84–90, 2025

    Ilan Price, Alvaro Sanchez-Gonzalez, Ferran Alet, Tom R Andersson, Andrew El-Kadi, Dominic Masters, Timo Ewalds, Jacklynn Stott, Shakir Mohamed, Peter Battaglia, et al. Probabilistic weather forecasting with machine learning.Nature, 637(8044):84–90, 2025

  27. [27]

    Pretraining codomain attention neural operators for solving multiphysics pdes.Advances in Neural Infor- mation Processing Systems, 37:104035–104064, 2024

    Md Ashiqur Rahman, Robert Joseph George, Mogab Elleithy, Daniel Leibovici, Zongyi Li, Boris Bonev, Colin White, Julius Berner, Raymond A Yeh, Jean Kossaifi, et al. Pretraining codomain attention neural operators for solving multiphysics pdes.Advances in Neural Infor- mation Processing Systems, 37:104035–104064, 2024

  28. [28]

    The mechanistic basis of data depen- dence and abrupt learning in an in-context classification task

    Gautam Reddy. The mechanistic basis of data depen- dence and abrupt learning in an in-context classification task. InThe Twelfth International Conference on Learn- ing Representations, 2024

  29. [29]

    Memorizing with- out overfitting: Bias, variance, and interpolation in overparameterized models.Physical review research, 4(1):013201, 2022

    Jason W Rocks and Pankaj Mehta. Memorizing with- out overfitting: Bias, variance, and interpolation in overparameterized models.Physical review research, 4(1):013201, 2022

  30. [30]

    Magnetohydrody- namics with physics informed neural operators.Machine Learning: Science and Technology, 4(3):035002, 2023

    Shawn G Rosofsky and Eliu A Huerta. Magnetohydrody- namics with physics informed neural operators.Machine Learning: Science and Technology, 4(3):035002, 2023

  31. [31]

    Implicit transfer operator learning: Multiple time- resolution models for molecular dynamics.Advances in Neural Information Processing Systems, 36:36449–36462, 2023

    Mathias Schreiner, Ole Winther, and Simon Olsson. Implicit transfer operator learning: Multiple time- resolution models for molecular dynamics.Advances in Neural Information Processing Systems, 36:36449–36462, 2023

  32. [32]

    Towards foundation models for scien- tific machine learning: Characterizing scaling and trans- fer behavior.Advances in Neural Information Processing Systems, 36:71242–71262, 2023

    Shashank Subramanian, Peter Harrington, Kurt Keutzer, Wahid Bhimji, Dmitriy Morozov, Michael W Mahoney, and Amir Gholami. Towards foundation models for scien- tific machine learning: Characterizing scaling and trans- fer behavior.Advances in Neural Information Processing Systems, 36:71242–71262, 2023

  33. [33]

    Dynamical systems and turbulence.War- wick, 1980, pages 366–381, 1981

    Floris Takens. Dynamical systems and turbulence.War- wick, 1980, pages 366–381, 1981

  34. [34]

    Ulam.A Collection of Mathematical Prob- lems

    Stanislaw M. Ulam.A Collection of Mathematical Prob- lems. Interscience Publishers, New York, 1960

  35. [35]

    Transformers learn in-context by gradient descent

    Johannes Von Oswald, Eyvind Niklasson, Ettore Ran- dazzo, Jo˜ ao Sacramento, Alexander Mordvintsev, An- drey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. InInternational Conference on Machine Learning, pages 35151–35174. PMLR, 2023

  36. [36]

    Trained transformers learn linear models in-context.Journal of Machine Learning Research, 25(49):1–55, 2024

    Ruiqi Zhang, Spencer Frei, and Peter L Bartlett. Trained transformers learn linear models in-context.Journal of Machine Learning Research, 25(49):1–55, 2024

  37. [37]

    Context parrot- ing: A simple but tough-to-beat baseline for foundation models in scientific machine learning

    Yuanzhao Zhang and William Gilpin. Context parrot- ing: A simple but tough-to-beat baseline for foundation models in scientific machine learning. InThe Fourteenth International Conference on Learning Representations, 2026. 1 APPENDIX CONTENTS References 5 Appendix 1 Appendix 1 Appendix A. Code Availability 1 Appendix B. Model architecture and training 1 A...

  38. [38]

    Quantifying attention flow in transformers

    Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 4190–4197, 2020

  39. [39]

    Chronos: Learning the language of time series.Transactions on Machine Learning Research, 2024

    Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, et al. Chronos: Learning the language of time series.Transactions on Machine Learning Research, 2024

  40. [40]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

  41. [41]

    Universal redundancies in time series foundation models.arXiv preprint arXiv:2602.01605, 2026

    Anthony Bao, Venkata Hasith Vattikuti, Jeffrey Lai, and William Gilpin. Universal redundancies in time series foundation models.arXiv preprint arXiv:2602.01605, 2026

  42. [42]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

  43. [43]

    Power-law distributions in empirical data.SIAM review, 51(4):661–703, 2009

    Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ Newman. Power-law distributions in empirical data.SIAM review, 51(4):661–703, 2009

  44. [44]

    The evolution of statistical induction heads: In-context learning markov chains.Advances in neural information processing systems, 37:64273–64311, 2024

    Ezra Edelman, Nikolaos Tsilivis, Benjamin L Edelman, Eran Malach, and Surbhi Goel. The evolution of statistical induction heads: In-context learning markov chains.Advances in neural information processing systems, 37:64273–64311, 2024

  45. [45]

    What can transformers learn in-context? a case study of simple function classes.Advances in neural information processing systems, 35:30583–30598, 2022

    Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes.Advances in neural information processing systems, 35:30583–30598, 2022

  46. [46]

    Measuring the strangeness of strange attractors.Physica D: nonlinear phenomena, 9(1-2):189–208, 1983

    Peter Grassberger and Itamar Procaccia. Measuring the strangeness of strange attractors.Physica D: nonlinear phenomena, 9(1-2):189–208, 1983

  47. [47]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 7

  48. [48]

    Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

    Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

  49. [49]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

  50. [50]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017