Temporal Task Diversity: Inductive Biases Under Non-Stationarity in Synthetic Sequence Modelling

Afiq Abdillah Effiezal Aswadi; Matthew Farrugia-Roberts; Oliver Britton; Ross Baker

arxiv: 2605.18281 · v1 · pith:4MRUOTQRnew · submitted 2026-05-18 · 💻 cs.LG

Temporal Task Diversity: Inductive Biases Under Non-Stationarity in Synthetic Sequence Modelling

Afiq Abdillah Effiezal Aswadi , Oliver Britton , Ross Baker , Matthew Farrugia-Roberts This is my paper

Pith reviewed 2026-05-20 12:28 UTC · model grok-4.3

classification 💻 cs.LG

keywords inductive biasnon-stationaritytemporal diversityin-context learninggeneralizationmemorizationtransformerssequence modelling

0 comments

The pith

Varying tasks over time during training biases small transformers toward generalization instead of memorization

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how non-stationary data distributions affect the inductive biases of neural networks in sequence modelling. It uses in-context linear regression as a testbed where small transformers are trained on tasks that change across training steps. The central finding is that this temporal variation increases the bias toward generalization over memorization compared to training on a fixed set of tasks. A sympathetic reader would care because many real applications involve shifting data, and the result suggests a way to shape model behavior through training dynamics rather than architecture alone.

Core claim

In in-context linear regression sequence modelling, diversifying the task distribution across training time leads to an increased bias towards generalisation over memorisation in small transformers.

What carries the argument

Temporal task diversity, the systematic variation of tasks in the training distribution over time, which creates non-stationarity and shifts inductive bias from memorization to generalization.

If this is right

Models exhibit reduced reliance on memorizing specific task instances encountered during training.
Generalization to novel tasks improves under conditions where the data distribution shifts gradually.
Inductive biases toward safer or more robust solutions may emerge as a side effect of the same training schedule.
Non-stationarity can be treated as a controllable design choice rather than an obstacle to avoid.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same temporal-diversity mechanism might be tested in larger transformers or recurrent architectures to check whether the bias shift scales.
Training curricula in online or continual-learning settings could deliberately introduce controlled task variation to favor generalization.
The finding invites comparison with human learning, where exposure to changing environments often promotes flexible rather than rote strategies.

Load-bearing premise

The synthetic in-context linear regression task with small transformers serves as a faithful testbed whose observed generalization patterns will transfer to broader classes of deep learning models and real-world non-stationary data.

What would settle it

Training small transformers on temporally diverse tasks and finding no measurable increase in accuracy on held-out tasks compared with fixed-distribution training would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.18281 by Afiq Abdillah Effiezal Aswadi, Matthew Farrugia-Roberts, Oliver Britton, Ross Baker.

**Figure 1.** Figure 1: Increasing non-stationarity via MALA random walk shifts the dMMSE-ridge transition to lower task diversities. We show results using the final M tasks at the end of pretraining (top row) and on new tasks drawn from TTrue = N (0, ID) (bottom row), for transformers trained via tasks updating according to a MALA random walk with step size γ at each step of training. We vary γ from 0 (dark) to 10−2 (light). The… view at source ↗

**Figure 2.** Figure 2: Increasing non-stationarity via resampling shifts the dMMSE-ridge transition to lower task diversities. As in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Predictions of transformers below the non-stationary task diversity threshold track dMMSE throughout training. We show the mean squared distance ∆PT,dMMSE and ∆PT,Ridge throughout training. We use a fixed task diversity M = 32 and MALA step sizes γ ∈ {0, 10−4 , 10−3 , 10−2 , 10−1 }. We evaluate on in-distribution sequences from q (τ) M (top row) and on out-of-distribution sequences from q∞ (bottom row). Th… view at source ↗

**Figure 4.** Figure 4: Predictive Monte Carlo reveals that below the non-stationary task diversity threshold, the transformer’s implicit task distribution tracks the changing task distribution. We use predictive Monte Carlo to extract the transformer’s implicit prior over task vectors throughout training, and compare to the finite task distribution q (τ) M (t) and infinite task diversity q∞(t) = N (0, ID) priors via energy dista… view at source ↗

**Figure 5.** Figure 5: Implicit prior over a 1D task vector against the true task during training. We use predictive Monte Carlo to extract the transformer’s implicit prior over the task vector p(t) (purple) and compare it to the true task t (black), for a one dimensional MALA setting with task dimension D = 1, task diversity M = 1, and γ = 10−2 . The top panel shows 0 to 100K training steps, and the bottom panel zooms into the … view at source ↗

**Figure 6.** Figure 6: Resampling non-stationarity analogue of [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Resampling non-stationarity analogue of [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: In-distribution mean squared prediction differences throughout training under random walk non-stationarity. We show ∆PT,dMMSE and ∆PT,Ridge on in-distribution sequences from q (τ) M for each combination of task diversity M and MALA step size γ. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Out-of-distribution mean squared prediction differences throughout training under random walk non-stationarity. We show ∆PT,dMMSE and ∆PT,Ridge on out-of-distribution sequences from q∞ for each combination of task diversity M and MALA step size γ. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Energy distance between the transformer’s implicit prior and the Uniform(T (τ) M ) and N (0, ID) priors throughout training under random walk non-stationarity. For each combination of task diversity M and MALA step size γ, we use predictive Monte Carlo to extract the transformer’s implicit prior over task vectors throughout training, and compare to the baseline priors via energy distance. 17 [PITH_FULL_I… view at source ↗

**Figure 11.** Figure 11: In-distribution mean squared prediction differences throughout training under resampling non-stationarity. We show ∆PT,dMMSE and ∆PT,Ridge on in-distribution sequences from q (τ) M for each combination of task diversity M and sample number R. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Out-of-distribution mean squared prediction differences throughout training under resampling non-stationarity. We show ∆PT,dMMSE and ∆PT,Ridge on out-of-distribution sequences from q∞ for each combination of task diversity M and sample number R. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Energy distance between the transformer’s implicit prior and the Uniform(T (τ) M ) and N (0, ID) priors throughout training under resampling non-stationarity. For each combination of task diversity M and sample number R, we use predictive Monte Carlo to extract the transformer’s implicit prior over task vectors throughout training, and compare to the baseline priors via energy distance. 20 [PITH_FULL_IMA… view at source ↗

**Figure 14.** Figure 14: Implicit prior over a 1D task vector against the true task during training, across MALA step sizes. As in [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: Implicit prior over a 1D task vector against the true task during training, across resampling rates. As in [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 16.** Figure 16: These broadly fell into two types of issues. 1. The first type of issue was a divergence of loss compared to the other seeds. In five of our runs across the Section 4.1 sweep and two runs across the Section 4.2 sweep, a single seed departed from the trajectory followed by the other seeds in the same configuration. The first two panels of [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

read the original abstract

Modern deep learning science often assumes that neural networks learn from a fixed data distribution. However, many practically important learning problems involve data distributions that change throughout training. How does such non-stationarity impact the inductive biases of deep learning towards models with different structural, generalisation, and safety properties? A fruitful testbed for studying inductive bias is in-context linear regression sequence modelling, where small transformers display strikingly different generalisation patterns depending on the diversity of the (fixed) training task distribution. In this paper, we explore the effect of diversifying the task distribution across training time, finding that such temporal diversity leads to an increased bias towards generalisation over memorisation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper examines how non-stationarity in task distributions during training affects inductive biases in small transformers performing in-context linear regression sequence modeling. It reports that introducing temporal diversity—by sequencing different task distributions over training time—produces an increased bias toward generalization rather than memorization, relative to stationary (fixed-distribution) training regimes.

Significance. If the central empirical finding holds after appropriate controls, the work would usefully extend existing synthetic testbeds for studying generalization in transformers to the non-stationary setting. This is relevant because real-world training often involves shifting distributions, and the result could inform curriculum or data-ordering strategies that favor generalization.

major comments (1)

[Section 4] Section 4 and associated figures: the reported shift toward generalization is attributed specifically to temporal ordering of task distributions. However, the design lacks an explicit stationary control that exposes the model to the same union of tasks (matched total coverage and compute) but in shuffled order. Without this, the effect could arise from greater cumulative task variety or curriculum-like dynamics rather than non-stationarity per se.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a brief statement of the precise transformer architecture (layers, heads, embedding dimension) and the exact in-context regression setup (number of in-context examples, input dimension) to allow replication.
Details on statistical significance, number of random seeds, and any exclusion criteria for runs are not visible in the provided text; adding these would strengthen the empirical claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback. We address the major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Section 4] Section 4 and associated figures: the reported shift toward generalization is attributed specifically to temporal ordering of task distributions. However, the design lacks an explicit stationary control that exposes the model to the same union of tasks (matched total coverage and compute) but in shuffled order. Without this, the effect could arise from greater cumulative task variety or curriculum-like dynamics rather than non-stationarity per se.

Authors: We agree that an explicit stationary control with matched total task coverage is necessary to isolate the contribution of temporal ordering. Our existing stationary baselines train on a single fixed task distribution for the entire run, while the temporal conditions cycle through a sequence of distributions. To address the concern, we will add a new stationary baseline that trains on the union of all tasks appearing in the temporal condition, presented in random shuffled order with identical total exposure and compute budget. We will include the results of this control in the revised Section 4, update the relevant figures, and revise the discussion to clarify whether the observed increase in generalization bias is specifically attributable to non-stationarity rather than cumulative variety alone. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical exploration without self-referential derivations

full rationale

The paper is framed as an empirical study exploring the impact of temporal task diversity on generalization vs. memorization biases in small transformers trained on in-context linear regression tasks. No equations, derivations, or first-principles claims are presented that reduce by construction to fitted parameters, self-definitions, or self-citations from the same work. The central observation—that diversifying tasks across training time increases generalization bias—is reported from experimental results rather than any load-bearing mathematical reduction or renamed known pattern. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only abstract available; ledger populated from stated premises in the abstract.

axioms (1)

domain assumption In-context linear regression sequence modelling is a fruitful testbed for studying inductive biases under non-stationarity.
Explicitly stated in the abstract as the chosen experimental setting.

pith-pipeline@v0.9.0 · 5653 in / 1090 out tokens · 40591 ms · 2026-05-20T12:28:12.271351+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArrowOfTime.lean arrow_from_z unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We explore the effect of diversifying the task distribution across training time, finding that such temporal diversity leads to an increased bias towards generalisation over memorisation.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Optimisation tends towards more stable models

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 3 internal anchors

[1]

Battiston and L

Battiston, M. and Cappello, L. Bayesian predictive inference beyond martingales, 2025. Preprint arXiv:2507.21874 https://arxiv.org/abs/2507.21874 [math.ST]

work page arXiv 2025
[2]

Curriculum learning

Bengio, Y., Louradour, J., Collobert, R., and Weston, J. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pp.\ 41--48, 2009

work page 2009
[3]

Bishop, C. M. Mixture density networks. Technical Report NCGR/94/004, Neural Computing Research Group, 1994

work page 1994
[4]

J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., VanderPlas , J., Wanderman-Milne , S., and Zhang, Q

Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., VanderPlas , J., Wanderman-Milne , S., and Zhang, Q. JAX: composable transformations of Python + NumPy programs. GitHub, 2018. URL http://github.com/jax-ml/jax

work page 2018
[5]

Statistical modeling: The two cultures (with comments and a rejoinder by the author)

Breiman, L. Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science, 16 0 (3): 0 199--231, 2001

work page 2001
[6]

Dynamics of transient structure in in-context linear regression transformers, 2025

Carroll, L., Hoogland, J., Farrugia-Roberts, M., and Murfet, D. Dynamics of transient structure in in-context linear regression transformers, 2025. Preprint arXiv:2501.17745 https://arxiv.org/abs/2501.17745 [cs.LG]

work page arXiv 2025
[7]

Data distributional properties drive emergent in-context learning in transformers

Chan, S., Santoro, A., Lampinen, A., Wang, J., Singh, A., Richemond, P., McClelland, J., and Hill, F. Data distributional properties drive emergent in-context learning in transformers. Advances in Neural Information Processing Systems 35, pp.\ 18878--18891, 2022

work page 2022
[8]

and Liu, B

Chen, Z. and Liu, B. Lifelong Machine Learning. Morgan & Claypool, 2018

work page 2018
[9]

Clements, M. P. and Hendry, D. F. Forecasting Non-Stationary Economic Time Series. MIT Press, 1999

work page 1999
[10]

Learning in nonstationary environments: A survey

Ditzler, G., Roveri, M., Alippi, C., and Polikar, R. Learning in nonstationary environments: A survey. IEEE Computational Intelligence Magazine, 10 0 (4): 0 12--25, 2015

work page 2015
[11]

A., Ma, H., and Wei, S

Effiezal Aswadi , A. A., Ma, H., and Wei, S. What does a Bayes -filtered transformer believe? A predictive Monte Carlo approach. In preparation, 2026

work page 2026
[12]

A mathematical framework for transformer circuits

Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. A mathematical framework for transformer circ...

work page 2021
[13]

Fong, E., Holmes, C., and Walker, S. G. Martingale posterior distributions. Journal of the Royal Statistical Society Series B: Statistical Methodology, 85 0 (5): 0 1357--1391, 2023

work page 2023
[14]

and Petrone, S

Fortini, S. and Petrone, S. Prediction-based uncertainty quantification for exchangeable sequences. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 381 0 (2247): 0 20220142, 2023

work page 2023
[15]

and Petrone, S

Fortini, S. and Petrone, S. Exchangeability, prediction and predictive modeling in Bayesian statistics. Statistical Science, 40 0 (1), January 2025

work page 2025
[16]

French, R. M. Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3 0 (4): 0 128--135, 1999

work page 1999
[17]

What can transformers learn in-context? a case study of simple function classes

Garg, S., Tsipras, D., Liang, P., and Valiant, G. What can transformers learn in-context? a case study of simple function classes. In Advances in Neural Information Processing Systems 35, pp.\ 30583--30598, 2022

work page 2022
[18]

An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks

Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. An empirical investigation of catastrophic forgetting in gradient-based neural networks, 2014. Published as a conference paper at ICLR 2014. Preprint arXiv:1312.6211 https://arxiv.org/abs/1312.6211 [stat.ML]

work page internal anchor Pith review Pith/arXiv arXiv 2014
[19]

Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

He, T., Doshi, D., Das, A., and Gromov, A. Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks. In Advances in Neural Information Processing Systems 37, pp.\ 13244--13273, 2024

work page 2024
[20]

Loss landscape degeneracy and stagewise development in transformers

Hoogland, J., Wang, G., Farrugia-Roberts, M., Carroll, L., Wei, S., and Murfet, D. Loss landscape degeneracy and stagewise development in transformers. Transactions on Machine Learning Research, 2025

work page 2025
[21]

Transient non-stationarity and generalisation in deep reinforcement learning

Igl, M., Farquhar, G., Luketina, J., Boehmer, W., and Whiteson, S. Transient non-stationarity and generalisation in deep reinforcement learning. In International Conference on Learning Representations, 2021

work page 2021
[22]

NanoGPT , 2022

Karpathy, A. NanoGPT , 2022. URL https://github.com/karpathy/nanoGPT

work page 2022
[23]

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization, 2015. Published as a conference paper at ICLR 2015. Preprint arXiv:1412.6980 https://arxiv.org/abs/1412.6980 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2015
[24]

and Cohen, N

McCloskey, M. and Cohen, N. J. Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of Learning and Motivation, 24: 0 109--165, 1989

work page 1989
[25]

Milly, P. C. D., Betancourt, J., Falkenmark, M., Hirsch, R. M., Kundzewicz, Z. W., Lettenmaier, D. P., and Stouffer, R. J. Stationarity is dead: Whither water management? Science, 319 0 (5863): 0 573--574, 2008

work page 2008
[26]

Mitchell, T. M. The need for biases in learning generalizations. Technical Report CBM-TR-117, Computer Science Department, Rutgers University, 1980

work page 1980
[27]

Nestor, B., McDermott, M. B. A., Boag, W., Berner, G., Naumann, T., Hughes, M. C., Goldenberg, A., and Ghassemi, M. Feature robustness in non-stationary health records: Caveats to deployable model performance in common clinical machine learning tasks. In Proceedings of the 4th Machine Learning for Healthcare Conference, volume 106, pp.\ 381--405. PMLR, 2019

work page 2019
[28]

The primacy bias in deep reinforcement learning

Nikishin, E., Schwarzer, M., D'Oro, P., Bacon, P.-L., and Courville, A. The primacy bias in deep reinforcement learning. In Proceedings of the 39th International Conference on Machine Learning, volume 162, pp.\ 16828--16847. PMLR, 2022

work page 2022
[29]

F., Leike, J., and Lowe, R

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P. F., Leike, J., and Lowe, R. Training language models to follow instructions with human feedback. In Advances in Neural Information Proces...

work page 2022
[30]

In-context learning through the Bayesian prism

Panwar, M., Ahuja, K., and Goyal, N. In-context learning through the Bayesian prism. In International Conference on Learning Representations, 2024

work page 2024
[31]

Papoudakis, G., Christianos, F., Rahman, A., and Albrecht, S. V. Dealing with non-stationarity in multi-agent deep reinforcement learning, 2019. Preprint arXiv:1906.04737 https://arxiv.org/abs/1906.04737 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2019
[32]

I., Kemker, R., Part, J

Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., and Wermter, S. Continual lifelong learning with neural networks: A review. Neural Networks, 113: 0 54--71, 2019

work page 2019
[33]

F., Lubana, E

Park, C. F., Lubana, E. S., Pres, I., and Tanaka, H. Competition dynamics shape algorithmic phases of in-context learning. In International Conference on Learning Representations, 2025

work page 2025
[34]

2025 , journal =

Pepin Lehalleur, S., Hoogland, J., Farrugia-Roberts, M., Wei, S., Gietelink Oldenziel, A., Wang, G., Carroll, L., and Murfet, D. You are what you eat -- AI alignment requires understanding how data shapes structure and generalisation, 2025. Preprint arXiv:2502.05475 https://arxiv.org/abs/2502.05475 [cs.LG]

work page arXiv 2025
[35]

Formal algorithms for transformers.arXiv preprint arXiv:2207.09238, 2022

Phuong, M. and Hutter, M. Formal algorithms for transformers, 2022. Preprint arXiv:2207.09238 https://arxiv.org/abs/2207.09238 [cs.LG]

work page arXiv 2022
[36]

Connectionist models of recognition memory: Constraints imposed by learning and forgetting functions

Ratcliff, R. Connectionist models of recognition memory: Constraints imposed by learning and forgetting functions. Psychological Review, 97 0 (2): 0 285--308, 1990

work page 1990
[37]

Pretraining task diversity and the emergence of non-bayesian in-context learning for regression

Ravent\' o s, A., Paul, M., Chen, F., and Ganguli, S. Pretraining task diversity and the emergence of non-bayesian in-context learning for regression. In Advances in Neural Information Processing Systems 36, pp.\ 14228--14246, 2023

work page 2023
[38]

Roberts, G. O. and Tweedie, R. L. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2 0 (4): 0 341--363, 1996

work page 1996
[39]

Schlimmer, J. C. and Fisher, D. A case study of incremental concept induction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 5, 1986

work page 1986
[40]

The transient nature of emergent in-context learning in transformers

Singh, A., Chan, S., Moskovitz, T., Grant, E., Saxe, A., and Hill, F. The transient nature of emergent in-context learning in transformers. Advances in Neural Information Processing Systems 36, pp.\ 27801--27819, 2024

work page 2024
[41]

Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT Press, second edition, 2018

work page 2018
[42]

Sutton, R. S. and Whitehead, S. D. Online learning with random representations. In Proceedings of the Tenth International Conference on Machine Learning, pp.\ 314--321. Morgan Kaufmann, 1993

work page 1993
[43]

Székely, G. J. Potential and kinetic energy in statistics. Lecture notes, Budapest Institute of Technology (Technical University), 1989. As cited in Szekely+Rizzo2013

work page 1989
[44]

Székely, G. J. and Rizzo, M. L. Energy statistics: A class of statistics based on distances. Journal of Statistical Planning and Inference, 143 0 (8): 0 1249--1272, 2013

work page 2013
[45]

and Mitchell, T

Thrun, S. and Mitchell, T. M. Lifelong robot learning. Robotics and Autonomous Systems, 15 0 (1): 0 25--46, 1995

work page 1995
[46]

The problem of concept drift: definitions and related work

Tsymbal, A. The problem of concept drift: definitions and related work. Technical Report TCD-CS-2004-15, Department of Computer Science, Trinity College Dublin, 2004

work page 2004
[47]

A comprehensive survey of continual learning: Theory, method and application

Wang, L., Zhang, X., Su, H., and Zhu, J. A comprehensive survey of continual learning: Theory, method and application. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46 0 (8): 0 5362--5383, 2024

work page 2024
[48]

Wentworth, J. S. Selection theorems: A program for understanding agents. AI Alignment Forum, 2021. URL https://www.alignmentforum.org/posts/G2Lne2Fi7Qra5Lbuf

work page 2021
[49]

and Kubat, M

Widmer, G. and Kubat, M. Learning in the presence of concept drift and hidden contexts. Machine Learning, 23: 0 69--101, 1996

work page 1996
[50]

S., Park, C

Wurgaft, D., Lubana, E. S., Park, C. F., Tanaka, H., Reddy, G., and Goodman, N. In-context learning strategies emerge rationally. In Advances in Neural Information Processing Systems 38, 2025

work page 2025

[1] [1]

Battiston and L

Battiston, M. and Cappello, L. Bayesian predictive inference beyond martingales, 2025. Preprint arXiv:2507.21874 https://arxiv.org/abs/2507.21874 [math.ST]

work page arXiv 2025

[2] [2]

Curriculum learning

Bengio, Y., Louradour, J., Collobert, R., and Weston, J. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pp.\ 41--48, 2009

work page 2009

[3] [3]

Bishop, C. M. Mixture density networks. Technical Report NCGR/94/004, Neural Computing Research Group, 1994

work page 1994

[4] [4]

J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., VanderPlas , J., Wanderman-Milne , S., and Zhang, Q

Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., VanderPlas , J., Wanderman-Milne , S., and Zhang, Q. JAX: composable transformations of Python + NumPy programs. GitHub, 2018. URL http://github.com/jax-ml/jax

work page 2018

[5] [5]

Statistical modeling: The two cultures (with comments and a rejoinder by the author)

Breiman, L. Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science, 16 0 (3): 0 199--231, 2001

work page 2001

[6] [6]

Dynamics of transient structure in in-context linear regression transformers, 2025

Carroll, L., Hoogland, J., Farrugia-Roberts, M., and Murfet, D. Dynamics of transient structure in in-context linear regression transformers, 2025. Preprint arXiv:2501.17745 https://arxiv.org/abs/2501.17745 [cs.LG]

work page arXiv 2025

[7] [7]

Data distributional properties drive emergent in-context learning in transformers

Chan, S., Santoro, A., Lampinen, A., Wang, J., Singh, A., Richemond, P., McClelland, J., and Hill, F. Data distributional properties drive emergent in-context learning in transformers. Advances in Neural Information Processing Systems 35, pp.\ 18878--18891, 2022

work page 2022

[8] [8]

and Liu, B

Chen, Z. and Liu, B. Lifelong Machine Learning. Morgan & Claypool, 2018

work page 2018

[9] [9]

Clements, M. P. and Hendry, D. F. Forecasting Non-Stationary Economic Time Series. MIT Press, 1999

work page 1999

[10] [10]

Learning in nonstationary environments: A survey

Ditzler, G., Roveri, M., Alippi, C., and Polikar, R. Learning in nonstationary environments: A survey. IEEE Computational Intelligence Magazine, 10 0 (4): 0 12--25, 2015

work page 2015

[11] [11]

A., Ma, H., and Wei, S

Effiezal Aswadi , A. A., Ma, H., and Wei, S. What does a Bayes -filtered transformer believe? A predictive Monte Carlo approach. In preparation, 2026

work page 2026

[12] [12]

A mathematical framework for transformer circuits

Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. A mathematical framework for transformer circ...

work page 2021

[13] [13]

Fong, E., Holmes, C., and Walker, S. G. Martingale posterior distributions. Journal of the Royal Statistical Society Series B: Statistical Methodology, 85 0 (5): 0 1357--1391, 2023

work page 2023

[14] [14]

and Petrone, S

Fortini, S. and Petrone, S. Prediction-based uncertainty quantification for exchangeable sequences. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 381 0 (2247): 0 20220142, 2023

work page 2023

[15] [15]

and Petrone, S

Fortini, S. and Petrone, S. Exchangeability, prediction and predictive modeling in Bayesian statistics. Statistical Science, 40 0 (1), January 2025

work page 2025

[16] [16]

French, R. M. Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3 0 (4): 0 128--135, 1999

work page 1999

[17] [17]

What can transformers learn in-context? a case study of simple function classes

Garg, S., Tsipras, D., Liang, P., and Valiant, G. What can transformers learn in-context? a case study of simple function classes. In Advances in Neural Information Processing Systems 35, pp.\ 30583--30598, 2022

work page 2022

[18] [18]

An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks

Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. An empirical investigation of catastrophic forgetting in gradient-based neural networks, 2014. Published as a conference paper at ICLR 2014. Preprint arXiv:1312.6211 https://arxiv.org/abs/1312.6211 [stat.ML]

work page internal anchor Pith review Pith/arXiv arXiv 2014

[19] [19]

Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

He, T., Doshi, D., Das, A., and Gromov, A. Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks. In Advances in Neural Information Processing Systems 37, pp.\ 13244--13273, 2024

work page 2024

[20] [20]

Loss landscape degeneracy and stagewise development in transformers

Hoogland, J., Wang, G., Farrugia-Roberts, M., Carroll, L., Wei, S., and Murfet, D. Loss landscape degeneracy and stagewise development in transformers. Transactions on Machine Learning Research, 2025

work page 2025

[21] [21]

Transient non-stationarity and generalisation in deep reinforcement learning

Igl, M., Farquhar, G., Luketina, J., Boehmer, W., and Whiteson, S. Transient non-stationarity and generalisation in deep reinforcement learning. In International Conference on Learning Representations, 2021

work page 2021

[22] [22]

NanoGPT , 2022

Karpathy, A. NanoGPT , 2022. URL https://github.com/karpathy/nanoGPT

work page 2022

[23] [23]

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization, 2015. Published as a conference paper at ICLR 2015. Preprint arXiv:1412.6980 https://arxiv.org/abs/1412.6980 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2015

[24] [24]

and Cohen, N

McCloskey, M. and Cohen, N. J. Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of Learning and Motivation, 24: 0 109--165, 1989

work page 1989

[25] [25]

Milly, P. C. D., Betancourt, J., Falkenmark, M., Hirsch, R. M., Kundzewicz, Z. W., Lettenmaier, D. P., and Stouffer, R. J. Stationarity is dead: Whither water management? Science, 319 0 (5863): 0 573--574, 2008

work page 2008

[26] [26]

Mitchell, T. M. The need for biases in learning generalizations. Technical Report CBM-TR-117, Computer Science Department, Rutgers University, 1980

work page 1980

[27] [27]

Nestor, B., McDermott, M. B. A., Boag, W., Berner, G., Naumann, T., Hughes, M. C., Goldenberg, A., and Ghassemi, M. Feature robustness in non-stationary health records: Caveats to deployable model performance in common clinical machine learning tasks. In Proceedings of the 4th Machine Learning for Healthcare Conference, volume 106, pp.\ 381--405. PMLR, 2019

work page 2019

[28] [28]

The primacy bias in deep reinforcement learning

Nikishin, E., Schwarzer, M., D'Oro, P., Bacon, P.-L., and Courville, A. The primacy bias in deep reinforcement learning. In Proceedings of the 39th International Conference on Machine Learning, volume 162, pp.\ 16828--16847. PMLR, 2022

work page 2022

[29] [29]

F., Leike, J., and Lowe, R

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P. F., Leike, J., and Lowe, R. Training language models to follow instructions with human feedback. In Advances in Neural Information Proces...

work page 2022

[30] [30]

In-context learning through the Bayesian prism

Panwar, M., Ahuja, K., and Goyal, N. In-context learning through the Bayesian prism. In International Conference on Learning Representations, 2024

work page 2024

[31] [31]

Papoudakis, G., Christianos, F., Rahman, A., and Albrecht, S. V. Dealing with non-stationarity in multi-agent deep reinforcement learning, 2019. Preprint arXiv:1906.04737 https://arxiv.org/abs/1906.04737 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2019

[32] [32]

I., Kemker, R., Part, J

Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., and Wermter, S. Continual lifelong learning with neural networks: A review. Neural Networks, 113: 0 54--71, 2019

work page 2019

[33] [33]

F., Lubana, E

Park, C. F., Lubana, E. S., Pres, I., and Tanaka, H. Competition dynamics shape algorithmic phases of in-context learning. In International Conference on Learning Representations, 2025

work page 2025

[34] [34]

2025 , journal =

Pepin Lehalleur, S., Hoogland, J., Farrugia-Roberts, M., Wei, S., Gietelink Oldenziel, A., Wang, G., Carroll, L., and Murfet, D. You are what you eat -- AI alignment requires understanding how data shapes structure and generalisation, 2025. Preprint arXiv:2502.05475 https://arxiv.org/abs/2502.05475 [cs.LG]

work page arXiv 2025

[35] [35]

Formal algorithms for transformers.arXiv preprint arXiv:2207.09238, 2022

Phuong, M. and Hutter, M. Formal algorithms for transformers, 2022. Preprint arXiv:2207.09238 https://arxiv.org/abs/2207.09238 [cs.LG]

work page arXiv 2022

[36] [36]

Connectionist models of recognition memory: Constraints imposed by learning and forgetting functions

Ratcliff, R. Connectionist models of recognition memory: Constraints imposed by learning and forgetting functions. Psychological Review, 97 0 (2): 0 285--308, 1990

work page 1990

[37] [37]

Pretraining task diversity and the emergence of non-bayesian in-context learning for regression

Ravent\' o s, A., Paul, M., Chen, F., and Ganguli, S. Pretraining task diversity and the emergence of non-bayesian in-context learning for regression. In Advances in Neural Information Processing Systems 36, pp.\ 14228--14246, 2023

work page 2023

[38] [38]

Roberts, G. O. and Tweedie, R. L. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2 0 (4): 0 341--363, 1996

work page 1996

[39] [39]

Schlimmer, J. C. and Fisher, D. A case study of incremental concept induction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 5, 1986

work page 1986

[40] [40]

The transient nature of emergent in-context learning in transformers

Singh, A., Chan, S., Moskovitz, T., Grant, E., Saxe, A., and Hill, F. The transient nature of emergent in-context learning in transformers. Advances in Neural Information Processing Systems 36, pp.\ 27801--27819, 2024

work page 2024

[41] [41]

Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT Press, second edition, 2018

work page 2018

[42] [42]

Sutton, R. S. and Whitehead, S. D. Online learning with random representations. In Proceedings of the Tenth International Conference on Machine Learning, pp.\ 314--321. Morgan Kaufmann, 1993

work page 1993

[43] [43]

Székely, G. J. Potential and kinetic energy in statistics. Lecture notes, Budapest Institute of Technology (Technical University), 1989. As cited in Szekely+Rizzo2013

work page 1989

[44] [44]

Székely, G. J. and Rizzo, M. L. Energy statistics: A class of statistics based on distances. Journal of Statistical Planning and Inference, 143 0 (8): 0 1249--1272, 2013

work page 2013

[45] [45]

and Mitchell, T

Thrun, S. and Mitchell, T. M. Lifelong robot learning. Robotics and Autonomous Systems, 15 0 (1): 0 25--46, 1995

work page 1995

[46] [46]

The problem of concept drift: definitions and related work

Tsymbal, A. The problem of concept drift: definitions and related work. Technical Report TCD-CS-2004-15, Department of Computer Science, Trinity College Dublin, 2004

work page 2004

[47] [47]

A comprehensive survey of continual learning: Theory, method and application

Wang, L., Zhang, X., Su, H., and Zhu, J. A comprehensive survey of continual learning: Theory, method and application. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46 0 (8): 0 5362--5383, 2024

work page 2024

[48] [48]

Wentworth, J. S. Selection theorems: A program for understanding agents. AI Alignment Forum, 2021. URL https://www.alignmentforum.org/posts/G2Lne2Fi7Qra5Lbuf

work page 2021

[49] [49]

and Kubat, M

Widmer, G. and Kubat, M. Learning in the presence of concept drift and hidden contexts. Machine Learning, 23: 0 69--101, 1996

work page 1996

[50] [50]

S., Park, C

Wurgaft, D., Lubana, E. S., Park, C. F., Tanaka, H., Reddy, G., and Goodman, N. In-context learning strategies emerge rationally. In Advances in Neural Information Processing Systems 38, 2025

work page 2025