Can In-Context Learning Support Intrinsic Curiosity?

Blaise Ag\"uera y Arcas; Eric Elmoznino; Guillaume Lajoie; Jo\~ao Sacramento; Johannes von Oswald; Rajai Nasser; Rif A. Saurous; Sangnie Bhardwaj

arxiv: 2606.19476 · v1 · pith:GZJRWWPBnew · submitted 2026-06-17 · 💻 cs.LG · cs.AI

Can In-Context Learning Support Intrinsic Curiosity?

Eric Elmoznino , Sangnie Bhardwaj , Johannes von Oswald , Rajai Nasser , Blaise Ag\"uera y Arcas , Jo\~ao Sacramento , Rif A. Saurous , Guillaume Lajoie This is my paper

Pith reviewed 2026-06-26 21:20 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords in-context learningintrinsic curiosityactive learningMarkov decision processeslearning progressBayesian experimental design

0 comments

The pith

In general MDPs, in-context learning cannot provide unbiased intrinsic rewards for curiosity, but it succeeds in non-temporal active learning settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether sequence models can use their in-context learning to compute intrinsic curiosity rewards based on learning progress, bypassing costly inner-loop gradient updates. It proves that in Markov decision processes, rewards built from ICL prediction errors either introduce biasing nuisance terms or cannot be implemented without bias. In a broad class of non-temporal settings that includes active learning and Bayesian experimental design, the same ICL-derived rewards bound and asymptotically converge to the true learning progress. Controlled experiments confirm that policies trained this way achieve optimal exploration in both continuous and symbolic environments.

Core claim

In general Markov decision processes, intrinsic rewards derived from an in-context learner's prediction errors either suffer from nuisance terms that bias their estimation of true learning progress, or they cannot be implemented using an in-context learner's prediction errors. Conversely, in a broad subclass of non-temporal settings encompassing active learning and Bayesian experimental design, ICL-derived rewards successfully bound and asymptotically converge to the true learning progress.

What carries the argument

Prediction errors of an in-context learner under counterfactual context manipulations, used to estimate learning progress without gradient updates.

If this is right

Exploration policies can be trained to maximize learning progress using only ICL prediction errors in non-temporal domains.
No inner loops of gradient descent are required to compute curiosity rewards in active learning and experimental design.
The resulting policies achieve optimal data collection in both continuous and symbolic environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The bias in temporal settings may require hybrid signals that combine ICL with explicit update-based terms.
The positive result opens a route to scale curiosity-driven data selection inside large sequence models for static selection tasks.
Similar ICL-based bounds might apply to other progress measures beyond prediction error.

Load-bearing premise

That the in-context learner's prediction errors can be directly manipulated via counterfactual context changes without introducing additional bias terms that cannot be corrected.

What would settle it

An experiment in a simple MDP showing that a policy trained on ICL-derived rewards collects data that yields strictly worse world-model predictions than a policy using the true learning-progress signal, or an active-learning task where the ICL rewards fail to converge to the optimal selection policy.

Figures

Figures reproduced from arXiv: 2606.19476 by Blaise Ag\"uera y Arcas, Eric Elmoznino, Guillaume Lajoie, Jo\~ao Sacramento, Johannes von Oswald, Rajai Nasser, Rif A. Saurous, Sangnie Bhardwaj.

**Figure 1.** Figure 1: Our method involves using a pretrained in-context learner ρ to construct intrinsic curiosity rewards for a policy πϕ. Trajectories unrolled by the policy are passed to the in-context learner, and the reward can be any function of the resulting observation prediction errors on manipulated sequence contexts. We give an example for the reward r sum, which measures the improvement in future prediction errors … view at source ↗

**Figure 2.** Figure 2: Comparison of validation log-likelihood for Gaussian Processes for the Oracle and PFN predictive models. (Left) For r sur validation score decreases over training steps. (Center and right) Final validation scores: r sum and r dl achieve equivalent performance to training on r task, but r sur performs significantly worse than a random policy. A fresh function f is drawn at the start of each episode. At each… view at source ↗

**Figure 3.** Figure 3: Left: The colourbar depicts noise strength of the GP. Top 10% of random paths ranked by the rewards show that r sum is high for the noiseless tiles, conversely r sur favours the noisy regions. Center: BIG across trajectories generated by trained policies. r sur has the lowest information gain due to the Noisy TV problem. Right: Zoomed comparison. Higher BIG for r sum, r dl, and r task indicates that these … view at source ↗

**Figure 4.** Figure 4: Validation scores for Mastermind and Alchemy. In the standard variant, all rewards are competitive. In both the noisy variants, r sur’s performance falls below the random baseline, whereas r sum and r dl remain robust. Mastermind is a code-breaking game parameterized by a secret code c∈{0,...,C−1} L made of coloured pegs, where L is the code length and C the number of colours. At each step, the policy subm… view at source ↗

read the original abstract

Effective machine learning depends not only on how we model data, but also on what data we choose to collect. While large sequence models have revolutionized data modeling, the problem of automated data selection, or "intrinsic curiosity", remains a significant challenge. Classic approaches incentivize exploration by rewarding an agent based on its "learning progress", which measures how much a newly acquired observation improves a world model's predictive ability. However, evaluating these rewards traditionally requires expensive inner loops of gradient descent updates within each trajectory, rendering them computationally impractical at scale. In this work, we investigate whether the emergent in-context learning (ICL) capabilities of sequence models can eliminate this bottleneck by serving as immediate, update-free world models. Specifically, we evaluate whether an exploration policy can be trained to maximize learning progress, using solely the prediction errors and counterfactual context manipulations of an in-context learner. We first prove that in general Markov decision processes, this is in fact impossible in an unbiased way: the resulting intrinsic rewards either suffer from nuisance terms that bias their estimation of true learning progress, or they cannot be implemented using an in-context learner's prediction errors. Conversely, we prove a positive result for a broad subclass of non-temporal settings, encompassing active learning and Bayesian Experimental Design: here, ICL-derived rewards successfully bound and asymptotically converge to the true learning progress. We corroborate our theory with controlled experiments across continuous and symbolic environments, demonstrating that our ICL-driven framework successfully trains curious data-collection policies that explore optimally.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ICL can't give unbiased learning-progress rewards in MDPs but bounds and converges to them in non-temporal settings.

read the letter

The key point is that ICL prediction errors cannot produce unbiased estimates of learning progress in general MDPs, either because of uncorrectable nuisance terms or because the signals cannot be implemented that way. The paper proves a positive result instead for non-temporal settings that cover active learning and Bayesian experimental design, where the ICL-derived rewards bound the true progress and converge to it asymptotically.

What is new is the separation itself: the impossibility result for MDPs and the convergence claim for the subclass are tied specifically to ICL prediction errors and counterfactual context changes. Earlier curiosity work did not have these formal distinctions. The controlled experiments in continuous and symbolic environments show that the approach can train exploration policies that match optimal behavior under the theory.

The proofs are the main strength; they rest on mathematical arguments rather than fitted parameters. The experiments serve as a direct check rather than loose illustration.

The soft spot is the scope of the positive result. Non-temporal settings are narrower than many practical data-selection problems, and the central assumption that counterfactual manipulations introduce no uncorrectable bias needs careful checking in the derivations. The experiments are deliberately controlled, which limits how far they speak to noisy or large-scale models.

This paper is for people working on exploration, active learning, or scaling curiosity signals to sequence models. Readers who want formal limits on what ICL can do for data selection will find the case split useful. It deserves peer review because the theoretical separation is sharp enough to be worth referee scrutiny even if the positive case has limited reach.

Referee Report

2 major / 2 minor

Summary. The paper claims that in-context learning (ICL) from sequence models cannot yield unbiased intrinsic rewards for learning progress in general Markov decision processes, as the resulting signals either contain uncorrectable nuisance bias terms or cannot be realized via the model's prediction errors and counterfactual context changes. It proves a positive result for a broad class of non-temporal problems (including active learning and Bayesian experimental design), where ICL-derived rewards provably bound and asymptotically converge to true learning progress. The theory is corroborated by controlled experiments in continuous and symbolic environments showing that ICL-driven policies explore optimally.

Significance. If the separation between the impossibility result for temporal MDPs and the convergence result for non-temporal settings holds, the work supplies a precise theoretical boundary on when emergent ICL capabilities can replace expensive inner-loop gradient updates for curiosity-driven data selection. The formal proofs (negative and positive) and the explicit identification of nuisance terms constitute a substantive contribution to the literature on intrinsic motivation and scalable exploration.

major comments (2)

[Impossibility result for MDPs (statement and proof)] The impossibility theorem for general MDPs rests on the claim that counterfactual context manipulations cannot be performed without introducing bias terms beyond the identified nuisance factors. The derivation of this claim (and the precise conditions under which the bias is uncorrectable) must be checked against the weakest assumption noted in the reader's report; if the manipulation step is not fully formalized, the negative result is not yet load-bearing.
[Positive result for non-temporal settings (theorem statement)] The positive convergence claim for non-temporal settings asserts that ICL prediction errors asymptotically match true learning progress. The rate and the precise sense of convergence (e.g., almost-sure, in expectation, or in probability) should be stated explicitly, together with any dependence on the size of the in-context window or the model class.

minor comments (2)

[Introduction and abstract] Notation for the intrinsic reward constructed from ICL errors should be introduced once and used consistently; several passages in the abstract and introduction use slightly varying descriptions of the same quantity.
[Experiments] The experimental section would benefit from an explicit statement of the number of random seeds and the precise definition of 'optimally' in the reported exploration performance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation of minor revision. We address each major comment below with clarifications and planned revisions to strengthen the formal statements.

read point-by-point responses

Referee: The impossibility theorem for general MDPs rests on the claim that counterfactual context manipulations cannot be performed without introducing bias terms beyond the identified nuisance factors. The derivation of this claim (and the precise conditions under which the bias is uncorrectable) must be checked against the weakest assumption noted in the reader's report; if the manipulation step is not fully formalized, the negative result is not yet load-bearing.

Authors: We appreciate the referee's scrutiny of the negative result. The impossibility theorem (Theorem 3) is derived under the weakest assumption that the ICL model receives only the observed sequence without direct access to the underlying MDP transition kernel or external state information. Counterfactual context changes are formalized in Definition 2 and the subsequent proof, which shows that any correction for nuisance bias requires either gradient updates or information outside the in-context window. We will add an explicit remark after the theorem stating the minimal assumptions and the precise conditions rendering the bias uncorrectable, to make the formalization fully load-bearing. revision: yes
Referee: The positive convergence claim for non-temporal settings asserts that ICL prediction errors asymptotically match true learning progress. The rate and the precise sense of convergence (e.g., almost-sure, in expectation, or in probability) should be stated explicitly, together with any dependence on the size of the in-context window or the model class.

Authors: We agree that the positive result (Theorem 4) can be stated more precisely. The current proof establishes convergence in expectation as the number of in-context examples tends to infinity. We will revise the theorem to specify convergence in probability at rate O(1/sqrt(n)) for universal model classes, with the result holding for any fixed in-context window size under the non-temporal assumption (no dependence on window length beyond a minimum threshold). These details will be added to the theorem statement and proof sketch. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central claims consist of formal mathematical proofs establishing impossibility results for general MDPs (due to nuisance bias or non-implementability) and positive convergence results for non-temporal subclasses. These derivations rely on direct analysis of prediction errors and counterfactual manipulations rather than any fitted parameters, self-definitional reductions, or load-bearing self-citations. No step reduces the target result to its own inputs by construction, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work relies on standard definitions from reinforcement learning and in-context learning; no free parameters, ad-hoc axioms, or new invented entities are introduced in the abstract.

axioms (2)

standard math Standard MDP formulation and definition of learning progress as improvement in predictive ability
Invoked when defining the intrinsic reward and the bias terms in the impossibility proof.
domain assumption In-context learner can perform counterfactual context manipulations
Required for implementing the ICL-derived rewards via prediction errors.

pith-pipeline@v0.9.1-grok · 5825 in / 1303 out tokens · 24400 ms · 2026-06-26T21:20:11.629016+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

88 extracted references · 3 canonical work pages

[1]

Conference on Learning Theory , pages=

Active learning for identification of linear dynamical systems , author=. Conference on Learning Theory , pages=. 2020 , organization=

2020
[2]

, title =

Schiff, Joel L. , title =
[3]

arXiv preprint arXiv:1905.03030 , year=

Meta-learning of sequential strategies , author=. arXiv preprint arXiv:1905.03030 , year=

Pith/arXiv arXiv 1905
[4]

2025 , journal=

MesaNet: Sequence Modeling by Locally Optimal Test-Time Training , author=. 2025 , journal=

2025
[5]

2024 , journal=

Uncovering mesa-optimization algorithms in Transformers , author=. 2024 , journal=

2024
[6]

Schmidhuber, J \"u rgen. Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes. Anticipatory Behavior in Adaptive Learning Systems. 2009

2009
[7]

Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010) , year=

Schmidhuber, Jürgen , journal=. Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010) , year=

1990
[8]

Proceedings of the international conference on artificial neural networks, Paris , volume=

Reinforcement driven information acquisition in non-deterministic environments , author=. Proceedings of the international conference on artificial neural networks, Paris , volume=
[9]

Curious model-building control systems , author=. Proc. international joint conference on neural networks , pages=
[10]

, journal=

Oudeyer, Pierre-Yves and Kaplan, Frdric and Hafner, Verena V. , journal=. Intrinsic Motivation Systems for Autonomous Mental Development , year=
[11]

the 8th international conference on epigenetic robotics: Modeling cognitive development in robotic systems , year=

How can we define intrinsic motivation? , author=. the 8th international conference on epigenetic robotics: Modeling cognitive development in robotic systems , year=
[12]

Frontiers in neurorobotics , volume=

What is intrinsic motivation? A typology of computational approaches , author=. Frontiers in neurorobotics , volume=. 2007 , publisher=

2007
[13]

Advances in neural information processing systems , volume=

Exploration in model-based reinforcement learning by empirically estimating learning progress , author=. Advances in neural information processing systems , volume=
[14]

arXiv preprint arXiv:1902.07685 , year=

World discovery models , author=. arXiv preprint arXiv:1902.07685 , year=

Pith/arXiv arXiv 1902
[15]

The Bell system technical journal , volume=

A mathematical theory of communication , author=. The Bell system technical journal , volume=. 1948 , publisher=

1948
[16]

A Possibility for Implementing Curiosity and Boredom in Model-Building Neural Controllers , booktitle =

Schmidhuber, Juergen , isbn =. A Possibility for Implementing Curiosity and Boredom in Model-Building Neural Controllers , booktitle =. 1991 , month =. doi:10.7551/mitpress/3115.003.0030 , url =

work page doi:10.7551/mitpress/3115.003.0030 1991
[17]

arXiv preprint arXiv:2506.06725 , year=

WorldLLM: Improving LLMs' world modeling using curiosity-driven theory-making , author=. arXiv preprint arXiv:2506.06725 , year=

arXiv
[18]

Intrinsically motivated model learning for a developing curious agent , year=

Hester, Todd and Stone, Peter , booktitle=. Intrinsically motivated model learning for a developing curious agent , year=
[19]

International Conference on Machine Learning , pages=

Provably efficient maximum entropy exploration , author=. International Conference on Machine Learning , pages=. 2019 , organization=

2019
[20]

Advances in neural information processing systems , volume=

Intrinsically motivated reinforcement learning , author=. Advances in neural information processing systems , volume=
[21]

Journal of Computer and System Sciences , volume=

An analysis of model-based interval estimation for Markov decision processes , author=. Journal of Computer and System Sciences , volume=. 2008 , publisher=

2008
[22]

Advances in neural information processing systems , volume=

Unifying count-based exploration and intrinsic motivation , author=. Advances in neural information processing systems , volume=
[23]

International conference on machine learning , pages=

Count-based exploration with neural density models , author=. International conference on machine learning , pages=. 2017 , organization=

2017
[24]

Frontiers in Psychology , volume =

Barto, Andrew and Mirolli, Marco and Baldassarre, Gianluca , month = dec, year =. Novelty or Surprise? , volume =. Frontiers in Psychology , publisher =. doi:10.3389/fpsyg.2013.00907 , language =

work page doi:10.3389/fpsyg.2013.00907 2013
[25]

International conference on machine learning , pages=

Curiosity-driven exploration by self-supervised prediction , author=. International conference on machine learning , pages=. 2017 , organization=

2017
[26]

2019 , editor =

Kim, Hyoungseok and Kim, Jaekyeom and Jeong, Yeonwoo and Levine, Sergey and Song, Hyun Oh , booktitle =. 2019 , editor =

2019
[27]

International Conference on Learning Representations , year=

Large-Scale Study of Curiosity-Driven Learning , author=. International Conference on Learning Representations , year=
[28]

International conference on artificial general intelligence , pages=

Planning to be surprised: Optimal bayesian exploration in dynamic environments , author=. International conference on artificial general intelligence , pages=. 2011 , organization=

2011
[29]

Frontiers in neural circuits , volume=

Learning and exploration in action-perception loops , author=. Frontiers in neural circuits , volume=. 2013 , publisher=

2013
[30]

Vision research , volume=

Bayesian surprise attracts human attention , author=. Vision research , volume=. 2009 , publisher=

2009
[31]

Advances in neural information processing systems , volume=

Vime: Variational information maximizing exploration , author=. Advances in neural information processing systems , volume=
[32]

arXiv preprint arXiv:1507.00814 , year=

Incentivizing exploration in reinforcement learning with deep predictive models , author=. arXiv preprint arXiv:1507.00814 , year=

Pith/arXiv arXiv
[33]

The Moment Problem , series =

Schm. The Moment Problem , series =
[34]

Neural Computation , volume=

Information-based objective functions for active data selection , author=. Neural Computation , volume=
[35]

Fedorov, Valerii , year =
[36]

arXiv preprint arXiv:1112.5745 , year=

Bayesian active learning for classification and preference learning , author=. arXiv preprint arXiv:1112.5745 , year=

Pith/arXiv arXiv
[37]

Colloques Internationaux du Centre National de la Recherche Scientifique , volume=

Application of the theory of martingales , author=. Colloques Internationaux du Centre National de la Recherche Scientifique , volume=. 1949 , organization=

1949
[38]

The Annals of Mathematical Statistics , volume=

On a measure of the information provided by an experiment , author=. The Annals of Mathematical Statistics , volume=. 1956 , publisher=

1956
[39]

Journal of artificial intelligence research , volume=

Adapting behavior via intrinsic reward: A survey and empirical study , author=. Journal of artificial intelligence research , volume=
[40]

2005 ieee congress on evolutionary computation , volume=

Empowerment: A universal agent-centric measure of control , author=. 2005 ieee congress on evolutionary computation , volume=. 2005 , organization=

2005
[41]

European Conference on Artificial Life , pages=

All else being equal be empowered , author=. European Conference on Artificial Life , pages=. 2005 , organization=

2005
[42]

Empowerment--An Introduction

Salge, Christoph and Glackin, Cornelius and Polani, Daniel. Empowerment--An Introduction. Guided Self-Organization: Inception. 2014. doi:10.1007/978-3-642-53734-9_4

work page doi:10.1007/978-3-642-53734-9_4 2014
[43]

Advances in neural information processing systems , volume=

Variational information maximisation for intrinsically motivated reinforcement learning , author=. Advances in neural information processing systems , volume=
[44]

What can

Yuqing Du and Eliza Kosoy and Alyssa Li Dayan and Maria Rufova and Alison Gopnik and Pieter Abbeel , booktitle=. What can. 2023 , url=

2023
[45]

International conference on machine learning , pages=

Planning to explore via self-supervised world models , author=. International conference on machine learning , pages=. 2020 , organization=

2020
[46]

International conference on machine learning , pages=

Model-based active exploration , author=. International conference on machine learning , pages=. 2019 , organization=

2019
[47]

(No Title) , year=

Why greatness cannot be planned: The myth of the objective , author=. (No Title) , year=
[48]

arXiv preprint arXiv:2408.06292 , year=

The ai scientist: Towards fully automated open-ended scientific discovery , author=. arXiv preprint arXiv:2408.06292 , year=

Pith/arXiv arXiv
[49]

Proceedings of the 41st International Conference on Machine Learning , pages =

Position: Open-Endedness is Essential for Artificial Superhuman Intelligence , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =

2024
[50]

2024 , url=

Jenny Zhang and Joel Lehman and Kenneth Stanley and Jeff Clune , booktitle=. 2024 , url=

2024
[51]

Second Agent Learning in Open-Endedness Workshop , year=

Quality diversity through human feedback , author=. Second Agent Learning in Open-Endedness Workshop , year=
[52]

Quality-Diversity through

Herbie Bradley and Andrew Dai and Hannah Benita Teufel and Jenny Zhang and Koen Oostermeijer and Marco Bellagente and Jeff Clune and Kenneth Stanley and Gregory Schott and Joel Lehman , booktitle=. Quality-Diversity through. 2024 , url=

2024
[53]

Transactions on Machine Learning Research , issn=

Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

2024
[54]

arXiv preprint arXiv:1901.01753 , year=

Paired open-ended trailblazer (poet): Endlessly generating increasingly complex and diverse learning environments and their solutions , author=. arXiv preprint arXiv:1901.01753 , year=

Pith/arXiv arXiv 1901
[55]

Entropy , volume=

An information-theoretic perspective on intrinsic motivation in reinforcement learning: A survey , author=. Entropy , volume=. 2023 , publisher=

2023
[56]

Forty-second International Conference on Machine Learning , year=

In-Context Learning and Occam's Razor , author=. Forty-second International Conference on Machine Learning , year=
[57]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
[58]

Proceedings of the 41st International Conference on Machine Learning , pages =

Learning Universal Predictors , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =

2024
[59]

Advances in neural information processing systems , volume=

Meta-trained agents implement bayes-optimal agents , author=. Advances in neural information processing systems , volume=
[60]

International Conference on Learning Representations , year=

An Explanation of In-context Learning as Implicit Bayesian Inference , author=. International Conference on Learning Representations , year=
[61]

International Conference on Learning Representations , year=

Transformers Can Do Bayesian Inference , author=. International Conference on Learning Representations , year=
[62]

Behavioral and Brain Sciences , volume=

Meta-learned models of cognition , author=. Behavioral and Brain Sciences , volume=. 2024 , publisher=

2024
[63]

International Conference on Machine Learning , pages=

Statistical foundations of prior-data fitted networks , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[64]

Proceedings of the 40th International Conference on Machine Learning , pages =

Transformers Learn In-Context by Gradient Descent , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

2023
[65]

arXiv preprint arXiv:1611.02779 , year=

RL ^2 : Fast reinforcement learning via slow reinforcement learning , author=. arXiv preprint arXiv:1611.02779 , year=

Pith/arXiv arXiv
[66]

arXiv preprint arXiv:2207.01848 , year=

Tabpfn: A transformer that solves small tabular classification problems in a second , author=. arXiv preprint arXiv:2207.01848 , year=

Pith/arXiv arXiv
[67]

The Eleventh International Conference on Learning Representations , year=

In-context Reinforcement Learning with Algorithm Distillation , author=. The Eleventh International Conference on Learning Representations , year=
[68]

Advances in Neural Information Processing Systems , volume=

Supervised pretraining can learn in-context reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
[69]

Advances in Neural Information Processing Systems , volume=

The description length of deep learning models , author=. Advances in Neural Information Processing Systems , volume=
[70]

arXiv preprint arXiv:2601.03220 , year=

From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence , author=. arXiv preprint arXiv:2601.03220 , year=

arXiv
[71]

Proceedings of the National Academy of Sciences , volume=

Benign overfitting in linear regression , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=

2020
[72]

arXiv preprint arXiv:2009.07624 , year=

Measuring information transfer in neural networks , author=. arXiv preprint arXiv:2009.07624 , year=

arXiv 2009
[73]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
[74]

Cognition , volume=

Children are more exploratory and learn more than adults in an approach-avoid task , author=. Cognition , volume=. 2022 , publisher=

2022
[75]

Philosophical Transactions of the Royal Society B , volume=

Childhood as a solution to explore--exploit tensions , author=. Philosophical Transactions of the Royal Society B , volume=. 2020 , publisher=

2020
[76]

Developmental science , volume=

Preschool children learn about causal structure from conditional interventions , author=. Developmental science , volume=. 2007 , publisher=

2007
[77]

Byrnes, Steven , month = feb, year =. [
[78]

2024 , eprint=

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models , author=. 2024 , eprint=

2024
[79]

2021 , journal=

Alchemy: A structured task distribution for meta-reinforcement learning , author=. 2021 , journal=

2021
[80]

arXiv preprint arXiv:2508.10142 , year=

Multi-turn puzzles: Evaluating interactive reasoning and strategic dialogue in llms , author=. arXiv preprint arXiv:2508.10142 , year=

arXiv

Showing first 80 references.

[1] [1]

Conference on Learning Theory , pages=

Active learning for identification of linear dynamical systems , author=. Conference on Learning Theory , pages=. 2020 , organization=

2020

[2] [2]

, title =

Schiff, Joel L. , title =

[3] [3]

arXiv preprint arXiv:1905.03030 , year=

Meta-learning of sequential strategies , author=. arXiv preprint arXiv:1905.03030 , year=

Pith/arXiv arXiv 1905

[4] [4]

2025 , journal=

MesaNet: Sequence Modeling by Locally Optimal Test-Time Training , author=. 2025 , journal=

2025

[5] [5]

2024 , journal=

Uncovering mesa-optimization algorithms in Transformers , author=. 2024 , journal=

2024

[6] [6]

Schmidhuber, J \"u rgen. Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes. Anticipatory Behavior in Adaptive Learning Systems. 2009

2009

[7] [7]

Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010) , year=

Schmidhuber, Jürgen , journal=. Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010) , year=

1990

[8] [8]

Proceedings of the international conference on artificial neural networks, Paris , volume=

Reinforcement driven information acquisition in non-deterministic environments , author=. Proceedings of the international conference on artificial neural networks, Paris , volume=

[9] [9]

Curious model-building control systems , author=. Proc. international joint conference on neural networks , pages=

[10] [10]

, journal=

Oudeyer, Pierre-Yves and Kaplan, Frdric and Hafner, Verena V. , journal=. Intrinsic Motivation Systems for Autonomous Mental Development , year=

[11] [11]

the 8th international conference on epigenetic robotics: Modeling cognitive development in robotic systems , year=

How can we define intrinsic motivation? , author=. the 8th international conference on epigenetic robotics: Modeling cognitive development in robotic systems , year=

[12] [12]

Frontiers in neurorobotics , volume=

What is intrinsic motivation? A typology of computational approaches , author=. Frontiers in neurorobotics , volume=. 2007 , publisher=

2007

[13] [13]

Advances in neural information processing systems , volume=

Exploration in model-based reinforcement learning by empirically estimating learning progress , author=. Advances in neural information processing systems , volume=

[14] [14]

arXiv preprint arXiv:1902.07685 , year=

World discovery models , author=. arXiv preprint arXiv:1902.07685 , year=

Pith/arXiv arXiv 1902

[15] [15]

The Bell system technical journal , volume=

A mathematical theory of communication , author=. The Bell system technical journal , volume=. 1948 , publisher=

1948

[16] [16]

A Possibility for Implementing Curiosity and Boredom in Model-Building Neural Controllers , booktitle =

Schmidhuber, Juergen , isbn =. A Possibility for Implementing Curiosity and Boredom in Model-Building Neural Controllers , booktitle =. 1991 , month =. doi:10.7551/mitpress/3115.003.0030 , url =

work page doi:10.7551/mitpress/3115.003.0030 1991

[17] [17]

arXiv preprint arXiv:2506.06725 , year=

WorldLLM: Improving LLMs' world modeling using curiosity-driven theory-making , author=. arXiv preprint arXiv:2506.06725 , year=

arXiv

[18] [18]

Intrinsically motivated model learning for a developing curious agent , year=

Hester, Todd and Stone, Peter , booktitle=. Intrinsically motivated model learning for a developing curious agent , year=

[19] [19]

International Conference on Machine Learning , pages=

Provably efficient maximum entropy exploration , author=. International Conference on Machine Learning , pages=. 2019 , organization=

2019

[20] [20]

Advances in neural information processing systems , volume=

Intrinsically motivated reinforcement learning , author=. Advances in neural information processing systems , volume=

[21] [21]

Journal of Computer and System Sciences , volume=

An analysis of model-based interval estimation for Markov decision processes , author=. Journal of Computer and System Sciences , volume=. 2008 , publisher=

2008

[22] [22]

Advances in neural information processing systems , volume=

Unifying count-based exploration and intrinsic motivation , author=. Advances in neural information processing systems , volume=

[23] [23]

International conference on machine learning , pages=

Count-based exploration with neural density models , author=. International conference on machine learning , pages=. 2017 , organization=

2017

[24] [24]

Frontiers in Psychology , volume =

Barto, Andrew and Mirolli, Marco and Baldassarre, Gianluca , month = dec, year =. Novelty or Surprise? , volume =. Frontiers in Psychology , publisher =. doi:10.3389/fpsyg.2013.00907 , language =

work page doi:10.3389/fpsyg.2013.00907 2013

[25] [25]

International conference on machine learning , pages=

Curiosity-driven exploration by self-supervised prediction , author=. International conference on machine learning , pages=. 2017 , organization=

2017

[26] [26]

2019 , editor =

Kim, Hyoungseok and Kim, Jaekyeom and Jeong, Yeonwoo and Levine, Sergey and Song, Hyun Oh , booktitle =. 2019 , editor =

2019

[27] [27]

International Conference on Learning Representations , year=

Large-Scale Study of Curiosity-Driven Learning , author=. International Conference on Learning Representations , year=

[28] [28]

International conference on artificial general intelligence , pages=

Planning to be surprised: Optimal bayesian exploration in dynamic environments , author=. International conference on artificial general intelligence , pages=. 2011 , organization=

2011

[29] [29]

Frontiers in neural circuits , volume=

Learning and exploration in action-perception loops , author=. Frontiers in neural circuits , volume=. 2013 , publisher=

2013

[30] [30]

Vision research , volume=

Bayesian surprise attracts human attention , author=. Vision research , volume=. 2009 , publisher=

2009

[31] [31]

Advances in neural information processing systems , volume=

Vime: Variational information maximizing exploration , author=. Advances in neural information processing systems , volume=

[32] [32]

arXiv preprint arXiv:1507.00814 , year=

Incentivizing exploration in reinforcement learning with deep predictive models , author=. arXiv preprint arXiv:1507.00814 , year=

Pith/arXiv arXiv

[33] [33]

The Moment Problem , series =

Schm. The Moment Problem , series =

[34] [34]

Neural Computation , volume=

Information-based objective functions for active data selection , author=. Neural Computation , volume=

[35] [35]

Fedorov, Valerii , year =

[36] [36]

arXiv preprint arXiv:1112.5745 , year=

Bayesian active learning for classification and preference learning , author=. arXiv preprint arXiv:1112.5745 , year=

Pith/arXiv arXiv

[37] [37]

Colloques Internationaux du Centre National de la Recherche Scientifique , volume=

Application of the theory of martingales , author=. Colloques Internationaux du Centre National de la Recherche Scientifique , volume=. 1949 , organization=

1949

[38] [38]

The Annals of Mathematical Statistics , volume=

On a measure of the information provided by an experiment , author=. The Annals of Mathematical Statistics , volume=. 1956 , publisher=

1956

[39] [39]

Journal of artificial intelligence research , volume=

Adapting behavior via intrinsic reward: A survey and empirical study , author=. Journal of artificial intelligence research , volume=

[40] [40]

2005 ieee congress on evolutionary computation , volume=

Empowerment: A universal agent-centric measure of control , author=. 2005 ieee congress on evolutionary computation , volume=. 2005 , organization=

2005

[41] [41]

European Conference on Artificial Life , pages=

All else being equal be empowered , author=. European Conference on Artificial Life , pages=. 2005 , organization=

2005

[42] [42]

Empowerment--An Introduction

Salge, Christoph and Glackin, Cornelius and Polani, Daniel. Empowerment--An Introduction. Guided Self-Organization: Inception. 2014. doi:10.1007/978-3-642-53734-9_4

work page doi:10.1007/978-3-642-53734-9_4 2014

[43] [43]

Advances in neural information processing systems , volume=

Variational information maximisation for intrinsically motivated reinforcement learning , author=. Advances in neural information processing systems , volume=

[44] [44]

What can

Yuqing Du and Eliza Kosoy and Alyssa Li Dayan and Maria Rufova and Alison Gopnik and Pieter Abbeel , booktitle=. What can. 2023 , url=

2023

[45] [45]

International conference on machine learning , pages=

Planning to explore via self-supervised world models , author=. International conference on machine learning , pages=. 2020 , organization=

2020

[46] [46]

International conference on machine learning , pages=

Model-based active exploration , author=. International conference on machine learning , pages=. 2019 , organization=

2019

[47] [47]

(No Title) , year=

Why greatness cannot be planned: The myth of the objective , author=. (No Title) , year=

[48] [48]

arXiv preprint arXiv:2408.06292 , year=

The ai scientist: Towards fully automated open-ended scientific discovery , author=. arXiv preprint arXiv:2408.06292 , year=

Pith/arXiv arXiv

[49] [49]

Proceedings of the 41st International Conference on Machine Learning , pages =

Position: Open-Endedness is Essential for Artificial Superhuman Intelligence , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =

2024

[50] [50]

2024 , url=

Jenny Zhang and Joel Lehman and Kenneth Stanley and Jeff Clune , booktitle=. 2024 , url=

2024

[51] [51]

Second Agent Learning in Open-Endedness Workshop , year=

Quality diversity through human feedback , author=. Second Agent Learning in Open-Endedness Workshop , year=

[52] [52]

Quality-Diversity through

Herbie Bradley and Andrew Dai and Hannah Benita Teufel and Jenny Zhang and Koen Oostermeijer and Marco Bellagente and Jeff Clune and Kenneth Stanley and Gregory Schott and Joel Lehman , booktitle=. Quality-Diversity through. 2024 , url=

2024

[53] [53]

Transactions on Machine Learning Research , issn=

Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

2024

[54] [54]

arXiv preprint arXiv:1901.01753 , year=

Paired open-ended trailblazer (poet): Endlessly generating increasingly complex and diverse learning environments and their solutions , author=. arXiv preprint arXiv:1901.01753 , year=

Pith/arXiv arXiv 1901

[55] [55]

Entropy , volume=

An information-theoretic perspective on intrinsic motivation in reinforcement learning: A survey , author=. Entropy , volume=. 2023 , publisher=

2023

[56] [56]

Forty-second International Conference on Machine Learning , year=

In-Context Learning and Occam's Razor , author=. Forty-second International Conference on Machine Learning , year=

[57] [57]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

[58] [58]

Proceedings of the 41st International Conference on Machine Learning , pages =

Learning Universal Predictors , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =

2024

[59] [59]

Advances in neural information processing systems , volume=

Meta-trained agents implement bayes-optimal agents , author=. Advances in neural information processing systems , volume=

[60] [60]

International Conference on Learning Representations , year=

An Explanation of In-context Learning as Implicit Bayesian Inference , author=. International Conference on Learning Representations , year=

[61] [61]

International Conference on Learning Representations , year=

Transformers Can Do Bayesian Inference , author=. International Conference on Learning Representations , year=

[62] [62]

Behavioral and Brain Sciences , volume=

Meta-learned models of cognition , author=. Behavioral and Brain Sciences , volume=. 2024 , publisher=

2024

[63] [63]

International Conference on Machine Learning , pages=

Statistical foundations of prior-data fitted networks , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023

[64] [64]

Proceedings of the 40th International Conference on Machine Learning , pages =

Transformers Learn In-Context by Gradient Descent , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

2023

[65] [65]

arXiv preprint arXiv:1611.02779 , year=

RL ^2 : Fast reinforcement learning via slow reinforcement learning , author=. arXiv preprint arXiv:1611.02779 , year=

Pith/arXiv arXiv

[66] [66]

arXiv preprint arXiv:2207.01848 , year=

Tabpfn: A transformer that solves small tabular classification problems in a second , author=. arXiv preprint arXiv:2207.01848 , year=

Pith/arXiv arXiv

[67] [67]

The Eleventh International Conference on Learning Representations , year=

In-context Reinforcement Learning with Algorithm Distillation , author=. The Eleventh International Conference on Learning Representations , year=

[68] [68]

Advances in Neural Information Processing Systems , volume=

Supervised pretraining can learn in-context reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

[69] [69]

Advances in Neural Information Processing Systems , volume=

The description length of deep learning models , author=. Advances in Neural Information Processing Systems , volume=

[70] [70]

arXiv preprint arXiv:2601.03220 , year=

From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence , author=. arXiv preprint arXiv:2601.03220 , year=

arXiv

[71] [71]

Proceedings of the National Academy of Sciences , volume=

Benign overfitting in linear regression , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=

2020

[72] [72]

arXiv preprint arXiv:2009.07624 , year=

Measuring information transfer in neural networks , author=. arXiv preprint arXiv:2009.07624 , year=

arXiv 2009

[73] [73]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

[74] [74]

Cognition , volume=

Children are more exploratory and learn more than adults in an approach-avoid task , author=. Cognition , volume=. 2022 , publisher=

2022

[75] [75]

Philosophical Transactions of the Royal Society B , volume=

Childhood as a solution to explore--exploit tensions , author=. Philosophical Transactions of the Royal Society B , volume=. 2020 , publisher=

2020

[76] [76]

Developmental science , volume=

Preschool children learn about causal structure from conditional interventions , author=. Developmental science , volume=. 2007 , publisher=

2007

[77] [77]

Byrnes, Steven , month = feb, year =. [

[78] [78]

2024 , eprint=

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models , author=. 2024 , eprint=

2024

[79] [79]

2021 , journal=

Alchemy: A structured task distribution for meta-reinforcement learning , author=. 2021 , journal=

2021

[80] [80]

arXiv preprint arXiv:2508.10142 , year=

Multi-turn puzzles: Evaluating interactive reasoning and strategic dialogue in llms , author=. arXiv preprint arXiv:2508.10142 , year=

arXiv