pith. sign in

arxiv: 2502.07709 · v3 · submitted 2025-02-11 · 💻 cs.AI

MAGELLAN: Metacognitive predictions of learning progress guide autotelic LLM agents in large goal spaces

Pith reviewed 2026-05-23 03:13 UTC · model grok-4.3

classification 💻 cs.AI
keywords metacognitionlearning progressLLM agentsautotelic explorationgoal prioritizationcurriculum learningopen-ended learning
0
0 comments X

The pith

MAGELLAN equips LLM agents with online metacognitive predictions of learning progress to master large evolving goal spaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework that lets LLM agents learn to forecast their own competence and learning progress by using semantic links between goals. This metacognitive monitoring replaces the need for heavy sampling or hand-crafted goal groups, allowing the agent to focus effort where progress is highest even as the space grows or shifts. A sympathetic reader would care because open-ended agents otherwise waste time on unreachable or already-mastered goals in huge possibility spaces. If the approach works, agents can maintain efficient curricula without constant external supervision.

Core claim

MAGELLAN is a metacognitive framework that lets LLM agents learn to predict their competence and LP online. By capturing semantic relationships between goals, MAGELLAN enables sample-efficient LP estimation and dynamic adaptation to evolving goal spaces through generalization. In an interactive learning environment, MAGELLAN improves LP prediction efficiency and goal prioritization, being the only method allowing the agent to fully master a large and evolving goal space.

What carries the argument

MAGELLAN, the metacognitive framework that trains the LLM agent to predict its own competence and learning progress by exploiting semantic relationships among goals.

If this is right

  • Goal prioritization becomes more efficient because the agent avoids goals with low predicted progress.
  • The agent adapts its curriculum automatically when new goals appear without requiring expert re-grouping.
  • Learning progress estimation requires fewer environment samples than traditional methods.
  • Full mastery of the entire goal space becomes achievable where other approaches plateau.
  • Curriculum learning scales to open-ended, high-dimensional goal spaces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same semantic-prediction idea could be tested on non-LLM agents that have access to goal embeddings.
  • If semantic generalization works here, it may reduce the need for hand-designed task taxonomies in other exploration settings.
  • One could measure whether the metacognitive module itself improves when the agent is allowed to update its predictions after each episode.
  • The approach raises the question of how robust the predictions remain when the underlying LLM is swapped for a different model.

Load-bearing premise

Semantic relationships inside the LLM can be used to predict the agent's actual competence and learning progress accurately enough to guide prioritization without needing extensive new samples or expert groupings.

What would settle it

Run the agent with MAGELLAN in the same interactive environment; if it still fails to fully master the goal space or if its predicted learning progress shows no reliable correlation with measured progress, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2502.07709 by C\'edric Colas, Cl\'ement Romac, Loris Gaven, Olivier Sigaud, Pierre-Yves Oudeyer, Sylvain Lamprier, Thomas Carta.

Figure 1
Figure 1. Figure 1: Navigating large goal spaces with MAGELLAN: Dur￾ing training, our LLM agent uses MAGELLAN to estimate its past and current competence to compute absolute learning progress (ALP) on each goal. Given the per-goal ALP, the LLM agent’s goal selector chooses the next goal to practice proportionally to their ALP. The LLM agent then performs a trajectory to achieve this goal and the outcome is used to update both… view at source ↗
Figure 2
Figure 2. Figure 2: A) Little-Zoo’s tech tree. B) Little-Zoo’s goal space is composed of all the possible combinations between instructions and objects that can be in the scene. Most object configurations make an instruction infeasible (e.g. ”grow lion” is impossible with the second configuration, as water, needed to obtain plants, is missing). C) Little-Zoo provides a textual description that is given in our LLM agent’s prom… view at source ↗
Figure 3
Figure 3. Figure 3: Scaling of competence estimation error and competence estimation cost (i.e. total number of additional evaluation episodes) when increasing the goal space size. dataset (Face, 2025). In this setup, rather than training a real agent, we simulate the learning of an agent that progressively acquires skills in three categories: Algebra, then Geometry, and finally Number Theory. We compare the competence estima… view at source ↗
Figure 5
Figure 5. Figure 5: Evolution of the observed competence (SR) when eval￾uating policies on 64 training goals per category every 5000 episodes. We report the average SR over evaluated goals along with standard deviation (8 seeds). Icons indicate the average time step at which a method mastered a goal (i.e. SR > 90%). We add stars to MAGELLAN, denoting significantly earlier mastery of a category compared to the method with the … view at source ↗
Figure 4
Figure 4. Figure 4: Competence estimation on OpenR1-Math-220k. MAG￾ELLAN (blue) accurately tracks competence across Algebra, Ge￾ometry, and Number Theory, closely matching true success proba￾bilities and outperforming Online-ALP (orange). 4.2. Training an LLM agent with MAGELLAN (Q2) As demonstrated in 4.1, MAGELLAN provides a superior competence estimation than Online-ALP. We further in￾vestigate whether this improvement tra… view at source ↗
Figure 6
Figure 6. Figure 6: MAGELLAN’s LLM embedding space displayed using t-SNE with goals used in Q2 (Train) and Q3 (Test), along with the estimated success probability and linear interpolation between goals. We show the embedding space for a single seed (a) before training and (b) at the end of the 500k training steps. We see that impossible goals have been left aside, and that the other goals with a high estimated success probabi… view at source ↗
Figure 7
Figure 7. Figure 7: Adaptation tests: Using a single’s seed training of 500k episodes, we stop and replace all goals with unseen ones every 50k episodes. We then resume training and sample goals using each method for 50k training episodes. We show two isolated and representative points of goal replacement: (a) there is no ALP on any goal (after 50k training episodes), and (b) some goals (here, ”Grow carnivores” after 150k tra… view at source ↗
Figure 8
Figure 8. Figure 8: We first generate the full goal space whom the distribution is given Figure 8a then for computational reasons we sample a smaller goal space with the following distribution Figure 8b. B. Comparison of LP methods We compare prior work computing LP for automatic curriculum learning under the dimensions from Section D.2. We show the comparison in [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Different architectural choices in MAGELLAN: (a) we learn separate LoRA adapters between the policy and MAGELLAN (used in the paper); (b) we share adapters and update them using both the policy and MAGELLAN gradient; (c) we share adapters but they are only updated by the policy gradient; (d) MAGELLAN directly uses the latent representation produced by the pretrained LLM. 22 [PITH_FULL_IMAGE:figures/full_f… view at source ↗
Figure 10
Figure 10. Figure 10: Training curves of the four different possible architecture for MAGELLAN. We use 8 seeds to plot the mean and the standard deviation (shadow area around the solid line). (a) Architecture A. (b) Architecture B. (c) Architecture C. (d) Architecture D [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The LLM embedding space of MAGELLAN displayed using t-SNE with goals used in Q2 (Train) and Q3 (Test), along with MAGELLAN’s estimated success probability and linear interpolation between goals. We show the embedding space for a single seed for the four architectures described in Appendix D.1 at the end of the 500k training episodes. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Evolution of competence estimation for each ALP method on each goal category for 25k goals. We show the average competence and its standard deviation across 8 seeds that use EK-Eval-ALP to sample goals [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Evolution of competence estimation for each ALP method on each goal category for 50k goals. We show the average competence and its standard deviation across 8 seeds that use EK-Eval-ALP to sample goals. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Evolution of competence estimation for each ALP method on each goal category for 100k goals. We show the average competence and its standard deviation across 8 seeds that use EK-Eval-ALP to sample goals. D.2.2. COMPETENCE ESTIMATION ON THE BABYAI-TEXT GOAL-SPACE We replicated the experiment from Section 4.1, simulating a learning agent and estimating its competence online using MAGELLAN and Online-ALP. In… view at source ↗
Figure 15
Figure 15. Figure 15: Competence estimation on BabyAI-Text. MAGELLAN accurately tracks competence across five goal types of increasing difficulty and consistently outperforms Online-ALP. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Effect of LLM choice on MAGELLAN’s competence estimation for OpenR1-Math-220k. All models yield similar performance [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Effect of LLM choice on MAGELLAN’s competence estimation for BabyAI-Text. Performance remains consistent across models. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Evolution of average SR for each ALP method for each goal category. To get the success rate within a category, we evaluate the policy on 64 goals of this category. We use 8 seeds to plot the mean and the standard deviation. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Goal sampling strategies of MAGELLAN, EK-Online-ALP, Online-ALP. We do not take into account the 20% uniformly sampled goals from the ε-greedy exploration. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Per-goal estimation of the success probability for each method. The average result (over 8 seed) is the solid line and the shaded zone represents the standard deviation. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Chronogram of the embedding space of the seed 0, at the beginning and after mastering each type of goal. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: a illustrates the clustering of different categories of impossible goals. The impossible ”Grasp” goals form a compact cluster, while the ”Grow” goals—categorized as ”Grow plant”, ”Grow herbivore”, ”Grow carnivore”, and ”Grow furniture”—exhibit four less-defined clusters. Additionally, a large, mixed cluster contains various impossible ”Grow” goals. However, when examining the same embeddings through the l… view at source ↗
Figure 23
Figure 23. Figure 23: Adaptation tests: Using a single’s seed training of 500k episodes, we stop and replace all goals by new unseen goals every 50k episodes. We then resume training by sampling goals using each of our four methods’ ALP estimation (MAGELLAN, EK-Online-ALP, Online-ALP, Uniform) and perform 50k training episodes. We report the evolution of observed competence (SR) when evaluating the policies on 64 goals per cat… view at source ↗
Figure 24
Figure 24. Figure 24: Average sample efficiency (after κ, the length of the test) of each method average over the 10 tests. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_24.png] view at source ↗
read the original abstract

Open-ended learning agents must efficiently prioritize goals in vast possibility spaces, focusing on those that maximize learning progress (LP). When such autotelic exploration is achieved by LLM agents trained with online RL in high-dimensional and evolving goal spaces, a key challenge for LP prediction is modeling one's own competence, a form of metacognitive monitoring. Traditional approaches either require extensive sampling or rely on brittle expert-defined goal groupings. We introduce MAGELLAN, a metacognitive framework that lets LLM agents learn to predict their competence and LP online. By capturing semantic relationships between goals, MAGELLAN enables sample-efficient LP estimation and dynamic adaptation to evolving goal spaces through generalization. In an interactive learning environment, we show that MAGELLAN improves LP prediction efficiency and goal prioritization, being the only method allowing the agent to fully master a large and evolving goal space. These results demonstrate how augmenting LLM agents with a metacognitive ability for LP predictions can effectively scale curriculum learning to open-ended goal spaces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MAGELLAN, a metacognitive framework for LLM-based autotelic agents that learns to predict its own competence and learning progress (LP) online. By leveraging semantic relationships between goals encoded in the LLM, the method enables sample-efficient LP estimation and dynamic prioritization in large, evolving goal spaces without relying on extensive sampling or expert-defined groupings. Experiments in an interactive learning environment demonstrate improved LP prediction efficiency and goal prioritization, with the claim that MAGELLAN is the only method allowing the agent to fully master the goal space.

Significance. If the empirical results on mastery and sample efficiency hold under rigorous controls, the work would provide a concrete demonstration of how metacognitive monitoring can scale curriculum learning for open-ended LLM agents, addressing a key bottleneck in autotelic exploration. The approach of online competence prediction via LLM semantics, if validated, could influence designs for agents operating in high-dimensional goal spaces.

major comments (2)
  1. [Abstract, §4] Abstract and §4 (Experiments): The claim that MAGELLAN is 'the only method allowing the agent to fully master a large and evolving goal space' is load-bearing for the central contribution, yet the abstract provides no quantitative details on mastery metrics (e.g., fraction of goals mastered), sample complexity curves, baseline failure modes, or statistical tests. Without these, it is impossible to verify whether the mastery gap arises from the metacognitive predictor or from other implementation differences.
  2. [§3, §4.2] §3 (MAGELLAN framework) and §4.2 (LP prediction): The key assumption that LLM semantic embeddings reliably encode competence-relevant similarities (rather than surface-level semantics) for generalization to new goals is stated but not directly tested. No ablation on embedding validation, nearest-neighbor analysis, or out-of-distribution goal performance is referenced, leaving the sample-efficiency advantage ungrounded.
minor comments (2)
  1. [Abstract] The abstract uses 'metacognitive monitoring' and 'LP prediction' without a brief definition or pointer to the formalization in §2; adding one sentence would improve accessibility.
  2. [Abstract] No mention of environment details (state space, goal generation process, or reward structure) appears in the abstract; these should be summarized in one sentence for context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate the corresponding revisions.

read point-by-point responses
  1. Referee: [Abstract, §4] Abstract and §4 (Experiments): The claim that MAGELLAN is 'the only method allowing the agent to fully master a large and evolving goal space' is load-bearing for the central contribution, yet the abstract provides no quantitative details on mastery metrics (e.g., fraction of goals mastered), sample complexity curves, baseline failure modes, or statistical tests. Without these, it is impossible to verify whether the mastery gap arises from the metacognitive predictor or from other implementation differences.

    Authors: We agree that the abstract would be strengthened by quantitative support for the mastery claim. Section 4 reports that MAGELLAN reaches 100% goal mastery while all baselines plateau below 70%, with the difference attributable to the metacognitive predictor as shown by the ablations in §4.3. We will revise the abstract to include the mastery fractions, a reference to the sample-complexity results, and a note on the statistical comparisons performed in §4.4. revision: yes

  2. Referee: [§3, §4.2] §3 (MAGELLAN framework) and §4.2 (LP prediction): The key assumption that LLM semantic embeddings reliably encode competence-relevant similarities (rather than surface-level semantics) for generalization to new goals is stated but not directly tested. No ablation on embedding validation, nearest-neighbor analysis, or out-of-distribution goal performance is referenced, leaving the sample-efficiency advantage ungrounded.

    Authors: The framework builds on the documented ability of LLM embeddings to capture goal semantics, which is reflected in the measured improvement in LP prediction sample efficiency. We acknowledge that an explicit validation would further ground the claim and will therefore add a nearest-neighbor analysis of embedding clusters together with their predicted competence values, plus results on a held-out out-of-distribution goal set, to the revised §4.2. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical claims rest on environment interaction results

full rationale

The abstract presents MAGELLAN as a new metacognitive framework that learns competence and LP predictions online by leveraging LLM semantic relationships for generalization in evolving goal spaces. No equations, fitted parameters renamed as predictions, or self-citations are shown that would reduce the central result to its own inputs by construction. The claim of being the only method to fully master the space is presented as an empirical outcome from interactive learning experiments rather than a definitional or self-referential derivation. The derivation chain is therefore self-contained against external benchmarks of agent performance.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the framework relies on the LLM's pre-existing semantic capabilities and online RL updates, but these are not detailed.

pith-pipeline@v0.9.0 · 5727 in / 996 out tokens · 31981 ms · 2026-05-23T03:13:54.986855+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 1 internal anchor

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Ho, D., Hsu, J., Ibarz, J., Ichter, B., Irpan, A., Jang, E., Ruano, R. M. J., Jeffrey, K., Jesmonth, S., Joshi, N. J., Julian, R. C., Kalashnikov, D., Kuang, Y., Lee, K.-H., Levine, S., Lu, Y., Luu, L., Parada, C., Pastor, P., Quiamb...

  3. [3]

    Grounding language to autonomously-acquired skills via goal generation

    Akakzia, A., Colas, C., Oudeyer, P.-Y., Chetouani, M., and Sigaud, O. Grounding language to autonomously-acquired skills via goal generation. In International Conference on Learning Representations, 2021

  4. [4]

    and Mirolli, M

    Baldassarre, G. and Mirolli, M. Intrinsically motivated learning systems: an overview. Intrinsically motivated learning in natural and artificial systems, pp.\ 1--14, 2012

  5. [5]

    and Oudeyer, P.-Y

    Baranes, A. and Oudeyer, P.-Y. R- IAC : Robust intrinsically motivated exploration and active learning. IEEE Transactions on Autonomous Mental Development , 1 0 (3): 0 155--169, 2009. ISSN 1943-0612. doi:10.1109/TAMD.2009.2037513. Conference Name: IEEE Transactions on Autonomous Mental Development

  6. [6]

    and Oudeyer, P.-Y

    Baranes, A. and Oudeyer, P.-Y. Active learning of inverse models with intrinsically motivated goal exploration in robots. Robotics and Autonomous Systems, 61 0 (1): 0 49--73, 2013. ISSN 0921-8890. doi:10.1016/j.robot.2012.05.008

  7. [7]

    Berlyne, D. E. A theory of human curiosity. British Journal of Psychology, 1954

  8. [8]

    Control what you can: Intrinsically motivated task-planning agent

    Blaes, S., Vlastelica Pogančić, M., Zhu, J., and Martius, G. Control what you can: Intrinsically motivated task-planning agent. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019

  9. [9]

    Grounding large language models in interactive environments with online reinforcement learning

    Carta, T., Romac, C., Wolf, T., Lamprier, S., Sigaud, O., and Oudeyer, P.-Y. Grounding large language models in interactive environments with online reinforcement learning. In Proceedings of the 40th International Conference on Machine Learning, pp.\ 3676--3713. PMLR , 2023. ISSN : 2640-3498

  10. [10]

    Stein variational goal generation for adaptive exploration in multi-goal reinforcement learning

    Castanet, N., Sigaud, O., and Lamprier, S. Stein variational goal generation for adaptive exploration in multi-goal reinforcement learning. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , volume 202 of Proceedings o...

  11. [11]

    H., and Bengio, Y

    Chevalier-Boisvert, M., Bahdanau, D., Lahlou, S., Willems, L., Saharia, C., Nguyen, T. H., and Bengio, Y. Babyai: A platform to study the sample efficiency of grounded language learning. In International Conference on Learning Representations, 2019

  12. [12]

    Multi-armed bandits for intelligent tutoring systems

    Clement, B., Roy, D., Oudeyer, P.-Y., and Lopes, M. Multi-armed bandits for intelligent tutoring systems. Journal of Educational Data Mining, 7 0 (2), 2015

  13. [13]

    CURIOUS : intrinsically motivated modular multi-goal reinforcement learning

    Colas, C., Fournier, P., Chetouani, M., Sigaud, O., and Oudeyer, P.-Y. CURIOUS : intrinsically motivated modular multi-goal reinforcement learning. In International conference on machine learning, pp.\ 1331--1340. PMLR, 2019

  14. [14]

    F., and Oudeyer, P.-Y

    Colas, C., Karch, T., Lair, N., Dussoux, J.-M., Moulin-Frier, C., Dominey, P. F., and Oudeyer, P.-Y. Language as a cognitive tool to imagine goals in curiosity-driven exploration. arXiv :2002.09253 [cs] , 2020

  15. [15]

    Language and culture internalization for human-like autotelic ai

    Colas, C., Karch, T., Moulin-Frier, C., and Oudeyer, P.-Y. Language and culture internalization for human-like autotelic ai. Nature Machine Intelligence, 4 0 (12): 0 1068--1076, 2022 a

  16. [16]

    Autotelic agents with intrinsically motivated goal-conditioned reinforcement learning: a short survey

    Colas, C., Karch, T., Sigaud, O., and Oudeyer, P.-Y. Autotelic agents with intrinsically motivated goal-conditioned reinforcement learning: a short survey. Journal of Artificial Intelligence Research, 74: 0 1159--1199, 2022 b

  17. [17]

    Emergent complexity and zero-shot transfer via unsupervised environment design

    Dennis, M., Jaques, N., Vinitsky, E., Bayen, A., Russell, S., Critch, A., and Levine, S. Emergent complexity and zero-shot transfer via unsupervised environment design. Advances in Neural Information Processing Systems, 33, 2020

  18. [18]

    QL o RA : Efficient finetuning of quantized LLM s

    Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. QL o RA : Efficient finetuning of quantized LLM s. In Thirty-seventh Conference on Neural Information Processing Systems, 2023

  19. [19]

    Where’s the reward? a review of reinforcement learning for instructional sequencing

    Doroudi, S., Aleven, V., and Brunskill, E. Where’s the reward? a review of reinforcement learning for instructional sequencing. International Journal of Artificial Intelligence in Education, 29: 0 568--620, 2019

  20. [20]

    Open r1: A fully open reproduction of deepseek-r1, January 2025

    Face, H. Open r1: A fully open reproduction of deepseek-r1, January 2025. URL https://github.com/huggingface/open-r1

  21. [21]

    and Oudeyer, P.-Y

    Forestier, S. and Oudeyer, P.-Y. Modular active curiosity-driven discovery of tool use. In Proceedings of the 2016 IEEE / RSJ International Conference on Intelligent Robots and Systems , Proceedings of the 2016 IEEE / RSJ International Conference on Intelligent Robots and Systems, 2016

  22. [22]

    Intrinsically motivated goal exploration processes with automatic curriculum learning

    Forestier, S., Portelas, R., Mollard, Y., and Oudeyer, P.-Y. Intrinsically motivated goal exploration processes with automatic curriculum learning. Journal of Machine Learning Research, 23 0 (1), January 2022. ISSN 1532-4435

  23. [23]

    Accuracy-based curriculum learning in deep reinforcement learning, 2018

    Fournier, P., Sigaud, O., Chetouani, M., and Oudeyer, P.-Y. Accuracy-based curriculum learning in deep reinforcement learning, 2018

  24. [24]

    Sac-glam: Improving online rl for llm agents with soft actor-critic and hindsight relabeling, 2024

    Gaven, L., Romac, C., Carta, T., Lamprier, S., Sigaud, O., and Oudeyer, P.-Y. Sac-glam: Improving online rl for llm agents with soft actor-critic and hindsight relabeling, 2024

  25. [25]

    and Oudeyer, P.-Y

    Gottlieb, J. and Oudeyer, P.-Y. Towards a neuroscience of active sampling and curiosity. Nature Reviews Neuroscience, 19 0 (12): 0 758--770, 2018

  26. [26]

    Benchmarking the spectrum of agent capabilities

    Hafner, D. Benchmarking the spectrum of agent capabilities. In International Conference on Learning Representations, 2022

  27. [27]

    Reasoning with language model is planning with world model

    Hao, S., Gu, Y., Ma, H., Hong, J., Wang, Z., Wang, D., and Hu, Z. Reasoning with language model is planning with world model. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 8154--8173, Singapore, 2023. Association for Computational Linguistics. doi:10.18653/v1/202...

  28. [28]

    Automatic goal generation for reinforcement learning agents

    Held, D., Geng, X., Florensa, C., and Abbeel, P. Automatic goal generation for reinforcement learning agents. In International Conference on Machine Learning, 2017

  29. [29]

    J., yelong shen, Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W

    Hu, E. J., yelong shen, Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lo RA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

  30. [30]

    Language models as zero-shot planners: Extracting actionable knowledge for embodied agents

    Huang, W., Abbeel, P., Pathak, D., and Mordatch, I. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International conference on machine learning, pp.\ 9118--9147. PMLR, 2022

  31. [31]

    Wordcraft: An environment for benchmarking commonsense agents

    Jiang, M., Luketina, J., Nardelli, N., Minervini, P., Torr, P., Whiteson, S., and Rockt \"a schel, T. Wordcraft: An environment for benchmarking commonsense agents. In Language in Reinforcement Learning Workshop at ICML 2020, 2020

  32. [32]

    Replay-guided adversarial environment design

    Jiang, M., Dennis, M., Parker-Holder, J., Foerster, J., Grefenstette, E., and Rocktäschel, T. Replay-guided adversarial environment design. In Proceedings of the 35th International Conference on Neural Information Processing Systems , NIPS '21, pp.\ 1884--1897, Red Hook, NY, USA, 2021. Curran Associates Inc. ISBN 978-1-71384-539-3

  33. [33]

    General intelligence requires rethinking exploration

    Jiang, M., Rockt \"a schel, T., and Grefenstette, E. General intelligence requires rethinking exploration. Royal Society Open Science, 10 0 (6): 0 230539, 2023

  34. [34]

    The malmo platform for artificial intelligence experimentation

    Johnson, M., Hofmann, K., Hutton, T., and Bignell, D. The malmo platform for artificial intelligence experimentation. In Ijcai, volume 16, pp.\ 4246--4247, 2016

  35. [35]

    H., Houghton, B., Sampedro, R., Zhokhov, P., Baker, B., Ecoffet, A., Tang, J., Klimov, O., and Clune, J

    Kanitscheider, I., Huizinga, J., Farhi, D., Guss, W. H., Houghton, B., Sampedro, R., Zhokhov, P., Baker, B., Ecoffet, A., Tang, J., Klimov, O., and Clune, J. Multi-task curriculum learning in a complex, visual, hard-exploration domain: Minecraft, 2021

  36. [36]

    and Oudeyer, P.-Y

    Kaplan, F. and Oudeyer, P.-Y. In search of the neural circuits of intrinsic motivation. Frontiers in neuroscience, 1: 0 9, 2007

  37. [37]

    and Hayden, B

    Kidd, C. and Hayden, B. Y. The psychology and neuroscience of curiosity. Neuron, 88 0 (3): 0 449--460, 2015

  38. [38]

    Grimgep: Learning progress for robust goal sampling in visual deep reinforcement learning

    Kovač, G., Laversanne-Finot, A., and Oudeyer, P.-Y. Grimgep: Learning progress for robust goal sampling in visual deep reinforcement learning. IEEE Transactions on Cognitive and Developmental Systems, 15 0 (3): 0 1396--1407, 2023. doi:10.1109/TCDS.2022.3216911

  39. [39]

    P., and Barry, J

    Kumar, N., Silver, T., McClinton, W., Zhao, L., Proulx, S., Lozano-Pérez, T., Kaelbling, L. P., and Barry, J. Practice makes perfect: Planning to learn skill parameter policies. In Robotics: Science and Systems (RSS), 2024

  40. [40]

    Curiosity driven exploration of learned disentangled goal spaces

    Laversanne-Finot, A., Pere, A., and Oudeyer, P.-Y. Curiosity driven exploration of learned disentangled goal spaces. In Billard, A., Dragan, A., Peters, J., and Morimoto, J. (eds.), Proceedings of The 2nd Conference on Robot Learning, volume 87 of Proceedings of Machine Learning Research, pp.\ 487--504. PMLR, 29--31 Oct 2018

  41. [41]

    A., Cordrey, S

    Leonard, J. A., Cordrey, S. R., Liu, H. Z., and Mackey, A. P. Young children calibrate effort based on the trajectory of their performance. Developmental Psychology, 59 0 (3): 0 609, 2023

  42. [42]

    and Oudeyer, P.-Y

    Lopes, M. and Oudeyer, P.-Y. The strategic student approach for life-long exploration and learning. In 2012 IEEE international conference on development and learning and epigenetic robotics (ICDL), pp.\ 1--8. IEEE, 2012 a

  43. [43]

    and Oudeyer, P.-Y

    Lopes, M. and Oudeyer, P.-Y. The strategic student approach for life-long exploration and learning. In 2012 IEEE International Conference on Development and Learning and Epigenetic Robotics ( ICDL ) , pp.\ 1--8, 2012 b . doi:10.1109/DevLrn.2012.6400807. ISSN : 2161-9476

  44. [44]

    Teacher–student curriculum learning, 2020

    Matiisen, T., Oliver, A., Cohen, T., and Schulman, J. Teacher–student curriculum learning, 2020

  45. [45]

    Kinetix: Investigating the training of general agents through open-ended physics-based control tasks, 2024

    Matthews, M., Beukman, M., Lu, C., and Foerster, J. Kinetix: Investigating the training of general agents through open-ended physics-based control tasks, 2024

  46. [46]

    and Oudeyer, P.-Y

    Moulin-Frier, C. and Oudeyer, P.-Y. Exploration strategies in developmental robotics: A unified probabilistic framework. In 2013 IEEE Third Joint International Conference on Development and Learning and Epigenetic Robotics ( ICDL ) , pp.\ 1--6, 2013. doi:10.1109/DevLrn.2013.6652535. ISSN : 2161-9476

  47. [47]

    M., and Oudeyer, P.-Y

    Moulin-Frier, C., Nguyen, S. M., and Oudeyer, P.-Y. Self-organization of early vocal development in infants and machines: the role of intrinsic motivation. Frontiers in Psychology, 4, 2014. ISSN 1664-1078. doi:10.3389/fpsyg.2013.01006. Publisher: Frontiers

  48. [48]

    and Smith, L

    Oudeyer, P.-Y. and Smith, L. B. How evolution may work through curiosity-driven developmental process. Topics in Cognitive Science, 8 0 (2): 0 492--502, 2016

  49. [49]

    Oudeyer, P.-Y., Kaplan, F., and Hafner, V. V. Intrinsic motivation systems for autonomous mental development. IEEE Transactions on Evolutionary Computation , 11 0 (2): 0 265--286, 2007. ISSN 1941-0026. doi:10.1109/TEVC.2006.890271. Conference Name: IEEE Transactions on Evolutionary Computation

  50. [50]

    Maximum Entropy Gain Exploration for Long Horizon Multi -goal Reinforcement Learning

    Pitis, S., Chan, H., Zhao, S., Stadie, B., and Ba, J. Maximum Entropy Gain Exploration for Long Horizon Multi -goal Reinforcement Learning . In Proceedings of the 37th International Conference on Machine Learning , pp.\ 7750--7761. PMLR, November 2020. ISSN: 2640-3498

  51. [51]

    B., and Hunnius, S

    Poli, F., Meyer, M., Mars, R. B., and Hunnius, S. Exploration in 4-year-old children is guided by learning progress and novelty. Child Development, 2024 a

  52. [52]

    X., Mars, R

    Poli, F., O’Reilly, J. X., Mars, R. B., and Hunnius, S. Curiosity and the dynamics of optimal exploration. Trends in Cognitive Sciences, 28 0 (5): 0 441--453, 2024 b

  53. [53]

    H., Dalal, M., Lin, S., Nair, A., Bahl, S., and Levine, S

    Pong, V. H., Dalal, M., Lin, S., Nair, A., Bahl, S., and Levine, S. Skew-fit: State-covering self-supervised reinforcement learning. In International Conference on Machine Learning, 2019

  54. [54]

    Teacher algorithms for curriculum learning of deep rl in continuously parameterized environments

    Portelas, R., Colas, C., Hofmann, K., and Oudeyer, P.-Y. Teacher algorithms for curriculum learning of deep rl in continuously parameterized environments. In Conference on Robot Learning, pp.\ 835--853. PMLR, 2020 a

  55. [55]

    Automatic curriculum learning for deep rl: A short survey

    Portelas, R., Colas, C., Weng, L., Hofmann, K., and Oudeyer, P.-Y. Automatic curriculum learning for deep rl: A short survey. In International Joint Conference on Artificial Intelligence, 2020 b

  56. [56]

    ACES : Generating a diversity of challenging programming puzzles with autotelic generative models

    Pourcel, J., Colas, C., Molinaro, G., Oudeyer, P.-Y., and Teodorescu, L. ACES : Generating a diversity of challenging programming puzzles with autotelic generative models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  57. [57]

    Qwen2.5 Technical Report

    Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wa...

  58. [58]

    Automated curriculum generation through setter-solver interactions

    Racaniere, S., Lampinen, A., Santoro, A., Reichert, D., Firoiu, V., and Lillicrap, T. Automated curriculum generation through setter-solver interactions. In International Conference on Learning Representations, 2020

  59. [59]

    Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020. ISSN 1533-7928

  60. [60]

    TeachMyAgent : a benchmark for automatic curriculum learning in deep RL

    Romac, C., Portelas, R., Hofmann, K., and Oudeyer, P.-Y. TeachMyAgent : a benchmark for automatic curriculum learning in deep RL . In International Conference on Machine Learning, pp.\ 9052--9063. PMLR , 2021. ISSN : 2640-3498

  61. [61]

    Learning progress mediates the link between cognitive effort and task engagement

    Sayal , C., Heling, E., and Cools, R. Learning progress mediates the link between cognitive effort and task engagement. Cognition, 236: 0 105418, 2023

  62. [62]

    PowerPlay : Training an increasingly general problem solver by continually searching for the simplest still unsolvable problem

    Schmidhuber, J. PowerPlay : Training an increasingly general problem solver by continually searching for the simplest still unsolvable problem. Frontiers in Psychology, 4, 2013. ISSN 1664-1078. doi:10.3389/fpsyg.2013.00313. Publisher: Frontiers

  63. [63]

    Reflexion: language agents with verbal reinforcement learning

    Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and Yao, S. Reflexion: language agents with verbal reinforcement learning. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems, volume 36, pp.\ 8634--8652. Curran Associates, Inc., 2023

  64. [64]

    J., Perrin-Gilbert, N., and Santucci, V

    Sigaud, O., Baldassarre, G., Colas, C., Doncieux, S., Duro, R. J., Perrin-Gilbert, N., and Santucci, V. G. A definition of open-ended learning problems for goal-conditioned agents. ArXiv, abs/2311.00344, 2023

  65. [65]

    and Barto, A

    Stout, A. and Barto, A. G. Competence progress intrinsic motivation. In 2010 IEEE 9th International Conference on Development and Learning , pp.\ 257--262, 2010. doi:10.1109/DEVLRN.2010.5578835. ISSN : 2161-9476

  66. [66]

    Humans monitor learning progress in curiosity-driven exploration

    Ten, A., Kaushik, P., Oudeyer, P.-Y., and Gottlieb, J. Humans monitor learning progress in curiosity-driven exploration. Nature Communications, 12 0 (1): 0 5972, 2021. ISSN 2041-1723. doi:10.1038/s41467-021-26196-w. Publisher: Nature Publishing Group

  67. [67]

    and Hinton, G

    van der Maaten, L. and Hinton, G. Visualizing data using t-sne. Journal of Machine Learning Research, 9 0 (86): 0 2579--2605, 2008

  68. [68]

    Voyager: An open-ended embodied agent with large language models, 2024

    Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., and Anandkumar, A. Voyager: An open-ended embodied agent with large language models, 2024. ISSN 2835-8856

  69. [69]

    V., Kulkarni, T., Ionescu, C., Hansen, S., and Mnih, V

    Warde-Farley, D., de Wiele, T. V., Kulkarni, T., Ionescu, C., Hansen, S., and Mnih, V. Unsupervised control through non-parametric discriminative rewards. In International Conference on Learning Representations, 2019

  70. [70]

    Entropy-regularized token-level policy optimization for large language models, 2024 a

    Wen, M., Deng, C., Wang, J., Zhang, W., and Wen, Y. Entropy-regularized token-level policy optimization for large language models, 2024 a

  71. [71]

    Reinforcing LLM agents via policy optimization with action decomposition, 2024 b

    Wen, M., Wan, Z., Wang, J., Zhang, W., and Wen, Y. Reinforcing LLM agents via policy optimization with action decomposition, 2024 b

  72. [72]

    React: Synergizing reasoning and acting in language models

    Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2022

  73. [73]

    OMNI : Open-endedness via models of human notions of interestingness

    Zhang, J., Lehman, J., Stanley, K., and Clune, J. OMNI : Open-endedness via models of human notions of interestingness. In The Twelfth International Conference on Learning Representations, 2024

  74. [74]

    A r CH er: Training language model agents via hierarchical multi-turn RL

    Zhou, Y., Zanette, A., Pan, J., Levine, S., and Kumar, A. A r CH er: Training language model agents via hierarchical multi-turn RL . In Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett, J., and Berkenkamp, F. (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learni...