Baba in Wonderland: Online Self-Supervised Dynamics Discovery for Executable World Models

DongHeun Han; HyeongYeop Kang; SeongRae Noh; SeungWon Seo

arxiv: 2605.16725 · v1 · pith:XIZ37DTTnew · submitted 2026-05-16 · 💻 cs.AI

Baba in Wonderland: Online Self-Supervised Dynamics Discovery for Executable World Models

SeungWon Seo , DongHeun Han , SeongRae Noh , HyeongYeop Kang This is my paper

Pith reviewed 2026-05-19 21:43 UTC · model grok-4.3

classification 💻 cs.AI

keywords executable world modelsself-supervised dynamics discoveryprior misalignmenthypothesis classespreservation conflictsBaba Is Youonline learningmodel-based planning

0 comments

The pith

Alice learns executable world models by refining failed candidate updates into hypothesis classes

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Alice for learning executable world models from raw interactions when an agent's priors are misaligned with the true environment dynamics. Executable models matter because they support reading, editing, and planning only when they encode actual transition laws rather than surface-level semantic shortcuts. Alice identifies structural signal in failed updates, where a candidate that covers new evidence breaks explanations of prior transitions, indicating conflated dynamics in the current program. It refines these conflicts into hypothesis classes that supply compact stratified counterexamples for correction and direct exploration toward novel or underrepresented transitions. Experiments on a relabeled Baba Is You environment show clear gains, with ablations confirming that both the class refinement and class-aware exploration steps are required.

Core claim

Alice is a closed-loop system that treats failed candidate updates as structural signal: when a candidate explains a new transition but loses previously explained ones, the preservation conflict reveals dynamics that the current program had conflated. Alice refines these conflicts into hypothesis classes that both provide compact, class-stratified preservation counterexamples for update and guide frontier exploration toward transitions that are novel and underrepresented with respect to the current program.

What carries the argument

Hypothesis class refinement from preservation conflicts, which converts failed-update signals into stratified classes that supply targeted counterexamples and steer class-aware exploration.

If this is right

Agents can induce accurate state-dependent dynamics without rule descriptions, rewards, or trustworthy lexical priors.
Preservation conflicts from failed updates expose previously conflated dynamics in the current program.
Hypothesis classes supply compact, stratified counterexamples that support effective model updates.
Class-aware exploration directs attention to novel and underrepresented transitions for improved coverage.
Both refinement and exploration components are necessary, as shown by the ablation results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The conflict-refinement idea could extend to other interactive domains where labels encourage shortcut learning.
It provides a template for using internal model inconsistencies as self-supervision signals in model-based planning.
Testing the same mechanism on environments with continuous states or partial observability would be a direct next experiment.

Load-bearing premise

Failed candidate updates provide structural signal revealing dynamics the current program had conflated, and that refining these into hypothesis classes yields compact preservation counterexamples sufficient for effective updates.

What would settle it

Remove the class-refinement step from Alice and check whether the agent still recovers the true transition laws on the Baba in Wonderland benchmark or remains stuck on semantic shortcuts.

Figures

Figures reproduced from arXiv: 2605.16725 by DongHeun Han, HyeongYeop Kang, SeongRae Noh, SeungWon Seo.

**Figure 1.** Figure 1: Alice in Baba in Wonderland. Left: simulator dynamics are preserved while rule-property labels are semantically remapped, removing lexical shortcuts. Right: exploration exposes prediction errors that drive successive executable world-model revisions. confronts regimes without such scaffolds, but typically reasons inside the Large Language Model (LLM) rather than synthesizing a persistent program [29, 36]. … view at source ↗

**Figure 2.** Figure 2: Hypothesis-class refinement. Rejected target-explaining updates reveal lost transitions, split [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Update-guided frontier scoring. The Explorer selects frontier candidates by combining [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Left: hypothesis-class ablations over cumulative LLM calls; bars show calls before the [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Inductive program update prompt template. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: t-SNE visualization of the Explorer’s learned state-action embedding space. Points [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Success Case 1 from Baba in Wonderland: a push chain under misleading rule-property labels. Each method panel shows the previous state, expected next state, and the method’s predicted next state. Previous State Expected Next State Alice Next State CWM Next State GIF-MCTS Next State WorldCoder Next State [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Success Case 2 from Baba in Wonderland: simultaneous movement of multiple YOU objects with one controlled object blocked by a stop object. Each method panel shows the previous state, expected next state, and the method’s predicted next state. Failure Cases. A representative failure case is shown in the top row of [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Failure case studies from Baba in Wonderland. motion initiated by YOU-controlled objects. In this case, however, the relevant overlap is created by a STAR that is displaced by another pushed object, not by BABA itself. As a result, the learned program misses the OPEN–SHUT resolution that the simulator applies, illustrating how Alice can fit a compact but overly narrow causal account of collision dynamics. … view at source ↗

**Figure 10.** Figure 10: Actual rejected candidate update from Alice in [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

read the original abstract

Executable world models can be read, edited, executed, and reused for planning, but only if the program captures the environment's transition law rather than semantic shortcuts in its surface vocabulary. We study online executable world-model learning under prior misalignment, where an agent must induce state-dependent dynamics from interaction evidence alone, without rule descriptions, reward signals, or trustworthy lexical priors. We introduce Alice, a closed-loop system that treats failed candidate updates as structural signal: when a candidate explains a new transition but loses previously explained ones, the preservation conflict reveals dynamics that the current program had conflated. Alice refines these conflicts into hypothesis classes that both provide compact, class-stratified preservation counterexamples for update and guide frontier exploration toward transitions that are novel and underrepresented with respect to the current program. We evaluate Alice on Baba in Wonderland, a prior-misaligned variant of Baba Is You that preserves simulator dynamics while replacing semantically meaningful rule-property labels with unrelated words. Experiments show that Alice substantially improves executable world-model learning under prior misalignment, and ablations show that both class refinement and class-aware exploration contribute.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Alice turns preservation conflicts from failed updates into hypothesis classes for refinement and exploration in executable world models under prior misalignment, which is a concrete step forward but rests on assumptions that need more visible validation in the results.

read the letter

Hi, The one or two things to know about this paper are that it proposes Alice, a system that uses failed candidate updates as a signal to refine dynamics into hypothesis classes for better preservation and exploration in learning executable world models, and that it tests this in a prior-misaligned Baba Is You variant called Baba in Wonderland. What is new is the specific way it turns preservation conflicts into class-stratified counterexamples and uses that to guide frontier exploration toward novel transitions. This differs from typical self-supervised dynamics learning or model-based RL by explicitly handling the case where priors are misaligned and no rule descriptions or rewards are available. The paper does well in setting up an evaluation that isolates the misalignment issue while keeping the underlying simulator dynamics intact, which makes the results more relevant to building reusable models for planning. The ablations are a plus, as they indicate that both the class refinement and the class-aware exploration play a role in the reported improvements. Where it is softer is in the details of the results. The abstract mentions substantial improvement and contributing ablations but does not include specific metrics, error bars, or comparisons that would let us see the magnitude or reliability of the gains. More importantly, the key assumption that these conflicts reveal dynamics the program had conflated in a way that leads to compact and effective refinements needs checking against the actual data. It is possible that in the Baba rule space this works, but if the conflicts are more syntax-dependent than transition-law-dependent, the method might not generalize as hoped. The stress-test concern about whether the properties hold is worth keeping in mind until the full experiments are reviewed. Overall, this paper is for people interested in executable and editable world models, particularly in domains like games or robotics where agents must discover dynamics from scratch under imperfect priors. A reader who works on model-based planning or self-supervised learning in structured environments would get value from the approach and the testbed. It deserves a serious referee because it tackles a practical challenge with a novel closed-loop mechanism and provides an evaluation setup that can be scrutinized. I would recommend engaging with the work through peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces Alice, a closed-loop online self-supervised system for learning executable world models under prior misalignment. In the Baba in Wonderland environment (a semantically relabeled variant of Baba Is You), Alice treats preservation conflicts arising from failed candidate program updates as structural signals. These conflicts are refined into hypothesis classes that supply compact, class-stratified counterexamples for program correction and guide class-aware frontier exploration toward underrepresented transitions. Experiments claim that Alice substantially outperforms baselines in inducing dynamics that capture the true transition law, with ablations confirming contributions from both class refinement and class-aware exploration.

Significance. If the central mechanism holds, the work offers a promising direction for self-supervised induction of readable, editable, and executable dynamics models without reliance on semantic priors or rewards. The use of update conflicts to drive both correction and exploration is a distinctive contribution that could generalize to other program-synthesis or model-learning settings where surface vocabulary misaligns with underlying rules.

major comments (2)

[Abstract and §4] Abstract and §4 (empirical evaluation): the central claim of substantial improvement under prior misalignment is stated without quantitative results, error bars, statistical tests, or per-run details; this prevents assessment of effect size, reproducibility, or whether the reported gains are driven by the hypothesized structural signal rather than implementation artifacts.
[§3.2] §3.2 (conflict refinement): the assumption that a failed update (covering a new transition while breaking prior ones) necessarily reveals dynamics the current program had conflated, rather than syntactic artifacts of the candidate representation, is load-bearing for the method but receives no direct verification or counterexample analysis; without evidence that the resulting hypothesis classes yield compact, necessary-and-sufficient preservation counterexamples, the update loop risks introducing new errors or failing to converge.

minor comments (2)

[§3] Notation for hypothesis classes and preservation counterexamples should be defined more explicitly with a small example early in the method section to aid readability.
[§3.3] The description of class-aware exploration would benefit from a precise definition of 'underrepresented' (e.g., an information-theoretic or count-based criterion) rather than a high-level statement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting opportunities to strengthen the empirical claims and verify core methodological assumptions. We address each major comment below, agreeing where revisions are warranted and providing clarifications supported by the existing experimental design.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (empirical evaluation): the central claim of substantial improvement under prior misalignment is stated without quantitative results, error bars, statistical tests, or per-run details; this prevents assessment of effect size, reproducibility, or whether the reported gains are driven by the hypothesized structural signal rather than implementation artifacts.

Authors: We agree that the abstract and §4 would benefit from explicit quantitative support. The experiments section already contains per-run results across multiple seeds, but these were not summarized with means, standard deviations, or significance tests in the abstract or main claims. In the revised manuscript we will add concrete metrics (e.g., transition-prediction accuracy and program-edit success rates), report means ± std over 10 independent runs, and include paired statistical tests (e.g., Wilcoxon signed-rank) comparing Alice against baselines. This will make the effect size and reproducibility transparent and help confirm that gains arise from the preservation-conflict mechanism. revision: yes
Referee: [§3.2] §3.2 (conflict refinement): the assumption that a failed update (covering a new transition while breaking prior ones) necessarily reveals dynamics the current program had conflated, rather than syntactic artifacts of the candidate representation, is load-bearing for the method but receives no direct verification or counterexample analysis; without evidence that the resulting hypothesis classes yield compact, necessary-and-sufficient preservation counterexamples, the update loop risks introducing new errors or failing to converge.

Authors: The referee correctly identifies that the interpretation of preservation conflicts as revealing conflated dynamics is central. While we do not provide an exhaustive counterexample analysis in the current draft, the class-refinement ablation demonstrates measurable gains in both final program accuracy and convergence speed, which would be unlikely if the classes were dominated by syntactic artifacts. To strengthen the claim we will add, in the revised §3.2, a short analysis of representative conflict cases from Baba in Wonderland showing that the induced hypothesis classes are both compact and necessary for restoring prior coverage. We will also report the average size of the counterexample sets and the number of refinement iterations required for convergence across runs. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external interaction evidence

full rationale

The paper presents an algorithmic system (Alice) that processes failed candidate updates from interaction evidence to generate hypothesis classes for counterexamples and exploration. No equations, fitted parameters, or self-citations appear in the provided text that reduce any claimed result to its own inputs by construction. The central mechanism is described as driven by preservation conflicts arising from new transitions, which are external data rather than internal definitions or renamings. Ablations are referenced only to show component contributions, without indicating statistical forcing or self-referential justification. The approach is therefore self-contained against external benchmarks from the environment simulator.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that executable programs can capture state-dependent transition laws and that conflicts in explanatory coverage provide useful inductive signal for refinement.

axioms (1)

domain assumption The environment possesses state-dependent dynamics capturable by an executable program rather than surface semantics.
Stated in the problem setup for executable world models under prior misalignment.

pith-pipeline@v0.9.0 · 5730 in / 1144 out tokens · 52003 ms · 2026-05-19T21:43:43.867594+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 6 internal anchors

[1]

Never give up: Learning directed exploration strategies.arXiv preprint arXiv:2002.06038, 2020

Adrià Puigdomènech Badia, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Bilal Piot, Steven Kapturowski, Olivier Tieleman, Martín Arjovsky, Alexander Pritzel, Andew Bolt, et al. Never give up: Learning directed exploration strategies.arXiv preprint arXiv:2002.06038, 2020

work page arXiv 2002
[2]

Exploration by Random Network Distillation

Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation.arXiv preprint arXiv:1810.12894, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[3]

Generating code world models with large language models guided by monte carlo tree search.Advances in Neural Information Processing Systems, 37:60429–60474, 2024

Nicola Dainese, Matteo Merler, Minttu Alakuijala, and Pekka Marttinen. Generating code world models with large language models guided by monte carlo tree search.Advances in Neural Information Processing Systems, 37:60429–60474, 2024

work page 2024
[4]

arXiv preprint arXiv:1901.10995 , year=

Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Go-explore: a new approach for hard-exploration problems.arXiv preprint arXiv:1901.10995, 2019

work page arXiv 1901
[5]

Diversity is All You Need: Learning Skills without a Reward Function

Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function.arXiv preprint arXiv:1802.06070, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

World Models

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

Learning latent dynamics for planning from pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InInternational conference on machine learning, pages 2555–2565. PMLR, 2019

work page 2019
[8]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

TD-MPC2: Scalable, Robust World Models for Continuous Control

Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control.arXiv preprint arXiv:2310.16828, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Reason- ing with language model is planning with world model

Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. Reason- ing with language model is planning with world model. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8154–8173, 2023

work page 2023
[11]

Vime: Variational information maximizing exploration.Advances in neural information processing systems, 29, 2016

Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration.Advances in neural information processing systems, 29, 2016

work page 2016
[12]

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents

Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. InInternational conference on machine learning, pages 9118–9147. PMLR, 2022

work page 2022
[13]

Reward-free exploration for reinforcement learning

Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning. InInternational Conference on Machine Learning, pages 4870–4879. PMLR, 2020

work page 2020
[14]

Supervised contrastive learning

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. InAdvances in Neural Information Processing Systems 33, 2020. URL https://proceedings.neurips. cc/paper/2020/hash/d89a66c7c80a29b1bdbab0f2a1a94af8-Abstract.html

work page 2020
[15]

Active world model learning with progress curiosity

Kuno Kim, Megumi Sano, Julian De Freitas, Nick Haber, and Daniel Yamins. Active world model learning with progress curiosity. InInternational conference on machine learning, pages 5306–5315. PMLR, 2020. 10

work page 2020
[16]

Unsupervised reinforcement learning with contrastive intrinsic control

Michael Laskin, Hao Liu, Xue Bin Peng, Denis Yarats, Aravind Rajeswaran, and Pieter Abbeel. Unsupervised reinforcement learning with contrastive intrinsic control. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neu- ral Information Processing Systems, volume 35, pages 34478–34491. Curran Associates, Inc., 2022. U...

work page 2022
[17]

Code world models for general game playing.arXiv preprint arXiv:2510.04542, 2025

Wolfgang Lehrach, Daniel Hennes, Miguel Lazaro-Gredilla, Xinghua Lou, Carter Wendelken, Zun Li, Antoine Dedieu, Jordi Grau-Moya, Marc Lanctot, Atil Iscen, et al. Code world models for general game playing.arXiv preprint arXiv:2510.04542, 2025

work page arXiv 2025
[18]

Code as policies: Language model programs for embodied control

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International conference on robotics and automation (ICRA), pages 9493–9500. IEEE, 2023

work page 2023
[19]

Behavior from the void: Unsupervised active pre-training.Advances in Neural Information Processing Systems, 34:18459–18473, 2021

Hao Liu and Pieter Abbeel. Behavior from the void: Unsupervised active pre-training.Advances in Neural Information Processing Systems, 34:18459–18473, 2021

work page 2021
[20]

Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

work page 2024
[21]

Agentboard: An analytical evaluation board of multi-turn llm agents.Advances in neural information processing systems, 37:74325–74362, 2024

Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, and Junxian He. Agentboard: An analytical evaluation board of multi-turn llm agents.Advances in neural information processing systems, 37:74325–74362, 2024

work page 2024
[22]

Curiosity-driven exploration by self-supervised prediction

Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. InInternational conference on machine learning, pages 2778–

work page
[23]

Self-supervised exploration via disagree- ment

Deepak Pathak, Dhiraj Gandhi, and Abhinav Gupta. Self-supervised exploration via disagree- ment. InInternational conference on machine learning, pages 5062–5071. PMLR, 2019

work page 2019
[24]

Poe-world: Compositional world modeling with products of programmatic experts.arXiv preprint arXiv:2505.10819, 2025

Wasu Top Piriyakulkij, Yichao Liang, Hao Tang, Adrian Weller, Marta Kryven, and Kevin Ellis. Poe-world: Compositional world modeling with products of programmatic experts.arXiv preprint arXiv:2505.10819, 2025

work page arXiv 2025
[25]

Planning to explore via self-supervised world models

Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak Pathak. Planning to explore via self-supervised world models. InInternational conference on machine learning, pages 8583–8592. PMLR, 2020

work page 2020
[26]

Reveca: Adaptive planning and trajectory-based validation in cooperative language agents using information relevance and relative proximity

SeungWon Seo, SeongRae Noh, Junhyeok Lee, SooBin Lim, Won Hee Lee, and HyeongYeop Kang. Reveca: Adaptive planning and trajectory-based validation in cooperative language agents using information relevance and relative proximity. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23295–23303, 2025

work page 2025
[27]

From assumptions to actions: Turning llm reasoning into uncertainty-aware planning for embodied agents.arXiv preprint arXiv:2602.04326, 2026

SeungWon Seo, SooBin Lim, SeongRae Noh, Haneul Kim, and HyeongYeop Kang. From assumptions to actions: Turning llm reasoning into uncertainty-aware planning for embodied agents.arXiv preprint arXiv:2602.04326, 2026

work page arXiv 2026
[28]

Hao Tang, Darren Key, and Kevin Ellis. Worldcoder, a model-based llm agent: Building world models by writing code and interacting with the environment.Advances in Neural Information Processing Systems, 37:70148–70212, 2024

work page 2024
[29]

Mars: Situated inductive reasoning in an open-world environment.Advances in Neural Information Processing Systems, 37:17830–17869, 2024

Xiaojuan Tang, Jiaqi Li, Yitao Liang, Song-chun Zhu, Muhan Zhang, and Zilong Zheng. Mars: Situated inductive reasoning in an open-world environment.Advances in Neural Information Processing Systems, 37:17830–17869, 2024

work page 2024
[30]

Deir: efficient and robust exploration through discriminative-model-based episodic intrinsic rewards.arXiv preprint arXiv:2304.10770, 2023

Shanchuan Wan, Yujin Tang, Yingtao Tian, and Tomoyuki Kaneko. Deir: efficient and robust exploration through discriminative-model-based episodic intrinsic rewards.arXiv preprint arXiv:2304.10770, 2023. 11

work page arXiv 2023
[31]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Ruoyao Wang, Graham Todd, Ziang Xiao, Xingdi Yuan, Marc-Alexandre Côté, Peter Clark, and Peter Jansen. Can language models serve as text-based world simulators? InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–17, 2024

work page 2024
[33]

Resum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025

Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Xinmiao Yu, Dingchu Zhang, Yong Jiang, et al. Resum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025

work page arXiv 2025
[34]

Keep the conversation going: Fixing 162 out of 337 bugs for $0.42 each using chatgpt.arXiv preprint arXiv:2304.00385, 2023

Chunqiu Steven Xia and Lingming Zhang. Keep the conversation going: Fixing 162 out of 337 bugs for $0.42 each using chatgpt.arXiv preprint arXiv:2304.00385, 2023

work page arXiv 2023
[35]

Code over words: Overcoming semantic inertia via code-grounded reasoning.arXiv preprint arXiv:2601.18352, 2026

Manjie Xu, Isabella Yin, Xinyi Tu, Chi Zhang, and Yixin Zhu. Code over words: Overcoming semantic inertia via code-grounded reasoning.arXiv preprint arXiv:2601.18352, 2026

work page arXiv 2026
[36]

Wall-e 2.0: World alignment by neurosymbolic learning improves world model-based llm agents.arXiv preprint arXiv:2504.15785, 2025

Siyu Zhou, Tianyi Zhou, Yijun Yang, Guodong Long, Deheng Ye, Jing Jiang, and Chengqi Zhang. Wall-e 2.0: World alignment by neurosymbolic learning improves world model-based llm agents.arXiv preprint arXiv:2504.15785, 2025

work page arXiv 2025
[37]

type": "rule_noun

Xin Zhou, Bowen Xu, Kisub Kim, DongGyun Han, Thanh Le-Cong, Junda He, Bach Le, and David Lo. Patchzero: Zero-shot automatic patch correctness assessment.arXiv preprint arXiv:2303.00202, 2023. 12 A Appendix A.1 Baba Is You Game Rules and Dynamics.Baba Is You is a rule-manipulation puzzle game in which object behavior is determined by textual rules assemble...

work page arXiv 2023

[1] [1]

Never give up: Learning directed exploration strategies.arXiv preprint arXiv:2002.06038, 2020

Adrià Puigdomènech Badia, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Bilal Piot, Steven Kapturowski, Olivier Tieleman, Martín Arjovsky, Alexander Pritzel, Andew Bolt, et al. Never give up: Learning directed exploration strategies.arXiv preprint arXiv:2002.06038, 2020

work page arXiv 2002

[2] [2]

Exploration by Random Network Distillation

Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation.arXiv preprint arXiv:1810.12894, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[3] [3]

Generating code world models with large language models guided by monte carlo tree search.Advances in Neural Information Processing Systems, 37:60429–60474, 2024

Nicola Dainese, Matteo Merler, Minttu Alakuijala, and Pekka Marttinen. Generating code world models with large language models guided by monte carlo tree search.Advances in Neural Information Processing Systems, 37:60429–60474, 2024

work page 2024

[4] [4]

arXiv preprint arXiv:1901.10995 , year=

Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Go-explore: a new approach for hard-exploration problems.arXiv preprint arXiv:1901.10995, 2019

work page arXiv 1901

[5] [5]

Diversity is All You Need: Learning Skills without a Reward Function

Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function.arXiv preprint arXiv:1802.06070, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

World Models

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

Learning latent dynamics for planning from pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InInternational conference on machine learning, pages 2555–2565. PMLR, 2019

work page 2019

[8] [8]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

TD-MPC2: Scalable, Robust World Models for Continuous Control

Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control.arXiv preprint arXiv:2310.16828, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Reason- ing with language model is planning with world model

Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. Reason- ing with language model is planning with world model. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8154–8173, 2023

work page 2023

[11] [11]

Vime: Variational information maximizing exploration.Advances in neural information processing systems, 29, 2016

Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration.Advances in neural information processing systems, 29, 2016

work page 2016

[12] [12]

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents

Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. InInternational conference on machine learning, pages 9118–9147. PMLR, 2022

work page 2022

[13] [13]

Reward-free exploration for reinforcement learning

Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning. InInternational Conference on Machine Learning, pages 4870–4879. PMLR, 2020

work page 2020

[14] [14]

Supervised contrastive learning

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. InAdvances in Neural Information Processing Systems 33, 2020. URL https://proceedings.neurips. cc/paper/2020/hash/d89a66c7c80a29b1bdbab0f2a1a94af8-Abstract.html

work page 2020

[15] [15]

Active world model learning with progress curiosity

Kuno Kim, Megumi Sano, Julian De Freitas, Nick Haber, and Daniel Yamins. Active world model learning with progress curiosity. InInternational conference on machine learning, pages 5306–5315. PMLR, 2020. 10

work page 2020

[16] [16]

Unsupervised reinforcement learning with contrastive intrinsic control

Michael Laskin, Hao Liu, Xue Bin Peng, Denis Yarats, Aravind Rajeswaran, and Pieter Abbeel. Unsupervised reinforcement learning with contrastive intrinsic control. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neu- ral Information Processing Systems, volume 35, pages 34478–34491. Curran Associates, Inc., 2022. U...

work page 2022

[17] [17]

Code world models for general game playing.arXiv preprint arXiv:2510.04542, 2025

Wolfgang Lehrach, Daniel Hennes, Miguel Lazaro-Gredilla, Xinghua Lou, Carter Wendelken, Zun Li, Antoine Dedieu, Jordi Grau-Moya, Marc Lanctot, Atil Iscen, et al. Code world models for general game playing.arXiv preprint arXiv:2510.04542, 2025

work page arXiv 2025

[18] [18]

Code as policies: Language model programs for embodied control

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International conference on robotics and automation (ICRA), pages 9493–9500. IEEE, 2023

work page 2023

[19] [19]

Behavior from the void: Unsupervised active pre-training.Advances in Neural Information Processing Systems, 34:18459–18473, 2021

Hao Liu and Pieter Abbeel. Behavior from the void: Unsupervised active pre-training.Advances in Neural Information Processing Systems, 34:18459–18473, 2021

work page 2021

[20] [20]

Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

work page 2024

[21] [21]

Agentboard: An analytical evaluation board of multi-turn llm agents.Advances in neural information processing systems, 37:74325–74362, 2024

Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, and Junxian He. Agentboard: An analytical evaluation board of multi-turn llm agents.Advances in neural information processing systems, 37:74325–74362, 2024

work page 2024

[22] [22]

Curiosity-driven exploration by self-supervised prediction

Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. InInternational conference on machine learning, pages 2778–

work page

[23] [23]

Self-supervised exploration via disagree- ment

Deepak Pathak, Dhiraj Gandhi, and Abhinav Gupta. Self-supervised exploration via disagree- ment. InInternational conference on machine learning, pages 5062–5071. PMLR, 2019

work page 2019

[24] [24]

Poe-world: Compositional world modeling with products of programmatic experts.arXiv preprint arXiv:2505.10819, 2025

Wasu Top Piriyakulkij, Yichao Liang, Hao Tang, Adrian Weller, Marta Kryven, and Kevin Ellis. Poe-world: Compositional world modeling with products of programmatic experts.arXiv preprint arXiv:2505.10819, 2025

work page arXiv 2025

[25] [25]

Planning to explore via self-supervised world models

Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak Pathak. Planning to explore via self-supervised world models. InInternational conference on machine learning, pages 8583–8592. PMLR, 2020

work page 2020

[26] [26]

Reveca: Adaptive planning and trajectory-based validation in cooperative language agents using information relevance and relative proximity

SeungWon Seo, SeongRae Noh, Junhyeok Lee, SooBin Lim, Won Hee Lee, and HyeongYeop Kang. Reveca: Adaptive planning and trajectory-based validation in cooperative language agents using information relevance and relative proximity. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23295–23303, 2025

work page 2025

[27] [27]

From assumptions to actions: Turning llm reasoning into uncertainty-aware planning for embodied agents.arXiv preprint arXiv:2602.04326, 2026

SeungWon Seo, SooBin Lim, SeongRae Noh, Haneul Kim, and HyeongYeop Kang. From assumptions to actions: Turning llm reasoning into uncertainty-aware planning for embodied agents.arXiv preprint arXiv:2602.04326, 2026

work page arXiv 2026

[28] [28]

Hao Tang, Darren Key, and Kevin Ellis. Worldcoder, a model-based llm agent: Building world models by writing code and interacting with the environment.Advances in Neural Information Processing Systems, 37:70148–70212, 2024

work page 2024

[29] [29]

Mars: Situated inductive reasoning in an open-world environment.Advances in Neural Information Processing Systems, 37:17830–17869, 2024

Xiaojuan Tang, Jiaqi Li, Yitao Liang, Song-chun Zhu, Muhan Zhang, and Zilong Zheng. Mars: Situated inductive reasoning in an open-world environment.Advances in Neural Information Processing Systems, 37:17830–17869, 2024

work page 2024

[30] [30]

Deir: efficient and robust exploration through discriminative-model-based episodic intrinsic rewards.arXiv preprint arXiv:2304.10770, 2023

Shanchuan Wan, Yujin Tang, Yingtao Tian, and Tomoyuki Kaneko. Deir: efficient and robust exploration through discriminative-model-based episodic intrinsic rewards.arXiv preprint arXiv:2304.10770, 2023. 11

work page arXiv 2023

[31] [31]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

Ruoyao Wang, Graham Todd, Ziang Xiao, Xingdi Yuan, Marc-Alexandre Côté, Peter Clark, and Peter Jansen. Can language models serve as text-based world simulators? InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–17, 2024

work page 2024

[33] [33]

Resum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025

Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Xinmiao Yu, Dingchu Zhang, Yong Jiang, et al. Resum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025

work page arXiv 2025

[34] [34]

Keep the conversation going: Fixing 162 out of 337 bugs for $0.42 each using chatgpt.arXiv preprint arXiv:2304.00385, 2023

Chunqiu Steven Xia and Lingming Zhang. Keep the conversation going: Fixing 162 out of 337 bugs for $0.42 each using chatgpt.arXiv preprint arXiv:2304.00385, 2023

work page arXiv 2023

[35] [35]

Code over words: Overcoming semantic inertia via code-grounded reasoning.arXiv preprint arXiv:2601.18352, 2026

Manjie Xu, Isabella Yin, Xinyi Tu, Chi Zhang, and Yixin Zhu. Code over words: Overcoming semantic inertia via code-grounded reasoning.arXiv preprint arXiv:2601.18352, 2026

work page arXiv 2026

[36] [36]

Wall-e 2.0: World alignment by neurosymbolic learning improves world model-based llm agents.arXiv preprint arXiv:2504.15785, 2025

Siyu Zhou, Tianyi Zhou, Yijun Yang, Guodong Long, Deheng Ye, Jing Jiang, and Chengqi Zhang. Wall-e 2.0: World alignment by neurosymbolic learning improves world model-based llm agents.arXiv preprint arXiv:2504.15785, 2025

work page arXiv 2025

[37] [37]

type": "rule_noun

Xin Zhou, Bowen Xu, Kisub Kim, DongGyun Han, Thanh Le-Cong, Junda He, Bach Le, and David Lo. Patchzero: Zero-shot automatic patch correctness assessment.arXiv preprint arXiv:2303.00202, 2023. 12 A Appendix A.1 Baba Is You Game Rules and Dynamics.Baba Is You is a rule-manipulation puzzle game in which object behavior is determined by textual rules assemble...

work page arXiv 2023