Baba in Wonderland: Online Self-Supervised Dynamics Discovery for Executable World Models
Pith reviewed 2026-05-19 21:43 UTC · model grok-4.3
The pith
Alice learns executable world models by refining failed candidate updates into hypothesis classes
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Alice is a closed-loop system that treats failed candidate updates as structural signal: when a candidate explains a new transition but loses previously explained ones, the preservation conflict reveals dynamics that the current program had conflated. Alice refines these conflicts into hypothesis classes that both provide compact, class-stratified preservation counterexamples for update and guide frontier exploration toward transitions that are novel and underrepresented with respect to the current program.
What carries the argument
Hypothesis class refinement from preservation conflicts, which converts failed-update signals into stratified classes that supply targeted counterexamples and steer class-aware exploration.
If this is right
- Agents can induce accurate state-dependent dynamics without rule descriptions, rewards, or trustworthy lexical priors.
- Preservation conflicts from failed updates expose previously conflated dynamics in the current program.
- Hypothesis classes supply compact, stratified counterexamples that support effective model updates.
- Class-aware exploration directs attention to novel and underrepresented transitions for improved coverage.
- Both refinement and exploration components are necessary, as shown by the ablation results.
Where Pith is reading between the lines
- The conflict-refinement idea could extend to other interactive domains where labels encourage shortcut learning.
- It provides a template for using internal model inconsistencies as self-supervision signals in model-based planning.
- Testing the same mechanism on environments with continuous states or partial observability would be a direct next experiment.
Load-bearing premise
Failed candidate updates provide structural signal revealing dynamics the current program had conflated, and that refining these into hypothesis classes yields compact preservation counterexamples sufficient for effective updates.
What would settle it
Remove the class-refinement step from Alice and check whether the agent still recovers the true transition laws on the Baba in Wonderland benchmark or remains stuck on semantic shortcuts.
Figures
read the original abstract
Executable world models can be read, edited, executed, and reused for planning, but only if the program captures the environment's transition law rather than semantic shortcuts in its surface vocabulary. We study online executable world-model learning under prior misalignment, where an agent must induce state-dependent dynamics from interaction evidence alone, without rule descriptions, reward signals, or trustworthy lexical priors. We introduce Alice, a closed-loop system that treats failed candidate updates as structural signal: when a candidate explains a new transition but loses previously explained ones, the preservation conflict reveals dynamics that the current program had conflated. Alice refines these conflicts into hypothesis classes that both provide compact, class-stratified preservation counterexamples for update and guide frontier exploration toward transitions that are novel and underrepresented with respect to the current program. We evaluate Alice on Baba in Wonderland, a prior-misaligned variant of Baba Is You that preserves simulator dynamics while replacing semantically meaningful rule-property labels with unrelated words. Experiments show that Alice substantially improves executable world-model learning under prior misalignment, and ablations show that both class refinement and class-aware exploration contribute.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Alice, a closed-loop online self-supervised system for learning executable world models under prior misalignment. In the Baba in Wonderland environment (a semantically relabeled variant of Baba Is You), Alice treats preservation conflicts arising from failed candidate program updates as structural signals. These conflicts are refined into hypothesis classes that supply compact, class-stratified counterexamples for program correction and guide class-aware frontier exploration toward underrepresented transitions. Experiments claim that Alice substantially outperforms baselines in inducing dynamics that capture the true transition law, with ablations confirming contributions from both class refinement and class-aware exploration.
Significance. If the central mechanism holds, the work offers a promising direction for self-supervised induction of readable, editable, and executable dynamics models without reliance on semantic priors or rewards. The use of update conflicts to drive both correction and exploration is a distinctive contribution that could generalize to other program-synthesis or model-learning settings where surface vocabulary misaligns with underlying rules.
major comments (2)
- [Abstract and §4] Abstract and §4 (empirical evaluation): the central claim of substantial improvement under prior misalignment is stated without quantitative results, error bars, statistical tests, or per-run details; this prevents assessment of effect size, reproducibility, or whether the reported gains are driven by the hypothesized structural signal rather than implementation artifacts.
- [§3.2] §3.2 (conflict refinement): the assumption that a failed update (covering a new transition while breaking prior ones) necessarily reveals dynamics the current program had conflated, rather than syntactic artifacts of the candidate representation, is load-bearing for the method but receives no direct verification or counterexample analysis; without evidence that the resulting hypothesis classes yield compact, necessary-and-sufficient preservation counterexamples, the update loop risks introducing new errors or failing to converge.
minor comments (2)
- [§3] Notation for hypothesis classes and preservation counterexamples should be defined more explicitly with a small example early in the method section to aid readability.
- [§3.3] The description of class-aware exploration would benefit from a precise definition of 'underrepresented' (e.g., an information-theoretic or count-based criterion) rather than a high-level statement.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting opportunities to strengthen the empirical claims and verify core methodological assumptions. We address each major comment below, agreeing where revisions are warranted and providing clarifications supported by the existing experimental design.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (empirical evaluation): the central claim of substantial improvement under prior misalignment is stated without quantitative results, error bars, statistical tests, or per-run details; this prevents assessment of effect size, reproducibility, or whether the reported gains are driven by the hypothesized structural signal rather than implementation artifacts.
Authors: We agree that the abstract and §4 would benefit from explicit quantitative support. The experiments section already contains per-run results across multiple seeds, but these were not summarized with means, standard deviations, or significance tests in the abstract or main claims. In the revised manuscript we will add concrete metrics (e.g., transition-prediction accuracy and program-edit success rates), report means ± std over 10 independent runs, and include paired statistical tests (e.g., Wilcoxon signed-rank) comparing Alice against baselines. This will make the effect size and reproducibility transparent and help confirm that gains arise from the preservation-conflict mechanism. revision: yes
-
Referee: [§3.2] §3.2 (conflict refinement): the assumption that a failed update (covering a new transition while breaking prior ones) necessarily reveals dynamics the current program had conflated, rather than syntactic artifacts of the candidate representation, is load-bearing for the method but receives no direct verification or counterexample analysis; without evidence that the resulting hypothesis classes yield compact, necessary-and-sufficient preservation counterexamples, the update loop risks introducing new errors or failing to converge.
Authors: The referee correctly identifies that the interpretation of preservation conflicts as revealing conflated dynamics is central. While we do not provide an exhaustive counterexample analysis in the current draft, the class-refinement ablation demonstrates measurable gains in both final program accuracy and convergence speed, which would be unlikely if the classes were dominated by syntactic artifacts. To strengthen the claim we will add, in the revised §3.2, a short analysis of representative conflict cases from Baba in Wonderland showing that the induced hypothesis classes are both compact and necessary for restoring prior coverage. We will also report the average size of the counterexample sets and the number of refinement iterations required for convergence across runs. revision: partial
Circularity Check
No significant circularity; derivation relies on external interaction evidence
full rationale
The paper presents an algorithmic system (Alice) that processes failed candidate updates from interaction evidence to generate hypothesis classes for counterexamples and exploration. No equations, fitted parameters, or self-citations appear in the provided text that reduce any claimed result to its own inputs by construction. The central mechanism is described as driven by preservation conflicts arising from new transitions, which are external data rather than internal definitions or renamings. Ablations are referenced only to show component contributions, without indicating statistical forcing or self-referential justification. The approach is therefore self-contained against external benchmarks from the environment simulator.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The environment possesses state-dependent dynamics capturable by an executable program rather than surface semantics.
Reference graph
Works this paper leans on
-
[1]
Never give up: Learning directed exploration strategies.arXiv preprint arXiv:2002.06038, 2020
Adrià Puigdomènech Badia, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Bilal Piot, Steven Kapturowski, Olivier Tieleman, Martín Arjovsky, Alexander Pritzel, Andew Bolt, et al. Never give up: Learning directed exploration strategies.arXiv preprint arXiv:2002.06038, 2020
-
[2]
Exploration by Random Network Distillation
Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation.arXiv preprint arXiv:1810.12894, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[3]
Nicola Dainese, Matteo Merler, Minttu Alakuijala, and Pekka Marttinen. Generating code world models with large language models guided by monte carlo tree search.Advances in Neural Information Processing Systems, 37:60429–60474, 2024
work page 2024
-
[4]
arXiv preprint arXiv:1901.10995 , year=
Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Go-explore: a new approach for hard-exploration problems.arXiv preprint arXiv:1901.10995, 2019
-
[5]
Diversity is All You Need: Learning Skills without a Reward Function
Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function.arXiv preprint arXiv:1802.06070, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[6]
David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[7]
Learning latent dynamics for planning from pixels
Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InInternational conference on machine learning, pages 2555–2565. PMLR, 2019
work page 2019
-
[8]
Mastering Diverse Domains through World Models
Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
TD-MPC2: Scalable, Robust World Models for Continuous Control
Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control.arXiv preprint arXiv:2310.16828, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Reason- ing with language model is planning with world model
Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. Reason- ing with language model is planning with world model. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8154–8173, 2023
work page 2023
-
[11]
Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration.Advances in neural information processing systems, 29, 2016
work page 2016
-
[12]
Language models as zero-shot planners: Extracting actionable knowledge for embodied agents
Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. InInternational conference on machine learning, pages 9118–9147. PMLR, 2022
work page 2022
-
[13]
Reward-free exploration for reinforcement learning
Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning. InInternational Conference on Machine Learning, pages 4870–4879. PMLR, 2020
work page 2020
-
[14]
Supervised contrastive learning
Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. InAdvances in Neural Information Processing Systems 33, 2020. URL https://proceedings.neurips. cc/paper/2020/hash/d89a66c7c80a29b1bdbab0f2a1a94af8-Abstract.html
work page 2020
-
[15]
Active world model learning with progress curiosity
Kuno Kim, Megumi Sano, Julian De Freitas, Nick Haber, and Daniel Yamins. Active world model learning with progress curiosity. InInternational conference on machine learning, pages 5306–5315. PMLR, 2020. 10
work page 2020
-
[16]
Unsupervised reinforcement learning with contrastive intrinsic control
Michael Laskin, Hao Liu, Xue Bin Peng, Denis Yarats, Aravind Rajeswaran, and Pieter Abbeel. Unsupervised reinforcement learning with contrastive intrinsic control. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neu- ral Information Processing Systems, volume 35, pages 34478–34491. Curran Associates, Inc., 2022. U...
work page 2022
-
[17]
Code world models for general game playing.arXiv preprint arXiv:2510.04542, 2025
Wolfgang Lehrach, Daniel Hennes, Miguel Lazaro-Gredilla, Xinghua Lou, Carter Wendelken, Zun Li, Antoine Dedieu, Jordi Grau-Moya, Marc Lanctot, Atil Iscen, et al. Code world models for general game playing.arXiv preprint arXiv:2510.04542, 2025
-
[18]
Code as policies: Language model programs for embodied control
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International conference on robotics and automation (ICRA), pages 9493–9500. IEEE, 2023
work page 2023
-
[19]
Hao Liu and Pieter Abbeel. Behavior from the void: Unsupervised active pre-training.Advances in Neural Information Processing Systems, 34:18459–18473, 2021
work page 2021
-
[20]
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024
work page 2024
-
[21]
Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, and Junxian He. Agentboard: An analytical evaluation board of multi-turn llm agents.Advances in neural information processing systems, 37:74325–74362, 2024
work page 2024
-
[22]
Curiosity-driven exploration by self-supervised prediction
Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. InInternational conference on machine learning, pages 2778–
-
[23]
Self-supervised exploration via disagree- ment
Deepak Pathak, Dhiraj Gandhi, and Abhinav Gupta. Self-supervised exploration via disagree- ment. InInternational conference on machine learning, pages 5062–5071. PMLR, 2019
work page 2019
-
[24]
Wasu Top Piriyakulkij, Yichao Liang, Hao Tang, Adrian Weller, Marta Kryven, and Kevin Ellis. Poe-world: Compositional world modeling with products of programmatic experts.arXiv preprint arXiv:2505.10819, 2025
-
[25]
Planning to explore via self-supervised world models
Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak Pathak. Planning to explore via self-supervised world models. InInternational conference on machine learning, pages 8583–8592. PMLR, 2020
work page 2020
-
[26]
SeungWon Seo, SeongRae Noh, Junhyeok Lee, SooBin Lim, Won Hee Lee, and HyeongYeop Kang. Reveca: Adaptive planning and trajectory-based validation in cooperative language agents using information relevance and relative proximity. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23295–23303, 2025
work page 2025
-
[27]
SeungWon Seo, SooBin Lim, SeongRae Noh, Haneul Kim, and HyeongYeop Kang. From assumptions to actions: Turning llm reasoning into uncertainty-aware planning for embodied agents.arXiv preprint arXiv:2602.04326, 2026
-
[28]
Hao Tang, Darren Key, and Kevin Ellis. Worldcoder, a model-based llm agent: Building world models by writing code and interacting with the environment.Advances in Neural Information Processing Systems, 37:70148–70212, 2024
work page 2024
-
[29]
Xiaojuan Tang, Jiaqi Li, Yitao Liang, Song-chun Zhu, Muhan Zhang, and Zilong Zheng. Mars: Situated inductive reasoning in an open-world environment.Advances in Neural Information Processing Systems, 37:17830–17869, 2024
work page 2024
-
[30]
Shanchuan Wan, Yujin Tang, Yingtao Tian, and Tomoyuki Kaneko. Deir: efficient and robust exploration through discriminative-model-based episodic intrinsic rewards.arXiv preprint arXiv:2304.10770, 2023. 11
-
[31]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
Ruoyao Wang, Graham Todd, Ziang Xiao, Xingdi Yuan, Marc-Alexandre Côté, Peter Clark, and Peter Jansen. Can language models serve as text-based world simulators? InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–17, 2024
work page 2024
-
[33]
Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Xinmiao Yu, Dingchu Zhang, Yong Jiang, et al. Resum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025
-
[34]
Chunqiu Steven Xia and Lingming Zhang. Keep the conversation going: Fixing 162 out of 337 bugs for $0.42 each using chatgpt.arXiv preprint arXiv:2304.00385, 2023
-
[35]
Manjie Xu, Isabella Yin, Xinyi Tu, Chi Zhang, and Yixin Zhu. Code over words: Overcoming semantic inertia via code-grounded reasoning.arXiv preprint arXiv:2601.18352, 2026
-
[36]
Siyu Zhou, Tianyi Zhou, Yijun Yang, Guodong Long, Deheng Ye, Jing Jiang, and Chengqi Zhang. Wall-e 2.0: World alignment by neurosymbolic learning improves world model-based llm agents.arXiv preprint arXiv:2504.15785, 2025
-
[37]
Xin Zhou, Bowen Xu, Kisub Kim, DongGyun Han, Thanh Le-Cong, Junda He, Bach Le, and David Lo. Patchzero: Zero-shot automatic patch correctness assessment.arXiv preprint arXiv:2303.00202, 2023. 12 A Appendix A.1 Baba Is You Game Rules and Dynamics.Baba Is You is a rule-manipulation puzzle game in which object behavior is determined by textual rules assemble...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.