pith. sign in

arxiv: 2605.21800 · v1 · pith:D2A6PAUTnew · submitted 2026-05-20 · 💻 cs.LG · cs.RO

stable-worldmodel: A Platform for Reproducible World Modeling Research and Evaluation

Pith reviewed 2026-05-22 08:53 UTC · model grok-4.3

classification 💻 cs.LG cs.RO
keywords world modelsreproducibilitybenchmarkingdata pipelinesopen source platformmachine learninggeneralizationreinforcement learning
0
0 comments X

The pith

stable-worldmodel unifies data pipelines, baselines, and benchmarks under one framework to cut research overhead for world models

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces stable-worldmodel as an open-source platform that addresses fragmentation in world model research through disparate codebases, slow data loading, and missing standardized benchmarks. It supplies a high-performance Lance-based data layer with conversion tools for MP4, HDF5, and LeRobot datasets, plus clean implementations of modern baselines and planning solvers, and environments extended with controllable visual, geometric, and physical factors. A sympathetic reader would care because these pieces together let researchers run reproducible experiments and fair comparisons without rebuilding pipelines from scratch each time. If the unification works as claimed, it would lower the barrier to developing agents that reason, plan, and generalize beyond training data.

Core claim

The authors state that by unifying the full pipeline under a single scalable framework, stable-worldmodel dramatically reduces research overhead and accelerates trustworthy progress toward reliable world models, delivering the data layer, baseline implementations, and extended environments as the concrete means to achieve standardized and reproducible evaluation of dynamics understanding, control performance, representation quality, and out-of-distribution generalization.

What carries the argument

The stable-worldmodel platform itself, which integrates a Lance-based data layer for fast native support and conversion across dataset formats, clean baseline implementations, and environments with controllable factors of variation for systematic testing.

If this is right

  • Native support and conversion tools for MP4, HDF5, and LeRobot datasets remove the need for custom video loaders in most experiments.
  • Well-tested baseline implementations and planning solvers let researchers focus effort on novel components rather than reimplementation.
  • Environments with controllable visual, geometric, and physical factors enable systematic measurement of out-of-distribution generalization.
  • The single framework makes it straightforward to reproduce and compare results across different research groups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same unification approach could be extended to real-robot data streams to test whether the platform's benefits transfer beyond simulation.
  • Neighbouring areas such as model-based reinforcement learning may adopt similar standardized layers to address their own reproducibility gaps.
  • A natural next measurement would be to track how many new papers cite or build on the platform's environments for their generalization claims.

Load-bearing premise

The provided Lance-based data layer, baseline implementations, and extended environments with controllable factors will be sufficient for systematic evaluation and fair comparison without requiring substantial additional custom engineering by users.

What would settle it

A direct comparison study in which independent teams implement the same new world model both inside and outside the platform and measure total engineering time plus result consistency would settle whether the claimed reduction in overhead holds.

Figures

Figures reproduced from arXiv: 2605.21800 by Ayush Chaurasia, Damien Scieur, Dan Haramati, Francesco Capuano, Lucas Maes, Luiz Facury, Nassim Massaudi, Quentin Le Lidec, Randall Balestriero, Richard Gao, Taj Gillin, Yann LeCun.

Figure 1
Figure 1. Figure 1: Overview of stable-worldmodel: data is efficiently collected from a world and used to train world models via provided baselines, then leveraged by solvers for control. The idea of using predictive models to guide decision-making dates back to the 1960s-1970s from the control theory community [1, 2]. Most of these approaches relied on analytical, closed￾form models or hand-crafted simulators to pre￾dict the… view at source ↗
Figure 2
Figure 2. Figure 2: Environment families supported by swm. Top row: default (unperturbed) renderings of each environment. Bottom row: all visual factors of variation (e.g., agent, object, scene, geometry, lighting) jointly perturbed. Dynamic physical parameters (e.g., mass, density, gravity, or friction) can also be modified, but are omitted here as they are not visible in a single frame. standardized benchmarks, and reproduc… view at source ↗
Figure 3
Figure 3. Figure 3: Performance comparison of different data formats for a dataset from the Push-T environment. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of trajectory-level prediction MSE for successful (blue) and failed (red) plans [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Robustness analysis on Push-T. Left: effect of visual distractors. Right: planning success [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Examples of visual perturbations that can be applied on-the-fly to any supported envi [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Factors of variation supported per environment. Light blue bars indicate total factors [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance comparison of different data formats for a dataset from the Two-Room [PITH_FULL_IMAGE:figures/full_fig_p033_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Push-T prediction error under increasing distribution shift for [PITH_FULL_IMAGE:figures/full_fig_p034_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: PLDM counterpart of Fig [PITH_FULL_IMAGE:figures/full_fig_p034_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Planning success rate of LeWM on Push-T as a function of background color intensity [PITH_FULL_IMAGE:figures/full_fig_p035_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Benchmarking stable-worldmodel against standard baselines across diverse environ￾ments. Solid lines depict the mean episode reward over 5 random seeds, while shaded areas denote the standard deviation. TD-MPC2 consistently achieves faster convergence in continuous control tasks. 1.5 1.0 0.5 0.0 0.5 1.0 PC1 (44.7% var) 0.5 0.0 0.5 1.0 PC2 (10.0% var) Expert Actor [PITH_FULL_IMAGE:figures/full_fig_p036_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: PCA projection of TD-MPC2’s latent state space on Push-T. Gray points show the [PITH_FULL_IMAGE:figures/full_fig_p036_13.png] view at source ↗
read the original abstract

World models are central to building agents that can reason, plan, and generalize beyond their training data. However, research on world models is currently fragmented, with disparate codebases, data pipelines, and evaluation protocols hindering reproducibility and fair comparison. Current practice is further limited by three key bottlenecks: fragile one-off codebases, slow video data loading, and the lack of standardized generalization benchmarks. We present stable-worldmodel (swm), an open-source platform for standardized and reproducible world modeling research and evaluation. It delivers (1) a high-performance Lance-based data layer with native support and conversion tools for MP4, HDF5, and LeRobot datasets, (2) clean, well-tested implementations of modern world model baselines and planning solvers, and (3) a broad suite of environments and tasks extended with controllable visual, geometric, and physical factors of variation for systematic in-silico evaluation of dynamics understanding, control performance, representation quality, and out-of-distribution generalization. By unifying the full pipeline under a single, scalable framework, \texttt{swm} dramatically reduces research overhead and accelerates trustworthy progress toward reliable world models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces stable-worldmodel (swm), an open-source platform for standardized and reproducible world modeling research. It claims to deliver three components: (1) a high-performance Lance-based data layer with native support and conversion tools for MP4, HDF5, and LeRobot datasets; (2) clean, well-tested implementations of modern world model baselines and planning solvers; and (3) a broad suite of environments and tasks extended with controllable visual, geometric, and physical factors of variation. By unifying the full pipeline under a single scalable framework, the paper asserts that swm dramatically reduces research overhead and accelerates trustworthy progress toward reliable world models.

Significance. If the delivered components prove jointly sufficient for end-to-end reproducible experiments and fair comparisons with minimal custom engineering, the platform could meaningfully address fragmentation in world-model research by enabling systematic in-silico evaluation of dynamics understanding, control, representation quality, and out-of-distribution generalization. The provision of factorized environments and baseline implementations is a constructive contribution toward standardized benchmarks. However, the manuscript supplies no usage traces, ablation studies, or overhead measurements, so the claimed significance remains prospective rather than demonstrated.

major comments (2)
  1. Abstract: the central claim that unifying the pipeline under swm 'dramatically reduces research overhead' is unsupported; the text describes the three components but provides neither timing measurements relative to prior fragmented codebases nor ablation results quantifying remaining custom engineering required by users.
  2. Abstract: the assertions of 'high-performance' Lance data layer, 'clean, well-tested' baselines, and 'broad suite' of extended environments are presented without any implementation details, performance benchmarks, validation results, or concrete usage examples that would substantiate sufficiency for zero-custom-engineering systematic evaluation.
minor comments (2)
  1. Consider adding explicit quick-start code snippets or a minimal reproducible experiment trace in the main text or supplementary material to illustrate end-to-end usage of the Lance layer, a baseline, and a factorized environment.
  2. Clarify the exact scope of 'controllable factors of variation' (visual, geometric, physical) with a table listing which factors are exposed per environment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We agree that the abstract claims would benefit from additional substantiation and will revise the manuscript to include more details, benchmarks, and examples as outlined below.

read point-by-point responses
  1. Referee: Abstract: the central claim that unifying the pipeline under swm 'dramatically reduces research overhead' is unsupported; the text describes the three components but provides neither timing measurements relative to prior fragmented codebases nor ablation results quantifying remaining custom engineering required by users.

    Authors: We acknowledge that the abstract asserts a reduction in overhead without direct quantitative comparisons in the current text. The manuscript's contribution centers on the integrated design of the data layer, baselines, and environments, which by construction eliminates the need for users to assemble disparate codebases. We will add a new subsection with preliminary timing measurements for data loading and setup effort relative to common prior practices, plus concrete usage traces showing the engineering steps required for a standard experiment. revision: yes

  2. Referee: Abstract: the assertions of 'high-performance' Lance data layer, 'clean, well-tested' baselines, and 'broad suite' of extended environments are presented without any implementation details, performance benchmarks, validation results, or concrete usage examples that would substantiate sufficiency for zero-custom-engineering systematic evaluation.

    Authors: We agree that the abstract would be strengthened by explicit support for these descriptors. The full manuscript already contains implementation descriptions of the Lance integration, baseline code structure, and environment factorizations, but we will expand the revised version with performance numbers for the data layer, test coverage statistics for the baselines, and step-by-step usage examples that illustrate end-to-end evaluation with controllable factors of variation. revision: yes

Circularity Check

0 steps flagged

No circularity: software platform paper with no derivations or fitted quantities

full rationale

The manuscript presents an open-source platform (data layer, baselines, extended environments) rather than any derivation chain, equations, or statistical predictions. No load-bearing steps reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains. Claims about reduced research overhead are descriptive assertions about the delivered components, not results derived from the paper's own inputs by construction. This is a standard non-finding for infrastructure papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a software platform contribution; the abstract introduces no free parameters, mathematical axioms, or new postulated entities.

pith-pipeline@v0.9.0 · 5772 in / 1027 out tokens · 37363 ms · 2026-05-22T08:53:53.353985+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 24 internal anchors

  1. [1]

    Use of linear programming methods for synthesizing sampled-data automatic systems.Automn

    AI Propoi. Use of linear programming methods for synthesizing sampled-data automatic systems.Automn. Remote Control, 24(7):837–844, 1963

  2. [2]

    Industrial applications of model based predictive control.Automatica, 29(5): 1251–1274, 1993

    Jacques Richalet. Industrial applications of model based predictive control.Automatica, 29(5): 1251–1274, 1993

  3. [3]

    Model predictive control.Switzerland: Springer International Publishing, 38(13-56):7, 2016

    Basil Kouvaritakis and Mark Cannon. Model predictive control.Switzerland: Springer International Publishing, 38(13-56):7, 2016

  4. [4]

    Model predictive control: theory, computation, and design.(No Title), 2020

    James B Rawlings, David Q Mayne, and Moritz M Diehl. Model predictive control: theory, computation, and design.(No Title), 2020. 9

  5. [5]

    World Models

    David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

  6. [6]

    Learning latent dynamics for planning from pixels

    Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InInternational conference on machine learning, pages 2555–2565. PMLR, 2019

  7. [7]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

  8. [8]

    TD-MPC2: Scalable, Robust World Models for Continuous Control

    Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control.arXiv preprint arXiv:2310.16828, 2023

  9. [9]

    V-jepa: Latent video prediction for visual representation learning

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. V-jepa: Latent video prediction for visual representation learning. 2023

  10. [10]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

  11. [11]

    World models can leverage human videos for dexterous manipulation.arXiv preprint arXiv:2512.13644, 2025

    Raktim Gautam Goswami, Amir Bar, David Fan, Tsung-Yen Yang, Gaoyue Zhou, Prashanth Krishnamurthy, Michael Rabbat, Farshad Khorrami, and Yann LeCun. World models can leverage human videos for dexterous manipulation.arXiv preprint arXiv:2512.13644, 2025

  12. [12]

    Pearson Education, 1995

    Frederick P Brooks Jr.The mythical man-month: essays on software engineering. Pearson Education, 1995

  13. [13]

    A step toward quantifying independently reproducible machine learning research

    Edward Raff. A step toward quantifying independently reproducible machine learning research. Advances in Neural Information Processing Systems, 32, 2019

  14. [14]

    Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control

    Riashat Islam, Peter Henderson, Maziar Gomrokchi, and Doina Precup. Reproducibility of benchmarked deep reinforcement learning tasks for continuous control.arXiv preprint arXiv:1708.04133, 2017

  15. [15]

    Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

  16. [16]

    Gymnasium: A Standard Interface for Reinforcement Learning Environments

    Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A standard interface for reinforcement learning environments.arXiv preprint arXiv:2407.17032, 2024

  17. [17]

    Training Agents Inside of Scalable World Models

    Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models.arXiv preprint arXiv:2509.24527, 2025

  18. [18]

    A path towards autonomous machine intelligence version 0.9

    Yann LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1):1–62, 2022

  19. [19]

    Learning from reward-free offline data: A case for planning with latent dynamics models.arXiv preprint arXiv:2502.14819, 2025

    Vlad Sobal, Wancong Zhang, Kyunghyun Cho, Randall Balestriero, Tim GJ Rudner, and Yann LeCun. Learning from reward-free offline data: A case for planning with latent dynamics models.arXiv preprint arXiv:2502.14819, 2025

  20. [20]

    LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

    Randall Balestriero and Yann LeCun. Lejepa: Provable and scalable self-supervised learning without the heuristics.arXiv preprint arXiv:2511.08544, 2025

  21. [21]

    LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

    Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. Leworld- model: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

  22. [22]

    DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

    Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. Dino-wm: World models on pre-trained visual features enable zero-shot planning.arXiv preprint arXiv:2411.04983, 2024. 10

  23. [23]

    Springer, 2004

    Reuven Y Rubinstein and Dirk P Kroese.The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simulation, and machine learning, volume 133. Springer, 2004

  24. [24]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

  25. [25]

    From kepler to newton: Inductive biases guide learned world models in transformers, 2026

    Ziming Liu, Sophia Sanborn, Surya Ganguli, and Andreas Tolias. From kepler to newton: Inductive biases guide learned world models in transformers, 2026. URL https://arxiv. org/abs/2602.06923

  26. [26]

    Lance: Efficient random access in columnar storage through adaptive structural encodings,

    Weston Pace, Chang She, Lei Xu, Will Jones, Albert Lockett, Jun Wang, and Raunak Shah. Lance: Efficient random access in columnar storage through adaptive structural encodings,

  27. [27]

    URLhttps://arxiv.org/abs/2504.15247

  28. [28]

    Lerobot: An open-source library for end-to-end robot learning

    Remi Cadene, Simon Alibert, Francesco Capuano, Michel Aractingi, Adil Zouitine, Pepijn Kooijmans, Jade Choghari, Martino Russi, Caroline Pascal, Steven Palma, et al. Lerobot: An open-source library for end-to-end robot learning. InThe Fourteenth International Conference on Learning Representations, 2026

  29. [29]

    Predictive sampling: Real-time behaviour synthesis with mujoco

    Taylor Howell, Nimrod Gileadi, Saran Tunyasuvunakool, Kevin Zakka, Tom Erez, and Yuval Tassa. Predictive sampling: Real-time behaviour synthesis with mujoco. 2022

  30. [30]

    Sample-efficient cross-entropy method for real-time planning

    Cristina Pinneri, Shambhuraj Sawant, Sebastian Blaes, Jan Achterhold, Joerg Stueckler, Michal Rolinek, and Georg Martius. Sample-efficient cross-entropy method for real-time planning. In Conference on Robot Learning, pages 1049–1065. PMLR, 2021

  31. [31]

    Aggressive driving with model predictive path integral control

    Grady Williams, Paul Drews, Brian Goldfain, James M Rehg, and Evangelos A Theodorou. Aggressive driving with model predictive path integral control. In2016 IEEE international conference on robotics and automation (ICRA), pages 1433–1440. IEEE, 2016

  32. [32]

    Model-Based Planning with Discrete and Continuous Actions

    Mikael Henaff, William F Whitney, and Yann LeCun. Model-based planning with discrete and continuous actions.arXiv preprint arXiv:1705.07177, 2017

  33. [33]

    Parallel stochastic gradient-based planning for world models.arXiv preprint arXiv:2602.00475, 2026

    Michael Psenka, Michael Rabbat, Aditi Krishnapriyan, Yann LeCun, and Amir Bar. Parallel stochastic gradient-based planning for world models.arXiv preprint arXiv:2602.00475, 2026

  34. [35]

    Offline Reinforcement Learning with Implicit Q-Learning

    Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning.arXiv preprint arXiv:2110.06169, 2021

  35. [36]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  36. [37]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021

  37. [38]

    Neuronlike adaptive elements that can solve difficult learning control problems.IEEE transactions on systems, man, and cybernetics, (5):834–846, 2012

    Andrew G Barto, Richard S Sutton, and Charles W Anderson. Neuronlike adaptive elements that can solve difficult learning control problems.IEEE transactions on systems, man, and cybernetics, (5):834–846, 2012

  38. [39]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems (RSS), 2023

  39. [40]

    OGBench: Bench- marking offline goal-conditioned RL

    Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. OGBench: Bench- marking offline goal-conditioned RL. InThe Thirteenth International Conference on Learning Representations, 2025. 11

  40. [41]

    The arcade learning environment: An evaluation platform for general agents.Journal of artificial intelligence research, 47:253–279, 2013

    Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents.Journal of artificial intelligence research, 47:253–279, 2013

  41. [42]

    Mujoco: A physics engine for model-based control

    Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012

  42. [43]

    Craftax: A lightning-fast benchmark for open-ended reinforcement learning

    Michael Matthews, Michael Beukman, Benjamin Ellis, Mikayel Samvelyan, Matthew Jackson, Samuel Coward, and Jakob Foerster. Craftax: A lightning-fast benchmark for open-ended reinforcement learning. InProceedings of the 41st International Conference on Machine Learning (ICML), pages 35104–35137, 2024. URL https://arxiv.org/abs/2402.16801

  43. [44]

    Dream to Control: Learning Behaviors by Latent Imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

  44. [45]

    Mastering Atari with Discrete World Models

    Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models.arXiv preprint arXiv:2010.02193, 2020

  45. [46]

    Temporal difference learning for model predictive control

    Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control.arXiv preprint arXiv:2203.04955, 2022

  46. [47]

    Learning and leveraging world models in visual representation learning.arXiv preprint arXiv:2403.00504, 2024

    Quentin Garrido, Mahmoud Assran, Nicolas Ballas, Adrien Bardes, Laurent Najman, and Yann LeCun. Learning and leveraging world models in visual representation learning.arXiv preprint arXiv:2403.00504, 2024

  47. [48]

    Navigation world models

    Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15791–15801, 2025

  48. [49]

    WorldMark: A Unified Benchmark Suite for Interactive Video World Models

    Xiaojie Xu, Zhengyuan Lin, Kang He, Yukang Feng, Xiaofeng Mao, Yuanyang Yin, Kaipeng Zhang, and Yongtao Ge. Worldmark: A unified benchmark suite for interactive video world models.arXiv preprint arXiv:2604.21686, 2026

  49. [50]

    Benchmarking World-Model Learning with Environment-Level Queries

    Archana Warrier, Dat Nguyen, Michelangelo Naim, Moksh Jain, Yichao Liang, Karen Schroeder, Cambridge Yang, Joshua B Tenenbaum, Sebastian V ollmer, Kevin Ellis, et al. Benchmarking world-model learning.arXiv preprint arXiv:2510.19788, 2025

  50. [51]

    A Lightweight Library for Energy-Based Joint-Embedding Predictive Architectures

    Basile Terver, Randall Balestriero, Megi Dervishi, David Fan, Quentin Garrido, Tushar Nagara- jan, Koustuv Sinha, Wancong Zhang, Mike Rabbat, Yann LeCun, et al. A lightweight library for energy-based joint-embedding predictive architectures.arXiv preprint arXiv:2602.03604, 2026

  51. [52]

    Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020

    Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Si- mon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020

  52. [53]

    Playing Atari with Deep Reinforcement Learning

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602, 2013

  53. [54]

    Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018

  54. [55]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  55. [56]

    Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

  56. [57]

    D4RL: Datasets for Deep Data-Driven Reinforcement Learning

    Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning.arXiv preprint arXiv:2004.07219, 2020. 12

  57. [58]

    Natural Environment Benchmarks for Reinforcement Learning

    Amy Zhang, Yuxin Wu, and Joelle Pineau. Natural environment benchmarks for reinforcement learning.arXiv preprint arXiv:1811.06032, 2018

  58. [59]

    The distracting con- trol suite–a challenging benchmark for reinforcement learning from pixels.arXiv preprint arXiv:2101.02722, 2021

    Austin Stone, Oscar Ramirez, Kurt Konolige, and Rico Jonschkowski. The distracting con- trol suite–a challenging benchmark for reinforcement learning from pixels.arXiv preprint arXiv:2101.02722, 2021

  59. [60]

    Stabilizing deep q-learning with convnets and vision transformers under data augmentation.Advances in neural information processing systems, 34:3680–3693, 2021

    Nicklas Hansen, Hao Su, and Xiaolong Wang. Stabilizing deep q-learning with convnets and vision transformers under data augmentation.Advances in neural information processing systems, 34:3680–3693, 2021

  60. [61]

    Dmc-vb: A benchmark for representation learning for control with visual distractors.Advances in Neural Information Processing Systems, 37:6574–6602, 2024

    Joseph Ortiz, Antoine Dedieu, Wolfgang Lehrach, J Swaroop Guntupalli, Carter Wendelken, Ahmad Humayun, Sivaramakrishnan Swaminathan, Guangyao Zhou, Miguel Lázaro-Gredilla, and Kevin P Murphy. Dmc-vb: A benchmark for representation learning for control with visual distractors.Advances in Neural Information Processing Systems, 37:6574–6602, 2024

  61. [62]

    Assessing adaptive world models in machines with novel games.arXiv preprint arXiv:2507.12821, 2025

    Lance Ying, Katherine M Collins, Prafull Sharma, Cedric Colas, Kaiya Ivy Zhao, Adrian Weller, Zenna Tavares, Phillip Isola, Samuel J Gershman, Jacob D Andreas, et al. Assessing adaptive world models in machines with novel games.arXiv preprint arXiv:2507.12821, 2025

  62. [63]

    Stable-baselines3: Reliable reinforcement learning implementations.Journal of machine learning research, 22(268):1–8, 2021

    Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-baselines3: Reliable reinforcement learning implementations.Journal of machine learning research, 22(268):1–8, 2021

  63. [64]

    Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithms.Journal of Machine Learning Research, 23(274):1–18, 2022

    Shengyi Huang, Rousslan Fernand Julien Dossa, Chang Ye, Jeff Braga, Dipam Chakraborty, Kinal Mehta, and JoÃG, o GM AraÚjo. Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithms.Journal of Machine Learning Research, 23(274):1–18, 2022

  64. [65]

    Mbrl-lib: A modular library for model-based reinforcement learning.arXiv preprint arXiv:2104.10159, 2021

    Luis Pineda, Brandon Amos, Amy Zhang, Nathan O Lambert, and Roberto Calandra. Mbrl-lib: A modular library for model-based reinforcement learning.arXiv preprint arXiv:2104.10159, 2021

  65. [66]

    Robohive: A unified framework for robot learning.Advances in Neural Information Processing Systems, 36:44323–44340, 2023

    Vikash Kumar, Rutav Shah, Gaoyue Zhou, Vincent Moens, Vittorio Caggiano, Abhishek Gupta, and Aravind Rajeswaran. Robohive: A unified framework for robot learning.Advances in Neural Information Processing Systems, 36:44323–44340, 2023

  66. [67]

    robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

    Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Martín-Martín, Abhishek Joshi, Kevin Lin, Abhiram Maddukuri, Soroush Nasiriany, and Yifeng Zhu. robosuite: A modular simulation framework and benchmark for robot learning.arXiv preprint arXiv:2009.12293, 2020

  67. [68]

    Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2): 3019–3026, 2020

    Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J Davison. Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2): 3019–3026, 2020

  68. [69]

    arXiv preprint arXiv:1912.06088 , year=

    Dibya Ghosh, Abhishek Gupta, Ashwin Reddy, Justin Fu, Coline Devin, Benjamin Eysenbach, and Sergey Levine. Learning to reach goals via iterated supervised learning.arXiv preprint arXiv:1912.06088, 2019

  69. [70]

    Vicreg: Variance-invariance-covariance regular- ization for self-supervised learning

    Adrien Bardes, Jean Ponce, and Yann LeCun. Vicreg: Variance-invariance-covariance regular- ization for self-supervised learning. 2021

  70. [71]

    Efficient projections onto the l1-ball for learning in high dimensions

    John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient projections onto the l1-ball for learning in high dimensions. InProceedings of the 25th International Conference on Machine Learning, ICML ’08, page 272–279, New York, NY , USA, 2008. Association for Computing Machinery. ISBN 9781605582054. doi: 10.1145/1390156.1390191. URL https...

  71. [72]

    Hydra - a framework for elegantly configuring complex applications

    Omry Yadan. Hydra - a framework for elegantly configuring complex applications. Github, 2019

  72. [73]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DDL Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 10, 2022. 13 Appendix Our Appendix complements the main paper with a walkthrough of thestable-worldmodel pla...