pith. sign in

arxiv: 2502.11537 · v4 · submitted 2025-02-17 · 💻 cs.LG · cs.AI

Simulus: Combining Improvements in Sample-Efficient World Model Agents

Pith reviewed 2026-05-23 02:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords world modelssample-efficient reinforcement learningintrinsic motivationprioritized replaytokenizationregression as classificationAtariDMC
0
0 comments X

The pith

Simulus shows that four separate improvements to world-model agents combine without conflict to set new sample-efficiency records on visual, continuous, and symbolic tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether the Rainbow-style principle of combining known improvements applies to world-model agents. It builds Simulus as a single modular agent that adds flexible tokenization of observations and actions, intrinsic motivation driven by epistemic uncertainty, prioritized replay of world-model experience, and regression treated as classification for rewards and returns. On the Atari 100K, DMC Proprioception 500K, and Craftax-1M benchmarks Simulus reaches state-of-the-art sample efficiency among planning-free world-model methods. Ablations confirm each addition improves results and that the four together produce larger gains than any subset.

Core claim

Simulus integrates a flexible tokenization framework, intrinsic motivation for epistemic uncertainty reduction, prioritized world-model replay, and regression-as-classification for reward and return prediction; the resulting agent achieves state-of-the-art sample efficiency for planning-free world models on visual Atari 100K, continuous-control DMC Proprioception 500K, and symbolic Craftax-1M while each component contributes individually and their combination produces synergistic gains.

What carries the argument

Simulus, a modular token-based world-model agent that supports arbitrary observation and action modalities and adds the four listed improvements on top of a shared base learner.

If this is right

  • Each of the four components improves performance when added alone.
  • The combination of all four yields larger gains than any subset.
  • Intrinsic motivation continues to help even when total environment steps are severely limited.
  • A single token-based architecture can accommodate visual, proprioceptive, and symbolic inputs without task-specific redesign.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Modular token interfaces may make it easier to test future improvements without rewriting the entire agent stack.
  • The success on three very different domains suggests the same four additions could be tried on planning-based world-model agents.
  • If prioritized replay of model rollouts remains useful, similar prioritization could be applied to other internal buffers such as value or policy targets.

Load-bearing premise

That the four components complement each other without significant negative interactions and that intrinsic motivation remains beneficial even under the tight interaction budgets of sample-efficient RL.

What would settle it

An ablation on any of the three benchmarks in which the full Simulus agent underperforms a version that omits one or more of the four components.

Figures

Figures reproduced from arXiv: 2502.11537 by Bingyi Kang, Kaixin Wang, Lior Cohen, Shie Mannor, Uri Gadot.

Figure 1
Figure 1. Figure 1: Results overview. Simulus exhibits state-of-the-art sample-efficiency performance for [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An illustration of the independent pro￾cessing of modalities for an observation with two modalities. Sequence Modeling Given a sequence of observation-action blocks X = X1, . . . , Xt, the matching outputs Y1, . . . , Yt are com￾puted auto-regressively as follows: (St, Yt) = fθ(St−1, Xt), where St is a recurrent state that summarizes X≤t and S0 = 0. However, the output Yu t+1, from which zˆt+1 is predicted… view at source ↗
Figure 3
Figure 3. Figure 3: World model training and imagination. To maintain visual clarity, we omitted token [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Results on the DeepMind Control Suite 500K Proprioception (top) and Atari 100K (bottom) [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Craftax-1M training curves with mean and 95% confi￾dence intervals. Effectiveness in continuous environments [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablations results on the Atari-100K and DeepMind Control Proprioception 500K bench [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance profile. For each human-normalized score value on the x-axis, the curve [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Reconstruction (Lpred) and dynamics (Ldyn) losses of PWM and PWM-decoupled on four Atari games (single seed). The first column uses a log-scaled y-axis. Decoupling the optimization objectives consistently reduces reconstruction loss while increasing dynamics loss, suggesting inter￾ference between the two objectives. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Agent episodic returns throughout training of PWM and PWM-decoupled on four Atari [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Ground truth (top) and reconstructed (bottom) frames from a training episode of PWM [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Ground truth (top) and reconstructed (bottom) frames from a training episode of PWM [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
read the original abstract

World models (WMs) represent the frontier of sample-efficient reinforcement learning, but their complexity leaves many promising improvements unrealized due to the significant expertise and effort required to identify and integrate them. Inspired by Rainbow, which showed that individually known improvements to DQN complement each other and can be effectively combined, we take on this challenge and ask whether the same principle applies to world model agents. We introduce Simulus, a modular token-based WM agent that integrates: (1) a flexible tokenization framework supporting arbitrary combinations of observation and action modalities; (2) intrinsic motivation for epistemic uncertainty reduction; (3) prioritized world model replay; and (4) regression-as-classification for reward and return prediction. Simulus achieves state-of-the-art sample efficiency for planning-free WMs across three diverse benchmarks: visual Atari 100K, continuous-control DMC Proprioception 500K, and symbolic Craftax-1M. Notably, intrinsic motivation proves beneficial even under the tight interaction budgets of sample-efficient RL, despite the risk of wasting scarce interactions on task-irrelevant experience. Ablation studies reveal that each component contributes individually, and their combination yields synergistic gains. Our code and model weights are publicly available at https://github.com/leor-c/Simulus.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper introduces Simulus, a modular token-based world model agent integrating four components: flexible tokenization supporting arbitrary observation/action modalities, intrinsic motivation for epistemic uncertainty reduction, prioritized world model replay, and regression-as-classification for reward/return prediction. It claims state-of-the-art sample efficiency for planning-free world models on visual Atari 100K, continuous-control DMC Proprioception 500K, and symbolic Craftax-1M benchmarks. Ablation studies indicate each component contributes positively on its own with synergistic gains from the full combination; intrinsic motivation remains beneficial at the tight 100K/500K/1M interaction budgets. Public code and model weights are released.

Significance. If the results hold, the work shows that the Rainbow-style combination of complementary improvements can be successfully applied to world model agents, potentially reducing the expertise barrier for building sample-efficient RL systems. The public code and weights directly support reproducibility of the SOTA claims across three diverse benchmarks and address concerns about experimental details.

minor comments (2)
  1. [Abstract] The abstract claims SOTA results but does not name the specific metrics (e.g., mean return, human-normalized score) or list the exact baselines against which superiority is measured.
  2. The manuscript would benefit from explicit reporting of the number of random seeds, confidence intervals, and any statistical tests used to support the ablation and benchmark comparisons.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of Simulus, the recognition of its modular design and reproducibility contributions, and the recommendation for minor revision. No major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical study that combines four modular improvements to world-model agents and validates them via ablation experiments on three standard benchmarks (Atari 100K, DMC 500K, Craftax-1M). All performance claims rest on reported interaction counts, reward curves, and ablation tables rather than any derivation, equation, or fitted parameter that reduces to its own inputs by construction. Public code and weights are supplied, making the results externally reproducible against the same benchmarks without reliance on self-citation chains or self-definitional steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Based on the abstract, the paper builds on existing world model techniques and standard RL practices without introducing new free parameters beyond typical hyperparameter tuning or new entities. The central claim rests on the assumption that the chosen benchmarks appropriately test sample efficiency and that the components integrate synergistically.

free parameters (1)
  • Various agent and training hyperparameters
    Standard in deep RL implementations; specific values are tuned for each benchmark but not detailed in the abstract.
axioms (1)
  • domain assumption Environments follow the standard Markov decision process formulation used in RL
    The paper operates within the conventional RL framework for world models and planning-free agents.

pith-pipeline@v0.9.0 · 5761 in / 1368 out tokens · 41189 ms · 2026-05-23T02:25:59.614188+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampli...

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · cited by 1 Pith paper · 8 internal anchors

  1. [1]

    Cosmos World Foundation Model Platform for Physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, 2025

  2. [2]

    Deep reinforcement learning at the edge of the statistical precipice

    Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Belle- mare. Deep reinforcement learning at the edge of the statistical precipice. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 29304–29320. Curran Associates, Inc....

  3. [3]

    Diffusion for world modeling: Visual details matter in atari

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari. arXiv preprint arXiv:2405.12399, 2024

  4. [4]

    Agent57: Outperforming the Atari human benchmark

    Adrià Puigdomènech Badia, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvit- skyi, Zhaohan Daniel Guo, and Charles Blundell. Agent57: Outperforming the Atari human benchmark. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, p...

  5. [5]

    Never give up: Learning directed exploration strategies

    Adrià Puigdomènech Badia, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Bilal Piot, Steven Kapturowski, Olivier Tieleman, Martin Arjovsky, Alexander Pritzel, Andrew Bolt, and Charles Blundell. Never give up: Learning directed exploration strategies. In International Con- ference on Learning Representations, 2020. URL https://openreview.net/forum?id= Sye57xStvB

  6. [6]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013

  7. [7]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023

  8. [8]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. URL https://openai.com/research/ video-generation-models-as-world-simulators

  9. [9]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma- teusz Litwin, S...

  10. [10]

    Exploration by random network distillation

    Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=H1lJJnR5Ym

  11. [11]

    Improving token-based world models with parallel observation prediction

    Lior Cohen, Kaixin Wang, Bingyi Kang, and Shie Mannor. Improving token-based world models with parallel observation prediction. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=Lfp5Dk1xb6

  12. [12]

    Oasis: A universe in a transformer, 2024

    Decart, Julian Quevedo, Quinn McIntyre, Spruce Campbell, Xinlei Chen, and Robert Wachen. Oasis: A universe in a transformer, 2024. URL https://oasis-model.github.io/. 10

  13. [13]

    Improving transformer world models for data-efficient rl, 2025

    Antoine Dedieu, Joseph Ortiz, Xinghua Lou, Carter Wendelken, Wolfgang Lehrach, J Swaroop Guntupalli, Miguel Lazaro-Gredilla, and Kevin Patrick Murphy. Improving transformer world models for data-efficient rl, 2025. URL https://arxiv.org/abs/2502.01591

  14. [14]

    Genie 2: A large-scale foundation world model, 2024

    Google DeepMind. Genie 2: A large-scale foundation world model, 2024. URL https://deepmind.google/discover/blog/genie-2-a-large-scale-foundation- world-model/

  15. [15]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

  16. [16]

    Stop regressing: Training value functions via classification for scalable deep RL

    Jesse Farebrother, Jordi Orbay, Quan Vuong, Adrien Ali Taiga, Yevgen Chebotar, Ted Xiao, Alex Irpan, Sergey Levine, Pablo Samuel Castro, Aleksandra Faust, Aviral Kumar, and Rishabh Agarwal. Stop regressing: Training value functions via classification for scalable deep RL. In Forty-first International Conference on Machine Learning, 2024. URL https://openr...

  17. [17]

    Recurrent world models facilitate policy evolution

    David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems 31, pages 2451–2463. Curran Associates, Inc., 2018. URL https://papers.nips.cc/paper/7512-recurrent-world-models- facilitate-policy-evolution. https://worldmodels.github.io

  18. [18]

    Dream to control: Learning behaviors by latent imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representa- tions, 2020. URL https://openreview.net/forum?id=S1lOTC4tDS

  19. [19]

    Lillicrap, Mohammad Norouzi, and Jimmy Ba

    Danijar Hafner, Timothy P. Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net, 2021. URL https: //openreview.net/forum?id=0oabwyZbOu

  20. [20]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023

  21. [21]

    TD-MPC2: Scalable, robust world models for continuous control

    Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, robust world models for continuous control. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Oxh5CstDJU

  22. [22]

    Provably efficient maximum entropy exploration

    Elad Hazan, Sham Kakade, Karan Singh, and Abby Van Soest. Provably efficient maximum entropy exploration. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2681–2691. PMLR, 09–15 Jun 2019. URL https://proceedings. mlr.p...

  23. [23]

    Exploration via ellip- tical episodic bonuses

    Mikael Henaff, Roberta Raileanu, Minqi Jiang, and Tim Rocktäschel. Exploration via ellip- tical episodic bonuses. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems , 2022. URL https: //openreview.net/forum?id=Xg-yZos9qJQ

  24. [24]

    Bridging nonlinearities and stochastic regularizers with gaussian error linear units, 2017

    Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units, 2017. URL https://openreview.net/forum?id=Bk0MRI5lg

  25. [25]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022

  26. [26]

    Long short-term memory

    Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735–1780, 1997

  27. [27]

    Perceptual losses for real-time style transfer and super-resolution

    Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 694–711. Springer, 2016. 11

  28. [28]

    Model based reinforcement learning for atari

    Łukasz Kaiser, Mohammad Babaeizadeh, Piotr Miłos, Bla˙zej Osi´nski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, Afroz Mohiuddin, Ryan Sepassi, George Tucker, and Henryk Michalewski. Model based reinforcement learning for atari. In International Conference on Learning Representations , 2020. URL https: /...

  29. [29]

    Transformers are RNNs: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 5156–5165. PMLR, 13–18 Jul 2...

  30. [30]

    Curious replay for model-based adaptation

    Isaac Kauvar, Chris Doyle, Linqi Zhou, and Nick Haber. Curious replay for model-based adaptation. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

  31. [31]

    Simple and scal- able predictive uncertainty estimation using deep ensembles

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scal- able predictive uncertainty estimation using deep ensembles. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, edi- tors, Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc., 2017. URL https:/...

  32. [32]

    Autoencoding beyond pixels using a learned similarity metric

    Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. In Maria Florina Balcan and Kilian Q. Weinberger, editors,Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1558–1566, New York, N...

  33. [33]

    UNIFIED-IO: A unified model for vision, language, and multi-modal tasks

    Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. UNIFIED-IO: A unified model for vision, language, and multi-modal tasks. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview. net/forum?id=E01k9048soZ

  34. [34]

    Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action

    Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26439–26455, June 2024

  35. [35]

    Craftax: A lightning-fast benchmark for open-ended reinforcement learning

    Michael Matthews, Michael Beukman, Benjamin Ellis, Mikayel Samvelyan, Matthew Jackson, Samuel Coward, and Jakob Foerster. Craftax: A lightning-fast benchmark for open-ended reinforcement learning. In International Conference on Machine Learning (ICML), 2024

  36. [36]

    Dis- covering and achieving goals via world models

    Russell Mendonca, Oleh Rybkin, Kostas Daniilidis, Danijar Hafner, and Deepak Pathak. Dis- covering and achieving goals via world models. Advances in Neural Information Processing Systems, 34:24379–24391, 2021

  37. [37]

    Transformers are sample-efficient world models

    Vincent Micheli, Eloi Alonso, and François Fleuret. Transformers are sample-efficient world models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/ pdf?id=vhFu1Acb0xb

  38. [38]

    Efficient world models with context-aware tokenization

    Vincent Micheli, Eloi Alonso, and François Fleuret. Efficient world models with context-aware tokenization. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https://openreview.net/forum? id=BiWIERWBFX

  39. [39]

    Human-level control through deep reinforcement learning

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015. 12

  40. [40]

    Pytorch: An imperative style, high-performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performa...

  41. [41]

    Efros, and Trevor Darrell

    Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven ex- ploration by self-supervised prediction. In Doina Precup and Yee Whye Teh, editors, Pro- ceedings of the 34th International Conference on Machine Learning , volume 70 of Pro- ceedings of Machine Learning Research, pages 2778–2787. PMLR, 06–11 Aug 2017. URL https://pro...

  42. [42]

    Prajit Ramachandran, Barret Zoph, and Quoc V . Le. Searching for activation functions, 2018. URL https://openreview.net/forum?id=SkBYYyZRZ

  43. [43]

    A generalist agent

    Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gómez Colmenarejo, Alexander Novikov, Gabriel Barth-maron, Mai Giménez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, Tom Eccles, Jake Bruce, Ali Razavi, Ashley Edwards, Nicolas Heess, Yutian Chen, Raia Hadsell, Oriol Vinyals, Mahyar Bordbar, and Nando de Freitas. A generalist agent. Transactions on ...

  44. [44]

    Transformer-based world models are happy with 100k interactions

    Jan Robine, Marc Höftmann, Tobias Uelwer, and Stefan Harmeling. Transformer-based world models are happy with 100k interactions. arXiv preprint arXiv:2303.07109, 2023

  45. [45]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022

  46. [46]

    doi: 10.1109/TAMD.2010.2056368

    Jürgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990-2010). IEEE Transactions on Autonomous Mental Development , 2, 2010. ISSN 19430604. doi: 10.1109/TAMD.2010.2056368

  47. [47]

    A generalist dynamics model for control

    Ingmar Schubert, Jingwei Zhang, Jake Bruce, Sarah Bechtle, Emilio Parisotto, Martin Ried- miller, Jost Tobias Springenberg, Arunkumar Byravan, Leonard Hasenclever, and Nicolas Heess. A generalist dynamics model for control. arXiv preprint arXiv:2305.10912, 2023

  48. [48]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  49. [49]

    Planning to explore via self-supervised world models

    Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak Pathak. Planning to explore via self-supervised world models. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 8583–8592. PMLR, 13–18 Jul 2020...

  50. [50]

    Model-based active exploration

    Pranav Shyam, Wojciech Ja´skowski, and Faustino Gomez. Model-based active exploration. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5779–5788. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/ shyam19a.html

  51. [51]

    Retentive Network: A Successor to Transformer for Large Language Models

    Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023

  52. [52]

    Policy gradi- ent methods for reinforcement learning with function approximation

    Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradi- ent methods for reinforcement learning with function approximation. In S. Solla, T. Leen, and K. Müller, editors, Advances in Neural Information Processing Systems , volume 12. MIT Press, 1999. URL https://proceedings.neurips.cc/paper_files/paper/1999/ file/464d828b85b0b...

  53. [53]

    2020 , issn =

    Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, Nicolas Heess, and Yuval Tassa. dm_control: Software and tasks for continuous control. Software Impacts, 6:100022, 2020. ISSN 2665-9638. doi: https:// doi.org/10.1016/j.simpa.2020.100022. URL https://www.sciencedirect.com/science/ article/...

  54. [54]

    Diffusion Models Are Real-Time Game Engines

    Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. CoRR, abs/2408.14837, 2024. URL https://doi.org/10.48550/ arXiv.2408.14837

  55. [55]

    Neural discrete representation learning

    Aaron van den Oord, Oriol Vinyals, and koray kavukcuoglu. Neural discrete representation learning. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/ 2017/...

  56. [56]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, ed- itors, Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc., 2017. UR...

  57. [57]

    Efficientzero v2: Mas- tering discrete and continuous control with limited data

    Shengjie Wang, Shaohuai Liu, Weirui Ye, Jiacheng You, and Yang Gao. Efficientzero v2: Mas- tering discrete and continuous control with limited data. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=LHGMXcr6zx

  58. [58]

    Parallelizing model-based rein- forcement learning over the sequence length

    ZiRui Wang, Yue DENG, Junfeng Long, and Yin Zhang. Parallelizing model-based rein- forcement learning over the sequence length. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id= R6N9AGyz13

  59. [59]

    ivideoGPT: Interactive videoGPTs are scalable world models

    Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye HAO, and Mingsheng Long. ivideoGPT: Interactive videoGPTs are scalable world models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview. net/forum?id=4TENzBftZR

  60. [60]

    Conv(a,b,c)

    Weipu Zhang, Gang Wang, Jian Sun, Yetian Yuan, and Gao Huang. Storm: Efficient stochastic transformer based world models for reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024. 14 A Models and Hyperparameters A.1 Hyperparameters We detail shared hyperparameters in Table 1, training hyperparameters in Table 2, world model h...

  61. [61]

    Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Answer: [Yes] Justification: We provide extensive empirical evidence in Section 3, including ablation studies, which directly relate to our contributions and claims. The scope of our paper is sample-efficient, planning-free wor...

  62. [62]

    Limitations

    Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: Section 5 explicitly discuss the limitations of our work. Additional limitations are discussed in Section 3 (e.g., the absence of ablations on Craftax due to computational limitations). Guidelines: • The answer NA means that the ...

  63. [63]

    Guidelines: • The answer NA means that the paper does not include theoretical results

    Theory assumptions and proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [NA] Justification: Our paper does not include theoretical results. Guidelines: • The answer NA means that the paper does not include theoretical results. • All the theorems, formulas, and proo...

  64. [64]

    Guidelines: • The answer NA means that the paper does not include experiments

    Experimental result reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main ex- perimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: In Section 2 and in the ap...

  65. [65]

    Our code has a detailed readme file for easy usage, and we also provide Docker support, which enables an easy environment setup and enhances reproducibility on any operation system

    Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instruc- tions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: In our abstract and appendix we provide a link to the code and trained model weights. Our code has a detail...

  66. [66]

    Guidelines: • The answer NA means that the paper does not include experiments

    Experimental setting/details Question: Does the paper specify all the training and test details (e.g., data splits, hyper- parameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We specify all experimental details in Section 3 and in Appendix A and C. Guidelines: • The answer NA means t...

  67. [67]

    Figure 5 also includes error bars

    Experiment statistical significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: We utilize the rliable toolkit [2] to generate plots with appropriate error bars (Figure 4 bottom, Figure 6). Figure 5 also includ...

  68. [68]

    Guidelines: • The answer NA means that the paper does not include experiments

    Experiments compute resources Question: For each experiment, does the paper provide sufficient information on the com- puter resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We provide this information in Appendix C. Guidelines: • The answer NA means that the paper does not in...

  69. [69]

    No human subjects or partici- pants were involved

    Code of ethics Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines? Answer: [Yes] Justification: Our work follows the NeurIPS Code of Ethics. No human subjects or partici- pants were involved. We found no special concerns beyond those related to the genera...

  70. [70]

    As such, there are no direct positive or negative societal impacts

    Broader impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] 30 Justification: This paper presents a foundational work in the field of Machine Learning. As such, there are no direct positive or negative societal impacts. Guidelines: • The answer NA means that th...

  71. [71]

    Hence, we do not introduce additional safeguards

    Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: Our work does not pose any additional risks beyond those of common deep reinforcement learning ...

  72. [72]

    We follow the licenses of all assets used in our work

    Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: Our paper cites all relevant assets, and our open-sourced repository includes a credits section ...

  73. [73]

    Guidelines: • The answer NA means that the paper does not release new assets

    New assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: Our open-sourced repository includes all new assets and is well documented. Guidelines: • The answer NA means that the paper does not release new assets. • Researchers should communicate the detai...

  74. [74]

    Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

    Crowdsourcing and research with human subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research...

  75. [75]

    32 Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

  76. [76]

    Answer: [NA] Justification: The core method development in this research does not involve LLMs as any important, original, or non-standard components

    Declaration of LLM usage Question: Does the paper describe the usage of LLMs if it is an important, original, or non-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does not impact the core methodology, scientific rigorousness, or originality of the research, decla...