Simulus: Combining Improvements in Sample-Efficient World Model Agents

Bingyi Kang; Kaixin Wang; Lior Cohen; Shie Mannor; Uri Gadot

arxiv: 2502.11537 · v4 · submitted 2025-02-17 · 💻 cs.LG · cs.AI

Simulus: Combining Improvements in Sample-Efficient World Model Agents

Lior Cohen , Kaixin Wang , Bingyi Kang , Uri Gadot , Shie Mannor This is my paper

Pith reviewed 2026-05-23 02:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords world modelssample-efficient reinforcement learningintrinsic motivationprioritized replaytokenizationregression as classificationAtariDMC

0 comments

The pith

Simulus shows that four separate improvements to world-model agents combine without conflict to set new sample-efficiency records on visual, continuous, and symbolic tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether the Rainbow-style principle of combining known improvements applies to world-model agents. It builds Simulus as a single modular agent that adds flexible tokenization of observations and actions, intrinsic motivation driven by epistemic uncertainty, prioritized replay of world-model experience, and regression treated as classification for rewards and returns. On the Atari 100K, DMC Proprioception 500K, and Craftax-1M benchmarks Simulus reaches state-of-the-art sample efficiency among planning-free world-model methods. Ablations confirm each addition improves results and that the four together produce larger gains than any subset.

Core claim

Simulus integrates a flexible tokenization framework, intrinsic motivation for epistemic uncertainty reduction, prioritized world-model replay, and regression-as-classification for reward and return prediction; the resulting agent achieves state-of-the-art sample efficiency for planning-free world models on visual Atari 100K, continuous-control DMC Proprioception 500K, and symbolic Craftax-1M while each component contributes individually and their combination produces synergistic gains.

What carries the argument

Simulus, a modular token-based world-model agent that supports arbitrary observation and action modalities and adds the four listed improvements on top of a shared base learner.

If this is right

Each of the four components improves performance when added alone.
The combination of all four yields larger gains than any subset.
Intrinsic motivation continues to help even when total environment steps are severely limited.
A single token-based architecture can accommodate visual, proprioceptive, and symbolic inputs without task-specific redesign.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Modular token interfaces may make it easier to test future improvements without rewriting the entire agent stack.
The success on three very different domains suggests the same four additions could be tried on planning-based world-model agents.
If prioritized replay of model rollouts remains useful, similar prioritization could be applied to other internal buffers such as value or policy targets.

Load-bearing premise

That the four components complement each other without significant negative interactions and that intrinsic motivation remains beneficial even under the tight interaction budgets of sample-efficient RL.

What would settle it

An ablation on any of the three benchmarks in which the full Simulus agent underperforms a version that omits one or more of the four components.

Figures

Figures reproduced from arXiv: 2502.11537 by Bingyi Kang, Kaixin Wang, Lior Cohen, Shie Mannor, Uri Gadot.

**Figure 2.** Figure 2: An illustration of the independent processing of modalities for an observation with two modalities. Sequence Modeling Given a sequence of observation-action blocks X = X1, . . . , Xt, the matching outputs Y1, . . . , Yt are computed auto-regressively as follows: (St, Yt) = fθ(St−1, Xt), where St is a recurrent state that summarizes X≤t and S0 = 0. However, the output Yu t+1, from which zˆt+1 is predicted… view at source ↗

**Figure 3.** Figure 3: World model training and imagination. To maintain visual clarity, we omitted token [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Results on the DeepMind Control Suite 500K Proprioception (top) and Atari 100K (bottom) [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Craftax-1M training curves with mean and 95% confidence intervals. Effectiveness in continuous environments [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Ablations results on the Atari-100K and DeepMind Control Proprioception 500K bench [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Performance profile. For each human-normalized score value on the x-axis, the curve [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Reconstruction (Lpred) and dynamics (Ldyn) losses of PWM and PWM-decoupled on four Atari games (single seed). The first column uses a log-scaled y-axis. Decoupling the optimization objectives consistently reduces reconstruction loss while increasing dynamics loss, suggesting interference between the two objectives. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Agent episodic returns throughout training of PWM and PWM-decoupled on four Atari [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Ground truth (top) and reconstructed (bottom) frames from a training episode of PWM [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Ground truth (top) and reconstructed (bottom) frames from a training episode of PWM [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

read the original abstract

World models (WMs) represent the frontier of sample-efficient reinforcement learning, but their complexity leaves many promising improvements unrealized due to the significant expertise and effort required to identify and integrate them. Inspired by Rainbow, which showed that individually known improvements to DQN complement each other and can be effectively combined, we take on this challenge and ask whether the same principle applies to world model agents. We introduce Simulus, a modular token-based WM agent that integrates: (1) a flexible tokenization framework supporting arbitrary combinations of observation and action modalities; (2) intrinsic motivation for epistemic uncertainty reduction; (3) prioritized world model replay; and (4) regression-as-classification for reward and return prediction. Simulus achieves state-of-the-art sample efficiency for planning-free WMs across three diverse benchmarks: visual Atari 100K, continuous-control DMC Proprioception 500K, and symbolic Craftax-1M. Notably, intrinsic motivation proves beneficial even under the tight interaction budgets of sample-efficient RL, despite the risk of wasting scarce interactions on task-irrelevant experience. Ablation studies reveal that each component contributes individually, and their combination yields synergistic gains. Our code and model weights are publicly available at https://github.com/leor-c/Simulus.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Simulus combines four known pieces in a token-based world model and shows they add up to SOTA sample efficiency on three benchmarks, with ablations and public code.

read the letter

The main takeaway is that Simulus takes tokenization for mixed modalities, intrinsic motivation, prioritized replay, and regression-as-classification, puts them into one agent, and reports better sample efficiency than prior planning-free world models on visual Atari 100K, DMC proprioception 500K, and symbolic Craftax-1M. The modular token framework is the clearest engineering contribution, letting the same backbone handle different observation and action types without major rewrites. Ablations indicate each component helps on its own and the full set produces further gains, and the paper directly checks that intrinsic motivation still pays off at these tight budgets rather than wasting interactions. Releasing code and weights is useful here because it lets others test the SOTA numbers directly. The work stays empirical and does not claim new theory, which keeps the scope clear. The main soft spot is that gains in this area often depend on hyperparameter choices and baseline fairness; the ablations help, but the usual risk remains that some of the edge comes from tuning rather than the listed ideas. Citation patterns look standard for the subfield, with no obvious missing priors on the individual components. This paper is aimed at people already working on sample-efficient world models who want a practical recipe for what to try next. It is not a foundational shift but gives a concrete, reproducible data point on combinations. I would send it to referees because the results are specific, the code is available, and the ablations address the most obvious questions about interactions.

Referee Report

0 major / 2 minor

Summary. The paper introduces Simulus, a modular token-based world model agent integrating four components: flexible tokenization supporting arbitrary observation/action modalities, intrinsic motivation for epistemic uncertainty reduction, prioritized world model replay, and regression-as-classification for reward/return prediction. It claims state-of-the-art sample efficiency for planning-free world models on visual Atari 100K, continuous-control DMC Proprioception 500K, and symbolic Craftax-1M benchmarks. Ablation studies indicate each component contributes positively on its own with synergistic gains from the full combination; intrinsic motivation remains beneficial at the tight 100K/500K/1M interaction budgets. Public code and model weights are released.

Significance. If the results hold, the work shows that the Rainbow-style combination of complementary improvements can be successfully applied to world model agents, potentially reducing the expertise barrier for building sample-efficient RL systems. The public code and weights directly support reproducibility of the SOTA claims across three diverse benchmarks and address concerns about experimental details.

minor comments (2)

[Abstract] The abstract claims SOTA results but does not name the specific metrics (e.g., mean return, human-normalized score) or list the exact baselines against which superiority is measured.
The manuscript would benefit from explicit reporting of the number of random seeds, confidence intervals, and any statistical tests used to support the ablation and benchmark comparisons.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of Simulus, the recognition of its modular design and reproducibility contributions, and the recommendation for minor revision. No major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical study that combines four modular improvements to world-model agents and validates them via ablation experiments on three standard benchmarks (Atari 100K, DMC 500K, Craftax-1M). All performance claims rest on reported interaction counts, reward curves, and ablation tables rather than any derivation, equation, or fitted parameter that reduces to its own inputs by construction. Public code and weights are supplied, making the results externally reproducible against the same benchmarks without reliance on self-citation chains or self-definitional steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Based on the abstract, the paper builds on existing world model techniques and standard RL practices without introducing new free parameters beyond typical hyperparameter tuning or new entities. The central claim rests on the assumption that the chosen benchmarks appropriately test sample efficiency and that the components integrate synergistically.

free parameters (1)

Various agent and training hyperparameters
Standard in deep RL implementations; specific values are tuned for each benchmark but not detailed in the abstract.

axioms (1)

domain assumption Environments follow the standard Markov decision process formulation used in RL
The paper operates within the conventional RL framework for world models and planning-free agents.

pith-pipeline@v0.9.0 · 5761 in / 1368 out tokens · 41189 ms · 2026-05-23T02:25:59.614188+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampli...

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Deep reinforcement learning at the edge of the statistical precipice

Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Belle- mare. Deep reinforcement learning at the edge of the statistical precipice. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 29304–29320. Curran Associates, Inc....

work page 2021
[3]

Diffusion for world modeling: Visual details matter in atari

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari. arXiv preprint arXiv:2405.12399, 2024

work page arXiv 2024
[4]

Agent57: Outperforming the Atari human benchmark

Adrià Puigdomènech Badia, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvit- skyi, Zhaohan Daniel Guo, and Charles Blundell. Agent57: Outperforming the Atari human benchmark. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, p...

work page 2020
[5]

Never give up: Learning directed exploration strategies

Adrià Puigdomènech Badia, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Bilal Piot, Steven Kapturowski, Olivier Tieleman, Martin Arjovsky, Alexander Pritzel, Andrew Bolt, and Charles Blundell. Never give up: Learning directed exploration strategies. In International Con- ference on Learning Representations, 2020. URL https://openreview.net/forum?id= Sye57xStvB

work page 2020
[6]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[7]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. URL https://openai.com/research/ video-generation-models-as-world-simulators

work page 2024
[9]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma- teusz Litwin, S...

work page 1901
[10]

Exploration by random network distillation

Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=H1lJJnR5Ym

work page 2019
[11]

Improving token-based world models with parallel observation prediction

Lior Cohen, Kaixin Wang, Bingyi Kang, and Shie Mannor. Improving token-based world models with parallel observation prediction. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=Lfp5Dk1xb6

work page 2024
[12]

Oasis: A universe in a transformer, 2024

Decart, Julian Quevedo, Quinn McIntyre, Spruce Campbell, Xinlei Chen, and Robert Wachen. Oasis: A universe in a transformer, 2024. URL https://oasis-model.github.io/. 10

work page 2024
[13]

Improving transformer world models for data-efficient rl, 2025

Antoine Dedieu, Joseph Ortiz, Xinghua Lou, Carter Wendelken, Wolfgang Lehrach, J Swaroop Guntupalli, Miguel Lazaro-Gredilla, and Kevin Patrick Murphy. Improving transformer world models for data-efficient rl, 2025. URL https://arxiv.org/abs/2502.01591

work page arXiv 2025
[14]

Genie 2: A large-scale foundation world model, 2024

Google DeepMind. Genie 2: A large-scale foundation world model, 2024. URL https://deepmind.google/discover/blog/genie-2-a-large-scale-foundation- world-model/

work page 2024
[15]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

work page 2021
[16]

Stop regressing: Training value functions via classification for scalable deep RL

Jesse Farebrother, Jordi Orbay, Quan Vuong, Adrien Ali Taiga, Yevgen Chebotar, Ted Xiao, Alex Irpan, Sergey Levine, Pablo Samuel Castro, Aleksandra Faust, Aviral Kumar, and Rishabh Agarwal. Stop regressing: Training value functions via classification for scalable deep RL. In Forty-first International Conference on Machine Learning, 2024. URL https://openr...

work page 2024
[17]

Recurrent world models facilitate policy evolution

David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems 31, pages 2451–2463. Curran Associates, Inc., 2018. URL https://papers.nips.cc/paper/7512-recurrent-world-models- facilitate-policy-evolution. https://worldmodels.github.io

work page 2018
[18]

Dream to control: Learning behaviors by latent imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representa- tions, 2020. URL https://openreview.net/forum?id=S1lOTC4tDS

work page 2020
[19]

Lillicrap, Mohammad Norouzi, and Jimmy Ba

Danijar Hafner, Timothy P. Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net, 2021. URL https: //openreview.net/forum?id=0oabwyZbOu

work page 2021
[20]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

TD-MPC2: Scalable, robust world models for continuous control

Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, robust world models for continuous control. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Oxh5CstDJU

work page 2024
[22]

Provably efficient maximum entropy exploration

Elad Hazan, Sham Kakade, Karan Singh, and Abby Van Soest. Provably efficient maximum entropy exploration. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2681–2691. PMLR, 09–15 Jun 2019. URL https://proceedings. mlr.p...

work page 2019
[23]

Exploration via ellip- tical episodic bonuses

Mikael Henaff, Roberta Raileanu, Minqi Jiang, and Tim Rocktäschel. Exploration via ellip- tical episodic bonuses. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems , 2022. URL https: //openreview.net/forum?id=Xg-yZos9qJQ

work page 2022
[24]

Bridging nonlinearities and stochastic regularizers with gaussian error linear units, 2017

Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units, 2017. URL https://openreview.net/forum?id=Bk0MRI5lg

work page 2017
[25]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

Long short-term memory

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735–1780, 1997

work page 1997
[27]

Perceptual losses for real-time style transfer and super-resolution

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 694–711. Springer, 2016. 11

work page 2016
[28]

Model based reinforcement learning for atari

Łukasz Kaiser, Mohammad Babaeizadeh, Piotr Miłos, Bla˙zej Osi´nski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, Afroz Mohiuddin, Ryan Sepassi, George Tucker, and Henryk Michalewski. Model based reinforcement learning for atari. In International Conference on Learning Representations , 2020. URL https: /...

work page 2020
[29]

Transformers are RNNs: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 5156–5165. PMLR, 13–18 Jul 2...

work page 2020
[30]

Curious replay for model-based adaptation

Isaac Kauvar, Chris Doyle, Linqi Zhou, and Nick Haber. Curious replay for model-based adaptation. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

work page 2023
[31]

Simple and scal- able predictive uncertainty estimation using deep ensembles

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scal- able predictive uncertainty estimation using deep ensembles. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, edi- tors, Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc., 2017. URL https:/...

work page 2017
[32]

Autoencoding beyond pixels using a learned similarity metric

Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. In Maria Florina Balcan and Kilian Q. Weinberger, editors,Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1558–1566, New York, N...

work page 2016
[33]

UNIFIED-IO: A unified model for vision, language, and multi-modal tasks

Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. UNIFIED-IO: A unified model for vision, language, and multi-modal tasks. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview. net/forum?id=E01k9048soZ

work page 2023
[34]

Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action

Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26439–26455, June 2024

work page 2024
[35]

Craftax: A lightning-fast benchmark for open-ended reinforcement learning

Michael Matthews, Michael Beukman, Benjamin Ellis, Mikayel Samvelyan, Matthew Jackson, Samuel Coward, and Jakob Foerster. Craftax: A lightning-fast benchmark for open-ended reinforcement learning. In International Conference on Machine Learning (ICML), 2024

work page 2024
[36]

Dis- covering and achieving goals via world models

Russell Mendonca, Oleh Rybkin, Kostas Daniilidis, Danijar Hafner, and Deepak Pathak. Dis- covering and achieving goals via world models. Advances in Neural Information Processing Systems, 34:24379–24391, 2021

work page 2021
[37]

Transformers are sample-efficient world models

Vincent Micheli, Eloi Alonso, and François Fleuret. Transformers are sample-efficient world models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/ pdf?id=vhFu1Acb0xb

work page 2023
[38]

Efficient world models with context-aware tokenization

Vincent Micheli, Eloi Alonso, and François Fleuret. Efficient world models with context-aware tokenization. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https://openreview.net/forum? id=BiWIERWBFX

work page 2024
[39]

Human-level control through deep reinforcement learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015. 12

work page 2015
[40]

Pytorch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performa...

work page 2019
[41]

Efros, and Trevor Darrell

Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven ex- ploration by self-supervised prediction. In Doina Precup and Yee Whye Teh, editors, Pro- ceedings of the 34th International Conference on Machine Learning , volume 70 of Pro- ceedings of Machine Learning Research, pages 2778–2787. PMLR, 06–11 Aug 2017. URL https://pro...

work page 2017
[42]

Prajit Ramachandran, Barret Zoph, and Quoc V . Le. Searching for activation functions, 2018. URL https://openreview.net/forum?id=SkBYYyZRZ

work page 2018
[43]

A generalist agent

Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gómez Colmenarejo, Alexander Novikov, Gabriel Barth-maron, Mai Giménez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, Tom Eccles, Jake Bruce, Ali Razavi, Ashley Edwards, Nicolas Heess, Yutian Chen, Raia Hadsell, Oriol Vinyals, Mahyar Bordbar, and Nando de Freitas. A generalist agent. Transactions on ...

work page 2022
[44]

Transformer-based world models are happy with 100k interactions

Jan Robine, Marc Höftmann, Tobias Uelwer, and Stefan Harmeling. Transformer-based world models are happy with 100k interactions. arXiv preprint arXiv:2303.07109, 2023

work page arXiv 2023
[45]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022

work page 2022
[46]

doi: 10.1109/TAMD.2010.2056368

Jürgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990-2010). IEEE Transactions on Autonomous Mental Development , 2, 2010. ISSN 19430604. doi: 10.1109/TAMD.2010.2056368

work page doi:10.1109/tamd.2010.2056368 1990
[47]

A generalist dynamics model for control

Ingmar Schubert, Jingwei Zhang, Jake Bruce, Sarah Bechtle, Emilio Parisotto, Martin Ried- miller, Jost Tobias Springenberg, Arunkumar Byravan, Leonard Hasenclever, and Nicolas Heess. A generalist dynamics model for control. arXiv preprint arXiv:2305.10912, 2023

work page arXiv 2023
[48]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[49]

Planning to explore via self-supervised world models

Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak Pathak. Planning to explore via self-supervised world models. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 8583–8592. PMLR, 13–18 Jul 2020...

work page 2020
[50]

Model-based active exploration

Pranav Shyam, Wojciech Ja´skowski, and Faustino Gomez. Model-based active exploration. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5779–5788. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/ shyam19a.html

work page 2019
[51]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Policy gradi- ent methods for reinforcement learning with function approximation

Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradi- ent methods for reinforcement learning with function approximation. In S. Solla, T. Leen, and K. Müller, editors, Advances in Neural Information Processing Systems , volume 12. MIT Press, 1999. URL https://proceedings.neurips.cc/paper_files/paper/1999/ file/464d828b85b0b...

work page 1999
[53]

2020 , issn =

Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, Nicolas Heess, and Yuval Tassa. dm_control: Software and tasks for continuous control. Software Impacts, 6:100022, 2020. ISSN 2665-9638. doi: https:// doi.org/10.1016/j.simpa.2020.100022. URL https://www.sciencedirect.com/science/ article/...

work page doi:10.1016/j.simpa.2020.100022 2020
[54]

Diffusion Models Are Real-Time Game Engines

Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. CoRR, abs/2408.14837, 2024. URL https://doi.org/10.48550/ arXiv.2408.14837

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

Neural discrete representation learning

Aaron van den Oord, Oriol Vinyals, and koray kavukcuoglu. Neural discrete representation learning. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/ 2017/...

work page 2017
[56]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, ed- itors, Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc., 2017. UR...

work page 2017
[57]

Efficientzero v2: Mas- tering discrete and continuous control with limited data

Shengjie Wang, Shaohuai Liu, Weirui Ye, Jiacheng You, and Yang Gao. Efficientzero v2: Mas- tering discrete and continuous control with limited data. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=LHGMXcr6zx

work page 2024
[58]

Parallelizing model-based rein- forcement learning over the sequence length

ZiRui Wang, Yue DENG, Junfeng Long, and Yin Zhang. Parallelizing model-based rein- forcement learning over the sequence length. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id= R6N9AGyz13

work page 2024
[59]

ivideoGPT: Interactive videoGPTs are scalable world models

Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye HAO, and Mingsheng Long. ivideoGPT: Interactive videoGPTs are scalable world models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview. net/forum?id=4TENzBftZR

work page 2024
[60]

Conv(a,b,c)

Weipu Zhang, Gang Wang, Jian Sun, Yetian Yuan, and Gao Huang. Storm: Efficient stochastic transformer based world models for reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024. 14 A Models and Hyperparameters A.1 Hyperparameters We detail shared hyperparameters in Table 1, training hyperparameters in Table 2, world model h...

work page 2024
[61]

Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Answer: [Yes] Justification: We provide extensive empirical evidence in Section 3, including ablation studies, which directly relate to our contributions and claims. The scope of our paper is sample-efficient, planning-free wor...

work page
[62]

Limitations

Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: Section 5 explicitly discuss the limitations of our work. Additional limitations are discussed in Section 3 (e.g., the absence of ablations on Craftax due to computational limitations). Guidelines: • The answer NA means that the ...

work page
[63]

Guidelines: • The answer NA means that the paper does not include theoretical results

Theory assumptions and proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [NA] Justification: Our paper does not include theoretical results. Guidelines: • The answer NA means that the paper does not include theoretical results. • All the theorems, formulas, and proo...

work page
[64]

Guidelines: • The answer NA means that the paper does not include experiments

Experimental result reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main ex- perimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: In Section 2 and in the ap...

work page
[65]

Our code has a detailed readme file for easy usage, and we also provide Docker support, which enables an easy environment setup and enhances reproducibility on any operation system

Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instruc- tions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: In our abstract and appendix we provide a link to the code and trained model weights. Our code has a detail...

work page
[66]

Guidelines: • The answer NA means that the paper does not include experiments

Experimental setting/details Question: Does the paper specify all the training and test details (e.g., data splits, hyper- parameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We specify all experimental details in Section 3 and in Appendix A and C. Guidelines: • The answer NA means t...

work page
[67]

Figure 5 also includes error bars

Experiment statistical significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: We utilize the rliable toolkit [2] to generate plots with appropriate error bars (Figure 4 bottom, Figure 6). Figure 5 also includ...

work page
[68]

Guidelines: • The answer NA means that the paper does not include experiments

Experiments compute resources Question: For each experiment, does the paper provide sufficient information on the com- puter resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We provide this information in Appendix C. Guidelines: • The answer NA means that the paper does not in...

work page
[69]

No human subjects or partici- pants were involved

Code of ethics Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines? Answer: [Yes] Justification: Our work follows the NeurIPS Code of Ethics. No human subjects or partici- pants were involved. We found no special concerns beyond those related to the genera...

work page
[70]

As such, there are no direct positive or negative societal impacts

Broader impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] 30 Justification: This paper presents a foundational work in the field of Machine Learning. As such, there are no direct positive or negative societal impacts. Guidelines: • The answer NA means that th...

work page
[71]

Hence, we do not introduce additional safeguards

Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: Our work does not pose any additional risks beyond those of common deep reinforcement learning ...

work page
[72]

We follow the licenses of all assets used in our work

Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: Our paper cites all relevant assets, and our open-sourced repository includes a credits section ...

work page
[73]

Guidelines: • The answer NA means that the paper does not release new assets

New assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: Our open-sourced repository includes all new assets and is well documented. Guidelines: • The answer NA means that the paper does not release new assets. • Researchers should communicate the detai...

work page
[74]

Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

Crowdsourcing and research with human subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research...

work page
[75]

32 Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page
[76]

Answer: [NA] Justification: The core method development in this research does not involve LLMs as any important, original, or non-standard components

Declaration of LLM usage Question: Does the paper describe the usage of LLMs if it is an important, original, or non-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does not impact the core methodology, scientific rigorousness, or originality of the research, decla...

work page 2025

[1] [1]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Deep reinforcement learning at the edge of the statistical precipice

Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Belle- mare. Deep reinforcement learning at the edge of the statistical precipice. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 29304–29320. Curran Associates, Inc....

work page 2021

[3] [3]

Diffusion for world modeling: Visual details matter in atari

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari. arXiv preprint arXiv:2405.12399, 2024

work page arXiv 2024

[4] [4]

Agent57: Outperforming the Atari human benchmark

Adrià Puigdomènech Badia, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvit- skyi, Zhaohan Daniel Guo, and Charles Blundell. Agent57: Outperforming the Atari human benchmark. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, p...

work page 2020

[5] [5]

Never give up: Learning directed exploration strategies

Adrià Puigdomènech Badia, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Bilal Piot, Steven Kapturowski, Olivier Tieleman, Martin Arjovsky, Alexander Pritzel, Andrew Bolt, and Charles Blundell. Never give up: Learning directed exploration strategies. In International Con- ference on Learning Representations, 2020. URL https://openreview.net/forum?id= Sye57xStvB

work page 2020

[6] [6]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[7] [7]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. URL https://openai.com/research/ video-generation-models-as-world-simulators

work page 2024

[9] [9]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma- teusz Litwin, S...

work page 1901

[10] [10]

Exploration by random network distillation

Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=H1lJJnR5Ym

work page 2019

[11] [11]

Improving token-based world models with parallel observation prediction

Lior Cohen, Kaixin Wang, Bingyi Kang, and Shie Mannor. Improving token-based world models with parallel observation prediction. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=Lfp5Dk1xb6

work page 2024

[12] [12]

Oasis: A universe in a transformer, 2024

Decart, Julian Quevedo, Quinn McIntyre, Spruce Campbell, Xinlei Chen, and Robert Wachen. Oasis: A universe in a transformer, 2024. URL https://oasis-model.github.io/. 10

work page 2024

[13] [13]

Improving transformer world models for data-efficient rl, 2025

Antoine Dedieu, Joseph Ortiz, Xinghua Lou, Carter Wendelken, Wolfgang Lehrach, J Swaroop Guntupalli, Miguel Lazaro-Gredilla, and Kevin Patrick Murphy. Improving transformer world models for data-efficient rl, 2025. URL https://arxiv.org/abs/2502.01591

work page arXiv 2025

[14] [14]

Genie 2: A large-scale foundation world model, 2024

Google DeepMind. Genie 2: A large-scale foundation world model, 2024. URL https://deepmind.google/discover/blog/genie-2-a-large-scale-foundation- world-model/

work page 2024

[15] [15]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

work page 2021

[16] [16]

Stop regressing: Training value functions via classification for scalable deep RL

Jesse Farebrother, Jordi Orbay, Quan Vuong, Adrien Ali Taiga, Yevgen Chebotar, Ted Xiao, Alex Irpan, Sergey Levine, Pablo Samuel Castro, Aleksandra Faust, Aviral Kumar, and Rishabh Agarwal. Stop regressing: Training value functions via classification for scalable deep RL. In Forty-first International Conference on Machine Learning, 2024. URL https://openr...

work page 2024

[17] [17]

Recurrent world models facilitate policy evolution

David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems 31, pages 2451–2463. Curran Associates, Inc., 2018. URL https://papers.nips.cc/paper/7512-recurrent-world-models- facilitate-policy-evolution. https://worldmodels.github.io

work page 2018

[18] [18]

Dream to control: Learning behaviors by latent imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representa- tions, 2020. URL https://openreview.net/forum?id=S1lOTC4tDS

work page 2020

[19] [19]

Lillicrap, Mohammad Norouzi, and Jimmy Ba

Danijar Hafner, Timothy P. Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net, 2021. URL https: //openreview.net/forum?id=0oabwyZbOu

work page 2021

[20] [20]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

TD-MPC2: Scalable, robust world models for continuous control

Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, robust world models for continuous control. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Oxh5CstDJU

work page 2024

[22] [22]

Provably efficient maximum entropy exploration

Elad Hazan, Sham Kakade, Karan Singh, and Abby Van Soest. Provably efficient maximum entropy exploration. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2681–2691. PMLR, 09–15 Jun 2019. URL https://proceedings. mlr.p...

work page 2019

[23] [23]

Exploration via ellip- tical episodic bonuses

Mikael Henaff, Roberta Raileanu, Minqi Jiang, and Tim Rocktäschel. Exploration via ellip- tical episodic bonuses. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems , 2022. URL https: //openreview.net/forum?id=Xg-yZos9qJQ

work page 2022

[24] [24]

Bridging nonlinearities and stochastic regularizers with gaussian error linear units, 2017

Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units, 2017. URL https://openreview.net/forum?id=Bk0MRI5lg

work page 2017

[25] [25]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[26] [26]

Long short-term memory

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735–1780, 1997

work page 1997

[27] [27]

Perceptual losses for real-time style transfer and super-resolution

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 694–711. Springer, 2016. 11

work page 2016

[28] [28]

Model based reinforcement learning for atari

Łukasz Kaiser, Mohammad Babaeizadeh, Piotr Miłos, Bla˙zej Osi´nski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, Afroz Mohiuddin, Ryan Sepassi, George Tucker, and Henryk Michalewski. Model based reinforcement learning for atari. In International Conference on Learning Representations , 2020. URL https: /...

work page 2020

[29] [29]

Transformers are RNNs: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 5156–5165. PMLR, 13–18 Jul 2...

work page 2020

[30] [30]

Curious replay for model-based adaptation

Isaac Kauvar, Chris Doyle, Linqi Zhou, and Nick Haber. Curious replay for model-based adaptation. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

work page 2023

[31] [31]

Simple and scal- able predictive uncertainty estimation using deep ensembles

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scal- able predictive uncertainty estimation using deep ensembles. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, edi- tors, Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc., 2017. URL https:/...

work page 2017

[32] [32]

Autoencoding beyond pixels using a learned similarity metric

Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. In Maria Florina Balcan and Kilian Q. Weinberger, editors,Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1558–1566, New York, N...

work page 2016

[33] [33]

UNIFIED-IO: A unified model for vision, language, and multi-modal tasks

Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. UNIFIED-IO: A unified model for vision, language, and multi-modal tasks. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview. net/forum?id=E01k9048soZ

work page 2023

[34] [34]

Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action

Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26439–26455, June 2024

work page 2024

[35] [35]

Craftax: A lightning-fast benchmark for open-ended reinforcement learning

Michael Matthews, Michael Beukman, Benjamin Ellis, Mikayel Samvelyan, Matthew Jackson, Samuel Coward, and Jakob Foerster. Craftax: A lightning-fast benchmark for open-ended reinforcement learning. In International Conference on Machine Learning (ICML), 2024

work page 2024

[36] [36]

Dis- covering and achieving goals via world models

Russell Mendonca, Oleh Rybkin, Kostas Daniilidis, Danijar Hafner, and Deepak Pathak. Dis- covering and achieving goals via world models. Advances in Neural Information Processing Systems, 34:24379–24391, 2021

work page 2021

[37] [37]

Transformers are sample-efficient world models

Vincent Micheli, Eloi Alonso, and François Fleuret. Transformers are sample-efficient world models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/ pdf?id=vhFu1Acb0xb

work page 2023

[38] [38]

Efficient world models with context-aware tokenization

Vincent Micheli, Eloi Alonso, and François Fleuret. Efficient world models with context-aware tokenization. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https://openreview.net/forum? id=BiWIERWBFX

work page 2024

[39] [39]

Human-level control through deep reinforcement learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015. 12

work page 2015

[40] [40]

Pytorch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performa...

work page 2019

[41] [41]

Efros, and Trevor Darrell

Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven ex- ploration by self-supervised prediction. In Doina Precup and Yee Whye Teh, editors, Pro- ceedings of the 34th International Conference on Machine Learning , volume 70 of Pro- ceedings of Machine Learning Research, pages 2778–2787. PMLR, 06–11 Aug 2017. URL https://pro...

work page 2017

[42] [42]

Prajit Ramachandran, Barret Zoph, and Quoc V . Le. Searching for activation functions, 2018. URL https://openreview.net/forum?id=SkBYYyZRZ

work page 2018

[43] [43]

A generalist agent

Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gómez Colmenarejo, Alexander Novikov, Gabriel Barth-maron, Mai Giménez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, Tom Eccles, Jake Bruce, Ali Razavi, Ashley Edwards, Nicolas Heess, Yutian Chen, Raia Hadsell, Oriol Vinyals, Mahyar Bordbar, and Nando de Freitas. A generalist agent. Transactions on ...

work page 2022

[44] [44]

Transformer-based world models are happy with 100k interactions

Jan Robine, Marc Höftmann, Tobias Uelwer, and Stefan Harmeling. Transformer-based world models are happy with 100k interactions. arXiv preprint arXiv:2303.07109, 2023

work page arXiv 2023

[45] [45]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022

work page 2022

[46] [46]

doi: 10.1109/TAMD.2010.2056368

Jürgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990-2010). IEEE Transactions on Autonomous Mental Development , 2, 2010. ISSN 19430604. doi: 10.1109/TAMD.2010.2056368

work page doi:10.1109/tamd.2010.2056368 1990

[47] [47]

A generalist dynamics model for control

Ingmar Schubert, Jingwei Zhang, Jake Bruce, Sarah Bechtle, Emilio Parisotto, Martin Ried- miller, Jost Tobias Springenberg, Arunkumar Byravan, Leonard Hasenclever, and Nicolas Heess. A generalist dynamics model for control. arXiv preprint arXiv:2305.10912, 2023

work page arXiv 2023

[48] [48]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[49] [49]

Planning to explore via self-supervised world models

Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak Pathak. Planning to explore via self-supervised world models. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 8583–8592. PMLR, 13–18 Jul 2020...

work page 2020

[50] [50]

Model-based active exploration

Pranav Shyam, Wojciech Ja´skowski, and Faustino Gomez. Model-based active exploration. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5779–5788. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/ shyam19a.html

work page 2019

[51] [51]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [52]

Policy gradi- ent methods for reinforcement learning with function approximation

Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradi- ent methods for reinforcement learning with function approximation. In S. Solla, T. Leen, and K. Müller, editors, Advances in Neural Information Processing Systems , volume 12. MIT Press, 1999. URL https://proceedings.neurips.cc/paper_files/paper/1999/ file/464d828b85b0b...

work page 1999

[53] [53]

2020 , issn =

Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, Nicolas Heess, and Yuval Tassa. dm_control: Software and tasks for continuous control. Software Impacts, 6:100022, 2020. ISSN 2665-9638. doi: https:// doi.org/10.1016/j.simpa.2020.100022. URL https://www.sciencedirect.com/science/ article/...

work page doi:10.1016/j.simpa.2020.100022 2020

[54] [54]

Diffusion Models Are Real-Time Game Engines

Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. CoRR, abs/2408.14837, 2024. URL https://doi.org/10.48550/ arXiv.2408.14837

work page internal anchor Pith review Pith/arXiv arXiv 2024

[55] [55]

Neural discrete representation learning

Aaron van den Oord, Oriol Vinyals, and koray kavukcuoglu. Neural discrete representation learning. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/ 2017/...

work page 2017

[56] [56]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, ed- itors, Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc., 2017. UR...

work page 2017

[57] [57]

Efficientzero v2: Mas- tering discrete and continuous control with limited data

Shengjie Wang, Shaohuai Liu, Weirui Ye, Jiacheng You, and Yang Gao. Efficientzero v2: Mas- tering discrete and continuous control with limited data. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=LHGMXcr6zx

work page 2024

[58] [58]

Parallelizing model-based rein- forcement learning over the sequence length

ZiRui Wang, Yue DENG, Junfeng Long, and Yin Zhang. Parallelizing model-based rein- forcement learning over the sequence length. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id= R6N9AGyz13

work page 2024

[59] [59]

ivideoGPT: Interactive videoGPTs are scalable world models

Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye HAO, and Mingsheng Long. ivideoGPT: Interactive videoGPTs are scalable world models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview. net/forum?id=4TENzBftZR

work page 2024

[60] [60]

Conv(a,b,c)

Weipu Zhang, Gang Wang, Jian Sun, Yetian Yuan, and Gao Huang. Storm: Efficient stochastic transformer based world models for reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024. 14 A Models and Hyperparameters A.1 Hyperparameters We detail shared hyperparameters in Table 1, training hyperparameters in Table 2, world model h...

work page 2024

[61] [61]

Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Answer: [Yes] Justification: We provide extensive empirical evidence in Section 3, including ablation studies, which directly relate to our contributions and claims. The scope of our paper is sample-efficient, planning-free wor...

work page

[62] [62]

Limitations

Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: Section 5 explicitly discuss the limitations of our work. Additional limitations are discussed in Section 3 (e.g., the absence of ablations on Craftax due to computational limitations). Guidelines: • The answer NA means that the ...

work page

[63] [63]

Guidelines: • The answer NA means that the paper does not include theoretical results

Theory assumptions and proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [NA] Justification: Our paper does not include theoretical results. Guidelines: • The answer NA means that the paper does not include theoretical results. • All the theorems, formulas, and proo...

work page

[64] [64]

Guidelines: • The answer NA means that the paper does not include experiments

Experimental result reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main ex- perimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: In Section 2 and in the ap...

work page

[65] [65]

Our code has a detailed readme file for easy usage, and we also provide Docker support, which enables an easy environment setup and enhances reproducibility on any operation system

Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instruc- tions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: In our abstract and appendix we provide a link to the code and trained model weights. Our code has a detail...

work page

[66] [66]

Guidelines: • The answer NA means that the paper does not include experiments

Experimental setting/details Question: Does the paper specify all the training and test details (e.g., data splits, hyper- parameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We specify all experimental details in Section 3 and in Appendix A and C. Guidelines: • The answer NA means t...

work page

[67] [67]

Figure 5 also includes error bars

Experiment statistical significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: We utilize the rliable toolkit [2] to generate plots with appropriate error bars (Figure 4 bottom, Figure 6). Figure 5 also includ...

work page

[68] [68]

Guidelines: • The answer NA means that the paper does not include experiments

Experiments compute resources Question: For each experiment, does the paper provide sufficient information on the com- puter resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We provide this information in Appendix C. Guidelines: • The answer NA means that the paper does not in...

work page

[69] [69]

No human subjects or partici- pants were involved

Code of ethics Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines? Answer: [Yes] Justification: Our work follows the NeurIPS Code of Ethics. No human subjects or partici- pants were involved. We found no special concerns beyond those related to the genera...

work page

[70] [70]

As such, there are no direct positive or negative societal impacts

Broader impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] 30 Justification: This paper presents a foundational work in the field of Machine Learning. As such, there are no direct positive or negative societal impacts. Guidelines: • The answer NA means that th...

work page

[71] [71]

Hence, we do not introduce additional safeguards

Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: Our work does not pose any additional risks beyond those of common deep reinforcement learning ...

work page

[72] [72]

We follow the licenses of all assets used in our work

Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: Our paper cites all relevant assets, and our open-sourced repository includes a credits section ...

work page

[73] [73]

Guidelines: • The answer NA means that the paper does not release new assets

New assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: Our open-sourced repository includes all new assets and is well documented. Guidelines: • The answer NA means that the paper does not release new assets. • Researchers should communicate the detai...

work page

[74] [74]

Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

Crowdsourcing and research with human subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research...

work page

[75] [75]

32 Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page

[76] [76]

Answer: [NA] Justification: The core method development in this research does not involve LLMs as any important, original, or non-standard components

Declaration of LLM usage Question: Does the paper describe the usage of LLMs if it is an important, original, or non-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does not impact the core methodology, scientific rigorousness, or originality of the research, decla...

work page 2025