pith. sign in

arxiv: 2605.18803 · v1 · pith:EQU73ZMKnew · submitted 2026-05-11 · 💻 cs.LG · cs.AI

PROWL: Prioritized Regret-Driven Optimization for World Model Learning

Pith reviewed 2026-05-20 22:29 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords world modelsadversarial trainingreinforcement learningMineRLrobustnesscurriculum learningtrajectory prioritizationdiffusion models
0
0 comments X

The pith

A KL-constrained adversarial policy turns rare world-model failures into stable training data by staying close to observed behavior.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that passive demonstration data under-samples the rare but interaction-critical transitions that determine planning success, leaving world models unreliable on high-impact cases. It introduces an adversarial training loop in which a policy is optimized to expose high prediction errors in a diffusion-based world model while a KL penalty keeps its trajectories near the original behavior distribution. A Prioritized Adversarial Trajectory buffer continually re-ranks the generated trajectories by error, action fidelity, and learning progress so that training keeps pressure on unresolved weaknesses instead of revisiting solved ones. When implemented in the MineRL environment, the loop improves robustness on held-out trajectories relative to passive training alone and shows that weak constraints allow reward-hacking behaviors to appear. The central insight is that scalable world models gain from actively producing informative near-distribution data rather than relying solely on larger passive collections.

Core claim

The authors establish that effective adversarial world-model training requires a KL-constrained policy to generate high-error trajectories together with a prioritized buffer that focuses retraining on the model's current weakest failure modes, converting rare failures into a stable, near-distribution training signal that improves robustness on held-out out-of-distribution trajectories.

What carries the argument

The KL-constrained adversarial curriculum that elicits high-error trajectories from the diffusion world model, combined with the Prioritized Adversarial Trajectory (PAT) buffer that re-ranks trajectories according to prediction error, action fidelity, and learning progress.

If this is right

  • World models trained this way become more reliable for downstream planning that depends on accurate prediction of rare interactions.
  • The method can surface reward-hacking behaviors when behavioral constraints are insufficiently strong.
  • Effective adversarial training for world models requires explicit balancing of exploratory failure discovery and distributional regularization.
  • Improvement in world-model performance can come from selective generation of informative data rather than increases in passive dataset size alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prioritization mechanism might be applied to non-diffusion world models to test whether the robustness gains depend on the specific architecture.
  • If the core loop works, purely passive data collection will leave persistent gaps in coverage for any safety-critical or long-horizon task.
  • One could examine whether the approach scales when the underlying environment or task distribution changes beyond the MineRL setting.

Load-bearing premise

A policy kept close to the behavior distribution by a KL penalty will reliably surface high-error trajectories that supply useful near-distribution training signals without the loop drifting into out-of-distribution exploitation.

What would settle it

Measure whether prediction error on a fixed set of held-out critical trajectories decreases after adversarial fine-tuning compared with a passive baseline, or check whether the adversarial policy's trajectories remain within a bounded KL distance from the original behavior distribution across training iterations.

Figures

Figures reproduced from arXiv: 2605.18803 by Ahmet H. G\"uzel, Benjamin Graham, Ilija Bogunovic, Jack Parker-Holder, Jeffrey Hawke, Jenny Seidenschwarz, Jonathan Sadeghi.

Figure 1
Figure 1. Figure 1: PROWL framework overview. Phase 1 pretrains the world model on the passive human￾collected dataset; Phase 2 runs the adversarial curriculum, in which the agent policy acts in the game environment, the world model predicts the resulting trajectory, and per-trajectory score(τ ) is computed from the discrepancy between predicted and ground-truth rollouts, driving both PPO updates to the policy and insertion i… view at source ↗
Figure 2
Figure 2. Figure 2: shows the result: the weakly constrained ckl=0.5 arm separates from the stable arms on KL and rebounds late in training to roughly twice their camera velocity — camera-thrashing that is technically learnable but displaces capacity from useful regimes and surfaces as visual artifacts rather than improved dynamics. Stronger KL constraints suppress this failure mode, confirming that high prediction error alon… view at source ↗
Figure 3
Figure 3. Figure 3: Curriculum formation and novel interaction discovery. (a) Mean PAT-buffer latent regret. Successful arms first increase regret by discovering hard cases, then reduce it as the world model learns; the ckl=0.5 arm remains high because it exploits camera-thrashing shortcuts. (b) Trajectory counts for the 5 most populated of 27 strictly novel composite modes, aggregated across all PROWL PAT buffers; the ckl=0.… view at source ↗
Figure 4
Figure 4. Figure 4: shows representative rollouts under identical action sequences. The Phase 1 world model often freezes, lags behind camera motion, or predicts visually plausible but action-inconsistent transitions. PROWL better preserves the intended rotation and locomotion, supporting the quantitative finding that lower latent regret and AFS-EPE correspond to improved action following rather than only better frame appeara… view at source ↗
Figure 5
Figure 5. Figure 5: AFS-EPE catches motion failures invisible to appearance metrics. Same trajectory as [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison: VPT Base vs. PROWL under identical actions. Frames 22–32 of a held-out VPT-reference trajectory containing forward+left + camera rotation through frames 22– 24 (action-heavy region). Each group shows ground-truth frames, predicted frames, ground-truth optical flow, and predicted flow. VPT Base’s predicted flow during the action-heavy frames is incoherent (red/magenta blobs); PROWL’s… view at source ↗
read the original abstract

Modern action-conditioned video world models achieve strong short-horizon visual realism, yet remain unreliable on rare, interaction-critical transitions that dominate downstream planning and policy performance. Because passive demonstration data systematically under-samples these high-impact regimes, improving robustness requires actively eliciting model failures rather than relying on their natural occurrence. We introduce a KL-constrained adversarial curriculum in which a policy is trained to expose high-error trajectories of a diffusion-based world model while remaining close to the behavior distribution. The world model is continuously fine-tuned on these adversarially discovered trajectories, yielding an adversarial training loop that converts rare failures into a stable, near-distribution training signal without drifting into out-of-distribution exploitation. To maintain pressure on unresolved weaknesses as the model improves, we propose a Prioritized Adversarial Trajectory (PAT) buffer that re-ranks trajectories based on prediction error, action fidelity, and learning progress, focusing training on unresolved failure modes rather than repeatedly revisiting solved cases. We implement our approach in the MineRL framework and evaluate it on held-out out-of-distribution trajectories; PROWL improves robustness over models trained on passive data alone, reveals reward-hacking behaviors under weak behavioral constraints, and demonstrates that effective adversarial world-model training critically depends on balancing exploratory failure discovery with explicit behavioral regularization. Our results suggest that scalable world models benefit not only from larger datasets, but also from selectively generating informative training data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents PROWL, a method for learning robust action-conditioned video world models. It uses a KL-constrained adversarial policy to generate high-error trajectories from a diffusion-based world model, fine-tunes the model on these trajectories in a loop, and employs a Prioritized Adversarial Trajectory (PAT) buffer to prioritize unresolved failure modes based on prediction error, action fidelity, and learning progress. The approach is implemented in the MineRL framework and evaluated on held-out out-of-distribution trajectories, claiming improved robustness over passive data training, identification of reward-hacking under weak constraints, and the critical role of balancing exploratory discovery with behavioral regularization.

Significance. If the empirical results hold, this work has potential significance for the development of scalable world models in reinforcement learning and robotics. By actively eliciting and addressing rare failure modes through adversarial training with regularization, it offers a way to improve model reliability on interaction-critical transitions without relying solely on larger passive datasets. The introduction of the PAT buffer provides a mechanism to focus training on persistent weaknesses, which could be broadly applicable to other iterative model improvement loops.

major comments (3)
  1. [Experimental Evaluation] Experimental Evaluation section: The abstract states performance gains and dependence on behavioral regularization, but provides no quantitative metrics, error bars, ablation details, or experimental setup sufficient to verify that the data support the claims as stated. This is load-bearing for the central claim of improved robustness on held-out trajectories.
  2. [Adversarial Training Loop] Adversarial Training Loop description: The claim that the KL-constrained adversarial policy converts rare failures into a stable, near-distribution training signal without drifting into out-of-distribution exploitation is central, yet no details are given on how the constraint is scheduled, monitored, or ablated against the PAT buffer re-ranking. This leaves the robustness gains vulnerable to the possibility that trajectories become OOD as the world model improves.
  3. [PAT Buffer] PAT Buffer section: The buffer re-ranks trajectories based on prediction error, action fidelity, and learning progress to focus on unresolved failure modes, but the exact ranking formula, weighting, or quantification of learning progress is not specified, undermining the claim that it maintains pressure on weaknesses as the model improves.
minor comments (2)
  1. [Abstract] Abstract: The mention of 'reveals reward-hacking behaviors under weak behavioral constraints' would benefit from a brief definition or example of what constitutes weak constraints.
  2. [Notation] Notation: Ensure consistent terminology for 'behavior distribution' versus 'near-distribution' across sections to avoid ambiguity in the regularization discussion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where the comments identify areas requiring greater specificity or additional empirical support, we have revised the manuscript to incorporate the requested details.

read point-by-point responses
  1. Referee: [Experimental Evaluation] Experimental Evaluation section: The abstract states performance gains and dependence on behavioral regularization, but provides no quantitative metrics, error bars, ablation details, or experimental setup sufficient to verify that the data support the claims as stated. This is load-bearing for the central claim of improved robustness on held-out trajectories.

    Authors: We agree that the Experimental Evaluation section would benefit from expanded quantitative reporting to make the robustness claims fully verifiable. In the revised manuscript we have added concrete metrics (including mean prediction error on held-out OOD trajectories with standard deviations across five random seeds), error bars on all reported figures, a full ablation table isolating the contribution of behavioral regularization, and an explicit description of the experimental protocol (MineRL environment version, trajectory lengths, train/test splits, and evaluation procedure). These additions directly substantiate the performance gains and the necessity of regularization. revision: yes

  2. Referee: [Adversarial Training Loop] Adversarial Training Loop description: The claim that the KL-constrained adversarial policy converts rare failures into a stable, near-distribution training signal without drifting into out-of-distribution exploitation is central, yet no details are given on how the constraint is scheduled, monitored, or ablated against the PAT buffer re-ranking. This leaves the robustness gains vulnerable to the possibility that trajectories become OOD as the world model improves.

    Authors: The manuscript already states that a KL constraint is used to keep the adversarial policy near the behavior distribution, but we concur that explicit scheduling, monitoring, and interaction with the PAT buffer were insufficiently described. The revised version now details the annealing schedule for the KL coefficient, the per-batch divergence monitoring rule with reset threshold, and an ablation that isolates the PAT buffer’s re-ranking effect on trajectory distribution. These additions demonstrate that the combined mechanism prevents OOD drift as the world model improves. revision: yes

  3. Referee: [PAT Buffer] PAT Buffer section: The buffer re-ranks trajectories based on prediction error, action fidelity, and learning progress to focus on unresolved failure modes, but the exact ranking formula, weighting, or quantification of learning progress is not specified, undermining the claim that it maintains pressure on weaknesses as the model improves.

    Authors: We acknowledge that the precise ranking mechanism was not fully formalized. The revised manuscript now states the exact priority formula as a weighted sum (0.4 × normalized prediction error + 0.3 × action fidelity + 0.3 × learning progress), defines learning progress as the relative error reduction since the last sampling of that trajectory, and reports the empirical weight selection together with a sensitivity analysis. This specification clarifies how the buffer continues to emphasize unresolved failure modes. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training loop with independent evaluation

full rationale

The paper describes an algorithmic procedure consisting of a KL-constrained adversarial policy that generates trajectories for fine-tuning a diffusion world model, combined with a PAT buffer that re-ranks based on prediction error and learning progress. No equations, derivations, or parameter-fitting steps are shown that reduce the robustness improvements to a fitted input or self-referential quantity by construction. Claims rest on held-out trajectory evaluation rather than any self-citation chain or ansatz imported from prior author work. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that passive data under-samples critical transitions and that the adversarial loop can be stabilized by KL regularization; no explicit free parameters or invented physical entities are named in the abstract.

free parameters (1)
  • KL constraint weight
    Controls how closely the adversarial policy must stay to the behavior distribution; its specific value is not reported in the abstract.
axioms (1)
  • domain assumption Passive demonstration data systematically under-samples high-impact regimes that dominate downstream planning performance.
    Directly stated as the motivation for needing active elicitation of model failures.

pith-pipeline@v0.9.0 · 5797 in / 1295 out tokens · 28711 ms · 2026-05-20T22:29:57.593970+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 7 internal anchors

  1. [1]

    Advances in Neural Information Processing Systems , volume=

    Action-conditional video prediction using deep networks in atari games , author=. Advances in Neural Information Processing Systems , volume=

  2. [2]

    World Models

    World models , author=. arXiv preprint arXiv:1803.10122 , year=

  3. [3]

    International Conference on Learning Representations , year=

    Dream to Control: Learning Behaviors by Latent Imagination , author=. International Conference on Learning Representations , year=

  4. [4]

    International Conference on Learning Representations , year=

    Mastering Atari with Discrete World Models , author=. International Conference on Learning Representations , year=

  5. [5]

    Mastering Diverse Domains through World Models

    Mastering Diverse Domains through World Models , author=. arXiv preprint arXiv:2301.04104 , year=

  6. [6]

    International Conference on Learning Representations , year=

    Transformer-based World Models Are Happy With 100k Interactions , author=. International Conference on Learning Representations , year=

  7. [7]

    International Conference on Learning Representations , year=

    Transformers are Sample Efficient World Models , author=. International Conference on Learning Representations , year=

  8. [8]

    Advances in Neural Information Processing Systems , year=

    Diffusion for World Modeling: Visual Details Matter in Atari , author=. Advances in Neural Information Processing Systems , year=

  9. [9]

    OpenAI Technical Report , year=

    Video generation models as world simulators , author=. OpenAI Technical Report , year=

  10. [10]

    Genie: generative interactive environments , year =

    Bruce, Jake and Dennis, Michael and Edwards, Ashley and Parker-Holder, Jack and Shi, Yuge (Jimmy) and Hughes, Edward and Lai, Matthew and Mavalankar, Aditi and Steigerwald, Richie and Apps, Chris and Aytar, Yusuf and Bechtle, Sarah and Behbahani, Feryal and Chan, Stephanie and Heess, Nicolas and Gonzalez, Lucy and Osindero, Simon and Ozair, Sherjil and Re...

  11. [11]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Playable video generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  12. [12]

    Diffusion Models Are Real-Time Game Engines

    Diffusion Models Are Real-Time Game Engines , author=. arXiv preprint arXiv:2408.14837 , year=

  13. [13]

    Oasis: Real-Time World Models , author=

  14. [14]

    GAIA-1: A Generative World Model for Autonomous Driving

    GAIA-1: A Generative World Model for Autonomous Driving , author=. arXiv preprint arXiv:2309.17080 , year=

  15. [15]

    arXiv preprint arXiv:2309.09777 , year=

    DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving , author=. arXiv preprint arXiv:2309.09777 , year=

  16. [16]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  17. [17]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

    UniSim: A Neural Closed-Loop Sensor Simulator , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

  18. [18]

    Conference on Robot Learning , year=

    DayDreamer: World Models for Physical Robot Learning , author=. Conference on Robot Learning , year=

  19. [19]

    Navigation World Models , year=

    Bar, Amir and Zhou, Gaoyue and Tran, Danny and Darrell, Trevor and LeCun, Yann , booktitle=. Navigation World Models , year=

  20. [20]

    2023 , eprint=

    BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy Tasks , author=. 2023 , eprint=

  21. [21]

    Scalable Diffusion Models with Transformers , year=

    Peebles, William and Xie, Saining , booktitle=. Scalable Diffusion Models with Transformers , year=

  22. [22]

    Diffusion forcing: next-token prediction meets full-sequence diffusion , year =

    Chen, Boyuan and Mons\'. Diffusion forcing: next-token prediction meets full-sequence diffusion , year =. Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =

  23. [23]

    2025 , eprint=

    Wan: Open and Advanced Large-Scale Video Generative Models , author=. 2025 , eprint=

  24. [24]

    The Eleventh International Conference on Learning Representations , year=

    UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining , author=. The Eleventh International Conference on Learning Representations , year=

  25. [25]

    Robust Adversarial Reinforcement Learning

    Lerrel Pinto and James Davidson and Rahul Sukthankar and Abhinav Gupta , title =. CoRR , volume =. 2017 , url =. 1703.02702 , timestamp =

  26. [26]

    Proximal Policy Optimization Algorithms

    John Schulman and Filip Wolski and Prafulla Dhariwal and Alec Radford and Oleg Klimov , title =. CoRR , volume =. 2017 , url =. 1707.06347 , timestamp =

  27. [27]

    2025 , eprint=

    Imagined Autocurricula , author=. 2025 , eprint=

  28. [28]

    2024 , eprint=

    SEA-RAFT: Simple, Efficient, Accurate RAFT for Optical Flow , author=. 2024 , eprint=

  29. [29]

    Proceedings of the 38th International Conference on Machine Learning , pages =

    Prioritized Level Replay , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =

  30. [30]

    2025 , eprint=

    MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft , author=. 2025 , eprint=

  31. [31]

    2024 , publisher=

    Lucid v1: Real-Time Latent World Models , author=. 2024 , publisher=. doi:10.30967/IJCRSET/Alberto-Hojel/152 , note=

  32. [32]

    2018 , eprint=

    High-Dimensional Continuous Control Using Generalized Advantage Estimation , author=. 2018 , eprint=

  33. [33]

    2022 , eprint=

    High-Resolution Image Synthesis with Latent Diffusion Models , author=. 2022 , eprint=

  34. [34]

    2023 , eprint=

    Latent Video Diffusion Models for High-Fidelity Long Video Generation , author=. 2023 , eprint=

  35. [35]

    2024 , eprint=

    Large Motion Video Autoencoding with Cross-modal Video VAE , author=. 2024 , eprint=

  36. [36]

    2025 , eprint=

    Training Agents Inside of Scalable World Models , author=. 2025 , eprint=

  37. [37]

    Czarnecki and Yee Whye Teh and Razvan Pascanu and Nicolas Heess , booktitle=

    Alexandre Galashov and Siddhant Jayakumar and Leonard Hasenclever and Dhruva Tirumala and Jonathan Schwarz and Guillaume Desjardins and Wojtek M. Czarnecki and Yee Whye Teh and Razvan Pascanu and Nicolas Heess , booktitle=. Information asymmetry in. 2019 , url=

  38. [38]

    2020 , eprint=

    Behavior Priors for Efficient Reinforcement Learning , author=. 2020 , eprint=

  39. [39]

    Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =

    Kumar, Aviral and Zhou, Aurick and Tucker, George and Levine, Sergey , title =. Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =. 2020 , isbn =

  40. [40]

    2019 , eprint=

    Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning , author=. 2019 , eprint=

  41. [41]

    2022 , eprint=

    Learning to summarize from human feedback , author=. 2022 , eprint=

  42. [42]

    2022 , eprint=

    Training language models to follow instructions with human feedback , author=. 2022 , eprint=

  43. [43]

    2025 , eprint=

    Defining and Characterizing Reward Hacking , author=. 2025 , eprint=

  44. [44]

    2022 , eprint=

    Classifier-Free Diffusion Guidance , author=. 2022 , eprint=

  45. [45]

    Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =

    Dennis, Michael and Jaques, Natasha and Vinitsky, Eugene and Bayen, Alexandre and Russell, Stuart and Critch, Andrew and Levine, Sergey , title =. Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =. 2020 , isbn =

  46. [46]

    arXiv preprint arXiv:2203.01302 , year=

    Evolving Curricula with Regret-Based Environment Design , author=. arXiv preprint arXiv:2203.01302 , year=

  47. [47]

    arXiv preprint arXiv:2506.17201 , year =

    Hunyuan-GameCraft: High-dynamic Interactive Game Video Generation with Hybrid History Condition , author =. arXiv preprint arXiv:2506.17201 , year =

  48. [48]

    Advances in Neural Information Processing Systems , year =

    Baker, Bowen and Akkaya, Ilge and Zhokhov, Peter and Huizinga, Joost and Tang, Jie and Ecoffet, Adrien and Houghton, Brandon and Sampedro, Raul and Clune, Jeff , title =. Advances in Neural Information Processing Systems , year =

  49. [49]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  50. [50]

    arXiv preprint arXiv:2504.12369 , year=

    WorldMem: Long-term Consistent World Simulation with Memory , author=. arXiv preprint arXiv:2504.12369 , year=

  51. [51]

    LoopNav: Benchmarking Spatial Consistency in World Models

    Toward Memory-Aided World Models: Benchmarking via Spatial Consistency , author=. arXiv preprint arXiv:2505.22976 , year=

  52. [52]

    Curriculum learning , booktitle =

    Bengio, Yoshua and Louradour, J. Curriculum learning , booktitle =. 2009 , pages =

  53. [53]

    and Menick, Jacob and Munos, R

    Graves, Alex and Bellemare, Marc G. and Menick, Jacob and Munos, R. Automated Curriculum Learning for Neural Networks , booktitle =. 2017 , volume =

  54. [54]

    Shah, Rohin and Varma, Vikrant and Kumar, Ram and Guss, William H and Milani, Stefano and Devlin, Sam , journal=. The

  55. [55]

    MineRL: A Large-Scale Dataset of

    Guss, William H and Codel, Brandon and Hofmann, Katja and Liu, Brandon and Gotta, Nicholay and Milani, Stefano and Song, Oskar and Forgy, Christin and Devlin, Sam and Rosenfeld, Max and others , booktitle=. MineRL: A Large-Scale Dataset of

  56. [56]

    2020 , howpublished =

    John Schulman , title =. 2020 , howpublished =

  57. [57]

    Proceedings of the 37th International Conference on Machine Learning (ICML) , year =

    Sekar, Ramanan and Rybkin, Oleh and Daniilidis, Kostas and Abbeel, Pieter and Hafner, Danijar and Pathak, Deepak , title =. Proceedings of the 37th International Conference on Machine Learning (ICML) , year =

  58. [58]

    and Parker-Holder, Jack and Pacchiano, Aldo and Choromanski, Krzysztof and Roberts, Stephen , title =

    Ball, Philip J. and Parker-Holder, Jack and Pacchiano, Aldo and Choromanski, Krzysztof and Roberts, Stephen , title =. arXiv preprint arXiv:2002.02693 , year =

  59. [59]

    2023 , note =

    Chung, Hyung Won and Constant, Noah and Garcia, Xavier and Roberts, Adam and Tay, Yi and Narang, Sharan and Firat, Orhan , journal =. 2023 , note =

  60. [60]

    2024 , eprint=

    Open-Endedness is Essential for Artificial Superhuman Intelligence , author=. 2024 , eprint=