pith. machine review for the scientific record. sign in

arxiv: 2301.04104 · v2 · submitted 2023-01-10 · 💻 cs.AI · cs.LG· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Mastering Diverse Domains through World Models

Authors on Pith no claims yet

Pith reviewed 2026-05-11 09:00 UTC · model grok-4.3

classification 💻 cs.AI cs.LGstat.ML
keywords reinforcement learningworld modelsDreamerV3Minecraftmodel-based planningsparse rewardsgeneral agents
0
0 comments X

The pith

DreamerV3 learns a world model to imagine futures and masters over 150 tasks plus Minecraft diamond collection with one fixed setup.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that a single reinforcement learning method can handle a broad range of control problems by building an internal model of the environment and using it to simulate possible future sequences. A reader would care because most current algorithms demand heavy human work to adapt to each new setting, and success here would reduce that barrier. The authors show the method reaching diamond collection in Minecraft from random starts using only pixel views and sparse rewards, an open-world task long viewed as difficult. They also report that the same configuration beats task-specific approaches across more than 150 varied problems. If the result holds, reinforcement learning could move from narrow lab experiments toward wider use in new domains without repeated retuning.

Core claim

DreamerV3 learns a model of the environment from interaction and improves its policy by imagining future scenarios inside that model. Techniques for normalization to keep signals in range, balancing to equalize different learning signals, and transformations to reshape inputs let the same algorithm run stably across domains. This produces the first from-scratch diamond collection in Minecraft and stronger results than specialized algorithms on more than 150 other tasks, all with an unchanged configuration.

What carries the argument

A learned world model that predicts future states, rewards, and continuation signals, allowing the agent to evaluate and improve actions by rolling out imagined trajectories rather than only real experience.

If this is right

  • The same algorithm applies to more than 150 tasks spanning games, robotics-style control, and open worlds without any per-task adjustments.
  • Minecraft diamond collection becomes solvable from pixels and sparse rewards without human demonstrations or staged curricula.
  • Challenging problems with long time horizons and delayed rewards can be addressed by planning inside the learned model instead of trial-and-error in the real environment.
  • Reinforcement learning becomes usable on new problems with far less human experimentation and domain expertise.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the world model remains accurate at longer horizons, the approach could support planning in physical robot settings where real trials are costly.
  • The emphasis on a single configuration suggests model-based methods may reduce the engineering overhead that currently limits reinforcement learning deployment.
  • Extending the imagination process to include uncertainty estimates could improve robustness on tasks where predictions are noisy.

Load-bearing premise

The combination of normalization, balancing, and transformations is enough to keep learning stable and high-performing when the algorithm is moved to any new domain without further changes.

What would settle it

Running the published DreamerV3 configuration on a fresh control task or repeating the Minecraft diamond collection experiment and finding it fails to reach the reported performance would show the single-configuration claim does not hold.

read the original abstract

Developing a general algorithm that learns to solve tasks across a wide range of applications has been a fundamental challenge in artificial intelligence. Although current reinforcement learning algorithms can be readily applied to tasks similar to what they have been developed for, configuring them for new application domains requires significant human expertise and experimentation. We present DreamerV3, a general algorithm that outperforms specialized methods across over 150 diverse tasks, with a single configuration. Dreamer learns a model of the environment and improves its behavior by imagining future scenarios. Robustness techniques based on normalization, balancing, and transformations enable stable learning across domains. Applied out of the box, Dreamer is the first algorithm to collect diamonds in Minecraft from scratch without human data or curricula. This achievement has been posed as a significant challenge in artificial intelligence that requires exploring farsighted strategies from pixels and sparse rewards in an open world. Our work allows solving challenging control problems without extensive experimentation, making reinforcement learning broadly applicable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents DreamerV3, a world-model-based reinforcement learning algorithm that incorporates three robustness techniques (normalization, balancing, and transformations) to enable stable learning. It claims that a single fixed hyperparameter configuration allows the method to outperform specialized algorithms across more than 150 tasks spanning multiple domains (Atari, DM Control, ProcGen, and others) and to be the first algorithm to collect diamonds in Minecraft from scratch using only pixels and sparse rewards, without human data or curricula.

Significance. If the empirical results hold under a truly fixed configuration, the work would constitute a meaningful advance toward general-purpose RL agents that require little or no per-domain engineering. The Minecraft diamond-collection result, if independently verified, would demonstrate non-trivial long-horizon planning from high-dimensional observations in an open world. The provision of a single configuration across 150+ tasks is a concrete strength that, if substantiated, reduces the barrier to applying model-based RL.

major comments (3)
  1. [Experiments] Experiments section (and associated tables/figures): the central claim that a single fixed configuration produces the reported results across all domains rests on the assertion that normalization, balancing, and transformation parameters are chosen once and never adjusted per domain. The manuscript should explicitly list every scalar hyperparameter (including any clipping thresholds, scaling factors, or transformation exponents) and state whether any of them were selected after inspecting per-domain statistics or performance; without this, the 'single configuration' and 'applied out of the box' claims cannot be evaluated.
  2. [Experiments] Minecraft results (likely §4 or dedicated subsection): the claim that DreamerV3 is the first algorithm to collect diamonds from scratch requires a precise description of the environment variant, reward function, episode length, and exact baseline implementations. The paper should also report the number of independent seeds, the precise success criterion (e.g., diamonds collected per episode), and whether any environment-specific wrappers were used; otherwise the 'first to solve' statement cannot be assessed for reproducibility.
  3. [Experiments] Ablation studies (if present in §4 or appendix): the robustness techniques are presented as jointly enabling cross-domain stability, yet the manuscript does not appear to isolate the contribution of each technique (normalization vs. balancing vs. transformations) under the fixed-configuration regime. An ablation that removes one technique at a time while keeping all other hyperparameters identical would directly test whether the combination is necessary for the reported generality.
minor comments (2)
  1. [Abstract] The abstract states 'outperforms specialized methods across over 150 diverse tasks' but does not name the exact task suites or the metric used for 'outperforms' (e.g., mean normalized score, median, etc.). Adding a short enumeration of the domains and the aggregate metric would improve clarity.
  2. [Method] Notation for the world-model components (encoder, dynamics, reward predictor) should be introduced once with consistent symbols; subsequent sections occasionally reuse symbols without redefinition, which can be confusing for readers unfamiliar with prior Dreamer papers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help strengthen the clarity and reproducibility of our work. We address each major comment point by point below. Revisions have been made to the manuscript to incorporate explicit hyperparameter listings, expanded experimental details, and additional ablation studies.

read point-by-point responses
  1. Referee: [Experiments] Experiments section (and associated tables/figures): the central claim that a single fixed configuration produces the reported results across all domains rests on the assertion that normalization, balancing, and transformation parameters are chosen once and never adjusted per domain. The manuscript should explicitly list every scalar hyperparameter (including any clipping thresholds, scaling factors, or transformation exponents) and state whether any of them were selected after inspecting per-domain statistics or performance; without this, the 'single configuration' and 'applied out of the box' claims cannot be evaluated.

    Authors: We agree that explicit enumeration of all scalar hyperparameters is necessary to substantiate the single-configuration claim. In the revised manuscript, we have added a dedicated appendix table that lists every scalar value, including normalization scales and clipping thresholds, balancing coefficients, transformation exponents (e.g., for symlog and other mappings), and all other fixed constants. These values were determined once via preliminary runs on a small, fixed set of representative tasks drawn from multiple domains and then locked for the entire evaluation suite; no subsequent per-domain inspection or adjustment occurred. The text now explicitly states this selection process to support the 'out of the box' assertion. revision: yes

  2. Referee: [Experiments] Minecraft results (likely §4 or dedicated subsection): the claim that DreamerV3 is the first algorithm to collect diamonds from scratch requires a precise description of the environment variant, reward function, episode length, and exact baseline implementations. The paper should also report the number of independent seeds, the precise success criterion (e.g., diamonds collected per episode), and whether any environment-specific wrappers were used; otherwise the 'first to solve' statement cannot be assessed for reproducibility.

    Authors: We have substantially expanded the Minecraft subsection and its caption to include all requested details. The environment uses the standard MineRL Minecraft 1.16.5 simulator with 64×64 RGB pixel observations, a sparse reward of +1 upon diamond collection and 0 otherwise, and a maximum episode length of 3600 steps. Results are reported over five independent seeds. The success criterion is collecting at least one diamond within an episode. Baseline algorithms are reimplemented from their original public codebases using the authors' recommended configurations; no environment-specific wrappers beyond the uniform preprocessing pipeline (frame stacking, normalization) applied to all methods were used. These clarifications have been inserted to allow independent verification of the 'first to solve' result. revision: yes

  3. Referee: [Experiments] Ablation studies (if present in §4 or appendix): the robustness techniques are presented as jointly enabling cross-domain stability, yet the manuscript does not appear to isolate the contribution of each technique (normalization vs. balancing vs. transformations) under the fixed-configuration regime. An ablation that removes one technique at a time while keeping all other hyperparameters identical would directly test whether the combination is necessary for the reported generality.

    Authors: We have added a new set of ablation experiments in the appendix that isolate each robustness technique. Keeping every other hyperparameter exactly as in the fixed configuration, we evaluate four variants: normalization removed, balancing removed, transformations removed, and all pairwise combinations. The results confirm that no single technique or incomplete subset suffices for stable performance across all 150+ tasks; only the full combination reproduces the reported cross-domain success. These ablations are presented with the same evaluation protocol and seed count as the main results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in claimed derivation or results.

full rationale

The paper's core contribution is an empirical demonstration that DreamerV3 with fixed robustness techniques (normalization, balancing, transformations) achieves strong performance on 150+ tasks plus Minecraft diamonds using one configuration. No mathematical derivation chain is presented that reduces predictions or first-principles results to fitted parameters or self-referential definitions by construction. Results are measured on held-out environments and tasks; the algorithm description does not contain equations where outputs are forced by inputs. Prior Dreamer papers by overlapping authors are cited for the base world-model approach, but the new robustness components and single-config generality claim rest on independent experimental evidence rather than load-bearing self-citation or ansatz smuggling.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard model-based RL assumptions that a learned dynamics model is accurate enough for useful planning, plus the empirical claim that the listed robustness techniques transfer across domains. No new physical entities or forces are introduced.

free parameters (1)
  • single fixed hyperparameter configuration
    The paper asserts one set of values works across all 150+ tasks; these values are chosen once rather than per domain.
axioms (1)
  • domain assumption A learned world model can support effective long-horizon planning even when trained from pixels and sparse rewards.
    Invoked to justify imagining futures instead of only real-environment interaction.

pith-pipeline@v0.9.0 · 5467 in / 1314 out tokens · 65102 ms · 2026-05-11T09:00:31.715164+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Coding Agent Is Good As World Simulator

    cs.AI 2026-05 unverdicted novelty 7.0

    A multi-agent framework generates and refines executable physics simulation code from prompts to create world models that enforce physical constraints, claiming superior accuracy and fidelity over video-based alternatives.

  2. JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampli...

  3. The Gordian Knot for VLMs: Diagrammatic Knot Reasoning as a Hard Benchmark

    cs.AI 2026-05 unverdicted novelty 7.0

    KnotBench benchmark shows state-of-the-art VLMs perform near random on diagrammatic knot reasoning tasks and lack ability to simulate structural moves.

  4. Learning Visual Feature-Based World Models via Residual Latent Action

    cs.CV 2026-05 unverdicted novelty 7.0

    RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.

  5. OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

  6. The Predictive-Causal Gap: An Impossibility Theorem and Large-Scale Neural Evidence

    cs.LG 2026-05 unverdicted novelty 7.0

    Predictive representation learning structurally favors encoding slower or less noisy environment modes over causal system modes, as shown by an impossibility theorem for linear-Gaussian dynamics and large-scale neural...

  7. Counterfactual identifiability beyond global monotonicity: non-monotone triangular structural causal models

    cs.LG 2026-05 unverdicted novelty 7.0

    Non-monotone triangular SCMs with mechanism-wise invertibility and context-independent inverse transport are equivalent to exogenous isomorphism and achieve complete counterfactual identifiability, with supporting exp...

  8. Latent State Design for World Models under Sufficiency Constraints

    cs.AI 2026-05 unverdicted novelty 7.0

    World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.

  9. Privileged Foresight Distillation: Zero-Cost Future Correction for World Action Models

    cs.RO 2026-04 unverdicted novelty 7.0

    Privileged Foresight Distillation distills the residual difference in action predictions with versus without future context into a current-only adapter, yielding consistent gains on LIBERO and RoboTwin benchmarks.

  10. Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations

    cs.RO 2026-04 unverdicted novelty 7.0

    ACO-MoE recovers 95.3% of clean-input performance in visual control tasks under Markov-switching corruptions by routing restoration experts and anchoring representations to clean foreground masks.

  11. Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations

    cs.RO 2026-04 unverdicted novelty 7.0

    ACO-MoE employs agent-centric mixture-of-experts to decouple task-relevant features from dynamic visual perturbations in RL, recovering 95.3% of clean performance on the new VDCS benchmark.

  12. Mask World Model: Predicting What Matters for Robust Robot Policy Learning

    cs.RO 2026-04 unverdicted novelty 7.0

    Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...

  13. 3D-Anchored Lookahead Planning for Persistent Robotic Scene Memory via World-Model-Based MCTS

    cs.RO 2026-04 unverdicted novelty 7.0

    3D-ALP achieves 0.65 success on memory-dependent 5-step robotic reach tasks versus near-zero for reactive baselines by anchoring MCTS planning to a persistent 3D camera-to-world frame.

  14. GTASA: Ground Truth Annotations for Spatiotemporal Analysis, Evaluation and Training of Video Models

    cs.CV 2026-04 unverdicted novelty 7.0

    GTASA supplies annotated multi-actor videos with exact 3D spatial and temporal ground truth that outperforms neural video generators in physical and semantic validity while enabling new probes of video encoders.

  15. EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks

    cs.CV 2026-04 unverdicted novelty 7.0

    EgoTL provides a new egocentric dataset with think-aloud chains and metric labels that benchmarks VLMs on long-horizon tasks and improves their planning, reasoning, and spatial grounding after finetuning.

  16. Advantage-Guided Diffusion for Model-Based Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 7.0

    Advantage-guided diffusion (SAG and EAG) steers sampling in diffusion world models to higher-advantage trajectories, enabling policy improvement and better sample efficiency on MuJoCo tasks.

  17. Beyond Static Forecasting: Unleashing the Power of World Models for Mobile Traffic Extrapolation

    cs.NI 2026-04 unverdicted novelty 7.0

    MobiWM is a multimodal world model for mobile networks that learns state-action dynamics to enable unlimited-horizon counterfactual traffic simulations and optimization.

  18. MoRight: Motion Control Done Right

    cs.CV 2026-04 unverdicted novelty 7.0

    MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...

  19. ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

    cs.CV 2025-06 unverdicted novelty 7.0

    ReCogDrive unifies VLM scene understanding with a diffusion planner reinforced by DiffGRPO to reach state-of-the-art results on NAVSIM and Bench2Drive benchmarks.

  20. Voyager: An Open-Ended Embodied Agent with Large Language Models

    cs.AI 2023-05 unverdicted novelty 7.0

    Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...

  21. CA2: Code-Aware Agent for Automated Game Testing

    cs.SE 2026-05 unverdicted novelty 6.0

    CA2 integrates call stack information into RL agents for game testing and shows consistent gains over baselines that ignore code signals.

  22. Debiased Model-based Representations for Sample-efficient Continuous Control

    cs.LG 2026-05 unverdicted novelty 6.0

    DR.Q debiases model-based representations for Q-learning by maximizing mutual information between state-action and next-state representations and applying faded prioritized experience replay, achieving competitive or ...

  23. Data-Asymmetric Latent Imagination and Reranking for 3D Robotic Imitation Learning

    cs.RO 2026-05 unverdicted novelty 6.0

    DALI-R boosts 3D imitation learning success rates by 6.8% on average from suboptimal trajectories via latent imagination and reranking, with under 0.7x inference cost.

  24. Network-Efficient World Model Token Streaming

    cs.RO 2026-05 unverdicted novelty 6.0

    An adaptive delta-prioritization algorithm using cosine distance and Hamming-drift thresholds improves embedding distortion by 4.8-7.2% and next-token perplexity by 2.1-6.3% over periodic keyframing at matched low bit...

  25. Geometric Pareto Control: Riemannian Gradient Flow of Energy Function via Lie Group Homotopy

    eess.SY 2026-05 unverdicted novelty 6.0

    Geometric Pareto Control embeds Pareto solutions in a Lie group submanifold and navigates via Riemannian gradient flow to achieve 100% feasibility and low suboptimality in control tasks without retraining.

  26. Do multimodal models imagine electric sheep?

    cs.CV 2026-05 conditional novelty 6.0

    Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.

  27. Kintsugi: Learning Policies by Repairing Executable Knowledge Bases

    cs.LG 2026-05 unverdicted novelty 6.0

    Kintsugi learns policies by repairing composable executable knowledge bases through agentic diagnosis, localized typed edits, and deterministic verification gates that admit only improvements.

  28. RePO-VLA: Recovery-Driven Policy Optimization for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    RePO-VLA raises average adversarial success rates in VLA manipulation from 20% to 75% by using recovery-aware initialization, a progress-aware semantic value function, and value-conditioned refinement on success and c...

  29. Multi-scale Predictive Representations for Goal-conditioned Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    Ms.PR applies multi-scale predictive supervision to enforce goal-directed alignment in latent spaces for offline GCRL, yielding improved representation quality and performance on vision and state-based tasks.

  30. MolWorld: Molecule World Models for Actionable Molecular Optimization

    cs.LG 2026-05 unverdicted novelty 6.0

    MolWorld expands a molecule-transfer graph using a world model to discover high-property molecules that maintain strong structural connectivity to known compounds for actionable optimization.

  31. Latent Geometry Beyond Search: Amortizing Planning in World Models

    cs.RO 2026-05 unverdicted novelty 6.0

    In regularized latent spaces of world models, planning can be amortized into a goal-conditioned inverse dynamics model that matches CEM performance at 100-130x lower per-decision cost.

  32. Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners

    cs.AI 2026-05 unverdicted novelty 6.0

    Frontier LRMs match human game-learning behavior and predict fMRI signals an order of magnitude better than RL or Bayesian agents because of their in-context game-state representations.

  33. LaWM: Least Action World Models for Long-Horizon Physical Consistency from Visual Observations

    cs.LG 2026-05 unverdicted novelty 6.0

    LaWM induces latent transitions from a learned discrete variational principle rather than an unconstrained neural predictor, yielding improved physical consistency on synthetic dynamics and robot benchmarks.

  34. Predictive but Not Plannable: RC-aux for Latent World Models

    cs.LG 2026-05 unverdicted novelty 6.0

    RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.

  35. TRAP: Tail-aware Ranking Attack for World-Model Planning

    cs.LG 2026-05 unverdicted novelty 6.0

    TRAP is a tail-aware ranking attack that plants a backdoor in world models so that a trigger causes the model to reorder a few critical imagined trajectories and redirect planning while preserving normal behavior on c...

  36. QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

    cs.LG 2026-05 unverdicted novelty 6.0

    QHyer achieves state-of-the-art results in offline goal-conditioned RL by replacing return-to-go with a state-conditioned Q-estimator and introducing a gated hybrid attention-mamba backbone for content-adaptive histor...

  37. QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

    cs.LG 2026-05 unverdicted novelty 6.0

    QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markov...

  38. Physically Native World Models: A Hamiltonian Perspective on Generative World Modeling

    cs.AI 2026-05 unverdicted novelty 6.0

    Hamiltonian World Models structure latent dynamics around energy-conserving Hamiltonian evolution to produce physically grounded, action-controllable predictions for embodied decision making.

  39. Biased Dreams: Limitations to Epistemic Uncertainty Quantification in Latent Space Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Latent transitions in models like Dreamer are biased toward dense regions, creating attractors that hide true dynamics discrepancies and cause epistemic uncertainty to be unreliable while overestimating rewards.

  40. Data-Driven Open-Loop Simulation for Digital-Twin Operator Decision Support in Wastewater Treatment

    cs.LG 2026-04 unverdicted novelty 6.0

    CCSS-RS achieves RMSE 0.696 and CRPS 0.349 at 1000-step horizons on a large public WWTP benchmark with 43% missingness, outperforming Neural CDE baselines by 40-46% in RMSE.

  41. Toward Safe Autonomous Robotic Endovascular Interventions using World Models

    cs.RO 2026-04 unverdicted novelty 6.0

    TD-MPC2 world models achieve 58% mean success in simulated endovascular navigation versus 36% for SAC, with comparable in-vitro rates but better path efficiency.

  42. Safe Control using Learned Safety Filters and Adaptive Conformal Inference

    eess.SY 2026-04 unverdicted novelty 6.0

    ACoFi adaptively tunes the switching threshold of learned safety filters using conformal inference on the range of predicted safety values, asymptotically bounding the rate of incorrect safety assessments by a user pa...

  43. Learning Ad Hoc Network Dynamics via Graph-Structured World Models

    cs.LG 2026-04 unverdicted novelty 6.0

    G-RSSM learns per-node dynamics in wireless ad hoc networks via graph attention and trains clustering policies through imagined rollouts, generalizing from N=50 training to larger networks.

  44. WM-DAgger: Enabling Efficient Data Aggregation for Imitation Learning with World Models

    cs.RO 2026-04 unverdicted novelty 6.0

    WM-DAgger uses world models with corrective action synthesis and consistency-guided filtering to aggregate OOD recovery data for imitation learning, reporting 93.3% success in soft bag pushing with five demonstrations.

  45. GIRL: Generative Imagination Reinforcement Learning via Information-Theoretic Hallucination Control

    cs.LG 2026-04 unverdicted novelty 6.0

    GIRL reduces latent rollout drift by 38-61% versus DreamerV3 in MBRL by grounding transitions with DINOv2 embeddings and using an information-theoretic adaptive bottleneck, yielding better long-horizon returns on cont...

  46. FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

    cs.LG 2026-04 unverdicted novelty 6.0

    FlashSAC scales up Soft Actor-Critic with fewer updates, larger models, higher data throughput, and norm bounds to deliver faster, more stable training than PPO on high-dimensional robot control tasks across dozens of...

  47. Hierarchical Planning with Latent World Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Hierarchical planning over multi-scale latent world models enables 70% success on real robotic pick-and-place with goal-only input where flat models achieve 0%, while cutting planning compute up to 4x in simulations.

  48. Active Inference with a Self-Prior in the Mirror-Mark Task

    cs.LG 2026-04 unverdicted novelty 6.0

    A Transformer-based self-prior in active inference enables a simulated agent to spontaneously recognize and remove a mark on its face in a mirror by detecting discrepancies in learned visual-proprioceptive experiences.

  49. Metriplector: From Field Theory to Neural Architecture

    cs.AI 2026-03 unverdicted novelty 6.0

    Metriplector treats neural computation as coupled metriplectic field dynamics whose stress-energy tensor readout achieves competitive results on vision, control, Sudoku, language modeling, and pathfinding with small p...

  50. LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

    cs.LG 2026-03 unverdicted novelty 6.0

    LeWM is the first end-to-end trainable JEPA from pixels that uses only two loss terms for stable training and fast planning on 2D/3D control tasks.

  51. World Action Models are Zero-shot Policies

    cs.RO 2026-02 unverdicted novelty 6.0

    DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...

  52. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

    cs.AI 2026-01 conditional novelty 6.0

    Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.

  53. GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    cs.RO 2024-10 unverdicted novelty 6.0

    GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.

  54. TD-MPC2: Scalable, Robust World Models for Continuous Control

    cs.LG 2023-10 conditional novelty 6.0

    TD-MPC2 scales an implicit world-model RL method to a 317M-parameter agent that masters 80 tasks across four domains with a single hyperparameter configuration.

  55. Transferable Delay-Aware Reinforcement Learning via Implicit Causal Graph Modeling

    cs.LG 2026-05 unverdicted novelty 5.0

    A delay-aware RL approach learns transferable structured representations and dynamics via implicit causal graphs, outperforming baselines on delayed DMC tasks and accelerating adaptation to new tasks.

  56. Nautilus: From One Prompt to Plug-and-Play Robot Learning

    cs.RO 2026-05 unverdicted novelty 5.0

    NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.

  57. CKT-WAM: Parameter-Efficient Context Knowledge Transfer Between World Action Models

    cs.RO 2026-05 unverdicted novelty 5.0

    CKT-WAM transfers teacher WAM knowledge to students via compressed text-embedding contexts using LQCA and adapters, reaching 86.1% success on LIBERO-Plus with 1.17% trainable parameters and 83.3% in real-world tasks.

  58. HDFlow: Hierarchical Diffusion-Flow Planning for Long-horizon Tasks

    cs.RO 2026-05 unverdicted novelty 5.0

    HDFlow pairs a high-level diffusion planner for subgoals with a low-level rectified flow planner for trajectories, outperforming prior methods on furniture assembly and locomotion-manipulation benchmarks.

  59. Dreaming Across Towns: Semantic Rollout and Town-Adversarial Regularization for Zero-Shot Held-Out-Town Fixed-Route Driving in CARLA

    cs.RO 2026-04 unverdicted novelty 5.0

    Semantic rollout prediction plus town-adversarial regularization on a Dreamer agent raises mean zero-shot success rate for fixed-route driving across held-out CARLA towns under fixed weather and no traffic.

  60. Detecting is Easy, Adapting is Hard: Local Expert Growth for Visual Model-Based Reinforcement Learning under Distribution Shift

    cs.LG 2026-04 unverdicted novelty 5.0

    JEPA-Indexed Local Expert Growth adds local action corrections for detected shift clusters and yields statistically significant OOD gains on four shift conditions while keeping in-distribution performance intact.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 70 Pith papers · 13 internal anchors

  1. [1]

    Mastering the game of go with deep neural networks and tree search

    David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587): 484, 2016

  2. [2]

    OpenAI Five

    OpenAI. OpenAI Five. https://blog.openai.com/openai-five/, 2018

  3. [3]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

  4. [4]

    Coderl: Mastering code generation through pretrained models and deep reinforcement learning

    Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35:21314–21328, 2022

  5. [5]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  6. [6]

    Continuous control with deep reinforcement learning

    Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015

  7. [7]

    Human-level control through deep reinforcement learning

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015

  8. [8]

    Mastering atari, go, chess and shogi by planning with a learned model.arXiv preprint arXiv:1911.08265, 2019

    Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:1911.08265, 2019

  9. [9]

    Reinforcement learning with unsupervised auxiliary tasks,

    Max Jaderberg, V olodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016

  10. [10]

    Unsupervised state representation learning in atari

    Ankesh Anand, Evan Racah, Sherjil Ozair, Yoshua Bengio, Marc-Alexandre Côté, and R Devon Hjelm. Unsupervised state representation learning in atari. Advances in neural information processing systems, 32, 2019

  11. [11]

    Reinforcement learning with neural radiance fields

    Danny Driess, Ingmar Schubert, Pete Florence, Yunzhu Li, and Marc Toussaint. Reinforcement learning with neural radiance fields. arXiv preprint arXiv:2206.01634, 2022

  12. [12]

    Mastering the game of go without human knowledge

    David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017

  13. [13]

    Andrychowicz, A

    Marcin Andrychowicz, Anton Raichuk, Piotr Sta´nczyk, Manu Orsini, Sertan Girgin, Raphael Marinier, Léonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, et al. What matters in on-policy reinforcement learning? a large-scale empirical study. arXiv preprint arXiv:2006.05990, 2020

  14. [14]

    Dyna, an integrated architecture for learning, planning, and reacting

    Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin, 2(4):160–163, 1991. 12

  15. [15]

    Deep visual foresight for planning robot motion

    Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2786–2793. IEEE, 2017

  16. [16]

    World Models

    David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018

  17. [17]

    Model-Based Reinforcement Learning for Atari

    Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, et al. Model- based reinforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019

  18. [18]

    The minerl competition on sample efficient reinforcement learning using human priors

    William H Guss, Cayden Codel, Katja Hofmann, Brandon Houghton, Noboru Kuno, Stephanie Milani, Sharada Mohanty, Diego Perez Liebana, Ruslan Salakhutdinov, Nicholay Topin, et al. The minerl competition on sample efficient reinforcement learning using human priors. arXiv e-prints, pages arXiv–1904, 2019

  19. [19]

    arXiv preprint arXiv:2106.14876 , year=

    Ingmar Kanitscheider, Joost Huizinga, David Farhi, William Hebgen Guss, Brandon Houghton, Raul Sampedro, Peter Zhokhov, Bowen Baker, Adrien Ecoffet, Jie Tang, et al. Multi-task curriculum learning in a complex, visual, hard-exploration domain: Minecraft. arXiv preprint arXiv:2106.14876, 2021

  20. [20]

    Baker, I

    Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. arXiv preprint arXiv:2206.11795, 2022

  21. [21]

    Dream to Control: Learning Behaviors by Latent Imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019

  22. [22]

    Mastering Atari with Discrete World Models

    Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020

  23. [23]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

  24. [24]

    Learning Latent Dynamics for Planning from Pixels

    Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551, 2018

  25. [25]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013

  26. [26]

    Improved variational inference with inverse autoregressive flow.Advances in neural information processing systems, 29, 2016

    Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow.Advances in neural information processing systems, 29, 2016

  27. [27]

    Very deep vaes generalize autoregressive models and can outperform them on images

    Rewon Child. Very deep vaes generalize autoregressive models and can outperform them on images. arXiv preprint arXiv:2011.10650, 2020

  28. [28]

    A distributional perspective on reinforcement learning

    Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In International Conference on Machine Learning, pages 449–458. PMLR, 2017

  29. [29]

    Reinforcement learning: An introduction

    Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018

  30. [30]

    Function optimization using connectionist reinforcement learning algorithms

    Ronald J Williams and Jing Peng. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3):241–268, 1991. 13

  31. [31]

    Simple statistical gradient-following algorithms for connectionist reinforcement learning

    Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992

  32. [32]

    Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018

  33. [33]

    Maximum a posteriori policy optimisation

    Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation.arXiv preprint arXiv:1806.06920, 2018

  34. [34]

    A bi-symmetric log transformation for wide-range data

    J Beau W Webber. A bi-symmetric log transformation for wide-range data. Measurement Science and Technology, 24(2):027001, 2012

  35. [35]

    Recurrent experience replay in distributed reinforcement learning

    Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, and Will Dabney. Recurrent experience replay in distributed reinforcement learning. In International conference on learning representations, 2018

  36. [36]

    Multi-task deep reinforcement learning with popart

    Matteo Hessel, Hubert Soyer, Lasse Espeholt, Wojciech Czarnecki, Simon Schmitt, and Hado van Hasselt. Multi-task deep reinforcement learning with popart. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3796–3803, 2019

  37. [37]

    Phasic policy gradient

    Karl W Cobbe, Jacob Hilton, Oleg Klimov, and John Schulman. Phasic policy gradient. In International Conference on Machine Learning, pages 2020–2027. PMLR, 2021

  38. [38]

    The arcade learning environment: An evaluation platform for general agents

    Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013

  39. [40]

    Rainbow: Combining improvements in deep reinforcement learning

    Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018

  40. [41]

    Implicit quantile networks for distributional reinforcement learning

    Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for distributional reinforcement learning. In International conference on machine learning, pages 1096–1105. PMLR, 2018

  41. [42]

    Leveraging procedural generation to benchmark reinforcement learning

    Karl Cobbe, Chris Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. In International conference on machine learning, pages 2048–2056. PMLR, 2020

  42. [43]

    DeepMind Lab

    Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, et al. Deepmind lab.arXiv preprint arXiv:1612.03801, 2016

  43. [44]

    Mastering atari games with limited data

    Weirui Ye, Shaohuai Liu, Thanard Kurutach, Pieter Abbeel, and Yang Gao. Mastering atari games with limited data. Advances in Neural Information Processing Systems, 34:25476–25488, 2021

  44. [45]

    Transformers are sample efficient world models.arXiv preprint arXiv:2209.00588, 2022

    Vincent Micheli, Eloi Alonso, and François Fleuret. Transformers are sample efficient world models. arXiv preprint arXiv:2209.00588, 2022. 14

  45. [46]

    DeepMind Control Suite

    Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018

  46. [47]

    arXiv preprint arXiv:2107.09645 , year=

    Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto. Mastering visual continuous control: Improved data-augmented reinforcement learning. arXiv preprint arXiv:2107.09645, 2021

  47. [48]

    Behaviour suite for reinforce- ment learning.arXiv preprint arXiv:1908.03568,

    Ian Osband, Yotam Doron, Matteo Hessel, John Aslanides, Eren Sezener, Andre Saraiva, Katrina McKinney, Tor Lattimore, Csaba Szepesvari, Satinder Singh, et al. Behaviour suite for reinforcement learning. arXiv preprint arXiv:1908.03568, 2019

  48. [49]

    Investigating the practicality of existing reinforcement learning algorithms: A performance comparison

    Olivia Dizon-Paradis, Stephen Wormald, Daniel Capecci, Avanti Bhandarkar, and Damon Woodard. Investigating the practicality of existing reinforcement learning algorithms: A performance comparison. Authorea Preprints, 2023

  49. [50]

    Hafner.Benchmarking the Spectrum of Agent Capabilities

    Danijar Hafner. Benchmarking the spectrum of agent capabilities. arXiv preprint arXiv:2109.06780, 2021

  50. [51]

    Improving sample efficiency in model-free reinforcement learning from images

    Denis Yarats, Amy Zhang, Ilya Kostrikov, Brandon Amos, Joelle Pineau, and Rob Fergus. Improving sample efficiency in model-free reinforcement learning from images. arXiv preprint arXiv:1910.01741, 2019

  51. [52]

    A Generalist Agent

    Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent. arXiv preprint arXiv:2205.06175, 2022

  52. [53]

    The malmo platform for artificial intelligence experimentation

    Matthew Johnson, Katja Hofmann, Tim Hutton, and David Bignell. The malmo platform for artificial intelligence experimentation. In IJCAI, pages 4246–4247. Citeseer, 2016

  53. [54]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

  54. [55]

    The 37 implementation details of proximal policy optimization

    Shengyi Huang, Rousslan Fernand Julien Dossa, Antonin Raffin, Anssi Kanervisto, and Weixun Wang. The 37 implementation details of proximal policy optimization. The ICLR Blog Track 2023, 2022

  55. [56]

    better" quality operators, two

    Matt Hoffman, Bobak Shahriari, John Aslanides, Gabriel Barth-Maron, Feryal Behbahani, Tamara Norman, Abbas Abdolmaleki, Albin Cassirer, Fan Yang, Kate Baumli, et al. Acme: A research framework for distributed reinforcement learning. arXiv preprint arXiv:2006.00979, 2020

  56. [57]

    Off-policy actor-critic with shared experience replay

    Simon Schmitt, Matteo Hessel, and Karen Simonyan. Off-policy actor-critic with shared experience replay. In International Conference on Machine Learning, pages 8545–8554. PMLR, 2020

  57. [58]

    Prioritized Experience Replay

    Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015

  58. [59]

    High-performance large-scale image recognition without normalization

    Andy Brock, Soham De, Samuel L Smith, and Karen Simonyan. High-performance large-scale image recognition without normalization. In International Conference on Machine Learning, pages 1059–1071. PMLR, 2021

  59. [60]

    Laprop: Separating momentum and adaptivity in adam

    Liu Ziyin, Zhikang T Wang, and Masahito Ueda. Laprop: Separating momentum and adaptivity in adam. arXiv preprint arXiv:2002.04839, 2020. 15

  60. [61]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  61. [62]

    The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning

    Audrunas Gruslys, Will Dabney, Mohammad Gheshlaghi Azar, Bilal Piot, Marc Bellemare, and Remi Munos. The reactor: A fast and sample-efficient actor-critic agent for reinforcement learning. arXiv preprint arXiv:1704.04651, 2017

  62. [63]

    Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

    Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder- decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014

  63. [64]

    Rethinking full connectivity in recurrent neural networks

    Matthijs Van Keirsbilck, Alexander Keller, and Xiaodong Yang. Rethinking full connectivity in recurrent neural networks. arXiv preprint arXiv:1905.12340, 2019

  64. [65]

    Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents

    Marlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61:523–562, 2018

  65. [66]

    Espeholt, H

    Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, V olodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561, 2018. 16 Methods Baselines We employ the Proximal Policy Optimization (PPO) algorithm 5, ...