pith. machine review for the scientific record. sign in

arxiv: 2310.16828 · v2 · submitted 2023-10-25 · 💻 cs.LG · cs.AI· cs.CV· cs.RO

Recognition: 3 theorem links

· Lean Theorem

TD-MPC2: Scalable, Robust World Models for Continuous Control

Authors on Pith no claims yet

Pith reviewed 2026-05-14 17:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CVcs.RO
keywords TD-MPC2model-based RLworld modelscontinuous controlscalabilitymulti-task agentsonline reinforcement learninglatent space optimization
0
0 comments X

The pith

TD-MPC2 achieves significantly better performance than baselines on 104 continuous control tasks using one fixed set of hyperparameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TD-MPC2 as an improved version of the TD-MPC algorithm for model-based reinforcement learning. It performs local trajectory optimization inside the latent space of an implicit world model. Experiments show consistent gains over baselines on 104 tasks in four different domains without changing hyperparameters for each task. The work also shows that making the model larger and training on more data improves the agent's ability to handle many tasks at once, including a 317 million parameter version that masters 80 tasks across varied setups.

Core claim

TD-MPC2 consists of a series of improvements to the TD-MPC method that allow it to deliver strong results across 104 online RL tasks spanning 4 diverse domains with a single hyperparameter configuration. Agent performance increases with larger models and more data, as demonstrated by training one 317M parameter agent that succeeds on 80 tasks involving multiple domains, embodiments, and action spaces.

What carries the argument

Improved implicit decoder-free world model with local trajectory optimization in latent space, enhanced through algorithmic changes for scalability and robustness.

If this is right

  • Consistent performance holds when using the same hyperparameters on new tasks within the tested domains.
  • Agent capabilities grow as model size and training data increase.
  • A single large model can learn to perform many tasks across different embodiments without separate training runs.
  • The approach reveals lessons about opportunities and risks when deploying large-scale world models for control.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Success with one hyperparameter set could reduce the engineering effort needed to apply the method to new continuous control problems.
  • If scaling continues, larger versions might handle even more tasks or transfer to real-world settings with minimal adjustment.
  • Risks of large agents, such as unexpected behaviors in novel situations, would need monitoring in practical use.

Load-bearing premise

The selected 104 tasks across four domains sufficiently represent the space of continuous control problems so that the single hyperparameter set generalizes beyond the tested cases.

What would settle it

Evaluating TD-MPC2 on a new continuous control task or domain not among the original four where the fixed hyperparameters lead to performance no better than or worse than strong baselines.

read the original abstract

TD-MPC is a model-based reinforcement learning (RL) algorithm that performs local trajectory optimization in the latent space of a learned implicit (decoder-free) world model. In this work, we present TD-MPC2: a series of improvements upon the TD-MPC algorithm. We demonstrate that TD-MPC2 improves significantly over baselines across 104 online RL tasks spanning 4 diverse task domains, achieving consistently strong results with a single set of hyperparameters. We further show that agent capabilities increase with model and data size, and successfully train a single 317M parameter agent to perform 80 tasks across multiple task domains, embodiments, and action spaces. We conclude with an account of lessons, opportunities, and risks associated with large TD-MPC2 agents. Explore videos, models, data, code, and more at https://tdmpc2.com

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces TD-MPC2, a series of improvements to the TD-MPC model-based RL algorithm that performs local trajectory optimization in the latent space of a learned implicit world model. It reports consistent outperformance over baselines across 104 online RL tasks spanning four domains using a single hyperparameter set, shows that agent performance scales with model and data size, and demonstrates successful training of a single 317M-parameter agent on 80 tasks across multiple domains, embodiments, and action spaces.

Significance. If the single-hyperparameter robustness claim holds after verification, the work would be significant for scalable continuous-control RL by providing evidence that world-model methods can achieve broad applicability without per-task tuning and can benefit from increased scale. The open release of code, models, data, and videos supports reproducibility and follow-on research.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (experimental results): The claim that TD-MPC2 achieves 'consistently strong results with a single set of hyperparameters' across all 104 tasks is load-bearing for the robustness conclusion, yet the manuscript provides no explicit description of the hyperparameter selection procedure. It is necessary to state whether the values were fixed before any evaluation on the full suite or refined after observing aggregate performance, as post-hoc selection would weaken the interpretation that the gains are intrinsic rather than an artifact of implicit tuning.
  2. [§4.2 and Table 1] §4.2 and Table 1 (baseline comparisons): The reported gains over baselines on the 104-task suite require confirmation that baseline implementations follow identical evaluation protocols, including episode lengths, random seeds, and statistical reporting (e.g., mean and standard error over the same number of runs). Without these details, hidden selection effects cannot be ruled out, directly affecting the soundness of the 'significant improvement' claim.
minor comments (2)
  1. [§4] The manuscript should include a dedicated subsection or appendix listing the exact hyperparameter values used for the single-set experiments to allow direct replication.
  2. [§5] Figure captions and axis labels in scaling plots should explicitly state the number of runs and confidence intervals to improve clarity of the model-size and data-size trends.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to add the requested details on hyperparameter selection and evaluation protocols.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (experimental results): The claim that TD-MPC2 achieves 'consistently strong results with a single set of hyperparameters' across all 104 tasks is load-bearing for the robustness conclusion, yet the manuscript provides no explicit description of the hyperparameter selection procedure. It is necessary to state whether the values were fixed before any evaluation on the full suite or refined after observing aggregate performance, as post-hoc selection would weaken the interpretation that the gains are intrinsic rather than an artifact of implicit tuning.

    Authors: We have added a dedicated paragraph in §4.1 describing the hyperparameter selection procedure. The values were taken from the original TD-MPC paper and refined only on a small validation subset (five DM Control tasks) before any evaluation on the full 104-task suite; no post-hoc adjustments were performed after observing aggregate results. This clarification supports the robustness interpretation. revision: yes

  2. Referee: [§4.2 and Table 1] §4.2 and Table 1 (baseline comparisons): The reported gains over baselines on the 104-task suite require confirmation that baseline implementations follow identical evaluation protocols, including episode lengths, random seeds, and statistical reporting (e.g., mean and standard error over the same number of runs). Without these details, hidden selection effects cannot be ruled out, directly affecting the soundness of the 'significant improvement' claim.

    Authors: We confirm that all baselines were re-evaluated under identical protocols: episode lengths follow the standard per-domain definitions (1000 steps for DM Control, 500 for Meta-World, etc.), three random seeds per task, and results are reported as mean ± standard error. We have added Appendix C with a full protocol description, seed counts, and pointers to the exact baseline code versions used. revision: yes

Circularity Check

0 steps flagged

No derivation chain; central claims are empirical performance results with fixed hyperparameters

full rationale

The paper presents TD-MPC2 as an algorithmic extension of prior TD-MPC work and reports empirical gains across 104 tasks using one hyperparameter set. No mathematical derivation, uniqueness theorem, or first-principles prediction is claimed that reduces to fitted inputs or self-citations by construction. The single-hyperparameter consistency is presented as an experimental outcome rather than a definitional property, and any self-citation to the original TD-MPC is not load-bearing for the reported results. This matches the default expectation of no significant circularity for an empirical RL paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard RL assumptions (Markovian dynamics, reward signals) plus the empirical claim that the listed algorithmic tweaks produce the observed scaling. No new physical entities or ad-hoc constants are introduced.

axioms (1)
  • domain assumption The environment dynamics are Markovian and fully captured by the learned latent state.
    Implicit in any latent world-model approach; stated in the description of TD-MPC.

pith-pipeline@v0.9.0 · 5451 in / 1328 out tokens · 48073 ms · 2026-05-14T17:21:09.097426+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampli...

  2. OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

  3. Latent State Design for World Models under Sufficiency Constraints

    cs.AI 2026-05 unverdicted novelty 7.0

    World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.

  4. Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations

    cs.RO 2026-04 unverdicted novelty 7.0

    ACO-MoE recovers 95.3% of clean-input performance in visual control tasks under Markov-switching corruptions by routing restoration experts and anchoring representations to clean foreground masks.

  5. Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations

    cs.RO 2026-04 unverdicted novelty 7.0

    ACO-MoE employs agent-centric mixture-of-experts to decouple task-relevant features from dynamic visual perturbations in RL, recovering 95.3% of clean performance on the new VDCS benchmark.

  6. PlayWorld: Learning Robot World Models from Autonomous Play

    cs.RO 2026-03 unverdicted novelty 7.0

    PlayWorld learns high-fidelity robot world models from unsupervised self-play, producing physically consistent video predictions that outperform models trained on human data and enabling 65% better real-world policy p...

  7. Training Agents Inside of Scalable World Models

    cs.AI 2025-09 conditional novelty 7.0

    Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.

  8. Multi-scale Predictive Representations for Goal-conditioned Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    Ms.PR applies multi-scale predictive supervision to enforce goal-directed alignment in latent spaces for offline GCRL, yielding improved representation quality and performance on vision and state-based tasks.

  9. MolWorld: Molecule World Models for Actionable Molecular Optimization

    cs.LG 2026-05 unverdicted novelty 6.0

    MolWorld expands a molecule-transfer graph using a world model to discover high-property molecules that maintain strong structural connectivity to known compounds for actionable optimization.

  10. Predictive but Not Plannable: RC-aux for Latent World Models

    cs.LG 2026-05 unverdicted novelty 6.0

    RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.

  11. TRAP: Tail-aware Ranking Attack for World-Model Planning

    cs.LG 2026-05 unverdicted novelty 6.0

    TRAP is a tail-aware ranking attack that plants a backdoor in world models so that a trigger causes the model to reorder a few critical imagined trajectories and redirect planning while preserving normal behavior on c...

  12. RAY-TOLD: Ray-Based Latent Dynamics for Dense Dynamic Obstacle Avoidance with TDMPC

    cs.RO 2026-04 unverdicted novelty 6.0

    RAY-TOLD combines ray-based latent dynamics from LiDAR with MPPI control and a learned policy prior via mixture sampling to lower collision rates in high-density dynamic obstacle environments compared to standard MPPI.

  13. Toward Safe Autonomous Robotic Endovascular Interventions using World Models

    cs.RO 2026-04 unverdicted novelty 6.0

    TD-MPC2 world models achieve 58% mean success in simulated endovascular navigation versus 36% for SAC, with comparable in-vitro rates but better path efficiency.

  14. Human Cognition in Machines: A Unified Perspective of World Models

    cs.RO 2026-04 unverdicted novelty 6.0

    The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...

  15. GIRL: Generative Imagination Reinforcement Learning via Information-Theoretic Hallucination Control

    cs.LG 2026-04 unverdicted novelty 6.0

    GIRL reduces latent rollout drift by 38-61% versus DreamerV3 in MBRL by grounding transitions with DINOv2 embeddings and using an information-theoretic adaptive bottleneck, yielding better long-horizon returns on cont...

  16. Neural Operators for Multi-Task Control and Adaptation

    cs.LG 2026-04 unverdicted novelty 6.0

    Neural operators approximate the solution operator for multi-task optimal control, generalizing to new tasks and enabling efficient adaptation via branch-trunk structure and meta-training.

  17. Hierarchical Planning with Latent World Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Hierarchical planning over multi-scale latent world models enables 70% success on real robotic pick-and-place with goal-only input where flat models achieve 0%, while cutting planning compute up to 4x in simulations.

  18. Learning Task-Invariant Properties via Dreamer: Enabling Efficient Policy Transfer for Quadruped Robots

    cs.RO 2026-04 unverdicted novelty 6.0

    DreamTIP adds LLM-identified task-invariant properties as auxiliary targets in Dreamer's world model plus a mixed-replay adaptation step, delivering 28.1% average simulated transfer gains and 100% real-world climb suc...

  19. Dreamer-CDP: Improving Reconstruction-free World Models Via Continuous Deterministic Representation Prediction

    cs.LG 2026-03 unverdicted novelty 6.0

    Dreamer-CDP achieves reconstruction-free world modeling via a JEPA-style predictor on continuous deterministic representations and matches Dreamer's performance on Crafter.

  20. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

    cs.AI 2026-01 conditional novelty 6.0

    Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.

  21. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    cs.AI 2025-06 unverdicted novelty 6.0

    V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...

  22. TOPPO: Rethinking PPO for Multi-Task Reinforcement Learning with Critic Balancing

    cs.AI 2026-05 unverdicted novelty 5.0

    TOPPO reformulates PPO with critic balancing to address gradient ill-conditioning in multi-task RL and reports stronger mean and tail performance than SAC baselines on Meta-World+ using fewer parameters and steps.

  23. Detecting is Easy, Adapting is Hard: Local Expert Growth for Visual Model-Based Reinforcement Learning under Distribution Shift

    cs.LG 2026-04 unverdicted novelty 5.0

    JEPA-Indexed Local Expert Growth adds local action corrections for detected shift clusters and yields statistically significant OOD gains on four shift conditions while keeping in-distribution performance intact.

  24. World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems

    cs.RO 2026-04 unverdicted novelty 5.0

    The World-Value-Action model enables implicit planning for VLA systems by performing inference over a learned latent representation of high-value future trajectories instead of direct action prediction.

  25. Active Inference: A method for Phenotyping Agency in AI systems?

    cs.AI 2026-04 unverdicted novelty 4.0

    Active inference offers a variational way to phenotype agency in AI systems by measuring empowerment in generative models via a T-maze paradigm.

Reference graph

Works this paper leans on

162 extracted references · 162 canonical work pages · cited by 24 Pith papers · 12 internal anchors

  1. [1]

    Layer normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. Advances in Neural Information Processing Systems, 2016

  2. [2]

    Video pretraining (vpt): Learning to act by watching unlabeled online videos

    Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems, 35: 0 24639--24654, 2022

  3. [3]

    A distributional perspective on reinforcement learning

    Marc G Bellemare, Will Dabney, and R \'e mi Munos. A distributional perspective on reinforcement learning. In International Conference on Machine Learning, pp.\ 449--458. PMLR, 2017

  4. [4]

    A markovian decision process

    Richard Bellman. A markovian decision process. Indiana Univ. Math. J., 6: 0 679--684, 1957. ISSN 0022-2518

  5. [7]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33: 0 1877--1901, 2020

  6. [8]

    Myosuite -- a contact-rich simulation suite for musculoskeletal motor control, 2022

    Vittorio Caggiano, Huawei Wang, Guillaume Durandau, Massimo Sartori, and Vikash Kumar. Myosuite -- a contact-rich simulation suite for musculoskeletal motor control, 2022

  7. [9]

    Berk Calli, Arjun Singh, Aaron Walsman, Siddhartha Srinivasa, Pieter Abbeel, and Aaron M. Dollar. The ycb object and model set: Towards common benchmarks for manipulation research. In 2015 International Conference on Advanced Robotics, pp.\ 510--517, 2015. doi:10.1109/ICAR.2015.7251504

  8. [10]

    Randomized ensembled double q-learning: Learning fast without a model

    Xinyue Chen, Che Wang, Zijian Zhou, and Keith Ross. Randomized ensembled double q-learning: Learning fast without a model. International Conference on Learning Representations, 2021

  9. [12]

    Faulty reward functions in the wild

    Jack Clark and Dario Amodei. Faulty reward functions in the wild. OpenAI Blog, 2016

  10. [14]

    Efficient selectivity and backup operators in monte-carlo tree search

    R \'e mi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. In H. Jaap van den Herik, Paolo Ciancarini, and H. H. L. M. (Jeroen) Donkers (eds.), Computers and Games, pp.\ 72--83, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg

  11. [15]

    Sample-efficient reinforcement learning by breaking the replay ratio barrier

    Pierluca D'Oro, Max Schwarzer, Evgenii Nikishin, Pierre-Luc Bacon, Marc G Bellemare, and Aaron Courville. Sample-efficient reinforcement learning by breaking the replay ratio barrier. In The Eleventh International Conference on Learning Representations, 2023

  12. [17]

    Contrastive learning as goal-conditioned reinforcement learning

    Benjamin Eysenbach, Tianjun Zhang, Sergey Levine, and Russ R Salakhutdinov. Contrastive learning as goal-conditioned reinforcement learning. Advances in Neural Information Processing Systems, 35: 0 35603--35620, 2022

  13. [18]

    Finetuning offline world models in the real world

    Yunhai Feng, Nicklas Hansen, Ziyan Xiong, Chandramouli Rajagopalan, and Xiaolong Wang. Finetuning offline world models in the real world. Conference on Robot Learning, 2023

  14. [19]

    Richemond, Elena Buchatskaya, Carl Doersch, Bernardo \'A vila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, R \'e mi Munos, and Michal Valko

    Jean-Bastien Grill, Florian Strub, Florent Altch'e, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo \'A vila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, R \'e mi Munos, and Michal Valko. Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Informat...

  15. [20]

    Maniskill2: A unified benchmark for generalizable manipulation skills

    Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiaing Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, Xiaodi Yuan, Pengwei Xie, Zhiao Huang, Rui Chen, and Hao Su. Maniskill2: A unified benchmark for generalizable manipulation skills. In International Conference on Learning Representations, 2023

  16. [21]

    Soft Actor-Critic Algorithms and Applications

    Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, G. Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, P. Abbeel, and Sergey Levine. Soft actor-critic algorithms and applications. ArXiv, abs/1812.05905, 2018

  17. [23]

    Stabilizing deep q-learning with convnets and vision transformers under data augmentation

    Nicklas Hansen, Hao Su, and Xiaolong Wang. Stabilizing deep q-learning with convnets and vision transformers under data augmentation. In Annual Conference on Neural Information Processing Systems, 2021

  18. [24]

    Temporal difference learning for model predictive control

    Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control. In ICML, 2022

  19. [25]

    Modem: Accelerating visual model-based reinforcement learning with demonstrations

    Nicklas Hansen, Yixin Lin, Hao Su, Xiaolong Wang, Vikash Kumar, and Aravind Rajeswaran. Modem: Accelerating visual model-based reinforcement learning with demonstrations. 2023

  20. [26]

    H. V. Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning. In Aaai, 2016

  21. [27]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll \'a r, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 16000--16009, 2022

  22. [28]

    Deep reinforcement learning that matters

    Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial In...

  23. [29]

    Learning and planning in complex action spaces, 2021

    Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Mohammadamin Barekatain, Simon Schmitt, and David Silver. Learning and planning in complex action spaces, 2021

  24. [30]

    I mproving P olicy O ptimization with G eneralist- S pecialist L earning

    Zhiwei Jia, Xuanlin Li, Zhan Ling, Shuang Liu, Yiran Wu, and Hao Su. I mproving P olicy O ptimization with G eneralist- S pecialist L earning. In International Conference on Machine Learning, 2022

  25. [32]

    Image augmentation is all you need: Regularizing deep reinforcement learning from pixels

    Ilya Kostrikov, Denis Yarats, and Rob Fergus. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. International Conference on Learning Representations, 2020

  26. [33]

    Should i run offline reinforcement learning or behavioral cloning? International Conference on Learning Representations, 2022

    Aviral Kumar, Joey Hong, Anikait Singh, and Sergey Levine. Should i run offline reinforcement learning or behavioral cloning? International Conference on Learning Representations, 2022

  27. [34]

    Offline q-learning on diverse multi-task data both scales and generalizes

    Aviral Kumar, Rishabh Agarwal, Xinyang Geng, George Tucker, and Sergey Levine. Offline q-learning on diverse multi-task data both scales and generalizes. International Conference on Learning Representations, 2023

  28. [35]

    Objective mismatch in model-based reinforcement learning

    Nathan Lambert, Brandon Amos, Omry Yadan, and Roberto Calandra. Objective mismatch in model-based reinforcement learning. Conference on Learning for Decision and Control, 2020

  29. [36]

    Modem-v2: Visuo-motor world models for real-world robot manipulation

    Patrick Lancaster, Nicklas Hansen, Aravind Rajeswaran, and Vikash Kumar. Modem-v2: Visuo-motor world models for real-world robot manipulation. arXiv preprint, 2023

  30. [38]

    Multi-game decision transformers

    Kuang-Huei Lee, Ofir Nachum, Mengjiao Sherry Yang, Lisa Lee, Daniel Freeman, Sergio Guadarrama, Ian Fischer, Winnie Xu, Eric Jang, Henryk Michalewski, et al. Multi-game decision transformers. Advances in Neural Information Processing Systems, 35: 0 27921--27936, 2022

  31. [39]

    Continuous control with deep reinforcement learning

    T. Lillicrap, J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. CoRR, abs/1509.02971, 2016

  32. [42]

    Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning, 2023

    Mitsuhiko Nakamoto, Yuexiang Zhai, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, and Sergey Levine. Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning, 2023

  33. [43]

    Negenborn, Bart De Schutter , Marco A

    Rudy R. Negenborn, Bart De Schutter , Marco A. Wiering, and Hans Hellendoorn. Learning-based model predictive control for markov decision processes. IFAC Proceedings Volumes, 38 0 (1): 0 354--359, 2005. 16th IFAC World Congress

  34. [45]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

  35. [47]

    Mastering atari, go, chess and shogi by planning with a learned model

    Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588 0 (7839): 0 604--609, 2020

  36. [49]

    Curl: Contrastive Unsupervised Representations for Reinforcement Learning

    Aravind Srinivas, Michael Laskin, and Pieter Abbeel. Curl: Contrastive unsupervised representations for reinforcement learning. arXiv preprint arXiv:2004.04136, 2020

  37. [50]

    Dropout: A simple way to prevent neural networks from overfitting

    Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15 0 (56): 0 1929--1958, 2014

  38. [51]

    R. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3: 0 9--44, 1998

  39. [52]

    Deepmind control suite

    Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, et al. Deepmind control suite. Technical report, DeepMind, 2018

  40. [53]

    Visualizing data using t-SNE

    Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE . Journal of Machine Learning Research, 9: 0 2579--2605, 2008

  41. [54]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

  42. [55]

    Model Predictive Path Integral Control using Covariance Variable Importance Sampling

    Grady Williams, Andrew Aldrich, and Evangelos A. Theodorou. Model predictive path integral control using covariance variable importance sampling. ArXiv, abs/1509.01149, 2015

  43. [56]

    On the feasibility of cross-task transfer with model-based reinforcement learning

    Yifan Xu, Nicklas Hansen, Zirui Wang, Yung-Chieh Chan, Hao Su, and Zhuowen Tu. On the feasibility of cross-task transfer with model-based reinforcement learning. 2023

  44. [57]

    Movie: Visual model-based policy adaptation for view generalization

    Sizhe Yang, Yanjie Ze, and Huazhe Xu. Movie: Visual model-based policy adaptation for view generalization. Advances in Neural Information Processing Systems, 2023

  45. [58]

    Mastering visual continuous control: Improved data-augmented reinforcement learning

    Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto. Mastering visual continuous control: Improved data-augmented reinforcement learning. International Conference on Learning Representations, 2021

  46. [59]

    Mastering atari games with limited data

    Weirui Ye, Shaohuai Liu, Thanard Kurutach, Pieter Abbeel, and Yang Gao. Mastering atari games with limited data. Advances in Neural Information Processing Systems, 34: 0 25476--25488, 2021

  47. [60]

    Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning

    Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, 2019

  48. [62]

    On the importance of hyperparameter optimization for model-based reinforcement learning

    Baohe Zhang, Raghu Rajan, Luis Pineda, Nathan Lambert, Andr \'e Biedenkapp, Kurtland Chua, Frank Hutter, and Roberto Calandra. On the importance of hyperparameter optimization for model-based reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pp.\ 4015--4023. PMLR, 2021

  49. [64]

    Maximum entropy inverse reinforcement learning

    Brian D Ziebart, Andrew Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd National Conference on Artificial Intelligence, volume 3, 2008

  50. [65]

    CoRR , volume =

    Continuous control with deep reinforcement learning , author =. CoRR , volume =

  51. [66]

    ArXiv , year=

    Soft Actor-Critic Algorithms and Applications , author=. ArXiv , year=

  52. [67]

    Playing atari with deep reinforcement learning , author =

  53. [68]

    Neural Discrete Representation Learning

    Neural discrete representation learning , author=. arXiv preprint arXiv:1711.00937 , year=

  54. [69]

    ICML , year=

    A Distributional View on Multi-Objective Policy Optimization , author=. ICML , year=

  55. [70]

    2023 , eprint=

    Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning , author=. 2023 , eprint=

  56. [71]

    Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search

    Coulom, R \'e mi. Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search. Computers and Games. 2007

  57. [72]

    2021 , eprint=

    Learning and Planning in Complex Action Spaces , author=. 2021 , eprint=

  58. [73]

    Policy Gradient Methods for Reinforcement Learning with Function Approximation , url =

    Sutton, Richard S and McAllester, David and Singh, Satinder and Mansour, Yishay , booktitle =. Policy Gradient Methods for Reinforcement Learning with Function Approximation , url =

  59. [74]

    , booktitle=

    Calli, Berk and Singh, Arjun and Walsman, Aaron and Srinivasa, Siddhartha and Abbeel, Pieter and Dollar, Aaron M. , booktitle=. The YCB object and Model set: Towards common benchmarks for manipulation research , year=

  60. [75]

    Advances in Neural Information Processing Systems , year=

    Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , author=. Advances in Neural Information Processing Systems , year=

  61. [76]

    ICML , year=

    Temporal Difference Learning for Model Predictive Control , author=. ICML , year=

  62. [77]

    A Markovian Decision Process

    Richard Bellman. A Markovian Decision Process. Indiana Univ. Math. J

  63. [78]

    2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

    Exploring Simple Siamese Representation Learning , author=. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

  64. [79]

    International Conference on Learning Representations , year=

    Mastering Visual Continuous Control: Improved Data-Augmented Reinforcement Learning , author=. International Conference on Learning Representations , year=

  65. [80]

    Reinforcement Learning: An Introduction , author =

  66. [81]

    International Conference on Learning Representations , year=

    Self-Supervised Policy Adaptation during Deployment , author=. International Conference on Learning Representations , year=

  67. [82]

    International Conference on Robotics and Automation , year=

    Generalization in Reinforcement Learning by Soft Data Augmentation , author=. International Conference on Robotics and Automation , year=

  68. [83]

    Journal of Machine Learning Research , year =

    Nitish Srivastava and Geoffrey Hinton and Alex Krizhevsky and Ilya Sutskever and Ruslan Salakhutdinov , title =. Journal of Machine Learning Research , year =

  69. [84]

    Deep Reinforcement Learning with Double Q-Learning , author =

  70. [85]

    ArXiv , volume =

    Addressing Function Approximation Error in Actor-Critic Methods , author =. ArXiv , volume =

  71. [86]

    ArXiv , year=

    Dream to Control: Learning Behaviors by Latent Imagination , author=. ArXiv , year=

  72. [87]

    ArXiv , year=

    IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , author=. ArXiv , year=

  73. [88]

    2018 IEEE International Conference on Robotics and Automation , year=

    Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , author=. 2018 IEEE International Conference on Robotics and Automation , year=

  74. [89]

    2019 International Joint Conference on Neural Networks , year=

    Curious Meta-Controller: Adaptive Alternation between Model-Based and Model-Free Control in Deep Reinforcement Learning , author=. 2019 International Joint Conference on Neural Networks , year=

  75. [90]

    CoRL , year=

    Learning to Jump from Pixels , author=. CoRL , year=

  76. [91]

    ArXiv , year=

    Temporal Difference Models: Model-Free Deep RL for Model-Based Control , author=. ArXiv , year=

  77. [92]

    ICML , year=

    Temporal Predictive Coding For Model-Based Planning In Latent Space , author=. ICML , year=

  78. [93]

    ArXiv , year=

    MT-Opt: Continuous Multi-Task Robotic Reinforcement Learning at Scale , author=. ArXiv , year=

  79. [94]

    ArXiv , year=

    CEM-RL: Combining evolutionary and gradient-based methods for policy search , author=. ArXiv , year=

  80. [95]

    ArXiv , year=

    Planning to Explore via Self-Supervised World Models , author=. ArXiv , year=

Showing first 80 references.