What Drives Success in Physical Planning with Joint-Embedding Predictive World Models?
Pith reviewed 2026-05-21 15:28 UTC · model grok-4.3
The pith
Design choices in architecture, objective, and planning drive success for joint-embedding predictive world models on physical tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Joint-embedding predictive world models achieve higher planning success rates in navigation and manipulation when specific architectural designs, training objectives, and planning algorithms are selected together, as shown by direct comparisons against established baselines on both simulated and real-world robotic tasks.
What carries the argument
Planning performed by optimizing directly in the learned representation space of a joint-embedding predictive world model, which abstracts away irrelevant input details.
If this is right
- Architecture choices that emphasize relevant features make planning more efficient by reducing the search space.
- Training objectives that produce high-quality representations increase the reliability of planning across varied environments.
- Particular planning algorithms better exploit the abstract space to raise success rates on both navigation and manipulation.
- Combining the identified components produces a model that generalizes better than DINO-WM and V-JEPA-2-AC.
Where Pith is reading between the lines
- The same component analysis could be applied to predictive models outside the joint-embedding family to check for similar gains.
- Representation-space planning may reduce sample requirements when agents adapt to novel physical interactions.
- Extending these models to tasks with longer action sequences would test whether the identified choices scale with horizon length.
- Deployment on additional real robots could reveal whether the gains hold when sensor noise and hardware variation increase.
Load-bearing premise
Observed performance differences arise primarily from the studied choices in architecture, objective, and planning algorithm rather than from uncontrolled experimental variables or task-specific tuning.
What would settle it
Evaluating the proposed model and the two baselines on a new collection of unseen physical tasks while holding all other experimental conditions fixed would show whether the performance gains persist.
Figures
read the original abstract
A long-standing challenge in AI is to develop agents capable of solving a wide range of physical tasks and generalizing to new, unseen tasks and environments. A popular recent approach involves training a world model from state-action trajectories and subsequently use it with a planning algorithm to solve new tasks. Planning is commonly performed in the input space, but a recent family of methods has introduced planning algorithms that optimize in the learned representation space of the world model, with the promise that abstracting irrelevant details yields more efficient planning. In this work, we characterize models from this family as JEPA-WMs and investigate the technical choices that make algorithms from this class work. We propose a comprehensive study of several key components with the objective of finding the optimal approach within the family. We conducted experiments using both simulated environments and real-world robotic data, and studied how the model architecture, the training objective, and the planning algorithm affect planning success. We combine our findings to propose a model that outperforms two established baselines, DINO-WM and V-JEPA-2-AC, in both navigation and manipulation tasks. Code, data and checkpoints are available at https://github.com/facebookresearch/jepa-wms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies Joint-Embedding Predictive Architecture World Models (JEPA-WMs) for physical planning. It systematically examines the effects of model architecture, training objective, and planning algorithm on success in navigation and manipulation tasks using both simulated environments and real robotic data. The authors combine their findings into a proposed model that outperforms the baselines DINO-WM and V-JEPA-2-AC, and release code, data, and checkpoints.
Significance. If the performance advantages are shown to arise from the studied design choices rather than uncontrolled factors, the work offers concrete guidance on building effective representation-space planners for robotics. The public release of code and checkpoints is a clear strength that supports verification and extension by the community.
major comments (2)
- [Experiments] The experimental sections do not report the hyperparameter search effort, compute budget, or tuning protocol applied to the proposed model versus the two baselines. This information is load-bearing for the central claim that the observed gains are due to architecture, objective, and planning choices rather than uneven optimization.
- [Results] Results tables and figures lack multi-seed averages with confidence intervals or statistical tests. Without these, it is not possible to assess whether the reported outperformance is robust or could be explained by random variation or single-run effects.
minor comments (2)
- [Figures] Figure captions and axis labels could be expanded to make the ablation results more immediately interpretable without reference to the main text.
- [Introduction] The paper uses several acronyms (JEPA-WM, DINO-WM, V-JEPA-2-AC) that would benefit from a short glossary or consistent first-use definitions.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which helps strengthen the transparency of our experimental methodology. We have revised the manuscript to address both major comments by adding the requested details on hyperparameter tuning and statistical reporting of results.
read point-by-point responses
-
Referee: [Experiments] The experimental sections do not report the hyperparameter search effort, compute budget, or tuning protocol applied to the proposed model versus the two baselines. This information is load-bearing for the central claim that the observed gains are due to architecture, objective, and planning choices rather than uneven optimization.
Authors: We agree that explicit reporting of the tuning process is necessary to substantiate our claims. In the revised manuscript, we have inserted a new subsection (4.1) detailing the hyperparameter search protocol, including the ranges explored for learning rate, embedding dimension, prediction horizon, and optimizer settings. We also report the total compute budget (approximately 1200 GPU-hours for the full study) and note that equivalent search effort was applied to re-tune the DINO-WM and V-JEPA-2-AC baselines using the same protocol and search budget. This ensures the performance differences can be attributed to the architecture, objective, and planning choices under study. revision: yes
-
Referee: [Results] Results tables and figures lack multi-seed averages with confidence intervals or statistical tests. Without these, it is not possible to assess whether the reported outperformance is robust or could be explained by random variation or single-run effects.
Authors: We acknowledge the value of multi-seed statistics for assessing robustness. The original submission reported single-run results due to the high cost of training large JEPA models. In revision, we have re-run the primary navigation and manipulation experiments with 5 independent random seeds and updated the main results tables to report mean success rates with standard deviations. We have also added paired t-tests comparing our model against each baseline, with p-values reported in the text. For the full ablation tables, we include variance estimates from the available runs and explicitly note the computational constraints that limited exhaustive multi-seed evaluation across all conditions. revision: yes
Circularity Check
No circularity: empirical comparisons to external baselines with no self-referential derivations
full rationale
The paper is an empirical study that varies architecture, training objective, and planning algorithm, then reports performance on navigation and manipulation tasks against the external baselines DINO-WM and V-JEPA-2-AC. No first-principles derivations, predictive equations, or fitted parameters are presented whose outputs reduce by construction to the inputs. All load-bearing claims rest on experimental outcomes that are falsifiable against independent implementations of the baselines. Self-citations to prior JEPA work are not load-bearing for the central performance claim, which is measured directly against published external models.
Axiom & Free-Parameter Ledger
free parameters (1)
- hyperparameters for architecture and training
axioms (1)
- domain assumption Planning in learned representation space abstracts away irrelevant details and yields more efficient planning than input-space planning.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean, Cost/FunctionalEquation.leanreality_from_one_distinction, washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We characterize models from this family as JEPA-WMs and investigate the technical choices that make algorithms from this class work... model architecture, the training objective, and the planning algorithm affect planning success.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We combine our findings to propose a model that outperforms two established baselines, DINO-WM and V-JEPA-2-AC, in both navigation and manipulation tasks.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 5 Pith papers
-
PEIRA: Learning Predictive Encoders through Inter-View Regressor Alignment
PEIRA learns predictive encoders by optimizing the trace of the optimal inter-view linear regressor, with only nontrivial global minimizers as stable equilibria that recover leading nonlinear canonical correlation subspaces.
-
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
OneVL is the first latent CoT method to exceed explicit CoT accuracy on four driving benchmarks while running at answer-only speed, by supervising latent tokens with a visual world model decoder.
-
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
-
Grounded World Model for Semantically Generalizable Planning
A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
-
Hierarchical Planning with Latent World Models
Hierarchical planning over multi-scale latent world models enables 70% success on real robotic pick-and-place with goal-only input where flat models achieve 0%, while cutting planning compute up to 4x in simulations.
Reference graph
Works this paper leans on
-
[1]
Cosmos World Foundation Model Platform for Physical AI
Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Ngiohtuned, a new black-box optimization wizard for real world machine learning
Anonymous. Ngiohtuned, a new black-box optimization wizard for real world machine learning. Submitted to Transactions on Machine Learning Research, 2024. URL https://openreview.net/forum?id=0FDiCoIStW. Rejected
work page 2024
-
[3]
V-jepa 2: Self-supervised video models enable understanding, prediction and planning, 2025
Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba, Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xia...
work page 2025
-
[4]
Back to the features: Dino as a foundation for video world models, 2025
Federico Baldassarre, Marc Szafraniec, Basile Terver, Vasil Khalidov, Francisco Massa, Yann LeCun, Patrick Labatut, Maximilian Seitzer, and Piotr Bojanowski. Back to the features: Dino as a foundation for video world models, 2025. URL https://arxiv.org/abs/2507.19468
-
[5]
a ron van den Oord, Inbar Mosseri, Adrian Bolton, Satinder Singh, and Tim Rockt \
Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kristian Holsheimer, Aleksander Holynski, Jiri Hron, Christos Kaplanis, Marjorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang, Jessica Yung, Ci...
work page 2025
-
[6]
Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 15791--15801, June 2025
work page 2025
-
[7]
Revisiting feature prediction for learning visual representations from video, 2024
Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video, 2024. ISSN 2835-8856
work page 2024
-
[8]
Vavim and vavam: Autonomous driving through video generative modeling
Florent Bartoccioni, Elias Ramzi, Victor Besnier, Shashanka Venkataramanan, Tuan-Hung Vu, Yihong Xu, Loick Chambon, Spyros Gidaris, Serkan Odabas, David Hurych, Renaud Marlet, Alexandre Boulch, Mickael Chen, Éloi Zablocki, Andrei Bursuc, Eduardo Valle, and Matthieu Cord. Vavim and vavam: Autonomous driving through video generative modeling. arXiv preprint...
-
[9]
Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks, 2015. URL https://arxiv.org/abs/1506.03099
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[10]
Nevergrad: black-box optimization platform
Pauline Bennet, Carola Doerr, Antoine Moreau, Jeremy Rapin, Fabien Teytaud, and Olivier Teytaud. Nevergrad: black-box optimization platform. SIGEVOlution, 14 0 (1): 0 8–15, April 2021. doi:10.1145/3460310.3460312. URL https://doi.org/10.1145/3460310.3460312
-
[11]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. _0 : A vis...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Predictive Control for Linear and Hybrid Systems
Francesco Borrelli, Alberto Bemporad, and Manfred Morari. Predictive Control for Linear and Hybrid Systems. Cambridge University Press, USA, 1st edition, 2017. ISBN 1107652871
work page 2017
-
[13]
Video generation models as world simulators, 2024
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators, 2024. URL https://openai. com/research/video-generation-modelsas-world-simulators
work page 2024
-
[14]
Genie: Generative interactive environments
Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning, 2024
work page 2024
-
[15]
Diffusion policy: Visuomotor policy learning via action diffusion
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, pp.\ 02783649241273668, 2023
work page 2023
-
[16]
The mit humanoid robot: Design, motion planning, and control for acrobatic behaviors
Matthew Chignoli, Donghyun Kim, Elijah Stanger-Jones, and Sangbae Kim. The mit humanoid robot: Design, motion planning, and control for acrobatic behaviors. In 2020 IEEE-RAS 20th International Conference on Humanoid Robots (Humanoids), pp.\ 1--8. doi:10.1109/HUMANOIDS47582.2021.9555782
-
[17]
Vision transformers need registers
Timoth \'e e Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. In ICRL, 2024
work page 2024
-
[18]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021
work page 2021
-
[19]
Jeffrey L. Elman. Finding structure in time. Cognitive Science, 14 0 (2): 0 179--211, 1990. ISSN 0364-0213. doi:https://doi.org/10.1016/0364-0213(90)90002-E. URL https://www.sciencedirect.com/science/article/pii/036402139090002E
-
[20]
Droid: A large-scale in-the-wild robot manipulation dataset, 2024
Alexander Khazatsky et al. Droid: A large-scale in-the-wild robot manipulation dataset, 2024
work page 2024
-
[21]
Rt-1: Robotics transformer for real-world control at scale, 2023
Anthony Brohan et al. Rt-1: Robotics transformer for real-world control at scale, 2023
work page 2023
-
[22]
Planning to practice: Efficient online fine-tuning by composing goals in latent space
Kuan Fang, Patrick Yin, Ashvin Nair, and Sergey Levine. Planning to practice: Efficient online fine-tuning by composing goals in latent space. In ICLR 2022 Workshop on Generalizable Policy Learning in Physical World, 2022 a
work page 2022
-
[23]
Generalization with lossy affordances: Leveraging broad offline data for learning visuomotor tasks
Kuan Fang, Patrick Yin, Ashvin Nair, Homer Rich Walke, Gengchen Yan, and Sergey Levine. Generalization with lossy affordances: Leveraging broad offline data for learning visuomotor tasks. In 6th Annual Conference on Robot Learning, 2022 b
work page 2022
-
[24]
D4RL: Datasets for Deep Data-Driven Reinforcement Learning
Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[25]
Addressing function approximation error in actor-critic methods
Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.\ 1587--1596. PMLR, 10--15 Jul 2018
work page 2018
-
[26]
Galimzyanov, T., Titov, S., Golubev, Y ., and Bogomolov, E
Pascale Fung, Yoram Bachrach, Asli Celikyilmaz, Kamalika Chaudhuri, Delong Chen, Willy Chung, Emmanuel Dupoux, Hongyu Gong, Hervé Jégou, Alessandro Lazaric, Arjun Majumdar, Andrea Madotto, Franziska Meier, Florian Metze, Louis-Philippe Morency, Théo Moutakanni, Juan Pino, Basile Terver, Joseph Tighe, Paden Tomasello, and Jitendra Malik. Embodied ai agents...
-
[27]
C. E. Garcia, D. M. Prett, and M. Morari. Model predictive control: theory and practice—a survey. Automatica, 25 0 (3): 0 335–348, May 1989. ISSN 0005-1098. doi:10.1016/0005-1098(89)90002-2. URL https://doi.org/10.1016/0005-1098(89)90002-2
-
[28]
Learning and leveraging world models in visual representation learning, 2024
Quentin Garrido, Mahmoud Assran, Nicolas Ballas, Adrien Bardes, Laurent Najman, and Yann LeCun. Learning and leveraging world models in visual representation learning, 2024
work page 2024
-
[29]
Byol-explore: Exploration by bootstrapped prediction
Zhaohan Guo, Shantanu Thakoor, Miruna Pislar, Bernardo Avila Pires, Florent Altch\' e , Corentin Tallec, Alaa Saade, Daniele Calandriello, Jean-Bastien Grill, Yunhao Tang, Michal Valko, Remi Munos, Mohammad Gheshlaghi Azar, and Bilal Piot. Byol-explore: Exploration by bootstrapped prediction. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and ...
work page 2022
-
[30]
Recurrent world models facilitate policy evolution
David Ha and J\" u rgen Schmidhuber. Recurrent world models facilitate policy evolution. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31, 2018
work page 2018
-
[31]
Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, volume 80, pp.\ 1856--1865. PMLR, 2018
work page 2018
-
[32]
Mastering diverse domains through world models, 2024
Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models, 2024
work page 2024
-
[33]
N. Hansen and A. Ostermeier. Adapting arbitrary normal mutation distributions in evolution strategies: the covariance matrix adaptation. In Proceedings of IEEE International Conference on Evolutionary Computation, pp.\ 312--317, 1996. doi:10.1109/ICEC.1996.542381
-
[34]
Td-mpc2: Scalable, robust world models for continuous control
Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control. In The Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[35]
The CMA Evolution Strategy: A Tutorial
Nikolaus Hansen. The cma evolution strategy: A tutorial, 2023. URL https://arxiv.org/abs/1604.00772
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Nikolaus Hansen, Youhei Akimoto, and Petr Baudis. CMA-ES/pycma on G ithub. Zenodo, DOI:10.5281/zenodo.2559634, February 2019. URL https://doi.org/10.5281/zenodo.2559634
-
[37]
GAIA-1: A Generative World Model for Autonomous Driving
Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving, 2023. URL https://arxiv.org/abs/2309.17080
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
S. Hutchinson, G. Hager, and P. Corke. A tutorial on visual servo control. IEEE Trans. on Robotics and Automation, 12 0 (5): 0 651--670, October 1996
work page 1996
-
[39]
Herbert Jaeger. Tutorial on training recurrent neural networks, covering bppt, rtrl, ekf and the echo state network approach. GMD-Forschungszentrum Informationstechnik, 2002., 5, 01 2002
work page 2002
-
[40]
Planning with diffusion for flexible behavior synthesis
Michael Janner, Yilun Du, Joshua Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. In ICML, 2022
work page 2022
-
[41]
Offline reinforcement learning with implicit q-learning
Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2022
work page 2022
-
[42]
A path towards autonomous machine intelligence
Yann LeCun. A path towards autonomous machine intelligence. Open Review, Jun 2022
work page 2022
-
[43]
Hierarchical planning through goal-conditioned offline reinforcement learning, 2022
Jinning Li, Chen Tang, Masayoshi Tomizuka, and Wei Zhan. Hierarchical planning through goal-conditioned offline reinforcement learning, 2022
work page 2022
-
[44]
Biconmp: A nonlinear model predictive control framework for whole body motion planning
Avadesh Meduri, Paarth Shah, Julian Viereck, Majid Khadiv, Ioannis Havoutis, and Ludovic Righetti. Biconmp: A nonlinear model predictive control framework for whole body motion planning. IEEE Transactions on Robotics, 39: 0 905--922, 2022. URL https://api.semanticscholar.org/CorpusID:246035621
work page 2022
-
[45]
Discovering and achieving goals via world models
Russell Mendonca, Oleh Rybkin, Kostas Daniilidis, Danijar Hafner, and Deepak Pathak. Discovering and achieving goals via world models. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp.\ 24379--24391, 2021
work page 2021
-
[46]
Structured world models from human videos, 2023
Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Structured world models from human videos, 2023
work page 2023
-
[47]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement l...
work page 2015
-
[48]
Asynchronous methods for deep reinforcement learning
Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp.\ 1928--1937. PMLR, 20--22 Jun 2016
work page 1928
-
[49]
Hierarchical foresight: Self-supervised learning of long-horizon tasks via visual subgoal generation
Suraj Nair and Chelsea Finn. Hierarchical foresight: Self-supervised learning of long-horizon tasks via visual subgoal generation. In International Conference on Learning Representations, 2020
work page 2020
-
[50]
Pong, Steven Lin, and Sergey Levine
Soroush Nasiriany, Vitchyr H. Pong, Steven Lin, and Sergey Levine. Planning with goal-conditioned policies. In NeurIPS, 2019
work page 2019
-
[51]
Robocasa: Large-scale simulation of everyday tasks for generalist robots
Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. In Robotics: Science and Systems (RSS), 2024
work page 2024
-
[52]
Octo: An open-source generalist robot policy
Octo Model Team , Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. In Proceedings of Robotics: Science a...
work page 2024
-
[53]
Offline goal-conditioned RL with latent states as actions
Seohong Park, Dibya Ghosh, Benjamin Eysenbach, and Sergey Levine. Offline goal-conditioned RL with latent states as actions. In ICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems, 2023
work page 2023
-
[54]
Genie 2: A large-scale foundation world model
Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Frederic Besse, Tim Harley, Anna ...
work page 2024
-
[55]
Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML'17, pp.\ 2778–2787. JMLR.org, 2017
work page 2017
-
[56]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. In ICCV, 2023
work page 2023
-
[57]
Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588 0 (7839): 0 604–609, December 2020. ISSN 1476-4687. doi:10.1038/s41586-02...
work page internal anchor Pith review doi:10.1038/s41586-020-03051-4 2020
-
[58]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[59]
Planning to explore via self-supervised world models
Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak Pathak. Planning to explore via self-supervised world models. In Proceedings of the 37th International Conference on Machine Learning, ICML'20. JMLR.org, 2020
work page 2020
-
[60]
Masked world models for visual control
Younggyo Seo, Danijar Hafner, Hao Liu, Fangchen Liu, Stephen James, Kimin Lee, and Pieter Abbeel. Masked world models for visual control. In 6th Annual Conference on Robot Learning, 2022
work page 2022
-
[61]
Rapid exploration for open-world navigation with latent goal models
Dhruv Shah, Benjamin Eysenbach, Nicholas Rhinehart, and Sergey Levine. Rapid exploration for open-world navigation with latent goal models. In 5th Annual Conference on Robot Learning, 2021
work page 2021
-
[62]
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362 0 (6419): 0 1140--1144, 2018. doi:...
-
[63]
Oriane Sim \'e oni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha \"e l Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth \'e e Darcet, Th \'e o Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[64]
Learning from reward-free offline data: A case for planning with latent dynamics models, 02 2025
Vlad Sobal, Wancong Zhang, Kynghyun Cho, Randall Balestriero, Tim Rudner, and Yann Lecun. Learning from reward-free offline data: A case for planning with latent dynamics models, 02 2025
work page 2025
-
[65]
Universal planning networks: Learning generalizable representations for visuomotor control
Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal planning networks: Learning generalizable representations for visuomotor control. In ICML, 2018
work page 2018
-
[66]
Roformer: Enhanced transformer with rotary position embedding
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomput., 568, 2024
work page 2024
-
[67]
Quan Vuong, Sergey Levine, Homer Rich Walke, Karl Pertsch, Anikait Singh, Ria Doshi, Charles Xu, Jianlan Luo, Liam Tan, Dhruv Shah, Chelsea Finn, Max Du, Moo Jin Kim, Alexander Khazatsky, Jonathan Heewon Yang, Tony Z. Zhao, Ken Goldberg, Ryan Hoque, Lawrence Yunliang Chen, Simeon Adebola, Gaurav S. Sukhatme, Gautam Salhotra, Shivin Dass, Lerrel Pinto, Zic...
work page 2023
-
[68]
Embed to control: A locally linear latent dynamics model for control from raw images
Manuel Watter, Jost Tobias Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In NeurIPS, 2015
work page 2015
-
[69]
Model predictive path integral control using covariance variable importance sampling, 2015
Grady Williams, Andrew Aldrich, and Evangelos Theodorou. Model predictive path integral control using covariance variable importance sampling, 2015
work page 2015
-
[70]
Understanding and improving layer normalization
Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang Zhao, and Junyang Lin. Understanding and improving layer normalization. Curran Associates Inc., Red Hook, NY, USA, 2019
work page 2019
-
[71]
IQL - TD - MPC : Implicit q-learning for hierarchical model predictive control
Yingchen Xu, Rohan Chitnis, Bobak T Hashemi, Lucas Lehnert, Urun Dogan, Zheqing Zhu, and Olivier Delalleau. IQL - TD - MPC : Implicit q-learning for hierarchical model predictive control. In ICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems, 2023
work page 2023
-
[72]
Learning interactive real-world simulators
Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. In ICLR, 2023
work page 2023
-
[73]
Mastering visual continuous control: Improved data-augmented reinforcement learning
Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto. Mastering visual continuous control: Improved data-augmented reinforcement learning. In ICLR, 2022
work page 2022
-
[74]
Latent Action Pretraining from Videos
Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, and Minjoon Seo. Latent action pretraining from videos, 2024. URL https://arxiv.org/abs/2410.11758
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[75]
Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning, 2019
Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Avnish Narayan, Hayden Shively, Adithya Bellathur, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning, 2019
work page 2019
-
[76]
Efros, Eli Shechtman, and Oliver Wang
Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018
work page 2018
-
[77]
DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning
Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. Dino-wm: World models on pre-trained visual features enable zero-shot planning, 2024 a . URL https://arxiv.org/abs/2411.04983
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[78]
Diffusion model predictive control.arXiv preprint arXiv:2410.05364,
Guangyao Zhou, Sivaramakrishnan Swaminathan, Rajkumar Vasudeva Raju, J. Swaroop Guntupalli, Wolfgang Lehrach, Joseph Ortiz, Antoine Dedieu, Miguel Lázaro-Gredilla, and Kevin Murphy. Diffusion model predictive control. arXiv preprint arXiv:2410.05364, 2024 b
-
[79]
robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Mart\' i n-Mart\' i n, Abhishek Joshi, Soroush Nasiriany, Yifeng Zhu, and Kevin Lin. robosuite: A modular simulation framework and benchmark for robot learning. In arXiv preprint arXiv:2009.12293, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[80]
Sanketi, Grecia Salazar, Michael S
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Vanhoucke, Huong Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R. Sanketi, Grecia Salazar, Michael S. Ryoo, Krista Reymann, Kanishka Rao, Karl Pertsch, Igor Mordatch, Henryk Michalewski...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.