pith. sign in

arxiv: 2512.24497 · v3 · pith:DTDRYLZEnew · submitted 2025-12-30 · 💻 cs.AI · cs.LG· cs.RO· stat.ML

What Drives Success in Physical Planning with Joint-Embedding Predictive World Models?

Pith reviewed 2026-05-21 15:28 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.ROstat.ML
keywords joint-embedding predictive world modelsphysical planningrobot navigationmanipulation tasksworld modelsrepresentation learningplanning algorithms
0
0 comments X

The pith

Design choices in architecture, objective, and planning drive success for joint-embedding predictive world models on physical tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines the technical choices that determine whether joint-embedding predictive world models succeed when used for planning in physical environments. It isolates the effects of model architecture, training objective, and planning algorithm through controlled experiments in both simulated settings and real robotic data. The study identifies combinations that improve an agent's ability to solve navigation and manipulation tasks while generalizing to new situations. These findings are assembled into one model that exceeds the performance of two prior methods, DINO-WM and V-JEPA-2-AC.

Core claim

Joint-embedding predictive world models achieve higher planning success rates in navigation and manipulation when specific architectural designs, training objectives, and planning algorithms are selected together, as shown by direct comparisons against established baselines on both simulated and real-world robotic tasks.

What carries the argument

Planning performed by optimizing directly in the learned representation space of a joint-embedding predictive world model, which abstracts away irrelevant input details.

If this is right

  • Architecture choices that emphasize relevant features make planning more efficient by reducing the search space.
  • Training objectives that produce high-quality representations increase the reliability of planning across varied environments.
  • Particular planning algorithms better exploit the abstract space to raise success rates on both navigation and manipulation.
  • Combining the identified components produces a model that generalizes better than DINO-WM and V-JEPA-2-AC.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same component analysis could be applied to predictive models outside the joint-embedding family to check for similar gains.
  • Representation-space planning may reduce sample requirements when agents adapt to novel physical interactions.
  • Extending these models to tasks with longer action sequences would test whether the identified choices scale with horizon length.
  • Deployment on additional real robots could reveal whether the gains hold when sensor noise and hardware variation increase.

Load-bearing premise

Observed performance differences arise primarily from the studied choices in architecture, objective, and planning algorithm rather than from uncontrolled experimental variables or task-specific tuning.

What would settle it

Evaluating the proposed model and the two baselines on a new collection of unseen physical tasks while holding all other experimental conditions fixed would show whether the performance gains persist.

Figures

Figures reproduced from arXiv: 2512.24497 by Adrien Bardes, Basile Terver, Jean Ponce, Tsung-Yen Yang, Yann LeCun.

Figure 1
Figure 1. Figure 1: Left: Training of JEPA-WM: the encoder Eϕ,θ embeds video and optionally propriocep￾tive observation, which is fed to the predictor Pθ, along with actions, to predict (in parallel across timesteps) the next state embedding. Right: Planning with JEPA-WM: sample action sequences, unroll the predictor on them, compute a planning cost L p for each trajectory, and use this cost to iteratively refine the action s… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of different methods on the counterfactual Franka arm lift cup task, where [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) Comparison of planning optimizers: NG is the Nevergrad-based interface that we [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) Models trained with proprioceptive input are denoted “prop”, while pure visual world [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a) Comparing predictor architectures: we denote positional embedding in the predictor [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (a) Comparison of model size: we vary from ViT-S to ViT-L the visual encoder size, as [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 3
Figure 3. Figure 3: Let us detail the failure cases of the GD planner. On the Wall task, the GD planner gets zero performance, although the task is visually simplistic. We identify two main failure cases. Either the agent goes into the wall without being able to pass the door, which is a classical failure case for better CEM or NG planners. Or the agent finds a local planning cost minimum by going to the borders of the image,… view at source ↗
read the original abstract

A long-standing challenge in AI is to develop agents capable of solving a wide range of physical tasks and generalizing to new, unseen tasks and environments. A popular recent approach involves training a world model from state-action trajectories and subsequently use it with a planning algorithm to solve new tasks. Planning is commonly performed in the input space, but a recent family of methods has introduced planning algorithms that optimize in the learned representation space of the world model, with the promise that abstracting irrelevant details yields more efficient planning. In this work, we characterize models from this family as JEPA-WMs and investigate the technical choices that make algorithms from this class work. We propose a comprehensive study of several key components with the objective of finding the optimal approach within the family. We conducted experiments using both simulated environments and real-world robotic data, and studied how the model architecture, the training objective, and the planning algorithm affect planning success. We combine our findings to propose a model that outperforms two established baselines, DINO-WM and V-JEPA-2-AC, in both navigation and manipulation tasks. Code, data and checkpoints are available at https://github.com/facebookresearch/jepa-wms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper studies Joint-Embedding Predictive Architecture World Models (JEPA-WMs) for physical planning. It systematically examines the effects of model architecture, training objective, and planning algorithm on success in navigation and manipulation tasks using both simulated environments and real robotic data. The authors combine their findings into a proposed model that outperforms the baselines DINO-WM and V-JEPA-2-AC, and release code, data, and checkpoints.

Significance. If the performance advantages are shown to arise from the studied design choices rather than uncontrolled factors, the work offers concrete guidance on building effective representation-space planners for robotics. The public release of code and checkpoints is a clear strength that supports verification and extension by the community.

major comments (2)
  1. [Experiments] The experimental sections do not report the hyperparameter search effort, compute budget, or tuning protocol applied to the proposed model versus the two baselines. This information is load-bearing for the central claim that the observed gains are due to architecture, objective, and planning choices rather than uneven optimization.
  2. [Results] Results tables and figures lack multi-seed averages with confidence intervals or statistical tests. Without these, it is not possible to assess whether the reported outperformance is robust or could be explained by random variation or single-run effects.
minor comments (2)
  1. [Figures] Figure captions and axis labels could be expanded to make the ablation results more immediately interpretable without reference to the main text.
  2. [Introduction] The paper uses several acronyms (JEPA-WM, DINO-WM, V-JEPA-2-AC) that would benefit from a short glossary or consistent first-use definitions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps strengthen the transparency of our experimental methodology. We have revised the manuscript to address both major comments by adding the requested details on hyperparameter tuning and statistical reporting of results.

read point-by-point responses
  1. Referee: [Experiments] The experimental sections do not report the hyperparameter search effort, compute budget, or tuning protocol applied to the proposed model versus the two baselines. This information is load-bearing for the central claim that the observed gains are due to architecture, objective, and planning choices rather than uneven optimization.

    Authors: We agree that explicit reporting of the tuning process is necessary to substantiate our claims. In the revised manuscript, we have inserted a new subsection (4.1) detailing the hyperparameter search protocol, including the ranges explored for learning rate, embedding dimension, prediction horizon, and optimizer settings. We also report the total compute budget (approximately 1200 GPU-hours for the full study) and note that equivalent search effort was applied to re-tune the DINO-WM and V-JEPA-2-AC baselines using the same protocol and search budget. This ensures the performance differences can be attributed to the architecture, objective, and planning choices under study. revision: yes

  2. Referee: [Results] Results tables and figures lack multi-seed averages with confidence intervals or statistical tests. Without these, it is not possible to assess whether the reported outperformance is robust or could be explained by random variation or single-run effects.

    Authors: We acknowledge the value of multi-seed statistics for assessing robustness. The original submission reported single-run results due to the high cost of training large JEPA models. In revision, we have re-run the primary navigation and manipulation experiments with 5 independent random seeds and updated the main results tables to report mean success rates with standard deviations. We have also added paired t-tests comparing our model against each baseline, with p-values reported in the text. For the full ablation tables, we include variance estimates from the available runs and explicitly note the computational constraints that limited exhaustive multi-seed evaluation across all conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons to external baselines with no self-referential derivations

full rationale

The paper is an empirical study that varies architecture, training objective, and planning algorithm, then reports performance on navigation and manipulation tasks against the external baselines DINO-WM and V-JEPA-2-AC. No first-principles derivations, predictive equations, or fitted parameters are presented whose outputs reduce by construction to the inputs. All load-bearing claims rest on experimental outcomes that are falsifiable against independent implementations of the baselines. Self-citations to prior JEPA work are not load-bearing for the central performance claim, which is measured directly against published external models.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review based solely on abstract; full details on hyperparameters and modeling assumptions unavailable.

free parameters (1)
  • hyperparameters for architecture and training
    Standard ML training choices whose specific values are not detailed in the abstract.
axioms (1)
  • domain assumption Planning in learned representation space abstracts away irrelevant details and yields more efficient planning than input-space planning.
    Presented as the core promise of the JEPA-WM family in the abstract.

pith-pipeline@v0.9.0 · 5756 in / 1096 out tokens · 50510 ms · 2026-05-21T15:28:50.159344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PEIRA: Learning Predictive Encoders through Inter-View Regressor Alignment

    cs.LG 2026-05 unverdicted novelty 7.0

    PEIRA learns predictive encoders by optimizing the trace of the optimal inter-view linear regressor, with only nontrivial global minimizers as stable equilibria that recover leading nonlinear canonical correlation subspaces.

  2. Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

    cs.CV 2026-04 unverdicted novelty 6.0

    OneVL is the first latent CoT method to exceed explicit CoT accuracy on four driving benchmarks while running at answer-only speed, by supervising latent tokens with a visual world model decoder.

  3. Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

    cs.CV 2026-04 unverdicted novelty 6.0

    OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.

  4. Grounded World Model for Semantically Generalizable Planning

    cs.RO 2026-04 conditional novelty 6.0

    A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.

  5. Hierarchical Planning with Latent World Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Hierarchical planning over multi-scale latent world models enables 70% success on real robotic pick-and-place with goal-only input where flat models achieve 0%, while cutting planning compute up to 4x in simulations.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · cited by 4 Pith papers · 12 internal anchors

  1. [1]

    Cosmos World Foundation Model Platform for Physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, 2025

  2. [2]

    Ngiohtuned, a new black-box optimization wizard for real world machine learning

    Anonymous. Ngiohtuned, a new black-box optimization wizard for real world machine learning. Submitted to Transactions on Machine Learning Research, 2024. URL https://openreview.net/forum?id=0FDiCoIStW. Rejected

  3. [3]

    V-jepa 2: Self-supervised video models enable understanding, prediction and planning, 2025

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba, Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xia...

  4. [4]

    Back to the features: Dino as a foundation for video world models, 2025

    Federico Baldassarre, Marc Szafraniec, Basile Terver, Vasil Khalidov, Francisco Massa, Yann LeCun, Patrick Labatut, Maximilian Seitzer, and Piotr Bojanowski. Back to the features: Dino as a foundation for video world models, 2025. URL https://arxiv.org/abs/2507.19468

  5. [5]

    a ron van den Oord, Inbar Mosseri, Adrian Bolton, Satinder Singh, and Tim Rockt \

    Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kristian Holsheimer, Aleksander Holynski, Jiri Hron, Christos Kaplanis, Marjorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang, Jessica Yung, Ci...

  6. [6]

    Navigation world models

    Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 15791--15801, June 2025

  7. [7]

    Revisiting feature prediction for learning visual representations from video, 2024

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video, 2024. ISSN 2835-8856

  8. [8]

    Vavim and vavam: Autonomous driving through video generative modeling

    Florent Bartoccioni, Elias Ramzi, Victor Besnier, Shashanka Venkataramanan, Tuan-Hung Vu, Yihong Xu, Loick Chambon, Spyros Gidaris, Serkan Odabas, David Hurych, Renaud Marlet, Alexandre Boulch, Mickael Chen, Éloi Zablocki, Andrei Bursuc, Eduardo Valle, and Matthieu Cord. Vavim and vavam: Autonomous driving through video generative modeling. arXiv preprint...

  9. [9]

    Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks

    Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks, 2015. URL https://arxiv.org/abs/1506.03099

  10. [10]

    Nevergrad: black-box optimization platform

    Pauline Bennet, Carola Doerr, Antoine Moreau, Jeremy Rapin, Fabien Teytaud, and Olivier Teytaud. Nevergrad: black-box optimization platform. SIGEVOlution, 14 0 (1): 0 8–15, April 2021. doi:10.1145/3460310.3460312. URL https://doi.org/10.1145/3460310.3460312

  11. [11]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. _0 : A vis...

  12. [12]

    Predictive Control for Linear and Hybrid Systems

    Francesco Borrelli, Alberto Bemporad, and Manfred Morari. Predictive Control for Linear and Hybrid Systems. Cambridge University Press, USA, 1st edition, 2017. ISBN 1107652871

  13. [13]

    Video generation models as world simulators, 2024

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators, 2024. URL https://openai. com/research/video-generation-modelsas-world-simulators

  14. [14]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning, 2024

  15. [15]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, pp.\ 02783649241273668, 2023

  16. [16]

    The mit humanoid robot: Design, motion planning, and control for acrobatic behaviors

    Matthew Chignoli, Donghyun Kim, Elijah Stanger-Jones, and Sangbae Kim. The mit humanoid robot: Design, motion planning, and control for acrobatic behaviors. In 2020 IEEE-RAS 20th International Conference on Humanoid Robots (Humanoids), pp.\ 1--8. doi:10.1109/HUMANOIDS47582.2021.9555782

  17. [17]

    Vision transformers need registers

    Timoth \'e e Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. In ICRL, 2024

  18. [18]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021

  19. [19]

    Jeffrey L. Elman. Finding structure in time. Cognitive Science, 14 0 (2): 0 179--211, 1990. ISSN 0364-0213. doi:https://doi.org/10.1016/0364-0213(90)90002-E. URL https://www.sciencedirect.com/science/article/pii/036402139090002E

  20. [20]

    Droid: A large-scale in-the-wild robot manipulation dataset, 2024

    Alexander Khazatsky et al. Droid: A large-scale in-the-wild robot manipulation dataset, 2024

  21. [21]

    Rt-1: Robotics transformer for real-world control at scale, 2023

    Anthony Brohan et al. Rt-1: Robotics transformer for real-world control at scale, 2023

  22. [22]

    Planning to practice: Efficient online fine-tuning by composing goals in latent space

    Kuan Fang, Patrick Yin, Ashvin Nair, and Sergey Levine. Planning to practice: Efficient online fine-tuning by composing goals in latent space. In ICLR 2022 Workshop on Generalizable Policy Learning in Physical World, 2022 a

  23. [23]

    Generalization with lossy affordances: Leveraging broad offline data for learning visuomotor tasks

    Kuan Fang, Patrick Yin, Ashvin Nair, Homer Rich Walke, Gengchen Yan, and Sergey Levine. Generalization with lossy affordances: Leveraging broad offline data for learning visuomotor tasks. In 6th Annual Conference on Robot Learning, 2022 b

  24. [24]

    D4RL: Datasets for Deep Data-Driven Reinforcement Learning

    Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020

  25. [25]

    Addressing function approximation error in actor-critic methods

    Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.\ 1587--1596. PMLR, 10--15 Jul 2018

  26. [26]

    Galimzyanov, T., Titov, S., Golubev, Y ., and Bogomolov, E

    Pascale Fung, Yoram Bachrach, Asli Celikyilmaz, Kamalika Chaudhuri, Delong Chen, Willy Chung, Emmanuel Dupoux, Hongyu Gong, Hervé Jégou, Alessandro Lazaric, Arjun Majumdar, Andrea Madotto, Franziska Meier, Florian Metze, Louis-Philippe Morency, Théo Moutakanni, Juan Pino, Basile Terver, Joseph Tighe, Paden Tomasello, and Jitendra Malik. Embodied ai agents...

  27. [27]

    C. E. Garcia, D. M. Prett, and M. Morari. Model predictive control: theory and practice—a survey. Automatica, 25 0 (3): 0 335–348, May 1989. ISSN 0005-1098. doi:10.1016/0005-1098(89)90002-2. URL https://doi.org/10.1016/0005-1098(89)90002-2

  28. [28]

    Learning and leveraging world models in visual representation learning, 2024

    Quentin Garrido, Mahmoud Assran, Nicolas Ballas, Adrien Bardes, Laurent Najman, and Yann LeCun. Learning and leveraging world models in visual representation learning, 2024

  29. [29]

    Byol-explore: Exploration by bootstrapped prediction

    Zhaohan Guo, Shantanu Thakoor, Miruna Pislar, Bernardo Avila Pires, Florent Altch\' e , Corentin Tallec, Alaa Saade, Daniele Calandriello, Jean-Bastien Grill, Yunhao Tang, Michal Valko, Remi Munos, Mohammad Gheshlaghi Azar, and Bilal Piot. Byol-explore: Exploration by bootstrapped prediction. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and ...

  30. [30]

    Recurrent world models facilitate policy evolution

    David Ha and J\" u rgen Schmidhuber. Recurrent world models facilitate policy evolution. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31, 2018

  31. [31]

    Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, volume 80, pp.\ 1856--1865. PMLR, 2018

  32. [32]

    Mastering diverse domains through world models, 2024

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models, 2024

  33. [33]

    Hansen and A

    N. Hansen and A. Ostermeier. Adapting arbitrary normal mutation distributions in evolution strategies: the covariance matrix adaptation. In Proceedings of IEEE International Conference on Evolutionary Computation, pp.\ 312--317, 1996. doi:10.1109/ICEC.1996.542381

  34. [34]

    Td-mpc2: Scalable, robust world models for continuous control

    Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control. In The Twelfth International Conference on Learning Representations, 2024

  35. [35]

    The CMA Evolution Strategy: A Tutorial

    Nikolaus Hansen. The cma evolution strategy: A tutorial, 2023. URL https://arxiv.org/abs/1604.00772

  36. [36]

    CMA-ES/pycma on G ithub

    Nikolaus Hansen, Youhei Akimoto, and Petr Baudis. CMA-ES/pycma on G ithub. Zenodo, DOI:10.5281/zenodo.2559634, February 2019. URL https://doi.org/10.5281/zenodo.2559634

  37. [37]

    GAIA-1: A Generative World Model for Autonomous Driving

    Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving, 2023. URL https://arxiv.org/abs/2309.17080

  38. [38]

    Hutchinson, G

    S. Hutchinson, G. Hager, and P. Corke. A tutorial on visual servo control. IEEE Trans. on Robotics and Automation, 12 0 (5): 0 651--670, October 1996

  39. [39]

    Tutorial on training recurrent neural networks, covering bppt, rtrl, ekf and the echo state network approach

    Herbert Jaeger. Tutorial on training recurrent neural networks, covering bppt, rtrl, ekf and the echo state network approach. GMD-Forschungszentrum Informationstechnik, 2002., 5, 01 2002

  40. [40]

    Planning with diffusion for flexible behavior synthesis

    Michael Janner, Yilun Du, Joshua Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. In ICML, 2022

  41. [41]

    Offline reinforcement learning with implicit q-learning

    Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2022

  42. [42]

    A path towards autonomous machine intelligence

    Yann LeCun. A path towards autonomous machine intelligence. Open Review, Jun 2022

  43. [43]

    Hierarchical planning through goal-conditioned offline reinforcement learning, 2022

    Jinning Li, Chen Tang, Masayoshi Tomizuka, and Wei Zhan. Hierarchical planning through goal-conditioned offline reinforcement learning, 2022

  44. [44]

    Biconmp: A nonlinear model predictive control framework for whole body motion planning

    Avadesh Meduri, Paarth Shah, Julian Viereck, Majid Khadiv, Ioannis Havoutis, and Ludovic Righetti. Biconmp: A nonlinear model predictive control framework for whole body motion planning. IEEE Transactions on Robotics, 39: 0 905--922, 2022. URL https://api.semanticscholar.org/CorpusID:246035621

  45. [45]

    Discovering and achieving goals via world models

    Russell Mendonca, Oleh Rybkin, Kostas Daniilidis, Danijar Hafner, and Deepak Pathak. Discovering and achieving goals via world models. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp.\ 24379--24391, 2021

  46. [46]

    Structured world models from human videos, 2023

    Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Structured world models from human videos, 2023

  47. [47]

    Rusu, Joel Veness, Marc G

    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement l...

  48. [48]

    Asynchronous methods for deep reinforcement learning

    Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp.\ 1928--1937. PMLR, 20--22 Jun 2016

  49. [49]

    Hierarchical foresight: Self-supervised learning of long-horizon tasks via visual subgoal generation

    Suraj Nair and Chelsea Finn. Hierarchical foresight: Self-supervised learning of long-horizon tasks via visual subgoal generation. In International Conference on Learning Representations, 2020

  50. [50]

    Pong, Steven Lin, and Sergey Levine

    Soroush Nasiriany, Vitchyr H. Pong, Steven Lin, and Sergey Levine. Planning with goal-conditioned policies. In NeurIPS, 2019

  51. [51]

    Robocasa: Large-scale simulation of everyday tasks for generalist robots

    Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. In Robotics: Science and Systems (RSS), 2024

  52. [52]

    Octo: An open-source generalist robot policy

    Octo Model Team , Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. In Proceedings of Robotics: Science a...

  53. [53]

    Offline goal-conditioned RL with latent states as actions

    Seohong Park, Dibya Ghosh, Benjamin Eysenbach, and Sergey Levine. Offline goal-conditioned RL with latent states as actions. In ICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems, 2023

  54. [54]

    Genie 2: A large-scale foundation world model

    Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Frederic Besse, Tim Harley, Anna ...

  55. [55]

    Efros, and Trevor Darrell

    Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML'17, pp.\ 2778–2787. JMLR.org, 2017

  56. [56]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. In ICCV, 2023

  57. [57]

    Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

    Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588 0 (7839): 0 604–609, December 2020. ISSN 1476-4687. doi:10.1038/s41586-02...

  58. [58]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017

  59. [59]

    Planning to explore via self-supervised world models

    Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak Pathak. Planning to explore via self-supervised world models. In Proceedings of the 37th International Conference on Machine Learning, ICML'20. JMLR.org, 2020

  60. [60]

    Masked world models for visual control

    Younggyo Seo, Danijar Hafner, Hao Liu, Fangchen Liu, Stephen James, Kimin Lee, and Pieter Abbeel. Masked world models for visual control. In 6th Annual Conference on Robot Learning, 2022

  61. [61]

    Rapid exploration for open-world navigation with latent goal models

    Dhruv Shah, Benjamin Eysenbach, Nicholas Rhinehart, and Sergey Levine. Rapid exploration for open-world navigation with latent goal models. In 5th Annual Conference on Robot Learning, 2021

  62. [62]

    A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.Science, 362(6419):1140–1144, 2018

    David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362 0 (6419): 0 1140--1144, 2018. doi:...

  63. [63]

    Oriane Sim \'e oni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha \"e l Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth \'e e Darcet, Th \'e o Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille ...

  64. [64]

    Learning from reward-free offline data: A case for planning with latent dynamics models, 02 2025

    Vlad Sobal, Wancong Zhang, Kynghyun Cho, Randall Balestriero, Tim Rudner, and Yann Lecun. Learning from reward-free offline data: A case for planning with latent dynamics models, 02 2025

  65. [65]

    Universal planning networks: Learning generalizable representations for visuomotor control

    Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal planning networks: Learning generalizable representations for visuomotor control. In ICML, 2018

  66. [66]

    Roformer: Enhanced transformer with rotary position embedding

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomput., 568, 2024

  67. [67]

    o lkopf, Dieter B \

    Quan Vuong, Sergey Levine, Homer Rich Walke, Karl Pertsch, Anikait Singh, Ria Doshi, Charles Xu, Jianlan Luo, Liam Tan, Dhruv Shah, Chelsea Finn, Max Du, Moo Jin Kim, Alexander Khazatsky, Jonathan Heewon Yang, Tony Z. Zhao, Ken Goldberg, Ryan Hoque, Lawrence Yunliang Chen, Simeon Adebola, Gaurav S. Sukhatme, Gautam Salhotra, Shivin Dass, Lerrel Pinto, Zic...

  68. [68]

    Embed to control: A locally linear latent dynamics model for control from raw images

    Manuel Watter, Jost Tobias Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In NeurIPS, 2015

  69. [69]

    Model predictive path integral control using covariance variable importance sampling, 2015

    Grady Williams, Andrew Aldrich, and Evangelos Theodorou. Model predictive path integral control using covariance variable importance sampling, 2015

  70. [70]

    Understanding and improving layer normalization

    Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang Zhao, and Junyang Lin. Understanding and improving layer normalization. Curran Associates Inc., Red Hook, NY, USA, 2019

  71. [71]

    IQL - TD - MPC : Implicit q-learning for hierarchical model predictive control

    Yingchen Xu, Rohan Chitnis, Bobak T Hashemi, Lucas Lehnert, Urun Dogan, Zheqing Zhu, and Olivier Delalleau. IQL - TD - MPC : Implicit q-learning for hierarchical model predictive control. In ICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems, 2023

  72. [72]

    Learning interactive real-world simulators

    Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. In ICLR, 2023

  73. [73]

    Mastering visual continuous control: Improved data-augmented reinforcement learning

    Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto. Mastering visual continuous control: Improved data-augmented reinforcement learning. In ICLR, 2022

  74. [74]

    Latent Action Pretraining from Videos

    Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, and Minjoon Seo. Latent action pretraining from videos, 2024. URL https://arxiv.org/abs/2410.11758

  75. [75]

    Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning, 2019

    Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Avnish Narayan, Hayden Shively, Adithya Bellathur, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning, 2019

  76. [76]

    Efros, Eli Shechtman, and Oliver Wang

    Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018

  77. [77]

    DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

    Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. Dino-wm: World models on pre-trained visual features enable zero-shot planning, 2024 a . URL https://arxiv.org/abs/2411.04983

  78. [78]

    Diffusion model predictive control.arXiv preprint arXiv:2410.05364,

    Guangyao Zhou, Sivaramakrishnan Swaminathan, Rajkumar Vasudeva Raju, J. Swaroop Guntupalli, Wolfgang Lehrach, Joseph Ortiz, Antoine Dedieu, Miguel Lázaro-Gredilla, and Kevin Murphy. Diffusion model predictive control. arXiv preprint arXiv:2410.05364, 2024 b

  79. [79]

    robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

    Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Mart\' i n-Mart\' i n, Abhishek Joshi, Soroush Nasiriany, Yifeng Zhu, and Kevin Lin. robosuite: A modular simulation framework and benchmark for robot learning. In arXiv preprint arXiv:2009.12293, 2020

  80. [80]

    Sanketi, Grecia Salazar, Michael S

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Vanhoucke, Huong Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R. Sanketi, Grecia Salazar, Michael S. Ryoo, Krista Reymann, Kanishka Rao, Karl Pertsch, Igor Mordatch, Henryk Michalewski...

Showing first 80 references.