pith. machine review for the scientific record. sign in

arxiv: 2411.04983 · v2 · submitted 2024-11-07 · 💻 cs.RO · cs.AI

Recognition: 2 theorem links

· Lean Theorem

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

Authors on Pith no claims yet

Pith reviewed 2026-05-17 16:00 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords world modelszero-shot planningvisual dynamicsDINOv2 featuresoffline learningtask-agnostic planningrobotic controlfeature prediction
0
0 comments X

The pith

DINO-WM uses pre-trained DINOv2 patch features to build world models that support zero-shot planning from offline data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DINO World Model as a method for learning predictive models of visual dynamics directly from pre-collected trajectories. It works by training to forecast how spatial patch features from DINOv2 will change in response to actions, rather than attempting to reconstruct raw images. This learned predictor then supports test-time planning by optimizing sequences of actions to make the predicted features match those of a desired goal state. The resulting system solves tasks in six different environments spanning mazes, pushing, and multi-particle scenarios while using no expert demonstrations, reward functions, or pre-trained inverse models.

Core claim

DINO-WM learns visual dynamics by predicting future DINOv2 patch features from offline behavioral trajectories, which allows it to perform task-agnostic planning through optimization of action sequences aimed at matching goal features at test time.

What carries the argument

Prediction of future spatial patch features extracted by DINOv2, serving as the representation for modeling dynamics and enabling action optimization toward goal features.

If this is right

  • Action sequences can be optimized at test time in feature space to reach arbitrary observational goals.
  • A single model trained on offline trajectories supports multiple tasks without retraining or reward engineering.
  • Zero-shot behavioral solutions become available across task families such as navigation, manipulation, and particle control.
  • Planning performance exceeds prior methods that require demonstrations or task-specific components.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Pre-trained vision features appear to encode sufficient state information for dynamics modeling in many control settings.
  • The same feature-prediction approach could be evaluated on physical robots to test whether the planning transfers beyond simulation.
  • General planning systems might require fewer custom perception modules if similar pre-trained features prove broadly useful.

Load-bearing premise

Predicting future DINOv2 patch features alone supplies enough information about environment dynamics to support reliable planning without any visual reconstruction or task-specific additions.

What would settle it

Observe whether action sequences optimized under DINO-WM reach the intended goals in a new environment whose dynamics depend on visual details absent from the DINOv2 patch features.

read the original abstract

The ability to predict future outcomes given control actions is fundamental for physical reasoning. However, such predictive models, often called world models, remains challenging to learn and are typically developed for task-specific solutions with online policy learning. To unlock world models' true potential, we argue that they should 1) be trainable on offline, pre-collected trajectories, 2) support test-time behavior optimization, and 3) facilitate task-agnostic reasoning. To this end, we present DINO World Model (DINO-WM), a new method to model visual dynamics without reconstructing the visual world. DINO-WM leverages spatial patch features pre-trained with DINOv2, enabling it to learn from offline behavioral trajectories by predicting future patch features. This allows DINO-WM to achieve observational goals through action sequence optimization, facilitating task-agnostic planning by treating goal features as prediction targets. We demonstrate that DINO-WM achieves zero-shot behavioral solutions at test time on six environments without expert demonstrations, reward modeling, or pre-learned inverse models, outperforming prior state-of-the-art work across diverse task families such as arbitrarily configured mazes, push manipulation with varied object shapes, and multi-particle scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DINO-WM, a world model that predicts future DINOv2 patch features from offline trajectories to enable test-time action sequence optimization for achieving observational goals. It claims zero-shot behavioral solutions across six environments (mazes, push manipulation with varied shapes, multi-particle scenarios) without expert demonstrations, reward modeling, or inverse models, outperforming prior SOTA.

Significance. If the results hold, this could advance scalable, task-agnostic visual world models by leveraging pre-trained semantic features instead of reconstruction or task-specific components, with potential impact on model-based planning in robotics.

major comments (2)
  1. [Method (dynamics model)] The central claim that DINOv2 patch feature prediction alone yields rollouts accurate enough for reliable planning hinges on an untested assumption about feature sensitivity. DINOv2 is pretrained for semantic correspondence; in push-manipulation and multi-particle settings this risks invariance to small pose or contact changes, so optimized actions may match features yet fail in pixel/state space. This assumption is load-bearing for the zero-shot results on contact-rich tasks.
  2. [Experiments] Experimental results: the abstract reports strong outperformance on six environments, but without visible ablations isolating the contribution of pre-trained DINOv2 features versus the prediction architecture, error bars, or controls for post-hoc hyperparameter choices, it is unclear whether the gains are robust or environment-specific. This directly affects verification of the task-agnostic planning claim.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'arbitrarily configured mazes' would benefit from a brief clarification of how configurations are varied at test time and whether the same model is used without retraining.
  2. [Method] Notation: ensure consistent use of 'patch features' versus 'visual features' when describing the prediction target to avoid ambiguity in the method description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their constructive feedback, which has helped us improve the clarity and rigor of our work. Below, we provide point-by-point responses to the major comments.

read point-by-point responses
  1. Referee: The central claim that DINOv2 patch feature prediction alone yields rollouts accurate enough for reliable planning hinges on an untested assumption about feature sensitivity. DINOv2 is pretrained for semantic correspondence; in push-manipulation and multi-particle settings this risks invariance to small pose or contact changes, so optimized actions may match features yet fail in pixel/state space. This assumption is load-bearing for the zero-shot results on contact-rich tasks.

    Authors: We appreciate this insightful observation regarding the potential limitations of DINOv2 features in capturing fine-grained dynamics. Our approach relies on the empirical observation that these features enable effective planning, as evidenced by the successful zero-shot performance in the push manipulation and multi-particle environments. However, we acknowledge that a direct test of feature sensitivity to small changes was not included in the original manuscript. In the revision, we will add a discussion section and supporting visualizations to analyze how DINOv2 patch features respond to pose and contact variations in our tasks. This will help substantiate that the features provide sufficient sensitivity for the planning objectives. revision: partial

  2. Referee: Experimental results: the abstract reports strong outperformance on six environments, but without visible ablations isolating the contribution of pre-trained DINOv2 features versus the prediction architecture, error bars, or controls for post-hoc hyperparameter choices, it is unclear whether the gains are robust or environment-specific. This directly affects verification of the task-agnostic planning claim.

    Authors: We agree that additional experimental details are necessary to fully support the claims of robustness and task-agnosticism. We will revise the manuscript to include ablations that isolate the role of pre-trained DINOv2 features (e.g., comparing to training from scratch or using other feature extractors), report error bars across multiple random seeds, and provide details on the hyperparameter tuning process to ensure it was not post-hoc. These changes will strengthen the evidence for the general applicability of DINO-WM across diverse environments. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on standard supervised feature prediction from data

full rationale

The paper trains a dynamics model to predict future DINOv2 patch features from offline trajectories and then optimizes actions at test time to reach goal features. This is a conventional supervised prediction setup followed by planning, with no equations or steps that reduce by construction to fitted parameters, self-defined quantities, or load-bearing self-citations. The central claim (zero-shot planning via feature prediction) remains independent of the reported results and does not import uniqueness theorems or ansatzes from the authors' prior work in a circular manner. External pre-trained DINOv2 features and standard optimization provide the necessary separation from the target performance metrics.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that DINOv2 features are rich enough for dynamics modeling and that offline trajectory data suffices for learning without additional signals. No new physical entities are postulated.

free parameters (1)
  • training hyperparameters
    Standard choices such as learning rate, sequence length, and optimization settings are required to train the feature predictor and are not derived from first principles.
axioms (1)
  • domain assumption Pre-trained DINOv2 spatial patch features contain sufficient information to model visual dynamics for planning.
    Invoked to justify skipping image reconstruction and using feature prediction as the learning target.

pith-pipeline@v0.9.0 · 5522 in / 1235 out tokens · 72405 ms · 2026-05-17T16:00:46.891348+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Runtime Monitoring of Perception-Based Autonomous Systems via Embedding Temporal Logic

    cs.LG 2026-05 unverdicted novelty 7.0

    Embedding Temporal Logic (ETL) performs runtime monitoring directly in learned embedding spaces using distance-based predicates composed with temporal operators, supported by conformal calibration for reliable predica...

  2. Runtime Monitoring of Perception-Based Autonomous Systems via Embedding Temporal Logic

    cs.LG 2026-05 unverdicted novelty 7.0

    Embedding Temporal Logic enables runtime monitoring of temporally extended perceptual behaviors by defining predicates via distances between observed and reference embeddings in learned spaces, with conformal calibrat...

  3. Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations

    cs.RO 2026-04 unverdicted novelty 7.0

    ACO-MoE recovers 95.3% of clean-input performance in visual control tasks under Markov-switching corruptions by routing restoration experts and anchoring representations to clean foreground masks.

  4. Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations

    cs.RO 2026-04 unverdicted novelty 7.0

    ACO-MoE employs agent-centric mixture-of-experts to decouple task-relevant features from dynamic visual perturbations in RL, recovering 95.3% of clean performance on the new VDCS benchmark.

  5. 3D-Anchored Lookahead Planning for Persistent Robotic Scene Memory via World-Model-Based MCTS

    cs.RO 2026-04 unverdicted novelty 7.0

    3D-ALP achieves 0.65 success on memory-dependent 5-step robotic reach tasks versus near-zero for reactive baselines by anchoring MCTS planning to a persistent 3D camera-to-world frame.

  6. TouchGuide: Inference-Time Steering of Visuomotor Policies via Touch Guidance

    cs.RO 2026-01 unverdicted novelty 7.0

    TouchGuide improves contact-rich robot manipulation by steering diffusion or flow-matching visuomotor policies with tactile feasibility scores from a contrastively trained Contact Physical Model.

  7. Predictive but Not Plannable: RC-aux for Latent World Models

    cs.LG 2026-05 unverdicted novelty 6.0

    RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.

  8. VISION-SLS: Safe Perception-Based Control from Learned Visual Representations via System Level Synthesis

    cs.RO 2026-04 conditional novelty 6.0

    VISION-SLS learns visual features with state-dependent error bounds and optimizes causal affine output-feedback policies via system level synthesis to achieve safe nonlinear control from RGB images.

  9. Safe Control using Learned Safety Filters and Adaptive Conformal Inference

    eess.SY 2026-04 unverdicted novelty 6.0

    ACoFi adaptively tunes the switching threshold of learned safety filters using conformal inference on the range of predicted safety values, asymptotically bounding the rate of incorrect safety assessments by a user pa...

  10. A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies

    cs.RO 2026-04 unverdicted novelty 6.0

    Sim-and-real co-training for robot policies is driven primarily by balanced cross-domain representation alignment and secondarily by domain-dependent action reweighting.

  11. Grounded World Model for Semantically Generalizable Planning

    cs.RO 2026-04 conditional novelty 6.0

    A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.

  12. Learning Long-term Motion Embeddings for Efficient Kinematics Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    A 64x temporally compressed motion embedding learned from trackers enables efficient conditional flow-matching generation of long-term motions that outperform video models and task-specific methods.

  13. Hierarchical Planning with Latent World Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Hierarchical planning over multi-scale latent world models enables 70% success on real robotic pick-and-place with goal-only input where flat models achieve 0%, while cutting planning compute up to 4x in simulations.

  14. Metriplector: From Field Theory to Neural Architecture

    cs.AI 2026-03 unverdicted novelty 6.0

    Metriplector treats neural computation as coupled metriplectic field dynamics whose stress-energy tensor readout achieves competitive results on vision, control, Sudoku, language modeling, and pathfinding with small p...

  15. RISE: Self-Improving Robot Policy with Compositional World Model

    cs.RO 2026-02 unverdicted novelty 6.0

    RISE combines a controllable dynamics model and progress value model into a closed-loop self-improving pipeline that updates robot policies entirely in imagination, reporting over 35% absolute gains on three real-world tasks.

  16. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    cs.AI 2025-06 unverdicted novelty 6.0

    V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...

  17. FLARE: Robot Learning with Implicit World Modeling

    cs.RO 2025-05 unverdicted novelty 6.0

    FLARE integrates predictive latent world modeling into diffusion transformer policies for robots, delivering up to 26% gains on multitask manipulation benchmarks and enabling co-training with action-free human videos.

  18. UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    cs.RO 2025-05 unverdicted novelty 6.0

    UniVLA trains cross-embodiment vision-language-action policies from unlabeled videos via a latent action model in DINO space, beating OpenVLA on benchmarks with 1/20th pretraining compute and 1/10th downstream data.

  19. Agentic Reasoning for Large Language Models

    cs.AI 2026-01 unverdicted novelty 4.0

    The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applicat...

Reference graph

Works this paper leans on

131 extracted references · 131 canonical work pages · cited by 17 Pith papers · 27 internal anchors

  1. [1]

    Legged locomotion in challenging terrains using egocentric vision, 2022

    Agarwal, A., Kumar, A., Malik, J., and Pathak, D. Legged locomotion in challenging terrains using egocentric vision, 2022. URL https://arxiv.org/abs/2211.07638

  2. [2]

    Self-supervised learning from images with a joint-embedding predictive architecture

    Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., and Ballas, N. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 15619--15629, 2023

  3. [3]

    Nonlinear and adaptive control with applications, volume 187

    Astolfi, A., Karagiannis, D., and Ortega, R. Nonlinear and adaptive control with applications, volume 187. Springer, 2008

  4. [4]

    V- JEPA : Latent video prediction for visual representation learning, 2024

    Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., and Ballas, N. V- JEPA : Latent video prediction for visual representation learning, 2024. URL https://openreview.net/forum?id=WFYbBOEOtv

  5. [5]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., Florence, P., Fu, C., Arenas, M. G., Gopalakrishnan, K., Han, K., Hausman, K., Herzog, A., Hsu, J., Ichter, B., Irpan, A., Joshi, N., Julian, R., Kalashnikov, D., Kuang, Y., Leal, I., Lee, L., Lee, T.-W. E., Levine, S., Lu, Y., Michalew...

  6. [6]

    Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., Ibarz, J., Ichter, B., Irpan, A., Jackson, T., Jesmonth, S., Joshi, N. J., Julian, R., Kalashnikov, D., Kuang, Y., Leal, I., Lee, K.-H., Levine, S., Lu, Y., Malla, U., Manjunath, D., Mordatch, I., Nachum, O., Parada, C., Peralta, J...

  7. [7]

    Genie: Generative interactive environments, 2024

    Bruce, J., Dennis, M., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., Aytar, Y., Bechtle, S., Behbahani, F., Chan, S., Heess, N., Gonzalez, L., Osindero, S., Ozair, S., Reed, S., Zhang, J., Zolna, K., Clune, J., de Freitas, N., Singh, S., and Rocktäschel, T. Genie: Generative interactive environmen...

  8. [8]

    Emerging Properties in Self-Supervised Vision Transformers

    Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., and Joulin, A. Emerging properties in self-supervised vision transformers, 2021. URL https://arxiv.org/abs/2104.14294

  9. [9]

    Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

    Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y., Burchfiel, B., Tedrake, R., and Song, S. Diffusion policy: Visuomotor policy learning via action diffusion, 2024. URL https://arxiv.org/abs/2303.04137

  10. [10]

    Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

    Chua, K., Calandra, R., McAllister, R., and Levine, S. Deep reinforcement learning in a handful of trials using probabilistic dynamics models, 2018. URL https://arxiv.org/abs/1805.12114

  11. [11]

    Deisenroth, M. P. and Rasmussen, C. E. Pilco: A model-based and data-efficient approach to policy search. In International Conference on Machine Learning, 2011. URL https://api.semanticscholar.org/CorpusID:14273320

  12. [12]

    Diffusion world model: Future modeling beyond step-by-step rollout for offline reinforcement learning, 2024

    Ding, Z., Zhang, A., Tian, Y., and Zheng, Q. Diffusion world model: Future modeling beyond step-by-step rollout for offline reinforcement learning, 2024. URL https://arxiv.org/abs/2402.03570

  13. [13]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. URL https://arxiv.org/abs/2010.11929

  14. [14]

    B., Schuurmans, D., and Abbeel, P

    Du, Y., Yang, M., Dai, B., Dai, H., Nachum, O., Tenenbaum, J. B., Schuurmans, D., and Abbeel, P. Learning universal policies via text-guided video generation, 2023. URL https://arxiv.org/abs/2302.00111

  15. [15]

    Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control

    Ebert, F., Finn, C., Dasari, S., Xie, A., Lee, A., and Levine, S. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control, 2018. URL https://arxiv.org/abs/1812.00568

  16. [17]

    Deep Visual Foresight for Planning Robot Motion

    Finn, C. and Levine, S. Deep visual foresight for planning robot motion, 2017. URL https://arxiv.org/abs/1610.00696

  17. [18]

    D4RL: Datasets for Deep Data-Driven Reinforcement Learning

    Fu, J., Kumar, A., Nachum, O., Tucker, G., and Levine, S. D4rl: Datasets for deep data-driven reinforcement learning, 2021. URL https://arxiv.org/abs/2004.07219

  18. [20]

    Learning Latent Dynamics for Planning from Pixels

    Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., and Davidson, J. Learning latent dynamics for planning from pixels, 2019. URL https://arxiv.org/abs/1811.04551

  19. [21]

    Dream to Control: Learning Behaviors by Latent Imagination

    Hafner, D., Lillicrap, T., Ba, J., and Norouzi, M. Dream to control: Learning behaviors by latent imagination, 2020. URL https://arxiv.org/abs/1912.01603

  20. [22]

    Mastering Atari with Discrete World Models

    Hafner, D., Lillicrap, T., Norouzi, M., and Ba, J. Mastering atari with discrete world models, 2022. URL https://arxiv.org/abs/2010.02193

  21. [23]

    Mastering Diverse Domains through World Models

    Hafner, D., Pasukonis, J., Ba, J., and Lillicrap, T. Mastering diverse domains through world models, 2024. URL https://arxiv.org/abs/2301.04104

  22. [24]

    Baku: An efficient transformer for multi-task policy learning, 2024

    Haldar, S., Peng, Z., and Pinto, L. Baku: An efficient transformer for multi-task policy learning, 2024. URL https://arxiv.org/abs/2406.07539

  23. [25]

    Temporal difference learning for model predictive control, 2022

    Hansen, N., Wang, X., and Su, H. Temporal difference learning for model predictive control, 2022. URL https://arxiv.org/abs/2203.04955

  24. [26]

    TD-MPC2: Scalable, Robust World Models for Continuous Control

    Hansen, N., Su, H., and Wang, X. Td-mpc2: Scalable, robust world models for continuous control, 2024. URL https://arxiv.org/abs/2310.16828

  25. [27]

    Deep residual learning for image recognition

    He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 770--778, 2016

  26. [28]

    and Waghmare, L

    Holkar, K. and Waghmare, L. M. An overview of model predictive control. International Journal of control and automation, 3 0 (4): 0 47--63, 2010

  27. [29]

    Gaia-1: A generative world model for autonomous driving, 2023

    Hu, A., Russell, L., Yeo, H., Murez, Z., Fedoseev, G., Kendall, A., Shotton, J., and Corrado, G. Gaia-1: A generative world model for autonomous driving, 2023

  28. [30]

    Chain-of-thought predictive control, 2024

    Jia, Z., Thumuluri, V., Liu, F., Chen, L., Huang, Z., and Su, H. Chain-of-thought predictive control, 2024. URL https://arxiv.org/abs/2304.00776

  29. [31]

    Ko, P.-C., Mao, J., Du, Y., Sun, S.-H., and Tenenbaum, J. B. Learning to act from actionless videos through dense correspondences, 2023. URL https://arxiv.org/abs/2310.08576

  30. [32]

    J., Shafiullah, N

    Lee, S., Wang, Y., Etukuru, H., Kim, H. J., Shafiullah, N. M. M., and Pinto, L. Behavior generation with latent actions, 2024. URL https://arxiv.org/abs/2403.03181

  31. [33]

    A., and Saxena, A

    Lenz, I., Knepper, R. A., and Saxena, A. Deepmpc: Learning deep latent features for model predictive control. In Robotics: Science and Systems, 2015. URL https://api.semanticscholar.org/CorpusID:10130184

  32. [34]

    Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

    Liu, Y., Zhang, K., Li, Y., Yan, Z., Gao, C., Chen, R., Yuan, Z., Huang, Y., Sun, H., Gao, J., He, L., and Sun, L. Sora: A review on background, technology, limitations, and opportunities of large vision models, 2024. URL https://arxiv.org/abs/2402.17177

  33. [35]

    Eureka: Human-Level Reward Design via Coding Large Language Models

    Ma, Y. J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., and Anandkumar, A. Eureka: Human-level reward design via coding large language models, 2024. URL https://arxiv.org/abs/2310.12931

  34. [36]

    Discovering and achieving goals via world models, 2021

    Mendonca, R., Rybkin, O., Daniilidis, K., Hafner, D., and Pathak, D. Discovering and achieving goals via world models, 2021. URL https://arxiv.org/abs/2110.09514

  35. [37]

    Alan: Autonomously exploring robotic agents in the real world, 2023 a

    Mendonca, R., Bahl, S., and Pathak, D. Alan: Autonomously exploring robotic agents in the real world, 2023 a . URL https://arxiv.org/abs/2302.06604

  36. [38]

    Structured world models from human videos, 2023 b

    Mendonca, R., Bahl, S., and Pathak, D. Structured world models from human videos, 2023 b . URL https://arxiv.org/abs/2308.10901

  37. [39]

    Transformers are sample-efficient world models, 2023

    Micheli, V., Alonso, E., and Fleuret, F. Transformers are sample-efficient world models, 2023. URL https://arxiv.org/abs/2209.00588

  38. [40]

    Deep dynamics models for learning dexterous manipulation, 2019

    Nagabandi, A., Konoglie, K., Levine, S., and Kumar, V. Deep dynamics models for learning dexterous manipulation, 2019. URL https://arxiv.org/abs/1909.11652

  39. [41]

    R3M: A Universal Visual Representation for Robot Manipulation

    Nair, S., Rajeswaran, A., Kumar, V., Finn, C., and Gupta, A. R3m: A universal visual representation for robot manipulation, 2022. URL https://arxiv.org/abs/2203.12601

  40. [42]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., and Bojanowski, P. Dinov2: Learning robust visual f...

  41. [43]

    Zero-Shot Visual Imitation

    Pathak, D., Mahmoudieh, P., Luo, G., Agrawal, P., Chen, D., Shentu, Y., Shelhamer, E., Malik, J., Efros, A. A., and Darrell, T. Zero-shot visual imitation, 2018. URL https://arxiv.org/abs/1804.08606

  42. [44]

    Generating Diverse High-Fidelity Images with VQ-VAE-2

    Razavi, A., van den Oord, A., and Vinyals, O. Generating diverse high-fidelity images with vq-vae-2, 2019. URL https://arxiv.org/abs/1906.00446

  43. [45]

    A Generalist Agent

    Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., Novikov, A., Barth-Maron, G., Gimenez, M., Sulsky, Y., Kay, J., Springenberg, J. T., Eccles, T., Bruce, J., Razavi, A., Edwards, A., Heess, N., Chen, Y., Hadsell, R., Vinyals, O., Bordbar, M., and de Freitas, N. A generalist agent, 2022. URL https://arxiv.org/abs/2205.06175

  44. [46]

    Transformer-based world models are happy with 100k interactions, 2023

    Robine, J., Höftmann, M., Uelwer, T., and Harmeling, S. Transformer-based world models are happy with 100k interactions, 2023. URL https://arxiv.org/abs/2303.07109

  45. [47]

    ImageNet Large Scale Visual Recognition Challenge

    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. Imagenet large scale visual recognition challenge, 2015. URL https://arxiv.org/abs/1409.0575

  46. [48]

    Planning to explore via self-supervised world models, 2020

    Sekar, R., Rybkin, O., Daniilidis, K., Abbeel, P., Hafner, D., and Pathak, D. Planning to explore via self-supervised world models, 2020. URL https://arxiv.org/abs/2005.05960

  47. [49]

    Sutton, R. S. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2 0 (4): 0 160--163, 1991

  48. [50]

    DeepMind Control Suite

    Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., de Las Casas, D., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq, A., Lillicrap, T., and Riedmiller, M. Deepmind control suite, 2018. URL https://arxiv.org/abs/1801.00690

  49. [51]

    and Li, W

    Todorov, E. and Li, W. A generalized iterative lqg method for locally-optimal feedback control of constrained nonlinear stochastic systems. In Proceedings of the 2005, American Control Conference, 2005., pp.\ 300--306. IEEE, 2005

  50. [52]

    K., Tulsiani, S., and Gupta, A

    Wang, J., Dasari, S., Srirama, M. K., Tulsiani, S., and Gupta, A. Manipulate by seeing: Creating manipulation controllers from pre-trained representations, 2023. URL https://arxiv.org/abs/2303.08135

  51. [53]

    Image quality assessment: from error visibility to structural similarity

    Wang, Z., Bovik, A., Sheikh, H., and Simoncelli, E. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13 0 (4): 0 600--612, 2004. doi:10.1109/TIP.2003.819861

  52. [54]

    Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images

    Watter, M., Springenberg, J. T., Boedecker, J., and Riedmiller, M. Embed to control: A locally linear latent dynamics model for control from raw images, 2015. URL https://arxiv.org/abs/1506.07365

  53. [55]

    Any-point Trajectory Modeling for Policy Learning

    Wen, C., Lin, X., So, J., Chen, K., Dou, Q., Gao, Y., and Abbeel, P. Any-point trajectory modeling for policy learning, 2024. URL https://arxiv.org/abs/2401.00025

  54. [56]

    M., Boots, B., and Theodorou, E

    Williams, G., Wagener, N., Goldfain, B., Drews, P., Rehg, J. M., Boots, B., and Theodorou, E. A. Information theoretic mpc for model-based reinforcement learning. In 2017 IEEE international conference on robotics and automation (ICRA), pp.\ 1714--1721. IEEE, 2017

  55. [57]

    Learning to manipulate deformable objects without demonstrations, 2020

    Wu, Y., Yan, W., Kurutach, T., Pinto, L., and Abbeel, P. Learning to manipulate deformable objects without demonstrations, 2020. URL https://arxiv.org/abs/1910.13439

  56. [58]

    Masked visual pre-training for motor control, 2022

    Xiao, T., Radosavovic, I., Darrell, T., and Malik, J. Masked visual pre-training for motor control, 2022. URL https://arxiv.org/abs/2203.06173

  57. [59]

    Learning predictive representations for deformable objects using contrastive estimation

    Yan, W., Vangipuram, A., Abbeel, P., and Pinto, L. Learning predictive representations for deformable objects using contrastive estimation. In Conference on Robot Learning, pp.\ 564--574. PMLR, 2021

  58. [60]

    Learning interactive real-world simulators, 2023

    Yang, M., Du, Y., Ghasemipour, K., Tompson, J., Schuurmans, D., and Abbeel, P. Learning interactive real-world simulators, 2023

  59. [61]

    Adaptigraph: Material-adaptive graph-based neural dynamics for robotic manipulation, 2024

    Zhang, K., Li, B., Hauser, K., and Li, Y. Adaptigraph: Material-adaptive graph-based neural dynamics for robotic manipulation, 2024. URL https://arxiv.org/abs/2407.07889

  60. [63]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Zhao, T. Z., Kumar, V., Levine, S., and Finn, C. Learning fine-grained bimanual manipulation with low-cost hardware, 2023. URL https://arxiv.org/abs/2304.13705

  61. [64]

    K., Rajeswaran, A., Pari, J., Hatch, K., Jain, A., Yu, T., Abbeel, P., Pinto, L., Finn, C., and Gupta, A

    Zhou, G., Dean, V., Srirama, M. K., Rajeswaran, A., Pari, J., Hatch, K., Jain, A., Yu, T., Abbeel, P., Pinto, L., Finn, C., and Gupta, A. Train offline, test online: A real robot learning benchmark, 2023. URL https://arxiv.org/abs/2306.00942

  62. [65]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  63. [66]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  64. [67]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  65. [68]

    2023 , eprint=

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware , author=. 2023 , eprint=

  66. [69]

    2024 , eprint=

    Behavior Generation with Latent Actions , author=. 2024 , eprint=

  67. [70]

    2024 , eprint=

    BAKU: An Efficient Transformer for Multi-Task Policy Learning , author=. 2024 , eprint=

  68. [71]

    2024 , eprint=

    Eureka: Human-Level Reward Design via Coding Large Language Models , author=. 2024 , eprint=

  69. [72]

    2024 , eprint=

    Mastering Diverse Domains through World Models , author=. 2024 , eprint=

  70. [73]

    2022 , eprint=

    Temporal Difference Learning for Model Predictive Control , author=. 2022 , eprint=

  71. [74]

    2024 , eprint=

    TD-MPC2: Scalable, Robust World Models for Continuous Control , author=. 2024 , eprint=

  72. [75]

    2022 , eprint=

    Legged Locomotion in Challenging Terrains using Egocentric Vision , author=. 2022 , eprint=

  73. [76]

    A tutorial on energy-based learning

    Yann Lecun and Sumit Chopra and Raia Hadsell and Ranzato, Marc Aurelio and Huang, Fu Jie. A tutorial on energy-based learning. Predicting structured data. 2006

  74. [77]

    2023 , eprint=

    Train Offline, Test Online: A Real Robot Learning Benchmark , author=. 2023 , eprint=

  75. [78]

    2023 , eprint=

    RT-1: Robotics Transformer for Real-World Control at Scale , author=. 2023 , eprint=

  76. [79]

    2018 , copyright =

    Ha, David and Schmidhuber, Jürgen , title =. 2018 , copyright =. doi:10.5281/ZENODO.1207631 , url =

  77. [80]

    LeCun, Yann , keywords =. A

  78. [81]

    2019 , eprint=

    Learning Latent Dynamics for Planning from Pixels , author=. 2019 , eprint=

  79. [82]

    2023 , eprint=

    Transformers are Sample-Efficient World Models , author=. 2023 , eprint=

  80. [83]

    2023 , eprint=

    Transformer-based World Models Are Happy With 100k Interactions , author=. 2023 , eprint=

Showing first 80 references.