arxiv: 2411.04983 · v2 · submitted 2024-11-07 · 💻 cs.RO · cs.AI

Recognition: 2 theorem links

· Lean Theorem

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

Gaoyue Zhou , Hengkai Pan , Yann LeCun , Lerrel Pinto

Authors on Pith no claims yet

Pith reviewed 2026-05-17 16:00 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords world modelszero-shot planningvisual dynamicsDINOv2 featuresoffline learningtask-agnostic planningrobotic controlfeature prediction

0 comments

The pith

DINO-WM uses pre-trained DINOv2 patch features to build world models that support zero-shot planning from offline data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DINO World Model as a method for learning predictive models of visual dynamics directly from pre-collected trajectories. It works by training to forecast how spatial patch features from DINOv2 will change in response to actions, rather than attempting to reconstruct raw images. This learned predictor then supports test-time planning by optimizing sequences of actions to make the predicted features match those of a desired goal state. The resulting system solves tasks in six different environments spanning mazes, pushing, and multi-particle scenarios while using no expert demonstrations, reward functions, or pre-trained inverse models.

Core claim

DINO-WM learns visual dynamics by predicting future DINOv2 patch features from offline behavioral trajectories, which allows it to perform task-agnostic planning through optimization of action sequences aimed at matching goal features at test time.

What carries the argument

Prediction of future spatial patch features extracted by DINOv2, serving as the representation for modeling dynamics and enabling action optimization toward goal features.

If this is right

Action sequences can be optimized at test time in feature space to reach arbitrary observational goals.
A single model trained on offline trajectories supports multiple tasks without retraining or reward engineering.
Zero-shot behavioral solutions become available across task families such as navigation, manipulation, and particle control.
Planning performance exceeds prior methods that require demonstrations or task-specific components.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Pre-trained vision features appear to encode sufficient state information for dynamics modeling in many control settings.
The same feature-prediction approach could be evaluated on physical robots to test whether the planning transfers beyond simulation.
General planning systems might require fewer custom perception modules if similar pre-trained features prove broadly useful.

Load-bearing premise

Predicting future DINOv2 patch features alone supplies enough information about environment dynamics to support reliable planning without any visual reconstruction or task-specific additions.

What would settle it

Observe whether action sequences optimized under DINO-WM reach the intended goals in a new environment whose dynamics depend on visual details absent from the DINOv2 patch features.

read the original abstract

The ability to predict future outcomes given control actions is fundamental for physical reasoning. However, such predictive models, often called world models, remains challenging to learn and are typically developed for task-specific solutions with online policy learning. To unlock world models' true potential, we argue that they should 1) be trainable on offline, pre-collected trajectories, 2) support test-time behavior optimization, and 3) facilitate task-agnostic reasoning. To this end, we present DINO World Model (DINO-WM), a new method to model visual dynamics without reconstructing the visual world. DINO-WM leverages spatial patch features pre-trained with DINOv2, enabling it to learn from offline behavioral trajectories by predicting future patch features. This allows DINO-WM to achieve observational goals through action sequence optimization, facilitating task-agnostic planning by treating goal features as prediction targets. We demonstrate that DINO-WM achieves zero-shot behavioral solutions at test time on six environments without expert demonstrations, reward modeling, or pre-learned inverse models, outperforming prior state-of-the-art work across diverse task families such as arbitrarily configured mazes, push manipulation with varied object shapes, and multi-particle scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DINO-WM, a world model that predicts future DINOv2 patch features from offline trajectories to enable test-time action sequence optimization for achieving observational goals. It claims zero-shot behavioral solutions across six environments (mazes, push manipulation with varied shapes, multi-particle scenarios) without expert demonstrations, reward modeling, or inverse models, outperforming prior SOTA.

Significance. If the results hold, this could advance scalable, task-agnostic visual world models by leveraging pre-trained semantic features instead of reconstruction or task-specific components, with potential impact on model-based planning in robotics.

major comments (2)

[Method (dynamics model)] The central claim that DINOv2 patch feature prediction alone yields rollouts accurate enough for reliable planning hinges on an untested assumption about feature sensitivity. DINOv2 is pretrained for semantic correspondence; in push-manipulation and multi-particle settings this risks invariance to small pose or contact changes, so optimized actions may match features yet fail in pixel/state space. This assumption is load-bearing for the zero-shot results on contact-rich tasks.
[Experiments] Experimental results: the abstract reports strong outperformance on six environments, but without visible ablations isolating the contribution of pre-trained DINOv2 features versus the prediction architecture, error bars, or controls for post-hoc hyperparameter choices, it is unclear whether the gains are robust or environment-specific. This directly affects verification of the task-agnostic planning claim.

minor comments (2)

[Abstract] Abstract: the phrase 'arbitrarily configured mazes' would benefit from a brief clarification of how configurations are varied at test time and whether the same model is used without retraining.
[Method] Notation: ensure consistent use of 'patch features' versus 'visual features' when describing the prediction target to avoid ambiguity in the method description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their constructive feedback, which has helped us improve the clarity and rigor of our work. Below, we provide point-by-point responses to the major comments.

read point-by-point responses

Referee: The central claim that DINOv2 patch feature prediction alone yields rollouts accurate enough for reliable planning hinges on an untested assumption about feature sensitivity. DINOv2 is pretrained for semantic correspondence; in push-manipulation and multi-particle settings this risks invariance to small pose or contact changes, so optimized actions may match features yet fail in pixel/state space. This assumption is load-bearing for the zero-shot results on contact-rich tasks.

Authors: We appreciate this insightful observation regarding the potential limitations of DINOv2 features in capturing fine-grained dynamics. Our approach relies on the empirical observation that these features enable effective planning, as evidenced by the successful zero-shot performance in the push manipulation and multi-particle environments. However, we acknowledge that a direct test of feature sensitivity to small changes was not included in the original manuscript. In the revision, we will add a discussion section and supporting visualizations to analyze how DINOv2 patch features respond to pose and contact variations in our tasks. This will help substantiate that the features provide sufficient sensitivity for the planning objectives. revision: partial
Referee: Experimental results: the abstract reports strong outperformance on six environments, but without visible ablations isolating the contribution of pre-trained DINOv2 features versus the prediction architecture, error bars, or controls for post-hoc hyperparameter choices, it is unclear whether the gains are robust or environment-specific. This directly affects verification of the task-agnostic planning claim.

Authors: We agree that additional experimental details are necessary to fully support the claims of robustness and task-agnosticism. We will revise the manuscript to include ablations that isolate the role of pre-trained DINOv2 features (e.g., comparing to training from scratch or using other feature extractors), report error bars across multiple random seeds, and provide details on the hyperparameter tuning process to ensure it was not post-hoc. These changes will strengthen the evidence for the general applicability of DINO-WM across diverse environments. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on standard supervised feature prediction from data

full rationale

The paper trains a dynamics model to predict future DINOv2 patch features from offline trajectories and then optimizes actions at test time to reach goal features. This is a conventional supervised prediction setup followed by planning, with no equations or steps that reduce by construction to fitted parameters, self-defined quantities, or load-bearing self-citations. The central claim (zero-shot planning via feature prediction) remains independent of the reported results and does not import uniqueness theorems or ansatzes from the authors' prior work in a circular manner. External pre-trained DINOv2 features and standard optimization provide the necessary separation from the target performance metrics.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that DINOv2 features are rich enough for dynamics modeling and that offline trajectory data suffices for learning without additional signals. No new physical entities are postulated.

free parameters (1)

training hyperparameters
Standard choices such as learning rate, sequence length, and optimization settings are required to train the feature predictor and are not derived from first principles.

axioms (1)

domain assumption Pre-trained DINOv2 spatial patch features contain sufficient information to model visual dynamics for planning.
Invoked to justify skipping image reconstruction and using feature prediction as the learning target.

pith-pipeline@v0.9.0 · 5522 in / 1235 out tokens · 72405 ms · 2026-05-17T16:00:46.891348+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DINO-WM leverages spatial patch features pre-trained with DINOv2, enabling it to learn from offline behavioral trajectories by predicting future patch features.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Runtime Monitoring of Perception-Based Autonomous Systems via Embedding Temporal Logic
cs.LG 2026-05 unverdicted novelty 7.0

Embedding Temporal Logic (ETL) performs runtime monitoring directly in learned embedding spaces using distance-based predicates composed with temporal operators, supported by conformal calibration for reliable predica...
Runtime Monitoring of Perception-Based Autonomous Systems via Embedding Temporal Logic
cs.LG 2026-05 unverdicted novelty 7.0

Embedding Temporal Logic enables runtime monitoring of temporally extended perceptual behaviors by defining predicates via distances between observed and reference embeddings in learned spaces, with conformal calibrat...
Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations
cs.RO 2026-04 unverdicted novelty 7.0

ACO-MoE recovers 95.3% of clean-input performance in visual control tasks under Markov-switching corruptions by routing restoration experts and anchoring representations to clean foreground masks.
Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations
cs.RO 2026-04 unverdicted novelty 7.0

ACO-MoE employs agent-centric mixture-of-experts to decouple task-relevant features from dynamic visual perturbations in RL, recovering 95.3% of clean performance on the new VDCS benchmark.
3D-Anchored Lookahead Planning for Persistent Robotic Scene Memory via World-Model-Based MCTS
cs.RO 2026-04 unverdicted novelty 7.0

3D-ALP achieves 0.65 success on memory-dependent 5-step robotic reach tasks versus near-zero for reactive baselines by anchoring MCTS planning to a persistent 3D camera-to-world frame.
TouchGuide: Inference-Time Steering of Visuomotor Policies via Touch Guidance
cs.RO 2026-01 unverdicted novelty 7.0

TouchGuide improves contact-rich robot manipulation by steering diffusion or flow-matching visuomotor policies with tactile feasibility scores from a contrastively trained Contact Physical Model.
Predictive but Not Plannable: RC-aux for Latent World Models
cs.LG 2026-05 unverdicted novelty 6.0

RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.
VISION-SLS: Safe Perception-Based Control from Learned Visual Representations via System Level Synthesis
cs.RO 2026-04 conditional novelty 6.0

VISION-SLS learns visual features with state-dependent error bounds and optimizes causal affine output-feedback policies via system level synthesis to achieve safe nonlinear control from RGB images.
Safe Control using Learned Safety Filters and Adaptive Conformal Inference
eess.SY 2026-04 unverdicted novelty 6.0

ACoFi adaptively tunes the switching threshold of learned safety filters using conformal inference on the range of predicted safety values, asymptotically bounding the rate of incorrect safety assessments by a user pa...
A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies
cs.RO 2026-04 unverdicted novelty 6.0

Sim-and-real co-training for robot policies is driven primarily by balanced cross-domain representation alignment and secondarily by domain-dependent action reweighting.
Grounded World Model for Semantically Generalizable Planning
cs.RO 2026-04 conditional novelty 6.0

A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
Learning Long-term Motion Embeddings for Efficient Kinematics Generation
cs.CV 2026-04 unverdicted novelty 6.0

A 64x temporally compressed motion embedding learned from trackers enables efficient conditional flow-matching generation of long-term motions that outperform video models and task-specific methods.
Hierarchical Planning with Latent World Models
cs.LG 2026-04 unverdicted novelty 6.0

Hierarchical planning over multi-scale latent world models enables 70% success on real robotic pick-and-place with goal-only input where flat models achieve 0%, while cutting planning compute up to 4x in simulations.
Metriplector: From Field Theory to Neural Architecture
cs.AI 2026-03 unverdicted novelty 6.0

Metriplector treats neural computation as coupled metriplectic field dynamics whose stress-energy tensor readout achieves competitive results on vision, control, Sudoku, language modeling, and pathfinding with small p...
RISE: Self-Improving Robot Policy with Compositional World Model
cs.RO 2026-02 unverdicted novelty 6.0

RISE combines a controllable dynamics model and progress value model into a closed-loop self-improving pipeline that updates robot policies entirely in imagination, reporting over 35% absolute gains on three real-world tasks.
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
cs.AI 2025-06 unverdicted novelty 6.0

V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...
FLARE: Robot Learning with Implicit World Modeling
cs.RO 2025-05 unverdicted novelty 6.0

FLARE integrates predictive latent world modeling into diffusion transformer policies for robots, delivering up to 26% gains on multitask manipulation benchmarks and enabling co-training with action-free human videos.
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
cs.RO 2025-05 unverdicted novelty 6.0

UniVLA trains cross-embodiment vision-language-action policies from unlabeled videos via a latent action model in DINO space, beating OpenVLA on benchmarks with 1/20th pretraining compute and 1/10th downstream data.
Agentic Reasoning for Large Language Models
cs.AI 2026-01 unverdicted novelty 4.0

The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applicat...

Reference graph

Works this paper leans on

131 extracted references · 131 canonical work pages · cited by 17 Pith papers · 27 internal anchors

[1]

Legged locomotion in challenging terrains using egocentric vision, 2022

Agarwal, A., Kumar, A., Malik, J., and Pathak, D. Legged locomotion in challenging terrains using egocentric vision, 2022. URL https://arxiv.org/abs/2211.07638

work page arXiv 2022
[2]

Self-supervised learning from images with a joint-embedding predictive architecture

Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., and Ballas, N. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 15619--15629, 2023

work page 2023
[3]

Nonlinear and adaptive control with applications, volume 187

Astolfi, A., Karagiannis, D., and Ortega, R. Nonlinear and adaptive control with applications, volume 187. Springer, 2008

work page 2008
[4]

V- JEPA : Latent video prediction for visual representation learning, 2024

Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., and Ballas, N. V- JEPA : Latent video prediction for visual representation learning, 2024. URL https://openreview.net/forum?id=WFYbBOEOtv

work page 2024
[5]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., Florence, P., Fu, C., Arenas, M. G., Gopalakrishnan, K., Han, K., Hausman, K., Herzog, A., Hsu, J., Ichter, B., Irpan, A., Joshi, N., Julian, R., Kalashnikov, D., Kuang, Y., Leal, I., Lee, L., Lee, T.-W. E., Levine, S., Lu, Y., Michalew...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., Ibarz, J., Ichter, B., Irpan, A., Jackson, T., Jesmonth, S., Joshi, N. J., Julian, R., Kalashnikov, D., Kuang, Y., Leal, I., Lee, K.-H., Levine, S., Lu, Y., Malla, U., Manjunath, D., Mordatch, I., Nachum, O., Parada, C., Peralta, J...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Genie: Generative interactive environments, 2024

Bruce, J., Dennis, M., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., Aytar, Y., Bechtle, S., Behbahani, F., Chan, S., Heess, N., Gonzalez, L., Osindero, S., Ozair, S., Reed, S., Zhang, J., Zolna, K., Clune, J., de Freitas, N., Singh, S., and Rocktäschel, T. Genie: Generative interactive environmen...

work page arXiv 2024
[8]

Emerging Properties in Self-Supervised Vision Transformers

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., and Joulin, A. Emerging properties in self-supervised vision transformers, 2021. URL https://arxiv.org/abs/2104.14294

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y., Burchfiel, B., Tedrake, R., and Song, S. Diffusion policy: Visuomotor policy learning via action diffusion, 2024. URL https://arxiv.org/abs/2303.04137

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

Chua, K., Calandra, R., McAllister, R., and Levine, S. Deep reinforcement learning in a handful of trials using probabilistic dynamics models, 2018. URL https://arxiv.org/abs/1805.12114

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

Deisenroth, M. P. and Rasmussen, C. E. Pilco: A model-based and data-efficient approach to policy search. In International Conference on Machine Learning, 2011. URL https://api.semanticscholar.org/CorpusID:14273320

work page 2011
[12]

Diffusion world model: Future modeling beyond step-by-step rollout for offline reinforcement learning, 2024

Ding, Z., Zhang, A., Tian, Y., and Zheng, Q. Diffusion world model: Future modeling beyond step-by-step rollout for offline reinforcement learning, 2024. URL https://arxiv.org/abs/2402.03570

work page arXiv 2024
[13]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. URL https://arxiv.org/abs/2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

B., Schuurmans, D., and Abbeel, P

Du, Y., Yang, M., Dai, B., Dai, H., Nachum, O., Tenenbaum, J. B., Schuurmans, D., and Abbeel, P. Learning universal policies via text-guided video generation, 2023. URL https://arxiv.org/abs/2302.00111

work page arXiv 2023
[15]

Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control

Ebert, F., Finn, C., Dasari, S., Xie, A., Lee, A., and Levine, S. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control, 2018. URL https://arxiv.org/abs/1812.00568

work page internal anchor Pith review Pith/arXiv arXiv 2018
[17]

Deep Visual Foresight for Planning Robot Motion

Finn, C. and Levine, S. Deep visual foresight for planning robot motion, 2017. URL https://arxiv.org/abs/1610.00696

work page internal anchor Pith review Pith/arXiv arXiv 2017
[18]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

Fu, J., Kumar, A., Nachum, O., Tucker, G., and Levine, S. D4rl: Datasets for deep data-driven reinforcement learning, 2021. URL https://arxiv.org/abs/2004.07219

work page internal anchor Pith review Pith/arXiv arXiv 2021
[20]

Learning Latent Dynamics for Planning from Pixels

Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., and Davidson, J. Learning latent dynamics for planning from pixels, 2019. URL https://arxiv.org/abs/1811.04551

work page internal anchor Pith review Pith/arXiv arXiv 2019
[21]

Dream to Control: Learning Behaviors by Latent Imagination

Hafner, D., Lillicrap, T., Ba, J., and Norouzi, M. Dream to control: Learning behaviors by latent imagination, 2020. URL https://arxiv.org/abs/1912.01603

work page internal anchor Pith review Pith/arXiv arXiv 2020
[22]

Mastering Atari with Discrete World Models

Hafner, D., Lillicrap, T., Norouzi, M., and Ba, J. Mastering atari with discrete world models, 2022. URL https://arxiv.org/abs/2010.02193

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

Mastering Diverse Domains through World Models

Hafner, D., Pasukonis, J., Ba, J., and Lillicrap, T. Mastering diverse domains through world models, 2024. URL https://arxiv.org/abs/2301.04104

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Baku: An efficient transformer for multi-task policy learning, 2024

Haldar, S., Peng, Z., and Pinto, L. Baku: An efficient transformer for multi-task policy learning, 2024. URL https://arxiv.org/abs/2406.07539

work page arXiv 2024
[25]

Temporal difference learning for model predictive control, 2022

Hansen, N., Wang, X., and Su, H. Temporal difference learning for model predictive control, 2022. URL https://arxiv.org/abs/2203.04955

work page arXiv 2022
[26]

TD-MPC2: Scalable, Robust World Models for Continuous Control

Hansen, N., Su, H., and Wang, X. Td-mpc2: Scalable, robust world models for continuous control, 2024. URL https://arxiv.org/abs/2310.16828

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Deep residual learning for image recognition

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 770--778, 2016

work page 2016
[28]

and Waghmare, L

Holkar, K. and Waghmare, L. M. An overview of model predictive control. International Journal of control and automation, 3 0 (4): 0 47--63, 2010

work page 2010
[29]

Gaia-1: A generative world model for autonomous driving, 2023

Hu, A., Russell, L., Yeo, H., Murez, Z., Fedoseev, G., Kendall, A., Shotton, J., and Corrado, G. Gaia-1: A generative world model for autonomous driving, 2023

work page 2023
[30]

Chain-of-thought predictive control, 2024

Jia, Z., Thumuluri, V., Liu, F., Chen, L., Huang, Z., and Su, H. Chain-of-thought predictive control, 2024. URL https://arxiv.org/abs/2304.00776

work page arXiv 2024
[31]

Ko, P.-C., Mao, J., Du, Y., Sun, S.-H., and Tenenbaum, J. B. Learning to act from actionless videos through dense correspondences, 2023. URL https://arxiv.org/abs/2310.08576

work page arXiv 2023
[32]

J., Shafiullah, N

Lee, S., Wang, Y., Etukuru, H., Kim, H. J., Shafiullah, N. M. M., and Pinto, L. Behavior generation with latent actions, 2024. URL https://arxiv.org/abs/2403.03181

work page arXiv 2024
[33]

A., and Saxena, A

Lenz, I., Knepper, R. A., and Saxena, A. Deepmpc: Learning deep latent features for model predictive control. In Robotics: Science and Systems, 2015. URL https://api.semanticscholar.org/CorpusID:10130184

work page 2015
[34]

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Liu, Y., Zhang, K., Li, Y., Yan, Z., Gao, C., Chen, R., Yuan, Z., Huang, Y., Sun, H., Gao, J., He, L., and Sun, L. Sora: A review on background, technology, limitations, and opportunities of large vision models, 2024. URL https://arxiv.org/abs/2402.17177

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Eureka: Human-Level Reward Design via Coding Large Language Models

Ma, Y. J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., and Anandkumar, A. Eureka: Human-level reward design via coding large language models, 2024. URL https://arxiv.org/abs/2310.12931

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Discovering and achieving goals via world models, 2021

Mendonca, R., Rybkin, O., Daniilidis, K., Hafner, D., and Pathak, D. Discovering and achieving goals via world models, 2021. URL https://arxiv.org/abs/2110.09514

work page arXiv 2021
[37]

Alan: Autonomously exploring robotic agents in the real world, 2023 a

Mendonca, R., Bahl, S., and Pathak, D. Alan: Autonomously exploring robotic agents in the real world, 2023 a . URL https://arxiv.org/abs/2302.06604

work page arXiv 2023
[38]

Structured world models from human videos, 2023 b

Mendonca, R., Bahl, S., and Pathak, D. Structured world models from human videos, 2023 b . URL https://arxiv.org/abs/2308.10901

work page arXiv 2023
[39]

Transformers are sample-efficient world models, 2023

Micheli, V., Alonso, E., and Fleuret, F. Transformers are sample-efficient world models, 2023. URL https://arxiv.org/abs/2209.00588

work page arXiv 2023
[40]

Deep dynamics models for learning dexterous manipulation, 2019

Nagabandi, A., Konoglie, K., Levine, S., and Kumar, V. Deep dynamics models for learning dexterous manipulation, 2019. URL https://arxiv.org/abs/1909.11652

work page arXiv 2019
[41]

R3M: A Universal Visual Representation for Robot Manipulation

Nair, S., Rajeswaran, A., Kumar, V., Finn, C., and Gupta, A. R3m: A universal visual representation for robot manipulation, 2022. URL https://arxiv.org/abs/2203.12601

work page internal anchor Pith review Pith/arXiv arXiv 2022
[42]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., and Bojanowski, P. Dinov2: Learning robust visual f...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Zero-Shot Visual Imitation

Pathak, D., Mahmoudieh, P., Luo, G., Agrawal, P., Chen, D., Shentu, Y., Shelhamer, E., Malik, J., Efros, A. A., and Darrell, T. Zero-shot visual imitation, 2018. URL https://arxiv.org/abs/1804.08606

work page internal anchor Pith review Pith/arXiv arXiv 2018
[44]

Generating Diverse High-Fidelity Images with VQ-VAE-2

Razavi, A., van den Oord, A., and Vinyals, O. Generating diverse high-fidelity images with vq-vae-2, 2019. URL https://arxiv.org/abs/1906.00446

work page internal anchor Pith review Pith/arXiv arXiv 2019
[45]

A Generalist Agent

Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., Novikov, A., Barth-Maron, G., Gimenez, M., Sulsky, Y., Kay, J., Springenberg, J. T., Eccles, T., Bruce, J., Razavi, A., Edwards, A., Heess, N., Chen, Y., Hadsell, R., Vinyals, O., Bordbar, M., and de Freitas, N. A generalist agent, 2022. URL https://arxiv.org/abs/2205.06175

work page internal anchor Pith review Pith/arXiv arXiv 2022
[46]

Transformer-based world models are happy with 100k interactions, 2023

Robine, J., Höftmann, M., Uelwer, T., and Harmeling, S. Transformer-based world models are happy with 100k interactions, 2023. URL https://arxiv.org/abs/2303.07109

work page arXiv 2023
[47]

ImageNet Large Scale Visual Recognition Challenge

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. Imagenet large scale visual recognition challenge, 2015. URL https://arxiv.org/abs/1409.0575

work page internal anchor Pith review Pith/arXiv arXiv 2015
[48]

Planning to explore via self-supervised world models, 2020

Sekar, R., Rybkin, O., Daniilidis, K., Abbeel, P., Hafner, D., and Pathak, D. Planning to explore via self-supervised world models, 2020. URL https://arxiv.org/abs/2005.05960

work page arXiv 2020
[49]

Sutton, R. S. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2 0 (4): 0 160--163, 1991

work page 1991
[50]

DeepMind Control Suite

Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., de Las Casas, D., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq, A., Lillicrap, T., and Riedmiller, M. Deepmind control suite, 2018. URL https://arxiv.org/abs/1801.00690

work page internal anchor Pith review Pith/arXiv arXiv 2018
[51]

and Li, W

Todorov, E. and Li, W. A generalized iterative lqg method for locally-optimal feedback control of constrained nonlinear stochastic systems. In Proceedings of the 2005, American Control Conference, 2005., pp.\ 300--306. IEEE, 2005

work page 2005
[52]

K., Tulsiani, S., and Gupta, A

Wang, J., Dasari, S., Srirama, M. K., Tulsiani, S., and Gupta, A. Manipulate by seeing: Creating manipulation controllers from pre-trained representations, 2023. URL https://arxiv.org/abs/2303.08135

work page arXiv 2023
[53]

Image quality assessment: from error visibility to structural similarity

Wang, Z., Bovik, A., Sheikh, H., and Simoncelli, E. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13 0 (4): 0 600--612, 2004. doi:10.1109/TIP.2003.819861

work page doi:10.1109/tip.2003.819861 2004
[54]

Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images

Watter, M., Springenberg, J. T., Boedecker, J., and Riedmiller, M. Embed to control: A locally linear latent dynamics model for control from raw images, 2015. URL https://arxiv.org/abs/1506.07365

work page internal anchor Pith review Pith/arXiv arXiv 2015
[55]

Any-point Trajectory Modeling for Policy Learning

Wen, C., Lin, X., So, J., Chen, K., Dou, Q., Gao, Y., and Abbeel, P. Any-point trajectory modeling for policy learning, 2024. URL https://arxiv.org/abs/2401.00025

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

M., Boots, B., and Theodorou, E

Williams, G., Wagener, N., Goldfain, B., Drews, P., Rehg, J. M., Boots, B., and Theodorou, E. A. Information theoretic mpc for model-based reinforcement learning. In 2017 IEEE international conference on robotics and automation (ICRA), pp.\ 1714--1721. IEEE, 2017

work page 2017
[57]

Learning to manipulate deformable objects without demonstrations, 2020

Wu, Y., Yan, W., Kurutach, T., Pinto, L., and Abbeel, P. Learning to manipulate deformable objects without demonstrations, 2020. URL https://arxiv.org/abs/1910.13439

work page arXiv 2020
[58]

Masked visual pre-training for motor control, 2022

Xiao, T., Radosavovic, I., Darrell, T., and Malik, J. Masked visual pre-training for motor control, 2022. URL https://arxiv.org/abs/2203.06173

work page arXiv 2022
[59]

Learning predictive representations for deformable objects using contrastive estimation

Yan, W., Vangipuram, A., Abbeel, P., and Pinto, L. Learning predictive representations for deformable objects using contrastive estimation. In Conference on Robot Learning, pp.\ 564--574. PMLR, 2021

work page 2021
[60]

Learning interactive real-world simulators, 2023

Yang, M., Du, Y., Ghasemipour, K., Tompson, J., Schuurmans, D., and Abbeel, P. Learning interactive real-world simulators, 2023

work page 2023
[61]

Adaptigraph: Material-adaptive graph-based neural dynamics for robotic manipulation, 2024

Zhang, K., Li, B., Hauser, K., and Li, Y. Adaptigraph: Material-adaptive graph-based neural dynamics for robotic manipulation, 2024. URL https://arxiv.org/abs/2407.07889

work page arXiv 2024
[63]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Zhao, T. Z., Kumar, V., Levine, S., and Finn, C. Learning fine-grained bimanual manipulation with low-cost hardware, 2023. URL https://arxiv.org/abs/2304.13705

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

K., Rajeswaran, A., Pari, J., Hatch, K., Jain, A., Yu, T., Abbeel, P., Pinto, L., Finn, C., and Gupta, A

Zhou, G., Dean, V., Srirama, M. K., Rajeswaran, A., Pari, J., Hatch, K., Jain, A., Yu, T., Abbeel, P., Pinto, L., Finn, C., and Gupta, A. Train offline, test online: A real robot learning benchmark, 2023. URL https://arxiv.org/abs/2306.00942

work page arXiv 2023
[65]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[66]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[67]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[68]

2023 , eprint=

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware , author=. 2023 , eprint=

work page 2023
[69]

2024 , eprint=

Behavior Generation with Latent Actions , author=. 2024 , eprint=

work page 2024
[70]

2024 , eprint=

BAKU: An Efficient Transformer for Multi-Task Policy Learning , author=. 2024 , eprint=

work page 2024
[71]

2024 , eprint=

Eureka: Human-Level Reward Design via Coding Large Language Models , author=. 2024 , eprint=

work page 2024
[72]

2024 , eprint=

Mastering Diverse Domains through World Models , author=. 2024 , eprint=

work page 2024
[73]

2022 , eprint=

Temporal Difference Learning for Model Predictive Control , author=. 2022 , eprint=

work page 2022
[74]

2024 , eprint=

TD-MPC2: Scalable, Robust World Models for Continuous Control , author=. 2024 , eprint=

work page 2024
[75]

2022 , eprint=

Legged Locomotion in Challenging Terrains using Egocentric Vision , author=. 2022 , eprint=

work page 2022
[76]

A tutorial on energy-based learning

Yann Lecun and Sumit Chopra and Raia Hadsell and Ranzato, Marc Aurelio and Huang, Fu Jie. A tutorial on energy-based learning. Predicting structured data. 2006

work page 2006
[77]

2023 , eprint=

Train Offline, Test Online: A Real Robot Learning Benchmark , author=. 2023 , eprint=

work page 2023
[78]

2023 , eprint=

RT-1: Robotics Transformer for Real-World Control at Scale , author=. 2023 , eprint=

work page 2023
[79]

2018 , copyright =

Ha, David and Schmidhuber, Jürgen , title =. 2018 , copyright =. doi:10.5281/ZENODO.1207631 , url =

work page doi:10.5281/zenodo.1207631 2018
[80]

LeCun, Yann , keywords =. A

work page
[81]

2019 , eprint=

Learning Latent Dynamics for Planning from Pixels , author=. 2019 , eprint=

work page 2019
[82]

2023 , eprint=

Transformers are Sample-Efficient World Models , author=. 2023 , eprint=

work page 2023
[83]

2023 , eprint=

Transformer-based World Models Are Happy With 100k Interactions , author=. 2023 , eprint=

work page 2023

Showing first 80 references.