Recognition: 2 theorem links
· Lean TheoremDINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning
Pith reviewed 2026-05-17 16:00 UTC · model grok-4.3
The pith
DINO-WM uses pre-trained DINOv2 patch features to build world models that support zero-shot planning from offline data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DINO-WM learns visual dynamics by predicting future DINOv2 patch features from offline behavioral trajectories, which allows it to perform task-agnostic planning through optimization of action sequences aimed at matching goal features at test time.
What carries the argument
Prediction of future spatial patch features extracted by DINOv2, serving as the representation for modeling dynamics and enabling action optimization toward goal features.
If this is right
- Action sequences can be optimized at test time in feature space to reach arbitrary observational goals.
- A single model trained on offline trajectories supports multiple tasks without retraining or reward engineering.
- Zero-shot behavioral solutions become available across task families such as navigation, manipulation, and particle control.
- Planning performance exceeds prior methods that require demonstrations or task-specific components.
Where Pith is reading between the lines
- Pre-trained vision features appear to encode sufficient state information for dynamics modeling in many control settings.
- The same feature-prediction approach could be evaluated on physical robots to test whether the planning transfers beyond simulation.
- General planning systems might require fewer custom perception modules if similar pre-trained features prove broadly useful.
Load-bearing premise
Predicting future DINOv2 patch features alone supplies enough information about environment dynamics to support reliable planning without any visual reconstruction or task-specific additions.
What would settle it
Observe whether action sequences optimized under DINO-WM reach the intended goals in a new environment whose dynamics depend on visual details absent from the DINOv2 patch features.
read the original abstract
The ability to predict future outcomes given control actions is fundamental for physical reasoning. However, such predictive models, often called world models, remains challenging to learn and are typically developed for task-specific solutions with online policy learning. To unlock world models' true potential, we argue that they should 1) be trainable on offline, pre-collected trajectories, 2) support test-time behavior optimization, and 3) facilitate task-agnostic reasoning. To this end, we present DINO World Model (DINO-WM), a new method to model visual dynamics without reconstructing the visual world. DINO-WM leverages spatial patch features pre-trained with DINOv2, enabling it to learn from offline behavioral trajectories by predicting future patch features. This allows DINO-WM to achieve observational goals through action sequence optimization, facilitating task-agnostic planning by treating goal features as prediction targets. We demonstrate that DINO-WM achieves zero-shot behavioral solutions at test time on six environments without expert demonstrations, reward modeling, or pre-learned inverse models, outperforming prior state-of-the-art work across diverse task families such as arbitrarily configured mazes, push manipulation with varied object shapes, and multi-particle scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DINO-WM, a world model that predicts future DINOv2 patch features from offline trajectories to enable test-time action sequence optimization for achieving observational goals. It claims zero-shot behavioral solutions across six environments (mazes, push manipulation with varied shapes, multi-particle scenarios) without expert demonstrations, reward modeling, or inverse models, outperforming prior SOTA.
Significance. If the results hold, this could advance scalable, task-agnostic visual world models by leveraging pre-trained semantic features instead of reconstruction or task-specific components, with potential impact on model-based planning in robotics.
major comments (2)
- [Method (dynamics model)] The central claim that DINOv2 patch feature prediction alone yields rollouts accurate enough for reliable planning hinges on an untested assumption about feature sensitivity. DINOv2 is pretrained for semantic correspondence; in push-manipulation and multi-particle settings this risks invariance to small pose or contact changes, so optimized actions may match features yet fail in pixel/state space. This assumption is load-bearing for the zero-shot results on contact-rich tasks.
- [Experiments] Experimental results: the abstract reports strong outperformance on six environments, but without visible ablations isolating the contribution of pre-trained DINOv2 features versus the prediction architecture, error bars, or controls for post-hoc hyperparameter choices, it is unclear whether the gains are robust or environment-specific. This directly affects verification of the task-agnostic planning claim.
minor comments (2)
- [Abstract] Abstract: the phrase 'arbitrarily configured mazes' would benefit from a brief clarification of how configurations are varied at test time and whether the same model is used without retraining.
- [Method] Notation: ensure consistent use of 'patch features' versus 'visual features' when describing the prediction target to avoid ambiguity in the method description.
Simulated Author's Rebuttal
We are grateful to the referee for their constructive feedback, which has helped us improve the clarity and rigor of our work. Below, we provide point-by-point responses to the major comments.
read point-by-point responses
-
Referee: The central claim that DINOv2 patch feature prediction alone yields rollouts accurate enough for reliable planning hinges on an untested assumption about feature sensitivity. DINOv2 is pretrained for semantic correspondence; in push-manipulation and multi-particle settings this risks invariance to small pose or contact changes, so optimized actions may match features yet fail in pixel/state space. This assumption is load-bearing for the zero-shot results on contact-rich tasks.
Authors: We appreciate this insightful observation regarding the potential limitations of DINOv2 features in capturing fine-grained dynamics. Our approach relies on the empirical observation that these features enable effective planning, as evidenced by the successful zero-shot performance in the push manipulation and multi-particle environments. However, we acknowledge that a direct test of feature sensitivity to small changes was not included in the original manuscript. In the revision, we will add a discussion section and supporting visualizations to analyze how DINOv2 patch features respond to pose and contact variations in our tasks. This will help substantiate that the features provide sufficient sensitivity for the planning objectives. revision: partial
-
Referee: Experimental results: the abstract reports strong outperformance on six environments, but without visible ablations isolating the contribution of pre-trained DINOv2 features versus the prediction architecture, error bars, or controls for post-hoc hyperparameter choices, it is unclear whether the gains are robust or environment-specific. This directly affects verification of the task-agnostic planning claim.
Authors: We agree that additional experimental details are necessary to fully support the claims of robustness and task-agnosticism. We will revise the manuscript to include ablations that isolate the role of pre-trained DINOv2 features (e.g., comparing to training from scratch or using other feature extractors), report error bars across multiple random seeds, and provide details on the hyperparameter tuning process to ensure it was not post-hoc. These changes will strengthen the evidence for the general applicability of DINO-WM across diverse environments. revision: yes
Circularity Check
No significant circularity; derivation relies on standard supervised feature prediction from data
full rationale
The paper trains a dynamics model to predict future DINOv2 patch features from offline trajectories and then optimizes actions at test time to reach goal features. This is a conventional supervised prediction setup followed by planning, with no equations or steps that reduce by construction to fitted parameters, self-defined quantities, or load-bearing self-citations. The central claim (zero-shot planning via feature prediction) remains independent of the reported results and does not import uniqueness theorems or ansatzes from the authors' prior work in a circular manner. External pre-trained DINOv2 features and standard optimization provide the necessary separation from the target performance metrics.
Axiom & Free-Parameter Ledger
free parameters (1)
- training hyperparameters
axioms (1)
- domain assumption Pre-trained DINOv2 spatial patch features contain sufficient information to model visual dynamics for planning.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DINO-WM leverages spatial patch features pre-trained with DINOv2, enabling it to learn from offline behavioral trajectories by predicting future patch features.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
Runtime Monitoring of Perception-Based Autonomous Systems via Embedding Temporal Logic
Embedding Temporal Logic (ETL) performs runtime monitoring directly in learned embedding spaces using distance-based predicates composed with temporal operators, supported by conformal calibration for reliable predica...
-
Runtime Monitoring of Perception-Based Autonomous Systems via Embedding Temporal Logic
Embedding Temporal Logic enables runtime monitoring of temporally extended perceptual behaviors by defining predicates via distances between observed and reference embeddings in learned spaces, with conformal calibrat...
-
Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations
ACO-MoE recovers 95.3% of clean-input performance in visual control tasks under Markov-switching corruptions by routing restoration experts and anchoring representations to clean foreground masks.
-
Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations
ACO-MoE employs agent-centric mixture-of-experts to decouple task-relevant features from dynamic visual perturbations in RL, recovering 95.3% of clean performance on the new VDCS benchmark.
-
3D-Anchored Lookahead Planning for Persistent Robotic Scene Memory via World-Model-Based MCTS
3D-ALP achieves 0.65 success on memory-dependent 5-step robotic reach tasks versus near-zero for reactive baselines by anchoring MCTS planning to a persistent 3D camera-to-world frame.
-
TouchGuide: Inference-Time Steering of Visuomotor Policies via Touch Guidance
TouchGuide improves contact-rich robot manipulation by steering diffusion or flow-matching visuomotor policies with tactile feasibility scores from a contrastively trained Contact Physical Model.
-
Predictive but Not Plannable: RC-aux for Latent World Models
RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.
-
VISION-SLS: Safe Perception-Based Control from Learned Visual Representations via System Level Synthesis
VISION-SLS learns visual features with state-dependent error bounds and optimizes causal affine output-feedback policies via system level synthesis to achieve safe nonlinear control from RGB images.
-
Safe Control using Learned Safety Filters and Adaptive Conformal Inference
ACoFi adaptively tunes the switching threshold of learned safety filters using conformal inference on the range of predicted safety values, asymptotically bounding the rate of incorrect safety assessments by a user pa...
-
A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies
Sim-and-real co-training for robot policies is driven primarily by balanced cross-domain representation alignment and secondarily by domain-dependent action reweighting.
-
Grounded World Model for Semantically Generalizable Planning
A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
-
Learning Long-term Motion Embeddings for Efficient Kinematics Generation
A 64x temporally compressed motion embedding learned from trackers enables efficient conditional flow-matching generation of long-term motions that outperform video models and task-specific methods.
-
Hierarchical Planning with Latent World Models
Hierarchical planning over multi-scale latent world models enables 70% success on real robotic pick-and-place with goal-only input where flat models achieve 0%, while cutting planning compute up to 4x in simulations.
-
Metriplector: From Field Theory to Neural Architecture
Metriplector treats neural computation as coupled metriplectic field dynamics whose stress-energy tensor readout achieves competitive results on vision, control, Sudoku, language modeling, and pathfinding with small p...
-
RISE: Self-Improving Robot Policy with Compositional World Model
RISE combines a controllable dynamics model and progress value model into a closed-loop self-improving pipeline that updates robot policies entirely in imagination, reporting over 35% absolute gains on three real-world tasks.
-
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...
-
FLARE: Robot Learning with Implicit World Modeling
FLARE integrates predictive latent world modeling into diffusion transformer policies for robots, delivering up to 26% gains on multitask manipulation benchmarks and enabling co-training with action-free human videos.
-
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
UniVLA trains cross-embodiment vision-language-action policies from unlabeled videos via a latent action model in DINO space, beating OpenVLA on benchmarks with 1/20th pretraining compute and 1/10th downstream data.
-
Agentic Reasoning for Large Language Models
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applicat...
Reference graph
Works this paper leans on
-
[1]
Legged locomotion in challenging terrains using egocentric vision, 2022
Agarwal, A., Kumar, A., Malik, J., and Pathak, D. Legged locomotion in challenging terrains using egocentric vision, 2022. URL https://arxiv.org/abs/2211.07638
-
[2]
Self-supervised learning from images with a joint-embedding predictive architecture
Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., and Ballas, N. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 15619--15629, 2023
work page 2023
-
[3]
Nonlinear and adaptive control with applications, volume 187
Astolfi, A., Karagiannis, D., and Ortega, R. Nonlinear and adaptive control with applications, volume 187. Springer, 2008
work page 2008
-
[4]
V- JEPA : Latent video prediction for visual representation learning, 2024
Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., and Ballas, N. V- JEPA : Latent video prediction for visual representation learning, 2024. URL https://openreview.net/forum?id=WFYbBOEOtv
work page 2024
-
[5]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., Florence, P., Fu, C., Arenas, M. G., Gopalakrishnan, K., Han, K., Hausman, K., Herzog, A., Hsu, J., Ichter, B., Irpan, A., Joshi, N., Julian, R., Kalashnikov, D., Kuang, Y., Leal, I., Lee, L., Lee, T.-W. E., Levine, S., Lu, Y., Michalew...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., Ibarz, J., Ichter, B., Irpan, A., Jackson, T., Jesmonth, S., Joshi, N. J., Julian, R., Kalashnikov, D., Kuang, Y., Leal, I., Lee, K.-H., Levine, S., Lu, Y., Malla, U., Manjunath, D., Mordatch, I., Nachum, O., Parada, C., Peralta, J...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Genie: Generative interactive environments, 2024
Bruce, J., Dennis, M., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., Aytar, Y., Bechtle, S., Behbahani, F., Chan, S., Heess, N., Gonzalez, L., Osindero, S., Ozair, S., Reed, S., Zhang, J., Zolna, K., Clune, J., de Freitas, N., Singh, S., and Rocktäschel, T. Genie: Generative interactive environmen...
-
[8]
Emerging Properties in Self-Supervised Vision Transformers
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., and Joulin, A. Emerging properties in self-supervised vision transformers, 2021. URL https://arxiv.org/abs/2104.14294
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y., Burchfiel, B., Tedrake, R., and Song, S. Diffusion policy: Visuomotor policy learning via action diffusion, 2024. URL https://arxiv.org/abs/2303.04137
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models
Chua, K., Calandra, R., McAllister, R., and Levine, S. Deep reinforcement learning in a handful of trials using probabilistic dynamics models, 2018. URL https://arxiv.org/abs/1805.12114
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[11]
Deisenroth, M. P. and Rasmussen, C. E. Pilco: A model-based and data-efficient approach to policy search. In International Conference on Machine Learning, 2011. URL https://api.semanticscholar.org/CorpusID:14273320
work page 2011
-
[12]
Ding, Z., Zhang, A., Tian, Y., and Zheng, Q. Diffusion world model: Future modeling beyond step-by-step rollout for offline reinforcement learning, 2024. URL https://arxiv.org/abs/2402.03570
-
[13]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. URL https://arxiv.org/abs/2010.11929
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[14]
B., Schuurmans, D., and Abbeel, P
Du, Y., Yang, M., Dai, B., Dai, H., Nachum, O., Tenenbaum, J. B., Schuurmans, D., and Abbeel, P. Learning universal policies via text-guided video generation, 2023. URL https://arxiv.org/abs/2302.00111
-
[15]
Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control
Ebert, F., Finn, C., Dasari, S., Xie, A., Lee, A., and Levine, S. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control, 2018. URL https://arxiv.org/abs/1812.00568
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[17]
Deep Visual Foresight for Planning Robot Motion
Finn, C. and Levine, S. Deep visual foresight for planning robot motion, 2017. URL https://arxiv.org/abs/1610.00696
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[18]
D4RL: Datasets for Deep Data-Driven Reinforcement Learning
Fu, J., Kumar, A., Nachum, O., Tucker, G., and Levine, S. D4rl: Datasets for deep data-driven reinforcement learning, 2021. URL https://arxiv.org/abs/2004.07219
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[20]
Learning Latent Dynamics for Planning from Pixels
Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., and Davidson, J. Learning latent dynamics for planning from pixels, 2019. URL https://arxiv.org/abs/1811.04551
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[21]
Dream to Control: Learning Behaviors by Latent Imagination
Hafner, D., Lillicrap, T., Ba, J., and Norouzi, M. Dream to control: Learning behaviors by latent imagination, 2020. URL https://arxiv.org/abs/1912.01603
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[22]
Mastering Atari with Discrete World Models
Hafner, D., Lillicrap, T., Norouzi, M., and Ba, J. Mastering atari with discrete world models, 2022. URL https://arxiv.org/abs/2010.02193
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[23]
Mastering Diverse Domains through World Models
Hafner, D., Pasukonis, J., Ba, J., and Lillicrap, T. Mastering diverse domains through world models, 2024. URL https://arxiv.org/abs/2301.04104
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Baku: An efficient transformer for multi-task policy learning, 2024
Haldar, S., Peng, Z., and Pinto, L. Baku: An efficient transformer for multi-task policy learning, 2024. URL https://arxiv.org/abs/2406.07539
-
[25]
Temporal difference learning for model predictive control, 2022
Hansen, N., Wang, X., and Su, H. Temporal difference learning for model predictive control, 2022. URL https://arxiv.org/abs/2203.04955
-
[26]
TD-MPC2: Scalable, Robust World Models for Continuous Control
Hansen, N., Su, H., and Wang, X. Td-mpc2: Scalable, robust world models for continuous control, 2024. URL https://arxiv.org/abs/2310.16828
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Deep residual learning for image recognition
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 770--778, 2016
work page 2016
-
[28]
Holkar, K. and Waghmare, L. M. An overview of model predictive control. International Journal of control and automation, 3 0 (4): 0 47--63, 2010
work page 2010
-
[29]
Gaia-1: A generative world model for autonomous driving, 2023
Hu, A., Russell, L., Yeo, H., Murez, Z., Fedoseev, G., Kendall, A., Shotton, J., and Corrado, G. Gaia-1: A generative world model for autonomous driving, 2023
work page 2023
-
[30]
Chain-of-thought predictive control, 2024
Jia, Z., Thumuluri, V., Liu, F., Chen, L., Huang, Z., and Su, H. Chain-of-thought predictive control, 2024. URL https://arxiv.org/abs/2304.00776
- [31]
-
[32]
Lee, S., Wang, Y., Etukuru, H., Kim, H. J., Shafiullah, N. M. M., and Pinto, L. Behavior generation with latent actions, 2024. URL https://arxiv.org/abs/2403.03181
-
[33]
Lenz, I., Knepper, R. A., and Saxena, A. Deepmpc: Learning deep latent features for model predictive control. In Robotics: Science and Systems, 2015. URL https://api.semanticscholar.org/CorpusID:10130184
work page 2015
-
[34]
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Liu, Y., Zhang, K., Li, Y., Yan, Z., Gao, C., Chen, R., Yuan, Z., Huang, Y., Sun, H., Gao, J., He, L., and Sun, L. Sora: A review on background, technology, limitations, and opportunities of large vision models, 2024. URL https://arxiv.org/abs/2402.17177
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Eureka: Human-Level Reward Design via Coding Large Language Models
Ma, Y. J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., and Anandkumar, A. Eureka: Human-level reward design via coding large language models, 2024. URL https://arxiv.org/abs/2310.12931
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Discovering and achieving goals via world models, 2021
Mendonca, R., Rybkin, O., Daniilidis, K., Hafner, D., and Pathak, D. Discovering and achieving goals via world models, 2021. URL https://arxiv.org/abs/2110.09514
-
[37]
Alan: Autonomously exploring robotic agents in the real world, 2023 a
Mendonca, R., Bahl, S., and Pathak, D. Alan: Autonomously exploring robotic agents in the real world, 2023 a . URL https://arxiv.org/abs/2302.06604
-
[38]
Structured world models from human videos, 2023 b
Mendonca, R., Bahl, S., and Pathak, D. Structured world models from human videos, 2023 b . URL https://arxiv.org/abs/2308.10901
-
[39]
Transformers are sample-efficient world models, 2023
Micheli, V., Alonso, E., and Fleuret, F. Transformers are sample-efficient world models, 2023. URL https://arxiv.org/abs/2209.00588
-
[40]
Deep dynamics models for learning dexterous manipulation, 2019
Nagabandi, A., Konoglie, K., Levine, S., and Kumar, V. Deep dynamics models for learning dexterous manipulation, 2019. URL https://arxiv.org/abs/1909.11652
-
[41]
R3M: A Universal Visual Representation for Robot Manipulation
Nair, S., Rajeswaran, A., Kumar, V., Finn, C., and Gupta, A. R3m: A universal visual representation for robot manipulation, 2022. URL https://arxiv.org/abs/2203.12601
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[42]
DINOv2: Learning Robust Visual Features without Supervision
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., and Bojanowski, P. Dinov2: Learning robust visual f...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Pathak, D., Mahmoudieh, P., Luo, G., Agrawal, P., Chen, D., Shentu, Y., Shelhamer, E., Malik, J., Efros, A. A., and Darrell, T. Zero-shot visual imitation, 2018. URL https://arxiv.org/abs/1804.08606
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[44]
Generating Diverse High-Fidelity Images with VQ-VAE-2
Razavi, A., van den Oord, A., and Vinyals, O. Generating diverse high-fidelity images with vq-vae-2, 2019. URL https://arxiv.org/abs/1906.00446
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[45]
Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., Novikov, A., Barth-Maron, G., Gimenez, M., Sulsky, Y., Kay, J., Springenberg, J. T., Eccles, T., Bruce, J., Razavi, A., Edwards, A., Heess, N., Chen, Y., Hadsell, R., Vinyals, O., Bordbar, M., and de Freitas, N. A generalist agent, 2022. URL https://arxiv.org/abs/2205.06175
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[46]
Transformer-based world models are happy with 100k interactions, 2023
Robine, J., Höftmann, M., Uelwer, T., and Harmeling, S. Transformer-based world models are happy with 100k interactions, 2023. URL https://arxiv.org/abs/2303.07109
-
[47]
ImageNet Large Scale Visual Recognition Challenge
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. Imagenet large scale visual recognition challenge, 2015. URL https://arxiv.org/abs/1409.0575
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[48]
Planning to explore via self-supervised world models, 2020
Sekar, R., Rybkin, O., Daniilidis, K., Abbeel, P., Hafner, D., and Pathak, D. Planning to explore via self-supervised world models, 2020. URL https://arxiv.org/abs/2005.05960
-
[49]
Sutton, R. S. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2 0 (4): 0 160--163, 1991
work page 1991
-
[50]
Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., de Las Casas, D., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq, A., Lillicrap, T., and Riedmiller, M. Deepmind control suite, 2018. URL https://arxiv.org/abs/1801.00690
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [51]
-
[52]
K., Tulsiani, S., and Gupta, A
Wang, J., Dasari, S., Srirama, M. K., Tulsiani, S., and Gupta, A. Manipulate by seeing: Creating manipulation controllers from pre-trained representations, 2023. URL https://arxiv.org/abs/2303.08135
-
[53]
Image quality assessment: from error visibility to structural similarity
Wang, Z., Bovik, A., Sheikh, H., and Simoncelli, E. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13 0 (4): 0 600--612, 2004. doi:10.1109/TIP.2003.819861
-
[54]
Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images
Watter, M., Springenberg, J. T., Boedecker, J., and Riedmiller, M. Embed to control: A locally linear latent dynamics model for control from raw images, 2015. URL https://arxiv.org/abs/1506.07365
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[55]
Any-point Trajectory Modeling for Policy Learning
Wen, C., Lin, X., So, J., Chen, K., Dou, Q., Gao, Y., and Abbeel, P. Any-point trajectory modeling for policy learning, 2024. URL https://arxiv.org/abs/2401.00025
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
M., Boots, B., and Theodorou, E
Williams, G., Wagener, N., Goldfain, B., Drews, P., Rehg, J. M., Boots, B., and Theodorou, E. A. Information theoretic mpc for model-based reinforcement learning. In 2017 IEEE international conference on robotics and automation (ICRA), pp.\ 1714--1721. IEEE, 2017
work page 2017
-
[57]
Learning to manipulate deformable objects without demonstrations, 2020
Wu, Y., Yan, W., Kurutach, T., Pinto, L., and Abbeel, P. Learning to manipulate deformable objects without demonstrations, 2020. URL https://arxiv.org/abs/1910.13439
-
[58]
Masked visual pre-training for motor control, 2022
Xiao, T., Radosavovic, I., Darrell, T., and Malik, J. Masked visual pre-training for motor control, 2022. URL https://arxiv.org/abs/2203.06173
-
[59]
Learning predictive representations for deformable objects using contrastive estimation
Yan, W., Vangipuram, A., Abbeel, P., and Pinto, L. Learning predictive representations for deformable objects using contrastive estimation. In Conference on Robot Learning, pp.\ 564--574. PMLR, 2021
work page 2021
-
[60]
Learning interactive real-world simulators, 2023
Yang, M., Du, Y., Ghasemipour, K., Tompson, J., Schuurmans, D., and Abbeel, P. Learning interactive real-world simulators, 2023
work page 2023
-
[61]
Adaptigraph: Material-adaptive graph-based neural dynamics for robotic manipulation, 2024
Zhang, K., Li, B., Hauser, K., and Li, Y. Adaptigraph: Material-adaptive graph-based neural dynamics for robotic manipulation, 2024. URL https://arxiv.org/abs/2407.07889
-
[63]
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
Zhao, T. Z., Kumar, V., Levine, S., and Finn, C. Learning fine-grained bimanual manipulation with low-cost hardware, 2023. URL https://arxiv.org/abs/2304.13705
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[64]
Zhou, G., Dean, V., Srirama, M. K., Rajeswaran, A., Pari, J., Hatch, K., Jain, A., Yu, T., Abbeel, P., Pinto, L., Finn, C., and Gupta, A. Train offline, test online: A real robot learning benchmark, 2023. URL https://arxiv.org/abs/2306.00942
-
[65]
Scaling Learning Algorithms Towards
Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
-
[66]
and Osindero, Simon and Teh, Yee Whye , journal =
Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
- [67]
-
[68]
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware , author=. 2023 , eprint=
work page 2023
- [69]
-
[70]
BAKU: An Efficient Transformer for Multi-Task Policy Learning , author=. 2024 , eprint=
work page 2024
-
[71]
Eureka: Human-Level Reward Design via Coding Large Language Models , author=. 2024 , eprint=
work page 2024
-
[72]
Mastering Diverse Domains through World Models , author=. 2024 , eprint=
work page 2024
-
[73]
Temporal Difference Learning for Model Predictive Control , author=. 2022 , eprint=
work page 2022
-
[74]
TD-MPC2: Scalable, Robust World Models for Continuous Control , author=. 2024 , eprint=
work page 2024
-
[75]
Legged Locomotion in Challenging Terrains using Egocentric Vision , author=. 2022 , eprint=
work page 2022
-
[76]
A tutorial on energy-based learning
Yann Lecun and Sumit Chopra and Raia Hadsell and Ranzato, Marc Aurelio and Huang, Fu Jie. A tutorial on energy-based learning. Predicting structured data. 2006
work page 2006
-
[77]
Train Offline, Test Online: A Real Robot Learning Benchmark , author=. 2023 , eprint=
work page 2023
-
[78]
RT-1: Robotics Transformer for Real-World Control at Scale , author=. 2023 , eprint=
work page 2023
-
[79]
Ha, David and Schmidhuber, Jürgen , title =. 2018 , copyright =. doi:10.5281/ZENODO.1207631 , url =
-
[80]
LeCun, Yann , keywords =. A
-
[81]
Learning Latent Dynamics for Planning from Pixels , author=. 2019 , eprint=
work page 2019
-
[82]
Transformers are Sample-Efficient World Models , author=. 2023 , eprint=
work page 2023
-
[83]
Transformer-based World Models Are Happy With 100k Interactions , author=. 2023 , eprint=
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.