pith. machine review for the scientific record. sign in

arxiv: 2412.14803 · v2 · submitted 2024-12-19 · 💻 cs.CV · cs.RO

Recognition: 2 theorem links

· Lean Theorem

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Authors on Pith no claims yet

Pith reviewed 2026-05-12 18:31 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords video diffusion modelsrobot policiespredictive visual representationsinverse dynamicsdexterous manipulationgeneralist policiesCalvin benchmark
0
0 comments X

The pith

Robot policies that condition actions on future video predictions from diffusion models outperform prior methods by 18 percent on benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that video diffusion models generate visual features containing both current scene details and predicted motion, which supply better guidance for learning robot actions than static image encoders. The authors introduce the Video Prediction Policy that trains an implicit inverse dynamics model by conditioning it on these future representations produced inside the diffusion process. They adapt a pre-trained video model through fine-tuning on robot datasets and internet human manipulation videos to make the forecasts more accurate for physical environments. This yields measurable gains in task success and generalization. A reader would care because embodied agents must anticipate change rather than merely observe the present state.

Core claim

The Video Prediction Policy learns an implicit inverse dynamics model conditioned on predicted future representations inside video diffusion models that have been fine-tuned on robot datasets together with internet-sourced human manipulation videos, producing higher success rates than previous vision-based policies.

What carries the argument

Video Prediction Policy (VPP) that conditions action outputs on future visual representations generated by fine-tuned video diffusion models.

If this is right

  • 18.6 percent relative improvement on the Calvin ABC-D generalization benchmark over prior state-of-the-art.
  • 31.6 percent higher success rates on complex real-world dexterous manipulation tasks.
  • Fine-tuning the video model on robot and human data produces more precise future predictions that better support policy learning.
  • Implicit inverse dynamics modeling gains from access to dynamic rather than static visual features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same predictive representations could replace conventional vision encoders in other sequential control settings such as navigation or assembly.
  • Extending the video prediction horizon might allow the policy to plan over longer action sequences without an explicit world model.
  • Mixing human demonstration data during fine-tuning may improve zero-shot transfer from human videos to robot execution.
  • The approach could be combined with language-conditioned policies to handle open-ended instructions while retaining the temporal advantage.

Load-bearing premise

Video diffusion models inherently produce representations that capture future dynamics useful for guiding robot action selection.

What would settle it

Training a policy with the same fine-tuned diffusion encoder but using only current-frame representations instead of predicted future ones and measuring whether performance drops on the Calvin ABC-D benchmark and real dexterous tasks.

read the original abstract

Visual representations play a crucial role in developing generalist robotic policies. Previous vision encoders, typically pre-trained with single-image reconstruction or two-image contrastive learning, tend to capture static information, often neglecting the dynamic aspects vital for embodied tasks. Recently, video diffusion models (VDMs) demonstrate the ability to predict future frames and showcase a strong understanding of physical world. We hypothesize that VDMs inherently produce visual representations that encompass both current static information and predicted future dynamics, thereby providing valuable guidance for robot action learning. Based on this hypothesis, we propose the Video Prediction Policy (VPP), which learns implicit inverse dynamics model conditioned on predicted future representations inside VDMs. To predict more precise future, we fine-tune pre-trained video foundation model on robot datasets along with internet human manipulation data. In experiments, VPP achieves a 18.6\% relative improvement on the Calvin ABC-D generalization benchmark compared to the previous state-of-the-art, and demonstrates a 31.6\% increase in success rates for complex real-world dexterous manipulation tasks. Project page at https://video-prediction-policy.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Video Prediction Policy (VPP), a generalist robot policy that fine-tunes pre-trained video diffusion models (VDMs) on robot datasets and internet human-manipulation videos to produce predictive visual representations. These representations are hypothesized to encode both static scene information and future dynamics, which are then used to condition an implicit inverse-dynamics model for action prediction. The authors report an 18.6% relative improvement over prior state-of-the-art on the Calvin ABC-D generalization benchmark and a 31.6% increase in success rate on complex real-world dexterous manipulation tasks.

Significance. If the reported gains are reproducible and the predictive conditioning is shown to be the operative factor, the work would offer a concrete route to injecting future-dynamics awareness into robot policies via existing video foundation models. This could meaningfully advance generalist embodied agents by moving beyond static image encoders, with direct implications for sim-to-real transfer and long-horizon manipulation.

major comments (2)
  1. [Abstract] Abstract: the performance claims (18.6% relative improvement on Calvin ABC-D and 31.6% real-world success-rate increase) are stated without any reference to the number of evaluation episodes, variance across seeds, statistical tests, or the precise baselines used, rendering the numerical results unverifiable from the provided information.
  2. [Method / Experiments] Method / Experiments: the central hypothesis that VDMs supply future-dynamic guidance is load-bearing for the contribution, yet no ablation isolates the effect of conditioning on predicted future representations while holding the fine-tuned encoder fixed. A comparison against an equivalently fine-tuned but non-predictive backbone (e.g., current-frame features only) is required to rule out gains arising solely from additional supervised data or increased model capacity.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'implicit inverse dynamics model conditioned on predicted future representations' is introduced without a brief definition or pointer to the relevant equation, which may hinder readers outside the immediate subfield.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of experimental reporting and hypothesis validation, and we have revised the paper to strengthen these elements while preserving the core contribution.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the performance claims (18.6% relative improvement on Calvin ABC-D and 31.6% real-world success-rate increase) are stated without any reference to the number of evaluation episodes, variance across seeds, statistical tests, or the precise baselines used, rendering the numerical results unverifiable from the provided information.

    Authors: We agree that the abstract would benefit from greater specificity to improve verifiability. In the revised manuscript, we have updated the abstract to note that the reported gains are obtained under standard benchmark protocols (with full details on episode counts, seed variance, and baselines provided in the Experiments section). This maintains abstract length while directing readers to the supporting evidence. revision: yes

  2. Referee: [Method / Experiments] Method / Experiments: the central hypothesis that VDMs supply future-dynamic guidance is load-bearing for the contribution, yet no ablation isolates the effect of conditioning on predicted future representations while holding the fine-tuned encoder fixed. A comparison against an equivalently fine-tuned but non-predictive backbone (e.g., current-frame features only) is required to rule out gains arising solely from additional supervised data or increased model capacity.

    Authors: The referee correctly notes that such an ablation is necessary to isolate the predictive component. The original manuscript compares VPP against multiple baselines with different encoders, but to directly address this point we have added a new ablation study. We fine-tune the identical VDM backbone and then train the policy using only its current-frame features (no future prediction or conditioning). Results in the revised Experiments section show that the predictive representations contribute additional gains beyond fine-tuning alone, supporting the hypothesis that future dynamics are a key factor. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with independent experimental validation

full rationale

The paper advances an empirical method: it states a hypothesis about VDM representations, fine-tunes a pre-trained video model on robot and human-manipulation videos to improve future-frame prediction, then trains a policy that conditions on the resulting representations for inverse-dynamics learning. All performance numbers (Calvin benchmark, real-world dexterous tasks) are reported as measured outcomes on held-out evaluation sets. No equations, definitions, or self-citations reduce the claimed gains to a fitted parameter renamed as prediction or to a self-referential premise; the central result remains falsifiable by ablation or external replication.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no identifiable free parameters, axioms, or invented entities; a full-text audit would be required to enumerate any fitted scales, domain assumptions, or new postulated constructs.

pith-pipeline@v0.9.0 · 5526 in / 1108 out tokens · 56328 ms · 2026-05-12T18:31:41.928814+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 37 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

    cs.CV 2026-05 conditional novelty 7.0

    CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight...

  2. ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

  3. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 conditional novelty 7.0

    Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.

  4. NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.

  5. VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts

    cs.RO 2026-05 unverdicted novelty 7.0

    VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...

  6. Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation

    cs.AI 2026-05 unverdicted novelty 7.0

    A multimodal transformer generates and caches interleaved text-image traces to guide closed-loop actions, achieving 92.4% success on LIBERO-Long and 95.5% average on LIBERO.

  7. Being-H0.7: A Latent World-Action Model from Egocentric Videos

    cs.RO 2026-04 unverdicted novelty 7.0

    Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

  8. Privileged Foresight Distillation: Zero-Cost Future Correction for World Action Models

    cs.RO 2026-04 unverdicted novelty 7.0

    Privileged Foresight Distillation distills the residual difference in action predictions with versus without future context into a current-only adapter, yielding consistent gains on LIBERO and RoboTwin benchmarks.

  9. Mask World Model: Predicting What Matters for Robust Robot Policy Learning

    cs.RO 2026-04 unverdicted novelty 7.0

    Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...

  10. Action Images: End-to-End Policy Learning via Multiview Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.

  11. JailWAM: Jailbreaking World Action Models in Robot Control

    cs.RO 2026-04 unverdicted novelty 7.0

    JailWAM is the first dedicated jailbreak framework for World Action Models, achieving 84.2% attack success rate on LingBot-VA in RoboTwin simulation and enabling safety evaluation of robotic AI.

  12. ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

  13. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 unverdicted novelty 6.0

    Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.

  14. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 unverdicted novelty 6.0

    Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.

  15. ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.

  16. From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discr...

  17. Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing

    cs.RO 2026-05 unverdicted novelty 6.0

    A dual-contrastive disentanglement method factorizes videos into independent task and embodiment latents, then uses a parameter-efficient adapter on a frozen video diffusion model to synthesize robot executions from s...

  18. MotuBrain: An Advanced World Action Model for Robot Control

    cs.RO 2026-04 unverdicted novelty 6.0

    MotuBrain jointly models video and action via a three-stream Mixture-of-Transformers UniDiffuser to reach 95.8-96.1% success on RoboTwin 2.0 benchmarks, top EWMScore, and fast 11 Hz inference while adapting to new rob...

  19. GazeVLA: Learning Human Intention for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.

  20. Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training

    cs.RO 2026-04 unverdicted novelty 6.0

    Hi-WM uses human interventions inside an action-conditioned world model with rollback and branching to generate dense corrective data, raising real-world success by 37.9 points on average across three manipulation tasks.

  21. Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models

    cs.RO 2026-04 unverdicted novelty 6.0

    Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.

  22. AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps

    cs.RO 2026-04 unverdicted novelty 6.0

    AIM predicts aligned spatial value maps inside a shared video-generation transformer to produce reliable robot actions, reaching 94% success on RoboTwin 2.0 with larger gains on long-horizon and contact-rich tasks.

  23. Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?

    cs.RO 2026-04 unverdicted novelty 6.0

    Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.

  24. Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model

    cs.RO 2026-04 conditional novelty 6.0

    MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.

  25. Fast-WAM: Do World Action Models Need Test-time Future Imagination?

    cs.CV 2026-03 unverdicted novelty 6.0

    Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.

  26. World Action Models are Zero-shot Policies

    cs.RO 2026-02 unverdicted novelty 6.0

    DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...

  27. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

    cs.AI 2026-01 conditional novelty 6.0

    Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.

  28. Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

    cs.RO 2025-04 unverdicted novelty 6.0

    Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generali...

  29. Unified Video Action Model

    cs.RO 2025-02 unverdicted novelty 6.0

    UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without p...

  30. DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

    cs.RO 2025-02 unverdicted novelty 6.0

    DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.

  31. CKT-WAM: Parameter-Efficient Context Knowledge Transfer Between World Action Models

    cs.RO 2026-05 unverdicted novelty 5.0

    CKT-WAM transfers teacher WAM knowledge to students via compressed text-embedding contexts using LQCA and adapters, reaching 86.1% success on LIBERO-Plus with 1.17% trainable parameters and 83.3% in real-world tasks.

  32. VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts

    cs.RO 2026-05 unverdicted novelty 5.0

    VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...

  33. StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement

    cs.RO 2026-04 unverdicted novelty 5.0

    StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...

  34. From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data

    cs.RO 2026-04 accept novelty 5.0

    A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.

  35. Motus: A Unified Latent Action World Model

    cs.CV 2025-12 unverdicted novelty 5.0

    Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.

  36. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

  37. Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap

    cs.RO 2026-04 unverdicted novelty 4.0

    A survey of UAV vision-and-language navigation that establishes a methodological taxonomy, reviews resources and challenges, and proposes a forward-looking research roadmap.

Reference graph

Works this paper leans on

106 extracted references · 106 canonical work pages · cited by 33 Pith papers · 18 internal anchors

  1. [1]

    FirstName LastName , title =

  2. [2]

    IEEE Robotics and Automation Letters (RA-L) , volume=

    Oier Mees and Lukas Hermann and Erick Rosete-Beas and Wolfram Burgard , title =. IEEE Robotics and Automation Letters (RA-L) , volume=

  3. [3]

    Conference on robot learning , pages=

    Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning , author=. Conference on robot learning , pages=. 2020 , organization=

  4. [4]

    2024 , eprint=

    Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals , author=. 2024 , eprint=

  5. [6]

    2023 , eprint=

    Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation , author=. 2023 , eprint=

  6. [7]

    FirstName Alpher , title =

  7. [8]

    Journal of Foo , volume = 13, number = 1, pages =

    FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =

  8. [9]

    Journal of Foo , volume = 14, number = 1, pages =

    FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =

  9. [10]

    FirstName Alpher and FirstName Gamow , title =

  10. [11]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  11. [13]

    Advances in neural information processing systems , volume=

    Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=

  12. [14]

    2022 , eprint=

    Classifier-Free Diffusion Guidance , author=. 2022 , eprint=

  13. [15]

    International conference on machine learning , pages=

    A simple framework for contrastive learning of visual representations , author=. International conference on machine learning , pages=. 2020 , organization=

  14. [16]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    An empirical study of training self-supervised vision transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  15. [17]

    Advances in neural information processing systems , volume=

    Unsupervised learning of visual features by contrasting cluster assignments , author=. Advances in neural information processing systems , volume=

  16. [18]

    International Conference on Machine Learning , pages=

    Data2vec: A general framework for self-supervised learning in speech, vision and language , author=. International Conference on Machine Learning , pages=. 2022 , organization=

  17. [20]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Masked autoencoders are scalable vision learners , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  18. [21]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Simple but effective: Clip embeddings for embodied ai , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  19. [22]

    Workshop on Reincarnating Reinforcement Learning at ICLR 2023 , year=

    Offline visual representation learning for embodied navigation , author=. Workshop on Reincarnating Reinforcement Learning at ICLR 2023 , year=

  20. [24]

    international conference on machine learning , pages=

    The unsurprising effectiveness of pre-trained vision models for control , author=. international conference on machine learning , pages=. 2022 , organization=

  21. [25]

    Conference on Robot Learning , pages=

    Real-world robot learning with masked visual pre-training , author=. Conference on Robot Learning , pages=. 2023 , organization=

  22. [29]

    Advances in Neural Information Processing Systems , volume=

    Where are we in the search for an artificial visual cortex for embodied intelligence? , author=. Advances in Neural Information Processing Systems , volume=

  23. [30]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Ego4d: Around the world in 3,000 hours of egocentric video , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  24. [31]

    something something

    The" something something" video database for learning and evaluating visual common sense , author=. Proceedings of the IEEE international conference on computer vision , pages=

  25. [32]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Instructpix2pix: Learning to follow image editing instructions , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  26. [34]

    Advances in Neural Information Processing Systems , volume=

    Learning universal policies via text-guided video generation , author=. Advances in Neural Information Processing Systems , volume=

  27. [36]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  28. [37]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Denoising diffusion autoencoders are unified self-supervised learners , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  29. [38]

    Advances in Neural Information Processing Systems , volume=

    Diffusion hyperfeatures: Searching through time and space for semantic correspondence , author=. Advances in Neural Information Processing Systems , volume=

  30. [42]

    OpenVLA: An Open-Source Vision-Language-Action Model

    OpenVLA: An Open-Source Vision-Language-Action Model , author=. arXiv preprint arXiv:2406.09246 , year=

  31. [43]

    Hirt: Enhancing robotic control with hierarchical robot transformers.arXiv preprint arXiv:2410.05273, 2024

    HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers , author=. arXiv preprint arXiv:2410.05273 , year=

  32. [44]

    Conference on Robot Learning , pages=

    Bc-z: Zero-shot task generalization with robotic imitation learning , author=. Conference on Robot Learning , pages=. 2022 , organization=

  33. [46]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation , author=. arXiv preprint arXiv:2410.07864 , year=

  34. [47]

    International conference on machine learning , pages=

    Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

  35. [48]

    2009 IEEE conference on computer vision and pattern recognition , pages=

    Imagenet: A large-scale hierarchical image database , author=. 2009 IEEE conference on computer vision and pattern recognition , pages=. 2009 , organization=

  36. [49]

    A Survey on Vision-Language-Action Models for Embodied AI

    A Survey on Vision-Language-Action Models for Embodied AI , author=. arXiv preprint arXiv:2405.14093 , year=

  37. [50]

    Advances in Neural Information Processing Systems , volume=

    Video diffusion models , author=. Advances in Neural Information Processing Systems , volume=

  38. [56]

    2024 , url=

    Video generation models as world simulators , author=. 2024 , url=

  39. [57]

    3d diffuser actor: Policy diffusion with 3d scene representations, 2024

    3d diffuser actor: Policy diffusion with 3d scene representations , author=. arXiv preprint arXiv:2402.10885 , year=

  40. [59]

    Advances in neural information processing systems , volume=

    Flamingo: a visual language model for few-shot learning , author=. Advances in neural information processing systems , volume=

  41. [60]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Score-based generative modeling through stochastic differential equations , author=. arXiv preprint arXiv:2011.13456 , year=

  42. [62]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  43. [66]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Spatialvlm: Endowing vision-language models with spatial reasoning capabilities , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  44. [67]

    2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=

    Grounding language with visual affordances over unstructured data , author=. 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2023 , organization=

  45. [68]

    Robonet: Large-scale multi-robot learning,

    Robonet: Large-scale multi-robot learning , author=. arXiv preprint arXiv:1910.11215 , year=

  46. [71]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Align your latents: High-resolution video synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  47. [72]

    International conference on machine learning , pages=

    Perceiver: General perception with iterative attention , author=. International conference on machine learning , pages=. 2021 , organization=

  48. [74]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    Prediction with action: Visual policy learning via joint denoising process , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  49. [75]

    Advances in Neural Information Processing Systems , volume=

    Vidman: Exploiting implicit dynamics from video diffusion model for effective robot manipulation , author=. Advances in Neural Information Processing Systems , volume=

  50. [77]

    Flamingo: a visual language model for few-shot learning

    Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35: 0 23716--23736, 2022

  51. [78]

    Data2vec: A general framework for self-supervised learning in speech, vision and language

    Baevski, A., Hsu, W.-N., Xu, Q., Babu, A., Gu, J., and Auli, M. Data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning, pp.\ 1298--1312. PMLR, 2022

  52. [79]

    BEiT: BERT Pre-Training of Image Transformers

    Bao, H., Dong, L., Piao, S., and Wei, F. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021

  53. [80]

    arXiv preprint arXiv:2409.16283 (2024)

    Bharadhwaj, H., Dwibedi, D., Gupta, A., Tulsiani, S., Doersch, C., Xiao, T., Shah, D., Xia, F., Sadigh, D., and Kirmani, S. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation. arXiv preprint arXiv:2409.16283, 2024

  54. [81]

    Black, M

    Black, K., Nakamoto, M., Atreya, P., Walke, H., Finn, C., Kumar, A., and Levine, S. Zero-shot robotic manipulation with pretrained image-editing diffusion models. arXiv preprint arXiv:2310.10639, 2023

  55. [82]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023 a

  56. [83]

    W., Fidler, S., and Kreis, K

    Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S. W., Fidler, S., and Kreis, K. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 22563--22575, 2023 b

  57. [84]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022

  58. [85]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023

  59. [86]

    Brooks, T., Holynski, A., and Efros, A. A. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 18392--18402, 2023

  60. [87]

    Video generation models as world simulators

    Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., and Ramesh, A. Video generation models as world simulators. 2024. URL https://openai.com/research/video-generation-models-as-world-simulators

  61. [88]

    Unsupervised learning of visual features by contrasting cluster assignments

    Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., and Joulin, A. Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33: 0 9912--9924, 2020

  62. [89]

    Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

    Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L., and Xia, F. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 14455--14465, 2024 a

  63. [90]

    A simple framework for contrastive learning of visual representations

    Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp.\ 1597--1607. PMLR, 2020

  64. [91]

    arXiv preprint arXiv:2305.13840 (2023)

    Chen, W., Ji, Y., Wu, J., Wu, H., Xie, P., Li, J., Xia, X., Xiao, X., and Lin, L. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023

  65. [92]

    An empirical study of training self-supervised vision transformers

    Chen, X., Xie, S., and He, K. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 9640--9649, 2021

  66. [93]

    Igor: Image-goal representations are the atomic control units for foundation models in embodied ai.arXiv preprint arXiv:2411.00785, 2024

    Chen, X., Guo, J., He, T., Zhang, C., Zhang, P., Yang, D. C., Zhao, L., and Bian, J. Igor: Image-goal representations are the atomic control units for foundation models in embodied ai. arXiv preprint arXiv:2411.00785, 2024 b

  67. [94]

    Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

    Chi, C., Feng, S., Du, Y., Xu, Z., Cousineau, E., Burchfiel, B., and Song, S. Diffusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023

  68. [95]

    Learning universal policies via text-guided video generation

    Du, Y., Yang, S., Dai, B., Dai, H., Nachum, O., Tenenbaum, J., Schuurmans, D., and Abbeel, P. Learning universal policies via text-guided video generation. Advances in Neural Information Processing Systems, 36, 2024

  69. [96]

    Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

    Ebert, F., Yang, Y., Schmeckpeper, K., Bucher, B., Georgakis, G., Daniilidis, K., Finn, C., and Levine, S. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. arXiv preprint arXiv:2109.13396, 2021

  70. [97]

    something something

    Goyal, R., Ebrahimi Kahou, S., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., et al. The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pp.\ 5842--5850, 2017

  71. [98]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Hamburger, J., Jiang, H., Liu, M., Liu, X., et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 18995--19012, 2022

  72. [99]

    Seer: Language Instructed Video Prediction with Latent Diffusion Models

    Gu, X., Wen, C., Ye, W., Song, J., and Gao, Y. Seer: Language instructed video prediction with latent diffusion models. arXiv preprint arXiv:2303.14897, 2023

  73. [100]

    Prediction with action: Visual policy learning via joint denoising process

    Guo, Y., Hu, Y., Zhang, J., Wang, Y.-J., Chen, X., Lu, C., and Chen, J. Prediction with action: Visual policy learning via joint denoising process. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  74. [101]

    Improving vision-language-action model with online reinforcement learning.arXiv preprint arXiv:2501.16664, 2025b

    Guo, Y., Zhang, J., Chen, X., Ji, X., Wang, Y.-J., Hu, Y., and Chen, J. Improving vision-language-action model with online reinforcement learning. arXiv preprint arXiv:2501.16664, 2025

  75. [102]

    Gupta, G., Yadav, K., Gal, Y., Batra, D., Kira, Z., Lu, C., and Rudner, T. G. Pre-trained text-to-image diffusion models are versatile representation learners for control. arXiv preprint arXiv:2405.05852, 2024

  76. [103]

    Masked autoencoders are scalable vision learners

    He, K., Chen, X., Xie, S., Li, Y., Doll \'a r, P., and Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 16000--16009, 2022

  77. [104]

    Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., and Fleet, D. J. Video diffusion models. Advances in Neural Information Processing Systems, 35: 0 8633--8646, 2022

  78. [105]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Hong, W., Ding, M., Zheng, W., Liu, X., and Tang, J. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022

  79. [106]

    Perceiver: General perception with iterative attention

    Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., and Carreira, J. Perceiver: General perception with iterative attention. In International conference on machine learning, pp.\ 4651--4664. PMLR, 2021

  80. [107]

    arXiv preprint arXiv:2302.12766 , year=

    Karamcheti, S., Nair, S., Chen, A. S., Kollar, T., Finn, C., Sadigh, D., and Liang, P. Language-driven representation learning for robotics. arXiv preprint arXiv:2302.12766, 2023

Showing first 80 references.