Recognition: 2 theorem links
· Lean TheoremVideo Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Pith reviewed 2026-05-12 18:31 UTC · model grok-4.3
The pith
Robot policies that condition actions on future video predictions from diffusion models outperform prior methods by 18 percent on benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Video Prediction Policy learns an implicit inverse dynamics model conditioned on predicted future representations inside video diffusion models that have been fine-tuned on robot datasets together with internet-sourced human manipulation videos, producing higher success rates than previous vision-based policies.
What carries the argument
Video Prediction Policy (VPP) that conditions action outputs on future visual representations generated by fine-tuned video diffusion models.
If this is right
- 18.6 percent relative improvement on the Calvin ABC-D generalization benchmark over prior state-of-the-art.
- 31.6 percent higher success rates on complex real-world dexterous manipulation tasks.
- Fine-tuning the video model on robot and human data produces more precise future predictions that better support policy learning.
- Implicit inverse dynamics modeling gains from access to dynamic rather than static visual features.
Where Pith is reading between the lines
- The same predictive representations could replace conventional vision encoders in other sequential control settings such as navigation or assembly.
- Extending the video prediction horizon might allow the policy to plan over longer action sequences without an explicit world model.
- Mixing human demonstration data during fine-tuning may improve zero-shot transfer from human videos to robot execution.
- The approach could be combined with language-conditioned policies to handle open-ended instructions while retaining the temporal advantage.
Load-bearing premise
Video diffusion models inherently produce representations that capture future dynamics useful for guiding robot action selection.
What would settle it
Training a policy with the same fine-tuned diffusion encoder but using only current-frame representations instead of predicted future ones and measuring whether performance drops on the Calvin ABC-D benchmark and real dexterous tasks.
read the original abstract
Visual representations play a crucial role in developing generalist robotic policies. Previous vision encoders, typically pre-trained with single-image reconstruction or two-image contrastive learning, tend to capture static information, often neglecting the dynamic aspects vital for embodied tasks. Recently, video diffusion models (VDMs) demonstrate the ability to predict future frames and showcase a strong understanding of physical world. We hypothesize that VDMs inherently produce visual representations that encompass both current static information and predicted future dynamics, thereby providing valuable guidance for robot action learning. Based on this hypothesis, we propose the Video Prediction Policy (VPP), which learns implicit inverse dynamics model conditioned on predicted future representations inside VDMs. To predict more precise future, we fine-tune pre-trained video foundation model on robot datasets along with internet human manipulation data. In experiments, VPP achieves a 18.6\% relative improvement on the Calvin ABC-D generalization benchmark compared to the previous state-of-the-art, and demonstrates a 31.6\% increase in success rates for complex real-world dexterous manipulation tasks. Project page at https://video-prediction-policy.github.io
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Video Prediction Policy (VPP), a generalist robot policy that fine-tunes pre-trained video diffusion models (VDMs) on robot datasets and internet human-manipulation videos to produce predictive visual representations. These representations are hypothesized to encode both static scene information and future dynamics, which are then used to condition an implicit inverse-dynamics model for action prediction. The authors report an 18.6% relative improvement over prior state-of-the-art on the Calvin ABC-D generalization benchmark and a 31.6% increase in success rate on complex real-world dexterous manipulation tasks.
Significance. If the reported gains are reproducible and the predictive conditioning is shown to be the operative factor, the work would offer a concrete route to injecting future-dynamics awareness into robot policies via existing video foundation models. This could meaningfully advance generalist embodied agents by moving beyond static image encoders, with direct implications for sim-to-real transfer and long-horizon manipulation.
major comments (2)
- [Abstract] Abstract: the performance claims (18.6% relative improvement on Calvin ABC-D and 31.6% real-world success-rate increase) are stated without any reference to the number of evaluation episodes, variance across seeds, statistical tests, or the precise baselines used, rendering the numerical results unverifiable from the provided information.
- [Method / Experiments] Method / Experiments: the central hypothesis that VDMs supply future-dynamic guidance is load-bearing for the contribution, yet no ablation isolates the effect of conditioning on predicted future representations while holding the fine-tuned encoder fixed. A comparison against an equivalently fine-tuned but non-predictive backbone (e.g., current-frame features only) is required to rule out gains arising solely from additional supervised data or increased model capacity.
minor comments (1)
- [Abstract] Abstract: the phrase 'implicit inverse dynamics model conditioned on predicted future representations' is introduced without a brief definition or pointer to the relevant equation, which may hinder readers outside the immediate subfield.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of experimental reporting and hypothesis validation, and we have revised the paper to strengthen these elements while preserving the core contribution.
read point-by-point responses
-
Referee: [Abstract] Abstract: the performance claims (18.6% relative improvement on Calvin ABC-D and 31.6% real-world success-rate increase) are stated without any reference to the number of evaluation episodes, variance across seeds, statistical tests, or the precise baselines used, rendering the numerical results unverifiable from the provided information.
Authors: We agree that the abstract would benefit from greater specificity to improve verifiability. In the revised manuscript, we have updated the abstract to note that the reported gains are obtained under standard benchmark protocols (with full details on episode counts, seed variance, and baselines provided in the Experiments section). This maintains abstract length while directing readers to the supporting evidence. revision: yes
-
Referee: [Method / Experiments] Method / Experiments: the central hypothesis that VDMs supply future-dynamic guidance is load-bearing for the contribution, yet no ablation isolates the effect of conditioning on predicted future representations while holding the fine-tuned encoder fixed. A comparison against an equivalently fine-tuned but non-predictive backbone (e.g., current-frame features only) is required to rule out gains arising solely from additional supervised data or increased model capacity.
Authors: The referee correctly notes that such an ablation is necessary to isolate the predictive component. The original manuscript compares VPP against multiple baselines with different encoders, but to directly address this point we have added a new ablation study. We fine-tune the identical VDM backbone and then train the policy using only its current-frame features (no future prediction or conditioning). Results in the revised Experiments section show that the predictive representations contribute additional gains beyond fine-tuning alone, supporting the hypothesis that future dynamics are a key factor. revision: yes
Circularity Check
No circularity: empirical pipeline with independent experimental validation
full rationale
The paper advances an empirical method: it states a hypothesis about VDM representations, fine-tunes a pre-trained video model on robot and human-manipulation videos to improve future-frame prediction, then trains a policy that conditions on the resulting representations for inverse-dynamics learning. All performance numbers (Calvin benchmark, real-world dexterous tasks) are reported as measured outcomes on held-out evaluation sets. No equations, definitions, or self-citations reduce the claimed gains to a fitted parameter renamed as prediction or to a self-referential premise; the central result remains falsifiable by ablation or external replication.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 37 Pith papers
-
CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL
CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight...
-
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.
-
NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models
NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
-
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...
-
Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation
A multimodal transformer generates and caches interleaved text-image traces to guide closed-loop actions, achieving 92.4% success on LIBERO-Long and 95.5% average on LIBERO.
-
Being-H0.7: A Latent World-Action Model from Egocentric Videos
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
-
Privileged Foresight Distillation: Zero-Cost Future Correction for World Action Models
Privileged Foresight Distillation distills the residual difference in action predictions with versus without future context into a current-only adapter, yielding consistent gains on LIBERO and RoboTwin benchmarks.
-
Mask World Model: Predicting What Matters for Robust Robot Policy Learning
Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...
-
Action Images: End-to-End Policy Learning via Multiview Video Generation
Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.
-
JailWAM: Jailbreaking World Action Models in Robot Control
JailWAM is the first dedicated jailbreak framework for World Action Models, achieving 84.2% attack success rate on LingBot-VA in RoboTwin simulation and enabling safety evaluation of robotic AI.
-
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.
-
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
-
From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models
A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discr...
-
Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing
A dual-contrastive disentanglement method factorizes videos into independent task and embodiment latents, then uses a parameter-efficient adapter on a frozen video diffusion model to synthesize robot executions from s...
-
MotuBrain: An Advanced World Action Model for Robot Control
MotuBrain jointly models video and action via a three-stream Mixture-of-Transformers UniDiffuser to reach 95.8-96.1% success on RoboTwin 2.0 benchmarks, top EWMScore, and fast 11 Hz inference while adapting to new rob...
-
GazeVLA: Learning Human Intention for Robotic Manipulation
GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
-
Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training
Hi-WM uses human interventions inside an action-conditioned world model with rollback and branching to generate dense corrective data, raising real-world success by 37.9 points on average across three manipulation tasks.
-
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models
Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
-
AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps
AIM predicts aligned spatial value maps inside a shared video-generation transformer to produce reliable robot actions, reaching 94% success on RoboTwin 2.0 with larger gains on long-horizon and contact-rich tasks.
-
Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?
Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.
-
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
-
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.
-
World Action Models are Zero-shot Policies
DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
-
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.
-
Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generali...
-
Unified Video Action Model
UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without p...
-
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.
-
CKT-WAM: Parameter-Efficient Context Knowledge Transfer Between World Action Models
CKT-WAM transfers teacher WAM knowledge to students via compressed text-embedding contexts using LQCA and adapters, reaching 86.1% success on LIBERO-Plus with 1.17% trainable parameters and 83.3% in real-world tasks.
-
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...
-
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement
StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...
-
From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data
A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.
-
Motus: A Unified Latent Action World Model
Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
-
Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap
A survey of UAV vision-and-language navigation that establishes a methodological taxonomy, reviews resources and challenges, and proposes a forward-looking research roadmap.
Reference graph
Works this paper leans on
-
[1]
FirstName LastName , title =
-
[2]
IEEE Robotics and Automation Letters (RA-L) , volume=
Oier Mees and Lukas Hermann and Erick Rosete-Beas and Wolfram Burgard , title =. IEEE Robotics and Automation Letters (RA-L) , volume=
-
[3]
Conference on robot learning , pages=
Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning , author=. Conference on robot learning , pages=. 2020 , organization=
work page 2020
-
[4]
Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals , author=. 2024 , eprint=
work page 2024
-
[6]
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation , author=. 2023 , eprint=
work page 2023
-
[7]
FirstName Alpher , title =
-
[8]
Journal of Foo , volume = 13, number = 1, pages =
FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =
-
[9]
Journal of Foo , volume = 14, number = 1, pages =
FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =
-
[10]
FirstName Alpher and FirstName Gamow , title =
-
[11]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[13]
Advances in neural information processing systems , volume=
Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=
- [14]
-
[15]
International conference on machine learning , pages=
A simple framework for contrastive learning of visual representations , author=. International conference on machine learning , pages=. 2020 , organization=
work page 2020
-
[16]
Proceedings of the IEEE/CVF international conference on computer vision , pages=
An empirical study of training self-supervised vision transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
-
[17]
Advances in neural information processing systems , volume=
Unsupervised learning of visual features by contrasting cluster assignments , author=. Advances in neural information processing systems , volume=
-
[18]
International Conference on Machine Learning , pages=
Data2vec: A general framework for self-supervised learning in speech, vision and language , author=. International Conference on Machine Learning , pages=. 2022 , organization=
work page 2022
-
[20]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Masked autoencoders are scalable vision learners , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[21]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Simple but effective: Clip embeddings for embodied ai , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[22]
Workshop on Reincarnating Reinforcement Learning at ICLR 2023 , year=
Offline visual representation learning for embodied navigation , author=. Workshop on Reincarnating Reinforcement Learning at ICLR 2023 , year=
work page 2023
-
[24]
international conference on machine learning , pages=
The unsurprising effectiveness of pre-trained vision models for control , author=. international conference on machine learning , pages=. 2022 , organization=
work page 2022
-
[25]
Conference on Robot Learning , pages=
Real-world robot learning with masked visual pre-training , author=. Conference on Robot Learning , pages=. 2023 , organization=
work page 2023
-
[29]
Advances in Neural Information Processing Systems , volume=
Where are we in the search for an artificial visual cortex for embodied intelligence? , author=. Advances in Neural Information Processing Systems , volume=
-
[30]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Ego4d: Around the world in 3,000 hours of egocentric video , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[31]
The" something something" video database for learning and evaluating visual common sense , author=. Proceedings of the IEEE international conference on computer vision , pages=
-
[32]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Instructpix2pix: Learning to follow image editing instructions , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[34]
Advances in Neural Information Processing Systems , volume=
Learning universal policies via text-guided video generation , author=. Advances in Neural Information Processing Systems , volume=
-
[36]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[37]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Denoising diffusion autoencoders are unified self-supervised learners , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[38]
Advances in Neural Information Processing Systems , volume=
Diffusion hyperfeatures: Searching through time and space for semantic correspondence , author=. Advances in Neural Information Processing Systems , volume=
-
[42]
OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA: An Open-Source Vision-Language-Action Model , author=. arXiv preprint arXiv:2406.09246 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers , author=. arXiv preprint arXiv:2410.05273 , year=
-
[44]
Conference on Robot Learning , pages=
Bc-z: Zero-shot task generalization with robotic imitation learning , author=. Conference on Robot Learning , pages=. 2022 , organization=
work page 2022
-
[46]
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation , author=. arXiv preprint arXiv:2410.07864 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
International conference on machine learning , pages=
Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=
work page 2021
-
[48]
2009 IEEE conference on computer vision and pattern recognition , pages=
Imagenet: A large-scale hierarchical image database , author=. 2009 IEEE conference on computer vision and pattern recognition , pages=. 2009 , organization=
work page 2009
-
[49]
A Survey on Vision-Language-Action Models for Embodied AI
A Survey on Vision-Language-Action Models for Embodied AI , author=. arXiv preprint arXiv:2405.14093 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[50]
Advances in Neural Information Processing Systems , volume=
Video diffusion models , author=. Advances in Neural Information Processing Systems , volume=
- [56]
-
[57]
3d diffuser actor: Policy diffusion with 3d scene representations, 2024
3d diffuser actor: Policy diffusion with 3d scene representations , author=. arXiv preprint arXiv:2402.10885 , year=
-
[59]
Advances in neural information processing systems , volume=
Flamingo: a visual language model for few-shot learning , author=. Advances in neural information processing systems , volume=
-
[60]
Score-Based Generative Modeling through Stochastic Differential Equations
Score-based generative modeling through stochastic differential equations , author=. arXiv preprint arXiv:2011.13456 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[62]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[66]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Spatialvlm: Endowing vision-language models with spatial reasoning capabilities , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[67]
2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=
Grounding language with visual affordances over unstructured data , author=. 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2023 , organization=
work page 2023
-
[68]
Robonet: Large-scale multi-robot learning,
Robonet: Large-scale multi-robot learning , author=. arXiv preprint arXiv:1910.11215 , year=
-
[71]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Align your latents: High-resolution video synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[72]
International conference on machine learning , pages=
Perceiver: General perception with iterative attention , author=. International conference on machine learning , pages=. 2021 , organization=
work page 2021
-
[74]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
Prediction with action: Visual policy learning via joint denoising process , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
-
[75]
Advances in Neural Information Processing Systems , volume=
Vidman: Exploiting implicit dynamics from video diffusion model for effective robot manipulation , author=. Advances in Neural Information Processing Systems , volume=
-
[77]
Flamingo: a visual language model for few-shot learning
Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35: 0 23716--23736, 2022
work page 2022
-
[78]
Data2vec: A general framework for self-supervised learning in speech, vision and language
Baevski, A., Hsu, W.-N., Xu, Q., Babu, A., Gu, J., and Auli, M. Data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning, pp.\ 1298--1312. PMLR, 2022
work page 2022
-
[79]
BEiT: BERT Pre-Training of Image Transformers
Bao, H., Dong, L., Piao, S., and Wei, F. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021
work page internal anchor Pith review arXiv 2021
-
[80]
arXiv preprint arXiv:2409.16283 (2024)
Bharadhwaj, H., Dwibedi, D., Gupta, A., Tulsiani, S., Doersch, C., Xiao, T., Shah, D., Xia, F., Sadigh, D., and Kirmani, S. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation. arXiv preprint arXiv:2409.16283, 2024
- [81]
-
[82]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023 a
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[83]
Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S. W., Fidler, S., and Kreis, K. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 22563--22575, 2023 b
work page 2023
-
[84]
RT-1: Robotics Transformer for Real-World Control at Scale
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[85]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[86]
Brooks, T., Holynski, A., and Efros, A. A. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 18392--18402, 2023
work page 2023
-
[87]
Video generation models as world simulators
Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., and Ramesh, A. Video generation models as world simulators. 2024. URL https://openai.com/research/video-generation-models-as-world-simulators
work page 2024
-
[88]
Unsupervised learning of visual features by contrasting cluster assignments
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., and Joulin, A. Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33: 0 9912--9924, 2020
work page 2020
-
[89]
Spatialvlm: Endowing vision-language models with spatial reasoning capabilities
Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L., and Xia, F. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 14455--14465, 2024 a
work page 2024
-
[90]
A simple framework for contrastive learning of visual representations
Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp.\ 1597--1607. PMLR, 2020
work page 2020
-
[91]
arXiv preprint arXiv:2305.13840 (2023)
Chen, W., Ji, Y., Wu, J., Wu, H., Xie, P., Li, J., Xia, X., Xiao, X., and Lin, L. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023
-
[92]
An empirical study of training self-supervised vision transformers
Chen, X., Xie, S., and He, K. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 9640--9649, 2021
work page 2021
-
[93]
Chen, X., Guo, J., He, T., Zhang, C., Zhang, P., Yang, D. C., Zhao, L., and Bian, J. Igor: Image-goal representations are the atomic control units for foundation models in embodied ai. arXiv preprint arXiv:2411.00785, 2024 b
-
[94]
Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
Chi, C., Feng, S., Du, Y., Xu, Z., Cousineau, E., Burchfiel, B., and Song, S. Diffusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023
work page internal anchor Pith review arXiv 2023
-
[95]
Learning universal policies via text-guided video generation
Du, Y., Yang, S., Dai, B., Dai, H., Nachum, O., Tenenbaum, J., Schuurmans, D., and Abbeel, P. Learning universal policies via text-guided video generation. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[96]
Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets
Ebert, F., Yang, Y., Schmeckpeper, K., Bucher, B., Georgakis, G., Daniilidis, K., Finn, C., and Levine, S. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. arXiv preprint arXiv:2109.13396, 2021
work page internal anchor Pith review arXiv 2021
-
[97]
Goyal, R., Ebrahimi Kahou, S., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., et al. The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pp.\ 5842--5850, 2017
work page 2017
-
[98]
Ego4d: Around the world in 3,000 hours of egocentric video
Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Hamburger, J., Jiang, H., Liu, M., Liu, X., et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 18995--19012, 2022
work page 2022
-
[99]
Seer: Language Instructed Video Prediction with Latent Diffusion Models
Gu, X., Wen, C., Ye, W., Song, J., and Gao, Y. Seer: Language instructed video prediction with latent diffusion models. arXiv preprint arXiv:2303.14897, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[100]
Prediction with action: Visual policy learning via joint denoising process
Guo, Y., Hu, Y., Zhang, J., Wang, Y.-J., Chen, X., Lu, C., and Chen, J. Prediction with action: Visual policy learning via joint denoising process. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
work page 2024
-
[101]
Guo, Y., Zhang, J., Chen, X., Ji, X., Wang, Y.-J., Hu, Y., and Chen, J. Improving vision-language-action model with online reinforcement learning. arXiv preprint arXiv:2501.16664, 2025
- [102]
-
[103]
Masked autoencoders are scalable vision learners
He, K., Chen, X., Xie, S., Li, Y., Doll \'a r, P., and Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 16000--16009, 2022
work page 2022
-
[104]
Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., and Fleet, D. J. Video diffusion models. Advances in Neural Information Processing Systems, 35: 0 8633--8646, 2022
work page 2022
-
[105]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Hong, W., Ding, M., Zheng, W., Liu, X., and Tang, J. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[106]
Perceiver: General perception with iterative attention
Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., and Carreira, J. Perceiver: General perception with iterative attention. In International conference on machine learning, pp.\ 4651--4664. PMLR, 2021
work page 2021
-
[107]
arXiv preprint arXiv:2302.12766 , year=
Karamcheti, S., Nair, S., Chen, A. S., Kollar, T., Finn, C., Sadigh, D., and Liang, P. Language-driven representation learning for robotics. arXiv preprint arXiv:2302.12766, 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.