arxiv: 2412.14803 · v2 · submitted 2024-12-19 · 💻 cs.CV · cs.RO

Recognition: 2 theorem links

· Lean Theorem

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Yucheng Hu , Yanjiang Guo , Pengchao Wang , Xiaoyu Chen , Yen-Jen Wang , Jianke Zhang , Koushil Sreenath , Chaochao Lu

show 1 more author

Jianyu Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-12 18:31 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords video diffusion modelsrobot policiespredictive visual representationsinverse dynamicsdexterous manipulationgeneralist policiesCalvin benchmark

0 comments

The pith

Robot policies that condition actions on future video predictions from diffusion models outperform prior methods by 18 percent on benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that video diffusion models generate visual features containing both current scene details and predicted motion, which supply better guidance for learning robot actions than static image encoders. The authors introduce the Video Prediction Policy that trains an implicit inverse dynamics model by conditioning it on these future representations produced inside the diffusion process. They adapt a pre-trained video model through fine-tuning on robot datasets and internet human manipulation videos to make the forecasts more accurate for physical environments. This yields measurable gains in task success and generalization. A reader would care because embodied agents must anticipate change rather than merely observe the present state.

Core claim

The Video Prediction Policy learns an implicit inverse dynamics model conditioned on predicted future representations inside video diffusion models that have been fine-tuned on robot datasets together with internet-sourced human manipulation videos, producing higher success rates than previous vision-based policies.

What carries the argument

Video Prediction Policy (VPP) that conditions action outputs on future visual representations generated by fine-tuned video diffusion models.

If this is right

18.6 percent relative improvement on the Calvin ABC-D generalization benchmark over prior state-of-the-art.
31.6 percent higher success rates on complex real-world dexterous manipulation tasks.
Fine-tuning the video model on robot and human data produces more precise future predictions that better support policy learning.
Implicit inverse dynamics modeling gains from access to dynamic rather than static visual features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same predictive representations could replace conventional vision encoders in other sequential control settings such as navigation or assembly.
Extending the video prediction horizon might allow the policy to plan over longer action sequences without an explicit world model.
Mixing human demonstration data during fine-tuning may improve zero-shot transfer from human videos to robot execution.
The approach could be combined with language-conditioned policies to handle open-ended instructions while retaining the temporal advantage.

Load-bearing premise

Video diffusion models inherently produce representations that capture future dynamics useful for guiding robot action selection.

What would settle it

Training a policy with the same fine-tuned diffusion encoder but using only current-frame representations instead of predicted future ones and measuring whether performance drops on the Calvin ABC-D benchmark and real dexterous tasks.

read the original abstract

Visual representations play a crucial role in developing generalist robotic policies. Previous vision encoders, typically pre-trained with single-image reconstruction or two-image contrastive learning, tend to capture static information, often neglecting the dynamic aspects vital for embodied tasks. Recently, video diffusion models (VDMs) demonstrate the ability to predict future frames and showcase a strong understanding of physical world. We hypothesize that VDMs inherently produce visual representations that encompass both current static information and predicted future dynamics, thereby providing valuable guidance for robot action learning. Based on this hypothesis, we propose the Video Prediction Policy (VPP), which learns implicit inverse dynamics model conditioned on predicted future representations inside VDMs. To predict more precise future, we fine-tune pre-trained video foundation model on robot datasets along with internet human manipulation data. In experiments, VPP achieves a 18.6\% relative improvement on the Calvin ABC-D generalization benchmark compared to the previous state-of-the-art, and demonstrates a 31.6\% increase in success rates for complex real-world dexterous manipulation tasks. Project page at https://video-prediction-policy.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VPP conditions robot policies on future representations from fine-tuned video diffusion models and reports solid gains, but those gains may trace to extra training data rather than the predictive mechanism.

read the letter

The main takeaway is that this work fine-tunes a video diffusion model on robot and human manipulation videos, then feeds the model's internal future-prediction features into a policy that learns implicit inverse dynamics. That produces the reported 18.6% relative lift on Calvin ABC-D and the 31.6% success-rate jump on real dexterous tasks. The approach is a direct attempt to move beyond static vision encoders by borrowing the temporal priors that diffusion models already learn during generation. Prior robot vision work mostly used single-image reconstruction or contrastive losses, so the shift to conditioning on predicted futures is the concrete novelty here. The numbers are the part that stands out: the Calvin result beats the previous state of the art by a noticeable margin, and the real-world dexterous gains are large enough to matter for practical manipulation. The paper also ships a project page with what looks like reproducible setup details. The clearest limitation is the missing isolation of the predictive component. The method explicitly adds supervised fine-tuning data to improve future prediction quality, yet the abstract and reported experiments do not include an ablation that keeps the fine-tuned encoder but drops the future-representation conditioning, or that compares against an equivalently trained non-predictive backbone. Without that control, the performance deltas cannot be cleanly attributed to the dynamic priors rather than simply stronger static features or more data. The central hypothesis therefore rests on an assumption that still needs direct testing. This paper is aimed at people already working on generalist policies and large vision models for robotics. A reader who cares about scaling embodied agents with video priors will find the pipeline and benchmarks worth examining. The thinking is coherent and the empirical claims are stated plainly, so the work is ready for serious refereeing. I would send it to peer review and ask specifically for the ablations that separate the future-conditioning effect from the fine-tuning step.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Video Prediction Policy (VPP), a generalist robot policy that fine-tunes pre-trained video diffusion models (VDMs) on robot datasets and internet human-manipulation videos to produce predictive visual representations. These representations are hypothesized to encode both static scene information and future dynamics, which are then used to condition an implicit inverse-dynamics model for action prediction. The authors report an 18.6% relative improvement over prior state-of-the-art on the Calvin ABC-D generalization benchmark and a 31.6% increase in success rate on complex real-world dexterous manipulation tasks.

Significance. If the reported gains are reproducible and the predictive conditioning is shown to be the operative factor, the work would offer a concrete route to injecting future-dynamics awareness into robot policies via existing video foundation models. This could meaningfully advance generalist embodied agents by moving beyond static image encoders, with direct implications for sim-to-real transfer and long-horizon manipulation.

major comments (2)

[Abstract] Abstract: the performance claims (18.6% relative improvement on Calvin ABC-D and 31.6% real-world success-rate increase) are stated without any reference to the number of evaluation episodes, variance across seeds, statistical tests, or the precise baselines used, rendering the numerical results unverifiable from the provided information.
[Method / Experiments] Method / Experiments: the central hypothesis that VDMs supply future-dynamic guidance is load-bearing for the contribution, yet no ablation isolates the effect of conditioning on predicted future representations while holding the fine-tuned encoder fixed. A comparison against an equivalently fine-tuned but non-predictive backbone (e.g., current-frame features only) is required to rule out gains arising solely from additional supervised data or increased model capacity.

minor comments (1)

[Abstract] Abstract: the phrase 'implicit inverse dynamics model conditioned on predicted future representations' is introduced without a brief definition or pointer to the relevant equation, which may hinder readers outside the immediate subfield.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of experimental reporting and hypothesis validation, and we have revised the paper to strengthen these elements while preserving the core contribution.

read point-by-point responses

Referee: [Abstract] Abstract: the performance claims (18.6% relative improvement on Calvin ABC-D and 31.6% real-world success-rate increase) are stated without any reference to the number of evaluation episodes, variance across seeds, statistical tests, or the precise baselines used, rendering the numerical results unverifiable from the provided information.

Authors: We agree that the abstract would benefit from greater specificity to improve verifiability. In the revised manuscript, we have updated the abstract to note that the reported gains are obtained under standard benchmark protocols (with full details on episode counts, seed variance, and baselines provided in the Experiments section). This maintains abstract length while directing readers to the supporting evidence. revision: yes
Referee: [Method / Experiments] Method / Experiments: the central hypothesis that VDMs supply future-dynamic guidance is load-bearing for the contribution, yet no ablation isolates the effect of conditioning on predicted future representations while holding the fine-tuned encoder fixed. A comparison against an equivalently fine-tuned but non-predictive backbone (e.g., current-frame features only) is required to rule out gains arising solely from additional supervised data or increased model capacity.

Authors: The referee correctly notes that such an ablation is necessary to isolate the predictive component. The original manuscript compares VPP against multiple baselines with different encoders, but to directly address this point we have added a new ablation study. We fine-tune the identical VDM backbone and then train the policy using only its current-frame features (no future prediction or conditioning). Results in the revised Experiments section show that the predictive representations contribute additional gains beyond fine-tuning alone, supporting the hypothesis that future dynamics are a key factor. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with independent experimental validation

full rationale

The paper advances an empirical method: it states a hypothesis about VDM representations, fine-tunes a pre-trained video model on robot and human-manipulation videos to improve future-frame prediction, then trains a policy that conditions on the resulting representations for inverse-dynamics learning. All performance numbers (Calvin benchmark, real-world dexterous tasks) are reported as measured outcomes on held-out evaluation sets. No equations, definitions, or self-citations reduce the claimed gains to a fitted parameter renamed as prediction or to a self-referential premise; the central result remains falsifiable by ablation or external replication.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no identifiable free parameters, axioms, or invented entities; a full-text audit would be required to enumerate any fitted scales, domain assumptions, or new postulated constructs.

pith-pipeline@v0.9.0 · 5526 in / 1108 out tokens · 56328 ms · 2026-05-12T18:31:41.928814+00:00 · methodology

discussion (0)

Forward citations

Cited by 37 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL
cs.CV 2026-05 conditional novelty 7.0

CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight...
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 conditional novelty 7.0

Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.
NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models
cs.RO 2026-05 unverdicted novelty 7.0

NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
cs.RO 2026-05 unverdicted novelty 7.0

VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...
Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation
cs.AI 2026-05 unverdicted novelty 7.0

A multimodal transformer generates and caches interleaved text-image traces to guide closed-loop actions, achieving 92.4% success on LIBERO-Long and 95.5% average on LIBERO.
Being-H0.7: A Latent World-Action Model from Egocentric Videos
cs.RO 2026-04 unverdicted novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
Privileged Foresight Distillation: Zero-Cost Future Correction for World Action Models
cs.RO 2026-04 unverdicted novelty 7.0

Privileged Foresight Distillation distills the residual difference in action predictions with versus without future context into a current-only adapter, yielding consistent gains on LIBERO and RoboTwin benchmarks.
Mask World Model: Predicting What Matters for Robust Robot Policy Learning
cs.RO 2026-04 unverdicted novelty 7.0

Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...
Action Images: End-to-End Policy Learning via Multiview Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.
JailWAM: Jailbreaking World Action Models in Robot Control
cs.RO 2026-04 unverdicted novelty 7.0

JailWAM is the first dedicated jailbreak framework for World Action Models, achieving 84.2% attack success rate on LingBot-VA in RoboTwin simulation and enabling safety evaluation of robotic AI.
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discr...
Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing
cs.RO 2026-05 unverdicted novelty 6.0

A dual-contrastive disentanglement method factorizes videos into independent task and embodiment latents, then uses a parameter-efficient adapter on a frozen video diffusion model to synthesize robot executions from s...
MotuBrain: An Advanced World Action Model for Robot Control
cs.RO 2026-04 unverdicted novelty 6.0

MotuBrain jointly models video and action via a three-stream Mixture-of-Transformers UniDiffuser to reach 95.8-96.1% success on RoboTwin 2.0 benchmarks, top EWMScore, and fast 11 Hz inference while adapting to new rob...
GazeVLA: Learning Human Intention for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training
cs.RO 2026-04 unverdicted novelty 6.0

Hi-WM uses human interventions inside an action-conditioned world model with rollback and branching to generate dense corrective data, raising real-world success by 37.9 points on average across three manipulation tasks.
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models
cs.RO 2026-04 unverdicted novelty 6.0

Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps
cs.RO 2026-04 unverdicted novelty 6.0

AIM predicts aligned spatial value maps inside a shared video-generation transformer to produce reliable robot actions, reaching 94% success on RoboTwin 2.0 with larger gains on long-horizon and contact-rich tasks.
Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?
cs.RO 2026-04 unverdicted novelty 6.0

Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
cs.RO 2026-04 conditional novelty 6.0

MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
cs.CV 2026-03 unverdicted novelty 6.0

Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.
World Action Models are Zero-shot Policies
cs.RO 2026-02 unverdicted novelty 6.0

DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
cs.AI 2026-01 conditional novelty 6.0

Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.
Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
cs.RO 2025-04 unverdicted novelty 6.0

Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generali...
Unified Video Action Model
cs.RO 2025-02 unverdicted novelty 6.0

UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without p...
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
cs.RO 2025-02 unverdicted novelty 6.0

DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.
CKT-WAM: Parameter-Efficient Context Knowledge Transfer Between World Action Models
cs.RO 2026-05 unverdicted novelty 5.0

CKT-WAM transfers teacher WAM knowledge to students via compressed text-embedding contexts using LQCA and adapters, reaching 86.1% success on LIBERO-Plus with 1.17% trainable parameters and 83.3% in real-world tasks.
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
cs.RO 2026-05 unverdicted novelty 5.0

VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement
cs.RO 2026-04 unverdicted novelty 5.0

StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...
From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data
cs.RO 2026-04 accept novelty 5.0

A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.
Motus: A Unified Latent Action World Model
cs.CV 2025-12 unverdicted novelty 5.0

Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap
cs.RO 2026-04 unverdicted novelty 4.0

A survey of UAV vision-and-language navigation that establishes a methodological taxonomy, reviews resources and challenges, and proposes a forward-looking research roadmap.

Reference graph

Works this paper leans on

106 extracted references · 106 canonical work pages · cited by 33 Pith papers · 18 internal anchors

[1]

FirstName LastName , title =

work page
[2]

IEEE Robotics and Automation Letters (RA-L) , volume=

Oier Mees and Lukas Hermann and Erick Rosete-Beas and Wolfram Burgard , title =. IEEE Robotics and Automation Letters (RA-L) , volume=

work page
[3]

Conference on robot learning , pages=

Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning , author=. Conference on robot learning , pages=. 2020 , organization=

work page 2020
[4]

2024 , eprint=

Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals , author=. 2024 , eprint=

work page 2024
[6]

2023 , eprint=

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation , author=. 2023 , eprint=

work page 2023
[7]

FirstName Alpher , title =

work page
[8]

Journal of Foo , volume = 13, number = 1, pages =

FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =

work page
[9]

Journal of Foo , volume = 14, number = 1, pages =

FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =

work page
[10]

FirstName Alpher and FirstName Gamow , title =

work page
[11]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[13]

Advances in neural information processing systems , volume=

Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=

work page
[14]

2022 , eprint=

Classifier-Free Diffusion Guidance , author=. 2022 , eprint=

work page 2022
[15]

International conference on machine learning , pages=

A simple framework for contrastive learning of visual representations , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020
[16]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

An empirical study of training self-supervised vision transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page
[17]

Advances in neural information processing systems , volume=

Unsupervised learning of visual features by contrasting cluster assignments , author=. Advances in neural information processing systems , volume=

work page
[18]

International Conference on Machine Learning , pages=

Data2vec: A general framework for self-supervised learning in speech, vision and language , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022
[20]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Masked autoencoders are scalable vision learners , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[21]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Simple but effective: Clip embeddings for embodied ai , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[22]

Workshop on Reincarnating Reinforcement Learning at ICLR 2023 , year=

Offline visual representation learning for embodied navigation , author=. Workshop on Reincarnating Reinforcement Learning at ICLR 2023 , year=

work page 2023
[24]

international conference on machine learning , pages=

The unsurprising effectiveness of pre-trained vision models for control , author=. international conference on machine learning , pages=. 2022 , organization=

work page 2022
[25]

Conference on Robot Learning , pages=

Real-world robot learning with masked visual pre-training , author=. Conference on Robot Learning , pages=. 2023 , organization=

work page 2023
[29]

Advances in Neural Information Processing Systems , volume=

Where are we in the search for an artificial visual cortex for embodied intelligence? , author=. Advances in Neural Information Processing Systems , volume=

work page
[30]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Ego4d: Around the world in 3,000 hours of egocentric video , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[31]

something something

The" something something" video database for learning and evaluating visual common sense , author=. Proceedings of the IEEE international conference on computer vision , pages=

work page
[32]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Instructpix2pix: Learning to follow image editing instructions , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[34]

Advances in Neural Information Processing Systems , volume=

Learning universal policies via text-guided video generation , author=. Advances in Neural Information Processing Systems , volume=

work page
[36]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[37]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Denoising diffusion autoencoders are unified self-supervised learners , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[38]

Advances in Neural Information Processing Systems , volume=

Diffusion hyperfeatures: Searching through time and space for semantic correspondence , author=. Advances in Neural Information Processing Systems , volume=

work page
[42]

OpenVLA: An Open-Source Vision-Language-Action Model

OpenVLA: An Open-Source Vision-Language-Action Model , author=. arXiv preprint arXiv:2406.09246 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Hirt: Enhancing robotic control with hierarchical robot transformers.arXiv preprint arXiv:2410.05273, 2024

HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers , author=. arXiv preprint arXiv:2410.05273 , year=

work page arXiv
[44]

Conference on Robot Learning , pages=

Bc-z: Zero-shot task generalization with robotic imitation learning , author=. Conference on Robot Learning , pages=. 2022 , organization=

work page 2022
[46]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation , author=. arXiv preprint arXiv:2410.07864 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[48]

2009 IEEE conference on computer vision and pattern recognition , pages=

Imagenet: A large-scale hierarchical image database , author=. 2009 IEEE conference on computer vision and pattern recognition , pages=. 2009 , organization=

work page 2009
[49]

A Survey on Vision-Language-Action Models for Embodied AI

A Survey on Vision-Language-Action Models for Embodied AI , author=. arXiv preprint arXiv:2405.14093 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

Advances in Neural Information Processing Systems , volume=

Video diffusion models , author=. Advances in Neural Information Processing Systems , volume=

work page
[56]

2024 , url=

Video generation models as world simulators , author=. 2024 , url=

work page 2024
[57]

3d diffuser actor: Policy diffusion with 3d scene representations, 2024

3d diffuser actor: Policy diffusion with 3d scene representations , author=. arXiv preprint arXiv:2402.10885 , year=

work page arXiv
[59]

Advances in neural information processing systems , volume=

Flamingo: a visual language model for few-shot learning , author=. Advances in neural information processing systems , volume=

work page
[60]

Score-Based Generative Modeling through Stochastic Differential Equations

Score-based generative modeling through stochastic differential equations , author=. arXiv preprint arXiv:2011.13456 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2011
[62]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[66]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[67]

2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Grounding language with visual affordances over unstructured data , author=. 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2023 , organization=

work page 2023
[68]

Robonet: Large-scale multi-robot learning,

Robonet: Large-scale multi-robot learning , author=. arXiv preprint arXiv:1910.11215 , year=

work page arXiv 1910
[71]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Align your latents: High-resolution video synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[72]

International conference on machine learning , pages=

Perceiver: General perception with iterative attention , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[74]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Prediction with action: Visual policy learning via joint denoising process , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page
[75]

Advances in Neural Information Processing Systems , volume=

Vidman: Exploiting implicit dynamics from video diffusion model for effective robot manipulation , author=. Advances in Neural Information Processing Systems , volume=

work page
[77]

Flamingo: a visual language model for few-shot learning

Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35: 0 23716--23736, 2022

work page 2022
[78]

Data2vec: A general framework for self-supervised learning in speech, vision and language

Baevski, A., Hsu, W.-N., Xu, Q., Babu, A., Gu, J., and Auli, M. Data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning, pp.\ 1298--1312. PMLR, 2022

work page 2022
[79]

BEiT: BERT Pre-Training of Image Transformers

Bao, H., Dong, L., Piao, S., and Wei, F. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021

work page internal anchor Pith review arXiv 2021
[80]

arXiv preprint arXiv:2409.16283 (2024)

Bharadhwaj, H., Dwibedi, D., Gupta, A., Tulsiani, S., Doersch, C., Xiao, T., Shah, D., Xia, F., Sadigh, D., and Kirmani, S. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation. arXiv preprint arXiv:2409.16283, 2024

work page arXiv 2024
[81]

Black, M

Black, K., Nakamoto, M., Atreya, P., Walke, H., Finn, C., Kumar, A., and Levine, S. Zero-shot robotic manipulation with pretrained image-editing diffusion models. arXiv preprint arXiv:2310.10639, 2023

work page arXiv 2023
[82]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023 a

work page internal anchor Pith review Pith/arXiv arXiv 2023
[83]

W., Fidler, S., and Kreis, K

Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S. W., Fidler, S., and Kreis, K. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 22563--22575, 2023 b

work page 2023
[84]

RT-1: Robotics Transformer for Real-World Control at Scale

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[85]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[86]

Brooks, T., Holynski, A., and Efros, A. A. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 18392--18402, 2023

work page 2023
[87]

Video generation models as world simulators

Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., and Ramesh, A. Video generation models as world simulators. 2024. URL https://openai.com/research/video-generation-models-as-world-simulators

work page 2024
[88]

Unsupervised learning of visual features by contrasting cluster assignments

Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., and Joulin, A. Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33: 0 9912--9924, 2020

work page 2020
[89]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L., and Xia, F. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 14455--14465, 2024 a

work page 2024
[90]

A simple framework for contrastive learning of visual representations

Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp.\ 1597--1607. PMLR, 2020

work page 2020
[91]

arXiv preprint arXiv:2305.13840 (2023)

Chen, W., Ji, Y., Wu, J., Wu, H., Xie, P., Li, J., Xia, X., Xiao, X., and Lin, L. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023

work page arXiv 2023
[92]

An empirical study of training self-supervised vision transformers

Chen, X., Xie, S., and He, K. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 9640--9649, 2021

work page 2021
[93]

Igor: Image-goal representations are the atomic control units for foundation models in embodied ai.arXiv preprint arXiv:2411.00785, 2024

Chen, X., Guo, J., He, T., Zhang, C., Zhang, P., Yang, D. C., Zhao, L., and Bian, J. Igor: Image-goal representations are the atomic control units for foundation models in embodied ai. arXiv preprint arXiv:2411.00785, 2024 b

work page arXiv 2024
[94]

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Chi, C., Feng, S., Du, Y., Xu, Z., Cousineau, E., Burchfiel, B., and Song, S. Diffusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023

work page internal anchor Pith review arXiv 2023
[95]

Learning universal policies via text-guided video generation

Du, Y., Yang, S., Dai, B., Dai, H., Nachum, O., Tenenbaum, J., Schuurmans, D., and Abbeel, P. Learning universal policies via text-guided video generation. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[96]

Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

Ebert, F., Yang, Y., Schmeckpeper, K., Bucher, B., Georgakis, G., Daniilidis, K., Finn, C., and Levine, S. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. arXiv preprint arXiv:2109.13396, 2021

work page internal anchor Pith review arXiv 2021
[97]

something something

Goyal, R., Ebrahimi Kahou, S., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., et al. The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pp.\ 5842--5850, 2017

work page 2017
[98]

Ego4d: Around the world in 3,000 hours of egocentric video

Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Hamburger, J., Jiang, H., Liu, M., Liu, X., et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 18995--19012, 2022

work page 2022
[99]

Seer: Language Instructed Video Prediction with Latent Diffusion Models

Gu, X., Wen, C., Ye, W., Song, J., and Gao, Y. Seer: Language instructed video prediction with latent diffusion models. arXiv preprint arXiv:2303.14897, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[100]

Prediction with action: Visual policy learning via joint denoising process

Guo, Y., Hu, Y., Zhang, J., Wang, Y.-J., Chen, X., Lu, C., and Chen, J. Prediction with action: Visual policy learning via joint denoising process. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[101]

Improving vision-language-action model with online reinforcement learning.arXiv preprint arXiv:2501.16664, 2025b

Guo, Y., Zhang, J., Chen, X., Ji, X., Wang, Y.-J., Hu, Y., and Chen, J. Improving vision-language-action model with online reinforcement learning. arXiv preprint arXiv:2501.16664, 2025

work page arXiv 2025
[102]

Gupta, G., Yadav, K., Gal, Y., Batra, D., Kira, Z., Lu, C., and Rudner, T. G. Pre-trained text-to-image diffusion models are versatile representation learners for control. arXiv preprint arXiv:2405.05852, 2024

work page arXiv 2024
[103]

Masked autoencoders are scalable vision learners

He, K., Chen, X., Xie, S., Li, Y., Doll \'a r, P., and Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 16000--16009, 2022

work page 2022
[104]

Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., and Fleet, D. J. Video diffusion models. Advances in Neural Information Processing Systems, 35: 0 8633--8646, 2022

work page 2022
[105]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Hong, W., Ding, M., Zheng, W., Liu, X., and Tang, J. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[106]

Perceiver: General perception with iterative attention

Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., and Carreira, J. Perceiver: General perception with iterative attention. In International conference on machine learning, pp.\ 4651--4664. PMLR, 2021

work page 2021
[107]

arXiv preprint arXiv:2302.12766 , year=

Karamcheti, S., Nair, S., Chen, A. S., Kollar, T., Finn, C., Sadigh, D., and Liang, P. Language-driven representation learning for robotics. arXiv preprint arXiv:2302.12766, 2023

work page arXiv 2023

Showing first 80 references.