Flow as Flow: Modeling Robot Velocity Fields as Probability Velocity Fields for Flow-Based Object Manipulation

Daichi Yashima; Kento Tokura; Koki Seno; Komei Sugiura; Yusuke Takagi

arxiv: 2606.23090 · v2 · pith:GLG3JQ7Tnew · submitted 2026-06-22 · 💻 cs.RO

Flow as Flow: Modeling Robot Velocity Fields as Probability Velocity Fields for Flow-Based Object Manipulation

Koki Seno , Daichi Yashima , Yusuke Takagi , Kento Tokura , Komei Sugiura This is my paper

Pith reviewed 2026-06-26 08:38 UTC · model grok-4.3

classification 💻 cs.RO

keywords flow matchingvelocity fieldsrobot manipulationprobability flowscross-embodimentobject manipulationflow-based generation

0 comments

The pith

Modeling robot velocity fields as probability flows via flow matching enables faster generation and higher success in object manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes modeling robot flows directly as probability flows using a flow matching formulation rather than as displacements of sparse keypoints. This choice is presented as better aligned with the continuous-time character of robot motions and cross-embodiment data. The resulting method is shown to produce dense velocity fields more efficiently. On standard benchmarks the approach beats baselines on conventional metrics while running about 33 times faster; real-robot trials across 13 tasks and 260 runs per method report higher average success rates.

Core claim

By formulating robot flows as probability flows based on a flow matching formulation, the method achieves efficient and high-quality robot flow generation, outperforming representative baseline methods on standard metrics with approximately 33× faster generation and higher average success rates in real-world experiments across 13 manipulation tasks.

What carries the argument

Flow matching formulation applied to dense robot velocity fields treated as probability flows, which directly models continuous-time motion representations instead of sparse keypoint displacements.

If this is right

Outperforms baselines on standard metrics for flow generation.
Achieves approximately 33 times faster generation than compared methods.
Delivers higher average success rates across 13 manipulation tasks in 260 trials per method.
Supports use of heterogeneous cross-embodiment data for robotic foundation models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same probability-flow treatment could be tested on continuous trajectory tasks outside manipulation, such as locomotion or navigation.
Integration into larger foundation-model training pipelines may reduce the need for explicit keypoint extraction steps.
Performance gains might scale differently when the amount of cross-embodiment data increases substantially beyond the current experiments.

Load-bearing premise

The premise that dense velocity fields formulated as probability flows within flow matching inherently better capture the continuous-time nature of motions and yield superior performance compared with sparse keypoint displacements.

What would settle it

A head-to-head evaluation on the same 13 real-world tasks with identical training data and compute budget in which any baseline method matches or exceeds both the reported generation speed and success rate.

Figures

Figures reproduced from arXiv: 2606.23090 by Daichi Yashima, Kento Tokura, Koki Seno, Komei Sugiura, Yusuke Takagi.

**Figure 1.** Figure 1: Overview of our framework. We use cross-embodiment data for training, which include human demonstrations as an additional embodiment. Our framework, Flow as Flow, models robot flows (robot velocity fields) as probability flows using flow matching. At test time, it generates a robot flow conditioned on an initial image and a goal image. The robot then executes object manipulation based on the generated flow… view at source ↗

**Figure 2.** Figure 2: Architecture of our method. The Flow Generation module generates X0:H−1, and the Action Generation module predicts at based on it. Xh and X<h denote the point coordinates at timestep h and their history, respectively. NDiT and N′ DiT are the numbers of the DiT blocks in the Flow Generation module and the Action Generation module, respectively. formulated as velocity fields. Thus, Flow as Flow formulates ro… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of robot flow generation between our method and a baseline [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Representative examples from our real-world experiments. We conducted experiments [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative results of our method in real-world experiments, showing three successful [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of end-effector masks. The left image shows I, and the other images present generated masks of SAM3 [82], Robot-SAM [83] without fine-tuning, and RobotSAM with fine-tuning. The generated segmentation masks are overlaid in purple. 2 [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: shows representative cases from Fractal [11] and Bridge V2 [12] in which evaluating the full keypoint set was problematic. In both cases, the predicted robot flows qualitatively represented appropriate end-effector motions. Specifically, the generated flow in [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: A failure case for Fanuc Manipulation. The four left columns show frames from [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Conceptual difference between conventional diffusion-based generation and Flow as Flow. Flow as Flow substantially accelerates diffusion-based flow generation methods by unifying generation steps with the flow horizon. Bussing table: Grasp the specified object on a table and place it into the designated receptacle, either a bin or a cardboard box, after rotating the mobile base. Compared with “bin picking,… view at source ↗

**Figure 10.** Figure 10: Additional qualitative results in the real-world experiments. The left column represents [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

read the original abstract

Cross-embodiment data have become central to training robotic foundation models. To leverage such heterogeneous data, we focus on flow-based object manipulation, where robot flows (robot velocity fields) serve as embodiment-agnostic motion representations. Previous studies do not formulate robot flows as dense velocity fields, but as displacements of sparse keypoints, while such velocity fields better match the continuous-time nature of motions. We propose Flow as Flow, a framework that models robot flows as probability flows based on a flow matching formulation. By naturally modeling such velocity fields within this formulation, our method achieves efficient and high-quality robot flow generation. Across standard benchmarks, our method outperforms representative baseline methods on standard metrics, while achieving approximately 33$\times$ faster generation. Furthermore, through real-world experiments evaluating 9 methods with 260 trials per method across 13 manipulation tasks, we show that our method achieves a higher average success rate than the baseline methods. Our project page is available at https://flow-as-flow-u0n5y.kinsta.page.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reframes robot velocity fields as probability flows via flow matching and reports 33x speedups plus higher real-world success, but supplies no ablation or derivation showing the modeling choice itself produces those gains.

read the letter

The main point is that this work models robot flows as dense probability velocity fields inside a flow-matching setup instead of using sparse keypoint displacements. The abstract presents this as a better match for continuous-time motion and claims clear wins on speed and task success.

The real-world evaluation is the stronger part. Running nine methods with 260 trials each across thirteen manipulation tasks gives a decent volume of data for this area, and the reported higher average success rate plus the benchmark speedup are concrete numbers that can be checked.

The soft spot is the missing link between the probability-flow formulation and the measured improvements. Nothing in the abstract isolates the modeling decision with an ablation that holds architecture, training, and data fixed while swapping only the flow representation. The stress-test note is correct on this; performance differences could easily come from other implementation details. Baseline descriptions, metric definitions, and error analysis are also absent, so the claims cannot be evaluated from the given text.

This is aimed at researchers working on flow-based or cross-embodiment manipulation. Someone already following flow-matching methods in robotics could extract the core modeling idea and the experiment scale.

I would send it to peer review. The experimental effort is substantial enough to justify referee time, and the modeling angle is distinct; the full methods section would need to supply the controls and details that are missing here.

Referee Report

2 major / 2 minor

Summary. The paper proposes 'Flow as Flow', a framework modeling robot velocity fields as probability flows via flow matching to serve as embodiment-agnostic representations for object manipulation from cross-embodiment data. It claims this dense formulation better matches continuous-time motions than prior sparse keypoint displacements, yielding ~33x faster generation on benchmarks and higher average success rates than 9 baselines across 260 trials per method on 13 real-world tasks.

Significance. If the central modeling claim holds after proper isolation of the probability-flow choice, the work could strengthen flow-based manipulation by supplying a dense, continuous-time velocity-field representation that improves efficiency and real-world reliability when training on heterogeneous robot data.

major comments (2)

[Abstract, §3] Abstract and §3 (method): the central claim that formulating robot flows as probability flows 'naturally' produces the reported 33× speedup and higher success rates is not supported by any derivation showing why the flow-matching ODE is required rather than optional, nor by an ablation that holds network architecture, training data, and optimization fixed while switching only between dense probability velocity fields and sparse keypoint displacements.
[§4] §4 (experiments): performance numbers are reported without baseline descriptions, metric definitions, error bars, or statistical tests; the 33× speedup and success-rate gains cannot be attributed to the probability-flow modeling choice on the evidence supplied.

minor comments (2)

[Abstract] Abstract: the phrase 'standard benchmarks' and 'standard metrics' should be replaced by explicit citations to the datasets and metrics used.
[Abstract] Project page URL should be checked for permanence and content matching the manuscript claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (method): the central claim that formulating robot flows as probability flows 'naturally' produces the reported 33× speedup and higher success rates is not supported by any derivation showing why the flow-matching ODE is required rather than optional, nor by an ablation that holds network architecture, training data, and optimization fixed while switching only between dense probability velocity fields and sparse keypoint displacements.

Authors: The flow-matching formulation is selected because it directly parameterizes dense, continuous-time velocity fields as probability flows, aligning with the continuous nature of robot motions in a way that sparse keypoint displacements do not. While this motivation is stated conceptually in the manuscript, we acknowledge that an explicit derivation isolating the ODE and a controlled ablation would strengthen the attribution. In the revised manuscript we will add a derivation in §3 explaining why the flow-matching ODE is required for this dense representation and include an ablation that holds network architecture, data, and optimization fixed while varying only the dense probability velocity field versus sparse keypoint formulation. revision: yes
Referee: [§4] §4 (experiments): performance numbers are reported without baseline descriptions, metric definitions, error bars, or statistical tests; the 33× speedup and success-rate gains cannot be attributed to the probability-flow modeling choice on the evidence supplied.

Authors: We agree that the experimental reporting requires additional detail for reproducibility and to support attribution of gains to the modeling choice. In the revised §4 we will expand all baseline descriptions, provide precise metric definitions, add error bars to all quantitative results, and include statistical tests (e.g., paired significance tests) on the reported speedups and success rates. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation chain self-contained against external benchmarks

full rationale

The abstract and method description introduce a modeling choice (robot velocity fields as probability flows via flow matching) and report empirical gains (33x speedup, higher success rates) without any equations, fitted parameters, or self-citations that reduce the claimed advantage to a quantity defined by the inputs themselves. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear. The central premise is presented as an empirical modeling decision whose optimality is asserted via performance on benchmarks and real-world trials rather than by algebraic reduction or prior-author uniqueness theorems. This is the most common honest finding for papers whose contributions rest on implementation and evaluation rather than closed-form derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the modeling choice is described at the level of a formulation rather than explicit assumptions or new entities.

pith-pipeline@v0.9.1-grok · 5728 in / 1039 out tokens · 17543 ms · 2026-06-26T08:38:53.943617+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

87 extracted references · 5 canonical work pages · 2 internal anchors

[1]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, et al.π 0: A Vision-Language-Action Flow Model for General Robot Control. InRSS, 2025

2025
[2]

Black, N

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, et al.π 0.5: a Vision- Language-Action Model with Open-World Generalization. InCoRL, pages 17–40, 2025

2025
[3]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. Fan, et al. GR00T N1: An Open Foundation Model for Generalist Humanoid Robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

A. Ye, B. Wang, C. Ni, G. Huang, G. Zhao, H. Li, J. Li, J. Zhu, L. Feng, P. Li, Q. Deng, R. Ouyang, W. Qin, X. Chen, X. Wang, Y . Wang, Y . Li, Y . Li, et al. GigaBrain-0: A World Model-Powered Vision-Language-Action Model.arXiv preprint arXiv:2510.19430, 2025

work page arXiv 2025
[5]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, et al. Open X- Embodiment: Robotic Learning Datasets and RT-X Models. InICRA, pages 6892–6903, 2024

2024
[6]

F. Lin, Y . Hu, P. Sheng, C. Wen, J. You, and Y . Gao. Data Scaling Laws in Imitation Learning for Robotic Manipulation. InICLR, 2025

2025
[7]

G. Yang, T. Zhang, H. Hao, W. Wang, Y . Liu, D. Wang, G. Chen, Z. Cai, J. Chen, et al. Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning. InICLR, 2026

2026
[8]

Bharadhwaj, R

H. Bharadhwaj, R. Mottaghi, A. Gupta, et al. Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation. InECCV, pages 306–324, 2024

2024
[9]

M. Xu, Z. Xu, Y . Xu, C. Chi, G. Wetzstein, M. Veloso, and S. Song. Flow as the Cross-Domain Manipulation Interface. InCoRL, pages 2475–2499, 2024

2024
[10]

C. Wen, X. Lin, J. So, K. Chen, Q. Dou, Y . Gao, and P. Abbeel. Any-point Trajectory Modeling for Policy Learning. InRSS, 2024

2024
[11]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, et al. RT-1: Robotics Transformer for Real-World Control at Scale. InRSS, 2023

2023
[12]

Walke, K

H. Walke, K. Black, T. Zhao, Q. Vuong, C. Zheng, P. Hansen-Estruch, A. He, V . Myers, et al. BridgeData V2: A Dataset for Robot Learning at Scale. InCoRL, pages 1723–1736, 2023

2023
[13]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, et al. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset. InRSS, 2024

2024
[14]

X. Zhu, R. Tian, C. Xu, M. Huo, W. Zhan, M. Tomizuka, and M. Ding. Fanuc Manipulation: A Dataset for Learning-based Manipulation with FANUC Mate 200iD Robot.https:// sites.google.com/berkeley.edu/fanuc-manipulation, 2023

2023
[15]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. InCVPR, pages 10684–10695, 2022

2022
[16]

Esser, S

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, et al. Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. InICML, pages 12606–12633, 2024

2024
[17]

Peebles and S

W. Peebles and S. Xie. Scalable Diffusion Models with Transformers. InICCV, pages 4172– 4182, 2023

2023
[18]

Z. Li, Q. Zhou, X. Zhang, Y . Zhang, Y . Wang, and W. Xie. Open-vocabulary Object Segmen- tation with Diffusion Models. InICCV, pages 7667–7676, 2023

2023
[19]

Iioka, Y

Y . Iioka, Y . Yoshida, Y . Wada, et al. Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions. InIROS, pages 7590–7597, 2023

2023
[20]

Zhang et al

Q. Zhang et al. FlowPolicy: Enabling Fast and Robust 3D Flow-based Policy via Consistency Flow Matching for Robot Manipulation. InAAAI, volume 39, pages 14754–14762, 2025. 9

2025
[21]

Zhang, C

T. Zhang, C. Yu, S. Su, and Y . Wang. ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning. InNeurIPS, 2025

2025
[22]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. InRSS, 2023

2023
[23]

M. Sun, W. Wang, G. Li, J. Liu, J. Sun, W. Feng, et al. AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion. InCVPR, pages 7364–7373, 2025

2025
[24]

K. Tian, Y . Jiang, Z. Yuan, B. Peng, et al. Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction. InNeurIPS, volume 37, pages 84839–84865, 2024

2024
[25]

S. Ren, Q. Yu, J. He, X. Shen, A. Yuille, and L.-C. Chen. FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching. InICML, pages 51489–51502, 2025

2025
[26]

Yashima, K

D. Yashima, K. Seno, S. Kurita, Y . Oda, et al. HiFlow: Tokenization-Free Scale-Wise Autore- gressive Policy Learning via Flow Matching.arXiv preprint arXiv:2603.27281, 2026

work page arXiv 2026
[27]

Block et al

A. Block et al. Provable Guarantees for Generative Behavior Cloning: Bridging Low-Level Stability and High-Level Behavior. InNeurIPS, volume 36, pages 48534–48547, 2023

2023
[28]

Jiang, X

S. Jiang, X. Fang, N. Roy, T. Lozano-Pérez, et al. Streaming Flow Policy: Simplifying diffusion/flow-matching policies by treating action trajectories as flow trajectories. InCoRL, 2025

2025
[29]

C. Yuan, C. Wen, T. Zhang, and Y . Gao. General Flow as Foundation Affordance for Scalable Robot Learning. InCoRL, pages 1541–1566, 2024

2024
[30]

H. Chen, B. Sun, et al. VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation. InCVPR, pages 27661–27672, 2025

2025
[31]

L.-H. Lin, Y . Cui, A. Xie, T. Hua, and D. Sadigh. FlowRetrieval: Flow-Guided Data Retrieval for Few-Shot Imitation Learning. InCoRL, pages 4084–4099, 2024

2024
[32]

S. Wang, J. You, Y . Hu, J. Li, and Y . Gao. SKIL: Semantic Keypoint Imitation Learning for Generalizable Data-efficient Manipulation. InRSS, 2025

2025
[33]

Vecerik, C

M. Vecerik, C. Doersch, Y . Yang, T. Davchev, Y . Aytar, G. Zhou, R. Hadsell, et al. RoboTAP: Tracking Arbitrary Points for Few-Shot Visual Imitation. InICRA, pages 5397–5403, 2024

2024
[34]

C. Gao, H. Zhang, Z. Xu, Z. Cai, and L. Shao. FLIP: Flow-Centric Generative Planning as General-Purpose Manipulation World Model. InICLR, 2025

2025
[35]

Zheng, Z

Y . Zheng, Z. Ye, W. Dong, S. Wang, Y . Liu, C. Zhang, C. Wen, and Y . Gao. Translating Flow to Policy via Hindsight Online Imitation. InICLR, 2026

2026
[36]

Eisner, H

B. Eisner, H. Zhang, and D. Held. FlowBot3D: Learning 3D Articulation Flow to Manipulate Articulated Objects. InRSS, 2022

2022
[37]

Yoshida, S

T. Yoshida, S. Kurita, T. Nishimura, et al. Generating 6DoF Object Manipulation Trajectories from Action Description in Egocentric Vision. InCVPR, pages 17370–17382, 2025

2025
[38]

Yoshida, S

T. Yoshida, S. Kurita, T. Nishimura, and S. Mori. Developing Vision-Language-Action Model from Egocentric Videos. InICRA, 2026

2026
[39]

Q. Bu, Y . Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li. UniVLA: Learning to Act Anywhere with Task-centric Latent Actions. InRSS, 2025

2025
[40]

Y . Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang. VIP: Towards Univer- sal Visual Reward and Representation via Value-Implicit Pre-Training. InICLR, 2023

2023
[41]

S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3M: A Universal Visual Repre- sentation for Robot Manipulation. InCoRL, pages 892–909, 2023. 10

2023
[42]

In- The-Wild

A. Chen, S. Nair, and C. Finn. Learning Generalizable Robotic Reward Functions from "In- The-Wild" Human Videos. InRSS, 2021

2021
[43]

Zakka, A

K. Zakka, A. Zeng, P. Florence, J. Tompson, J. Bohg, and D. Dwibedi. XIRL: Cross- embodiment Inverse Reinforcement Learning. InCoRL, pages 537–546, 2022

2022
[44]

K. Shaw, S. Bahl, and D. Pathak. VideoDex: Learning Dexterity from Internet Videos. In CoRL, pages 654–665, 2023

2023
[45]

S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak. Affordances from Human Videos as a Versatile Representation for Robotics . InCVPR, pages 13778–13790, 2023

2023
[46]

Goyal, S

M. Goyal, S. Modi, R. Goyal, and S. Gupta. Human Hands as Probes for Interactive Object Understanding. InCVPR, pages 3293–3303, 2022

2022
[47]

S. Liu, S. Tripathi, S. Majumdar, and X. Wang. Joint Hand Motion and Interaction Hotspots Prediction from Egocentric Videos. InCVPR, pages 3282–3292, 2022

2022
[48]

Y . Liu, A. Gupta, P. Abbeel, and S. Levine. Imitation from Observation: Learning to Imitate Behaviors from Raw Video via Context Translation. InICRA, pages 1118–1125, 2018

2018
[49]

L. Chen, K. Hari, K. Dharmarajan, C. Xu, Q. Vuong, and K. Goldberg. Mirage: Cross- Embodiment Zero-Shot Policy Transfer with Cross-Painting. InRSS, 2024

2024
[50]

Schmeckpeper, O

K. Schmeckpeper, O. Rybkin, K. Daniilidis, S. Levine, et al. Reinforcement Learning with Videos: Combining Offline Observations with Interaction. InCoRL, pages 339–354, 2021

2021
[51]

Grauman, A

K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, et al. Ego4D: Around The World in 3,000 Hours of Egocentric Video. InCVPR, pages 18995–19012, 2022

2022
[52]

Damen, H

D. Damen, H. Doughty, G. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, et al. Scaling Egocentric Vision: The EPIC-KITCHENS Dataset. InECCV, pages 720–736, 2018

2018
[53]

something something

R. Goyal, S. Kahou, V . Michalski, J. Materzynska, et al. The “something something” video database for learning and evaluating visual common sense. InICCV, pages 5842–5850, 2017

2017
[54]

S. Ye, J. Jang, B. Jeon, S. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Lin, L. Liden, K. Lee, J. Gao, et al. Latent Action Pretraining from Videos. InICLR, 2025

2025
[55]

Y . Du, S. Yang, B. Dai, H. Dai, O. Nachum, J. Tenenbaum, et al. Learning Universal Policies via Text-Guided Video Generation. InNeurIPS, volume 36, pages 9156–9172, 2023

2023
[56]

Lipman, R

Y . Lipman, R. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow Matching for Generative Modeling. InICLR, 2023

2023
[57]

J. Wu, R. Antonova, A. Kan, M. Lepert, A. Zeng, S. Song, J. Bohg, et al. TidyBot: Personalized Robot Assistance with Large Language Models.Autonomous Robots, 47(8):1087–1102, 2023

2023
[58]

J. Liu, J. Han, B. Yan, H. Wu, F. Zhu, X. Wang, Y . Jiang, B. Peng, and Z. Yuan. InfinityStar: Unified Spacetime AutoRegressive Modeling for Visual Generation. InNeurIPS, 2025

2025
[59]

K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In CVPR, pages 770–778, 2016

2016
[60]

Dasari, O

S. Dasari, O. Mees, S. Zhao, M. Srirama, and S. Levine. The Ingredients for Robotic Diffusion Transformers . InICRA, pages 15617–15625, 2025

2025
[61]

Reuss, O

M. Reuss, O. Yagmurlu, F. Wenzel, and R. Lioutikov. Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals. InRSS, 2024

2024
[62]

M. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, et al. Open- VLA: An Open-Source Vision-Language-Action Model. InCoRL, pages 2679–2713, 2024. 11

2024
[63]

A. Ye, B. Wang, C. Ni, G. Huang, G. Zhao, H. Li, J. Zhu, et al. GigaWorld-0: World Models as Data Engine to Empower Embodied AI.arXiv preprint arXiv:2511.19861, 2025

work page arXiv 2025
[64]

Karaev, Y

N. Karaev, Y . Makarov, J. Wang, N. Neverova, A. Vedaldi, et al. CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos. InICCV, pages 6013–6022, 2025

2025
[65]

Karaev, I

N. Karaev, I. Rocco, B. Graham, N. Neverova, A. Vedaldi, and C. Rupprecht. CoTracker: It is Better to Track Together. InECCV, pages 18–35, 2024

2024
[66]

Yamamoto, K

T. Yamamoto, K. Terada, A. Ochiai, F. Saito, et al. Development of Human Support Robot as the research platform of a domestic mobile manipulator.ROBOMECH Journal, 6(1):4, 2019

2019
[67]

J. Ma, Y . Qin, Y . Li, X. Liao, Y . Guo, and R. Zhang. CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion. InCoRL, pages 4190–4205, 2025

2025
[68]

Kawaharazuka, T

K. Kawaharazuka, T. Matsushima, A. Gambardella, J. Guo, C. Paxton, and A. Zeng. Real- World Robot Applications of Foundation Models: A Review.AR, 38(18):1232–1254, 2024

2024
[69]

Firoozi, J

R. Firoozi, J. Tucker, S. Tian, A. Majumdar, J. Sun, W. Liu, Y . Zhu, et al. Foundation Models in Robotics: Applications, Challenges, and the Future.IJRR, 44(5):701–739, 2025

2025
[70]

Urain, A

J. Urain, A. Mandlekar, Y . Du, N. Shafiullah, D. Xu, et al. A Survey on Deep Generative Models for Robot Learning From Multimodal Demonstrations.T-RO, 42:60–79, 2026

2026
[71]

J. Ho, A. Jain, and P. Abbeel. Denoising Diffusion Probabilistic Models. InNeurIPS, vol- ume 33, pages 6840–6851, 2020

2020
[72]

Y . Wang, Y . He, Y . Li, K. Li, J. Yu, X. Ma, X. Li, G. Chen, X. Chen, et al. InternVid: A Large- scale Video-Text Dataset for Multimodal Understanding and Generation. InICLR, 2024

2024
[73]

M. Bain, A. Nagrani, G. Varol, and A. Zisserman. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. InICCV, pages 1728–1738, 2021

2021
[74]

D. Shan, J. Geng, M. Shu, and D. Fouhey. Understanding Human Hands in Contact at Internet Scale. InCVPR, pages 9869–9878, 2020

2020
[75]

H. Xue, T. Hang, Y . Zeng, Y . Sun, B. Liu, et al. Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions. InCVPR, pages 5036–5045, 2022

2022
[76]

Zellers, X

R. Zellers, X. Lu, J. Hessel, Y . Yu, J. Park, J. Cao, A. Farhadi, et al. MERLOT: Multimodal Neural Script Knowledge Models. InNeurIPS, volume 34, pages 23634–23651, 2021

2021
[77]

Grauman, A

K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V . Baiyya, S. Bansal, B. Boote, E. Byrne, et al. Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives. InCVPR, pages 19383–19400, 2024

2024
[78]

Sivakumar, K

A. Sivakumar, K. Shaw, and D. Pathak. Robotic Telekinesis: Learning a Robotic Hand Imitator by Watching Humans on YouTube. InRSS, 2023

2023
[79]

S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak. Affordances from Human Videos as a Versatile Representation for Robotics. InCVPR, pages 13778–13790, 2023

2023
[80]

Y . Yang, M. Chen, Q. Qiu, J. Wu, et al. Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts. InECCV, pages 163–180, 2024

2024

Showing first 80 references.

[1] [1]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, et al.π 0: A Vision-Language-Action Flow Model for General Robot Control. InRSS, 2025

2025

[2] [2]

Black, N

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, et al.π 0.5: a Vision- Language-Action Model with Open-World Generalization. InCoRL, pages 17–40, 2025

2025

[3] [3]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. Fan, et al. GR00T N1: An Open Foundation Model for Generalist Humanoid Robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

A. Ye, B. Wang, C. Ni, G. Huang, G. Zhao, H. Li, J. Li, J. Zhu, L. Feng, P. Li, Q. Deng, R. Ouyang, W. Qin, X. Chen, X. Wang, Y . Wang, Y . Li, Y . Li, et al. GigaBrain-0: A World Model-Powered Vision-Language-Action Model.arXiv preprint arXiv:2510.19430, 2025

work page arXiv 2025

[5] [5]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, et al. Open X- Embodiment: Robotic Learning Datasets and RT-X Models. InICRA, pages 6892–6903, 2024

2024

[6] [6]

F. Lin, Y . Hu, P. Sheng, C. Wen, J. You, and Y . Gao. Data Scaling Laws in Imitation Learning for Robotic Manipulation. InICLR, 2025

2025

[7] [7]

G. Yang, T. Zhang, H. Hao, W. Wang, Y . Liu, D. Wang, G. Chen, Z. Cai, J. Chen, et al. Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning. InICLR, 2026

2026

[8] [8]

Bharadhwaj, R

H. Bharadhwaj, R. Mottaghi, A. Gupta, et al. Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation. InECCV, pages 306–324, 2024

2024

[9] [9]

M. Xu, Z. Xu, Y . Xu, C. Chi, G. Wetzstein, M. Veloso, and S. Song. Flow as the Cross-Domain Manipulation Interface. InCoRL, pages 2475–2499, 2024

2024

[10] [10]

C. Wen, X. Lin, J. So, K. Chen, Q. Dou, Y . Gao, and P. Abbeel. Any-point Trajectory Modeling for Policy Learning. InRSS, 2024

2024

[11] [11]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, et al. RT-1: Robotics Transformer for Real-World Control at Scale. InRSS, 2023

2023

[12] [12]

Walke, K

H. Walke, K. Black, T. Zhao, Q. Vuong, C. Zheng, P. Hansen-Estruch, A. He, V . Myers, et al. BridgeData V2: A Dataset for Robot Learning at Scale. InCoRL, pages 1723–1736, 2023

2023

[13] [13]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, et al. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset. InRSS, 2024

2024

[14] [14]

X. Zhu, R. Tian, C. Xu, M. Huo, W. Zhan, M. Tomizuka, and M. Ding. Fanuc Manipulation: A Dataset for Learning-based Manipulation with FANUC Mate 200iD Robot.https:// sites.google.com/berkeley.edu/fanuc-manipulation, 2023

2023

[15] [15]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. InCVPR, pages 10684–10695, 2022

2022

[16] [16]

Esser, S

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, et al. Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. InICML, pages 12606–12633, 2024

2024

[17] [17]

Peebles and S

W. Peebles and S. Xie. Scalable Diffusion Models with Transformers. InICCV, pages 4172– 4182, 2023

2023

[18] [18]

Z. Li, Q. Zhou, X. Zhang, Y . Zhang, Y . Wang, and W. Xie. Open-vocabulary Object Segmen- tation with Diffusion Models. InICCV, pages 7667–7676, 2023

2023

[19] [19]

Iioka, Y

Y . Iioka, Y . Yoshida, Y . Wada, et al. Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions. InIROS, pages 7590–7597, 2023

2023

[20] [20]

Zhang et al

Q. Zhang et al. FlowPolicy: Enabling Fast and Robust 3D Flow-based Policy via Consistency Flow Matching for Robot Manipulation. InAAAI, volume 39, pages 14754–14762, 2025. 9

2025

[21] [21]

Zhang, C

T. Zhang, C. Yu, S. Su, and Y . Wang. ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning. InNeurIPS, 2025

2025

[22] [22]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. InRSS, 2023

2023

[23] [23]

M. Sun, W. Wang, G. Li, J. Liu, J. Sun, W. Feng, et al. AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion. InCVPR, pages 7364–7373, 2025

2025

[24] [24]

K. Tian, Y . Jiang, Z. Yuan, B. Peng, et al. Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction. InNeurIPS, volume 37, pages 84839–84865, 2024

2024

[25] [25]

S. Ren, Q. Yu, J. He, X. Shen, A. Yuille, and L.-C. Chen. FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching. InICML, pages 51489–51502, 2025

2025

[26] [26]

Yashima, K

D. Yashima, K. Seno, S. Kurita, Y . Oda, et al. HiFlow: Tokenization-Free Scale-Wise Autore- gressive Policy Learning via Flow Matching.arXiv preprint arXiv:2603.27281, 2026

work page arXiv 2026

[27] [27]

Block et al

A. Block et al. Provable Guarantees for Generative Behavior Cloning: Bridging Low-Level Stability and High-Level Behavior. InNeurIPS, volume 36, pages 48534–48547, 2023

2023

[28] [28]

Jiang, X

S. Jiang, X. Fang, N. Roy, T. Lozano-Pérez, et al. Streaming Flow Policy: Simplifying diffusion/flow-matching policies by treating action trajectories as flow trajectories. InCoRL, 2025

2025

[29] [29]

C. Yuan, C. Wen, T. Zhang, and Y . Gao. General Flow as Foundation Affordance for Scalable Robot Learning. InCoRL, pages 1541–1566, 2024

2024

[30] [30]

H. Chen, B. Sun, et al. VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation. InCVPR, pages 27661–27672, 2025

2025

[31] [31]

L.-H. Lin, Y . Cui, A. Xie, T. Hua, and D. Sadigh. FlowRetrieval: Flow-Guided Data Retrieval for Few-Shot Imitation Learning. InCoRL, pages 4084–4099, 2024

2024

[32] [32]

S. Wang, J. You, Y . Hu, J. Li, and Y . Gao. SKIL: Semantic Keypoint Imitation Learning for Generalizable Data-efficient Manipulation. InRSS, 2025

2025

[33] [33]

Vecerik, C

M. Vecerik, C. Doersch, Y . Yang, T. Davchev, Y . Aytar, G. Zhou, R. Hadsell, et al. RoboTAP: Tracking Arbitrary Points for Few-Shot Visual Imitation. InICRA, pages 5397–5403, 2024

2024

[34] [34]

C. Gao, H. Zhang, Z. Xu, Z. Cai, and L. Shao. FLIP: Flow-Centric Generative Planning as General-Purpose Manipulation World Model. InICLR, 2025

2025

[35] [35]

Zheng, Z

Y . Zheng, Z. Ye, W. Dong, S. Wang, Y . Liu, C. Zhang, C. Wen, and Y . Gao. Translating Flow to Policy via Hindsight Online Imitation. InICLR, 2026

2026

[36] [36]

Eisner, H

B. Eisner, H. Zhang, and D. Held. FlowBot3D: Learning 3D Articulation Flow to Manipulate Articulated Objects. InRSS, 2022

2022

[37] [37]

Yoshida, S

T. Yoshida, S. Kurita, T. Nishimura, et al. Generating 6DoF Object Manipulation Trajectories from Action Description in Egocentric Vision. InCVPR, pages 17370–17382, 2025

2025

[38] [38]

Yoshida, S

T. Yoshida, S. Kurita, T. Nishimura, and S. Mori. Developing Vision-Language-Action Model from Egocentric Videos. InICRA, 2026

2026

[39] [39]

Q. Bu, Y . Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li. UniVLA: Learning to Act Anywhere with Task-centric Latent Actions. InRSS, 2025

2025

[40] [40]

Y . Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang. VIP: Towards Univer- sal Visual Reward and Representation via Value-Implicit Pre-Training. InICLR, 2023

2023

[41] [41]

S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3M: A Universal Visual Repre- sentation for Robot Manipulation. InCoRL, pages 892–909, 2023. 10

2023

[42] [42]

In- The-Wild

A. Chen, S. Nair, and C. Finn. Learning Generalizable Robotic Reward Functions from "In- The-Wild" Human Videos. InRSS, 2021

2021

[43] [43]

Zakka, A

K. Zakka, A. Zeng, P. Florence, J. Tompson, J. Bohg, and D. Dwibedi. XIRL: Cross- embodiment Inverse Reinforcement Learning. InCoRL, pages 537–546, 2022

2022

[44] [44]

K. Shaw, S. Bahl, and D. Pathak. VideoDex: Learning Dexterity from Internet Videos. In CoRL, pages 654–665, 2023

2023

[45] [45]

S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak. Affordances from Human Videos as a Versatile Representation for Robotics . InCVPR, pages 13778–13790, 2023

2023

[46] [46]

Goyal, S

M. Goyal, S. Modi, R. Goyal, and S. Gupta. Human Hands as Probes for Interactive Object Understanding. InCVPR, pages 3293–3303, 2022

2022

[47] [47]

S. Liu, S. Tripathi, S. Majumdar, and X. Wang. Joint Hand Motion and Interaction Hotspots Prediction from Egocentric Videos. InCVPR, pages 3282–3292, 2022

2022

[48] [48]

Y . Liu, A. Gupta, P. Abbeel, and S. Levine. Imitation from Observation: Learning to Imitate Behaviors from Raw Video via Context Translation. InICRA, pages 1118–1125, 2018

2018

[49] [49]

L. Chen, K. Hari, K. Dharmarajan, C. Xu, Q. Vuong, and K. Goldberg. Mirage: Cross- Embodiment Zero-Shot Policy Transfer with Cross-Painting. InRSS, 2024

2024

[50] [50]

Schmeckpeper, O

K. Schmeckpeper, O. Rybkin, K. Daniilidis, S. Levine, et al. Reinforcement Learning with Videos: Combining Offline Observations with Interaction. InCoRL, pages 339–354, 2021

2021

[51] [51]

Grauman, A

K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, et al. Ego4D: Around The World in 3,000 Hours of Egocentric Video. InCVPR, pages 18995–19012, 2022

2022

[52] [52]

Damen, H

D. Damen, H. Doughty, G. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, et al. Scaling Egocentric Vision: The EPIC-KITCHENS Dataset. InECCV, pages 720–736, 2018

2018

[53] [53]

something something

R. Goyal, S. Kahou, V . Michalski, J. Materzynska, et al. The “something something” video database for learning and evaluating visual common sense. InICCV, pages 5842–5850, 2017

2017

[54] [54]

S. Ye, J. Jang, B. Jeon, S. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Lin, L. Liden, K. Lee, J. Gao, et al. Latent Action Pretraining from Videos. InICLR, 2025

2025

[55] [55]

Y . Du, S. Yang, B. Dai, H. Dai, O. Nachum, J. Tenenbaum, et al. Learning Universal Policies via Text-Guided Video Generation. InNeurIPS, volume 36, pages 9156–9172, 2023

2023

[56] [56]

Lipman, R

Y . Lipman, R. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow Matching for Generative Modeling. InICLR, 2023

2023

[57] [57]

J. Wu, R. Antonova, A. Kan, M. Lepert, A. Zeng, S. Song, J. Bohg, et al. TidyBot: Personalized Robot Assistance with Large Language Models.Autonomous Robots, 47(8):1087–1102, 2023

2023

[58] [58]

J. Liu, J. Han, B. Yan, H. Wu, F. Zhu, X. Wang, Y . Jiang, B. Peng, and Z. Yuan. InfinityStar: Unified Spacetime AutoRegressive Modeling for Visual Generation. InNeurIPS, 2025

2025

[59] [59]

K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In CVPR, pages 770–778, 2016

2016

[60] [60]

Dasari, O

S. Dasari, O. Mees, S. Zhao, M. Srirama, and S. Levine. The Ingredients for Robotic Diffusion Transformers . InICRA, pages 15617–15625, 2025

2025

[61] [61]

Reuss, O

M. Reuss, O. Yagmurlu, F. Wenzel, and R. Lioutikov. Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals. InRSS, 2024

2024

[62] [62]

M. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, et al. Open- VLA: An Open-Source Vision-Language-Action Model. InCoRL, pages 2679–2713, 2024. 11

2024

[63] [63]

A. Ye, B. Wang, C. Ni, G. Huang, G. Zhao, H. Li, J. Zhu, et al. GigaWorld-0: World Models as Data Engine to Empower Embodied AI.arXiv preprint arXiv:2511.19861, 2025

work page arXiv 2025

[64] [64]

Karaev, Y

N. Karaev, Y . Makarov, J. Wang, N. Neverova, A. Vedaldi, et al. CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos. InICCV, pages 6013–6022, 2025

2025

[65] [65]

Karaev, I

N. Karaev, I. Rocco, B. Graham, N. Neverova, A. Vedaldi, and C. Rupprecht. CoTracker: It is Better to Track Together. InECCV, pages 18–35, 2024

2024

[66] [66]

Yamamoto, K

T. Yamamoto, K. Terada, A. Ochiai, F. Saito, et al. Development of Human Support Robot as the research platform of a domestic mobile manipulator.ROBOMECH Journal, 6(1):4, 2019

2019

[67] [67]

J. Ma, Y . Qin, Y . Li, X. Liao, Y . Guo, and R. Zhang. CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion. InCoRL, pages 4190–4205, 2025

2025

[68] [68]

Kawaharazuka, T

K. Kawaharazuka, T. Matsushima, A. Gambardella, J. Guo, C. Paxton, and A. Zeng. Real- World Robot Applications of Foundation Models: A Review.AR, 38(18):1232–1254, 2024

2024

[69] [69]

Firoozi, J

R. Firoozi, J. Tucker, S. Tian, A. Majumdar, J. Sun, W. Liu, Y . Zhu, et al. Foundation Models in Robotics: Applications, Challenges, and the Future.IJRR, 44(5):701–739, 2025

2025

[70] [70]

Urain, A

J. Urain, A. Mandlekar, Y . Du, N. Shafiullah, D. Xu, et al. A Survey on Deep Generative Models for Robot Learning From Multimodal Demonstrations.T-RO, 42:60–79, 2026

2026

[71] [71]

J. Ho, A. Jain, and P. Abbeel. Denoising Diffusion Probabilistic Models. InNeurIPS, vol- ume 33, pages 6840–6851, 2020

2020

[72] [72]

Y . Wang, Y . He, Y . Li, K. Li, J. Yu, X. Ma, X. Li, G. Chen, X. Chen, et al. InternVid: A Large- scale Video-Text Dataset for Multimodal Understanding and Generation. InICLR, 2024

2024

[73] [73]

M. Bain, A. Nagrani, G. Varol, and A. Zisserman. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. InICCV, pages 1728–1738, 2021

2021

[74] [74]

D. Shan, J. Geng, M. Shu, and D. Fouhey. Understanding Human Hands in Contact at Internet Scale. InCVPR, pages 9869–9878, 2020

2020

[75] [75]

H. Xue, T. Hang, Y . Zeng, Y . Sun, B. Liu, et al. Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions. InCVPR, pages 5036–5045, 2022

2022

[76] [76]

Zellers, X

R. Zellers, X. Lu, J. Hessel, Y . Yu, J. Park, J. Cao, A. Farhadi, et al. MERLOT: Multimodal Neural Script Knowledge Models. InNeurIPS, volume 34, pages 23634–23651, 2021

2021

[77] [77]

Grauman, A

K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V . Baiyya, S. Bansal, B. Boote, E. Byrne, et al. Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives. InCVPR, pages 19383–19400, 2024

2024

[78] [78]

Sivakumar, K

A. Sivakumar, K. Shaw, and D. Pathak. Robotic Telekinesis: Learning a Robotic Hand Imitator by Watching Humans on YouTube. InRSS, 2023

2023

[79] [79]

S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak. Affordances from Human Videos as a Versatile Representation for Robotics. InCVPR, pages 13778–13790, 2023

2023

[80] [80]

Y . Yang, M. Chen, Q. Qiu, J. Wu, et al. Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts. InECCV, pages 163–180, 2024

2024