WatchAct: A Benchmark for Behavior-Grounded Robot Manipulation

Baiqi Li; Ce Zhang; Gedas Bertasius; Mingyu Ding; Shangzhe Li; Yue Yang; Yu Fang

arxiv: 2606.26443 · v1 · pith:OWVDESD6new · submitted 2026-06-24 · 💻 cs.RO · cs.AI· cs.CV

WatchAct: A Benchmark for Behavior-Grounded Robot Manipulation

Baiqi Li , Ce Zhang , Yu Fang , Yue Yang , Shangzhe Li , Mingyu Ding , Gedas Bertasius This is my paper

Pith reviewed 2026-06-26 01:08 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV

keywords robot manipulationvideo-based reasoningbehavior groundingbenchmark evaluationvision language modelshuman intent inferenceprocedural reasoningLIBERO simulator

0 comments

The pith

The WatchAct benchmark shows that current robot systems achieve only 16.3% success when manipulation tasks require reasoning from human action videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WatchAct as a benchmark for robot manipulation that uses real human action videos paired with simulator tasks to test four types of reasoning: event grounding, procedural reasoning, implicit intent inference, and episodic reasoning. It uses a disentangled protocol to measure video-to-plan reasoning by vision-language models, policy execution with perfect plans, and end-to-end performance of combined systems. Results indicate that even the best current pipeline reaches just 16.3% success in simulation and 14% on a real Franka robot, while humans achieve near-perfect plan success. This setup matters because robots in human environments must interpret observed actions rather than follow direct instructions.

Core claim

WatchAct provides 3000 long-horizon instances across 14 tasks in four cognitive domains drawn from watching another agent. The benchmark aligns videos with simulator scenes and executable tasks, allowing separate measurement of planning from video, execution under oracle plans, and full task completion. Current systems remain far from solving it, with the best pipeline at 16.3% success rate in simulation and 14.0% on the real robot.

What carries the argument

The WatchAct benchmark consisting of paired human-action videos and aligned LIBERO simulator tasks, together with the disentangled evaluation protocol that isolates video-to-plan reasoning, oracle-plan execution, and integrated planner-policy performance.

If this is right

Vision-language models attain 36.8% plan success rate on the benchmark compared to 97.1% for humans.
Policy execution reaches 21.5% task success rate under oracle plans but falls to 10.6% in out-of-domain scenarios.
The benchmark supports evaluation in both simulation and on physical robots using the same tasks.
Four distinct capability domains test parsing events, recovering procedural structure, inferring unstated intent, and tracking scene changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If video-to-plan reasoning improves, it could raise overall task success when paired with better policies.
Out-of-domain drops suggest that generalization from observed behavior remains a key challenge beyond the benchmark tasks.
Alignment of video and simulator scenes may allow testing whether video understanding transfers directly to robot control.
The focus on long-horizon tasks highlights the need for models that maintain memory of past actions across multiple steps.

Load-bearing premise

The 3000 instances and four capability domains accurately capture the cognitive demands of watching another agent and the evaluation protocol isolates those capabilities without biases in task selection or video-simulator alignment.

What would settle it

Demonstration of a pipeline that achieves over 50% success rate on the full task completion metric across both simulation and real-robot evaluations would indicate that current systems are closer to solving the benchmark than claimed.

Figures

Figures reproduced from arXiv: 2606.26443 by Baiqi Li, Ce Zhang, Gedas Bertasius, Mingyu Ding, Shangzhe Li, Yue Yang, Yu Fang.

**Figure 1.** Figure 1: WatchAct is a behavior-grounded benchmark for robotic manipulation, where the robot reasons over observed human behavior and a language instruction to perform the corresponding task. Despite the importance of such reasoning, existing manipulation benchmarks rarely test for it. Standard benchmarks evaluate execution from a static scene observation [7, 8, 9, 10, 11] and do not provide temporal context about … view at source ↗

**Figure 2.** Figure 2: WatchAct data construction pipeline. Step 1 (Schema Generation): GPT-5.1 generates video recording instructions and the corresponding simulation task definition. Step 2 (Data Construction): we record the human video following the instructions and instantiate the matching LIBERO task. Step 3 (Quality Verification): each instance passes four checks: (i) Task Validity, (ii) Simulator Executability, (iii) Ora… view at source ↗

**Figure 3.** Figure 3: Spatial reasoning evaluation for VLMs. VLM performance across camera viewpoints and reference frames (human-centered vs. camera-centered), defined in Section 3.4. Camera-based references do not apply under Oblique View, so only human-based references are reported there [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Real-world task example: Fine-Grained Action. In the human video, the person vigorously [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Real-world task example: Restore Previous State. In the human video, the person removes [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Real-world task example: Imitation. In the human video, the person picks up the cup on the [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Real-world task example: Imitation. In the human video, the person moves a cup and pours [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

read the original abstract

A robot working alongside people must reason about what they have done, in what order, and with what intent. Video carries the spatial layouts, object histories, and gestures that language leaves underspecified, yet today's manipulation benchmarks pair an instruction with a single current image, offering no way to evaluate reasoning over observed human behavior. We introduce WatchAct, a benchmark for robot manipulation grounded in observed human behavior. Each instance pairs a real-world human-action video and a language instruction with an aligned simulator scene and an executable LIBERO task, enabling scalable and reproducible evaluation. WatchAct comprises 3,000 long-horizon instances across 14 tasks in four capability domains drawn from the cognitive demands of watching another agent: parsing events (Event Grounding), recovering procedural structure (Procedural Reasoning), inferring unstated intent (Implicit Intent Inference), and tracking how the scene was changed (Episodic Reasoning). We further propose a disentangled evaluation protocol that separately measures (i)~video-to-plan reasoning by vision-language models, (ii)~policy execution under oracle plans, and (iii)~full task completion by integrated planner--policy pipelines. In both simulation and on a Franka Research 3 robot, current systems remain far from solving WatchAct. The best pipeline, Gemini-3.1-Pro with $\pi_{0.5}$, reaches only 16.3% Success Rate (SR) in simulation and 14.0% on the real robot. Gemini-3.1-Pro attains just 36.8% Plan SR (vs. 97.1% for humans), while $\pi_{0.5}$ reaches only 21.5% Task SR under oracle plans and drops to 10.6% on out-of-domain scenarios. Dataset and code are available at https://baiqi-li.github.io/watchact_page/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WatchAct creates a new benchmark pairing human videos with aligned simulator tasks across four cognitive domains, but the low reported success rates rest on unverified alignment quality between videos and scenes.

read the letter

The core of this paper is a benchmark with 3000 instances that takes real human manipulation videos, adds language instructions, and maps them to executable LIBERO simulator tasks. It organizes the tasks into four domains—Event Grounding, Procedural Reasoning, Implicit Intent Inference, and Episodic Reasoning—and uses a three-part evaluation that separates video-to-plan reasoning, oracle-plan policy execution, and full end-to-end performance.

What stands out is the shift from language-only goals to observed human behavior. Most manipulation benchmarks skip the video history, so this fills a practical gap for collaborative settings. The disentangled protocol is a clear improvement over single-number scores because it shows where the pipeline breaks: Gemini-3.1-Pro gets 36.8% on plans while the policy under oracle plans reaches only 21.5%, and the combined system sits at 16.3% in simulation and 14% on the real Franka.

The construction itself is the main soft spot. The claim that current systems are far from solving the benchmark assumes the video-simulator pairs accurately test the targeted cognitive skills without systematic mismatches in object states, timing, or intent labels. The abstract gives no alignment error rates, inter-annotator agreement on domain labels, or checks on task difficulty balance. If those alignments have hidden biases, the performance gaps partly measure data artifacts rather than model capability. The 10.6% OOD drop is reported but does not address the core alignment question.

This work is aimed at groups building vision-language models or policies for human-robot interaction and assistive robotics. Readers who need a dataset and protocol for testing behavior understanding will get direct value from the released code and data.

It deserves peer review because the artifact is new and the evaluation structure is useful, even though the construction details require more evidence to support the headline numbers.

Referee Report

2 major / 1 minor

Summary. The paper introduces WatchAct, a benchmark of 3,000 long-horizon robot manipulation instances that pair real human-action videos and language instructions with aligned LIBERO simulator scenes. Instances span 14 tasks across four domains (Event Grounding, Procedural Reasoning, Implicit Intent Inference, Episodic Reasoning) drawn from the cognitive demands of observing another agent. A disentangled protocol separately evaluates VLM video-to-plan reasoning, policy execution under oracle plans, and end-to-end planner-policy pipelines. Experiments on simulation and a Franka Research 3 robot show current systems remain far from solving the benchmark, with the best pipeline (Gemini-3.1-Pro + π0.5) reaching 16.3% SR in simulation and 14.0% on the real robot; Gemini-3.1-Pro achieves 36.8% Plan SR (vs. 97.1% human) and π0.5 reaches 21.5% Task SR under oracle plans, dropping to 10.6% OOD. Dataset and code are released.

Significance. If the video-simulator alignments and task selection are valid, the benchmark and disentangled protocol would provide a reproducible, scalable testbed that isolates planning versus execution failures in behavior-grounded manipulation, a clear advance over instruction-only benchmarks. The public release of the 3,000 instances, code, and real-robot results strengthens potential impact for the field.

major comments (2)

[§3 (dataset construction)] §3 (dataset construction) and abstract: The central claim that 'current systems remain far from solving WatchAct' (16.3% sim SR / 14.0% real SR) is load-bearing on the assumption that the 3,000 instances and four domains accurately capture the targeted cognitive demands without systematic bias from video-simulator misalignment, action ordering errors, or intent labeling. The manuscript reports no quantitative alignment error statistics, inter-annotator agreement on domain labels, or verification metrics for the 3,000 instances beyond the 10.6% OOD drop.
[Evaluation protocol and results] Evaluation protocol description and results tables: The disentangled protocol (VLM Plan SR, oracle-plan policy SR, end-to-end SR) is presented as cleanly isolating capabilities, yet without reported checks on whether simulator state mismatches or video alignment failures differentially affect plan SR (36.8%) versus policy SR (21.5%), the attribution of low end-to-end performance to specific capability gaps remains unverified.

minor comments (1)

[§3] Clarify the exact mapping between the 14 tasks and the four capability domains in the main text or a table for reader traceability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on dataset construction and the evaluation protocol. We respond to each major comment below and indicate where revisions will be made.

read point-by-point responses

Referee: [§3 (dataset construction)] §3 (dataset construction) and abstract: The central claim that 'current systems remain far from solving WatchAct' (16.3% sim SR / 14.0% real SR) is load-bearing on the assumption that the 3,000 instances and four domains accurately capture the targeted cognitive demands without systematic bias from video-simulator misalignment, action ordering errors, or intent labeling. The manuscript reports no quantitative alignment error statistics, inter-annotator agreement on domain labels, or verification metrics for the 3,000 instances beyond the 10.6% OOD drop.

Authors: We agree that explicit quantitative verification metrics would strengthen the central claim. The reported human Plan SR of 97.1% and the 10.6% OOD drop already provide evidence that the instances are well-defined and test the intended cognitive demands rather than alignment artifacts. However, to directly address the concern about potential biases, we will add to the revised manuscript: inter-annotator agreement on domain labels for a sampled subset and quantitative alignment error statistics from manual inspection of 100 random instances. These additions will be reported in §3. revision: yes
Referee: [Evaluation protocol and results] Evaluation protocol description and results tables: The disentangled protocol (VLM Plan SR, oracle-plan policy SR, end-to-end SR) is presented as cleanly isolating capabilities, yet without reported checks on whether simulator state mismatches or video alignment failures differentially affect plan SR (36.8%) versus policy SR (21.5%), the attribution of low end-to-end performance to specific capability gaps remains unverified.

Authors: Oracle plans are generated from the aligned simulator states and manually verified for executability, which limits differential impact from alignment failures on policy SR. The observed drops (plan SR 36.8% → oracle-policy SR 21.5% → end-to-end 16.3%) are consistent with separate gaps in reasoning and control. We nevertheless acknowledge the value of explicit verification. In the revision we will add a failure-mode analysis that categorizes errors by source (plan, execution, or alignment) to confirm the protocol's isolation holds. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivation chain

full rationale

The paper presents a new benchmark (WatchAct) consisting of 3000 video-simulator paired instances and reports empirical success rates for existing VLMs and policies. No equations, fitted parameters, predictions, or uniqueness theorems are claimed; the central result (16.3% sim / 14.0% real SR) is a direct measurement rather than a reduction to self-referential inputs. The disentangled protocol (plan SR, oracle-plan SR, end-to-end SR) is defined by construction from the benchmark instances without any self-citation load-bearing or ansatz smuggling. This is a standard empirical benchmark paper whose claims rest on external model performance, not internal circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a benchmark paper; no physical constants, fitted parameters, or new theoretical entities are introduced. The central contribution is the dataset and protocol itself.

pith-pipeline@v0.9.1-grok · 5887 in / 1192 out tokens · 20742 ms · 2026-06-26T01:08:11.713987+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 2 canonical work pages

[1]

Ciocarlie, K

M. Ciocarlie, K. Hsiao, A. Leeper, and D. Gossow. Mobile manipulation through an assistive home robot. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5313–5320. IEEE, 2012

2012
[2]

J. Wu, R. Antonova, A. Kan, M. Lepert, A. Zeng, S. Song, J. Bohg, S. Rusinkiewicz, and T. Funkhouser. Tidybot: Personalized robot assistance with large language models.Autonomous Robots, 47(8):1087–1102, 2023

2023
[3]

E. A. Sisbot, L. F. Marin-Urias, R. Alami, and T. Simeon. A human aware mobile robot motion planner.IEEE Transactions on Robotics, 23(5):874–883, 2007

2007
[4]

J. M. Zacks and K. M. Swallow. Event segmentation.Current directions in psychological science, 16(2):80–84, 2007

2007
[5]

Csibra and G

G. Csibra and G. Gergely. Natural pedagogy.Trends in Cognitive Sciences, 13(4):148–153,
[6]

doi:10.1016/j.tics.2009.01.005

work page doi:10.1016/j.tics.2009.01.005 2009
[7]

C. L. Baker, R. Saxe, and J. B. Tenenbaum. Action understanding as inverse planning.Cognition, 113(3):329–349, 2009. doi:10.1016/j.cognition.2009.07.005

work page doi:10.1016/j.cognition.2009.07.005 2009
[8]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36: 44776–44791, 2023

2023
[9]

Nasiriany, A

S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024

Pith/arXiv arXiv 2024
[10]

O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

2022
[11]

S. Fei, S. Wang, J. Shi, Z. Dai, J. Cai, P. Qian, L. Ji, X. He, S. Zhang, Z. Fei, et al. Libero-plus: In- depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025

Pith/arXiv arXiv 2025
[12]

X. Zhou, Y . Xu, G. Tie, Y . Chen, G. Zhang, D. Chu, P. Zhou, and L. Sun. Libero-pro: Towards robust and fair evaluation of vision-language-action models beyond memorization.arXiv preprint arXiv:2510.03827, 2025

Pith/arXiv arXiv 2025
[13]

Jiang, A

Y . Jiang, A. Gupta, Z. Zhang, G. Wang, Y . Dou, Y . Chen, L. Fei-Fei, A. Anandkumar, Y . Zhu, and L. Fan. Vima: General robot manipulation with multimodal prompts. InFortieth International Conference on Machine Learning, 2023

2023
[14]

B. Wang, J. Zhang, S. Dong, I. Fang, and C. Feng. Vlm see, robot do: Human demo video to robot action plan via vision language model. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 17215–17222. IEEE, 2025

2025
[15]

J. Son, J. Kim, K. Kam, J. Coholich, S. J. Kim, J. Kim, C. D. Kim, J. Cho, D. Fox, and Z. Kira. Seetraceact: Visibility-aware latent planning from cross-embodiment demonstration videos. arXiv preprint arXiv:2606.02745, 2026. 9

Pith/arXiv arXiv 2026
[16]

Y . Dai, H. Fu, J. Lee, Y . Liu, H. Zhang, J. Yang, C. Finn, N. Fazeli, and J. Chai. Robomme: Benchmarking and understanding memory for robotic generalist policies.arXiv preprint arXiv:2603.04639, 2026

Pith/arXiv arXiv 2026
[17]

Y . Hong, J. Liu, H. Yin, M. Li, L. Guibas, L. Fei-Fei, J. Wu, and Y . Choi. ESI-Bench: Towards embodied spatial intelligence that closes the perception–action loop.arXiv preprint arXiv:2605.18746, 2026

Pith/arXiv arXiv 2026
[18]

Gemini 3.1 pro

Google. Gemini 3.1 pro. https://blog.google/innovation-and-ai/ models-and-research/gemini-models/gemini-3-1-pro/ , 2026. Accessed: 2026-05- 29

2026
[19]

Franka research 3

Franka Robotics. Franka research 3. https://franka.de/franka-research-3, 2026. Accessed: 2026-05-29

2026
[20]

C. Li, R. Zhang, J. Wong, C. Gokmen, S. Srivastava, R. Martin-Martin, C. Wang, G. Levine, W. Ai, B. Martinez, H. Yin, M. Lingelbach, M. Hwang, A. Hiranaka, S. Garlanka, A. Aydin, S. Lee, J. Sun, M. Anvari, M. Sharma, D. Bansal, S. Hunter, K.-Y . Kim, A. Lou, C. R. Matthews, I. Villa-Renteria, J. H. Tang, C. Tang, F. Xia, Y . Li, S. Savarese, H. Gweon, C. ...

Pith/arXiv arXiv 2024
[21]

Zhang, Z

S. Zhang, Z. Xu, P. Liu, X. Yu, Y . Li, Q. Gao, Z. Fei, Z. Yin, Z. Wu, Y .-G. Jiang, et al. Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11142–11152, 2025

2025
[22]

Cherepanov, N

E. Cherepanov, N. Kachaev, A. K. Kovalev, and A. I. Panov. Memory, benchmark & robots: A benchmark for solving complex tasks with reinforcement learning.arXiv preprint arXiv:2502.10550, 2025

arXiv 2025
[23]

S. Han, B. Qiu, Y . Liao, S. Huang, C. Gao, S. Yan, and S. Liu. Robocerebra: A large-scale benchmark for long-horizon robotic manipulation evaluation.arXiv preprint arXiv:2506.06677, 2025

arXiv 2025
[24]

Shridhar, J

M. Shridhar, J. Thomason, D. Gordon, Y . Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox. ALFRED: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10740–10749, 2020

2020
[25]

James, Z

S. James, Z. Ma, D. R. Arrojo, and A. J. Davison. Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

2020
[26]

T. Chen, Y . Wang, M. Li, Y . Qin, H. Shi, Z. Li, Y . Hu, Y . Zhang, K. Wang, Y . Chen, et al. Rmbench: Memory-dependent robotic manipulation benchmark with insights into policy design. arXiv preprint arXiv:2603.01229, 2026

arXiv 2026
[27]

H. Fang, M. Grotz, W. Pumacay, Y . R. Wang, D. Fox, R. Krishna, and J. Duan. Sam2act: Integrating visual foundation model with a memory architecture for robotic manipulation.arXiv preprint arXiv:2501.18564, 2025

arXiv 2025
[28]

J. Li, Y . Zhu, Y . Xie, Z. Jiang, M. Seo, G. Pavlakos, and Y . Zhu. Okami: Teaching humanoid robots manipulation skills through single video imitation.arXiv preprint arXiv:2410.11792, 2024

arXiv 2024
[29]

Heppert, M

N. Heppert, M. Argus, T. Welschehold, T. Brox, and A. Valada. Ditto: Demonstration imitation by trajectory transformation. In2024 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 7565–7572. IEEE, 2024. 10

2024
[30]

S. Park, H. Bharadhwaj, and S. Tulsiani. Demodiffusion: One-shot human imitation using pre-trained diffusion policy.arXiv preprint arXiv:2506.20668, 2025

arXiv 2025
[31]

C. Wang, L. Fan, J. Sun, R. Zhang, L. Fei-Fei, D. Xu, Y . Zhu, and A. Anandkumar. MimicPlay: Long-horizon imitation learning by watching human play. InConference on Robot Learning (CoRL), 2023

2023
[32]

R. Shah, S. Liu, Q. Wang, Z. Jiang, S. Kumar, M. Seo, R. Martín-Martín, and Y . Zhu. Mimic- droid: In-context learning for humanoid manipulation from human play videos. 2026

2026
[33]

Black, N

Physical Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al. π0.5: A vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025
[34]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024
[35]

L. X. Shi, B. Ichter, M. Equi, L. Ke, K. Pertsch, Q. Vuong, J. Tanner, A. Walling, H. Wang, N. Fusai, et al. Hi robot: Open-ended instruction following with hierarchical vision-language- action models.arXiv preprint arXiv:2502.19417, 2025

Pith/arXiv arXiv 2025
[36]

Y . Li, Y . Deng, J. Zhang, J. Jang, M. Memmel, R. Yu, C. R. Garrett, F. Ramos, D. Fox, A. Li, A. Gupta, and A. Goyal. HAMSTER: Hierarchical action models for open-world robot manipulation. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025

2025
[37]

Zhang, Y

J. Zhang, Y . Guo, X. Chen, Y .-J. Wang, Y . Hu, C. Shi, and J. Chen. HiRT: Enhancing robotic control with hierarchical robot transformers. InProceedings of the 8th Conference on Robot Learning (CoRL), volume 270 ofPMLR, pages 933–946, 2025

2025
[38]

Y . Yang, S. Gao, Q. Bu, L. Chen, and D. N. Metaxas. Seeing farther and smarter: Value-guided multi-path reflection for vlm policy optimization.arXiv preprint arXiv:2602.19372, 2026

arXiv 2026
[39]

C. A. Kurby and J. M. Zacks. Segmentation in the perception and memory of events.Trends in cognitive sciences, 12(2):72–79, 2008

2008
[40]

G. D. Abowd, A. K. Dey, P. J. Brown, N. Davies, M. Smith, and P. Steggles. Towards a better understanding of context and context-awareness. InInternational symposium on handheld and ubiquitous computing, pages 304–307. Springer, 1999

1999
[41]

Sebanz, H

N. Sebanz, H. Bekkering, and G. Knoblich. Joint action: bodies and minds moving together. Trends in cognitive sciences, 10(2):70–76, 2006

2006
[42]

R. A. Zwaan, M. C. Langston, and A. C. Graesser. The construction of situation models in narrative comprehension: An event-indexing model.Psychological science, 6(5):292–297, 1995

1995
[43]

R. C. Schank and R. P. Abelson.Scripts, plans, goals, and understanding: An inquiry into human knowledge structures. Psychology press, 2013

2013
[44]

Gpt-5.1 model

OpenAI. Gpt-5.1 model. https://developers.openai.com/api/docs/models/gpt-5. 1, 2026. Accessed: 2026-05-29

2026
[45]

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

Pith/arXiv arXiv 2025
[46]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 11

Pith/arXiv arXiv 2025
[47]

Graesser and P

L. Graesser and P. Xu. Gemini robotics-er 1.6: Powering real-world robotics tasks through enhanced embodied reasoning. https://deepmind.google/blog/gemini-robotics-er-1-6/, Apr
[48]

Google DeepMind blog
[49]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

Pith/arXiv arXiv 2025
[50]

Q. Bu, Y . Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

Pith/arXiv arXiv 2025
[51]

L. Li, Q. Zhang, Y . Luo, S. Yang, R. Wang, F. Han, M. Yu, Z. Gao, N. Xue, X. Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Pith/arXiv arXiv 2026
[52]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024. 12 Appendix 1 Benchmark Details Benchmark Statistics.WatchAct comprises 3,000 instances across 14 tasks organized into fo...

Pith/arXiv arXiv 2024

[1] [1]

Ciocarlie, K

M. Ciocarlie, K. Hsiao, A. Leeper, and D. Gossow. Mobile manipulation through an assistive home robot. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5313–5320. IEEE, 2012

2012

[2] [2]

J. Wu, R. Antonova, A. Kan, M. Lepert, A. Zeng, S. Song, J. Bohg, S. Rusinkiewicz, and T. Funkhouser. Tidybot: Personalized robot assistance with large language models.Autonomous Robots, 47(8):1087–1102, 2023

2023

[3] [3]

E. A. Sisbot, L. F. Marin-Urias, R. Alami, and T. Simeon. A human aware mobile robot motion planner.IEEE Transactions on Robotics, 23(5):874–883, 2007

2007

[4] [4]

J. M. Zacks and K. M. Swallow. Event segmentation.Current directions in psychological science, 16(2):80–84, 2007

2007

[5] [5]

Csibra and G

G. Csibra and G. Gergely. Natural pedagogy.Trends in Cognitive Sciences, 13(4):148–153,

[6] [6]

doi:10.1016/j.tics.2009.01.005

work page doi:10.1016/j.tics.2009.01.005 2009

[7] [7]

C. L. Baker, R. Saxe, and J. B. Tenenbaum. Action understanding as inverse planning.Cognition, 113(3):329–349, 2009. doi:10.1016/j.cognition.2009.07.005

work page doi:10.1016/j.cognition.2009.07.005 2009

[8] [8]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36: 44776–44791, 2023

2023

[9] [9]

Nasiriany, A

S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024

Pith/arXiv arXiv 2024

[10] [10]

O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

2022

[11] [11]

S. Fei, S. Wang, J. Shi, Z. Dai, J. Cai, P. Qian, L. Ji, X. He, S. Zhang, Z. Fei, et al. Libero-plus: In- depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025

Pith/arXiv arXiv 2025

[12] [12]

X. Zhou, Y . Xu, G. Tie, Y . Chen, G. Zhang, D. Chu, P. Zhou, and L. Sun. Libero-pro: Towards robust and fair evaluation of vision-language-action models beyond memorization.arXiv preprint arXiv:2510.03827, 2025

Pith/arXiv arXiv 2025

[13] [13]

Jiang, A

Y . Jiang, A. Gupta, Z. Zhang, G. Wang, Y . Dou, Y . Chen, L. Fei-Fei, A. Anandkumar, Y . Zhu, and L. Fan. Vima: General robot manipulation with multimodal prompts. InFortieth International Conference on Machine Learning, 2023

2023

[14] [14]

B. Wang, J. Zhang, S. Dong, I. Fang, and C. Feng. Vlm see, robot do: Human demo video to robot action plan via vision language model. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 17215–17222. IEEE, 2025

2025

[15] [15]

J. Son, J. Kim, K. Kam, J. Coholich, S. J. Kim, J. Kim, C. D. Kim, J. Cho, D. Fox, and Z. Kira. Seetraceact: Visibility-aware latent planning from cross-embodiment demonstration videos. arXiv preprint arXiv:2606.02745, 2026. 9

Pith/arXiv arXiv 2026

[16] [16]

Y . Dai, H. Fu, J. Lee, Y . Liu, H. Zhang, J. Yang, C. Finn, N. Fazeli, and J. Chai. Robomme: Benchmarking and understanding memory for robotic generalist policies.arXiv preprint arXiv:2603.04639, 2026

Pith/arXiv arXiv 2026

[17] [17]

Y . Hong, J. Liu, H. Yin, M. Li, L. Guibas, L. Fei-Fei, J. Wu, and Y . Choi. ESI-Bench: Towards embodied spatial intelligence that closes the perception–action loop.arXiv preprint arXiv:2605.18746, 2026

Pith/arXiv arXiv 2026

[18] [18]

Gemini 3.1 pro

Google. Gemini 3.1 pro. https://blog.google/innovation-and-ai/ models-and-research/gemini-models/gemini-3-1-pro/ , 2026. Accessed: 2026-05- 29

2026

[19] [19]

Franka research 3

Franka Robotics. Franka research 3. https://franka.de/franka-research-3, 2026. Accessed: 2026-05-29

2026

[20] [20]

C. Li, R. Zhang, J. Wong, C. Gokmen, S. Srivastava, R. Martin-Martin, C. Wang, G. Levine, W. Ai, B. Martinez, H. Yin, M. Lingelbach, M. Hwang, A. Hiranaka, S. Garlanka, A. Aydin, S. Lee, J. Sun, M. Anvari, M. Sharma, D. Bansal, S. Hunter, K.-Y . Kim, A. Lou, C. R. Matthews, I. Villa-Renteria, J. H. Tang, C. Tang, F. Xia, Y . Li, S. Savarese, H. Gweon, C. ...

Pith/arXiv arXiv 2024

[21] [21]

Zhang, Z

S. Zhang, Z. Xu, P. Liu, X. Yu, Y . Li, Q. Gao, Z. Fei, Z. Yin, Z. Wu, Y .-G. Jiang, et al. Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11142–11152, 2025

2025

[22] [22]

Cherepanov, N

E. Cherepanov, N. Kachaev, A. K. Kovalev, and A. I. Panov. Memory, benchmark & robots: A benchmark for solving complex tasks with reinforcement learning.arXiv preprint arXiv:2502.10550, 2025

arXiv 2025

[23] [23]

S. Han, B. Qiu, Y . Liao, S. Huang, C. Gao, S. Yan, and S. Liu. Robocerebra: A large-scale benchmark for long-horizon robotic manipulation evaluation.arXiv preprint arXiv:2506.06677, 2025

arXiv 2025

[24] [24]

Shridhar, J

M. Shridhar, J. Thomason, D. Gordon, Y . Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox. ALFRED: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10740–10749, 2020

2020

[25] [25]

James, Z

S. James, Z. Ma, D. R. Arrojo, and A. J. Davison. Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

2020

[26] [26]

T. Chen, Y . Wang, M. Li, Y . Qin, H. Shi, Z. Li, Y . Hu, Y . Zhang, K. Wang, Y . Chen, et al. Rmbench: Memory-dependent robotic manipulation benchmark with insights into policy design. arXiv preprint arXiv:2603.01229, 2026

arXiv 2026

[27] [27]

H. Fang, M. Grotz, W. Pumacay, Y . R. Wang, D. Fox, R. Krishna, and J. Duan. Sam2act: Integrating visual foundation model with a memory architecture for robotic manipulation.arXiv preprint arXiv:2501.18564, 2025

arXiv 2025

[28] [28]

J. Li, Y . Zhu, Y . Xie, Z. Jiang, M. Seo, G. Pavlakos, and Y . Zhu. Okami: Teaching humanoid robots manipulation skills through single video imitation.arXiv preprint arXiv:2410.11792, 2024

arXiv 2024

[29] [29]

Heppert, M

N. Heppert, M. Argus, T. Welschehold, T. Brox, and A. Valada. Ditto: Demonstration imitation by trajectory transformation. In2024 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 7565–7572. IEEE, 2024. 10

2024

[30] [30]

S. Park, H. Bharadhwaj, and S. Tulsiani. Demodiffusion: One-shot human imitation using pre-trained diffusion policy.arXiv preprint arXiv:2506.20668, 2025

arXiv 2025

[31] [31]

C. Wang, L. Fan, J. Sun, R. Zhang, L. Fei-Fei, D. Xu, Y . Zhu, and A. Anandkumar. MimicPlay: Long-horizon imitation learning by watching human play. InConference on Robot Learning (CoRL), 2023

2023

[32] [32]

R. Shah, S. Liu, Q. Wang, Z. Jiang, S. Kumar, M. Seo, R. Martín-Martín, and Y . Zhu. Mimic- droid: In-context learning for humanoid manipulation from human play videos. 2026

2026

[33] [33]

Black, N

Physical Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al. π0.5: A vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025

[34] [34]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024

[35] [35]

L. X. Shi, B. Ichter, M. Equi, L. Ke, K. Pertsch, Q. Vuong, J. Tanner, A. Walling, H. Wang, N. Fusai, et al. Hi robot: Open-ended instruction following with hierarchical vision-language- action models.arXiv preprint arXiv:2502.19417, 2025

Pith/arXiv arXiv 2025

[36] [36]

Y . Li, Y . Deng, J. Zhang, J. Jang, M. Memmel, R. Yu, C. R. Garrett, F. Ramos, D. Fox, A. Li, A. Gupta, and A. Goyal. HAMSTER: Hierarchical action models for open-world robot manipulation. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025

2025

[37] [37]

Zhang, Y

J. Zhang, Y . Guo, X. Chen, Y .-J. Wang, Y . Hu, C. Shi, and J. Chen. HiRT: Enhancing robotic control with hierarchical robot transformers. InProceedings of the 8th Conference on Robot Learning (CoRL), volume 270 ofPMLR, pages 933–946, 2025

2025

[38] [38]

Y . Yang, S. Gao, Q. Bu, L. Chen, and D. N. Metaxas. Seeing farther and smarter: Value-guided multi-path reflection for vlm policy optimization.arXiv preprint arXiv:2602.19372, 2026

arXiv 2026

[39] [39]

C. A. Kurby and J. M. Zacks. Segmentation in the perception and memory of events.Trends in cognitive sciences, 12(2):72–79, 2008

2008

[40] [40]

G. D. Abowd, A. K. Dey, P. J. Brown, N. Davies, M. Smith, and P. Steggles. Towards a better understanding of context and context-awareness. InInternational symposium on handheld and ubiquitous computing, pages 304–307. Springer, 1999

1999

[41] [41]

Sebanz, H

N. Sebanz, H. Bekkering, and G. Knoblich. Joint action: bodies and minds moving together. Trends in cognitive sciences, 10(2):70–76, 2006

2006

[42] [42]

R. A. Zwaan, M. C. Langston, and A. C. Graesser. The construction of situation models in narrative comprehension: An event-indexing model.Psychological science, 6(5):292–297, 1995

1995

[43] [43]

R. C. Schank and R. P. Abelson.Scripts, plans, goals, and understanding: An inquiry into human knowledge structures. Psychology press, 2013

2013

[44] [44]

Gpt-5.1 model

OpenAI. Gpt-5.1 model. https://developers.openai.com/api/docs/models/gpt-5. 1, 2026. Accessed: 2026-05-29

2026

[45] [45]

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

Pith/arXiv arXiv 2025

[46] [46]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 11

Pith/arXiv arXiv 2025

[47] [47]

Graesser and P

L. Graesser and P. Xu. Gemini robotics-er 1.6: Powering real-world robotics tasks through enhanced embodied reasoning. https://deepmind.google/blog/gemini-robotics-er-1-6/, Apr

[48] [48]

Google DeepMind blog

[49] [49]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

Pith/arXiv arXiv 2025

[50] [50]

Q. Bu, Y . Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

Pith/arXiv arXiv 2025

[51] [51]

L. Li, Q. Zhang, Y . Luo, S. Yang, R. Wang, F. Han, M. Yu, Z. Gao, N. Xue, X. Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Pith/arXiv arXiv 2026

[52] [52]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024. 12 Appendix 1 Benchmark Details Benchmark Statistics.WatchAct comprises 3,000 instances across 14 tasks organized into fo...

Pith/arXiv arXiv 2024