ActiveMimic: Egocentric Video Pretraining with Active Perception

Guojin Zhong; Tianyi Lu; Xingyao Lin; Yichen Zhu; Yu-Gang Jiang; Ziyi Ye; Zuxuan Wu

arxiv: 2606.06194 · v1 · pith:RCZB4WOXnew · submitted 2026-06-04 · 💻 cs.RO · cs.CV

ActiveMimic: Egocentric Video Pretraining with Active Perception

Xingyao Lin , Guojin Zhong , Tianyi Lu , Ziyi Ye , Yichen Zhu , Zuxuan Wu , Yu-Gang Jiang This is my paper

Pith reviewed 2026-06-28 01:16 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords egocentric videorobot pretrainingactive perceptioncamera motionviewpoint actionmanipulation learningtrajectory recovery

0 comments

The pith

Recovering synchronized camera and wrist trajectories from egocentric human video enables pretraining that matches robot-data models on manipulation tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that the performance gap between human-video and robot-data pretraining arises because standard methods ignore active perception—the intentional camera movements humans make while manipulating objects. ActiveMimic extracts these movements as viewpoint actions alongside wrist actions from ordinary body-worn RGB footage, then pretrains a joint model of perception and manipulation before adapting it to a robot. Real-world tests on tasks with different viewpoint demands show the resulting policies beat other human-video methods and reach parity with state-of-the-art robot-data methods. Analysis indicates the active-perception skill comes from the human-video stage rather than later robot fine-tuning.

Core claim

The central claim is that active perception signals latent in egocentric human videos can be recovered as synchronized camera-wrist trajectories from a single RGB camera, modeled explicitly as viewpoint actions, and used to pretrain policies that learn both perception and manipulation jointly; when adapted to robots, these policies close the gap with robot-data pretraining across tasks that vary in active-perception demands.

What carries the argument

The ActiveMimic framework that recovers synchronized camera and wrist trajectories from single RGB video and treats camera motion as an explicit viewpoint action during joint pretraining of perception and manipulation.

If this is right

Pretraining can now draw on far larger pools of everyday human video rather than scarce robot interaction data.
The active-perception component transfers from human-video pretraining and does not require robot-specific fine-tuning to appear in the final policy.
Policies become effective on tasks whose success depends on deliberate viewpoint adjustment during manipulation.
The same trajectory-recovery step can be applied to new egocentric datasets without additional instrumentation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the trajectory extraction works on internet-scale egocentric video, pretraining corpora could grow by orders of magnitude beyond current robot datasets.
The same viewpoint-action modeling might apply to non-manipulation domains such as navigation or inspection where camera motion is also goal-directed.
Embodiment differences between human and robot hands may still limit transfer even after active perception is aligned.

Load-bearing premise

That the performance difference between human-video and robot-data pretraining is caused by the lack of an explicit active-perception signal that can be accurately recovered from unsynchronized single-camera footage without extra sensors or viewpoint labels.

What would settle it

A controlled test in which the camera-motion modeling component is removed or replaced with random viewpoint noise yet performance still matches robot-data baselines, or in which the full method is applied to videos containing no recoverable active-perception signal yet still matches those baselines.

Figures

Figures reproduced from arXiv: 2606.06194 by Guojin Zhong, Tianyi Lu, Xingyao Lin, Yichen Zhu, Yu-Gang Jiang, Ziyi Ye, Zuxuan Wu.

**Figure 1.** Figure 1: ActiveMimic acquires active perception from in-the-wild egocentric human video and transfers it to real-world humanoid robots. Left to center: egocentric camera motion and wrist action together form a 27-dimensional unified action representation that enables the model to jointly learn active perception and manipulation. Center to right: active perception is transferred to a humanoid robot, which reposition… view at source ↗

**Figure 2.** Figure 2: Overview of ActiveMimic. Left: recovering synchronized camera and wrist trajectories from a single body-worn RGB camera. Middle: resolving camera-wrist coupling and encoding as a unified 27D action. Right: pretraining on the 27D action to jointly model active perception and manipulation, then adapting to the target robot. 2 Related Work Learning from human videos Human videos offer a cheaper, more scalable… view at source ↗

**Figure 3.** Figure 3: Real-world tasks. (a) Restocking: the robot crouches to pick up a water bottle from the table, then stands and looks up to scan the shelf for an empty slot and places it. (b) Reaching: the robot stands up and leans over an obstacle to reach the target object behind it. (c) Finding: the robot turns its head left or right to locate a yogurt and grasps it with the corresponding arm. (d) Pouring: the robot use… view at source ↗

**Figure 4.** Figure 4: Real-world results. (a) Success rate: end-to-end success rate (%) on the four real-world tasks. (b) Restocking points: average points per trial on Restocking, with one point awarded for picking up the bottle and one for placing it on the shelf. (a) Recovery Rate (b) Frames and the Corresponding 3D Trajectories [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Dataset characterization. Left: recovery rates of predicted head and wrist poses on HOT3D at three tolerance tiers. Right: for two HOT3D videos, predicted wrist projections on a sampled frame and 3D chunk trajectories starting from that frame. across the board, confirming that camera motion supervision during egocentric pretraining is the key differentiating factor. MotoVLA, which leverages a large mixed c… view at source ↗

**Figure 6.** Figure 6: Analysis experiments. (a) Scores on Restocking for crouching to grasp the bottle (Pts1) and looking up to place it (Pts2). (b) Per-layer overlap (%) of the top-10% activated units under head-view vs. full-view inputs for ActiveMimic and ActiveMimicwrist-only. trajectories closely follow ground-truth trends on sampled HOT3D episodes (Fig. 5b). Together, these results ( [PITH_FULL_IMAGE:figures/full_fig_p00… view at source ↗

**Figure 7.** Figure 7: Pretraining corpus statistics. Word cloud of (a) action verbs and (b) manipulated objects in the final pretraining corpus after filtering, showing broad coverage of manipulation actions and object categories. a reasonable operating point before joint optimization. The full training phase then unfreezes all parameters and trains the entire model end-to-end. The robot-specific training stage initializes from… view at source ↗

**Figure 8.** Figure 8: Robustness evaluation setup. (a) Restocking under alternating red, green, and blue flashing light. (b) Finding with two unseen yogurt variants (different packaging, identical shape and size) not present in training demonstrations. The training yogurt is shown at the top; the two unseen variants are shown below. Finding with unseen objects. We replace the training yogurt with two unseen yogurt variants (dif… view at source ↗

**Figure 9.** Figure 9: Robustness evaluation. (a) Restocking under alternating red/green/blue flashing light. (b) Finding with unseen yogurt objects (different packaging, identical shape and size). Solid bars denote in-domain (normal) conditions; hatched bars denote out-of-domain conditions. ActiveMimic achieves the highest success rate under both perturbations and exhibits the smallest absolute drop among all models [PITH_FULL… view at source ↗

**Figure 10.** Figure 10: Representative failure cases of ActiveMimic without the head camera on Restocking. All three failures occur at the placement point. From left to right: (1) correct shelf tier and lateral position, but the placement motion is imprecise and knocks over the shelf; (2) correct tier, wrong lateral position; (3) wrong tier entirely. All three stem from severing the visual loop that the pretrained model relies … view at source ↗

**Figure 11.** Figure 11: K sensitivity analysis for representational transfer. Top-K% activation overlap between full-view and head-view inference conditions for ActiveMimic and ActiveMimicwrist-only across all action-expert layers, evaluated at K = 5, 10, 15, 20. The shaded area indicates the advantage of ActiveMimic over ActiveMimicwrist-only. ActiveMimic maintains consistently higher overlap across all K values, confirming th… view at source ↗

**Figure 12.** Figure 12: Prompt used for VLM-based temporal segmentation. The model identifies manipulation segments from egocentric video and outputs structured annotations including a natural-language task instruction that serves as the language prompt during pretraining. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt used for LLM-based semantic filtering. The model retains only segments involving indoor hand-object manipulation of artificial objects. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

read the original abstract

Egocentric human video offers a scalable alternative to robot data for pretraining, yet models pretrained on such video consistently underperform those pretrained on robot data. We attribute this gap to a missing signal, the active perception behavior in egocentric videos, where humans continuously reposition their viewpoint during manipulation, inducing camera motion that standard pipelines treat as noise. To address this, we present ActiveMimic, a pretraining framework that recovers synchronized camera and wrist trajectories from a single body-worn RGB camera, models camera motion as a viewpoint action, and jointly learns active perception and manipulation from in-the-wild egocentric human video before adapting to a target robot. Empirically, real-world experiments across tasks with diverse active perception demands show that ActiveMimic consistently surpasses baselines pretrained on human video and matches state-of-the-art models pretrained on robot data. Further analysis provides evidence that active perception capability originates from egocentric human video pretraining rather than robot-specific fine-tuning, confirming active perception as the key to unlocking egocentric human video for robot pretraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ActiveMimic claims to close the human-video vs robot-data gap by recovering camera-wrist trajectories and treating motion as viewpoint actions, but the abstract supplies no numbers or recovery details to support it.

read the letter

The central idea is to recover synchronized camera and wrist trajectories from single RGB egocentric video, model the camera motion as a viewpoint action, and pretrain jointly on active perception and manipulation before robot adaptation.

This framing of recovered motion as an explicit learnable signal is the main novelty relative to prior human-video baselines. The paper correctly identifies that standard pipelines discard the very camera motion that comes from humans actively repositioning their viewpoint during tasks.

The abstract states that real-world experiments across tasks show ActiveMimic beating human-video baselines and matching robot-data models, plus an analysis attributing the capability to the pretraining rather than fine-tuning. That is a clean empirical claim if the numbers hold.

The soft spot is exactly the one in the stress-test note: the abstract gives no description of the trajectory estimation method, no error metrics against ground truth, and no ablation separating trajectory quality from other choices. Monocular recovery under manipulation blur and occlusion is known to be noisy; if the recovered actions contain systematic errors, the reported gains could come from regularization or data effects instead of active perception. The origin analysis inherits the same dependency.

The work is aimed at researchers scaling imitation learning or video world models with abundant human data rather than scarce robot demonstrations. Anyone already running egocentric video pretraining would want to see the full methods and results.

It deserves peer review so the trajectory recovery procedure and the quantitative comparisons can be checked directly. The idea is worth testing even if the current presentation leaves the key assumption unverified.

Referee Report

2 major / 2 minor

Summary. The paper introduces ActiveMimic, a pretraining framework that recovers synchronized camera and wrist trajectories from monocular body-worn RGB egocentric human videos, treats camera motion as explicit viewpoint actions, and jointly pretrains active perception with manipulation before robot adaptation. It claims this closes the performance gap with robot-data pretraining, with real-world experiments showing consistent superiority over human-video baselines and parity with SOTA robot-data models, plus analysis attributing the active-perception capability to the human-video stage rather than robot fine-tuning.

Significance. If the trajectory recovery is verifiably accurate and the gains are causally tied to the active-perception signal, the result would be significant for scalable robot learning: it would demonstrate how abundant in-the-wild egocentric video can substitute for scarce robot data by explicitly modeling viewpoint control, with direct implications for manipulation tasks requiring active sensing.

major comments (2)

[§3] §3 (trajectory recovery subsection): the method for extracting synchronized camera-wrist trajectories from a single monocular RGB stream is presented without any reported pose-estimation error metrics, ground-truth comparisons, or robustness analysis under manipulation blur/occlusion; this is load-bearing because the central claim attributes the human-vs-robot gap specifically to the absence of an accurate active-perception signal that is now recovered.
[§5.2] §5.2 (origin analysis): the evidence that active-perception capability 'originates from egocentric human video pretraining rather than robot-specific fine-tuning' lacks an ablation that varies trajectory estimation noise or substitutes noisy vs. clean viewpoint actions; without it the causal attribution cannot be isolated from incidental regularization or data-filtering effects.

minor comments (2)

[Abstract] Abstract: quantitative results, dataset sizes, task counts, and error bars are omitted, making it difficult for readers to gauge the scale of the reported improvements.
[§3] Notation in §3: the distinction between recovered 'viewpoint action' and raw optical flow is not made explicit in the equations, risking confusion with standard video-pretraining pipelines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [§3] §3 (trajectory recovery subsection): the method for extracting synchronized camera-wrist trajectories from a single monocular RGB stream is presented without any reported pose-estimation error metrics, ground-truth comparisons, or robustness analysis under manipulation blur/occlusion; this is load-bearing because the central claim attributes the human-vs-robot gap specifically to the absence of an accurate active-perception signal that is now recovered.

Authors: We agree that quantitative validation of trajectory recovery would strengthen the presentation. Ground-truth comparisons are infeasible for the in-the-wild egocentric videos, which lack synchronized motion-capture data. We will add (i) accuracy metrics on synthetic sequences with known ground truth and (ii) a robustness analysis under simulated blur and occlusion in the revised manuscript. revision: yes
Referee: [§5.2] §5.2 (origin analysis): the evidence that active-perception capability 'originates from egocentric human video pretraining rather than robot-specific fine-tuning' lacks an ablation that varies trajectory estimation noise or substitutes noisy vs. clean viewpoint actions; without it the causal attribution cannot be isolated from incidental regularization or data-filtering effects.

Authors: The §5.2 analysis isolates the contribution of active-perception modeling via controlled pretraining ablations. We acknowledge that an explicit noise-level ablation on the recovered trajectories would provide stronger causal isolation. We will add this ablation in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external comparisons

full rationale

The paper presents an empirical pretraining framework and reports real-world robot task results comparing human-video pretraining (with recovered trajectories) against baselines. No equations, fitted parameters, or derivation steps are described that reduce by construction to the inputs (e.g., no self-definitional recovery of trajectories or predictions forced by prior fits). Attribution of gains to active perception is supported by ablation-style analysis rather than a closed mathematical loop. Self-citations, if present, are not load-bearing for the central empirical result. The method is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes that collapse to prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities used in the method.

pith-pipeline@v0.9.1-grok · 5730 in / 1131 out tokens · 37304 ms · 2026-06-28T01:16:59.391003+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ThinkingVLA: Interleaved Vision and Language Reasoning for Robotic Manipulation
cs.RO 2026-06 unverdicted novelty 7.0

ThinkingVLA is a Mixture-of-Transformers VLA model that performs interleaved forward CoT for subgoal and image prediction followed by inverse CoT grounded on the predicted image to generate actions.

Reference graph

Works this paper leans on

64 extracted references · 10 linked inside Pith · cited by 1 Pith paper

[1]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In CoRL, 2023

2023
[2]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, et al. Openvla: An open-source vision-language-action model. InCoRL, 2025

2025
[3]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024
[4]

Bjorck, F

J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025
[5]

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. InICLR, 2025

2025
[6]

Intelligence, K

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025
[7]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. InICCV, 2023

2023
[8]

Lipman, R

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InICLR, 2023

2023
[9]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. InICRA, 2024

2024
[10]

Y . Liu, W. C. Shin, Y . Han, Z. Chen, H. Ravichandar, and D. Xu. Immimic: Cross-domain imitation from human videos via mapping and interpolation. InCoRL, 2025

2025
[11]

Cai, R.-Z

X. Cai, R.-Z. Qiu, G. Chen, L. Wei, I. Liu, T. Huang, X. Cheng, and X. Wang. In-n-on: Scaling egocentric manipulation with in-the-wild and on-task data.arXiv preprint arXiv:2511.15704, 2025

arXiv 2025
[12]

Spiridonov, J.-N

A. Spiridonov, J.-N. Zaech, N. Nikolov, L. Van Gool, and D. P. Paudel. Generalist robot manipulation beyond action labeled data. InCoRL, 2025

2025
[13]

Yoshida, S

T. Yoshida, S. Kurita, T. Nishimura, and S. Mori. Developing vision-language-action model from egocentric videos.arXiv preprint arXiv:2509.21986, 2025

arXiv 2025
[14]

R. Bajcsy. Active perception.Proceedings of the IEEE, 1988

1988
[15]

Bajcsy, Y

R. Bajcsy, Y . Aloimonos, and J. K. Tsotsos. Revisiting active perception.Autonomous Robots, 2018

2018
[16]

Aloimonos, I

J. Aloimonos, I. Weiss, and A. Bandyopadhyay. Active vision.IJCV, 1988

1988
[17]

R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, X. Cheng, R.-Z. Qiu, et al. Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

Pith/arXiv arXiv 2025
[18]

L. Y . Zhu, P. Kuppili, R. Punamiya, P. Aphiwetsa, D. Patel, S. Kareer, S. Ha, and D. Xu. Emma: Scaling mobile manipulation via egocentric human data.RAL, 2026. 9

2026
[19]

M. Shi, S. Peng, J. Chen, H. Jiang, Y . Li, D. Huang, P. Luo, H. Li, and L. Chen. Egohu- manoid: Unlocking in-the-wild loco-manipulation with robot-free egocentric demonstration. arXiv preprint arXiv:2602.10106, 2026

Pith/arXiv arXiv 2026
[20]

R.-Z. Qiu, S. Yang, X. Cheng, C. Chawla, J. Li, T. He, G. Yan, D. J. Yoon, R. Hoque, L. Paulsen, et al. Humanoid policy∼human policy. InCoRL, 2025

2025
[21]

Grauman, A

K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InCVPR, 2022

2022
[22]

H. Luo, Y . Wang, W. Zhang, S. Zheng, Z. Xi, C. Xu, H. Xu, H. Yuan, C. Zhang, Y . Wang, et al. Being-h0. 5: Scaling human-centric robot learning for cross-embodiment generalization. arXiv preprint arXiv:2601.12993, 2026

arXiv 2026
[23]

Kareer, D

S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu. Egomimic: Scaling imitation learning via egocentric video. InICRA, 2025

2025
[24]

Bircher, M

A. Bircher, M. Kamel, K. Alexis, H. Oleynikova, and R. Siegwart. Receding horizon” next- best-view” planner for 3d exploration. InICRA, 2016

2016
[25]

Breyer, L

M. Breyer, L. Ott, R. Siegwart, and J. J. Chung. Closed-loop next-best-view planning for target-driven grasping. InIROS, 2022

2022
[26]

Connolly

C. Connolly. The determination of next best views. InICRA, 1985

1985
[27]

Krainin, B

M. Krainin, B. Curless, and D. Fox. Autonomous generation of complete 3d object models using next best view manipulation planning. InICRA, 2011

2011
[28]

Naazare, F

M. Naazare, F. G. Rosas, and D. Schulz. Online next-best-view planner for 3d-exploration and inspection with a mobile manipulator robot.RAL, 2022

2022
[29]

Zhang, D

X. Zhang, D. Wang, S. Han, W. Li, B. Zhao, Z. Wang, X. Duan, C. Fang, X. Li, and J. He. Affordance-driven next-best-view planning for robotic grasping. InCoRL, 2023

2023
[30]

J. Yu, Y . Shentu, D. Wu, P. Abbeel, K. Goldberg, and P. Wu. Egomi: Learning active vi- sion and whole-body manipulation from egocentric human demonstrations.arXiv preprint arXiv:2511.00153, 2025

arXiv 2025
[31]

Xiong, X

H. Xiong, X. Xu, J. Wu, Y . Hou, J. Bohg, and S. Song. Vision in action: Learning active perception from human demonstrations. InCoRL, 2025

2025
[32]

Q. Zeng, C. Li, J. S. John, Z. Zhou, J. Wen, G. Feng, Y . Zhu, and Y . Xu. Activeumi: Robotic manipulation with active perception from robot-free human demonstrations.arXiv preprint arXiv:2510.01607, 2025

arXiv 2025
[33]

Chuang, A

I. Chuang, A. Lee, D. Gao, M.-M. Naddaf-Sh, and I. Soltani. Active vision might be all you need: Exploring active vision in bimanual robotic manipulation. InICRA, 2025

2025
[34]

J. Kerr, K. Hari, E. Weber, C. M. Kim, B. Yi, K. Goldberg, A. Kanazawa, et al. Eye, robot: Learning to look to act with a bc-rl perception-action loop. InCoRL, 2025

2025
[35]

Cheng, J

X. Cheng, J. Li, S. Yang, G. Yang, and X. Wang. Open-television: Teleoperation with immer- sive active visual feedback. InCoRL, 2025

2025
[36]

M. Liu, E. Zhou, C. Chi, Y . Han, S. Rong, L. Chen, P. Wang, Z. Wang, and S. Zhang. Sapave: Towards active perception and manipulation in vision-language-action models for robotics. arXiv preprint arXiv:2603.12193, 2026. 10

arXiv 2026
[37]

Kareer, K

S. Kareer, K. Pertsch, J. Darpinian, J. Hoffman, D. Xu, S. Levine, C. Finn, and S. Nair. Emergence of human to robot transfer in vision-language-action models.arXiv preprint arXiv:2512.22414, 2025

arXiv 2025
[38]

Q. Li, Y . Deng, Y . Liang, L. Luo, L. Zhou, C. Yao, L. Zeng, Z. Feng, H. Liang, S. Xu, et al. Scalable vision-language-action model pretraining for robotic manipulation with real-life hu- man activity videos.arXiv preprint arXiv:2510.21571, 2025

arXiv 2025
[39]

X. Yang, D. Kukreja, D. Pinkus, A. Sagar, T. Fan, J. Park, S. Shin, J. Cao, J. Liu, N. Ugrinovic, et al. Sam 3d body: Robust full-body human mesh recovery.arXiv preprint arXiv:2602.15989, 2026

arXiv 2026
[40]

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual geometry grounded transformer. InCVPR, 2025

2025
[41]

Piccinelli, Y .-H

L. Piccinelli, Y .-H. Yang, C. Sakaridis, M. Segu, S. Li, L. Van Gool, and F. Yu. Unidepth: Universal monocular metric depth estimation. InCVPR, 2024

2024
[42]

Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks. InCVPR, 2019

2019
[43]

Liang, L

W. Liang, L. Yu, L. Luo, S. Iyer, N. Dong, C. Zhou, G. Ghosh, M. Lewis, W.-t. Yih, L. Zettle- moyer, et al. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.arXiv preprint arXiv:2411.04996, 2024

Pith/arXiv arXiv 2024
[44]

S. Zhao, X. Zhang, J. Guo, J. Hu, L. Duan, M. Fu, Y . X. Chng, G.-H. Wang, Q.-G. Chen, Z. Xu, et al. Unified multimodal understanding and generation models: Advances, challenges, and opportunities.arXiv preprint arXiv:2505.02567, 2025

arXiv 2025
[45]

Beyer, A

L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdul- mohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

Pith/arXiv arXiv 2024
[46]

H.-S. Fang, H. Fang, Z. Tang, J. Liu, C. Wang, J. Wang, H. Zhu, and C. Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot.arXiv preprint arXiv:2307.00595, 2023

arXiv 2023
[47]

Banerjee, S

P. Banerjee, S. Shkodrani, P. Moulon, S. Hampali, S. Han, F. Zhang, L. Zhang, J. Fountain, E. Miller, S. Basol, et al. Hot3d: Hand and object tracking in 3d from egocentric multi-view videos. InCVPR, 2025

2025
[48]

Y . Luo, H. Chen, Z. Wu, B. Sui, J. Liu, C. Gu, Z. Liu, Q. Feng, J. Yu, S. Gu, et al. Look before acting: Enhancing vision foundation representations for vision-language-action models.arXiv preprint arXiv:2603.15618, 2026

arXiv 2026
[49]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025
[50]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[51]

Grauman, A

K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V . Baiyya, S. Bansal, B. Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InCVPR, 2024

2024
[52]

S. Yin, Y . Ze, H.-X. Yu, C. K. Liu, and J. Wu. Visualmimic: Visual humanoid loco- manipulation via motion tracking and generation.arXiv preprint arXiv:2509.20322, 2025. 11

arXiv 2025
[53]

S. Wei, H. Jing, B. Li, Z. Zhao, J. Mao, Z. Ni, S. He, J. Liu, X. Liu, K. Kang, et al.ψ 0: An open foundation model towards universal humanoid loco-manipulation.arXiv preprint arXiv:2603.12263, 2026

arXiv 2026
[54]

Jiang, J

H. Jiang, J. Chen, Q. Bu, L. Chen, M. Shi, Y . Zhang, D. Li, C. Suo, C. Wang, Z. Peng, et al. Wholebodyvla: Towards unified latent vla for whole-body loco-manipulation control.arXiv preprint arXiv:2512.11047, 2025

arXiv 2025
[55]

X. Wang, T. Kwon, M. Rad, B. Pan, I. Chakraborty, S. Andrist, D. Bohus, A. Feniello, B. Tekin, F. V . Frujeri, et al. Holoassist: an egocentric human interaction dataset for interactive ai assis- tants in the real world. InICCV, 2023

2023
[56]

L. Xu, C. Yang, Z. Lin, F. Xu, Y . Liu, C. Xu, Y . Zhang, J. Qin, X. Sheng, Y . Liu, et al. Perceiving and acting in first-person: A dataset and benchmark for egocentric human-object- human interactions. InICCV, 2025

2025
[57]

X. Lin, X. Zhu, T. Lu, S. Xie, H. Zhang, X. Qiu, Z. Wu, and Y .-G. Jiang. Ask-to-clarify: Re- solving instruction ambiguity through multi-turn dialogue.arXiv preprint arXiv:2509.15061, 2025

Pith/arXiv arXiv 2025
[58]

washing a dish

I. Rodin, A. Furnari, D. Mavroeidis, and G. M. Farinella. Predicting the future from first person (egocentric) vision: A survey.CVIU, 2021. 12 A From Egocentric Video to Unified Action Space A.1 Metric Scale Recovery The camera trajectory recovered by VGGT is a scale-normalized path ˜T camk cam1 whose translational component is determined only up to a glo...

2021
[59]

The start time (in seconds, integer only)
[60]

The end time (in seconds, integer only)
[61]

pick up",

A concise description of the specific task being performed Each description must include: - The main manipulation action (a verb like "pick up", "place", "insert", "open", etc.) - A list of one or more objects that are being manipulated - A short natural language instruction generated from the action and objects The segments may overlap in time if multipl...
[62]

The action involves hand-object manipulation (e.g., pick up, cut, fold, assemble, insert, tighten, wipe, pour, etc.)
[63]

Exclude: body parts (leg, hand, arm), people (man, woman, person), natural materials (plant, soil, mud, grass, tree)

The object(s) must be artificial, physical items (tools, containers, utensils, electronics, furniture, fabric, household goods). Exclude: body parts (leg, hand, arm), people (man, woman, person), natural materials (plant, soil, mud, grass, tree)
[64]

clip_uid

The scene is likely indoors. Exclude: gardening, farming, outdoor repair, digging, planting, handling mud/branches/natural terrain. Return a JSON object: {"clip_uid": "...", "status": "success", "filtered_segments": [...]} Figure 13:Prompt used for LLM-based semantic filtering.The model retains only segments involving indoor hand-object manipulation of ar...

[1] [1]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In CoRL, 2023

2023

[2] [2]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, et al. Openvla: An open-source vision-language-action model. InCoRL, 2025

2025

[3] [3]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024

[4] [4]

Bjorck, F

J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025

[5] [5]

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. InICLR, 2025

2025

[6] [6]

Intelligence, K

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025

[7] [7]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. InICCV, 2023

2023

[8] [8]

Lipman, R

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InICLR, 2023

2023

[9] [9]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. InICRA, 2024

2024

[10] [10]

Y . Liu, W. C. Shin, Y . Han, Z. Chen, H. Ravichandar, and D. Xu. Immimic: Cross-domain imitation from human videos via mapping and interpolation. InCoRL, 2025

2025

[11] [11]

Cai, R.-Z

X. Cai, R.-Z. Qiu, G. Chen, L. Wei, I. Liu, T. Huang, X. Cheng, and X. Wang. In-n-on: Scaling egocentric manipulation with in-the-wild and on-task data.arXiv preprint arXiv:2511.15704, 2025

arXiv 2025

[12] [12]

Spiridonov, J.-N

A. Spiridonov, J.-N. Zaech, N. Nikolov, L. Van Gool, and D. P. Paudel. Generalist robot manipulation beyond action labeled data. InCoRL, 2025

2025

[13] [13]

Yoshida, S

T. Yoshida, S. Kurita, T. Nishimura, and S. Mori. Developing vision-language-action model from egocentric videos.arXiv preprint arXiv:2509.21986, 2025

arXiv 2025

[14] [14]

R. Bajcsy. Active perception.Proceedings of the IEEE, 1988

1988

[15] [15]

Bajcsy, Y

R. Bajcsy, Y . Aloimonos, and J. K. Tsotsos. Revisiting active perception.Autonomous Robots, 2018

2018

[16] [16]

Aloimonos, I

J. Aloimonos, I. Weiss, and A. Bandyopadhyay. Active vision.IJCV, 1988

1988

[17] [17]

R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, X. Cheng, R.-Z. Qiu, et al. Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

Pith/arXiv arXiv 2025

[18] [18]

L. Y . Zhu, P. Kuppili, R. Punamiya, P. Aphiwetsa, D. Patel, S. Kareer, S. Ha, and D. Xu. Emma: Scaling mobile manipulation via egocentric human data.RAL, 2026. 9

2026

[19] [19]

M. Shi, S. Peng, J. Chen, H. Jiang, Y . Li, D. Huang, P. Luo, H. Li, and L. Chen. Egohu- manoid: Unlocking in-the-wild loco-manipulation with robot-free egocentric demonstration. arXiv preprint arXiv:2602.10106, 2026

Pith/arXiv arXiv 2026

[20] [20]

R.-Z. Qiu, S. Yang, X. Cheng, C. Chawla, J. Li, T. He, G. Yan, D. J. Yoon, R. Hoque, L. Paulsen, et al. Humanoid policy∼human policy. InCoRL, 2025

2025

[21] [21]

Grauman, A

K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InCVPR, 2022

2022

[22] [22]

H. Luo, Y . Wang, W. Zhang, S. Zheng, Z. Xi, C. Xu, H. Xu, H. Yuan, C. Zhang, Y . Wang, et al. Being-h0. 5: Scaling human-centric robot learning for cross-embodiment generalization. arXiv preprint arXiv:2601.12993, 2026

arXiv 2026

[23] [23]

Kareer, D

S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu. Egomimic: Scaling imitation learning via egocentric video. InICRA, 2025

2025

[24] [24]

Bircher, M

A. Bircher, M. Kamel, K. Alexis, H. Oleynikova, and R. Siegwart. Receding horizon” next- best-view” planner for 3d exploration. InICRA, 2016

2016

[25] [25]

Breyer, L

M. Breyer, L. Ott, R. Siegwart, and J. J. Chung. Closed-loop next-best-view planning for target-driven grasping. InIROS, 2022

2022

[26] [26]

Connolly

C. Connolly. The determination of next best views. InICRA, 1985

1985

[27] [27]

Krainin, B

M. Krainin, B. Curless, and D. Fox. Autonomous generation of complete 3d object models using next best view manipulation planning. InICRA, 2011

2011

[28] [28]

Naazare, F

M. Naazare, F. G. Rosas, and D. Schulz. Online next-best-view planner for 3d-exploration and inspection with a mobile manipulator robot.RAL, 2022

2022

[29] [29]

Zhang, D

X. Zhang, D. Wang, S. Han, W. Li, B. Zhao, Z. Wang, X. Duan, C. Fang, X. Li, and J. He. Affordance-driven next-best-view planning for robotic grasping. InCoRL, 2023

2023

[30] [30]

J. Yu, Y . Shentu, D. Wu, P. Abbeel, K. Goldberg, and P. Wu. Egomi: Learning active vi- sion and whole-body manipulation from egocentric human demonstrations.arXiv preprint arXiv:2511.00153, 2025

arXiv 2025

[31] [31]

Xiong, X

H. Xiong, X. Xu, J. Wu, Y . Hou, J. Bohg, and S. Song. Vision in action: Learning active perception from human demonstrations. InCoRL, 2025

2025

[32] [32]

Q. Zeng, C. Li, J. S. John, Z. Zhou, J. Wen, G. Feng, Y . Zhu, and Y . Xu. Activeumi: Robotic manipulation with active perception from robot-free human demonstrations.arXiv preprint arXiv:2510.01607, 2025

arXiv 2025

[33] [33]

Chuang, A

I. Chuang, A. Lee, D. Gao, M.-M. Naddaf-Sh, and I. Soltani. Active vision might be all you need: Exploring active vision in bimanual robotic manipulation. InICRA, 2025

2025

[34] [34]

J. Kerr, K. Hari, E. Weber, C. M. Kim, B. Yi, K. Goldberg, A. Kanazawa, et al. Eye, robot: Learning to look to act with a bc-rl perception-action loop. InCoRL, 2025

2025

[35] [35]

Cheng, J

X. Cheng, J. Li, S. Yang, G. Yang, and X. Wang. Open-television: Teleoperation with immer- sive active visual feedback. InCoRL, 2025

2025

[36] [36]

M. Liu, E. Zhou, C. Chi, Y . Han, S. Rong, L. Chen, P. Wang, Z. Wang, and S. Zhang. Sapave: Towards active perception and manipulation in vision-language-action models for robotics. arXiv preprint arXiv:2603.12193, 2026. 10

arXiv 2026

[37] [37]

Kareer, K

S. Kareer, K. Pertsch, J. Darpinian, J. Hoffman, D. Xu, S. Levine, C. Finn, and S. Nair. Emergence of human to robot transfer in vision-language-action models.arXiv preprint arXiv:2512.22414, 2025

arXiv 2025

[38] [38]

Q. Li, Y . Deng, Y . Liang, L. Luo, L. Zhou, C. Yao, L. Zeng, Z. Feng, H. Liang, S. Xu, et al. Scalable vision-language-action model pretraining for robotic manipulation with real-life hu- man activity videos.arXiv preprint arXiv:2510.21571, 2025

arXiv 2025

[39] [39]

X. Yang, D. Kukreja, D. Pinkus, A. Sagar, T. Fan, J. Park, S. Shin, J. Cao, J. Liu, N. Ugrinovic, et al. Sam 3d body: Robust full-body human mesh recovery.arXiv preprint arXiv:2602.15989, 2026

arXiv 2026

[40] [40]

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual geometry grounded transformer. InCVPR, 2025

2025

[41] [41]

Piccinelli, Y .-H

L. Piccinelli, Y .-H. Yang, C. Sakaridis, M. Segu, S. Li, L. Van Gool, and F. Yu. Unidepth: Universal monocular metric depth estimation. InCVPR, 2024

2024

[42] [42]

Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks. InCVPR, 2019

2019

[43] [43]

Liang, L

W. Liang, L. Yu, L. Luo, S. Iyer, N. Dong, C. Zhou, G. Ghosh, M. Lewis, W.-t. Yih, L. Zettle- moyer, et al. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.arXiv preprint arXiv:2411.04996, 2024

Pith/arXiv arXiv 2024

[44] [44]

S. Zhao, X. Zhang, J. Guo, J. Hu, L. Duan, M. Fu, Y . X. Chng, G.-H. Wang, Q.-G. Chen, Z. Xu, et al. Unified multimodal understanding and generation models: Advances, challenges, and opportunities.arXiv preprint arXiv:2505.02567, 2025

arXiv 2025

[45] [45]

Beyer, A

L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdul- mohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

Pith/arXiv arXiv 2024

[46] [46]

H.-S. Fang, H. Fang, Z. Tang, J. Liu, C. Wang, J. Wang, H. Zhu, and C. Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot.arXiv preprint arXiv:2307.00595, 2023

arXiv 2023

[47] [47]

Banerjee, S

P. Banerjee, S. Shkodrani, P. Moulon, S. Hampali, S. Han, F. Zhang, L. Zhang, J. Fountain, E. Miller, S. Basol, et al. Hot3d: Hand and object tracking in 3d from egocentric multi-view videos. InCVPR, 2025

2025

[48] [48]

Y . Luo, H. Chen, Z. Wu, B. Sui, J. Liu, C. Gu, Z. Liu, Q. Feng, J. Yu, S. Gu, et al. Look before acting: Enhancing vision foundation representations for vision-language-action models.arXiv preprint arXiv:2603.15618, 2026

arXiv 2026

[49] [49]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025

[50] [50]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[51] [51]

Grauman, A

K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V . Baiyya, S. Bansal, B. Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InCVPR, 2024

2024

[52] [52]

S. Yin, Y . Ze, H.-X. Yu, C. K. Liu, and J. Wu. Visualmimic: Visual humanoid loco- manipulation via motion tracking and generation.arXiv preprint arXiv:2509.20322, 2025. 11

arXiv 2025

[53] [53]

S. Wei, H. Jing, B. Li, Z. Zhao, J. Mao, Z. Ni, S. He, J. Liu, X. Liu, K. Kang, et al.ψ 0: An open foundation model towards universal humanoid loco-manipulation.arXiv preprint arXiv:2603.12263, 2026

arXiv 2026

[54] [54]

Jiang, J

H. Jiang, J. Chen, Q. Bu, L. Chen, M. Shi, Y . Zhang, D. Li, C. Suo, C. Wang, Z. Peng, et al. Wholebodyvla: Towards unified latent vla for whole-body loco-manipulation control.arXiv preprint arXiv:2512.11047, 2025

arXiv 2025

[55] [55]

X. Wang, T. Kwon, M. Rad, B. Pan, I. Chakraborty, S. Andrist, D. Bohus, A. Feniello, B. Tekin, F. V . Frujeri, et al. Holoassist: an egocentric human interaction dataset for interactive ai assis- tants in the real world. InICCV, 2023

2023

[56] [56]

L. Xu, C. Yang, Z. Lin, F. Xu, Y . Liu, C. Xu, Y . Zhang, J. Qin, X. Sheng, Y . Liu, et al. Perceiving and acting in first-person: A dataset and benchmark for egocentric human-object- human interactions. InICCV, 2025

2025

[57] [57]

X. Lin, X. Zhu, T. Lu, S. Xie, H. Zhang, X. Qiu, Z. Wu, and Y .-G. Jiang. Ask-to-clarify: Re- solving instruction ambiguity through multi-turn dialogue.arXiv preprint arXiv:2509.15061, 2025

Pith/arXiv arXiv 2025

[58] [58]

washing a dish

I. Rodin, A. Furnari, D. Mavroeidis, and G. M. Farinella. Predicting the future from first person (egocentric) vision: A survey.CVIU, 2021. 12 A From Egocentric Video to Unified Action Space A.1 Metric Scale Recovery The camera trajectory recovered by VGGT is a scale-normalized path ˜T camk cam1 whose translational component is determined only up to a glo...

2021

[59] [59]

The start time (in seconds, integer only)

[60] [60]

The end time (in seconds, integer only)

[61] [61]

pick up",

A concise description of the specific task being performed Each description must include: - The main manipulation action (a verb like "pick up", "place", "insert", "open", etc.) - A list of one or more objects that are being manipulated - A short natural language instruction generated from the action and objects The segments may overlap in time if multipl...

[62] [62]

The action involves hand-object manipulation (e.g., pick up, cut, fold, assemble, insert, tighten, wipe, pour, etc.)

[63] [63]

Exclude: body parts (leg, hand, arm), people (man, woman, person), natural materials (plant, soil, mud, grass, tree)

The object(s) must be artificial, physical items (tools, containers, utensils, electronics, furniture, fabric, household goods). Exclude: body parts (leg, hand, arm), people (man, woman, person), natural materials (plant, soil, mud, grass, tree)

[64] [64]

clip_uid

The scene is likely indoors. Exclude: gardening, farming, outdoor repair, digging, planting, handling mud/branches/natural terrain. Return a JSON object: {"clip_uid": "...", "status": "success", "filtered_segments": [...]} Figure 13:Prompt used for LLM-based semantic filtering.The model retains only segments involving indoor hand-object manipulation of ar...