EgoKit: Towards Unified Low-Cost Egocentric Data Collection with Heterogeneous Devices

Beichen Wang; Erdem Murat; Huining Feng; Huizhen Zhou; Ke Jing; Liuchuan Yu; Ning Yang; Ruoya Sheng; Shanghao Li; Tingting Luo

arxiv: 2605.16797 · v1 · pith:MFB7YLMInew · submitted 2026-05-16 · 💻 cs.CV · cs.RO

EgoKit: Towards Unified Low-Cost Egocentric Data Collection with Heterogeneous Devices

Liuchuan Yu , Erdem Murat , Beichen Wang , Yan Zeng , Tingting Luo , Huizhen Zhou , Shanghao Li , Huining Feng

show 5 more authors

Zhigen Zhao Ning Yang Ke Jing Yunhao Liu Ruoya Sheng

This is my paper

Pith reviewed 2026-05-19 21:17 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords egocentric videodata collectionheterogeneous devicesXR headsetsrobot learningactivity understandingunified toolkitwrist camera

0 comments

The pith

EgoKit delivers the same recording workflow and uniform video logs for egocentric data across six different device types including phones, glasses, and XR headsets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Collecting egocentric video for robot learning and activity understanding is currently fragmented because each device type has its own SDK, camera access rules, and hardware limits. The paper introduces EgoKit as a single toolkit that applies one consistent recording process on six heterogeneous hosts and stores results in a shared format. On XR devices it also records aligned head pose and 26-joint hand tracking. Low-cost accessories such as wrist cameras and a USB-C hub extend the system to capture wrist views without requiring custom fabrication for each platform. If this approach holds, researchers could gather comparable data from many more devices and scale datasets without locking into one vendor's ecosystem.

Core claim

EgoKit exposes the same egocentric recording workflow across six heterogeneous host devices and produces locally stored video with a uniform log format; on XR headsets it additionally logs head pose and OpenXR-standard 26-joint hand tracking aligned to the video streams. The companion accessories, including two wrist cameras with mounts, a head strap, and a USB-C hub, add wrist-view capture to any supported host without custom hardware fabrication.

What carries the argument

A single software layer that presents identical recording controls and output logs while abstracting over device-specific SDKs and camera policies, together with simple off-the-shelf accessories that attach wrist cameras to any host.

If this is right

Researchers can collect synchronized ego and wrist views from multiple device classes using one workflow and compare results directly because of the shared log format.
XR headsets gain head pose and standardized hand tracking aligned to video without separate setup steps.
Data collection for embodied AI no longer requires committing to a single proprietary platform or building one-off rigs.
Wrist-view capture becomes available on any supported host by attaching the provided low-cost accessories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The uniform format could make it easier to pool datasets from different labs or devices for larger training corpora.
Future work might test whether the same abstraction layer extends to additional consumer electronics without new hardware.
Standardized logs could support community benchmarks that mix recordings from phones and headsets.

Load-bearing premise

A single software layer plus low-cost accessories can reliably overcome the differing SDKs, raw camera access policies, and external-camera limitations of Android, iOS, smart glasses, and XR platforms without device-specific custom code or hardware.

What would settle it

Running the same EgoKit installation on a new device model outside the six tested hosts and checking whether the user interface, video storage, and log format remain identical without any code modifications.

Figures

Figures reproduced from arXiv: 2605.16797 by Beichen Wang, Erdem Murat, Huining Feng, Huizhen Zhou, Ke Jing, Liuchuan Yu, Ning Yang, Ruoya Sheng, Shanghao Li, Tingting Luo, Yan Zeng, Yunhao Liu, Zhigen Zhao.

**Figure 1.** Figure 1: EgoKit-PICO use case. EgoKit provides a unified egocentric data collection workflow across six types of devices— PICO 4 Ultra (EgoKit-PICO), Apple Vision Pro (EgoKit-AVP), Project Aria (EgoKit-Aria), iPhone (EgoKit-iOS), Android (EgoKit-Android), and Meta Quest 3 (EgoKit-Quest)—to capture ego-view and wrist-view video with off-the-shelf low-cost accessories. On headsets, such as PICO 4 Ultra and Meta Quest… view at source ↗

**Figure 2.** Figure 2: Devices supported by EgoKit. From left to right: Samsung Galaxy S23, iPhone 16 Pro, iPad Pro 2018, Project Aria [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Examples of industry products for egocentric and/or wrist video data collection. Images are courtesy of the official [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: User Interface of the EgoKit family. Please refer to Section 3.1 for their explanations. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Off-the-shelf consumer-grade accessories. From left [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Various setups of EgoKit. Please refer to Section 4.1 for their explanations. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Frame examples of wrist view recordings. (a-1) and (a-2) are from the setup where a USB-C hub is connected to an [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Frame examples of egocentric video recordings using different devices. The label indicates the host device that [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

read the original abstract

Egocentric video is increasingly used as a data source for robot learning, activity understanding, and embodied AI research, but collecting it at scale remains fragmented in practice: each candidate host device, such as an Android phone, iPhone, iPad, smart glasses, or extended reality (XR) headset, exposes a different SDK, a different policy on raw camera access, and different limitations on external USB cameras and on-device tracking. Synchronized ego-view and wrist-view capture is therefore typically obtained by either committing to a single proprietary platform or building one-off rigs that do not transfer across devices. To address this gap, we present EgoKit, a toolkit that exposes the same egocentric recording workflow across six heterogeneous host devices. Across all supported devices, EgoKit presents the same recording interaction and produces locally stored video with a uniform log format; on XR headsets, it additionally logs head pose and OpenXR-standard 26-joint hand tracking aligned to the video streams. The companion accessories, including two wrist cameras with mounts, a head strap, and a USB-C hub, add wrist-view capture to any supported host without custom hardware fabrication. EgoKit is available at \url{https://egokit.chuange.org/}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EgoKit is a practical toolkit that unifies egocentric recording and wrist-view capture across six device classes with one workflow and cheap accessories, but it presents no measurements or tests to show the unification actually works reliably.

read the letter

EgoKit gives users the same recording steps and a single log format for egocentric video on Android phones, iPhones, iPads, smart glasses, and XR headsets, plus aligned head pose and 26-joint hand tracking on the XR side. The low-cost wrist cameras and mounts let you add a second view without building custom rigs each time. That is the core offering: one software layer and commodity add-ons instead of per-platform hacks.

Referee Report

1 major / 2 minor

Summary. The paper presents EgoKit, a toolkit and set of low-cost accessories that expose an identical egocentric recording workflow and produce locally stored video with a uniform log format across six heterogeneous host devices (Android phones, iPhones, iPads, smart glasses, and XR headsets). On XR headsets the toolkit additionally logs head pose and OpenXR-standard 26-joint hand tracking aligned to the video streams. The work is framed as an engineering solution to the fragmentation caused by differing SDKs, camera-access policies, and external-camera limitations.

Significance. If the described abstraction and accessories function as claimed, EgoKit could meaningfully lower the barrier to collecting synchronized ego- and wrist-view data at scale for robot learning, activity understanding, and embodied AI. The open availability of the toolkit and the emphasis on commodity hardware are concrete strengths that support reproducibility and community adoption.

major comments (1)

[Abstract] Abstract: the central claim that a single software layer plus low-cost accessories reliably overcomes differing SDKs, raw-camera-access policies, and external-camera limitations across all six device classes is presented without any compatibility tests, error rates, latency measurements, or failure-mode analysis. This absence is load-bearing because the manuscript's value rests on the assertion that the uniform workflow succeeds in practice.

minor comments (2)

The manuscript would benefit from a concise table listing each supported device class together with the specific features (video resolution, hand tracking, head pose) that EgoKit exposes on that platform.
The availability URL should be accompanied by a permanent archive link or GitHub repository to ensure long-term access.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential of EgoKit to reduce fragmentation in egocentric data collection. We address the major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that a single software layer plus low-cost accessories reliably overcomes differing SDKs, raw-camera-access policies, and external-camera limitations across all six device classes is presented without any compatibility tests, error rates, latency measurements, or failure-mode analysis. This absence is load-bearing because the manuscript's value rests on the assertion that the uniform workflow succeeds in practice.

Authors: We agree that the abstract does not include quantitative supporting data and that this weakens the presentation of the core claim. The full manuscript provides implementation details for each device class along with qualitative validation via the released toolkit and accessories. To address the concern directly, we will revise the abstract for accuracy and add a new Evaluation section reporting compatibility tests across the six devices, measured recording latencies and synchronization errors, and a failure-mode analysis covering issues such as camera permission denials and tracking drift. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents EgoKit as an engineering toolkit and set of commodity accessories that provide a uniform recording workflow and log format across six device classes, with additional XR-specific logging of head pose and hand tracking. No derivation chain, equations, fitted parameters, or predictions appear in the provided text. The central claim is a descriptive statement of existence and basic functionality of the software artifact rather than a result obtained by reducing to prior self-citations or by construction from inputs. The work is self-contained as an implementation contribution with no load-bearing logical steps that equate outputs to their own definitions or fitted values.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an engineering toolkit paper whose central contribution is software and accessory design rather than a mathematical or empirical derivation. No free parameters, domain axioms, or invented physical entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5789 in / 1282 out tokens · 63620 ms · 2026-05-19T21:17:20.908727+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 13 internal anchors

[1]

EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World

R. Punamiya, S. Kareer, Z. Liu, J. Citron, R.-Z. Qiu, X. Cai, A. Gavryushin, J. Chen, D. Liconti, L. Y . Zhuet al., “Egoverse: An egocentric human dataset for robot learning from around the world,” arXiv preprint arXiv:2604.07607, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Project Aria: A New Tool for Egocentric Multi-Modal AI Research

J. Engel, K. Somasundaram, M. Goesele, A. Sun, A. Gamino, A. Turner, A. Talattof, A. Yuan, B. Souti, B. Meredithet al., “Project aria: A new tool for egocentric multi-modal ai research,”arXiv preprint arXiv:2308.13561, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Das ego - genrobot ai - genrobot ai platform for robotics,

G. AI, “Das ego - genrobot ai - genrobot ai platform for robotics,” [Online; accessed 2026-05-13]. [Online]. Available: https: //www.genrobot.ai/products/ego

work page 2026
[4]

Sunday robotics — the helpful robotics company,

“Sunday robotics — the helpful robotics company,” [Online; accessed 2026-05-13]. [Online]. Available: https://www.sunday.ai/

work page 2026
[5]

[Online]

“Pika,” [Online; accessed 2026-05-13]. [Online]. Available: https: //global.agilex.ai/products/pika

work page 2026
[6]

Lumosumi pro,

“Lumosumi pro,” [Online; accessed 2026-05-13]. [Online]. Available: https://lumosumi.lumosbot.tech/pro/

work page 2026
[7]

Xrobotoolkit: A cross-platform framework for robot teleoperation,

Z. Zhao, L. Yu, K. Jing, and N. Yang, “Xrobotoolkit: A cross-platform framework for robot teleoperation,” in2026 IEEE/SICE International Symposium on System Integration (SII). IEEE, 2026, pp. 15–20

work page 2026
[8]

Lucid-XR: An Extended-Reality Data Engine for Robotic Manipulation

Y . Ravan, A. Rashid, A. Yu, K. McClennen, G. Huh, K. Yang, Z. Yang, Q. Yu, X. Wang, P. Isola, and G. Yang, “Lucid-xr: An extended-reality data engine for robotic manipulation,” 2026, site: https://lucidxr.github.io. [Online]. Available: https://arxiv.org/abs/ 2605.00244

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

Demobot: Efficient learning of bimanual manipulation with dexterous hands from third-person human videos,

Y . Xu, X. Mao, E. Miller, X. Yi, Y . Li, Z. Li, and R. B. Fisher, “Demobot: Efficient learning of bimanual manipulation with dexterous hands from third-person human videos,” 2026. [Online]. Available: https://arxiv.org/abs/2601.01651

work page arXiv 2026
[10]

Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, X. Cheng, R.-Z. Qiu, H. Yin, S. Liu, S. Han, Y . Lu, and X. Wang, “Egovla: Learning vision-language-action models from egocentric human videos,” 2025. [Online]. Available: https://arxiv.org/abs/2507.12440

work page arXiv 2025
[11]

Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos.arXiv preprint arXiv:2510.21571, 2025

Q. Li, Y . Deng, Y . Liang, L. Luo, L. Zhou, C. Yao, L. Zeng, Z. Feng, H. Liang, S. Xu, Y . Zhang, X. Chen, H. Chen, L. Sun, D. Chen, J. Yang, and B. Guo, “Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos,” 2025, project: https://microsoft.github.io/VITRA/. [Online]. Available: https://arxiv.o...

work page arXiv 2025
[12]

Clap: Contrastive latent action pretraining for learning vision-language-action models from human videos,

C. Zhang, J. Wang, Z. Gao, Y . Su, T. Dai, C. Zhou, J. Lu, and Y . Tang, “Clap: Contrastive latent action pretraining for learning vision-language-action models from human videos,” 2026. [Online]. Available: https://arxiv.org/abs/2601.04061

work page arXiv 2026
[13]

Internvla-a1: Unifying understanding, generation and action for robotic manipulation,

J. Cai, Z. Cai, J. Cao, Y . Chen, Z. He, L. Jiang, H. Li, H. Li, Y . Li, Y . Liuet al., “Internvla-a1: Unifying understanding, generation and action for robotic manipulation,”arXiv preprint arXiv:2601.02456, 2026

work page arXiv 2026
[14]

Zheng, D

R. Zheng, D. Niu, Y . Xie, J. Wang, M. Xu, Y . Jiang, F. Casta ˜neda, F. Hu, Y . L. Tan, L. Fu, T. Darrell, F. Huang, Y . Zhu, D. Xu, and L. Fan, “Egoscale: Scaling dexterous manipulation with diverse egocentric human data,” 2026. [Online]. Available: https://arxiv.org/abs/2602.16710

work page arXiv 2026
[15]

Egohumanoid: Unlocking in-the-wild loco-manipulation with robot-free egocentric demonstration,

M. Shi, S. Peng, J. Chen, H. Jiang, Y . Li, D. Huang, P. Luo, H. Li, and L. Chen, “Egohumanoid: Unlocking in-the-wild loco-manipulation with robot-free egocentric demonstration,” 2026. [Online]. Available: https://arxiv.org/abs/2602.10106

work page arXiv 2026
[16]

mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

J. Pai, L. Achenbach, V . Montesinos, B. Forrai, O. Mees, and E. Nava, “mimic-video: Video-action models for generalizable robot control beyond vlas,” 2025. [Online]. Available: https: //arxiv.org/abs/2512.15692

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

World models can leverage human videos for dexterous manipulation.arXiv preprint arXiv:2512.13644, 2025

R. G. Goswami, A. Bar, D. Fan, T.-Y . Yang, G. Zhou, P. Krishnamurthy, M. Rabbat, F. Khorrami, and Y . LeCun, “World models for learning dexterous hand-object interactions from human videos,” 2026. [Online]. Available: https://arxiv.org/abs/2512.13644

work page arXiv 2026
[18]

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

S. Gao, W. Liang, K. Zheng, A. Malik, S. Ye, S. Yu, W.-C. Tseng, Y . Dong, K. Mo, C.-H. Lin, Q. Ma, S. Nah, L. Magne, J. Xiang, Y . Xie, R. Zheng, D. Niu, Y . L. Tan, K. R. Zentner, G. Kurian, S. Indupuru, P. Jannaty, J. Gu, J. Zhang, J. Malik, P. Abbeel, M.-Y . Liu, Y . Zhu, J. Jang, and L. J. Fan, “Dreamdojo: A generalist robot world model from large-sc...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

arXiv preprint arXiv:2602.10116 , year=

H. Xia, X. Li, Z. Li, Q. Ma, J. Xu, M.-Y . Liu, Y . Cui, T.-Y . Lin, W.-C. Ma, S. Wang, S. Song, and F. Wei, “Sage: Scalable agentic 3d scene generation for embodied ai,” 2026, project: https://research.nvidia.com/labs/dir/sage/. [Online]. Available: https://arxiv.org/abs/2602.10116

work page arXiv 2026
[20]

HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations

X. Xu, J. Park, H. Zhang, E. Cousineau, A. Bhat, J. Barreiros, D. Wang, and S. Song, “Hommi: Learning whole-body mobile manipulation from human demonstrations,” 2026, project: https: //hommi-robot.github.io. [Online]. Available: https://arxiv.org/abs/ 2603.03243

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

Humdex: Humanoid dexterous manipulation made easy,

L. Heng, Y . Tang, J. Xu, H. Bao, D. Huang, and Y . Wang, “Humdex: Humanoid dexterous manipulation made easy,” 2026, code: https://github.com/physical-superintelligence-lab/HumDex. [Online]. Available: https://arxiv.org/abs/2603.12260

work page arXiv 2026
[22]

ActiveGlasses: Learning Manipulation with Active Vision from Ego-centric Human Demonstration

Y . Zou, C. Shi, W. Yu, H. Xue, J. Lv, Y . Pan, C. Wen, and C. Lu, “Activeglasses: Learning manipulation with active vision from ego-centric human demonstration,” 2026. [Online]. Available: https://arxiv.org/abs/2604.08534

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Moving through clutter: Scaling data collection and benchmarking for 3d scene-aware humanoid locomotion via virtual reality,

B. Wang, Y . Lu, L. Wang, L. Yu, and X. Xiao, “Moving through clutter: Scaling data collection and benchmarking for 3d scene-aware humanoid locomotion via virtual reality,” 2026. [Online]. Available: https://arxiv.org/abs/2603.05993

work page arXiv 2026
[24]

One Hand to Rule Them All: Canonical Representations for Unified Dexterous Manipulation

Z. Wei, Y . Yao, and M. Ding, “One hand to rule them all: Canonical representations for unified dexterous manipulation,” 2026, project: https://zhenyuwei2003.github.io/OHRA/. [Online]. Available: https://arxiv.org/abs/2602.16712

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Omnixtreme: Breaking the generality barrier in high-dynamic humanoid control,

Y . Wang, S. Zhu, P. Zhi, Y . Li, J. Li, Y .-L. Li, Y . Xiao, X. Wang, B. Jia, and S. Huang, “Omnixtreme: Breaking the generality barrier in high-dynamic humanoid control,” 2026. [Online]. Available: https://arxiv.org/abs/2602.23843

work page arXiv 2026
[26]

Ultradexgrasp: Learning universal dexterous grasping for bimanual robots with synthetic data,

S. Yang, Y . Xie, Z. Liang, Y . Tian, J. Zeng, D. Lin, and J. Pang, “Ultradexgrasp: Learning universal dexterous grasping for bimanual robots with synthetic data,” 2026, project: https://yangsizhe.github.io/ ultradexgrasp/. [Online]. Available: https://arxiv.org/abs/2603.05312

work page arXiv 2026
[27]

Omnistream: Mastering perception, reconstruction and action in continuous streams.arXiv preprint arXiv:2603.12265, 2026

Y . Yan, J. Xu, S. Di, H. Wu, and W. Xie, “Omnistream: Mastering perception, reconstruction and action in continuous streams,” 2026, project: https://go2heart.github.io/omnistream. [Online]. Available: https://arxiv.org/abs/2603.12265

work page arXiv 2026
[28]

Egocentric whole-body motion capture with fisheyevit and diffusion-based motion refinement,

J. Wang, Z. Cao, D. Luvizon, L. Liu, K. Sarkar, D. Tang, T. Beeler, and C. Theobalt, “Egocentric whole-body motion capture with fisheyevit and diffusion-based motion refinement,” 2023. [Online]. Available: https://arxiv.org/abs/2311.16495

work page arXiv 2023
[29]

mimic-one: a scalable model recipe for general purpose robot dexterity,

E. Nava, V . Montesinos, E. Bauer, B. Forrai, J. Pai, S. Weirich, S.-D. Gravert, P. Wand, S. Polinski, B. F. Grewe, and R. K. Katzschmann, “mimic-one: a scalable model recipe for general purpose robot dexterity,” 2025. [Online]. Available: https://arxiv.org/abs/2506.11916

work page arXiv 2025
[30]

HO-cap: A capture system and dataset for 3d reconstruction and pose tracking of hand-object interaction,

J. Wang, Q. Zhang, Y .-W. Chao, B. Wen, X. Guo, and Y . Xiang, “HO-cap: A capture system and dataset for 3d reconstruction and pose tracking of hand-object interaction,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

work page 2025
[31]

EgoLive: A Large-Scale Egocentric Dataset from Real-World Human Tasks

Y . Li, X. Wei, J. Luo, Y . Xiao, Y . Bai, G. Zhou, T. Zou, C. Gui, J. Wen, H. Zhang, K. Chen, X. Pan, S. Liu, D. Wang, T. An, J. Li, S. Jin, W. Zhang, T. Wang, B. Wei, Z. Huang, F. Liu, R. Li, H. Zhang, A. Li, Y . Gong, P. Cao, J. Liang, and L. Lin, “Egolive: A large-scale egocentric dataset from real-world human tasks,” 2026. [Online]. Available: https:...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

HumanNet: Scaling Human-centric Video Learning to One Million Hours

Y . Deng and D. Zhou, “Humannet: Scaling human-centric video learning to one million hours,” 2026. [Online]. Available: https: //arxiv.org/abs/2605.06747

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

arXiv preprint arXiv:2406.09598 (2024)

P. Banerjee, S. Shkodrani, P. Moulon, S. Hampali, F. Zhang, J. Fountain, E. Miller, S. Basol, R. Newcombe, R. Wang, J. J. Engel, and T. Hodan, “Introducing hot3d: An egocentric dataset for 3d hand and object tracking,” 2024, dataset: https://www.projectaria.com/ datasets/hot3D/. [Online]. Available: https://arxiv.org/abs/2406.09598

work page arXiv 2024
[34]

Assembly101: A large-scale multi-view video dataset for understanding procedural activities,

F. Sener, D. Chatterjee, D. Shelepov, K. He, D. Singhania, R. Wang, and A. Yao, “Assembly101: A large-scale multi-view video dataset for understanding procedural activities,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, project: https://assembly-101.github.io. [Online]. Available: https://arxiv.org/abs/...

work page arXiv 2022
[35]

EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang, “Egodex: Learning dexterous manipulation from large-scale egocentric video,” 2025. [Online]. Available: https://arxiv.org/abs/2505.11709

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Aria Gen 2 Pilot Dataset,

C. Kong, J. Fort, A. Kang, J. Wittmer, S. Green, T. Shen, Y . Zhao, C. Peng, G. Solaira, A. Berkovich, N. Raina, V . Baiyya, E. Oleinik, E. Huang, F. Zhang, J. Straub, M. Schwesinger, L. Pesqueira, X. Pan, J. J. Engel, C. Ren, M. Yan, and R. Newcombe, “Aria gen 2 pilot dataset,” 2025. [Online]. Available: https://arxiv.org/abs/2510.16134

work page arXiv 2025
[37]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Open X-Embodiment Collaboration, “Open x-embodiment: Robotic learning datasets and rt-x models,” 2025, large alphabetical author list on arXiv; site https://robotics-transformer-x.github.io. [Online]. Available: https://arxiv.org/abs/2310.08864

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Harmony4d: A video dataset for in-the-wild close human interactions,

R. Khirodkar, J.-T. Song, J. Cao, Z. Luo, and K. Kitani, “Harmony4d: A video dataset for in-the-wild close human interactions,” 2024

work page 2024
[39]

Umetrack: Unified multi-view end-to-end hand tracking for VR,

S. Han, P. Wu, Y . Zhang, B. Liu, L. Zhang, Z. Wang, W. Si, P. Zhang, Y . Cai, T. Hodan, R. Cabezas, L. Tran, M. Akbay, T. Yu, C. Keskin, and R. Wang, “Umetrack: Unified multi-view end-to-end hand tracking for VR,” inSIGGRAPH Asia 2022 Conference Papers, SA 2022, Daegu, Republic of Korea, December 6-9, 2022, 2022

work page 2022
[40]

Robot Learning from Human Videos: A Survey

J. Ma, E. Zhang, H. Yang, D. Li, C. Xu, G. Wang, and H. Wang, “Robot learning from human videos: A survey,” 2026, resource list: https://github.com/IRMVLab/ awesome-robot-learning-from-human-videos. [Online]. Available: https://arxiv.org/abs/2604.27621

work page internal anchor Pith review Pith/arXiv arXiv 2026
[41]

Working with usb through iokit on a jailbroken ios,

danylokos, “Working with usb through iokit on a jailbroken ios,” 2 2022, [Online; accessed 2026-05-11]. [Online]. Available: https://danylokos.github.io/0x05/

work page 2022
[42]

Using apple vision pro to train and control robots,

Y . Park and P. Agrawal, “Using apple vision pro to train and control robots,” 2024. [Online]. Available: https://github.com/Improbable-AI/ VisionProTeleop

work page 2024

[1] [1]

EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World

R. Punamiya, S. Kareer, Z. Liu, J. Citron, R.-Z. Qiu, X. Cai, A. Gavryushin, J. Chen, D. Liconti, L. Y . Zhuet al., “Egoverse: An egocentric human dataset for robot learning from around the world,” arXiv preprint arXiv:2604.07607, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Project Aria: A New Tool for Egocentric Multi-Modal AI Research

J. Engel, K. Somasundaram, M. Goesele, A. Sun, A. Gamino, A. Turner, A. Talattof, A. Yuan, B. Souti, B. Meredithet al., “Project aria: A new tool for egocentric multi-modal ai research,”arXiv preprint arXiv:2308.13561, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Das ego - genrobot ai - genrobot ai platform for robotics,

G. AI, “Das ego - genrobot ai - genrobot ai platform for robotics,” [Online; accessed 2026-05-13]. [Online]. Available: https: //www.genrobot.ai/products/ego

work page 2026

[4] [4]

Sunday robotics — the helpful robotics company,

“Sunday robotics — the helpful robotics company,” [Online; accessed 2026-05-13]. [Online]. Available: https://www.sunday.ai/

work page 2026

[5] [5]

[Online]

“Pika,” [Online; accessed 2026-05-13]. [Online]. Available: https: //global.agilex.ai/products/pika

work page 2026

[6] [6]

Lumosumi pro,

“Lumosumi pro,” [Online; accessed 2026-05-13]. [Online]. Available: https://lumosumi.lumosbot.tech/pro/

work page 2026

[7] [7]

Xrobotoolkit: A cross-platform framework for robot teleoperation,

Z. Zhao, L. Yu, K. Jing, and N. Yang, “Xrobotoolkit: A cross-platform framework for robot teleoperation,” in2026 IEEE/SICE International Symposium on System Integration (SII). IEEE, 2026, pp. 15–20

work page 2026

[8] [8]

Lucid-XR: An Extended-Reality Data Engine for Robotic Manipulation

Y . Ravan, A. Rashid, A. Yu, K. McClennen, G. Huh, K. Yang, Z. Yang, Q. Yu, X. Wang, P. Isola, and G. Yang, “Lucid-xr: An extended-reality data engine for robotic manipulation,” 2026, site: https://lucidxr.github.io. [Online]. Available: https://arxiv.org/abs/ 2605.00244

work page internal anchor Pith review Pith/arXiv arXiv 2026

[9] [9]

Demobot: Efficient learning of bimanual manipulation with dexterous hands from third-person human videos,

Y . Xu, X. Mao, E. Miller, X. Yi, Y . Li, Z. Li, and R. B. Fisher, “Demobot: Efficient learning of bimanual manipulation with dexterous hands from third-person human videos,” 2026. [Online]. Available: https://arxiv.org/abs/2601.01651

work page arXiv 2026

[10] [10]

Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, X. Cheng, R.-Z. Qiu, H. Yin, S. Liu, S. Han, Y . Lu, and X. Wang, “Egovla: Learning vision-language-action models from egocentric human videos,” 2025. [Online]. Available: https://arxiv.org/abs/2507.12440

work page arXiv 2025

[11] [11]

Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos.arXiv preprint arXiv:2510.21571, 2025

Q. Li, Y . Deng, Y . Liang, L. Luo, L. Zhou, C. Yao, L. Zeng, Z. Feng, H. Liang, S. Xu, Y . Zhang, X. Chen, H. Chen, L. Sun, D. Chen, J. Yang, and B. Guo, “Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos,” 2025, project: https://microsoft.github.io/VITRA/. [Online]. Available: https://arxiv.o...

work page arXiv 2025

[12] [12]

Clap: Contrastive latent action pretraining for learning vision-language-action models from human videos,

C. Zhang, J. Wang, Z. Gao, Y . Su, T. Dai, C. Zhou, J. Lu, and Y . Tang, “Clap: Contrastive latent action pretraining for learning vision-language-action models from human videos,” 2026. [Online]. Available: https://arxiv.org/abs/2601.04061

work page arXiv 2026

[13] [13]

Internvla-a1: Unifying understanding, generation and action for robotic manipulation,

J. Cai, Z. Cai, J. Cao, Y . Chen, Z. He, L. Jiang, H. Li, H. Li, Y . Li, Y . Liuet al., “Internvla-a1: Unifying understanding, generation and action for robotic manipulation,”arXiv preprint arXiv:2601.02456, 2026

work page arXiv 2026

[14] [14]

Zheng, D

R. Zheng, D. Niu, Y . Xie, J. Wang, M. Xu, Y . Jiang, F. Casta ˜neda, F. Hu, Y . L. Tan, L. Fu, T. Darrell, F. Huang, Y . Zhu, D. Xu, and L. Fan, “Egoscale: Scaling dexterous manipulation with diverse egocentric human data,” 2026. [Online]. Available: https://arxiv.org/abs/2602.16710

work page arXiv 2026

[15] [15]

Egohumanoid: Unlocking in-the-wild loco-manipulation with robot-free egocentric demonstration,

M. Shi, S. Peng, J. Chen, H. Jiang, Y . Li, D. Huang, P. Luo, H. Li, and L. Chen, “Egohumanoid: Unlocking in-the-wild loco-manipulation with robot-free egocentric demonstration,” 2026. [Online]. Available: https://arxiv.org/abs/2602.10106

work page arXiv 2026

[16] [16]

mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

J. Pai, L. Achenbach, V . Montesinos, B. Forrai, O. Mees, and E. Nava, “mimic-video: Video-action models for generalizable robot control beyond vlas,” 2025. [Online]. Available: https: //arxiv.org/abs/2512.15692

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

World models can leverage human videos for dexterous manipulation.arXiv preprint arXiv:2512.13644, 2025

R. G. Goswami, A. Bar, D. Fan, T.-Y . Yang, G. Zhou, P. Krishnamurthy, M. Rabbat, F. Khorrami, and Y . LeCun, “World models for learning dexterous hand-object interactions from human videos,” 2026. [Online]. Available: https://arxiv.org/abs/2512.13644

work page arXiv 2026

[18] [18]

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

S. Gao, W. Liang, K. Zheng, A. Malik, S. Ye, S. Yu, W.-C. Tseng, Y . Dong, K. Mo, C.-H. Lin, Q. Ma, S. Nah, L. Magne, J. Xiang, Y . Xie, R. Zheng, D. Niu, Y . L. Tan, K. R. Zentner, G. Kurian, S. Indupuru, P. Jannaty, J. Gu, J. Zhang, J. Malik, P. Abbeel, M.-Y . Liu, Y . Zhu, J. Jang, and L. J. Fan, “Dreamdojo: A generalist robot world model from large-sc...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

arXiv preprint arXiv:2602.10116 , year=

H. Xia, X. Li, Z. Li, Q. Ma, J. Xu, M.-Y . Liu, Y . Cui, T.-Y . Lin, W.-C. Ma, S. Wang, S. Song, and F. Wei, “Sage: Scalable agentic 3d scene generation for embodied ai,” 2026, project: https://research.nvidia.com/labs/dir/sage/. [Online]. Available: https://arxiv.org/abs/2602.10116

work page arXiv 2026

[20] [20]

HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations

X. Xu, J. Park, H. Zhang, E. Cousineau, A. Bhat, J. Barreiros, D. Wang, and S. Song, “Hommi: Learning whole-body mobile manipulation from human demonstrations,” 2026, project: https: //hommi-robot.github.io. [Online]. Available: https://arxiv.org/abs/ 2603.03243

work page internal anchor Pith review Pith/arXiv arXiv 2026

[21] [21]

Humdex: Humanoid dexterous manipulation made easy,

L. Heng, Y . Tang, J. Xu, H. Bao, D. Huang, and Y . Wang, “Humdex: Humanoid dexterous manipulation made easy,” 2026, code: https://github.com/physical-superintelligence-lab/HumDex. [Online]. Available: https://arxiv.org/abs/2603.12260

work page arXiv 2026

[22] [22]

ActiveGlasses: Learning Manipulation with Active Vision from Ego-centric Human Demonstration

Y . Zou, C. Shi, W. Yu, H. Xue, J. Lv, Y . Pan, C. Wen, and C. Lu, “Activeglasses: Learning manipulation with active vision from ego-centric human demonstration,” 2026. [Online]. Available: https://arxiv.org/abs/2604.08534

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

Moving through clutter: Scaling data collection and benchmarking for 3d scene-aware humanoid locomotion via virtual reality,

B. Wang, Y . Lu, L. Wang, L. Yu, and X. Xiao, “Moving through clutter: Scaling data collection and benchmarking for 3d scene-aware humanoid locomotion via virtual reality,” 2026. [Online]. Available: https://arxiv.org/abs/2603.05993

work page arXiv 2026

[24] [24]

One Hand to Rule Them All: Canonical Representations for Unified Dexterous Manipulation

Z. Wei, Y . Yao, and M. Ding, “One hand to rule them all: Canonical representations for unified dexterous manipulation,” 2026, project: https://zhenyuwei2003.github.io/OHRA/. [Online]. Available: https://arxiv.org/abs/2602.16712

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

Omnixtreme: Breaking the generality barrier in high-dynamic humanoid control,

Y . Wang, S. Zhu, P. Zhi, Y . Li, J. Li, Y .-L. Li, Y . Xiao, X. Wang, B. Jia, and S. Huang, “Omnixtreme: Breaking the generality barrier in high-dynamic humanoid control,” 2026. [Online]. Available: https://arxiv.org/abs/2602.23843

work page arXiv 2026

[26] [26]

Ultradexgrasp: Learning universal dexterous grasping for bimanual robots with synthetic data,

S. Yang, Y . Xie, Z. Liang, Y . Tian, J. Zeng, D. Lin, and J. Pang, “Ultradexgrasp: Learning universal dexterous grasping for bimanual robots with synthetic data,” 2026, project: https://yangsizhe.github.io/ ultradexgrasp/. [Online]. Available: https://arxiv.org/abs/2603.05312

work page arXiv 2026

[27] [27]

Omnistream: Mastering perception, reconstruction and action in continuous streams.arXiv preprint arXiv:2603.12265, 2026

Y . Yan, J. Xu, S. Di, H. Wu, and W. Xie, “Omnistream: Mastering perception, reconstruction and action in continuous streams,” 2026, project: https://go2heart.github.io/omnistream. [Online]. Available: https://arxiv.org/abs/2603.12265

work page arXiv 2026

[28] [28]

Egocentric whole-body motion capture with fisheyevit and diffusion-based motion refinement,

J. Wang, Z. Cao, D. Luvizon, L. Liu, K. Sarkar, D. Tang, T. Beeler, and C. Theobalt, “Egocentric whole-body motion capture with fisheyevit and diffusion-based motion refinement,” 2023. [Online]. Available: https://arxiv.org/abs/2311.16495

work page arXiv 2023

[29] [29]

mimic-one: a scalable model recipe for general purpose robot dexterity,

E. Nava, V . Montesinos, E. Bauer, B. Forrai, J. Pai, S. Weirich, S.-D. Gravert, P. Wand, S. Polinski, B. F. Grewe, and R. K. Katzschmann, “mimic-one: a scalable model recipe for general purpose robot dexterity,” 2025. [Online]. Available: https://arxiv.org/abs/2506.11916

work page arXiv 2025

[30] [30]

HO-cap: A capture system and dataset for 3d reconstruction and pose tracking of hand-object interaction,

J. Wang, Q. Zhang, Y .-W. Chao, B. Wen, X. Guo, and Y . Xiang, “HO-cap: A capture system and dataset for 3d reconstruction and pose tracking of hand-object interaction,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

work page 2025

[31] [31]

EgoLive: A Large-Scale Egocentric Dataset from Real-World Human Tasks

Y . Li, X. Wei, J. Luo, Y . Xiao, Y . Bai, G. Zhou, T. Zou, C. Gui, J. Wen, H. Zhang, K. Chen, X. Pan, S. Liu, D. Wang, T. An, J. Li, S. Jin, W. Zhang, T. Wang, B. Wei, Z. Huang, F. Liu, R. Li, H. Zhang, A. Li, Y . Gong, P. Cao, J. Liang, and L. Lin, “Egolive: A large-scale egocentric dataset from real-world human tasks,” 2026. [Online]. Available: https:...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[32] [32]

HumanNet: Scaling Human-centric Video Learning to One Million Hours

Y . Deng and D. Zhou, “Humannet: Scaling human-centric video learning to one million hours,” 2026. [Online]. Available: https: //arxiv.org/abs/2605.06747

work page internal anchor Pith review Pith/arXiv arXiv 2026

[33] [33]

arXiv preprint arXiv:2406.09598 (2024)

P. Banerjee, S. Shkodrani, P. Moulon, S. Hampali, F. Zhang, J. Fountain, E. Miller, S. Basol, R. Newcombe, R. Wang, J. J. Engel, and T. Hodan, “Introducing hot3d: An egocentric dataset for 3d hand and object tracking,” 2024, dataset: https://www.projectaria.com/ datasets/hot3D/. [Online]. Available: https://arxiv.org/abs/2406.09598

work page arXiv 2024

[34] [34]

Assembly101: A large-scale multi-view video dataset for understanding procedural activities,

F. Sener, D. Chatterjee, D. Shelepov, K. He, D. Singhania, R. Wang, and A. Yao, “Assembly101: A large-scale multi-view video dataset for understanding procedural activities,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, project: https://assembly-101.github.io. [Online]. Available: https://arxiv.org/abs/...

work page arXiv 2022

[35] [35]

EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang, “Egodex: Learning dexterous manipulation from large-scale egocentric video,” 2025. [Online]. Available: https://arxiv.org/abs/2505.11709

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Aria Gen 2 Pilot Dataset,

C. Kong, J. Fort, A. Kang, J. Wittmer, S. Green, T. Shen, Y . Zhao, C. Peng, G. Solaira, A. Berkovich, N. Raina, V . Baiyya, E. Oleinik, E. Huang, F. Zhang, J. Straub, M. Schwesinger, L. Pesqueira, X. Pan, J. J. Engel, C. Ren, M. Yan, and R. Newcombe, “Aria gen 2 pilot dataset,” 2025. [Online]. Available: https://arxiv.org/abs/2510.16134

work page arXiv 2025

[37] [37]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Open X-Embodiment Collaboration, “Open x-embodiment: Robotic learning datasets and rt-x models,” 2025, large alphabetical author list on arXiv; site https://robotics-transformer-x.github.io. [Online]. Available: https://arxiv.org/abs/2310.08864

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Harmony4d: A video dataset for in-the-wild close human interactions,

R. Khirodkar, J.-T. Song, J. Cao, Z. Luo, and K. Kitani, “Harmony4d: A video dataset for in-the-wild close human interactions,” 2024

work page 2024

[39] [39]

Umetrack: Unified multi-view end-to-end hand tracking for VR,

S. Han, P. Wu, Y . Zhang, B. Liu, L. Zhang, Z. Wang, W. Si, P. Zhang, Y . Cai, T. Hodan, R. Cabezas, L. Tran, M. Akbay, T. Yu, C. Keskin, and R. Wang, “Umetrack: Unified multi-view end-to-end hand tracking for VR,” inSIGGRAPH Asia 2022 Conference Papers, SA 2022, Daegu, Republic of Korea, December 6-9, 2022, 2022

work page 2022

[40] [40]

Robot Learning from Human Videos: A Survey

J. Ma, E. Zhang, H. Yang, D. Li, C. Xu, G. Wang, and H. Wang, “Robot learning from human videos: A survey,” 2026, resource list: https://github.com/IRMVLab/ awesome-robot-learning-from-human-videos. [Online]. Available: https://arxiv.org/abs/2604.27621

work page internal anchor Pith review Pith/arXiv arXiv 2026

[41] [41]

Working with usb through iokit on a jailbroken ios,

danylokos, “Working with usb through iokit on a jailbroken ios,” 2 2022, [Online; accessed 2026-05-11]. [Online]. Available: https://danylokos.github.io/0x05/

work page 2022

[42] [42]

Using apple vision pro to train and control robots,

Y . Park and P. Agrawal, “Using apple vision pro to train and control robots,” 2024. [Online]. Available: https://github.com/Improbable-AI/ VisionProTeleop

work page 2024