pith. sign in

arxiv: 2605.18214 · v2 · pith:U67BHLJ2new · submitted 2026-05-18 · 💻 cs.CV

EgoInteract: Synthetic Egocentric Videos Generation for Interaction Understanding and Anticipation

Pith reviewed 2026-05-25 06:19 UTC · model grok-4.3

classification 💻 cs.CV
keywords egocentric videosynthetic data generationhuman-object interactionaction anticipationtemporal action segmentationsimulatorhand-object detectiondomain transfer
0
0 comments X

The pith

A controllable simulator produces synthetic egocentric videos whose annotations let models trained on them improve on real interaction benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Collecting real egocentric video with dense labels for interactions is expensive and limited by privacy and coverage. The paper presents EgoInteract as a simulator that lets users control camera motion, body and hand movements, object handling, and scene layout to produce videos with exact spatial and temporal labels. These synthetic videos are used to train models for temporal action segmentation, next-active object detection, interaction anticipation, and hand-object interaction detection. When the resulting models are tested on multiple real-world egocentric datasets, they outperform strong baselines trained without the synthetic data. The approach therefore supplies a scalable source of interaction examples that transfers across environments and tasks.

Core claim

Training on the synthetic dataset generated by the EgoInteract simulator yields consistent gains over strong baselines on several real egocentric benchmarks that cover different environments, object sets, and interaction types, for the four tasks of temporal action segmentation, next-active object detection, interaction anticipation, and hand-object interaction detection.

What carries the argument

The EgoInteract simulator, which supplies precise parametric control over camera pose, full-body and hand kinematics, object manipulation sequences, and scene composition to output temporally coherent egocentric videos together with dense spatial and temporal ground truth.

If this is right

  • Synthetic data can be generated at arbitrary scale and with perfect, automatic labels for any chosen interaction pattern.
  • The same simulator can supply training examples for multiple downstream tasks without separate data collection efforts.
  • Performance gains hold across datasets that differ in camera type, environment, and object categories.
  • No domain-specific adaptation step is required for the observed positive transfer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the simulator parameters can be tuned to match a target real domain more closely, the transfer gap could shrink further.
  • The method opens a route to pre-train large models on synthetic interactions before any real video is seen.
  • Extending the simulator to include longer temporal horizons or multi-person scenes would address anticipation tasks that currently remain data-limited.

Load-bearing premise

The generated synthetic videos must reproduce enough of the visual appearance, motion statistics, and interaction variety found in real egocentric recordings for models to improve when transferred without any real-data fine-tuning.

What would settle it

Training the same model architectures on the synthetic dataset and evaluating them on the real benchmarks yields no improvement or a clear drop relative to the identical architectures trained only on the available real data.

Figures

Figures reproduced from arXiv: 2605.18214 by Alessandro Passanisi, Daniele Materia, Francesco Ragusa, Giovanni Maria Farinella, Jakob Engel, James Fort, Rosario Leonardi.

Figure 1
Figure 1. Figure 1: EgoInteract generates temporally coherent videos of humans interacting with diverse objects, enabling the study of egocentric interaction understanding at multiple levels. generation of hand-object interactions [24], or on diffusion-based egocentric video synthesis methods that emphasize visual realism and world modeling rather than task-oriented supervision [37, 17]. However, restricting simulation to sta… view at source ↗
Figure 2
Figure 2. Figure 2: The EgoInteract simulator. The generated egocentric videos are automatically labeled with several spatial and temporal annotations. 3.1 Overview and Episode Definition In EgoInteract, interactions are organized as episodes, where each episode corresponds to a complete first-person interaction sequence centered on a single target object ( [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Examples of HM3D environments used in EgoInteract. Blue regions denote the navigable surfaces used for agent placement and motion, while purple regions indicate support surfaces where objects can be placed. A Technical appendices This supplementary material provides additional technical details, implementation specifics, ablation results, visual examples, and qualitative results of our study that complemen… view at source ↗
Figure 4
Figure 4. Figure 4: Examples of procedurally generated tabletop scenes in [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Examples of avatar appearance randomization in [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Examples of full-body inverse kinematics in [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Examples of the collision hand proxy used for grasp generation in [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Sequence of full-body poses generated during interaction execution in [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Examples of interaction episodes generated by [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Example of interaction episodes generated by [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative TAS predictions on representative test videos from [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative comparison between models trained with real data only ( [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative comparison between models trained with real data only ( [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative examples of synthetic finetuning improving interaction anticipation. For each [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
read the original abstract

Collecting large-scale egocentric video datasets with dense spatial and temporal annotations is costly, slow, and often constrained by environmental biases, privacy constraints, and limited coverage of interaction patterns. While synthetic data has shown strong potential in several vision domains, its use for egocentric perception remains relatively underexplored, especially for tasks requiring temporally coherent human-object interactions. In this work, we introduce EgoInteract, a controllable simulator for egocentric video generation designed to model fine-grained egocentric interactions and their temporal dynamics. The simulator enables precise control over camera, human body and hand motion, object manipulation, and scene composition across diverse environments. Building on this framework, we generate a synthetic egocentric video dataset with dense spatial and temporal annotations for temporal action segmentation, next-active object detection, interaction anticipation, and hand-object interaction detection. We evaluate models trained with simulated data on multiple real-world egocentric benchmarks spanning diverse environments, object categories, and interaction patterns. Results show consistent improvements over strong baselines across tasks and datasets, demonstrating the effectiveness and transferability of our simulation-based approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces EgoInteract, a controllable simulator for generating synthetic egocentric videos that model fine-grained human-object interactions with precise control over camera, body/hand motion, object manipulation, and scene composition. It produces a synthetic dataset with dense spatial/temporal annotations for temporal action segmentation, next-active object detection, interaction anticipation, and hand-object interaction detection, then reports that models trained on this data yield consistent improvements over strong baselines when evaluated on multiple real-world egocentric benchmarks.

Significance. If the transfer results hold under scrutiny, the work offers a practical route to scalable, controllable, and privacy-preserving data for egocentric interaction tasks, directly addressing collection costs, environmental biases, and annotation density limitations in real datasets. The emphasis on temporal coherence and interaction variability is a notable strength relative to prior synthetic efforts in vision.

major comments (2)
  1. [Abstract] Abstract: the central claim of 'consistent improvements over strong baselines across tasks and datasets' is stated without any quantitative numbers, baseline identities, dataset sizes, or sim-to-real gap analysis; this absence is load-bearing because the transferability argument cannot be evaluated from the given evidence.
  2. [Abstract] The load-bearing assumption that the simulator reproduces real visual statistics, hand-object contact physics, temporal coherence, and interaction variability (so that positive transfer occurs without domain adaptation) receives no supporting verification such as distribution-distance metrics, failure-case analysis, or ablation on motion fidelity; this directly affects whether the reported gains generalize beyond simulation artifacts.
minor comments (1)
  1. [Abstract] The abstract refers to 'strong baselines' and 'multiple real-world egocentric benchmarks' without naming either the baselines or the specific datasets used for evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that strengthening the abstract with quantitative details and additional verification will improve clarity and allow better evaluation of the transfer claims. We will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of 'consistent improvements over strong baselines across tasks and datasets' is stated without any quantitative numbers, baseline identities, dataset sizes, or sim-to-real gap analysis; this absence is load-bearing because the transferability argument cannot be evaluated from the given evidence.

    Authors: We agree that the abstract would benefit from including quantitative highlights to make the claims more concrete. In the revised version, we will update the abstract to report specific improvement percentages across tasks, identify the baselines, note dataset sizes, and reference the sim-to-real analysis sections in the paper. revision: yes

  2. Referee: [Abstract] The load-bearing assumption that the simulator reproduces real visual statistics, hand-object contact physics, temporal coherence, and interaction variability (so that positive transfer occurs without domain adaptation) receives no supporting verification such as distribution-distance metrics, failure-case analysis, or ablation on motion fidelity; this directly affects whether the reported gains generalize beyond simulation artifacts.

    Authors: The positive transfer results on real benchmarks provide the primary empirical support for the simulator's fidelity. However, we acknowledge that explicit verification metrics would further strengthen the manuscript. We will add an ablation study on motion fidelity along with discussion of failure cases and, where feasible, distribution comparisons in the revised paper. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical transfer to external real benchmarks is independent of simulator construction

full rationale

The paper introduces a controllable simulator to generate synthetic egocentric videos with annotations, then trains models on the synthetic data and evaluates transfer performance on multiple real-world egocentric benchmarks. This is a standard empirical pipeline with externally falsifiable results on held-out real datasets; no equations, fitted parameters, or self-citations are presented that would reduce the reported improvements to the input data or simulator design by construction. The central claim rests on observed positive transfer rather than any definitional or self-referential derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review provides no details on internal simulator parameters, mathematical assumptions, or new entities beyond the high-level description of the framework itself.

invented entities (1)
  • EgoInteract simulator no independent evidence
    purpose: Generate controllable synthetic egocentric videos modeling fine-grained human-object interactions and temporal dynamics
    Core new artifact introduced to address data collection limitations; no independent evidence of correctness outside the paper's claims is provided in the abstract.

pith-pipeline@v0.9.0 · 5733 in / 1182 out tokens · 45528 ms · 2026-05-25T06:19:18.458164+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 4 internal anchors

  1. [1]

    LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

    Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Didi Zhu, Chunsheng Wu, Huajie Tan, Chunyuan Li, Jing Yang, Jie Yu, Xiyao Wang, Bin Qin, Yumeng Wang, Zizhen Yan, Ziyong Feng, Ziwei Liu, Bo Li, and Jiankang Deng. Llava-onevision-1.5: Fully open framework for democratized multimodal training,...

  2. [2]

    Scenescript: Reconstructing scenes with an autoregressive structured language model.arXiv preprint arXiv:2403.13064, 2024

    Armen Avetisyan, Christopher Xie, Henry Howard-Jenkins, Tsun-Yi Yang, Samir Aroudj, Suvam Patra, Fuyang Zhang, Duncan Frost, Luke Holland, Campbell Orme, et al. Scenescript: Reconstructing scenes with an autoregressive structured language model.arXiv preprint arXiv:2403.13064, 2024

  3. [3]

    Towards a richer 2d understanding of hands at scale

    Tianyi Cheng, Dandan Shan, Ayda Sultan Hassen, Richard Ely Locke Higgins, and David Fouhey. Towards a richer 2d understanding of hands at scale. InThirty-seventh Conference on Neural Information Processing Systems, 2023

  4. [4]

    Scaling egocentric vision: The epic-kitchens dataset

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. InECCV, pages 720–736, 2018

  5. [5]

    Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.IJCV, pages 1–23, 2021

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, , Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.IJCV, pages 1–23, 2021

  6. [6]

    Epic-kitchens visor benchmark: Video segmentations and object relations

    Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar, Richard Higgins, Sanja Fidler, David Fouhey, and Dima Damen. Epic-kitchens visor benchmark: Video segmentations and object relations. InNeurIPS, pages 13745–13758, 2022

  7. [7]

    Objaverse-XL: A Universe of 10M+ 3D Objects

    Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl V ondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. Objaverse-xl: A universe of 10m+ 3d objects.arXiv preprint arXiv:2307.05663, 2023

  8. [8]

    CARLA: An open urban driving simulator

    Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. InProceedings of the 1st Annual Conference on Robot Learning, pages 1–16, 2017

  9. [9]

    Everingham, L

    M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2010 (VOC2010) Results.International Journal of Computer Vision, 88(2):303–338, 2010. URL https://www.robots.ox.ac.uk/~vgg/projects/pascal/ VOC/pubs/everingham10.html. 10

  10. [10]

    Motsynth: How can synthetic data help pedestrian detection and tracking? InICCV, 2021

    Matteo Fabbri, Guillem Brasó, Gianluca Maugeri, Aljoša Ošep, Riccardo Gasparini, Orcun Cetintas, Simone Calderara, Laura Leal-Taixé, and Rita Cucchiara. Motsynth: How can synthetic data help pedestrian detection and tracking? InICCV, 2021

  11. [11]

    Next- active-object prediction from egocentric videos.Journal of Visual Communication and Image Representation, 49:401–411, November 2017

    Antonino Furnari, Sebastiano Battiato, Kristen Grauman, and Giovanni Maria Farinella. Next- active-object prediction from egocentric videos.Journal of Visual Communication and Image Representation, 49:401–411, November 2017. ISSN 1047-3203

  12. [12]

    Personal-location-based temporal segmentation of egocentric video for lifelogging applications.Journal of Visual Communication and Image Representation, 52:1–12, 2018

    Antonino Furnari, Sebastiano Battiato, and Giovanni Maria Farinella. Personal-location-based temporal segmentation of egocentric video for lifelogging applications.Journal of Visual Communication and Image Representation, 52:1–12, 2018. ISSN 1047-3203. doi: https: //doi.org/10.1016/j.jvcir.2018.01.019

  13. [13]

    Domain-adversarial training of neural networks

    Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario March, and Victor Lempitsky. Domain-adversarial training of neural networks. Journal of Machine Learning Research, 17(59):1–35, 2016. URLhttp://jmlr.org/papers/ v17/15-239.html

  14. [14]

    Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh K

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Q. Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh K. Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Z. Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, ...

  15. [15]

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Tri- antafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mo...

  16. [16]

    Maisi: Medical ai for synthetic imaging

    Pengfei Guo, Can Zhao, Dong Yang, Ziyue Xu, Vishwesh Nath, Yucheng Tang, Benjamin Simon, Mason Belue, Stephanie Harmon, Baris Turkbey, and Daguang Xu. Maisi: Medical ai for synthetic imaging. InProceedings of the Winter Conference on Applications of Computer Vision (WACV), pages 4430–4441, February 2025. 11

  17. [17]

    Egosim: Egocentric world simulator for embodied interaction generation.arXiv preprint arXiv:2604.01001, 2026

    Jinkun Hao, Mingda Jia, Ruiyan Wang, Xihui Liu, Ran Yi, Lizhuang Ma, Jiangmiao Pang, and Xudong Xu. Egosim: Egocentric world simulator for embodied interaction generation.arXiv preprint arXiv:2604.01001, 2026

  18. [18]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685. Accessed 24 April 2026

  19. [19]

    Pointrend: Image segmentation as rendering

    Alexander Kirillov, Yuxin Wu, Kaiming He, and Ross Girshick. Pointrend: Image segmentation as rendering. InCVPR, pages 9799–9808, 2020

  20. [20]

    AI2-THOR: An Interactive 3D Environment for Visual AI.arXiv, 2017

    Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. AI2-THOR: An Interactive 3D Environment for Visual AI.arXiv, 2017

  21. [21]

    Temporal convolutional networks for action segmentation and detection

    Colin Lea, Michael D Flynn, Rene Vidal, Austin Reiter, and Gregory D Hager. Temporal convolutional networks for action segmentation and detection. Inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 156–165, 2017

  22. [22]

    Egocen- tric human-object interaction detection exploiting synthetic data

    Rosario Leonardi, Francesco Ragusa, Antonino Furnari, and Giovanni Maria Farinella. Egocen- tric human-object interaction detection exploiting synthetic data. InInternational Conference on Image Analysis and Processing, pages 237–248. Springer, 2022

  23. [23]

    Exploit- ing multimodal synthetic data for egocentric human-object interaction detection in an industrial scenario.Computer Vision and Image Understanding, 242:103984, 2024

    Rosario Leonardi, Francesco Ragusa, Antonino Furnari, and Giovanni Maria Farinella. Exploit- ing multimodal synthetic data for egocentric human-object interaction detection in an industrial scenario.Computer Vision and Image Understanding, 242:103984, 2024

  24. [24]

    Are synthetic data useful for egocentric hand-object interaction detection? InEuropean Conference on Computer Vision, pages 36–54

    Rosario Leonardi, Antonino Furnari, Francesco Ragusa, and Giovanni Maria Farinella. Are synthetic data useful for egocentric hand-object interaction detection? InEuropean Conference on Computer Vision, pages 36–54. Springer, 2025

  25. [25]

    Llava-onevision: Easy visual task transfer,

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer,

  26. [26]

    LLaVA-OneVision: Easy Visual Task Transfer

    URLhttps://arxiv.org/abs/2408.03326. Accessed: 24 April 2026

  27. [27]

    Egogen: An egocentric synthetic data generator

    Gen Li, Kaifeng Zhao, Siwei Zhang, Xiaozhong Lyu, Mihai Dusmanu, Yan Zhang, Marc Pollefeys, and Siyu Tang. Egogen: An egocentric synthetic data generator. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14497–14509, 2024

  28. [28]

    Ms-tcn++: Multi- stage temporal convolutional network for action segmentation.IEEE transactions on pattern analysis and machine intelligence, 2020

    Shi-Jie Li, Yazan AbuFarha, Yun Liu, Ming-Ming Cheng, and Juergen Gall. Ms-tcn++: Multi- stage temporal convolutional network for action segmentation.IEEE transactions on pattern analysis and machine intelligence, 2020

  29. [29]

    Cross-domain adaptive teacher for object detection

    Yu-Jhe Li, Xiaoliang Dai, Chih-Yao Ma, Yen-Cheng Liu, Kan Chen, Bichen Wu, Zijian He, Kris Kitani, and Peter Vajda. Cross-domain adaptive teacher for object detection. InCVPR, pages 7581–7590, 2022

  30. [30]

    Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C

    Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. InECCV, pages 740–755, 2014

  31. [31]

    Forecasting human-object interaction: Joint prediction of motor attention and actions in first person video

    Miao Liu, Siyu Tang, Yin Li, et al. Forecasting human-object interaction: Joint prediction of motor attention and actions in first person video. InECCV, pages 704–721, 2020

  32. [32]

    Fact: Frame-action cross-attention temporal modeling for efficient action segmentation

    Zijia Lu and Ehsan Elhamifar. Fact: Frame-action cross-attention temporal modeling for efficient action segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18175–18185, 2024

  33. [33]

    V olumetric hierarchical approximate convex decom- position.Game engine gems, 3:141–158, 2016

    Khaled Mamou, E Lengyel, and A Peters. V olumetric hierarchical approximate convex decom- position.Game engine gems, 3:141–158, 2016

  34. [34]

    Habitat: A Platform for Embodied AI Research

    Manolis Savva*, Abhishek Kadian*, Oleksandr Maksymets*, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A Platform for Embodied AI Research. InICCV, 2019. 12

  35. [35]

    Leveraging gaze and set- of-mark in vllms for human-object interaction anticipation from egocentric videos

    Daniele Materia, Francesco Ragusa, and Giovanni Maria Farinella. Leveraging gaze and set- of-mark in vllms for human-object interaction anticipation from egocentric videos. InICPR, 2026

  36. [36]

    Grounded human-object interaction hotspots from video

    Tushar Nagarajan, Christoph Feichtenhofer, and Kristen Grauman. Grounded human-object interaction hotspots from video. InICCV, pages 8687–8696, 2019

  37. [37]

    Nvidia isaac sim, 2021.https://developer.nvidia.com/isaac-sim

    NVIDIA. Nvidia isaac sim, 2021.https://developer.nvidia.com/isaac-sim

  38. [38]

    Egocontrol: Controllable egocentric video generation via 3d full-body poses

    Enrico Pallotta, Sina Mokhtarzadeh Azar, Lars Doorenbos, Serdar Ozsoy, Umar Iqbal, and Juergen Gall. Egocontrol: Controllable egocentric video generation via 3d full-body poses. arXiv preprint arXiv:2511.18173, 2025

  39. [39]

    Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. InProceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019

  40. [40]

    Hd-epic: A highly-detailed egocentric video dataset

    Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Parida, Kaiting Liu, Prajwal Gatti, Siddhant Bansal, Kevin Flanagan, Jacob Chalk, Zhifan Zhu, Rhodri Guerrier, Fahd Abdelazim, Bin Zhu, Davide Moltisanti, Michael Wray, Hazel Doughty, and Dima Damen. Hd-epic: A highly-detailed egocentric video dataset. InCVPR, pages 23901– 239...

  41. [41]

    The meccano dataset: Understanding human-object interactions from egocentric videos in an industrial-like domain

    Francesco Ragusa, Antonino Furnari, Salvatore Livatino, and Giovanni Maria Farinella. The meccano dataset: Understanding human-object interactions from egocentric videos in an industrial-like domain. InWinter Conference on Applications of Computer Vision, pages 1569–1578, 2021

  42. [42]

    Meccano: A multimodal egocentric dataset for humans behavior understanding in the industrial-like domain.Comput

    Francesco Ragusa, Antonino Furnari, and Giovanni Maria Farinella. Meccano: A multimodal egocentric dataset for humans behavior understanding in the industrial-like domain.Comput. Vis. Image Underst., 235(C), October 2023. ISSN 1077-3142. doi: 10.1016/j.cviu.2023.103764. URLhttps://doi.org/10.1016/j.cviu.2023.103764

  43. [43]

    Enigma-51: Towards a fine-grained under- standing of human behavior in industrial scenarios

    Francesco Ragusa, Rosario Leonardi, Michele Mazzamuto, Claudia Bonanno, Rosario Scavo, Antonino Furnari, and Giovanni Maria Farinella. Enigma-51: Towards a fine-grained under- standing of human behavior in industrial scenarios. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4549–4559, 2024

  44. [44]

    Enigma-360: An ego-exo dataset for human behavior understanding in industrial scenarios

    Francesco Ragusa, Rosario Leonardi, Michele Mazzamuto, Daniele Di Mauro, Camillo Quat- trocchi, Alessandro Passanisi, Irene D’Ambra, Antonino Furnari, and Giovanni Maria Farinella. Enigma-360: An ego-exo dataset for human behavior understanding in industrial scenarios. arXiv preprint arXiv:2603.09741, 2026

  45. [45]

    Rockstar advanced game engine (rage)

    Rockstar Games. Rockstar advanced game engine (rage). https://www.rockstargames. com/, Accessed 2024. Proprietary game engine developed by Rockstar Games

  46. [46]

    Assembly101: A large-scale multi-view video dataset for understanding procedural activities

    Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. Assembly101: A large-scale multi-view video dataset for understanding procedural activities. InCVPR, pages 21096–21106, 2022

  47. [47]

    Dandan Shan, Jiaqi Geng, Michelle Shu, and David F. Fouhey. Understanding human hands in contact at internet scale. InCVPR, pages 9869–9878, 2020

  48. [48]

    Coarse to fine multi-resolution temporal convolutional network.arXiv preprint arXiv:2105.10859, 2021

    Dipika Singhania, Rahul Rahaman, and Angela Yao. Coarse to fine multi-resolution temporal convolutional network.arXiv preprint arXiv:2105.10859, 2021

  49. [49]

    Habitat 2.0: Training home assistants to rearrange their habitat.Advances in Neural Information Processing Systems, 34:251–266, 2021

    Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, et al. Habitat 2.0: Training home assistants to rearrange their habitat.Advances in Neural Information Processing Systems, 34:251–266, 2021. 13

  50. [50]

    Action recognition in rgb-d egocentric videos

    Yansong Tang, Yi Tian, Jiwen Lu, Jianjiang Feng, and Jie Zhou. Action recognition in rgb-d egocentric videos. In2017 IEEE International Conference on Image Processing (ICIP), pages 3410–3414. IEEE, 2017

  51. [51]

    Zamir, Zhi-Yang He, Alexander Sax, Jitendra Malik, and Silvio Savarese

    Fei Xia, Amir R. Zamir, Zhi-Yang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson env: real-world perception for embodied agents. InCVPR, 2018

  52. [52]

    Asformer: Transformer for action segmentation

    Fangqiu Yi, Hongyu Wen, and Tingting Jiang. Asformer: Transformer for action segmentation. InThe British Machine Vision Conference (BMVC), 2021

  53. [53]

    Fine-grained egocentric hand- object segmentation: Dataset, model, and applications

    Lingzhi Zhang, Shenghao Zhou, Simon Stent, and Jianbo Shi. Fine-grained egocentric hand- object segmentation: Dataset, model, and applications. InECCV, pages 127–145, 2022

  54. [54]

    Deep future gaze: Gaze anticipation on egocentric videos using adversarial networks

    Mengmi Zhang, Keng Teck Ma, et al. Deep future gaze: Gaze anticipation on egocentric videos using adversarial networks. InCVPR, pages 3539–3548, 2017. 14 Figure 3: Examples of HM3D environments used inEgoInteract. Blue regions denote the navigable surfaces used for agent placement and motion, while purple regions indicate support surfaces where objects ca...

  55. [55]

    19 Figure 11: Qualitative TAS predictions on representative test videos fromEPIC-KITCHENS(top) andEgo-Exo4D(bottom)

    "The Agent.“ VQA Figure 10: Example of interaction episodes generated byEgoInteractwith the set of temporal and spatial annotations obtained automatically. 19 Figure 11: Qualitative TAS predictions on representative test videos fromEPIC-KITCHENS(top) andEgo-Exo4D(bottom). Red segments indicateTakeactions, while blue segments indicateRelease actions. A.2.1...