pith. sign in

arxiv: 2605.17262 · v1 · pith:FRGXOF3Snew · submitted 2026-05-17 · 💻 cs.CV

EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning

Pith reviewed 2026-05-20 14:44 UTC · model grok-4.3

classification 💻 cs.CV
keywords egocentric datasetinternal state reasoningmultimodal LLMsuser intentaffective experiencewearable AIbenchmark
0
0 comments X

The pith

EgoIntrospect dataset and benchmark reveals that multimodal large language models struggle to infer users' internal states from egocentric multimodal data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EgoIntrospect, the first egocentric dataset designed to capture users' internal states during interactions with AI assistants. Collected from 60 subjects over 180 hours using synchronized video, audio, gaze, motion, and physiological signals, it includes self-annotations for affective experience, interactive intent, and cognitive memory. The authors build benchmarks to evaluate how well current multimodal large language models can reason about these subjective states from the observations. Their experiments indicate that existing models fail to effectively combine the multimodal signals for accurate inference of internal states. By making the dataset public, the work supports progress in creating more responsive wearable AI systems.

Core claim

EgoIntrospect provides 180 hours of egocentric recordings from 60 users in user-driven scenarios, equipped with self-annotations that directly reveal interactive intentions with AI assistants, along with synchronized multimodal data including video, audio, gaze, motion, and physiological signals. This enables a set of tasks for reasoning about affective experience, interactive intent, and cognitive memory, and benchmarks demonstrate that multimodal large language models do not yet leverage these signals well to understand users' subjective internal states.

What carries the argument

The EgoIntrospect dataset featuring cross-device synchronized multimodal recordings and explicit self-annotations for user internal states.

If this is right

  • Improved multimodal models could enable AI assistants that better understand user intent and emotions in real time.
  • The benchmark tasks highlight specific weaknesses in current models' ability to process egocentric data for subjective reasoning.
  • Public release of the dataset and annotations will facilitate further research in egocentric vision and human-AI interaction.
  • Models that succeed on these tasks may lead to more natural and personalized wearable AI experiences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If self-annotations prove reliable, similar datasets could be collected for other contexts like health monitoring or education.
  • Success on this benchmark might require new architectures specifically designed for integrating physiological and gaze data with video and audio.
  • Future extensions could include longitudinal studies to see how internal states change over repeated interactions.

Load-bearing premise

User-provided self-annotations accurately and reliably reflect true internal states such as affective experience, interactive intent, and cognitive memory.

What would settle it

Development of a multimodal model that achieves significantly higher accuracy than current baselines on the EgoIntrospect benchmark tasks by better utilizing the combined signals.

Figures

Figures reproduced from arXiv: 2605.17262 by Borislav Pavlov, Chang Liu, Dai Shi, Eduardus Tjitrahardja, Fangfei Gou, Guocai Yao, Jiacheng Hua, Jia Jia, Jiayi Tan, Jingwei Sun, Jinzhao Li, Jose Manuel Davila, Liuxin Zhang, Miao Liu, Qianying Wang, Qi Wang, Ran Xu, Shuting Chang, Yifei Huang, Yin Li, Yuanchun Shi, Yue Pan, Yuntao Wang, Yu Zhang, Zeyu Wang.

Figure 1
Figure 1. Figure 1: Visual examples of EgoIntrospect for understanding user internal states. We illustrate the daily usage of smart-glass AI assistants through three core dimensions: (1) Affective Experience: recognizing salient moments and inferring the user’s emotional states (e.g., joy or stress); (2) Request Intent: tackling complex, context-grounded queries and initiating proactive assistance (e.g., gym guidance or shopp… view at source ↗
Figure 2
Figure 2. Figure 2: Capture and Annotation Overview of the EgoIntrospect. Our dataset record users’ natural daily routines with a multimodal wearable sensor suite. We synchronize exteroceptive (video, audio) and interoceptive (physiological, gaze, motion) signals for a rich multimodal capture. Annotations are obtained through two stages: an In-situ Labeling stage, where participants verbally mark key moments or send instant m… view at source ↗
Figure 3
Figure 3. Figure 3: Illustrative examples of Affective Experience tasks. (a) MoR-Video/-Photo determine whether a video segment or photo reflects the user’s capture intent. (b) Emotion Identification (EI) classifies the user’s affective state, and Emotion Intensity Recognition (EIR) selects the scenario with the highest emotional intensity. Icons indicate the specific experimental settings: (ICL) refers to the use of In-Conte… view at source ↗
Figure 4
Figure 4. Figure 4: Illustrative examples of Interactive Intent tasks. (a) Tool Selection (TS) identifies the tools needed for addressing the request, and Request Prediction (RP) predicts the user’s possible request given the context. (b) Proactive Timing Judgment (PTJ) evaluates the user’s openness to interaction, and Valuable Interaction Identification (VII) selects helpful proactive assistance from candidates. task focuses… view at source ↗
Figure 5
Figure 5. Figure 5: Illustrative examples of Cognitive Memory tasks. (a) Memory Recall Prediction evaluates the model’s ability to identify specific, vivid details that are most reachable in a user’s memory. (b) Memory Assistance Recognition (MAR) identifies specific items the user explicitly intended to preserve (MAR-Event is shown; see MAR-Object in text), and Memory Lifespan Identification predicts the temporal utility and… view at source ↗
Figure 6
Figure 6. Figure 6: Annotation interface for Task1.1 (Moment of Recording). [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Annotation interface for Task1.2 (Emotion Analysis). [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Screenshot for task2.1’s annotation UI. could be clearly identified (e.g., if a participant felt joy upon seeing a friend, they would draw the bounding box on the frame where the friend became visible). Throughout the annotation session, an experimenter was present to explain the annotation rules, help participants refine ambiguous temporal boundaries, remind them of potentially missed emotional moments, a… view at source ↗
Figure 9
Figure 9. Figure 9: Task2.2 annotation platform Video availability selection module. Participants reviewed the egocentric recording with timeline￾based playback and marked any interval during which they would not want to receive a proactive request and provide the reason. Each marked non-interruptible interval was visualized as a red bar on the timeline. The interface also displayed the corresponding start and end timestamps … view at source ↗
Figure 10
Figure 10. Figure 10: Task3.1 annotation platform settings, locations, participants, primary activities, vivid details, and overall impressions. We accept accounts in both audio and text formats. These original files are processed through a standardized pipeline: audio files (e.g., .m4a) are first transcribed into text using the Xunfei or ElevenLabs API. Subsequently, a Large Language Model (LLM) is employed to decompose these… view at source ↗
Figure 11
Figure 11. Figure 11: Task3.2 annotation platform A.7 Task3.2 - Memory Intent Modeling B Data Processing Algorithms for Annotation. B.1 Transcript and Command Extraction The transcription pipeline processes continuous audio, extracts word-level transcriptions, identifies user commands directed at an AI assistant, and classifies them into categories. Stage 1: Audio Loading & Chunking Input: Audio file (WAV format) The audio is … view at source ↗
Figure 12
Figure 12. Figure 12: Overview of the Task2.2 proactive request recommendation generation pipeline. The [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Demographic statistics of the 60 participants. [PITH_FULL_IMAGE:figures/full_fig_p035_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Distribution of scenario categories across the 192 annotated activity segments. Each [PITH_FULL_IMAGE:figures/full_fig_p037_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Examples of gaze-state overlays in video-based benchmark inputs. The shared renderer [PITH_FULL_IMAGE:figures/full_fig_p038_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: PTJ confusion heatmaps for the two representative models. Each cell reports count and [PITH_FULL_IMAGE:figures/full_fig_p054_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: VII error diagnostics for kimi-k2.5 and Qwen3-VL-8B-Instruct. The left heatmap shows [PITH_FULL_IMAGE:figures/full_fig_p055_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Representative human-correct/model-wrong Task2.2 examples. Each panel shows a [PITH_FULL_IMAGE:figures/full_fig_p056_18.png] view at source ↗
read the original abstract

Despite extensive efforts on egocentric video datasets and benchmarks, understanding users' internal states, which is crucial for enabling seamless AI assistant experiences, remains largely overlooked. In this work, we introduce EgoIntrospect, the first egocentric dataset captured in user-driven scenarios with self-annotations that explicitly reveal users' interactive intentions with AI assistants. EgoIntrospect was collected using a cross-device setup, providing synchronized video, audio, gaze, motion, and physiological signals. It consists of 180 hours of recordings from 60 subjects, with an average recording duration of 3 hours per subject. Leveraging EgoIntrospect, we formalize a suite of tasks centered on user internal states, including affective experience, interactive intent, and cognitive memory. We further process the annotations to construct benchmarks that evaluate the ability of modern multimodal large language models to reason about users' internal states from egocentric observations. Experiments on our benchmark suggest that existing multimodal large language models struggle to effectively leverage multimodal signals to infer users' subjective internal states. The dataset and annotations will be made publicly available to advance research in egocentric vision and wearable AI assistants. Project page: https://ego-introspect.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces EgoIntrospect, the first egocentric dataset captured in user-driven scenarios with self-annotations explicitly revealing users' interactive intentions with AI assistants. Collected via a cross-device setup from 60 subjects yielding 180 hours of synchronized video, audio, gaze, motion, and physiological signals (average 3 hours per subject), the work formalizes tasks on affective experience, interactive intent, and cognitive memory. It constructs benchmarks to evaluate multimodal large language models' reasoning about users' internal states from egocentric observations and reports that existing MLLMs struggle to effectively leverage multimodal signals for inferring subjective internal states. The dataset and annotations are to be released publicly.

Significance. If the self-annotations can be shown to reliably reflect internal states and the benchmarks are constructed without circularity or label noise, the work would be significant for egocentric vision and wearable AI research by addressing the overlooked area of user-centric internal state reasoning. The multimodal synchronization and public release are strengths that could enable reproducible follow-up studies. However, the significance is limited by the absence of validation for the self-annotations serving as ground truth.

major comments (2)
  1. [Abstract and dataset construction section] Abstract and dataset description: The central claim that MLLMs struggle to infer subjective internal states rests on self-annotations for affective experience, interactive intent, and cognitive memory being treated as ground truth. No details are provided on annotation protocols, inter-annotator agreement, or cross-validation against the synchronized physiological signals, raising the risk that observed model failures reflect annotation noise rather than limitations in multimodal reasoning.
  2. [Benchmark and evaluation section] Benchmark construction: The experiments conclude that models fail to leverage multimodal signals, yet without quantitative results on annotation reliability or synchronization validation (as noted in the abstract's description of the cross-device setup), it is unclear whether the benchmark isolates the intended reasoning challenge or is confounded by label inconsistency.
minor comments (2)
  1. [Dataset statistics] The average recording duration of 3 hours per subject is stated but no breakdown of task distribution or scenario diversity across the 60 subjects is given, which would help assess generalizability.
  2. [Conclusion] The project page URL is provided but the manuscript should include a brief description of what supplementary materials (e.g., annotation guidelines) will be released alongside the dataset.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and valuable feedback. We address each major comment below, indicating where revisions will be made to improve clarity and transparency.

read point-by-point responses
  1. Referee: [Abstract and dataset construction section] Abstract and dataset description: The central claim that MLLMs struggle to infer subjective internal states rests on self-annotations for affective experience, interactive intent, and cognitive memory being treated as ground truth. No details are provided on annotation protocols, inter-annotator agreement, or cross-validation against the synchronized physiological signals, raising the risk that observed model failures reflect annotation noise rather than limitations in multimodal reasoning.

    Authors: We agree that greater detail on the self-annotation process is warranted. The annotations were obtained directly from each participant immediately following the recorded sessions via a structured digital questionnaire that asked subjects to report their affective experience, interactive intentions toward an AI assistant, and recall of cognitive events. We will revise the dataset construction section to describe the exact questionnaire items, the interface used, and the timing relative to the activities. Because the annotations are self-reports by the individuals who experienced the internal states, standard inter-annotator agreement statistics do not apply; we will instead note any consistency checks (e.g., re-annotation by a subset of participants) that were performed. Explicit quantitative cross-validation against the physiological signals was not conducted in the present study, as the primary focus was on multimodal reasoning benchmarks; we will add an explicit discussion of this limitation and its implications for interpreting model performance. revision: yes

  2. Referee: [Benchmark and evaluation section] Benchmark construction: The experiments conclude that models fail to leverage multimodal signals, yet without quantitative results on annotation reliability or synchronization validation (as noted in the abstract's description of the cross-device setup), it is unclear whether the benchmark isolates the intended reasoning challenge or is confounded by label inconsistency.

    Authors: The benchmark tasks are defined directly from the self-annotations to evaluate whether MLLMs can reason about subjective internal states given synchronized multimodal observations. We will expand the benchmark construction and evaluation sections to provide quantitative details on the synchronization procedure used in the cross-device capture (including hardware timestamps and alignment verification steps) and any reliability measures obtained for the annotations. While we maintain that the observed model shortcomings reflect genuine difficulties in leveraging multimodal cues rather than pervasive label noise—given the consistent performance patterns across modalities—we will add a dedicated paragraph discussing potential sources of annotation variability and their possible influence on the reported results. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical dataset and benchmark with external model evaluations

full rationale

This is a data-collection paper that records egocentric multimodal signals, collects user self-annotations for internal states, and runs external MLLM evaluations on the resulting benchmark tasks. No equations, parameter fits, or derivations appear in the provided text. Claims about model struggles rest on direct experimental comparisons to the collected annotations rather than any self-referential reduction or self-citation chain. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution rests primarily on the domain assumption that self-annotations serve as valid ground truth for subjective internal states; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Self-annotations by users accurately reflect their internal states including affective experience, interactive intent, and cognitive memory.
    The benchmark tasks and model evaluations treat these annotations as reliable labels for measuring inference performance.

pith-pipeline@v0.9.0 · 5831 in / 1208 out tokens · 57266 ms · 2026-05-20T14:44:04.808843+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

112 extracted references · 112 canonical work pages · 5 internal anchors

  1. [1]

    Introducing the new ray-ban meta smart glasses

    Meta. Introducing the new ray-ban meta smart glasses. https://about.fb.com/news/ 2023/09/new-ray-ban-meta-smart-glasses/ , September 2023. Accessed: 2026-04-10

  2. [2]

    Project Aria: A New Tool for Egocentric Multi-Modal AI Research

    Jakob Engel, Kiran Somasundaram, Michael Goesele, Albert Sun, Alexander Gamino, Andrew Turner, Arjang Talattof, Arnie Yuan, Bilal Souti, Brighid Meredith, et al. Project aria: A new tool for egocentric multi-modal ai research.arXiv preprint arXiv:2308.13561, 2023

  3. [3]

    A survey on multimodal large language models.National Science Review, 11(12), November 2024

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12), November 2024

  4. [4]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

  5. [5]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  6. [6]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  7. [7]

    Egocentric video-language pretraining.Advances in Neural Information Processing Systems, 35:7575–7586, 2022

    Kevin Qinghong Lin, Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Z Xu, Difei Gao, Rong-Cheng Tu, Wenzhe Zhao, Weijie Kong, et al. Egocentric video-language pretraining.Advances in Neural Information Processing Systems, 35:7575–7586, 2022

  8. [8]

    Lifelongmemory: Leveraging llms for answering queries in long-form egocentric videos.arXiv preprint arXiv:2312.05269, 2023

    Ying Wang, Yanlai Yang, and Mengye Ren. Lifelongmemory: Leveraging llms for answering queries in long-form egocentric videos.arXiv preprint arXiv:2312.05269, 2023

  9. [9]

    Ego-humans: An ego-centric 3d multi-human benchmark

    Rawal Khirodkar, Aayush Bansal, Lingni Ma, Richard Newcombe, Minh V o, and Kris Kitani. Ego-humans: An ego-centric 3d multi-human benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19807–19819, 2023

  10. [10]

    Mm-ego: Towards building ego- centric multimodal llms

    Hanrong Ye, Haotian Zhang, Erik Daxberger, Lin Chen, Zongyu Lin, Yanghao Li, Bowen Zhang, Haoxuan You, Dan Xu, Zhe Gan, et al. Mm-ego: Towards building egocentric multimodal llms for video qa.arXiv preprint arXiv:2410.07177, 2024

  11. [11]

    Learning fine-grained view-invariant representations from unpaired ego-exo videos via temporal alignment.Advances in Neural Information Processing Systems, 36:53688–53710, 2023

    Zihui Sherry Xue and Kristen Grauman. Learning fine-grained view-invariant representations from unpaired ego-exo videos via temporal alignment.Advances in Neural Information Processing Systems, 36:53688–53710, 2023

  12. [12]

    Helping hands: An object-aware ego- centric video recognition model

    Chuhan Zhang, Ankush Gupta, and Andrew Zisserman. Helping hands: An object-aware ego- centric video recognition model. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13901–13912, October 2023

  13. [13]

    Retrieval-augmented egocentric video captioning

    Jilan Xu, Yifei Huang, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, and Weidi Xie. Retrieval-augmented egocentric video captioning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13525–13536, June 2024. 10

  14. [14]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

  15. [15]

    Jawahar, Richard Newcombe, Hyun Soo Park, James M

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Tri- antafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mo...

  16. [16]

    Scaling egocentric vision: The epic-kitchens dataset

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. InProceedings of the European conference on computer vision (ECCV), pages 720–736, 2018

  17. [17]

    Egolife: Towards egocentric life assistant

    Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, et al. Egolife: Towards egocentric life assistant. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28885–28900, 2025

  18. [18]

    E 3: Exploring Embodied Emotion Through A Large-Scale Egocentric Video Dataset.Advances in Neural Information Processing Systems, 37:118182–118197, 2024

    Wang Lin, Yueying Feng, WenKang Han, Tao Jin, Zhou Zhao, Fei Wu, Chang Yao, and Jingyuan Chen. E 3: Exploring Embodied Emotion Through A Large-Scale Egocentric Video Dataset.Advances in Neural Information Processing Systems, 37:118182–118197, 2024

  19. [19]

    Neon accuracy test report

    Chris Baumann and Kai Dierkes. Neon accuracy test report. 2023

  20. [20]

    Ralovi S2 Wristwatch: EDA Skin Con- ductance and HRV Heart Rate Variability

    Hangzhou Raloway Health Technology Co., Ltd. Ralovi S2 Wristwatch: EDA Skin Con- ductance and HRV Heart Rate Variability. https://raloway.com/index.php?m=home& c=Lists&a=index&tid=23, n.d. Accessed: 2026-05-07

  21. [21]

    A dataset and toolkit for multiparameter cardiovascular physiology sensing on rings.arXiv preprint arXiv:2505.04172, 2025

    Jiankai Tang, Kegang Wang, Yingke Ding, Jiatong Ji, Zeyu Wang, Xiyuxing Zhang, Ping Chen, Yuanchun Shi, and Yuntao Wang. A dataset and toolkit for multiparameter cardiovascular physiology sensing on rings.arXiv preprint arXiv:2505.04172, 2025

  22. [22]

    Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision, 130(1):33–55, 2022

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kaza- kos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision, 130(1):33–55, 2022

  23. [23]

    Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos

    Gunnar A Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, and Karteek Alahari. Charades-ego: A large-scale dataset of paired third and first person videos.arXiv preprint arXiv:1804.09626, 2018

  24. [24]

    Lemma: A multi-view dataset for le arning m ulti-agent m ulti-task a ctivities

    Baoxiong Jia, Yixin Chen, Siyuan Huang, Yixin Zhu, and Song-Chun Zhu. Lemma: A multi-view dataset for le arning m ulti-agent m ulti-task a ctivities. InEuropean Conference on Computer Vision, pages 767–786. Springer, 2020. 11

  25. [25]

    Egoexolearn: A dataset for bridging asynchronous ego- and exo-centric view of procedural activities in real world

    Yifei Huang, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, Limin Wang, and Yu Qiao. Egoexolearn: A dataset for bridging asynchronous ego- and exo-centric view of procedural activities in real world. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22072–22086,...

  26. [26]

    Aria digital twin: A new benchmark dataset for egocentric 3d machine perception

    Xiaqing Pan, Nicholas Charron, Yongqian Yang, Scott Peters, Thomas Whelan, Chen Kong, Omkar Parkhi, Richard Newcombe, and Yuheng Carl Ren. Aria digital twin: A new benchmark dataset for egocentric 3d machine perception. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20133–20143, 2023

  27. [27]

    egoemotion: Egocentric vision and physiological signals for emotion and personality recognition in real- world tasks.arXiv preprint arXiv:2510.22129, 2025

    Matthias Jammot, Björn Braun, Paul Streli, Rafael Wampfler, and Christian Holz. egoemotion: Egocentric vision and physiological signals for emotion and personality recognition in real- world tasks.arXiv preprint arXiv:2510.22129, 2025

  28. [28]

    K-emocon, a multimodal sensor dataset for continuous emotion recognition in naturalistic conversations.Scientific Data, 7(1):293, 2020

    Cheul Young Park, Narae Cha, Soowon Kang, Auk Kim, Ahsan Habib Khandoker, Leontios Hadjileontiadis, Alice Oh, Yong Jeong, and Uichin Lee. K-emocon, a multimodal sensor dataset for continuous emotion recognition in naturalistic conversations.Scientific Data, 7(1):293, 2020

  29. [29]

    Jointly learning energy expenditures and activities using egocentric multimodal signals

    Katsuyuki Nakamura, Serena Yeung, Alexandre Alahi, and Li Fei-Fei. Jointly learning energy expenditures and activities using egocentric multimodal signals. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1868–1877, 2017

  30. [30]

    The construal of experience in hci: Understanding self- reports.International Journal of Human-Computer Studies, 110:63–74, 2018

    Kevin Doherty and Gavin Doherty. The construal of experience in hci: Understanding self- reports.International Journal of Human-Computer Studies, 110:63–74, 2018

  31. [31]

    Measuring digital intervention user experience with a novel ecological momentary assessment (ema) method, corto.Internet Interventions, 35:100706, 2024

    Lauri Lukka, Veli-Matti Karhulahti, Vilma-Reetta Bergman, and J Matias Palva. Measuring digital intervention user experience with a novel ecological momentary assessment (ema) method, corto.Internet Interventions, 35:100706, 2024

  32. [32]

    Online episodic memory visual query localization with egocentric streaming object memory

    Zaira Manigrasso, Matteo Dunnhofer, Antonino Furnari, Moritz Nottebaum, Antonio Finoc- chiaro, Davide Marana, Rosario Forte, Giovanni Maria Farinella, and Christian Micheloni. Online episodic memory visual query localization with egocentric streaming object memory. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), ...

  33. [33]

    Episodic memory question answering

    Samyak Datta, Sameer Dharur, Vincent Cartillier, Ruta Desai, Mukul Khanna, Dhruv Batra, and Devi Parikh. Episodic memory question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19119–19128, June 2022

  34. [34]

    Where did i leave my keys?-episodic-memory-based question answering on egocentric videos

    Leonard Bärmann and Alex Waibel. Where did i leave my keys?-episodic-memory-based question answering on egocentric videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1560–1568, 2022

  35. [35]

    Egotaskqa: Understanding human tasks in egocentric videos, 2022

    Baoxiong Jia, Ting Lei, Song-Chun Zhu, and Siyuan Huang. Egotaskqa: Understanding human tasks in egocentric videos, 2022

  36. [36]

    arXiv preprint arXiv:2506.05287 , year=

    Yuqian Yuan, Ronghao Dang, Long Li, Wentong Li, Dian Jiao, Xin Li, Deli Zhao, Fan Wang, Wenqiao Zhang, Jun Xiao, et al. Eoc-bench: Can mllms identify, recall, and forecast objects in an egocentric world?arXiv preprint arXiv:2506.05287, 2025

  37. [37]

    Building a mind palace: Structuring environment-grounded semantic graphs for effective long video analysis with llms

    Zeyi Huang, Yuyang Ji, Xiaofang Wang, Nikhil Mehta, Tong Xiao, Donghyun Lee, Sigmund Vanvalkenburgh, Shengxin Zha, Bolin Lai, Licheng Yu, Ning Zhang, Yong Jae Lee, and Miao Liu. Building a mind palace: Structuring environment-grounded semantic graphs for effective long video analysis with llms. InProceedings of the IEEE/CVF Conference on Computer Vision a...

  38. [38]

    Assistq: Affordance-centric question-driven task completion for egocentric assistant

    Benita Wong, Joya Chen, You Wu, Stan Weixian Lei, Dongxing Mao, Difei Gao, and Mike Zheng Shou. Assistq: Affordance-centric question-driven task completion for egocentric assistant. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors,Computer Vision – ECCV 2022, pages 485–501, Cham, 2022. Springer Nature Sw...

  39. [39]

    Hourvideo: 1-hour video- language understanding.Advances in Neural Information Processing Systems, 37:53168– 53197, 2024

    Keshigeyan Chandrasegaran, Agrim Gupta, Lea M Hadzic, Taran Kota, Jimming He, Cristóbal Eyzaguirre, Zane Durante, Manling Li, Jiajun Wu, and Li Fei-Fei. Hourvideo: 1-hour video- language understanding.Advances in Neural Information Processing Systems, 37:53168– 53197, 2024

  40. [40]

    Egothink: Evaluating first-person perspective thinking capability of vision-language models

    Sijie Cheng, Zhicheng Guo, Jingwen Wu, Kechen Fang, Peng Li, Huaping Liu, and Yang Liu. Egothink: Evaluating first-person perspective thinking capability of vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14291–14302, 2024

  41. [41]

    Egoplan-bench: Benchmarking multimodal large language models for human-level planning.International Journal of Computer Vision, 134(3):118, 2026

    Yi Chen, Yuying Ge, Yixiao Ge, Mingyu Ding, Bohao Li, Rui Wang, Ruifeng Xu, Ying Shan, and Xihui Liu. Egoplan-bench: Benchmarking multimodal large language models for human-level planning.International Journal of Computer Vision, 134(3):118, 2026

  42. [42]

    Egoschema: A diagnostic benchmark for very long-form video language understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023

  43. [43]

    In the eye of mllm: Benchmarking egocentric video intent understanding with gaze-guided prompting.arXiv preprint arXiv:2509.07447, 2025

    Taiying Peng, Jiacheng Hua, Miao Liu, and Feng Lu. In the eye of mllm: Benchmark- ing egocentric video intent understanding with gaze-guided prompting.arXiv preprint arXiv:2509.07447, 2025

  44. [44]

    Egotextvqa: Towards egocentric scene-text aware video question answering

    Sheng Zhou, Junbin Xiao, Qingyun Li, Yicong Li, Xun Yang, Dan Guo, Meng Wang, Tat-Seng Chua, and Angela Yao. Egotextvqa: Towards egocentric scene-text aware video question answering. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3363–3373, 2025

  45. [45]

    Sensecam: A retrospective memory aid

    Steve Hodges, Lyndsay Williams, Emma Berry, Shahram Izadi, James Srinivasan, Alex Butler, Gavin Smyth, Narinder Kapur, and Ken Wood. Sensecam: A retrospective memory aid. In International conference on ubiquitous computing, pages 177–193. Springer, 2006

  46. [46]

    Highlight detection with pairwise deep ranking for first- person video summarization

    Ting Yao, Tao Mei, and Yong Rui. Highlight detection with pairwise deep ranking for first- person video summarization. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 982–990, 2016

  47. [47]

    Impact of video summary viewing on episodic memory recall: Design guidelines for video summarizations

    Huy Viet Le, Sarah Clinch, Corina Sas, Tilman Dingler, Niels Henze, and Nigel Davies. Impact of video summary viewing on episodic memory recall: Design guidelines for video summarizations. InProceedings of the 2016 CHI conference on human factors in computing systems, pages 4793–4805, 2016

  48. [48]

    Experienc- ing the affective diary.Personal and Ubiquitous Computing, 13(5):365–378, 2009

    Anna Ståhl, Kristina Höök, Martin Svensson, Alex S Taylor, and Marco Combetto. Experienc- ing the affective diary.Personal and Ubiquitous Computing, 13(5):365–378, 2009

  49. [49]

    Affectcam: arousal-augmented sensecam for richer recall of episodic memories

    Corina Sas, Tomasz Fratczak, Matthew Rees, Hans Gellersen, Vaiva Kalnikaite, Alina Coman, and Kristina Höök. Affectcam: arousal-augmented sensecam for richer recall of episodic memories. InCHI’13 extended abstracts on human factors in computing systems, pages 1041–1046. 2013

  50. [50]

    Yuhu Chang, Yingying Zhao, Mingzhi Dong, Yujiang Wang, Yutian Lu, Qin Lv, Robert P Dick, Tun Lu, Ning Gu, and Li Shang. Memx: An attention-aware smart eyewear system for personalized moment auto-capture.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 5(2):1–23, 2021

  51. [51]

    In the eye of the beholder: Gaze and actions in first person video.IEEE transactions on pattern analysis and machine intelligence, 45(6):6731– 6747, 2021

    Yin Li, Miao Liu, and James M Rehg. In the eye of the beholder: Gaze and actions in first person video.IEEE transactions on pattern analysis and machine intelligence, 45(6):6731– 6747, 2021

  52. [52]

    Katsutoshi Masai, Kai Kunze, Yuta Sugiura, Masa Ogata, Masahiko Inami, and Maki Sugimoto. Evaluation of facial expression recognition by a smart eyewear for facial direction changes, repeatability, and positional drift.ACM Transactions on Interactive Intelligent Systems (TiiS), 7(4):1–23, 2017. 13

  53. [53]

    Eyeecho: Continuous and low-power facial expression tracking on glasses

    Ke Li, Ruidong Zhang, Siyuan Chen, Boao Chen, Mose Sakashita, François Guimbretière, and Cheng Zhang. Eyeecho: Continuous and low-power facial expression tracking on glasses. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pages 1–24, 2024

  54. [54]

    Emotion recognition using a glasses-type wearable device via multi-channel facial responses.IEEE Access, 9:146392–146403, 2021

    Jangho Kwon, Jihyeon Ha, Da-Hye Kim, Jun Won Choi, and Laehyun Kim. Emotion recognition using a glasses-type wearable device via multi-channel facial responses.IEEE Access, 9:146392–146403, 2021

  55. [55]

    Emo: Real-time emotion recognition from single-eye images for resource- constrained eyewear devices

    Hao Wu, Jinghao Feng, Xuejin Tian, Edward Sun, Yunxin Liu, Bo Dong, Fengyuan Xu, and Sheng Zhong. Emo: Real-time emotion recognition from single-eye images for resource- constrained eyewear devices. InProceedings of the 18th International Conference on Mobile Systems, Applications, and Services, pages 448–461, 2020

  56. [56]

    Yingying Zhao, Yuhu Chang, Yutian Lu, Yujiang Wang, Mingzhi Dong, Qin Lv, Robert P Dick, Fan Yang, Tun Lu, Ning Gu, et al. Do smart glasses dream of sentimental visions? deep emotionship analysis for eyewear devices.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 6(1):1–29, 2022

  57. [57]

    Emotion-llama: Multimodal emotion recognition and reasoning with instruction tuning.Advances in Neural Information Processing Systems, 37:110805–110853, 2024

    Zebang Cheng, Zhi-Qi Cheng, Jun-Yan He, Jingdong Sun, Kai Wang, Yuxiang Lin, Zheng Lian, Xiaojiang Peng, and Alexander G Hauptmann. Emotion-llama: Multimodal emotion recognition and reasoning with instruction tuning.Advances in Neural Information Processing Systems, 37:110805–110853, 2024

  58. [58]

    EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models

    He Hu, Yucheng Zhou, Lianzhong You, Hongbo Xu, Qianning Wang, Zheng Lian, Fei Richard Yu, Fei Ma, and Laizhong Cui. Emobench-m: Benchmarking emotional intelligence for multimodal large language models.arXiv preprint arXiv:2502.04424, 2025

  59. [59]

    G-voila: gaze-facilitated information querying in daily scenarios

    Zeyu Wang, Yuanchun Shi, Yuntao Wang, Yuchen Yao, Kun Yan, Yuhan Wang, Lei Ji, Xuhai Xu, and Chun Yu. G-voila: gaze-facilitated information querying in daily scenarios. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 8(2):1–33, 2024

  60. [60]

    Gazepointar: A context-aware multimodal voice assistant for pronoun disambigua- tion in wearable augmented reality

    Jaewook Lee, Jun Wang, Elizabeth Brown, Liam Chu, Sebastian S Rodriguez, and Jon E Froehlich. Gazepointar: A context-aware multimodal voice assistant for pronoun disambigua- tion in wearable augmented reality. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pages 1–20, 2024

  61. [61]

    V oila-a: Aligning vision-language models with user’s gaze attention.Advances in neural information processing systems, 37:1890–1918, 2024

    Kun Yan, Zeyu Wang, Lei Ji, Yuntao Wang, Nan Duan, and Shuai Ma. V oila-a: Aligning vision-language models with user’s gaze attention.Advances in neural information processing systems, 37:1890–1918, 2024

  62. [62]

    Visual intention grounding for egocentric assistants

    Pengzhan Sun, Junbin Xiao, Tze Ho Elden Tse, Yicong Li, Arjun Akula, and Angela Yao. Visual intention grounding for egocentric assistants. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 2512–2522, 2025

  63. [63]

    Sensible agent: A framework for unobtrusive interaction with proactive ar agents

    Geonsun Lee, Min Xia, Nels Numan, Xun Qian, David Li, Yanhe Chen, Achin Kulshrestha, Ishan Chatterjee, Yinda Zhang, Dinesh Manocha, et al. Sensible agent: A framework for unobtrusive interaction with proactive ar agents. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology, pages 1–22, 2025

  64. [64]

    Generating natural questions about an image

    Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Margaret Mitchell, Xiaodong He, and Lucy Vanderwende. Generating natural questions about an image. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1802–1813, 2016

  65. [65]

    Egointent: An egocentric step-level benchmark for understanding what, why, and next.arXiv preprint arXiv:2603.12147, 2026

    Ye Pan, Chi Kit Wong, Yuanhuiyi Lyu, Hanqian Li, Jiahao Huo, Jiacheng Chen, Lutao Jiang, Xu Zheng, and Xuming Hu. Egointent: An egocentric step-level benchmark for understanding what, why, and next.arXiv preprint arXiv:2603.12147, 2026

  66. [66]

    Benchmarking egocentric multimodal goal inference for assistive wearable agents.arXiv preprint arXiv:2510.22443, 2025

    Vijay Veerabadran, Fanyi Xiao, Nitin Kamra, Pedro Matias, Joy Chen, Caley Drooff, Brett D Roads, Riley Williams, Ethan Henderson, Xuanyi Zhao, et al. Benchmarking egocentric multimodal goal inference for assistive wearable agents.arXiv preprint arXiv:2510.22443, 2025. 14

  67. [67]

    Aiget: Transforming everyday moments into hidden knowledge discovery with ai assistance on smart glasses

    Runze Cai, Nuwan Janaka, Hyeongcheol Kim, Yang Chen, Shengdong Zhao, Yun Huang, and David Hsu. Aiget: Transforming everyday moments into hidden knowledge discovery with ai assistance on smart glasses. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–26, 2025

  68. [68]

    Proactive assistant dialogue generation from streaming egocentric videos

    Yichi Zhang, Xin Luna Dong, Zhaojiang Lin, Andrea Madotto, Anuj Kumar, Babak Dama- vandi, Joyce Chai, and Seungwhan Moon. Proactive assistant dialogue generation from streaming egocentric videos. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12055–12079, 2025

  69. [69]

    Acceptability of a lifelogging wearable camera in older adults with mild cognitive impairment: a mixed-method study.BMC geriatrics, 19(1):110, 2019

    Olga Gelonch, Mireia Ribera, Núria Codern-Bové, Sílvia Ramos, Maria Quintana, Gloria Chico, Noemí Cerulla, Paula Lafarga, Petia Radeva, and Maite Garolera. Acceptability of a lifelogging wearable camera in older adults with mild cognitive impairment: a mixed-method study.BMC geriatrics, 19(1):110, 2019

  70. [70]

    Do life-logging technologies support memory for the past? an experimental study using sensecam

    Abigail J Sellen, Andrew Fogg, Mike Aitken, Steve Hodges, Carsten Rother, and Ken Wood. Do life-logging technologies support memory for the past? an experimental study using sensecam. InProceedings of the SIGCHI conference on Human factors in computing systems, pages 81–90, 2007

  71. [71]

    Evangelos Niforatos, Caterina Cinel, Cathleen Cortis Mack, Marc Langheinrich, and Geoff Ward. Can less be more? contrasting limited, unlimited, and automatic picture capture for augmenting memory recall.Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies, 1(2):1–22, 2017

  72. [72]

    Beyond total capture: a constructive critique of lifelog- ging.Communications of the ACM, 53(5):70–77, 2010

    Abigail J Sellen and Steve Whittaker. Beyond total capture: a constructive critique of lifelog- ging.Communications of the ACM, 53(5):70–77, 2010

  73. [73]

    Smart assistive glasses for alzheimer’s patients

    Mohamed Ait Gacem, Saifeddin Alghlayini, Wessam Shehieb, Muaid Saeed, Ahmed Ghazal, and Mustahsan Mir. Smart assistive glasses for alzheimer’s patients. In2019 IEEE Inter- national Symposium on Signal Processing and Information Technology (ISSPIT), pages 1–5. IEEE, 2019

  74. [74]

    Fmt: A wearable camera-based object tracking memory aid for older adults.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 3(3):1–25, 2019

    Franklin Mingzhe Li, Di Laura Chen, Mingming Fan, and Khai N Truong. Fmt: A wearable camera-based object tracking memory aid for older adults.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 3(3):1–25, 2019

  75. [75]

    Navmarkar: A landmark- based augmented reality (ar) wayfinding system for enhancing older adults’ spatial learning

    Zhiwen Qiu, Mojtaba Ashour, Xiaohe Zhou, and Saleh Kalantari. Navmarkar: A landmark- based augmented reality (ar) wayfinding system for enhancing older adults’ spatial learning. Advanced Engineering Informatics, 62:102635, 2024

  76. [76]

    Memoro: Using large language models to realize a concise interface for real-time memory augmentation

    Wazeer Deen Zulfikar, Samantha Chan, and Pattie Maes. Memoro: Using large language models to realize a concise interface for real-time memory augmentation. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pages 1–18, 2024

  77. [77]

    Ar secretary agent: Real-time memory augmentation via llm-powered augmented reality glasses

    Raphaël A El Haddad, Zeyu Wang, Yeonsu Shin, Ranyi Liu, Yuntao Wang, and Chun Yu. Ar secretary agent: Real-time memory augmentation via llm-powered augmented reality glasses. arXiv preprint arXiv:2505.11888, 2025

  78. [78]

    Speech to text

    ElevenLabs. Speech to text. https://elevenlabs.io/docs/eleven-creative/ playground/speech-to-text. ElevenLabs Documentation. Accessed: 2026-05-06

  79. [79]

    Large Model for Audio File Transcription

    iFLYTEK. Large Model for Audio File Transcription. https://www.xfyun.cn/doc/ spark/asr_llm/Ifasr_llm.html, 2026. iFLYTEK Open Platform Documentation Center. Accessed: 2026-05-06

  80. [80]

    OpenAI GPT-4o

    Microsoft. OpenAI GPT-4o. https://ai.azure.com/catalog/models/gpt-4o, 2026. Microsoft Foundry Models catalog. Version 2024-11-20; last updated April 2026. Accessed: 2026-05-06

Showing first 80 references.