EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning

Borislav Pavlov; Chang Liu; Dai Shi; Eduardus Tjitrahardja; Fangfei Gou; Guocai Yao; Jiacheng Hua; Jia Jia; Jiayi Tan; Jingwei Sun

arxiv: 2605.17262 · v1 · pith:FRGXOF3Snew · submitted 2026-05-17 · 💻 cs.CV

EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning

Zeyu Wang , Chang Liu , Eduardus Tjitrahardja , Yuntao Wang , Borislav Pavlov , Fangfei Gou , Jose Manuel Davila , Dai Shi

show 17 more authors

Ran Xu Yue Pan Jiayi Tan Shuting Chang Qi Wang Jinzhao Li Jiacheng Hua Yifei Huang Jingwei Sun Yu Zhang Liuxin Zhang Guocai Yao Jia Jia Yin Li Qianying Wang Yuanchun Shi Miao Liu

This is my paper

Pith reviewed 2026-05-20 14:44 UTC · model grok-4.3

classification 💻 cs.CV

keywords egocentric datasetinternal state reasoningmultimodal LLMsuser intentaffective experiencewearable AIbenchmark

0 comments

The pith

EgoIntrospect dataset and benchmark reveals that multimodal large language models struggle to infer users' internal states from egocentric multimodal data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EgoIntrospect, the first egocentric dataset designed to capture users' internal states during interactions with AI assistants. Collected from 60 subjects over 180 hours using synchronized video, audio, gaze, motion, and physiological signals, it includes self-annotations for affective experience, interactive intent, and cognitive memory. The authors build benchmarks to evaluate how well current multimodal large language models can reason about these subjective states from the observations. Their experiments indicate that existing models fail to effectively combine the multimodal signals for accurate inference of internal states. By making the dataset public, the work supports progress in creating more responsive wearable AI systems.

Core claim

EgoIntrospect provides 180 hours of egocentric recordings from 60 users in user-driven scenarios, equipped with self-annotations that directly reveal interactive intentions with AI assistants, along with synchronized multimodal data including video, audio, gaze, motion, and physiological signals. This enables a set of tasks for reasoning about affective experience, interactive intent, and cognitive memory, and benchmarks demonstrate that multimodal large language models do not yet leverage these signals well to understand users' subjective internal states.

What carries the argument

The EgoIntrospect dataset featuring cross-device synchronized multimodal recordings and explicit self-annotations for user internal states.

If this is right

Improved multimodal models could enable AI assistants that better understand user intent and emotions in real time.
The benchmark tasks highlight specific weaknesses in current models' ability to process egocentric data for subjective reasoning.
Public release of the dataset and annotations will facilitate further research in egocentric vision and human-AI interaction.
Models that succeed on these tasks may lead to more natural and personalized wearable AI experiences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If self-annotations prove reliable, similar datasets could be collected for other contexts like health monitoring or education.
Success on this benchmark might require new architectures specifically designed for integrating physiological and gaze data with video and audio.
Future extensions could include longitudinal studies to see how internal states change over repeated interactions.

Load-bearing premise

User-provided self-annotations accurately and reliably reflect true internal states such as affective experience, interactive intent, and cognitive memory.

What would settle it

Development of a multimodal model that achieves significantly higher accuracy than current baselines on the EgoIntrospect benchmark tasks by better utilizing the combined signals.

Figures

Figures reproduced from arXiv: 2605.17262 by Borislav Pavlov, Chang Liu, Dai Shi, Eduardus Tjitrahardja, Fangfei Gou, Guocai Yao, Jiacheng Hua, Jia Jia, Jiayi Tan, Jingwei Sun, Jinzhao Li, Jose Manuel Davila, Liuxin Zhang, Miao Liu, Qianying Wang, Qi Wang, Ran Xu, Shuting Chang, Yifei Huang, Yin Li, Yuanchun Shi, Yue Pan, Yuntao Wang, Yu Zhang, Zeyu Wang.

**Figure 1.** Figure 1: Visual examples of EgoIntrospect for understanding user internal states. We illustrate the daily usage of smart-glass AI assistants through three core dimensions: (1) Affective Experience: recognizing salient moments and inferring the user’s emotional states (e.g., joy or stress); (2) Request Intent: tackling complex, context-grounded queries and initiating proactive assistance (e.g., gym guidance or shopp… view at source ↗

**Figure 2.** Figure 2: Capture and Annotation Overview of the EgoIntrospect. Our dataset record users’ natural daily routines with a multimodal wearable sensor suite. We synchronize exteroceptive (video, audio) and interoceptive (physiological, gaze, motion) signals for a rich multimodal capture. Annotations are obtained through two stages: an In-situ Labeling stage, where participants verbally mark key moments or send instant m… view at source ↗

**Figure 3.** Figure 3: Illustrative examples of Affective Experience tasks. (a) MoR-Video/-Photo determine whether a video segment or photo reflects the user’s capture intent. (b) Emotion Identification (EI) classifies the user’s affective state, and Emotion Intensity Recognition (EIR) selects the scenario with the highest emotional intensity. Icons indicate the specific experimental settings: (ICL) refers to the use of In-Conte… view at source ↗

**Figure 4.** Figure 4: Illustrative examples of Interactive Intent tasks. (a) Tool Selection (TS) identifies the tools needed for addressing the request, and Request Prediction (RP) predicts the user’s possible request given the context. (b) Proactive Timing Judgment (PTJ) evaluates the user’s openness to interaction, and Valuable Interaction Identification (VII) selects helpful proactive assistance from candidates. task focuses… view at source ↗

**Figure 5.** Figure 5: Illustrative examples of Cognitive Memory tasks. (a) Memory Recall Prediction evaluates the model’s ability to identify specific, vivid details that are most reachable in a user’s memory. (b) Memory Assistance Recognition (MAR) identifies specific items the user explicitly intended to preserve (MAR-Event is shown; see MAR-Object in text), and Memory Lifespan Identification predicts the temporal utility and… view at source ↗

**Figure 6.** Figure 6: Annotation interface for Task1.1 (Moment of Recording). [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Annotation interface for Task1.2 (Emotion Analysis). [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Screenshot for task2.1’s annotation UI. could be clearly identified (e.g., if a participant felt joy upon seeing a friend, they would draw the bounding box on the frame where the friend became visible). Throughout the annotation session, an experimenter was present to explain the annotation rules, help participants refine ambiguous temporal boundaries, remind them of potentially missed emotional moments, a… view at source ↗

**Figure 9.** Figure 9: Task2.2 annotation platform Video availability selection module. Participants reviewed the egocentric recording with timelinebased playback and marked any interval during which they would not want to receive a proactive request and provide the reason. Each marked non-interruptible interval was visualized as a red bar on the timeline. The interface also displayed the corresponding start and end timestamps … view at source ↗

**Figure 10.** Figure 10: Task3.1 annotation platform settings, locations, participants, primary activities, vivid details, and overall impressions. We accept accounts in both audio and text formats. These original files are processed through a standardized pipeline: audio files (e.g., .m4a) are first transcribed into text using the Xunfei or ElevenLabs API. Subsequently, a Large Language Model (LLM) is employed to decompose these… view at source ↗

**Figure 11.** Figure 11: Task3.2 annotation platform A.7 Task3.2 - Memory Intent Modeling B Data Processing Algorithms for Annotation. B.1 Transcript and Command Extraction The transcription pipeline processes continuous audio, extracts word-level transcriptions, identifies user commands directed at an AI assistant, and classifies them into categories. Stage 1: Audio Loading & Chunking Input: Audio file (WAV format) The audio is … view at source ↗

**Figure 12.** Figure 12: Overview of the Task2.2 proactive request recommendation generation pipeline. The [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Demographic statistics of the 60 participants. [PITH_FULL_IMAGE:figures/full_fig_p035_13.png] view at source ↗

**Figure 14.** Figure 14: Distribution of scenario categories across the 192 annotated activity segments. Each [PITH_FULL_IMAGE:figures/full_fig_p037_14.png] view at source ↗

**Figure 15.** Figure 15: Examples of gaze-state overlays in video-based benchmark inputs. The shared renderer [PITH_FULL_IMAGE:figures/full_fig_p038_15.png] view at source ↗

**Figure 16.** Figure 16: PTJ confusion heatmaps for the two representative models. Each cell reports count and [PITH_FULL_IMAGE:figures/full_fig_p054_16.png] view at source ↗

**Figure 17.** Figure 17: VII error diagnostics for kimi-k2.5 and Qwen3-VL-8B-Instruct. The left heatmap shows [PITH_FULL_IMAGE:figures/full_fig_p055_17.png] view at source ↗

**Figure 18.** Figure 18: Representative human-correct/model-wrong Task2.2 examples. Each panel shows a [PITH_FULL_IMAGE:figures/full_fig_p056_18.png] view at source ↗

read the original abstract

Despite extensive efforts on egocentric video datasets and benchmarks, understanding users' internal states, which is crucial for enabling seamless AI assistant experiences, remains largely overlooked. In this work, we introduce EgoIntrospect, the first egocentric dataset captured in user-driven scenarios with self-annotations that explicitly reveal users' interactive intentions with AI assistants. EgoIntrospect was collected using a cross-device setup, providing synchronized video, audio, gaze, motion, and physiological signals. It consists of 180 hours of recordings from 60 subjects, with an average recording duration of 3 hours per subject. Leveraging EgoIntrospect, we formalize a suite of tasks centered on user internal states, including affective experience, interactive intent, and cognitive memory. We further process the annotations to construct benchmarks that evaluate the ability of modern multimodal large language models to reason about users' internal states from egocentric observations. Experiments on our benchmark suggest that existing multimodal large language models struggle to effectively leverage multimodal signals to infer users' subjective internal states. The dataset and annotations will be made publicly available to advance research in egocentric vision and wearable AI assistants. Project page: https://ego-introspect.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EgoIntrospect adds a useful new dataset for internal state reasoning but its benchmarks need stronger validation of the self-annotations.

read the letter

This paper introduces EgoIntrospect, a new dataset of 180 hours of egocentric recordings from 60 subjects with self-annotations for internal states such as affective experience, interactive intent, and cognitive memory. The authors also benchmark multimodal LLMs on these and report that the models struggle to use the signals effectively. What is new here is the focus on user-driven scenarios with explicit self-annotations for subjective states, going beyond action or object focused egocentric datasets. The synchronized collection of video, audio, gaze, motion, and physiological signals in a cross-device setup is a solid technical achievement, and releasing the data publicly is helpful for the community. The evaluations on modern MLLMs provide some initial evidence that these models have room to improve on internal state inference. That part is useful as a starting point for future work. The soft spot is the ground truth. Self-annotations for internal states can be affected by reporting biases and limited accuracy, and nothing in the paper description indicates they cross-checked these against the physiological signals or other measures. If the labels are inconsistent, the benchmark results could overstate the models' shortcomings. More details on annotation protocols would strengthen the claims. This work is for researchers in egocentric vision, wearable computing, and human-AI interaction who want data on subjective user states. Readers looking for new benchmarks or datasets in multimodal reasoning will find value in the release. It deserves peer review because the dataset is novel and the scale is substantial. I recommend sending it out, with attention to validating the annotations in revisions.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces EgoIntrospect, the first egocentric dataset captured in user-driven scenarios with self-annotations explicitly revealing users' interactive intentions with AI assistants. Collected via a cross-device setup from 60 subjects yielding 180 hours of synchronized video, audio, gaze, motion, and physiological signals (average 3 hours per subject), the work formalizes tasks on affective experience, interactive intent, and cognitive memory. It constructs benchmarks to evaluate multimodal large language models' reasoning about users' internal states from egocentric observations and reports that existing MLLMs struggle to effectively leverage multimodal signals for inferring subjective internal states. The dataset and annotations are to be released publicly.

Significance. If the self-annotations can be shown to reliably reflect internal states and the benchmarks are constructed without circularity or label noise, the work would be significant for egocentric vision and wearable AI research by addressing the overlooked area of user-centric internal state reasoning. The multimodal synchronization and public release are strengths that could enable reproducible follow-up studies. However, the significance is limited by the absence of validation for the self-annotations serving as ground truth.

major comments (2)

[Abstract and dataset construction section] Abstract and dataset description: The central claim that MLLMs struggle to infer subjective internal states rests on self-annotations for affective experience, interactive intent, and cognitive memory being treated as ground truth. No details are provided on annotation protocols, inter-annotator agreement, or cross-validation against the synchronized physiological signals, raising the risk that observed model failures reflect annotation noise rather than limitations in multimodal reasoning.
[Benchmark and evaluation section] Benchmark construction: The experiments conclude that models fail to leverage multimodal signals, yet without quantitative results on annotation reliability or synchronization validation (as noted in the abstract's description of the cross-device setup), it is unclear whether the benchmark isolates the intended reasoning challenge or is confounded by label inconsistency.

minor comments (2)

[Dataset statistics] The average recording duration of 3 hours per subject is stated but no breakdown of task distribution or scenario diversity across the 60 subjects is given, which would help assess generalizability.
[Conclusion] The project page URL is provided but the manuscript should include a brief description of what supplementary materials (e.g., annotation guidelines) will be released alongside the dataset.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and valuable feedback. We address each major comment below, indicating where revisions will be made to improve clarity and transparency.

read point-by-point responses

Referee: [Abstract and dataset construction section] Abstract and dataset description: The central claim that MLLMs struggle to infer subjective internal states rests on self-annotations for affective experience, interactive intent, and cognitive memory being treated as ground truth. No details are provided on annotation protocols, inter-annotator agreement, or cross-validation against the synchronized physiological signals, raising the risk that observed model failures reflect annotation noise rather than limitations in multimodal reasoning.

Authors: We agree that greater detail on the self-annotation process is warranted. The annotations were obtained directly from each participant immediately following the recorded sessions via a structured digital questionnaire that asked subjects to report their affective experience, interactive intentions toward an AI assistant, and recall of cognitive events. We will revise the dataset construction section to describe the exact questionnaire items, the interface used, and the timing relative to the activities. Because the annotations are self-reports by the individuals who experienced the internal states, standard inter-annotator agreement statistics do not apply; we will instead note any consistency checks (e.g., re-annotation by a subset of participants) that were performed. Explicit quantitative cross-validation against the physiological signals was not conducted in the present study, as the primary focus was on multimodal reasoning benchmarks; we will add an explicit discussion of this limitation and its implications for interpreting model performance. revision: yes
Referee: [Benchmark and evaluation section] Benchmark construction: The experiments conclude that models fail to leverage multimodal signals, yet without quantitative results on annotation reliability or synchronization validation (as noted in the abstract's description of the cross-device setup), it is unclear whether the benchmark isolates the intended reasoning challenge or is confounded by label inconsistency.

Authors: The benchmark tasks are defined directly from the self-annotations to evaluate whether MLLMs can reason about subjective internal states given synchronized multimodal observations. We will expand the benchmark construction and evaluation sections to provide quantitative details on the synchronization procedure used in the cross-device capture (including hardware timestamps and alignment verification steps) and any reliability measures obtained for the annotations. While we maintain that the observed model shortcomings reflect genuine difficulties in leveraging multimodal cues rather than pervasive label noise—given the consistent performance patterns across modalities—we will add a dedicated paragraph discussing potential sources of annotation variability and their possible influence on the reported results. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical dataset and benchmark with external model evaluations

full rationale

This is a data-collection paper that records egocentric multimodal signals, collects user self-annotations for internal states, and runs external MLLM evaluations on the resulting benchmark tasks. No equations, parameter fits, or derivations appear in the provided text. Claims about model struggles rest on direct experimental comparisons to the collected annotations rather than any self-referential reduction or self-citation chain. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution rests primarily on the domain assumption that self-annotations serve as valid ground truth for subjective internal states; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Self-annotations by users accurately reflect their internal states including affective experience, interactive intent, and cognitive memory.
The benchmark tasks and model evaluations treat these annotations as reliable labels for measuring inference performance.

pith-pipeline@v0.9.0 · 5831 in / 1208 out tokens · 57266 ms · 2026-05-20T14:44:04.808843+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

112 extracted references · 112 canonical work pages · 5 internal anchors

[1]

Introducing the new ray-ban meta smart glasses

Meta. Introducing the new ray-ban meta smart glasses. https://about.fb.com/news/ 2023/09/new-ray-ban-meta-smart-glasses/ , September 2023. Accessed: 2026-04-10

work page 2023
[2]

Project Aria: A New Tool for Egocentric Multi-Modal AI Research

Jakob Engel, Kiran Somasundaram, Michael Goesele, Albert Sun, Alexander Gamino, Andrew Turner, Arjang Talattof, Arnie Yuan, Bilal Souti, Brighid Meredith, et al. Project aria: A new tool for egocentric multi-modal ai research.arXiv preprint arXiv:2308.13561, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

A survey on multimodal large language models.National Science Review, 11(12), November 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12), November 2024

work page 2024
[4]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

work page 2023
[5]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Egocentric video-language pretraining.Advances in Neural Information Processing Systems, 35:7575–7586, 2022

Kevin Qinghong Lin, Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Z Xu, Difei Gao, Rong-Cheng Tu, Wenzhe Zhao, Weijie Kong, et al. Egocentric video-language pretraining.Advances in Neural Information Processing Systems, 35:7575–7586, 2022

work page 2022
[8]

Lifelongmemory: Leveraging llms for answering queries in long-form egocentric videos.arXiv preprint arXiv:2312.05269, 2023

Ying Wang, Yanlai Yang, and Mengye Ren. Lifelongmemory: Leveraging llms for answering queries in long-form egocentric videos.arXiv preprint arXiv:2312.05269, 2023

work page arXiv 2023
[9]

Ego-humans: An ego-centric 3d multi-human benchmark

Rawal Khirodkar, Aayush Bansal, Lingni Ma, Richard Newcombe, Minh V o, and Kris Kitani. Ego-humans: An ego-centric 3d multi-human benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19807–19819, 2023

work page 2023
[10]

Mm-ego: Towards building ego- centric multimodal llms

Hanrong Ye, Haotian Zhang, Erik Daxberger, Lin Chen, Zongyu Lin, Yanghao Li, Bowen Zhang, Haoxuan You, Dan Xu, Zhe Gan, et al. Mm-ego: Towards building egocentric multimodal llms for video qa.arXiv preprint arXiv:2410.07177, 2024

work page arXiv 2024
[11]

Learning fine-grained view-invariant representations from unpaired ego-exo videos via temporal alignment.Advances in Neural Information Processing Systems, 36:53688–53710, 2023

Zihui Sherry Xue and Kristen Grauman. Learning fine-grained view-invariant representations from unpaired ego-exo videos via temporal alignment.Advances in Neural Information Processing Systems, 36:53688–53710, 2023

work page 2023
[12]

Helping hands: An object-aware ego- centric video recognition model

Chuhan Zhang, Ankush Gupta, and Andrew Zisserman. Helping hands: An object-aware ego- centric video recognition model. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13901–13912, October 2023

work page 2023
[13]

Retrieval-augmented egocentric video captioning

Jilan Xu, Yifei Huang, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, and Weidi Xie. Retrieval-augmented egocentric video captioning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13525–13536, June 2024. 10

work page 2024
[14]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

work page 2022
[15]

Jawahar, Richard Newcombe, Hyun Soo Park, James M

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Tri- antafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mo...

work page 2024
[16]

Scaling egocentric vision: The epic-kitchens dataset

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. InProceedings of the European conference on computer vision (ECCV), pages 720–736, 2018

work page 2018
[17]

Egolife: Towards egocentric life assistant

Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, et al. Egolife: Towards egocentric life assistant. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28885–28900, 2025

work page 2025
[18]

E 3: Exploring Embodied Emotion Through A Large-Scale Egocentric Video Dataset.Advances in Neural Information Processing Systems, 37:118182–118197, 2024

Wang Lin, Yueying Feng, WenKang Han, Tao Jin, Zhou Zhao, Fei Wu, Chang Yao, and Jingyuan Chen. E 3: Exploring Embodied Emotion Through A Large-Scale Egocentric Video Dataset.Advances in Neural Information Processing Systems, 37:118182–118197, 2024

work page 2024
[19]

Neon accuracy test report

Chris Baumann and Kai Dierkes. Neon accuracy test report. 2023

work page 2023
[20]

Ralovi S2 Wristwatch: EDA Skin Con- ductance and HRV Heart Rate Variability

Hangzhou Raloway Health Technology Co., Ltd. Ralovi S2 Wristwatch: EDA Skin Con- ductance and HRV Heart Rate Variability. https://raloway.com/index.php?m=home& c=Lists&a=index&tid=23, n.d. Accessed: 2026-05-07

work page 2026
[21]

A dataset and toolkit for multiparameter cardiovascular physiology sensing on rings.arXiv preprint arXiv:2505.04172, 2025

Jiankai Tang, Kegang Wang, Yingke Ding, Jiatong Ji, Zeyu Wang, Xiyuxing Zhang, Ping Chen, Yuanchun Shi, and Yuntao Wang. A dataset and toolkit for multiparameter cardiovascular physiology sensing on rings.arXiv preprint arXiv:2505.04172, 2025

work page arXiv 2025
[22]

Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision, 130(1):33–55, 2022

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kaza- kos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision, 130(1):33–55, 2022

work page 2022
[23]

Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos

Gunnar A Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, and Karteek Alahari. Charades-ego: A large-scale dataset of paired third and first person videos.arXiv preprint arXiv:1804.09626, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[24]

Lemma: A multi-view dataset for le arning m ulti-agent m ulti-task a ctivities

Baoxiong Jia, Yixin Chen, Siyuan Huang, Yixin Zhu, and Song-Chun Zhu. Lemma: A multi-view dataset for le arning m ulti-agent m ulti-task a ctivities. InEuropean Conference on Computer Vision, pages 767–786. Springer, 2020. 11

work page 2020
[25]

Egoexolearn: A dataset for bridging asynchronous ego- and exo-centric view of procedural activities in real world

Yifei Huang, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, Limin Wang, and Yu Qiao. Egoexolearn: A dataset for bridging asynchronous ego- and exo-centric view of procedural activities in real world. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22072–22086,...

work page 2024
[26]

Aria digital twin: A new benchmark dataset for egocentric 3d machine perception

Xiaqing Pan, Nicholas Charron, Yongqian Yang, Scott Peters, Thomas Whelan, Chen Kong, Omkar Parkhi, Richard Newcombe, and Yuheng Carl Ren. Aria digital twin: A new benchmark dataset for egocentric 3d machine perception. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20133–20143, 2023

work page 2023
[27]

egoemotion: Egocentric vision and physiological signals for emotion and personality recognition in real- world tasks.arXiv preprint arXiv:2510.22129, 2025

Matthias Jammot, Björn Braun, Paul Streli, Rafael Wampfler, and Christian Holz. egoemotion: Egocentric vision and physiological signals for emotion and personality recognition in real- world tasks.arXiv preprint arXiv:2510.22129, 2025

work page arXiv 2025
[28]

K-emocon, a multimodal sensor dataset for continuous emotion recognition in naturalistic conversations.Scientific Data, 7(1):293, 2020

Cheul Young Park, Narae Cha, Soowon Kang, Auk Kim, Ahsan Habib Khandoker, Leontios Hadjileontiadis, Alice Oh, Yong Jeong, and Uichin Lee. K-emocon, a multimodal sensor dataset for continuous emotion recognition in naturalistic conversations.Scientific Data, 7(1):293, 2020

work page 2020
[29]

Jointly learning energy expenditures and activities using egocentric multimodal signals

Katsuyuki Nakamura, Serena Yeung, Alexandre Alahi, and Li Fei-Fei. Jointly learning energy expenditures and activities using egocentric multimodal signals. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1868–1877, 2017

work page 2017
[30]

The construal of experience in hci: Understanding self- reports.International Journal of Human-Computer Studies, 110:63–74, 2018

Kevin Doherty and Gavin Doherty. The construal of experience in hci: Understanding self- reports.International Journal of Human-Computer Studies, 110:63–74, 2018

work page 2018
[31]

Measuring digital intervention user experience with a novel ecological momentary assessment (ema) method, corto.Internet Interventions, 35:100706, 2024

Lauri Lukka, Veli-Matti Karhulahti, Vilma-Reetta Bergman, and J Matias Palva. Measuring digital intervention user experience with a novel ecological momentary assessment (ema) method, corto.Internet Interventions, 35:100706, 2024

work page 2024
[32]

Online episodic memory visual query localization with egocentric streaming object memory

Zaira Manigrasso, Matteo Dunnhofer, Antonino Furnari, Moritz Nottebaum, Antonio Finoc- chiaro, Davide Marana, Rosario Forte, Giovanni Maria Farinella, and Christian Micheloni. Online episodic memory visual query localization with egocentric streaming object memory. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), ...

work page 2026
[33]

Episodic memory question answering

Samyak Datta, Sameer Dharur, Vincent Cartillier, Ruta Desai, Mukul Khanna, Dhruv Batra, and Devi Parikh. Episodic memory question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19119–19128, June 2022

work page 2022
[34]

Where did i leave my keys?-episodic-memory-based question answering on egocentric videos

Leonard Bärmann and Alex Waibel. Where did i leave my keys?-episodic-memory-based question answering on egocentric videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1560–1568, 2022

work page 2022
[35]

Egotaskqa: Understanding human tasks in egocentric videos, 2022

Baoxiong Jia, Ting Lei, Song-Chun Zhu, and Siyuan Huang. Egotaskqa: Understanding human tasks in egocentric videos, 2022

work page 2022
[36]

arXiv preprint arXiv:2506.05287 , year=

Yuqian Yuan, Ronghao Dang, Long Li, Wentong Li, Dian Jiao, Xin Li, Deli Zhao, Fan Wang, Wenqiao Zhang, Jun Xiao, et al. Eoc-bench: Can mllms identify, recall, and forecast objects in an egocentric world?arXiv preprint arXiv:2506.05287, 2025

work page arXiv 2025
[37]

Building a mind palace: Structuring environment-grounded semantic graphs for effective long video analysis with llms

Zeyi Huang, Yuyang Ji, Xiaofang Wang, Nikhil Mehta, Tong Xiao, Donghyun Lee, Sigmund Vanvalkenburgh, Shengxin Zha, Bolin Lai, Licheng Yu, Ning Zhang, Yong Jae Lee, and Miao Liu. Building a mind palace: Structuring environment-grounded semantic graphs for effective long video analysis with llms. InProceedings of the IEEE/CVF Conference on Computer Vision a...

work page 2025
[38]

Assistq: Affordance-centric question-driven task completion for egocentric assistant

Benita Wong, Joya Chen, You Wu, Stan Weixian Lei, Dongxing Mao, Difei Gao, and Mike Zheng Shou. Assistq: Affordance-centric question-driven task completion for egocentric assistant. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors,Computer Vision – ECCV 2022, pages 485–501, Cham, 2022. Springer Nature Sw...

work page 2022
[39]

Hourvideo: 1-hour video- language understanding.Advances in Neural Information Processing Systems, 37:53168– 53197, 2024

Keshigeyan Chandrasegaran, Agrim Gupta, Lea M Hadzic, Taran Kota, Jimming He, Cristóbal Eyzaguirre, Zane Durante, Manling Li, Jiajun Wu, and Li Fei-Fei. Hourvideo: 1-hour video- language understanding.Advances in Neural Information Processing Systems, 37:53168– 53197, 2024

work page 2024
[40]

Egothink: Evaluating first-person perspective thinking capability of vision-language models

Sijie Cheng, Zhicheng Guo, Jingwen Wu, Kechen Fang, Peng Li, Huaping Liu, and Yang Liu. Egothink: Evaluating first-person perspective thinking capability of vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14291–14302, 2024

work page 2024
[41]

Egoplan-bench: Benchmarking multimodal large language models for human-level planning.International Journal of Computer Vision, 134(3):118, 2026

Yi Chen, Yuying Ge, Yixiao Ge, Mingyu Ding, Bohao Li, Rui Wang, Ruifeng Xu, Ying Shan, and Xihui Liu. Egoplan-bench: Benchmarking multimodal large language models for human-level planning.International Journal of Computer Vision, 134(3):118, 2026

work page 2026
[42]

Egoschema: A diagnostic benchmark for very long-form video language understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023

work page 2023
[43]

In the eye of mllm: Benchmarking egocentric video intent understanding with gaze-guided prompting.arXiv preprint arXiv:2509.07447, 2025

Taiying Peng, Jiacheng Hua, Miao Liu, and Feng Lu. In the eye of mllm: Benchmark- ing egocentric video intent understanding with gaze-guided prompting.arXiv preprint arXiv:2509.07447, 2025

work page arXiv 2025
[44]

Egotextvqa: Towards egocentric scene-text aware video question answering

Sheng Zhou, Junbin Xiao, Qingyun Li, Yicong Li, Xun Yang, Dan Guo, Meng Wang, Tat-Seng Chua, and Angela Yao. Egotextvqa: Towards egocentric scene-text aware video question answering. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3363–3373, 2025

work page 2025
[45]

Sensecam: A retrospective memory aid

Steve Hodges, Lyndsay Williams, Emma Berry, Shahram Izadi, James Srinivasan, Alex Butler, Gavin Smyth, Narinder Kapur, and Ken Wood. Sensecam: A retrospective memory aid. In International conference on ubiquitous computing, pages 177–193. Springer, 2006

work page 2006
[46]

Highlight detection with pairwise deep ranking for first- person video summarization

Ting Yao, Tao Mei, and Yong Rui. Highlight detection with pairwise deep ranking for first- person video summarization. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 982–990, 2016

work page 2016
[47]

Impact of video summary viewing on episodic memory recall: Design guidelines for video summarizations

Huy Viet Le, Sarah Clinch, Corina Sas, Tilman Dingler, Niels Henze, and Nigel Davies. Impact of video summary viewing on episodic memory recall: Design guidelines for video summarizations. InProceedings of the 2016 CHI conference on human factors in computing systems, pages 4793–4805, 2016

work page 2016
[48]

Experienc- ing the affective diary.Personal and Ubiquitous Computing, 13(5):365–378, 2009

Anna Ståhl, Kristina Höök, Martin Svensson, Alex S Taylor, and Marco Combetto. Experienc- ing the affective diary.Personal and Ubiquitous Computing, 13(5):365–378, 2009

work page 2009
[49]

Affectcam: arousal-augmented sensecam for richer recall of episodic memories

Corina Sas, Tomasz Fratczak, Matthew Rees, Hans Gellersen, Vaiva Kalnikaite, Alina Coman, and Kristina Höök. Affectcam: arousal-augmented sensecam for richer recall of episodic memories. InCHI’13 extended abstracts on human factors in computing systems, pages 1041–1046. 2013

work page 2013
[50]

Yuhu Chang, Yingying Zhao, Mingzhi Dong, Yujiang Wang, Yutian Lu, Qin Lv, Robert P Dick, Tun Lu, Ning Gu, and Li Shang. Memx: An attention-aware smart eyewear system for personalized moment auto-capture.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 5(2):1–23, 2021

work page 2021
[51]

In the eye of the beholder: Gaze and actions in first person video.IEEE transactions on pattern analysis and machine intelligence, 45(6):6731– 6747, 2021

Yin Li, Miao Liu, and James M Rehg. In the eye of the beholder: Gaze and actions in first person video.IEEE transactions on pattern analysis and machine intelligence, 45(6):6731– 6747, 2021

work page 2021
[52]

Katsutoshi Masai, Kai Kunze, Yuta Sugiura, Masa Ogata, Masahiko Inami, and Maki Sugimoto. Evaluation of facial expression recognition by a smart eyewear for facial direction changes, repeatability, and positional drift.ACM Transactions on Interactive Intelligent Systems (TiiS), 7(4):1–23, 2017. 13

work page 2017
[53]

Eyeecho: Continuous and low-power facial expression tracking on glasses

Ke Li, Ruidong Zhang, Siyuan Chen, Boao Chen, Mose Sakashita, François Guimbretière, and Cheng Zhang. Eyeecho: Continuous and low-power facial expression tracking on glasses. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pages 1–24, 2024

work page 2024
[54]

Emotion recognition using a glasses-type wearable device via multi-channel facial responses.IEEE Access, 9:146392–146403, 2021

Jangho Kwon, Jihyeon Ha, Da-Hye Kim, Jun Won Choi, and Laehyun Kim. Emotion recognition using a glasses-type wearable device via multi-channel facial responses.IEEE Access, 9:146392–146403, 2021

work page 2021
[55]

Emo: Real-time emotion recognition from single-eye images for resource- constrained eyewear devices

Hao Wu, Jinghao Feng, Xuejin Tian, Edward Sun, Yunxin Liu, Bo Dong, Fengyuan Xu, and Sheng Zhong. Emo: Real-time emotion recognition from single-eye images for resource- constrained eyewear devices. InProceedings of the 18th International Conference on Mobile Systems, Applications, and Services, pages 448–461, 2020

work page 2020
[56]

Yingying Zhao, Yuhu Chang, Yutian Lu, Yujiang Wang, Mingzhi Dong, Qin Lv, Robert P Dick, Fan Yang, Tun Lu, Ning Gu, et al. Do smart glasses dream of sentimental visions? deep emotionship analysis for eyewear devices.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 6(1):1–29, 2022

work page 2022
[57]

Emotion-llama: Multimodal emotion recognition and reasoning with instruction tuning.Advances in Neural Information Processing Systems, 37:110805–110853, 2024

Zebang Cheng, Zhi-Qi Cheng, Jun-Yan He, Jingdong Sun, Kai Wang, Yuxiang Lin, Zheng Lian, Xiaojiang Peng, and Alexander G Hauptmann. Emotion-llama: Multimodal emotion recognition and reasoning with instruction tuning.Advances in Neural Information Processing Systems, 37:110805–110853, 2024

work page 2024
[58]

EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models

He Hu, Yucheng Zhou, Lianzhong You, Hongbo Xu, Qianning Wang, Zheng Lian, Fei Richard Yu, Fei Ma, and Laizhong Cui. Emobench-m: Benchmarking emotional intelligence for multimodal large language models.arXiv preprint arXiv:2502.04424, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

G-voila: gaze-facilitated information querying in daily scenarios

Zeyu Wang, Yuanchun Shi, Yuntao Wang, Yuchen Yao, Kun Yan, Yuhan Wang, Lei Ji, Xuhai Xu, and Chun Yu. G-voila: gaze-facilitated information querying in daily scenarios. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 8(2):1–33, 2024

work page 2024
[60]

Gazepointar: A context-aware multimodal voice assistant for pronoun disambigua- tion in wearable augmented reality

Jaewook Lee, Jun Wang, Elizabeth Brown, Liam Chu, Sebastian S Rodriguez, and Jon E Froehlich. Gazepointar: A context-aware multimodal voice assistant for pronoun disambigua- tion in wearable augmented reality. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pages 1–20, 2024

work page 2024
[61]

V oila-a: Aligning vision-language models with user’s gaze attention.Advances in neural information processing systems, 37:1890–1918, 2024

Kun Yan, Zeyu Wang, Lei Ji, Yuntao Wang, Nan Duan, and Shuai Ma. V oila-a: Aligning vision-language models with user’s gaze attention.Advances in neural information processing systems, 37:1890–1918, 2024

work page 1918
[62]

Visual intention grounding for egocentric assistants

Pengzhan Sun, Junbin Xiao, Tze Ho Elden Tse, Yicong Li, Arjun Akula, and Angela Yao. Visual intention grounding for egocentric assistants. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 2512–2522, 2025

work page 2025
[63]

Sensible agent: A framework for unobtrusive interaction with proactive ar agents

Geonsun Lee, Min Xia, Nels Numan, Xun Qian, David Li, Yanhe Chen, Achin Kulshrestha, Ishan Chatterjee, Yinda Zhang, Dinesh Manocha, et al. Sensible agent: A framework for unobtrusive interaction with proactive ar agents. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology, pages 1–22, 2025

work page 2025
[64]

Generating natural questions about an image

Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Margaret Mitchell, Xiaodong He, and Lucy Vanderwende. Generating natural questions about an image. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1802–1813, 2016

work page 2016
[65]

Egointent: An egocentric step-level benchmark for understanding what, why, and next.arXiv preprint arXiv:2603.12147, 2026

Ye Pan, Chi Kit Wong, Yuanhuiyi Lyu, Hanqian Li, Jiahao Huo, Jiacheng Chen, Lutao Jiang, Xu Zheng, and Xuming Hu. Egointent: An egocentric step-level benchmark for understanding what, why, and next.arXiv preprint arXiv:2603.12147, 2026

work page arXiv 2026
[66]

Benchmarking egocentric multimodal goal inference for assistive wearable agents.arXiv preprint arXiv:2510.22443, 2025

Vijay Veerabadran, Fanyi Xiao, Nitin Kamra, Pedro Matias, Joy Chen, Caley Drooff, Brett D Roads, Riley Williams, Ethan Henderson, Xuanyi Zhao, et al. Benchmarking egocentric multimodal goal inference for assistive wearable agents.arXiv preprint arXiv:2510.22443, 2025. 14

work page arXiv 2025
[67]

Aiget: Transforming everyday moments into hidden knowledge discovery with ai assistance on smart glasses

Runze Cai, Nuwan Janaka, Hyeongcheol Kim, Yang Chen, Shengdong Zhao, Yun Huang, and David Hsu. Aiget: Transforming everyday moments into hidden knowledge discovery with ai assistance on smart glasses. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–26, 2025

work page 2025
[68]

Proactive assistant dialogue generation from streaming egocentric videos

Yichi Zhang, Xin Luna Dong, Zhaojiang Lin, Andrea Madotto, Anuj Kumar, Babak Dama- vandi, Joyce Chai, and Seungwhan Moon. Proactive assistant dialogue generation from streaming egocentric videos. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12055–12079, 2025

work page 2025
[69]

Acceptability of a lifelogging wearable camera in older adults with mild cognitive impairment: a mixed-method study.BMC geriatrics, 19(1):110, 2019

Olga Gelonch, Mireia Ribera, Núria Codern-Bové, Sílvia Ramos, Maria Quintana, Gloria Chico, Noemí Cerulla, Paula Lafarga, Petia Radeva, and Maite Garolera. Acceptability of a lifelogging wearable camera in older adults with mild cognitive impairment: a mixed-method study.BMC geriatrics, 19(1):110, 2019

work page 2019
[70]

Do life-logging technologies support memory for the past? an experimental study using sensecam

Abigail J Sellen, Andrew Fogg, Mike Aitken, Steve Hodges, Carsten Rother, and Ken Wood. Do life-logging technologies support memory for the past? an experimental study using sensecam. InProceedings of the SIGCHI conference on Human factors in computing systems, pages 81–90, 2007

work page 2007
[71]

Evangelos Niforatos, Caterina Cinel, Cathleen Cortis Mack, Marc Langheinrich, and Geoff Ward. Can less be more? contrasting limited, unlimited, and automatic picture capture for augmenting memory recall.Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies, 1(2):1–22, 2017

work page 2017
[72]

Beyond total capture: a constructive critique of lifelog- ging.Communications of the ACM, 53(5):70–77, 2010

Abigail J Sellen and Steve Whittaker. Beyond total capture: a constructive critique of lifelog- ging.Communications of the ACM, 53(5):70–77, 2010

work page 2010
[73]

Smart assistive glasses for alzheimer’s patients

Mohamed Ait Gacem, Saifeddin Alghlayini, Wessam Shehieb, Muaid Saeed, Ahmed Ghazal, and Mustahsan Mir. Smart assistive glasses for alzheimer’s patients. In2019 IEEE Inter- national Symposium on Signal Processing and Information Technology (ISSPIT), pages 1–5. IEEE, 2019

work page 2019
[74]

Fmt: A wearable camera-based object tracking memory aid for older adults.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 3(3):1–25, 2019

Franklin Mingzhe Li, Di Laura Chen, Mingming Fan, and Khai N Truong. Fmt: A wearable camera-based object tracking memory aid for older adults.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 3(3):1–25, 2019

work page 2019
[75]

Navmarkar: A landmark- based augmented reality (ar) wayfinding system for enhancing older adults’ spatial learning

Zhiwen Qiu, Mojtaba Ashour, Xiaohe Zhou, and Saleh Kalantari. Navmarkar: A landmark- based augmented reality (ar) wayfinding system for enhancing older adults’ spatial learning. Advanced Engineering Informatics, 62:102635, 2024

work page 2024
[76]

Memoro: Using large language models to realize a concise interface for real-time memory augmentation

Wazeer Deen Zulfikar, Samantha Chan, and Pattie Maes. Memoro: Using large language models to realize a concise interface for real-time memory augmentation. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pages 1–18, 2024

work page 2024
[77]

Ar secretary agent: Real-time memory augmentation via llm-powered augmented reality glasses

Raphaël A El Haddad, Zeyu Wang, Yeonsu Shin, Ranyi Liu, Yuntao Wang, and Chun Yu. Ar secretary agent: Real-time memory augmentation via llm-powered augmented reality glasses. arXiv preprint arXiv:2505.11888, 2025

work page arXiv 2025
[78]

Speech to text

ElevenLabs. Speech to text. https://elevenlabs.io/docs/eleven-creative/ playground/speech-to-text. ElevenLabs Documentation. Accessed: 2026-05-06

work page 2026
[79]

Large Model for Audio File Transcription

iFLYTEK. Large Model for Audio File Transcription. https://www.xfyun.cn/doc/ spark/asr_llm/Ifasr_llm.html, 2026. iFLYTEK Open Platform Documentation Center. Accessed: 2026-05-06

work page 2026
[80]

OpenAI GPT-4o

Microsoft. OpenAI GPT-4o. https://ai.azure.com/catalog/models/gpt-4o, 2026. Microsoft Foundry Models catalog. Version 2024-11-20; last updated April 2026. Accessed: 2026-05-06

work page 2026

Showing first 80 references.

[1] [1]

Introducing the new ray-ban meta smart glasses

Meta. Introducing the new ray-ban meta smart glasses. https://about.fb.com/news/ 2023/09/new-ray-ban-meta-smart-glasses/ , September 2023. Accessed: 2026-04-10

work page 2023

[2] [2]

Project Aria: A New Tool for Egocentric Multi-Modal AI Research

Jakob Engel, Kiran Somasundaram, Michael Goesele, Albert Sun, Alexander Gamino, Andrew Turner, Arjang Talattof, Arnie Yuan, Bilal Souti, Brighid Meredith, et al. Project aria: A new tool for egocentric multi-modal ai research.arXiv preprint arXiv:2308.13561, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

A survey on multimodal large language models.National Science Review, 11(12), November 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12), November 2024

work page 2024

[4] [4]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

work page 2023

[5] [5]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Egocentric video-language pretraining.Advances in Neural Information Processing Systems, 35:7575–7586, 2022

Kevin Qinghong Lin, Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Z Xu, Difei Gao, Rong-Cheng Tu, Wenzhe Zhao, Weijie Kong, et al. Egocentric video-language pretraining.Advances in Neural Information Processing Systems, 35:7575–7586, 2022

work page 2022

[8] [8]

Lifelongmemory: Leveraging llms for answering queries in long-form egocentric videos.arXiv preprint arXiv:2312.05269, 2023

Ying Wang, Yanlai Yang, and Mengye Ren. Lifelongmemory: Leveraging llms for answering queries in long-form egocentric videos.arXiv preprint arXiv:2312.05269, 2023

work page arXiv 2023

[9] [9]

Ego-humans: An ego-centric 3d multi-human benchmark

Rawal Khirodkar, Aayush Bansal, Lingni Ma, Richard Newcombe, Minh V o, and Kris Kitani. Ego-humans: An ego-centric 3d multi-human benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19807–19819, 2023

work page 2023

[10] [10]

Mm-ego: Towards building ego- centric multimodal llms

Hanrong Ye, Haotian Zhang, Erik Daxberger, Lin Chen, Zongyu Lin, Yanghao Li, Bowen Zhang, Haoxuan You, Dan Xu, Zhe Gan, et al. Mm-ego: Towards building egocentric multimodal llms for video qa.arXiv preprint arXiv:2410.07177, 2024

work page arXiv 2024

[11] [11]

Learning fine-grained view-invariant representations from unpaired ego-exo videos via temporal alignment.Advances in Neural Information Processing Systems, 36:53688–53710, 2023

Zihui Sherry Xue and Kristen Grauman. Learning fine-grained view-invariant representations from unpaired ego-exo videos via temporal alignment.Advances in Neural Information Processing Systems, 36:53688–53710, 2023

work page 2023

[12] [12]

Helping hands: An object-aware ego- centric video recognition model

Chuhan Zhang, Ankush Gupta, and Andrew Zisserman. Helping hands: An object-aware ego- centric video recognition model. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13901–13912, October 2023

work page 2023

[13] [13]

Retrieval-augmented egocentric video captioning

Jilan Xu, Yifei Huang, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, and Weidi Xie. Retrieval-augmented egocentric video captioning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13525–13536, June 2024. 10

work page 2024

[14] [14]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

work page 2022

[15] [15]

Jawahar, Richard Newcombe, Hyun Soo Park, James M

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Tri- antafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mo...

work page 2024

[16] [16]

Scaling egocentric vision: The epic-kitchens dataset

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. InProceedings of the European conference on computer vision (ECCV), pages 720–736, 2018

work page 2018

[17] [17]

Egolife: Towards egocentric life assistant

Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, et al. Egolife: Towards egocentric life assistant. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28885–28900, 2025

work page 2025

[18] [18]

E 3: Exploring Embodied Emotion Through A Large-Scale Egocentric Video Dataset.Advances in Neural Information Processing Systems, 37:118182–118197, 2024

Wang Lin, Yueying Feng, WenKang Han, Tao Jin, Zhou Zhao, Fei Wu, Chang Yao, and Jingyuan Chen. E 3: Exploring Embodied Emotion Through A Large-Scale Egocentric Video Dataset.Advances in Neural Information Processing Systems, 37:118182–118197, 2024

work page 2024

[19] [19]

Neon accuracy test report

Chris Baumann and Kai Dierkes. Neon accuracy test report. 2023

work page 2023

[20] [20]

Ralovi S2 Wristwatch: EDA Skin Con- ductance and HRV Heart Rate Variability

Hangzhou Raloway Health Technology Co., Ltd. Ralovi S2 Wristwatch: EDA Skin Con- ductance and HRV Heart Rate Variability. https://raloway.com/index.php?m=home& c=Lists&a=index&tid=23, n.d. Accessed: 2026-05-07

work page 2026

[21] [21]

A dataset and toolkit for multiparameter cardiovascular physiology sensing on rings.arXiv preprint arXiv:2505.04172, 2025

Jiankai Tang, Kegang Wang, Yingke Ding, Jiatong Ji, Zeyu Wang, Xiyuxing Zhang, Ping Chen, Yuanchun Shi, and Yuntao Wang. A dataset and toolkit for multiparameter cardiovascular physiology sensing on rings.arXiv preprint arXiv:2505.04172, 2025

work page arXiv 2025

[22] [22]

Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision, 130(1):33–55, 2022

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kaza- kos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision, 130(1):33–55, 2022

work page 2022

[23] [23]

Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos

Gunnar A Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, and Karteek Alahari. Charades-ego: A large-scale dataset of paired third and first person videos.arXiv preprint arXiv:1804.09626, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[24] [24]

Lemma: A multi-view dataset for le arning m ulti-agent m ulti-task a ctivities

Baoxiong Jia, Yixin Chen, Siyuan Huang, Yixin Zhu, and Song-Chun Zhu. Lemma: A multi-view dataset for le arning m ulti-agent m ulti-task a ctivities. InEuropean Conference on Computer Vision, pages 767–786. Springer, 2020. 11

work page 2020

[25] [25]

Egoexolearn: A dataset for bridging asynchronous ego- and exo-centric view of procedural activities in real world

Yifei Huang, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, Limin Wang, and Yu Qiao. Egoexolearn: A dataset for bridging asynchronous ego- and exo-centric view of procedural activities in real world. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22072–22086,...

work page 2024

[26] [26]

Aria digital twin: A new benchmark dataset for egocentric 3d machine perception

Xiaqing Pan, Nicholas Charron, Yongqian Yang, Scott Peters, Thomas Whelan, Chen Kong, Omkar Parkhi, Richard Newcombe, and Yuheng Carl Ren. Aria digital twin: A new benchmark dataset for egocentric 3d machine perception. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20133–20143, 2023

work page 2023

[27] [27]

egoemotion: Egocentric vision and physiological signals for emotion and personality recognition in real- world tasks.arXiv preprint arXiv:2510.22129, 2025

Matthias Jammot, Björn Braun, Paul Streli, Rafael Wampfler, and Christian Holz. egoemotion: Egocentric vision and physiological signals for emotion and personality recognition in real- world tasks.arXiv preprint arXiv:2510.22129, 2025

work page arXiv 2025

[28] [28]

K-emocon, a multimodal sensor dataset for continuous emotion recognition in naturalistic conversations.Scientific Data, 7(1):293, 2020

Cheul Young Park, Narae Cha, Soowon Kang, Auk Kim, Ahsan Habib Khandoker, Leontios Hadjileontiadis, Alice Oh, Yong Jeong, and Uichin Lee. K-emocon, a multimodal sensor dataset for continuous emotion recognition in naturalistic conversations.Scientific Data, 7(1):293, 2020

work page 2020

[29] [29]

Jointly learning energy expenditures and activities using egocentric multimodal signals

Katsuyuki Nakamura, Serena Yeung, Alexandre Alahi, and Li Fei-Fei. Jointly learning energy expenditures and activities using egocentric multimodal signals. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1868–1877, 2017

work page 2017

[30] [30]

The construal of experience in hci: Understanding self- reports.International Journal of Human-Computer Studies, 110:63–74, 2018

Kevin Doherty and Gavin Doherty. The construal of experience in hci: Understanding self- reports.International Journal of Human-Computer Studies, 110:63–74, 2018

work page 2018

[31] [31]

Measuring digital intervention user experience with a novel ecological momentary assessment (ema) method, corto.Internet Interventions, 35:100706, 2024

Lauri Lukka, Veli-Matti Karhulahti, Vilma-Reetta Bergman, and J Matias Palva. Measuring digital intervention user experience with a novel ecological momentary assessment (ema) method, corto.Internet Interventions, 35:100706, 2024

work page 2024

[32] [32]

Online episodic memory visual query localization with egocentric streaming object memory

Zaira Manigrasso, Matteo Dunnhofer, Antonino Furnari, Moritz Nottebaum, Antonio Finoc- chiaro, Davide Marana, Rosario Forte, Giovanni Maria Farinella, and Christian Micheloni. Online episodic memory visual query localization with egocentric streaming object memory. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), ...

work page 2026

[33] [33]

Episodic memory question answering

Samyak Datta, Sameer Dharur, Vincent Cartillier, Ruta Desai, Mukul Khanna, Dhruv Batra, and Devi Parikh. Episodic memory question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19119–19128, June 2022

work page 2022

[34] [34]

Where did i leave my keys?-episodic-memory-based question answering on egocentric videos

Leonard Bärmann and Alex Waibel. Where did i leave my keys?-episodic-memory-based question answering on egocentric videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1560–1568, 2022

work page 2022

[35] [35]

Egotaskqa: Understanding human tasks in egocentric videos, 2022

Baoxiong Jia, Ting Lei, Song-Chun Zhu, and Siyuan Huang. Egotaskqa: Understanding human tasks in egocentric videos, 2022

work page 2022

[36] [36]

arXiv preprint arXiv:2506.05287 , year=

Yuqian Yuan, Ronghao Dang, Long Li, Wentong Li, Dian Jiao, Xin Li, Deli Zhao, Fan Wang, Wenqiao Zhang, Jun Xiao, et al. Eoc-bench: Can mllms identify, recall, and forecast objects in an egocentric world?arXiv preprint arXiv:2506.05287, 2025

work page arXiv 2025

[37] [37]

Building a mind palace: Structuring environment-grounded semantic graphs for effective long video analysis with llms

Zeyi Huang, Yuyang Ji, Xiaofang Wang, Nikhil Mehta, Tong Xiao, Donghyun Lee, Sigmund Vanvalkenburgh, Shengxin Zha, Bolin Lai, Licheng Yu, Ning Zhang, Yong Jae Lee, and Miao Liu. Building a mind palace: Structuring environment-grounded semantic graphs for effective long video analysis with llms. InProceedings of the IEEE/CVF Conference on Computer Vision a...

work page 2025

[38] [38]

Assistq: Affordance-centric question-driven task completion for egocentric assistant

Benita Wong, Joya Chen, You Wu, Stan Weixian Lei, Dongxing Mao, Difei Gao, and Mike Zheng Shou. Assistq: Affordance-centric question-driven task completion for egocentric assistant. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors,Computer Vision – ECCV 2022, pages 485–501, Cham, 2022. Springer Nature Sw...

work page 2022

[39] [39]

Hourvideo: 1-hour video- language understanding.Advances in Neural Information Processing Systems, 37:53168– 53197, 2024

Keshigeyan Chandrasegaran, Agrim Gupta, Lea M Hadzic, Taran Kota, Jimming He, Cristóbal Eyzaguirre, Zane Durante, Manling Li, Jiajun Wu, and Li Fei-Fei. Hourvideo: 1-hour video- language understanding.Advances in Neural Information Processing Systems, 37:53168– 53197, 2024

work page 2024

[40] [40]

Egothink: Evaluating first-person perspective thinking capability of vision-language models

Sijie Cheng, Zhicheng Guo, Jingwen Wu, Kechen Fang, Peng Li, Huaping Liu, and Yang Liu. Egothink: Evaluating first-person perspective thinking capability of vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14291–14302, 2024

work page 2024

[41] [41]

Egoplan-bench: Benchmarking multimodal large language models for human-level planning.International Journal of Computer Vision, 134(3):118, 2026

Yi Chen, Yuying Ge, Yixiao Ge, Mingyu Ding, Bohao Li, Rui Wang, Ruifeng Xu, Ying Shan, and Xihui Liu. Egoplan-bench: Benchmarking multimodal large language models for human-level planning.International Journal of Computer Vision, 134(3):118, 2026

work page 2026

[42] [42]

Egoschema: A diagnostic benchmark for very long-form video language understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023

work page 2023

[43] [43]

In the eye of mllm: Benchmarking egocentric video intent understanding with gaze-guided prompting.arXiv preprint arXiv:2509.07447, 2025

Taiying Peng, Jiacheng Hua, Miao Liu, and Feng Lu. In the eye of mllm: Benchmark- ing egocentric video intent understanding with gaze-guided prompting.arXiv preprint arXiv:2509.07447, 2025

work page arXiv 2025

[44] [44]

Egotextvqa: Towards egocentric scene-text aware video question answering

Sheng Zhou, Junbin Xiao, Qingyun Li, Yicong Li, Xun Yang, Dan Guo, Meng Wang, Tat-Seng Chua, and Angela Yao. Egotextvqa: Towards egocentric scene-text aware video question answering. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3363–3373, 2025

work page 2025

[45] [45]

Sensecam: A retrospective memory aid

Steve Hodges, Lyndsay Williams, Emma Berry, Shahram Izadi, James Srinivasan, Alex Butler, Gavin Smyth, Narinder Kapur, and Ken Wood. Sensecam: A retrospective memory aid. In International conference on ubiquitous computing, pages 177–193. Springer, 2006

work page 2006

[46] [46]

Highlight detection with pairwise deep ranking for first- person video summarization

Ting Yao, Tao Mei, and Yong Rui. Highlight detection with pairwise deep ranking for first- person video summarization. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 982–990, 2016

work page 2016

[47] [47]

Impact of video summary viewing on episodic memory recall: Design guidelines for video summarizations

Huy Viet Le, Sarah Clinch, Corina Sas, Tilman Dingler, Niels Henze, and Nigel Davies. Impact of video summary viewing on episodic memory recall: Design guidelines for video summarizations. InProceedings of the 2016 CHI conference on human factors in computing systems, pages 4793–4805, 2016

work page 2016

[48] [48]

Experienc- ing the affective diary.Personal and Ubiquitous Computing, 13(5):365–378, 2009

Anna Ståhl, Kristina Höök, Martin Svensson, Alex S Taylor, and Marco Combetto. Experienc- ing the affective diary.Personal and Ubiquitous Computing, 13(5):365–378, 2009

work page 2009

[49] [49]

Affectcam: arousal-augmented sensecam for richer recall of episodic memories

Corina Sas, Tomasz Fratczak, Matthew Rees, Hans Gellersen, Vaiva Kalnikaite, Alina Coman, and Kristina Höök. Affectcam: arousal-augmented sensecam for richer recall of episodic memories. InCHI’13 extended abstracts on human factors in computing systems, pages 1041–1046. 2013

work page 2013

[50] [50]

Yuhu Chang, Yingying Zhao, Mingzhi Dong, Yujiang Wang, Yutian Lu, Qin Lv, Robert P Dick, Tun Lu, Ning Gu, and Li Shang. Memx: An attention-aware smart eyewear system for personalized moment auto-capture.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 5(2):1–23, 2021

work page 2021

[51] [51]

In the eye of the beholder: Gaze and actions in first person video.IEEE transactions on pattern analysis and machine intelligence, 45(6):6731– 6747, 2021

Yin Li, Miao Liu, and James M Rehg. In the eye of the beholder: Gaze and actions in first person video.IEEE transactions on pattern analysis and machine intelligence, 45(6):6731– 6747, 2021

work page 2021

[52] [52]

Katsutoshi Masai, Kai Kunze, Yuta Sugiura, Masa Ogata, Masahiko Inami, and Maki Sugimoto. Evaluation of facial expression recognition by a smart eyewear for facial direction changes, repeatability, and positional drift.ACM Transactions on Interactive Intelligent Systems (TiiS), 7(4):1–23, 2017. 13

work page 2017

[53] [53]

Eyeecho: Continuous and low-power facial expression tracking on glasses

Ke Li, Ruidong Zhang, Siyuan Chen, Boao Chen, Mose Sakashita, François Guimbretière, and Cheng Zhang. Eyeecho: Continuous and low-power facial expression tracking on glasses. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pages 1–24, 2024

work page 2024

[54] [54]

Emotion recognition using a glasses-type wearable device via multi-channel facial responses.IEEE Access, 9:146392–146403, 2021

Jangho Kwon, Jihyeon Ha, Da-Hye Kim, Jun Won Choi, and Laehyun Kim. Emotion recognition using a glasses-type wearable device via multi-channel facial responses.IEEE Access, 9:146392–146403, 2021

work page 2021

[55] [55]

Emo: Real-time emotion recognition from single-eye images for resource- constrained eyewear devices

Hao Wu, Jinghao Feng, Xuejin Tian, Edward Sun, Yunxin Liu, Bo Dong, Fengyuan Xu, and Sheng Zhong. Emo: Real-time emotion recognition from single-eye images for resource- constrained eyewear devices. InProceedings of the 18th International Conference on Mobile Systems, Applications, and Services, pages 448–461, 2020

work page 2020

[56] [56]

Yingying Zhao, Yuhu Chang, Yutian Lu, Yujiang Wang, Mingzhi Dong, Qin Lv, Robert P Dick, Fan Yang, Tun Lu, Ning Gu, et al. Do smart glasses dream of sentimental visions? deep emotionship analysis for eyewear devices.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 6(1):1–29, 2022

work page 2022

[57] [57]

Emotion-llama: Multimodal emotion recognition and reasoning with instruction tuning.Advances in Neural Information Processing Systems, 37:110805–110853, 2024

Zebang Cheng, Zhi-Qi Cheng, Jun-Yan He, Jingdong Sun, Kai Wang, Yuxiang Lin, Zheng Lian, Xiaojiang Peng, and Alexander G Hauptmann. Emotion-llama: Multimodal emotion recognition and reasoning with instruction tuning.Advances in Neural Information Processing Systems, 37:110805–110853, 2024

work page 2024

[58] [58]

EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models

He Hu, Yucheng Zhou, Lianzhong You, Hongbo Xu, Qianning Wang, Zheng Lian, Fei Richard Yu, Fei Ma, and Laizhong Cui. Emobench-m: Benchmarking emotional intelligence for multimodal large language models.arXiv preprint arXiv:2502.04424, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [59]

G-voila: gaze-facilitated information querying in daily scenarios

Zeyu Wang, Yuanchun Shi, Yuntao Wang, Yuchen Yao, Kun Yan, Yuhan Wang, Lei Ji, Xuhai Xu, and Chun Yu. G-voila: gaze-facilitated information querying in daily scenarios. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 8(2):1–33, 2024

work page 2024

[60] [60]

Gazepointar: A context-aware multimodal voice assistant for pronoun disambigua- tion in wearable augmented reality

Jaewook Lee, Jun Wang, Elizabeth Brown, Liam Chu, Sebastian S Rodriguez, and Jon E Froehlich. Gazepointar: A context-aware multimodal voice assistant for pronoun disambigua- tion in wearable augmented reality. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pages 1–20, 2024

work page 2024

[61] [61]

V oila-a: Aligning vision-language models with user’s gaze attention.Advances in neural information processing systems, 37:1890–1918, 2024

Kun Yan, Zeyu Wang, Lei Ji, Yuntao Wang, Nan Duan, and Shuai Ma. V oila-a: Aligning vision-language models with user’s gaze attention.Advances in neural information processing systems, 37:1890–1918, 2024

work page 1918

[62] [62]

Visual intention grounding for egocentric assistants

Pengzhan Sun, Junbin Xiao, Tze Ho Elden Tse, Yicong Li, Arjun Akula, and Angela Yao. Visual intention grounding for egocentric assistants. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 2512–2522, 2025

work page 2025

[63] [63]

Sensible agent: A framework for unobtrusive interaction with proactive ar agents

Geonsun Lee, Min Xia, Nels Numan, Xun Qian, David Li, Yanhe Chen, Achin Kulshrestha, Ishan Chatterjee, Yinda Zhang, Dinesh Manocha, et al. Sensible agent: A framework for unobtrusive interaction with proactive ar agents. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology, pages 1–22, 2025

work page 2025

[64] [64]

Generating natural questions about an image

Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Margaret Mitchell, Xiaodong He, and Lucy Vanderwende. Generating natural questions about an image. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1802–1813, 2016

work page 2016

[65] [65]

Egointent: An egocentric step-level benchmark for understanding what, why, and next.arXiv preprint arXiv:2603.12147, 2026

Ye Pan, Chi Kit Wong, Yuanhuiyi Lyu, Hanqian Li, Jiahao Huo, Jiacheng Chen, Lutao Jiang, Xu Zheng, and Xuming Hu. Egointent: An egocentric step-level benchmark for understanding what, why, and next.arXiv preprint arXiv:2603.12147, 2026

work page arXiv 2026

[66] [66]

Benchmarking egocentric multimodal goal inference for assistive wearable agents.arXiv preprint arXiv:2510.22443, 2025

Vijay Veerabadran, Fanyi Xiao, Nitin Kamra, Pedro Matias, Joy Chen, Caley Drooff, Brett D Roads, Riley Williams, Ethan Henderson, Xuanyi Zhao, et al. Benchmarking egocentric multimodal goal inference for assistive wearable agents.arXiv preprint arXiv:2510.22443, 2025. 14

work page arXiv 2025

[67] [67]

Aiget: Transforming everyday moments into hidden knowledge discovery with ai assistance on smart glasses

Runze Cai, Nuwan Janaka, Hyeongcheol Kim, Yang Chen, Shengdong Zhao, Yun Huang, and David Hsu. Aiget: Transforming everyday moments into hidden knowledge discovery with ai assistance on smart glasses. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–26, 2025

work page 2025

[68] [68]

Proactive assistant dialogue generation from streaming egocentric videos

Yichi Zhang, Xin Luna Dong, Zhaojiang Lin, Andrea Madotto, Anuj Kumar, Babak Dama- vandi, Joyce Chai, and Seungwhan Moon. Proactive assistant dialogue generation from streaming egocentric videos. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12055–12079, 2025

work page 2025

[69] [69]

Acceptability of a lifelogging wearable camera in older adults with mild cognitive impairment: a mixed-method study.BMC geriatrics, 19(1):110, 2019

Olga Gelonch, Mireia Ribera, Núria Codern-Bové, Sílvia Ramos, Maria Quintana, Gloria Chico, Noemí Cerulla, Paula Lafarga, Petia Radeva, and Maite Garolera. Acceptability of a lifelogging wearable camera in older adults with mild cognitive impairment: a mixed-method study.BMC geriatrics, 19(1):110, 2019

work page 2019

[70] [70]

Do life-logging technologies support memory for the past? an experimental study using sensecam

Abigail J Sellen, Andrew Fogg, Mike Aitken, Steve Hodges, Carsten Rother, and Ken Wood. Do life-logging technologies support memory for the past? an experimental study using sensecam. InProceedings of the SIGCHI conference on Human factors in computing systems, pages 81–90, 2007

work page 2007

[71] [71]

Evangelos Niforatos, Caterina Cinel, Cathleen Cortis Mack, Marc Langheinrich, and Geoff Ward. Can less be more? contrasting limited, unlimited, and automatic picture capture for augmenting memory recall.Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies, 1(2):1–22, 2017

work page 2017

[72] [72]

Beyond total capture: a constructive critique of lifelog- ging.Communications of the ACM, 53(5):70–77, 2010

Abigail J Sellen and Steve Whittaker. Beyond total capture: a constructive critique of lifelog- ging.Communications of the ACM, 53(5):70–77, 2010

work page 2010

[73] [73]

Smart assistive glasses for alzheimer’s patients

Mohamed Ait Gacem, Saifeddin Alghlayini, Wessam Shehieb, Muaid Saeed, Ahmed Ghazal, and Mustahsan Mir. Smart assistive glasses for alzheimer’s patients. In2019 IEEE Inter- national Symposium on Signal Processing and Information Technology (ISSPIT), pages 1–5. IEEE, 2019

work page 2019

[74] [74]

Fmt: A wearable camera-based object tracking memory aid for older adults.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 3(3):1–25, 2019

Franklin Mingzhe Li, Di Laura Chen, Mingming Fan, and Khai N Truong. Fmt: A wearable camera-based object tracking memory aid for older adults.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 3(3):1–25, 2019

work page 2019

[75] [75]

Navmarkar: A landmark- based augmented reality (ar) wayfinding system for enhancing older adults’ spatial learning

Zhiwen Qiu, Mojtaba Ashour, Xiaohe Zhou, and Saleh Kalantari. Navmarkar: A landmark- based augmented reality (ar) wayfinding system for enhancing older adults’ spatial learning. Advanced Engineering Informatics, 62:102635, 2024

work page 2024

[76] [76]

Memoro: Using large language models to realize a concise interface for real-time memory augmentation

Wazeer Deen Zulfikar, Samantha Chan, and Pattie Maes. Memoro: Using large language models to realize a concise interface for real-time memory augmentation. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pages 1–18, 2024

work page 2024

[77] [77]

Ar secretary agent: Real-time memory augmentation via llm-powered augmented reality glasses

Raphaël A El Haddad, Zeyu Wang, Yeonsu Shin, Ranyi Liu, Yuntao Wang, and Chun Yu. Ar secretary agent: Real-time memory augmentation via llm-powered augmented reality glasses. arXiv preprint arXiv:2505.11888, 2025

work page arXiv 2025

[78] [78]

Speech to text

ElevenLabs. Speech to text. https://elevenlabs.io/docs/eleven-creative/ playground/speech-to-text. ElevenLabs Documentation. Accessed: 2026-05-06

work page 2026

[79] [79]

Large Model for Audio File Transcription

iFLYTEK. Large Model for Audio File Transcription. https://www.xfyun.cn/doc/ spark/asr_llm/Ifasr_llm.html, 2026. iFLYTEK Open Platform Documentation Center. Accessed: 2026-05-06

work page 2026

[80] [80]

OpenAI GPT-4o

Microsoft. OpenAI GPT-4o. https://ai.azure.com/catalog/models/gpt-4o, 2026. Microsoft Foundry Models catalog. Version 2024-11-20; last updated April 2026. Accessed: 2026-05-06

work page 2026