EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning
Pith reviewed 2026-05-20 14:44 UTC · model grok-4.3
The pith
EgoIntrospect dataset and benchmark reveals that multimodal large language models struggle to infer users' internal states from egocentric multimodal data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EgoIntrospect provides 180 hours of egocentric recordings from 60 users in user-driven scenarios, equipped with self-annotations that directly reveal interactive intentions with AI assistants, along with synchronized multimodal data including video, audio, gaze, motion, and physiological signals. This enables a set of tasks for reasoning about affective experience, interactive intent, and cognitive memory, and benchmarks demonstrate that multimodal large language models do not yet leverage these signals well to understand users' subjective internal states.
What carries the argument
The EgoIntrospect dataset featuring cross-device synchronized multimodal recordings and explicit self-annotations for user internal states.
If this is right
- Improved multimodal models could enable AI assistants that better understand user intent and emotions in real time.
- The benchmark tasks highlight specific weaknesses in current models' ability to process egocentric data for subjective reasoning.
- Public release of the dataset and annotations will facilitate further research in egocentric vision and human-AI interaction.
- Models that succeed on these tasks may lead to more natural and personalized wearable AI experiences.
Where Pith is reading between the lines
- If self-annotations prove reliable, similar datasets could be collected for other contexts like health monitoring or education.
- Success on this benchmark might require new architectures specifically designed for integrating physiological and gaze data with video and audio.
- Future extensions could include longitudinal studies to see how internal states change over repeated interactions.
Load-bearing premise
User-provided self-annotations accurately and reliably reflect true internal states such as affective experience, interactive intent, and cognitive memory.
What would settle it
Development of a multimodal model that achieves significantly higher accuracy than current baselines on the EgoIntrospect benchmark tasks by better utilizing the combined signals.
Figures
read the original abstract
Despite extensive efforts on egocentric video datasets and benchmarks, understanding users' internal states, which is crucial for enabling seamless AI assistant experiences, remains largely overlooked. In this work, we introduce EgoIntrospect, the first egocentric dataset captured in user-driven scenarios with self-annotations that explicitly reveal users' interactive intentions with AI assistants. EgoIntrospect was collected using a cross-device setup, providing synchronized video, audio, gaze, motion, and physiological signals. It consists of 180 hours of recordings from 60 subjects, with an average recording duration of 3 hours per subject. Leveraging EgoIntrospect, we formalize a suite of tasks centered on user internal states, including affective experience, interactive intent, and cognitive memory. We further process the annotations to construct benchmarks that evaluate the ability of modern multimodal large language models to reason about users' internal states from egocentric observations. Experiments on our benchmark suggest that existing multimodal large language models struggle to effectively leverage multimodal signals to infer users' subjective internal states. The dataset and annotations will be made publicly available to advance research in egocentric vision and wearable AI assistants. Project page: https://ego-introspect.github.io/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces EgoIntrospect, the first egocentric dataset captured in user-driven scenarios with self-annotations explicitly revealing users' interactive intentions with AI assistants. Collected via a cross-device setup from 60 subjects yielding 180 hours of synchronized video, audio, gaze, motion, and physiological signals (average 3 hours per subject), the work formalizes tasks on affective experience, interactive intent, and cognitive memory. It constructs benchmarks to evaluate multimodal large language models' reasoning about users' internal states from egocentric observations and reports that existing MLLMs struggle to effectively leverage multimodal signals for inferring subjective internal states. The dataset and annotations are to be released publicly.
Significance. If the self-annotations can be shown to reliably reflect internal states and the benchmarks are constructed without circularity or label noise, the work would be significant for egocentric vision and wearable AI research by addressing the overlooked area of user-centric internal state reasoning. The multimodal synchronization and public release are strengths that could enable reproducible follow-up studies. However, the significance is limited by the absence of validation for the self-annotations serving as ground truth.
major comments (2)
- [Abstract and dataset construction section] Abstract and dataset description: The central claim that MLLMs struggle to infer subjective internal states rests on self-annotations for affective experience, interactive intent, and cognitive memory being treated as ground truth. No details are provided on annotation protocols, inter-annotator agreement, or cross-validation against the synchronized physiological signals, raising the risk that observed model failures reflect annotation noise rather than limitations in multimodal reasoning.
- [Benchmark and evaluation section] Benchmark construction: The experiments conclude that models fail to leverage multimodal signals, yet without quantitative results on annotation reliability or synchronization validation (as noted in the abstract's description of the cross-device setup), it is unclear whether the benchmark isolates the intended reasoning challenge or is confounded by label inconsistency.
minor comments (2)
- [Dataset statistics] The average recording duration of 3 hours per subject is stated but no breakdown of task distribution or scenario diversity across the 60 subjects is given, which would help assess generalizability.
- [Conclusion] The project page URL is provided but the manuscript should include a brief description of what supplementary materials (e.g., annotation guidelines) will be released alongside the dataset.
Simulated Author's Rebuttal
We thank the referee for the careful review and valuable feedback. We address each major comment below, indicating where revisions will be made to improve clarity and transparency.
read point-by-point responses
-
Referee: [Abstract and dataset construction section] Abstract and dataset description: The central claim that MLLMs struggle to infer subjective internal states rests on self-annotations for affective experience, interactive intent, and cognitive memory being treated as ground truth. No details are provided on annotation protocols, inter-annotator agreement, or cross-validation against the synchronized physiological signals, raising the risk that observed model failures reflect annotation noise rather than limitations in multimodal reasoning.
Authors: We agree that greater detail on the self-annotation process is warranted. The annotations were obtained directly from each participant immediately following the recorded sessions via a structured digital questionnaire that asked subjects to report their affective experience, interactive intentions toward an AI assistant, and recall of cognitive events. We will revise the dataset construction section to describe the exact questionnaire items, the interface used, and the timing relative to the activities. Because the annotations are self-reports by the individuals who experienced the internal states, standard inter-annotator agreement statistics do not apply; we will instead note any consistency checks (e.g., re-annotation by a subset of participants) that were performed. Explicit quantitative cross-validation against the physiological signals was not conducted in the present study, as the primary focus was on multimodal reasoning benchmarks; we will add an explicit discussion of this limitation and its implications for interpreting model performance. revision: yes
-
Referee: [Benchmark and evaluation section] Benchmark construction: The experiments conclude that models fail to leverage multimodal signals, yet without quantitative results on annotation reliability or synchronization validation (as noted in the abstract's description of the cross-device setup), it is unclear whether the benchmark isolates the intended reasoning challenge or is confounded by label inconsistency.
Authors: The benchmark tasks are defined directly from the self-annotations to evaluate whether MLLMs can reason about subjective internal states given synchronized multimodal observations. We will expand the benchmark construction and evaluation sections to provide quantitative details on the synchronization procedure used in the cross-device capture (including hardware timestamps and alignment verification steps) and any reliability measures obtained for the annotations. While we maintain that the observed model shortcomings reflect genuine difficulties in leveraging multimodal cues rather than pervasive label noise—given the consistent performance patterns across modalities—we will add a dedicated paragraph discussing potential sources of annotation variability and their possible influence on the reported results. revision: partial
Circularity Check
No circularity: empirical dataset and benchmark with external model evaluations
full rationale
This is a data-collection paper that records egocentric multimodal signals, collects user self-annotations for internal states, and runs external MLLM evaluations on the resulting benchmark tasks. No equations, parameter fits, or derivations appear in the provided text. Claims about model struggles rest on direct experimental comparisons to the collected annotations rather than any self-referential reduction or self-citation chain. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Self-annotations by users accurately reflect their internal states including affective experience, interactive intent, and cognitive memory.
Reference graph
Works this paper leans on
-
[1]
Introducing the new ray-ban meta smart glasses
Meta. Introducing the new ray-ban meta smart glasses. https://about.fb.com/news/ 2023/09/new-ray-ban-meta-smart-glasses/ , September 2023. Accessed: 2026-04-10
work page 2023
-
[2]
Project Aria: A New Tool for Egocentric Multi-Modal AI Research
Jakob Engel, Kiran Somasundaram, Michael Goesele, Albert Sun, Alexander Gamino, Andrew Turner, Arjang Talattof, Arnie Yuan, Bilal Souti, Brighid Meredith, et al. Project aria: A new tool for egocentric multi-modal ai research.arXiv preprint arXiv:2308.13561, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
A survey on multimodal large language models.National Science Review, 11(12), November 2024
Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12), November 2024
work page 2024
-
[4]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023
work page 2023
-
[5]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Kevin Qinghong Lin, Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Z Xu, Difei Gao, Rong-Cheng Tu, Wenzhe Zhao, Weijie Kong, et al. Egocentric video-language pretraining.Advances in Neural Information Processing Systems, 35:7575–7586, 2022
work page 2022
-
[8]
Ying Wang, Yanlai Yang, and Mengye Ren. Lifelongmemory: Leveraging llms for answering queries in long-form egocentric videos.arXiv preprint arXiv:2312.05269, 2023
-
[9]
Ego-humans: An ego-centric 3d multi-human benchmark
Rawal Khirodkar, Aayush Bansal, Lingni Ma, Richard Newcombe, Minh V o, and Kris Kitani. Ego-humans: An ego-centric 3d multi-human benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19807–19819, 2023
work page 2023
-
[10]
Mm-ego: Towards building ego- centric multimodal llms
Hanrong Ye, Haotian Zhang, Erik Daxberger, Lin Chen, Zongyu Lin, Yanghao Li, Bowen Zhang, Haoxuan You, Dan Xu, Zhe Gan, et al. Mm-ego: Towards building egocentric multimodal llms for video qa.arXiv preprint arXiv:2410.07177, 2024
-
[11]
Zihui Sherry Xue and Kristen Grauman. Learning fine-grained view-invariant representations from unpaired ego-exo videos via temporal alignment.Advances in Neural Information Processing Systems, 36:53688–53710, 2023
work page 2023
-
[12]
Helping hands: An object-aware ego- centric video recognition model
Chuhan Zhang, Ankush Gupta, and Andrew Zisserman. Helping hands: An object-aware ego- centric video recognition model. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13901–13912, October 2023
work page 2023
-
[13]
Retrieval-augmented egocentric video captioning
Jilan Xu, Yifei Huang, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, and Weidi Xie. Retrieval-augmented egocentric video captioning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13525–13536, June 2024. 10
work page 2024
-
[14]
Ego4d: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022
work page 2022
-
[15]
Jawahar, Richard Newcombe, Hyun Soo Park, James M
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Tri- antafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mo...
work page 2024
-
[16]
Scaling egocentric vision: The epic-kitchens dataset
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. InProceedings of the European conference on computer vision (ECCV), pages 720–736, 2018
work page 2018
-
[17]
Egolife: Towards egocentric life assistant
Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, et al. Egolife: Towards egocentric life assistant. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28885–28900, 2025
work page 2025
-
[18]
Wang Lin, Yueying Feng, WenKang Han, Tao Jin, Zhou Zhao, Fei Wu, Chang Yao, and Jingyuan Chen. E 3: Exploring Embodied Emotion Through A Large-Scale Egocentric Video Dataset.Advances in Neural Information Processing Systems, 37:118182–118197, 2024
work page 2024
-
[19]
Chris Baumann and Kai Dierkes. Neon accuracy test report. 2023
work page 2023
-
[20]
Ralovi S2 Wristwatch: EDA Skin Con- ductance and HRV Heart Rate Variability
Hangzhou Raloway Health Technology Co., Ltd. Ralovi S2 Wristwatch: EDA Skin Con- ductance and HRV Heart Rate Variability. https://raloway.com/index.php?m=home& c=Lists&a=index&tid=23, n.d. Accessed: 2026-05-07
work page 2026
-
[21]
Jiankai Tang, Kegang Wang, Yingke Ding, Jiatong Ji, Zeyu Wang, Xiyuxing Zhang, Ping Chen, Yuanchun Shi, and Yuntao Wang. A dataset and toolkit for multiparameter cardiovascular physiology sensing on rings.arXiv preprint arXiv:2505.04172, 2025
-
[22]
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kaza- kos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision, 130(1):33–55, 2022
work page 2022
-
[23]
Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos
Gunnar A Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, and Karteek Alahari. Charades-ego: A large-scale dataset of paired third and first person videos.arXiv preprint arXiv:1804.09626, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[24]
Lemma: A multi-view dataset for le arning m ulti-agent m ulti-task a ctivities
Baoxiong Jia, Yixin Chen, Siyuan Huang, Yixin Zhu, and Song-Chun Zhu. Lemma: A multi-view dataset for le arning m ulti-agent m ulti-task a ctivities. InEuropean Conference on Computer Vision, pages 767–786. Springer, 2020. 11
work page 2020
-
[25]
Yifei Huang, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, Limin Wang, and Yu Qiao. Egoexolearn: A dataset for bridging asynchronous ego- and exo-centric view of procedural activities in real world. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22072–22086,...
work page 2024
-
[26]
Aria digital twin: A new benchmark dataset for egocentric 3d machine perception
Xiaqing Pan, Nicholas Charron, Yongqian Yang, Scott Peters, Thomas Whelan, Chen Kong, Omkar Parkhi, Richard Newcombe, and Yuheng Carl Ren. Aria digital twin: A new benchmark dataset for egocentric 3d machine perception. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20133–20143, 2023
work page 2023
-
[27]
Matthias Jammot, Björn Braun, Paul Streli, Rafael Wampfler, and Christian Holz. egoemotion: Egocentric vision and physiological signals for emotion and personality recognition in real- world tasks.arXiv preprint arXiv:2510.22129, 2025
-
[28]
Cheul Young Park, Narae Cha, Soowon Kang, Auk Kim, Ahsan Habib Khandoker, Leontios Hadjileontiadis, Alice Oh, Yong Jeong, and Uichin Lee. K-emocon, a multimodal sensor dataset for continuous emotion recognition in naturalistic conversations.Scientific Data, 7(1):293, 2020
work page 2020
-
[29]
Jointly learning energy expenditures and activities using egocentric multimodal signals
Katsuyuki Nakamura, Serena Yeung, Alexandre Alahi, and Li Fei-Fei. Jointly learning energy expenditures and activities using egocentric multimodal signals. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1868–1877, 2017
work page 2017
-
[30]
Kevin Doherty and Gavin Doherty. The construal of experience in hci: Understanding self- reports.International Journal of Human-Computer Studies, 110:63–74, 2018
work page 2018
-
[31]
Lauri Lukka, Veli-Matti Karhulahti, Vilma-Reetta Bergman, and J Matias Palva. Measuring digital intervention user experience with a novel ecological momentary assessment (ema) method, corto.Internet Interventions, 35:100706, 2024
work page 2024
-
[32]
Online episodic memory visual query localization with egocentric streaming object memory
Zaira Manigrasso, Matteo Dunnhofer, Antonino Furnari, Moritz Nottebaum, Antonio Finoc- chiaro, Davide Marana, Rosario Forte, Giovanni Maria Farinella, and Christian Micheloni. Online episodic memory visual query localization with egocentric streaming object memory. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), ...
work page 2026
-
[33]
Episodic memory question answering
Samyak Datta, Sameer Dharur, Vincent Cartillier, Ruta Desai, Mukul Khanna, Dhruv Batra, and Devi Parikh. Episodic memory question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19119–19128, June 2022
work page 2022
-
[34]
Where did i leave my keys?-episodic-memory-based question answering on egocentric videos
Leonard Bärmann and Alex Waibel. Where did i leave my keys?-episodic-memory-based question answering on egocentric videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1560–1568, 2022
work page 2022
-
[35]
Egotaskqa: Understanding human tasks in egocentric videos, 2022
Baoxiong Jia, Ting Lei, Song-Chun Zhu, and Siyuan Huang. Egotaskqa: Understanding human tasks in egocentric videos, 2022
work page 2022
-
[36]
arXiv preprint arXiv:2506.05287 , year=
Yuqian Yuan, Ronghao Dang, Long Li, Wentong Li, Dian Jiao, Xin Li, Deli Zhao, Fan Wang, Wenqiao Zhang, Jun Xiao, et al. Eoc-bench: Can mllms identify, recall, and forecast objects in an egocentric world?arXiv preprint arXiv:2506.05287, 2025
-
[37]
Zeyi Huang, Yuyang Ji, Xiaofang Wang, Nikhil Mehta, Tong Xiao, Donghyun Lee, Sigmund Vanvalkenburgh, Shengxin Zha, Bolin Lai, Licheng Yu, Ning Zhang, Yong Jae Lee, and Miao Liu. Building a mind palace: Structuring environment-grounded semantic graphs for effective long video analysis with llms. InProceedings of the IEEE/CVF Conference on Computer Vision a...
work page 2025
-
[38]
Assistq: Affordance-centric question-driven task completion for egocentric assistant
Benita Wong, Joya Chen, You Wu, Stan Weixian Lei, Dongxing Mao, Difei Gao, and Mike Zheng Shou. Assistq: Affordance-centric question-driven task completion for egocentric assistant. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors,Computer Vision – ECCV 2022, pages 485–501, Cham, 2022. Springer Nature Sw...
work page 2022
-
[39]
Keshigeyan Chandrasegaran, Agrim Gupta, Lea M Hadzic, Taran Kota, Jimming He, Cristóbal Eyzaguirre, Zane Durante, Manling Li, Jiajun Wu, and Li Fei-Fei. Hourvideo: 1-hour video- language understanding.Advances in Neural Information Processing Systems, 37:53168– 53197, 2024
work page 2024
-
[40]
Egothink: Evaluating first-person perspective thinking capability of vision-language models
Sijie Cheng, Zhicheng Guo, Jingwen Wu, Kechen Fang, Peng Li, Huaping Liu, and Yang Liu. Egothink: Evaluating first-person perspective thinking capability of vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14291–14302, 2024
work page 2024
-
[41]
Yi Chen, Yuying Ge, Yixiao Ge, Mingyu Ding, Bohao Li, Rui Wang, Ruifeng Xu, Ying Shan, and Xihui Liu. Egoplan-bench: Benchmarking multimodal large language models for human-level planning.International Journal of Computer Vision, 134(3):118, 2026
work page 2026
-
[42]
Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023
work page 2023
-
[43]
Taiying Peng, Jiacheng Hua, Miao Liu, and Feng Lu. In the eye of mllm: Benchmark- ing egocentric video intent understanding with gaze-guided prompting.arXiv preprint arXiv:2509.07447, 2025
-
[44]
Egotextvqa: Towards egocentric scene-text aware video question answering
Sheng Zhou, Junbin Xiao, Qingyun Li, Yicong Li, Xun Yang, Dan Guo, Meng Wang, Tat-Seng Chua, and Angela Yao. Egotextvqa: Towards egocentric scene-text aware video question answering. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3363–3373, 2025
work page 2025
-
[45]
Sensecam: A retrospective memory aid
Steve Hodges, Lyndsay Williams, Emma Berry, Shahram Izadi, James Srinivasan, Alex Butler, Gavin Smyth, Narinder Kapur, and Ken Wood. Sensecam: A retrospective memory aid. In International conference on ubiquitous computing, pages 177–193. Springer, 2006
work page 2006
-
[46]
Highlight detection with pairwise deep ranking for first- person video summarization
Ting Yao, Tao Mei, and Yong Rui. Highlight detection with pairwise deep ranking for first- person video summarization. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 982–990, 2016
work page 2016
-
[47]
Huy Viet Le, Sarah Clinch, Corina Sas, Tilman Dingler, Niels Henze, and Nigel Davies. Impact of video summary viewing on episodic memory recall: Design guidelines for video summarizations. InProceedings of the 2016 CHI conference on human factors in computing systems, pages 4793–4805, 2016
work page 2016
-
[48]
Experienc- ing the affective diary.Personal and Ubiquitous Computing, 13(5):365–378, 2009
Anna Ståhl, Kristina Höök, Martin Svensson, Alex S Taylor, and Marco Combetto. Experienc- ing the affective diary.Personal and Ubiquitous Computing, 13(5):365–378, 2009
work page 2009
-
[49]
Affectcam: arousal-augmented sensecam for richer recall of episodic memories
Corina Sas, Tomasz Fratczak, Matthew Rees, Hans Gellersen, Vaiva Kalnikaite, Alina Coman, and Kristina Höök. Affectcam: arousal-augmented sensecam for richer recall of episodic memories. InCHI’13 extended abstracts on human factors in computing systems, pages 1041–1046. 2013
work page 2013
-
[50]
Yuhu Chang, Yingying Zhao, Mingzhi Dong, Yujiang Wang, Yutian Lu, Qin Lv, Robert P Dick, Tun Lu, Ning Gu, and Li Shang. Memx: An attention-aware smart eyewear system for personalized moment auto-capture.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 5(2):1–23, 2021
work page 2021
-
[51]
Yin Li, Miao Liu, and James M Rehg. In the eye of the beholder: Gaze and actions in first person video.IEEE transactions on pattern analysis and machine intelligence, 45(6):6731– 6747, 2021
work page 2021
-
[52]
Katsutoshi Masai, Kai Kunze, Yuta Sugiura, Masa Ogata, Masahiko Inami, and Maki Sugimoto. Evaluation of facial expression recognition by a smart eyewear for facial direction changes, repeatability, and positional drift.ACM Transactions on Interactive Intelligent Systems (TiiS), 7(4):1–23, 2017. 13
work page 2017
-
[53]
Eyeecho: Continuous and low-power facial expression tracking on glasses
Ke Li, Ruidong Zhang, Siyuan Chen, Boao Chen, Mose Sakashita, François Guimbretière, and Cheng Zhang. Eyeecho: Continuous and low-power facial expression tracking on glasses. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pages 1–24, 2024
work page 2024
-
[54]
Jangho Kwon, Jihyeon Ha, Da-Hye Kim, Jun Won Choi, and Laehyun Kim. Emotion recognition using a glasses-type wearable device via multi-channel facial responses.IEEE Access, 9:146392–146403, 2021
work page 2021
-
[55]
Emo: Real-time emotion recognition from single-eye images for resource- constrained eyewear devices
Hao Wu, Jinghao Feng, Xuejin Tian, Edward Sun, Yunxin Liu, Bo Dong, Fengyuan Xu, and Sheng Zhong. Emo: Real-time emotion recognition from single-eye images for resource- constrained eyewear devices. InProceedings of the 18th International Conference on Mobile Systems, Applications, and Services, pages 448–461, 2020
work page 2020
-
[56]
Yingying Zhao, Yuhu Chang, Yutian Lu, Yujiang Wang, Mingzhi Dong, Qin Lv, Robert P Dick, Fan Yang, Tun Lu, Ning Gu, et al. Do smart glasses dream of sentimental visions? deep emotionship analysis for eyewear devices.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 6(1):1–29, 2022
work page 2022
-
[57]
Zebang Cheng, Zhi-Qi Cheng, Jun-Yan He, Jingdong Sun, Kai Wang, Yuxiang Lin, Zheng Lian, Xiaojiang Peng, and Alexander G Hauptmann. Emotion-llama: Multimodal emotion recognition and reasoning with instruction tuning.Advances in Neural Information Processing Systems, 37:110805–110853, 2024
work page 2024
-
[58]
EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models
He Hu, Yucheng Zhou, Lianzhong You, Hongbo Xu, Qianning Wang, Zheng Lian, Fei Richard Yu, Fei Ma, and Laizhong Cui. Emobench-m: Benchmarking emotional intelligence for multimodal large language models.arXiv preprint arXiv:2502.04424, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
G-voila: gaze-facilitated information querying in daily scenarios
Zeyu Wang, Yuanchun Shi, Yuntao Wang, Yuchen Yao, Kun Yan, Yuhan Wang, Lei Ji, Xuhai Xu, and Chun Yu. G-voila: gaze-facilitated information querying in daily scenarios. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 8(2):1–33, 2024
work page 2024
-
[60]
Jaewook Lee, Jun Wang, Elizabeth Brown, Liam Chu, Sebastian S Rodriguez, and Jon E Froehlich. Gazepointar: A context-aware multimodal voice assistant for pronoun disambigua- tion in wearable augmented reality. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pages 1–20, 2024
work page 2024
-
[61]
Kun Yan, Zeyu Wang, Lei Ji, Yuntao Wang, Nan Duan, and Shuai Ma. V oila-a: Aligning vision-language models with user’s gaze attention.Advances in neural information processing systems, 37:1890–1918, 2024
work page 1918
-
[62]
Visual intention grounding for egocentric assistants
Pengzhan Sun, Junbin Xiao, Tze Ho Elden Tse, Yicong Li, Arjun Akula, and Angela Yao. Visual intention grounding for egocentric assistants. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 2512–2522, 2025
work page 2025
-
[63]
Sensible agent: A framework for unobtrusive interaction with proactive ar agents
Geonsun Lee, Min Xia, Nels Numan, Xun Qian, David Li, Yanhe Chen, Achin Kulshrestha, Ishan Chatterjee, Yinda Zhang, Dinesh Manocha, et al. Sensible agent: A framework for unobtrusive interaction with proactive ar agents. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology, pages 1–22, 2025
work page 2025
-
[64]
Generating natural questions about an image
Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Margaret Mitchell, Xiaodong He, and Lucy Vanderwende. Generating natural questions about an image. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1802–1813, 2016
work page 2016
-
[65]
Ye Pan, Chi Kit Wong, Yuanhuiyi Lyu, Hanqian Li, Jiahao Huo, Jiacheng Chen, Lutao Jiang, Xu Zheng, and Xuming Hu. Egointent: An egocentric step-level benchmark for understanding what, why, and next.arXiv preprint arXiv:2603.12147, 2026
-
[66]
Vijay Veerabadran, Fanyi Xiao, Nitin Kamra, Pedro Matias, Joy Chen, Caley Drooff, Brett D Roads, Riley Williams, Ethan Henderson, Xuanyi Zhao, et al. Benchmarking egocentric multimodal goal inference for assistive wearable agents.arXiv preprint arXiv:2510.22443, 2025. 14
-
[67]
Runze Cai, Nuwan Janaka, Hyeongcheol Kim, Yang Chen, Shengdong Zhao, Yun Huang, and David Hsu. Aiget: Transforming everyday moments into hidden knowledge discovery with ai assistance on smart glasses. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–26, 2025
work page 2025
-
[68]
Proactive assistant dialogue generation from streaming egocentric videos
Yichi Zhang, Xin Luna Dong, Zhaojiang Lin, Andrea Madotto, Anuj Kumar, Babak Dama- vandi, Joyce Chai, and Seungwhan Moon. Proactive assistant dialogue generation from streaming egocentric videos. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12055–12079, 2025
work page 2025
-
[69]
Olga Gelonch, Mireia Ribera, Núria Codern-Bové, Sílvia Ramos, Maria Quintana, Gloria Chico, Noemí Cerulla, Paula Lafarga, Petia Radeva, and Maite Garolera. Acceptability of a lifelogging wearable camera in older adults with mild cognitive impairment: a mixed-method study.BMC geriatrics, 19(1):110, 2019
work page 2019
-
[70]
Do life-logging technologies support memory for the past? an experimental study using sensecam
Abigail J Sellen, Andrew Fogg, Mike Aitken, Steve Hodges, Carsten Rother, and Ken Wood. Do life-logging technologies support memory for the past? an experimental study using sensecam. InProceedings of the SIGCHI conference on Human factors in computing systems, pages 81–90, 2007
work page 2007
-
[71]
Evangelos Niforatos, Caterina Cinel, Cathleen Cortis Mack, Marc Langheinrich, and Geoff Ward. Can less be more? contrasting limited, unlimited, and automatic picture capture for augmenting memory recall.Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies, 1(2):1–22, 2017
work page 2017
-
[72]
Abigail J Sellen and Steve Whittaker. Beyond total capture: a constructive critique of lifelog- ging.Communications of the ACM, 53(5):70–77, 2010
work page 2010
-
[73]
Smart assistive glasses for alzheimer’s patients
Mohamed Ait Gacem, Saifeddin Alghlayini, Wessam Shehieb, Muaid Saeed, Ahmed Ghazal, and Mustahsan Mir. Smart assistive glasses for alzheimer’s patients. In2019 IEEE Inter- national Symposium on Signal Processing and Information Technology (ISSPIT), pages 1–5. IEEE, 2019
work page 2019
-
[74]
Franklin Mingzhe Li, Di Laura Chen, Mingming Fan, and Khai N Truong. Fmt: A wearable camera-based object tracking memory aid for older adults.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 3(3):1–25, 2019
work page 2019
-
[75]
Zhiwen Qiu, Mojtaba Ashour, Xiaohe Zhou, and Saleh Kalantari. Navmarkar: A landmark- based augmented reality (ar) wayfinding system for enhancing older adults’ spatial learning. Advanced Engineering Informatics, 62:102635, 2024
work page 2024
-
[76]
Memoro: Using large language models to realize a concise interface for real-time memory augmentation
Wazeer Deen Zulfikar, Samantha Chan, and Pattie Maes. Memoro: Using large language models to realize a concise interface for real-time memory augmentation. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pages 1–18, 2024
work page 2024
-
[77]
Ar secretary agent: Real-time memory augmentation via llm-powered augmented reality glasses
Raphaël A El Haddad, Zeyu Wang, Yeonsu Shin, Ranyi Liu, Yuntao Wang, and Chun Yu. Ar secretary agent: Real-time memory augmentation via llm-powered augmented reality glasses. arXiv preprint arXiv:2505.11888, 2025
-
[78]
ElevenLabs. Speech to text. https://elevenlabs.io/docs/eleven-creative/ playground/speech-to-text. ElevenLabs Documentation. Accessed: 2026-05-06
work page 2026
-
[79]
Large Model for Audio File Transcription
iFLYTEK. Large Model for Audio File Transcription. https://www.xfyun.cn/doc/ spark/asr_llm/Ifasr_llm.html, 2026. iFLYTEK Open Platform Documentation Center. Accessed: 2026-05-06
work page 2026
-
[80]
Microsoft. OpenAI GPT-4o. https://ai.azure.com/catalog/models/gpt-4o, 2026. Microsoft Foundry Models catalog. Version 2024-11-20; last updated April 2026. Accessed: 2026-05-06
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.