pith. sign in

arxiv: 2604.08062 · v1 · submitted 2026-04-09 · 💻 cs.HC · cs.AI

From Gaze to Guidance: Interpreting and Adapting to Users' Cognitive Needs with Multimodal Gaze-Aware AI Assistants

Pith reviewed 2026-05-10 17:38 UTC · model grok-4.3

classification 💻 cs.HC cs.AI
keywords gaze-aware AImultimodal LLMegocentric videocognitive assistancereading behaviorpersonalized interactionuser modelingretrospective feedback
0
0 comments X

The pith

Gaze-aware multimodal LLM assistants detect reading difficulties from egocentric video and improve recall and efficiency over text-only versions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that adding egocentric video with gaze overlays lets an LLM identify likely points of cognitive difficulty during reading and deliver targeted retrospective assistance. A sympathetic reader would care because current LLM assistants lack access to behavioral signals of struggle, often giving generic responses instead of personalized guidance that actually helps comprehension. In a study of 36 participants, the gaze-aware assistant produced significantly higher ratings for accuracy and personalization in assessing reading behavior, better information recall, and shorter user speech turns indicating more efficient exchanges. Qualitative results showed perceived benefits alongside occasional misreadings of gaze data.

Core claim

By processing egocentric video with gaze overlays, a multimodal LLM can identify likely points of difficulty in users' cognitive processes and provide follow-up assistance that leads to more accurate and personalized assessments of reading behavior, improved recall of information, and more efficient interactions than a conventional text-only LLM assistant.

What carries the argument

Gaze-grounded multimodal LLM that interprets egocentric video with gaze overlays to detect cognitive difficulty points and generate retrospective guidance.

If this is right

  • Users recall more information from reading material when the assistant targets help based on observed gaze patterns.
  • Ratings of accuracy and personalization rise because the assistant grounds its responses in visible behavioral context.
  • Users speak fewer words overall, as the system infers needs without requiring explicit verbal descriptions.
  • Assistance becomes proactive and retrospective rather than reactive to user queries alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same gaze-interpretation approach could apply to tasks like web navigation or code review where eye movements signal confusion.
  • Adding user correction mechanisms would address the noted inaccuracies and make the system more robust over repeated sessions.
  • This points toward AI assistants that maintain ongoing models of individual cognitive states across multiple interactions.
  • Educational platforms could incorporate such assistants to deliver post-reading support tailored to where learners actually lingered.

Load-bearing premise

That gaze overlays on egocentric video supply reliable enough signals for the LLM to correctly identify genuine points of cognitive difficulty despite some inaccurate interpretations.

What would settle it

A follow-up experiment in which gaze-identified difficulty points show no correlation with users' self-reported struggles or in which the recall and efficiency gains disappear when gaze data is removed.

Figures

Figures reproduced from arXiv: 2604.08062 by Andrew Wilson, Javier Hernandez, Judith Amores, Pattie Maes, Valdemar Danry.

Figure 1
Figure 1. Figure 1: Overview of the gaze-aware cognitive AI assistant. (a) A user wearing wearable AI glasses which streams eye gaze [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: System overview of the gaze-aware AI assistant. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Real examples of gaze-reading behavior, the object list generated from LLM object recognition, the LLM eye tracking [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Bar plots of results for main variables across conditions for (a) learning performance, (b) LLM analysis ratings, and (c) [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Exploratory results showing NASA TLX across conditions, with standard errors. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Exploratory results for sanity check questions and HLMIQ responses. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Rated accuracy of AI analysis in phase 3 between control and experimental differed based on text difficulty, with the [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Conceptual overview of system user experience with AI interpretations of gaze behavior. Top row: What is captured [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Overview of study procedure and timing across phases. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
read the original abstract

Current LLM assistants are powerful at answering questions, but they have limited access to the behavioral context that reveals when and where a user is struggling. We present a gaze-grounded multimodal LLM assistant that uses egocentric video with gaze overlays to identify likely points of difficulty and target follow-up retrospective assistance. We instantiate this vision in a controlled study (n=36) comparing the gaze-aware AI assistant to a text-only LLM assistant. Compared to a conventional LLM assistant, the gaze-aware assistant was rated as significantly more accurate and personalized in its assessments of users' reading behavior and significantly improved people's ability to recall information. Users spoke significantly fewer words with the gaze-aware assistant, indicating more efficient interactions. Qualitative results underscored both perceived benefits in comprehension and challenges when interpretations of gaze behaviors were inaccurate. Our findings suggest that gaze-aware LLM assistants can reason about cognitive needs to improve cognitive outcomes of users.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents a gaze-grounded multimodal LLM assistant that processes egocentric video with gaze overlays to detect likely points of user difficulty during reading tasks and deliver targeted retrospective assistance. It reports results from a controlled study (n=36) comparing this system to a conventional text-only LLM assistant, claiming statistically significant improvements in rated accuracy and personalization of assessments of reading behavior, better information recall, and more efficient interactions (fewer words spoken). Qualitative results highlight perceived comprehension benefits alongside challenges from inaccurate gaze interpretations.

Significance. If the empirical claims hold after addressing verification gaps, this work would advance HCI and multimodal AI by showing how behavioral context from gaze can enable more adaptive, cognitively-aware assistants that improve user outcomes in reading and learning scenarios. The direct comparison to a text-only baseline provides a useful benchmark for the added value of multimodal inputs, with potential implications for designing more efficient and personalized human-AI systems.

major comments (2)
  1. [Abstract and Results] Abstract and Results section: The manuscript claims statistically significant benefits (accuracy, personalization, recall, and reduced spoken words) from the n=36 controlled study, yet provides no details on experimental design, controls, statistical tests used, effect sizes, p-values, or exclusion criteria. This omission is load-bearing for the central empirical claim, as it prevents verification that the reported gains are attributable to gaze-grounded reasoning rather than other factors.
  2. [Qualitative Results and Discussion] Qualitative Results and Discussion: The paper acknowledges 'challenges when interpretations of gaze behaviors were inaccurate' but reports no quantitative metrics for the accuracy of difficulty-point detection (e.g., precision/recall against user self-reports, post-hoc annotations, or eye-tracking validation). This is critical because the core mechanism—mapping egocentric gaze overlays to 'likely points of difficulty'—underpins the claimed benefits; without such validation, improvements could arise from video context alone.
minor comments (2)
  1. [Abstract and Introduction] The abstract and introduction could more explicitly distinguish the contributions of gaze overlays versus the egocentric video feed to clarify the unique role of gaze data.
  2. [Methods and Figures] Figure captions and method descriptions would benefit from additional detail on how gaze overlays are rendered and fed into the multimodal LLM to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the careful review and constructive comments. We agree that greater transparency on statistical details and component validation would strengthen the paper. Below we respond point-by-point and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and Results] Abstract and Results section: The manuscript claims statistically significant benefits (accuracy, personalization, recall, and reduced spoken words) from the n=36 controlled study, yet provides no details on experimental design, controls, statistical tests used, effect sizes, p-values, or exclusion criteria. This omission is load-bearing for the central empirical claim, as it prevents verification that the reported gains are attributable to gaze-grounded reasoning rather than other factors.

    Authors: We agree that the Results section should contain these details for verifiability. The Methods section already describes the within-subjects design, counterbalancing, task instructions, and data collection procedure, but we will expand the Results section to report the specific statistical tests (paired t-tests or Wilcoxon signed-rank tests as appropriate after normality checks), exact p-values, effect sizes (Cohen’s d or r), confidence intervals, and any exclusion criteria applied (e.g., incomplete sessions or technical failures). This will make clear that the reported gains are tied to the gaze-grounded condition rather than other factors. revision: yes

  2. Referee: [Qualitative Results and Discussion] Qualitative Results and Discussion: The paper acknowledges 'challenges when interpretations of gaze behaviors were inaccurate' but reports no quantitative metrics for the accuracy of difficulty-point detection (e.g., precision/recall against user self-reports, post-hoc annotations, or eye-tracking validation). This is critical because the core mechanism—mapping egocentric gaze overlays to 'likely points of difficulty'—underpins the claimed benefits; without such validation, improvements could arise from video context alone.

    Authors: We accept this point. Our primary outcome measures were end-to-end user ratings, recall performance, and interaction efficiency; we did not pre-register or collect explicit per-difficulty-point self-reports or annotations for quantitative validation of the detection step. In revision we will add a new subsection that (a) reports any post-hoc agreement that can be computed from existing session logs and video, (b) quantifies the frequency of acknowledged inaccurate interpretations from the qualitative data, and (c) explicitly states this as a limitation while arguing that the statistically significant user-outcome improvements still demonstrate value beyond video alone. We cannot, however, generate new ground-truth labels without additional annotation effort or a follow-up study. revision: partial

standing simulated objections not resolved
  • Quantitative precision/recall metrics for difficulty-point detection cannot be added without new post-hoc annotation or additional data collection, as these were not part of the original study protocol.

Circularity Check

0 steps flagged

No circularity: empirical user study with direct measurements

full rationale

This is a controlled empirical comparison study (n=36) evaluating a gaze-aware multimodal LLM assistant against a text-only baseline through user ratings, recall tests, word counts, and qualitative feedback. No equations, derivations, fitted parameters, or first-principles predictions appear in the provided text or abstract. Claims rest on observed differences in the study data rather than any self-referential construction, self-citation load-bearing premise, or renamed known result. The work is self-contained against external benchmarks with no reduction of outputs to inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that gaze behavior serves as a valid proxy for identifying cognitive difficulty points, drawn from HCI but not independently tested here.

axioms (1)
  • domain assumption Gaze patterns overlaid on egocentric video can be used by an LLM to identify likely points of user difficulty during reading or tasks.
    This premise underpins the design of the gaze-grounded assistant and the interpretation of study results.

pith-pipeline@v0.9.0 · 5467 in / 1220 out tokens · 48698 ms · 2026-05-10T17:38:22.650432+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 1 internal anchor

  1. [1]

    Yasmeen Abdrabou, Süleyman Özdel, Virmarie Maquiling, Efe Bozkir, and Enkele- jda Kasneci. 2025. From gaze to data: Privacy and societal challenges of using eye-tracking data to inform GenAI models. InProceedings of the 2025 Symposium on Eye Tracking Research and Applications. 1–9

  2. [2]

    Dekel Abeles and Shlomit Yuval-Greenberg. 2017. Just look away: Gaze aversions as an overt attentional disengagement mechanism.Cognition168 (2017), 99–109

  3. [3]

    Rawan Alharbi, Tammy Stump, Nilofar Vafaie, Angela Pfammatter, Bonnie Spring, and Nabil Alshurafa. 2018. I can’t be myself: effects of wearable cameras on the capture of authentic behavior in the wild.Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies2, 3 (2018), 1–40

  4. [4]

    Michael Argyle, Mark Cook, and Duncan Cramer. 1994. Gaze and mutual gaze. The British Journal of Psychiatry165, 6 (1994), 848–850

  5. [5]

    Gonzalez, Li-Te Cheng, and Mar Gonzalez-Franco

    Riccardo Bovo, Steven Abreu, Karan Ahuja, Eric J. Gonzalez, Li-Te Cheng, and Mar Gonzalez-Franco. 2024. EmBARDiment: an Embodied AI Agent for Produc- tivity in XR. arXiv:2408.08158 [cs.HC] https://arxiv.org/abs/2408.08158

  6. [6]

    1998.Transforming qualitative information: Thematic analysis and code development

    Richard E Boyatzis. 1998.Transforming qualitative information: Thematic analysis and code development. Sage

  7. [7]

    through the eyes

    Jennifer Choe Bush, Peter Christopher Pantelis, Xavier Morin Duchesne, Sebas- tian Alexander Kagemann, and Daniel Patrick Kennedy. 2015. Viewing complex, From Gaze to Guidance: Interpreting and Adapting to Users’ Cognitive Needs with Multimodal Gaze-Aware AI Assistants dynamic scenes “through the eyes” of another person: The gaze-replay paradigm. PloS one...

  8. [8]

    Roser Cañigueral and Antonia F de C Hamilton. 2019. The role of eye gaze during natural social interactions in typical and autistic people.Frontiers in psychology 10 (2019), 560

  9. [9]

    Michelene TH Chi, Nicholas De Leeuw, Mei-Hung Chiu, and Christian LaVancher

  10. [10]

    Eliciting self-explanations improves understanding.Cognitive science18, 3 (1994), 439–477

  11. [11]

    Michelene TH Chi and Ruth Wylie. 2014. The ICAP framework: Linking cognitive engagement to active learning outcomes.Educational psychologist49, 4 (2014), 219–243

  12. [12]

    2017.Research design: Qualitative, quan- titative, and mixed methods approaches

    John W Creswell and J David Creswell. 2017.Research design: Qualitative, quan- titative, and mixed methods approaches. Sage publications

  13. [13]

    Valdemar Danry, Pat Pataranutaporn, Yaoli Mao, and Pattie Maes. 2023. Don’t just tell me, ask me: Ai systems that intelligently frame explanations as questions improve human logical discernment accuracy over causal ai explanations. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–13

  14. [14]

    D’Mello and Arthur C

    Sidney K. D’Mello and Arthur C. Graesser. 2012. AutoTutor and Affective AutoTu- tor: Learning by Talking with Cognitively and Emotionally Intelligent Computers That Talk Back.ACM Transactions on Interactive Intelligent Systems2, 4, Article 23 (2012). doi:10.1145/2395123.2395128

  15. [15]

    D’Mello, Andrew Olney, Charles Williams, and Peter Hays

    Sidney K. D’Mello, Andrew Olney, Charles Williams, and Peter Hays. 2012. Gaze Tutor: A Gaze-Reactive Intelligent Tutoring System.International Journal of Human-Computer Studies70, 5 (2012), 377–398. doi:10.1016/j.ijhcs.2012.01.004

  16. [16]

    Gwyneth Doherty-Sneddon and Fiona G Phelps. 2005. Gaze aversion: A response to cognitive or social difficulty?Memory & cognition33, 4 (2005), 727–733

  17. [17]

    Sidney D’Mello, Blair Lehman, Reinhard Pekrun, and Art Graesser. 2014. Confu- sion can be beneficial for learning.Learning and Instruction29 (2014), 153–170

  18. [18]

    Selina N Emhardt, Margot van Wermeskerken, Katharina Scheiter, and Tamara van Gog. 2020. Inferring task performance and confidence from displays of eye movements.Applied Cognitive Psychology34, 6 (2020), 1430–1443

  19. [19]

    Alexandra Frischen, Andrew P Bayliss, and Steven P Tipper. 2007. Gaze cue- ing of attention: visual attention, social cognition, and individual differences. Psychological bulletin133, 4 (2007), 694

  20. [20]

    Holly Gorin, Jigna Patel, Qinyin Qiu, Alma Merians, Sergei Adamovich, and Gerard Fluet. 2024. A review of the use of gaze and pupil metrics to assess mental workload in gamified and simulated sensorimotor tasks.Sensors24, 6 (2024), 1759

  21. [21]

    Robert GM Hausmann and Kurt VanLehn. 2007. Explaining self-explaining: A contrast between content and generation.Frontiers in Artificial Intelligence and Applications158 (2007), 417

  22. [22]

    Javier Hernandez, Josh Lovejoy, Daniel McDuff, Jina Suh, Tim O’Brien, Arathi Sethumadhavan, Gretchen Greene, Rosalind W Picard, and Mary Czerwinski

  23. [23]

    Guidelines for Assessing and Minimizing Risks of Emotion Recognition Applications.. InACII. 1–8

  24. [24]

    Roy S Hessels, Antje Nuthmann, Marcus Nyström, Richard Andersson, Dieder- ick C Niehorster, and Ignace TC Hooge. 2024. The fundamentals of eye tracking part 1: The link between theory and research question.Behavior Research Methods 57, 1 (2024), 16

  25. [25]

    David Hestenes, Malcolm Wells, Gregg Swackhamer, et al. 1992. Force concept inventory.The physics teacher30, 3 (1992), 141–158

  26. [26]

    Roland S Johansson, Göran Westling, Anders Bäckström, and J Randall Flanagan

  27. [27]

    Eye–hand coordination in object manipulation.Journal of neuroscience21, 17 (2001), 6917–6932

  28. [28]

    Jeffrey D Karpicke and Henry L Roediger III. 2008. The critical importance of retrieval for learning.science319, 5865 (2008), 966–968

  29. [29]

    Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. 2022. Decomposed prompting: A modular approach for solving complex tasks.arXiv preprint arXiv:2210.02406(2022)

  30. [30]

    Peter König, Niklas Wilming, Tim C Kietzmann, Jose P Ossandón, Selim Onat, Benedikt V Ehinger, Ricardo R Gameiro, and Kai Kaspar. 2016. Eye movements as a window to cognitive processes.Journal of eye movement research9, 5 (2016), 25

  31. [31]

    Sébastien Lallé, Cristina Conati, and Giuseppe Carenini. 2016. Predicting Confu- sion in Information Visualization from Eye Tracking and Interaction Data.. In IJCAI. 2529–2535

  32. [32]

    Moritz Langner, Peyman Toreini, and Alexander Maedche. 2023. Leveraging eye tracking technology for a situation-aware writing assistant. InProceedings of the 2023 Symposium on Eye Tracking Research and Applications. 1–2

  33. [33]

    Jaewook Lee, Jun Wang, Elizabeth Brown, Liam Chu, Sebastian S Rodriguez, and Jon E Froehlich. 2024. GazePointAR: A context-aware multimodal voice assistant for pronoun disambiguation in wearable augmented reality. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–20

  34. [34]

    Jia Zheng Lim, James Mountstephens, and Jason Teo. 2020. Emotion recognition using eye-tracking: taxonomy, review and current challenges.Sensors20, 8 (2020), 2384

  35. [35]

    Gwen Marchand and Ellen A Skinner. 2007. Motivational dynamics of children’s academic help-seeking and concealment.Journal of Educational Psychology99, 1 (2007), 65

  36. [36]

    Diane C Mézière, Niilo E Hautala, Timo T Heikkilä, and Johanna K Kaakinen. 2025. Eye-movement markers of mind wandering during reading: A meta-analysis. Memory & Cognition(2025), 1–26

  37. [37]

    Chiara Mirandola, Alfonso Ciriello, Martina Gigli, and Cesare Cornoldi. 2018. Metacognitive monitoring of text comprehension: An investigation on post- dictive judgments in typically developing children and children with reading comprehension difficulties.Frontiers in psychology9 (2018), 2253

  38. [38]

    William E Nagy, Patricia A Herman, and Richard C Anderson. 1985. Learning words from context.Reading research quarterly(1985), 233–253

  39. [39]

    Paul Nation and David Beglar. 2007. A vocabulary size test. (2007)

  40. [40]

    Mariya Pachman, Amaël Arguel, Lori Lockyer, Gregor Kennedy, and Jason Lodge

  41. [41]

    Eye tracking and early detection of confusion in digital learning environ- ments: Proof of concept.Australasian Journal of Educational Technology32, 6 (2016)

  42. [42]

    Taiying Peng, Jiacheng Hua, Miao Liu, and Feng Lu. 2025. In the eye of mllm: Benchmarking egocentric video intent understanding with gaze-guided prompt- ing.arXiv preprint arXiv:2509.07447(2025)

  43. [43]

    Alexander Plopski, Teresa Hirzle, Nahal Norouzi, Long Qian, Gerd Bruder, and Tobias Langlotz. 2022. The eye in extended reality: A survey on gaze interaction and eye tracking in head-worn extended reality.ACM Computing Surveys (CSUR) 55, 3 (2022), 1–39

  44. [44]

    Jun Rekimoto. 2025. GazeLLM: Multimodal LLMs incorporating human visual attention. InProceedings of the Augmented Humans International Conference 2025. 302–311

  45. [45]

    Jayasankar Santhosh, Andreas Dengel, and Shoya Ishimaru. 2024. Gaze-driven adaptive learning system with ChatGPT-generated summaries.IEEE Access12 (2024), 173714–173733

  46. [46]

    Gabriel Herbert Sarch, Balasaravanan Thoravi Kumaravel, Sahithya Ravi, Vibhav Vineet, and Andrew D Wilson. 2025. Grounding Task Assistance with Mul- timodal Cues from a Single Demonstration. InFindings of the Association for Computational Linguistics: ACL 2025. 12807–12833

  47. [47]

    Katharina Scheiter, Carina Schubert, Anne Schüler, Holger Schmidt, Gottfried Zimmermann, Benjamin Wassermann, Marie-Christin Krebs, and Thérése Eder

  48. [48]

    Adaptive multimedia: Using gaze-contingent instructional guidance to provide personalized processing support.Computers & Education139 (2019), 31–47

  49. [49]

    Thimo Schulz, Chiara Krisam, and Julia Seitz. 2025. EyeGPT: A Cognitive Load- Adaptive GenAI Assistant with Eye Tracking for Programming Education. In NeuroIS Retreat. Springer, 45–55

  50. [50]

    John L Sibert, Mehmet Gokturk, and Robert A Lavine. 2000. The reading assistant: eye gaze triggered auditory prompting for reading remediation. InProceedings of the 13th annual ACM symposium on User interface software and technology. 101–107

  51. [51]

    Julie Dangremond Stanton, Amanda J Sebesta, and John Dunlosky. 2021. Fos- tering metacognition to support student learning and performance.CBE—Life Sciences Education20, 2 (2021), fe3

  52. [52]

    Enkeleda Thaqi, Mohamed Omar Mantawy, and Enkelejda Kasneci. 2024. SARA: Smart AI reading assistant for reading comprehension. InProceedings of the 2024 Symposium on Eye Tracking Research and Applications. 1–3

  53. [53]

    Margot van Wermeskerken, Damien Litchfield, and Tamara van Gog. 2018. What am I looking at? Interpreting dynamic and static gaze displays.Cognitive science 42, 1 (2018), 220–252

  54. [54]

    Pierre Vermersch. 1994. The explicitation interview.French original ESF(1994)

  55. [55]

    Ru Wang, Zach Potter, Yun Ho, Daniel Killough, Linxiu Zeng, Sanbrita Mondal, and Yuhang Zhao. 2024. GazePrompt: Enhancing Low Vision People’s Reading Experience with Gaze-Aware Augmentations. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–17

  56. [56]

    Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean An- drist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, et al. 2023. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. InProceedings of the IEEE/CVF International Conference on Computer Vision. 20270–20281

  57. [57]

    Haowei Zhang, Jianzhe Liu, Zhen Han, Shuo Chen, Bailan He, Volker Tresp, Zhiqiang Xu, and Jindong Gu. 2024. Visual question decomposition on multi- modal large language models. InFindings of the Association for Computational Linguistics: EMNLP 2024. 1926–1949

  58. [58]

    Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. 2023. Visual cropping improves zero-shot question answering of multimodal large language models. InR0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models

  59. [59]

    Yichi Zhang, Xin Luna Dong, Zhaojiang Lin, Andrea Madotto, Anuj Kumar, Babak Damavandi, Joyce Chai, and Seungwhan Moon. 2025. Proactive assistant dialogue generation from streaming egocentric videos. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 12055–12079. Danry et al

  60. [60]

    Brief opening message the assistant should say if an intervention is needed , otherwise'none'

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623. A Prompt Templates and Examples For reproducibility, we provide the exact prompts used...