arxiv: 2605.04227 · v1 · submitted 2026-05-05 · 💻 cs.AI · cs.HC

Recognition: unknown

Pro²Assist: Continuous Step-Aware Proactive Assistance with Multimodal Egocentric Perception for Long-Horizon Procedural Tasks

Lilin Xu , Bufang Yang , Siyang Jiang , Kaiwei Liu , Kaiyuan Hou , Yuang Fan , Hongkai Chen , Zhenyu Yan

show 1 more author

Xiaofan Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:11 UTC · model grok-4.3

classification 💻 cs.AI cs.HC

keywords proactive assistanceprocedural tasksAR glassesegocentric perceptionstep-aware reasoningmultimodal datalong-horizon taskscontinuous reasoning

0 comments

The pith

Pro²Assist uses continuous multimodal reasoning from AR glasses to deliver timely proactive assistance throughout long-horizon procedural tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a system that moves beyond query-triggered or short-term reactive help by tracking a user's progress through ordered multi-step activities in real time. It combines raw motion data from wearable glasses with structured knowledge of task sequences to infer what assistance is needed next and when to provide it. A sympathetic reader would care because everyday activities such as cooking, assembly, or maintenance often fail due to missed timing or forgotten steps rather than lack of information. If the approach holds, personal AI assistants could shift from passive responders to active companions that stay synchronized with extended real-world workflows.

Core claim

Pro²Assist is a step-aware proactive assistant that continuously tracks fine-grained task progress and reasons over the user's evolving state to provide timely assistance throughout tasks. It leverages multimodal data from augmented reality glasses for motion-based perception, extracts step-oriented procedural context from multi-scale temporal dynamics and task-specific expert knowledge, and performs continuous reasoning to infer user needs and display timely assistance on the glasses themselves.

What carries the argument

The continuous reasoning process that fuses motion-based egocentric perception from AR glasses with step-oriented procedural context drawn from multi-scale temporal dynamics and task-specific expert knowledge.

If this is right

The system outperforms the strongest prior baselines by more than 21 percent in procedural action understanding accuracy.
It reaches up to 2.29 times the proactive timing accuracy of existing methods.
A user study with 20 participants indicates that 90 percent find the assistance useful in practice.
The approach supports continuous help across entire long-horizon sequences rather than isolated short events.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same continuous-reasoning pattern could be adapted to non-glasses wearables if equivalent motion and context streams are available.
Extending the context extraction step to new task domains would require only updated expert knowledge rather than retraining the entire perception pipeline.
If timing accuracy gains hold, the method could reduce user errors in safety-critical procedures by surfacing reminders before mistakes occur.

Load-bearing premise

Multimodal data from AR glasses together with extracted step-oriented procedural context can reliably and continuously infer the user's evolving state and needs across diverse real-world long-horizon tasks.

What would settle it

A side-by-side trial in which participants complete varied long procedural tasks while wearing the glasses, with independent observers measuring whether the system's assistance appears at moments that actually match the user's demonstrated needs and whether action-understanding accuracy exceeds the best baseline by the reported margin.

Figures

Figures reproduced from arXiv: 2605.04227 by Bufang Yang, Hongkai Chen, Kaiwei Liu, Kaiyuan Hou, Lilin Xu, Siyang Jiang, Xiaofan Jiang, Yuang Fan, Zhenyu Yan.

**Figure 1.** Figure 1: Application scenario of Pro2Assist. Pro2Assist provides continuous, step-aware proactive assistance by continuously reasoning over the user’s evolving state in the task workflow (left), while reactive procedural assistants rely on explicit user requests and one-shot proactive assistants provide isolated assistance for the overall event (right). proactive systems [46, 74, 75] that aim to further reduce user… view at source ↗

**Figure 3.** Figure 3: A preliminary example that head motion can effectively indicate view at source ↗

**Figure 4.** Figure 4: Egocentric vision and hand motion provide important cues for inferring user intent during procedural tasks. view at source ↗

**Figure 5.** Figure 5: Examples illustrating two characteristics in action understanding for procedural tasks: visually similar but functionally view at source ↗

**Figure 6.** Figure 6: Performance of existing VLMs adapted for procedural view at source ↗

**Figure 8.** Figure 8: System overview of Pro2Assist. Pro2Assist utilizes multimodal egocentric data from AR glasses to achieve motion-based perception. By integrating visual inputs with extracted task-specific expert knowledge and multi-scale temporal context, the reasoner performs step-aware proactive reasoning with consistency checking. The resulting proactive assistance is then delivered to the user via on-screen displays on… view at source ↗

**Figure 9.** Figure 9: Optical flow computation time and area ratio distri view at source ↗

**Figure 11.** Figure 11: Impact of incorporating short-term hand motion cues and their impact. <Thoughts> The tea bag is visibly in the mug, and the mug is on the counter with no active pouring or stirring. The user is likely about to steep the tea, and this is the beginning of the steeping phase. START Measure cold water Steep tea bag in mug Throw tea bag into trash can Add honey to tea in mug Stir tea in mug using spoon End The… view at source ↗

**Figure 13.** Figure 13: Training pipeline of the step-aware proactive reasoner in Pro view at source ↗

**Figure 14.** Figure 14: System prompt for the step-aware proactive rea view at source ↗

**Figure 15.** Figure 15: Our dataset curation pipeline combining human annotation and LLM-assisted generation. view at source ↗

**Figure 16.** Figure 16: A sample from the curated dataset, consisting of an egocentric frame and its corresponding annotations. Glass Display Camera/Mic Processors & Battery AR Glasses view at source ↗

**Figure 18.** Figure 18: End-to-end performance comparison in real-world evaluation. Missing bars indicate that the corresponding metric is view at source ↗

**Figure 19.** Figure 19: Overall performance comparison on the curated dataset. Missing bars indicate that the corresponding metric is not view at source ↗

**Figure 20.** Figure 20: Performance comparison of inference triggering and proactive prediction. Solid orange lines indicate VLM inference view at source ↗

**Figure 21.** Figure 21: Examples of Pro2Assist’s inference results in real-world evaluation. recording, demonstrating that Pro2Assist explicitly reasons over sensory and procedural contexts to achieve reliable action understanding. When no assistance is needed, Pro2Assist remains silent to avoid interrupting the user. Otherwise, it generates step-aware assistance based on the user’s current state and expert knowledge to effectiv… view at source ↗

**Figure 22.** Figure 22: Impact of motion-based perception on inference ratio and proactive hit rate. 0 20 40 60 80 Step-Acc (%) 0 20 40 60 Status-Acc (%) 0 20 40 60 80 Acc-P (%) 0 20 40 60 80 STS (%) Ours w/o Motion Extraction w/o Consistency Checking view at source ↗

**Figure 25.** Figure 25: Impact of step-aware consistency checking on avoid view at source ↗

**Figure 26.** Figure 26: Comparison of different base VLMs used in view at source ↗

**Figure 28.** Figure 28: Impact of threshold settings in motion-based perception. 4 5 6 7 8 70 80 90 Step-Acc w/o Checking 4 5 6 7 8 65 70 Status-Acc w/o Checking 4 5 6 7 8 Window Length 75 80 Acc-P w/o Checking 4 5 6 7 8 Window Length 60 80 TUS w/o Checking view at source ↗

**Figure 31.** Figure 31: Results of the overall user perception for Pro2Assist. Pro2Assist to begin delivering assistance promptly via streaming output, even on resource-constrained edge devices. Expert knowledge retrieval takes under 0.22 s across all devices, and the average communication latency in two real-world evaluation environments is 327.3 ms. In addition, we measure the average power consumption of the smart glasses whi… view at source ↗

**Figure 32.** Figure 32: Comparison of baselines and Pro2Assist using a 7-point Likert scale, ranging from 1 (Very low) to 7 (Very high). For Intrusiveness, lower is better. “ref.” marks reference, and ∗∗∗𝑝 < 0.001 (paired Wilcoxon signed-rank test). CoT, and ProAgent (FT)) and rated each system on a 7-point Likert scale across the five evaluation dimensions. For each participant, all four methods were applied to the same video t… view at source ↗

read the original abstract

Procedural tasks with multiple ordered steps are ubiquitous in daily life. Recent advances in multimodal large language models (MLLMs) have enabled personal assistants that support daily activities. However, existing systems primarily provide reactive guidance triggered by user queries, or limited proactive assistance for isolated short-term events rather than long-horizon procedural tasks. In this work, we introduce Pro$^2$Assist, a step-aware proactive assistant that continuously tracks fine-grained task progress and reasons over the user's evolving state to provide timely assistance throughout tasks. Pro$^2$Assist leverages multimodal data from augmented reality (AR) glasses to achieve motion-based perception. It then extracts step-oriented procedural context from multi-scale temporal dynamics and task-specific expert knowledge. Based on both sensory input and procedural context, Pro$^2$Assist performs continuous reasoning to infer user needs and display timely assistance on AR glasses. We evaluate Pro$^2$Assist using a dataset curated from public sources and a real-world dataset collected on our testbed with AR glasses. Extensive evaluations show that Pro$^2$Assist outperforms the best-performing baselines by over 21% in procedural action understanding accuracy, and it achieves up to 2.29x the proactive timing accuracy of baselines. A user study with 20 participants further shows that 90% find Pro$^2$Assist useful, indicating its effectiveness for real-world procedural assistance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pro²Assist gets measurable gains on action tracking and timing but the gains may depend on task-specific expert knowledge that is not shown to generalize to unseen procedures.

read the letter

The main thing to know is that this system improves on baselines for understanding steps in long procedural tasks and for giving assistance at the right time, but those numbers rest on a pipeline that mixes AR perception with step-oriented context that includes task-specific expert knowledge. If that knowledge is supplied or tuned per task family, the reported lifts could be narrower than they first appear. The abstract claims over 21% better procedural action understanding accuracy and up to 2.29 times the proactive timing accuracy, plus a user study where 90% of 20 participants found it useful. Those are concrete numbers worth checking against the full evaluations. What the paper does well is lay out a continuous reasoning loop that takes egocentric video from AR glasses, extracts multi-scale temporal dynamics, folds in the expert context, and decides when to display help. That moves past simple query-triggered responses or short-event proactivity, and the mix of public datasets with their own real-world AR testbed collection is a practical choice for a systems paper. The soft spot is exactly the one the stress-test note flags. The abstract does not spell out whether the expert knowledge is manually provided, how it is extracted at scale, or whether the same pipeline works on procedures outside the training distribution. Without that, it is hard to tell if the system truly infers evolving user state across diverse long-horizon tasks or whether the results are helped by in-distribution knowledge. The user study is small but positive, and there is no sign of circular reasoning in the claims. This work is for HCI and wearable AI researchers who want ideas on step-aware multimodal assistance. Readers building practical prototypes will get some usable pieces even if they have to adapt the knowledge component. The thinking is straightforward and the paper engages the reactive-versus-proactive literature without obvious contradictions. It deserves a serious referee because it has quantitative results and a user study on a real application gap. I would send it to peer review and ask the authors to clarify how the expert knowledge is obtained and to add tests on held-out task families.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Pro²Assist, a system for continuous step-aware proactive assistance in long-horizon procedural tasks. It uses multimodal egocentric perception from AR glasses for motion-based tracking of user actions and progress. The core pipeline extracts step-oriented procedural context from multi-scale temporal dynamics combined with task-specific expert knowledge, then performs continuous reasoning over sensory input and this context to infer evolving user needs and deliver timely assistance displayed on the AR glasses. Evaluations on a dataset curated from public sources plus a real-world testbed dataset show Pro²Assist outperforming best baselines by over 21% in procedural action understanding accuracy and up to 2.29× in proactive timing accuracy; a user study with 20 participants reports 90% finding the system useful.

Significance. If the central claims hold under more rigorous generality testing, the work would be significant for advancing proactive multimodal AI assistants beyond reactive or short-horizon systems, particularly through the integration of AR-based egocentric perception with procedural reasoning. Credit is due for the real-world data collection on an AR testbed, the quantitative gains over baselines, and the inclusion of a user study providing practical validation. The approach addresses a relevant gap in continuous assistance for everyday procedural activities.

major comments (3)

[§3.2] §3.2 (Procedural Context Extraction): The extraction of step-oriented procedural context explicitly incorporates 'task-specific expert knowledge' alongside multi-scale temporal dynamics. No ablation study isolates the contribution of this expert-knowledge component, nor is there evidence (e.g., zero-shot transfer results or description of how the knowledge is obtained) that it generalizes beyond the evaluated task families. This is load-bearing for the headline claims of reliable continuous inference of user state across diverse long-horizon procedures.
[§4.1] §4.1 (Datasets and Evaluation Setup): The real-world testbed dataset is described only at high level ('curated from public sources and real-world testbed'). Missing are concrete details on the number of distinct long-horizon tasks, total procedure instances, diversity metrics, and exactly how task-specific expert knowledge was supplied or tuned for each task. Without these, the reported 21% action-understanding and 2.29× timing gains cannot be assessed for robustness outside the training distribution.
[§4.3] §4.3 (User Study): The claim that 90% of 20 participants found Pro²Assist useful is presented without statistical analysis, task breakdown, or comparison of subjective metrics against the same baselines used in the quantitative experiments. This weakens the supporting evidence for real-world effectiveness.

minor comments (2)

[Abstract] The term 'proactive timing accuracy' is used in the abstract and results but is not formally defined until later; an early definition or equation would improve readability.
[Figures] Figure captions for the system overview and qualitative examples could include more detail on what each panel illustrates (e.g., specific steps or failure modes of baselines).

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the detailed and constructive feedback on our manuscript. We appreciate the referee's recognition of the significance of our work on proactive assistance for long-horizon tasks. Below, we provide point-by-point responses to the major comments and indicate the revisions we will make to address them.

read point-by-point responses

Referee: [§3.2] §3.2 (Procedural Context Extraction): The extraction of step-oriented procedural context explicitly incorporates 'task-specific expert knowledge' alongside multi-scale temporal dynamics. No ablation study isolates the contribution of this expert-knowledge component, nor is there evidence (e.g., zero-shot transfer results or description of how the knowledge is obtained) that it generalizes beyond the evaluated task families. This is load-bearing for the headline claims of reliable continuous inference of user state across diverse long-horizon procedures.

Authors: We agree that isolating the contribution of the task-specific expert knowledge is important for validating its role in the system. In the revised manuscript, we will include an ablation study that compares the full Pro²Assist pipeline against a variant without the expert knowledge component, reporting the resulting drops in action understanding and timing accuracy. Additionally, we will provide a detailed description of how the expert knowledge is obtained—specifically, by structuring standard procedural instructions from task documentation into a step-oriented format—and demonstrate its application across the evaluated task families. While we do not claim zero-shot generalization to entirely novel task domains outside our testbed, we will include results on held-out procedures within the same families to support the generalization claims within the scope of our evaluation. revision: yes
Referee: [§4.1] §4.1 (Datasets and Evaluation Setup): The real-world testbed dataset is described only at high level ('curated from public sources and real-world testbed'). Missing are concrete details on the number of distinct long-horizon tasks, total procedure instances, diversity metrics, and exactly how task-specific expert knowledge was supplied or tuned for each task. Without these, the reported 21% action-understanding and 2.29× timing gains cannot be assessed for robustness outside the training distribution.

Authors: We acknowledge that more granular dataset statistics are necessary for assessing the robustness of our results. In the revised version, we will expand Section 4.1 to include: the exact number of distinct long-horizon tasks (broken down by category, e.g., cooking, assembly, maintenance), the total number of procedure instances, diversity metrics such as average number of steps per task and variance in task duration, and a precise description of how task-specific expert knowledge was supplied (via manual structuring of public task manuals into our procedural context format) and tuned (using a consistent template across tasks with task-specific step names and dependencies). This will allow readers to better evaluate the generalizability of the reported performance gains. revision: yes
Referee: [§4.3] §4.3 (User Study): The claim that 90% of 20 participants found Pro²Assist useful is presented without statistical analysis, task breakdown, or comparison of subjective metrics against the same baselines used in the quantitative experiments. This weakens the supporting evidence for real-world effectiveness.

Authors: We agree that additional analysis would strengthen the user study section. In the revision, we will add statistical analysis to the 90% usefulness finding, such as confidence intervals or binomial tests for significance. We will also provide a task breakdown showing usefulness ratings per task category and include comparisons of subjective metrics (e.g., perceived timeliness and helpfulness) against the baseline systems from the quantitative experiments. These additions will be based on the existing participant responses and will be presented in a new table or figure. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical evaluation

full rationale

The paper presents a system architecture for proactive assistance using AR glasses, multimodal perception, and procedural context extraction, then reports performance gains via comparisons to baselines on public and real-world datasets plus a user study. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that reduce any central claim to its own inputs by construction. The evaluation methodology is described as independent of the system outputs being measured.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not describe any mathematical models, free parameters, or new invented entities. The system is built upon existing multimodal large language models and AR technology.

pith-pipeline@v0.9.0 · 5579 in / 1202 out tokens · 79491 ms · 2026-05-08T17:11:53.449969+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

89 extracted references · 20 canonical work pages · 7 internal anchors

[1]

Claude-Sonnet-4.5

2025. Claude-Sonnet-4.5. https://www.anthropic.com/claude/sonnet

2025
[2]

2025. Ollama. https://ollama.com/

2025
[3]

RayNeo X3 Pro

2025. RayNeo X3 Pro. https://rayneo.cn/x3pro.html

2025
[4]

Google Gemini

2026. Google Gemini. https://gemini.google.com/

2026
[5]

Magic Leap 2 Devices

2026. Magic Leap 2 Devices. https://www.magicleap.com/legal/devices-ml2

2026
[6]

Meta Orion

2026. Meta Orion. https://www.meta.com/emerging-tech/orion/

2026
[7]

SentenceTransformers Documentation

2026. SentenceTransformers Documentation. https://www.sbert.net/

2026
[8]

Lavisha Aggarwal, Vikas Bahirwani, Lin Li, and Andrea Colaco. 2025. Generating Dialogues from Egocentric Instructional Videos for Task Assistance: Dataset, Method and Benchmark.arXiv preprint arXiv:2508.11192(2025)

work page arXiv 2025
[9]

Peri Akiva, Jing Huang, Kevin J Liang, Rama Kovvuri, Xingyu Chen, Matt Feiszli, Kristin Dana, and Tal Hassner. 2023. Self-supervised object detection from egocentric videos. InProceedings of the IEEE/CVF International Conference on Computer Vision. 5225–5237

2023
[10]

Riku Arakawa, Jill Fain Lehman, and Mayank Goel. 2024. Prism-q&a: Step-aware voice assistant on a smartwatch enabled by multimodal procedure tracking and large language models.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies8, 4 (2024), 1–26

2024
[11]

Riku Arakawa, Prasoon Patidar, Will Page, Jill Lehman, and Mayank Goel. 2025. Scaling Context-Aware Task Assistants that Learn from Demonstration and Adapt through Mixed-Initiative Dialogue. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. 1–19

2025
[12]

Riku Arakawa, Hiromu Yakura, and Mayank Goel. 2024. PrISM-Observer: Intervention agent to help users perform everyday procedures sensed using a smartwatch. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. 1–16

2024
[13]

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review arXiv 2025
[14]

Andrea Bandini and José Zariffa. 2020. Analysis of the hands in egocentric vision: A survey.IEEE transactions on pattern analysis and machine intelligence45, 6 (2020), 6846–6866

2020
[15]

Siddhant Bansal, Chetan Arora, and C.V. Jawahar. 2022. My View is the Best View: Procedure Learning from Egocentric Videos. In European Conference on Computer Vision (ECCV)

2022
[16]

Yuwei Bao, Keunwoo Yu, Yichi Zhang, Shane Storks, Itamar Bar-Yossef, Alex de la Iglesia, Megan Su, Xiao Zheng, and Joyce Chai
[17]

InFindings of the Association for Computational 24 Linguistics: EMNLP 2023

Can foundation models watch, talk and guide you step by step to make a cake?. InFindings of the Association for Computational 24 Linguistics: EMNLP 2023. 12325–12341

2023
[18]

Justin Chan, Solomon Nsumba, Mitchell Wortsman, Achal Dave, Ludwig Schmidt, Shyamnath Gollakota, and Kelly Michaelsen. 2024. Detecting clinical medication errors with AI enabled wearable cameras.NPJ Digital Medicine7, 1 (2024), 287

2024
[19]

Yu-Wei Chao, Wei Yang, Yu Xiang, Pavlo Molchanov, Ankur Handa, Jonathan Tremblay, Yashraj S Narang, Karl Van Wyk, Umar Iqbal, Stan Birchfield, et al. 2021. Dexycb: A benchmark for capturing hand grasping of objects. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9044–9053

2021
[20]

Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun
[21]

InForty-first International Conference on Machine Learning

Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark. InForty-first International Conference on Machine Learning
[22]

Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. 2024. Videollm-online: Online video large language model for streaming video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18407–18418

2024
[23]

Tuochao Chen, Nicholas Scott Batchelder, Alisa Liu, Noah A Smith, and Shyamnath Gollakota. 2025. LLAMAPIE: Proactive In-Ear Conversation Assistants. InFindings of the Association for Computational Linguistics: ACL 2025. 13801–13824

2025
[24]

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al
[25]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 24185–24198
[26]

Sijie Cheng, Zhicheng Guo, Jingwen Wu, Kechen Fang, Peng Li, Huaping Liu, and Yang Liu. 2024. Egothink: Evaluating first-person perspective thinking capability of vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14291–14302

2024
[27]

Dongwook Choi, Taeyoon Kwon, Dongil Yang, Hyojun Kim, and Jinyoung Yeo. 2025. Designing Memory-Augmented AR Agents for Spatiotemporal Reasoning in Personalized Task Assistance. In2025 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct). IEEE, 113–119

2025
[28]

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, et al. 2024. A survey on in-context learning. InProceedings of the 2024 conference on empirical methods in natural language processing. 1107–1128

2024
[29]

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, et al. 2022. A survey on in-context learning.arXiv preprint arXiv:2301.00234(2022)

work page internal anchor Pith review arXiv 2022
[30]

Anup Doshi and Mohan Manubhai Trivedi. 2009. On the roles of eye gaze and head dynamics in predicting driver’s intent to change lanes.IEEE Transactions on Intelligent Transportation Systems10, 3 (2009), 453–462

2009
[31]

Anup Doshi and Mohan M Trivedi. 2012. Head and eye gaze dynamics during visual attention shifts in complex environments.Journal of vision12, 2 (2012), 9–9

2012
[32]

Pardis Emami-Naeini, Janarth Dheenadhayalan, Yuvraj Agarwal, and Lorrie Faith Cranor. 2021. Which privacy and security attributes most impact consumers’ risk perception and willingness to purchase IoT devices?. In2021 IEEE Symposium on Security and Privacy (SP). IEEE, 519–536

2021
[33]

Jakob Engel, Kiran Somasundaram, Michael Goesele, Albert Sun, Alexander Gamino, Andrew Turner, Arjang Talattof, Arnie Yuan, Bilal Souti, Brighid Meredith, et al. 2023. Project aria: A new tool for egocentric multi-modal ai research.arXiv preprint arXiv:2308.13561 (2023)

work page internal anchor Pith review arXiv 2023
[34]

Alireza Fathi, Xiaofeng Ren, and James M Rehg. 2011. Learning to recognize objects in egocentric activities. InCVPR 2011. IEEE, 3281–3288

2011
[35]

Alessandro Flaborea, Guido Maria D’Amely Di Melendugno, Leonardo Plini, Luca Scofano, Edoardo De Matteis, Antonino Furnari, Giovanni Maria Farinella, and Fabio Galasso. 2024. Prego: online mistake detection in procedural egocentric videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18483–18492

2024
[36]

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al
[37]

A survey on llm-as-a-judge.The Innovation(2024)

2024
[38]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al . 2022. Lora: Low-rank adaptation of large language models.ICLR1, 2 (2022), 3

2022
[39]

Lin Huang, Boshen Zhang, Zhilin Guo, Yang Xiao, Zhiguo Cao, and Junsong Yuan. 2021. Survey on depth and RGB image-based 3D hand shape and pose estimation.Virtual Reality & Intelligent Hardware3, 3 (2021), 207–234

2021
[40]

Yifei Huang, Jilan Xu, Baoqi Pei, Lijin Yang, Mingfang Zhang, Yuping He, Guo Chen, Xinyuan Chen, Yaohui Wang, Zheng Nie, et al
[41]

Vinci: A real-time smart assistant based on egocentric vision-language model for portable devices.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies9, 3 (2025), 1–33

2025
[42]

Youngkyoon Jang, Brian Sullivan, Casimir Ludwig, Iain Gilchrist, Dima Damen, and Walterio Mayol-Cuevas. 2019. EPIC-Tent: An Egocentric Video Dataset for Camping Tent Assembly. InInternational Conference on Computer Vision (ICCV) Workshops

2019
[43]

2024.Ultralytics YOLO11

Glenn Jocher and Jing Qiu. 2024.Ultralytics YOLO11. https://github.com/ultralytics/ultralytics 25

2024
[44]

Mahnaz Koupaee and William Yang Wang. 2018. Wikihow: A large scale text summarization dataset.arXiv preprint arXiv:1810.09305 (2018)

work page arXiv 2018
[45]

Anna Kukleva, Fadime Sener, Edoardo Remelli, Bugra Tekin, Eric Sauser, Bernt Schiele, and Shugao Ma. 2024. X-mic: Cross-modal instance conditioning for egocentric action generalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 26364–26373

2024
[46]

Shih-Po Lee, Zijia Lu, Zekun Zhang, Minh Hoai, and Ehsan Elhamifar. 2024. Error detection in egocentric procedural task videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18655–18666

2024
[47]

Chenyi Li, Guande Wu, Gromit Yeuk-Yin Chan, Dishita Gdi Turakhia, Sonia Castelo Quispe, Dong Li, Leslie Welch, Claudio Silva, and Jing Qian. 2025. Satori: Towards Proactive AR Assistant with Belief-Desire-Intention User Modeling. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–24

2025
[48]

Yuanqi Li, Arthi Padmanabhan, Pengzhan Zhao, Yufei Wang, Guoqing Harry Xu, and Ravi Netravali. 2020. Reducto: On-camera filtering for resource-efficient real-time video analytics. InProceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer commu...

2020
[49]

Yin Li, Zhefan Ye, and James M Rehg. 2015. Delving into egocentric actions. InProceedings of the IEEE conference on computer vision and pattern recognition. 287–295

2015
[50]

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. 2015. Microsoft COCO: Common Objects in Context. arXiv:1405.0312 [cs.CV] https://arxiv.org/abs/1405.0312

work page internal anchor Pith review arXiv 2015
[51]

Jiacheng Liu, Yuanchun Li, Liangyan Li, Yi Sun, Hao Wen, Xiangyu Li, Yao Guo, and Yunxin Liu. 2024. ChainStream: An LLM-based Framework for Unified Synthetic Sensing.arXiv preprint arXiv:2412.15240(2024)

work page arXiv 2024
[52]

Yongdong Luo, Xiawu Zheng, Guilin Li, Shukang Yin, Haojia Lin, Chaoyou Fu, Jinfa Huang, Jiayi Ji, Fei Chao, Jiebo Luo, et al. 2024. Video-rag: Visually-aligned retrieval-augmented long video comprehension.arXiv preprint arXiv:2411.13093(2024)

work page arXiv 2024
[53]

Rachel McCloud, Carly Perez, Mesfin Awoke Bekalu, and K Viswanath. 2022. Using smart speaker technology for health and well-being in an older adult population: pre-post feasibility study.JMIR aging5, 2 (2022), e33498

2022
[54]

Microsoft. 2026. Microsoft Azure Speech. https://azure.microsoft.com/en-us/products/ai-foundry/tools/speech Accessed: 2026-01-21

2026
[55]

Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. 2020. End-to-end learning of visual representations from uncurated instructional videos. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9879–9889

2020
[56]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al . 2019. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems32 (2019)

2019
[57]

Kevin Pu, Ting Zhang, Naveen Sendhilnathan, Sebastian Freitag, Raj Sodhi, and Tanya R Jonker. 2025. ProMemAssist: Exploring Timely Proactive Assistance Through Working Memory Modeling in Multi-Modal Wearable Devices. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. 1–19

2025
[58]

A Raouf and S Arora. 1980. Effect of informational load, index of difficulty direction and plane angles of discrete moves in a combined manual and decision task.International Journal of Production Research18, 1 (1980), 117–128

1980
[59]

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. 2022. LAION-5B: An open large-scale dataset for training next generation image-text m...

work page internal anchor Pith review arXiv 2022
[60]

Yuhan Shen, Lu Wang, and Ehsan Elhamifar. 2021. Learning to segment actions from visual and language instructions via differentiable weak sequence alignment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10156–10165

2021
[61]

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. 2025. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Arthur Tang, Charles Owen, Frank Biocca, and Weimin Mou. 2003. Comparative effectiveness of augmented reality in object assembly. InProceedings of the SIGCHI conference on Human factors in computing systems. 73–80

2003
[63]

Zachary Teed and Jia Deng. 2020. Raft: Recurrent all-pairs field transforms for optical flow. InEuropean conference on computer vision. Springer, 402–419

2020
[64]

Issatay Tokmurziyev, Miguel Altamirano Cabrera, Muhammad Haris Khan, Yara Mahmoud, Luis Moreno, and Dzmitry Tsetserukou. 2025. LLM-Glasses: GenAI-driven Glasses with Haptic Feedback for Navigation of Visually Impaired People.arXiv preprint arXiv:2503.16475 (2025)

work page arXiv 2025
[65]

Ashwin Vinod, Shrey Pandit, Aditya Vavre, and Linshen Liu. 2025. EgoVLM: Policy Optimization for Egocentric Video Understanding. arXiv preprint arXiv:2506.03097(2025)

work page arXiv 2025
[66]

Ruiqi Wang, Peiqi Gao, Patrick Lynch, Tingjun Liu, Yejin Lee, Carolyn Baum, Lisa Tabor Connor, and Chenyang Lu. 2025. CHEF-VL: Detecting Cognitive Sequencing Errors in Cooking with Vision-language Models.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies9, 4 (2025), 1–35. 26

2025
[67]

Tiannan Wang, Wangchunshu Zhou, Yan Zeng, and Xinsong Zhang. 2023. Efficientvlm: Fast and accurate vision-language models via knowledge distillation and modal-adaptive pruning. InFindings of the association for computational linguistics: ACL 2023. 13899–13913

2023
[68]

Xijun Wang, Tanay Sharma, Achin Kulshrestha, Abhimitra Meka, Aveek Purohit, and Dinesh Manocha. 2025. EgoSocial: Benchmarking Proactive Intervention Ability of Omnimodal LLMs via Egocentric Social Interaction Perception.arXiv preprint arXiv:2510.13105(2025)

work page arXiv 2025
[69]

Ying Wang, Yanlai Yang, and Mengye Ren. 2023. Lifelongmemory: Leveraging llms for answering queries in long-form egocentric videos.arXiv preprint arXiv:2312.05269(2023)

work page arXiv 2023
[70]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

2022
[71]

Ethan Wilson, Naveen Sendhilnathan, Charlie S Burlingham, Yusuf Mansour, Robert Cavin, Sai Deep Tetali, Ajoy Savio Fernandes, and Michael J Proulx. 2025. Eye gaze as a signal for conveying user attention in contextual AI systems. InProceedings of the 2025 symposium on eye tracking research and applications. 1–7

2025
[72]

Jay Zhangjie Wu, David Junhao Zhang, Wynne Hsu, Mengmi Zhang, and Mike Zheng Shou. 2023. Label-efficient online continual object detection in streaming video. InProceedings of the IEEE/CVF International Conference on Computer Vision. 19246–19255

2023
[73]

Shiwei Wu, Joya Chen, Kevin Qinghong Lin, Qimeng Wang, Yan Gao, Qianli Xu, Tong Xu, Yao Hu, Enhong Chen, and Mike Zheng Shou
[74]

Videollm-mod: Efficient video-language streaming with mixture-of-depths vision computation.Advances in Neural Information Processing Systems37 (2024), 109922–109947

2024
[75]

Zhenyu Xu, Hailin Xu, Zhouyang Lu, Yingying Zhao, Rui Zhu, Yujiang Wang, Mingzhi Dong, Yuhu Chang, Qin Lv, Robert P Dick, et al. 2024. Can large language models be good companions? An LLM-based eyewear system with conversational common ground. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies8, 2 (2024), 1–41

2024
[76]

Jiaqi Yan, Ruilong Ren, Jingren Liu, Shuning Xu, Ling Wang, Yiheng Wang, Xinlin Zhong, Yun Wang, Long Zhang, Xiangyu Chen, et al
[77]

TeleEgo: Benchmarking Egocentric AI Assistants in the Wild.arXiv preprint arXiv:2510.23981(2025)

work page arXiv 2025
[78]

Bufang Yang, Yunqi Guo, Lilin Xu, Zhenyu Yan, Hongkai Chen, Guoliang Xing, and Xiaofan Jiang. 2025. Socialmind: Llm-based proactive ar social assistive system with human-like perception for in-situ live interactions.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies9, 1 (2025), 1–30

2025
[79]

Bufang Yang, Lixing He, Neiwen Ling, Zhenyu Yan, Guoliang Xing, Xian Shuai, Xiaozhe Ren, and Xin Jiang. 2023. Edgefm: Leveraging foundation model for open-set learning on the edge. InProceedings of the 21st ACM Conference on Embedded Networked Sensor Systems. 111–124

2023
[80]

Bufang Yang, Lixing He, Kaiwei Liu, and Zhenyu Yan. 2024. Viassist: Adapting multi-modal large language models for users with visual impairments. In2024 IEEE International Workshop on Foundation Models for Cyber-Physical Systems & Internet of Things (FMSys). IEEE, 32–37

2024

Showing first 80 references.