pith. machine review for the scientific record. sign in

arxiv: 2604.10108 · v2 · submitted 2026-04-11 · 💻 cs.HC

Recognition: unknown

JARVIS: A Just-in-Time AR Visual Instruction System for Cross-Reality Task Guidance

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:16 UTC · model grok-4.3

classification 💻 cs.HC
keywords augmented realityvision-language modelscross-reality taskstask guidanceuser studyprocedural instructionsmixed realityreal-time verification
0
0 comments X

The pith

JARVIS uses one vision-language model prompt to deliver real-time AR step-by-step guidance for tasks that combine physical objects and virtual elements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces JARVIS to reduce the constant switching between instructions and actions that disrupts many everyday tasks involving both real and virtual components. A formative study identifies four categories of cross-reality tasks and the need for state awareness and coordination. The system generates contextual guidance and verifies progress on the fly using a single prompt to a vision-language model. In a within-subjects study with 14 participants across four domains, JARVIS produced better usability scores, lower workload, higher success rates, and more effective visualizations than baseline approaches.

Core claim

JARVIS is a VLM-driven AR instruction system that generates contextual, step-by-step guidance from a single prompt, with real-time state verification and adaptive visual feedback. Formative study of cross-reality tasks identifies requirements for state awareness and cross-reality coordination across four task types: real-to-real, real-to-virtual, virtual-to-real, and virtual-to-virtual. A within-subjects study with 14 participants across four domains shows improvements in usability, workload, success rate, and visualization effectiveness over baselines.

What carries the argument

Single-prompt vision-language model that generates step-by-step AR instructions while performing real-time state verification and adaptive visual feedback for cross-reality tasks.

If this is right

  • Users complete hybrid physical-virtual tasks with reduced context switching and cognitive load.
  • Instructions adapt automatically as the system checks real-time task progress.
  • The approach supports all four identified cross-reality task transitions without separate handling.
  • Success rates rise and workload falls compared with manual tutorials or non-adaptive AR systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Longer or more open-ended tasks could require additional verification steps beyond single-prompt operation.
  • The same mechanism might apply to collaborative settings where multiple users share the same mixed workspace.
  • Integration with other sensing modalities could further improve state verification accuracy.

Load-bearing premise

A single-prompt vision-language model can reliably create accurate, context-aware step-by-step guidance and verify task states in real time without frequent errors or hallucinations across varied cross-reality scenarios.

What would settle it

Observing that the vision-language model produces incorrect instructions or misses state changes in multiple real-world cross-reality test cases, resulting in user performance no better than or worse than the baseline systems.

Figures

Figures reproduced from arXiv: 2604.10108 by Chenfanfu Jiang, Jiayin Lu, Ying Jiang, Yin Yang, Yong-Hong Kuo, Yusi Sun.

Figure 1
Figure 1. Figure 1: JARVIS instructs a user to make a latte with a coffee machine. Tasks are categorized by step type, Reality-to-Reality (R2R), [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Examples from the formative study comparing video, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of error rate, completion time, and user [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cross-Reality Step Types. Steps are categorized into [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Guidance components: (a) state cues provide feedback [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: System architecture of JARVIS, illustrated with an origami boat-folding task, comprising four components: (1) [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: JARVIS guidance examples across the four user study tasks: (1) coffee machine latte-making with target configuration [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: The statistical analysis of different evaluation results. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of task outcomes across three guidance [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Step-level completion heatmap across three systems (JARVIS, [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Distribution of participants’ ratings on the perceived [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
read the original abstract

Many everyday tasks rely on external tutorials such as manuals and videos, requiring users to constantly switch between reading instructions and performing actions, which disrupts workflow and increases cognitive load. Augmented reality (AR) enables in-situ guidance, while recent advances in large language models (LLMs) and vision-language models (VLMs) make it possible to automatically generate such guidance. However, existing AI-powered AR tutorial systems primarily focus on physical procedural tasks and provide limited support for hybrid physical and virtual workspaces. To address this gap, we conduct a formative study of cross-reality tasks and identify key requirements for state awareness and cross-reality coordination. We present JARVIS, a VLM-driven AR instruction system that generates contextual, step-by-step guidance from a single prompt, with real-time state verification and adaptive visual feedback. To inform the system design, we conducted a formative study to understand guidance needs across cross-reality tasks, which we categorize into four types, real-to-real (R2R), real-to-virtual (R2V), virtual-to-real (V2R), and virtual-to-virtual (V2V). A within-subjects study (N=14) across four domains shows JARVIS improves usability, workload, success rate, and visualization effectiveness over baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces JARVIS, a VLM-driven AR instruction system for cross-reality task guidance. It begins with a formative study identifying requirements for state awareness and cross-reality coordination, categorizing tasks into R2R, R2V, V2R, and V2V types. The system uses a single prompt to generate contextual step-by-step guidance with real-time state verification and adaptive visual feedback. A within-subjects user study (N=14) across four domains evaluates the system, claiming improvements in usability, workload, success rate, and visualization effectiveness over baselines.

Significance. If the results hold with stronger quantification of the VLM component, the work could meaningfully advance HCI research on AI-augmented AR tutorials by extending support to hybrid physical-virtual workspaces, a gap in prior systems focused mainly on physical tasks. The cross-reality taxonomy and integration of VLM for just-in-time adaptive feedback represent useful contributions, supported by the empirical evaluation design.

major comments (2)
  1. [§5 (User Study)] §5 (User Study): The within-subjects study (N=14) reports improvements in usability, workload, success rate, and visualization effectiveness, but provides no details on exact baselines, statistical tests performed, task definitions, or quantitative error rates/hallucination counts for the VLM's guidance generation and state verification. Without these, the results cannot isolate whether gains are attributable to reliable VLM automation or to the AR visualization layer alone.
  2. [§4 (System Design)] §4 (System Design): The central design claim that a single-prompt VLM reliably generates accurate, context-aware step-by-step guidance and performs real-time state verification across R2R/R2V/V2R/V2V scenarios lacks supporting metrics on VLM accuracy, failure modes, or error rates. This unquantified reliability directly undercuts attribution of the reported success-rate and usability benefits to the VLM component.
minor comments (2)
  1. [Abstract] The abstract and study summary could more explicitly list the baselines compared against and the primary dependent variables with their measurement methods.
  2. [Figures/Tables] Figure captions and table legends would benefit from clearer indication of which conditions correspond to JARVIS versus baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review. The comments highlight important opportunities to improve the clarity and rigor of our evaluation and system description. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§5 (User Study)] §5 (User Study): The within-subjects study (N=14) reports improvements in usability, workload, success rate, and visualization effectiveness, but provides no details on exact baselines, statistical tests performed, task definitions, or quantitative error rates/hallucination counts for the VLM's guidance generation and state verification. Without these, the results cannot isolate whether gains are attributable to reliable VLM automation or to the AR visualization layer alone.

    Authors: We agree that additional detail is warranted. In the revised manuscript we will expand §5 to explicitly describe the two baseline conditions (a non-adaptive AR overlay system and a static video tutorial system), report the statistical tests performed (paired t-tests for SUS and NASA-TLX scores, McNemar’s test for binary success rates, with exact p-values and effect sizes), and provide concrete task definitions with examples for each of the four cross-reality categories. We will also add a post-hoc analysis of VLM behavior drawn from the study session logs, reporting the observed rate of guidance steps that required user correction or were factually incorrect. These additions will make the attribution of benefits to the VLM component versus the AR layer more transparent. revision: yes

  2. Referee: [§4 (System Design)] §4 (System Design): The central design claim that a single-prompt VLM reliably generates accurate, context-aware step-by-step guidance and performs real-time state verification across R2R/R2V/V2R/V2V scenarios lacks supporting metrics on VLM accuracy, failure modes, or error rates. This unquantified reliability directly undercuts attribution of the reported success-rate and usability benefits to the VLM component.

    Authors: We acknowledge that §4 currently emphasizes design rationale over quantitative VLM evaluation. We will revise this section to include metrics on VLM performance obtained from the user-study interaction logs, specifically the percentage of correctly generated steps, the rate of successful real-time state verifications, and a categorized summary of observed failure modes (e.g., occasional misrecognition of virtual objects in V2V tasks). These data will directly support the reliability claims and help readers assess the VLM’s contribution to the reported outcomes. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on empirical user study and formative requirements gathering, not derivations or self-referential fits.

full rationale

The paper describes a VLM-driven AR system for cross-reality tasks, informed by a formative study identifying four task types (R2R/R2V/V2R/V2V) and evaluated in a within-subjects user study (N=14) measuring usability, workload, success rate, and visualization effectiveness. No equations, parameter fits, predictions, or uniqueness theorems appear in the provided text. Central claims derive directly from study outcomes rather than reducing to inputs by construction, self-citation chains, or renamed empirical patterns. The VLM reliability assumption is noted as unquantified but does not create circularity in the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on standard assumptions from HCI and AI without introducing new free parameters or invented entities beyond the described system.

axioms (2)
  • domain assumption VLMs can generate reliable step-by-step guidance and perform real-time state verification from visual input in AR settings
    Required for the core functionality of contextual guidance and adaptive feedback
  • domain assumption Reducing context switching improves usability and success rates in hybrid physical-virtual tasks
    Basis for the formative study and claimed benefits

pith-pipeline@v0.9.0 · 5540 in / 1233 out tokens · 38948 ms · 2026-05-10T16:16:09.925678+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 33 canonical work pages · 1 internal anchor

  1. [1]

    Alexander Barquero, Rodrigo Luis Calvo, Daniel Alexander Delgado, Isaac Wang, Lisa Anthony, and Jaime Ruiz. 2024. Understanding User Needs for Task Guidance Systems Through the Lens of Cooking. InProceedings of the 2024 ACM Designing Interactive Systems Conference(Copenhagen, Denmark)(DIS ’24). Association for Computing Machinery, New York, NY, USA, 2006–...

  2. [2]

    JonasBlattgerste,PatrickRenner,andThiesPfeiffer.2019. Authorableaugmented realityinstructionsforassistanceandtraininginworkenvironments.InProceedings of the 18th International Conference on Mobile and Ubiquitous Multimedia(Pisa, Italy)(MUM ’19). Association for Computing Machinery, New York, NY, USA, Article 34, 11 pages. doi:10.1145/3365610.3365646

  3. [3]

    SUS:A’Quick’and’Dirty’UsabilityScale

    JohnBrooke.1996. SUS:A’Quick’and’Dirty’UsabilityScale. InUsabilityEval- uation in Industry, Patrick W. Jordan, Bruce Thomas, Bernard A. Weerdmeester, and Ian Lyall McClelland (Eds.). Taylor and Francis, Chapter 21, 189–194

  4. [4]

    MobileTutAR:aLightweight AugmentedRealityTutorialSystemusingSpatiallySituatedHumanSegmentation Videos

    YuanzhiCao,AnnaFuste,andValentinHeun.2022. MobileTutAR:aLightweight AugmentedRealityTutorialSystemusingSpatiallySituatedHumanSegmentation Videos. InExtended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems(New Orleans, LA, USA)(CHI EA ’22). Association for Computing Machinery, New York, NY, USA, Article 396, 8 pages. doi:10.1145/...

  5. [5]

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. 2025. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719(2025)

  6. [6]

    Boyuan Chen, Xinan Yan, Xuning Hu, Dominic Kao, and Hai-Ning Liang. 2024. Impact of Tutorial Modes with Different Time Flow Rates in Virtual Reality Games.Proc. ACM Comput. Graph. Interact. Tech.7, 1, Article 6 (May 2024), 19 pages. doi:10.1145/3651296

  7. [7]

    Chen Chen, Cuong Nguyen, Jane Hoffswell, Jennifer Healey, Trung Bui, and Nadir Weibel. 2023. PaperToPlace: Transforming Instruction Documents into Spatialized and Context-Aware Mixed Reality Experiences. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology(San Francisco, CA, USA)(UIST ’23). Association for Computing Mac...

  8. [8]

    Subramanian Chidambaram, Hank Huang, Fengming He, Xun Qian, Ana M Villanueva, Thomas S Redick, Wolfgang Stuerzlinger, and Karthik Ramani. 2021. ProcessAR: An augmented reality-based tool to create in-situ procedural 2D/3D AR Instructions. InProceedings of the 2021 ACM Designing Interactive Systems Conference(VirtualEvent,USA)(DIS’21).AssociationforComputi...

  9. [9]

    Mustafa Doga Dogan, Eric J Gonzalez, Karan Ahuja, Ruofei Du, Andrea Colaço, Johnny Lee, Mar Gonzalez-Franco, and David Kim. 2024. Augmented Object IntelligencewithXR-Objects.InProceedingsofthe37thAnnualACMSymposium on User Interface Software and Technology(Pittsburgh, PA, USA)(UIST ’24). Association for Computing Machinery, New York, NY, USA, Article 19, ...

  10. [10]

    Mustafa Doga Dogan, Eric J Gonzalez, Karan Ahuja, Ruofei Du, Andrea Colaço, Johnny Lee, Mar Gonzalez-Franco, and David Kim. 2024. Augmented object intelligence with xr-objects. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. 1–15

  11. [11]

    DuckDuckGo. 2026. DuckDuckGo. https://duckduckgo.com/. Search engine. Accessed March 30, 2026

  12. [12]

    Google. 2026. Gemini API. https://ai.google.dev/api. Accessed: 2026-03-31

  13. [13]

    Sandra G Hart and Lowell E Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research.Human mental workload1, 3 (1988), 139–183

  14. [14]

    GaopingHuang,XunQian,TianyiWang,FagunPatel,MaitreyaSreeram,Yuanzhi Cao, Karthik Ramani, and Alexander J. Quinn. 2021. AdapTutAR: An Adaptive Tutoring System for Machine Tasks in Augmented Reality. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems(Yokohama, Japan)(CHI ’21). Association for Computing Machinery, New York, NY, USA...

  15. [15]

    Mina Huh, Zihui Xue, Ujjaini Das, Kumar Ashutosh, Kristen Grauman, and Amy Pavel. 2025. Vid2Coach: Transforming How-To Videos into Task Assistants. In Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST ’25). Association for Computing Machinery, New York, NY, USA, Article 46, 24 pages. doi:10.1145/3746059.3747612

  16. [16]

    Panayu Keelawat and Ryo Suzuki. 2024. Transforming Procedural Instructions into In-Situ Augmented Reality Guides with InstructAR. InAdjunct Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (Pittsburgh,PA,USA)(UISTAdjunct’24).AssociationforComputingMachinery, New York, NY, USA, Article 70, 3 pages. doi:10.1145/3672539.3686321

  17. [17]

    Junhan Kong, Dena Sabha, Jeffrey P Bigham, Amy Pavel, and Anhong Guo

  18. [18]

    InProceedings of the 2021 ACM Symposium on Spatial User Interaction(Virtual Event, USA)(SUI ’21)

    TutorialLens: Authoring Interactive Augmented Reality Tutorials Through Narration and Demonstration. InProceedings of the 2021 ACM Symposium on Spatial User Interaction(Virtual Event, USA)(SUI ’21). Association for Computing Machinery, New York, NY, USA, Article 16, 11 pages. doi:10.1145/ 3485279.3485289

  19. [19]

    Tobias Langlotz, Holger Regenbrecht, Stefanie Zollmann, and Dieter Schmalstieg

  20. [20]

    InProceedings of the 25th Australian Computer-Human Interaction Conference: Augmentation, Application, Innovation, Collaboration Sun et al

    Audio stickies: visually-guided spatial audio annotations on a mobile aug- mented reality platform. InProceedings of the 25th Australian Computer-Human Interaction Conference: Augmentation, Application, Innovation, Collaboration Sun et al. (Adelaide, Australia)(OzCHI ’13). Association for Computing Machinery, New York, NY, USA, 545–554. doi:10.1145/254101...

  21. [21]

    Imag- inateAR: AI-Assisted In-Situ Authoring in Augmented Reality

    Jaewook Lee, Filippo Aleotti, Diego Mazala, Guillermo Garcia-Hernando, Sara Vicente, Oliver James Johnston, Isabel Kraus-Liang, Jakub Powierza, Donghoon Shin,JonE.Froehlich,GabrielBrostow,andJessicaVanBrummelen.2025. Imag- inateAR: AI-Assisted In-Situ Authoring in Augmented Reality. InProceedings of the 38th Annual ACM Symposium on User Interface Software...

  22. [22]

    Satori: Towards Proactive AR Assistant with Belief-Desire-Intention User Modeling

    Chenyi Li, Guande Wu, Gromit Yeuk-Yin Chan, Dishita Gdi Turakhia, Sonia CasteloQuispe,DongLi,LeslieWelch,ClaudioSilva,andJingQian.2025. Satori: Towards Proactive AR Assistant with Belief-Desire-Intention User Modeling. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, N...

  23. [23]

    Ziyi Liu, Zhengzhe Zhu, Enze Jiang, Feichi Huang, Ana M Villanueva, Xun Qian, Tianyi Wang, and Karthik Ramani. 2023. InstruMentAR: Auto-Generation of AugmentedRealityTutorialsforOperatingDigitalInstrumentsThroughRecording Embodied Demonstration. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23). Asso...

  24. [24]

    Peter Mohr, Bernhard Kerbl, Michael Donoser, Dieter Schmalstieg, and Denis Kalkofen. 2015. Retargeting Technical Documentation to Augmented Reality. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Comput- ing Systems(Seoul, Republic of Korea)(CHI ’15). Association for Computing Machinery, New York, NY, USA, 3337–3346. doi:10.1145/2702...

  25. [25]

    Peter Mohr, David Mandl, Markus Tatzgern, Eduardo Veas, Dieter Schmalstieg, and Denis Kalkofen. 2017. Retargeting Video Tutorials Showing Tools With SurfaceContacttoAugmentedReality.InProceedingsofthe2017CHIConference on Human Factors in Computing Systems(Denver, Colorado, USA)(CHI ’17). Association for Computing Machinery, New York, NY, USA, 6547–6558. d...

  26. [26]

    Nguyen, Bilal Mirza, Dorothy Tan, and Jose Sepulveda

    Tam V. Nguyen, Bilal Mirza, Dorothy Tan, and Jose Sepulveda. 2018. AS- MIM: Augmented Reality Authoring System for Mobile Interactive Manuals. In Proceedings of the 12th International Conference on Ubiquitous Information Management and Communication(Langkawi, Malaysia)(IMCOM ’18). Asso- ciation for Computing Machinery, New York, NY, USA, Article 3, 6 page...

  27. [27]

    Lev Poretski and Anthony Tang. 2022. Press A to Jump: Design Strategies for Video Game Learnability. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems(New Orleans, LA, USA)(CHI ’22). AssociationforComputingMachinery,NewYork,NY,USA,Article155,26pages. doi:10.1145/3491102.3517685

  28. [28]

    HyunA Seo, Juheon Yi, Rajesh Balan, and Youngki Lee. 2024. GradualReality: Enhancing Physical Object Interaction in Virtual Reality via Interaction State- Aware Blending. InProceedings of the 37th Annual ACM Symposium on User InterfaceSoftwareandTechnology(Pittsburgh,PA,USA)(UIST’24).Association for Computing Machinery, New York, NY, USA, Article 82, 14 p...

  29. [29]

    Quinn, and Karthik Ramani

    Jingyu Shi, Rahul Jain, Seunggeun Chi, Hyungjun Doh, Hyung-gun Chi, Alexan- der J. Quinn, and Karthik Ramani. 2025. CARING-AI: Towards Authoring Context-aware Augmented Reality INstruction through Generative Artificial In- telligence. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machin...

  30. [30]

    Sruti Srinidhi, Edward Lu, and Anthony Rowe. 2024. XaiR: An XR Platform that Integrates Large Language Models with the Physical World. In2024 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). 759–767. doi:10.1109/ISMAR62088.2024.00091

  31. [31]

    Daniel Stover and Doug Bowman. 2024. TAGGAR: General-Purpose Task Guidance from Natural Language in Augmented Reality using Vision-Language Models. InProceedings of the 2024 ACM Symposium on Spatial User Interaction (Trier, Germany)(SUI ’24). Association for Computing Machinery, New York, NY, USA, Article 12, 12 pages. doi:10.1145/3677386.3682095

  32. [32]

    Yiliu Tang, Jason Situ, Andrea Yaoyun Cui, Mengke Wu, and Yun Huang

  33. [33]

    InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25)

    LLM Integration in Extended Reality: A Comprehensive Review of Current Trends, Challenges, and Future Perspectives. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA, Article 1054, 24 pages. doi:10.1145/3706598.3714224

  34. [34]

    Balasaravanan Thoravi Kumaravel, Fraser Anderson, George Fitzmaurice, Bjoern Hartmann, and Tovi Grossman. 2019. Loki: Facilitating Remote Instruction of Physical Tasks Using Bi-Directional Mixed-Reality Telepresence. InProceedings of the 32nd Annual ACM Symposium on User Interface Software and Technology (NewOrleans,LA,USA)(UIST’19).AssociationforComputin...

  35. [35]

    Nhan (Nathan) Tran, Ethan Yang, and Abe Davis. 2025. ARticulate: Interactive Visual Guidance for Demonstrated Rotational Degrees of Freedom in Mobile AR. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA, Article 28, 8 pages. doi:10.1145/3706598.3713179

  36. [36]

    Julia Woodward and Jaime Ruiz. 2023. Analytic Review of Using Augmented Reality for Situational Awareness.IEEE Transactions on Visualization and Computer Graphics29, 4 (2023), 2166–2183. doi:10.1109/TVCG.2022.3141585

  37. [37]

    Masahiro Yamaguchi, Shohei Mori, Peter Mohr, Markus Tatzgern, Ana Stanescu, Hideo Saito, and Denis Kalkofen. 2020. Video-Annotated Augmented Reality Assembly Tutorials. InProceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology(Virtual Event, USA)(UIST ’20). Association for Computing Machinery, New York, NY, USA, 1010–1022. d...

  38. [38]

    Ramez Yousri, Zeyad Essam, Yehia Kareem, Youstina Sherief, Sherry Gamil, and Soha Safwat. 2024. IllusionX: An LLM-powered mixed reality personal companion.arXiv preprint arXiv:2402.07924(2024)

  39. [39]

    yt-dlp contributors. 2026. yt-dlp. https://github.com/yt-dlp/yt-dlp. Software repository. Accessed March 30, 2026

  40. [40]

    Xingyao Yu, Benjamin Lee, and Michael Sedlmair. 2024. Design Space of Visual Feedforward And Corrective Feedback in XR-Based Motion Guidance Systems. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 723, 15 pages. doi:10.1145/3613904.3642143

  41. [41]

    Ada Yi Zhao, Aditya Gunturu, Ellen Yi-Luen Do, and Ryo Suzuki. 2025. Guided Reality: Generating Visually-Enriched AR Task Guidance with LLMs and Vision Models. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST ’25). Association for Computing Machinery, New York, NY, USA, Article 146, 15 pages. doi:10.1145/37460...

  42. [42]

    Boil a pot of water

    Chenfei Zhu, Shao-Kang Hsia, Xiyun Hu, Ziyi Liu, Jingyu Shi, and Karthik Ramani. 2025. agentAR: Creating Augmented Reality Applications with Tool- Augmented LLM-based Autonomous Agents. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST ’25). Asso- ciation for Computing Machinery, New York, NY, USA, Article 54, ...

  43. [43]

    Outline": Green bounding box to identify object location (use when object position needs highlighting) -

    objectViz: How to visualize the object to operate on - "Outline": Green bounding box to identify object location (use when object position needs highlighting) - "ShapePreview": Shape/area preview image (use when object shape or area needs to be shown, especially with SAM3 segmentation)

  44. [44]

    Arrow": For movement/rotation (translation or rotation). Requireswaypointswithstartandendpositions.-

    actionViz: How to visualize the hand/target object action or move- ment - "Arrow": For movement/rotation (translation or rotation). Requireswaypointswithstartandendpositions.-"Gesture":Forhand JARVIS: A Just-in-Time AR Visual Instruction System for Cross-Reality Task Guidance gestures (e.g., pinch, poke, grip). System will search for matching imageinResou...

  45. [45]

    -success:Pleasechecktheimageandconfirmifcurrentstepisreached, only answer true or false

    FILL IN these three fields based on the current photo: - next: The specific sub-goal for the next step (if current step is not successfully reached). -success:Pleasechecktheimageandconfirmifcurrentstepisreached, only answer true or false. - check: If you are not sure of the result of success, tell me what you need to further check. If you are sure, leave ...

  46. [46]

    starttarget | endtarget | object

    VERIFY the field from the existing plannerResponse: - waypoints: Verify that the waypoint objectNames and types are still relevant and suitable for current visualization. Keep the existing waypoint structure unless it’s completely inappropriate. IMPORTANT:UsetheexistingplannerResponseasyourbasetemplate. Ignoreanyunnecessarydetailswhenjudgingthestatus.e.g....