arxiv: 2604.10108 · v2 · submitted 2026-04-11 · 💻 cs.HC

Recognition: unknown

JARVIS: A Just-in-Time AR Visual Instruction System for Cross-Reality Task Guidance

Yusi Sun , Ying Jiang , Jiayin Lu , Yin Yang , Yong-Hong Kuo , Chenfanfu Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:16 UTC · model grok-4.3

classification 💻 cs.HC

keywords augmented realityvision-language modelscross-reality taskstask guidanceuser studyprocedural instructionsmixed realityreal-time verification

0 comments

The pith

JARVIS uses one vision-language model prompt to deliver real-time AR step-by-step guidance for tasks that combine physical objects and virtual elements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces JARVIS to reduce the constant switching between instructions and actions that disrupts many everyday tasks involving both real and virtual components. A formative study identifies four categories of cross-reality tasks and the need for state awareness and coordination. The system generates contextual guidance and verifies progress on the fly using a single prompt to a vision-language model. In a within-subjects study with 14 participants across four domains, JARVIS produced better usability scores, lower workload, higher success rates, and more effective visualizations than baseline approaches.

Core claim

JARVIS is a VLM-driven AR instruction system that generates contextual, step-by-step guidance from a single prompt, with real-time state verification and adaptive visual feedback. Formative study of cross-reality tasks identifies requirements for state awareness and cross-reality coordination across four task types: real-to-real, real-to-virtual, virtual-to-real, and virtual-to-virtual. A within-subjects study with 14 participants across four domains shows improvements in usability, workload, success rate, and visualization effectiveness over baselines.

What carries the argument

Single-prompt vision-language model that generates step-by-step AR instructions while performing real-time state verification and adaptive visual feedback for cross-reality tasks.

If this is right

Users complete hybrid physical-virtual tasks with reduced context switching and cognitive load.
Instructions adapt automatically as the system checks real-time task progress.
The approach supports all four identified cross-reality task transitions without separate handling.
Success rates rise and workload falls compared with manual tutorials or non-adaptive AR systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Longer or more open-ended tasks could require additional verification steps beyond single-prompt operation.
The same mechanism might apply to collaborative settings where multiple users share the same mixed workspace.
Integration with other sensing modalities could further improve state verification accuracy.

Load-bearing premise

A single-prompt vision-language model can reliably create accurate, context-aware step-by-step guidance and verify task states in real time without frequent errors or hallucinations across varied cross-reality scenarios.

What would settle it

Observing that the vision-language model produces incorrect instructions or misses state changes in multiple real-world cross-reality test cases, resulting in user performance no better than or worse than the baseline systems.

Figures

Figures reproduced from arXiv: 2604.10108 by Chenfanfu Jiang, Jiayin Lu, Ying Jiang, Yin Yang, Yong-Hong Kuo, Yusi Sun.

**Figure 1.** Figure 1: JARVIS instructs a user to make a latte with a coffee machine. Tasks are categorized by step type, Reality-to-Reality (R2R), [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Examples from the formative study comparing video, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of error rate, completion time, and user [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Cross-Reality Step Types. Steps are categorized into [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Guidance components: (a) state cues provide feedback [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: System architecture of JARVIS, illustrated with an origami boat-folding task, comprising four components: (1) [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: JARVIS guidance examples across the four user study tasks: (1) coffee machine latte-making with target configuration [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 9.** Figure 9: The statistical analysis of different evaluation results. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 8.** Figure 8: Comparison of task outcomes across three guidance [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 10.** Figure 10: Step-level completion heatmap across three systems (JARVIS, [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: Distribution of participants’ ratings on the perceived [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

read the original abstract

Many everyday tasks rely on external tutorials such as manuals and videos, requiring users to constantly switch between reading instructions and performing actions, which disrupts workflow and increases cognitive load. Augmented reality (AR) enables in-situ guidance, while recent advances in large language models (LLMs) and vision-language models (VLMs) make it possible to automatically generate such guidance. However, existing AI-powered AR tutorial systems primarily focus on physical procedural tasks and provide limited support for hybrid physical and virtual workspaces. To address this gap, we conduct a formative study of cross-reality tasks and identify key requirements for state awareness and cross-reality coordination. We present JARVIS, a VLM-driven AR instruction system that generates contextual, step-by-step guidance from a single prompt, with real-time state verification and adaptive visual feedback. To inform the system design, we conducted a formative study to understand guidance needs across cross-reality tasks, which we categorize into four types, real-to-real (R2R), real-to-virtual (R2V), virtual-to-real (V2R), and virtual-to-virtual (V2V). A within-subjects study (N=14) across four domains shows JARVIS improves usability, workload, success rate, and visualization effectiveness over baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

JARVIS gives a four-type taxonomy for cross-reality tasks and a single-prompt VLM AR system, but the N=14 study leaves VLM error rates and verification reliability unmeasured.

read the letter

JARVIS gives a four-type taxonomy for cross-reality tasks and a single-prompt VLM AR system, but the N=14 study leaves VLM error rates and verification reliability unmeasured. The formative study that produced the R2R, R2V, V2R, and V2V categories is the clearest addition. It directly tackles the gap the abstract flags: most prior AR tutorial work stayed inside physical-only procedures and ignored hybrid physical-virtual workspaces. That split feels useful for thinking about state awareness and coordination needs across different setups. The system then tries to turn a single prompt into contextual steps plus real-time visual feedback, which matches the workflow-disruption problem they describe. The within-subjects test reports gains in usability, workload, success rate, and visualization effectiveness over baselines, and the domains chosen look reasonably broad. Those are concrete points worth noting. The main weakness is that the abstract and summary give no numbers on how often the VLM actually gets the steps right or verifies state correctly. No error rates, no hallucination counts, no failure-mode breakdown. If the model frequently needs correction or produces off-target guidance, the reported success-rate lift could be driven mostly by the AR visualization layer rather than by reliable automation. That makes it hard to isolate what the VLM component is really contributing. The paper is aimed at HCI researchers who build or evaluate AR guidance tools for training and productivity. Readers who want system prototypes and task breakdowns will find usable ideas here, even if they have to treat the performance claims as preliminary. It deserves a serious referee because the taxonomy and design choices are grounded enough to review, though any revision should require explicit VLM accuracy metrics to make the user-study results interpretable.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces JARVIS, a VLM-driven AR instruction system for cross-reality task guidance. It begins with a formative study identifying requirements for state awareness and cross-reality coordination, categorizing tasks into R2R, R2V, V2R, and V2V types. The system uses a single prompt to generate contextual step-by-step guidance with real-time state verification and adaptive visual feedback. A within-subjects user study (N=14) across four domains evaluates the system, claiming improvements in usability, workload, success rate, and visualization effectiveness over baselines.

Significance. If the results hold with stronger quantification of the VLM component, the work could meaningfully advance HCI research on AI-augmented AR tutorials by extending support to hybrid physical-virtual workspaces, a gap in prior systems focused mainly on physical tasks. The cross-reality taxonomy and integration of VLM for just-in-time adaptive feedback represent useful contributions, supported by the empirical evaluation design.

major comments (2)

[§5 (User Study)] §5 (User Study): The within-subjects study (N=14) reports improvements in usability, workload, success rate, and visualization effectiveness, but provides no details on exact baselines, statistical tests performed, task definitions, or quantitative error rates/hallucination counts for the VLM's guidance generation and state verification. Without these, the results cannot isolate whether gains are attributable to reliable VLM automation or to the AR visualization layer alone.
[§4 (System Design)] §4 (System Design): The central design claim that a single-prompt VLM reliably generates accurate, context-aware step-by-step guidance and performs real-time state verification across R2R/R2V/V2R/V2V scenarios lacks supporting metrics on VLM accuracy, failure modes, or error rates. This unquantified reliability directly undercuts attribution of the reported success-rate and usability benefits to the VLM component.

minor comments (2)

[Abstract] The abstract and study summary could more explicitly list the baselines compared against and the primary dependent variables with their measurement methods.
[Figures/Tables] Figure captions and table legends would benefit from clearer indication of which conditions correspond to JARVIS versus baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review. The comments highlight important opportunities to improve the clarity and rigor of our evaluation and system description. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§5 (User Study)] §5 (User Study): The within-subjects study (N=14) reports improvements in usability, workload, success rate, and visualization effectiveness, but provides no details on exact baselines, statistical tests performed, task definitions, or quantitative error rates/hallucination counts for the VLM's guidance generation and state verification. Without these, the results cannot isolate whether gains are attributable to reliable VLM automation or to the AR visualization layer alone.

Authors: We agree that additional detail is warranted. In the revised manuscript we will expand §5 to explicitly describe the two baseline conditions (a non-adaptive AR overlay system and a static video tutorial system), report the statistical tests performed (paired t-tests for SUS and NASA-TLX scores, McNemar’s test for binary success rates, with exact p-values and effect sizes), and provide concrete task definitions with examples for each of the four cross-reality categories. We will also add a post-hoc analysis of VLM behavior drawn from the study session logs, reporting the observed rate of guidance steps that required user correction or were factually incorrect. These additions will make the attribution of benefits to the VLM component versus the AR layer more transparent. revision: yes
Referee: [§4 (System Design)] §4 (System Design): The central design claim that a single-prompt VLM reliably generates accurate, context-aware step-by-step guidance and performs real-time state verification across R2R/R2V/V2R/V2V scenarios lacks supporting metrics on VLM accuracy, failure modes, or error rates. This unquantified reliability directly undercuts attribution of the reported success-rate and usability benefits to the VLM component.

Authors: We acknowledge that §4 currently emphasizes design rationale over quantitative VLM evaluation. We will revise this section to include metrics on VLM performance obtained from the user-study interaction logs, specifically the percentage of correctly generated steps, the rate of successful real-time state verifications, and a categorized summary of observed failure modes (e.g., occasional misrecognition of virtual objects in V2V tasks). These data will directly support the reliability claims and help readers assess the VLM’s contribution to the reported outcomes. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on empirical user study and formative requirements gathering, not derivations or self-referential fits.

full rationale

The paper describes a VLM-driven AR system for cross-reality tasks, informed by a formative study identifying four task types (R2R/R2V/V2R/V2V) and evaluated in a within-subjects user study (N=14) measuring usability, workload, success rate, and visualization effectiveness. No equations, parameter fits, predictions, or uniqueness theorems appear in the provided text. Central claims derive directly from study outcomes rather than reducing to inputs by construction, self-citation chains, or renamed empirical patterns. The VLM reliability assumption is noted as unquantified but does not create circularity in the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on standard assumptions from HCI and AI without introducing new free parameters or invented entities beyond the described system.

axioms (2)

domain assumption VLMs can generate reliable step-by-step guidance and perform real-time state verification from visual input in AR settings
Required for the core functionality of contextual guidance and adaptive feedback
domain assumption Reducing context switching improves usability and success rates in hybrid physical-virtual tasks
Basis for the formative study and claimed benefits

pith-pipeline@v0.9.0 · 5540 in / 1233 out tokens · 38948 ms · 2026-05-10T16:16:09.925678+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 33 canonical work pages · 1 internal anchor

[1]

Alexander Barquero, Rodrigo Luis Calvo, Daniel Alexander Delgado, Isaac Wang, Lisa Anthony, and Jaime Ruiz. 2024. Understanding User Needs for Task Guidance Systems Through the Lens of Cooking. InProceedings of the 2024 ACM Designing Interactive Systems Conference(Copenhagen, Denmark)(DIS ’24). Association for Computing Machinery, New York, NY, USA, 2006–...

work page doi:10.1145/3643834.3661611 2024
[2]

JonasBlattgerste,PatrickRenner,andThiesPfeiffer.2019. Authorableaugmented realityinstructionsforassistanceandtraininginworkenvironments.InProceedings of the 18th International Conference on Mobile and Ubiquitous Multimedia(Pisa, Italy)(MUM ’19). Association for Computing Machinery, New York, NY, USA, Article 34, 11 pages. doi:10.1145/3365610.3365646

work page doi:10.1145/3365610.3365646 2019
[3]

SUS:A’Quick’and’Dirty’UsabilityScale

JohnBrooke.1996. SUS:A’Quick’and’Dirty’UsabilityScale. InUsabilityEval- uation in Industry, Patrick W. Jordan, Bruce Thomas, Bernard A. Weerdmeester, and Ian Lyall McClelland (Eds.). Taylor and Francis, Chapter 21, 189–194

1996
[4]

MobileTutAR:aLightweight AugmentedRealityTutorialSystemusingSpatiallySituatedHumanSegmentation Videos

YuanzhiCao,AnnaFuste,andValentinHeun.2022. MobileTutAR:aLightweight AugmentedRealityTutorialSystemusingSpatiallySituatedHumanSegmentation Videos. InExtended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems(New Orleans, LA, USA)(CHI EA ’22). Association for Computing Machinery, New York, NY, USA, Article 396, 8 pages. doi:10.1145/...

work page arXiv 2022
[5]

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. 2025. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Boyuan Chen, Xinan Yan, Xuning Hu, Dominic Kao, and Hai-Ning Liang. 2024. Impact of Tutorial Modes with Different Time Flow Rates in Virtual Reality Games.Proc. ACM Comput. Graph. Interact. Tech.7, 1, Article 6 (May 2024), 19 pages. doi:10.1145/3651296

work page doi:10.1145/3651296 2024
[7]

Chen Chen, Cuong Nguyen, Jane Hoffswell, Jennifer Healey, Trung Bui, and Nadir Weibel. 2023. PaperToPlace: Transforming Instruction Documents into Spatialized and Context-Aware Mixed Reality Experiences. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology(San Francisco, CA, USA)(UIST ’23). Association for Computing Mac...

work page doi:10.1145/3586183.3606832 2023
[8]

Subramanian Chidambaram, Hank Huang, Fengming He, Xun Qian, Ana M Villanueva, Thomas S Redick, Wolfgang Stuerzlinger, and Karthik Ramani. 2021. ProcessAR: An augmented reality-based tool to create in-situ procedural 2D/3D AR Instructions. InProceedings of the 2021 ACM Designing Interactive Systems Conference(VirtualEvent,USA)(DIS’21).AssociationforComputi...

work page doi:10.1145/3461778.3462126 2021
[9]

Mustafa Doga Dogan, Eric J Gonzalez, Karan Ahuja, Ruofei Du, Andrea Colaço, Johnny Lee, Mar Gonzalez-Franco, and David Kim. 2024. Augmented Object IntelligencewithXR-Objects.InProceedingsofthe37thAnnualACMSymposium on User Interface Software and Technology(Pittsburgh, PA, USA)(UIST ’24). Association for Computing Machinery, New York, NY, USA, Article 19, ...

work page doi:10.1145/3654777.3676379 2024
[10]

Mustafa Doga Dogan, Eric J Gonzalez, Karan Ahuja, Ruofei Du, Andrea Colaço, Johnny Lee, Mar Gonzalez-Franco, and David Kim. 2024. Augmented object intelligence with xr-objects. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. 1–15

2024
[11]

DuckDuckGo. 2026. DuckDuckGo. https://duckduckgo.com/. Search engine. Accessed March 30, 2026

2026
[12]

Google. 2026. Gemini API. https://ai.google.dev/api. Accessed: 2026-03-31

2026
[13]

Sandra G Hart and Lowell E Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research.Human mental workload1, 3 (1988), 139–183

1988
[14]

GaopingHuang,XunQian,TianyiWang,FagunPatel,MaitreyaSreeram,Yuanzhi Cao, Karthik Ramani, and Alexander J. Quinn. 2021. AdapTutAR: An Adaptive Tutoring System for Machine Tasks in Augmented Reality. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems(Yokohama, Japan)(CHI ’21). Association for Computing Machinery, New York, NY, USA...

work page doi:10.1145/3411764.3445283 2021
[15]

Mina Huh, Zihui Xue, Ujjaini Das, Kumar Ashutosh, Kristen Grauman, and Amy Pavel. 2025. Vid2Coach: Transforming How-To Videos into Task Assistants. In Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST ’25). Association for Computing Machinery, New York, NY, USA, Article 46, 24 pages. doi:10.1145/3746059.3747612

work page doi:10.1145/3746059.3747612 2025
[16]

Panayu Keelawat and Ryo Suzuki. 2024. Transforming Procedural Instructions into In-Situ Augmented Reality Guides with InstructAR. InAdjunct Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (Pittsburgh,PA,USA)(UISTAdjunct’24).AssociationforComputingMachinery, New York, NY, USA, Article 70, 3 pages. doi:10.1145/3672539.3686321

work page doi:10.1145/3672539.3686321 2024
[17]

Junhan Kong, Dena Sabha, Jeffrey P Bigham, Amy Pavel, and Anhong Guo
[18]

InProceedings of the 2021 ACM Symposium on Spatial User Interaction(Virtual Event, USA)(SUI ’21)

TutorialLens: Authoring Interactive Augmented Reality Tutorials Through Narration and Demonstration. InProceedings of the 2021 ACM Symposium on Spatial User Interaction(Virtual Event, USA)(SUI ’21). Association for Computing Machinery, New York, NY, USA, Article 16, 11 pages. doi:10.1145/ 3485279.3485289

work page arXiv 2021
[19]

Tobias Langlotz, Holger Regenbrecht, Stefanie Zollmann, and Dieter Schmalstieg
[20]

InProceedings of the 25th Australian Computer-Human Interaction Conference: Augmentation, Application, Innovation, Collaboration Sun et al

Audio stickies: visually-guided spatial audio annotations on a mobile aug- mented reality platform. InProceedings of the 25th Australian Computer-Human Interaction Conference: Augmentation, Application, Innovation, Collaboration Sun et al. (Adelaide, Australia)(OzCHI ’13). Association for Computing Machinery, New York, NY, USA, 545–554. doi:10.1145/254101...

work page doi:10.1145/2541016.2541022
[21]

Imag- inateAR: AI-Assisted In-Situ Authoring in Augmented Reality

Jaewook Lee, Filippo Aleotti, Diego Mazala, Guillermo Garcia-Hernando, Sara Vicente, Oliver James Johnston, Isabel Kraus-Liang, Jakub Powierza, Donghoon Shin,JonE.Froehlich,GabrielBrostow,andJessicaVanBrummelen.2025. Imag- inateAR: AI-Assisted In-Situ Authoring in Augmented Reality. InProceedings of the 38th Annual ACM Symposium on User Interface Software...

work page doi:10.1145/3746059.3747635 2025
[22]

Satori: Towards Proactive AR Assistant with Belief-Desire-Intention User Modeling

Chenyi Li, Guande Wu, Gromit Yeuk-Yin Chan, Dishita Gdi Turakhia, Sonia CasteloQuispe,DongLi,LeslieWelch,ClaudioSilva,andJingQian.2025. Satori: Towards Proactive AR Assistant with Belief-Desire-Intention User Modeling. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, N...

work page doi:10.1145/3706598.3714188 2025
[23]

Ziyi Liu, Zhengzhe Zhu, Enze Jiang, Feichi Huang, Ana M Villanueva, Xun Qian, Tianyi Wang, and Karthik Ramani. 2023. InstruMentAR: Auto-Generation of AugmentedRealityTutorialsforOperatingDigitalInstrumentsThroughRecording Embodied Demonstration. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23). Asso...

work page doi:10.1145/3544548 2023
[24]

Peter Mohr, Bernhard Kerbl, Michael Donoser, Dieter Schmalstieg, and Denis Kalkofen. 2015. Retargeting Technical Documentation to Augmented Reality. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Comput- ing Systems(Seoul, Republic of Korea)(CHI ’15). Association for Computing Machinery, New York, NY, USA, 3337–3346. doi:10.1145/2702...

work page doi:10.1145/2702123.2702490 2015
[25]

Peter Mohr, David Mandl, Markus Tatzgern, Eduardo Veas, Dieter Schmalstieg, and Denis Kalkofen. 2017. Retargeting Video Tutorials Showing Tools With SurfaceContacttoAugmentedReality.InProceedingsofthe2017CHIConference on Human Factors in Computing Systems(Denver, Colorado, USA)(CHI ’17). Association for Computing Machinery, New York, NY, USA, 6547–6558. d...

work page arXiv 2017
[26]

Nguyen, Bilal Mirza, Dorothy Tan, and Jose Sepulveda

Tam V. Nguyen, Bilal Mirza, Dorothy Tan, and Jose Sepulveda. 2018. AS- MIM: Augmented Reality Authoring System for Mobile Interactive Manuals. In Proceedings of the 12th International Conference on Ubiquitous Information Management and Communication(Langkawi, Malaysia)(IMCOM ’18). Asso- ciation for Computing Machinery, New York, NY, USA, Article 3, 6 page...

work page doi:10.1145/3164541.3164592 2018
[27]

Lev Poretski and Anthony Tang. 2022. Press A to Jump: Design Strategies for Video Game Learnability. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems(New Orleans, LA, USA)(CHI ’22). AssociationforComputingMachinery,NewYork,NY,USA,Article155,26pages. doi:10.1145/3491102.3517685

work page doi:10.1145/3491102.3517685 2022
[28]

HyunA Seo, Juheon Yi, Rajesh Balan, and Youngki Lee. 2024. GradualReality: Enhancing Physical Object Interaction in Virtual Reality via Interaction State- Aware Blending. InProceedings of the 37th Annual ACM Symposium on User InterfaceSoftwareandTechnology(Pittsburgh,PA,USA)(UIST’24).Association for Computing Machinery, New York, NY, USA, Article 82, 14 p...

work page arXiv 2024
[29]

Quinn, and Karthik Ramani

Jingyu Shi, Rahul Jain, Seunggeun Chi, Hyungjun Doh, Hyung-gun Chi, Alexan- der J. Quinn, and Karthik Ramani. 2025. CARING-AI: Towards Authoring Context-aware Augmented Reality INstruction through Generative Artificial In- telligence. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machin...

work page doi:10.1145/3706598.3713348 2025
[30]

Sruti Srinidhi, Edward Lu, and Anthony Rowe. 2024. XaiR: An XR Platform that Integrates Large Language Models with the Physical World. In2024 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). 759–767. doi:10.1109/ISMAR62088.2024.00091

work page doi:10.1109/ismar62088.2024.00091 2024
[31]

Daniel Stover and Doug Bowman. 2024. TAGGAR: General-Purpose Task Guidance from Natural Language in Augmented Reality using Vision-Language Models. InProceedings of the 2024 ACM Symposium on Spatial User Interaction (Trier, Germany)(SUI ’24). Association for Computing Machinery, New York, NY, USA, Article 12, 12 pages. doi:10.1145/3677386.3682095

work page doi:10.1145/3677386.3682095 2024
[32]

Yiliu Tang, Jason Situ, Andrea Yaoyun Cui, Mengke Wu, and Yun Huang
[33]

InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25)

LLM Integration in Extended Reality: A Comprehensive Review of Current Trends, Challenges, and Future Perspectives. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA, Article 1054, 24 pages. doi:10.1145/3706598.3714224

work page doi:10.1145/3706598.3714224 2025
[34]

Balasaravanan Thoravi Kumaravel, Fraser Anderson, George Fitzmaurice, Bjoern Hartmann, and Tovi Grossman. 2019. Loki: Facilitating Remote Instruction of Physical Tasks Using Bi-Directional Mixed-Reality Telepresence. InProceedings of the 32nd Annual ACM Symposium on User Interface Software and Technology (NewOrleans,LA,USA)(UIST’19).AssociationforComputin...

work page doi:10.1145/3332165.3347872 2019
[35]

Nhan (Nathan) Tran, Ethan Yang, and Abe Davis. 2025. ARticulate: Interactive Visual Guidance for Demonstrated Rotational Degrees of Freedom in Mobile AR. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA, Article 28, 8 pages. doi:10.1145/3706598.3713179

work page doi:10.1145/3706598.3713179 2025
[36]

Julia Woodward and Jaime Ruiz. 2023. Analytic Review of Using Augmented Reality for Situational Awareness.IEEE Transactions on Visualization and Computer Graphics29, 4 (2023), 2166–2183. doi:10.1109/TVCG.2022.3141585

work page doi:10.1109/tvcg.2022.3141585 2023
[37]

Masahiro Yamaguchi, Shohei Mori, Peter Mohr, Markus Tatzgern, Ana Stanescu, Hideo Saito, and Denis Kalkofen. 2020. Video-Annotated Augmented Reality Assembly Tutorials. InProceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology(Virtual Event, USA)(UIST ’20). Association for Computing Machinery, New York, NY, USA, 1010–1022. d...

work page arXiv 2020
[38]

Ramez Yousri, Zeyad Essam, Yehia Kareem, Youstina Sherief, Sherry Gamil, and Soha Safwat. 2024. IllusionX: An LLM-powered mixed reality personal companion.arXiv preprint arXiv:2402.07924(2024)

work page arXiv 2024
[39]

yt-dlp contributors. 2026. yt-dlp. https://github.com/yt-dlp/yt-dlp. Software repository. Accessed March 30, 2026

2026
[40]

Xingyao Yu, Benjamin Lee, and Michael Sedlmair. 2024. Design Space of Visual Feedforward And Corrective Feedback in XR-Based Motion Guidance Systems. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 723, 15 pages. doi:10.1145/3613904.3642143

work page doi:10.1145/3613904.3642143 2024
[41]

Ada Yi Zhao, Aditya Gunturu, Ellen Yi-Luen Do, and Ryo Suzuki. 2025. Guided Reality: Generating Visually-Enriched AR Task Guidance with LLMs and Vision Models. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST ’25). Association for Computing Machinery, New York, NY, USA, Article 146, 15 pages. doi:10.1145/37460...

work page doi:10.1145/3746059.3747784 2025
[42]

Boil a pot of water

Chenfei Zhu, Shao-Kang Hsia, Xiyun Hu, Ziyi Liu, Jingyu Shi, and Karthik Ramani. 2025. agentAR: Creating Augmented Reality Applications with Tool- Augmented LLM-based Autonomous Agents. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST ’25). Asso- ciation for Computing Machinery, New York, NY, USA, Article 54, ...

work page doi:10.1145/3746059.3747676 2025
[43]

Outline": Green bounding box to identify object location (use when object position needs highlighting) -

objectViz: How to visualize the object to operate on - "Outline": Green bounding box to identify object location (use when object position needs highlighting) - "ShapePreview": Shape/area preview image (use when object shape or area needs to be shown, especially with SAM3 segmentation)
[44]

Arrow": For movement/rotation (translation or rotation). Requireswaypointswithstartandendpositions.-

actionViz: How to visualize the hand/target object action or move- ment - "Arrow": For movement/rotation (translation or rotation). Requireswaypointswithstartandendpositions.-"Gesture":Forhand JARVIS: A Just-in-Time AR Visual Instruction System for Cross-Reality Task Guidance gestures (e.g., pinch, poke, grip). System will search for matching imageinResou...
[45]

-success:Pleasechecktheimageandconfirmifcurrentstepisreached, only answer true or false

FILL IN these three fields based on the current photo: - next: The specific sub-goal for the next step (if current step is not successfully reached). -success:Pleasechecktheimageandconfirmifcurrentstepisreached, only answer true or false. - check: If you are not sure of the result of success, tell me what you need to further check. If you are sure, leave ...
[46]

starttarget | endtarget | object

VERIFY the field from the existing plannerResponse: - waypoints: Verify that the waypoint objectNames and types are still relevant and suitable for current visualization. Keep the existing waypoint structure unless it’s completely inappropriate. IMPORTANT:UsetheexistingplannerResponseasyourbasetemplate. Ignoreanyunnecessarydetailswhenjudgingthestatus.e.g....