pith. sign in

arxiv: 2602.03059 · v2 · submitted 2026-02-03 · 💻 cs.HC · cs.CL· cs.ET· cs.IR

From Speech-to-Spatial: Grounding Utterances on A Live Shared View with Augmented Reality

Pith reviewed 2026-05-16 08:14 UTC · model grok-4.3

classification 💻 cs.HC cs.CLcs.ETcs.IR
keywords Speech-to-Spatialaugmented realityremote assistancereferent disambiguationvoice interfacespatial groundinghuman-computer interactionAR guidance
0
0 comments X

The pith

Speech-to-Spatial converts spoken remote instructions into persistent AR visual guidance using only voice patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Speech-to-Spatial, a framework that turns verbal remote-assistance instructions into spatially grounded AR overlays on a live shared view. It does so by first identifying one of four recurring speech patterns for referring to objects—Direct Attribute, Relational, Remembrance, or Chained—then mapping those cues to an object-centric relational graph. The resulting AR visuals stay visible in place, cutting down on repeated verbal corrections like “a bit more to the right.” A user study found the approach faster and less mentally demanding than voice-only guidance. This matters because many remote help tasks still rely on imprecise back-and-forth talk that the system aims to replace with direct visual cues.

Core claim

Speech-to-Spatial infers the intended target solely from spoken references by parsing four characterized patterns (Direct Attribute, Relational, Remembrance, and Chained), grounds them to an object-centric relational graph, and renders persistent in-situ AR visual guidance that improves task efficiency, reduces cognitive load, and enhances usability compared with a conventional voice-only baseline.

What carries the argument

Object-centric relational graph that parses referent cues from spoken utterances according to the four speech patterns and renders them as persistent AR visuals.

If this is right

  • Reduces the volume of iterative micro-guidance phrases during remote assistance.
  • Transforms disembodied verbal instructions into visually explainable, actionable AR guidance on a shared view.
  • Supports both remote guided assistance and intent disambiguation scenarios.
  • Delivers measurable gains in task completion time and lower cognitive load versus voice-only systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same four-pattern parser could be tested in non-remote AR settings such as in-person collaborative design or training.
  • If speech patterns prove incomplete in noisy environments, the graph could later accept supplemental visual data without redesigning the core grounding step.
  • Success in reducing verbal clarifications suggests the approach may lower error rates in time-critical tasks like equipment repair or medical guidance.
  • The object graph could be extended to track dynamic objects whose positions change during the session.

Load-bearing premise

Referent cues can be reliably parsed and grounded solely from spoken references using the four characterized patterns without additional cues such as gesture or gaze.

What would settle it

An experiment in which participants issue references that fall outside the four patterns, resulting in frequent incorrect AR placements and no measured gains in speed or reduced cognitive load over voice-only instructions.

Figures

Figures reproduced from arXiv: 2602.03059 by Arie Kaufman, Devshree Jadeja, Divyansh Pradhan, Yoonsang Kim.

Figure 1
Figure 1. Figure 1: Concept illustration of Speech-to-Spatial, disambiguating verbal descriptions of a referent and situating AR visual guiders [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: End-to-end pipeline of Speech-to-Spatial: From speech with visual inputs and prior memories (if present), Speech-to-Spatial extracts [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Attribute parsing: Transcribed text of verbal instructions is [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Three use case scenarios of Speech-to-Spatial. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of median task completion time per referenc [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

We introduce Speech-to-Spatial, a referent disambiguation framework that converts verbal remote-assistance instructions into spatially grounded AR guidance. Unlike prior systems that rely on additional cues (e.g., gesture, gaze) or manual expert annotations, Speech-to-Spatial infers the intended target solely from spoken references (speech input). Motivated by our formative study of speech referencing patterns, we characterize recurring ways people specify targets (Direct Attribute, Relational, Remembrance, and Chained) and ground them to our object-centric relational graph. Given an utterance, referent cues are parsed and rendered as persistent in-situ AR visual guidance, reducing iterative micro-guidance ("a bit more to the right", "now, stop.") during remote guidance. We demonstrate the use cases of our system with remote guided assistance and intent disambiguation scenarios. Our evaluation shows that Speechto-Spatial improves task efficiency, reduces cognitive load, and enhances usability compared to a conventional voice-only baseline, transforming disembodied verbal instruction into visually explainable, actionable guidance on a live shared view.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Speech-to-Spatial, a referent disambiguation framework that converts verbal remote-assistance instructions into spatially grounded AR guidance using only spoken references. Based on a formative study, it identifies four recurring speech referencing patterns—Direct Attribute, Relational, Remembrance, and Chained—and grounds them to an object-centric relational graph. The system renders persistent in-situ AR visual guidance to reduce iterative verbal micro-guidance. Demonstrations include remote guided assistance and intent disambiguation, with an evaluation claiming superior task efficiency, reduced cognitive load, and better usability compared to a voice-only baseline.

Significance. Should the evaluation prove robust, the work offers a meaningful contribution to human-computer interaction in augmented reality by enabling speech-only spatial grounding for remote collaboration. This could lower barriers in scenarios where gestures or gaze are impractical, transforming verbal instructions into actionable visual overlays. The characterization of speech patterns and the object-centric graph provide a structured approach that may generalize beyond the demonstrated use cases, though its practical significance hinges on the unquantified reliability of the parsing component.

major comments (2)
  1. [Evaluation] Evaluation section: The central claim that Speech-to-Spatial improves task efficiency, reduces cognitive load, and enhances usability rests on an evaluation whose details are absent from the abstract and not sufficiently elaborated. No information is provided on participant numbers, study design, specific metrics (e.g., task completion time, NASA-TLX), statistical tests, or error analysis. Without these, it is impossible to verify that the reported benefits are attributable to the speech-to-spatial grounding pipeline rather than the AR rendering step alone.
  2. [System Description] System / Referent Parsing: The headline result depends on the assumption that referent cues can be reliably parsed and grounded solely from the four speech patterns without additional cues. The manuscript does not report parsing success rates, accuracy of the object-centric relational graph resolution, or breakdown of failures in live AR settings. If resolution accuracy is low, the measured gains would be artifacts of only the easy cases.
minor comments (2)
  1. [Abstract] Abstract: Typo in 'Speechto-Spatial' (missing hyphen); should be consistent with the title 'Speech-to-Spatial'.
  2. [Abstract] Abstract: The evaluation claim is stated without any preview of quantitative results or key metrics, which weakens the summary for an HCI audience.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight areas where additional detail will strengthen the presentation of our evaluation and system components. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: The central claim that Speech-to-Spatial improves task efficiency, reduces cognitive load, and enhances usability rests on an evaluation whose details are absent from the abstract and not sufficiently elaborated. No information is provided on participant numbers, study design, specific metrics (e.g., task completion time, NASA-TLX), statistical tests, or error analysis. Without these, it is impossible to verify that the reported benefits are attributable to the speech-to-spatial grounding pipeline rather than the AR rendering step alone.

    Authors: We agree that the evaluation section requires more elaboration to allow readers to fully assess the claims. In the revised manuscript we will expand this section with the number of participants, a description of the study design (including the within-subjects comparison to the voice-only baseline), the specific metrics collected (task completion time, NASA-TLX scores, and usability ratings), the statistical tests applied, and an error analysis. These additions will clarify how the measured improvements are attributable to the referent disambiguation and grounding pipeline rather than AR rendering in isolation. revision: yes

  2. Referee: [System Description] System / Referent Parsing: The headline result depends on the assumption that referent cues can be reliably parsed and grounded solely from the four speech patterns without additional cues. The manuscript does not report parsing success rates, accuracy of the object-centric relational graph resolution, or breakdown of failures in live AR settings. If resolution accuracy is low, the measured gains would be artifacts of only the easy cases.

    Authors: We acknowledge that quantitative reporting on parsing performance is necessary to substantiate the system’s reliability. Although the current manuscript emphasizes end-to-end task outcomes, we will add a dedicated subsection reporting parsing success rates, the accuracy of object-centric graph resolution, and a breakdown of observed failures from the live demonstrations. This will allow readers to evaluate whether the reported gains generalize beyond easy cases. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluation rests on empirical comparison to baseline

full rationale

The paper describes a new Speech-to-Spatial framework whose design is informed by a formative study of four speech patterns, then reports user-study results on task efficiency, cognitive load, and usability versus a voice-only baseline. No equations, fitted parameters, or derivations appear in the provided text. The central claims do not reduce by construction to self-definitions, renamed inputs, or self-citation chains; the formative study supplies design motivation while the measured gains are obtained from independent participant data. This is the expected non-circular outcome for an HCI system paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework depends on the validity of speech referencing patterns identified in the formative study and the effectiveness of the object-centric relational graph in accurately mapping utterances to spatial targets in real time.

axioms (1)
  • domain assumption People use recurring speech referencing patterns that can be categorized into Direct Attribute, Relational, Remembrance, and Chained.
    Derived from the formative study of speech referencing patterns mentioned in the abstract.
invented entities (1)
  • object-centric relational graph no independent evidence
    purpose: To ground parsed referent cues from utterances to spatial locations for AR visual guidance.
    Core component introduced to map speech input to persistent in-situ overlays.

pith-pipeline@v0.9.0 · 5503 in / 1258 out tokens · 33958 ms · 2026-05-16T08:14:18.478393+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VisionClaw: Always-On AI Agents through Smart Glasses

    cs.HC 2026-04 unverdicted novelty 5.0

    VisionClaw couples continuous egocentric vision on smart glasses with speech-driven AI agents to enable hands-free real-world tasks, with lab and field studies showing faster completion and a shift toward opportunisti...

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    H. Bai, P. Sasikumar, J. Yang, and M. Billinghurst. A user study on mixed reality remote collaboration with eye gaze and hand gesture sharing. InProc. of ACM CHI, pp. 1–13, 2020. 1

  2. [2]

    put-that-there

    R. A. Bolt. “put-that-there” voice and gesture at the graphics interface. InProc. of SIGGRAPH, pp. 262–270, 1980. 1, 2, 4

  3. [3]

    R. Bovo, D. Giunchi, P. Cascarano, E. J. Gonzalez, and M. Gonzalez- Franco. Revisiting put-that-there, context aware window interactions via llms.arXiv preprint arXiv:2511.02378, 2025. 2

  4. [4]

    M. Brehmer. Video-conferencing beyond screen-sharing and thumb- nail webcam videos: Gesture-aware augmented reality video for data- rich remote presentations.arXiv preprint arXiv:2501.05345, 2025. 2

  5. [5]

    S. E. Brennan and H. H. Clark. Conceptual pacts and lexical choice in conversation.Journal of experimental psychology: Learning, memory, and cognition, 22(6):1482, 1996. 2, 3

  6. [6]

    Bressa, J

    N. Bressa, J. Vermeulen, and W. Willett. Data every day: Designing and living with personal situated visualizations. InProc. of ACM CHI, pp. 1–18, 2022. 3

  7. [7]

    Carbonell and S

    N. Carbonell and S. Kieffer. Do oral messages help visual search. Advances in natural multimodal dialogue systems, 30:131–157, 2005. 2

  8. [8]

    R. S. M. Chan, A. Marx, A. Kim, and M. El-Assady. A design space for intelligent dialogue augmentation. InProc. of IUI, pp. 18–36,

  9. [9]

    Chen, P.-L

    S. Chen, P.-L. Guhur, M. Tapaswi, C. Schmid, and I. Laptev. Lan- guage conditioned spatial relation reasoning for 3d object grounding. Neurips, 35:20522–20535, 2022. 2

  10. [10]

    H. H. Clark and S. E. Brennan. Grounding in communication. In L. Resnick, L. B., M. John, S. Teasley, and D., eds.,Perspectives on Socially Shared Cognition, pp. 13–1991. APA, 1991. 3

  11. [11]

    F. I. Do ˘gan, S. Kalkan, and I. Leite. Learning to generate unambiguous spatial referring expressions for real-world environments. InProc. of IEEE/RSJ IROS, pp. 4992–4999, 2019. 2

  12. [12]

    M. D. Dogan, E. J. Gonzalez, K. Ahuja, R. Du, A. Colac ¸o, J. Lee, M. Gonzalez-Franco, and D. Kim. Augmented object intelligence with XR-Objects. InProc. of ACM UIST, pp. 1–15, 2024. 2

  13. [13]

    dos Santos Silva and I

    D. dos Santos Silva and I. Paraboni. Generating spatial referring ex- pressions in interactive 3d worlds.Spatial Cognition & Computation, 15(3):186–225, 2015. 2

  14. [14]

    Druta, C

    R. Druta, C. Druta, P. Negirla, and I. Silea. A review on methods and systems for remote collaboration.Applied Sciences, 11(21):10035,

  15. [15]

    D. Edge, H. Trinh, N. Cheng, J. Bradley, A. Chao, A. Mody, S. Truitt, and J. Larson. From local to global: A graph rag approach to query- focused summarization.arXiv:2404.16130, 2024. 2, 5

  16. [16]

    Evgrashin

    A. Evgrashin. Whisper for unity.https://github.com/Macoron/ whisper.unity/tree/master, 2024. Aug. 31. 2024. 4

  17. [17]

    C. G. Fidalgo, Y . Yan, H. Cho, M. Sousa, D. Lindlbauer, and J. Jorge. A survey on remote assistance and training in mixed reality environ- ments.IEEE TVCG, 29(5):2291–2303, 2023. 2

  18. [18]

    D. I. Fink, J. Zagermann, H. Reiterer, and H.-C. Jetter. Re-locations: Augmenting personal and shared workspaces to support remote col- laboration in incongruent spaces.Proc. of ACM HCI, 6(ISS):1–30,

  19. [19]

    A. Garnham. A unified theory of the meaning of some spatial rela- tional terms.Cognition, 31(1):45–60, 1989. 3

  20. [20]

    J. E. S. Grønbæk, K. Pfeuffer, E. Velloso, M. Astrup, M. I. S. Peder- sen, M. Kjær, G. Leiva, and H. Gellersen. Partially blended realities: Aligning dissimilar spaces for distributed mixed reality meetings. In Proc. of ACM CHI, pp. 1–16, 2023. 2

  21. [21]

    Grubert, T

    J. Grubert, T. Langlotz, S. Zollmann, and H. Regenbrecht. Towards pervasive augmented reality: Context-awareness in augmented reality. IEEE TVCG, 23(6):1706–1724, 2016. 3

  22. [22]

    Q. Gu, A. Kuwajerwala, S. Morin, K. M. Jatavallabhula, B. Sen, A. Agarwal, C. Rivera, W. Paul, K. Ellis, R. Chellappa, et al. Concept- graphs: Open-vocabulary 3d scene graphs for perception and plan- ning. InProc. of IEEE ICRA, pp. 5021–5028, 2024. 2, 5

  23. [23]

    Gurevich, J

    P. Gurevich, J. Lanir, B. Cohen, and R. Stone. Teleadvisor: a versatile augmented reality tool for remote assistance. InProc. of ACM CHI, pp. 619–622, 2012. 2

  24. [24]

    Han and K

    C. Han and K. E. Isaacs. A deixis-centered approach for documenting remote synchronous communication around data visualizations.IEEE TVCG, 2024. 2

  25. [25]

    Hepperle, Y

    D. Hepperle, Y . Weiß, A. Siess, and M. W ¨olfel. 2d, 3d or speech? a case study on which user interface is preferable for what kind of object interaction in immersive virtual reality.Computers & Graphics, 82:321–331, 2019. 9

  26. [26]

    Howlader, H

    P. Howlader, H. Nguyen-Canh, S. Das, J. Xu, H. Le, and D. Samaras. Cora: Consistency-guided semi-supervised framework for reasoning segmentation. InProc. of IEEE/CVF WACV, 2026. 2

  27. [27]

    X. Hu, D. Ma, F. He, Z. Zhu, S.-K. Hsia, C. Zhu, Z. Liu, and K. Ra- mani. Gesprompt: Leveraging co-speech gestures to augment llm- based interaction in virtual reality. InProc. of ACM DIS, pp. 59–80,

  28. [28]

    Jadon, M

    S. Jadon, M. Faridan, E. Mah, R. Vaish, W. Willett, and R. Suzuki. Augmented conversation with embedded speech-driven on-the-fly ref- erencing in ar.arXiv preprint arXiv:2405.18537, 2024. 1, 2

  29. [29]

    Jang, E.-J

    S. Jang, E.-J. Ko, and W. Woo. Unified user-centric context: Who, where, when, what, how and why. InProc. of UbiPCMM, 2005. 4, 5

  30. [30]

    Johannsen and J

    K. Johannsen and J. P. D. Ruiter. Reference frame selection in dialog: priming or preference?Frontiers in Human Neuroscience, 7:667,

  31. [31]

    Kartmann and T

    R. Kartmann and T. Asfour. Interactive and incremental learning of spatial object relations from human demonstrations.Frontiers in Robotics and AI, 10:1151303, 2023. 2

  32. [32]

    D. Kim, T. Ha, J. Hong, S. Kim, S. Choi, H. Ko, and W. Woo. Meta- objects: Interactive and multisensory virtual objects learned from the real world for use in augmented reality.IEEE CG&A, 45(3):134–143,

  33. [33]

    H. Kim, E. Hu, and S. Heo. Spaceshare: Leveraging multimodal con- text for fluid sharing of spaces in video meetings. InProc. of ACM UIST-Adjunct, pp. 1–3, 2025. 2

  34. [34]

    H. Kim, T. Matuszka, J.-I. Kim, J. Kim, and W. Woo. Ontology- based mobile augmented reality in cultural heritage sites: informa- tion modeling and user study.Multimedia Tools and Applications, 76(24):26001–26029, 2017. 1, 5

  35. [35]

    Y . Kim, Z. Aamir, M. Singh, S. Boorboor, K. Mueller, and A. E. Kauf- man. Explainable xr: Understanding user behaviors of xr environ- ments using llm-assisted analytics framework.IEEE TVCG, 2025. 4, 5

  36. [36]

    B. Lee, M. Sedlmair, and D. Schmalstieg. Design patterns for situated visualization in augmented reality.IEEE TVCG, 30(1):1324–1335,

  37. [37]

    G. Lee, M. Xia, N. Numan, X. Qian, D. Li, Y . Chen, A. Kulshrestha, I. Chatterjee, Y . Zhang, D. Manocha, et al. Sensible agent: A frame- work for unobtrusive interaction with proactive ar agents. InProc. of ACM UIST, pp. 1–22, 2025. 9

  38. [38]

    J. Lee, F. Aleotti, D. Mazala, G. Garcia-Hernando, S. Vicente, O. J. Johnston, I. Kraus-Liang, J. Powierza, D. Shin, J. E. Froehlich, et al. Imaginatear: Ai-assisted in-situ authoring in augmented reality. In Proc. of ACM UIST, pp. 1–21, 2025. 2

  39. [39]

    J. Lee, J. Kim, J. Ahn, and W. Woo. Remote diagnosis of architec- tural heritage based on 5w1h model-based metadata in virtual reality. ISPRS IJGI, 8(8):339, 2019. 4

  40. [40]

    J. Lee, J. Wang, E. Brown, L. Chu, S. S. Rodriguez, and J. E. Froehlich. GazePointAR: a context-aware multimodal voice assistant for pronoun disambiguation in wearable augmented reality. InProc. of ACM CHI, pp. 1–20, 2024. 1, 2, 4, 7, 9

  41. [41]

    J. Lee, T. Wang, J. Fashimpaur, N. Sendhilnathan, and T. R. Jonker. Walkie-talkie: Exploring longitudinal natural gaze, llms, and vlms for query disambiguation in xr. InProc. of ACM CHI EA, pp. 1–9, 2025. 1, 2, 9

  42. [42]

    W. J. Levelt. Cognitive styles in the use of spatial direction terms. Psychology, 1982. 2, 3

  43. [43]

    W. J. Levelt.Speaking: From intention to articulation. MIT press,

  44. [44]

    S. C. Levinson. Frames of reference and molyneux’s question: Crosslinguistic evidence.Language and space, 109:169, 1996. 3

  45. [45]

    C. Li, G. Wu, G. Y .-Y . Chan, D. G. Turakhia, S. Castelo Quispe, D. Li, 10 © 2026 IEEE. This is the author’s version of the article that will appear at the IEEE Conference on Virtual Reality and 3D User Interfaces (IEEE VR). The final version of this record is available at: 10.1109/VR67842.2026.00045 L. Welch, C. Silva, and J. Qian. Satori: Towards proac...

  46. [46]

    J. N. Li, Z. Zhang, and J. Ma. Omniquery: Contextually augmenting captured multimodal memories to enable personal question answer- ing. InProc. of ACM CHI, pp. 1–20, 2025. 3

  47. [47]

    X. Liu, D. Jia, X. C. Liu, M. Gonzalez-Franco, and C. Zhu-Tian. Real- ity proxy: fluid interactions with real-world objects in mr via abstract representations. InProc. of ACM UIST, pp. 1–16, 2025. 2

  48. [48]

    Lukianova, J.-Y

    E. Lukianova, J.-Y . Jeong, and J.-W. Jeong. A picture is worth a thou- sand words? investigating the impact of image aids in ar on memory recall for everyday tasks. InProc. of IUI, pp. 106–126, 2025. 3

  49. [49]

    M. N. Lystbæk, K. Pfeuffer, T. Langlotz, J. E. S. Grønbæk, and H. Gellersen. Spatial gaze markers: Supporting effective task switch- ing in augmented reality. InProc. of ACM CHI, pp. 1–11, 2024. 3

  50. [50]

    Markov-Vetter, M

    D. Markov-Vetter, M. Luboschik, A. T. Islam, P. Gauger, and O. Staadt. The effect of spatial reference on visual attention and work- load during viewpoint guidance in augmented reality. InProc. of ACM SUI, pp. 1–10, 2020. 2

  51. [51]

    Dynamics 365 remote assist.https://learn

    Microsoft. Dynamics 365 remote assist.https://learn. microsoft.com/en-us/dynamics365/mixed-reality/ remote-assist/ra-overview, 2025. Sep. 3. 2025. 6

  52. [52]

    G. A. Miller and P. N. Johnson-Laird.Language and perception. Har- vard University Press, 1976. 3

  53. [53]

    Murai, E

    R. Murai, E. Dexheimer, and A. J. Davison. Mast3r-slam: Real- time dense slam with 3d reconstruction priors. InProc. of CVPR, pp. 16695–16705, 2025. 1, 2, 5

  54. [54]

    Rebol, C

    M. Rebol, C. Hood, C. Ranniger, A. Rutenberg, N. Sikka, E. M. Ho- ran, C. G¨utl, and K. Pietroszek. Remote assistance with mixed reality for procedural tasks. InProc. of IEEE VRW, pp. 653–654, 2021. 2

  55. [55]

    3D dynamic scene graphs: Actionable spatial perception with places, objects, and humans

    A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone. 3d dynamic scene graphs: Actionable spatial perception with places, objects, and humans.arXiv preprint arXiv:2002.06289, 2020. 2

  56. [56]

    K. A. Satriadi, B. Tag, and T. Dwyer. Context-dependent memory in situated visualization.arXiv:2311.12288, 2023. 3

  57. [57]

    M. F. Schober. Addressee-and object-centered frames of reference in spatial descriptions. InAmerican Association for Artificial Intelli- gence, Working Notes of the 1996 AAAI Spring Symposium on Cogni- tive and Computational Models of Spatial Representation, vol. 47, pp. 92–100, 1996. 2, 3

  58. [58]

    Sch ¨uz, A

    S. Sch ¨uz, A. Gatt, and S. Zarrieß. Rethinking symbolic and visual context in referring expression generation.Frontiers in Artificial In- telligence, 6:1067125, 2023. 2

  59. [59]

    J. Seo, I. Avellino, D. P. Rajasagi, A. Komlodi, and H. M. Mentis. Holomentor: Enabling remote instruction through augmented reality mobile views.Proc. of ACM HCI, 7(GROUP):1–29, 2023. 2

  60. [60]

    Shakeri, H

    M. Shakeri, H. Park, I. Jeon, A. Sadeghi-Niaraki, and W. Woo. User behavior modeling for ar personalized recommendations in spatial transitions.VR, 27(4):3033–3050, 2023. 5

  61. [61]

    J. Shen, J. J. Dudley, and P. O. Kristensson. Encode-store-retrieve: Augmenting human memory through language-encoded egocentric perception. InProc. of IEEE ISMAR, pp. 923–931, 2024. 3

  62. [62]

    Shusterman and P

    A. Shusterman and P. Li. Frames of reference in spatial language acquisition.Cognitive psychology, 88:115–161, 2016. 2, 3

  63. [63]

    J. G. R. d. Souza, J. J. Ferreira, and V . Segura. A taxonomy of meth- ods, tools, and approaches for enabling collaborative annotation. In Proc. of IHC, pp. 1–12, 2023. 2

  64. [64]

    Stover and D

    D. Stover and D. Bowman. Taggar: General-purpose task guidance from natural language in augmented reality using vision-language models. InProc. of ACM SUI, pp. 1–12, 2024. 1, 2

  65. [65]

    H. A. Taylor and B. Tversky. Descriptions and depictions of environ- ments.Memory & cognition, 20(5):483–496, 1992. 3

  66. [66]

    H. A. Taylor and B. Tversky. Perspective in spatial descriptions.Jour- nal of memory and language, 35(3):371–391, 1996. 2, 3

  67. [67]

    Teamviewer assist ar.https://www

    TeamViewer. Teamviewer assist ar.https://www. teamviewer.com/en-us/products/frontline/solutions/ remote-assistance, 2025. Sep. 3. 2025. 6

  68. [68]

    P. Wang, Y . Wang, Y . Wang, M. Billinghurst, D. Yang, H. Yang, R. Luo, and X. Zhang. Extended reality remote collaboration sup- porting visual annotation cues for industry: A literature review.Engi- neered Science, 37:1802, 2025. 2

  69. [69]

    Zaman, C

    F. Zaman, C. Anslow, and T. J. Rhee. Vicarious: Context-aware view- points selection for mixed reality collaboration. InProc. of ACM VRST, pp. 1–11, 2023. 2

  70. [70]

    A. Y . Zhao, A. Gunturu, E. Y .-L. Do, and R. Suzuki. Guided reality: Generating visually-enriched ar task guidance with llms and vision models.arXiv preprint arXiv:2508.03547, 2025. 2, 9

  71. [71]

    Zoom.https://www.zoom.com/, 2025

    Zoom. Zoom.https://www.zoom.com/, 2025. Sep. 3. 2025. 3

  72. [72]

    W. D. Zulfikar, S. Chan, and P. Maes. Memoro: Using large language models to realize a concise interface for real-time memory augmenta- tion. InProc. of ACM CHI, pp. 1–18, 2024. 3 11