From Speech-to-Spatial: Grounding Utterances on A Live Shared View with Augmented Reality

Arie Kaufman; Devshree Jadeja; Divyansh Pradhan; Yoonsang Kim

arxiv: 2602.03059 · v2 · submitted 2026-02-03 · 💻 cs.HC · cs.CL· cs.ET· cs.IR

From Speech-to-Spatial: Grounding Utterances on A Live Shared View with Augmented Reality

Yoonsang Kim , Divyansh Pradhan , Devshree Jadeja , Arie Kaufman This is my paper

Pith reviewed 2026-05-16 08:14 UTC · model grok-4.3

classification 💻 cs.HC cs.CLcs.ETcs.IR

keywords Speech-to-Spatialaugmented realityremote assistancereferent disambiguationvoice interfacespatial groundinghuman-computer interactionAR guidance

0 comments

The pith

Speech-to-Spatial converts spoken remote instructions into persistent AR visual guidance using only voice patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Speech-to-Spatial, a framework that turns verbal remote-assistance instructions into spatially grounded AR overlays on a live shared view. It does so by first identifying one of four recurring speech patterns for referring to objects—Direct Attribute, Relational, Remembrance, or Chained—then mapping those cues to an object-centric relational graph. The resulting AR visuals stay visible in place, cutting down on repeated verbal corrections like “a bit more to the right.” A user study found the approach faster and less mentally demanding than voice-only guidance. This matters because many remote help tasks still rely on imprecise back-and-forth talk that the system aims to replace with direct visual cues.

Core claim

Speech-to-Spatial infers the intended target solely from spoken references by parsing four characterized patterns (Direct Attribute, Relational, Remembrance, and Chained), grounds them to an object-centric relational graph, and renders persistent in-situ AR visual guidance that improves task efficiency, reduces cognitive load, and enhances usability compared with a conventional voice-only baseline.

What carries the argument

Object-centric relational graph that parses referent cues from spoken utterances according to the four speech patterns and renders them as persistent AR visuals.

If this is right

Reduces the volume of iterative micro-guidance phrases during remote assistance.
Transforms disembodied verbal instructions into visually explainable, actionable AR guidance on a shared view.
Supports both remote guided assistance and intent disambiguation scenarios.
Delivers measurable gains in task completion time and lower cognitive load versus voice-only systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same four-pattern parser could be tested in non-remote AR settings such as in-person collaborative design or training.
If speech patterns prove incomplete in noisy environments, the graph could later accept supplemental visual data without redesigning the core grounding step.
Success in reducing verbal clarifications suggests the approach may lower error rates in time-critical tasks like equipment repair or medical guidance.
The object graph could be extended to track dynamic objects whose positions change during the session.

Load-bearing premise

Referent cues can be reliably parsed and grounded solely from spoken references using the four characterized patterns without additional cues such as gesture or gaze.

What would settle it

An experiment in which participants issue references that fall outside the four patterns, resulting in frequent incorrect AR placements and no measured gains in speed or reduced cognitive load over voice-only instructions.

Figures

Figures reproduced from arXiv: 2602.03059 by Arie Kaufman, Devshree Jadeja, Divyansh Pradhan, Yoonsang Kim.

**Figure 1.** Figure 1: Concept illustration of Speech-to-Spatial, disambiguating verbal descriptions of a referent and situating AR visual guiders [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: End-to-end pipeline of Speech-to-Spatial: From speech with visual inputs and prior memories (if present), Speech-to-Spatial extracts [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Attribute parsing: Transcribed text of verbal instructions is [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Three use case scenarios of Speech-to-Spatial. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of median task completion time per referenc [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

We introduce Speech-to-Spatial, a referent disambiguation framework that converts verbal remote-assistance instructions into spatially grounded AR guidance. Unlike prior systems that rely on additional cues (e.g., gesture, gaze) or manual expert annotations, Speech-to-Spatial infers the intended target solely from spoken references (speech input). Motivated by our formative study of speech referencing patterns, we characterize recurring ways people specify targets (Direct Attribute, Relational, Remembrance, and Chained) and ground them to our object-centric relational graph. Given an utterance, referent cues are parsed and rendered as persistent in-situ AR visual guidance, reducing iterative micro-guidance ("a bit more to the right", "now, stop.") during remote guidance. We demonstrate the use cases of our system with remote guided assistance and intent disambiguation scenarios. Our evaluation shows that Speechto-Spatial improves task efficiency, reduces cognitive load, and enhances usability compared to a conventional voice-only baseline, transforming disembodied verbal instruction into visually explainable, actionable guidance on a live shared view.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Speech-to-Spatial gives a workable speech-only AR grounding pipeline built on four reference patterns from a formative study, but the evaluation provides no numbers on parsing accuracy, which undercuts the efficiency claims.

read the letter

The paper's main contribution is a concrete way to map spoken references to an object-centric graph for live AR overlays in remote assistance. It draws four recurring patterns—Direct Attribute, Relational, Remembrance, and Chained—from their formative study and uses them to turn verbal instructions into persistent visual guidance without needing gestures or manual labels. That setup directly targets the back-and-forth clarifications common in voice-only remote help, which is a practical pain point in technical domains.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Speech-to-Spatial, a referent disambiguation framework that converts verbal remote-assistance instructions into spatially grounded AR guidance using only spoken references. Based on a formative study, it identifies four recurring speech referencing patterns—Direct Attribute, Relational, Remembrance, and Chained—and grounds them to an object-centric relational graph. The system renders persistent in-situ AR visual guidance to reduce iterative verbal micro-guidance. Demonstrations include remote guided assistance and intent disambiguation, with an evaluation claiming superior task efficiency, reduced cognitive load, and better usability compared to a voice-only baseline.

Significance. Should the evaluation prove robust, the work offers a meaningful contribution to human-computer interaction in augmented reality by enabling speech-only spatial grounding for remote collaboration. This could lower barriers in scenarios where gestures or gaze are impractical, transforming verbal instructions into actionable visual overlays. The characterization of speech patterns and the object-centric graph provide a structured approach that may generalize beyond the demonstrated use cases, though its practical significance hinges on the unquantified reliability of the parsing component.

major comments (2)

[Evaluation] Evaluation section: The central claim that Speech-to-Spatial improves task efficiency, reduces cognitive load, and enhances usability rests on an evaluation whose details are absent from the abstract and not sufficiently elaborated. No information is provided on participant numbers, study design, specific metrics (e.g., task completion time, NASA-TLX), statistical tests, or error analysis. Without these, it is impossible to verify that the reported benefits are attributable to the speech-to-spatial grounding pipeline rather than the AR rendering step alone.
[System Description] System / Referent Parsing: The headline result depends on the assumption that referent cues can be reliably parsed and grounded solely from the four speech patterns without additional cues. The manuscript does not report parsing success rates, accuracy of the object-centric relational graph resolution, or breakdown of failures in live AR settings. If resolution accuracy is low, the measured gains would be artifacts of only the easy cases.

minor comments (2)

[Abstract] Abstract: Typo in 'Speechto-Spatial' (missing hyphen); should be consistent with the title 'Speech-to-Spatial'.
[Abstract] Abstract: The evaluation claim is stated without any preview of quantitative results or key metrics, which weakens the summary for an HCI audience.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight areas where additional detail will strengthen the presentation of our evaluation and system components. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The central claim that Speech-to-Spatial improves task efficiency, reduces cognitive load, and enhances usability rests on an evaluation whose details are absent from the abstract and not sufficiently elaborated. No information is provided on participant numbers, study design, specific metrics (e.g., task completion time, NASA-TLX), statistical tests, or error analysis. Without these, it is impossible to verify that the reported benefits are attributable to the speech-to-spatial grounding pipeline rather than the AR rendering step alone.

Authors: We agree that the evaluation section requires more elaboration to allow readers to fully assess the claims. In the revised manuscript we will expand this section with the number of participants, a description of the study design (including the within-subjects comparison to the voice-only baseline), the specific metrics collected (task completion time, NASA-TLX scores, and usability ratings), the statistical tests applied, and an error analysis. These additions will clarify how the measured improvements are attributable to the referent disambiguation and grounding pipeline rather than AR rendering in isolation. revision: yes
Referee: [System Description] System / Referent Parsing: The headline result depends on the assumption that referent cues can be reliably parsed and grounded solely from the four speech patterns without additional cues. The manuscript does not report parsing success rates, accuracy of the object-centric relational graph resolution, or breakdown of failures in live AR settings. If resolution accuracy is low, the measured gains would be artifacts of only the easy cases.

Authors: We acknowledge that quantitative reporting on parsing performance is necessary to substantiate the system’s reliability. Although the current manuscript emphasizes end-to-end task outcomes, we will add a dedicated subsection reporting parsing success rates, the accuracy of object-centric graph resolution, and a breakdown of observed failures from the live demonstrations. This will allow readers to evaluate whether the reported gains generalize beyond easy cases. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluation rests on empirical comparison to baseline

full rationale

The paper describes a new Speech-to-Spatial framework whose design is informed by a formative study of four speech patterns, then reports user-study results on task efficiency, cognitive load, and usability versus a voice-only baseline. No equations, fitted parameters, or derivations appear in the provided text. The central claims do not reduce by construction to self-definitions, renamed inputs, or self-citation chains; the formative study supplies design motivation while the measured gains are obtained from independent participant data. This is the expected non-circular outcome for an HCI system paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework depends on the validity of speech referencing patterns identified in the formative study and the effectiveness of the object-centric relational graph in accurately mapping utterances to spatial targets in real time.

axioms (1)

domain assumption People use recurring speech referencing patterns that can be categorized into Direct Attribute, Relational, Remembrance, and Chained.
Derived from the formative study of speech referencing patterns mentioned in the abstract.

invented entities (1)

object-centric relational graph no independent evidence
purpose: To ground parsed referent cues from utterances to spatial locations for AR visual guidance.
Core component introduced to map speech input to persistent in-situ overlays.

pith-pipeline@v0.9.0 · 5503 in / 1258 out tokens · 33958 ms · 2026-05-16T08:14:18.478393+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VisionClaw: Always-On AI Agents through Smart Glasses
cs.HC 2026-04 unverdicted novelty 5.0

VisionClaw couples continuous egocentric vision on smart glasses with speech-driven AI agents to enable hands-free real-world tasks, with lab and field studies showing faster completion and a shift toward opportunisti...

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

H. Bai, P. Sasikumar, J. Yang, and M. Billinghurst. A user study on mixed reality remote collaboration with eye gaze and hand gesture sharing. InProc. of ACM CHI, pp. 1–13, 2020. 1

work page 2020
[2]

put-that-there

R. A. Bolt. “put-that-there” voice and gesture at the graphics interface. InProc. of SIGGRAPH, pp. 262–270, 1980. 1, 2, 4

work page 1980
[3]

R. Bovo, D. Giunchi, P. Cascarano, E. J. Gonzalez, and M. Gonzalez- Franco. Revisiting put-that-there, context aware window interactions via llms.arXiv preprint arXiv:2511.02378, 2025. 2

work page arXiv 2025
[4]

M. Brehmer. Video-conferencing beyond screen-sharing and thumb- nail webcam videos: Gesture-aware augmented reality video for data- rich remote presentations.arXiv preprint arXiv:2501.05345, 2025. 2

work page arXiv 2025
[5]

S. E. Brennan and H. H. Clark. Conceptual pacts and lexical choice in conversation.Journal of experimental psychology: Learning, memory, and cognition, 22(6):1482, 1996. 2, 3

work page 1996
[6]

Bressa, J

N. Bressa, J. Vermeulen, and W. Willett. Data every day: Designing and living with personal situated visualizations. InProc. of ACM CHI, pp. 1–18, 2022. 3

work page 2022
[7]

Carbonell and S

N. Carbonell and S. Kieffer. Do oral messages help visual search. Advances in natural multimodal dialogue systems, 30:131–157, 2005. 2

work page 2005
[8]

R. S. M. Chan, A. Marx, A. Kim, and M. El-Assady. A design space for intelligent dialogue augmentation. InProc. of IUI, pp. 18–36,

work page
[9]

Chen, P.-L

S. Chen, P.-L. Guhur, M. Tapaswi, C. Schmid, and I. Laptev. Lan- guage conditioned spatial relation reasoning for 3d object grounding. Neurips, 35:20522–20535, 2022. 2

work page 2022
[10]

H. H. Clark and S. E. Brennan. Grounding in communication. In L. Resnick, L. B., M. John, S. Teasley, and D., eds.,Perspectives on Socially Shared Cognition, pp. 13–1991. APA, 1991. 3

work page 1991
[11]

F. I. Do ˘gan, S. Kalkan, and I. Leite. Learning to generate unambiguous spatial referring expressions for real-world environments. InProc. of IEEE/RSJ IROS, pp. 4992–4999, 2019. 2

work page 2019
[12]

M. D. Dogan, E. J. Gonzalez, K. Ahuja, R. Du, A. Colac ¸o, J. Lee, M. Gonzalez-Franco, and D. Kim. Augmented object intelligence with XR-Objects. InProc. of ACM UIST, pp. 1–15, 2024. 2

work page 2024
[13]

dos Santos Silva and I

D. dos Santos Silva and I. Paraboni. Generating spatial referring ex- pressions in interactive 3d worlds.Spatial Cognition & Computation, 15(3):186–225, 2015. 2

work page 2015
[14]

Druta, C

R. Druta, C. Druta, P. Negirla, and I. Silea. A review on methods and systems for remote collaboration.Applied Sciences, 11(21):10035,

work page
[15]

D. Edge, H. Trinh, N. Cheng, J. Bradley, A. Chao, A. Mody, S. Truitt, and J. Larson. From local to global: A graph rag approach to query- focused summarization.arXiv:2404.16130, 2024. 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Evgrashin

A. Evgrashin. Whisper for unity.https://github.com/Macoron/ whisper.unity/tree/master, 2024. Aug. 31. 2024. 4

work page 2024
[17]

C. G. Fidalgo, Y . Yan, H. Cho, M. Sousa, D. Lindlbauer, and J. Jorge. A survey on remote assistance and training in mixed reality environ- ments.IEEE TVCG, 29(5):2291–2303, 2023. 2

work page 2023
[18]

D. I. Fink, J. Zagermann, H. Reiterer, and H.-C. Jetter. Re-locations: Augmenting personal and shared workspaces to support remote col- laboration in incongruent spaces.Proc. of ACM HCI, 6(ISS):1–30,

work page
[19]

A. Garnham. A unified theory of the meaning of some spatial rela- tional terms.Cognition, 31(1):45–60, 1989. 3

work page 1989
[20]

J. E. S. Grønbæk, K. Pfeuffer, E. Velloso, M. Astrup, M. I. S. Peder- sen, M. Kjær, G. Leiva, and H. Gellersen. Partially blended realities: Aligning dissimilar spaces for distributed mixed reality meetings. In Proc. of ACM CHI, pp. 1–16, 2023. 2

work page 2023
[21]

Grubert, T

J. Grubert, T. Langlotz, S. Zollmann, and H. Regenbrecht. Towards pervasive augmented reality: Context-awareness in augmented reality. IEEE TVCG, 23(6):1706–1724, 2016. 3

work page 2016
[22]

Q. Gu, A. Kuwajerwala, S. Morin, K. M. Jatavallabhula, B. Sen, A. Agarwal, C. Rivera, W. Paul, K. Ellis, R. Chellappa, et al. Concept- graphs: Open-vocabulary 3d scene graphs for perception and plan- ning. InProc. of IEEE ICRA, pp. 5021–5028, 2024. 2, 5

work page 2024
[23]

Gurevich, J

P. Gurevich, J. Lanir, B. Cohen, and R. Stone. Teleadvisor: a versatile augmented reality tool for remote assistance. InProc. of ACM CHI, pp. 619–622, 2012. 2

work page 2012
[24]

Han and K

C. Han and K. E. Isaacs. A deixis-centered approach for documenting remote synchronous communication around data visualizations.IEEE TVCG, 2024. 2

work page 2024
[25]

Hepperle, Y

D. Hepperle, Y . Weiß, A. Siess, and M. W ¨olfel. 2d, 3d or speech? a case study on which user interface is preferable for what kind of object interaction in immersive virtual reality.Computers & Graphics, 82:321–331, 2019. 9

work page 2019
[26]

Howlader, H

P. Howlader, H. Nguyen-Canh, S. Das, J. Xu, H. Le, and D. Samaras. Cora: Consistency-guided semi-supervised framework for reasoning segmentation. InProc. of IEEE/CVF WACV, 2026. 2

work page 2026
[27]

X. Hu, D. Ma, F. He, Z. Zhu, S.-K. Hsia, C. Zhu, Z. Liu, and K. Ra- mani. Gesprompt: Leveraging co-speech gestures to augment llm- based interaction in virtual reality. InProc. of ACM DIS, pp. 59–80,

work page
[28]

Jadon, M

S. Jadon, M. Faridan, E. Mah, R. Vaish, W. Willett, and R. Suzuki. Augmented conversation with embedded speech-driven on-the-fly ref- erencing in ar.arXiv preprint arXiv:2405.18537, 2024. 1, 2

work page arXiv 2024
[29]

Jang, E.-J

S. Jang, E.-J. Ko, and W. Woo. Unified user-centric context: Who, where, when, what, how and why. InProc. of UbiPCMM, 2005. 4, 5

work page 2005
[30]

Johannsen and J

K. Johannsen and J. P. D. Ruiter. Reference frame selection in dialog: priming or preference?Frontiers in Human Neuroscience, 7:667,

work page
[31]

Kartmann and T

R. Kartmann and T. Asfour. Interactive and incremental learning of spatial object relations from human demonstrations.Frontiers in Robotics and AI, 10:1151303, 2023. 2

work page 2023
[32]

D. Kim, T. Ha, J. Hong, S. Kim, S. Choi, H. Ko, and W. Woo. Meta- objects: Interactive and multisensory virtual objects learned from the real world for use in augmented reality.IEEE CG&A, 45(3):134–143,

work page
[33]

H. Kim, E. Hu, and S. Heo. Spaceshare: Leveraging multimodal con- text for fluid sharing of spaces in video meetings. InProc. of ACM UIST-Adjunct, pp. 1–3, 2025. 2

work page 2025
[34]

H. Kim, T. Matuszka, J.-I. Kim, J. Kim, and W. Woo. Ontology- based mobile augmented reality in cultural heritage sites: informa- tion modeling and user study.Multimedia Tools and Applications, 76(24):26001–26029, 2017. 1, 5

work page 2017
[35]

Y . Kim, Z. Aamir, M. Singh, S. Boorboor, K. Mueller, and A. E. Kauf- man. Explainable xr: Understanding user behaviors of xr environ- ments using llm-assisted analytics framework.IEEE TVCG, 2025. 4, 5

work page 2025
[36]

B. Lee, M. Sedlmair, and D. Schmalstieg. Design patterns for situated visualization in augmented reality.IEEE TVCG, 30(1):1324–1335,

work page
[37]

G. Lee, M. Xia, N. Numan, X. Qian, D. Li, Y . Chen, A. Kulshrestha, I. Chatterjee, Y . Zhang, D. Manocha, et al. Sensible agent: A frame- work for unobtrusive interaction with proactive ar agents. InProc. of ACM UIST, pp. 1–22, 2025. 9

work page 2025
[38]

J. Lee, F. Aleotti, D. Mazala, G. Garcia-Hernando, S. Vicente, O. J. Johnston, I. Kraus-Liang, J. Powierza, D. Shin, J. E. Froehlich, et al. Imaginatear: Ai-assisted in-situ authoring in augmented reality. In Proc. of ACM UIST, pp. 1–21, 2025. 2

work page 2025
[39]

J. Lee, J. Kim, J. Ahn, and W. Woo. Remote diagnosis of architec- tural heritage based on 5w1h model-based metadata in virtual reality. ISPRS IJGI, 8(8):339, 2019. 4

work page 2019
[40]

J. Lee, J. Wang, E. Brown, L. Chu, S. S. Rodriguez, and J. E. Froehlich. GazePointAR: a context-aware multimodal voice assistant for pronoun disambiguation in wearable augmented reality. InProc. of ACM CHI, pp. 1–20, 2024. 1, 2, 4, 7, 9

work page 2024
[41]

J. Lee, T. Wang, J. Fashimpaur, N. Sendhilnathan, and T. R. Jonker. Walkie-talkie: Exploring longitudinal natural gaze, llms, and vlms for query disambiguation in xr. InProc. of ACM CHI EA, pp. 1–9, 2025. 1, 2, 9

work page 2025
[42]

W. J. Levelt. Cognitive styles in the use of spatial direction terms. Psychology, 1982. 2, 3

work page 1982
[43]

W. J. Levelt.Speaking: From intention to articulation. MIT press,

work page
[44]

S. C. Levinson. Frames of reference and molyneux’s question: Crosslinguistic evidence.Language and space, 109:169, 1996. 3

work page 1996
[45]

C. Li, G. Wu, G. Y .-Y . Chan, D. G. Turakhia, S. Castelo Quispe, D. Li, 10 © 2026 IEEE. This is the author’s version of the article that will appear at the IEEE Conference on Virtual Reality and 3D User Interfaces (IEEE VR). The final version of this record is available at: 10.1109/VR67842.2026.00045 L. Welch, C. Silva, and J. Qian. Satori: Towards proac...

work page doi:10.1109/vr67842.2026.00045 2026
[46]

J. N. Li, Z. Zhang, and J. Ma. Omniquery: Contextually augmenting captured multimodal memories to enable personal question answer- ing. InProc. of ACM CHI, pp. 1–20, 2025. 3

work page 2025
[47]

X. Liu, D. Jia, X. C. Liu, M. Gonzalez-Franco, and C. Zhu-Tian. Real- ity proxy: fluid interactions with real-world objects in mr via abstract representations. InProc. of ACM UIST, pp. 1–16, 2025. 2

work page 2025
[48]

Lukianova, J.-Y

E. Lukianova, J.-Y . Jeong, and J.-W. Jeong. A picture is worth a thou- sand words? investigating the impact of image aids in ar on memory recall for everyday tasks. InProc. of IUI, pp. 106–126, 2025. 3

work page 2025
[49]

M. N. Lystbæk, K. Pfeuffer, T. Langlotz, J. E. S. Grønbæk, and H. Gellersen. Spatial gaze markers: Supporting effective task switch- ing in augmented reality. InProc. of ACM CHI, pp. 1–11, 2024. 3

work page 2024
[50]

Markov-Vetter, M

D. Markov-Vetter, M. Luboschik, A. T. Islam, P. Gauger, and O. Staadt. The effect of spatial reference on visual attention and work- load during viewpoint guidance in augmented reality. InProc. of ACM SUI, pp. 1–10, 2020. 2

work page 2020
[51]

Dynamics 365 remote assist.https://learn

Microsoft. Dynamics 365 remote assist.https://learn. microsoft.com/en-us/dynamics365/mixed-reality/ remote-assist/ra-overview, 2025. Sep. 3. 2025. 6

work page 2025
[52]

G. A. Miller and P. N. Johnson-Laird.Language and perception. Har- vard University Press, 1976. 3

work page 1976
[53]

Murai, E

R. Murai, E. Dexheimer, and A. J. Davison. Mast3r-slam: Real- time dense slam with 3d reconstruction priors. InProc. of CVPR, pp. 16695–16705, 2025. 1, 2, 5

work page 2025
[54]

Rebol, C

M. Rebol, C. Hood, C. Ranniger, A. Rutenberg, N. Sikka, E. M. Ho- ran, C. G¨utl, and K. Pietroszek. Remote assistance with mixed reality for procedural tasks. InProc. of IEEE VRW, pp. 653–654, 2021. 2

work page 2021
[55]

3D dynamic scene graphs: Actionable spatial perception with places, objects, and humans

A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone. 3d dynamic scene graphs: Actionable spatial perception with places, objects, and humans.arXiv preprint arXiv:2002.06289, 2020. 2

work page arXiv 2002
[56]

K. A. Satriadi, B. Tag, and T. Dwyer. Context-dependent memory in situated visualization.arXiv:2311.12288, 2023. 3

work page arXiv 2023
[57]

M. F. Schober. Addressee-and object-centered frames of reference in spatial descriptions. InAmerican Association for Artificial Intelli- gence, Working Notes of the 1996 AAAI Spring Symposium on Cogni- tive and Computational Models of Spatial Representation, vol. 47, pp. 92–100, 1996. 2, 3

work page 1996
[58]

Sch ¨uz, A

S. Sch ¨uz, A. Gatt, and S. Zarrieß. Rethinking symbolic and visual context in referring expression generation.Frontiers in Artificial In- telligence, 6:1067125, 2023. 2

work page 2023
[59]

J. Seo, I. Avellino, D. P. Rajasagi, A. Komlodi, and H. M. Mentis. Holomentor: Enabling remote instruction through augmented reality mobile views.Proc. of ACM HCI, 7(GROUP):1–29, 2023. 2

work page 2023
[60]

Shakeri, H

M. Shakeri, H. Park, I. Jeon, A. Sadeghi-Niaraki, and W. Woo. User behavior modeling for ar personalized recommendations in spatial transitions.VR, 27(4):3033–3050, 2023. 5

work page 2023
[61]

J. Shen, J. J. Dudley, and P. O. Kristensson. Encode-store-retrieve: Augmenting human memory through language-encoded egocentric perception. InProc. of IEEE ISMAR, pp. 923–931, 2024. 3

work page 2024
[62]

Shusterman and P

A. Shusterman and P. Li. Frames of reference in spatial language acquisition.Cognitive psychology, 88:115–161, 2016. 2, 3

work page 2016
[63]

J. G. R. d. Souza, J. J. Ferreira, and V . Segura. A taxonomy of meth- ods, tools, and approaches for enabling collaborative annotation. In Proc. of IHC, pp. 1–12, 2023. 2

work page 2023
[64]

Stover and D

D. Stover and D. Bowman. Taggar: General-purpose task guidance from natural language in augmented reality using vision-language models. InProc. of ACM SUI, pp. 1–12, 2024. 1, 2

work page 2024
[65]

H. A. Taylor and B. Tversky. Descriptions and depictions of environ- ments.Memory & cognition, 20(5):483–496, 1992. 3

work page 1992
[66]

H. A. Taylor and B. Tversky. Perspective in spatial descriptions.Jour- nal of memory and language, 35(3):371–391, 1996. 2, 3

work page 1996
[67]

Teamviewer assist ar.https://www

TeamViewer. Teamviewer assist ar.https://www. teamviewer.com/en-us/products/frontline/solutions/ remote-assistance, 2025. Sep. 3. 2025. 6

work page 2025
[68]

P. Wang, Y . Wang, Y . Wang, M. Billinghurst, D. Yang, H. Yang, R. Luo, and X. Zhang. Extended reality remote collaboration sup- porting visual annotation cues for industry: A literature review.Engi- neered Science, 37:1802, 2025. 2

work page 2025
[69]

Zaman, C

F. Zaman, C. Anslow, and T. J. Rhee. Vicarious: Context-aware view- points selection for mixed reality collaboration. InProc. of ACM VRST, pp. 1–11, 2023. 2

work page 2023
[70]

A. Y . Zhao, A. Gunturu, E. Y .-L. Do, and R. Suzuki. Guided reality: Generating visually-enriched ar task guidance with llms and vision models.arXiv preprint arXiv:2508.03547, 2025. 2, 9

work page arXiv 2025
[71]

Zoom.https://www.zoom.com/, 2025

Zoom. Zoom.https://www.zoom.com/, 2025. Sep. 3. 2025. 3

work page 2025
[72]

W. D. Zulfikar, S. Chan, and P. Maes. Memoro: Using large language models to realize a concise interface for real-time memory augmenta- tion. InProc. of ACM CHI, pp. 1–18, 2024. 3 11

work page 2024

[1] [1]

H. Bai, P. Sasikumar, J. Yang, and M. Billinghurst. A user study on mixed reality remote collaboration with eye gaze and hand gesture sharing. InProc. of ACM CHI, pp. 1–13, 2020. 1

work page 2020

[2] [2]

put-that-there

R. A. Bolt. “put-that-there” voice and gesture at the graphics interface. InProc. of SIGGRAPH, pp. 262–270, 1980. 1, 2, 4

work page 1980

[3] [3]

R. Bovo, D. Giunchi, P. Cascarano, E. J. Gonzalez, and M. Gonzalez- Franco. Revisiting put-that-there, context aware window interactions via llms.arXiv preprint arXiv:2511.02378, 2025. 2

work page arXiv 2025

[4] [4]

M. Brehmer. Video-conferencing beyond screen-sharing and thumb- nail webcam videos: Gesture-aware augmented reality video for data- rich remote presentations.arXiv preprint arXiv:2501.05345, 2025. 2

work page arXiv 2025

[5] [5]

S. E. Brennan and H. H. Clark. Conceptual pacts and lexical choice in conversation.Journal of experimental psychology: Learning, memory, and cognition, 22(6):1482, 1996. 2, 3

work page 1996

[6] [6]

Bressa, J

N. Bressa, J. Vermeulen, and W. Willett. Data every day: Designing and living with personal situated visualizations. InProc. of ACM CHI, pp. 1–18, 2022. 3

work page 2022

[7] [7]

Carbonell and S

N. Carbonell and S. Kieffer. Do oral messages help visual search. Advances in natural multimodal dialogue systems, 30:131–157, 2005. 2

work page 2005

[8] [8]

R. S. M. Chan, A. Marx, A. Kim, and M. El-Assady. A design space for intelligent dialogue augmentation. InProc. of IUI, pp. 18–36,

work page

[9] [9]

Chen, P.-L

S. Chen, P.-L. Guhur, M. Tapaswi, C. Schmid, and I. Laptev. Lan- guage conditioned spatial relation reasoning for 3d object grounding. Neurips, 35:20522–20535, 2022. 2

work page 2022

[10] [10]

H. H. Clark and S. E. Brennan. Grounding in communication. In L. Resnick, L. B., M. John, S. Teasley, and D., eds.,Perspectives on Socially Shared Cognition, pp. 13–1991. APA, 1991. 3

work page 1991

[11] [11]

F. I. Do ˘gan, S. Kalkan, and I. Leite. Learning to generate unambiguous spatial referring expressions for real-world environments. InProc. of IEEE/RSJ IROS, pp. 4992–4999, 2019. 2

work page 2019

[12] [12]

M. D. Dogan, E. J. Gonzalez, K. Ahuja, R. Du, A. Colac ¸o, J. Lee, M. Gonzalez-Franco, and D. Kim. Augmented object intelligence with XR-Objects. InProc. of ACM UIST, pp. 1–15, 2024. 2

work page 2024

[13] [13]

dos Santos Silva and I

D. dos Santos Silva and I. Paraboni. Generating spatial referring ex- pressions in interactive 3d worlds.Spatial Cognition & Computation, 15(3):186–225, 2015. 2

work page 2015

[14] [14]

Druta, C

R. Druta, C. Druta, P. Negirla, and I. Silea. A review on methods and systems for remote collaboration.Applied Sciences, 11(21):10035,

work page

[15] [15]

D. Edge, H. Trinh, N. Cheng, J. Bradley, A. Chao, A. Mody, S. Truitt, and J. Larson. From local to global: A graph rag approach to query- focused summarization.arXiv:2404.16130, 2024. 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Evgrashin

A. Evgrashin. Whisper for unity.https://github.com/Macoron/ whisper.unity/tree/master, 2024. Aug. 31. 2024. 4

work page 2024

[17] [17]

C. G. Fidalgo, Y . Yan, H. Cho, M. Sousa, D. Lindlbauer, and J. Jorge. A survey on remote assistance and training in mixed reality environ- ments.IEEE TVCG, 29(5):2291–2303, 2023. 2

work page 2023

[18] [18]

D. I. Fink, J. Zagermann, H. Reiterer, and H.-C. Jetter. Re-locations: Augmenting personal and shared workspaces to support remote col- laboration in incongruent spaces.Proc. of ACM HCI, 6(ISS):1–30,

work page

[19] [19]

A. Garnham. A unified theory of the meaning of some spatial rela- tional terms.Cognition, 31(1):45–60, 1989. 3

work page 1989

[20] [20]

J. E. S. Grønbæk, K. Pfeuffer, E. Velloso, M. Astrup, M. I. S. Peder- sen, M. Kjær, G. Leiva, and H. Gellersen. Partially blended realities: Aligning dissimilar spaces for distributed mixed reality meetings. In Proc. of ACM CHI, pp. 1–16, 2023. 2

work page 2023

[21] [21]

Grubert, T

J. Grubert, T. Langlotz, S. Zollmann, and H. Regenbrecht. Towards pervasive augmented reality: Context-awareness in augmented reality. IEEE TVCG, 23(6):1706–1724, 2016. 3

work page 2016

[22] [22]

Q. Gu, A. Kuwajerwala, S. Morin, K. M. Jatavallabhula, B. Sen, A. Agarwal, C. Rivera, W. Paul, K. Ellis, R. Chellappa, et al. Concept- graphs: Open-vocabulary 3d scene graphs for perception and plan- ning. InProc. of IEEE ICRA, pp. 5021–5028, 2024. 2, 5

work page 2024

[23] [23]

Gurevich, J

P. Gurevich, J. Lanir, B. Cohen, and R. Stone. Teleadvisor: a versatile augmented reality tool for remote assistance. InProc. of ACM CHI, pp. 619–622, 2012. 2

work page 2012

[24] [24]

Han and K

C. Han and K. E. Isaacs. A deixis-centered approach for documenting remote synchronous communication around data visualizations.IEEE TVCG, 2024. 2

work page 2024

[25] [25]

Hepperle, Y

D. Hepperle, Y . Weiß, A. Siess, and M. W ¨olfel. 2d, 3d or speech? a case study on which user interface is preferable for what kind of object interaction in immersive virtual reality.Computers & Graphics, 82:321–331, 2019. 9

work page 2019

[26] [26]

Howlader, H

P. Howlader, H. Nguyen-Canh, S. Das, J. Xu, H. Le, and D. Samaras. Cora: Consistency-guided semi-supervised framework for reasoning segmentation. InProc. of IEEE/CVF WACV, 2026. 2

work page 2026

[27] [27]

X. Hu, D. Ma, F. He, Z. Zhu, S.-K. Hsia, C. Zhu, Z. Liu, and K. Ra- mani. Gesprompt: Leveraging co-speech gestures to augment llm- based interaction in virtual reality. InProc. of ACM DIS, pp. 59–80,

work page

[28] [28]

Jadon, M

S. Jadon, M. Faridan, E. Mah, R. Vaish, W. Willett, and R. Suzuki. Augmented conversation with embedded speech-driven on-the-fly ref- erencing in ar.arXiv preprint arXiv:2405.18537, 2024. 1, 2

work page arXiv 2024

[29] [29]

Jang, E.-J

S. Jang, E.-J. Ko, and W. Woo. Unified user-centric context: Who, where, when, what, how and why. InProc. of UbiPCMM, 2005. 4, 5

work page 2005

[30] [30]

Johannsen and J

K. Johannsen and J. P. D. Ruiter. Reference frame selection in dialog: priming or preference?Frontiers in Human Neuroscience, 7:667,

work page

[31] [31]

Kartmann and T

R. Kartmann and T. Asfour. Interactive and incremental learning of spatial object relations from human demonstrations.Frontiers in Robotics and AI, 10:1151303, 2023. 2

work page 2023

[32] [32]

D. Kim, T. Ha, J. Hong, S. Kim, S. Choi, H. Ko, and W. Woo. Meta- objects: Interactive and multisensory virtual objects learned from the real world for use in augmented reality.IEEE CG&A, 45(3):134–143,

work page

[33] [33]

H. Kim, E. Hu, and S. Heo. Spaceshare: Leveraging multimodal con- text for fluid sharing of spaces in video meetings. InProc. of ACM UIST-Adjunct, pp. 1–3, 2025. 2

work page 2025

[34] [34]

H. Kim, T. Matuszka, J.-I. Kim, J. Kim, and W. Woo. Ontology- based mobile augmented reality in cultural heritage sites: informa- tion modeling and user study.Multimedia Tools and Applications, 76(24):26001–26029, 2017. 1, 5

work page 2017

[35] [35]

Y . Kim, Z. Aamir, M. Singh, S. Boorboor, K. Mueller, and A. E. Kauf- man. Explainable xr: Understanding user behaviors of xr environ- ments using llm-assisted analytics framework.IEEE TVCG, 2025. 4, 5

work page 2025

[36] [36]

B. Lee, M. Sedlmair, and D. Schmalstieg. Design patterns for situated visualization in augmented reality.IEEE TVCG, 30(1):1324–1335,

work page

[37] [37]

G. Lee, M. Xia, N. Numan, X. Qian, D. Li, Y . Chen, A. Kulshrestha, I. Chatterjee, Y . Zhang, D. Manocha, et al. Sensible agent: A frame- work for unobtrusive interaction with proactive ar agents. InProc. of ACM UIST, pp. 1–22, 2025. 9

work page 2025

[38] [38]

J. Lee, F. Aleotti, D. Mazala, G. Garcia-Hernando, S. Vicente, O. J. Johnston, I. Kraus-Liang, J. Powierza, D. Shin, J. E. Froehlich, et al. Imaginatear: Ai-assisted in-situ authoring in augmented reality. In Proc. of ACM UIST, pp. 1–21, 2025. 2

work page 2025

[39] [39]

J. Lee, J. Kim, J. Ahn, and W. Woo. Remote diagnosis of architec- tural heritage based on 5w1h model-based metadata in virtual reality. ISPRS IJGI, 8(8):339, 2019. 4

work page 2019

[40] [40]

J. Lee, J. Wang, E. Brown, L. Chu, S. S. Rodriguez, and J. E. Froehlich. GazePointAR: a context-aware multimodal voice assistant for pronoun disambiguation in wearable augmented reality. InProc. of ACM CHI, pp. 1–20, 2024. 1, 2, 4, 7, 9

work page 2024

[41] [41]

J. Lee, T. Wang, J. Fashimpaur, N. Sendhilnathan, and T. R. Jonker. Walkie-talkie: Exploring longitudinal natural gaze, llms, and vlms for query disambiguation in xr. InProc. of ACM CHI EA, pp. 1–9, 2025. 1, 2, 9

work page 2025

[42] [42]

W. J. Levelt. Cognitive styles in the use of spatial direction terms. Psychology, 1982. 2, 3

work page 1982

[43] [43]

W. J. Levelt.Speaking: From intention to articulation. MIT press,

work page

[44] [44]

S. C. Levinson. Frames of reference and molyneux’s question: Crosslinguistic evidence.Language and space, 109:169, 1996. 3

work page 1996

[45] [45]

C. Li, G. Wu, G. Y .-Y . Chan, D. G. Turakhia, S. Castelo Quispe, D. Li, 10 © 2026 IEEE. This is the author’s version of the article that will appear at the IEEE Conference on Virtual Reality and 3D User Interfaces (IEEE VR). The final version of this record is available at: 10.1109/VR67842.2026.00045 L. Welch, C. Silva, and J. Qian. Satori: Towards proac...

work page doi:10.1109/vr67842.2026.00045 2026

[46] [46]

J. N. Li, Z. Zhang, and J. Ma. Omniquery: Contextually augmenting captured multimodal memories to enable personal question answer- ing. InProc. of ACM CHI, pp. 1–20, 2025. 3

work page 2025

[47] [47]

X. Liu, D. Jia, X. C. Liu, M. Gonzalez-Franco, and C. Zhu-Tian. Real- ity proxy: fluid interactions with real-world objects in mr via abstract representations. InProc. of ACM UIST, pp. 1–16, 2025. 2

work page 2025

[48] [48]

Lukianova, J.-Y

E. Lukianova, J.-Y . Jeong, and J.-W. Jeong. A picture is worth a thou- sand words? investigating the impact of image aids in ar on memory recall for everyday tasks. InProc. of IUI, pp. 106–126, 2025. 3

work page 2025

[49] [49]

M. N. Lystbæk, K. Pfeuffer, T. Langlotz, J. E. S. Grønbæk, and H. Gellersen. Spatial gaze markers: Supporting effective task switch- ing in augmented reality. InProc. of ACM CHI, pp. 1–11, 2024. 3

work page 2024

[50] [50]

Markov-Vetter, M

D. Markov-Vetter, M. Luboschik, A. T. Islam, P. Gauger, and O. Staadt. The effect of spatial reference on visual attention and work- load during viewpoint guidance in augmented reality. InProc. of ACM SUI, pp. 1–10, 2020. 2

work page 2020

[51] [51]

Dynamics 365 remote assist.https://learn

Microsoft. Dynamics 365 remote assist.https://learn. microsoft.com/en-us/dynamics365/mixed-reality/ remote-assist/ra-overview, 2025. Sep. 3. 2025. 6

work page 2025

[52] [52]

G. A. Miller and P. N. Johnson-Laird.Language and perception. Har- vard University Press, 1976. 3

work page 1976

[53] [53]

Murai, E

R. Murai, E. Dexheimer, and A. J. Davison. Mast3r-slam: Real- time dense slam with 3d reconstruction priors. InProc. of CVPR, pp. 16695–16705, 2025. 1, 2, 5

work page 2025

[54] [54]

Rebol, C

M. Rebol, C. Hood, C. Ranniger, A. Rutenberg, N. Sikka, E. M. Ho- ran, C. G¨utl, and K. Pietroszek. Remote assistance with mixed reality for procedural tasks. InProc. of IEEE VRW, pp. 653–654, 2021. 2

work page 2021

[55] [55]

3D dynamic scene graphs: Actionable spatial perception with places, objects, and humans

A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone. 3d dynamic scene graphs: Actionable spatial perception with places, objects, and humans.arXiv preprint arXiv:2002.06289, 2020. 2

work page arXiv 2002

[56] [56]

K. A. Satriadi, B. Tag, and T. Dwyer. Context-dependent memory in situated visualization.arXiv:2311.12288, 2023. 3

work page arXiv 2023

[57] [57]

M. F. Schober. Addressee-and object-centered frames of reference in spatial descriptions. InAmerican Association for Artificial Intelli- gence, Working Notes of the 1996 AAAI Spring Symposium on Cogni- tive and Computational Models of Spatial Representation, vol. 47, pp. 92–100, 1996. 2, 3

work page 1996

[58] [58]

Sch ¨uz, A

S. Sch ¨uz, A. Gatt, and S. Zarrieß. Rethinking symbolic and visual context in referring expression generation.Frontiers in Artificial In- telligence, 6:1067125, 2023. 2

work page 2023

[59] [59]

J. Seo, I. Avellino, D. P. Rajasagi, A. Komlodi, and H. M. Mentis. Holomentor: Enabling remote instruction through augmented reality mobile views.Proc. of ACM HCI, 7(GROUP):1–29, 2023. 2

work page 2023

[60] [60]

Shakeri, H

M. Shakeri, H. Park, I. Jeon, A. Sadeghi-Niaraki, and W. Woo. User behavior modeling for ar personalized recommendations in spatial transitions.VR, 27(4):3033–3050, 2023. 5

work page 2023

[61] [61]

J. Shen, J. J. Dudley, and P. O. Kristensson. Encode-store-retrieve: Augmenting human memory through language-encoded egocentric perception. InProc. of IEEE ISMAR, pp. 923–931, 2024. 3

work page 2024

[62] [62]

Shusterman and P

A. Shusterman and P. Li. Frames of reference in spatial language acquisition.Cognitive psychology, 88:115–161, 2016. 2, 3

work page 2016

[63] [63]

J. G. R. d. Souza, J. J. Ferreira, and V . Segura. A taxonomy of meth- ods, tools, and approaches for enabling collaborative annotation. In Proc. of IHC, pp. 1–12, 2023. 2

work page 2023

[64] [64]

Stover and D

D. Stover and D. Bowman. Taggar: General-purpose task guidance from natural language in augmented reality using vision-language models. InProc. of ACM SUI, pp. 1–12, 2024. 1, 2

work page 2024

[65] [65]

H. A. Taylor and B. Tversky. Descriptions and depictions of environ- ments.Memory & cognition, 20(5):483–496, 1992. 3

work page 1992

[66] [66]

H. A. Taylor and B. Tversky. Perspective in spatial descriptions.Jour- nal of memory and language, 35(3):371–391, 1996. 2, 3

work page 1996

[67] [67]

Teamviewer assist ar.https://www

TeamViewer. Teamviewer assist ar.https://www. teamviewer.com/en-us/products/frontline/solutions/ remote-assistance, 2025. Sep. 3. 2025. 6

work page 2025

[68] [68]

P. Wang, Y . Wang, Y . Wang, M. Billinghurst, D. Yang, H. Yang, R. Luo, and X. Zhang. Extended reality remote collaboration sup- porting visual annotation cues for industry: A literature review.Engi- neered Science, 37:1802, 2025. 2

work page 2025

[69] [69]

Zaman, C

F. Zaman, C. Anslow, and T. J. Rhee. Vicarious: Context-aware view- points selection for mixed reality collaboration. InProc. of ACM VRST, pp. 1–11, 2023. 2

work page 2023

[70] [70]

A. Y . Zhao, A. Gunturu, E. Y .-L. Do, and R. Suzuki. Guided reality: Generating visually-enriched ar task guidance with llms and vision models.arXiv preprint arXiv:2508.03547, 2025. 2, 9

work page arXiv 2025

[71] [71]

Zoom.https://www.zoom.com/, 2025

Zoom. Zoom.https://www.zoom.com/, 2025. Sep. 3. 2025. 3

work page 2025

[72] [72]

W. D. Zulfikar, S. Chan, and P. Maes. Memoro: Using large language models to realize a concise interface for real-time memory augmenta- tion. InProc. of ACM CHI, pp. 1–18, 2024. 3 11

work page 2024