pith. machine review for the scientific record. sign in

arxiv: 2602.10154 · v1 · submitted 2026-02-09 · 💻 cs.CR · cs.AI· cs.MM

Recognition: 1 theorem link

· Lean Theorem

PRISM-XR: Empowering Privacy-Aware XR Collaboration with Multimodal Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-16 05:02 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.MM
keywords privacy preservationextended realitymultimodal LLMsedge computingXR collaborationsensitive data filteringspatial registration
0
0 comments X

The pith

PRISM-XR uses edge-server preprocessing to filter sensitive data from XR frames before querying cloud MLLMs for collaborative content creation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PRISM-XR to allow multiple users in extended reality environments to collaborate using multimodal large language models without exposing private real-world information captured by headsets. It does this by running intelligent frame preprocessing on an edge server to remove sensitive objects and irrelevant context. A lightweight registration process and customizable sharing mechanism handle synchronization efficiently. Numerical tests show nearly 90 percent accuracy in requests, registration under 0.27 seconds, and spatial errors below 3.5 centimeters. A user study with 28 participants confirms over 90 percent filtering of sensitive objects with good usability.

Core claim

PRISM-XR provides privacy-aware integration of MLLMs into XR by preprocessing frames on the edge to filter sensitive data, using a lightweight registration process and fully customizable content-sharing to support efficient multi-user collaboration.

What carries the argument

Edge-server preprocessing that detects and removes sensitive content from XR frames prior to transmission to cloud MLLMs, paired with a lightweight registration process for spatial synchronization.

If this is right

  • Multi-user XR sessions can incorporate natural language and visual inputs for object creation without privacy violations from background scenes.
  • Registration and synchronization occur in under 0.27 seconds with spatial inconsistencies below 3.5 cm.
  • The system automatically filters highly sensitive objects in over 90 percent of tested scenarios.
  • Nearly 90 percent accuracy is maintained in fulfilling user requests during collaboration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If edge preprocessing proves reliable, similar techniques could apply to other camera-based AI systems like smart glasses or autonomous vehicles.
  • Customizable sharing might allow fine-grained control over what collaborators see, reducing the need for full scene uploads.
  • Future work could test the framework with more diverse environments to confirm filtering effectiveness.

Load-bearing premise

Edge-server preprocessing can reliably detect sensitive content in XR frames and remove it without missing real privacy risks or removing context needed for correct MLLM responses.

What would settle it

Observe a scenario where the system misses filtering a credit card or user face in more than 10 percent of cases, or where filtered frames cause MLLM accuracy to drop below 80 percent on user tasks.

Figures

Figures reproduced from arXiv: 2602.10154 by Bin Li, Jiangong Chen, Mingyu Zhu.

Figure 1
Figure 1. Figure 1: An example of multi-user collaboration in PRISM-XR. In this example, two users collaborate with each other in the same [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: System architecture of PRISM-XR. Unlike existing systems that rely on cumbersome manual in￾puts [13], gesture-based audio triggers [9], or controller-based au￾dio activation [18], PRISM-XR employs a more flexible keyword￾activated approach. Users initiate their requests by starting with a predefined keyword, which triggers automatic audio recording when the keyword is detected and stops when silence is ide… view at source ↗
Figure 3
Figure 3. Figure 3: Workflow of privacy-aware frame processing. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Successful user regis￾tration with a wire-frame cube. keyboard, center (327.80, 352.12), box (66.47, 66.03, 589.13, 638.21), confidence 0.91 This data is incorporated into the prompt for MLLMs to provide relevant real-world contexts. For example, the MLLM is aware of the user’s body gestures when the hand is detected, enabling vague requests like “moving objects here”. Since object detection outputs only p… view at source ↗
Figure 6
Figure 6. Figure 6: Registration evaluation. colocation app for 10 times and record the initialization time for two users. Our results, as shown in Figure 6b, indicate that the total latency for our registration is only around 0.27 seconds on average, almost 25 times faster than 6.73 seconds measured by Meta SSA. 6.3 Latency Analysis This section focuses on evaluating the end-to-end latency from a user’s voice request to syst… view at source ↗
Figure 7
Figure 7. Figure 7: NASA-TLX scores for all tasks. We also measured the task completion time and task fulfillment level statistics, as shown in [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Fulfillment levels [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Results of questionnaire. 7.3 User Feedback and Discussion We have collected 248 comments across 28 users through our inter￾views and conducted a thematic analysis [10] to better understand user experiences with our system. 7.3.1 Positive Experiences Immersive collaboration. Many participants expressed strong sat￾isfaction with the immersive XR experience and the system’s sup￾port for interactive, multi-us… view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) enhance collaboration in Extended Reality (XR) environments by enabling flexible object and animation creation through the combination of natural language and visual inputs. However, visual data captured by XR headsets includes real-world backgrounds that may contain irrelevant or sensitive user information, such as credit cards left on the table or facial identities of other users. Uploading those frames to cloud-based MLLMs poses serious privacy risks, particularly when such data is processed without explicit user consent. Additionally, existing colocation and synchronization mechanisms in commercial XR APIs rely on time-consuming, privacy-invasive environment scanning and struggle to adapt to the highly dynamic nature of MLLM-integrated XR environments. In this paper, we propose PRISM-XR, a novel framework that facilitates multi-user collaboration in XR by providing privacy-aware MLLM integration. PRISM-XR employs intelligent frame preprocessing on the edge server to filter sensitive data and remove irrelevant context before communicating with cloud generative AI models. Additionally, we introduce a lightweight registration process and a fully customizable content-sharing mechanism to enable efficient, accurate, and privacy-preserving content synchronization among users. Our numerical evaluation results indicate that the proposed platform achieves nearly 90% accuracy in fulfilling user requests and less than 0.27 seconds registration time while maintaining spatial inconsistencies of less than 3.5 cm. Furthermore, we conducted an IRB-approved user study with 28 participants, demonstrating that our system could automatically filter highly sensitive objects in over 90% of scenarios while maintaining strong overall usability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes PRISM-XR, a framework for multi-user XR collaboration that integrates multimodal LLMs while addressing privacy risks. It uses edge-server frame preprocessing to filter sensitive content (e.g., credit cards, faces) before cloud MLLM queries, plus a lightweight registration process and customizable content-sharing for synchronization. The authors report ~90% request-fulfillment accuracy, <0.27 s registration time, <3.5 cm spatial inconsistency, and >90% sensitive-object filtering success in an IRB-approved study with 28 participants.

Significance. If the empirical claims hold under rigorous validation, the work would provide a concrete, deployable approach to privacy-preserving MLLM use in dynamic XR settings, filling a gap between commercial XR APIs and generative AI. The edge-preprocessing plus lightweight sync design is practically relevant for collaborative XR applications.

major comments (3)
  1. [§4] §4 (Numerical Evaluation): The abstract and evaluation claim nearly 90% accuracy in fulfilling user requests and <3.5 cm spatial inconsistency, yet no baselines, error bars, test-scenario definitions, or measurement protocol (e.g., how spatial error was computed across frames) are provided; without these the quantitative results cannot be interpreted or reproduced.
  2. [User study] User-study section: The >90% sensitive-object filtering rate is presented as a central result, but the manuscript supplies no description of the detection method, model architecture, training data, false-negative rates on occluded or novel items, or downstream effect on MLLM response correctness; this directly undermines the privacy guarantee that is load-bearing for the framework.
  3. [§3.2] §3.2 (Edge Preprocessing): The assumption that edge filtering reliably removes sensitive content without stripping necessary context for the MLLM is stated but never tested with controlled ablation (e.g., MLLM accuracy with vs. without filtering on the same queries); the 28-participant study does not isolate this variable.
minor comments (2)
  1. [Abstract] The abstract mixes system-level metrics with user-study outcomes without clear separation; a short table summarizing all reported numbers would improve readability.
  2. [Discussion] No discussion of failure modes (e.g., what happens when the edge filter misses a partially occluded card) or computational overhead of the preprocessing step on typical XR hardware.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important areas for improving the clarity and rigor of our manuscript. We address each major comment below and will revise the paper accordingly to strengthen the presentation of our results and methods.

read point-by-point responses
  1. Referee: [§4] §4 (Numerical Evaluation): The abstract and evaluation claim nearly 90% accuracy in fulfilling user requests and <3.5 cm spatial inconsistency, yet no baselines, error bars, test-scenario definitions, or measurement protocol (e.g., how spatial error was computed across frames) are provided; without these the quantitative results cannot be interpreted or reproduced.

    Authors: We agree that the current §4 would benefit from additional detail to support interpretability and reproducibility. In the revised manuscript we will add relevant baselines (e.g., commercial XR synchronization APIs without our lightweight registration), include error bars on all reported metrics, explicitly define the test scenarios (including request types and environmental conditions), and provide a precise measurement protocol for spatial inconsistency that describes how ground-truth tracking data was used across frames. revision: yes

  2. Referee: [User study] User-study section: The >90% sensitive-object filtering rate is presented as a central result, but the manuscript supplies no description of the detection method, model architecture, training data, false-negative rates on occluded or novel items, or downstream effect on MLLM response correctness; this directly undermines the privacy guarantee that is load-bearing for the framework.

    Authors: We acknowledge that the user-study section currently lacks sufficient technical detail on the sensitive-object filtering component. In the revision we will expand this section to describe the detection method, model architecture, training data characteristics, false-negative rates (including performance on occluded and novel items), and an analysis of the downstream impact on MLLM response correctness. These additions will directly support the privacy claims. revision: yes

  3. Referee: [§3.2] §3.2 (Edge Preprocessing): The assumption that edge filtering reliably removes sensitive content without stripping necessary context for the MLLM is stated but never tested with controlled ablation (e.g., MLLM accuracy with vs. without filtering on the same queries); the 28-participant study does not isolate this variable.

    Authors: We agree that a controlled ablation would provide stronger evidence for the edge-preprocessing design choice. While the 28-participant study evaluates the integrated system, it does not isolate the filtering variable. In the revised version we will add a dedicated ablation experiment that measures MLLM response accuracy on identical queries with and without edge filtering, thereby directly testing the assumption. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on direct empirical measurements

full rationale

The paper reports performance metrics (90% request accuracy, <0.27s registration, <3.5cm spatial error, >90% sensitive-object filtering) from numerical evaluations and an IRB-approved 28-participant user study. No equations, fitted parameters, derivations, or self-citations are invoked to support these outcomes. The central claims are presented as direct experimental results rather than any chain that reduces to its own inputs by construction. This is the expected non-finding for a systems paper whose load-bearing evidence is external measurement.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework depends on the unproven assumption that edge hardware can run accurate real-time sensitive-object detection and that MLLM responses remain useful after context removal.

axioms (2)
  • domain assumption Edge servers have sufficient compute and low enough latency to perform real-time sensitive-data filtering before cloud MLLM calls.
    Required for the preprocessing step to preserve both privacy and task utility.
  • domain assumption MLLM performance after filtering is comparable to performance on raw frames for the targeted XR tasks.
    Implicit in the claim that the system fulfills user requests at 90% accuracy.
invented entities (1)
  • PRISM-XR framework no independent evidence
    purpose: Privacy-aware MLLM integration for multi-user XR
    The paper introduces the named system and its components.

pith-pipeline@v0.9.0 · 5583 in / 1387 out tokens · 62304 ms · 2026-05-16T05:02:42.976872+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 3 internal anchors

  1. [1]

    Abraham, P

    M. Abraham, P. Saeghe, M. Mcgill, and M. Khamis. Implications of xr on privacy, security and behaviour: Insights from experts. InNordic Human-Computer Interaction Conference, pp. 1–12, 2022. 1

  2. [2]

    Alkaeed, A

    M. Alkaeed, A. Qayyum, and J. Qadir. Privacy preservation in arti- ficial intelligence and extended reality (ai-xr) metaverses: A survey. Journal of Network and Computer Applications, p. 103989, 2024. 3

  3. [3]

    Bevan, C

    N. Bevan, C. Barnum, G. Cockton, J. Nielsen, J. Spool, and D. Wixon. The” magic number 5” is it enough for web testing? InCHI’03 ex- tended abstracts on Human factors in computing systems, pp. 698– 699, 2003. 7

  4. [4]

    D. A. Boiko, R. MacKnight, and G. Gomes. Emergent autonomous scientific research capabilities of large language models.arXiv preprint arXiv:2304.05332, 2023. 2

  5. [5]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choro- manski, T. Ding, D. Driess, A. Dubey, C. Finn, et al. Rt-2: Vision- language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023. 2

  6. [6]

    J. Chen, T. Lan, and B. Li. GPT-VR Nexus: ChatGPT-powered im- mersive virtual reality experience. In2024 IEEE Conference on Vir- tual Reality and 3D User Interfaces Abstracts and Workshops (VRW), pp. 01–02. IEEE, 2024. 2

  7. [7]

    J. Chen, F. Qian, and B. Li. Enhancing quality of experience for col- laborative virtual reality with commodity mobile devices. In2022 IEEE 42nd International Conference on Distributed Computing Sys- tems (ICDCS), pp. 1018–1028. IEEE, 2022. 2

  8. [8]

    J. Chen, X. Qin, G. Zhu, B. Ji, and B. Li. Motion-prediction-based wireless scheduling for multi-user panoramic video streaming. In IEEE INFOCOM 2021-IEEE Conference on Computer Communica- tions, pp. 1–10. IEEE, 2021. 2

  9. [9]

    J. Chen, X. Wu, T. Lan, and B. Li. LLMER: Crafting interactive extended reality worlds with JSON data generated by large language models.IEEE Transactions on Visualization and Computer Graphics,

  10. [10]

    Clarke and V

    V . Clarke and V . B. and. Thematic analysis.The Journal of Posi- tive Psychology, 12(3):297–298, 2017. doi: 10.1080/17439760.2016. 1262613 9

  11. [11]

    Corbett, B

    M. Corbett, B. David-John, J. Shang, Y . C. Hu, and B. Ji. Bystan- dAR: Protecting bystander visual data in augmented reality systems. InProceedings of the 21st Annual International Conference on Mobile Systems, Applications and Services, pp. 370–382, 2023. 1, 2, 3

  12. [12]

    Davies and L

    H. Davies and L. Hjorth. Roblox in lockdown: Understanding young people’s digital social play in the pandemic.Gaming and Gamers in Times of Pandemic, p. 15, 2024. 1

  13. [13]

    De La Torre, C

    F. De La Torre, C. M. Fang, H. Huang, A. Banburski-Fahey, J. Amores Fernandez, and J. Lanier. LLMR: Real-time prompting of interactive worlds using large language models. InProceedings of the CHI Conference on Human Factors in Computing Systems, pp. 1–22,

  14. [14]

    Dhakal, X

    A. Dhakal, X. Ran, Y . Wang, J. Chen, and K. Ramakrishnan. Slam- share: visual simultaneous localization and mapping for real-time multi-user augmented reality. InProceedings of the 18th International Conference on emerging Networking EXperiments and Technologies, pp. 293–306, 2022. 2

  15. [15]

    Z. Dong, J. Chen, and B. Li. Collaborative mixed-reality-based fire- fighter training. InIEEE INFOCOM 2023-IEEE Conference on Com- puter Communications Workshops (INFOCOM WKSHPS), pp. 1–2. IEEE, 2023. 2

  16. [16]

    Earle, F

    S. Earle, F. Kokkinos, Y . Nie, J. Togelius, and R. Raileanu. Dream- craft: Text-guided generation of functional 3d environments in minecraft. InProceedings of the 19th International Conference on the Foundations of Digital Games, pp. 1–15, 2024. 1

  17. [17]

    Garrido-Jurado, R

    S. Garrido-Jurado, R. Mu ˜noz-Salinas, F. J. Madrid-Cuevas, and M. J. Mar´ın-Jim´enez. Automatic generation and detection of highly reliable fiducial markers under occlusion.Pattern Recognition, 47(6):2280– 2292, 2014. 1

  18. [18]

    Giunchi, N

    D. Giunchi, N. Numan, E. Gatti, and A. Steed. Dreamcodevr: Towards democratizing behavior design in virtual reality with speech-driven programming. In2024 IEEE Conference Virtual Reality and 3D User Interfaces (VR), pp. 579–589. IEEE, 2024. 2, 3

  19. [19]

    Cloud Anchors allow different users to share AR experiences |ARCore|Google for Developers

    Google. Cloud Anchors allow different users to share AR experiences |ARCore|Google for Developers. 1, 4

  20. [20]

    Hadan, D

    H. Hadan, D. M. Wang, L. E. Nacke, and L. Zhang-Kennedy. Privacy in immersive extended reality: Exploring user perceptions, concerns, and coping strategies. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pp. 1–24, 2024. 3

  21. [21]

    S. Hart. Development of nasa-tlx (task load index): Results of empiri- cal and theoretical research.Human mental workload/Elsevier, 1988. 7

  22. [22]

    Hirzle, F

    T. Hirzle, F. M ¨uller, F. Draxler, M. Schmitz, P. Knierim, and K. Horn- bæk. When XR and AI meet - a scoping review on extended reality and artificial intelligence. InProceedings of the 2023 CHI conference on human factors in computing systems, pp. 1–45, 2023. 2

  23. [23]

    J. Hu, A. Iosifescu, and R. LiKamWa. Lenscap: split-process frame- work for fine-grained visual privacy control for augmented reality apps. InProceedings of the 19th annual international conference on mobile systems, applications, and services, pp. 14–27, 2021. 1, 3

  24. [24]

    T. Hu, F. Yang, T. Scargill, and M. Gorlatova. Apple vs Meta: A com- parative study on spatial tracking in sota xr headsets. InProceedings of the 30th Annual International Conference on Mobile Computing and Networking, pp. 2120–2127, 2024. 6

  25. [25]

    Y . Hu, M. Zhu, Q. Jin, F. Qian, and B. Li. MagicCloth: Protect user privacy in AR streaming. InProceedings of the 1st ACM Workshop on Mobile Immersive Computing, Networking, and Systems, pp. 222– 228, 2023. 2, 3

  26. [26]

    Huang, Y

    Y . Huang, Y . Wang, Z. Xu, C. Gao, S. Wu, J. Ye, X. Chen, P.-Y . Chen, and X. Zhang. Breaking focus: Contextual distraction curse in large language models.arXiv preprint arXiv:2502.01609, 2025. 1

  27. [27]

    Jocher, J

    G. Jocher, J. Qiu, and A. Chaurasia. Ultralytics YOLO, Jan. 2023. 3, 5

  28. [28]

    Kobenova, C

    A. Kobenova, C. DeVeaux, S. Parajuli, A. Banburski-Fahey, J. A. Fernandez, and J. Lanier. Social conjuring: Multi-user runtime col- laboration with AI in building virtual 3D worlds.arXiv preprint arXiv:2410.00274, 2024. 1, 2

  29. [29]

    Lammerding, T

    L. Lammerding, T. Hilken, D. Mahr, and J. Heller. Too real for com- fort: Measuring consumers’ augmented reality information privacy concerns. InAugmented reality and virtual reality: New trends in immersive technology, pp. 95–108. Springer, 2021. 3

  30. [30]

    S. M. Lehman, A. S. Alrumayh, K. Kolhe, H. Ling, and C. C. Tan. Hidden in plain sight: Exploring privacy risks of mobile augmented reality applications.ACM Transactions on Privacy and Security, 25(4):1–35, 2022. 3

  31. [31]

    J. R. Lewis. IBM computer usability satisfaction questionnaires: psy- chometric evaluation and instructions for use.International Journal of Human-Computer Interaction, 7(1):57–78, 1995. 8

  32. [32]

    F. Li, S. Yang, X. Yi, and X. Yang. CORB-SLAM: a collaborative visual SLAM system for multiple robots. InInternational Conference on Collaborative Computing: Networking, Applications and Work- sharing, pp. 480–490. Springer, 2017. 2

  33. [33]

    R. Li, T. Patel, Q. Wang, and X. Du. MLR-Copilot: Autonomous ma- chine learning research based on large language models agents.arXiv preprint arXiv:2408.14033, 2024. 2

  34. [34]

    T. Li, N. S. Nguyen, X. Zhang, T. Wang, and B. Sheng. Promar: Prac- tical reference object-based multi-user augmented reality. InIEEE IN- FOCOM 2020-IEEE Conference on Computer Communications, pp. 1359–1368. IEEE, 2020. 2

  35. [35]

    Liu and M

    L. Liu and M. Gruteser. EdgeSharing: Edge assisted real-time local- ization and object sharing in urban streets. InIEEE INFOCOM 2021- IEEE Conference on Computer Communications, pp. 1–10. IEEE,

  36. [36]

    A. M. Lund. Measuring usability with the use questionnaire12.Us- ability interface, 8(2):3–6, 2001. 8

  37. [37]

    Mecheri, X

    H. Mecheri, X. Robert-Lachaine, C. Larue, and A. Plamondon. Evalu- ation of eight methods for aligning orientation of two coordinate sys- tems.Journal of biomechanical engineering, 138(8):084501, 2016. 1

  38. [38]

    Merino, M

    T. Merino, M. Charity, and J. Togelius. Interactive latent variable evolution for the generation of minecraft structures. InProceedings of the 18th International Conference on the Foundations of Digital Games, pp. 1–8, 2023. 1

  39. [39]

    Shared Spatial Anchors|Meta Horizon OS Developers

    Meta. Shared Spatial Anchors|Meta Horizon OS Developers. 1, 4

  40. [40]

    Spatial Anchor Sharing

    Microsoft. Spatial Anchor Sharing. 1, 4

  41. [41]

    D. L. Mills.Computer network time synchronization: the network time protocol. CRC press, 2006. 7

  42. [42]

    O’Hagan, P

    J. O’Hagan, P. Saeghe, J. Gugenheimer, D. Medeiros, K. Marky, M. Khamis, and M. McGill. Privacy-enhancing technology and ev- eryday augmented reality: Understanding bystanders’ varying needs for awareness and consent.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 6(4):1–35, 2023. 3

  43. [43]

    E. Olson. AprilTag: A robust and flexible visual fiducial system. In 2011 IEEE International Conference on Robotics and Automation, pp. 3400–3407. IEEE, 2011. 1, 4

  44. [44]

    J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein. Generative agents: Interactive simulacra of human behav- ior. InProceedings of the 36th annual acm symposium on user inter- face software and technology, pp. 1–22, 2023. 2

  45. [45]

    Qian and B

    F. Qian and B. Li. Boosting remote multi-user AR privacy through a magic rope. InProceedings of the 20th Annual International Con- ference on Mobile Systems, Applications and Services, pp. 583–584,

  46. [46]

    Radford, J

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever. Robust speech recognition via large-scale weak super- vision. InInternational conference on machine learning, pp. 28492– 28518. PMLR, 2023. 3, 5

  47. [47]

    Rajaram, C

    S. Rajaram, C. Chen, F. Roesner, and M. Nebeling. Eliciting security & privacy-informed sharing techniques for multi-user augmented re- ality. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pp. 1–17, 2023. 3

  48. [48]

    X. Ran, C. Slocum, Y .-Z. Tsai, K. Apicharttrisorn, M. Gorlatova, and J. Chen. Multi-user augmented reality with communication efficient and spatially consistent virtual objects. InProceedings of the 16th International Conference on emerging Networking EXperiments and Technologies, pp. 386–398, 2020. 2, 6

  49. [49]

    Agent Laboratory: Using LLM Agents as Research Assistants

    S. Schmidgall, Y . Su, Z. Wang, X. Sun, J. Wu, X. Yu, J. Liu, Z. Liu, and E. Barsoum. Agent Laboratory: Using LLM agents as research assistants.arXiv preprint arXiv:2501.04227, 2025. 2

  50. [50]

    F. Shi, X. Chen, K. Misra, N. Scales, D. Dohan, E. H. Chi, N. Sch ¨arli, and D. Zhou. Large language models can be easily distracted by irrel- evant context. InInternational Conference on Machine Learning, pp. 31210–31227. PMLR, 2023. 1

  51. [51]

    Srinidhi, E

    S. Srinidhi, E. Lu, and A. Rowe. XaiR: An XR platform that inte- grates large language models with the physical world. In2024 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pp. 759–767. IEEE, 2024. 1, 2, 4

  52. [52]

    Srinidhi, E

    S. Srinidhi, E. Lu, A. Singh, S. Kartik, A. Lin, T. Laroia, and A. Rowe. An XR platform that integrates large language models with the physi- cal world. InProceedings of the 23rd ACM Conference on Embedded Networked Sensor Systems, pp. 700–701, 2025. 1

  53. [53]

    Y . Tang, J. Situ, A. Y . Cui, M. Wu, and Y . Huang. LLM integra- tion in extended reality: A comprehensive review of current trends, challenges, and future perspectives. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pp. 1–24, 2025. 2

  54. [54]

    G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar. V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023. 2

  55. [55]

    Y .-J. Wang, B. Zhang, J. Chen, and K. Sreenath. Prompt a robot to walk with large language models.arXiv preprint arXiv:2309.09969,

  56. [56]

    Y . Xiu, T. Scargill, and M. Gorlatova. LOBSTAR: Language model- based obstruction detection for augmented reality. In2024 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct), pp. 335–336. IEEE, 2024. 1, 2

  57. [57]

    Y . Xiu, T. Scargill, and M. Gorlatova. ViDDAR: Vision language model-based task-detrimental content detection for augmented reality. arXiv preprint arXiv:2501.12553, 2025. 1

  58. [58]

    Yamakami

    T. Yamakami. A privacy threat model in XR applications. InAdvances in Internet, Data and Web Technologies: The 8th International Con- ference on Emerging Internet, Data and Web Technologies (EIDWT- 2020), pp. 384–394. Springer, 2020. 1

  59. [59]

    X. Yao, J. Chen, T. He, J. Yang, and B. Li. A scalable mixed real- ity platform for remote collaborative LEGO design. InIEEE INFO- COM 2022-IEEE Conference on Computer Communications Work- shops (INFOCOM WKSHPS), pp. 1–2. IEEE, 2022. 2

  60. [60]

    Y . Yao, J. Duan, K. Xu, Y . Cai, Z. Sun, and Y . Zhang. A survey on large language model (LLM) security and privacy: The good, the bad, and the ugly.High-Confidence Computing, p. 100211, 2024. 3

  61. [61]

    K. You, Q. Chen, P. Xie, and S. Song. Range-based coordinate align- ment for cooperative mobile sensor network localization.IEEE Trans- actions on Control of Network Systems, 7(3):1379–1390, 2020. 1

  62. [62]

    W. Yu, N. Gileadi, C. Fu, S. Kirmani, K.-H. Lee, M. G. Arenas, H.- T. L. Chiang, T. Erez, L. Hasenclever, J. Humplik, et al. Language to rewards for robotic skill synthesis.arXiv preprint arXiv:2306.08647,

  63. [63]

    M. Zhu, J. Chen, and B. Li. When generative AI meets extended re- ality: Enabling scalable and natural interactions.IEEE Internet Com- puting, pp. 1–10, 2026. doi: 10.1109/MIC.2025.3619462 1, 2