pith. machine review for the scientific record. sign in

arxiv: 2604.05510 · v2 · submitted 2026-04-07 · 💻 cs.CV

Recognition: no theorem link

Benchmarking Vision-Language Models under Contradictory Virtual Content Attacks in Augmented Reality

Authors on Pith no claims yet

Pith reviewed 2026-05-10 20:12 UTC · model grok-4.3

classification 💻 cs.CV
keywords augmented realityvision-language modelsadversarial attacksbenchmarkcontradictory contentrobustness evaluationvirtual content manipulation
0
0 comments X

The pith

Vision-language models show reasonable grasp of contradictory virtual content in augmented reality but still need better detection and faster reasoning under attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper creates ContrAR, a benchmark of 312 real-world augmented reality videos, to test how vision-language models handle contradictory or malicious virtual elements overlaid on physical scenes. The videos were validated by ten human participants to ensure they represent inconsistent or harmful AR content. Eleven commercial and open-source VLMs were evaluated on tasks involving understanding, detecting, and reasoning about these contradictions. Results indicate the models perform adequately on basic comprehension yet fall short on reliable attack detection and on keeping both accuracy and response speed high. If the findings hold, AR systems that rely on these models could leave users exposed to misleading information in everyday settings.

Core claim

The paper introduces ContrAR as a benchmark containing 312 validated real-world AR videos for evaluating vision-language model robustness to contradictory virtual content attacks. Benchmarking eleven VLMs shows they exhibit reasonable understanding of such content but leaves clear room for improvement in detecting and reasoning about adversarial manipulations, while also revealing persistent challenges in balancing detection accuracy with low latency.

What carries the argument

ContrAR, a dataset of 312 human-validated real-world augmented reality videos that systematically model contradictory virtual content attacks for testing VLM detection and reasoning capabilities.

If this is right

  • Current VLMs exhibit reasonable understanding of contradictory virtual content in AR environments.
  • There remains clear room for improvement in detecting and reasoning about adversarial content manipulations.
  • Balancing detection accuracy and latency stays a difficult trade-off for existing models.
  • AR applications using these models may still be vulnerable to misleading or harmful virtual overlays.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers could add AR-specific fine-tuning or auxiliary detectors to reduce the observed weaknesses without retraining entire models.
  • The benchmark could be expanded to interactive or multi-user AR scenarios to check whether dynamic contradictions expose further gaps.
  • Results point toward the value of combining VLMs with classical computer-vision checks for real-time AR security.

Load-bearing premise

The chosen 312 videos together with validation from only ten human participants capture enough of real-world contradictory AR attacks for the performance patterns seen in the eleven tested models to apply more broadly.

What would settle it

A new evaluation of the same eleven VLMs on a fresh collection of several hundred AR videos with similar contradictions that shows consistently high detection accuracy paired with low latency would directly contradict the reported need for improvement.

Figures

Figures reproduced from arXiv: 2604.05510 by Maria Gorlatova, Neil Zhenqiang Gong, Yanming Xiu, Zhengyuan Jiang.

Figure 1
Figure 1. Figure 1: Examples of contradictory virtual content in augmented [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Demonstration of the dataset collection process. (a) AR [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Sample frames from the ContrAR dataset. Each column shows an AR application. The top row shows the contradictory [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 6
Figure 6. Figure 6: Pipeline for evaluating VLMs on the ContrAR. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: User agreement with contradictory virtual content at [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Three typical failure cases. Subfigure (a) and (b) are [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Augmented reality (AR) has rapidly expanded over the past decade. As AR becomes increasingly integrated into daily life, its security and reliability emerge as critical challenges. Among various threats, contradictory virtual content attacks, where malicious or inconsistent virtual elements are introduced into the user's view, pose a unique risk by misleading users, creating semantic confusion, or delivering harmful information. In this work, we systematically model such attacks and present ContrAR, a novel benchmark for evaluating the robustness of vision-language models (VLMs) against virtual content manipulation and contradiction in AR. ContrAR contains 312 real-world AR videos validated by 10 human participants. We further benchmark 11 VLMs, including both commercial and open-source models. Experimental results reveal that while current VLMs exhibit reasonable understanding of contradictory virtual content, room still remains for improvement in detecting and reasoning about adversarial content manipulations in AR environments. Moreover, balancing detection accuracy and latency remains challenging.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces ContrAR, a benchmark consisting of 312 real-world AR videos (validated by 10 human participants) for evaluating the robustness of 11 vision-language models against contradictory virtual content attacks in augmented reality. It systematically models such attacks and reports that current VLMs exhibit reasonable understanding of contradictory content but still have room for improvement in detection and reasoning, while also facing challenges in balancing detection accuracy and latency.

Significance. If the benchmark proves representative, this work is significant for identifying vulnerabilities in VLMs as AR integrates into daily applications, providing empirical data to guide improvements in model robustness for security-critical AR scenarios. The empirical benchmarking of both commercial and open-source models offers concrete performance insights, though the impact depends on the dataset's coverage of attack diversity.

major comments (2)
  1. [Benchmark construction] Benchmark construction (dataset section): The 312-video scale validated by only 10 human participants lacks reported inter-rater reliability, scenario diversity metrics, or explicit coverage of attack types (semantic vs. visual contradictions), which is load-bearing for the central claim that results indicate general VLM limitations rather than sample-specific observations.
  2. [Experimental results] Experimental evaluation: The claims of 'reasonable understanding' and 'room for improvement' rest on model outputs without sufficient detail on exact metrics, prompt formulations, attack modeling parameters, or statistical significance tests in the results, limiting verifiability of the performance gaps and latency-accuracy trade-offs.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., average accuracy or latency figures) to better ground the high-level conclusions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We have carefully reviewed the major comments and provide point-by-point responses below. We will revise the manuscript to incorporate additional details that address the concerns raised, thereby improving the clarity, rigor, and verifiability of our work.

read point-by-point responses
  1. Referee: [Benchmark construction] Benchmark construction (dataset section): The 312-video scale validated by only 10 human participants lacks reported inter-rater reliability, scenario diversity metrics, or explicit coverage of attack types (semantic vs. visual contradictions), which is load-bearing for the central claim that results indicate general VLM limitations rather than sample-specific observations.

    Authors: We agree that reporting inter-rater reliability and diversity metrics will strengthen the benchmark's credibility. In the revised manuscript, we will add Fleiss' kappa scores computed from the 10 human validators' annotations to quantify agreement. We will also include scenario diversity metrics, such as the distribution of videos across AR environments (indoor/outdoor), lighting conditions, and object categories. Regarding attack types, Section 3 of the manuscript already systematically models both semantic contradictions (e.g., inconsistent object semantics or labels) and visual contradictions (e.g., mismatched visual appearances or placements), with explicit categorization; we will add a summary table showing the proportion of each type to demonstrate balanced coverage. These additions will better substantiate that our observations reflect broader VLM limitations. revision: yes

  2. Referee: [Experimental results] Experimental evaluation: The claims of 'reasonable understanding' and 'room for improvement' rest on model outputs without sufficient detail on exact metrics, prompt formulations, attack modeling parameters, or statistical significance tests in the results, limiting verifiability of the performance gaps and latency-accuracy trade-offs.

    Authors: We concur that greater specificity is needed for reproducibility and to support the performance claims. In the revised Experimental Evaluation section, we will provide: (1) complete metric definitions and tabulated results including accuracy, precision, recall, F1-score, and latency (in ms) for all 11 models; (2) the full prompt templates and system instructions used for each VLM; (3) precise attack modeling parameters, such as virtual overlay methods, contradiction generation rules, and intensity levels; and (4) statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests with p-values) comparing model performances and accuracy-latency trade-offs. These enhancements will make the evidence for 'reasonable understanding' with 'room for improvement' fully verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity in this empirical benchmarking study

full rationale

This is a purely empirical paper that introduces the ContrAR benchmark consisting of 312 AR videos validated by 10 human participants and then directly evaluates 11 VLMs on it. There are no equations, derivations, fitted parameters, predictions, or self-citation chains that reduce any claimed result to the inputs by construction. All reported findings on VLM understanding, detection accuracy, and latency trade-offs are obtained from straightforward model inference and human validation on the described dataset, making the work self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is empirical and relies on the assumption that the selected videos and human validators capture relevant attack scenarios; no free parameters, mathematical axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5465 in / 1022 out tokens · 56315 ms · 2026-05-10T20:12:18.022557+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 6 canonical work pages · 2 internal anchors

  1. [1]

    Gradio: Hassle-Free Sharing and Testing of ML Models in the Wild

    Abubakar Abid, Ali Abdalla, Ali Abid, Dawood Khan, Ab- dulrahman Alfozan, and James Zou. Gradio: Hassle-free sharing and testing of ML models in the wild.arXiv preprint arXiv:1906.02569, 2019. 6

  2. [2]

    Virtual object grasping in augmented reality: Drop shadows for improved interaction

    Maadh Al-Kalbani, Maite Frutos-Pascual, and Ian Williams. Virtual object grasping in augmented reality: Drop shadows for improved interaction. InProceedings of International Conference on Virtual Worlds and Games for Serious Appli- cations, 2019. 2

  3. [3]

    Introducing the Claude Haiku 4.5.https:// www.anthropic.com/news/claude-haiku-4-5,

    Anthropic. Introducing the Claude Haiku 4.5.https:// www.anthropic.com/news/claude-haiku-4-5,

  4. [4]

    Introducing the Claude Sonnet 4.5.https: //www.anthropic.com/news/claude- sonnet- 4-5, 2025

    Anthropic. Introducing the Claude Sonnet 4.5.https: //www.anthropic.com/news/claude- sonnet- 4-5, 2025. 6

  5. [5]

    The impact of augmented reality on cognitive load and perfor- mance: A systematic review.Journal of Computer Assisted Learning, 38(1), 2022

    Josef Buchner, Katja Buntins, and Michael Kerres. The impact of augmented reality on cognitive load and perfor- mance: A systematic review.Journal of Computer Assisted Learning, 38(1), 2022. 8

  6. [6]

    A neurosymbolic framework for interpretable cognitive attack detection in augmented re- ality.arXiv preprint arXiv:2508.09185, 2025

    Rongqian Chen, Allison Andreyev, Yanming Xiu, Joshua Chilukuri, Shunav Sen, Mahdi Imani, Bin Li, Maria Gorla- tova, Gang Tan, and Tian Lan. A neurosymbolic framework for interpretable cognitive attack detection in augmented re- ality.arXiv preprint arXiv:2508.09185, 2025. 2

  7. [7]

    Demo: Perception graph for cog- nitive attack reasoning in augmented reality

    Rongqian Chen, Shu Hong, Rifatul Islam, Mahdi Imani, Gang Tan, and Tian Lan. Demo: Perception graph for cog- nitive attack reasoning in augmented reality. InProceedings of International Symposium on Theory, Algorithmic Founda- tions, and Protocol Design for Mobile Networks and Mobile Computing, 2025. 3

  8. [8]

    Securing augmented reality applica- tions

    Si Chen and Jie Wu. Securing augmented reality applica- tions. InNetwork Security Empowered by Artificial Intelli- gence, pages 331–354. Springer, 2024. 3

  9. [9]

    Tian, Tadayoshi Kohno, and Franziska Roesner

    Kaiming Cheng, Jeffery F. Tian, Tadayoshi Kohno, and Franziska Roesner. Exploring user reactions and mental models towards perceptual manipulation attacks in mixed re- ality. InProceedings of USENIX Security Symposium, 2023. 1

  10. [10]

    Finding contradictions in text

    Marie-Catherine De Marneffe, Anna N Rafferty, and Christopher D Manning. Finding contradictions in text. In Proceedings of ACL-08: HLT, 2008. 2

  11. [11]

    Augmenting human cognition with adaptive augmented reality

    Jayfus T Doswell and Anna Skinner. Augmenting human cognition with adaptive augmented reality. InInternational Conference on Augmented Cognition, 2014. 8

  12. [12]

    Augmented reality image quality as- sessment based on visual confusion theory

    Huiyu Duan, Lantu Guo, Wei Sun, Xiongkuo Min, Li Chen, and Guangtao Zhai. Augmented reality image quality as- sessment based on visual confusion theory. InProceedings of IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB), 2022. 2

  13. [13]

    Prob- ing the augmented reality scene analysis capabilities of large multimodal models: Toward reliable real-time assessment solutions.IEEE Internet Computing, 2025

    Lin Duan, Elias Rotondo, Yanming Xiu, Sangjun Eom, Ryan Chen, Conrad Li, Yuhe Hu, and Maria Gorlatova. Prob- ing the augmented reality scene analysis capabilities of large multimodal models: Toward reliable real-time assessment solutions.IEEE Internet Computing, 2025. 2

  14. [14]

    Advancing the understanding and evaluation of AR-generated scenes: When vision-language models shine and stumble

    Lin Duan, Yanming Xiu, and Maria Gorlatova. Advancing the understanding and evaluation of AR-generated scenes: When vision-language models shine and stumble. InPro- ceedings of IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW), 2025. 2

  15. [15]

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

    Gheorghe Comanici et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. 1, 6

  16. [16]

    Edge computing-driven scene-aware intelligent aug- mented reality assembly.The International Journal of Ad- vanced Manufacturing Technology, 119(11), 2022

    Mingyu Fu, Wei Fang, Shan Gao, Jianhao Hong, and Yizhou Chen. Edge computing-driven scene-aware intelligent aug- mented reality assembly.The International Journal of Ad- vanced Manufacturing Technology, 119(11), 2022. 2

  17. [17]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team. Gemini: A family of highly capable multi- modal models.arXiv:2312.11805, 2024. 6

  18. [18]

    Qual- ity evaluation of 3D objects in mixed reality for different lighting conditions.Electronic Imaging, 32, 2020

    Jes ´us Guti´errez, Toinon Vigier, and Patrick Le Callet. Qual- ity evaluation of 3D objects in mixed reality for different lighting conditions.Electronic Imaging, 32, 2020. 2

  19. [19]

    BoardgameQA: A dataset for natural language reasoning with contradictory information

    Mehran Kazemi, Quan Yuan, Deepti Bhatia, Najoung Kim, Xin Xu, Vaiva Imbrasaite, and Deepak Ramachandran. BoardgameQA: A dataset for natural language reasoning with contradictory information. InAdvances in Neural In- formation Processing Systems, 2023. 1, 2

  20. [20]

    Assessing distraction potential of augmented reality head-up displays for vehicle drivers.Human Factors, 64(5), 2022

    Hyungil Kim and Joseph L Gabbard. Assessing distraction potential of augmented reality head-up displays for vehicle drivers.Human Factors, 64(5), 2022. 8

  21. [21]

    On visual artifacts of physics simulation in augmented reality environ- ment

    Sinyoung Kim, Yeonjoon Kim, and Sunghee Lee. On visual artifacts of physics simulation in augmented reality environ- ment. InProceedings of International Symposium on Ubiq- uitous Virtual Reality, 2011. 2

  22. [22]

    Lebeck, K

    K. Lebeck, K. Ruth, T. Kohno, and F. Roesner. Arya: Oper- ating system support for securely augmenting reality.IEEE Security & Privacy, 16(01), 2018. 1

  23. [23]

    The role of latency in the validity of AR simu- lation

    Cha Lee, Scott Bonebrake, Doug A Bowman, and Tobias H¨ollerer. The role of latency in the validity of AR simu- lation. InProceedings of IEEE Virtual Reality Conference (VR), 2010. 8

  24. [24]

    Efficient evaluation of misalign- ment between real and virtual objects for HMD-based AR assembly assistance system.Advanced Engineering Infor- matics, 59, 2024

    Tinghao Li, Hiromasa Suzuki, Yutaka Ohtake, Tatsuya Yata- gawa, and Shinji Matsuda. Efficient evaluation of misalign- ment between real and virtual objects for HMD-based AR assembly assistance system.Advanced Engineering Infor- matics, 59, 2024. 2

  25. [25]

    arXiv preprint arXiv:2401.17807 (2024)

    Xiaoyu Li, Qi Zhang, Di Kang, Weihao Cheng, Yiming Gao, Jingbo Zhang, Zhihao Liang, Jing Liao, Yanpei Cao, and Ying Shan. Advances in 3D generation: A survey.arXiv preprint arXiv:2401.17807, 2024. 8

  26. [26]

    A technique for the measurement of attitudes

    Rensis Likert. A technique for the measurement of attitudes. Archives of Psychology, 1932. 5

  27. [27]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report.arXiv:2303.08774, 2023. 6

  28. [28]

    GPT-5 system card.https://cdn.openai.com/gpt-5- system-card.pdf, 2025

    OpenAI. GPT-5 system card.https://cdn.openai.com/gpt-5- system-card.pdf, 2025. 1, 6

  29. [29]

    Attacking open-domain question answering by injecting misinformation

    Liangming Pan, Wenhu Chen, Min-Yen Kan, and William Yang Wang. Attacking open-domain question answering by injecting misinformation. InProceedings of the 13th International Joint Conference on Natural Lan- guage Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (IJCNLP-AACL), 2023. 1, 2

  30. [30]

    Machine learning- based augmented reality for improved surgical scene under- standing.Computerized Medical Imaging and Graphics, 41,

    Olivier Pauly, Benoit Diotte, Pascal Fallavollita, Simon Wei- dert, Ekkehard Euler, and Nassir Navab. Machine learning- based augmented reality for improved surgical scene under- standing.Computerized Medical Imaging and Graphics, 41,

  31. [31]

    Semantic object- scene inconsistencies affect eye movements, but not in the way predicted by contextualized meaning maps.Journal of Vision, 22(2), 2022

    Marek A Pedziwiatr, Matthias K ¨ummerer, Thomas SA Wal- lis, Matthias Bethge, and Christoph Teufel. Semantic object- scene inconsistencies affect eye movements, but not in the way predicted by contextualized meaning maps.Journal of Vision, 22(2), 2022. 2

  32. [32]

    OCTO+: A Suite for Automatic Open-V ocabulary Object Placement in Mixed Reality

    Aditya Sharma, Luke Yoffe, and Tobias H ¨ollerer. OCTO+: A Suite for Automatic Open-V ocabulary Object Placement in Mixed Reality. InProceedings of IEEE International Con- ference on Artificial Intelligence and eXtended and Virtual Reality (AIxVR), 2024. 3, 8

  33. [33]

    That doesn’t go there: Attacks on shared state in multi-user augmented real- ity applications

    Carter Slocum, Yicheng Zhang, Erfan Shayegani, Pedram Zaree, Nael Abu-Ghazaleh, and Jiasi Chen. That doesn’t go there: Attacks on shared state in multi-user augmented real- ity applications. InProceedings of USENIX Security Sympo- sium, 2024. 1

  34. [34]

    Learning virtual borders through semantic scene understanding and augmented reality

    Dennis Sprute, Philipp Viertel, Klaus T ¨onnies, and Matthias K¨onig. Learning virtual borders through semantic scene understanding and augmented reality. InProceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019. 2

  35. [35]

    XaiR: An XR platform that integrates large language models with the physical world

    Sruti Srinidhi, Edward Lu, and Anthony Rowe. XaiR: An XR platform that integrates large language models with the physical world. InProceedings of IEEE International Sym- posium on Mixed and Augmented Reality (ISMAR), 2024. 3, 8

  36. [36]

    arXiv preprint arXiv:2403.02151 , year=

    Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, and Yanpei Cao. TripoSR: Fast 3D object reconstruction from a single image.arXiv preprint arXiv:2403.02151, 2024. 8

  37. [37]

    The dark side of perceptual manipulations in vir- tual reality

    Wenjie Tseng, Elise Bonnail, Mark McGill, Mohamed Khamis, Eric Lecolinet, Samuel Huron, and Jan Gugen- heimer. The dark side of perceptual manipulations in vir- tual reality. InProceedings of the ACM CHI Conference on Human Factors in Computing Systems, 2022. 1

  38. [38]

    Augmented reality 3D discrepancy check in industrial ap- plications

    Oliver Wasenm ¨uller, Marcel Meyer, and Didier Stricker. Augmented reality 3D discrepancy check in industrial ap- plications. InProceedings of IEEE International Symposium on Mixed and Augmented Reality (ISMAR), 2016. 2

  39. [39]

    Grok 4 fast model card.https://data.x.ai/2025-09-19- grok-4-fast-model-card.pdf, 2025

    XAI. Grok 4 fast model card.https://data.x.ai/2025-09-19- grok-4-fast-model-card.pdf, 2025. 1, 6

  40. [40]

    Structured 3D latents for scalable and versatile 3d generation

    Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3D latents for scalable and versatile 3d generation. InProceedings of the IEEE/CVF Computer Vi- sion and Pattern Recognition Conference (CVPR), 2025. 8

  41. [41]

    Yanming Xiu and Maria Gorlatova. Detecting visual infor- mation manipulation attacks in augmented reality: A multi- modal semantic reasoning approach.IEEE Transactions on Visualization and Computer Graphics, 31(11), 2025. 3

  42. [42]

    Demonstrating visual information manipulation attacks in augmented reality: A hands-on miniature city-based setup

    Yanming Xiu and Maria Gorlatova. Demonstrating visual information manipulation attacks in augmented reality: A hands-on miniature city-based setup. InProceedings of the International Symposium on Theory, Algorithmic Founda- tions, and Protocol Design for Mobile Networks and Mobile Computing (MobiHoc), 2025. 1

  43. [43]

    ViDDAR: Vision Language Model-Based Task-Detrimental Content Detection for Augmented Reality .IEEE Transactions on Visualization and Computer Graphics, 31(05), 2025

    Yanming Xiu, Tim Scargill, and Maria Gorlatova. ViDDAR: Vision Language Model-Based Task-Detrimental Content Detection for Augmented Reality .IEEE Transactions on Visualization and Computer Graphics, 31(05), 2025. 1, 3

  44. [44]

    Multi- modal inconsistency reasoning (MMIR): A new benchmark for multimodal reasoning models

    Qianqi Yan, Yue Fan, Hongquan Li, Shan Jiang, Yang Zhao, Xinze Guan, Ching-Chen Kuo, and Xin Eric Wang. Multi- modal inconsistency reasoning (MMIR): A new benchmark for multimodal reasoning models. InFindings of the Associ- ation for Computational Linguistics (ACL), 2025. 2

  45. [45]

    Sear: Scaling expe- riences in multi-user augmented reality.IEEE Transactions on Visualization and Computer Graphics, 28(5), 2022

    Wenxiao Zhang, Bo Han, and Pan Hui. Sear: Scaling expe- riences in multi-user augmented reality.IEEE Transactions on Visualization and Computer Graphics, 28(5), 2022. 8