pith. sign in

arxiv: 2606.24498 · v1 · pith:IDK45OM6new · submitted 2026-06-23 · 💻 cs.CV

VistaRef: Boosting Visual Spatial Orientation Awareness for Pointing-to-Object Detection

Pith reviewed 2026-06-26 00:57 UTC · model grok-4.3

classification 💻 cs.CV
keywords pointing-to-object detectiondeictic gesturesspatial orientation awarenesshand pose embeddinggeometric ray modelingtransformer attentiongrounding accuracyAR human-robot interaction
0
0 comments X

The pith

VistaRef improves pointing-to-object detection accuracy by explicitly modeling hand poses and geometric rays inside Transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to fix a specific weakness in Transformer visual models when they must ground a human pointing gesture to a target object in a photo. Global attention spreads focus too broadly and misses the precise angle implied by finger position, which produces drift especially when targets are far away or crowded. VistaRef counters this with three additions: a Local Hand Entity Modeling module that injects hand-pose embeddings, a Geometric Ray Modeling module that turns the implied direction into explicit geometric features, and an Orientation-Consistent Alignment Loss that keeps hand and ray predictions consistent. The authors report that these changes raise grounding accuracy by 14 absolute points over the baseline and produce visibly better ray-to-target alignment in difficult scenes.

Core claim

VistaRef is a framework that augments Transformer-based object detectors for deictic gesture grounding. It adds Local Hand Entity Modeling to embed subtle finger deviations, Geometric Ray Modeling to convert implicit orientation into explicit spatial geometric features that guide attention-based fusion, and Orientation-Consistent Alignment Loss to supervise both hand presence and pointing consistency. Together these components close the spatial-perception gap that standard global attention leaves in pointing tasks, yielding a 14-point absolute gain in grounding accuracy and clearer hand-to-target geometric correlation.

What carries the argument

The combination of Local Hand Entity Modeling (LHEM), Geometric Ray Modeling (GRM), and Orientation-Consistent Alignment Loss (OCAL) that converts finger pose into an explicit pointing ray and uses that ray to steer feature aggregation.

If this is right

  • Pointing accuracy improves most for distant or densely packed objects where ray drift was previously severe.
  • The model now produces explicit geometric features that link hand pose directly to target location.
  • All three components must act together to realize the full gain; isolated changes yield smaller benefits.
  • Qualitative results show reduced localization ambiguity in AR-style interaction scenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ray-modeling approach could be tested on other directional gestures such as head or gaze pointing.
  • If micro-geometric relations matter in other detection subtasks, similar local-entity modules might help beyond pointing.
  • The explicit ray representation offers a possible interface for downstream planning modules that need a 3-D direction rather than a 2-D box.

Load-bearing premise

Global attention in Transformers is the main source of orientation errors in pointing tasks, and the three added modules will correct those errors without creating offsetting new failure modes.

What would settle it

A controlled ablation in which any one of LHEM, GRM, or OCAL is removed and the 14-point accuracy gain disappears or reverses on the same pointing-to-object test set.

Figures

Figures reproduced from arXiv: 2606.24498 by Bowen Liu, Jiaqing Lyu, Ling Li, Xinkun Wu, Zhidong Deng, Zhizhen Cai, Ziyu Zhu.

Figure 1
Figure 1. Figure 1: Examples of pointing-to-object detection. The tar [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the VistaRef framework. Our model consists of four hierarchical modules: (1) Text-Guided Visual Aggre [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the Geometric Ray Modeling (GRM) and Local Hand Entity Modeling (LHEM). [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of grounding results. We visualize the results of VistaRef (top row) against several state-of-the-art [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: More detailed visualization of fingertip localization. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison results in salient-object scenarios. Our model remains robust to salient distractors and accurately localizes [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison results in cluttered multi-object scenarios. Our model maintains reliable and accurate localization [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of attention maps for representative samples. VistaRef produces spatially aligned and direction-aware [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: More Visualization of attention maps for representative samples. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
read the original abstract

Grounding deictic gestures in natural images is fundamental to AR and human-robot collaboration, providing a basis for seamless spatial interaction. While Transformer-based visual models have achieved significant progress in general object detection, their global attention mechanisms often neglect micro-geometric relationships, degrading orientation accuracy. In pointing tasks, this deficiency manifests as an inability to accurately capture the pointing ray implied by finger poses, which results in pointing drift and localization ambiguity when dealing with distant or densely packed objects. To address this, we propose VistaRef, a framework designed to explicitly enhance spatial orientation awareness. First, we develop the Local Hand Entity Modeling (LHEM) module, which incorporates hand-pose embeddings to strengthen the model's capability to capture subtle finger deviations. Second, drawing inspiration from multi-view geometry, we construct the Geometric Ray Modeling (GRM) module to transform implicit orientation information into explicit spatial geometric features, guiding feature aggregation and deep fusion via attention mechanisms. Furthermore, we introduce a novel Orientation-Consistent Alignment Loss (OCAL) to synergistically supervise hand presence and pointing consistency, ensuring that all architectural improvements collectively serve the core objective of spatial localization. Experimental results demonstrate that VistaRef significantly outperforms the baseline, achieving a 14-point absolute gain in grounding accuracy. Qualitative analysis further confirms that VistaRef effectively models the geometric correlation from hand to target, bridging the spatial perception gap inherent in traditional Transformers for complex scenarios. Code: https://github.com/lingli1724/VistaRef.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes VistaRef, a Transformer-based framework for pointing-to-object detection that augments the model with three components: the Local Hand Entity Modeling (LHEM) module to incorporate hand-pose embeddings, the Geometric Ray Modeling (GRM) module to convert implicit orientation into explicit multi-view geometric features, and the Orientation-Consistent Alignment Loss (OCAL) to supervise hand presence and pointing consistency. The central claim is that these additions address the neglect of micro-geometric relationships by global attention, yielding a 14-point absolute gain in grounding accuracy over the baseline, with qualitative confirmation of improved hand-to-target geometric correlation.

Significance. If the reported gain is robustly supported by ablations and protocol details, the work would be significant for AR and human-robot collaboration applications by improving spatial localization in deictic gesture grounding. The explicit modeling of hand-ray geometry via LHEM/GRM/OCAL is a targeted response to a known Transformer limitation, but the absence of any experimental verification in the manuscript prevents assessment of whether the result holds or generalizes.

major comments (2)
  1. [Abstract] Abstract: the claim that 'VistaRef significantly outperforms the baseline, achieving a 14-point absolute gain in grounding accuracy' is presented without any experimental protocol, baseline details, dataset statistics, error bars, ablation results, or secondary metrics. This renders the central performance claim unverifiable and prevents attribution of the gain to LHEM, GRM, or OCAL rather than incidental capacity or training changes.
  2. [Abstract] Abstract: the motivation assumes global attention is the primary source of orientation inaccuracy and that LHEM/GRM/OCAL will close this gap without new failure modes (e.g., sensitivity to hand-pose noise or degraded performance on non-pointing cases), yet no failure-case analysis, robustness tests, or cross-task evaluation is supplied to support this.
minor comments (1)
  1. [Abstract] Abstract: the sentence 'Code: https://github.com/lingli1724/VistaRef' appears without indicating whether the repository contains the full experimental setup, training scripts, or evaluation code needed to reproduce the claimed results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, clarifying the experimental details available in the full paper while agreeing to strengthen the abstract for better verifiability.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'VistaRef significantly outperforms the baseline, achieving a 14-point absolute gain in grounding accuracy' is presented without any experimental protocol, baseline details, dataset statistics, error bars, ablation results, or secondary metrics. This renders the central performance claim unverifiable and prevents attribution of the gain to LHEM, GRM, or OCAL rather than incidental capacity or training changes.

    Authors: We agree that the abstract's brevity omits key protocol details, which limits immediate verifiability. The full manuscript (Section 4) specifies the evaluation protocol on a standard pointing-to-object detection benchmark, the exact baseline architecture and training settings, dataset statistics, ablation studies isolating LHEM/GRM/OCAL contributions (with the 14-point gain in Acc@0.5 attributable to these modules rather than capacity changes), error bars from multiple runs, and secondary metrics such as precision-recall curves. We will revise the abstract to include a concise reference to the benchmark, baseline, and note that full ablations and protocol appear in the Experiments section. revision: yes

  2. Referee: [Abstract] Abstract: the motivation assumes global attention is the primary source of orientation inaccuracy and that LHEM/GRM/OCAL will close this gap without new failure modes (e.g., sensitivity to hand-pose noise or degraded performance on non-pointing cases), yet no failure-case analysis, robustness tests, or cross-task evaluation is supplied to support this.

    Authors: The motivation is grounded in the Introduction's discussion of Transformer global attention limitations for micro-geometric relations, supported by citations to prior spatial reasoning work. The manuscript provides qualitative evidence of improved hand-to-target correlation and quantitative ablations showing gains without reported degradation on the evaluated pointing cases. However, explicit failure-case analysis for hand-pose noise, non-pointing scenarios, or cross-task generalization is not included in the current version. We can add a dedicated limitations discussion and robustness experiments during revision if required. revision: partial

Circularity Check

0 steps flagged

No significant circularity; VistaRef claims are empirical and self-contained

full rationale

The paper describes an empirical architecture (LHEM hand-pose embeddings, GRM ray features, OCAL loss) motivated by Transformer limitations in pointing tasks, then reports a 14-point grounding accuracy gain from experiments. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described claims. The performance result is presented as an outcome of the added modules rather than reducing by construction to the input data or prior self-work. This is the normal case for an ML systems paper whose central claim rests on external benchmarks and ablations rather than internal definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The abstract supplies no explicit free parameters, mathematical axioms, or new physical entities; the three modules are presented as engineering additions whose internal parameters and training details are not disclosed.

axioms (1)
  • domain assumption Transformer global attention neglects micro-geometric relationships in hand poses
    Stated directly in the abstract as the motivation for the work.
invented entities (3)
  • Local Hand Entity Modeling (LHEM) module no independent evidence
    purpose: Incorporate hand-pose embeddings to capture finger deviations
    New architectural component introduced in the abstract; no independent evidence supplied.
  • Geometric Ray Modeling (GRM) module no independent evidence
    purpose: Transform implicit orientation into explicit spatial geometric features
    New architectural component introduced in the abstract; no independent evidence supplied.
  • Orientation-Consistent Alignment Loss (OCAL) no independent evidence
    purpose: Supervise hand presence and pointing consistency
    New loss function introduced in the abstract; no independent evidence supplied.

pith-pipeline@v0.9.1-grok · 5816 in / 1391 out tokens · 30084 ms · 2026-06-26T00:57:19.967609+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 14 canonical work pages

  1. [1]

    Tijn Bertens, Brandon Caasenbrood, Alessandro Saccon, and Andrei Jalba. 2025. Symmetry-Induced Ambiguity in Orientation Estimation From RGB Images. Machine Vision and Applications36, 2 (2025), 40

  2. [2]

    Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2019. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence43, 1 (2019), 172– 186

  3. [3]

    Yixin Chen, Qing Li, Deqian Kong, Yik Lun Kei, Song-Chun Zhu, Tao Gao, Yixin Zhu, and Siyuan Huang. 2021. YouRefIt: Embodied Reference Understanding With Language and Gesture.arXiv preprint arXiv:2109.03413(2021)

  4. [4]

    X. Chu, Z. Tian, B. Zhang, X. Wang, X. Wei, H. Xia, and C. Shen. 2021. Conditional Positional Encodings for Vision Transformers.arXiv preprint arXiv:2102.10882 (2021)

  5. [5]

    Ming Dai, Wenxuan Cheng, Jiedong Zhuang, Jiang jiang Liu, Hongshen Zhao, Zhenhua Feng, and Wankou Yang. 2025. PropVG: End-to-End Proposal-Driven Visual Grounding With Multi-Granularity Discrimination. InIn Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV ’25). 7058–7068

  6. [6]

    Ming Dai, Lingfeng Yang, Yihao Xu, Zhenhua Feng, and Wankou Yang. 2024. SimVG: A Simple Framework for Visual Grounding With Decoupled Multi-Modal Fusion.Advances in Neural Information Processing Systems37 (2024), 121670– 121698

  7. [7]

    Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, and Houqiang Li. 2021. TransVG: End-to-End Visual Grounding With Transformers. InIn Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV ’21). 1769–1779

  8. [8]

    Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Zhang Weiming, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo. 2022. CSWin Transformer: A General Vision Transformer Backbone With Cross-Shaped Windows. InIn Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’22). 12124–12134

  9. [9]

    Fevziye Irem Eyiokur, Dogucan Yaman, Hazım Kemal Ekenel, and Alexander Waibel. 2026. CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding. InIn Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV ’26). 3939–3950

  10. [10]

    Kanoko Goto, Takumi Hirose, Mahiro Ukai, Shuhei Kurita, and Nakamasa Inoue

  11. [11]

    InIn Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV ’25)

    Referring Expression Comprehension for Small Objects. InIn Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV ’25). 21231– 21242

  12. [12]

    Hao Guo, Wei Fan, Baichun Wei, Jianfei Zhu, Jin Tian, Chunzhi Yi, and Feng Jiang. 2025. AD-DINO: Attention-Dynamic DINO for Distance-Aware Embodied Reference Understanding.IEEE Transactions on Circuits and Systems for Video Technology(2025), 1–11

  13. [13]

    Hao Guo, Jianfei Zhu, Wei Fan, Chunzhi Yi, and Feng Jiang. 2025. Beyond Object Categories: Multi-Attribute Reference Understanding for Visual Grounding.arXiv preprint arXiv:2503.19240(2025)

  14. [14]

    Sandeep Gupta, Carsten Maple, Bruno Crispo, Kiran Raja, Artsiom Yautsiukhin, and Fabio Martinelli. 2023. A Survey of Human-Computer Interaction (HCI) & Natural Habits-Based Behavioural Biometric Modalities for User Recognition Schemes.Pattern Recognition139 (2023), 109453

  15. [15]

    Zeyu Han, Fangrui Zhu, Qianru Lao, and Huaizu Jiang. 2024. Zero-Shot Refer- ring Expression Comprehension via Structural Similarity Between Images and Captions. InIn Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’24). 14364–14374

  16. [16]

    Berg, and Vicente Ordonez

    Ruozhen He, Paola Cascante-Bonilla, Ziyan Yang, Alexander C. Berg, and Vicente Ordonez. 2024. Improved Visual Grounding Through Self-Consistent Explana- tions. InIn Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’24). 13095–13105

  17. [17]

    Shuting He, Henghui Ding, Chang Liu, and Xudong Jiang. 2023. GREC: Gen- eralized Referring Expression Comprehension.arXiv preprint arXiv:2308.16182 (2023)

  18. [18]

    Md Mofijul Islam, Alexi Gladstone, Sujan Sarker, Ganesh Nanduru, Md Fahim, Keyan Du, Aman Chadha, and Tariq Iqbal. 2025. Embodied Referring Expression Comprehension in Human-Robot Interaction.arXiv preprint arXiv:2512.06558 (2025)

  19. [19]

    Kritika Johari, Christopher Tay Zi Tong, Vigneshwaran Subbaraju, Jung-Jae Kim, and U-Xuan Tan. 2021. Gaze Assisted Visual Grounding. InIn Proceedings of the International Conference on Social Robotics (ICSR ’21). 191–202

  20. [20]

    Mark Johnson. 2015. Embodied Understanding.Frontiers in Psychology6 (2015), 875

  21. [21]

    Gloria Yi-Ming Kao and Cheng-An Ruan. 2022. Designing and Evaluating a High Interactive Augmented Reality System for Programming Learning.Computers in Human Behavior132 (2022), 107245

  22. [22]

    Bum Jun Kim, Hyeyeon Choi, Hyeonah Jang, and Sang Woo Kim. 2023. Un- derstanding Gaussian Attention Bias of Vision Transformers Using Effective Receptive Fields.arXiv preprint arXiv:2305.04722(2023). Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al

  23. [23]

    Sven Kreiss, Lorenzo Bertoni, and Alexandre Alahi. 2021. OpenPifPaf: Composite Fields for Semantic Keypoint Detection and Spatio-Temporal Association.IEEE Transactions on Intelligent Transportation Systems23, 8 (2021), 13498–13511

  24. [24]

    Georgios Lampropoulos. 2025. Intelligent Virtual Reality and Augmented Reality Technologies: An Overview.Future Internet17, 2 (2025), 58

  25. [25]

    Ling Li, Bowen Liu, Zinuo Zhan, Peng Jie, Jianhui Zhong, Kenglun Chang, and Zhidong Deng. 2026. Beyond Language: Grounding Referring Expressions With Hand Pointing in Egocentric Vision.arXiv preprint arXiv:2603.26646(2026)

  26. [26]

    Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al

  27. [27]

    InIn Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’22)

    Grounded Language-Image Pre-Training. InIn Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’22). 10965–10975

  28. [28]

    Yang Li, Xiaoxue Chen, Hao Zhao, Jiangtao Gong, Guyue Zhou, Federico Rossano, and Yixin Zhu. 2023. Understanding Embodied Reference With Touch-Line Transformer. InIn Proceedings of the 11th International Conference on Learning Representations (ICLR ’23). 1–15

  29. [29]

    Chang Liu, Henghui Ding, and Xudong Jiang. 2023. GRES: Generalized Refer- ring Expression Segmentation. InIn Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’23). 23592–23601

  30. [30]

    Daqing Liu, Hanwang Zhang, Feng Wu, and Zheng-Jun Zha. 2019. Learning to Assemble Neural Module Tree Networks for Visual Grounding.arXiv preprint arXiv:1812.03299(2019)

  31. [31]

    Haokun Liu, Yaonan Zhu, Kenji Kato, Atsushi Tsukahara, Izumi Kondo, Tadayoshi Aoyama, and Yasuhisa Hasegawa. 2024. Enhancing the LLM-Based Robot Ma- nipulation Through Human-Robot Collaboration.IEEE Robotics and Automation Letters9, 8 (2024), 6904–6911

  32. [32]

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Yang Jie, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. 2024. Grounding DINO: Mar- rying DINO With Grounded Pre-Training for Open-Set Object Detection. In In Proceedings of the 18th European Conference on Computer Vision (ECCV ’24). 38–55

  33. [33]

    Mingcong Lu, Ruifan Li, Fangxiang Feng, Zhanyu Ma, and Xiaojie Wang. 2024. LGR-NET: Language Guided Reasoning Network for Referring Expression Com- prehension.IEEE Transactions on Circuits and Systems for Video Technology34, 8 (2024), 7771–7784

  34. [34]

    Ziyang Lu, Yunqiang Pei, Guoqing Wang, Peiwei Li, Yang Yang, Yinjie Lei, and Heng Tao Shen. 2024. Scaneru: Interactive 3D Visual Grounding Based on Embodied Reference Understanding. InIn Proceedings of the 38th AAAI Conference on Artificial Intelligence (AAAI ’24). 3936–3944

  35. [35]

    Zhihan Lv, Fabio Poiesi, Qi Dong, Jaime Lloret, and Houbing Song. 2022. Deep Learning for Intelligent Human-Computer Interaction.Applied Sciences12, 22 (2022), 11457

  36. [36]

    Sarma, and Archan Misra

    Atharv Mahesh Mane, Dulanga Weerakoon, Vigneshwaran Subbaraju, Sougata Sen, Sanjay E. Sarma, and Archan Misra. 2025. Ges3ViG: Incorporating Pointing Gestures Into Language-Based 3D Visual Grounding for Embodied Reference Understanding. InIn Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’25). 9017–9026

  37. [37]

    Yuille, and Kevin Murphy

    Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L. Yuille, and Kevin Murphy. 2016. Generation and Comprehension of Unambiguous Object Descriptions. InIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’16). 11–20

  38. [38]

    Nagaraja, Vlad I

    Varun K. Nagaraja, Vlad I. Morariu, and Larry S. Davis. 2016. Modeling Context Between Objects for Referring Expression Understanding. InIn Proceedings of the European Conference on Computer Vision (ECCV ’16). 792–807

  39. [39]

    Shu Nakamura, Yasutomo Kawanishi, Shohei Nobuhara, and Ko Nishino. 2023. DeePoint: Visual Pointing Recognition and Direction Estimation. InIn Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV ’23). 20577– 20587

  40. [40]

    Georgios Pantazopoulos and Eda B. Özyiğit. 2025. Towards Understanding Visual Grounding in Visual Language Models.arXiv preprint arXiv:2509.10345(2025)

  41. [41]

    Atharva Paralikar, Pavan Mantripragada, Trong Nguyen, Youness Arjoune, Raj Shekhar, and Reza Monfaredi. 2025. Robot-Assisted Ultrasound Probe Calibration for Image-Guided Interventions.International Journal of Computer Assisted Radiology and Surgery20, 5 (2025), 859–868

  42. [42]

    Kun Qian, Zhuoyang Zhang, Wei Song, and Jianfeng Liao. 2023. GVGNet: Gaze- Directed Visual Grounding for Learning Under-Specified Object Referring Inten- tion.IEEE Robotics and Automation Letters8, 9 (2023), 5990–5997

  43. [43]

    Yanyuan Qiao, Chaorui Deng, and Qi Wu. 2021. Referring Expression Compre- hension: A Survey of Methods and Datasets.IEEE Transactions on Multimedia23 (2021), 4426–4440

  44. [44]

    Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. 2019. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. InIn Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’19). 658–666

  45. [45]

    Cheng Shi and Sibei Yang. 2022. Spatial and Visual Perspective-Taking via View Rotation and Relation Reasoning for Embodied Reference Understanding. InIn Proceedings of the European Conference on Computer Vision (ECCV ’22). 201–218

  46. [46]

    Cheng Shi and Sibei Yang. 2023. Spatial and Visual Perspective-Taking via View Rotation and Relation Reasoning for Embodied Reference Understanding.arXiv preprint arXiv:2309.01073(2023)

  47. [47]

    Tianyu Song, Felix Pabst, Feng Li, Yordanka Velikova, Miruna-Alexandra Gafencu, Yuan Bi, Ulrich Eck, and Nassir Navab. 2026. Feasibility of Augmented Reality- Guided Robotic Ultrasound With Cone-Beam CT Integration for Spine Procedures. arXiv preprint arXiv:2603.22174(2026)

  48. [48]

    Martin Sundermeyer, Zoltan-Csaba Marton, Maximilian Durner, Manuel Brucker, and Rudolph Triebel. 2018. Implicit 3D Orientation Learning for 6D Object Detection From RGB Images. InIn Proceedings of the European Conference on Computer Vision (ECCV ’18). 699–715

  49. [49]

    Jinguang Tong, Jinbo Wu, Kaisiyuan Wang, Zhelun Shen, Xuan Huang, Mochu Xiang, Xuesong Li, Yingying Li, Haocheng Feng, Chen Zhao, Hang Zhou, Wei He, Chuong Nguyen, Jingdong Wang, and Hongdong Li. 2026. MVHOI: Bridge Multi-View Condition to Complex Human-Object Interaction Video Reenactment via 3D Foundation Model.arXiv preprint arXiv:2603.14686(2026)

  50. [50]

    Shijie Wang, Dahun Kim, Ali Taalimi, Chen Sun, and Weicheng Kuo. 2025. Learn- ing Visual Grounding From Generative Vision and Language Model. InIn Pro- ceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV ’25). 8057–8067

  51. [51]

    Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al

  52. [52]

    InIn Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’23)

    Image as a Foreign Language: BEiT Pretraining for Vision and Vision- Language Tasks. InIn Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’23). 19175–19186

  53. [53]

    Dulanga Weerakoon, Vigneshwaran Subbaraju, Nipuni Karumpulli, Tuan Tran, Qianli Xu, U-Xuan Tan, Joo Hwee Lim, and Archan Misra. 2020. Gesture Enhanced Comprehension of Ambiguous Human-to-Robot Instructions. InIn Proceedings of the 2020 International Conference on Multimodal Interaction (ICMI ’20). 251–259

  54. [54]

    Changli Wu, Qi Chen, Jiayi Ji, Haowei Wang, Yiwei Ma, You Huang, Hao Fei, Xiaoshuai Sun, and Rongrong Ji. 2024. RG-SAN: Rule-Guided Spatial Awareness Network for End-to-End 3D Referring Expression Segmentation.Advances in Neural Information Processing Systems37 (2024), 110972–110999

  55. [55]

    Yixuan Wu, Zhao Zhang, Chi Xie, Feng Zhu, and Rui Zhao. 2023. Advancing Referring Expression Segmentation Beyond Single Image. InIn Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV ’23). 2628–2638

  56. [56]

    Linhui Xiao, Xiaoshan Yang, Fang Peng, Yaowei Wang, and Changsheng Xu. 2024. HiVG: Hierarchical Multimodal Fine-Grained Modulation for Visual Grounding. InIn Proceedings of the 32nd ACM International Conference on Multimedia (MM ’24). 5460–5469

  57. [57]

    Linhui Xiao, Xiaoshan Yang, Fang Peng, Yaowei Wang, and Changsheng Xu. 2024. OneRef: Unified One-Tower Expression Grounding and Segmentation With Mask Referring Modeling.Advances in Neural Information Processing Systems37 (2024), 139854–139885

  58. [58]

    Linhui Xiao, Xiaoshan Yang, Lan Xiangyuan, Yaowei Wang, and Changsheng Xu

  59. [59]

    Toward Visual Grounding: A Survey.IEEE Transactions on Pattern Analysis and Machine Intelligence48, 3 (2026), 2749–2771

  60. [60]

    Jun Xu, Hanchen Wang, Jianrong Zhang, and Linqin Cai. 2022. Robust Hand Ges- ture Recognition Based on RGB-D Data for Natural Human-Computer Interaction. IEEE Access10 (2022), 54549–54562

  61. [61]

    Yue Yang, Christoph Leuze, Brian Hargreaves, Bruce Daniel, and Fred Baik

  62. [62]

    EasyREG: Easy Depth-Based Markerless Registration and Tracking Using Augmented Reality Device for Surgical Guidance.arXiv preprint arXiv:2504.09498 (2025)

  63. [63]

    Jiawen Yi, Jiaojiao Liu, Chuanlong Zhang, and Xiong Lu. 2022. Magnetic Motion Tracking for Natural Human-Computer Interaction: A Review.IEEE Sensors Journal22, 23 (2022), 22356–22367

  64. [64]

    Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. 2018. MAttNet: Modular Attention Network for Referring Expression Comprehension. InIn Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’18). 1307–1315

  65. [65]

    Berg, and Tamara L

    Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg

  66. [66]

    InIn Proceedings of the European Conference on Computer Vision (ECCV ’16)

    Modeling Context in Referring Expressions. InIn Proceedings of the European Conference on Computer Vision (ECCV ’16). 69–85

  67. [67]

    Jiabao Zhao, Jonghan Lim, Hongliang Li, and Ilya Kovalenko. 2026. CoViLLM: An Adaptive Human-Robot Collaborative Assembly Framework Using Large Language Models.arXiv preprint arXiv:2603.11461(2026)

  68. [68]

    Peizhi Zhao, Shiyi Zheng, Wenye Zhao, Dongsheng Xu, Pijian Li, Yi Cai, and Huang Qingbao. 2024. Rethinking Two-Stage Referring Expression Compre- hension: A Novel Grounding and Segmentation Method Modulated by Point. In In Proceedings of the 38th AAAI Conference on Artificial Intelligence (AAAI ’24). 7487–7495

  69. [69]

    Yiyi Zhou, Rongrong Ji, Gen Luo, Xiaoshuai Sun, Jinsong Su, Xinghao Ding, Chia-Wen Lin, and Qi Tian. 2023. A Real-Time Global Inference Network for One-Stage Referring Expression Comprehension.IEEE Transactions on Neural Networks and Learning Systems34, 1 (2023), 134–143

  70. [70]

    Zhishan Zhou, Shihao Zhou, Lv Zhi, Minqiang Zou, Yao Tang, and Jiajun Liang

  71. [71]

    A Simple Baseline for Efficient Hand Mesh Reconstruction. InIn Proceedings VistaRef: Boosting Visual Spatial Orientation Awareness for Pointing-to-Object Detection Conference acronym ’XX, June 03–05, 2018, Woodstock, NY of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’24). 1367–1376

  72. [72]

    Chaoyang Zhu, Yiyi Zhou, Yunhang Shen, Gen Luo, Xingjia Pan, Mingbao Lin, Chao Chen, Liujuan Cao, Xiaoshuai Sun, and Rongrong Ji. 2022. SeqTR: A Simple Yet Universal Network for Visual Grounding. InIn Proceedings of the 17th European Conference on Computer Vision (ECCV ’22). 598–615. Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009