arxiv: 2604.20543 · v2 · submitted 2026-04-22 · 💻 cs.CV

Recognition: unknown

RefAerial: A Benchmark and Approach for Referring Detection in Aerial Images

Guyue Hu , Hao Song , Yuxing Tong , Duzhi Yuan , Dengdi Sun , Aihua Zheng , Chenglong Li , Jin Tang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:53 UTC · model grok-4.3

classification 💻 cs.CV

keywords referring detectionaerial imagesscale varietyRefAerial datasetmixture-of-granularity attentionobject localizationnatural languagecomputer vision

0 comments

The pith

A new dataset and SCS framework address scale variety so referring detection works in aerial images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper creates RefAerial, a dataset of aerial images with natural language references to objects, to test referring detection outside the usual ground-level centered-object setting. It finds that prior methods drop in performance because aerial images have objects at very different scales both inside one photo and across photos. The authors respond with the SCS framework that uses mixture-of-granularity attention to understand targets at all scales and a two-stage decoding process to go from rough to precise location. Readers should care since aerial imagery grows in use for many practical tasks, and language queries could let non-experts find what they need in large scenes.

Core claim

The paper claims that the RefAerial dataset exposes the scale variety problem that causes existing ground referring detection approaches to degrade on aerial images, and introduces the SCS framework with mixture-of-granularity attention for scale-comprehensive understanding and two-stage comprehensive-to-sensitive decoding for fine target localization, resulting in strong performance on the aerial dataset and improvements on ground datasets as well.

What carries the argument

The SCS framework using mixture-of-granularity attention to capture multi-scale information and a two-stage comprehensive-to-sensitive decoding strategy to progressively refine the referring target.

Load-bearing premise

The serious performance degradation of ground referring detection approaches on aerial images is caused by the intrinsic scale variety issue within or across the images.

What would settle it

An experiment where the proposed SCS framework fails to outperform standard referring detection methods adapted to the RefAerial dataset would falsify the effectiveness of the scale-comprehensive approach.

Figures

Figures reproduced from arXiv: 2604.20543 by Aihua Zheng, Chenglong Li, Dengdi Sun, Duzhi Yuan, Guyue Hu, Hao Song, Jin Tang, Yuxing Tong.

**Figure 2.** Figure 2: Illustration of challenges and samples in the RefAerial dataset. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the referring expansion and annotation engine (REA-Engine). It is a semi-automated human-in-the-loop annotation [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The overall pipeline of the proposed scale-comprehensive and sensitive (SCS) framework. It is characterized with a mixture-of [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: (a) Visualization comparison with methods with large [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Referring detection refers to locate the target referred by natural languages, which has recently attracted growing research interests. However, existing datasets are limited to ground images with large object centered in relative small scenes. This paper introduces a large-scale challenging dataset for referring detection in aerial images, termed as RefAerial. It distinguishes from conventional ground referring detection datasets by 4 characteristics: (1) low but diverse object-to-scene ratios, (2) numerous targets and distractors, (3)complex and fine-grained referring descriptions, (4) diverse and broad scenes in the aerial view. We also develop a human-in-the-loop referring expansion and annotation engine (REA-Engine) for efficient semi-automated referring pair annotation. Besides, we observe that existing ground referring detection approaches exhibiting serious performance degradation on our aerial dataset since the intrinsic scale variety issue within or across aerial images. Therefore, we further propose a novel scale-comprehensive and sensitive (SCS) framework for referring detection in aerial images. It consists of a mixture-of-granularity (MoG) attention and a two-stage comprehensive-to-sensitive (CtS) decoding strategy. Specifically, the mixture-of-granularity attention is developed for scale-comprehensive target understanding. In addition, the two-stage comprehensive-to-sensitive decoding strategy is designed for coarse-to-fine referring target decoding. Eventually, the proposed SCS framework achieves remarkable performance on our aerial referring detection dataset and even promising performance boost on conventional ground referring detection datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

RefAerial adds a new aerial referring detection dataset and a scale-oriented SCS model, but the abstract gives no numbers and the cause of ground-method failure is not isolated from other dataset differences. The dataset is the clearest step forward. It targets four aerial-specific traits: low and varied object-to-scene ratios, lots of targets plus distractors, complex fine-grained language, and wide scene variety. The authors also supply a semi-automated REA-Engine for building the annotations. These elements are not present in the ground-based referring detection literature they cite, so the benchmark itself is new material that others can use for remote-sensing or UAV work. The SCS framework tries to address the problem with mixture-of-granularity attention for multi-scale understanding and a two-stage comprehensive-to-sensitive decoder. That design choice follows directly from their observation that existing detectors degrade on aerial data. The paper does a reasonable job of spelling out why aerial images differ from the centered-object ground datasets that dominate the field. The soft spots sit in the evidence and the causal claim. The abstract states that ground methods show serious degradation due to scale variety and that SCS delivers remarkable gains, yet it supplies no scores, baselines, error bars, or ablation tables. Without those, the performance assertions cannot be checked. The stress-test note is on target here: the four dataset differences are listed together, but the degradation is pinned specifically on scale. If distractors or description complexity drive most of the drop, then the mixture-of-granularity attention may not be the targeted fix the authors intend. A controlled comparison that holds the other factors fixed would settle this; the current write-up does not show one. This paper is for computer-vision groups working on referring expressions or aerial imagery. Anyone building benchmarks for drone or satellite tasks would find the dataset useful even if the method section needs tightening. I would send it to peer review. The new data is a concrete addition that deserves referee time, provided the full experiments back the claims.

Referee Report

3 major / 2 minor

Summary. The paper introduces RefAerial, a new large-scale benchmark dataset for referring detection in aerial images that differs from ground-based datasets in four ways: low/diverse object-to-scene ratios, numerous targets and distractors, complex/fine-grained descriptions, and diverse aerial scenes. It also describes a human-in-the-loop REA-Engine for semi-automated annotation and proposes the SCS framework, which uses mixture-of-granularity (MoG) attention for scale-comprehensive target understanding and a two-stage comprehensive-to-sensitive (CtS) decoding strategy for coarse-to-fine localization. The central claims are that existing ground referring detectors suffer serious degradation on RefAerial due to intrinsic scale variety, and that SCS delivers remarkable gains on RefAerial while also improving performance on conventional ground datasets.

Significance. If the empirical claims are substantiated, the work supplies a needed aerial-specific benchmark and a scale-aware architecture that could improve referring detection for remote-sensing applications. The dataset's four distinguishing characteristics and the explicit design of MoG attention plus CtS decoding constitute a targeted response to a domain shift that has received little prior attention.

major comments (3)

[Abstract, §1] Abstract and §1: the assertion that ground-based methods exhibit 'serious performance degradation ... since the intrinsic scale variety issue' is not accompanied by any ablation or controlled comparison that isolates scale variety from the other three dataset differences (numerous distractors, complex descriptions, diverse scenes). Without such isolation, the motivation for MoG attention and CtS decoding as the necessary or optimal remedies remains under-supported.
[§4, §5] §4 (method) and §5 (experiments): the paper must report quantitative baselines, error bars, and per-factor ablations on RefAerial that measure the contribution of MoG attention versus CtS decoding, as well as the effect of each when distractors or description complexity are controlled. The current central claim that SCS 'achieves remarkable performance' cannot be evaluated without these numbers.
[§3] §3 (dataset): the four characteristics are presented as equally important, yet the framework is motivated almost exclusively by scale. A quantitative characterization (e.g., histograms of object-to-scene ratios, target counts, description length statistics) is required to justify why scale is treated as the dominant factor.

minor comments (2)

[Abstract] Abstract: the phrase 'promising performance boost on conventional ground referring detection datasets' should be replaced by concrete metrics and dataset names.
[§4] Notation: define the granularity levels used inside MoG attention and the precise inputs/outputs of the two CtS stages before they are referenced in equations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, agreeing where additional evidence is needed and outlining specific revisions to strengthen the empirical support and dataset characterization.

read point-by-point responses

Referee: [Abstract, §1] Abstract and §1: the assertion that ground-based methods exhibit 'serious performance degradation ... since the intrinsic scale variety issue' is not accompanied by any ablation or controlled comparison that isolates scale variety from the other three dataset differences (numerous distractors, complex descriptions, diverse scenes). Without such isolation, the motivation for MoG attention and CtS decoding as the necessary or optimal remedies remains under-supported.

Authors: We agree that a controlled isolation of scale variety would strengthen the motivation. Our initial experiments demonstrate substantial degradation of ground-based methods on RefAerial relative to their reported ground-image performance, and qualitative analysis of failure cases points to scale as a primary driver given the low and diverse object-to-scene ratios in aerial views. Nevertheless, we did not perform explicit ablations that hold distractor count and description complexity fixed while varying scale. In the revision we will add such controlled comparisons on subsets of RefAerial to better isolate the scale factor and justify the design choices of MoG attention and CtS decoding. revision: yes
Referee: [§4, §5] §4 (method) and §5 (experiments): the paper must report quantitative baselines, error bars, and per-factor ablations on RefAerial that measure the contribution of MoG attention versus CtS decoding, as well as the effect of each when distractors or description complexity are controlled. The current central claim that SCS 'achieves remarkable performance' cannot be evaluated without these numbers.

Authors: We accept that the current experimental section lacks the requested granularity. The manuscript reports aggregate results for the full SCS framework but does not break down the individual contributions of MoG attention and CtS decoding, nor does it include error bars or controlled ablations that vary distractor density or description complexity. We will revise §5 to include (i) error bars computed over multiple random seeds, (ii) component-wise ablations on RefAerial, and (iii) additional experiments that control for the number of distractors and the complexity of referring expressions while measuring the incremental gains from each module. revision: yes
Referee: [§3] §3 (dataset): the four characteristics are presented as equally important, yet the framework is motivated almost exclusively by scale. A quantitative characterization (e.g., histograms of object-to-scene ratios, target counts, description length statistics) is required to justify why scale is treated as the dominant factor.

Authors: All four characteristics are indeed distinctive of RefAerial, yet our architectural focus on scale follows from the observation that scale diversity directly affects both visual feature extraction and the coarse-to-fine localization process in ways that are less prevalent in ground-level datasets. To make this emphasis quantitative rather than qualitative, we will augment §3 with histograms and summary statistics for object-to-scene ratios, target and distractor counts per image, and referring-expression length distributions, thereby providing empirical grounding for treating scale variety as the dominant design driver. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark and model design

full rationale

The paper introduces a new dataset (RefAerial) with four listed distinguishing characteristics and proposes an SCS framework consisting of MoG attention plus CtS decoding. No equations, fitted parameters, or derivations are presented that reduce to prior inputs by construction. Performance claims are empirical results on the introduced benchmark and on existing ground datasets; the attribution of degradation to scale variety is an interpretive assumption rather than a self-referential prediction. No load-bearing self-citations or uniqueness theorems appear in the provided text. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard deep learning assumptions for attention and decoding plus the empirical observation of scale variety in aerial data; no new physical entities or ad-hoc axioms are introduced beyond typical computer vision domain assumptions.

axioms (1)

domain assumption Standard assumptions in deep learning for object detection and attention mechanisms hold for aerial imagery
Invoked implicitly when claiming existing methods degrade and new components improve performance.

pith-pipeline@v0.9.0 · 5586 in / 1227 out tokens · 71808 ms · 2026-05-10T00:53:02.784137+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 3 canonical work pages · 2 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

End-to- end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InEuropean confer- ence on computer vision, pages 213–229. Springer, 2020. 5, 6

2020
[4]

Ref-nms: Breaking proposal bottlenecks in two-stage referring expression grounding

Long Chen, Wenbo Ma, Jun Xiao, Hanwang Zhang, and Shih-Fu Chang. Ref-nms: Breaking proposal bottlenecks in two-stage referring expression grounding. InProceedings of the AAAI conference on artificial intelligence, pages 1036– 1044, 2021. 3

2021
[5]

Simvg: A simple framework for visual grounding with decoupled multi-modal fusion.Advances in neural information processing systems, 37:121670–121698,

Ming Dai, Lingfeng Yang, Yihao Xu, Zhenhua Feng, and Wankou Yang. Simvg: A simple framework for visual grounding with decoupled multi-modal fusion.Advances in neural information processing systems, 37:121670–121698,
[6]

Deceptive explanations by large language mod- els lead people to change their beliefs about misinformation more often than honest explanations

Valdemar Danry, Pat Pataranutaporn, Matthew Groh, and Ziv Epstein. Deceptive explanations by large language mod- els lead people to change their beliefs about misinformation more often than honest explanations. InProceedings of the 2025 CHI Conference on Human Factors in Computing Sys- tems, pages 1–31, 2025. 1

2025
[7]

Transvg: End-to-end visual ground- ing with transformers

Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, and Houqiang Li. Transvg: End-to-end visual ground- ing with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 1769– 1779, 2021. 6, 7, 8

2021
[8]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers), pages 4171– 4186, 2019. 5

2019
[9]

Evidential active recognition: Intelligent and pru- dent open-world embodied perception

Lei Fan, Mingfu Liang, Yunxuan Li, Gang Hua, and Ying Wu. Evidential active recognition: Intelligent and pru- dent open-world embodied perception. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16351–16361, 2024. 1

2024
[10]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 5

2016
[11]

Modeling relationships in refer- ential expressions with compositional modular networks

Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, and Kate Saenko. Modeling relationships in refer- ential expressions with compositional modular networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1115–1124, 2017. 3

2017
[12]

Segvg: Transferring object bounding box to segmentation for visual grounding

Weitai Kang, Gaowen Liu, Mubarak Shah, and Yan Yan. Segvg: Transferring object bounding box to segmentation for visual grounding. InEuropean Conference on Computer Vision, pages 57–75. Springer, 2024. 5, 6, 7, 8

2024
[13]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 4

2023
[14]

The hungarian method for the assignment problem.Naval Research Logistics Quarterly, 2(1-2):83–97,

Harold W Kuhn. The hungarian method for the assignment problem.Naval Research Logistics Quarterly, 2(1-2):83–97,
[15]

Language-guided progressive attention for visual grounding in remote sensing images.IEEE Transactions on Geoscience and Remote Sensing, 62:1–13, 2024

Ke Li, Di Wang, Haojie Xu, Haodi Zhong, and Cong Wang. Language-guided progressive attention for visual grounding in remote sensing images.IEEE Transactions on Geoscience and Remote Sensing, 62:1–13, 2024. 1

2024
[16]

A real-time cross-modality correlation fil- tering method for referring expression comprehension

Yue Liao, Si Liu, Guanbin Li, Fei Wang, Yanjie Chen, Chen Qian, and Bo Li. A real-time cross-modality correlation fil- tering method for referring expression comprehension. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 10880–10889, 2020. 3

2020
[17]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 6, 7, 8

2023
[18]

Poly- former: Referring image segmentation as sequential poly- gon generation

Jiang Liu, Hui Ding, Zhaowei Cai, Yuting Zhang, Ravi Ku- mar Satzoda, Vijay Mahadevan, and R Manmatha. Poly- former: Referring image segmentation as sequential poly- gon generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18653– 18663, 2023. 3

2023
[19]

Integrated sensing and edge ai: Realizing intelligent perception in 6g.IEEE Communications Surveys & Tutorials, 2025

Zhiyan Liu, Xu Chen, Hai Wu, Zhanwei Wang, Xianhao Chen, Dusit Niyato, and Kaibin Huang. Integrated sensing and edge ai: Realizing intelligent perception in 6g.IEEE Communications Surveys & Tutorials, 2025. 1

2025
[20]

When visual grounding meets gigapixel- level large-scale scenes: benchmark and approach

Tao Ma, Bing Bai, Haozhe Lin, Heyuan Wang, Yu Wang, Lin Luo, and Lu Fang. When visual grounding meets gigapixel- level large-scale scenes: benchmark and approach. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22119–22128, 2024. 7

2024
[21]

Generation and comprehension of unambiguous object descriptions

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016. 1, 3

2016
[22]

Do expres- sions change decisions? exploring the impact of ai’s expla- nation tone on decision-making

Ayano Okoso, Mingzhe Yang, and Yukino Baba. Do expres- sions change decisions? exploring the impact of ai’s expla- nation tone on decision-making. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–22, 2025. 1

2025
[23]

Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models

Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazeb- nik. Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models. InPro- ceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015. 3

2015
[24]

Dynamic mdetr: A dynamic multimodal transformer decoder for visual grounding.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(2):1181–1198, 2023

Fengyuan Shi, Ruopeng Gao, Weilin Huang, and Limin Wang. Dynamic mdetr: A dynamic multimodal transformer decoder for visual grounding.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(2):1181–1198, 2023. 7

2023
[25]

Towards long-horizon vision- language navigation: Platform, benchmark and method

Xinshuai Song, Weixing Chen, Yang Liu, Weikai Chen, Guanbin Li, and Liang Lin. Towards long-horizon vision- language navigation: Platform, benchmark and method. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12078–12088, 2025. 1

2025
[26]

Scan- former: Referring expression comprehension by iteratively scanning

Wei Su, Peihan Miao, Huanzhang Dou, and Xi Li. Scan- former: Referring expression comprehension by iteratively scanning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13449– 13458, 2024. 3

2024
[27]

Cogvlm: Visual expert for pretrained language models.Advances in Neural Information Processing Sys- tems, 37:121475–121499, 2024

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Song XiX- uan, et al. Cogvlm: Visual expert for pretrained language models.Advances in Neural Information Processing Sys- tems, 37:121475–121499, 2024. 4

2024
[28]

Approximated prompt tuning for vision-language pre-trained models.arXiv preprint arXiv:2306.15706, 2023

Qiong Wu, Shubin Huang, Yiyi Zhou, Pingyang Dai, An- nan Shu, Guannan Jiang, and Rongrong Ji. Approximated prompt tuning for vision-language pre-trained models.arXiv preprint arXiv:2306.15706, 2023. 3

work page arXiv 2023
[29]

Hivg: Hierarchical multimodal fine- grained modulation for visual grounding

Linhui Xiao, Xiaoshan Yang, Fang Peng, Yaowei Wang, and Changsheng Xu. Hivg: Hierarchical multimodal fine- grained modulation for visual grounding. InProceedings of the 32nd ACM International Conference on Multimedia, pages 5460–5469, 2024

2024
[30]

Dual modality prompt tuning for vision-language pre-trained model.IEEE Transactions on Multimedia, 26:2056–2068, 2023

Yinghui Xing, Qirui Wu, De Cheng, Shizhou Zhang, Guo- qiang Liang, Peng Wang, and Yanning Zhang. Dual modality prompt tuning for vision-language pre-trained model.IEEE Transactions on Multimedia, 26:2056–2068, 2023. 3

2056
[31]

Improving visual grounding with visual- linguistic verification and iterative reasoning

Li Yang, Yan Xu, Chunfeng Yuan, Wei Liu, Bing Li, and Weiming Hu. Improving visual grounding with visual- linguistic verification and iterative reasoning. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9499–9508, 2022. 3

2022
[32]

A fast and accurate one-stage approach to visual grounding

Zhengyuan Yang, Boqing Gong, Liwei Wang, Wenbing Huang, Dong Yu, and Jiebo Luo. A fast and accurate one-stage approach to visual grounding. InProceedings of the IEEE/CVF international conference on computer vision, pages 4683–4693, 2019. 3

2019
[33]

Improving one-stage visual grounding by recursive sub- query construction

Zhengyuan Yang, Tianlang Chen, Liwei Wang, and Jiebo Luo. Improving one-stage visual grounding by recursive sub- query construction. InEuropean conference on computer vi- sion, pages 387–404. Springer, 2020. 3

2020
[34]

Shifting more attention to visual backbone: Query-modulated refinement networks for end-to-end visual grounding

Jiabo Ye, Junfeng Tian, Ming Yan, Xiaoshan Yang, Xuwu Wang, Ji Zhang, Liang He, and Xin Lin. Shifting more attention to visual backbone: Query-modulated refinement networks for end-to-end visual grounding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15502–15512, 2022. 3, 7

2022
[35]

Modeling context in referring expres- sions

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expres- sions. InEuropean conference on computer vision, pages 69–85. Springer, 2016. 1, 3

2016
[36]

Mattnet: Modular at- tention network for referring expression comprehension

Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. Mattnet: Modular at- tention network for referring expression comprehension. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1307–1315, 2018. 3

2018
[37]

Grounding referring expressions in images by variational context

Hanwang Zhang, Yulei Niu, and Shih-Fu Chang. Grounding referring expressions in images by variational context. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4158–4166, 2018. 3

2018
[38]

Look around before locating: Considering content and structure information for visual grounding

Shiyi Zheng, Peizhi Zhao, Zhilong Zheng, Peihang He, Hao- nan Cheng, Yi Cai, and Qingbao Huang. Look around before locating: Considering content and structure information for visual grounding. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1656–1664, 2025. 7

2025
[39]

Vision-language navigation with self-supervised auxiliary reasoning tasks

Fengda Zhu, Yi Zhu, Xiaojun Chang, and Xiaodan Liang. Vision-language navigation with self-supervised auxiliary reasoning tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10012– 10022, 2020. 1

2020