Recognition: unknown
RefAerial: A Benchmark and Approach for Referring Detection in Aerial Images
Pith reviewed 2026-05-10 00:53 UTC · model grok-4.3
The pith
A new dataset and SCS framework address scale variety so referring detection works in aerial images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that the RefAerial dataset exposes the scale variety problem that causes existing ground referring detection approaches to degrade on aerial images, and introduces the SCS framework with mixture-of-granularity attention for scale-comprehensive understanding and two-stage comprehensive-to-sensitive decoding for fine target localization, resulting in strong performance on the aerial dataset and improvements on ground datasets as well.
What carries the argument
The SCS framework using mixture-of-granularity attention to capture multi-scale information and a two-stage comprehensive-to-sensitive decoding strategy to progressively refine the referring target.
Load-bearing premise
The serious performance degradation of ground referring detection approaches on aerial images is caused by the intrinsic scale variety issue within or across the images.
What would settle it
An experiment where the proposed SCS framework fails to outperform standard referring detection methods adapted to the RefAerial dataset would falsify the effectiveness of the scale-comprehensive approach.
Figures
read the original abstract
Referring detection refers to locate the target referred by natural languages, which has recently attracted growing research interests. However, existing datasets are limited to ground images with large object centered in relative small scenes. This paper introduces a large-scale challenging dataset for referring detection in aerial images, termed as RefAerial. It distinguishes from conventional ground referring detection datasets by 4 characteristics: (1) low but diverse object-to-scene ratios, (2) numerous targets and distractors, (3)complex and fine-grained referring descriptions, (4) diverse and broad scenes in the aerial view. We also develop a human-in-the-loop referring expansion and annotation engine (REA-Engine) for efficient semi-automated referring pair annotation. Besides, we observe that existing ground referring detection approaches exhibiting serious performance degradation on our aerial dataset since the intrinsic scale variety issue within or across aerial images. Therefore, we further propose a novel scale-comprehensive and sensitive (SCS) framework for referring detection in aerial images. It consists of a mixture-of-granularity (MoG) attention and a two-stage comprehensive-to-sensitive (CtS) decoding strategy. Specifically, the mixture-of-granularity attention is developed for scale-comprehensive target understanding. In addition, the two-stage comprehensive-to-sensitive decoding strategy is designed for coarse-to-fine referring target decoding. Eventually, the proposed SCS framework achieves remarkable performance on our aerial referring detection dataset and even promising performance boost on conventional ground referring detection datasets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RefAerial, a new large-scale benchmark dataset for referring detection in aerial images that differs from ground-based datasets in four ways: low/diverse object-to-scene ratios, numerous targets and distractors, complex/fine-grained descriptions, and diverse aerial scenes. It also describes a human-in-the-loop REA-Engine for semi-automated annotation and proposes the SCS framework, which uses mixture-of-granularity (MoG) attention for scale-comprehensive target understanding and a two-stage comprehensive-to-sensitive (CtS) decoding strategy for coarse-to-fine localization. The central claims are that existing ground referring detectors suffer serious degradation on RefAerial due to intrinsic scale variety, and that SCS delivers remarkable gains on RefAerial while also improving performance on conventional ground datasets.
Significance. If the empirical claims are substantiated, the work supplies a needed aerial-specific benchmark and a scale-aware architecture that could improve referring detection for remote-sensing applications. The dataset's four distinguishing characteristics and the explicit design of MoG attention plus CtS decoding constitute a targeted response to a domain shift that has received little prior attention.
major comments (3)
- [Abstract, §1] Abstract and §1: the assertion that ground-based methods exhibit 'serious performance degradation ... since the intrinsic scale variety issue' is not accompanied by any ablation or controlled comparison that isolates scale variety from the other three dataset differences (numerous distractors, complex descriptions, diverse scenes). Without such isolation, the motivation for MoG attention and CtS decoding as the necessary or optimal remedies remains under-supported.
- [§4, §5] §4 (method) and §5 (experiments): the paper must report quantitative baselines, error bars, and per-factor ablations on RefAerial that measure the contribution of MoG attention versus CtS decoding, as well as the effect of each when distractors or description complexity are controlled. The current central claim that SCS 'achieves remarkable performance' cannot be evaluated without these numbers.
- [§3] §3 (dataset): the four characteristics are presented as equally important, yet the framework is motivated almost exclusively by scale. A quantitative characterization (e.g., histograms of object-to-scene ratios, target counts, description length statistics) is required to justify why scale is treated as the dominant factor.
minor comments (2)
- [Abstract] Abstract: the phrase 'promising performance boost on conventional ground referring detection datasets' should be replaced by concrete metrics and dataset names.
- [§4] Notation: define the granularity levels used inside MoG attention and the precise inputs/outputs of the two CtS stages before they are referenced in equations.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below, agreeing where additional evidence is needed and outlining specific revisions to strengthen the empirical support and dataset characterization.
read point-by-point responses
-
Referee: [Abstract, §1] Abstract and §1: the assertion that ground-based methods exhibit 'serious performance degradation ... since the intrinsic scale variety issue' is not accompanied by any ablation or controlled comparison that isolates scale variety from the other three dataset differences (numerous distractors, complex descriptions, diverse scenes). Without such isolation, the motivation for MoG attention and CtS decoding as the necessary or optimal remedies remains under-supported.
Authors: We agree that a controlled isolation of scale variety would strengthen the motivation. Our initial experiments demonstrate substantial degradation of ground-based methods on RefAerial relative to their reported ground-image performance, and qualitative analysis of failure cases points to scale as a primary driver given the low and diverse object-to-scene ratios in aerial views. Nevertheless, we did not perform explicit ablations that hold distractor count and description complexity fixed while varying scale. In the revision we will add such controlled comparisons on subsets of RefAerial to better isolate the scale factor and justify the design choices of MoG attention and CtS decoding. revision: yes
-
Referee: [§4, §5] §4 (method) and §5 (experiments): the paper must report quantitative baselines, error bars, and per-factor ablations on RefAerial that measure the contribution of MoG attention versus CtS decoding, as well as the effect of each when distractors or description complexity are controlled. The current central claim that SCS 'achieves remarkable performance' cannot be evaluated without these numbers.
Authors: We accept that the current experimental section lacks the requested granularity. The manuscript reports aggregate results for the full SCS framework but does not break down the individual contributions of MoG attention and CtS decoding, nor does it include error bars or controlled ablations that vary distractor density or description complexity. We will revise §5 to include (i) error bars computed over multiple random seeds, (ii) component-wise ablations on RefAerial, and (iii) additional experiments that control for the number of distractors and the complexity of referring expressions while measuring the incremental gains from each module. revision: yes
-
Referee: [§3] §3 (dataset): the four characteristics are presented as equally important, yet the framework is motivated almost exclusively by scale. A quantitative characterization (e.g., histograms of object-to-scene ratios, target counts, description length statistics) is required to justify why scale is treated as the dominant factor.
Authors: All four characteristics are indeed distinctive of RefAerial, yet our architectural focus on scale follows from the observation that scale diversity directly affects both visual feature extraction and the coarse-to-fine localization process in ways that are less prevalent in ground-level datasets. To make this emphasis quantitative rather than qualitative, we will augment §3 with histograms and summary statistics for object-to-scene ratios, target and distractor counts per image, and referring-expression length distributions, thereby providing empirical grounding for treating scale variety as the dominant design driver. revision: yes
Circularity Check
No circularity: empirical benchmark and model design
full rationale
The paper introduces a new dataset (RefAerial) with four listed distinguishing characteristics and proposes an SCS framework consisting of MoG attention plus CtS decoding. No equations, fitted parameters, or derivations are presented that reduce to prior inputs by construction. Performance claims are empirical results on the introduced benchmark and on existing ground datasets; the attribution of degradation to scale variety is an interpretive assumption rather than a self-referential prediction. No load-bearing self-citations or uniqueness theorems appear in the provided text. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard assumptions in deep learning for object detection and attention mechanisms hold for aerial imagery
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. 3, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
End-to- end object detection with transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InEuropean confer- ence on computer vision, pages 213–229. Springer, 2020. 5, 6
2020
-
[4]
Ref-nms: Breaking proposal bottlenecks in two-stage referring expression grounding
Long Chen, Wenbo Ma, Jun Xiao, Hanwang Zhang, and Shih-Fu Chang. Ref-nms: Breaking proposal bottlenecks in two-stage referring expression grounding. InProceedings of the AAAI conference on artificial intelligence, pages 1036– 1044, 2021. 3
2021
-
[5]
Simvg: A simple framework for visual grounding with decoupled multi-modal fusion.Advances in neural information processing systems, 37:121670–121698,
Ming Dai, Lingfeng Yang, Yihao Xu, Zhenhua Feng, and Wankou Yang. Simvg: A simple framework for visual grounding with decoupled multi-modal fusion.Advances in neural information processing systems, 37:121670–121698,
-
[6]
Deceptive explanations by large language mod- els lead people to change their beliefs about misinformation more often than honest explanations
Valdemar Danry, Pat Pataranutaporn, Matthew Groh, and Ziv Epstein. Deceptive explanations by large language mod- els lead people to change their beliefs about misinformation more often than honest explanations. InProceedings of the 2025 CHI Conference on Human Factors in Computing Sys- tems, pages 1–31, 2025. 1
2025
-
[7]
Transvg: End-to-end visual ground- ing with transformers
Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, and Houqiang Li. Transvg: End-to-end visual ground- ing with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 1769– 1779, 2021. 6, 7, 8
2021
-
[8]
Bert: Pre-training of deep bidirectional trans- formers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers), pages 4171– 4186, 2019. 5
2019
-
[9]
Evidential active recognition: Intelligent and pru- dent open-world embodied perception
Lei Fan, Mingfu Liang, Yunxuan Li, Gang Hua, and Ying Wu. Evidential active recognition: Intelligent and pru- dent open-world embodied perception. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16351–16361, 2024. 1
2024
-
[10]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 5
2016
-
[11]
Modeling relationships in refer- ential expressions with compositional modular networks
Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, and Kate Saenko. Modeling relationships in refer- ential expressions with compositional modular networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1115–1124, 2017. 3
2017
-
[12]
Segvg: Transferring object bounding box to segmentation for visual grounding
Weitai Kang, Gaowen Liu, Mubarak Shah, and Yan Yan. Segvg: Transferring object bounding box to segmentation for visual grounding. InEuropean Conference on Computer Vision, pages 57–75. Springer, 2024. 5, 6, 7, 8
2024
-
[13]
Segment any- thing
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 4
2023
-
[14]
The hungarian method for the assignment problem.Naval Research Logistics Quarterly, 2(1-2):83–97,
Harold W Kuhn. The hungarian method for the assignment problem.Naval Research Logistics Quarterly, 2(1-2):83–97,
-
[15]
Language-guided progressive attention for visual grounding in remote sensing images.IEEE Transactions on Geoscience and Remote Sensing, 62:1–13, 2024
Ke Li, Di Wang, Haojie Xu, Haodi Zhong, and Cong Wang. Language-guided progressive attention for visual grounding in remote sensing images.IEEE Transactions on Geoscience and Remote Sensing, 62:1–13, 2024. 1
2024
-
[16]
A real-time cross-modality correlation fil- tering method for referring expression comprehension
Yue Liao, Si Liu, Guanbin Li, Fei Wang, Yanjie Chen, Chen Qian, and Bo Li. A real-time cross-modality correlation fil- tering method for referring expression comprehension. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 10880–10889, 2020. 3
2020
-
[17]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 6, 7, 8
2023
-
[18]
Poly- former: Referring image segmentation as sequential poly- gon generation
Jiang Liu, Hui Ding, Zhaowei Cai, Yuting Zhang, Ravi Ku- mar Satzoda, Vijay Mahadevan, and R Manmatha. Poly- former: Referring image segmentation as sequential poly- gon generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18653– 18663, 2023. 3
2023
-
[19]
Integrated sensing and edge ai: Realizing intelligent perception in 6g.IEEE Communications Surveys & Tutorials, 2025
Zhiyan Liu, Xu Chen, Hai Wu, Zhanwei Wang, Xianhao Chen, Dusit Niyato, and Kaibin Huang. Integrated sensing and edge ai: Realizing intelligent perception in 6g.IEEE Communications Surveys & Tutorials, 2025. 1
2025
-
[20]
When visual grounding meets gigapixel- level large-scale scenes: benchmark and approach
Tao Ma, Bing Bai, Haozhe Lin, Heyuan Wang, Yu Wang, Lin Luo, and Lu Fang. When visual grounding meets gigapixel- level large-scale scenes: benchmark and approach. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22119–22128, 2024. 7
2024
-
[21]
Generation and comprehension of unambiguous object descriptions
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016. 1, 3
2016
-
[22]
Do expres- sions change decisions? exploring the impact of ai’s expla- nation tone on decision-making
Ayano Okoso, Mingzhe Yang, and Yukino Baba. Do expres- sions change decisions? exploring the impact of ai’s expla- nation tone on decision-making. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–22, 2025. 1
2025
-
[23]
Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models
Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazeb- nik. Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models. InPro- ceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015. 3
2015
-
[24]
Dynamic mdetr: A dynamic multimodal transformer decoder for visual grounding.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(2):1181–1198, 2023
Fengyuan Shi, Ruopeng Gao, Weilin Huang, and Limin Wang. Dynamic mdetr: A dynamic multimodal transformer decoder for visual grounding.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(2):1181–1198, 2023. 7
2023
-
[25]
Towards long-horizon vision- language navigation: Platform, benchmark and method
Xinshuai Song, Weixing Chen, Yang Liu, Weikai Chen, Guanbin Li, and Liang Lin. Towards long-horizon vision- language navigation: Platform, benchmark and method. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12078–12088, 2025. 1
2025
-
[26]
Scan- former: Referring expression comprehension by iteratively scanning
Wei Su, Peihan Miao, Huanzhang Dou, and Xi Li. Scan- former: Referring expression comprehension by iteratively scanning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13449– 13458, 2024. 3
2024
-
[27]
Cogvlm: Visual expert for pretrained language models.Advances in Neural Information Processing Sys- tems, 37:121475–121499, 2024
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Song XiX- uan, et al. Cogvlm: Visual expert for pretrained language models.Advances in Neural Information Processing Sys- tems, 37:121475–121499, 2024. 4
2024
-
[28]
Qiong Wu, Shubin Huang, Yiyi Zhou, Pingyang Dai, An- nan Shu, Guannan Jiang, and Rongrong Ji. Approximated prompt tuning for vision-language pre-trained models.arXiv preprint arXiv:2306.15706, 2023. 3
-
[29]
Hivg: Hierarchical multimodal fine- grained modulation for visual grounding
Linhui Xiao, Xiaoshan Yang, Fang Peng, Yaowei Wang, and Changsheng Xu. Hivg: Hierarchical multimodal fine- grained modulation for visual grounding. InProceedings of the 32nd ACM International Conference on Multimedia, pages 5460–5469, 2024
2024
-
[30]
Dual modality prompt tuning for vision-language pre-trained model.IEEE Transactions on Multimedia, 26:2056–2068, 2023
Yinghui Xing, Qirui Wu, De Cheng, Shizhou Zhang, Guo- qiang Liang, Peng Wang, and Yanning Zhang. Dual modality prompt tuning for vision-language pre-trained model.IEEE Transactions on Multimedia, 26:2056–2068, 2023. 3
2056
-
[31]
Improving visual grounding with visual- linguistic verification and iterative reasoning
Li Yang, Yan Xu, Chunfeng Yuan, Wei Liu, Bing Li, and Weiming Hu. Improving visual grounding with visual- linguistic verification and iterative reasoning. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9499–9508, 2022. 3
2022
-
[32]
A fast and accurate one-stage approach to visual grounding
Zhengyuan Yang, Boqing Gong, Liwei Wang, Wenbing Huang, Dong Yu, and Jiebo Luo. A fast and accurate one-stage approach to visual grounding. InProceedings of the IEEE/CVF international conference on computer vision, pages 4683–4693, 2019. 3
2019
-
[33]
Improving one-stage visual grounding by recursive sub- query construction
Zhengyuan Yang, Tianlang Chen, Liwei Wang, and Jiebo Luo. Improving one-stage visual grounding by recursive sub- query construction. InEuropean conference on computer vi- sion, pages 387–404. Springer, 2020. 3
2020
-
[34]
Shifting more attention to visual backbone: Query-modulated refinement networks for end-to-end visual grounding
Jiabo Ye, Junfeng Tian, Ming Yan, Xiaoshan Yang, Xuwu Wang, Ji Zhang, Liang He, and Xin Lin. Shifting more attention to visual backbone: Query-modulated refinement networks for end-to-end visual grounding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15502–15512, 2022. 3, 7
2022
-
[35]
Modeling context in referring expres- sions
Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expres- sions. InEuropean conference on computer vision, pages 69–85. Springer, 2016. 1, 3
2016
-
[36]
Mattnet: Modular at- tention network for referring expression comprehension
Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. Mattnet: Modular at- tention network for referring expression comprehension. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1307–1315, 2018. 3
2018
-
[37]
Grounding referring expressions in images by variational context
Hanwang Zhang, Yulei Niu, and Shih-Fu Chang. Grounding referring expressions in images by variational context. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4158–4166, 2018. 3
2018
-
[38]
Look around before locating: Considering content and structure information for visual grounding
Shiyi Zheng, Peizhi Zhao, Zhilong Zheng, Peihang He, Hao- nan Cheng, Yi Cai, and Qingbao Huang. Look around before locating: Considering content and structure information for visual grounding. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1656–1664, 2025. 7
2025
-
[39]
Vision-language navigation with self-supervised auxiliary reasoning tasks
Fengda Zhu, Yi Zhu, Xiaojun Chang, and Xiaodan Liang. Vision-language navigation with self-supervised auxiliary reasoning tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10012– 10022, 2020. 1
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.