DroneFINE: Domain-Aware Parameter-Efficient Fine-Tuning of Vision-Language Detectors for Drone Images

Chenyu Zhou; Di Huang; Jiaxin Chen; Ke Wu; Wenhao Li; Xinzhu Ma; Yanan Zhang; Yingjie Gao

arxiv: 2607.00338 · v1 · pith:Q3K3TFZUnew · submitted 2026-07-01 · 💻 cs.CV

DroneFINE: Domain-Aware Parameter-Efficient Fine-Tuning of Vision-Language Detectors for Drone Images

Ke Wu , Yanan Zhang , Yingjie Gao , Wenhao Li , Chenyu Zhou , XinZhu Ma , Jiaxin Chen , Di Huang This is my paper

Pith reviewed 2026-07-02 15:21 UTC · model grok-4.3

classification 💻 cs.CV

keywords drone object detectionparameter-efficient fine-tuningvision-language modelsUAV imagerydomain adaptationHyperAdapterSemanticGate

0 comments

The pith

DroneFINE matches full fine-tuning performance on drone images using far fewer trainable parameters

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that vision-language models can be adapted to aerial drone imagery through a specialized parameter-efficient fine-tuning method rather than full retraining. It identifies a core mismatch: pre-trained models expect ground-level scenes with prominent foreground objects, while drone views feature overhead perspectives, small targets, and background-heavy scenes. Standard PEFT approaches fall short because their fixed structures cannot accommodate this shift. DroneFINE introduces two targeted modules to close the gap, claiming results on standard UAV datasets that rival complete fine-tuning at a fraction of the parameter cost.

Core claim

DroneFINE introduces HyperAdapter, a data-dependent foreground-aware multi-path adaptation mechanism that relaxes the static constraints of conventional PEFT, together with SemanticGate, a text-conditioned background suppression strategy that uses background vocabulary to steer the model away from irrelevant regions. Together these modules enable VLMs to handle the domain shift in UAV imagery, delivering detection performance on VisDrone and UAVDT that surpasses existing PEFT baselines and approaches the accuracy of full fine-tuning while training far fewer parameters.

What carries the argument

HyperAdapter and SemanticGate modules that supply dynamic, foreground-aware adaptation and text-guided background suppression inside VLM-based detectors.

If this is right

UAV object detection becomes practical without retraining every parameter in a large VLM.
Text-conditioned background suppression can reduce false positives in open aerial scenes.
Data-dependent multi-path adaptation can extend the reach of PEFT beyond its usual limits.
Detection systems for dynamic environments can maintain high accuracy at lower training cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same domain-aware modules could be tested on other overhead imaging tasks such as satellite or medical aerial views.
Reduced parameter counts may allow on-device adaptation of detectors directly aboard UAVs.
Combining the modules with quantization or pruning could push efficiency gains further in resource-limited settings.

Load-bearing premise

The natural-scene priors in VLMs create a mismatch with drone imagery that static PEFT structures cannot resolve, but the proposed modules can.

What would settle it

If side-by-side experiments on VisDrone and UAVDT show DroneFINE failing to exceed other PEFT methods or to reach parity with full fine-tuning, the central claim would not hold.

Figures

Figures reproduced from arXiv: 2607.00338 by Chenyu Zhou, Di Huang, Jiaxin Chen, Ke Wu, Wenhao Li, Xinzhu Ma, Yanan Zhang, Yingjie Gao.

**Figure 2.** Figure 2: Framework of DroneFINE. The model based on GroundingDINO comprises five parts: (i) frozen visual and text backbones for feature extraction, (ii) projectors for cross-modal alignment, (iii) a detection decoder and head, (iv) HyperAdapter as a serial adapter for the visual backbone, and (v) SemanticGate for enhanced GroundingDINO’s language-guided query selection. 3 Methodology 3.1 Framework Overview As illu… view at source ↗

**Figure 3.** Figure 3: (L)Architecture of HyperAdapter: Similar to a conventional visual adapter, the HyperAdapter is positioned after the attention and MLP layers. It introduces a foreground-aware multi-path convolution generation mechanism (highlighted in red in the figure) into the adapter. This mechanism utilizes queries to aggregate foreground features, which are then used to generate convolutions. These convolutions, in … view at source ↗

**Figure 4.** Figure 4: Detection Results and Attention Map on VisDrone. (Left) [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

Object detection for Unmanned Aerial Vehicles (UAVs) working in open and dynamic environments is a highly challenging task. While Vision-Language Models (VLMs) have offered a powerful solution for universal object detection, adapting them to UAV scenarios remains non-trivial due to a substantial domain gap between VLM pre-training data and aerial imagery. The prevailing Parameter-Efficient Fine-Tuning (PEFT) methods prove ineffective in bridging this gap, as VLMs' "natural-scene, foreground-dominant" visual priors misalign with the "bird's-eye-view, background-dominant, small-object" characteristics of UAV data. To address this issue, we propose DroneFINE, a novel PEFT paradigm comprising two domain-aware complementary modules tailored for VLM-based drone image detectors. Specifically, a data-dependent, foreground-aware, and multi-path adaptation mechanism named HyperAdapter is designed, which overcomes the static structural constraints of PEFT. In addition, a background suppression algorithm named SemanticGate is developed. It is a text-conditioned guidance strategy that employs background vocabulary to actively guide the model in suppressing responses from irrelevant regions. Extensive experiments on VisDrone and UAVDT demonstrate that DroneFINE significantly outperforms existing PEFT methods and achieves performance comparable to full fine-tuning while substantially reducing the number of trainable parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DroneFINE adds two targeted modules to PEFT for drone detection but the abstract supplies no numbers so the performance claims stay unverified here.

read the letter

The paper's main move is to treat the domain gap between natural-scene VLMs and bird's-eye drone images as a structural problem rather than a generic adaptation issue. It introduces HyperAdapter, a data-dependent multi-path mechanism meant to handle foreground awareness, and SemanticGate, a text-conditioned way to suppress background responses using background vocabulary. These are presented as complementary fixes that overcome the static limits of standard PEFT.

What is actually new is the specific pairing of these two modules for UAV detection. The motivation section does a clean job explaining why natural-scene priors clash with background-dominant, small-object aerial data, and the modules are designed to address that mismatch directly. The claim that the approach reaches near full fine-tuning performance with far fewer parameters is the central result, resting on experiments on VisDrone and UAVDT.

The soft spot is the complete absence of any metrics, baselines, or ablation numbers in the abstract. Without those, it is impossible to judge whether the modules deliver the stated gains or whether the baselines were implemented fairly. The stress-test note indicates the full paper contains the comparisons and that the design rationale aligns with the results, so there is no obvious internal contradiction or circularity. Still, the strength of the paper hinges on whether those tables hold up under scrutiny for controls and reproducibility.

This is incremental but practical work aimed at people doing efficient adaptation of VLMs for aerial or similar domain-shifted detection tasks. A reader already working on PEFT for detection would get concrete design ideas from the module descriptions. It is coherent on its own terms and shows honest engagement with the literature on domain gaps, so it deserves a serious referee to check the experiments rather than a desk reject.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes DroneFINE, a domain-aware PEFT method for adapting vision-language models to UAV object detection. It argues that standard PEFT fails due to misalignment between natural-scene foreground-dominant priors and UAV bird's-eye-view, background-dominant, small-object characteristics. The method introduces HyperAdapter (data-dependent multi-path adaptation) and SemanticGate (text-conditioned background suppression using background vocabulary). Experiments on VisDrone and UAVDT are claimed to show that DroneFINE outperforms existing PEFT baselines, approaches full fine-tuning performance, and uses far fewer trainable parameters.

Significance. If the experimental comparisons hold under fair and reproducible conditions, the work provides a practical, parameter-efficient route for specializing large VLMs to aerial imagery, which is valuable for UAV applications where compute is limited. The explicit targeting of domain-specific visual priors via HyperAdapter and SemanticGate offers a concrete design pattern that could generalize to other domain shifts in detection tasks.

minor comments (2)

[Abstract] Abstract: the claim of 'significantly outperforms existing PEFT methods' and 'performance comparable to full fine-tuning' is stated without any numerical values, mAP deltas, or parameter counts; adding one or two headline metrics would make the abstract self-contained.
The description of HyperAdapter as overcoming 'static structural constraints of PEFT' would be strengthened by an explicit side-by-side comparison (text or diagram) of its data-dependent multi-path routing versus a standard adapter block.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of DroneFINE and for recommending minor revision. The provided report contains no enumerated major comments, so we have no specific points to rebut or revise at this stage. We remain available to address any minor issues or clarifications the editor may request.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central claims rest on empirical comparisons of the proposed DroneFINE (with HyperAdapter and SemanticGate) against PEFT baselines and full fine-tuning on the external public datasets VisDrone and UAVDT. No derivation chain, mathematical prediction, or fitted parameter is presented that reduces to its own inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked. The method design is motivated by stated domain gaps but validated externally through benchmark results rather than self-referential definitions or renamings.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no mathematical derivations, fitted parameters, background axioms, or new postulated entities are described.

pith-pipeline@v0.9.1-grok · 5786 in / 1150 out tokens · 50506 ms · 2026-07-02T15:21:20.145520+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 12 canonical work pages · 3 internal anchors

[1]

MMDetection: Open MMLab Detection Toolbox and Benchmark

Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C.C., Lin, D.: MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1906
[2]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Chen, Z., Zeng, Y., Chen, Z., Gao, H., Chen, L., Liu, J., Zhao, F.: Vfm-adapter: Adapting visual foundation models for dense prediction with dynamic hybrid oper- ation mapping. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 2385–2393 (2025)

2025
[3]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Cheng, T., Song, L., Ge, Y., Liu, W., Wang, X., Shan, Y.: Yolo-world: Real-time open-vocabulary object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16901–16911 (2024)

2024
[4]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

Du, B., Huang, Y., Chen, J., Huang, D.: Adaptive sparse convolutional networks with global context enhancement for faster object detection on drone images. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 13435–13444 (2023)

2023
[5]

In: Proceedings of the European conference on computer vision (ECCV)

Du, D., Qi, Y., Yu, H., Yang, Y., Duan, K., Li, G., Zhang, W., Huang, Q., Tian, Q.: The unmanned aerial vehicle benchmark: Object detection and tracking. In: Proceedings of the European conference on computer vision (ECCV). pp. 370–386 (2018)

2018
[6]

In: Proceedings of the IEEE/CVF international conference on computer vision workshops

Du, D., Zhu, P., Wen, L., Bian, X., Lin, H., Hu, Q., Peng, T., Zheng, J., Wang, X., Zhang, Y., et al.: Visdrone-det2019: The vision meets drone object detection in image challenge results. In: Proceedings of the IEEE/CVF international conference on computer vision workshops. pp. 0–0 (2019)

2019
[7]

Advances in Neural Information Processing Systems38, 95228– 95251 (2026)

Gao, Y., Zhang, Y., Cai, Z., Huang, D.: Test-time adaptive object detection with foundation model. Advances in Neural Information Processing Systems38, 95228– 95251 (2026)

2026
[8]

In: Proceedings of the 32nd ACM International Conference on Multimedia

Gao, Y., Zhang, Y., Huang, Z., Liu, N., Huang, D.: Ps-ttl: Prototype-based soft- labels and test-time learning for few-shot object detection. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 8691–8700 (2024)

2024
[9]

arXiv preprint arXiv:2302.07937 (2023)

Giannou, A., Rajput, S., Papailiopoulos, D.: The expressive power of tuning only the normalization layers. arXiv preprint arXiv:2302.07937 (2023)

work page arXiv 2023
[10]

arXiv preprint arXiv:2204.13653 (2022)

Gupta, T., Marten, R., Kembhavi, A., Hoiem, D.: Grit: General robust image task benchmark. arXiv preprint arXiv:2204.13653 (2022)

work page arXiv 2022
[11]

HyperNetworks

Ha, D., Dai, A., Le, Q.V.: Hypernetworks. arXiv preprint arXiv:1609.09106 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[12]

Neural computation 9(8), 1735–1780 (1997)

Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8), 1735–1780 (1997)

1997
[13]

In: International conference on machine learning

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Ges- mundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for nlp. In: International conference on machine learning. pp. 2790–2799. PMLR (2019)

2019
[14]

ICLR1(2), 3 (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)

2022
[15]

arXiv preprint arXiv:2505.05741 (2025)

Hu, Z., Wu, P., Chen, J., Zhu, H., Wang, Y., Peng, Y., Li, H., Sun, X.: Dome- detr: Detr with density-oriented feature-query manipulation for efficient tiny object detection. arXiv preprint arXiv:2505.05741 (2025)

work page arXiv 2025
[16]

In: Proceedings of the AAAI conference on artificial intelligence

Huang, Y., Chen, J., Huang, D.: Ufpmp-det: Toward accurate and efficient object detection on drone imagery. In: Proceedings of the AAAI conference on artificial intelligence. vol. 36, pp. 1026–1033 (2022) DroneFINE 17

2022
[17]

In: European conference on computer vision

Jia, M., Tang, L., Chen, B.C., Cardie, C., Belongie, S., Hariharan, B., Lim, S.N.: Visual prompt tuning. In: European conference on computer vision. pp. 709–727. Springer (2022)

2022
[18]

arXiv preprint arXiv:2207.07039 (2022)

Jie, S., Deng, Z.H.: Convolutional bypasses are better vision transformer adapters. arXiv preprint arXiv:2207.07039 (2022)

work page arXiv 2022
[19]

com/ultralytics/ultralytics

Jocher, G., Chaurasia, A., Qiu, J.: Ultralytics yolov8 (2023),https://github. com/ultralytics/ultralytics

2023
[20]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Li, C., Zhao, R., Wang, Z., Xu, H., Zhu, X.: Remdet: Rethinking efficient model de- sign for uav object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 4643–4651 (2025)

2025
[21]

In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.N., et al.: Grounded language-image pre-training. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10965–10975 (2022)

2022
[22]

Li, M., Chen, J., Feng, W., Li, B., Dai, F., Zhao, S., He, Q.: Hyperlora: Parameter- efficientadaptivegenerationforportraitsynthesis.In:ProceedingsoftheComputer Vision and Pattern Recognition Conference. pp. 13114–13123 (2025)

2025
[23]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Li, N., Ye, M., Zhou, L., Tang, S., Gan, Y., Liang, Z., Zhu, X.: Self-prompting ana- logical reasoning for uav object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 18412–18420 (2025)

2025
[24]

In: Proceedings of the European Conference on Computer Vision

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Proceedings of the European Conference on Computer Vision. pp. 740–755. Springer (2014)

2014
[25]

IEEE Transactions on Image Processing (2024)

Liu, K., Fu, Z., Jin, S., Chen, Z., Zhou, F., Jiang, R., Chen, Y., Ye, J.: Esod: efficient small object detection on high-resolution images. IEEE Transactions on Image Processing (2024)

2024
[26]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

Liu, L., Wang, N., Chen, C., Liu, D., Yang, X., Gao, X., Liu, T.: Frequency-based comprehensive prompt learning for vision-language models. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

2025
[27]

In: Forty-second International Conference on Machine Learning (2025)

Liu, L., Wang, N., Yang, X., Gao, X., Liu, T.: Surrogate prompt learning: Towards efficient and diverse prompt learning for vision-language models. In: Forty-second International Conference on Machine Learning (2025)

2025
[28]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liu, M., Hayes, T.L., Ricci, E., Csurka, G., Volpi, R.: Shine: Semantic hierarchy nexus for open-vocabulary object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16634–16644 (2024)

2024
[29]

In: European conference on computer vision

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: European conference on computer vision. pp. 38–55. Springer (2024)

2024
[30]

In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision

Liu, S., Zhang, H., Qi, Y., Wang, P., Zhang, Y., Wu, Q.: Aerialvln: Vision-and- language navigation for uavs. In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision. pp. 15384–15394 (2023)

2023
[31]

In: Findings of the Association for Computational Linguistics: EMNLP 2024

Lv, C., Li, L., Zhang, S., Chen, G., Qi, F., Zhang, N., Zheng, H.T.: Hyperlora: Efficient cross-task generalization via constrained low-rank adapters generation. In: Findings of the Association for Computational Linguistics: EMNLP 2024. pp. 16376–16393 (2024)

2024
[32]

Representation Learning with Contrastive Predictive Coding

Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predic- tive coding. arXiv preprint arXiv:1807.03748 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[33]

sensing community

Pan, J., Liu, Y., Fu, Y., Ma, M., Li, J., Paudel, D.P., Van Gool, L., Huang, X.: Locate anything on earth: Advancing open-vocabulary object detection for remote 18 Wu et al. sensing community. In: Proceedings of the AAAI Conference on Artificial Intelli- gence. vol. 39, pp. 6281–6289 (2025)

2025
[34]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

2021
[35]

Dino-x: A unified vision model for open-world object detection and understanding.arXiv preprint arXiv:2411.14347, 2024

Ren, T., Chen, Y., Jiang, Q., Zeng, Z., Xiong, Y., Liu, W., Ma, Z., Shen, J., Gao, Y., Jiang, X., et al.: Dino-x: A unified vision model for open-world object detection and understanding. arXiv preprint arXiv:2411.14347 (2024)

work page arXiv 2024
[36]

In: Proceedings of the IEEE/CVF international conference on computer vision

Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X., Li, J., Sun, J.: Objects365: A large-scale, high-quality dataset for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 8430–8439 (2019)

2019
[37]

Information Fusion122, 103158 (2025)

Tian, Y., Lin, F., Li, Y., Zhang, T., Zhang, Q., Fu, X., Huang, J., Dai, X., Wang, Y., Tian, C., et al.: Uavs meet llms: Overviews and perspectives towards agentic low-altitude mobility. Information Fusion122, 103158 (2025)

2025
[38]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Wang, J., Zhang, P., Chu, T., Cao, Y., Zhou, Y., Wu, T., Wang, B., He, C., Lin, D.: V3det: Vast vocabulary visual detection dataset. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19844–19854 (2023)

2023
[39]

arXiv preprint arXiv:2410.07087 (2024)

Wang, X., Yang, D., Wang, Z., Kwan, H., Chen, J., Wu, W., Li, H., Liao, Y., Liu, S.: Towards realistic uav vision-language navigation: Platform, benchmark, and methodology. arXiv preprint arXiv:2410.07087 (2024)

work page arXiv 2024
[40]

arXiv preprint arXiv:2408.12246 (2024)

Wei, G., Yuan, X., Liu, Y., Shang, Z., Xue, X., Wang, P., Yao, K., Zhao, C., Zhang, H., Xiao, R.: Rt-ovad: Real-time open-vocabulary aerial object detection via image-text collaboration. arXiv preprint arXiv:2408.12246 (2024)

work page arXiv 2024
[41]

In: Proceedings of the IEEE/CVF Confer- ence on computer vision and pattern recognition

Yang, C., Huang, Z., Wang, N.: Querydet: Cascaded sparse query for accelerating high-resolution small object detection. In: Proceedings of the IEEE/CVF Confer- ence on computer vision and pattern recognition. pp. 13668–13677 (2022)

2022
[42]

In: Proceedings of the IEEE/CVF international conference on computer vision

Yang, F., Fan, H., Chu, P., Blasch, E., Ling, H.: Clustered object detection in aerial images. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 8311–8320 (2019)

2019
[43]

In: Proceedings of the Com- puter Vision and Pattern Recognition Conference

Yin, D., Hu, L., Li, B., Zhang, Y., Yang, X.: 5%> 100%: Breaking performance shackles of full fine-tuning on visual recognition tasks. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 20071–20081 (2025)

2025
[44]

In: The Eleventh International Conference on Learning Representations

Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L., Shum, H.Y.: Dino: Detr with improved denoising anchor boxes for end-to-end object detection. In: The Eleventh International Conference on Learning Representations
[45]

arXiv preprint arXiv:2505.05622 (2025)

Zhang, W., Gao, C., Yu, S., Peng, R., Zhao, B., Zhang, Q., Cui, J., Chen, X., Li, Y.: Citynavagent: Aerial vision-and-language navigation with hierarchical semantic planning and global memory. arXiv preprint arXiv:2505.05622 (2025)

work page arXiv 2025
[46]

arXiv preprint arXiv:2401.02361 (2024)

Zhao, X., Chen, Y., Xu, S., Li, X., Wang, X., Li, Y., Huang, H.: An open and comprehensive pipeline for unified object grounding and detection. arXiv preprint arXiv:2401.02361 (2024)

work page arXiv 2024
[47]

International Journal of Computer Vision130(9), 2337–2348 (2022)

Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. International Journal of Computer Vision130(9), 2337–2348 (2022)

2022
[48]

Drones9(8), 557 (2025)

Zhou, Y., Li, J., Ou, C., Yan, D., Zhang, H., Xue, X.: Open-vocabulary object detection in uav imagery: A review and future perspectives. Drones9(8), 557 (2025)

2025

[1] [1]

MMDetection: Open MMLab Detection Toolbox and Benchmark

Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C.C., Lin, D.: MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1906

[2] [2]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Chen, Z., Zeng, Y., Chen, Z., Gao, H., Chen, L., Liu, J., Zhao, F.: Vfm-adapter: Adapting visual foundation models for dense prediction with dynamic hybrid oper- ation mapping. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 2385–2393 (2025)

2025

[3] [3]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Cheng, T., Song, L., Ge, Y., Liu, W., Wang, X., Shan, Y.: Yolo-world: Real-time open-vocabulary object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16901–16911 (2024)

2024

[4] [4]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

Du, B., Huang, Y., Chen, J., Huang, D.: Adaptive sparse convolutional networks with global context enhancement for faster object detection on drone images. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 13435–13444 (2023)

2023

[5] [5]

In: Proceedings of the European conference on computer vision (ECCV)

Du, D., Qi, Y., Yu, H., Yang, Y., Duan, K., Li, G., Zhang, W., Huang, Q., Tian, Q.: The unmanned aerial vehicle benchmark: Object detection and tracking. In: Proceedings of the European conference on computer vision (ECCV). pp. 370–386 (2018)

2018

[6] [6]

In: Proceedings of the IEEE/CVF international conference on computer vision workshops

Du, D., Zhu, P., Wen, L., Bian, X., Lin, H., Hu, Q., Peng, T., Zheng, J., Wang, X., Zhang, Y., et al.: Visdrone-det2019: The vision meets drone object detection in image challenge results. In: Proceedings of the IEEE/CVF international conference on computer vision workshops. pp. 0–0 (2019)

2019

[7] [7]

Advances in Neural Information Processing Systems38, 95228– 95251 (2026)

Gao, Y., Zhang, Y., Cai, Z., Huang, D.: Test-time adaptive object detection with foundation model. Advances in Neural Information Processing Systems38, 95228– 95251 (2026)

2026

[8] [8]

In: Proceedings of the 32nd ACM International Conference on Multimedia

Gao, Y., Zhang, Y., Huang, Z., Liu, N., Huang, D.: Ps-ttl: Prototype-based soft- labels and test-time learning for few-shot object detection. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 8691–8700 (2024)

2024

[9] [9]

arXiv preprint arXiv:2302.07937 (2023)

Giannou, A., Rajput, S., Papailiopoulos, D.: The expressive power of tuning only the normalization layers. arXiv preprint arXiv:2302.07937 (2023)

work page arXiv 2023

[10] [10]

arXiv preprint arXiv:2204.13653 (2022)

Gupta, T., Marten, R., Kembhavi, A., Hoiem, D.: Grit: General robust image task benchmark. arXiv preprint arXiv:2204.13653 (2022)

work page arXiv 2022

[11] [11]

HyperNetworks

Ha, D., Dai, A., Le, Q.V.: Hypernetworks. arXiv preprint arXiv:1609.09106 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[12] [12]

Neural computation 9(8), 1735–1780 (1997)

Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8), 1735–1780 (1997)

1997

[13] [13]

In: International conference on machine learning

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Ges- mundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for nlp. In: International conference on machine learning. pp. 2790–2799. PMLR (2019)

2019

[14] [14]

ICLR1(2), 3 (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)

2022

[15] [15]

arXiv preprint arXiv:2505.05741 (2025)

Hu, Z., Wu, P., Chen, J., Zhu, H., Wang, Y., Peng, Y., Li, H., Sun, X.: Dome- detr: Detr with density-oriented feature-query manipulation for efficient tiny object detection. arXiv preprint arXiv:2505.05741 (2025)

work page arXiv 2025

[16] [16]

In: Proceedings of the AAAI conference on artificial intelligence

Huang, Y., Chen, J., Huang, D.: Ufpmp-det: Toward accurate and efficient object detection on drone imagery. In: Proceedings of the AAAI conference on artificial intelligence. vol. 36, pp. 1026–1033 (2022) DroneFINE 17

2022

[17] [17]

In: European conference on computer vision

Jia, M., Tang, L., Chen, B.C., Cardie, C., Belongie, S., Hariharan, B., Lim, S.N.: Visual prompt tuning. In: European conference on computer vision. pp. 709–727. Springer (2022)

2022

[18] [18]

arXiv preprint arXiv:2207.07039 (2022)

Jie, S., Deng, Z.H.: Convolutional bypasses are better vision transformer adapters. arXiv preprint arXiv:2207.07039 (2022)

work page arXiv 2022

[19] [19]

com/ultralytics/ultralytics

Jocher, G., Chaurasia, A., Qiu, J.: Ultralytics yolov8 (2023),https://github. com/ultralytics/ultralytics

2023

[20] [20]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Li, C., Zhao, R., Wang, Z., Xu, H., Zhu, X.: Remdet: Rethinking efficient model de- sign for uav object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 4643–4651 (2025)

2025

[21] [21]

In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.N., et al.: Grounded language-image pre-training. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10965–10975 (2022)

2022

[22] [22]

Li, M., Chen, J., Feng, W., Li, B., Dai, F., Zhao, S., He, Q.: Hyperlora: Parameter- efficientadaptivegenerationforportraitsynthesis.In:ProceedingsoftheComputer Vision and Pattern Recognition Conference. pp. 13114–13123 (2025)

2025

[23] [23]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Li, N., Ye, M., Zhou, L., Tang, S., Gan, Y., Liang, Z., Zhu, X.: Self-prompting ana- logical reasoning for uav object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 18412–18420 (2025)

2025

[24] [24]

In: Proceedings of the European Conference on Computer Vision

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Proceedings of the European Conference on Computer Vision. pp. 740–755. Springer (2014)

2014

[25] [25]

IEEE Transactions on Image Processing (2024)

Liu, K., Fu, Z., Jin, S., Chen, Z., Zhou, F., Jiang, R., Chen, Y., Ye, J.: Esod: efficient small object detection on high-resolution images. IEEE Transactions on Image Processing (2024)

2024

[26] [26]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

Liu, L., Wang, N., Chen, C., Liu, D., Yang, X., Gao, X., Liu, T.: Frequency-based comprehensive prompt learning for vision-language models. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

2025

[27] [27]

In: Forty-second International Conference on Machine Learning (2025)

Liu, L., Wang, N., Yang, X., Gao, X., Liu, T.: Surrogate prompt learning: Towards efficient and diverse prompt learning for vision-language models. In: Forty-second International Conference on Machine Learning (2025)

2025

[28] [28]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liu, M., Hayes, T.L., Ricci, E., Csurka, G., Volpi, R.: Shine: Semantic hierarchy nexus for open-vocabulary object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16634–16644 (2024)

2024

[29] [29]

In: European conference on computer vision

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: European conference on computer vision. pp. 38–55. Springer (2024)

2024

[30] [30]

In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision

Liu, S., Zhang, H., Qi, Y., Wang, P., Zhang, Y., Wu, Q.: Aerialvln: Vision-and- language navigation for uavs. In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision. pp. 15384–15394 (2023)

2023

[31] [31]

In: Findings of the Association for Computational Linguistics: EMNLP 2024

Lv, C., Li, L., Zhang, S., Chen, G., Qi, F., Zhang, N., Zheng, H.T.: Hyperlora: Efficient cross-task generalization via constrained low-rank adapters generation. In: Findings of the Association for Computational Linguistics: EMNLP 2024. pp. 16376–16393 (2024)

2024

[32] [32]

Representation Learning with Contrastive Predictive Coding

Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predic- tive coding. arXiv preprint arXiv:1807.03748 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[33] [33]

sensing community

Pan, J., Liu, Y., Fu, Y., Ma, M., Li, J., Paudel, D.P., Van Gool, L., Huang, X.: Locate anything on earth: Advancing open-vocabulary object detection for remote 18 Wu et al. sensing community. In: Proceedings of the AAAI Conference on Artificial Intelli- gence. vol. 39, pp. 6281–6289 (2025)

2025

[34] [34]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

2021

[35] [35]

Dino-x: A unified vision model for open-world object detection and understanding.arXiv preprint arXiv:2411.14347, 2024

Ren, T., Chen, Y., Jiang, Q., Zeng, Z., Xiong, Y., Liu, W., Ma, Z., Shen, J., Gao, Y., Jiang, X., et al.: Dino-x: A unified vision model for open-world object detection and understanding. arXiv preprint arXiv:2411.14347 (2024)

work page arXiv 2024

[36] [36]

In: Proceedings of the IEEE/CVF international conference on computer vision

Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X., Li, J., Sun, J.: Objects365: A large-scale, high-quality dataset for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 8430–8439 (2019)

2019

[37] [37]

Information Fusion122, 103158 (2025)

Tian, Y., Lin, F., Li, Y., Zhang, T., Zhang, Q., Fu, X., Huang, J., Dai, X., Wang, Y., Tian, C., et al.: Uavs meet llms: Overviews and perspectives towards agentic low-altitude mobility. Information Fusion122, 103158 (2025)

2025

[38] [38]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Wang, J., Zhang, P., Chu, T., Cao, Y., Zhou, Y., Wu, T., Wang, B., He, C., Lin, D.: V3det: Vast vocabulary visual detection dataset. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19844–19854 (2023)

2023

[39] [39]

arXiv preprint arXiv:2410.07087 (2024)

Wang, X., Yang, D., Wang, Z., Kwan, H., Chen, J., Wu, W., Li, H., Liao, Y., Liu, S.: Towards realistic uav vision-language navigation: Platform, benchmark, and methodology. arXiv preprint arXiv:2410.07087 (2024)

work page arXiv 2024

[40] [40]

arXiv preprint arXiv:2408.12246 (2024)

Wei, G., Yuan, X., Liu, Y., Shang, Z., Xue, X., Wang, P., Yao, K., Zhao, C., Zhang, H., Xiao, R.: Rt-ovad: Real-time open-vocabulary aerial object detection via image-text collaboration. arXiv preprint arXiv:2408.12246 (2024)

work page arXiv 2024

[41] [41]

In: Proceedings of the IEEE/CVF Confer- ence on computer vision and pattern recognition

Yang, C., Huang, Z., Wang, N.: Querydet: Cascaded sparse query for accelerating high-resolution small object detection. In: Proceedings of the IEEE/CVF Confer- ence on computer vision and pattern recognition. pp. 13668–13677 (2022)

2022

[42] [42]

In: Proceedings of the IEEE/CVF international conference on computer vision

Yang, F., Fan, H., Chu, P., Blasch, E., Ling, H.: Clustered object detection in aerial images. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 8311–8320 (2019)

2019

[43] [43]

In: Proceedings of the Com- puter Vision and Pattern Recognition Conference

Yin, D., Hu, L., Li, B., Zhang, Y., Yang, X.: 5%> 100%: Breaking performance shackles of full fine-tuning on visual recognition tasks. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 20071–20081 (2025)

2025

[44] [44]

In: The Eleventh International Conference on Learning Representations

Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L., Shum, H.Y.: Dino: Detr with improved denoising anchor boxes for end-to-end object detection. In: The Eleventh International Conference on Learning Representations

[45] [45]

arXiv preprint arXiv:2505.05622 (2025)

Zhang, W., Gao, C., Yu, S., Peng, R., Zhao, B., Zhang, Q., Cui, J., Chen, X., Li, Y.: Citynavagent: Aerial vision-and-language navigation with hierarchical semantic planning and global memory. arXiv preprint arXiv:2505.05622 (2025)

work page arXiv 2025

[46] [46]

arXiv preprint arXiv:2401.02361 (2024)

Zhao, X., Chen, Y., Xu, S., Li, X., Wang, X., Li, Y., Huang, H.: An open and comprehensive pipeline for unified object grounding and detection. arXiv preprint arXiv:2401.02361 (2024)

work page arXiv 2024

[47] [47]

International Journal of Computer Vision130(9), 2337–2348 (2022)

Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. International Journal of Computer Vision130(9), 2337–2348 (2022)

2022

[48] [48]

Drones9(8), 557 (2025)

Zhou, Y., Li, J., Ou, C., Yan, D., Zhang, H., Xue, X.: Open-vocabulary object detection in uav imagery: A review and future perspectives. Drones9(8), 557 (2025)

2025