pith. sign in

arxiv: 2607.00338 · v1 · pith:Q3K3TFZUnew · submitted 2026-07-01 · 💻 cs.CV

DroneFINE: Domain-Aware Parameter-Efficient Fine-Tuning of Vision-Language Detectors for Drone Images

Pith reviewed 2026-07-02 15:21 UTC · model grok-4.3

classification 💻 cs.CV
keywords drone object detectionparameter-efficient fine-tuningvision-language modelsUAV imagerydomain adaptationHyperAdapterSemanticGate
0
0 comments X

The pith

DroneFINE matches full fine-tuning performance on drone images using far fewer trainable parameters

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that vision-language models can be adapted to aerial drone imagery through a specialized parameter-efficient fine-tuning method rather than full retraining. It identifies a core mismatch: pre-trained models expect ground-level scenes with prominent foreground objects, while drone views feature overhead perspectives, small targets, and background-heavy scenes. Standard PEFT approaches fall short because their fixed structures cannot accommodate this shift. DroneFINE introduces two targeted modules to close the gap, claiming results on standard UAV datasets that rival complete fine-tuning at a fraction of the parameter cost.

Core claim

DroneFINE introduces HyperAdapter, a data-dependent foreground-aware multi-path adaptation mechanism that relaxes the static constraints of conventional PEFT, together with SemanticGate, a text-conditioned background suppression strategy that uses background vocabulary to steer the model away from irrelevant regions. Together these modules enable VLMs to handle the domain shift in UAV imagery, delivering detection performance on VisDrone and UAVDT that surpasses existing PEFT baselines and approaches the accuracy of full fine-tuning while training far fewer parameters.

What carries the argument

HyperAdapter and SemanticGate modules that supply dynamic, foreground-aware adaptation and text-guided background suppression inside VLM-based detectors.

If this is right

  • UAV object detection becomes practical without retraining every parameter in a large VLM.
  • Text-conditioned background suppression can reduce false positives in open aerial scenes.
  • Data-dependent multi-path adaptation can extend the reach of PEFT beyond its usual limits.
  • Detection systems for dynamic environments can maintain high accuracy at lower training cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same domain-aware modules could be tested on other overhead imaging tasks such as satellite or medical aerial views.
  • Reduced parameter counts may allow on-device adaptation of detectors directly aboard UAVs.
  • Combining the modules with quantization or pruning could push efficiency gains further in resource-limited settings.

Load-bearing premise

The natural-scene priors in VLMs create a mismatch with drone imagery that static PEFT structures cannot resolve, but the proposed modules can.

What would settle it

If side-by-side experiments on VisDrone and UAVDT show DroneFINE failing to exceed other PEFT methods or to reach parity with full fine-tuning, the central claim would not hold.

Figures

Figures reproduced from arXiv: 2607.00338 by Chenyu Zhou, Di Huang, Jiaxin Chen, Ke Wu, Wenhao Li, Xinzhu Ma, Yanan Zhang, Yingjie Gao.

Figure 1
Figure 1. Figure 1: Analysis of Domain Gap and Adaptation Bottlenecks for UAVs. (a) [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Framework of DroneFINE. The model based on GroundingDINO comprises five parts: (i) frozen visual and text backbones for feature extraction, (ii) projectors for cross-modal alignment, (iii) a detection decoder and head, (iv) HyperAdapter as a serial adapter for the visual backbone, and (v) SemanticGate for enhanced GroundingDINO’s language-guided query selection. 3 Methodology 3.1 Framework Overview As illu… view at source ↗
Figure 3
Figure 3. Figure 3: (L)Architecture of HyperAdapter: Similar to a conventional visual adapter, the HyperAdapter is positioned after the attention and MLP layers. It intro￾duces a foreground-aware multi-path convolution generation mechanism (highlighted in red in the figure) into the adapter. This mechanism utilizes queries to aggregate foreground features, which are then used to generate convolutions. These convolu￾tions, in … view at source ↗
Figure 4
Figure 4. Figure 4: Detection Results and Attention Map on VisDrone. (Left) [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
read the original abstract

Object detection for Unmanned Aerial Vehicles (UAVs) working in open and dynamic environments is a highly challenging task. While Vision-Language Models (VLMs) have offered a powerful solution for universal object detection, adapting them to UAV scenarios remains non-trivial due to a substantial domain gap between VLM pre-training data and aerial imagery. The prevailing Parameter-Efficient Fine-Tuning (PEFT) methods prove ineffective in bridging this gap, as VLMs' "natural-scene, foreground-dominant" visual priors misalign with the "bird's-eye-view, background-dominant, small-object" characteristics of UAV data. To address this issue, we propose DroneFINE, a novel PEFT paradigm comprising two domain-aware complementary modules tailored for VLM-based drone image detectors. Specifically, a data-dependent, foreground-aware, and multi-path adaptation mechanism named HyperAdapter is designed, which overcomes the static structural constraints of PEFT. In addition, a background suppression algorithm named SemanticGate is developed. It is a text-conditioned guidance strategy that employs background vocabulary to actively guide the model in suppressing responses from irrelevant regions. Extensive experiments on VisDrone and UAVDT demonstrate that DroneFINE significantly outperforms existing PEFT methods and achieves performance comparable to full fine-tuning while substantially reducing the number of trainable parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes DroneFINE, a domain-aware PEFT method for adapting vision-language models to UAV object detection. It argues that standard PEFT fails due to misalignment between natural-scene foreground-dominant priors and UAV bird's-eye-view, background-dominant, small-object characteristics. The method introduces HyperAdapter (data-dependent multi-path adaptation) and SemanticGate (text-conditioned background suppression using background vocabulary). Experiments on VisDrone and UAVDT are claimed to show that DroneFINE outperforms existing PEFT baselines, approaches full fine-tuning performance, and uses far fewer trainable parameters.

Significance. If the experimental comparisons hold under fair and reproducible conditions, the work provides a practical, parameter-efficient route for specializing large VLMs to aerial imagery, which is valuable for UAV applications where compute is limited. The explicit targeting of domain-specific visual priors via HyperAdapter and SemanticGate offers a concrete design pattern that could generalize to other domain shifts in detection tasks.

minor comments (2)
  1. [Abstract] Abstract: the claim of 'significantly outperforms existing PEFT methods' and 'performance comparable to full fine-tuning' is stated without any numerical values, mAP deltas, or parameter counts; adding one or two headline metrics would make the abstract self-contained.
  2. The description of HyperAdapter as overcoming 'static structural constraints of PEFT' would be strengthened by an explicit side-by-side comparison (text or diagram) of its data-dependent multi-path routing versus a standard adapter block.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of DroneFINE and for recommending minor revision. The provided report contains no enumerated major comments, so we have no specific points to rebut or revise at this stage. We remain available to address any minor issues or clarifications the editor may request.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central claims rest on empirical comparisons of the proposed DroneFINE (with HyperAdapter and SemanticGate) against PEFT baselines and full fine-tuning on the external public datasets VisDrone and UAVDT. No derivation chain, mathematical prediction, or fitted parameter is presented that reduces to its own inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked. The method design is motivated by stated domain gaps but validated externally through benchmark results rather than self-referential definitions or renamings.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no mathematical derivations, fitted parameters, background axioms, or new postulated entities are described.

pith-pipeline@v0.9.1-grok · 5786 in / 1150 out tokens · 50506 ms · 2026-07-02T15:21:20.145520+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 12 canonical work pages · 3 internal anchors

  1. [1]

    MMDetection: Open MMLab Detection Toolbox and Benchmark

    Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C.C., Lin, D.: MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)

  2. [2]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Chen, Z., Zeng, Y., Chen, Z., Gao, H., Chen, L., Liu, J., Zhao, F.: Vfm-adapter: Adapting visual foundation models for dense prediction with dynamic hybrid oper- ation mapping. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 2385–2393 (2025)

  3. [3]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Cheng, T., Song, L., Ge, Y., Liu, W., Wang, X., Shan, Y.: Yolo-world: Real-time open-vocabulary object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16901–16911 (2024)

  4. [4]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

    Du, B., Huang, Y., Chen, J., Huang, D.: Adaptive sparse convolutional networks with global context enhancement for faster object detection on drone images. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 13435–13444 (2023)

  5. [5]

    In: Proceedings of the European conference on computer vision (ECCV)

    Du, D., Qi, Y., Yu, H., Yang, Y., Duan, K., Li, G., Zhang, W., Huang, Q., Tian, Q.: The unmanned aerial vehicle benchmark: Object detection and tracking. In: Proceedings of the European conference on computer vision (ECCV). pp. 370–386 (2018)

  6. [6]

    In: Proceedings of the IEEE/CVF international conference on computer vision workshops

    Du, D., Zhu, P., Wen, L., Bian, X., Lin, H., Hu, Q., Peng, T., Zheng, J., Wang, X., Zhang, Y., et al.: Visdrone-det2019: The vision meets drone object detection in image challenge results. In: Proceedings of the IEEE/CVF international conference on computer vision workshops. pp. 0–0 (2019)

  7. [7]

    Advances in Neural Information Processing Systems38, 95228– 95251 (2026)

    Gao, Y., Zhang, Y., Cai, Z., Huang, D.: Test-time adaptive object detection with foundation model. Advances in Neural Information Processing Systems38, 95228– 95251 (2026)

  8. [8]

    In: Proceedings of the 32nd ACM International Conference on Multimedia

    Gao, Y., Zhang, Y., Huang, Z., Liu, N., Huang, D.: Ps-ttl: Prototype-based soft- labels and test-time learning for few-shot object detection. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 8691–8700 (2024)

  9. [9]

    arXiv preprint arXiv:2302.07937 (2023)

    Giannou, A., Rajput, S., Papailiopoulos, D.: The expressive power of tuning only the normalization layers. arXiv preprint arXiv:2302.07937 (2023)

  10. [10]

    arXiv preprint arXiv:2204.13653 (2022)

    Gupta, T., Marten, R., Kembhavi, A., Hoiem, D.: Grit: General robust image task benchmark. arXiv preprint arXiv:2204.13653 (2022)

  11. [11]

    HyperNetworks

    Ha, D., Dai, A., Le, Q.V.: Hypernetworks. arXiv preprint arXiv:1609.09106 (2016)

  12. [12]

    Neural computation 9(8), 1735–1780 (1997)

    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8), 1735–1780 (1997)

  13. [13]

    In: International conference on machine learning

    Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Ges- mundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for nlp. In: International conference on machine learning. pp. 2790–2799. PMLR (2019)

  14. [14]

    ICLR1(2), 3 (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)

  15. [15]

    arXiv preprint arXiv:2505.05741 (2025)

    Hu, Z., Wu, P., Chen, J., Zhu, H., Wang, Y., Peng, Y., Li, H., Sun, X.: Dome- detr: Detr with density-oriented feature-query manipulation for efficient tiny object detection. arXiv preprint arXiv:2505.05741 (2025)

  16. [16]

    In: Proceedings of the AAAI conference on artificial intelligence

    Huang, Y., Chen, J., Huang, D.: Ufpmp-det: Toward accurate and efficient object detection on drone imagery. In: Proceedings of the AAAI conference on artificial intelligence. vol. 36, pp. 1026–1033 (2022) DroneFINE 17

  17. [17]

    In: European conference on computer vision

    Jia, M., Tang, L., Chen, B.C., Cardie, C., Belongie, S., Hariharan, B., Lim, S.N.: Visual prompt tuning. In: European conference on computer vision. pp. 709–727. Springer (2022)

  18. [18]

    arXiv preprint arXiv:2207.07039 (2022)

    Jie, S., Deng, Z.H.: Convolutional bypasses are better vision transformer adapters. arXiv preprint arXiv:2207.07039 (2022)

  19. [19]

    com/ultralytics/ultralytics

    Jocher, G., Chaurasia, A., Qiu, J.: Ultralytics yolov8 (2023),https://github. com/ultralytics/ultralytics

  20. [20]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Li, C., Zhao, R., Wang, Z., Xu, H., Zhu, X.: Remdet: Rethinking efficient model de- sign for uav object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 4643–4651 (2025)

  21. [21]

    In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

    Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.N., et al.: Grounded language-image pre-training. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10965–10975 (2022)

  22. [22]

    Li, M., Chen, J., Feng, W., Li, B., Dai, F., Zhao, S., He, Q.: Hyperlora: Parameter- efficientadaptivegenerationforportraitsynthesis.In:ProceedingsoftheComputer Vision and Pattern Recognition Conference. pp. 13114–13123 (2025)

  23. [23]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Li, N., Ye, M., Zhou, L., Tang, S., Gan, Y., Liang, Z., Zhu, X.: Self-prompting ana- logical reasoning for uav object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 18412–18420 (2025)

  24. [24]

    In: Proceedings of the European Conference on Computer Vision

    Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Proceedings of the European Conference on Computer Vision. pp. 740–755. Springer (2014)

  25. [25]

    IEEE Transactions on Image Processing (2024)

    Liu, K., Fu, Z., Jin, S., Chen, Z., Zhou, F., Jiang, R., Chen, Y., Ye, J.: Esod: efficient small object detection on high-resolution images. IEEE Transactions on Image Processing (2024)

  26. [26]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

    Liu, L., Wang, N., Chen, C., Liu, D., Yang, X., Gao, X., Liu, T.: Frequency-based comprehensive prompt learning for vision-language models. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

  27. [27]

    In: Forty-second International Conference on Machine Learning (2025)

    Liu, L., Wang, N., Yang, X., Gao, X., Liu, T.: Surrogate prompt learning: Towards efficient and diverse prompt learning for vision-language models. In: Forty-second International Conference on Machine Learning (2025)

  28. [28]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Liu, M., Hayes, T.L., Ricci, E., Csurka, G., Volpi, R.: Shine: Semantic hierarchy nexus for open-vocabulary object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16634–16644 (2024)

  29. [29]

    In: European conference on computer vision

    Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: European conference on computer vision. pp. 38–55. Springer (2024)

  30. [30]

    In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision

    Liu, S., Zhang, H., Qi, Y., Wang, P., Zhang, Y., Wu, Q.: Aerialvln: Vision-and- language navigation for uavs. In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision. pp. 15384–15394 (2023)

  31. [31]

    In: Findings of the Association for Computational Linguistics: EMNLP 2024

    Lv, C., Li, L., Zhang, S., Chen, G., Qi, F., Zhang, N., Zheng, H.T.: Hyperlora: Efficient cross-task generalization via constrained low-rank adapters generation. In: Findings of the Association for Computational Linguistics: EMNLP 2024. pp. 16376–16393 (2024)

  32. [32]

    Representation Learning with Contrastive Predictive Coding

    Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predic- tive coding. arXiv preprint arXiv:1807.03748 (2018)

  33. [33]

    sensing community

    Pan, J., Liu, Y., Fu, Y., Ma, M., Li, J., Paudel, D.P., Van Gool, L., Huang, X.: Locate anything on earth: Advancing open-vocabulary object detection for remote 18 Wu et al. sensing community. In: Proceedings of the AAAI Conference on Artificial Intelli- gence. vol. 39, pp. 6281–6289 (2025)

  34. [34]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  35. [35]

    Dino-x: A unified vision model for open-world object detection and understanding.arXiv preprint arXiv:2411.14347, 2024

    Ren, T., Chen, Y., Jiang, Q., Zeng, Z., Xiong, Y., Liu, W., Ma, Z., Shen, J., Gao, Y., Jiang, X., et al.: Dino-x: A unified vision model for open-world object detection and understanding. arXiv preprint arXiv:2411.14347 (2024)

  36. [36]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X., Li, J., Sun, J.: Objects365: A large-scale, high-quality dataset for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 8430–8439 (2019)

  37. [37]

    Information Fusion122, 103158 (2025)

    Tian, Y., Lin, F., Li, Y., Zhang, T., Zhang, Q., Fu, X., Huang, J., Dai, X., Wang, Y., Tian, C., et al.: Uavs meet llms: Overviews and perspectives towards agentic low-altitude mobility. Information Fusion122, 103158 (2025)

  38. [38]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Wang, J., Zhang, P., Chu, T., Cao, Y., Zhou, Y., Wu, T., Wang, B., He, C., Lin, D.: V3det: Vast vocabulary visual detection dataset. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19844–19854 (2023)

  39. [39]

    arXiv preprint arXiv:2410.07087 (2024)

    Wang, X., Yang, D., Wang, Z., Kwan, H., Chen, J., Wu, W., Li, H., Liao, Y., Liu, S.: Towards realistic uav vision-language navigation: Platform, benchmark, and methodology. arXiv preprint arXiv:2410.07087 (2024)

  40. [40]

    arXiv preprint arXiv:2408.12246 (2024)

    Wei, G., Yuan, X., Liu, Y., Shang, Z., Xue, X., Wang, P., Yao, K., Zhao, C., Zhang, H., Xiao, R.: Rt-ovad: Real-time open-vocabulary aerial object detection via image-text collaboration. arXiv preprint arXiv:2408.12246 (2024)

  41. [41]

    In: Proceedings of the IEEE/CVF Confer- ence on computer vision and pattern recognition

    Yang, C., Huang, Z., Wang, N.: Querydet: Cascaded sparse query for accelerating high-resolution small object detection. In: Proceedings of the IEEE/CVF Confer- ence on computer vision and pattern recognition. pp. 13668–13677 (2022)

  42. [42]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Yang, F., Fan, H., Chu, P., Blasch, E., Ling, H.: Clustered object detection in aerial images. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 8311–8320 (2019)

  43. [43]

    In: Proceedings of the Com- puter Vision and Pattern Recognition Conference

    Yin, D., Hu, L., Li, B., Zhang, Y., Yang, X.: 5%> 100%: Breaking performance shackles of full fine-tuning on visual recognition tasks. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 20071–20081 (2025)

  44. [44]

    In: The Eleventh International Conference on Learning Representations

    Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L., Shum, H.Y.: Dino: Detr with improved denoising anchor boxes for end-to-end object detection. In: The Eleventh International Conference on Learning Representations

  45. [45]

    arXiv preprint arXiv:2505.05622 (2025)

    Zhang, W., Gao, C., Yu, S., Peng, R., Zhao, B., Zhang, Q., Cui, J., Chen, X., Li, Y.: Citynavagent: Aerial vision-and-language navigation with hierarchical semantic planning and global memory. arXiv preprint arXiv:2505.05622 (2025)

  46. [46]

    arXiv preprint arXiv:2401.02361 (2024)

    Zhao, X., Chen, Y., Xu, S., Li, X., Wang, X., Li, Y., Huang, H.: An open and comprehensive pipeline for unified object grounding and detection. arXiv preprint arXiv:2401.02361 (2024)

  47. [47]

    International Journal of Computer Vision130(9), 2337–2348 (2022)

    Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. International Journal of Computer Vision130(9), 2337–2348 (2022)

  48. [48]

    Drones9(8), 557 (2025)

    Zhou, Y., Li, J., Ou, C., Yan, D., Zhang, H., Xue, X.: Open-vocabulary object detection in uav imagery: A review and future perspectives. Drones9(8), 557 (2025)