DroneIQA-VLE: Multi-Task Drone Image Quality Assessment via Vision-Language Ensemble

Guangtao Zhai; Hongjian Zhan; Mingkai Lu; Wei Sun; Weixia Zhang; Yixuan Gao

arxiv: 2607.00416 · v1 · pith:QTD4GAF4new · submitted 2026-07-01 · 💻 cs.CV

DroneIQA-VLE: Multi-Task Drone Image Quality Assessment via Vision-Language Ensemble

Wei Sun , Weixia Zhang , Hongjian Zhan , Mingkai Lu , Yixuan Gao , Guangtao Zhai This is my paper

Pith reviewed 2026-07-02 15:18 UTC · model grok-4.3

classification 💻 cs.CV

keywords drone image quality assessmentvision-language ensemblemulti-task regressiontarget-aware IQAUAV imagesLoRA adaptationquality score predictionensemble averaging

0 comments

The pith

An ensemble of a vision encoder pipeline and a language model pipeline predicts quality scores for drone images by averaging their outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a system that combines two different approaches to assess the quality of images taken by drones. One approach uses vision models to directly predict scores for the whole image, the target area, and the background. The other uses a large language model adapted with additional training to do the same. By taking the average of the global scores from both, the method aims to improve accuracy. This matters because good quality assessment can help in selecting or improving drone-captured images for various applications like surveillance or mapping.

Core claim

The framework jointly predicts global, target, and background quality scores by ensembling two complementary pipelines: SigLIP2 vision encoders with multi-task regression heads, and a LoRA-adapted Qwen3.5-9B multimodal large language model for quality score regression. The final global quality prediction is obtained by arithmetically averaging the outputs of both pipelines. The method achieves 2nd place in the ICME 2026 Drone-IQA Grand Challenge on Target-aware Image Quality Assessment for Low-altitude UAV Images.

What carries the argument

Arithmetic averaging of global quality outputs from a SigLIP2 multi-task regression pipeline and a LoRA-adapted multimodal LLM pipeline, which fuses the two to produce the final global score while also generating target and background scores.

If this is right

The system produces separate quality scores for the main target object and the surrounding background in addition to the overall image score.
Both a pure vision regression approach and a multimodal language model approach can be adapted to handle target-aware quality assessment on low-altitude UAV imagery.
Simple arithmetic averaging of the two pipelines is sufficient to reach competitive standing in the Drone-IQA challenge.
The joint multi-task setup allows a single model to output the three related quality scores without separate passes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the vision pipeline and language-model pipeline tend to err on different kinds of drone images, their average could reduce variance even if neither is individually superior.
The same two-pipeline structure might transfer to quality assessment in other aerial or remote-sensing domains where both local objects and overall scene matter.
Replacing the fixed average with a small learned fusion layer could be tested to see whether it captures any systematic difference in the two pipelines' strengths.

Load-bearing premise

That the arithmetic averaging of the two independent pipelines produces a reliably superior global quality score without evidence of complementary error patterns or validation that this fusion outperforms either pipeline alone or alternative fusion methods.

What would settle it

A direct comparison on the challenge test set in which the averaged global score does not exceed the score of the stronger individual pipeline, or in which a different fusion rule such as a weighted sum or learned combiner yields higher accuracy.

read the original abstract

We present DroneIQA-VLE, our solution to the ICME 2026 Drone-IQA Grand Challenge on Target-aware Image Quality Assessment for Low-altitude UAV Images. The framework jointly predicts global, target, and background quality scores by ensembling two complementary pipelines: (1) SigLIP2 vision encoders with multi-task regression heads, and (2) a LoRA-adapted Qwen3.5-9B multimodal large language model for quality score regression. The final global quality prediction is obtained by arithmetically averaging the outputs of both pipelines. Our method achieves 2nd place in the challenge, demonstrating its effectiveness. The code is available at https://github.com/sunwei925/DroneIQA-VLE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A standard competition entry that ensembles SigLIP2 and LoRA-Qwen for drone IQA gets 2nd place but shows no evidence the averaging step improves results.

read the letter

This paper presents DroneIQA-VLE as a solution for the ICME 2026 Drone-IQA Grand Challenge. It uses two pipelines: one with SigLIP2 vision encoders and multi-task regression heads to predict global, target, and background quality scores, and another with a LoRA-adapted Qwen3.5-9B multimodal LLM for regression. The global score comes from averaging the two.

Nothing fundamentally new is introduced. Both components are established models, and the approach is standard fine-tuning plus simple averaging. The novelty is limited to applying them to target-aware quality assessment on low-altitude UAV images.

The paper does well by releasing the code on GitHub, which allows others to check the implementation. Achieving 2nd place shows the pipeline works in practice for the challenge.

The soft spots are clear. There are no ablations showing that the arithmetic average outperforms either pipeline alone or alternative fusion methods. No error correlation analysis is mentioned, so we can't tell if the models complement each other. The abstract lacks training details, metrics, or error analysis, making it hard to assess the claims fully. The stress-test note is accurate here; the averaging step's value isn't demonstrated.

This work is mainly for participants in image quality assessment challenges or researchers focused on drone vision applications. A general reader in computer vision won't get new insights or methods from it.

I would not bring this to a reading group unless the group is covering the specific challenge. I would not cite it in my own work. It might deserve peer review if the full manuscript includes the missing experiments and details, but on the current evidence it's a borderline case for a serious referee.

Referee Report

2 major / 1 minor

Summary. The paper presents DroneIQA-VLE, an ensemble framework for target-aware image quality assessment on low-altitude UAV images. It combines (1) SigLIP2 vision encoders with multi-task regression heads predicting global, target, and background scores and (2) a LoRA-adapted Qwen3.5-9B multimodal LLM for the same regression tasks. The final global score is produced by arithmetic averaging of the two pipelines. The work reports achieving 2nd place in the ICME 2026 Drone-IQA Grand Challenge and releases code at the cited GitHub repository.

Significance. If the performance claims and the value of the averaging step can be substantiated, the approach would illustrate a practical way to combine pure-vision and vision-language models for multi-task drone IQA. The public code release is a concrete strength that enables direct reproducibility and further analysis by the community.

major comments (2)

[Abstract] Abstract: The claim that the ensemble 'achieves 2nd place' is stated without any quantitative metrics (PLCC, SRCC, MAE, or challenge-specific scores), leaderboard details, training protocols, or validation splits. This absence makes the central performance assertion unverifiable from the manuscript.
[Abstract] Abstract: The final global prediction is obtained by arithmetic averaging, yet the manuscript supplies no ablation results comparing the averaged output against either pipeline run independently, no error-correlation analysis between the SigLIP2 and Qwen3.5-9B components, and no comparison to alternative fusion strategies. Consequently the specific contribution of the averaging step to the reported ranking cannot be isolated.

minor comments (1)

[Abstract] The manuscript provides no implementation hyperparameters (learning rates, LoRA rank, batch sizes, or loss weights) for either pipeline, which would be needed for exact replication even with the released code.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and will revise the manuscript accordingly to improve verifiability and substantiate the ensemble contribution.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the ensemble 'achieves 2nd place' is stated without any quantitative metrics (PLCC, SRCC, MAE, or challenge-specific scores), leaderboard details, training protocols, or validation splits. This absence makes the central performance assertion unverifiable from the manuscript.

Authors: We agree that the abstract lacks the supporting quantitative details. In the revised manuscript we will expand the abstract (and add a corresponding results section or table) to report the challenge-specific PLCC, SRCC, MAE values, the exact leaderboard position with reference to the official ranking, the training/validation splits used, and the training protocols for both pipelines. revision: yes
Referee: [Abstract] Abstract: The final global prediction is obtained by arithmetic averaging, yet the manuscript supplies no ablation results comparing the averaged output against either pipeline run independently, no error-correlation analysis between the SigLIP2 and Qwen3.5-9B components, and no comparison to alternative fusion strategies. Consequently the specific contribution of the averaging step to the reported ranking cannot be isolated.

Authors: We acknowledge the absence of these ablations. In the revision we will add an ablation study that reports the performance of each pipeline in isolation, the arithmetic average, and at least one alternative fusion strategy (e.g., learned weighting). We will also include a brief error-correlation analysis between the two components to quantify complementarity. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with no derivations or self-referential predictions

full rationale

The manuscript describes an applied ensemble for a competition task: two separate models (SigLIP2 regression heads and LoRA-adapted LLM) whose outputs are averaged for the global score. No equations, no parameter fitting presented as a 'prediction,' and no self-citation chains are invoked to justify any step. The averaging is a fixed post-hoc fusion rule, not a derived quantity that reduces to its own inputs. The work is therefore self-contained as an empirical engineering report and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper contains no theoretical derivations, new axioms, free parameters, or invented entities; it is a description of an empirical ML system.

pith-pipeline@v0.9.1-grok · 5669 in / 1090 out tokens · 22560 ms · 2026-07-02T15:18:56.516167+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 3 canonical work pages · 3 internal anchors

[1]

Perceptual image quality assess- ment: a survey,

Guangtao Zhai and Xiongkuo Min, “Perceptual image quality assess- ment: a survey,”Science China Information Sciences, vol. 63, no. 11, pp. 211301, 2020

2020
[2]

Image quality assessment: from error visibility to structural similarity,

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004

2004
[3]

Blind image quality assessment via vision-language correspondence: A multitask learning perspective,

Weixia Zhang, Guangtao Zhai, Ying Wei, Xiaokang Yang, and Kede Ma, “Blind image quality assessment via vision-language correspondence: A multitask learning perspective,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 14071–14081

2023
[4]

Blind quality assessment for in-the-wild images via hierarchical feature fusion and iterative mixed database training,

Wei Sun, Xiongkuo Min, Danyang Tu, Siwei Ma, and Guangtao Zhai, “Blind quality assessment for in-the-wild images via hierarchical feature fusion and iterative mixed database training,”IEEE Journal of Selected Topics in Signal Processing, vol. 17, no. 6, pp. 1178–1192, 2023

2023
[5]

Deep neural network for blind visual quality assessment of 4k content,

Wei Lu, Wei Sun, Xiongkuo Min, Wenhan Zhu, Quan Zhou, Jun He, Qiyuan Wang, Zicheng Zhang, Tao Wang, and Guangtao Zhai, “Deep neural network for blind visual quality assessment of 4k content,”IEEE Transactions on Broadcasting, vol. 69, no. 2, pp. 406–421, 2022

2022
[6]

Large multi-modality model assisted ai-generated image quality assessment,

Puyi Wang, Wei Sun, Zicheng Zhang, Jun Jia, Yanwei Jiang, Zhichao Zhang, Xiongkuo Min, and Guangtao Zhai, “Large multi-modality model assisted ai-generated image quality assessment,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 7803–7812

2024
[7]

A deep learning based no-reference quality assessment model for ugc videos,

Wei Sun, Xiongkuo Min, Wei Lu, and Guangtao Zhai, “A deep learning based no-reference quality assessment model for ugc videos,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 856–865

2022
[8]

Vqathinker: Exploring generalizable and explainable video quality assessment via reinforcement learning,

Linhan Cao, Wei Sun, Weixia Zhang, Xiangyang Zhu, Jun Jia, Kaiwei Zhang, Dandan Zhu, Guangtao Zhai, and Xiongkuo Min, “Vqathinker: Exploring generalizable and explainable video quality assessment via reinforcement learning,” inProceedings of the AAAI Conference on Artificial Intelligence, 2026, vol. 40, pp. 2607–2615

2026
[9]

Efficient face image quality assessment via self-training and knowledge distillation,

Wei Sun, Weixia Zhang, Linhan Cao, Jun Jia, Xiangyang Zhu, Dandan Zhu, Xiongkuo Min, and Guangtao Zhai, “Efficient face image quality assessment via self-training and knowledge distillation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 3363–3371

2025
[10]

Overview of drone-iqa gc 2026: Target-aware image quality assessment for low-altitude uav images,

Chengyan Jiang, Lingyu Zhu, Baoliang Chen, Dachun Kai, Weisi Lin, Chenchi Luo, Liang Xie, Haijun Yang, Tao Wang, Yunliang Chen, Wei Sun, Weixia Zhang, Hongjian Zhan, Mingkai Lu, Yixuan Gao, Guangtao Zhai, Jie Li, Lei Yang, Meng Guo, Tushar Shinde, Anurag Roychowdhury, Sreejita Roy, Gaoxiang Li, Ying Zhang, Linxin Zhang, and Yongzhen Huang, “Overview of dr...

2026
[11]

Vision Meets Drones: A Challenge

Pengfei Zhu, Longyin Wen, Xiao Bian, Haibin Ling, and Qinghua Hu, “Vision meets drones: A challenge,”arXiv preprint arXiv:1804.07437, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[12]

The unmanned aerial vehicle benchmark: Object detection and tracking,

Dawei Du, Yuankai Qi, Hongyang Yu, Yifan Yang, Kaiwen Duan, Guorong Li, Weigang Zhang, Qingming Huang, and Qi Tian, “The unmanned aerial vehicle benchmark: Object detection and tracking,” in Proceedings of the European conference on computer vision, 2018, pp. 370–386

2018
[13]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al., “Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features,”arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Qwen3. 5: Towards native multimodal agents,

Qwen Team, “Qwen3. 5: Towards native multimodal agents,”URL: https://qwen. ai/blog, 2026

2026
[15]

Lora: Low-rank adaptation of large language models.,

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al., “Lora: Low-rank adaptation of large language models.,”Iclr, vol. 1, no. 2, pp. 3, 2022

2022
[16]

Drone-iqa gc 2026: Target-aware image quality assessment for low-altitude uav images,

Chengyan Jiang and others, “Drone-iqa gc 2026: Target-aware image quality assessment for low-altitude uav images,” https://chengyanjiang. github.io/icme26-droneiqa/, 2026, ICME 2026 Grand Challenge

2026
[17]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[18]

Enhancing blind video quality assessment with rich quality-aware features,

Wei Sun, Linhan Cao, Jun Jia, Zhichao Zhang, Zicheng Zhang, Xiongkuo Min, and Guangtao Zhai, “Enhancing blind video quality assessment with rich quality-aware features,”Expert Systems with Applications, p. 130452, 2025

2025
[19]

Assessing uhd image quality from aesthetics, distortions, and saliency,

Wei Sun, Weixia Zhang, Yuqin Cao, Linhan Cao, Jun Jia, Zijian Chen, Zicheng Zhang, Xiongkuo Min, and Guangtao Zhai, “Assessing uhd image quality from aesthetics, distortions, and saliency,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 109–126

2024

[1] [1]

Perceptual image quality assess- ment: a survey,

Guangtao Zhai and Xiongkuo Min, “Perceptual image quality assess- ment: a survey,”Science China Information Sciences, vol. 63, no. 11, pp. 211301, 2020

2020

[2] [2]

Image quality assessment: from error visibility to structural similarity,

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004

2004

[3] [3]

Blind image quality assessment via vision-language correspondence: A multitask learning perspective,

Weixia Zhang, Guangtao Zhai, Ying Wei, Xiaokang Yang, and Kede Ma, “Blind image quality assessment via vision-language correspondence: A multitask learning perspective,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 14071–14081

2023

[4] [4]

Blind quality assessment for in-the-wild images via hierarchical feature fusion and iterative mixed database training,

Wei Sun, Xiongkuo Min, Danyang Tu, Siwei Ma, and Guangtao Zhai, “Blind quality assessment for in-the-wild images via hierarchical feature fusion and iterative mixed database training,”IEEE Journal of Selected Topics in Signal Processing, vol. 17, no. 6, pp. 1178–1192, 2023

2023

[5] [5]

Deep neural network for blind visual quality assessment of 4k content,

Wei Lu, Wei Sun, Xiongkuo Min, Wenhan Zhu, Quan Zhou, Jun He, Qiyuan Wang, Zicheng Zhang, Tao Wang, and Guangtao Zhai, “Deep neural network for blind visual quality assessment of 4k content,”IEEE Transactions on Broadcasting, vol. 69, no. 2, pp. 406–421, 2022

2022

[6] [6]

Large multi-modality model assisted ai-generated image quality assessment,

Puyi Wang, Wei Sun, Zicheng Zhang, Jun Jia, Yanwei Jiang, Zhichao Zhang, Xiongkuo Min, and Guangtao Zhai, “Large multi-modality model assisted ai-generated image quality assessment,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 7803–7812

2024

[7] [7]

A deep learning based no-reference quality assessment model for ugc videos,

Wei Sun, Xiongkuo Min, Wei Lu, and Guangtao Zhai, “A deep learning based no-reference quality assessment model for ugc videos,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 856–865

2022

[8] [8]

Vqathinker: Exploring generalizable and explainable video quality assessment via reinforcement learning,

Linhan Cao, Wei Sun, Weixia Zhang, Xiangyang Zhu, Jun Jia, Kaiwei Zhang, Dandan Zhu, Guangtao Zhai, and Xiongkuo Min, “Vqathinker: Exploring generalizable and explainable video quality assessment via reinforcement learning,” inProceedings of the AAAI Conference on Artificial Intelligence, 2026, vol. 40, pp. 2607–2615

2026

[9] [9]

Efficient face image quality assessment via self-training and knowledge distillation,

Wei Sun, Weixia Zhang, Linhan Cao, Jun Jia, Xiangyang Zhu, Dandan Zhu, Xiongkuo Min, and Guangtao Zhai, “Efficient face image quality assessment via self-training and knowledge distillation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 3363–3371

2025

[10] [10]

Overview of drone-iqa gc 2026: Target-aware image quality assessment for low-altitude uav images,

Chengyan Jiang, Lingyu Zhu, Baoliang Chen, Dachun Kai, Weisi Lin, Chenchi Luo, Liang Xie, Haijun Yang, Tao Wang, Yunliang Chen, Wei Sun, Weixia Zhang, Hongjian Zhan, Mingkai Lu, Yixuan Gao, Guangtao Zhai, Jie Li, Lei Yang, Meng Guo, Tushar Shinde, Anurag Roychowdhury, Sreejita Roy, Gaoxiang Li, Ying Zhang, Linxin Zhang, and Yongzhen Huang, “Overview of dr...

2026

[11] [11]

Vision Meets Drones: A Challenge

Pengfei Zhu, Longyin Wen, Xiao Bian, Haibin Ling, and Qinghua Hu, “Vision meets drones: A challenge,”arXiv preprint arXiv:1804.07437, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[12] [12]

The unmanned aerial vehicle benchmark: Object detection and tracking,

Dawei Du, Yuankai Qi, Hongyang Yu, Yifan Yang, Kaiwen Duan, Guorong Li, Weigang Zhang, Qingming Huang, and Qi Tian, “The unmanned aerial vehicle benchmark: Object detection and tracking,” in Proceedings of the European conference on computer vision, 2018, pp. 370–386

2018

[13] [13]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al., “Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features,”arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Qwen3. 5: Towards native multimodal agents,

Qwen Team, “Qwen3. 5: Towards native multimodal agents,”URL: https://qwen. ai/blog, 2026

2026

[15] [15]

Lora: Low-rank adaptation of large language models.,

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al., “Lora: Low-rank adaptation of large language models.,”Iclr, vol. 1, no. 2, pp. 3, 2022

2022

[16] [16]

Drone-iqa gc 2026: Target-aware image quality assessment for low-altitude uav images,

Chengyan Jiang and others, “Drone-iqa gc 2026: Target-aware image quality assessment for low-altitude uav images,” https://chengyanjiang. github.io/icme26-droneiqa/, 2026, ICME 2026 Grand Challenge

2026

[17] [17]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[18] [18]

Enhancing blind video quality assessment with rich quality-aware features,

Wei Sun, Linhan Cao, Jun Jia, Zhichao Zhang, Zicheng Zhang, Xiongkuo Min, and Guangtao Zhai, “Enhancing blind video quality assessment with rich quality-aware features,”Expert Systems with Applications, p. 130452, 2025

2025

[19] [19]

Assessing uhd image quality from aesthetics, distortions, and saliency,

Wei Sun, Weixia Zhang, Yuqin Cao, Linhan Cao, Jun Jia, Zijian Chen, Zicheng Zhang, Xiongkuo Min, and Guangtao Zhai, “Assessing uhd image quality from aesthetics, distortions, and saliency,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 109–126

2024