DroneIQA-VLE: Multi-Task Drone Image Quality Assessment via Vision-Language Ensemble
Pith reviewed 2026-07-02 15:18 UTC · model grok-4.3
The pith
An ensemble of a vision encoder pipeline and a language model pipeline predicts quality scores for drone images by averaging their outputs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The framework jointly predicts global, target, and background quality scores by ensembling two complementary pipelines: SigLIP2 vision encoders with multi-task regression heads, and a LoRA-adapted Qwen3.5-9B multimodal large language model for quality score regression. The final global quality prediction is obtained by arithmetically averaging the outputs of both pipelines. The method achieves 2nd place in the ICME 2026 Drone-IQA Grand Challenge on Target-aware Image Quality Assessment for Low-altitude UAV Images.
What carries the argument
Arithmetic averaging of global quality outputs from a SigLIP2 multi-task regression pipeline and a LoRA-adapted multimodal LLM pipeline, which fuses the two to produce the final global score while also generating target and background scores.
If this is right
- The system produces separate quality scores for the main target object and the surrounding background in addition to the overall image score.
- Both a pure vision regression approach and a multimodal language model approach can be adapted to handle target-aware quality assessment on low-altitude UAV imagery.
- Simple arithmetic averaging of the two pipelines is sufficient to reach competitive standing in the Drone-IQA challenge.
- The joint multi-task setup allows a single model to output the three related quality scores without separate passes.
Where Pith is reading between the lines
- If the vision pipeline and language-model pipeline tend to err on different kinds of drone images, their average could reduce variance even if neither is individually superior.
- The same two-pipeline structure might transfer to quality assessment in other aerial or remote-sensing domains where both local objects and overall scene matter.
- Replacing the fixed average with a small learned fusion layer could be tested to see whether it captures any systematic difference in the two pipelines' strengths.
Load-bearing premise
That the arithmetic averaging of the two independent pipelines produces a reliably superior global quality score without evidence of complementary error patterns or validation that this fusion outperforms either pipeline alone or alternative fusion methods.
What would settle it
A direct comparison on the challenge test set in which the averaged global score does not exceed the score of the stronger individual pipeline, or in which a different fusion rule such as a weighted sum or learned combiner yields higher accuracy.
read the original abstract
We present DroneIQA-VLE, our solution to the ICME 2026 Drone-IQA Grand Challenge on Target-aware Image Quality Assessment for Low-altitude UAV Images. The framework jointly predicts global, target, and background quality scores by ensembling two complementary pipelines: (1) SigLIP2 vision encoders with multi-task regression heads, and (2) a LoRA-adapted Qwen3.5-9B multimodal large language model for quality score regression. The final global quality prediction is obtained by arithmetically averaging the outputs of both pipelines. Our method achieves 2nd place in the challenge, demonstrating its effectiveness. The code is available at https://github.com/sunwei925/DroneIQA-VLE.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents DroneIQA-VLE, an ensemble framework for target-aware image quality assessment on low-altitude UAV images. It combines (1) SigLIP2 vision encoders with multi-task regression heads predicting global, target, and background scores and (2) a LoRA-adapted Qwen3.5-9B multimodal LLM for the same regression tasks. The final global score is produced by arithmetic averaging of the two pipelines. The work reports achieving 2nd place in the ICME 2026 Drone-IQA Grand Challenge and releases code at the cited GitHub repository.
Significance. If the performance claims and the value of the averaging step can be substantiated, the approach would illustrate a practical way to combine pure-vision and vision-language models for multi-task drone IQA. The public code release is a concrete strength that enables direct reproducibility and further analysis by the community.
major comments (2)
- [Abstract] Abstract: The claim that the ensemble 'achieves 2nd place' is stated without any quantitative metrics (PLCC, SRCC, MAE, or challenge-specific scores), leaderboard details, training protocols, or validation splits. This absence makes the central performance assertion unverifiable from the manuscript.
- [Abstract] Abstract: The final global prediction is obtained by arithmetic averaging, yet the manuscript supplies no ablation results comparing the averaged output against either pipeline run independently, no error-correlation analysis between the SigLIP2 and Qwen3.5-9B components, and no comparison to alternative fusion strategies. Consequently the specific contribution of the averaging step to the reported ranking cannot be isolated.
minor comments (1)
- [Abstract] The manuscript provides no implementation hyperparameters (learning rates, LoRA rank, batch sizes, or loss weights) for either pipeline, which would be needed for exact replication even with the released code.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below and will revise the manuscript accordingly to improve verifiability and substantiate the ensemble contribution.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that the ensemble 'achieves 2nd place' is stated without any quantitative metrics (PLCC, SRCC, MAE, or challenge-specific scores), leaderboard details, training protocols, or validation splits. This absence makes the central performance assertion unverifiable from the manuscript.
Authors: We agree that the abstract lacks the supporting quantitative details. In the revised manuscript we will expand the abstract (and add a corresponding results section or table) to report the challenge-specific PLCC, SRCC, MAE values, the exact leaderboard position with reference to the official ranking, the training/validation splits used, and the training protocols for both pipelines. revision: yes
-
Referee: [Abstract] Abstract: The final global prediction is obtained by arithmetic averaging, yet the manuscript supplies no ablation results comparing the averaged output against either pipeline run independently, no error-correlation analysis between the SigLIP2 and Qwen3.5-9B components, and no comparison to alternative fusion strategies. Consequently the specific contribution of the averaging step to the reported ranking cannot be isolated.
Authors: We acknowledge the absence of these ablations. In the revision we will add an ablation study that reports the performance of each pipeline in isolation, the arithmetic average, and at least one alternative fusion strategy (e.g., learned weighting). We will also include a brief error-correlation analysis between the two components to quantify complementarity. revision: yes
Circularity Check
No circularity: empirical pipeline with no derivations or self-referential predictions
full rationale
The manuscript describes an applied ensemble for a competition task: two separate models (SigLIP2 regression heads and LoRA-adapted LLM) whose outputs are averaged for the global score. No equations, no parameter fitting presented as a 'prediction,' and no self-citation chains are invoked to justify any step. The averaging is a fixed post-hoc fusion rule, not a derived quantity that reduces to its own inputs. The work is therefore self-contained as an empirical engineering report and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Perceptual image quality assess- ment: a survey,
Guangtao Zhai and Xiongkuo Min, “Perceptual image quality assess- ment: a survey,”Science China Information Sciences, vol. 63, no. 11, pp. 211301, 2020
2020
-
[2]
Image quality assessment: from error visibility to structural similarity,
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004
2004
-
[3]
Blind image quality assessment via vision-language correspondence: A multitask learning perspective,
Weixia Zhang, Guangtao Zhai, Ying Wei, Xiaokang Yang, and Kede Ma, “Blind image quality assessment via vision-language correspondence: A multitask learning perspective,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 14071–14081
2023
-
[4]
Blind quality assessment for in-the-wild images via hierarchical feature fusion and iterative mixed database training,
Wei Sun, Xiongkuo Min, Danyang Tu, Siwei Ma, and Guangtao Zhai, “Blind quality assessment for in-the-wild images via hierarchical feature fusion and iterative mixed database training,”IEEE Journal of Selected Topics in Signal Processing, vol. 17, no. 6, pp. 1178–1192, 2023
2023
-
[5]
Deep neural network for blind visual quality assessment of 4k content,
Wei Lu, Wei Sun, Xiongkuo Min, Wenhan Zhu, Quan Zhou, Jun He, Qiyuan Wang, Zicheng Zhang, Tao Wang, and Guangtao Zhai, “Deep neural network for blind visual quality assessment of 4k content,”IEEE Transactions on Broadcasting, vol. 69, no. 2, pp. 406–421, 2022
2022
-
[6]
Large multi-modality model assisted ai-generated image quality assessment,
Puyi Wang, Wei Sun, Zicheng Zhang, Jun Jia, Yanwei Jiang, Zhichao Zhang, Xiongkuo Min, and Guangtao Zhai, “Large multi-modality model assisted ai-generated image quality assessment,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 7803–7812
2024
-
[7]
A deep learning based no-reference quality assessment model for ugc videos,
Wei Sun, Xiongkuo Min, Wei Lu, and Guangtao Zhai, “A deep learning based no-reference quality assessment model for ugc videos,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 856–865
2022
-
[8]
Vqathinker: Exploring generalizable and explainable video quality assessment via reinforcement learning,
Linhan Cao, Wei Sun, Weixia Zhang, Xiangyang Zhu, Jun Jia, Kaiwei Zhang, Dandan Zhu, Guangtao Zhai, and Xiongkuo Min, “Vqathinker: Exploring generalizable and explainable video quality assessment via reinforcement learning,” inProceedings of the AAAI Conference on Artificial Intelligence, 2026, vol. 40, pp. 2607–2615
2026
-
[9]
Efficient face image quality assessment via self-training and knowledge distillation,
Wei Sun, Weixia Zhang, Linhan Cao, Jun Jia, Xiangyang Zhu, Dandan Zhu, Xiongkuo Min, and Guangtao Zhai, “Efficient face image quality assessment via self-training and knowledge distillation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 3363–3371
2025
-
[10]
Overview of drone-iqa gc 2026: Target-aware image quality assessment for low-altitude uav images,
Chengyan Jiang, Lingyu Zhu, Baoliang Chen, Dachun Kai, Weisi Lin, Chenchi Luo, Liang Xie, Haijun Yang, Tao Wang, Yunliang Chen, Wei Sun, Weixia Zhang, Hongjian Zhan, Mingkai Lu, Yixuan Gao, Guangtao Zhai, Jie Li, Lei Yang, Meng Guo, Tushar Shinde, Anurag Roychowdhury, Sreejita Roy, Gaoxiang Li, Ying Zhang, Linxin Zhang, and Yongzhen Huang, “Overview of dr...
2026
-
[11]
Vision Meets Drones: A Challenge
Pengfei Zhu, Longyin Wen, Xiao Bian, Haibin Ling, and Qinghua Hu, “Vision meets drones: A challenge,”arXiv preprint arXiv:1804.07437, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[12]
The unmanned aerial vehicle benchmark: Object detection and tracking,
Dawei Du, Yuankai Qi, Hongyang Yu, Yifan Yang, Kaiwen Duan, Guorong Li, Weigang Zhang, Qingming Huang, and Qi Tian, “The unmanned aerial vehicle benchmark: Object detection and tracking,” in Proceedings of the European conference on computer vision, 2018, pp. 370–386
2018
-
[13]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al., “Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features,”arXiv preprint arXiv:2502.14786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Qwen3. 5: Towards native multimodal agents,
Qwen Team, “Qwen3. 5: Towards native multimodal agents,”URL: https://qwen. ai/blog, 2026
2026
-
[15]
Lora: Low-rank adaptation of large language models.,
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al., “Lora: Low-rank adaptation of large language models.,”Iclr, vol. 1, no. 2, pp. 3, 2022
2022
-
[16]
Drone-iqa gc 2026: Target-aware image quality assessment for low-altitude uav images,
Chengyan Jiang and others, “Drone-iqa gc 2026: Target-aware image quality assessment for low-altitude uav images,” https://chengyanjiang. github.io/icme26-droneiqa/, 2026, ICME 2026 Grand Challenge
2026
-
[17]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[18]
Enhancing blind video quality assessment with rich quality-aware features,
Wei Sun, Linhan Cao, Jun Jia, Zhichao Zhang, Zicheng Zhang, Xiongkuo Min, and Guangtao Zhai, “Enhancing blind video quality assessment with rich quality-aware features,”Expert Systems with Applications, p. 130452, 2025
2025
-
[19]
Assessing uhd image quality from aesthetics, distortions, and saliency,
Wei Sun, Weixia Zhang, Yuqin Cao, Linhan Cao, Jun Jia, Zijian Chen, Zicheng Zhang, Xiongkuo Min, and Guangtao Zhai, “Assessing uhd image quality from aesthetics, distortions, and saliency,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 109–126
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.