Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking

Junlin Li; Kanglong Fan; Li Zhang; Tianwu Zhi; Wen Wen; Xinge Peng; Yabin Zhang; Yang Li; Yiting Liao

arxiv: 2509.25787 · v5 · pith:5F2MVLYBnew · submitted 2025-09-30 · 💻 cs.CV

Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking

Wen Wen , Tianwu Zhi , Kanglong Fan , Yang Li , Xinge Peng , Yabin Zhang , Yiting Liao , Junlin Li

show 1 more author

Li Zhang

This is my paper

classification 💻 cs.CV

keywords evoqualitymodelsqualityassessmentbenchmarkscapabilitiesframeworkimage

0 comments

read the original abstract

Improving vision-language models (VLMs) in the post-training stage typically relies on supervised fine-tuning or reinforcement learning, methods that necessitate costly, human-annotated data. While self-supervised techniques have proven effective for enhancing reasoning capabilities, their application to perceptual domains such as image quality assessment (IQA) remains largely unexplored. In this work, we introduce EvoQuality, a novel framework that enables a VLM to autonomously refine its quality perception capabilities without any ground-truth labels. EvoQuality adapts the principle of self-consistency to the ranking-based nature of IQA. It generates pseudo-labels by performing pairwise majority voting on the VLM's own outputs to establish a consensus on relative quality. These pseudo-rankings are then formulated into a fidelity reward that guides the model's iterative evolution through group relative policy optimization (GRPO). By iteratively leveraging its own predictions, EvoQuality progressively refines the VLM's perceptual capability. Extensive experiments show that EvoQuality boosts the base VLM's zero-shot performance by 31.8% on PLCC across diverse IQA benchmarks. Remarkably, despite being entirely self-supervised, EvoQuality achieves performance that is competitive with, or even surpasses, state-of-the-art supervised VLM-based IQA models, outperforming these models on 5 out of 7 IQA benchmarks. Furthermore, the framework demonstrates significant flexibility, allowing it to be stacked with pre-trained IQA models to bolster generalization on unseen datasets. Codes and checkpoints will be available at https://github.com/bytedance/EvoQuality.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Preferences Order, Ratings Anchor: From Fused Expert Aesthetic Ground Truth to Self-Distillation
cs.CV 2026-05 unverdicted novelty 7.0

A new dual-protocol expert benchmark for image aesthetics is fused into ground truth and used to self-distill a VLM, raising SRCC from 0.504 to 0.709 across categories while matching closed-source performance.
Preferences Order, Ratings Anchor: From Fused Expert Aesthetic Ground Truth to Self-Distillation
cs.CV 2026-05 conditional novelty 7.0

PPaint fuses expert pairwise preferences and ratings into ground truth; PSDistill converts VLM pairwise judgments into calibrated pseudo-scores via Elo and trains the same VLM to produce a single-pass aesthetic scorer...
What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time
cs.LG 2026-03 unverdicted novelty 7.0

SCRL adds selective positive pseudo-labeling and entropy-gated negative pseudo-labeling to test-time RL, reducing noise from weak consensus and improving LLM reasoning on benchmarks.
EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

EvoVid proposes a temporal-centric self-evolution framework for Video-LLMs that uses temporal-aware Questioner and temporal-grounded Solver rewards to improve performance directly from unannotated videos.