arxiv: 2602.03558 · v2 · submitted 2026-02-03 · 💻 cs.CV · cs.AI· cs.MM

ELIQ: A Label-Free Framework for Quality Assessment of Evolving AI-Generated Images

Xinyue Li , Zhiming Xu , Min Tang , Zhaolin Cai , Sijing Wu , Xiongkuo Min , Yitong Chen , Guangtao Zhai This is my paper

Pith reviewed 2026-05-16 08:09 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MM

keywords label-free quality assessmentAI-generated imagesprompt-image alignmentinstruction tuningAIGC distortionsevolving generative modelsmultimodal criticgeneralization to UGC

0 comments

The pith

ELIQ assesses quality of evolving AI-generated images without human labels by automatically building positive and aspect-specific negative pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Generative image models improve so fast that fixed human quality labels quickly become outdated. ELIQ solves this by automatically creating positive image pairs and negative pairs tuned to specific quality aspects like distortions or alignment issues. These pairs train a multimodal model through instruction tuning to judge new images on two axes without any new annotations. The result is a critic that works across different benchmarks and even on regular user photos. This approach keeps quality assessment viable as the field advances.

Core claim

ELIQ focuses on visual quality and prompt-image alignment, automatically constructs positive and aspect-specific negative pairs to cover both conventional distortions and AIGC-specific distortion modes, enabling transferable supervision without human annotations. Building on these pairs, ELIQ adapts a pre-trained multimodal model into a quality-aware critic via instruction tuning and predicts two-dimensional quality using lightweight gated fusion and a Quality Query Transformer.

What carries the argument

Automatic construction of positive and aspect-specific negative pairs that enable instruction tuning of a pre-trained multimodal model into a quality critic, combined with a Quality Query Transformer for two-dimensional scoring.

If this is right

ELIQ outperforms existing label-free methods across multiple benchmarks.
The framework generalizes directly from AI-generated content to user-generated content scenarios with no modifications.
It supports scalable, ongoing quality assessment as generative models continue to evolve.
It produces separate scores for visual quality and prompt-image alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The pair-construction approach could be reused for quality checks on generated video or audio without new labels.
Continuous re-application of the method on outputs from updated generators would maintain relevance over time.
The two-dimensional scores could serve as direct signals to refine prompts or training objectives in generative systems.
Similar self-supervised pairing might reduce annotation needs in other multimodal perception tasks.

Load-bearing premise

Automatically constructed positive and aspect-specific negative pairs provide transferable supervision without human annotations for quality assessment.

What would settle it

A new test set of images from the latest text-to-image models where ELIQ scores show low or no correlation with fresh human ratings would show the pairs fail to supply valid supervision.

read the original abstract

Generative text-to-image models are advancing at an unprecedented pace, continuously shifting the perceptual quality ceiling and rendering previously collected labels unreliable for newer generations. To address this, we present ELIQ, a Label-free Framework for Quality Assessment of Evolving AI-generated Images. Specifically, ELIQ focuses on visual quality and prompt-image alignment, automatically constructs positive and aspect-specific negative pairs to cover both conventional distortions and AIGC-specific distortion modes, enabling transferable supervision without human annotations. Building on these pairs, ELIQ adapts a pre-trained multimodal model into a quality-aware critic via instruction tuning and predicts two-dimensional quality using lightweight gated fusion and a Quality Query Transformer. Experiments across multiple benchmarks demonstrate that ELIQ consistently outperforms existing label-free methods, generalizes from AI-generated content (AIGC) to user-generated content (UGC) scenarios without modification, and paves the way for scalable and label-free quality assessment under continuously evolving generative models. The code will be released upon publication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ELIQ gives a workable route to label-free AIGC quality scoring by auto-building pairs and instruction-tuning, but the claimed zero-shot transfer to UGC rests on an untested assumption about pair construction.

read the letter

Hi, the main point is that ELIQ shows how to keep quality assessment current for generative images without waiting for new human labels. It auto-constructs positive pairs and aspect-specific negatives that hit both ordinary distortions and AIGC artifacts like prompt misalignment, then tunes a multimodal model with a Quality Query Transformer and gated fusion to output two-dimensional scores. That setup is a direct response to the obsolescence problem and uses existing backbones efficiently. The experiments reportedly beat other label-free baselines and extend to UGC without changes, which would be useful if it holds. The soft spot is exactly the transfer claim. Because the negatives are built around AIGC-specific modes, the learned critic could latch onto generation statistics rather than general quality cues. The abstract gives no ablations on how the pairs are made or any cross-domain metrics that would confirm invariance, so the no-modification generalization stays under-supported. This is aimed at computer-vision groups working on scalable IQA for generative content. It deserves a serious referee because the problem is real, the method is concrete, and the results can be checked once the pair-construction details and code are available. I'd send it out rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces ELIQ, a label-free framework for quality assessment of evolving AI-generated images that focuses on visual quality and prompt-image alignment. It automatically constructs positive and aspect-specific negative pairs covering conventional and AIGC-specific distortions to provide transferable supervision without human annotations, adapts a pre-trained multimodal model via instruction tuning, and employs lightweight gated fusion together with a Quality Query Transformer to predict two-dimensional quality scores. Experiments across benchmarks are reported to show consistent outperformance over existing label-free methods and unmodified generalization from AIGC to UGC scenarios.

Significance. If the central claims hold, the work would be significant for enabling scalable, annotation-free quality assessment in a domain where generative models evolve rapidly and render static labeled datasets obsolete. The use of synthetic pairs for supervision and the reported zero-shot transfer to UGC represent a practical advance over prior label-free approaches that often require domain-specific retraining or human labels.

major comments (2)

[Method overview and pair-construction subsection] The description of automatic pair construction (positive pairs and aspect-specific negatives targeting AIGC distortions such as prompt misalignment and generative artifacts) remains high-level. Without explicit details on the heuristics, perturbation models, or selection criteria used to generate these pairs, it is impossible to verify that the resulting supervision signal is invariant enough to support the claimed unmodified generalization from AIGC to UGC.
[Experiments and ablation studies] The experiments claim consistent outperformance and cross-domain generalization, yet no ablation isolating the contribution of the pair-construction choices (versus the instruction-tuning or gated-fusion components) is presented. This leaves the weakest assumption—that the automatically generated pairs supply transferable quality supervision—unsecured.

minor comments (2)

[Method] The notation for the two-dimensional quality output (visual quality and prompt-image alignment) should be defined explicitly with symbols before its first use in the method section.
[Experiments] Figure captions for the qualitative results should include the exact benchmark and distortion type for each example to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to incorporate additional details and experiments as indicated.

read point-by-point responses

Referee: [Method overview and pair-construction subsection] The description of automatic pair construction (positive pairs and aspect-specific negatives targeting AIGC distortions such as prompt misalignment and generative artifacts) remains high-level. Without explicit details on the heuristics, perturbation models, or selection criteria used to generate these pairs, it is impossible to verify that the resulting supervision signal is invariant enough to support the claimed unmodified generalization from AIGC to UGC.

Authors: We agree that the pair-construction process requires more explicit description to allow verification of invariance. In the revised manuscript we will expand the relevant subsection (currently 3.2) with concrete heuristics: positive pairs are formed by pairing each generated image with its exact original prompt; negative pairs are generated via two families of perturbations—conventional (Gaussian blur with sigma in [1,3], additive Gaussian noise with sigma in [0.01,0.05], JPEG compression at quality 30–70) and AIGC-specific (prompt misalignment obtained by replacing 20–40 % of prompt tokens with semantically distant alternatives while keeping the image fixed, plus artifact simulation via controlled diffusion-step truncation). Selection criteria enforce aspect specificity by computing CLIP similarity thresholds (>0.85 for positives, <0.65 for negatives) and ensuring each negative targets only one distortion axis. These additions will make the supervision signal reproducible and will directly support the claimed AIGC-to-UGC transfer. revision: yes
Referee: [Experiments and ablation studies] The experiments claim consistent outperformance and cross-domain generalization, yet no ablation isolating the contribution of the pair-construction choices (versus the instruction-tuning or gated-fusion components) is presented. This leaves the weakest assumption—that the automatically generated pairs supply transferable quality supervision—unsecured.

Authors: We acknowledge that an explicit ablation isolating pair construction is missing. In the revised manuscript we will add a dedicated ablation subsection (4.4) that reports three controlled variants on the same backbone and tuning procedure: (i) full ELIQ pairs, (ii) random positive/negative pairs drawn without aspect-specific perturbations, and (iii) pairs using only conventional distortions. Performance deltas on both AIGC and UGC benchmarks will be reported, thereby quantifying the contribution of our pair-construction choices to transferable supervision. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes automatic construction of positive and aspect-specific negative pairs from AIGC distortions to provide supervision for instruction tuning a pre-trained multimodal model, followed by gated fusion and Quality Query Transformer for 2D quality prediction. No equations, derivations, or steps are presented that reduce any prediction or result to fitted inputs, self-definitions, or self-citation chains by construction. The approach relies on external pre-trained models and synthetic pair generation as independent components, with the central generalization claim remaining self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Only abstract available so ledger is incomplete; relies on transferability of pre-trained multimodal models and effectiveness of synthetic negative pairs.

free parameters (1)

Gated fusion weights
Lightweight gated fusion likely involves tunable parameters during instruction tuning.

axioms (1)

domain assumption Pre-trained multimodal models can be effectively adapted for quality assessment via instruction tuning on synthetic pairs
Core assumption enabling the label-free approach.

invented entities (1)

Quality Query Transformer no independent evidence
purpose: To predict two-dimensional quality scores
New component introduced for the quality prediction head.

pith-pipeline@v0.9.0 · 5501 in / 1216 out tokens · 28128 ms · 2026-05-16T08:09:06.498374+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

automatically constructs positive and aspect-specific negative pairs ... margin-based ranking losses ... Lvis = E[max(0,m−ŝvis(I+)+ŝvis(I−tec))+...]
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

lightweight gated fusion and Quality Query Transformer

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.