hub Mixed citations

Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li · 2023 · cs.CV · arXiv 2312.17090

Mixed citation behavior. Most common role is baseline (33%).

45 Pith papers citing it

Baseline 33% of classified citations

open full Pith review browse 45 citing papers arXiv PDF

abstract

The explosion of visual content available online underscores the requirement for an accurate machine assessor to robustly evaluate scores across diverse types of visual contents. While recent studies have demonstrated the exceptional potentials of large multi-modality models (LMMs) on a wide range of related fields, in this work, we explore how to teach them for visual rating aligned with human opinions. Observing that human raters only learn and judge discrete text-defined levels in subjective studies, we propose to emulate this subjective process and teach LMMs with text-defined rating levels instead of scores. The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA), as well as video quality assessment (VQA) tasks under the original LMM structure. With the syllabus, we further unify the three tasks into one model, termed the OneAlign. In our experiments, we demonstrate the advantage of the discrete-level-based syllabus over direct-score-based variants for LMMs. Our code and the pre-trained weights are released at https://github.com/Q-Future/Q-Align.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

baseline 3 method 3 background 2 dataset 1

citation-polarity summary

baseline 3 use method 3 background 2 use dataset 1

representative citing papers

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

Beyond Absolute Scores: Relative Edit-induced Difference for Generalizable Image Aesthetic Assessment

cs.CV · 2026-06-04 · unverdicted · novelty 7.0 · 2 refs

RED-Aes learns aesthetic changes from edit-induced image pairs and a new RED-20k dataset via three-stage relative ranking training, claiming SOTA generalization over absolute MOS regression.

Towards Characterizing Scientific Image Utility and Upgradability

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

The SIU²A framework evaluates scientific images for error detection, repair feasibility, and correction quality, showing current multimodal systems have major limitations in preserving scientific validity.

LL-Bench: Rethinking Low-Level Vision Evaluation in the Era of Large-Scale Generative Models

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

LL-Bench supplies a human-annotated dataset exposing generative model weaknesses in low-level restoration and introduces LL-Score as an MLLM evaluator that outperforms existing quality metrics and can serve as a training reward.

Seeing Through Fog: Towards Fog-Invariant Action Recognition

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

Introduces FogAct paired clean-foggy video dataset and FogNet two-stream CLIP model that learns fog-invariant semantic representations via clean-video guidance.

Accelerating Rectified Flow Models via Trajectory-Aware Caching

cs.CV · 2026-05-16 · unverdicted · novelty 7.0

TACache accelerates rectified flow sampling up to 4.14x for text-to-image and 2.11x for text-to-video via offline skip scheduling from cumulative variation thresholds and online velocity reconstruction using historical orthogonal directions.

GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.

EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.

Bringing Multimodal Large Language Models to Infrared-Visible Image Fusion Quality Assessment

cs.CV · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

FuScore uses MLLMs to output continuous quality scores for IVIF images, constructs per-image soft labels from four sub-dimensions, and applies a tripartite objective with Thurstone fidelity to achieve higher correlation with human preferences than prior metrics.

GameScope: A Multi-Attribute, Multi-Codec Benchmark Dataset for Gaming Video Quality Assessment

cs.CV · 2026-05-02 · unverdicted · novelty 7.0

GameScope provides 4,048 multi-codec gaming videos with MOS ratings and attribute annotations, claimed as the first comprehensive dataset for gaming video quality assessment across codecs and content types.

Personalizing Text-to-Image Generation to Individual Taste

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.

EduVQA: Towards Concept-Aware Assessment of Educational AI-Generated Videos

cs.CV · 2026-03-03 · unverdicted · novelty 7.0

EduVQA introduces the first concept-aware benchmark for educational AI-generated video assessment and a S2D-MoE framework that jointly evaluates perceptual quality and fine-grained semantic alignment.

LucidFlux: Caption-Free Photo-Realistic Image Restoration via a Large-Scale Diffusion Transformer

cs.CV · 2025-09-26 · unverdicted · novelty 7.0

LucidFlux is a caption-free image restoration method that conditions a Flux.1 diffusion transformer with a dual-branch module from the degraded input and a proxy restoration plus SigLIP semantic features to outperform baselines on synthetic and real-world data.

MR-IQA: A Unified Margin View of Regression and Ranking for Blind Image Quality Assessment

cs.CV · 2026-06-29 · unverdicted · novelty 6.0 · 2 refs

MR-IQA unifies regression and ranking in BIQA via a quality-margin optimization framework in RL, showing competitive performance on six benchmarks.

InteractiveAvatar: Real-Time Streaming Video Generation for Consistent and Intent-Aware Avatars

cs.CV · 2026-06-22 · unverdicted · novelty 6.0 · 2 refs

InteractiveAvatar is a real-time infinite-streaming avatar video generation system using autoregressive distillation, Long-Short Visual Memory for consistency, and a Reasoning-Reaction Module for intent-aware interactions.

Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions

cs.CV · 2026-06-08 · unverdicted · novelty 6.0

Z-Reward trains a 27B reasoning teacher VLM on score distributions via GDSO and distills it via RISD into a 9B student, reaching 89.6% and 88.6% human preference accuracy with 41.3% optimization gain over SFT baseline.

Direct 3D-Aware Object Insertion via Decomposed Visual Proxies

cs.CV · 2026-06-04 · unverdicted · novelty 6.0

DIRECT decomposes insertion conditions into appearance, 3D geometry proxy, and background context guidances injected separately to achieve pose-controllable high-fidelity object insertion.

Triadic Dynamics Aware Diffusion Posterior Sampling for Inverse Problems: Optimizing Guidance and Stochasticity Schedules

cs.CV · 2026-05-26 · unverdicted · novelty 6.0

TriPS reformulates diffusion posterior sampling as a time-varying control problem and optimizes triadic schedules (decreasing DC and stochasticity, increasing CFG) via template search and GRPO reinforcement learning, outperforming baselines in fidelity and realism.

4KLSDB: A Large-Scale Dataset for 4K Image Restoration and Generation

cs.CV · 2026-05-23 · unverdicted · novelty 6.0

4KLSDB supplies 129k+ curated 4K images plus validation/test splits to support training of super-resolution and text-to-image diffusion models.

PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion

cs.CV · 2026-05-22 · unverdicted · novelty 6.0

PiD is a pixel diffusion decoder that performs latent-to-pixel conversion and 4-8x upsampling in one generative step, enabling early stopping of latent diffusion and achieving sub-second 2048x2048 decoding with claimed better fidelity than cascaded baselines.

SR-Ground: Image Quality Grounding for Super-Resolved Content

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

The paper releases SR-Ground, a crowdsourced dataset for pixel-level segmentation of six artifact types in super-resolved images, and shows its use for training grounded IQA models and artifact-reducing fine-tuning.

FGSVQA: Frequency-Guided Short-form Video Quality Assessment

eess.IV · 2026-05-19 · unverdicted · novelty 6.0

FGSVQA combines CLIP visual encoding with frequency priors and adaptive branch fusion to predict short-form video quality, reporting SRCC 0.736 and PLCC 0.787 on relevant datasets.

FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization

cs.CV · 2026-05-15 · unverdicted · novelty 6.0 · 2 refs

FashionChameleon achieves interactive multi-garment video customization at 23.8 FPS via in-context teacher models, streaming distillation, and training-free KV cache rescheduling while using only single-garment data.

GeoR-Bench: Evaluating Geoscience Visual Reasoning

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

GeoR-Bench shows top multimodal models reach only 42.7% strict accuracy on geoscience visual reasoning tasks while open-source models reach 10.3%, with outputs often visually plausible yet scientifically inaccurate.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Generalizable Video Quality Assessment via Weak-to-Strong Learning cs.CV · 2025-05-06 · unreviewed · ref 13 · internal anchor

Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer