hub Mixed citations

Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li · 2023 · cs.CV · arXiv 2312.17090

Mixed citation behavior. Most common role is baseline (33%).

33 Pith papers citing it

Baseline 33% of classified citations

open full Pith review browse 33 citing papers arXiv PDF

abstract

The explosion of visual content available online underscores the requirement for an accurate machine assessor to robustly evaluate scores across diverse types of visual contents. While recent studies have demonstrated the exceptional potentials of large multi-modality models (LMMs) on a wide range of related fields, in this work, we explore how to teach them for visual rating aligned with human opinions. Observing that human raters only learn and judge discrete text-defined levels in subjective studies, we propose to emulate this subjective process and teach LMMs with text-defined rating levels instead of scores. The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA), as well as video quality assessment (VQA) tasks under the original LMM structure. With the syllabus, we further unify the three tasks into one model, termed the OneAlign. In our experiments, we demonstrate the advantage of the discrete-level-based syllabus over direct-score-based variants for LMMs. Our code and the pre-trained weights are released at https://github.com/Q-Future/Q-Align.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

baseline 3 method 3 background 2 dataset 1

citation-polarity summary

baseline 3 use method 3 background 2 use dataset 1

representative citing papers

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

Seeing Through Fog: Towards Fog-Invariant Action Recognition

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

Introduces FogAct paired clean-foggy video dataset and FogNet two-stream CLIP model that learns fog-invariant semantic representations via clean-video guidance.

Accelerating Rectified Flow Models via Trajectory-Aware Caching

cs.CV · 2026-05-16 · unverdicted · novelty 7.0

TACache accelerates rectified flow sampling up to 4.14x for text-to-image and 2.11x for text-to-video via offline skip scheduling from cumulative variation thresholds and online velocity reconstruction using historical orthogonal directions.

GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.

EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.

Bringing Multimodal Large Language Models to Infrared-Visible Image Fusion Quality Assessment

cs.CV · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

FuScore uses MLLMs to output continuous quality scores for IVIF images, constructs per-image soft labels from four sub-dimensions, and applies a tripartite objective with Thurstone fidelity to achieve higher correlation with human preferences than prior metrics.

GameScope: A Multi-Attribute, Multi-Codec Benchmark Dataset for Gaming Video Quality Assessment

cs.CV · 2026-05-02 · unverdicted · novelty 7.0

GameScope provides 4,048 multi-codec gaming videos with MOS ratings and attribute annotations, claimed as the first comprehensive dataset for gaming video quality assessment across codecs and content types.

Personalizing Text-to-Image Generation to Individual Taste

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.

EduVQA: Towards Concept-Aware Assessment of Educational AI-Generated Videos

cs.CV · 2026-03-03 · unverdicted · novelty 7.0

EduVQA introduces the first concept-aware benchmark for educational AI-generated video assessment and a S2D-MoE framework that jointly evaluates perceptual quality and fine-grained semantic alignment.

LucidFlux: Caption-Free Photo-Realistic Image Restoration via a Large-Scale Diffusion Transformer

cs.CV · 2025-09-26 · unverdicted · novelty 7.0

LucidFlux is a caption-free image restoration method that conditions a Flux.1 diffusion transformer with a dual-branch module from the degraded input and a proxy restoration plus SigLIP semantic features to outperform baselines on synthetic and real-world data.

PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion

cs.CV · 2026-05-22 · unverdicted · novelty 6.0

PiD is a pixel diffusion decoder that performs latent-to-pixel conversion and 4-8x upsampling in one generative step, enabling early stopping of latent diffusion and achieving sub-second 2048x2048 decoding with claimed better fidelity than cascaded baselines.

SR-Ground: Image Quality Grounding for Super-Resolved Content

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

The paper releases SR-Ground, a crowdsourced dataset for pixel-level segmentation of six artifact types in super-resolved images, and shows its use for training grounded IQA models and artifact-reducing fine-tuning.

FGSVQA: Frequency-Guided Short-form Video Quality Assessment

eess.IV · 2026-05-19 · unverdicted · novelty 6.0

FGSVQA combines CLIP visual encoding with frequency priors and adaptive branch fusion to predict short-form video quality, reporting SRCC 0.736 and PLCC 0.787 on relevant datasets.

FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

FashionChameleon achieves interactive multi-garment video customization in real time by training a teacher model with in-context learning on single-garment pairs, applying streaming distillation, and using training-free KV cache rescheduling.

GeoR-Bench: Evaluating Geoscience Visual Reasoning

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

GeoR-Bench shows top multimodal models reach only 42.7% strict accuracy on geoscience visual reasoning tasks while open-source models reach 10.3%, with outputs often visually plausible yet scientifically inaccurate.

ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

ReasonEdit uses a new CoT dataset and reinforcement learning to produce interpretable, human-aligned evaluations of text-guided image edits.

Unpaired Image Deraining Using Reward-Guided Self-Reinforcement Strategy

cs.CV · 2026-05-01 · unverdicted · novelty 6.0

RGSUD achieves SOTA unsupervised deraining by using IQA-based reward recycling and self-reinforcement to constrain optimization and improve pseudo-paired data quality.

You Only Gaussian Once: Controllable 3D Gaussian Splatting for Ultra-Densely Sampled Scenes

cs.CV · 2026-04-23 · unverdicted · novelty 6.0

YOGO reformulates stochastic 3D Gaussian Splatting into a deterministic budget-aware system and supplies an ultra-dense dataset to enforce physical fidelity over viewpoint interpolation.

Redefining Quality Criteria and Distance-Aware Score Modeling for Image Editing Assessment

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

DS-IEQA jointly learns evaluation criteria via feedback-driven prompt optimization and continuous score modeling via token-decoupled distance regression, ranking 4th in the 2026 NTIRE X-AIGC Quality Assessment Track 2 without extra training data.

Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models

cs.CV · 2026-04-12 · unverdicted · novelty 6.0

Rein3D generates photorealistic, globally consistent 3D indoor scenes by using a restore-and-refine process where radial panoramic videos are restored via diffusion models and then used to update a 3D Gaussian field.

On the Global Photometric Alignment for Low-Level Vision

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

PAL uses closed-form affine color alignment on prediction-target pairs to discount global photometric discrepancies from the supervision signal, improving restoration across low-level vision tasks.

LumiVideo: An Intelligent Agentic System for Video Color Grading

cs.CV · 2026-04-02 · unverdicted · novelty 6.0

LumiVideo deploys an LLM-based agent with RAG and Tree of Thoughts to generate ASC-CDL parameters and 3D LUTs for automatic cinematic color grading from raw log video, approaching expert quality.

LucidNFT: LR-Anchored Multi-Reward Preference Optimization for Flow-Based Real-World Super-Resolution

cs.CV · 2026-03-06 · unverdicted · novelty 6.0

LucidNFT combines a new LR-referenced consistency reward, decoupled normalization, and a real-degradation dataset to improve perceptual quality in flow-matching super-resolution while preserving input fidelity.

HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images

cs.CV · 2026-03-02 · unverdicted · novelty 6.0

HiFi-Inpaint delivers state-of-the-art detail-preserving human-product images by adding Shared Enhancement Attention and Detail-Aware Loss to reference-based inpainting on a new 40K dataset.

citing papers explorer

Showing 33 of 33 citing papers.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models cs.CV · 2026-05-11 · unverdicted · none · ref 34 · internal anchor
SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.
Seeing Through Fog: Towards Fog-Invariant Action Recognition cs.CV · 2026-05-20 · unverdicted · none · ref 42 · internal anchor
Introduces FogAct paired clean-foggy video dataset and FogNet two-stream CLIP model that learns fog-invariant semantic representations via clean-video guidance.
Accelerating Rectified Flow Models via Trajectory-Aware Caching cs.CV · 2026-05-16 · unverdicted · none · ref 24 · internal anchor
TACache accelerates rectified flow sampling up to 4.14x for text-to-image and 2.11x for text-to-video via offline skip scheduling from cumulative variation thresholds and online velocity reconstruction using historical orthogonal directions.
GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion cs.CV · 2026-05-13 · unverdicted · none · ref 83 · internal anchor
GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.
EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement cs.CV · 2026-05-08 · unverdicted · none · ref 54 · internal anchor
EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.
Bringing Multimodal Large Language Models to Infrared-Visible Image Fusion Quality Assessment cs.CV · 2026-05-07 · unverdicted · none · ref 25 · 2 links · internal anchor
FuScore uses MLLMs to output continuous quality scores for IVIF images, constructs per-image soft labels from four sub-dimensions, and applies a tripartite objective with Thurstone fidelity to achieve higher correlation with human preferences than prior metrics.
GameScope: A Multi-Attribute, Multi-Codec Benchmark Dataset for Gaming Video Quality Assessment cs.CV · 2026-05-02 · unverdicted · none · ref 24 · internal anchor
GameScope provides 4,048 multi-codec gaming videos with MOS ratings and attribute annotations, claimed as the first comprehensive dataset for gaming video quality assessment across codecs and content types.
Personalizing Text-to-Image Generation to Individual Taste cs.CV · 2026-04-08 · unverdicted · none · ref 55 · internal anchor
PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.
EduVQA: Towards Concept-Aware Assessment of Educational AI-Generated Videos cs.CV · 2026-03-03 · unverdicted · none · ref 12 · internal anchor
EduVQA introduces the first concept-aware benchmark for educational AI-generated video assessment and a S2D-MoE framework that jointly evaluates perceptual quality and fine-grained semantic alignment.
LucidFlux: Caption-Free Photo-Realistic Image Restoration via a Large-Scale Diffusion Transformer cs.CV · 2025-09-26 · unverdicted · none · ref 14 · internal anchor
LucidFlux is a caption-free image restoration method that conditions a Flux.1 diffusion transformer with a dual-branch module from the degraded input and a proxy restoration plus SigLIP semantic features to outperform baselines on synthetic and real-world data.
PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion cs.CV · 2026-05-22 · unverdicted · none · ref 45 · internal anchor
PiD is a pixel diffusion decoder that performs latent-to-pixel conversion and 4-8x upsampling in one generative step, enabling early stopping of latent diffusion and achieving sub-second 2048x2048 decoding with claimed better fidelity than cascaded baselines.
SR-Ground: Image Quality Grounding for Super-Resolved Content cs.CV · 2026-05-20 · unverdicted · none · ref 36 · internal anchor
The paper releases SR-Ground, a crowdsourced dataset for pixel-level segmentation of six artifact types in super-resolved images, and shows its use for training grounded IQA models and artifact-reducing fine-tuning.
FGSVQA: Frequency-Guided Short-form Video Quality Assessment eess.IV · 2026-05-19 · unverdicted · none · ref 25 · internal anchor
FGSVQA combines CLIP visual encoding with frequency priors and adaptive branch fusion to predict short-form video quality, reporting SRCC 0.736 and PLCC 0.787 on relevant datasets.
FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization cs.CV · 2026-05-15 · unverdicted · none · ref 54 · internal anchor
FashionChameleon achieves interactive multi-garment video customization in real time by training a teacher model with in-context learning on single-garment pairs, applying streaming distillation, and using training-free KV cache rescheduling.
GeoR-Bench: Evaluating Geoscience Visual Reasoning cs.CV · 2026-05-12 · unverdicted · none · ref 29 · internal anchor
GeoR-Bench shows top multimodal models reach only 42.7% strict accuracy on geoscience visual reasoning tasks while open-source models reach 10.3%, with outputs often visually plausible yet scientifically inaccurate.
ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning cs.CV · 2026-05-08 · unverdicted · none · ref 68 · internal anchor
ReasonEdit uses a new CoT dataset and reinforcement learning to produce interpretable, human-aligned evaluations of text-guided image edits.
Unpaired Image Deraining Using Reward-Guided Self-Reinforcement Strategy cs.CV · 2026-05-01 · unverdicted · none · ref 73 · internal anchor
RGSUD achieves SOTA unsupervised deraining by using IQA-based reward recycling and self-reinforcement to constrain optimization and improve pseudo-paired data quality.
You Only Gaussian Once: Controllable 3D Gaussian Splatting for Ultra-Densely Sampled Scenes cs.CV · 2026-04-23 · unverdicted · none · ref 31 · internal anchor
YOGO reformulates stochastic 3D Gaussian Splatting into a deterministic budget-aware system and supplies an ultra-dense dataset to enforce physical fidelity over viewpoint interpolation.
Redefining Quality Criteria and Distance-Aware Score Modeling for Image Editing Assessment cs.CV · 2026-04-14 · unverdicted · none · ref 27 · internal anchor
DS-IEQA jointly learns evaluation criteria via feedback-driven prompt optimization and continuous score modeling via token-decoupled distance regression, ranking 4th in the 2026 NTIRE X-AIGC Quality Assessment Track 2 without extra training data.
Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models cs.CV · 2026-04-12 · unverdicted · none · ref 50 · internal anchor
Rein3D generates photorealistic, globally consistent 3D indoor scenes by using a restore-and-refine process where radial panoramic videos are restored via diffusion models and then used to update a 3D Gaussian field.
On the Global Photometric Alignment for Low-Level Vision cs.CV · 2026-04-09 · unverdicted · none · ref 4 · internal anchor
PAL uses closed-form affine color alignment on prediction-target pairs to discount global photometric discrepancies from the supervision signal, improving restoration across low-level vision tasks.
LumiVideo: An Intelligent Agentic System for Video Color Grading cs.CV · 2026-04-02 · unverdicted · none · ref 14 · internal anchor
LumiVideo deploys an LLM-based agent with RAG and Tree of Thoughts to generate ASC-CDL parameters and 3D LUTs for automatic cinematic color grading from raw log video, approaching expert quality.
LucidNFT: LR-Anchored Multi-Reward Preference Optimization for Flow-Based Real-World Super-Resolution cs.CV · 2026-03-06 · unverdicted · none · ref 38 · internal anchor
LucidNFT combines a new LR-referenced consistency reward, decoupled normalization, and a real-degradation dataset to improve perceptual quality in flow-matching super-resolution while preserving input fidelity.
HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images cs.CV · 2026-03-02 · unverdicted · none · ref 50 · internal anchor
HiFi-Inpaint delivers state-of-the-art detail-preserving human-product images by adding Shared Enhancement Attention and Detail-Aware Loss to reference-based inpainting on a new 40K dataset.
Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length cs.CV · 2025-12-04 · conditional · none · ref 44 · internal anchor
Live Avatar enables 45 FPS real-time streaming infinite-length audio-driven avatar generation from a 14B diffusion model via distillation and timestep-forcing pipeline parallelism.
Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models cs.CV · 2025-11-01 · unverdicted · none · ref 87 · internal anchor
A feed-forward video latent transformer that predicts time-varying 3D Gaussian primitives from one image to produce controllable 4D scenes with appearance, geometry, and motion.
Embody4D: A Generalist 4D World Model for Embodied AI cs.CV · 2026-05-03 · unverdicted · none · ref 54 · internal anchor
Embody4D generates high-fidelity, view-consistent novel views from monocular videos for embodied scenarios via 3D-aware data synthesis, adaptive noise injection, and interaction-aware attention.
FDIM: A Feature-distance-based Generic Video Quality Metric for Versatile Codecs cs.CV · 2026-04-27 · unverdicted · none · ref 38 · internal anchor
FDIM is a new hybrid feature-distance video quality metric trained on over 16k sequences that shows strong generalization and correlation with human judgments across ten unseen SDR/HDR datasets and diverse codecs.
Q-DeepSight: Incentivizing Thinking with Images for Image Quality Assessment and Refinement cs.CV · 2026-04-18 · unverdicted · none · ref 47 · internal anchor
Q-DeepSight proposes a think-with-image multimodal CoT framework trained via RL with perceptual curriculum rewards and evidence gradient filtering to achieve SOTA IQA performance and enable training-free perceptual refinement in image generation.
LongCat-Image Technical Report cs.CV · 2025-12-08 · unverdicted · none · ref 8 · internal anchor
LongCat-Image delivers a compact 6B-parameter bilingual image generation model that sets new standards for Chinese character rendering accuracy and photorealism while remaining efficient and fully open-source.
JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation cs.CV · 2024-11-14 · unverdicted · none · ref 43 · internal anchor
JoyVASA decouples static 3D facial representations from identity-independent dynamic motion sequences generated by a diffusion transformer to produce audio-driven animations for humans and animals.
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds cs.CV · 2026-04-15 · unverdicted · none · ref 73 · internal anchor
HY-World 2.0 generates and reconstructs high-fidelity navigable 3D Gaussian Splatting worlds from text, images, or videos via upgraded panorama, planning, expansion, and composition modules, with released code claiming open-source SOTA performance.
Generalizable Video Quality Assessment via Weak-to-Strong Learning cs.CV · 2025-05-06 · unreviewed · ref 13 · internal anchor

Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer