arxiv: 2312.17090 · v1 · submitted 2023-12-28 · 💻 cs.CV · cs.CL· cs.LG

Recognition: 3 theorem links

· Lean Theorem

Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

Haoning Wu , Zicheng Zhang , Weixia Zhang , Chaofeng Chen , Liang Liao , Chunyi Li , Yixuan Gao , Annan Wang

show 6 more authors

Erli Zhang Wenxiu Sun Qiong Yan Xiongkuo Min Guangtao Zhai Weisi Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:31 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.LG

keywords large multi-modality modelsvisual quality assessmentimage aesthetic assessmentdiscrete text levelssubjective judgmentOneAlignvideo quality assessment

0 comments

The pith

LMMs achieve better visual scoring by predicting discrete text-defined rating levels instead of numerical scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that large multi-modality models can be trained to rate images and videos by learning discrete text-defined levels that match how humans judge in subjective studies. This replaces direct numerical score regression and produces stronger results on image quality assessment, image aesthetic assessment, and video quality assessment using the unchanged model structure. The same syllabus then combines the three tasks into one unified model called OneAlign. The approach demonstrates a clear edge over score-based training variants while requiring no extra data or architecture changes.

Core claim

Observing that human raters learn and judge discrete text-defined levels in subjective studies, the work teaches LMMs to output these levels for visual rating. The resulting Q-Align model reaches state-of-the-art performance on image quality assessment, image aesthetic assessment, and video quality assessment tasks under the original LMM structure. A syllabus built from the same discrete levels further unifies the three tasks into a single model termed OneAlign.

What carries the argument

The discrete text-defined rating levels syllabus that replaces numerical score regression to train LMMs for visual assessment.

If this is right

The discrete-level syllabus outperforms direct-score training on IQA, IAA, and VQA benchmarks.
Three separate visual assessment tasks can be unified into one model without architectural changes.
State-of-the-art results are obtained while keeping the original LMM structure and data requirements unchanged.
The same training approach extends across image and video content types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar discrete-level training could apply to other subjective rating domains where humans use categorical language.
Different choices of text phrasing for the levels might further improve alignment with specific human populations.
The method suggests that discrete outputs could reduce calibration issues in other LMM alignment tasks.

Load-bearing premise

Human subjective judgment in visual scoring relies primarily on discrete text-defined levels rather than continuous numerical values.

What would settle it

If an LMM trained with direct numerical score regression matches or exceeds the discrete-level version on the same image quality, aesthetic, and video quality benchmarks, the claimed advantage would not hold.

read the original abstract

The explosion of visual content available online underscores the requirement for an accurate machine assessor to robustly evaluate scores across diverse types of visual contents. While recent studies have demonstrated the exceptional potentials of large multi-modality models (LMMs) on a wide range of related fields, in this work, we explore how to teach them for visual rating aligned with human opinions. Observing that human raters only learn and judge discrete text-defined levels in subjective studies, we propose to emulate this subjective process and teach LMMs with text-defined rating levels instead of scores. The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA), as well as video quality assessment (VQA) tasks under the original LMM structure. With the syllabus, we further unify the three tasks into one model, termed the OneAlign. In our experiments, we demonstrate the advantage of the discrete-level-based syllabus over direct-score-based variants for LMMs. Our code and the pre-trained weights are released at https://github.com/Q-Future/Q-Align.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Q-Align's switch to discrete text levels for LMM visual scoring beats their regression baselines and unifies IQA/IAA/VQA, but the gains may trace to output format rather than human emulation.

read the letter

The main point is that training LMMs to predict text-defined levels like 'excellent' or 'poor' instead of direct scores lifts performance on image quality, aesthetics, and video quality tasks, and the same model handles all three. They release code and weights, which makes the work immediately usable. The unification into OneAlign is a clean practical move that avoids separate heads or models for each task. The abstract shows clear gains over their own score-based variants, and the approach stays within the original LMM structure without new architecture. That is the solid part. The soft spot is the baseline comparison. LMMs are built for next-token text prediction, so outputting a level name fits the pretraining directly while regressing a scalar does not. Without a control that outputs text-formatted numbers (such as 'score: 4'), it is unclear whether the improvement comes from the discrete syllabus or simply from staying in the language modeling regime. The abstract does not detail dataset splits, statistical tests, or full baseline setups, so the SOTA claim rests on moderate evidence until the full experiments are checked. This is useful for groups already fine-tuning LMMs on perceptual tasks who want a simple training change and released artifacts. It is not a theoretical advance, but the empirical recipe is worth testing. A serious referee should see it because the method is reproducible and the unification result is worth confirming or correcting.

Referee Report

1 major / 1 minor

Summary. The paper proposes Q-Align, which trains large multi-modality models (LMMs) for visual scoring by using discrete text-defined rating levels (e.g., 'excellent', 'good') rather than numerical scores to better emulate human subjective judgment. It reports state-of-the-art results on image quality assessment (IQA), image aesthetic assessment (IAA), and video quality assessment (VQA) tasks while preserving the original LMM architecture, and introduces a unified OneAlign model across the three tasks via a shared syllabus.

Significance. If the performance gains are shown to arise specifically from the discrete-level syllabus rather than output-format compatibility, the approach would provide a simple, architecture-preserving method for aligning LMMs with human perceptual judgments, with potential impact on fine-tuning strategies for subjective visual assessment tasks.

major comments (1)

[Abstract] Abstract: the reported advantage of the discrete-level-based syllabus over direct-score-based variants is central to the contribution, yet the comparison does not isolate whether gains stem from the use of discrete text levels or from the fact that text-token output is native to LMM pretraining. No experiment tests a text-formatted numerical baseline (e.g., outputting 'score: 4') that would control for output-format mismatch while keeping the regression target continuous.

minor comments (1)

[Experiments] The manuscript should provide explicit details on dataset splits, full baseline implementations, and statistical significance tests to substantiate the SOTA claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on isolating the source of performance gains. We address the major comment point-by-point below and commit to strengthening the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the reported advantage of the discrete-level-based syllabus over direct-score-based variants is central to the contribution, yet the comparison does not isolate whether gains stem from the use of discrete text levels or from the fact that text-token output is native to LMM pretraining. No experiment tests a text-formatted numerical baseline (e.g., outputting 'score: 4') that would control for output-format mismatch while keeping the regression target continuous.

Authors: We agree that a text-formatted numerical baseline would provide a cleaner control for output-format effects. In the current experiments the direct-score variants prompted the LMM for numerical regression, which is indeed less native to its text-token pretraining than discrete text labels. The core motivation remains that human raters in subjective studies use discrete text-defined levels rather than continuous numbers; this is why we adopted the syllabus. To address the referee's concern directly, we will add the suggested text-formatted numerical baseline (e.g., prompting for outputs such as 'score: 4') in the revised experiments and report the results alongside the existing comparisons. This addition will be included in the updated manuscript and supplementary material. revision: yes

Circularity Check

0 steps flagged

Empirical fine-tuning shows no circular derivation

full rationale

The paper is an empirical study on fine-tuning LMMs for visual scoring tasks using discrete text-defined levels. No mathematical derivation chain, equations, or first-principles results are presented that reduce to inputs by construction. The method relies on standard next-token prediction training with released code and weights; performance claims are validated experimentally across IQA, IAA, and VQA benchmarks rather than through self-referential logic or fitted parameters renamed as predictions. Minor self-citations exist but are not load-bearing for the core syllabus or unification into OneAlign.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that discrete levels from human studies can be directly used as training targets for LMMs and that this yields better alignment than numerical scores.

axioms (1)

domain assumption Human raters in subjective studies judge visual content using discrete text-defined levels
Stated as an observation from subjective studies that the method emulates.

pith-pipeline@v0.9.0 · 5549 in / 1167 out tokens · 65154 ms · 2026-05-15T16:31:49.640259+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.JcostCore Jcost_pos_of_ne_one unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Observing that human raters only learn and judge discrete text-defined levels in subjective studies, we propose to emulate this subjective process and teach LMMs with text-defined rating levels instead of scores.
Foundation.PhiForcing phi_equation unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA), as well as video quality assessment (VQA) tasks under the original LMM structure.
Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

With the syllabus, we further unify the three tasks into one model, termed the OneAlign.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models
cs.CV 2026-05 unverdicted novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.
GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion
cs.CV 2026-05 unverdicted novelty 7.0

GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.
EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement
cs.CV 2026-05 unverdicted novelty 7.0

EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.
Bringing Multimodal Large Language Models to Infrared-Visible Image Fusion Quality Assessment
cs.CV 2026-05 unverdicted novelty 7.0

FuScore uses MLLMs to output continuous quality scores for IVIF images, constructs per-image soft labels from four sub-dimensions, and applies a tripartite objective with Thurstone fidelity to achieve higher correlati...
GameScope: A Multi-Attribute, Multi-Codec Benchmark Dataset for Gaming Video Quality Assessment
cs.CV 2026-05 unverdicted novelty 7.0

GameScope provides 4,048 multi-codec gaming videos with MOS ratings and attribute annotations, claimed as the first comprehensive dataset for gaming video quality assessment across codecs and content types.
Personalizing Text-to-Image Generation to Individual Taste
cs.CV 2026-04 unverdicted novelty 7.0

PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.
GeoR-Bench: Evaluating Geoscience Visual Reasoning
cs.CV 2026-05 unverdicted novelty 6.0

GeoR-Bench shows top multimodal models reach only 42.7% strict accuracy on geoscience visual reasoning tasks while open-source models reach 10.3%, with outputs often visually plausible yet scientifically inaccurate.
ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 6.0

ReasonEdit uses a new CoT dataset and reinforcement learning to produce interpretable, human-aligned evaluations of text-guided image edits.
Bringing Multimodal Large Language Models to Infrared-Visible Image Fusion Quality Assessment
cs.CV 2026-05 unverdicted novelty 6.0

FuScore trains an MLLM to produce continuous IVIF quality scores supervised by per-image soft labels and Thurstone fidelity terms, reaching state-of-the-art correlation with human preferences.
Unpaired Image Deraining Using Reward-Guided Self-Reinforcement Strategy
cs.CV 2026-05 unverdicted novelty 6.0

RGSUD achieves SOTA unsupervised deraining by using IQA-based reward recycling and self-reinforcement to constrain optimization and improve pseudo-paired data quality.
You Only Gaussian Once: Controllable 3D Gaussian Splatting for Ultra-Densely Sampled Scenes
cs.CV 2026-04 unverdicted novelty 6.0

YOGO reformulates stochastic 3D Gaussian Splatting into a deterministic budget-aware system and supplies an ultra-dense dataset to enforce physical fidelity over viewpoint interpolation.
Redefining Quality Criteria and Distance-Aware Score Modeling for Image Editing Assessment
cs.CV 2026-04 unverdicted novelty 6.0

DS-IEQA jointly learns evaluation criteria via feedback-driven prompt optimization and continuous score modeling via token-decoupled distance regression, ranking 4th in the 2026 NTIRE X-AIGC Quality Assessment Track 2...
Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models
cs.CV 2026-04 unverdicted novelty 6.0

Rein3D generates photorealistic, globally consistent 3D indoor scenes by using a restore-and-refine process where radial panoramic videos are restored via diffusion models and then used to update a 3D Gaussian field.
On the Global Photometric Alignment for Low-Level Vision
cs.CV 2026-04 unverdicted novelty 6.0

PAL uses closed-form affine color alignment on prediction-target pairs to discount global photometric discrepancies from the supervision signal, improving restoration across low-level vision tasks.
LumiVideo: An Intelligent Agentic System for Video Color Grading
cs.CV 2026-04 unverdicted novelty 6.0

LumiVideo deploys an LLM-based agent with RAG and Tree of Thoughts to generate ASC-CDL parameters and 3D LUTs for automatic cinematic color grading from raw log video, approaching expert quality.
LucidNFT: LR-Anchored Multi-Reward Preference Optimization for Flow-Based Real-World Super-Resolution
cs.CV 2026-03 unverdicted novelty 6.0

LucidNFT combines a new LR-referenced consistency reward, decoupled normalization, and a real-degradation dataset to improve perceptual quality in flow-matching super-resolution while preserving input fidelity.
HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images
cs.CV 2026-03 unverdicted novelty 6.0

HiFi-Inpaint delivers state-of-the-art detail-preserving human-product images by adding Shared Enhancement Attention and Detail-Aware Loss to reference-based inpainting on a new 40K dataset.
Embody4D: A Generalist 4D World Model for Embodied AI
cs.CV 2026-05 unverdicted novelty 5.0

Embody4D generates high-fidelity, view-consistent novel views from monocular videos for embodied scenarios via 3D-aware data synthesis, adaptive noise injection, and interaction-aware attention.
FDIM: A Feature-distance-based Generic Video Quality Metric for Versatile Codecs
cs.CV 2026-04 unverdicted novelty 5.0

FDIM is a new hybrid feature-distance video quality metric trained on over 16k sequences that shows strong generalization and correlation with human judgments across ten unseen SDR/HDR datasets and diverse codecs.
Q-DeepSight: Incentivizing Thinking with Images for Image Quality Assessment and Refinement
cs.CV 2026-04 unverdicted novelty 5.0

Q-DeepSight proposes a think-with-image multimodal CoT framework trained via RL with perceptual curriculum rewards and evidence gradient filtering to achieve SOTA IQA performance and enable training-free perceptual re...
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
cs.CV 2026-04 unverdicted novelty 4.0

HY-World 2.0 generates and reconstructs high-fidelity navigable 3D Gaussian Splatting worlds from text, images, or videos via upgraded panorama, planning, expansion, and composition modules, with released code claimin...

Reference graph

Works this paper leans on

293 extracted references · 293 canonical work pages · cited by 20 Pith papers · 9 internal anchors

[1]

FirstName LastName , title =

work page
[2]

FirstName Alpher , title =

work page
[3]

Journal of Foo , volume = 13, number = 1, pages =

FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =

work page
[4]

Journal of Foo , volume = 14, number = 1, pages =

FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =

work page
[5]

FirstName Alpher and FirstName Gamow , title =

work page
[6]

CVPR , month =

Wang, Yilin and Ke, Junjie and Talebi, Hossein and Yim, Joong Gon and Birkbeck, Neil and Adsumilli, Balu and Milanfar, Peyman and Yang, Feng , title =. CVPR , month =. 2021 , pages =

work page 2021
[7]

The Netflix Tech Blog , volume=

Toward a practical perceptual video quality metric , author=. The Netflix Tech Blog , volume=

work page
[8]

arXiv preprint arXiv:1907.07484 , year=

Benchmarking Robustness in Object Detection: Autonomous Driving when Winter is Coming , author=. arXiv preprint arXiv:1907.07484 , year=

work page arXiv 1907
[9]

2023 , eprint=

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension , author=. 2023 , eprint=

work page 2023
[10]

Exploring and Evaluating Image Restoration Potential in Dynamic Scenes , year=

Zhang, Cheng and Su, Shaolin and Zhu, Yu and Yan, Qingsen and Sun, Jinqiu and Zhang, Yanning , booktitle=. Exploring and Evaluating Image Restoration Potential in Dynamic Scenes , year=

work page
[11]

A Sober Look at the Unsupervised Learning of Disentangled Representations and their Evaluation , journal =

Francesco Locatello and Stefan Bauer and Mario Lucic and Gunnar Raetsch and Sylvain Gelly and Bernhard Sch. A Sober Look at the Unsupervised Learning of Disentangled Representations and their Evaluation , journal =. 2020 , volume =

work page 2020
[12]

NeurIPS , year=

Multi-mapping Image-to-Image Translation via Learning Disentanglement , author=. NeurIPS , year=

work page
[13]

Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

Video compression dataset and benchmark of learning-based video-quality metrics , author=. Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

work page
[14]

Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig , doi =

work page
[15]

ECCV , year=

Real-Time Intermediate Flow Estimation for Video Frame Interpolation , author=. ECCV , year=

work page
[16]

A Probabilistic Approach to People-Centric Photo Selection and Sequencing , year=

Vonikakis, Vassilios and Subramanian, Ramanathan and Arnfred, Jonas and Winkler, Stefan , journal=. A Probabilistic Approach to People-Centric Photo Selection and Sequencing , year=

work page
[17]

CVPR , month =

Ruan, Lingyan and Chen, Bin and Li, Jizhou and Lam, Miuling , title =. CVPR , month =. 2022 , pages =

work page 2022
[18]

CVPR , month =

Lee, Yao-Chih and Tseng, Kuan-Wei and Chen, Yu-Ta and Chen, Chien-Cheng and Chen, Chu-Song and Hung, Yi-Ping , title =. CVPR , month =. 2021 , pages =

work page 2021
[19]

CVPR , month =

Tassano, Matias and Delon, Julie and Veit, Thomas , title =. CVPR , month =

work page
[20]

Proceedings of the 31st ACM International Conference on Multimedia , year =

Tengchuan Kou and Xiaohong Liu and Jun Jia and Wei Sun and Guangtao Zhai and Ning Liu , title =. Proceedings of the 31st ACM International Conference on Multimedia , year =

work page
[21]

n Proceedings of the 31st ACM International Conference on Multimedia , year =

Yunlong Dong and Xiaohong Liu and Yixuan Gao and Xunchu Zhou and Tao Tan and Guangtao Zhai , title =. n Proceedings of the 31st ACM International Conference on Multimedia , year =

work page
[22]

IEEE TIP , year =

Jingwen Hou and Weisi Lin and Yuming Fang and Haoning Wu and Chaofeng Chen and Liang Liao and Weide Liu , title =. IEEE TIP , year =

work page
[23]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

MD-VQA: Multi-Dimensional Quality Assessment for UGC Live Videos , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page
[24]

2023 , eprint=

VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining , author=. 2023 , eprint=

work page 2023
[25]

2023 , eprint=

AGIQA-3K: An Open Database for AI-Generated Image Quality Assessment , author=. 2023 , eprint=

work page 2023
[26]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. arXiv preprint arXiv:2308.12966 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

2023 , eprint=

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , author=. 2023 , eprint=

work page 2023
[28]

LIVE Image Quality Assessment Database Release 2 , author=

work page
[30]

Proceedings of the 30th ACM International Conference on Multimedia , year =

A Deep Learning Based No-Reference Quality Assessment Model for UGC Videos , author =. Proceedings of the 30th ACM International Conference on Multimedia , year =

work page
[31]

2023 , howpublished =

LAION , title =. 2023 , howpublished =

work page 2023
[32]

2023 , eprint=

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration , author=. 2023 , eprint=

work page 2023
[33]

2023 , eprint=

Towards Robust Text-Prompted Semantic Criterion for In-the-Wild Video Quality Assessment , author=. 2023 , eprint=

work page 2023
[34]

2023 , eprint=

Advancing Zero-Shot Digital Human Quality Assessment through Text-Prompted Evaluation , author=. 2023 , eprint=

work page 2023
[35]

IEEE TIP , volume=

Massive online crowdsourced study of subjective and objective picture quality , author=. IEEE TIP , volume=. 2015 , publisher=

work page 2015
[36]

CVPR , pages=

Perceptual Quality Assessment of Smartphone Photography , author=. CVPR , pages=

work page
[37]

CVPR , year =

MixNMatch: Multifactor Disentanglement and Encoding for Conditional Image Generation , author =. CVPR , year =

work page
[38]

When Is Unsupervised Disentanglement Possible? , volume =

Horan, Daniella and Richardson, Eitan and Weiss, Yair , booktitle =. When Is Unsupervised Disentanglement Possible? , volume =

work page
[39]

A Gated Peripheral-Foveal Convolutional Neural Network for Unified Image Aesthetic Prediction , volume =

Zhang, Xiaodan and Gao, Xinbo and Lu, Wen and He, Lihuo , year =. A Gated Peripheral-Foveal Convolutional Neural Network for Unified Image Aesthetic Prediction , volume =. IEEE TMM , doi =

work page
[40]

NIMA: Neural Image Assessment , year=

Talebi, Hossein and Milanfar, Peyman , journal=. NIMA: Neural Image Assessment , year=

work page
[41]

CVPR , pages=

AVA: A large-scale database for aesthetic visual analysis , author=. CVPR , pages=

work page
[42]

IEEE TCSVT , year =

Hou, Jingwen and Ding, Henghui and Lin, Weisi and Liu, Weide and Fang, Yuming , title =. IEEE TCSVT , year =

work page
[43]

Deep Neural Networks for No-Reference Video Quality Assessment , year=

You, Junyong and Korhonen, Jari , booktitle=. Deep Neural Networks for No-Reference Video Quality Assessment , year=

work page
[44]

The Konstanz natural video database (KoNViD-1k) , year=

Hosu, Vlad and Hahn, Franz and Jenadeleh, Mohsen and Lin, Hanhe and Men, Hui and Szirányi, Tamás and Li, Shujun and Saupe, Dietmar , booktitle=. The Konstanz natural video database (KoNViD-1k) , year=

work page
[45]

ICIP , pages=

Exploiting spatial redundancy in pixel domain Wyner-Ziv video coding , author=. ICIP , pages=

work page
[46]

Signal Processing: Image Communication , volume=

The MPEG video compression algorithm , author=. Signal Processing: Image Communication , volume=

work page
[47]

2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA) , pages=

EVA ^2 : Exploiting temporal redundancy in live computer vision , author=. 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA) , pages=

work page 2018
[48]

CVPR , pages=

A convnet for the 2020s , author=. CVPR , pages=

work page
[49]

doi:10.48550/ARXIV.2110.00476 , author =

ResNet strikes back: An improved training procedure in timm , publisher =. doi:10.48550/ARXIV.2110.00476 , author =

work page doi:10.48550/arxiv.2110.00476
[50]

, journal=

Keys, R. , journal=. Cubic convolution interpolation for digital image processing , year=

work page
[51]

Draft ITU-T recommendation and final draft international standard of joint video specification , author=

work page
[52]

Disentangling Aesthetic and Technical Effects for Video Quality Assessment of User Generated Content , author=. , year=

work page
[53]

, journal=

Zhang, Lin and Zhang, Lei and Bovik, Alan C. , journal=. A Feature-Enriched Completely Blind Image Quality Evaluator , year=

work page
[54]

1978 , issn =

In search of a general picture processing operator , journal =. 1978 , issn =

work page 1978
[55]

CVPR , year=

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting , author=. CVPR , year=

work page
[56]

2022 , booktitle=

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , author=. 2022 , booktitle=

work page 2022
[57]

IEEE Conference on Computer Vision and Pattern Recognition , year=

Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective , author=. IEEE Conference on Computer Vision and Pattern Recognition , year=

work page
[58]

The Power of Scale for Parameter-Efficient Prompt Tuning

Lester, Brian and Al-Rfou, Rami and Constant, Noah. The Power of Scale for Parameter-Efficient Prompt Tuning. 2021 EMNLP. 2021. doi:10.18653/v1/2021.emnlp-main.243

work page doi:10.18653/v1/2021.emnlp-main.243 2021
[59]

Prefix-Tuning: Optimizing Continuous Prompts for Generation , url =

Li, Xiang Lisa and Liang, Percy. Prefix-Tuning: Optimizing Continuous Prompts for Generation. ACL 2021. 2021. doi:10.18653/v1/2021.acl-long.353

work page doi:10.18653/v1/2021.acl-long.353 2021
[61]

2023 , booktitle=

Exploring Opinion-Unaware Video Quality Assessment with Semantic Affinity Criterion , author=. 2023 , booktitle=

work page 2023
[62]

2021 , booktitle=

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , author=. 2021 , booktitle=

work page 2021
[63]

CoCa: Contrastive Captioners are Image-Text Foundation Models , author =

work page
[64]

Conference on Computer Vision and Pattern Recognition (CVPR) , year =

Kenneth Marino and Mohammad Rastegari and Ali Farhadi and Roozbeh Mottaghi , title =. Conference on Computer Vision and Pattern Recognition (CVPR) , year =

work page
[66]

2023 , url =

Introducing IDEFICS: An Open Reproduction of State-of-the-Art Visual Language Model , author =. 2023 , url =

work page 2023
[68]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023
[69]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[70]

2023 , eprint=

GPT-4 Technical Report , author=. 2023 , eprint=

work page 2023
[71]

2023 , eprint=

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality , author=. 2023 , eprint=

work page 2023
[72]

Bovik , title =

Joshua Peter Ebenezer and Zaixi Shang and Yongjun Wu and Hai Wei and Sriram Sethuraman and Alan C. Bovik , title =

work page
[73]

something something

Raghav Goyal and Samira Ebrahimi Kahou and Vincent Michalski and Joanna Materzynska and Susanne Westphal and Heuna Kim and Valentin Haenel and Ingo Fr. The "something something" video database for learning and evaluating visual common sense , journal =

work page
[74]

, title =

Wallace, Gregory K. , title =. Commun. ACM , pages =. 1991 , publisher =

work page 1991
[75]

and Adam, Hartwig , title =

Howard, Andrew and Sandler, Mark and Chu, Grace and Chen, Liang-Chieh and Chen, Bo and Tan, Mingxing and Wang, Weijun and Zhu, Yukun and Pang, Ruoming and Vasudevan, Vijay and Le, Quoc V. and Adam, Hartwig , title =

work page
[76]

2021 , booktitle =

Xu, Jiahua and Li, Jing and Zhou, Xingguang and Zhou, Wei and Wang, Baichao and Chen, Zhibo , title =. 2021 , booktitle =

work page 2021
[77]

Nature , volume=

Progress and challenges in probing the human brain , author=. Nature , volume=

work page
[78]

Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya , title =

work page
[79]

ECCV , year=

Expanding Language-Image Pretrained Models for General Video Recognition , author =. ECCV , year=

work page
[80]

2021 , eprint=

ImageNet-21K Pretraining for the Masses , author=. 2021 , eprint=

work page 2021
[81]

ATQAM/MAST'20: Joint Workshop on Aesthetic and Technical Quality Assessment of Multimedia and Media Analytics for Societal Trends , year =

Guha, Tanaya and Hosu, Vlad and Saupe, Dietmar and Goldl\". ATQAM/MAST'20: Joint Workshop on Aesthetic and Technical Quality Assessment of Multimedia and Media Analytics for Societal Trends , year =. ACM MM , pages =

work page
[82]

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

Young, Peter and Lai, Alice and Hodosh, Micah and Hockenmaier, Julia. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics. 2014

work page 2014
[83]

arXiv preprint arXiv:2305.10843 , year=

X-IQE: eXplainable Image Quality Evaluation for Text-to-Image Generation with Visual Large Language Models , author=. arXiv preprint arXiv:2305.10843 , year=

work page arXiv
[84]

and Bovik, Alan C

Jayaraman, Dinesh and Mittal, Anish and Moorthy, Anush K. and Bovik, Alan C. , booktitle=. Objective quality assessment of multiply distorted images , year=

work page

Showing first 80 references.