Context and Pixel Aware Large Language Model for Video Quality Assessment

Balu Adsumilli; Neil Birkbeck; Wen Wen; Yaohong Wu; Yilin Wang; Yue Sheng

arxiv: 2505.16025 · v4 · submitted 2025-05-21 · 💻 cs.CV · cs.MM· eess.IV

Context and Pixel Aware Large Language Model for Video Quality Assessment

Wen Wen , Yaohong Wu , Yue Sheng , Neil Birkbeck , Balu Adsumilli , Yilin Wang This is my paper

Pith reviewed 2026-05-22 13:17 UTC · model grok-4.3

classification 💻 cs.CV cs.MMeess.IV

keywords video quality assessmentmultimodal large language modeldual vision encoderspixel distortionscontext awarenessquality scoringquality description

0 comments

The pith

A multimodal LLM with separate encoders for video context and pixel distortions generates quality scores and descriptions together.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CP-LLM as a way to fix two gaps in video quality assessment: traditional models miss broader scene understanding, while most multimodal language models overlook small pixel-level problems like compression artifacts. It does this by running two vision encoders in parallel—one that looks at overall video context and one that examines fine pixel details—then feeding both into a language decoder that reasons about how they interact. The result is a single model that outputs both a numerical quality score and a readable description. A sympathetic reader would care because reliable VQA matters for video delivery, editing, and compression pipelines where context and distortion must both be weighed.

Core claim

CP-LLM is built around dual vision encoders that separately process high-level video context and low-level pixel distortions, followed by a language decoder that reasons over their combined information. This architecture lets the model output both robust quality scores and interpretable natural-language descriptions in one forward pass, while remaining more sensitive to pixel-level issues such as compression artifacts than prior multimodal approaches.

What carries the argument

Dual vision encoders that run independently on context and pixel signals, then feed into a shared language decoder for joint reasoning about quality.

If this is right

The model can supply both a numeric score and a human-readable explanation for the same video input.
Performance holds up when the test videos come from entirely different sources than the training data.
Sensitivity to subtle pixel distortions improves without sacrificing awareness of overall scene semantics.
Quality assessment and description generation become a single unified task rather than two separate models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dual-encoder split could be tested on related tasks such as detecting video editing artifacts or guiding adaptive streaming decisions.
If the separation truly prevents interference, it might reduce the amount of task-specific data needed to adapt the model to new distortion types.
One could measure whether removing either encoder drops performance on the other signal, confirming the independence assumption holds in practice.

Load-bearing premise

The two encoders can extract context and distortion signals independently, and the decoder can combine them without losing sensitivity or creating new errors in the quality judgment.

What would settle it

A controlled test set of videos that share identical high-level content but differ only in compression strength, where CP-LLM fails to rank the versions correctly or produces descriptions that ignore the visible artifacts.

read the original abstract

Video quality assessment (VQA) is a challenging research topic with broad applications. Traditional hand-crafted and discriminative learning-based VQA models mainly focus on pixel-level distortions and lack contextual understanding, while recent multimodal large language models (MLLMs) struggle with sensitivity to small distortions or handle quality scoring and description as separate tasks. To address these shortcomings, we introduce CP-LLM: a Context- and Pixel-aware Large Language Model. CP-LLM is a novel multimodal LLM architecture featuring dual vision encoders designed to independently analyze perceptual quality at both high-level (video context) and low-level (pixel distortion) granularity, along with a language decoder that subsequently reasons about the interplay between these aspects. This design enables CP-LLM to simultaneously produce robust quality scores and interpretable quality descriptions, with enhanced sensitivity to pixel distortions (e.g., compression artifacts). Experiment results demonstrate that CP-LLM achieves state-of-the-art cross-dataset performance on VQA benchmarks and superior robustness to pixel distortions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CP-LLM adds a dual-encoder MLLM for VQA that splits context and pixel streams, but the independence claim needs ablations to hold up.

read the letter

The main takeaway is that this paper builds a multimodal LLM with two separate vision encoders—one for high-level video context and one for low-level pixel distortions—then routes both into a shared language decoder that produces both quality scores and descriptions. That specific split-and-reason design is the concrete new piece relative to earlier MLLM work on VQA. It directly targets the practical split the field has: older models catch compression artifacts but miss scene semantics, while recent language models often lose sensitivity to small pixel changes. The architecture is a straightforward way to try keeping both signals intact, and the joint output of score plus explanation is a useful practical feature for media pipelines. The paper earns credit for naming the problem cleanly and for shipping an end-to-end trainable structure that could be reproduced if the code appears. The soft spot is exactly the one the stress-test flags: there is no clear evidence yet that the two encoders actually stay independent or that the decoder meaningfully reasons over their interplay rather than just averaging capacity. Without single-encoder baselines, fusion ablations, or controlled distortion tests, the reported cross-dataset gains and robustness could come from training regime or overall scale instead of the dual design. If feature projection mixes the streams, low-level sensitivity could drop even while overall numbers look better. This work is aimed at people building perceptual models for video delivery or moderation who already use MLLMs and want to add distortion awareness. A reader who cares about architecture variants in multimodal quality assessment would get value from the design sketch and could test the independence idea themselves. It is coherent enough on its own terms to deserve a serious referee, mainly so the experimental controls can be added and checked. I would send it to review with a note asking for those ablations.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces CP-LLM, a multimodal LLM for video quality assessment that employs dual vision encoders—one context-aware for high-level video understanding and one pixel-aware for low-level distortions—followed by a language decoder that reasons over their interplay. The central claims are state-of-the-art cross-dataset performance on VQA benchmarks together with improved robustness to pixel-level distortions and the ability to generate both quality scores and interpretable descriptions.

Significance. If the empirical results are substantiated, the dual-encoder design could meaningfully advance VQA by addressing the complementary limitations of traditional pixel-focused methods and existing MLLMs, potentially yielding both higher sensitivity to small distortions and greater interpretability in a unified model.

major comments (2)

[Experiments] The experimental section provides no ablation studies comparing the dual-encoder architecture against single-encoder baselines or alternative fusion mechanisms. Without these controls it is impossible to attribute the reported robustness gains specifically to independent context and pixel streams rather than increased model capacity or training regime.
[Results] Cross-dataset SOTA claims are stated without accompanying statistical significance tests, standard deviations across runs, or detailed baseline tables that would allow verification that improvements exceed what could arise from dataset biases or hyperparameter tuning.

minor comments (1)

[Abstract] The abstract refers to 'experiment results' demonstrating SOTA performance but does not name the specific VQA benchmarks or metrics employed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address each major comment below, agreeing that the suggested additions will strengthen the empirical support for our claims. We will incorporate the requested analyses and controls in the revised manuscript.

read point-by-point responses

Referee: [Experiments] The experimental section provides no ablation studies comparing the dual-encoder architecture against single-encoder baselines or alternative fusion mechanisms. Without these controls it is impossible to attribute the reported robustness gains specifically to independent context and pixel streams rather than increased model capacity or training regime.

Authors: We agree that ablation studies are necessary to isolate the contribution of the dual-encoder design from potential confounds such as model capacity or training differences. In the revised manuscript we will add a dedicated ablation section that includes: (i) single-encoder baselines using only the context-aware encoder or only the pixel-aware encoder, (ii) alternative fusion mechanisms (early concatenation, cross-attention, and simple averaging), and (iii) capacity-matched variants where parameter counts are equalized across configurations. These experiments will be run under identical training regimes to enable direct attribution of robustness gains to the independent context and pixel streams. revision: yes
Referee: [Results] Cross-dataset SOTA claims are stated without accompanying statistical significance tests, standard deviations across runs, or detailed baseline tables that would allow verification that improvements exceed what could arise from dataset biases or hyperparameter tuning.

Authors: We acknowledge that statistical rigor and transparency are required to substantiate the cross-dataset SOTA claims. In the revision we will: (i) report means and standard deviations over at least five independent runs with different random seeds, (ii) include statistical significance tests (paired t-tests and Wilcoxon signed-rank tests with p-values) against the strongest baselines, (iii) expand the baseline tables with hyperparameter details, training settings, and dataset statistics, and (iv) add a short discussion of how the cross-dataset protocol and diverse benchmark selection mitigate dataset-specific biases. These results will be presented in updated tables and text in the experimental section. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical architecture and benchmark results

full rationale

The paper introduces CP-LLM as a multimodal architecture with dual vision encoders for context and pixel-level analysis plus a language decoder for their interplay, claiming SOTA cross-dataset VQA performance via experiments. No derivation chain, equations, or first-principles results are presented that reduce by construction to fitted inputs, self-defined quantities, or self-citation chains. The central claims rest on empirical evaluation against external benchmarks rather than any internal reduction or renaming of known results. This qualifies as a self-contained empirical contribution with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The model rests on standard deep-learning assumptions about vision-language fusion and the ability of separate encoders to extract complementary signals; no machine-checked proofs or parameter-free derivations are provided.

free parameters (1)

encoder fusion mechanism
The relative weighting or integration of the two vision encoder outputs is almost certainly tuned during training.

axioms (1)

domain assumption Multimodal LLMs can integrate and reason over separate visual feature streams
Invoked when the language decoder is said to reason about the interplay between context and pixel information.

invented entities (1)

Dual vision encoders (context-aware and pixel-aware) no independent evidence
purpose: To separately analyze high-level video context and low-level pixel distortions
New architectural component introduced to overcome limitations of single-encoder MLLMs.

pith-pipeline@v0.9.0 · 5712 in / 1324 out tokens · 42918 ms · 2026-05-22T13:17:46.775359+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dual vision encoders designed to independently analyze perceptual quality at both high-level (video context) and low-level (pixel distortion) granularity, along with a language decoder that subsequently reasons about the interplay

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DPC-VQA: Decoupling Quality Perception and Residual Calibration for Video Quality Assessment
cs.CV 2026-04 unverdicted novelty 7.0

DPC-VQA decouples a frozen MLLM perceptual prior from a lightweight residual calibration branch to adapt video quality assessment to new scenarios with under 2% trainable parameters and 20% of typical MOS labels.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Context and Pixel Aware Large Language Model for Video Quality Assessment

INTRODUCTION Video quality assessment (VQA) is a fundamental research area with broad applications in video compression, transcod- ing, transmission, playback, content search, and recommen- dation systems. Early no-reference VQA models focused on low-level distortions like blur and blocking artifacts. How- ever, with the emergence of user-generated conten...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Early knowledge-driven ap- proaches, such as TLVQM [7], and VIDEV AL [8], pri- marily relied on handcrafted spatial and temporal features

RELA TED WORK Knowledge-driven and Learning-based Methods.Objec- tive methods for assessing UGC video quality have evolved significantly in past decades. Early knowledge-driven ap- proaches, such as TLVQM [7], and VIDEV AL [8], pri- marily relied on handcrafted spatial and temporal features. Subsequently, hybrid models like CNN-TLVQM [9] and RAPIQUE [3] c...

work page
[3]

CP-LLM: CONTEXT- AND PIXEL-A W ARE LARGE LANGUAGE MODEL 3.1. Model Components To achieve good context understanding capability and good sensitivity to pixel distortions, the proposed CP-LLM model incorporates an MLLM architecture with dual vision en- coders: one encoder is dedicated to extracting high-level semantic information, while the other focuses on...

work page
[4]

Answer / Score

EXPERIMENTS 4.1. Experimental Setups Datasets. The proposed model, CP-LLM, was trained on LSVQ [10] and augmented with compression variants. These variants were generated by encoding the original LSVQ videos using the H.264 codec at20distinct constant rate factor (CRF) levels, ranging from15to53. To train the language decoder, videos from LSVQ were annota...

work page
[5]

The maximum sequence length was set toL= 512tokens

bilinear resizing preserving aspect ratio to fit within540× 1080pixels, followed by croppingK= 8non-overlapping 224×224pixel patches for the low-level vision encoder. The maximum sequence length was set toL= 512tokens. We fine-tuned the language decoder using low-rank adaptation (LoRA), applying a rank ofR= 4to its query, key, and value attention projecti...

work page
[6]

CONCLUSION We have presented CP-LLM, a novel MLLM-based frame- work for VQA that simultaneously predicts quality scores and generates interpretable textual descriptions. Our method addresses key challenges in VQA, particularly the perception of pixel-level distortions and the integration of quantitative and qualitative feedback, leveraging dual vision enc...

work page
[7]

YouTube UGC dataset for video compression research,

Yilin Wang, Sasi Inguva, and Balu Adsumilli, “YouTube UGC dataset for video compression research,” inIEEE International Workshop on Multimedia Signal Process- ing, 2019, pp. 1–5

work page 2019
[8]

YouTube SFV+ HDR quality dataset,

Yilin Wang, Joong Gon Yim, Neil Birkbeck, and Balu Adsumilli, “YouTube SFV+ HDR quality dataset,” in IEEE International Conference on Image Processing, 2024, pp. 96–102

work page 2024
[9]

RAPIQUE: Rapid and accurate video quality prediction of user gen- erated content,

Zhengzhong Tu, Xiangxu Yu, Yilin Wang, Neil Birk- beck, Balu Adsumilli, and Alan C. Bovik, “RAPIQUE: Rapid and accurate video quality prediction of user gen- erated content,”IEEE Open Journal of Signal Process- ing, vol. 2, pp. 425–440, 2021

work page 2021
[10]

Rich features for perceptual quality as- sessment of UGC videos,

Yilin Wang, Junjie Ke, Hossein Talebi, Joong Gon Yim, Neil Birkbeck, Balu Adsumilli, Peyman Milanfar, and Feng Yang, “Rich features for perceptual quality as- sessment of UGC videos,” inIEEE Conference on Com- puter Vision and Pattern Recognition, 2021, pp. 13435– 13444

work page 2021
[11]

FAST-VQA: Efficient end-to-end video quality assessment with fragment sampling,

Haoning Wu, Chaofeng Chen, Jingwen Hou, Liang Liao, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin, “FAST-VQA: Efficient end-to-end video quality assessment with fragment sampling,” inEuropean Con- ference on Computer Vision, 2022, pp. 538–554

work page 2022
[12]

Exploring video quality assessment on user generated contents from aesthetic and technical per- spectives,

Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin, “Exploring video quality assessment on user generated contents from aesthetic and technical per- spectives,” inIEEE International Conference on Com- puter Vision, 2023, pp. 20144–20154

work page 2023
[13]

Two-level approach for no-reference consumer video quality assessment,

Jari Korhonen, “Two-level approach for no-reference consumer video quality assessment,”IEEE Transactions on Image Processing, vol. 28, no. 12, pp. 5923–5938, 2019

work page 2019
[14]

UGC-VQA: Bench- marking blind video quality assessment for user gener- ated content,

Zhengzhong Tu, Yilin Wang, Neil Birkbeck, Balu Adsumilli, and Alan C. Bovik, “UGC-VQA: Bench- marking blind video quality assessment for user gener- ated content,”IEEE Transactions on Image Processing, vol. 30, pp. 4449–4464, 2021

work page 2021
[15]

Blind natural video quality prediction via statistical temporal features and deep spatial features,

Jari Korhonen, Yicheng Su, and Junyong You, “Blind natural video quality prediction via statistical temporal features and deep spatial features,” inACM Interna- tional Conference on Multimedia, 2020, pp. 3311–3319

work page 2020
[16]

Patch-VQ: ‘Patching up’ the video quality problem,

Zhenqiang Ying, Maniratnam Mandal, Deepti Ghadi- yaram, and Alan C. Bovik, “Patch-VQ: ‘Patching up’ the video quality problem,” inIEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 14019–14029

work page 2021
[17]

Modular blind video quality assessment,

Wen Wen, Mu Li, Yabin Zhang, Yiting Liao, Junlin Li, Li Zhang, and Kede Ma, “Modular blind video quality assessment,” inIEEE Conference on Computer Vision and Pattern Recognition, 2024, pp. 2763–2772

work page 2024
[18]

Q-Instruct: Improving low-level visual abilities for multi-modality foundation models,

Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Kaixin Xu, Chunyi Li, Jingwen Hou, Guangtao Zhai, Geng Xue, Wenxiu Sun, Qiong Yan, and Weisi Lin, “Q-Instruct: Improving low-level visual abilities for multi-modality foundation models,” inIEEE Conference on Computer Vision and Pattern Recognition, 2024, pp. 25490–25500

work page 2024
[19]

Depicting beyond scores: Advancing image quality assessment through multi- modal language models,

Zhiyuan You, Zheyuan Li, Jinjin Gu, Zhenfei Yin, Tian- fan Xue, and Chao Dong, “Depicting beyond scores: Advancing image quality assessment through multi- modal language models,” inEuropean Conference on Computer Vision, 2024, pp. 259–276

work page 2024
[20]

Q-Align: Teach- ing LMMs for visual scoring via discrete text-defined levels,

Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, Qiong Yan, Xiongkuo Min, Guangtao Zhai, and Weisi Lin, “Q-Align: Teach- ing LMMs for visual scoring via discrete text-defined levels,” inInternational Conference on Machine Learn- ing, 2024, pp. 81–92

work page 2024
[21]

An image is worth 16x16 words: Trans- formers for image recognition at scale,

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby, “An image is worth 16x16 words: Trans- formers for image recognition at scale,” inInternational Conference on Learning Representations, 2021, pp. 1–22

work page 2021
[22]

Subjective quality assessment of user-generated con- tent gaming videos,

Xiangxu Yu, Zhengzhong Tu, Zhenqiang Ying, Alan C. Bovik, Neil Birkbeck, Yilin Wang, and Balu Adsumilli, “Subjective quality assessment of user-generated con- tent gaming videos,” inIEEE Winter Conference on Ap- plications of Computer Vision, 2022, pp. 74–83

work page 2022
[23]

PaLi-3 vision lan- guage models: Smaller, faster, stronger,

Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul V oigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, and Radu Soricut, “PaLi-3 vision lan- guage models: Smaller, faster, stronger,”arXiv prepri...

work page arXiv 2023
[24]

Video- Prism: A foundational visual encoder for video un- derstanding,

Long Zhao, Nitesh Bharadwaj Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, Rachel Hornung, Florian Schroff, Ming-Hsuan Yang, David A. Ross, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Ting Liu, and Boqing Gong, “Video- Prism: A foundational visual encoder for video un- derstanding,” in...

work page 2024
[25]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team and Google DeepMind, “Gemma: Open models based on gemini research and technology,” arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Context and Pixel Aware Large Language Model for Video Quality Assessment

INTRODUCTION Video quality assessment (VQA) is a fundamental research area with broad applications in video compression, transcod- ing, transmission, playback, content search, and recommen- dation systems. Early no-reference VQA models focused on low-level distortions like blur and blocking artifacts. How- ever, with the emergence of user-generated conten...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Early knowledge-driven ap- proaches, such as TLVQM [7], and VIDEV AL [8], pri- marily relied on handcrafted spatial and temporal features

RELA TED WORK Knowledge-driven and Learning-based Methods.Objec- tive methods for assessing UGC video quality have evolved significantly in past decades. Early knowledge-driven ap- proaches, such as TLVQM [7], and VIDEV AL [8], pri- marily relied on handcrafted spatial and temporal features. Subsequently, hybrid models like CNN-TLVQM [9] and RAPIQUE [3] c...

work page

[3] [3]

CP-LLM: CONTEXT- AND PIXEL-A W ARE LARGE LANGUAGE MODEL 3.1. Model Components To achieve good context understanding capability and good sensitivity to pixel distortions, the proposed CP-LLM model incorporates an MLLM architecture with dual vision en- coders: one encoder is dedicated to extracting high-level semantic information, while the other focuses on...

work page

[4] [4]

Answer / Score

EXPERIMENTS 4.1. Experimental Setups Datasets. The proposed model, CP-LLM, was trained on LSVQ [10] and augmented with compression variants. These variants were generated by encoding the original LSVQ videos using the H.264 codec at20distinct constant rate factor (CRF) levels, ranging from15to53. To train the language decoder, videos from LSVQ were annota...

work page

[5] [5]

The maximum sequence length was set toL= 512tokens

bilinear resizing preserving aspect ratio to fit within540× 1080pixels, followed by croppingK= 8non-overlapping 224×224pixel patches for the low-level vision encoder. The maximum sequence length was set toL= 512tokens. We fine-tuned the language decoder using low-rank adaptation (LoRA), applying a rank ofR= 4to its query, key, and value attention projecti...

work page

[6] [6]

CONCLUSION We have presented CP-LLM, a novel MLLM-based frame- work for VQA that simultaneously predicts quality scores and generates interpretable textual descriptions. Our method addresses key challenges in VQA, particularly the perception of pixel-level distortions and the integration of quantitative and qualitative feedback, leveraging dual vision enc...

work page

[7] [7]

YouTube UGC dataset for video compression research,

Yilin Wang, Sasi Inguva, and Balu Adsumilli, “YouTube UGC dataset for video compression research,” inIEEE International Workshop on Multimedia Signal Process- ing, 2019, pp. 1–5

work page 2019

[8] [8]

YouTube SFV+ HDR quality dataset,

Yilin Wang, Joong Gon Yim, Neil Birkbeck, and Balu Adsumilli, “YouTube SFV+ HDR quality dataset,” in IEEE International Conference on Image Processing, 2024, pp. 96–102

work page 2024

[9] [9]

RAPIQUE: Rapid and accurate video quality prediction of user gen- erated content,

Zhengzhong Tu, Xiangxu Yu, Yilin Wang, Neil Birk- beck, Balu Adsumilli, and Alan C. Bovik, “RAPIQUE: Rapid and accurate video quality prediction of user gen- erated content,”IEEE Open Journal of Signal Process- ing, vol. 2, pp. 425–440, 2021

work page 2021

[10] [10]

Rich features for perceptual quality as- sessment of UGC videos,

Yilin Wang, Junjie Ke, Hossein Talebi, Joong Gon Yim, Neil Birkbeck, Balu Adsumilli, Peyman Milanfar, and Feng Yang, “Rich features for perceptual quality as- sessment of UGC videos,” inIEEE Conference on Com- puter Vision and Pattern Recognition, 2021, pp. 13435– 13444

work page 2021

[11] [11]

FAST-VQA: Efficient end-to-end video quality assessment with fragment sampling,

Haoning Wu, Chaofeng Chen, Jingwen Hou, Liang Liao, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin, “FAST-VQA: Efficient end-to-end video quality assessment with fragment sampling,” inEuropean Con- ference on Computer Vision, 2022, pp. 538–554

work page 2022

[12] [12]

Exploring video quality assessment on user generated contents from aesthetic and technical per- spectives,

Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin, “Exploring video quality assessment on user generated contents from aesthetic and technical per- spectives,” inIEEE International Conference on Com- puter Vision, 2023, pp. 20144–20154

work page 2023

[13] [13]

Two-level approach for no-reference consumer video quality assessment,

Jari Korhonen, “Two-level approach for no-reference consumer video quality assessment,”IEEE Transactions on Image Processing, vol. 28, no. 12, pp. 5923–5938, 2019

work page 2019

[14] [14]

UGC-VQA: Bench- marking blind video quality assessment for user gener- ated content,

Zhengzhong Tu, Yilin Wang, Neil Birkbeck, Balu Adsumilli, and Alan C. Bovik, “UGC-VQA: Bench- marking blind video quality assessment for user gener- ated content,”IEEE Transactions on Image Processing, vol. 30, pp. 4449–4464, 2021

work page 2021

[15] [15]

Blind natural video quality prediction via statistical temporal features and deep spatial features,

Jari Korhonen, Yicheng Su, and Junyong You, “Blind natural video quality prediction via statistical temporal features and deep spatial features,” inACM Interna- tional Conference on Multimedia, 2020, pp. 3311–3319

work page 2020

[16] [16]

Patch-VQ: ‘Patching up’ the video quality problem,

Zhenqiang Ying, Maniratnam Mandal, Deepti Ghadi- yaram, and Alan C. Bovik, “Patch-VQ: ‘Patching up’ the video quality problem,” inIEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 14019–14029

work page 2021

[17] [17]

Modular blind video quality assessment,

Wen Wen, Mu Li, Yabin Zhang, Yiting Liao, Junlin Li, Li Zhang, and Kede Ma, “Modular blind video quality assessment,” inIEEE Conference on Computer Vision and Pattern Recognition, 2024, pp. 2763–2772

work page 2024

[18] [18]

Q-Instruct: Improving low-level visual abilities for multi-modality foundation models,

Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Kaixin Xu, Chunyi Li, Jingwen Hou, Guangtao Zhai, Geng Xue, Wenxiu Sun, Qiong Yan, and Weisi Lin, “Q-Instruct: Improving low-level visual abilities for multi-modality foundation models,” inIEEE Conference on Computer Vision and Pattern Recognition, 2024, pp. 25490–25500

work page 2024

[19] [19]

Depicting beyond scores: Advancing image quality assessment through multi- modal language models,

Zhiyuan You, Zheyuan Li, Jinjin Gu, Zhenfei Yin, Tian- fan Xue, and Chao Dong, “Depicting beyond scores: Advancing image quality assessment through multi- modal language models,” inEuropean Conference on Computer Vision, 2024, pp. 259–276

work page 2024

[20] [20]

Q-Align: Teach- ing LMMs for visual scoring via discrete text-defined levels,

Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, Qiong Yan, Xiongkuo Min, Guangtao Zhai, and Weisi Lin, “Q-Align: Teach- ing LMMs for visual scoring via discrete text-defined levels,” inInternational Conference on Machine Learn- ing, 2024, pp. 81–92

work page 2024

[21] [21]

An image is worth 16x16 words: Trans- formers for image recognition at scale,

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby, “An image is worth 16x16 words: Trans- formers for image recognition at scale,” inInternational Conference on Learning Representations, 2021, pp. 1–22

work page 2021

[22] [22]

Subjective quality assessment of user-generated con- tent gaming videos,

Xiangxu Yu, Zhengzhong Tu, Zhenqiang Ying, Alan C. Bovik, Neil Birkbeck, Yilin Wang, and Balu Adsumilli, “Subjective quality assessment of user-generated con- tent gaming videos,” inIEEE Winter Conference on Ap- plications of Computer Vision, 2022, pp. 74–83

work page 2022

[23] [23]

PaLi-3 vision lan- guage models: Smaller, faster, stronger,

Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul V oigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, and Radu Soricut, “PaLi-3 vision lan- guage models: Smaller, faster, stronger,”arXiv prepri...

work page arXiv 2023

[24] [24]

Video- Prism: A foundational visual encoder for video un- derstanding,

Long Zhao, Nitesh Bharadwaj Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, Rachel Hornung, Florian Schroff, Ming-Hsuan Yang, David A. Ross, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Ting Liu, and Boqing Gong, “Video- Prism: A foundational visual encoder for video un- derstanding,” in...

work page 2024

[25] [25]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team and Google DeepMind, “Gemma: Open models based on gemini research and technology,” arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024