Community-Aware Assessment of Social Textual Engagement and Resonance: A Human-Centric Perspective on User-Generated Content Evaluation

Huyang Sun; Kai Zhao; Tianjiao Li; Xiang Li; Yang Liu

arxiv: 2606.01897 · v3 · pith:WVSNMT6Lnew · submitted 2026-06-01 · 💻 cs.AI

Community-Aware Assessment of Social Textual Engagement and Resonance: A Human-Centric Perspective on User-Generated Content Evaluation

Tianjiao Li , Kai Zhao , Xiang Li , Yang Liu , Huyang Sun This is my paper

Pith reviewed 2026-06-28 14:49 UTC · model grok-4.3

classification 💻 cs.AI

keywords user-generated contentcommunity resonancesocial chain-of-thoughtmultimodal assessmentCASTER-BenchMEDEAvideo quality assessment

0 comments

The pith

MEDEA assesses user-generated content quality by simulating diverse community perspectives through Social-CoT rather than focusing on visual aesthetics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that quality in user-generated content depends on social resonance within communities, not just visual fidelity. It introduces CASTER as a task for evaluating this resonance and MEDEA as a model that uses Social Chain-of-Thought to instantiate multiple viewer personas and simulate collective reactions. MEDEA is trained with supervised fine-tuning followed by reinforcement learning using a Social Alignment Reward. The approach is tested on the new CASTER-Bench benchmark where it outperforms existing methods and generates reasoning that matches human community feedback.

Core claim

MEDEA introduces a Social Chain-of-Thought mechanism that performs multimodal perspective-taking by instantiating diverse viewer personas to simulate the community mind before making a quality judgment, trained via two-stage supervised fine-tuning and process-supervised reinforcement learning with Social Alignment Reward to ground reasoning in authentic human social cognition.

What carries the argument

Social Chain-of-Thought (Social-CoT), which instantiates diverse viewer personas for multimodal perspective-taking to simulate collective cognitive and emotional reactions.

Load-bearing premise

That instantiating diverse viewer personas via Social-CoT and training with Social Alignment Reward produces reasoning paths grounded in authentic human social cognition rather than artifacts of the training process or benchmark.

What would settle it

A study where independent human raters compare MEDEA's reasoning paths and judgments against actual community responses on held-out UGC items, finding no better alignment than traditional VQA methods.

Figures

Figures reproduced from arXiv: 2606.01897 by Huyang Sun, Kai Zhao, Tianjiao Li, Xiang Li, Yang Liu.

**Figure 2.** Figure 2: Overview of the MEDEA framework. The upper part depicts the Social-CoT construction pipeline, in [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Cover and 7 uniformly sampled key frames of the example. [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

**Figure 4.** Figure 4: Representative examples of “inflated bubbles”: videos with high popularity metrics that experts rated [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: Oracle Social Context: Social-CoT reasoning path generated by Gemini, grounded in real high [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Social-CoT with Alignment: Reasoning paths generated by MEDEA trained with Social Alignment Re [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Social-CoT without Alignment: Reasoning paths generated by MEDEA trained without Social Align [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt used to generate reasoning content. [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt used to train MEDEA. threshold High-Quality Low-Quality Macro Average Precision Recall F1 Precision Recall F1 Precision Recall F1 0.206 0.277 1.000 0.434 0.000 0.000 0.000 0.139 0.500 0.217 0.216 0.279 0.998 0.436 0.923 0.011 0.022 0.601 0.504 0.229 0.396 0.308 0.890 0.458 0.847 0.233 0.365 0.577 0.562 0.412 0.586 0.333 0.527 0.408 0.767 0.596 0.671 0.550 0.561 0.539 0.616⋆ 0.358 0.454 0.400 0.766 0… view at source ↗

read the original abstract

Traditional Video Quality Assessment (VQA) focuses narrowly on aesthetic fidelity, overlooking the complex social dynamics that define quality in User-Generated Content (UGC). In this work, we propose a paradigm shift from signal-centric metrics to human-centric resonance assessment. We introduce CASTER (Community-Aware Assessment of Social Textual Engagement and Resonance), a new task that evaluates whether a UGC item achieves positive community resonance based on its multimodal attributes rather than visual quality alone. To address this, we present MEDEA (Multimodal Engagement-Driven Evaluation Architecture), which introduces a novel Social Chain-of-Thought (Social-CoT) mechanism. Unlike traditional logical CoT, Social-CoT performs multimodal perspective-taking, instantiating diverse viewer personas to simulate collective cognitive and emotional reactions (i.e., the "community mind") before deriving a quality judgment. MEDEA is trained via a two-stage approach involving supervised fine-tuning and process-supervised reinforcement learning with Social Alignment Reward to ensure reasoning paths are grounded in authentic human social cognition. To support this task, we release CASTER-Bench, a comprehensive human-annotated benchmark covering diverse UGC categories. Experiments demonstrate that MEDEA significantly outperforms state-of-the-art baselines on CASTER-Bench while providing interpretable and empathetic reasoning paths that align with real community feedback.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a new community-resonance task for UGC evaluation and releases a human-annotated benchmark, but the abstract supplies no metrics, baselines, or ablations to support the outperformance claim.

read the letter

The main takeaway is that this work wants to move UGC assessment from visual signal metrics to whether content resonates with communities, and it introduces CASTER as the task plus MEDEA with Social-CoT for simulating viewer personas. They also release CASTER-Bench with human annotations across UGC categories and train via SFT then process-supervised RL with a Social Alignment Reward.

What the paper does reasonably well is lay out a clear motivation for why traditional VQA misses social dynamics on platforms, and the persona-simulation step in Social-CoT is a straightforward way to operationalize collective reactions. Releasing the benchmark is a concrete output that others could build on.

The soft spots sit in the evaluation details. The abstract states that MEDEA significantly outperforms baselines and produces reasoning paths aligned with real community feedback, yet it gives no numbers, no baseline names, no dataset statistics, and no ablation results. That makes the central claim impossible to check from the abstract alone. The circularity concern is also present: the reward is defined to enforce grounding in human cognition and is optimized against the same human-annotated CASTER-Bench, so without evidence of held-out judgments or independent validation it is unclear whether the paths reflect authentic community cognition or just benchmark fitting. The stress-test note did not identify an internal contradiction, but the missing experimental specifics remain the practical limit.

This is for researchers working on multimodal models for social media, engagement prediction, or content systems. Someone looking for new benchmarks might get value from CASTER-Bench; anyone needing rigorous evidence on whether the method improves over existing approaches will need the full results section.

I would send it for peer review. The framing is timely and the benchmark is a usable contribution even if the method section needs more empirical grounding to be convincing.

Referee Report

3 major / 2 minor

Summary. The paper introduces CASTER, a new task for assessing whether user-generated content achieves positive community resonance via multimodal attributes rather than visual quality. It proposes MEDEA, which uses a Social Chain-of-Thought (Social-CoT) mechanism to instantiate diverse viewer personas and simulate collective reactions before judging quality. MEDEA is trained in two stages (supervised fine-tuning followed by process-supervised RL with a Social Alignment Reward) and evaluated on the newly released human-annotated CASTER-Bench, with the central claim that it significantly outperforms baselines while producing interpretable, empathetic reasoning paths aligned with real community feedback.

Significance. If the empirical claims hold after verification, the shift from signal-centric VQA to human-centric social resonance assessment could influence UGC recommendation, moderation, and content creation tools. The release of CASTER-Bench and the Social-CoT mechanism for multimodal perspective-taking represent concrete contributions that enable future work on community-aware evaluation. The two-stage training approach with process supervision is a standard strength when accompanied by ablations.

major comments (3)

[Abstract] Abstract: the assertion that MEDEA 'significantly outperforms state-of-the-art baselines on CASTER-Bench' supplies no metrics, baseline names, dataset statistics, or significance tests, which is load-bearing for the central empirical claim.
[§3.2] §3.2 (Social Alignment Reward definition): the reward is stated to enforce grounding in authentic human social cognition and is optimized against CASTER-Bench annotations, but the text does not specify whether the reward model is trained on held-out human judgments independent of the benchmark labels or whether it re-uses the same annotations; this creates a direct risk that reported gains reduce to benchmark fitting rather than independent prediction of community resonance.
[§5] §5 (Experiments): no ablation isolating the contribution of Social-CoT persona instantiation versus standard CoT, or of the RL stage versus SFT alone, is reported; without these controls the claim that the reasoning paths reflect genuine community cognition rather than training artifacts cannot be evaluated.

minor comments (2)

[§2] The related-work section should explicitly contrast Social-CoT with prior persona-based or theory-of-mind simulation methods in NLP and multimodal reasoning.
[Figure 2] Figure 2 (Social-CoT diagram) would benefit from an explicit legend distinguishing the persona instantiation step from the final judgment aggregation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional clarity and controls will strengthen the paper. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that MEDEA 'significantly outperforms state-of-the-art baselines on CASTER-Bench' supplies no metrics, baseline names, dataset statistics, or significance tests, which is load-bearing for the central empirical claim.

Authors: We agree that the abstract should supply concrete support for the central claim. In the revised manuscript we will insert the key performance numbers, baseline names, and reference to the statistical tests already present in §5. revision: yes
Referee: [§3.2] §3.2 (Social Alignment Reward definition): the reward is stated to enforce grounding in authentic human social cognition and is optimized against CASTER-Bench annotations, but the text does not specify whether the reward model is trained on held-out human judgments independent of the benchmark labels or whether it re-uses the same annotations; this creates a direct risk that reported gains reduce to benchmark fitting rather than independent prediction of community resonance.

Authors: We will revise §3.2 to state explicitly that the Social Alignment Reward model is trained on a held-out annotation set that is disjoint from the CASTER-Bench test labels used for final evaluation, thereby removing any ambiguity about data leakage. revision: yes
Referee: [§5] §5 (Experiments): no ablation isolating the contribution of Social-CoT persona instantiation versus standard CoT, or of the RL stage versus SFT alone, is reported; without these controls the claim that the reasoning paths reflect genuine community cognition rather than training artifacts cannot be evaluated.

Authors: We accept that the current experiments lack these controls. We will add the requested ablations (Social-CoT vs. standard CoT and RL vs. SFT) to the revised §5, together with the corresponding performance deltas and reasoning-path analyses. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The provided abstract and description outline a two-stage training process (SFT followed by process-supervised RL using Social Alignment Reward) evaluated on the separately constructed human-annotated CASTER-Bench. No quoted equation, definition, or step reduces a claimed prediction or result to its own inputs by construction, nor does any load-bearing premise collapse into a self-citation or ansatz smuggled from prior work by the same authors. The Social Alignment Reward is presented as a mechanism to align with human cognition rather than a fitted parameter whose output is then relabeled as an independent prediction. The central claim of outperformance on the benchmark therefore remains an external empirical result rather than a definitional tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the untested premise that persona-based multimodal perspective-taking plus a learned Social Alignment Reward can faithfully reproduce community-level human judgments; the benchmark itself is a new constructed artifact whose construction details are not supplied.

free parameters (1)

Social Alignment Reward model parameters
Learned during process-supervised RL to enforce alignment with human social cognition; value not reported.

axioms (1)

domain assumption Community resonance can be accurately simulated by instantiating diverse viewer personas and aggregating their cognitive/emotional reactions
Invoked to justify the Social-CoT mechanism in the abstract.

invented entities (2)

Social-CoT no independent evidence
purpose: Multimodal perspective-taking to simulate the community mind
New reasoning mechanism introduced without external validation in the abstract.
CASTER-Bench no independent evidence
purpose: Human-annotated benchmark for the new resonance assessment task
New dataset whose annotation protocol and coverage details are not provided.

pith-pipeline@v0.9.1-grok · 5773 in / 1365 out tokens · 27519 ms · 2026-06-28T14:49:15.605030+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Bvi-vfi: A video quality database for video frame interpola- tion.IEEE Transactions on Image Processing, 32:6004–6019, 2023

Duolikun Danier, Fan Zhang, and David R Bull. Bvi-vfi: A video quality database for video frame interpola- tion.IEEE Transactions on Image Processing, 32:6004–6019, 2023

2023
[2]

No-reference vmaf: A deep neural network-based approach to blind video quality assessment.IEEE Transactions on Broadcasting, 70(3):844– 861, 2024

Axel De Decker, Jan De Cock, Peter Lambert, and Glenn Van Wallendael. No-reference vmaf: A deep neural network-based approach to blind video quality assessment.IEEE Transactions on Broadcasting, 70(3):844– 861, 2024

2024
[3]

Finevq: Fine-grained user generated content video quality assessment

Huiyu Duan, Qiang Hu, Jiarui Wang, Liu Yang, Zitong Xu, Lu Liu, Xiongkuo Min, Chunlei Cai, Tianxiao Ye, Xi- aoyun Zhang, et al. Finevq: Fine-grained user generated content video quality assessment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3206–3217, 2025

2025
[4]

Lmm-vqa: Advancing video quality assessment with large multimodal models.IEEE Trans- actions on Circuits and Systems for Video Technology, 2025

Qihang Ge, Wei Sun, Yu Zhang, Yunhao Li, Zhongpeng Ji, Fengyu Sun, Shangling Jui, Xiongkuo Min, and Guangtao Zhai. Lmm-vqa: Advancing video quality assessment with large multimodal models.IEEE Trans- actions on Circuits and Systems for Video Technology, 2025

2025
[5]

Cover: A comprehen- sive video quality evaluator

Chenlong He, Qi Zheng, Ruoxi Zhu, Xiaoyang Zeng, Yibo Fan, and Zhengzhong Tu. Cover: A comprehen- sive video quality evaluator. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5799–5809, 2024

2024
[6]

The konstanz natural video database (konvid-1k)

Vlad Hosu, Franz Hahn, Mohsen Jenadeleh, Hanhe Lin, Hui Men, Tamás Szirányi, Shujun Li, and Dietmar Saupe. The konstanz natural video database (konvid-1k). In2017 Ninth international conference on quality of multimedia experience (QoMEX), pages 1–6. IEEE, 2017

2017
[7]

Vqa2: visual question answering for video quality assessment

Ziheng Jia, Zicheng Zhang, Jiaying Qian, Haoning Wu, Wei Sun, Chunyi Li, Xiaohong Liu, Weisi Lin, Guang- tao Zhai, and Xiongkuo Min. Vqa2: visual question answering for video quality assessment. InProceedings of the 33rd ACM International Conference on Multimedia, pages 6751–6760, 2025

2025
[8]

Quality assessment of in-the-wild videos

Dingquan Li, Tingting Jiang, and Ming Jiang. Quality assessment of in-the-wild videos. InProceedings of the 27th ACM international conference on multimedia, pages 2351–2359, 2019

2019
[9]

Pugcq: A large scale dataset for quality assessment of professional user-generated content

Guo Li, Baoliang Chen, Lingyu Zhu, Qinwen He, Hongfei Fan, and Shiqi Wang. Pugcq: A large scale dataset for quality assessment of professional user-generated content. InProceedings of the 29th ACM International Conference on Multimedia, pages 3728–3736, 2021. ©2026 Bilibili Index Team. All Rights Reserved.10

2021
[10]

Mcl-v: A stream- ing video quality assessment database.Journal of Visual Communication and Image Representation, 30:1– 9, 2015

Joe Yuchieh Lin, Rui Song, Chi-Hao Wu, TsungJung Liu, Haiqiang Wang, and C-C Jay Kuo. Mcl-v: A stream- ing video quality assessment database.Journal of Visual Communication and Image Representation, 30:1– 9, 2015

2015
[11]

Kvq: Kwai video quality assessment for short-form videos

Yiting Lu, Xin Li, Yajing Pei, Kun Yuan, Qizhi Xie, Yunpeng Qu, Ming Sun, Chao Zhou, and Zhibo Chen. Kvq: Kwai video quality assessment for short-form videos. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 25963–25973, 2024

2024
[12]

A study of high frame rate video formats.IEEE Transactions on Multimedia, 21(6):1499–1512, 2019

Alex Mackin, Fan Zhang, and David R Bull. A study of high frame rate video formats.IEEE Transactions on Multimedia, 21(6):1499–1512, 2019

2019
[13]

St-greed: Space-time generalized entropic differences for frame rate dependent video quality prediction.IEEE Transactions on Image Processing, 30:7446–7457, 2021

Pavan C Madhusudana, Neil Birkbeck, Yilin Wang, Balu Adsumilli, and Alan C Bovik. St-greed: Space-time generalized entropic differences for frame rate dependent video quality prediction.IEEE Transactions on Image Processing, 30:7446–7457, 2021

2021
[14]

Subjective and objective quality assessment of high frame rate videos.IEEE Access, 9:108069–108082, 2021

Pavan C Madhusudana, Xiangxu Yu, Neil Birkbeck, Yilin Wang, Balu Adsumilli, and Alan C Bovik. Subjective and objective quality assessment of high frame rate videos.IEEE Access, 9:108069–108082, 2021

2021
[15]

An optical flow-based full reference video quality assessment algorithm.IEEE Transactions on Image Processing, 25(6):2480–2492, 2016

K Manasa and Sumohana S Channappayya. An optical flow-based full reference video quality assessment algorithm.IEEE Transactions on Image Processing, 25(6):2480–2492, 2016

2016
[16]

Efficient video quality assessment along temporal trajec- tories.IEEE transactions on circuits and systems for video technology, 20(11):1653–1658, 2010

Anush Krishna Moorthy and Alan Conrad Bovik. Efficient video quality assessment along temporal trajec- tories.IEEE transactions on circuits and systems for video technology, 20(11):1653–1658, 2010

2010
[17]

Cvd2014—a database for evaluating no-reference video quality assessment algorithms.IEEE Transactions on Image Processing, 25(7):3073–3086, 2016

Mikko Nuutinen, Toni Virtanen, Mikko Vaahteranoksa, Tero Vuori, Pirkko Oittinen, and Jukka Häkkinen. Cvd2014—a database for evaluating no-reference video quality assessment algorithms.IEEE Transactions on Image Processing, 25(7):3073–3086, 2016

2016
[18]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[19]

Neural theory-of-mind? on the limits of large lan- guage models when interaction requires anticipating others’ states

Maarten Sap, Ronan Le Bras, Daniel Fried, and Yejin Choi. Neural theory-of-mind? on the limits of large lan- guage models when interaction requires anticipating others’ states. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8184–8205, 2022

2022
[20]

Study of subjective and objective quality assessment of video.IEEE transactions on Image Processing, 19(6):1427– 1441, 2010

Kalpana Seshadrinathan, Rajiv Soundararajan, Alan Conrad Bovik, and Lawrence K Cormack. Study of subjective and objective quality assessment of video.IEEE transactions on Image Processing, 19(6):1427– 1441, 2010

2010
[21]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open lan- guage models, 2024

2024
[22]

Large-scale study of perceptual video quality.IEEE Transactions on Image Processing, 28(2):612–627, 2019

Zeina Sinno and Alan Conrad Bovik. Large-scale study of perceptual video quality.IEEE Transactions on Image Processing, 28(2):612–627, 2019

2019
[23]

Vf-eval: Evaluating multimodal llms for generating feedback on aigc videos.arXiv preprint arXiv:2505.23693, 2025

Tingyu Song, Tongyan Hu, Guo Gan, and Yilun Zhao. Vf-eval: Evaluating multimodal llms for generating feedback on aigc videos.arXiv preprint arXiv:2505.23693, 2025

work page arXiv 2025
[24]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

No-reference video quality assessment using multi-pooled, saliency weighted deep features and decision fusion.Sensors, 22(6):2209, 2022

Domonkos Varga. No-reference video quality assessment using multi-pooled, saliency weighted deep features and decision fusion.Sensors, 22(6):2209, 2022

2022
[26]

A spatiotemporal most-apparent-distortion model for video quality assessment

Phong V Vu, Cuong T Vu, and Damon M Chandler. A spatiotemporal most-apparent-distortion model for video quality assessment. In2011 18th IEEE international conference on image processing, pages 2505–
[27]

©2026 Bilibili Index Team

IEEE, 2011. ©2026 Bilibili Index Team. All Rights Reserved.11

2011
[28]

Camp-vqa: Caption-embedded multimodal perception for no-reference quality assessment of compressed video.arXiv preprint arXiv:2511.07290, 2025

Xinyi Wang, Angeliki Katsenou, Junxiao Shen, and David Bull. Camp-vqa: Caption-embedded multimodal perception for no-reference quality assessment of compressed video.arXiv preprint arXiv:2511.07290, 2025

work page arXiv 2025
[29]

Youtube ugc dataset for video compression research

Yilin Wang, Sasi Inguva, and Balu Adsumilli. Youtube ugc dataset for video compression research. In2019 IEEE 21st international workshop on multimedia signal processing (MMSP), pages 1–5. IEEE, 2019

2019
[30]

Video quality assessment using a statistical model of human visual speed per- ception.Journal of the optical society of america A, 24(12):B61–B69, 2007

Zhou Wang and Qiang Li. Video quality assessment using a statistical model of human visual speed per- ception.Journal of the optical society of america A, 24(12):B61–B69, 2007

2007
[31]

No-reference perceptual quality assessment of jpeg com- pressed images

Zhou Wang, Hamid R Sheikh, and Alan C Bovik. No-reference perceptual quality assessment of jpeg com- pressed images. InProceedings. International conference on image processing, volume 1, pages I–I. IEEE, 2002

2002
[32]

Chain-of- thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain-of- thought prompting elicits reasoning in large language models. InAdvances in Neural Information Process- ing Systems, volume 35, pages 24824–24837, 2022

2022
[33]

Fast-vqa: Efficient end-to-end video quality assessment with fragment sampling.Proceedings of European Conference of Computer Vision (ECCV), 2022

Haoning Wu, Chaofeng Chen, Jingwen Hou, Liang Liao, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Fast-vqa: Efficient end-to-end video quality assessment with fragment sampling.Proceedings of European Conference of Computer Vision (ECCV), 2022

2022
[34]

Exploring video quality assessment on user generated contents from aesthetic and technical perspectives

Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20144– 20154, 2023

2023
[35]

Towards explainable in-the-wild video quality assessment: a database and a language- prompted approach

Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Towards explainable in-the-wild video quality assessment: a database and a language- prompted approach. InProceedings of the 31st acm international conference on multimedia, pages 1045– 1054, 2023

2023
[36]

Q-align: Teaching lmms for visual scoring via discrete text-defined levels

Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, et al. Q-align: Teaching lmms for visual scoring via discrete text-defined levels. InInternational Conference on Machine Learning, pages 54015–54029. PMLR, 2024

2024
[37]

Starvqa: Space-time attention for video quality assessment

Fengchuang Xing, Yuan-Gen Wang, Hanpin Wang, Leida Li, and Guopu Zhu. Starvqa: Space-time attention for video quality assessment. In2022 IEEE International Conference on Image Processing (ICIP), pages 2326–
[39]

Key Frames: Seven key frames extracted from the video 3

Cover Image: The video’s cover image 2. Key Frames: Seven key frames extracted from the video 3. Title: {title}
[40]

think":

Tags: {tag} 5. ASR: {asr} 6. Primary Category: {new_tid_name} 7. Secondary Category: {new_sub_tid_name} 8. Duration: {duration} 9. Resolution: {resolution} 10. Vertical Format: {vertical} 11. Top-liked Comments: A pool of high-like comments from which 15–20 strongly content-related comments must be selected ————————————————– Output Requirements The output...
[41]

This looks amazing

Exact Content Matching (Highest Priority): Comments should directly correspond to specific elements of the video content. Examples: - “This looks amazing”→linked to visual features - “The mixed language makes it hard to understand”→linked to ASR content
[42]

The image quality is too blurry

Thematic Relevance (Secondary Priority): Comments should relate to the overall theme or quality of the video. Examples: - “The image quality is too blurry”→linked to visual resolution - “This is a waste of time”→linked to perceived content value
[43]

Mandatory Exclusion Rule: Comments referring to auditory or sound-related elements must be excluded
[44]

————————————————– Reasoning Process Construction Rules

Handling Offensive Comments: Highly liked comments containing insults toward the uploader should be cate- gorized as opposing the video’s creative quality and retained if they satisfy content relevance criteria. ————————————————– Reasoning Process Construction Rules
[45]

Merging or collapsing similar comments is prohibited

Independent Coverage Requirement: Each selected comment must appear at least once independently. Merging or collapsing similar comments is prohibited
[46]

When viewers see {visual information} / read {ASR content}, they may express {comment}

Video–Comment Alignment: - Precise alignment: “When viewers see {visual information} / read {ASR content}, they may express {comment}. ” - Thematic alignment: “Given the video’s overall characteristics, it may lead to com- ments such as {comment}. ” Only the provided 11 video attributes may be referenced
[47]

viewers may point out

Speculative Expression Style: Use inferential phrasing such as “viewers may point out... ” and incorporate audi- ence expectations
[48]

- Ensure strict nu- merical consistency

Mandatory Statistical Summary: - Report the number of supportive and opposing comments. - Ensure strict nu- merical consistency. - Compute the Sigma-normalized difference (Skellam z-score): z = (X - Y) / sqrt(X + Y) - Decision rule: If z≥1.5, conclude Support; otherwise, Not Clearly Supportive. - The z-score must be enclosed in boxed{}. ————————————————– ...
[49]

Insert a blank line between each simulated comment. 2. Use<video>to mark video information and<comment> to mark simulated comments. 3. Annotate each comment with its stance and index: - Support Comment + index - Opposing Comment + index ————————————————– <Current Task> Cover Image: <image> Key Frames: <image><image><image><image><image><image><image> Titl...

2026
[50]

Cover Image: The video’s cover image
[51]

Key Frames: Seven key frames extracted from the video
[52]

Primary Category: {new_tid_name}
[53]

Secondary Category: {new_sub_tid_name}
[54]

Duration: {duration}
[55]

Resolution: {resolution}
[56]

Vertical Format: {vertical} Criteria for Overall Comment Tendency
[57]

All comments must be non-duplicated and explicitly appear in the reasoning process

The simulated comments must contain at least 15 entries. All comments must be non-duplicated and explicitly appear in the reasoning process
[58]

Assume that among the simulated comments: - X comments are classified as *supportive* - Y comments are classified as *opposing*
[59]

Compute the Sigma-normalized difference (Skellam z-score): z = (X - Y) / sqrt(X + Y)
[60]

Support"; otherwise, it is classified as

If z≥1.5, the overall comment tendency is classified as "Support"; otherwise, it is classified as "Not Clearly Supportive"
[61]

z = boxed-2

In the output, the z value must be wrapped using boxed, for example: "z = boxed-2"
[62]

Support" or

The numbers of supportive and opposing comments reported in the final summary must strictly match those generated during the reasoning process. Fabrication or inconsistency is not allowed. <Current Task> Cover Image: <image> Key Frames: <image><image><image><image><image><image><image> Title: Tags: ASR: Primary Category: Secondary Category: Duration: Reso...

work page arXiv 2026

[1] [1]

Bvi-vfi: A video quality database for video frame interpola- tion.IEEE Transactions on Image Processing, 32:6004–6019, 2023

Duolikun Danier, Fan Zhang, and David R Bull. Bvi-vfi: A video quality database for video frame interpola- tion.IEEE Transactions on Image Processing, 32:6004–6019, 2023

2023

[2] [2]

No-reference vmaf: A deep neural network-based approach to blind video quality assessment.IEEE Transactions on Broadcasting, 70(3):844– 861, 2024

Axel De Decker, Jan De Cock, Peter Lambert, and Glenn Van Wallendael. No-reference vmaf: A deep neural network-based approach to blind video quality assessment.IEEE Transactions on Broadcasting, 70(3):844– 861, 2024

2024

[3] [3]

Finevq: Fine-grained user generated content video quality assessment

Huiyu Duan, Qiang Hu, Jiarui Wang, Liu Yang, Zitong Xu, Lu Liu, Xiongkuo Min, Chunlei Cai, Tianxiao Ye, Xi- aoyun Zhang, et al. Finevq: Fine-grained user generated content video quality assessment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3206–3217, 2025

2025

[4] [4]

Lmm-vqa: Advancing video quality assessment with large multimodal models.IEEE Trans- actions on Circuits and Systems for Video Technology, 2025

Qihang Ge, Wei Sun, Yu Zhang, Yunhao Li, Zhongpeng Ji, Fengyu Sun, Shangling Jui, Xiongkuo Min, and Guangtao Zhai. Lmm-vqa: Advancing video quality assessment with large multimodal models.IEEE Trans- actions on Circuits and Systems for Video Technology, 2025

2025

[5] [5]

Cover: A comprehen- sive video quality evaluator

Chenlong He, Qi Zheng, Ruoxi Zhu, Xiaoyang Zeng, Yibo Fan, and Zhengzhong Tu. Cover: A comprehen- sive video quality evaluator. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5799–5809, 2024

2024

[6] [6]

The konstanz natural video database (konvid-1k)

Vlad Hosu, Franz Hahn, Mohsen Jenadeleh, Hanhe Lin, Hui Men, Tamás Szirányi, Shujun Li, and Dietmar Saupe. The konstanz natural video database (konvid-1k). In2017 Ninth international conference on quality of multimedia experience (QoMEX), pages 1–6. IEEE, 2017

2017

[7] [7]

Vqa2: visual question answering for video quality assessment

Ziheng Jia, Zicheng Zhang, Jiaying Qian, Haoning Wu, Wei Sun, Chunyi Li, Xiaohong Liu, Weisi Lin, Guang- tao Zhai, and Xiongkuo Min. Vqa2: visual question answering for video quality assessment. InProceedings of the 33rd ACM International Conference on Multimedia, pages 6751–6760, 2025

2025

[8] [8]

Quality assessment of in-the-wild videos

Dingquan Li, Tingting Jiang, and Ming Jiang. Quality assessment of in-the-wild videos. InProceedings of the 27th ACM international conference on multimedia, pages 2351–2359, 2019

2019

[9] [9]

Pugcq: A large scale dataset for quality assessment of professional user-generated content

Guo Li, Baoliang Chen, Lingyu Zhu, Qinwen He, Hongfei Fan, and Shiqi Wang. Pugcq: A large scale dataset for quality assessment of professional user-generated content. InProceedings of the 29th ACM International Conference on Multimedia, pages 3728–3736, 2021. ©2026 Bilibili Index Team. All Rights Reserved.10

2021

[10] [10]

Mcl-v: A stream- ing video quality assessment database.Journal of Visual Communication and Image Representation, 30:1– 9, 2015

Joe Yuchieh Lin, Rui Song, Chi-Hao Wu, TsungJung Liu, Haiqiang Wang, and C-C Jay Kuo. Mcl-v: A stream- ing video quality assessment database.Journal of Visual Communication and Image Representation, 30:1– 9, 2015

2015

[11] [11]

Kvq: Kwai video quality assessment for short-form videos

Yiting Lu, Xin Li, Yajing Pei, Kun Yuan, Qizhi Xie, Yunpeng Qu, Ming Sun, Chao Zhou, and Zhibo Chen. Kvq: Kwai video quality assessment for short-form videos. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 25963–25973, 2024

2024

[12] [12]

A study of high frame rate video formats.IEEE Transactions on Multimedia, 21(6):1499–1512, 2019

Alex Mackin, Fan Zhang, and David R Bull. A study of high frame rate video formats.IEEE Transactions on Multimedia, 21(6):1499–1512, 2019

2019

[13] [13]

St-greed: Space-time generalized entropic differences for frame rate dependent video quality prediction.IEEE Transactions on Image Processing, 30:7446–7457, 2021

Pavan C Madhusudana, Neil Birkbeck, Yilin Wang, Balu Adsumilli, and Alan C Bovik. St-greed: Space-time generalized entropic differences for frame rate dependent video quality prediction.IEEE Transactions on Image Processing, 30:7446–7457, 2021

2021

[14] [14]

Subjective and objective quality assessment of high frame rate videos.IEEE Access, 9:108069–108082, 2021

Pavan C Madhusudana, Xiangxu Yu, Neil Birkbeck, Yilin Wang, Balu Adsumilli, and Alan C Bovik. Subjective and objective quality assessment of high frame rate videos.IEEE Access, 9:108069–108082, 2021

2021

[15] [15]

An optical flow-based full reference video quality assessment algorithm.IEEE Transactions on Image Processing, 25(6):2480–2492, 2016

K Manasa and Sumohana S Channappayya. An optical flow-based full reference video quality assessment algorithm.IEEE Transactions on Image Processing, 25(6):2480–2492, 2016

2016

[16] [16]

Efficient video quality assessment along temporal trajec- tories.IEEE transactions on circuits and systems for video technology, 20(11):1653–1658, 2010

Anush Krishna Moorthy and Alan Conrad Bovik. Efficient video quality assessment along temporal trajec- tories.IEEE transactions on circuits and systems for video technology, 20(11):1653–1658, 2010

2010

[17] [17]

Cvd2014—a database for evaluating no-reference video quality assessment algorithms.IEEE Transactions on Image Processing, 25(7):3073–3086, 2016

Mikko Nuutinen, Toni Virtanen, Mikko Vaahteranoksa, Tero Vuori, Pirkko Oittinen, and Jukka Häkkinen. Cvd2014—a database for evaluating no-reference video quality assessment algorithms.IEEE Transactions on Image Processing, 25(7):3073–3086, 2016

2016

[18] [18]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021

[19] [19]

Neural theory-of-mind? on the limits of large lan- guage models when interaction requires anticipating others’ states

Maarten Sap, Ronan Le Bras, Daniel Fried, and Yejin Choi. Neural theory-of-mind? on the limits of large lan- guage models when interaction requires anticipating others’ states. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8184–8205, 2022

2022

[20] [20]

Study of subjective and objective quality assessment of video.IEEE transactions on Image Processing, 19(6):1427– 1441, 2010

Kalpana Seshadrinathan, Rajiv Soundararajan, Alan Conrad Bovik, and Lawrence K Cormack. Study of subjective and objective quality assessment of video.IEEE transactions on Image Processing, 19(6):1427– 1441, 2010

2010

[21] [21]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open lan- guage models, 2024

2024

[22] [22]

Large-scale study of perceptual video quality.IEEE Transactions on Image Processing, 28(2):612–627, 2019

Zeina Sinno and Alan Conrad Bovik. Large-scale study of perceptual video quality.IEEE Transactions on Image Processing, 28(2):612–627, 2019

2019

[23] [23]

Vf-eval: Evaluating multimodal llms for generating feedback on aigc videos.arXiv preprint arXiv:2505.23693, 2025

Tingyu Song, Tongyan Hu, Guo Gan, and Yilun Zhao. Vf-eval: Evaluating multimodal llms for generating feedback on aigc videos.arXiv preprint arXiv:2505.23693, 2025

work page arXiv 2025

[24] [24]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

No-reference video quality assessment using multi-pooled, saliency weighted deep features and decision fusion.Sensors, 22(6):2209, 2022

Domonkos Varga. No-reference video quality assessment using multi-pooled, saliency weighted deep features and decision fusion.Sensors, 22(6):2209, 2022

2022

[26] [26]

A spatiotemporal most-apparent-distortion model for video quality assessment

Phong V Vu, Cuong T Vu, and Damon M Chandler. A spatiotemporal most-apparent-distortion model for video quality assessment. In2011 18th IEEE international conference on image processing, pages 2505–

[27] [27]

©2026 Bilibili Index Team

IEEE, 2011. ©2026 Bilibili Index Team. All Rights Reserved.11

2011

[28] [28]

Camp-vqa: Caption-embedded multimodal perception for no-reference quality assessment of compressed video.arXiv preprint arXiv:2511.07290, 2025

Xinyi Wang, Angeliki Katsenou, Junxiao Shen, and David Bull. Camp-vqa: Caption-embedded multimodal perception for no-reference quality assessment of compressed video.arXiv preprint arXiv:2511.07290, 2025

work page arXiv 2025

[29] [29]

Youtube ugc dataset for video compression research

Yilin Wang, Sasi Inguva, and Balu Adsumilli. Youtube ugc dataset for video compression research. In2019 IEEE 21st international workshop on multimedia signal processing (MMSP), pages 1–5. IEEE, 2019

2019

[30] [30]

Video quality assessment using a statistical model of human visual speed per- ception.Journal of the optical society of america A, 24(12):B61–B69, 2007

Zhou Wang and Qiang Li. Video quality assessment using a statistical model of human visual speed per- ception.Journal of the optical society of america A, 24(12):B61–B69, 2007

2007

[31] [31]

No-reference perceptual quality assessment of jpeg com- pressed images

Zhou Wang, Hamid R Sheikh, and Alan C Bovik. No-reference perceptual quality assessment of jpeg com- pressed images. InProceedings. International conference on image processing, volume 1, pages I–I. IEEE, 2002

2002

[32] [32]

Chain-of- thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain-of- thought prompting elicits reasoning in large language models. InAdvances in Neural Information Process- ing Systems, volume 35, pages 24824–24837, 2022

2022

[33] [33]

Fast-vqa: Efficient end-to-end video quality assessment with fragment sampling.Proceedings of European Conference of Computer Vision (ECCV), 2022

Haoning Wu, Chaofeng Chen, Jingwen Hou, Liang Liao, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Fast-vqa: Efficient end-to-end video quality assessment with fragment sampling.Proceedings of European Conference of Computer Vision (ECCV), 2022

2022

[34] [34]

Exploring video quality assessment on user generated contents from aesthetic and technical perspectives

Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20144– 20154, 2023

2023

[35] [35]

Towards explainable in-the-wild video quality assessment: a database and a language- prompted approach

Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Towards explainable in-the-wild video quality assessment: a database and a language- prompted approach. InProceedings of the 31st acm international conference on multimedia, pages 1045– 1054, 2023

2023

[36] [36]

Q-align: Teaching lmms for visual scoring via discrete text-defined levels

Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, et al. Q-align: Teaching lmms for visual scoring via discrete text-defined levels. InInternational Conference on Machine Learning, pages 54015–54029. PMLR, 2024

2024

[37] [37]

Starvqa: Space-time attention for video quality assessment

Fengchuang Xing, Yuan-Gen Wang, Hanpin Wang, Leida Li, and Guopu Zhu. Starvqa: Space-time attention for video quality assessment. In2022 IEEE International Conference on Image Processing (ICIP), pages 2326–

[38] [39]

Key Frames: Seven key frames extracted from the video 3

Cover Image: The video’s cover image 2. Key Frames: Seven key frames extracted from the video 3. Title: {title}

[39] [40]

think":

Tags: {tag} 5. ASR: {asr} 6. Primary Category: {new_tid_name} 7. Secondary Category: {new_sub_tid_name} 8. Duration: {duration} 9. Resolution: {resolution} 10. Vertical Format: {vertical} 11. Top-liked Comments: A pool of high-like comments from which 15–20 strongly content-related comments must be selected ————————————————– Output Requirements The output...

[40] [41]

This looks amazing

Exact Content Matching (Highest Priority): Comments should directly correspond to specific elements of the video content. Examples: - “This looks amazing”→linked to visual features - “The mixed language makes it hard to understand”→linked to ASR content

[41] [42]

The image quality is too blurry

Thematic Relevance (Secondary Priority): Comments should relate to the overall theme or quality of the video. Examples: - “The image quality is too blurry”→linked to visual resolution - “This is a waste of time”→linked to perceived content value

[42] [43]

Mandatory Exclusion Rule: Comments referring to auditory or sound-related elements must be excluded

[43] [44]

————————————————– Reasoning Process Construction Rules

Handling Offensive Comments: Highly liked comments containing insults toward the uploader should be cate- gorized as opposing the video’s creative quality and retained if they satisfy content relevance criteria. ————————————————– Reasoning Process Construction Rules

[44] [45]

Merging or collapsing similar comments is prohibited

Independent Coverage Requirement: Each selected comment must appear at least once independently. Merging or collapsing similar comments is prohibited

[45] [46]

When viewers see {visual information} / read {ASR content}, they may express {comment}

Video–Comment Alignment: - Precise alignment: “When viewers see {visual information} / read {ASR content}, they may express {comment}. ” - Thematic alignment: “Given the video’s overall characteristics, it may lead to com- ments such as {comment}. ” Only the provided 11 video attributes may be referenced

[46] [47]

viewers may point out

Speculative Expression Style: Use inferential phrasing such as “viewers may point out... ” and incorporate audi- ence expectations

[47] [48]

- Ensure strict nu- merical consistency

Mandatory Statistical Summary: - Report the number of supportive and opposing comments. - Ensure strict nu- merical consistency. - Compute the Sigma-normalized difference (Skellam z-score): z = (X - Y) / sqrt(X + Y) - Decision rule: If z≥1.5, conclude Support; otherwise, Not Clearly Supportive. - The z-score must be enclosed in boxed{}. ————————————————– ...

[48] [49]

Insert a blank line between each simulated comment. 2. Use<video>to mark video information and<comment> to mark simulated comments. 3. Annotate each comment with its stance and index: - Support Comment + index - Opposing Comment + index ————————————————– <Current Task> Cover Image: <image> Key Frames: <image><image><image><image><image><image><image> Titl...

2026

[49] [50]

Cover Image: The video’s cover image

[50] [51]

Key Frames: Seven key frames extracted from the video

[51] [52]

Primary Category: {new_tid_name}

[52] [53]

Secondary Category: {new_sub_tid_name}

[53] [54]

Duration: {duration}

[54] [55]

Resolution: {resolution}

[55] [56]

Vertical Format: {vertical} Criteria for Overall Comment Tendency

[56] [57]

All comments must be non-duplicated and explicitly appear in the reasoning process

The simulated comments must contain at least 15 entries. All comments must be non-duplicated and explicitly appear in the reasoning process

[57] [58]

Assume that among the simulated comments: - X comments are classified as *supportive* - Y comments are classified as *opposing*

[58] [59]

Compute the Sigma-normalized difference (Skellam z-score): z = (X - Y) / sqrt(X + Y)

[59] [60]

Support"; otherwise, it is classified as

If z≥1.5, the overall comment tendency is classified as "Support"; otherwise, it is classified as "Not Clearly Supportive"

[60] [61]

z = boxed-2

In the output, the z value must be wrapped using boxed, for example: "z = boxed-2"

[61] [62]

Support" or

The numbers of supportive and opposing comments reported in the final summary must strictly match those generated during the reasoning process. Fabrication or inconsistency is not allowed. <Current Task> Cover Image: <image> Key Frames: <image><image><image><image><image><image><image> Title: Tags: ASR: Primary Category: Secondary Category: Duration: Reso...

work page arXiv 2026