pith. sign in

arxiv: 2606.01897 · v3 · pith:WVSNMT6Lnew · submitted 2026-06-01 · 💻 cs.AI

Community-Aware Assessment of Social Textual Engagement and Resonance: A Human-Centric Perspective on User-Generated Content Evaluation

Pith reviewed 2026-06-28 14:49 UTC · model grok-4.3

classification 💻 cs.AI
keywords user-generated contentcommunity resonancesocial chain-of-thoughtmultimodal assessmentCASTER-BenchMEDEAvideo quality assessment
0
0 comments X

The pith

MEDEA assesses user-generated content quality by simulating diverse community perspectives through Social-CoT rather than focusing on visual aesthetics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that quality in user-generated content depends on social resonance within communities, not just visual fidelity. It introduces CASTER as a task for evaluating this resonance and MEDEA as a model that uses Social Chain-of-Thought to instantiate multiple viewer personas and simulate collective reactions. MEDEA is trained with supervised fine-tuning followed by reinforcement learning using a Social Alignment Reward. The approach is tested on the new CASTER-Bench benchmark where it outperforms existing methods and generates reasoning that matches human community feedback.

Core claim

MEDEA introduces a Social Chain-of-Thought mechanism that performs multimodal perspective-taking by instantiating diverse viewer personas to simulate the community mind before making a quality judgment, trained via two-stage supervised fine-tuning and process-supervised reinforcement learning with Social Alignment Reward to ground reasoning in authentic human social cognition.

What carries the argument

Social Chain-of-Thought (Social-CoT), which instantiates diverse viewer personas for multimodal perspective-taking to simulate collective cognitive and emotional reactions.

Load-bearing premise

That instantiating diverse viewer personas via Social-CoT and training with Social Alignment Reward produces reasoning paths grounded in authentic human social cognition rather than artifacts of the training process or benchmark.

What would settle it

A study where independent human raters compare MEDEA's reasoning paths and judgments against actual community responses on held-out UGC items, finding no better alignment than traditional VQA methods.

Figures

Figures reproduced from arXiv: 2606.01897 by Huyang Sun, Kai Zhao, Tianjiao Li, Xiang Li, Yang Liu.

Figure 1
Figure 1. Figure 1: Overview of CASTER-Bench. (a) Category-level composition of the benchmark, covering 1,485 UGC [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the MEDEA framework. The upper part depicts the Social-CoT construction pipeline, in [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cover and 7 uniformly sampled key frames of the example. [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Representative examples of “inflated bubbles”: videos with high popularity metrics that experts rated [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Oracle Social Context: Social-CoT reasoning path generated by Gemini, grounded in real high [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Social-CoT with Alignment: Reasoning paths generated by MEDEA trained with Social Alignment Re [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Social-CoT without Alignment: Reasoning paths generated by MEDEA trained without Social Align [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt used to generate reasoning content. [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt used to train MEDEA. threshold High-Quality Low-Quality Macro Average Precision Recall F1 Precision Recall F1 Precision Recall F1 0.206 0.277 1.000 0.434 0.000 0.000 0.000 0.139 0.500 0.217 0.216 0.279 0.998 0.436 0.923 0.011 0.022 0.601 0.504 0.229 0.396 0.308 0.890 0.458 0.847 0.233 0.365 0.577 0.562 0.412 0.586 0.333 0.527 0.408 0.767 0.596 0.671 0.550 0.561 0.539 0.616⋆ 0.358 0.454 0.400 0.766 0… view at source ↗
read the original abstract

Traditional Video Quality Assessment (VQA) focuses narrowly on aesthetic fidelity, overlooking the complex social dynamics that define quality in User-Generated Content (UGC). In this work, we propose a paradigm shift from signal-centric metrics to human-centric resonance assessment. We introduce CASTER (Community-Aware Assessment of Social Textual Engagement and Resonance), a new task that evaluates whether a UGC item achieves positive community resonance based on its multimodal attributes rather than visual quality alone. To address this, we present MEDEA (Multimodal Engagement-Driven Evaluation Architecture), which introduces a novel Social Chain-of-Thought (Social-CoT) mechanism. Unlike traditional logical CoT, Social-CoT performs multimodal perspective-taking, instantiating diverse viewer personas to simulate collective cognitive and emotional reactions (i.e., the "community mind") before deriving a quality judgment. MEDEA is trained via a two-stage approach involving supervised fine-tuning and process-supervised reinforcement learning with Social Alignment Reward to ensure reasoning paths are grounded in authentic human social cognition. To support this task, we release CASTER-Bench, a comprehensive human-annotated benchmark covering diverse UGC categories. Experiments demonstrate that MEDEA significantly outperforms state-of-the-art baselines on CASTER-Bench while providing interpretable and empathetic reasoning paths that align with real community feedback.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces CASTER, a new task for assessing whether user-generated content achieves positive community resonance via multimodal attributes rather than visual quality. It proposes MEDEA, which uses a Social Chain-of-Thought (Social-CoT) mechanism to instantiate diverse viewer personas and simulate collective reactions before judging quality. MEDEA is trained in two stages (supervised fine-tuning followed by process-supervised RL with a Social Alignment Reward) and evaluated on the newly released human-annotated CASTER-Bench, with the central claim that it significantly outperforms baselines while producing interpretable, empathetic reasoning paths aligned with real community feedback.

Significance. If the empirical claims hold after verification, the shift from signal-centric VQA to human-centric social resonance assessment could influence UGC recommendation, moderation, and content creation tools. The release of CASTER-Bench and the Social-CoT mechanism for multimodal perspective-taking represent concrete contributions that enable future work on community-aware evaluation. The two-stage training approach with process supervision is a standard strength when accompanied by ablations.

major comments (3)
  1. [Abstract] Abstract: the assertion that MEDEA 'significantly outperforms state-of-the-art baselines on CASTER-Bench' supplies no metrics, baseline names, dataset statistics, or significance tests, which is load-bearing for the central empirical claim.
  2. [§3.2] §3.2 (Social Alignment Reward definition): the reward is stated to enforce grounding in authentic human social cognition and is optimized against CASTER-Bench annotations, but the text does not specify whether the reward model is trained on held-out human judgments independent of the benchmark labels or whether it re-uses the same annotations; this creates a direct risk that reported gains reduce to benchmark fitting rather than independent prediction of community resonance.
  3. [§5] §5 (Experiments): no ablation isolating the contribution of Social-CoT persona instantiation versus standard CoT, or of the RL stage versus SFT alone, is reported; without these controls the claim that the reasoning paths reflect genuine community cognition rather than training artifacts cannot be evaluated.
minor comments (2)
  1. [§2] The related-work section should explicitly contrast Social-CoT with prior persona-based or theory-of-mind simulation methods in NLP and multimodal reasoning.
  2. [Figure 2] Figure 2 (Social-CoT diagram) would benefit from an explicit legend distinguishing the persona instantiation step from the final judgment aggregation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional clarity and controls will strengthen the paper. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that MEDEA 'significantly outperforms state-of-the-art baselines on CASTER-Bench' supplies no metrics, baseline names, dataset statistics, or significance tests, which is load-bearing for the central empirical claim.

    Authors: We agree that the abstract should supply concrete support for the central claim. In the revised manuscript we will insert the key performance numbers, baseline names, and reference to the statistical tests already present in §5. revision: yes

  2. Referee: [§3.2] §3.2 (Social Alignment Reward definition): the reward is stated to enforce grounding in authentic human social cognition and is optimized against CASTER-Bench annotations, but the text does not specify whether the reward model is trained on held-out human judgments independent of the benchmark labels or whether it re-uses the same annotations; this creates a direct risk that reported gains reduce to benchmark fitting rather than independent prediction of community resonance.

    Authors: We will revise §3.2 to state explicitly that the Social Alignment Reward model is trained on a held-out annotation set that is disjoint from the CASTER-Bench test labels used for final evaluation, thereby removing any ambiguity about data leakage. revision: yes

  3. Referee: [§5] §5 (Experiments): no ablation isolating the contribution of Social-CoT persona instantiation versus standard CoT, or of the RL stage versus SFT alone, is reported; without these controls the claim that the reasoning paths reflect genuine community cognition rather than training artifacts cannot be evaluated.

    Authors: We accept that the current experiments lack these controls. We will add the requested ablations (Social-CoT vs. standard CoT and RL vs. SFT) to the revised §5, together with the corresponding performance deltas and reasoning-path analyses. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The provided abstract and description outline a two-stage training process (SFT followed by process-supervised RL using Social Alignment Reward) evaluated on the separately constructed human-annotated CASTER-Bench. No quoted equation, definition, or step reduces a claimed prediction or result to its own inputs by construction, nor does any load-bearing premise collapse into a self-citation or ansatz smuggled from prior work by the same authors. The Social Alignment Reward is presented as a mechanism to align with human cognition rather than a fitted parameter whose output is then relabeled as an independent prediction. The central claim of outperformance on the benchmark therefore remains an external empirical result rather than a definitional tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the untested premise that persona-based multimodal perspective-taking plus a learned Social Alignment Reward can faithfully reproduce community-level human judgments; the benchmark itself is a new constructed artifact whose construction details are not supplied.

free parameters (1)
  • Social Alignment Reward model parameters
    Learned during process-supervised RL to enforce alignment with human social cognition; value not reported.
axioms (1)
  • domain assumption Community resonance can be accurately simulated by instantiating diverse viewer personas and aggregating their cognitive/emotional reactions
    Invoked to justify the Social-CoT mechanism in the abstract.
invented entities (2)
  • Social-CoT no independent evidence
    purpose: Multimodal perspective-taking to simulate the community mind
    New reasoning mechanism introduced without external validation in the abstract.
  • CASTER-Bench no independent evidence
    purpose: Human-annotated benchmark for the new resonance assessment task
    New dataset whose annotation protocol and coverage details are not provided.

pith-pipeline@v0.9.1-grok · 5773 in / 1365 out tokens · 27519 ms · 2026-06-28T14:49:15.605030+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 4 canonical work pages · 1 internal anchor

  1. [1]

    Bvi-vfi: A video quality database for video frame interpola- tion.IEEE Transactions on Image Processing, 32:6004–6019, 2023

    Duolikun Danier, Fan Zhang, and David R Bull. Bvi-vfi: A video quality database for video frame interpola- tion.IEEE Transactions on Image Processing, 32:6004–6019, 2023

  2. [2]

    No-reference vmaf: A deep neural network-based approach to blind video quality assessment.IEEE Transactions on Broadcasting, 70(3):844– 861, 2024

    Axel De Decker, Jan De Cock, Peter Lambert, and Glenn Van Wallendael. No-reference vmaf: A deep neural network-based approach to blind video quality assessment.IEEE Transactions on Broadcasting, 70(3):844– 861, 2024

  3. [3]

    Finevq: Fine-grained user generated content video quality assessment

    Huiyu Duan, Qiang Hu, Jiarui Wang, Liu Yang, Zitong Xu, Lu Liu, Xiongkuo Min, Chunlei Cai, Tianxiao Ye, Xi- aoyun Zhang, et al. Finevq: Fine-grained user generated content video quality assessment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3206–3217, 2025

  4. [4]

    Lmm-vqa: Advancing video quality assessment with large multimodal models.IEEE Trans- actions on Circuits and Systems for Video Technology, 2025

    Qihang Ge, Wei Sun, Yu Zhang, Yunhao Li, Zhongpeng Ji, Fengyu Sun, Shangling Jui, Xiongkuo Min, and Guangtao Zhai. Lmm-vqa: Advancing video quality assessment with large multimodal models.IEEE Trans- actions on Circuits and Systems for Video Technology, 2025

  5. [5]

    Cover: A comprehen- sive video quality evaluator

    Chenlong He, Qi Zheng, Ruoxi Zhu, Xiaoyang Zeng, Yibo Fan, and Zhengzhong Tu. Cover: A comprehen- sive video quality evaluator. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5799–5809, 2024

  6. [6]

    The konstanz natural video database (konvid-1k)

    Vlad Hosu, Franz Hahn, Mohsen Jenadeleh, Hanhe Lin, Hui Men, Tamás Szirányi, Shujun Li, and Dietmar Saupe. The konstanz natural video database (konvid-1k). In2017 Ninth international conference on quality of multimedia experience (QoMEX), pages 1–6. IEEE, 2017

  7. [7]

    Vqa2: visual question answering for video quality assessment

    Ziheng Jia, Zicheng Zhang, Jiaying Qian, Haoning Wu, Wei Sun, Chunyi Li, Xiaohong Liu, Weisi Lin, Guang- tao Zhai, and Xiongkuo Min. Vqa2: visual question answering for video quality assessment. InProceedings of the 33rd ACM International Conference on Multimedia, pages 6751–6760, 2025

  8. [8]

    Quality assessment of in-the-wild videos

    Dingquan Li, Tingting Jiang, and Ming Jiang. Quality assessment of in-the-wild videos. InProceedings of the 27th ACM international conference on multimedia, pages 2351–2359, 2019

  9. [9]

    Pugcq: A large scale dataset for quality assessment of professional user-generated content

    Guo Li, Baoliang Chen, Lingyu Zhu, Qinwen He, Hongfei Fan, and Shiqi Wang. Pugcq: A large scale dataset for quality assessment of professional user-generated content. InProceedings of the 29th ACM International Conference on Multimedia, pages 3728–3736, 2021. ©2026 Bilibili Index Team. All Rights Reserved.10

  10. [10]

    Mcl-v: A stream- ing video quality assessment database.Journal of Visual Communication and Image Representation, 30:1– 9, 2015

    Joe Yuchieh Lin, Rui Song, Chi-Hao Wu, TsungJung Liu, Haiqiang Wang, and C-C Jay Kuo. Mcl-v: A stream- ing video quality assessment database.Journal of Visual Communication and Image Representation, 30:1– 9, 2015

  11. [11]

    Kvq: Kwai video quality assessment for short-form videos

    Yiting Lu, Xin Li, Yajing Pei, Kun Yuan, Qizhi Xie, Yunpeng Qu, Ming Sun, Chao Zhou, and Zhibo Chen. Kvq: Kwai video quality assessment for short-form videos. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 25963–25973, 2024

  12. [12]

    A study of high frame rate video formats.IEEE Transactions on Multimedia, 21(6):1499–1512, 2019

    Alex Mackin, Fan Zhang, and David R Bull. A study of high frame rate video formats.IEEE Transactions on Multimedia, 21(6):1499–1512, 2019

  13. [13]

    St-greed: Space-time generalized entropic differences for frame rate dependent video quality prediction.IEEE Transactions on Image Processing, 30:7446–7457, 2021

    Pavan C Madhusudana, Neil Birkbeck, Yilin Wang, Balu Adsumilli, and Alan C Bovik. St-greed: Space-time generalized entropic differences for frame rate dependent video quality prediction.IEEE Transactions on Image Processing, 30:7446–7457, 2021

  14. [14]

    Subjective and objective quality assessment of high frame rate videos.IEEE Access, 9:108069–108082, 2021

    Pavan C Madhusudana, Xiangxu Yu, Neil Birkbeck, Yilin Wang, Balu Adsumilli, and Alan C Bovik. Subjective and objective quality assessment of high frame rate videos.IEEE Access, 9:108069–108082, 2021

  15. [15]

    An optical flow-based full reference video quality assessment algorithm.IEEE Transactions on Image Processing, 25(6):2480–2492, 2016

    K Manasa and Sumohana S Channappayya. An optical flow-based full reference video quality assessment algorithm.IEEE Transactions on Image Processing, 25(6):2480–2492, 2016

  16. [16]

    Efficient video quality assessment along temporal trajec- tories.IEEE transactions on circuits and systems for video technology, 20(11):1653–1658, 2010

    Anush Krishna Moorthy and Alan Conrad Bovik. Efficient video quality assessment along temporal trajec- tories.IEEE transactions on circuits and systems for video technology, 20(11):1653–1658, 2010

  17. [17]

    Cvd2014—a database for evaluating no-reference video quality assessment algorithms.IEEE Transactions on Image Processing, 25(7):3073–3086, 2016

    Mikko Nuutinen, Toni Virtanen, Mikko Vaahteranoksa, Tero Vuori, Pirkko Oittinen, and Jukka Häkkinen. Cvd2014—a database for evaluating no-reference video quality assessment algorithms.IEEE Transactions on Image Processing, 25(7):3073–3086, 2016

  18. [18]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  19. [19]

    Neural theory-of-mind? on the limits of large lan- guage models when interaction requires anticipating others’ states

    Maarten Sap, Ronan Le Bras, Daniel Fried, and Yejin Choi. Neural theory-of-mind? on the limits of large lan- guage models when interaction requires anticipating others’ states. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8184–8205, 2022

  20. [20]

    Study of subjective and objective quality assessment of video.IEEE transactions on Image Processing, 19(6):1427– 1441, 2010

    Kalpana Seshadrinathan, Rajiv Soundararajan, Alan Conrad Bovik, and Lawrence K Cormack. Study of subjective and objective quality assessment of video.IEEE transactions on Image Processing, 19(6):1427– 1441, 2010

  21. [21]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open lan- guage models, 2024

  22. [22]

    Large-scale study of perceptual video quality.IEEE Transactions on Image Processing, 28(2):612–627, 2019

    Zeina Sinno and Alan Conrad Bovik. Large-scale study of perceptual video quality.IEEE Transactions on Image Processing, 28(2):612–627, 2019

  23. [23]

    Vf-eval: Evaluating multimodal llms for generating feedback on aigc videos.arXiv preprint arXiv:2505.23693, 2025

    Tingyu Song, Tongyan Hu, Guo Gan, and Yilun Zhao. Vf-eval: Evaluating multimodal llms for generating feedback on aigc videos.arXiv preprint arXiv:2505.23693, 2025

  24. [24]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

  25. [25]

    No-reference video quality assessment using multi-pooled, saliency weighted deep features and decision fusion.Sensors, 22(6):2209, 2022

    Domonkos Varga. No-reference video quality assessment using multi-pooled, saliency weighted deep features and decision fusion.Sensors, 22(6):2209, 2022

  26. [26]

    A spatiotemporal most-apparent-distortion model for video quality assessment

    Phong V Vu, Cuong T Vu, and Damon M Chandler. A spatiotemporal most-apparent-distortion model for video quality assessment. In2011 18th IEEE international conference on image processing, pages 2505–

  27. [27]

    ©2026 Bilibili Index Team

    IEEE, 2011. ©2026 Bilibili Index Team. All Rights Reserved.11

  28. [28]

    Camp-vqa: Caption-embedded multimodal perception for no-reference quality assessment of compressed video.arXiv preprint arXiv:2511.07290, 2025

    Xinyi Wang, Angeliki Katsenou, Junxiao Shen, and David Bull. Camp-vqa: Caption-embedded multimodal perception for no-reference quality assessment of compressed video.arXiv preprint arXiv:2511.07290, 2025

  29. [29]

    Youtube ugc dataset for video compression research

    Yilin Wang, Sasi Inguva, and Balu Adsumilli. Youtube ugc dataset for video compression research. In2019 IEEE 21st international workshop on multimedia signal processing (MMSP), pages 1–5. IEEE, 2019

  30. [30]

    Video quality assessment using a statistical model of human visual speed per- ception.Journal of the optical society of america A, 24(12):B61–B69, 2007

    Zhou Wang and Qiang Li. Video quality assessment using a statistical model of human visual speed per- ception.Journal of the optical society of america A, 24(12):B61–B69, 2007

  31. [31]

    No-reference perceptual quality assessment of jpeg com- pressed images

    Zhou Wang, Hamid R Sheikh, and Alan C Bovik. No-reference perceptual quality assessment of jpeg com- pressed images. InProceedings. International conference on image processing, volume 1, pages I–I. IEEE, 2002

  32. [32]

    Chain-of- thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain-of- thought prompting elicits reasoning in large language models. InAdvances in Neural Information Process- ing Systems, volume 35, pages 24824–24837, 2022

  33. [33]

    Fast-vqa: Efficient end-to-end video quality assessment with fragment sampling.Proceedings of European Conference of Computer Vision (ECCV), 2022

    Haoning Wu, Chaofeng Chen, Jingwen Hou, Liang Liao, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Fast-vqa: Efficient end-to-end video quality assessment with fragment sampling.Proceedings of European Conference of Computer Vision (ECCV), 2022

  34. [34]

    Exploring video quality assessment on user generated contents from aesthetic and technical perspectives

    Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20144– 20154, 2023

  35. [35]

    Towards explainable in-the-wild video quality assessment: a database and a language- prompted approach

    Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Towards explainable in-the-wild video quality assessment: a database and a language- prompted approach. InProceedings of the 31st acm international conference on multimedia, pages 1045– 1054, 2023

  36. [36]

    Q-align: Teaching lmms for visual scoring via discrete text-defined levels

    Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, et al. Q-align: Teaching lmms for visual scoring via discrete text-defined levels. InInternational Conference on Machine Learning, pages 54015–54029. PMLR, 2024

  37. [37]

    Starvqa: Space-time attention for video quality assessment

    Fengchuang Xing, Yuan-Gen Wang, Hanpin Wang, Leida Li, and Guopu Zhu. Starvqa: Space-time attention for video quality assessment. In2022 IEEE International Conference on Image Processing (ICIP), pages 2326–

  38. [39]

    Key Frames: Seven key frames extracted from the video 3

    Cover Image: The video’s cover image 2. Key Frames: Seven key frames extracted from the video 3. Title: {title}

  39. [40]

    think":

    Tags: {tag} 5. ASR: {asr} 6. Primary Category: {new_tid_name} 7. Secondary Category: {new_sub_tid_name} 8. Duration: {duration} 9. Resolution: {resolution} 10. Vertical Format: {vertical} 11. Top-liked Comments: A pool of high-like comments from which 15–20 strongly content-related comments must be selected ————————————————– Output Requirements The output...

  40. [41]

    This looks amazing

    Exact Content Matching (Highest Priority): Comments should directly correspond to specific elements of the video content. Examples: - “This looks amazing”→linked to visual features - “The mixed language makes it hard to understand”→linked to ASR content

  41. [42]

    The image quality is too blurry

    Thematic Relevance (Secondary Priority): Comments should relate to the overall theme or quality of the video. Examples: - “The image quality is too blurry”→linked to visual resolution - “This is a waste of time”→linked to perceived content value

  42. [43]

    Mandatory Exclusion Rule: Comments referring to auditory or sound-related elements must be excluded

  43. [44]

    ————————————————– Reasoning Process Construction Rules

    Handling Offensive Comments: Highly liked comments containing insults toward the uploader should be cate- gorized as opposing the video’s creative quality and retained if they satisfy content relevance criteria. ————————————————– Reasoning Process Construction Rules

  44. [45]

    Merging or collapsing similar comments is prohibited

    Independent Coverage Requirement: Each selected comment must appear at least once independently. Merging or collapsing similar comments is prohibited

  45. [46]

    When viewers see {visual information} / read {ASR content}, they may express {comment}

    Video–Comment Alignment: - Precise alignment: “When viewers see {visual information} / read {ASR content}, they may express {comment}. ” - Thematic alignment: “Given the video’s overall characteristics, it may lead to com- ments such as {comment}. ” Only the provided 11 video attributes may be referenced

  46. [47]

    viewers may point out

    Speculative Expression Style: Use inferential phrasing such as “viewers may point out... ” and incorporate audi- ence expectations

  47. [48]

    - Ensure strict nu- merical consistency

    Mandatory Statistical Summary: - Report the number of supportive and opposing comments. - Ensure strict nu- merical consistency. - Compute the Sigma-normalized difference (Skellam z-score): z = (X - Y) / sqrt(X + Y) - Decision rule: If z≥1.5, conclude Support; otherwise, Not Clearly Supportive. - The z-score must be enclosed in boxed{}. ————————————————– ...

  48. [49]

    Insert a blank line between each simulated comment. 2. Use<video>to mark video information and<comment> to mark simulated comments. 3. Annotate each comment with its stance and index: - Support Comment + index - Opposing Comment + index ————————————————– <Current Task> Cover Image: <image> Key Frames: <image><image><image><image><image><image><image> Titl...

  49. [50]

    Cover Image: The video’s cover image

  50. [51]

    Key Frames: Seven key frames extracted from the video

  51. [52]

    Primary Category: {new_tid_name}

  52. [53]

    Secondary Category: {new_sub_tid_name}

  53. [54]

    Duration: {duration}

  54. [55]

    Resolution: {resolution}

  55. [56]

    Vertical Format: {vertical} Criteria for Overall Comment Tendency

  56. [57]

    All comments must be non-duplicated and explicitly appear in the reasoning process

    The simulated comments must contain at least 15 entries. All comments must be non-duplicated and explicitly appear in the reasoning process

  57. [58]

    Assume that among the simulated comments: - X comments are classified as *supportive* - Y comments are classified as *opposing*

  58. [59]

    Compute the Sigma-normalized difference (Skellam z-score): z = (X - Y) / sqrt(X + Y)

  59. [60]

    Support"; otherwise, it is classified as

    If z≥1.5, the overall comment tendency is classified as "Support"; otherwise, it is classified as "Not Clearly Supportive"

  60. [61]

    z = boxed-2

    In the output, the z value must be wrapped using boxed, for example: "z = boxed-2"

  61. [62]

    Support" or

    The numbers of supportive and opposing comments reported in the final summary must strictly match those generated during the reasoning process. Fabrication or inconsistency is not allowed. <Current Task> Cover Image: <image> Key Frames: <image><image><image><image><image><image><image> Title: Tags: ASR: Primary Category: Secondary Category: Duration: Reso...