pith. machine review for the scientific record. sign in

arxiv: 2312.17090 · v1 · submitted 2023-12-28 · 💻 cs.CV · cs.CL· cs.LG

Recognition: 3 theorem links

· Lean Theorem

Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:31 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.LG
keywords large multi-modality modelsvisual quality assessmentimage aesthetic assessmentdiscrete text levelssubjective judgmentOneAlignvideo quality assessment
0
0 comments X

The pith

LMMs achieve better visual scoring by predicting discrete text-defined rating levels instead of numerical scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that large multi-modality models can be trained to rate images and videos by learning discrete text-defined levels that match how humans judge in subjective studies. This replaces direct numerical score regression and produces stronger results on image quality assessment, image aesthetic assessment, and video quality assessment using the unchanged model structure. The same syllabus then combines the three tasks into one unified model called OneAlign. The approach demonstrates a clear edge over score-based training variants while requiring no extra data or architecture changes.

Core claim

Observing that human raters learn and judge discrete text-defined levels in subjective studies, the work teaches LMMs to output these levels for visual rating. The resulting Q-Align model reaches state-of-the-art performance on image quality assessment, image aesthetic assessment, and video quality assessment tasks under the original LMM structure. A syllabus built from the same discrete levels further unifies the three tasks into a single model termed OneAlign.

What carries the argument

The discrete text-defined rating levels syllabus that replaces numerical score regression to train LMMs for visual assessment.

If this is right

  • The discrete-level syllabus outperforms direct-score training on IQA, IAA, and VQA benchmarks.
  • Three separate visual assessment tasks can be unified into one model without architectural changes.
  • State-of-the-art results are obtained while keeping the original LMM structure and data requirements unchanged.
  • The same training approach extends across image and video content types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar discrete-level training could apply to other subjective rating domains where humans use categorical language.
  • Different choices of text phrasing for the levels might further improve alignment with specific human populations.
  • The method suggests that discrete outputs could reduce calibration issues in other LMM alignment tasks.

Load-bearing premise

Human subjective judgment in visual scoring relies primarily on discrete text-defined levels rather than continuous numerical values.

What would settle it

If an LMM trained with direct numerical score regression matches or exceeds the discrete-level version on the same image quality, aesthetic, and video quality benchmarks, the claimed advantage would not hold.

read the original abstract

The explosion of visual content available online underscores the requirement for an accurate machine assessor to robustly evaluate scores across diverse types of visual contents. While recent studies have demonstrated the exceptional potentials of large multi-modality models (LMMs) on a wide range of related fields, in this work, we explore how to teach them for visual rating aligned with human opinions. Observing that human raters only learn and judge discrete text-defined levels in subjective studies, we propose to emulate this subjective process and teach LMMs with text-defined rating levels instead of scores. The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA), as well as video quality assessment (VQA) tasks under the original LMM structure. With the syllabus, we further unify the three tasks into one model, termed the OneAlign. In our experiments, we demonstrate the advantage of the discrete-level-based syllabus over direct-score-based variants for LMMs. Our code and the pre-trained weights are released at https://github.com/Q-Future/Q-Align.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes Q-Align, which trains large multi-modality models (LMMs) for visual scoring by using discrete text-defined rating levels (e.g., 'excellent', 'good') rather than numerical scores to better emulate human subjective judgment. It reports state-of-the-art results on image quality assessment (IQA), image aesthetic assessment (IAA), and video quality assessment (VQA) tasks while preserving the original LMM architecture, and introduces a unified OneAlign model across the three tasks via a shared syllabus.

Significance. If the performance gains are shown to arise specifically from the discrete-level syllabus rather than output-format compatibility, the approach would provide a simple, architecture-preserving method for aligning LMMs with human perceptual judgments, with potential impact on fine-tuning strategies for subjective visual assessment tasks.

major comments (1)
  1. [Abstract] Abstract: the reported advantage of the discrete-level-based syllabus over direct-score-based variants is central to the contribution, yet the comparison does not isolate whether gains stem from the use of discrete text levels or from the fact that text-token output is native to LMM pretraining. No experiment tests a text-formatted numerical baseline (e.g., outputting 'score: 4') that would control for output-format mismatch while keeping the regression target continuous.
minor comments (1)
  1. [Experiments] The manuscript should provide explicit details on dataset splits, full baseline implementations, and statistical significance tests to substantiate the SOTA claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on isolating the source of performance gains. We address the major comment point-by-point below and commit to strengthening the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported advantage of the discrete-level-based syllabus over direct-score-based variants is central to the contribution, yet the comparison does not isolate whether gains stem from the use of discrete text levels or from the fact that text-token output is native to LMM pretraining. No experiment tests a text-formatted numerical baseline (e.g., outputting 'score: 4') that would control for output-format mismatch while keeping the regression target continuous.

    Authors: We agree that a text-formatted numerical baseline would provide a cleaner control for output-format effects. In the current experiments the direct-score variants prompted the LMM for numerical regression, which is indeed less native to its text-token pretraining than discrete text labels. The core motivation remains that human raters in subjective studies use discrete text-defined levels rather than continuous numbers; this is why we adopted the syllabus. To address the referee's concern directly, we will add the suggested text-formatted numerical baseline (e.g., prompting for outputs such as 'score: 4') in the revised experiments and report the results alongside the existing comparisons. This addition will be included in the updated manuscript and supplementary material. revision: yes

Circularity Check

0 steps flagged

Empirical fine-tuning shows no circular derivation

full rationale

The paper is an empirical study on fine-tuning LMMs for visual scoring tasks using discrete text-defined levels. No mathematical derivation chain, equations, or first-principles results are presented that reduce to inputs by construction. The method relies on standard next-token prediction training with released code and weights; performance claims are validated experimentally across IQA, IAA, and VQA benchmarks rather than through self-referential logic or fitted parameters renamed as predictions. Minor self-citations exist but are not load-bearing for the core syllabus or unification into OneAlign.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that discrete levels from human studies can be directly used as training targets for LMMs and that this yields better alignment than numerical scores.

axioms (1)
  • domain assumption Human raters in subjective studies judge visual content using discrete text-defined levels
    Stated as an observation from subjective studies that the method emulates.

pith-pipeline@v0.9.0 · 5549 in / 1167 out tokens · 65154 ms · 2026-05-15T16:31:49.640259+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.JcostCore Jcost_pos_of_ne_one unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Observing that human raters only learn and judge discrete text-defined levels in subjective studies, we propose to emulate this subjective process and teach LMMs with text-defined rating levels instead of scores.

  • Foundation.PhiForcing phi_equation unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA), as well as video quality assessment (VQA) tasks under the original LMM structure.

  • Foundation.DimensionForcing dimension_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    With the syllabus, we further unify the three tasks into one model, termed the OneAlign.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 8.0

    SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

  2. GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion

    cs.CV 2026-05 unverdicted novelty 7.0

    GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.

  3. EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement

    cs.CV 2026-05 unverdicted novelty 7.0

    EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.

  4. Bringing Multimodal Large Language Models to Infrared-Visible Image Fusion Quality Assessment

    cs.CV 2026-05 unverdicted novelty 7.0

    FuScore uses MLLMs to output continuous quality scores for IVIF images, constructs per-image soft labels from four sub-dimensions, and applies a tripartite objective with Thurstone fidelity to achieve higher correlati...

  5. GameScope: A Multi-Attribute, Multi-Codec Benchmark Dataset for Gaming Video Quality Assessment

    cs.CV 2026-05 unverdicted novelty 7.0

    GameScope provides 4,048 multi-codec gaming videos with MOS ratings and attribute annotations, claimed as the first comprehensive dataset for gaming video quality assessment across codecs and content types.

  6. Personalizing Text-to-Image Generation to Individual Taste

    cs.CV 2026-04 unverdicted novelty 7.0

    PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.

  7. GeoR-Bench: Evaluating Geoscience Visual Reasoning

    cs.CV 2026-05 unverdicted novelty 6.0

    GeoR-Bench shows top multimodal models reach only 42.7% strict accuracy on geoscience visual reasoning tasks while open-source models reach 10.3%, with outputs often visually plausible yet scientifically inaccurate.

  8. ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning

    cs.CV 2026-05 unverdicted novelty 6.0

    ReasonEdit uses a new CoT dataset and reinforcement learning to produce interpretable, human-aligned evaluations of text-guided image edits.

  9. Bringing Multimodal Large Language Models to Infrared-Visible Image Fusion Quality Assessment

    cs.CV 2026-05 unverdicted novelty 6.0

    FuScore trains an MLLM to produce continuous IVIF quality scores supervised by per-image soft labels and Thurstone fidelity terms, reaching state-of-the-art correlation with human preferences.

  10. Unpaired Image Deraining Using Reward-Guided Self-Reinforcement Strategy

    cs.CV 2026-05 unverdicted novelty 6.0

    RGSUD achieves SOTA unsupervised deraining by using IQA-based reward recycling and self-reinforcement to constrain optimization and improve pseudo-paired data quality.

  11. You Only Gaussian Once: Controllable 3D Gaussian Splatting for Ultra-Densely Sampled Scenes

    cs.CV 2026-04 unverdicted novelty 6.0

    YOGO reformulates stochastic 3D Gaussian Splatting into a deterministic budget-aware system and supplies an ultra-dense dataset to enforce physical fidelity over viewpoint interpolation.

  12. Redefining Quality Criteria and Distance-Aware Score Modeling for Image Editing Assessment

    cs.CV 2026-04 unverdicted novelty 6.0

    DS-IEQA jointly learns evaluation criteria via feedback-driven prompt optimization and continuous score modeling via token-decoupled distance regression, ranking 4th in the 2026 NTIRE X-AIGC Quality Assessment Track 2...

  13. Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Rein3D generates photorealistic, globally consistent 3D indoor scenes by using a restore-and-refine process where radial panoramic videos are restored via diffusion models and then used to update a 3D Gaussian field.

  14. On the Global Photometric Alignment for Low-Level Vision

    cs.CV 2026-04 unverdicted novelty 6.0

    PAL uses closed-form affine color alignment on prediction-target pairs to discount global photometric discrepancies from the supervision signal, improving restoration across low-level vision tasks.

  15. LumiVideo: An Intelligent Agentic System for Video Color Grading

    cs.CV 2026-04 unverdicted novelty 6.0

    LumiVideo deploys an LLM-based agent with RAG and Tree of Thoughts to generate ASC-CDL parameters and 3D LUTs for automatic cinematic color grading from raw log video, approaching expert quality.

  16. LucidNFT: LR-Anchored Multi-Reward Preference Optimization for Flow-Based Real-World Super-Resolution

    cs.CV 2026-03 unverdicted novelty 6.0

    LucidNFT combines a new LR-referenced consistency reward, decoupled normalization, and a real-degradation dataset to improve perceptual quality in flow-matching super-resolution while preserving input fidelity.

  17. HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images

    cs.CV 2026-03 unverdicted novelty 6.0

    HiFi-Inpaint delivers state-of-the-art detail-preserving human-product images by adding Shared Enhancement Attention and Detail-Aware Loss to reference-based inpainting on a new 40K dataset.

  18. Embody4D: A Generalist 4D World Model for Embodied AI

    cs.CV 2026-05 unverdicted novelty 5.0

    Embody4D generates high-fidelity, view-consistent novel views from monocular videos for embodied scenarios via 3D-aware data synthesis, adaptive noise injection, and interaction-aware attention.

  19. FDIM: A Feature-distance-based Generic Video Quality Metric for Versatile Codecs

    cs.CV 2026-04 unverdicted novelty 5.0

    FDIM is a new hybrid feature-distance video quality metric trained on over 16k sequences that shows strong generalization and correlation with human judgments across ten unseen SDR/HDR datasets and diverse codecs.

  20. Q-DeepSight: Incentivizing Thinking with Images for Image Quality Assessment and Refinement

    cs.CV 2026-04 unverdicted novelty 5.0

    Q-DeepSight proposes a think-with-image multimodal CoT framework trained via RL with perceptual curriculum rewards and evidence gradient filtering to achieve SOTA IQA performance and enable training-free perceptual re...

  21. HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds

    cs.CV 2026-04 unverdicted novelty 4.0

    HY-World 2.0 generates and reconstructs high-fidelity navigable 3D Gaussian Splatting worlds from text, images, or videos via upgraded panorama, planning, expansion, and composition modules, with released code claimin...

Reference graph

Works this paper leans on

293 extracted references · 293 canonical work pages · cited by 20 Pith papers · 9 internal anchors

  1. [1]

    FirstName LastName , title =

  2. [2]

    FirstName Alpher , title =

  3. [3]

    Journal of Foo , volume = 13, number = 1, pages =

    FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =

  4. [4]

    Journal of Foo , volume = 14, number = 1, pages =

    FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =

  5. [5]

    FirstName Alpher and FirstName Gamow , title =

  6. [6]

    CVPR , month =

    Wang, Yilin and Ke, Junjie and Talebi, Hossein and Yim, Joong Gon and Birkbeck, Neil and Adsumilli, Balu and Milanfar, Peyman and Yang, Feng , title =. CVPR , month =. 2021 , pages =

  7. [7]

    The Netflix Tech Blog , volume=

    Toward a practical perceptual video quality metric , author=. The Netflix Tech Blog , volume=

  8. [8]

    arXiv preprint arXiv:1907.07484 , year=

    Benchmarking Robustness in Object Detection: Autonomous Driving when Winter is Coming , author=. arXiv preprint arXiv:1907.07484 , year=

  9. [9]

    2023 , eprint=

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension , author=. 2023 , eprint=

  10. [10]

    Exploring and Evaluating Image Restoration Potential in Dynamic Scenes , year=

    Zhang, Cheng and Su, Shaolin and Zhu, Yu and Yan, Qingsen and Sun, Jinqiu and Zhang, Yanning , booktitle=. Exploring and Evaluating Image Restoration Potential in Dynamic Scenes , year=

  11. [11]

    A Sober Look at the Unsupervised Learning of Disentangled Representations and their Evaluation , journal =

    Francesco Locatello and Stefan Bauer and Mario Lucic and Gunnar Raetsch and Sylvain Gelly and Bernhard Sch. A Sober Look at the Unsupervised Learning of Disentangled Representations and their Evaluation , journal =. 2020 , volume =

  12. [12]

    NeurIPS , year=

    Multi-mapping Image-to-Image Translation via Learning Disentanglement , author=. NeurIPS , year=

  13. [13]

    Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

    Video compression dataset and benchmark of learning-based video-quality metrics , author=. Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

  14. [14]

    Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig , doi =

  15. [15]

    ECCV , year=

    Real-Time Intermediate Flow Estimation for Video Frame Interpolation , author=. ECCV , year=

  16. [16]

    A Probabilistic Approach to People-Centric Photo Selection and Sequencing , year=

    Vonikakis, Vassilios and Subramanian, Ramanathan and Arnfred, Jonas and Winkler, Stefan , journal=. A Probabilistic Approach to People-Centric Photo Selection and Sequencing , year=

  17. [17]

    CVPR , month =

    Ruan, Lingyan and Chen, Bin and Li, Jizhou and Lam, Miuling , title =. CVPR , month =. 2022 , pages =

  18. [18]

    CVPR , month =

    Lee, Yao-Chih and Tseng, Kuan-Wei and Chen, Yu-Ta and Chen, Chien-Cheng and Chen, Chu-Song and Hung, Yi-Ping , title =. CVPR , month =. 2021 , pages =

  19. [19]

    CVPR , month =

    Tassano, Matias and Delon, Julie and Veit, Thomas , title =. CVPR , month =

  20. [20]

    Proceedings of the 31st ACM International Conference on Multimedia , year =

    Tengchuan Kou and Xiaohong Liu and Jun Jia and Wei Sun and Guangtao Zhai and Ning Liu , title =. Proceedings of the 31st ACM International Conference on Multimedia , year =

  21. [21]

    n Proceedings of the 31st ACM International Conference on Multimedia , year =

    Yunlong Dong and Xiaohong Liu and Yixuan Gao and Xunchu Zhou and Tao Tan and Guangtao Zhai , title =. n Proceedings of the 31st ACM International Conference on Multimedia , year =

  22. [22]

    IEEE TIP , year =

    Jingwen Hou and Weisi Lin and Yuming Fang and Haoning Wu and Chaofeng Chen and Liang Liao and Weide Liu , title =. IEEE TIP , year =

  23. [23]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    MD-VQA: Multi-Dimensional Quality Assessment for UGC Live Videos , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  24. [24]

    2023 , eprint=

    VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining , author=. 2023 , eprint=

  25. [25]

    2023 , eprint=

    AGIQA-3K: An Open Database for AI-Generated Image Quality Assessment , author=. 2023 , eprint=

  26. [26]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. arXiv preprint arXiv:2308.12966 , year=

  27. [27]

    2023 , eprint=

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , author=. 2023 , eprint=

  28. [28]

    LIVE Image Quality Assessment Database Release 2 , author=

  29. [30]

    Proceedings of the 30th ACM International Conference on Multimedia , year =

    A Deep Learning Based No-Reference Quality Assessment Model for UGC Videos , author =. Proceedings of the 30th ACM International Conference on Multimedia , year =

  30. [31]

    2023 , howpublished =

    LAION , title =. 2023 , howpublished =

  31. [32]

    2023 , eprint=

    mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration , author=. 2023 , eprint=

  32. [33]

    2023 , eprint=

    Towards Robust Text-Prompted Semantic Criterion for In-the-Wild Video Quality Assessment , author=. 2023 , eprint=

  33. [34]

    2023 , eprint=

    Advancing Zero-Shot Digital Human Quality Assessment through Text-Prompted Evaluation , author=. 2023 , eprint=

  34. [35]

    IEEE TIP , volume=

    Massive online crowdsourced study of subjective and objective picture quality , author=. IEEE TIP , volume=. 2015 , publisher=

  35. [36]

    CVPR , pages=

    Perceptual Quality Assessment of Smartphone Photography , author=. CVPR , pages=

  36. [37]

    CVPR , year =

    MixNMatch: Multifactor Disentanglement and Encoding for Conditional Image Generation , author =. CVPR , year =

  37. [38]

    When Is Unsupervised Disentanglement Possible? , volume =

    Horan, Daniella and Richardson, Eitan and Weiss, Yair , booktitle =. When Is Unsupervised Disentanglement Possible? , volume =

  38. [39]

    A Gated Peripheral-Foveal Convolutional Neural Network for Unified Image Aesthetic Prediction , volume =

    Zhang, Xiaodan and Gao, Xinbo and Lu, Wen and He, Lihuo , year =. A Gated Peripheral-Foveal Convolutional Neural Network for Unified Image Aesthetic Prediction , volume =. IEEE TMM , doi =

  39. [40]

    NIMA: Neural Image Assessment , year=

    Talebi, Hossein and Milanfar, Peyman , journal=. NIMA: Neural Image Assessment , year=

  40. [41]

    CVPR , pages=

    AVA: A large-scale database for aesthetic visual analysis , author=. CVPR , pages=

  41. [42]

    IEEE TCSVT , year =

    Hou, Jingwen and Ding, Henghui and Lin, Weisi and Liu, Weide and Fang, Yuming , title =. IEEE TCSVT , year =

  42. [43]

    Deep Neural Networks for No-Reference Video Quality Assessment , year=

    You, Junyong and Korhonen, Jari , booktitle=. Deep Neural Networks for No-Reference Video Quality Assessment , year=

  43. [44]

    The Konstanz natural video database (KoNViD-1k) , year=

    Hosu, Vlad and Hahn, Franz and Jenadeleh, Mohsen and Lin, Hanhe and Men, Hui and Szirányi, Tamás and Li, Shujun and Saupe, Dietmar , booktitle=. The Konstanz natural video database (KoNViD-1k) , year=

  44. [45]

    ICIP , pages=

    Exploiting spatial redundancy in pixel domain Wyner-Ziv video coding , author=. ICIP , pages=

  45. [46]

    Signal Processing: Image Communication , volume=

    The MPEG video compression algorithm , author=. Signal Processing: Image Communication , volume=

  46. [47]

    2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA) , pages=

    EVA ^2 : Exploiting temporal redundancy in live computer vision , author=. 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA) , pages=

  47. [48]

    CVPR , pages=

    A convnet for the 2020s , author=. CVPR , pages=

  48. [49]

    doi:10.48550/ARXIV.2110.00476 , author =

    ResNet strikes back: An improved training procedure in timm , publisher =. doi:10.48550/ARXIV.2110.00476 , author =

  49. [50]

    , journal=

    Keys, R. , journal=. Cubic convolution interpolation for digital image processing , year=

  50. [51]

    Draft ITU-T recommendation and final draft international standard of joint video specification , author=

  51. [52]

    Disentangling Aesthetic and Technical Effects for Video Quality Assessment of User Generated Content , author=. , year=

  52. [53]

    , journal=

    Zhang, Lin and Zhang, Lei and Bovik, Alan C. , journal=. A Feature-Enriched Completely Blind Image Quality Evaluator , year=

  53. [54]

    1978 , issn =

    In search of a general picture processing operator , journal =. 1978 , issn =

  54. [55]

    CVPR , year=

    DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting , author=. CVPR , year=

  55. [56]

    2022 , booktitle=

    SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , author=. 2022 , booktitle=

  56. [57]

    IEEE Conference on Computer Vision and Pattern Recognition , year=

    Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective , author=. IEEE Conference on Computer Vision and Pattern Recognition , year=

  57. [58]

    The Power of Scale for Parameter-Efficient Prompt Tuning

    Lester, Brian and Al-Rfou, Rami and Constant, Noah. The Power of Scale for Parameter-Efficient Prompt Tuning. 2021 EMNLP. 2021. doi:10.18653/v1/2021.emnlp-main.243

  58. [59]

    Prefix-Tuning: Optimizing Continuous Prompts for Generation , url =

    Li, Xiang Lisa and Liang, Percy. Prefix-Tuning: Optimizing Continuous Prompts for Generation. ACL 2021. 2021. doi:10.18653/v1/2021.acl-long.353

  59. [61]

    2023 , booktitle=

    Exploring Opinion-Unaware Video Quality Assessment with Semantic Affinity Criterion , author=. 2023 , booktitle=

  60. [62]

    2021 , booktitle=

    Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , author=. 2021 , booktitle=

  61. [63]

    CoCa: Contrastive Captioners are Image-Text Foundation Models , author =

  62. [64]

    Conference on Computer Vision and Pattern Recognition (CVPR) , year =

    Kenneth Marino and Mohammad Rastegari and Ali Farhadi and Roozbeh Mottaghi , title =. Conference on Computer Vision and Pattern Recognition (CVPR) , year =

  63. [66]

    2023 , url =

    Introducing IDEFICS: An Open Reproduction of State-of-the-Art Visual Language Model , author =. 2023 , url =

  64. [68]

    2023 , eprint=

    Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

  65. [69]

    Hashimoto , title =

    Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

  66. [70]

    2023 , eprint=

    GPT-4 Technical Report , author=. 2023 , eprint=

  67. [71]

    2023 , eprint=

    mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality , author=. 2023 , eprint=

  68. [72]

    Bovik , title =

    Joshua Peter Ebenezer and Zaixi Shang and Yongjun Wu and Hai Wei and Sriram Sethuraman and Alan C. Bovik , title =

  69. [73]

    something something

    Raghav Goyal and Samira Ebrahimi Kahou and Vincent Michalski and Joanna Materzynska and Susanne Westphal and Heuna Kim and Valentin Haenel and Ingo Fr. The "something something" video database for learning and evaluating visual common sense , journal =

  70. [74]

    , title =

    Wallace, Gregory K. , title =. Commun. ACM , pages =. 1991 , publisher =

  71. [75]

    and Adam, Hartwig , title =

    Howard, Andrew and Sandler, Mark and Chu, Grace and Chen, Liang-Chieh and Chen, Bo and Tan, Mingxing and Wang, Weijun and Zhu, Yukun and Pang, Ruoming and Vasudevan, Vijay and Le, Quoc V. and Adam, Hartwig , title =

  72. [76]

    2021 , booktitle =

    Xu, Jiahua and Li, Jing and Zhou, Xingguang and Zhou, Wei and Wang, Baichao and Chen, Zhibo , title =. 2021 , booktitle =

  73. [77]

    Nature , volume=

    Progress and challenges in probing the human brain , author=. Nature , volume=

  74. [78]

    Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya , title =

  75. [79]

    ECCV , year=

    Expanding Language-Image Pretrained Models for General Video Recognition , author =. ECCV , year=

  76. [80]

    2021 , eprint=

    ImageNet-21K Pretraining for the Masses , author=. 2021 , eprint=

  77. [81]

    ATQAM/MAST'20: Joint Workshop on Aesthetic and Technical Quality Assessment of Multimedia and Media Analytics for Societal Trends , year =

    Guha, Tanaya and Hosu, Vlad and Saupe, Dietmar and Goldl\". ATQAM/MAST'20: Joint Workshop on Aesthetic and Technical Quality Assessment of Multimedia and Media Analytics for Societal Trends , year =. ACM MM , pages =

  78. [82]

    From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

    Young, Peter and Lai, Alice and Hodosh, Micah and Hockenmaier, Julia. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics. 2014

  79. [83]

    arXiv preprint arXiv:2305.10843 , year=

    X-IQE: eXplainable Image Quality Evaluation for Text-to-Image Generation with Visual Large Language Models , author=. arXiv preprint arXiv:2305.10843 , year=

  80. [84]

    and Bovik, Alan C

    Jayaraman, Dinesh and Mittal, Anish and Moorthy, Anush K. and Bovik, Alan C. , booktitle=. Objective quality assessment of multiply distorted images , year=

Showing first 80 references.