Recognition: 3 theorem links
· Lean TheoremQ-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels
Pith reviewed 2026-05-15 16:31 UTC · model grok-4.3
The pith
LMMs achieve better visual scoring by predicting discrete text-defined rating levels instead of numerical scores.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Observing that human raters learn and judge discrete text-defined levels in subjective studies, the work teaches LMMs to output these levels for visual rating. The resulting Q-Align model reaches state-of-the-art performance on image quality assessment, image aesthetic assessment, and video quality assessment tasks under the original LMM structure. A syllabus built from the same discrete levels further unifies the three tasks into a single model termed OneAlign.
What carries the argument
The discrete text-defined rating levels syllabus that replaces numerical score regression to train LMMs for visual assessment.
If this is right
- The discrete-level syllabus outperforms direct-score training on IQA, IAA, and VQA benchmarks.
- Three separate visual assessment tasks can be unified into one model without architectural changes.
- State-of-the-art results are obtained while keeping the original LMM structure and data requirements unchanged.
- The same training approach extends across image and video content types.
Where Pith is reading between the lines
- Similar discrete-level training could apply to other subjective rating domains where humans use categorical language.
- Different choices of text phrasing for the levels might further improve alignment with specific human populations.
- The method suggests that discrete outputs could reduce calibration issues in other LMM alignment tasks.
Load-bearing premise
Human subjective judgment in visual scoring relies primarily on discrete text-defined levels rather than continuous numerical values.
What would settle it
If an LMM trained with direct numerical score regression matches or exceeds the discrete-level version on the same image quality, aesthetic, and video quality benchmarks, the claimed advantage would not hold.
read the original abstract
The explosion of visual content available online underscores the requirement for an accurate machine assessor to robustly evaluate scores across diverse types of visual contents. While recent studies have demonstrated the exceptional potentials of large multi-modality models (LMMs) on a wide range of related fields, in this work, we explore how to teach them for visual rating aligned with human opinions. Observing that human raters only learn and judge discrete text-defined levels in subjective studies, we propose to emulate this subjective process and teach LMMs with text-defined rating levels instead of scores. The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA), as well as video quality assessment (VQA) tasks under the original LMM structure. With the syllabus, we further unify the three tasks into one model, termed the OneAlign. In our experiments, we demonstrate the advantage of the discrete-level-based syllabus over direct-score-based variants for LMMs. Our code and the pre-trained weights are released at https://github.com/Q-Future/Q-Align.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Q-Align, which trains large multi-modality models (LMMs) for visual scoring by using discrete text-defined rating levels (e.g., 'excellent', 'good') rather than numerical scores to better emulate human subjective judgment. It reports state-of-the-art results on image quality assessment (IQA), image aesthetic assessment (IAA), and video quality assessment (VQA) tasks while preserving the original LMM architecture, and introduces a unified OneAlign model across the three tasks via a shared syllabus.
Significance. If the performance gains are shown to arise specifically from the discrete-level syllabus rather than output-format compatibility, the approach would provide a simple, architecture-preserving method for aligning LMMs with human perceptual judgments, with potential impact on fine-tuning strategies for subjective visual assessment tasks.
major comments (1)
- [Abstract] Abstract: the reported advantage of the discrete-level-based syllabus over direct-score-based variants is central to the contribution, yet the comparison does not isolate whether gains stem from the use of discrete text levels or from the fact that text-token output is native to LMM pretraining. No experiment tests a text-formatted numerical baseline (e.g., outputting 'score: 4') that would control for output-format mismatch while keeping the regression target continuous.
minor comments (1)
- [Experiments] The manuscript should provide explicit details on dataset splits, full baseline implementations, and statistical significance tests to substantiate the SOTA claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on isolating the source of performance gains. We address the major comment point-by-point below and commit to strengthening the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the reported advantage of the discrete-level-based syllabus over direct-score-based variants is central to the contribution, yet the comparison does not isolate whether gains stem from the use of discrete text levels or from the fact that text-token output is native to LMM pretraining. No experiment tests a text-formatted numerical baseline (e.g., outputting 'score: 4') that would control for output-format mismatch while keeping the regression target continuous.
Authors: We agree that a text-formatted numerical baseline would provide a cleaner control for output-format effects. In the current experiments the direct-score variants prompted the LMM for numerical regression, which is indeed less native to its text-token pretraining than discrete text labels. The core motivation remains that human raters in subjective studies use discrete text-defined levels rather than continuous numbers; this is why we adopted the syllabus. To address the referee's concern directly, we will add the suggested text-formatted numerical baseline (e.g., prompting for outputs such as 'score: 4') in the revised experiments and report the results alongside the existing comparisons. This addition will be included in the updated manuscript and supplementary material. revision: yes
Circularity Check
Empirical fine-tuning shows no circular derivation
full rationale
The paper is an empirical study on fine-tuning LMMs for visual scoring tasks using discrete text-defined levels. No mathematical derivation chain, equations, or first-principles results are presented that reduce to inputs by construction. The method relies on standard next-token prediction training with released code and weights; performance claims are validated experimentally across IQA, IAA, and VQA benchmarks rather than through self-referential logic or fitted parameters renamed as predictions. Minor self-citations exist but are not load-bearing for the core syllabus or unification into OneAlign.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human raters in subjective studies judge visual content using discrete text-defined levels
Lean theorems connected to this paper
-
Cost.JcostCoreJcost_pos_of_ne_one unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Observing that human raters only learn and judge discrete text-defined levels in subjective studies, we propose to emulate this subjective process and teach LMMs with text-defined rating levels instead of scores.
-
Foundation.PhiForcingphi_equation unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA), as well as video quality assessment (VQA) tasks under the original LMM structure.
-
Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
With the syllabus, we further unify the three tasks into one model, termed the OneAlign.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 21 Pith papers
-
SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models
SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.
-
GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion
GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.
-
EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement
EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.
-
Bringing Multimodal Large Language Models to Infrared-Visible Image Fusion Quality Assessment
FuScore uses MLLMs to output continuous quality scores for IVIF images, constructs per-image soft labels from four sub-dimensions, and applies a tripartite objective with Thurstone fidelity to achieve higher correlati...
-
GameScope: A Multi-Attribute, Multi-Codec Benchmark Dataset for Gaming Video Quality Assessment
GameScope provides 4,048 multi-codec gaming videos with MOS ratings and attribute annotations, claimed as the first comprehensive dataset for gaming video quality assessment across codecs and content types.
-
Personalizing Text-to-Image Generation to Individual Taste
PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.
-
GeoR-Bench: Evaluating Geoscience Visual Reasoning
GeoR-Bench shows top multimodal models reach only 42.7% strict accuracy on geoscience visual reasoning tasks while open-source models reach 10.3%, with outputs often visually plausible yet scientifically inaccurate.
-
ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning
ReasonEdit uses a new CoT dataset and reinforcement learning to produce interpretable, human-aligned evaluations of text-guided image edits.
-
Bringing Multimodal Large Language Models to Infrared-Visible Image Fusion Quality Assessment
FuScore trains an MLLM to produce continuous IVIF quality scores supervised by per-image soft labels and Thurstone fidelity terms, reaching state-of-the-art correlation with human preferences.
-
Unpaired Image Deraining Using Reward-Guided Self-Reinforcement Strategy
RGSUD achieves SOTA unsupervised deraining by using IQA-based reward recycling and self-reinforcement to constrain optimization and improve pseudo-paired data quality.
-
You Only Gaussian Once: Controllable 3D Gaussian Splatting for Ultra-Densely Sampled Scenes
YOGO reformulates stochastic 3D Gaussian Splatting into a deterministic budget-aware system and supplies an ultra-dense dataset to enforce physical fidelity over viewpoint interpolation.
-
Redefining Quality Criteria and Distance-Aware Score Modeling for Image Editing Assessment
DS-IEQA jointly learns evaluation criteria via feedback-driven prompt optimization and continuous score modeling via token-decoupled distance regression, ranking 4th in the 2026 NTIRE X-AIGC Quality Assessment Track 2...
-
Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models
Rein3D generates photorealistic, globally consistent 3D indoor scenes by using a restore-and-refine process where radial panoramic videos are restored via diffusion models and then used to update a 3D Gaussian field.
-
On the Global Photometric Alignment for Low-Level Vision
PAL uses closed-form affine color alignment on prediction-target pairs to discount global photometric discrepancies from the supervision signal, improving restoration across low-level vision tasks.
-
LumiVideo: An Intelligent Agentic System for Video Color Grading
LumiVideo deploys an LLM-based agent with RAG and Tree of Thoughts to generate ASC-CDL parameters and 3D LUTs for automatic cinematic color grading from raw log video, approaching expert quality.
-
LucidNFT: LR-Anchored Multi-Reward Preference Optimization for Flow-Based Real-World Super-Resolution
LucidNFT combines a new LR-referenced consistency reward, decoupled normalization, and a real-degradation dataset to improve perceptual quality in flow-matching super-resolution while preserving input fidelity.
-
HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images
HiFi-Inpaint delivers state-of-the-art detail-preserving human-product images by adding Shared Enhancement Attention and Detail-Aware Loss to reference-based inpainting on a new 40K dataset.
-
Embody4D: A Generalist 4D World Model for Embodied AI
Embody4D generates high-fidelity, view-consistent novel views from monocular videos for embodied scenarios via 3D-aware data synthesis, adaptive noise injection, and interaction-aware attention.
-
FDIM: A Feature-distance-based Generic Video Quality Metric for Versatile Codecs
FDIM is a new hybrid feature-distance video quality metric trained on over 16k sequences that shows strong generalization and correlation with human judgments across ten unseen SDR/HDR datasets and diverse codecs.
-
Q-DeepSight: Incentivizing Thinking with Images for Image Quality Assessment and Refinement
Q-DeepSight proposes a think-with-image multimodal CoT framework trained via RL with perceptual curriculum rewards and evidence gradient filtering to achieve SOTA IQA performance and enable training-free perceptual re...
-
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
HY-World 2.0 generates and reconstructs high-fidelity navigable 3D Gaussian Splatting worlds from text, images, or videos via upgraded panorama, planning, expansion, and composition modules, with released code claimin...
Reference graph
Works this paper leans on
-
[1]
FirstName LastName , title =
-
[2]
FirstName Alpher , title =
-
[3]
Journal of Foo , volume = 13, number = 1, pages =
FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =
-
[4]
Journal of Foo , volume = 14, number = 1, pages =
FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =
-
[5]
FirstName Alpher and FirstName Gamow , title =
-
[6]
Wang, Yilin and Ke, Junjie and Talebi, Hossein and Yim, Joong Gon and Birkbeck, Neil and Adsumilli, Balu and Milanfar, Peyman and Yang, Feng , title =. CVPR , month =. 2021 , pages =
work page 2021
-
[7]
The Netflix Tech Blog , volume=
Toward a practical perceptual video quality metric , author=. The Netflix Tech Blog , volume=
-
[8]
arXiv preprint arXiv:1907.07484 , year=
Benchmarking Robustness in Object Detection: Autonomous Driving when Winter is Coming , author=. arXiv preprint arXiv:1907.07484 , year=
-
[9]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension , author=. 2023 , eprint=
work page 2023
-
[10]
Exploring and Evaluating Image Restoration Potential in Dynamic Scenes , year=
Zhang, Cheng and Su, Shaolin and Zhu, Yu and Yan, Qingsen and Sun, Jinqiu and Zhang, Yanning , booktitle=. Exploring and Evaluating Image Restoration Potential in Dynamic Scenes , year=
-
[11]
Francesco Locatello and Stefan Bauer and Mario Lucic and Gunnar Raetsch and Sylvain Gelly and Bernhard Sch. A Sober Look at the Unsupervised Learning of Disentangled Representations and their Evaluation , journal =. 2020 , volume =
work page 2020
-
[12]
Multi-mapping Image-to-Image Translation via Learning Disentanglement , author=. NeurIPS , year=
-
[13]
Video compression dataset and benchmark of learning-based video-quality metrics , author=. Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
-
[14]
Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig , doi =
-
[15]
Real-Time Intermediate Flow Estimation for Video Frame Interpolation , author=. ECCV , year=
-
[16]
A Probabilistic Approach to People-Centric Photo Selection and Sequencing , year=
Vonikakis, Vassilios and Subramanian, Ramanathan and Arnfred, Jonas and Winkler, Stefan , journal=. A Probabilistic Approach to People-Centric Photo Selection and Sequencing , year=
-
[17]
Ruan, Lingyan and Chen, Bin and Li, Jizhou and Lam, Miuling , title =. CVPR , month =. 2022 , pages =
work page 2022
-
[18]
Lee, Yao-Chih and Tseng, Kuan-Wei and Chen, Yu-Ta and Chen, Chien-Cheng and Chen, Chu-Song and Hung, Yi-Ping , title =. CVPR , month =. 2021 , pages =
work page 2021
-
[19]
Tassano, Matias and Delon, Julie and Veit, Thomas , title =. CVPR , month =
-
[20]
Proceedings of the 31st ACM International Conference on Multimedia , year =
Tengchuan Kou and Xiaohong Liu and Jun Jia and Wei Sun and Guangtao Zhai and Ning Liu , title =. Proceedings of the 31st ACM International Conference on Multimedia , year =
-
[21]
n Proceedings of the 31st ACM International Conference on Multimedia , year =
Yunlong Dong and Xiaohong Liu and Yixuan Gao and Xunchu Zhou and Tao Tan and Guangtao Zhai , title =. n Proceedings of the 31st ACM International Conference on Multimedia , year =
-
[22]
Jingwen Hou and Weisi Lin and Yuming Fang and Haoning Wu and Chaofeng Chen and Liang Liao and Weide Liu , title =. IEEE TIP , year =
-
[23]
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=
MD-VQA: Multi-Dimensional Quality Assessment for UGC Live Videos , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=
-
[24]
VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining , author=. 2023 , eprint=
work page 2023
-
[25]
AGIQA-3K: An Open Database for AI-Generated Image Quality Assessment , author=. 2023 , eprint=
work page 2023
-
[26]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. arXiv preprint arXiv:2308.12966 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , author=. 2023 , eprint=
work page 2023
-
[28]
LIVE Image Quality Assessment Database Release 2 , author=
-
[30]
Proceedings of the 30th ACM International Conference on Multimedia , year =
A Deep Learning Based No-Reference Quality Assessment Model for UGC Videos , author =. Proceedings of the 30th ACM International Conference on Multimedia , year =
- [31]
-
[32]
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration , author=. 2023 , eprint=
work page 2023
-
[33]
Towards Robust Text-Prompted Semantic Criterion for In-the-Wild Video Quality Assessment , author=. 2023 , eprint=
work page 2023
-
[34]
Advancing Zero-Shot Digital Human Quality Assessment through Text-Prompted Evaluation , author=. 2023 , eprint=
work page 2023
-
[35]
Massive online crowdsourced study of subjective and objective picture quality , author=. IEEE TIP , volume=. 2015 , publisher=
work page 2015
-
[36]
Perceptual Quality Assessment of Smartphone Photography , author=. CVPR , pages=
-
[37]
MixNMatch: Multifactor Disentanglement and Encoding for Conditional Image Generation , author =. CVPR , year =
-
[38]
When Is Unsupervised Disentanglement Possible? , volume =
Horan, Daniella and Richardson, Eitan and Weiss, Yair , booktitle =. When Is Unsupervised Disentanglement Possible? , volume =
-
[39]
Zhang, Xiaodan and Gao, Xinbo and Lu, Wen and He, Lihuo , year =. A Gated Peripheral-Foveal Convolutional Neural Network for Unified Image Aesthetic Prediction , volume =. IEEE TMM , doi =
-
[40]
NIMA: Neural Image Assessment , year=
Talebi, Hossein and Milanfar, Peyman , journal=. NIMA: Neural Image Assessment , year=
-
[41]
AVA: A large-scale database for aesthetic visual analysis , author=. CVPR , pages=
-
[42]
Hou, Jingwen and Ding, Henghui and Lin, Weisi and Liu, Weide and Fang, Yuming , title =. IEEE TCSVT , year =
-
[43]
Deep Neural Networks for No-Reference Video Quality Assessment , year=
You, Junyong and Korhonen, Jari , booktitle=. Deep Neural Networks for No-Reference Video Quality Assessment , year=
-
[44]
The Konstanz natural video database (KoNViD-1k) , year=
Hosu, Vlad and Hahn, Franz and Jenadeleh, Mohsen and Lin, Hanhe and Men, Hui and Szirányi, Tamás and Li, Shujun and Saupe, Dietmar , booktitle=. The Konstanz natural video database (KoNViD-1k) , year=
-
[45]
Exploiting spatial redundancy in pixel domain Wyner-Ziv video coding , author=. ICIP , pages=
-
[46]
Signal Processing: Image Communication , volume=
The MPEG video compression algorithm , author=. Signal Processing: Image Communication , volume=
-
[47]
2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA) , pages=
EVA ^2 : Exploiting temporal redundancy in live computer vision , author=. 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA) , pages=
work page 2018
- [48]
-
[49]
doi:10.48550/ARXIV.2110.00476 , author =
ResNet strikes back: An improved training procedure in timm , publisher =. doi:10.48550/ARXIV.2110.00476 , author =
-
[50]
Keys, R. , journal=. Cubic convolution interpolation for digital image processing , year=
-
[51]
Draft ITU-T recommendation and final draft international standard of joint video specification , author=
-
[52]
Disentangling Aesthetic and Technical Effects for Video Quality Assessment of User Generated Content , author=. , year=
-
[53]
Zhang, Lin and Zhang, Lei and Bovik, Alan C. , journal=. A Feature-Enriched Completely Blind Image Quality Evaluator , year=
-
[54]
In search of a general picture processing operator , journal =. 1978 , issn =
work page 1978
-
[55]
DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting , author=. CVPR , year=
-
[56]
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , author=. 2022 , booktitle=
work page 2022
-
[57]
IEEE Conference on Computer Vision and Pattern Recognition , year=
Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective , author=. IEEE Conference on Computer Vision and Pattern Recognition , year=
-
[58]
The Power of Scale for Parameter-Efficient Prompt Tuning
Lester, Brian and Al-Rfou, Rami and Constant, Noah. The Power of Scale for Parameter-Efficient Prompt Tuning. 2021 EMNLP. 2021. doi:10.18653/v1/2021.emnlp-main.243
-
[59]
Prefix-Tuning: Optimizing Continuous Prompts for Generation , url =
Li, Xiang Lisa and Liang, Percy. Prefix-Tuning: Optimizing Continuous Prompts for Generation. ACL 2021. 2021. doi:10.18653/v1/2021.acl-long.353
-
[61]
Exploring Opinion-Unaware Video Quality Assessment with Semantic Affinity Criterion , author=. 2023 , booktitle=
work page 2023
-
[62]
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , author=. 2021 , booktitle=
work page 2021
-
[63]
CoCa: Contrastive Captioners are Image-Text Foundation Models , author =
-
[64]
Conference on Computer Vision and Pattern Recognition (CVPR) , year =
Kenneth Marino and Mohammad Rastegari and Ali Farhadi and Roozbeh Mottaghi , title =. Conference on Computer Vision and Pattern Recognition (CVPR) , year =
-
[66]
Introducing IDEFICS: An Open Reproduction of State-of-the-Art Visual Language Model , author =. 2023 , url =
work page 2023
-
[68]
Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=
work page 2023
-
[69]
Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
- [70]
-
[71]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality , author=. 2023 , eprint=
work page 2023
-
[72]
Joshua Peter Ebenezer and Zaixi Shang and Yongjun Wu and Hai Wei and Sriram Sethuraman and Alan C. Bovik , title =
-
[73]
Raghav Goyal and Samira Ebrahimi Kahou and Vincent Michalski and Joanna Materzynska and Susanne Westphal and Heuna Kim and Valentin Haenel and Ingo Fr. The "something something" video database for learning and evaluating visual common sense , journal =
- [74]
-
[75]
Howard, Andrew and Sandler, Mark and Chu, Grace and Chen, Liang-Chieh and Chen, Bo and Tan, Mingxing and Wang, Weijun and Zhu, Yukun and Pang, Ruoming and Vasudevan, Vijay and Le, Quoc V. and Adam, Hartwig , title =
-
[76]
Xu, Jiahua and Li, Jing and Zhou, Xingguang and Zhou, Wei and Wang, Baichao and Chen, Zhibo , title =. 2021 , booktitle =
work page 2021
-
[77]
Progress and challenges in probing the human brain , author=. Nature , volume=
-
[78]
Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya , title =
-
[79]
Expanding Language-Image Pretrained Models for General Video Recognition , author =. ECCV , year=
- [80]
-
[81]
Guha, Tanaya and Hosu, Vlad and Saupe, Dietmar and Goldl\". ATQAM/MAST'20: Joint Workshop on Aesthetic and Technical Quality Assessment of Multimedia and Media Analytics for Societal Trends , year =. ACM MM , pages =
-
[82]
Young, Peter and Lai, Alice and Hodosh, Micah and Hockenmaier, Julia. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics. 2014
work page 2014
-
[83]
arXiv preprint arXiv:2305.10843 , year=
X-IQE: eXplainable Image Quality Evaluation for Text-to-Image Generation with Visual Large Language Models , author=. arXiv preprint arXiv:2305.10843 , year=
-
[84]
Jayaraman, Dinesh and Mittal, Anish and Moorthy, Anush K. and Bovik, Alan C. , booktitle=. Objective quality assessment of multiply distorted images , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.