arxiv: 2412.21059 · v4 · submitted 2024-12-30 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

Jiazheng Xu , Yu Huang , Jiale Cheng , Yuanming Yang , Jiajun Xu , Yuan Wang , Wenbo Duan , Shen Yang

show 14 more authors

Qunlin Jin Shurun Li Jiayan Teng Zhuoyi Yang Wendi Zheng Xiao Liu Dan Zhang Ming Ding Xiaohan Zhang Xiaotao Gu Shiyu Huang Minlie Huang Jie Tang Yuxiao Dong

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:45 UTC · model grok-4.3

classification 💻 cs.CV

keywords human preference learningreward modelimage generationvideo generationpreference optimizationhierarchical assessmentinterpretabilitymulti-dimensional evaluation

0 comments

The pith

VisionReward learns fine-grained human preferences for image and video generation through hierarchical assessment and linear weighting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VisionReward as a framework to align visual generative models with human preferences by decomposing evaluations into multiple dimensions. Existing reward models often act as black boxes that produce scores without clear explanations and can introduce biases during optimization. VisionReward addresses this by applying a hierarchical visual assessment to break down preferences and then combining scores via linear weighting for interpretability. When used in preference optimization, a multi-dimensional consistent strategy keeps the process reliable across dimensions. Experiments demonstrate gains in preference prediction accuracy and higher win rates for generated images and videos compared to prior approaches.

Core claim

VisionReward employs a hierarchical visual assessment framework to capture fine-grained human preferences across multiple dimensions and uses linear weighting to combine them into an interpretable score. When applied as a reward model in preference optimization for visual generation, a multi-dimensional consistent strategy maintains alignment across dimensions. This results in superior performance on preference prediction accuracy and higher win rates for generated content compared to prior reward models.

What carries the argument

Hierarchical visual assessment framework with linear weighting that decomposes preferences into dimensions and aggregates them into interpretable scores.

Load-bearing premise

The hierarchical visual assessment framework combined with linear weighting accurately captures fine-grained human preferences across dimensions without introducing unexpected biases or inconsistencies.

What would settle it

A direct comparison of human preference ratings on image and video outputs from models optimized with VisionReward versus prior reward models, checking whether the predicted multi-dimensional scores match the actual human choices without dimension-specific inconsistencies.

read the original abstract

Visual generative models have achieved remarkable progress in synthesizing photorealistic images and videos, yet aligning their outputs with human preferences across critical dimensions remains a persistent challenge. Though reinforcement learning from human feedback offers promise for preference alignment, existing reward models for visual generation face limitations, including black-box scoring without interpretability and potentially resultant unexpected biases. We present VisionReward, a general framework for learning human visual preferences in both image and video generation. Specifically, we employ a hierarchical visual assessment framework to capture fine-grained human preferences, and leverages linear weighting to enable interpretable preference learning. Furthermore, we propose a multi-dimensional consistent strategy when using VisionReward as a reward model during preference optimization for visual generation. Experiments show that VisionReward can significantly outperform existing image and video reward models on both machine metrics and human evaluation. Notably, VisionReward surpasses VideoScore by 17.2% in preference prediction accuracy, and text-to-video models with VisionReward achieve a 31.6% higher pairwise win rate compared to the same models using VideoScore. All code and datasets are provided at https://github.com/THUDM/VisionReward.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

VisionReward adds a hierarchical breakdown plus linear weights to make visual reward models more interpretable than VideoScore, with reported gains that look usable but rest on additivity assumptions that may not generalize. The hierarchical assessment splits preferences into fine-grained dimensions like quality and consistency, then combines them linearly for transparency. They also add a multi-dimensional consistency strategy when the model is used for preference optimization. This is a direct step beyond black-box scorers, and releasing the full code and datasets is genuinely helpful for anyone who wants to test or extend it. The numbers they give—17.2% higher preference prediction accuracy and 31.6% better pairwise win rates on text-to-video models—are concrete and come from both machine metrics and human evaluations, which is the right check. The linear weighting is the clear soft spot. Human preferences often contain non-additive effects, such as one weak dimension dragging down the whole judgment more than a simple sum predicts. If that holds, the gains could be tied to the particular test distributions rather than reflecting robust alignment. The abstract does not show detailed ablations on the weights or sensitivity to different preference structures, so the evidence stays moderate. This is aimed at people working on reward modeling and alignment for image and video generators. A reader who needs practical, inspectable reward functions would get value from the open resources and the specific comparison to VideoScore. It deserves peer review because the core construction is straightforward, the claims are falsifiable with the released code, and the open data lets referees verify the numbers directly.

Referee Report

2 major / 2 minor

Summary. The paper introduces VisionReward, a framework for fine-grained multi-dimensional human preference learning for image and video generation. It employs a hierarchical visual assessment to capture preferences across dimensions such as quality, consistency, and aesthetics, combined with linear weighting for interpretability. A multi-dimensional consistent strategy is proposed for using the model as a reward in preference optimization. Experiments report that VisionReward outperforms baselines, surpassing VideoScore by 17.2% in preference prediction accuracy and yielding a 31.6% higher pairwise win rate for text-to-video models trained with it versus VideoScore.

Significance. If the central claims hold, VisionReward provides an interpretable alternative to black-box reward models for aligning visual generative models with human preferences. The reported gains on machine metrics and human evaluations, together with the open release of code and datasets, would support practical improvements in image and video synthesis and enable follow-up research on preference optimization.

major comments (2)

[§3.3] §3.3 (Linear Weighting): The framework combines hierarchical dimension scores via fixed linear weights, but no ablation or analysis tests whether human preferences exhibit non-additive interactions (e.g., motion artifacts amplifying quality penalties). This assumption is load-bearing for the 17.2% accuracy and 31.6% win-rate claims, as violations could produce inflated alignment on the chosen test distributions without generalizing.
[§5.1, Table 3] §5.1, Table 3: The preference prediction accuracy comparison to VideoScore reports a 17.2% lift, yet the evaluation uses the same linear aggregator for both training and testing; an ablation replacing the linear step with a non-linear aggregator (e.g., small MLP) is absent, leaving open whether the gain stems from the hierarchical assessment or from the specific additive construction.

minor comments (2)

[§2] §2 (Related Work): The discussion of prior video reward models could explicitly contrast the proposed multi-dimensional consistency strategy with earlier single-score approaches to clarify the novelty.
[Figure 1] Figure 1: The hierarchical assessment diagram would benefit from explicit labels on the dimension-specific heads and the final linear combination step for improved readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on VisionReward. We appreciate the focus on the linear weighting assumption and its implications for the reported gains. We address each major comment below and will incorporate the suggested analyses in the revised manuscript to strengthen the claims.

read point-by-point responses

Referee: [§3.3] §3.3 (Linear Weighting): The framework combines hierarchical dimension scores via fixed linear weights, but no ablation or analysis tests whether human preferences exhibit non-additive interactions (e.g., motion artifacts amplifying quality penalties). This assumption is load-bearing for the 17.2% accuracy and 31.6% win-rate claims, as violations could produce inflated alignment on the chosen test distributions without generalizing.

Authors: We selected linear weighting to prioritize interpretability, enabling users to directly attribute the final preference score to specific dimensions such as quality or consistency, as emphasized in the manuscript. While non-additive interactions between dimensions may exist in human judgments, the hierarchical assessment already decomposes preferences into fine-grained components, and linear aggregation serves as an effective and transparent approximation supported by our empirical results. To rigorously test this, we will add an ablation study in the revision that compares the linear aggregator against a non-linear alternative (e.g., a small MLP) on the same hierarchical scores, evaluating impacts on both prediction accuracy and downstream win rates. revision: yes
Referee: [§5.1, Table 3] §5.1, Table 3: The preference prediction accuracy comparison to VideoScore reports a 17.2% lift, yet the evaluation uses the same linear aggregator for both training and testing; an ablation replacing the linear step with a non-linear aggregator (e.g., small MLP) is absent, leaving open whether the gain stems from the hierarchical assessment or from the specific additive construction.

Authors: The 17.2% accuracy improvement and 31.6% win-rate gains arise primarily from the hierarchical multi-dimensional assessment framework rather than the aggregator alone, as VideoScore employs a different architecture without our dimension decomposition. The linear aggregator is applied consistently for fair comparison and to maintain interpretability. We acknowledge the value of isolating this factor and will include a new ablation in the revised version: training and evaluating our hierarchical scores with a non-linear aggregator (small MLP) to quantify whether the performance lift persists independently of the linear construction. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external human annotations and benchmarks

full rationale

The paper trains VisionReward on human preference data using a hierarchical assessment plus linear weighting, then evaluates preference prediction accuracy and win rates on held-out test sets and public benchmarks. No equations or steps reduce the reported gains (17.2% accuracy, 31.6% win rate) to the training inputs by construction. Linear weighting is a standard fitted aggregator validated externally rather than a self-definitional or self-citation load-bearing step. No uniqueness theorems, ansatzes, or renamings of known results are invoked in a circular manner. The central claims remain falsifiable against independent human feedback.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The framework rests on domain assumptions about the structure of human visual preferences and the sufficiency of linear combinations; no new physical entities are introduced, but weights are expected to be fitted to preference data.

free parameters (1)

dimension weights
Linear coefficients used to combine hierarchical assessment scores into an overall preference value; these are fitted to human feedback data.

axioms (2)

domain assumption Human visual preferences can be decomposed into a hierarchical set of fine-grained assessments
Invoked as the basis for the assessment framework in the abstract.
domain assumption Linear weighting of dimension scores yields interpretable and consistent preference learning
Stated as enabling interpretability and used in the multi-dimensional strategy.

pith-pipeline@v0.9.0 · 5571 in / 1301 out tokens · 28049 ms · 2026-05-16T11:45:06.312296+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we employ a hierarchical visual assessment framework to capture fine-grained human preferences, and leverages linear weighting to enable interpretable preference learning
Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VisionReward surpasses VideoScore by 17.2% in preference prediction accuracy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating
cs.CV 2026-05 unverdicted novelty 7.0

CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GR...
RewardHarness: Self-Evolving Agentic Post-Training
cs.AI 2026-05 unverdicted novelty 7.0

RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation
cs.CV 2026-04 unverdicted novelty 7.0

OTCA improves GRPO training for visual generation by estimating step importance in trajectories and adaptively weighting multiple reward objectives.
Speculative Decoding for Autoregressive Video Generation
cs.CV 2026-04 conditional novelty 7.0

A training-free speculative decoding method for block-based autoregressive video diffusion uses a quality router on worst-frame ImageReward scores to accept drafter proposals, achieving up to 2.09x speedup at 95.7% qu...
DSH-Bench: A Difficulty- and Scenario-Aware Benchmark with Hierarchical Subject Taxonomy for Subject-Driven Text-to-Image Generation
cs.CV 2026-03 unverdicted novelty 7.0

DSH-Bench is a benchmark for subject-driven T2I generation that uses hierarchical taxonomy sampling, difficulty/scenario classification, and a new SICS metric showing 9.4% higher human correlation than prior measures.
Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models
cs.CV 2026-01 unverdicted novelty 7.0

LocalDPO creates localized preference pairs from real videos by applying random spatio-temporal masks and restoring masked regions with the frozen base model, then applies region-restricted DPO loss to improve fidelit...
Unified Reward Model for Multimodal Understanding and Generation
cs.CV 2025-03 unverdicted novelty 7.0

UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation
cs.CV 2026-05 conditional novelty 6.0

PhyMotion scores generated human videos by grounding recovered 3D poses in a physics simulator across kinematic, contact, and dynamic axes, yielding stronger human correlation and larger RL post-training gains than pr...
Skill-Aligned Annotation for Reliable Evaluation in Text-to-Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

Skill-aligned annotation improves inter-annotator agreement and evaluation stability in text-to-image generation compared to uniform annotation baselines.
How Far Are Video Models from True Multimodal Reasoning?
cs.CV 2026-04 unverdicted novelty 6.0

Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior
cs.CV 2026-04 unverdicted novelty 6.0

DreamShot uses video diffusion priors and a role-attention consistency loss to produce coherent, personalized storyboards with better character and scene continuity than text-to-image methods.
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
cs.CV 2025-10 conditional novelty 6.0

Self-Forcing++ scales autoregressive video diffusion to over 4 minutes by using self-generated segments for guidance, reducing error accumulation and outperforming baselines in fidelity and consistency.
DanceGRPO: Unleashing GRPO on Visual Generation
cs.CV 2025-05 unverdicted novelty 6.0

DanceGRPO applies GRPO to visual generation tasks to achieve stable policy optimization across diffusion models, rectified flows, multiple tasks, and diverse reward models, outperforming prior RL methods.
Improving Video Generation with Human Feedback
cs.CV 2025-01 unverdicted novelty 6.0

A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
cs.CV 2026-04 unverdicted novelty 5.0

Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
Reward-Aware Trajectory Shaping for Few-step Visual Generation
cs.CV 2026-04 unverdicted novelty 5.0

RATS lets few-step visual generators surpass multi-step teachers by shaping trajectories with reward-based adaptive guidance instead of strict imitation.
Anthropogenic Regional Adaptation in Multimodal Vision-Language Model
cs.AI 2026-04 unverdicted novelty 5.0

Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.
Reward-Forcing: Autoregressive Video Generation with Reward Feedback
cs.CV 2026-01 unverdicted novelty 5.0

Reward-Forcing guides autoregressive video generation with reward feedback to achieve performance comparable to teacher-dependent methods on benchmarks like VBench without relying on distillation.
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
cs.CV 2025-11 unverdicted novelty 5.0

Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and strea...

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 19 Pith papers · 13 internal anchors

[1]

International conference on machine learning , pages=

Zero-shot text-to-image generation , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[2]

Advances in Neural Information Processing Systems , volume=

Cogview: Mastering text-to-image generation via transformers , author=. Advances in Neural Information Processing Systems , volume=

work page
[3]

Advances in neural information processing systems , volume=

Photorealistic text-to-image diffusion models with deep language understanding , author=. Advances in neural information processing systems , volume=

work page
[4]

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

High-Resolution Image Synthesis with Latent Diffusion Models , author=. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=. 2022 , organization=

work page 2022
[5]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Sdxl: Improving latent diffusion models for high-resolution image synthesis , author=. arXiv preprint arXiv:2307.01952 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Computer Science

Improving image generation with better captions , author=. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf , volume=

work page
[7]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Cogvideo: Large-scale pretraining for text-to-video generation via transformers , author=. arXiv preprint arXiv:2205.15868 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Imagen Video: High Definition Video Generation with Diffusion Models

Imagen video: High definition video generation with diffusion models , author=. arXiv preprint arXiv:2210.02303 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

International Conference on Learning Representations , year=

Phenaki: Variable length video generation from open domain textual descriptions , author=. International Conference on Learning Representations , year=

work page
[10]

2024 , url =

Zangwei Zheng and Xiangyu Peng and Tianji Yang and Chenhui Shen and Shenggui Li and Hongxin Liu and Yukun Zhou and Tianyi Li and Yang You , title =. 2024 , url =

work page 2024
[11]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Videocrafter2: Overcoming data limitations for high-quality video diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[12]

2024 , url=

Video generation models as world simulators , author=. 2024 , url=

work page 2024
[13]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Cogvideox: Text-to-video diffusion models with an expert transformer , author=. arXiv preprint arXiv:2408.06072 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Panda-70m: Captioning 70m videos with multiple cross-modality teachers , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[15]

Advances in Neural Information Processing Systems , volume=

Learning to summarize with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page
[16]

WebGPT: Browser-assisted question-answering with human feedback

Webgpt: Browser-assisted question-answering with human feedback , author=. arXiv preprint arXiv:2112.09332 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page
[18]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[19]

Advances in neural information processing systems , volume=

Gans trained by a two time-scale update rule converge to a local nash equilibrium , author=. Advances in neural information processing systems , volume=

work page
[20]

Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

LAION-5B: An open large-scale dataset for training next generation image-text models , author=. Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

work page
[21]

Proceedings of the 37th International Conference on Neural Information Processing Systems , pages=

ImageReward: learning and evaluating human preferences for text-to-image generation , author=. Proceedings of the 37th International Conference on Neural Information Processing Systems , pages=

work page
[22]

Advances in Neural Information Processing Systems , volume=

Pick-a-pic: An open dataset of user preferences for text-to-image generation , author=. Advances in Neural Information Processing Systems , volume=

work page
[23]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis , author=. arXiv preprint arXiv:2306.09341 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Learning multi-dimensional human preference for text-to-image generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[25]

Proceedings of the 37th International Conference on Neural Information Processing Systems , pages=

DPOK: reinforcement learning for fine-tuning text-to-image diffusion models , author=. Proceedings of the 37th International Conference on Neural Information Processing Systems , pages=

work page
[26]

Training Diffusion Models with Reinforcement Learning

Training diffusion models with reinforcement learning , author=. arXiv preprint arXiv:2305.13301 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Directly Fine-Tuning Diffusion Models on Differentiable Rewards

Directly fine-tuning diffusion models on differentiable rewards , author=. arXiv preprint arXiv:2309.17400 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

arXiv preprint arXiv:2405.00760 , year=

Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models , author=. arXiv preprint arXiv:2405.00760 , year=

work page arXiv
[29]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Diffusion model alignment using direct preference optimization , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[30]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Vbench: Comprehensive benchmark suite for video generative models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[31]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2024
[32]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

InstructVideo: instructing video diffusion models with human feedback , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[33]

arXiv preprint arXiv:2407.08737 , year=

Video diffusion alignment via reward gradients , author=. arXiv preprint arXiv:2407.08737 , year=

work page arXiv
[34]

arXiv preprint arXiv:2311.01361 , year=

Gpt-4v (ision) as a generalist evaluator for vision-language tasks , author=. arXiv preprint arXiv:2311.01361 , year=

work page arXiv
[35]

European Conference on Computer Vision , pages=

Evaluating text-to-visual generation with image-to-text generation , author=. European Conference on Computer Vision , pages=. 2025 , organization=

work page 2025
[36]

CogVLM: Visual Expert for Pretrained Language Models

Cogvlm: Visual expert for pretrained language models , author=. arXiv preprint arXiv:2311.03079 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

arXiv preprint arXiv:2408.16500 , year=

Cogvlm2: Visual language models for image and video understanding , author=. arXiv preprint arXiv:2408.16500 , year=

work page arXiv
[38]

Visual Instruction Tuning , author=

work page
[39]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models , author=. arXiv preprint arXiv:2407.07895 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. arXiv preprint arXiv:2403.05530 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

International conference on machine learning , pages=

Deep unsupervised learning using nonequilibrium thermodynamics , author=. International conference on machine learning , pages=. 2015 , organization=

work page 2015
[43]

Advances in neural information processing systems , volume=

Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=

work page
[44]

Score-Based Generative Modeling through Stochastic Differential Equations

Score-based generative modeling through stochastic differential equations , author=. arXiv preprint arXiv:2011.13456 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2011
[45]

Advances in neural information processing systems , volume=

Variational diffusion models , author=. Advances in neural information processing systems , volume=

work page
[46]

Advances in neural information processing systems , volume=

Maximum likelihood training of score-based diffusion models , author=. Advances in neural information processing systems , volume=

work page
[47]

arXiv preprint arXiv:2403.06098 , year=

Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models , author=. arXiv preprint arXiv:2403.06098 , year=

work page arXiv
[48]

arXiv preprint arXiv:2406.04485 , year=

GenAI Arena: An Open Evaluation Platform for Generative Models , author=. arXiv preprint arXiv:2406.04485 , year=

work page arXiv
[49]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Ties Matter: Meta-Evaluating Modern Metrics with Pairwise Accuracy and Tie Calibration , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2023
[50]

2024 , eprint=

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools , author=. 2024 , eprint=

work page 2024
[51]

ROUGE : A Package for Automatic Evaluation of Summaries

Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004

work page 2004
[52]

Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations , year=

Wu, Hao and Mao, Jiayuan and Zhang, Yufeng and Jiang, Yuning and Li, Lei and Sun, Weiwei and Ma, Wei-Ying , booktitle=. Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations , year=

work page
[53]

2024 , eprint=

UniFL: Improve Stable Diffusion via Unified Feedback Learning , author=. 2024 , eprint=

work page 2024
[54]

arXiv preprint arXiv:2406.06424 , year=

Margin-aware preference optimization for aligning diffusion models without reference , author=. arXiv preprint arXiv:2406.06424 , year=

work page arXiv
[55]

arXiv preprint arXiv:2407.04842 , year=

MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation? , author=. arXiv preprint arXiv:2407.04842 , year=

work page arXiv
[56]

IEEE Transactions on Circuits and Systems for Video Technology , volume=

Agiqa-3k: An open database for ai-generated image quality assessment , author=. IEEE Transactions on Circuits and Systems for Video Technology , volume=. 2023 , publisher=

work page 2023
[57]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Rich human feedback for text-to-image generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[58]

Annual review of psychology , volume=

Visual aesthetics and human preference , author=. Annual review of psychology , volume=. 2013 , publisher=

work page 2013
[59]

Frontiers in Psychology , volume=

Image feature types and their predictions of aesthetic preference and naturalness , author=. Frontiers in Psychology , volume=. 2017 , publisher=

work page 2017