pith. machine review for the scientific record. sign in

arxiv: 2412.21059 · v4 · submitted 2024-12-30 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:45 UTC · model grok-4.3

classification 💻 cs.CV
keywords human preference learningreward modelimage generationvideo generationpreference optimizationhierarchical assessmentinterpretabilitymulti-dimensional evaluation
0
0 comments X

The pith

VisionReward learns fine-grained human preferences for image and video generation through hierarchical assessment and linear weighting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VisionReward as a framework to align visual generative models with human preferences by decomposing evaluations into multiple dimensions. Existing reward models often act as black boxes that produce scores without clear explanations and can introduce biases during optimization. VisionReward addresses this by applying a hierarchical visual assessment to break down preferences and then combining scores via linear weighting for interpretability. When used in preference optimization, a multi-dimensional consistent strategy keeps the process reliable across dimensions. Experiments demonstrate gains in preference prediction accuracy and higher win rates for generated images and videos compared to prior approaches.

Core claim

VisionReward employs a hierarchical visual assessment framework to capture fine-grained human preferences across multiple dimensions and uses linear weighting to combine them into an interpretable score. When applied as a reward model in preference optimization for visual generation, a multi-dimensional consistent strategy maintains alignment across dimensions. This results in superior performance on preference prediction accuracy and higher win rates for generated content compared to prior reward models.

What carries the argument

Hierarchical visual assessment framework with linear weighting that decomposes preferences into dimensions and aggregates them into interpretable scores.

Load-bearing premise

The hierarchical visual assessment framework combined with linear weighting accurately captures fine-grained human preferences across dimensions without introducing unexpected biases or inconsistencies.

What would settle it

A direct comparison of human preference ratings on image and video outputs from models optimized with VisionReward versus prior reward models, checking whether the predicted multi-dimensional scores match the actual human choices without dimension-specific inconsistencies.

read the original abstract

Visual generative models have achieved remarkable progress in synthesizing photorealistic images and videos, yet aligning their outputs with human preferences across critical dimensions remains a persistent challenge. Though reinforcement learning from human feedback offers promise for preference alignment, existing reward models for visual generation face limitations, including black-box scoring without interpretability and potentially resultant unexpected biases. We present VisionReward, a general framework for learning human visual preferences in both image and video generation. Specifically, we employ a hierarchical visual assessment framework to capture fine-grained human preferences, and leverages linear weighting to enable interpretable preference learning. Furthermore, we propose a multi-dimensional consistent strategy when using VisionReward as a reward model during preference optimization for visual generation. Experiments show that VisionReward can significantly outperform existing image and video reward models on both machine metrics and human evaluation. Notably, VisionReward surpasses VideoScore by 17.2% in preference prediction accuracy, and text-to-video models with VisionReward achieve a 31.6% higher pairwise win rate compared to the same models using VideoScore. All code and datasets are provided at https://github.com/THUDM/VisionReward.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces VisionReward, a framework for fine-grained multi-dimensional human preference learning for image and video generation. It employs a hierarchical visual assessment to capture preferences across dimensions such as quality, consistency, and aesthetics, combined with linear weighting for interpretability. A multi-dimensional consistent strategy is proposed for using the model as a reward in preference optimization. Experiments report that VisionReward outperforms baselines, surpassing VideoScore by 17.2% in preference prediction accuracy and yielding a 31.6% higher pairwise win rate for text-to-video models trained with it versus VideoScore.

Significance. If the central claims hold, VisionReward provides an interpretable alternative to black-box reward models for aligning visual generative models with human preferences. The reported gains on machine metrics and human evaluations, together with the open release of code and datasets, would support practical improvements in image and video synthesis and enable follow-up research on preference optimization.

major comments (2)
  1. [§3.3] §3.3 (Linear Weighting): The framework combines hierarchical dimension scores via fixed linear weights, but no ablation or analysis tests whether human preferences exhibit non-additive interactions (e.g., motion artifacts amplifying quality penalties). This assumption is load-bearing for the 17.2% accuracy and 31.6% win-rate claims, as violations could produce inflated alignment on the chosen test distributions without generalizing.
  2. [§5.1, Table 3] §5.1, Table 3: The preference prediction accuracy comparison to VideoScore reports a 17.2% lift, yet the evaluation uses the same linear aggregator for both training and testing; an ablation replacing the linear step with a non-linear aggregator (e.g., small MLP) is absent, leaving open whether the gain stems from the hierarchical assessment or from the specific additive construction.
minor comments (2)
  1. [§2] §2 (Related Work): The discussion of prior video reward models could explicitly contrast the proposed multi-dimensional consistency strategy with earlier single-score approaches to clarify the novelty.
  2. [Figure 1] Figure 1: The hierarchical assessment diagram would benefit from explicit labels on the dimension-specific heads and the final linear combination step for improved readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on VisionReward. We appreciate the focus on the linear weighting assumption and its implications for the reported gains. We address each major comment below and will incorporate the suggested analyses in the revised manuscript to strengthen the claims.

read point-by-point responses
  1. Referee: [§3.3] §3.3 (Linear Weighting): The framework combines hierarchical dimension scores via fixed linear weights, but no ablation or analysis tests whether human preferences exhibit non-additive interactions (e.g., motion artifacts amplifying quality penalties). This assumption is load-bearing for the 17.2% accuracy and 31.6% win-rate claims, as violations could produce inflated alignment on the chosen test distributions without generalizing.

    Authors: We selected linear weighting to prioritize interpretability, enabling users to directly attribute the final preference score to specific dimensions such as quality or consistency, as emphasized in the manuscript. While non-additive interactions between dimensions may exist in human judgments, the hierarchical assessment already decomposes preferences into fine-grained components, and linear aggregation serves as an effective and transparent approximation supported by our empirical results. To rigorously test this, we will add an ablation study in the revision that compares the linear aggregator against a non-linear alternative (e.g., a small MLP) on the same hierarchical scores, evaluating impacts on both prediction accuracy and downstream win rates. revision: yes

  2. Referee: [§5.1, Table 3] §5.1, Table 3: The preference prediction accuracy comparison to VideoScore reports a 17.2% lift, yet the evaluation uses the same linear aggregator for both training and testing; an ablation replacing the linear step with a non-linear aggregator (e.g., small MLP) is absent, leaving open whether the gain stems from the hierarchical assessment or from the specific additive construction.

    Authors: The 17.2% accuracy improvement and 31.6% win-rate gains arise primarily from the hierarchical multi-dimensional assessment framework rather than the aggregator alone, as VideoScore employs a different architecture without our dimension decomposition. The linear aggregator is applied consistently for fair comparison and to maintain interpretability. We acknowledge the value of isolating this factor and will include a new ablation in the revised version: training and evaluating our hierarchical scores with a non-linear aggregator (small MLP) to quantify whether the performance lift persists independently of the linear construction. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external human annotations and benchmarks

full rationale

The paper trains VisionReward on human preference data using a hierarchical assessment plus linear weighting, then evaluates preference prediction accuracy and win rates on held-out test sets and public benchmarks. No equations or steps reduce the reported gains (17.2% accuracy, 31.6% win rate) to the training inputs by construction. Linear weighting is a standard fitted aggregator validated externally rather than a self-definitional or self-citation load-bearing step. No uniqueness theorems, ansatzes, or renamings of known results are invoked in a circular manner. The central claims remain falsifiable against independent human feedback.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The framework rests on domain assumptions about the structure of human visual preferences and the sufficiency of linear combinations; no new physical entities are introduced, but weights are expected to be fitted to preference data.

free parameters (1)
  • dimension weights
    Linear coefficients used to combine hierarchical assessment scores into an overall preference value; these are fitted to human feedback data.
axioms (2)
  • domain assumption Human visual preferences can be decomposed into a hierarchical set of fine-grained assessments
    Invoked as the basis for the assessment framework in the abstract.
  • domain assumption Linear weighting of dimension scores yields interpretable and consistent preference learning
    Stated as enabling interpretability and used in the multi-dimensional strategy.

pith-pipeline@v0.9.0 · 5571 in / 1301 out tokens · 28049 ms · 2026-05-16T11:45:06.312296+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

    cs.CV 2026-05 unverdicted novelty 7.0

    CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GR...

  2. RewardHarness: Self-Evolving Agentic Post-Training

    cs.AI 2026-05 unverdicted novelty 7.0

    RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.

  3. Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    OTCA improves GRPO training for visual generation by estimating step importance in trajectories and adaptively weighting multiple reward objectives.

  4. Speculative Decoding for Autoregressive Video Generation

    cs.CV 2026-04 conditional novelty 7.0

    A training-free speculative decoding method for block-based autoregressive video diffusion uses a quality router on worst-frame ImageReward scores to accept drafter proposals, achieving up to 2.09x speedup at 95.7% qu...

  5. DSH-Bench: A Difficulty- and Scenario-Aware Benchmark with Hierarchical Subject Taxonomy for Subject-Driven Text-to-Image Generation

    cs.CV 2026-03 unverdicted novelty 7.0

    DSH-Bench is a benchmark for subject-driven T2I generation that uses hierarchical taxonomy sampling, difficulty/scenario classification, and a new SICS metric showing 9.4% higher human correlation than prior measures.

  6. Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

    cs.CV 2026-01 unverdicted novelty 7.0

    LocalDPO creates localized preference pairs from real videos by applying random spatio-temporal masks and restoring masked regions with the frozen base model, then applies region-restricted DPO loss to improve fidelit...

  7. Unified Reward Model for Multimodal Understanding and Generation

    cs.CV 2025-03 unverdicted novelty 7.0

    UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.

  8. PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation

    cs.CV 2026-05 conditional novelty 6.0

    PhyMotion scores generated human videos by grounding recovered 3D poses in a physics simulator across kinematic, contact, and dynamic axes, yielding stronger human correlation and larger RL post-training gains than pr...

  9. Skill-Aligned Annotation for Reliable Evaluation in Text-to-Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Skill-aligned annotation improves inter-annotator agreement and evaluation stability in text-to-image generation compared to uniform annotation baselines.

  10. How Far Are Video Models from True Multimodal Reasoning?

    cs.CV 2026-04 unverdicted novelty 6.0

    Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.

  11. DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior

    cs.CV 2026-04 unverdicted novelty 6.0

    DreamShot uses video diffusion priors and a role-attention consistency loss to produce coherent, personalized storyboards with better character and scene continuity than text-to-image methods.

  12. Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

    cs.CV 2025-10 conditional novelty 6.0

    Self-Forcing++ scales autoregressive video diffusion to over 4 minutes by using self-generated segments for guidance, reducing error accumulation and outperforming baselines in fidelity and consistency.

  13. DanceGRPO: Unleashing GRPO on Visual Generation

    cs.CV 2025-05 unverdicted novelty 6.0

    DanceGRPO applies GRPO to visual generation tasks to achieve stable policy optimization across diffusion models, rectified flows, multiple tasks, and diverse reward models, outperforming prior RL methods.

  14. Improving Video Generation with Human Feedback

    cs.CV 2025-01 unverdicted novelty 6.0

    A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.

  15. Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...

  16. Reward-Aware Trajectory Shaping for Few-step Visual Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    RATS lets few-step visual generators surpass multi-step teachers by shaping trajectories with reward-based adaptive guidance instead of strict imitation.

  17. Anthropogenic Regional Adaptation in Multimodal Vision-Language Model

    cs.AI 2026-04 unverdicted novelty 5.0

    Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.

  18. Reward-Forcing: Autoregressive Video Generation with Reward Feedback

    cs.CV 2026-01 unverdicted novelty 5.0

    Reward-Forcing guides autoregressive video generation with reward feedback to achieve performance comparable to teacher-dependent methods on benchmarks like VBench without relying on distillation.

  19. Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    cs.CV 2025-11 unverdicted novelty 5.0

    Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and strea...

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 19 Pith papers · 13 internal anchors

  1. [1]

    International conference on machine learning , pages=

    Zero-shot text-to-image generation , author=. International conference on machine learning , pages=. 2021 , organization=

  2. [2]

    Advances in Neural Information Processing Systems , volume=

    Cogview: Mastering text-to-image generation via transformers , author=. Advances in Neural Information Processing Systems , volume=

  3. [3]

    Advances in neural information processing systems , volume=

    Photorealistic text-to-image diffusion models with deep language understanding , author=. Advances in neural information processing systems , volume=

  4. [4]

    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

    High-Resolution Image Synthesis with Latent Diffusion Models , author=. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=. 2022 , organization=

  5. [5]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Sdxl: Improving latent diffusion models for high-resolution image synthesis , author=. arXiv preprint arXiv:2307.01952 , year=

  6. [6]

    Computer Science

    Improving image generation with better captions , author=. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf , volume=

  7. [7]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Cogvideo: Large-scale pretraining for text-to-video generation via transformers , author=. arXiv preprint arXiv:2205.15868 , year=

  8. [8]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Imagen video: High definition video generation with diffusion models , author=. arXiv preprint arXiv:2210.02303 , year=

  9. [9]

    International Conference on Learning Representations , year=

    Phenaki: Variable length video generation from open domain textual descriptions , author=. International Conference on Learning Representations , year=

  10. [10]

    2024 , url =

    Zangwei Zheng and Xiangyu Peng and Tianji Yang and Chenhui Shen and Shenggui Li and Hongxin Liu and Yukun Zhou and Tianyi Li and Yang You , title =. 2024 , url =

  11. [11]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Videocrafter2: Overcoming data limitations for high-quality video diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  12. [12]

    2024 , url=

    Video generation models as world simulators , author=. 2024 , url=

  13. [13]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Cogvideox: Text-to-video diffusion models with an expert transformer , author=. arXiv preprint arXiv:2408.06072 , year=

  14. [14]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Panda-70m: Captioning 70m videos with multiple cross-modality teachers , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  15. [15]

    Advances in Neural Information Processing Systems , volume=

    Learning to summarize with human feedback , author=. Advances in Neural Information Processing Systems , volume=

  16. [16]

    WebGPT: Browser-assisted question-answering with human feedback

    Webgpt: Browser-assisted question-answering with human feedback , author=. arXiv preprint arXiv:2112.09332 , year=

  17. [17]

    Advances in Neural Information Processing Systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

  18. [18]

    International conference on machine learning , pages=

    Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

  19. [19]

    Advances in neural information processing systems , volume=

    Gans trained by a two time-scale update rule converge to a local nash equilibrium , author=. Advances in neural information processing systems , volume=

  20. [20]

    Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

    LAION-5B: An open large-scale dataset for training next generation image-text models , author=. Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

  21. [21]

    Proceedings of the 37th International Conference on Neural Information Processing Systems , pages=

    ImageReward: learning and evaluating human preferences for text-to-image generation , author=. Proceedings of the 37th International Conference on Neural Information Processing Systems , pages=

  22. [22]

    Advances in Neural Information Processing Systems , volume=

    Pick-a-pic: An open dataset of user preferences for text-to-image generation , author=. Advances in Neural Information Processing Systems , volume=

  23. [23]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis , author=. arXiv preprint arXiv:2306.09341 , year=

  24. [24]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Learning multi-dimensional human preference for text-to-image generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  25. [25]

    Proceedings of the 37th International Conference on Neural Information Processing Systems , pages=

    DPOK: reinforcement learning for fine-tuning text-to-image diffusion models , author=. Proceedings of the 37th International Conference on Neural Information Processing Systems , pages=

  26. [26]

    Training Diffusion Models with Reinforcement Learning

    Training diffusion models with reinforcement learning , author=. arXiv preprint arXiv:2305.13301 , year=

  27. [27]

    Directly Fine-Tuning Diffusion Models on Differentiable Rewards

    Directly fine-tuning diffusion models on differentiable rewards , author=. arXiv preprint arXiv:2309.17400 , year=

  28. [28]

    arXiv preprint arXiv:2405.00760 , year=

    Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models , author=. arXiv preprint arXiv:2405.00760 , year=

  29. [29]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Diffusion model alignment using direct preference optimization , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  30. [30]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Vbench: Comprehensive benchmark suite for video generative models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  31. [31]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

    VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

  32. [32]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    InstructVideo: instructing video diffusion models with human feedback , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  33. [33]

    arXiv preprint arXiv:2407.08737 , year=

    Video diffusion alignment via reward gradients , author=. arXiv preprint arXiv:2407.08737 , year=

  34. [34]

    arXiv preprint arXiv:2311.01361 , year=

    Gpt-4v (ision) as a generalist evaluator for vision-language tasks , author=. arXiv preprint arXiv:2311.01361 , year=

  35. [35]

    European Conference on Computer Vision , pages=

    Evaluating text-to-visual generation with image-to-text generation , author=. European Conference on Computer Vision , pages=. 2025 , organization=

  36. [36]

    CogVLM: Visual Expert for Pretrained Language Models

    Cogvlm: Visual expert for pretrained language models , author=. arXiv preprint arXiv:2311.03079 , year=

  37. [37]

    arXiv preprint arXiv:2408.16500 , year=

    Cogvlm2: Visual language models for image and video understanding , author=. arXiv preprint arXiv:2408.16500 , year=

  38. [38]

    Visual Instruction Tuning , author=

  39. [39]

    LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models , author=. arXiv preprint arXiv:2407.07895 , year=

  40. [40]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  41. [41]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. arXiv preprint arXiv:2403.05530 , year=

  42. [42]

    International conference on machine learning , pages=

    Deep unsupervised learning using nonequilibrium thermodynamics , author=. International conference on machine learning , pages=. 2015 , organization=

  43. [43]

    Advances in neural information processing systems , volume=

    Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=

  44. [44]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Score-based generative modeling through stochastic differential equations , author=. arXiv preprint arXiv:2011.13456 , year=

  45. [45]

    Advances in neural information processing systems , volume=

    Variational diffusion models , author=. Advances in neural information processing systems , volume=

  46. [46]

    Advances in neural information processing systems , volume=

    Maximum likelihood training of score-based diffusion models , author=. Advances in neural information processing systems , volume=

  47. [47]

    arXiv preprint arXiv:2403.06098 , year=

    Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models , author=. arXiv preprint arXiv:2403.06098 , year=

  48. [48]

    arXiv preprint arXiv:2406.04485 , year=

    GenAI Arena: An Open Evaluation Platform for Generative Models , author=. arXiv preprint arXiv:2406.04485 , year=

  49. [49]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

    Ties Matter: Meta-Evaluating Modern Metrics with Pairwise Accuracy and Tie Calibration , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

  50. [50]

    2024 , eprint=

    ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools , author=. 2024 , eprint=

  51. [51]

    ROUGE : A Package for Automatic Evaluation of Summaries

    Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004

  52. [52]

    Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations , year=

    Wu, Hao and Mao, Jiayuan and Zhang, Yufeng and Jiang, Yuning and Li, Lei and Sun, Weiwei and Ma, Wei-Ying , booktitle=. Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations , year=

  53. [53]

    2024 , eprint=

    UniFL: Improve Stable Diffusion via Unified Feedback Learning , author=. 2024 , eprint=

  54. [54]

    arXiv preprint arXiv:2406.06424 , year=

    Margin-aware preference optimization for aligning diffusion models without reference , author=. arXiv preprint arXiv:2406.06424 , year=

  55. [55]

    arXiv preprint arXiv:2407.04842 , year=

    MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation? , author=. arXiv preprint arXiv:2407.04842 , year=

  56. [56]

    IEEE Transactions on Circuits and Systems for Video Technology , volume=

    Agiqa-3k: An open database for ai-generated image quality assessment , author=. IEEE Transactions on Circuits and Systems for Video Technology , volume=. 2023 , publisher=

  57. [57]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Rich human feedback for text-to-image generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  58. [58]

    Annual review of psychology , volume=

    Visual aesthetics and human preference , author=. Annual review of psychology , volume=. 2013 , publisher=

  59. [59]

    Frontiers in Psychology , volume=

    Image feature types and their predictions of aesthetic preference and naturalness , author=. Frontiers in Psychology , volume=. 2017 , publisher=