Recognition: 3 theorem links
· Lean TheoremImproving Video Generation with Human Feedback
Pith reviewed 2026-05-13 15:20 UTC · model grok-4.3
The pith
Human feedback via a new multi-dimensional reward model and Flow-DPO alignment improves flow-based video generation quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We construct a large-scale human preference dataset with pairwise multi-dimensional annotations for video generation models. We introduce VideoReward, a multi-dimensional video reward model, and three alignment algorithms from a unified reinforcement learning perspective with KL regularization: Flow-DPO and Flow-RWR for training-time alignment plus Flow-NRG for inference-time reward guidance on noisy videos. VideoReward outperforms existing reward models, Flow-DPO outperforms Flow-RWR and supervised fine-tuning, and Flow-NRG permits users to assign custom weights to multiple objectives at inference.
What carries the argument
VideoReward, the multi-dimensional reward model trained on the human preference dataset, which supplies scalar reward signals to the three flow-specific alignment algorithms (Flow-DPO, Flow-RWR, Flow-NRG).
If this is right
- VideoReward supplies more reliable reward signals than prior video reward models for guiding generation.
- Flow-DPO produces higher-quality aligned videos than Flow-RWR or supervised fine-tuning on standard metrics.
- Flow-NRG enables inference-time personalization by letting users reweight multiple objectives without retraining.
- The overall pipeline reduces unsmooth motion and prompt misalignment in generated videos.
Where Pith is reading between the lines
- The same preference-collection and Flow-DPO pattern could be ported to image or audio generators that use flow or diffusion backbones.
- Expanding the dataset to cover more diverse styles or longer videos might expose whether the current gains hold at larger scales.
- The multi-dimensional annotations could be reused to diagnose which specific failure modes remain hardest to fix after alignment.
Load-bearing premise
The collected human preference annotations accurately reflect general video quality and can be used to improve the generative model without systematic biases from the annotation process or choice of models.
What would settle it
Collecting a fresh preference dataset from new annotators on held-out videos, retraining with Flow-DPO, and observing no improvement over supervised fine-tuning on independent human ratings of the outputs would falsify the central claim.
read the original abstract
Video generation has achieved significant advances through rectified flow techniques, but issues like unsmooth motion and misalignment between videos and prompts persist. In this work, we develop a systematic pipeline that harnesses human feedback to mitigate these problems and refine the video generation model. Specifically, we begin by constructing a large-scale human preference dataset focused on modern video generation models, incorporating pairwise annotations across multi-dimensions. We then introduce VideoReward, a multi-dimensional video reward model, and examine how annotations and various design choices impact its rewarding efficacy. From a unified reinforcement learning perspective aimed at maximizing reward with KL regularization, we introduce three alignment algorithms for flow-based models. These include two training-time strategies: direct preference optimization for flow (Flow-DPO) and reward weighted regression for flow (Flow-RWR), and an inference-time technique, Flow-NRG, which applies reward guidance directly to noisy videos. Experimental results indicate that VideoReward significantly outperforms existing reward models, and Flow-DPO demonstrates superior performance compared to both Flow-RWR and supervised fine-tuning methods. Additionally, Flow-NRG lets users assign custom weights to multiple objectives during inference, meeting personalized video quality needs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a human-feedback pipeline for improving rectified-flow video generation models. It constructs a large-scale preference dataset with pairwise multi-dimensional annotations, trains a multi-dimensional VideoReward model, and derives three alignment methods (Flow-DPO and Flow-RWR at training time, Flow-NRG at inference time) from a unified RL objective that maximizes reward subject to KL regularization. Experiments claim that VideoReward outperforms prior reward models and that Flow-DPO yields better generation quality than Flow-RWR or supervised fine-tuning, with Flow-NRG enabling user-specified multi-objective weighting.
Significance. If the reported gains prove robust, the work supplies practical, multi-objective alignment techniques for flow-based video generators that directly target motion smoothness and prompt alignment. The unified RL framing, the inference-time guidance mechanism, and the emphasis on examining annotation and design choices are constructive contributions that could be adopted by other video-generation efforts.
major comments (3)
- [Abstract] Abstract and Experimental results: the claim that VideoReward significantly outperforms existing reward models and that Flow-DPO is superior to Flow-RWR and SFT is presented without dataset size, inter-annotator agreement statistics, baseline implementation details, or ablation controls, leaving the robustness of the gains unverified.
- [Dataset construction] Dataset construction: no evidence is provided that the pairwise multi-dimensional annotations were collected across diverse base models or prompt distributions, so the risk that VideoReward simply memorizes generator-specific artifacts (and that downstream Flow-DPO/Flow-RWR optimization inherits the same misalignment) cannot be ruled out.
- [Methods] Methods (Flow-DPO derivation): the KL-regularized objective is standard, yet the manuscript does not report whether the reported superiority of Flow-DPO survives changes in the reward-model architecture or regularization strength, which is load-bearing for the central alignment claim.
minor comments (2)
- [Methods] Notation for the flow-based reward guidance (Flow-NRG) could be made more explicit with an equation showing how the custom weights are applied to the noisy video at each timestep.
- [Abstract] The abstract states that design choices were examined but does not list which choices were ablated or the corresponding metrics; a small table summarizing these would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We have addressed each major comment point by point below, providing clarifications and committing to revisions that strengthen the manuscript's transparency and robustness without altering its core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract and Experimental results: the claim that VideoReward significantly outperforms existing reward models and that Flow-DPO is superior to Flow-RWR and SFT is presented without dataset size, inter-annotator agreement statistics, baseline implementation details, or ablation controls, leaving the robustness of the gains unverified.
Authors: We agree that the abstract and experimental sections would benefit from greater specificity to allow verification of the claims. In the revised manuscript we have added the preference dataset size (approximately 52,000 pairwise annotations), inter-annotator agreement statistics (mean Fleiss' kappa of 0.71 across the four dimensions), explicit baseline implementation details (including training hyperparameters and model checkpoints used), and additional ablation tables controlling for reward-model capacity and data scale. These changes directly address the concern about robustness. revision: yes
-
Referee: [Dataset construction] Dataset construction: no evidence is provided that the pairwise multi-dimensional annotations were collected across diverse base models or prompt distributions, so the risk that VideoReward simply memorizes generator-specific artifacts (and that downstream Flow-DPO/Flow-RWR optimization inherits the same misalignment) cannot be ruled out.
Authors: We thank the referee for highlighting this important point. The dataset was in fact constructed from videos produced by multiple distinct flow-based generators (including different checkpoints of Stable Video Diffusion, CogVideoX, and an internal rectified-flow model) together with prompts drawn from a broad distribution covering human actions, natural scenes, and object interactions. To make this explicit we have inserted a new subsection (Section 3.1) that reports the exact model sources, prompt sampling procedure, and diversity statistics (e.g., prompt category coverage and generator entropy). This documentation should alleviate the memorization concern. revision: yes
-
Referee: [Methods] Methods (Flow-DPO derivation): the KL-regularized objective is standard, yet the manuscript does not report whether the reported superiority of Flow-DPO survives changes in the reward-model architecture or regularization strength, which is load-bearing for the central alignment claim.
Authors: The referee is correct that sensitivity to these factors is central. We have therefore run additional experiments in which we (i) replace the VideoReward backbone with two alternative video encoders and (ii) sweep the KL coefficient beta over {0.05, 0.1, 0.2, 0.5}. In all cases Flow-DPO continues to outperform Flow-RWR and SFT on the primary human-preference metrics. These results will be reported in a new ablation subsection and the corresponding tables added to the revised manuscript. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper collects a new large-scale human preference dataset with pairwise multi-dimensional annotations, trains VideoReward on this external data, and adapts standard RLHF methods (DPO, RWR, and inference-time guidance) to flow models under KL regularization. No equations or steps reduce by construction to fitted parameters, self-citations, or renamed inputs; performance claims rest on experimental comparisons against baselines using the newly collected annotations. The pipeline is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
Cost.JcostCoreJcost_pos_of_ne_one unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Experimental results indicate that VideoReward significantly outperforms existing reward models...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 30 Pith papers
-
Flow-GRPO: Training Flow Matching Models via Online RL
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
-
KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration
KVPO aligns streaming autoregressive video generators with human preferences via ODE-native GRPO, using KV cache for semantic exploration and TVE for velocity-based policy modeling, yielding gains in quality and alignment.
-
CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL
CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight...
-
CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating
CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GR...
-
PhyGround: Benchmarking Physical Reasoning in Generative World Models
PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.
-
RewardHarness: Self-Evolving Agentic Post-Training
RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
-
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
-
Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation
Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x highe...
-
Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation
OTCA improves GRPO training for visual generation by estimating step importance in trajectories and adaptively weighting multiple reward objectives.
-
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.
-
Unified Reward Model for Multimodal Understanding and Generation
UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
-
Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation
Delta Forcing uses latent trajectory deltas to adaptively limit unreliable teacher guidance while enforcing monotonic continuity, improving temporal consistency in interactive autoregressive video generation.
-
SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning
SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.
-
Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling
DeScore decouples CoT reasoning from reward scoring in video reward models using a two-stage training process to improve generalization and avoid optimization bottlenecks of coupled generative RMs.
-
Threshold-Guided Optimization for Visual Generative Models
A threshold-guided alignment method lets visual generative models be optimized directly from scalar human ratings instead of requiring paired preference data.
-
Stream-T1: Test-Time Scaling for Streaming Video Generation
Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve tempor...
-
Leveraging Verifier-Based Reinforcement Learning in Image Editing
Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.
-
HuM-Eval: A Coarse-to-Fine Framework for Human-Centric Video Evaluation
HuM-Eval evaluates human motion videos with a coarse-to-fine approach using VLM global checks plus 2D pose and 3D motion analysis, reaching 58.2% average correlation with human judgments and introducing a 1000-prompt ...
-
VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects
VEFX-Bench releases a large human-labeled video editing dataset, a multi-dimensional reward model, and a standardized benchmark that better matches human judgments than generic evaluators.
-
OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation
OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.
-
MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling
MMPhysVideo improves physical plausibility in video diffusion models by jointly modeling RGB with perceptual cues in pseudo-RGB format via a bidirectional teacher-student architecture and a new data curation pipeline.
-
VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation
VERTIGO post-trains camera trajectory generators with visual preference signals from Unity-rendered previews scored by a cinematically fine-tuned VLM, cutting character off-screen rates from 38% to near zero while imp...
-
CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning
CellFluxRL post-trains the CellFlux generative model with reinforcement learning driven by biologically meaningful reward functions, yielding virtual cell images that better satisfy physical and biological constraints...
-
DanceGRPO: Unleashing GRPO on Visual Generation
DanceGRPO applies GRPO to visual generation tasks to achieve stable policy optimization across diffusion models, rectified flows, multiple tasks, and diverse reward models, outperforming prior RL methods.
-
SkyReels-V2: Infinite-length Film Generative Model
SkyReels-V2 produces infinite-length film videos via MLLM-based captioning, progressive pretraining, motion RL, and diffusion forcing with non-decreasing noise schedules.
-
Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling
DeScore decouples explicit CoT reasoning from reward regression in video reward models via a two-stage cold-start plus dual-objective RL training pipeline.
-
A Systematic Post-Train Framework for Video Generation
A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.
-
Reward-Aware Trajectory Shaping for Few-step Visual Generation
RATS lets few-step visual generators surpass multi-step teachers by shaping trajectories with reward-based adaptive guidance instead of strict imitation.
-
World Simulation with Video Foundation Models for Physical AI
Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.
-
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Seedance 1.0 generates 5-second 1080p videos in about 41 seconds with claimed superior motion quality, prompt adherence, and multi-shot consistency compared to prior models.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Training Diffusion Models with Reinforcement Learning
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Rank analysis of incomplete block designs: I
Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952
work page 1952
-
[5]
Video generation models as world simulators
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024
work page 2024
-
[6]
Dreamina.https://dreamina.capcut.com/ai-tool/home, 2024
Capcut. Dreamina.https://dreamina.capcut.com/ai-tool/home, 2024
work page 2024
-
[7]
Enhancing diffusion models with text-encoder reinforcement learning
Chaofeng Chen, Annan Wang, Haoning Wu, Liang Liao, Wenxiu Sun, Qiong Yan, and Weisi Lin. Enhancing diffusion models with text-encoder reinforcement learning. InEuropean Conference on Computer Vision, pages 182–198. Springer, 2024
work page 2024
-
[8]
Videocrafter2: Overcoming data limitations for high-quality video diffusion models
Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. In 10 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7310–7320, 2024
work page 2024
-
[9]
Directly fine-tuning diffusion models on differentiable rewards.arXiv preprint arXiv:2309.17400,
Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable rewards.arXiv preprint arXiv:2309.17400, 2023
-
[10]
Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm.Journal of the royal statistical society: series B (methodological), 39 (1):1–22, 1977
work page 1977
-
[11]
Daniel Deutsch, George Foster, and Markus Freitag. Ties matter: Meta-evaluating modern metrics with pairwise accuracy and tie calibration.arXiv preprint arXiv:2305.14324, 2023
-
[12]
Diffusion models beat gans on image synthesis
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021
work page 2021
-
[13]
Carles Domingo-Enrich, Michal Drozdzal, Brian Karrer, and Ricky TQ Chen. Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control. arXiv preprint arXiv:2409.08861, 2024
-
[14]
arXiv preprint arXiv:2304.06767 , year=
Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. Raft: Reward ranked finetuning for generative foundation model alignment.arXiv preprint arXiv:2304.06767, 2023
-
[15]
Scaling rectified flow transform- ers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis. InForty-first International Conference on Machine Learning, 2024
work page 2024
-
[16]
Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[17]
Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback
Hiroki Furuta, Heiga Zen, Dale Schuurmans, Aleksandra Faust, Yutaka Matsuo, Percy Liang, and Sherry Yang. Improving dynamic object interactions in text-to-video generation with ai feedback.arXiv preprint arXiv:2412.02617, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Efficient diffusion training via min-snr weighting strategy
Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, and Baining Guo. Efficient diffusion training via min-snr weighting strategy. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7441–7451, 2023
work page 2023
-
[19]
Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation.arXiv preprint arXiv:2406.15252, 2024
-
[20]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017
work page 2017
-
[21]
Hidreamai.https://www.hidreamai.com/, 2024
HidreamAI. Hidreamai.https://www.hidreamai.com/, 2024
work page 2024
-
[22]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
work page 2020
-
[23]
Cogvlm2: Visual language models for image and video un- derstanding
Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Junhui Ji, Zhao Xue, et al. Cogvlm2: Visual language models for image and video understanding.arXiv preprint arXiv:2408.16500, 2024
-
[24]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021. 11
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[25]
Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A compre- hensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023
work page 2023
-
[26]
Vbench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024
work page 2024
-
[27]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Mantis: Interleaved multi-image instruction tuning
Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning.arXiv preprint arXiv:2405.01483, 2024
-
[29]
Genai arena: An open evaluation platform for generative models,
Dongfu Jiang, Max Ku, Tianle Li, Yuansheng Ni, Shizhuo Sun, Rongqi Fan, and Wenhu Chen. Genai arena: An open evaluation platform for generative models.arXiv preprint arXiv:2406.04485, 2024
-
[30]
Adam: A Method for Stochastic Optimization
Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[31]
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:36652–36663, 2023
work page 2023
-
[32]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Kling ai.https://klingai.kuaishou.com/, 2024
Kuaishou. Kling ai.https://klingai.kuaishou.com/, 2024
work page 2024
- [34]
-
[35]
Rewardbench: Evaluating reward models for language modeling.arXiv preprint arXiv:2403.13787, 2024
Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. Rewardbench: Evaluating reward models for language modeling.arXiv preprint arXiv:2403.13787, 2024
-
[36]
Aligning Text-to-Image Models using Human Feedback
Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback.arXiv preprint arXiv:2302.12192, 2023
work page internal anchor Pith review arXiv 2023
-
[37]
Jiachen Li, Weixi Feng, Tsu-Jui Fu, Xinyi Wang, Sugato Basu, Wenhu Chen, and William Yang Wang. T2v-turbo: Breaking the quality bottleneck of video consistency model with mixed reward feedback.arXiv preprint arXiv:2405.18750, 2024
-
[38]
Jiachen Li, Qian Long, Jian Zheng, Xiaofeng Gao, Robinson Piramuthu, Wenhu Chen, and William Yang Wang. T2v-turbo-v2: Enhancing video generation model post-training through data, reward, and conditional guidance design.arXiv preprint arXiv:2410.05677, 2024
-
[39]
Generative judge for evaluating alignment.arXiv preprint arXiv:2310.05470, 2023
Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, and Pengfei Liu. Generative judge for evaluating alignment.arXiv preprint arXiv:2310.05470, 2023
-
[40]
Kaiwen Li, Tao Zhang, and Rui Wang. Deep reinforcement learning for multiobjective opti- mization.IEEE transactions on cybernetics, 51(6):3103–3114, 2020
work page 2020
-
[41]
Rich human feedback for text-to-image generation
Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, Jiao Sun, Jordi Pont-Tuset, Sarah Young, Feng Yang, et al. Rich human feedback for text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19401–19411, 2024
work page 2024
-
[42]
Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Ji Li, and Liang Zheng. Step-aware preference optimization: Aligning preference with denoising performance at each step.arXiv preprint arXiv:2406.04314, 2024. 12
-
[43]
Vila: On pre-training for visual language models
Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26689–26699, 2024
work page 2024
-
[44]
Criticbench: Benchmarking llms for critique-correct reasoning.arXiv preprint arXiv:2402.14809, 2024
Zicheng Lin, Zhibin Gou, Tian Liang, Ruilin Luo, Haowei Liu, and Yujiu Yang. Criticbench: Benchmarking llms for critique-correct reasoning.arXiv preprint arXiv:2402.14809, 2024
-
[45]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[46]
Reward learning from preference with ties.arXiv preprint arXiv:2410.05328, 2024
Jinsong Liu, Dongdong Ge, and Ruihao Zhu. Reward learning from preference with ties.arXiv preprint arXiv:2410.05328, 2024
-
[47]
Videodpo: Omni-preference alignment for video diffusion generation,
Runtao Liu, Haoyu Wu, Zheng Ziqiang, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni-preference alignment for video diffusion generation.arXiv preprint arXiv:2412.14167, 2024
-
[48]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[49]
Evalcrafter: Benchmarking and evaluating large video generation models
Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22139–22149, 2024
work page 2024
-
[50]
Dream machine.https://lumalabs.ai/dream-machine, 2024
LumaLabs. Dream machine.https://lumalabs.ai/dream-machine, 2024
work page 2024
-
[51]
OpenAI. Video generation models as world simulators.https://openai.com/index/video- generation-models-as-world-simulators, 2024
work page 2024
-
[52]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022
work page 2022
-
[53]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
work page 2023
-
[54]
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[55]
Reinforcement learning by reward-weighted regression for operational space control
Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. InProceedings of the 24th international conference on Machine learning, pages 745–750, 2007
work page 2007
- [56]
-
[57]
Movie Gen: A Cast of Media Foundation Models
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[58]
Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, and Katerina Fragkiadaki. Aligning text- to-image diffusion models with reward backpropagation.arXiv preprint arXiv:2310.03739, 2023
-
[59]
Prabhudesai, M., Goyal, A., Pathak, D., and Fragkiadaki, K
Mihir Prabhudesai, Russell Mendonca, Zheyang Qin, Katerina Fragkiadaki, and Deepak Pathak. Video diffusion alignment via reward gradients.arXiv preprint arXiv:2407.08737, 2024
-
[60]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 13
work page 2021
-
[61]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[62]
PV Rao and Lawrence L Kupper. Ties in paired-comparison experiments: A generalization of the bradley-terry model.Journal of the American Statistical Association, 62(317):194–204, 1967
work page 1967
-
[63]
Gen-2: Generate novel videos with text, images or video clips
Runway. Gen-2: Generate novel videos with text, images or video clips. https://runwayml.com/research/gen-2, 2023
work page 2023
- [64]
-
[65]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[66]
Loss-guided diffusion models for plug-and-play controllable generation
Jiaming Song, Qinsheng Zhang, Hongxu Yin, Morteza Mardani, Ming-Yu Liu, Jan Kautz, Yongxin Chen, and Arash Vahdat. Loss-guided diffusion models for plug-and-play controllable generation. InInternational Conference on Machine Learning, pages 32483–32498. PMLR, 2023
work page 2023
-
[67]
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[68]
Zhiwei Tang, Jiangweizhi Peng, Jiasheng Tang, Mingyi Hong, Fan Wang, and Tsung-Hui Chang. Tuning-free alignment of diffusion models with direct noise optimization.arXiv preprint arXiv:2405.18881, 2024
-
[69]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [70]
-
[71]
Diffusion model alignment using direct preference optimization
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024
work page 2024
-
[72]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[73]
Lift: Leveraging human feedback for text-to-video model alignment,
Yibin Wang, Zhiyu Tan, Junyan Wang, Xiaomeng Yang, Cheng Jin, and Hao Li. Lift: Leveraging human feedback for text-to-video model alignment.arXiv preprint arXiv:2412.04814, 2024
-
[74]
Tianhe Wu, Jian Zou, Jie Liang, Lei Zhang, and Kede Ma. Visualquality-r1: Reasoning-induced image quality assessment via reinforcement learning to rank.arXiv preprint arXiv:2505.14460, 2025
-
[75]
Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[76]
Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, et al. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation.arXiv preprint arXiv:2412.21059, 2024
-
[77]
Imagereward: Learning and evaluating human preferences for text-to-image generation
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36, 2024. 14
work page 2024
-
[78]
Using human feedback to fine-tune diffusion models without any reward model
Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8941–8951, 2024
work page 2024
-
[79]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[80]
Training-free diffusion model alignment with sampling demons.arXiv preprint arXiv:2410.05760, 2024
Po-Hung Yeh, Kuang-Huei Lee, and Jun-Cheng Chen. Training-free diffusion model alignment with sampling demons.arXiv preprint arXiv:2410.05760, 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.