hub Canonical reference

Improving Video Generation with Human Feedback

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng · 2025 · cs.CV · arXiv 2501.13918

Canonical reference. 73% of citing Pith papers cite this work as background.

37 Pith papers citing it

Background 73% of classified citations

open full Pith review browse 37 citing papers arXiv PDF

abstract

Video generation has achieved significant advances through rectified flow techniques, but issues like unsmooth motion and misalignment between videos and prompts persist. In this work, we develop a systematic pipeline that harnesses human feedback to mitigate these problems and refine the video generation model. Specifically, we begin by constructing a large-scale human preference dataset focused on modern video generation models, incorporating pairwise annotations across multi-dimensions. We then introduce VideoReward, a multi-dimensional video reward model, and examine how annotations and various design choices impact its rewarding efficacy. From a unified reinforcement learning perspective aimed at maximizing reward with KL regularization, we introduce three alignment algorithms for flow-based models. These include two training-time strategies: direct preference optimization for flow (Flow-DPO) and reward weighted regression for flow (Flow-RWR), and an inference-time technique, Flow-NRG, which applies reward guidance directly to noisy videos. Experimental results indicate that VideoReward significantly outperforms existing reward models, and Flow-DPO demonstrates superior performance compared to both Flow-RWR and supervised fine-tuning methods. Additionally, Flow-NRG lets users assign custom weights to multiple objectives during inference, meeting personalized video quality needs.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 16 baseline 4 method 2

citation-polarity summary

background 16 baseline 4 use method 2

representative citing papers

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

Geo-Align: Video Generation Alignment via Metric Geometry Reward

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.

KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

KVPO aligns streaming autoregressive video generators with human preferences via ODE-native GRPO, using KV cache for semantic exploration and TVE for velocity-based policy modeling, yielding gains in quality and alignment.

CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

cs.CV · 2026-05-14 · conditional · novelty 7.0

CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight bimanual manipulation tasks.

CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GRPO training with temporal/spatial IoU rewards.

PhyGround: Benchmarking Physical Reasoning in Generative World Models

cs.CV · 2026-05-11 · accept · novelty 7.0

PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.

RewardHarness: Self-Evolving Agentic Post-Training

cs.AI · 2026-05-09 · unverdicted · novelty 7.0

RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.

Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation

cs.CV · 2026-05-05 · unverdicted · novelty 7.0

Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.

Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation

cs.CV · 2026-04-26 · unverdicted · novelty 7.0

Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x higher throughput.

Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation

cs.CV · 2026-04-21 · unverdicted · novelty 7.0

OTCA improves GRPO training for visual generation by estimating step importance in trajectories and adaptively weighting multiple reward objectives.

Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

cs.CV · 2026-03-01 · unverdicted · novelty 7.0

SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.

Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

cs.CV · 2026-02-11 · unverdicted · novelty 7.0

DiNa-LRM introduces a diffusion-native latent reward model using a noise-calibrated Thurstone likelihood on noisy states, matching VLM performance at lower compute in image alignment and preference optimization.

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

cs.AI · 2025-07-29 · unverdicted · novelty 7.0

MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.

Unified Reward Model for Multimodal Understanding and Generation

cs.CV · 2025-03-07 · unverdicted · novelty 7.0

UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.

Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

Flash-GRPO introduces iso-temporal grouping and temporal gradient rectification to enable single-step GRPO training that outperforms full-trajectory methods on video diffusion alignment under low compute budgets.

Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation

cs.CV · 2026-05-14 · unverdicted · novelty 6.0 · 3 refs

Delta Forcing improves temporal coherence in interactive autoregressive video generation by estimating transition consistency from teacher-generator latent deltas and balancing it against a monotonic continuity objective.

SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.

Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling

cs.CV · 2026-05-07 · unverdicted · novelty 6.0

DeScore decouples CoT reasoning from reward scoring in video reward models using a two-stage training process to improve generalization and avoid optimization bottlenecks of coupled generative RMs.

Threshold-Guided Optimization for Visual Generative Models

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

A threshold-guided alignment method lets visual generative models be optimized directly from scalar human ratings instead of requiring paired preference data.

Stream-T1: Test-Time Scaling for Streaming Video Generation

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve temporal consistency and visual quality.

Leveraging Verifier-Based Reinforcement Learning in Image Editing

cs.CV · 2026-04-30 · unverdicted · novelty 6.0 · 2 refs

Edit-R1 builds a CoT-based reasoning reward model (RRM) via SFT and GCPO, then applies it with GRPO to improve image editing models such as FLUX.1-kontext.

HuM-Eval: A Coarse-to-Fine Framework for Human-Centric Video Evaluation

cs.CV · 2026-04-28 · unverdicted · novelty 6.0

HuM-Eval evaluates human motion videos with a coarse-to-fine approach using VLM global checks plus 2D pose and 3D motion analysis, reaching 58.2% average correlation with human judgments and introducing a 1000-prompt benchmark.

VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

cs.CV · 2026-04-17 · unverdicted · novelty 6.0

VEFX-Bench releases a large human-labeled video editing dataset, a multi-dimensional reward model, and a standardized benchmark that better matches human judgments than generic evaluators.

OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

cs.CV · 2026-04-13 · unverdicted · novelty 6.0

OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.

citing papers explorer

Showing 37 of 37 citing papers.

Flow-GRPO: Training Flow Matching Models via Online RL cs.CV · 2025-05-08 · unverdicted · none · ref 14 · internal anchor
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
Geo-Align: Video Generation Alignment via Metric Geometry Reward cs.CV · 2026-05-22 · unverdicted · none · ref 8 · internal anchor
Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.
KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration cs.CV · 2026-05-14 · unverdicted · none · ref 11 · internal anchor
KVPO aligns streaming autoregressive video generators with human preferences via ODE-native GRPO, using KV cache for semantic exploration and TVE for velocity-based policy modeling, yielding gains in quality and alignment.
CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL cs.CV · 2026-05-14 · conditional · none · ref 24 · internal anchor
CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight bimanual manipulation tasks.
CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating cs.CV · 2026-05-12 · unverdicted · none · ref 27 · internal anchor
CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GRPO training with temporal/spatial IoU rewards.
PhyGround: Benchmarking Physical Reasoning in Generative World Models cs.CV · 2026-05-11 · accept · none · ref 23 · internal anchor
PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.
RewardHarness: Self-Evolving Agentic Post-Training cs.AI · 2026-05-09 · unverdicted · none · ref 16 · internal anchor
RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation cs.CV · 2026-05-05 · unverdicted · none · ref 22 · internal anchor
Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation cs.CV · 2026-04-26 · unverdicted · none · ref 19 · internal anchor
Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x higher throughput.
Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation cs.CV · 2026-04-21 · unverdicted · none · ref 26 · internal anchor
OTCA improves GRPO training for visual generation by estimating step importance in trajectories and adaptively weighting multiple reward objectives.
Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards cs.CV · 2026-03-01 · unverdicted · none · ref 38 · internal anchor
SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.
Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling cs.CV · 2026-02-11 · unverdicted · none · ref 20 · internal anchor
DiNa-LRM introduces a diffusion-native latent reward model using a noise-calibrated Thurstone likelihood on noisy states, matching VLM performance at lower compute in image alignment and preference optimization.
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE cs.AI · 2025-07-29 · unverdicted · none · ref 19 · internal anchor
MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.
Unified Reward Model for Multimodal Understanding and Generation cs.CV · 2025-03-07 · unverdicted · none · ref 7 · internal anchor
UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization cs.CV · 2026-05-15 · unverdicted · none · ref 13 · internal anchor
Flash-GRPO introduces iso-temporal grouping and temporal gradient rectification to enable single-step GRPO training that outperforms full-trajectory methods on video diffusion alignment under low compute budgets.
Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation cs.CV · 2026-05-14 · unverdicted · none · ref 35 · 3 links · internal anchor
Delta Forcing improves temporal coherence in interactive autoregressive video generation by estimating transition consistency from teacher-generator latent deltas and balancing it against a monotonic continuity objective.
SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning cs.CV · 2026-05-12 · unverdicted · none · ref 26 · internal anchor
SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.
Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling cs.CV · 2026-05-07 · unverdicted · none · ref 20 · internal anchor
DeScore decouples CoT reasoning from reward scoring in video reward models using a two-stage training process to improve generalization and avoid optimization bottlenecks of coupled generative RMs.
Threshold-Guided Optimization for Visual Generative Models cs.LG · 2026-05-06 · unverdicted · none · ref 35 · internal anchor
A threshold-guided alignment method lets visual generative models be optimized directly from scalar human ratings instead of requiring paired preference data.
Stream-T1: Test-Time Scaling for Streaming Video Generation cs.CV · 2026-05-06 · unverdicted · none · ref 22 · internal anchor
Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve temporal consistency and visual quality.
Leveraging Verifier-Based Reinforcement Learning in Image Editing cs.CV · 2026-04-30 · unverdicted · none · ref 34 · 2 links · internal anchor
Edit-R1 builds a CoT-based reasoning reward model (RRM) via SFT and GCPO, then applies it with GRPO to improve image editing models such as FLUX.1-kontext.
HuM-Eval: A Coarse-to-Fine Framework for Human-Centric Video Evaluation cs.CV · 2026-04-28 · unverdicted · none · ref 18 · internal anchor
HuM-Eval evaluates human motion videos with a coarse-to-fine approach using VLM global checks plus 2D pose and 3D motion analysis, reaching 58.2% average correlation with human judgments and introducing a 1000-prompt benchmark.
VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects cs.CV · 2026-04-17 · unverdicted · none · ref 19 · internal anchor
VEFX-Bench releases a large human-labeled video editing dataset, a multi-dimensional reward model, and a standardized benchmark that better matches human judgments than generic evaluators.
OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation cs.CV · 2026-04-13 · unverdicted · none · ref 41 · internal anchor
OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.
MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling cs.CV · 2026-04-03 · unverdicted · none · ref 2 · internal anchor
MMPhysVideo improves physical plausibility in video diffusion models by jointly modeling RGB with perceptual cues in pseudo-RGB format via a bidirectional teacher-student architecture and a new data curation pipeline.
VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation cs.CV · 2026-04-02 · conditional · none · ref 32 · internal anchor
VERTIGO post-trains camera trajectory generators with visual preference signals from Unity-rendered previews scored by a cinematically fine-tuned VLM, cutting character off-screen rates from 38% to near zero while improving framing and prompt adherence.
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation cs.CV · 2025-12-04 · conditional · none · ref 44 · internal anchor
Reward Forcing combines EMA-Sink tokens and Rewarded Distribution Matching Distillation to deliver state-of-the-art streaming video generation at 23.1 FPS without copying initial frames.
Seeing What Matters: Visual Preference Policy Optimization for Visual Generation cs.CV · 2025-11-24 · unverdicted · none · ref 23 · internal anchor
ViPO enhances GRPO for visual generation by creating spatially and temporally aware advantage maps from pretrained vision models to focus optimization on perceptually important regions.
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation cs.CV · 2025-10-02 · conditional · none · ref 37 · internal anchor
Self-Forcing++ scales autoregressive video diffusion to over 4 minutes by using self-generated segments for guidance, reducing error accumulation and outperforming baselines in fidelity and consistency.
DanceGRPO: Unleashing GRPO on Visual Generation cs.CV · 2025-05-12 · unverdicted · none · ref 14 · internal anchor
DanceGRPO applies GRPO to visual generation tasks to achieve stable policy optimization across diffusion models, rectified flows, multiple tasks, and diverse reward models, outperforming prior RL methods.
SkyReels-V2: Infinite-length Film Generative Model cs.CV · 2025-04-17 · unverdicted · none · ref 46 · internal anchor
SkyReels-V2 produces infinite-length film videos via MLLM-based captioning, progressive pretraining, motion RL, and diffusion forcing with non-decreasing noise schedules.
A Systematic Post-Train Framework for Video Generation cs.CV · 2026-04-28 · unverdicted · none · ref 43 · internal anchor
A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.
Reward-Aware Trajectory Shaping for Few-step Visual Generation cs.CV · 2026-04-16 · unverdicted · none · ref 19 · internal anchor
RATS lets few-step visual generators surpass multi-step teachers by shaping trajectories with reward-based adaptive guidance instead of strict imitation.
CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning cs.LG · 2026-03-23 · unverdicted · none · ref 25 · 2 links · internal anchor
CellFluxRL post-trains the CellFlux model with RL using seven biological reward functions to generate more biologically valid virtual cell images.
Reward-Forcing: Autoregressive Video Generation with Reward Feedback cs.CV · 2026-01-23 · unverdicted · none · ref 14 · internal anchor
Reward-Forcing guides autoregressive video generation with reward feedback to achieve performance comparable to teacher-dependent methods on benchmarks like VBench without relying on distillation.
World Simulation with Video Foundation Models for Physical AI cs.CV · 2025-10-28 · unverdicted · none · ref 47 · internal anchor
Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.
Seedance 1.0: Exploring the Boundaries of Video Generation Models cs.CV · 2025-06-10 · unverdicted · none · ref 18 · internal anchor
Seedance 1.0 generates 5-second 1080p videos in about 41 seconds with claimed superior motion quality, prompt adherence, and multi-shot consistency compared to prior models.

Improving Video Generation with Human Feedback

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer