pith. sign in

arxiv: 2601.04068 · v4 · pith:HFQBZVIDnew · submitted 2026-01-07 · 💻 cs.CV · cs.AI

Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

Pith reviewed 2026-05-21 15:56 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords video diffusion modelsdirect preference optimizationlocal alignmentpost-trainingtext-to-videohuman preferencespatio-temporal regions
0
0 comments X

The pith

LocalDPO aligns text-to-video diffusion models by optimizing preferences only on localized corrupted regions drawn from real videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LocalDPO as a post-training method that builds preference pairs directly from high-quality real videos rather than relying on multiple generations or external critics. High-quality videos serve as positives while the base model restores randomly masked spatio-temporal regions to create negatives. A region-aware DPO loss then restricts learning to those masked areas for faster convergence. Experiments on Wan2.1 and CogVideoX show gains in fidelity, temporal coherence, and human preference scores compared with prior post-training approaches.

Core claim

LocalDPO constructs localized preference pairs by treating real videos as positives and generating negatives through random spatio-temporal masking followed by restoration with the frozen base model, then applies a region-restricted DPO loss to align the generator at the detail level with a single inference per prompt and without critic models or manual labels.

What carries the argument

The automated pipeline that produces localized preference pairs from real videos via masking and frozen-model restoration, paired with a region-aware DPO loss that limits supervision to corrupted spatio-temporal areas.

If this is right

  • Video generators reach higher fidelity and temporal coherence after post-training.
  • Alignment requires only one inference per prompt instead of multiple samples.
  • No external critic models or manual annotations are needed for preference data.
  • Learning focuses on fine-grained spatio-temporal details rather than global video quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same local-corruption approach could be tested on image or 3D generators to check whether region-level supervision transfers.
  • If the restored negatives reliably track human judgments, the method might reduce the need for large-scale preference datasets in other generative domains.
  • Restricting loss to masked regions might allow targeted fixes for specific artifact types without retraining the entire model.

Load-bearing premise

Locally corrupting high-quality real videos with random spatio-temporal masks and restoring only the masked regions with the frozen base model creates negative samples that give clear, region-level supervision matching human preferences.

What would settle it

A side-by-side human evaluation that measures whether viewers consistently judge LocalDPO outputs higher than baseline outputs specifically inside the masked regions of the same video clips.

Figures

Figures reproduced from arXiv: 2601.04068 by Chao Gao, Kaidong Zhang, Rui Ding, Wangmeng Zuo, Ying Chen, Yukang Ding, Zitong Huang.

Figure 1
Figure 1. Figure 1: Comparison between (a) vanilla DPO and (b) LocalDPO [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of video pairs generated by CogVideoX-5B from the same prompt but different seeds reveals significant discrepan [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pipeline of locally corrupted videos generation. We first randomly sample several Bezier curves on the original video and ´ ensure that these curves form closed shapes. The interior of each closed shape defines the region to be corrupted in subsequent steps. Then, the masked area of real video is inpainted by the pretrained VDM. Specifically, given the latent of input real video, the model first adds a con… view at source ↗
Figure 4
Figure 4. Figure 4: Human evaluation of LocalDPO vs. SFT and Vanil￾laDPO. LocalDPO achieves the best results on all dimensions of human evaluation. 5.2. Experimental Setup Baselines and comparisons. To demonstrate the ef￾fectiveness of our method, we conduct extensive experi￾ments on multiple DiT-based VDMs with varying param￾eter scales, including CogVideoX-2B [78], CogVideoX￾5B [78], and Wan2.1-1.3B [61]. We compare our met… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative Comparison between SFT, Vanilla DPO and LocalDPO for CogVideoX models. Our LocalDPO generates rich [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Convergence of the models on aesthetic and image qual [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Human evaluation of LocalDPO vs. Baseline, SFT and Vanilla DPO on CogvideoX-2B [78], CogvideoX-5B [78] and Wan2.1- 1.3B [61]. LocalDPO achieves the best results on all dimensions of human evaluation. filter out blurry content. Aesthetics: A pre-trained aesthetic scoring model [49] is utilized to evaluate the perceptual and artistic appeal of each frame. Motion Smoothness: The “vmafmotion” filter from FFm￾p… view at source ↗
Figure 8
Figure 8. Figure 8: Category Distribution of the constructed video dataset. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of generated locally corrupted videos. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visualization of LocalDPO vs. Baseline, SFT and VanillaDPO on CogvideoX-2B. dation most affects user perception. As a result, the prefer￾ence signal may be less effective for improving generation fidelity of specific object classes. In future work, we will incorporate vision foundation models, such as Grounding DINO [38] for object detection and SAM [28, 45] for segmentation, to guide mask place￾5 [PITH_… view at source ↗
Figure 11
Figure 11. Figure 11: Visualization of LocalDPO vs. Baseline, SFT and VanillaDPO on CogvideoX-5B. ment towards semantically meaningful regions. This would enable targeted refinement of object-level realism and con￾trollability in text-to-video generation. 6 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visualization of LocalDPO vs. Baseline, SFT and VanillaDPO on Wan2.1-1.3B. 12. More Qualitative Comparisons We present additional visual comparisons between our method and other methods, including the baseline, SFT, and vanilla DPO [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
read the original abstract

Aligning text-to-video diffusion models with human preferences is crucial for generating high-quality videos. Existing Direct Preference Otimization (DPO) methods rely on multi-sample ranking and task-specific critic models, which is inefficient and often yields ambiguous global supervision. To address these limitations, we propose LocalDPO, a novel post-training framework that constructs localized preference pairs from real videos and optimizes alignment at the spatio-temporal region level. We design an automated pipeline to efficiently collect preference pair data that generates preference pairs with a single inference per prompt, eliminating the need for external critic models or manual annotation. Specifically, we treat high-quality real videos as positive samples and generate corresponding negatives by locally corrupting them with random spatio-temporal masks and restoring only the masked regions using the frozen base model. During training, we introduce a region-aware DPO loss that restricts preference learning to corrupted areas for rapid convergence. Experiments on Wan2.1 and CogVideoX demonstrate that LocalDPO consistently improves video fidelity, temporal coherence and human preference scores over other post-training approaches, establishing a more efficient and fine-grained paradigm for video generator alignment.The code is available at https://github.com/1170300714/Local-DPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LocalDPO, a post-training framework for aligning text-to-video diffusion models with human preferences via localized spatio-temporal optimization. High-quality real videos are treated as positive samples; corresponding negatives are generated by applying random spatio-temporal masks and restoring only the masked regions with the frozen base model. A region-aware DPO loss restricts learning to the corrupted areas. Experiments on Wan2.1 and CogVideoX report consistent gains in fidelity, temporal coherence, and human preference scores over prior post-training methods.

Significance. If the synthetic negative generation procedure produces preference signals that reliably align with human judgments at the region level, the approach would offer a meaningful efficiency gain over global DPO variants by requiring only a single inference per prompt and no external critic models. The public code release supports reproducibility.

major comments (2)
  1. [§3] §3 (Automated pipeline for preference pair data): The central premise that negatives obtained by random spatio-temporal masking followed by restoration with the frozen base model yield effective, unambiguous region-level supervision aligned with human preferences is load-bearing yet unsupported. No human evaluation, ablation, or analysis is described to show that restored patches are consistently inferior according to human raters rather than reflecting inpainting-specific artifacts (e.g., temporal inconsistency or texture issues). This directly affects whether the region-aware DPO loss supplies valid supervision.
  2. [§4] §4 (Experiments): The abstract states that LocalDPO 'consistently improves' fidelity, coherence, and human scores on Wan2.1 and CogVideoX, but provides no quantitative metrics, baseline details, statistical significance tests, or controls. Without these, the magnitude and reliability of the claimed gains cannot be assessed.
minor comments (2)
  1. [Abstract] Abstract: Typo 'Otimization' should read 'Optimization'.
  2. The description of the region-aware DPO loss would benefit from an explicit equation or pseudocode showing how the loss is restricted to masked regions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, providing the strongest honest defense of the manuscript while committing to revisions where the concerns are valid.

read point-by-point responses
  1. Referee: [§3] §3 (Automated pipeline for preference pair data): The central premise that negatives obtained by random spatio-temporal masking followed by restoration with the frozen base model yield effective, unambiguous region-level supervision aligned with human preferences is load-bearing yet unsupported. No human evaluation, ablation, or analysis is described to show that restored patches are consistently inferior according to human raters rather than reflecting inpainting-specific artifacts (e.g., temporal inconsistency or texture issues). This directly affects whether the region-aware DPO loss supplies valid supervision.

    Authors: We acknowledge that explicit human validation of the negative samples would strengthen the central claim. The manuscript relies on the design rationale that the frozen base model, lacking any preference alignment, produces restorations that are systematically lower quality than the original real videos in the masked regions; the region-aware DPO loss then directly optimizes the model to prefer the real details. To address the concern rigorously, we will add both quantitative ablations (comparing masked vs. restored regions using standard video quality metrics) and a small-scale human preference study on the generated pairs in the revised Section 3. This will confirm that the supervision signal aligns with human judgments beyond inpainting artifacts. revision: yes

  2. Referee: [§4] §4 (Experiments): The abstract states that LocalDPO 'consistently improves' fidelity, coherence, and human scores on Wan2.1 and CogVideoX, but provides no quantitative metrics, baseline details, statistical significance tests, or controls. Without these, the magnitude and reliability of the claimed gains cannot be assessed.

    Authors: The full manuscript in Section 4 already reports quantitative results on both Wan2.1 and CogVideoX, including specific metrics for fidelity, temporal coherence, and human preference scores, with comparisons against prior post-training baselines. We agree the abstract is too high-level and will revise it to include key numerical improvements. We will also add explicit details on statistical significance testing and experimental controls (e.g., fixed random seeds, identical inference settings) in the revised experimental section to improve clarity and reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: independent pipeline and region-aware loss are self-contained

full rationale

The paper's core contribution is an automated data-generation pipeline that treats real videos as positives and produces negatives via random spatio-temporal masking followed by restoration with the frozen base model, plus a region-restricted DPO loss. No equation or claim reduces a prediction to a fitted parameter by construction, nor does any load-bearing step collapse to a self-citation whose validity depends on the present work. The reported gains on Wan2.1 and CogVideoX are framed as empirical outcomes of this external-to-the-model procedure, leaving the derivation chain independent of its own outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach depends on domain assumptions about real videos as positives and the effectiveness of local corruption for negatives, with likely unstated hyperparameters in masking and loss weighting.

free parameters (1)
  • mask generation parameters
    Random spatio-temporal masks for corruption likely involve tunable choices such as size, density, or probability not specified in the abstract.
axioms (1)
  • domain assumption High-quality real videos serve as reliable positive samples reflecting human preferences
    Invoked as the foundation for constructing positive examples in the preference pair collection pipeline.

pith-pipeline@v0.9.0 · 5760 in / 1501 out tokens · 64519 ms · 2026-05-21T15:56:43.552524+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

    cs.CV 2026-05 conditional novelty 7.0

    CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight...

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · cited by 1 Pith paper · 15 internal anchors

  1. [1]

    ac- cessed: 2025-11-01

    Pexels.https://www.pexels.com/, 2025.10. ac- cessed: 2025-11-01. 5, 1

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 3, 6, 1

  3. [3]

    Frozen in time: A joint video and image encoder for end-to-end retrieval

    Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval. InICCV, pages 1728–1738, 2021. 3

  4. [4]

    Improving image generation with better captions.Computer Science

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023. 3

  5. [5]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 1

  6. [6]

    Align your latents: High-resolution video synthesis with la- tent diffusion models

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with la- tent diffusion models. InCVPR, pages 22563–22575, 2023. 3

  7. [7]

    VideoJAM: Joint appearance-motion representations for en- hanced motion generation in video models

    Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin. VideoJAM: Joint appearance-motion representations for en- hanced motion generation in video models. InICML, 2025. 3, 6

  8. [8]

    Panda-70m: Captioning 70m videos with multiple cross-modality teachers

    Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InCVPR, pages 13320–13331,

  9. [9]

    Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InCVPR, pages 24185–24198, 2024. 3

  10. [10]

    Diffusion models in vision: A survey

    Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A survey. PAMI, 45(9):10850–10869, 2023. 1

  11. [11]

    An image is worth 16x16 words: Trans- formers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InICLR, 2020. 3

  12. [12]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,

  13. [13]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xi- aojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113,

  14. [14]

    Animatediff: Animate your personalized text-to- image diffusion models without specific tuning.ICLR, 2024

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to- image diffusion models without specific tuning.ICLR, 2024. 3

  15. [15]

    Videoscore: Building automatic metrics to simulate fine-grained human feedback for video genera- tion

    Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video genera- tion. InEMNLP, pages 2105–2123, 2024. 2

  16. [16]

    Denoising diffu- sion probabilistic models.NeurIPS, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models.NeurIPS, 33:6840–6851, 2020. 1, 3

  17. [17]

    Denoising dif- fusion probabilistic models.NeurIPS, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.NeurIPS, 33:6840–6851, 2020. 1

  18. [18]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion mod- els.arXiv preprint arXiv:2210.02303, 2022. 3

  19. [19]

    Video dif- fusion models.NeurIPS, 35:8633–8646, 2022

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models.NeurIPS, 35:8633–8646, 2022. 1

  20. [20]

    Video dif- fusion models.NeurIPS, 35:8633–8646, 2022

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models.NeurIPS, 35:8633–8646, 2022. 3

  21. [21]

    Cogvideo: Large-scale pretraining for text-to-video generation via transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. InICLR, 2023. 1, 3 9

  22. [22]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 3, 6

  23. [23]

    Vbench: Com- prehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench: Com- prehensive benchmark suite for video generative models. In CVPR, pages 21807–21818, 2024. 3, 6

  24. [24]

    Miradata: A large-scale video dataset with long durations and structured captions.NeurIPS, 37:48955–48970, 2024

    Xuan Ju, Yiming Gao, Zhaoyang Zhang, Ziyang Yuan, Xin- tao Wang, Ailing Zeng, Yu Xiong, Qiang Xu, and Ying Shan. Miradata: A large-scale video dataset with long durations and structured captions.NeurIPS, 37:48955–48970, 2024. 3, 5, 1

  25. [25]

    Musiq: Multi-scale image quality transformer

    Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InICCV, pages 5148–5157, 2021. 5, 6, 1

  26. [26]

    Text2video-zero: Text- to-image diffusion models are zero-shot video generators

    Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text- to-image diffusion models are zero-shot video generators. In ICCV, pages 15954–15964, 2023. 3

  27. [27]

    Auto-encoding varia- tional bayes

    Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes. InICLR, 2014. 3

  28. [28]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InICCV, pages 4015–4026, 2023. 5

  29. [29]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- tiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. NeurIPS, 36:36652–36663, 2023. 6

  30. [30]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 1, 3

  31. [31]

    Personalvideo: High id-fidelity video customization without dynamic and semantic degradation

    Hengjia Li, Haonan Qiu, Shiwei Zhang, Xiang Wang, Yu- jie Wei, Zekun Li, Yingya Zhang, Boxi Wu, and Deng Cai. Personalvideo: High id-fidelity video customization without dynamic and semantic degradation. InICCV, pages 19406– 19416, 2025. 3

  32. [32]

    Stiv: Scalable text and image conditioned video generation

    Zongyu Lin, Wei Liu, Chen Chen, Jiasen Lu, Wenze Hu, Tsu-Jui Fu, Jesse Allardice, Zhengfeng Lai, Liangchen Song, Bowen Zhang, et al. Stiv: Scalable text and image conditioned video generation. InICCV, pages 16249–16259,

  33. [33]

    Vmbench: A benchmark for perception-aligned video motion generation

    Xinran Ling, Chen Zhu, Meiqi Wu, Hangyu Li, Xiaokun Feng, Cundian Yang, Aiming Hao, Jiashu Zhu, Jiahong Wu, and Xiangxiang Chu. Vmbench: A benchmark for perception-aligned video motion generation. InICCV, pages 13087–13098, 2025. 3

  34. [34]

    Flow matching for generative modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023. 1, 3

  35. [35]

    Improving video generation with human feedback.NeurIPS, 2025

    Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback.NeurIPS, 2025. 2, 3, 4, 6, 7

  36. [36]

    Hoigen-1m: A large- scale dataset for human-object interaction video generation

    Kun Liu, Qi Liu, Xinchen Liu, Jie Li, Yongdong Zhang, Jiebo Luo, Xiaodong He, and Wu Liu. Hoigen-1m: A large- scale dataset for human-object interaction video generation. InCVPR, pages 24001–24010, 2025. 3

  37. [37]

    Videodpo: Omni- preference alignment for video diffusion generation

    Runtao Liu, Haoyu Wu, Ziqiang Zheng, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni- preference alignment for video diffusion generation. In CVPR, pages 8009–8019, 2025. 2, 3, 4

  38. [38]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In ECCV, pages 38–55. Springer, 2024. 5

  39. [39]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 6

  40. [40]

    Openvid-1m: A large-scale high-quality dataset for text-to- video generation

    Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhen- heng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to- video generation. InICLR, 2025. 3

  41. [41]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, pages 4195–4205, 2023. 3

  42. [42]

    Sdxl: Improving latent diffusion models for high-resolution image synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InICLR, 2024. 3

  43. [43]

    Direct preference optimization: Your language model is secretly a reward model.NeurIPS, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.NeurIPS, 36:53728–53741, 2023. 2, 3

  44. [44]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents.arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 3

  45. [45]

    Sam 2: Seg- ment anything in images and videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Seg- ment anything in images and videos. InICLR, 2025. 5

  46. [46]

    High-resolution image syn- thesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, pages 10684– 10695, 2022. 3

  47. [47]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMICCAI, pages 234–241. Springer, 2015. 3

  48. [48]

    Photorealistic text-to-image diffusion models with deep language understanding.NeurIPS, 35:36479–36494, 2022

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.NeurIPS, 35:36479–36494, 2022. 3

  49. [49]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 5, 6, 1, 2 10

  50. [50]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 2, 3

  51. [51]

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al

    Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, et al. Seaweed-7b: Cost-effective train- ing of video generation foundation model.arXiv preprint arXiv:2504.08685, 2025. 1, 5

  52. [52]

    Make-a-video: Text-to-video generation without text-video data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. InICLR, 2023. 3

  53. [53]

    Denois- ing diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InICLR, 2021. 1, 3, 6

  54. [54]

    Score-based generative modeling through stochastic differential equa- tions

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. InInternational Conference on Learning Represen- tations, 2021. 3

  55. [55]

    Transnet v2: An effective deep network architecture for fast shot transition detection

    Tom ´as Soucek and Jakub Lokoc. Transnet v2: An effective deep network architecture for fast shot transition detection. InACMMM, pages 11218–11221, 2024. 5, 1, 2

  56. [56]

    VidGen-1M: A large-scale dataset for text-to-video generation

    Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, and Hao Li. Vidgen-1m: A large-scale dataset for text-to-video genera- tion.arXiv preprint arXiv:2408.02629, 2024. 3

  57. [57]

    Raft: Recurrent all-pairs field transforms for optical flow

    Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InECCV, pages 402–419. Springer, 2020. 5, 1, 2

  58. [58]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 5, 1, 2

  59. [59]

    Attention is all you need.NeurIPS, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.NeurIPS, 30, 2017. 3

  60. [60]

    Diffusion model align- ment using direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model align- ment using direct preference optimization. InCVPR, pages 8228–8238, 2024. 3

  61. [61]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 1, 3, 5, 6, 2

  62. [62]

    Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content

    Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, et al. Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content. InCVPR, pages 8428–8437, 2025. 3, 5, 1

  63. [63]

    Videoufo: A million-scale user- focused dataset for text-to-video generation.arXiv preprint arXiv:2503.01739, 2025

    Wenhao Wang and Yi Yang. Videoufo: A million-scale user- focused dataset for text-to-video generation.arXiv preprint arXiv:2503.01739, 2025. 3

  64. [64]

    Cogvlm: Visual expert for pretrained language models.NeurIPS, 37:121475–121499, 2024

    Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Song XiX- uan, et al. Cogvlm: Visual expert for pretrained language models.NeurIPS, 37:121475–121499, 2024. 3

  65. [65]

    Real-esrgan: Training real-world blind super-resolution with pure synthetic data

    Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. InICCV, pages 1905–1914, 2021. 3

  66. [66]

    Lift: Leveraging human feed- back for text-to-video model alignment.arXiv preprint arXiv:2412.04814, 2024

    Yibin Wang, Zhiyu Tan, Junyan Wang, Xiaomeng Yang, Cheng Jin, and Hao Li. Lift: Leveraging human feed- back for text-to-video model alignment.arXiv preprint arXiv:2412.04814, 2024. 2, 3

  67. [67]

    Lavie: High-quality video generation with cascaded latent diffusion models.IJCV, 133(5):3059– 3078, 2025

    Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models.IJCV, 133(5):3059– 3078, 2025. 3

  68. [68]

    General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

    Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, et al. General ocr theory: Towards ocr-2.0 via a unified end-to-end model.arXiv preprint arXiv:2409.01704,

  69. [69]

    Exploring video quality assessment on user generated contents from aesthetic and technical perspectives

    Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jing- wen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. InICCV, pages 20144–20154, 2023. 5, 1

  70. [70]

    Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

    Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, pages 7623–7633, 2023. 3

  71. [71]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

  72. [72]

    Densedpo: Fine-grained temporal pref- erence optimization for video diffusion models.NeurIPS,

    Ziyi Wu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ashkan Mirzaei, Igor Gilitschenski, Sergey Tulyakov, and Aliaksandr Siarohin. Densedpo: Fine-grained temporal pref- erence optimization for video diffusion models.NeurIPS,

  73. [73]

    A survey on video dif- fusion models.ACM Computing Surveys, 57(2):1–42, 2024

    Zhen Xing, Qijun Feng, Haoran Chen, Qi Dai, Han Hu, Hang Xu, Zuxuan Wu, and Yu-Gang Jiang. A survey on video dif- fusion models.ACM Computing Surveys, 57(2):1–42, 2024. 1

  74. [74]

    Imagere- ward: Learning and evaluating human preferences for text- to-image generation.NeurIPS, 36:15903–15935, 2023

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation.NeurIPS, 36:15903–15935, 2023. 6

  75. [75]

    VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

    Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shu- run Li, et al. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation. arXiv preprint arXiv:2412.21059, 2024. 2

  76. [76]

    Diffusion models: A comprehensive survey of 11 methods and applications.ACM computing surveys, 56(4): 1–39, 2023

    Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Run- sheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming- Hsuan Yang. Diffusion models: A comprehensive survey of 11 methods and applications.ACM computing surveys, 56(4): 1–39, 2023. 1

  77. [77]

    Ipo: Iterative preference optimization for text-to-video generation.arXiv preprint arXiv:2502.02088, 2025

    Xiaomeng Yang, Zhiyu Tan, and Hao Li. Ipo: Iterative preference optimization for text-to-video generation.arXiv preprint arXiv:2502.02088, 2025. 3

  78. [78]

    Cogvideox: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InICLR, 2025. 1, 3, 6, 2

  79. [79]

    The dawn of video generation: Preliminary explorations with sora-like models.arXiv preprint arXiv:2410.05227, 2024

    Ailing Zeng, Yuhang Yang, Weidong Chen, and Wei Liu. The dawn of video generation: Preliminary explorations with sora-like models.arXiv preprint arXiv:2410.05227, 2024. 3

  80. [80]

    Designing a practical degradation model for deep blind image super-resolution

    Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timo- fte. Designing a practical degradation model for deep blind image super-resolution. InICCV, pages 4791–4800, 2021. 3

Showing first 80 references.