Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models
Pith reviewed 2026-05-21 15:56 UTC · model grok-4.3
The pith
LocalDPO aligns text-to-video diffusion models by optimizing preferences only on localized corrupted regions drawn from real videos.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LocalDPO constructs localized preference pairs by treating real videos as positives and generating negatives through random spatio-temporal masking followed by restoration with the frozen base model, then applies a region-restricted DPO loss to align the generator at the detail level with a single inference per prompt and without critic models or manual labels.
What carries the argument
The automated pipeline that produces localized preference pairs from real videos via masking and frozen-model restoration, paired with a region-aware DPO loss that limits supervision to corrupted spatio-temporal areas.
If this is right
- Video generators reach higher fidelity and temporal coherence after post-training.
- Alignment requires only one inference per prompt instead of multiple samples.
- No external critic models or manual annotations are needed for preference data.
- Learning focuses on fine-grained spatio-temporal details rather than global video quality.
Where Pith is reading between the lines
- The same local-corruption approach could be tested on image or 3D generators to check whether region-level supervision transfers.
- If the restored negatives reliably track human judgments, the method might reduce the need for large-scale preference datasets in other generative domains.
- Restricting loss to masked regions might allow targeted fixes for specific artifact types without retraining the entire model.
Load-bearing premise
Locally corrupting high-quality real videos with random spatio-temporal masks and restoring only the masked regions with the frozen base model creates negative samples that give clear, region-level supervision matching human preferences.
What would settle it
A side-by-side human evaluation that measures whether viewers consistently judge LocalDPO outputs higher than baseline outputs specifically inside the masked regions of the same video clips.
Figures
read the original abstract
Aligning text-to-video diffusion models with human preferences is crucial for generating high-quality videos. Existing Direct Preference Otimization (DPO) methods rely on multi-sample ranking and task-specific critic models, which is inefficient and often yields ambiguous global supervision. To address these limitations, we propose LocalDPO, a novel post-training framework that constructs localized preference pairs from real videos and optimizes alignment at the spatio-temporal region level. We design an automated pipeline to efficiently collect preference pair data that generates preference pairs with a single inference per prompt, eliminating the need for external critic models or manual annotation. Specifically, we treat high-quality real videos as positive samples and generate corresponding negatives by locally corrupting them with random spatio-temporal masks and restoring only the masked regions using the frozen base model. During training, we introduce a region-aware DPO loss that restricts preference learning to corrupted areas for rapid convergence. Experiments on Wan2.1 and CogVideoX demonstrate that LocalDPO consistently improves video fidelity, temporal coherence and human preference scores over other post-training approaches, establishing a more efficient and fine-grained paradigm for video generator alignment.The code is available at https://github.com/1170300714/Local-DPO.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces LocalDPO, a post-training framework for aligning text-to-video diffusion models with human preferences via localized spatio-temporal optimization. High-quality real videos are treated as positive samples; corresponding negatives are generated by applying random spatio-temporal masks and restoring only the masked regions with the frozen base model. A region-aware DPO loss restricts learning to the corrupted areas. Experiments on Wan2.1 and CogVideoX report consistent gains in fidelity, temporal coherence, and human preference scores over prior post-training methods.
Significance. If the synthetic negative generation procedure produces preference signals that reliably align with human judgments at the region level, the approach would offer a meaningful efficiency gain over global DPO variants by requiring only a single inference per prompt and no external critic models. The public code release supports reproducibility.
major comments (2)
- [§3] §3 (Automated pipeline for preference pair data): The central premise that negatives obtained by random spatio-temporal masking followed by restoration with the frozen base model yield effective, unambiguous region-level supervision aligned with human preferences is load-bearing yet unsupported. No human evaluation, ablation, or analysis is described to show that restored patches are consistently inferior according to human raters rather than reflecting inpainting-specific artifacts (e.g., temporal inconsistency or texture issues). This directly affects whether the region-aware DPO loss supplies valid supervision.
- [§4] §4 (Experiments): The abstract states that LocalDPO 'consistently improves' fidelity, coherence, and human scores on Wan2.1 and CogVideoX, but provides no quantitative metrics, baseline details, statistical significance tests, or controls. Without these, the magnitude and reliability of the claimed gains cannot be assessed.
minor comments (2)
- [Abstract] Abstract: Typo 'Otimization' should read 'Optimization'.
- The description of the region-aware DPO loss would benefit from an explicit equation or pseudocode showing how the loss is restricted to masked regions.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, providing the strongest honest defense of the manuscript while committing to revisions where the concerns are valid.
read point-by-point responses
-
Referee: [§3] §3 (Automated pipeline for preference pair data): The central premise that negatives obtained by random spatio-temporal masking followed by restoration with the frozen base model yield effective, unambiguous region-level supervision aligned with human preferences is load-bearing yet unsupported. No human evaluation, ablation, or analysis is described to show that restored patches are consistently inferior according to human raters rather than reflecting inpainting-specific artifacts (e.g., temporal inconsistency or texture issues). This directly affects whether the region-aware DPO loss supplies valid supervision.
Authors: We acknowledge that explicit human validation of the negative samples would strengthen the central claim. The manuscript relies on the design rationale that the frozen base model, lacking any preference alignment, produces restorations that are systematically lower quality than the original real videos in the masked regions; the region-aware DPO loss then directly optimizes the model to prefer the real details. To address the concern rigorously, we will add both quantitative ablations (comparing masked vs. restored regions using standard video quality metrics) and a small-scale human preference study on the generated pairs in the revised Section 3. This will confirm that the supervision signal aligns with human judgments beyond inpainting artifacts. revision: yes
-
Referee: [§4] §4 (Experiments): The abstract states that LocalDPO 'consistently improves' fidelity, coherence, and human scores on Wan2.1 and CogVideoX, but provides no quantitative metrics, baseline details, statistical significance tests, or controls. Without these, the magnitude and reliability of the claimed gains cannot be assessed.
Authors: The full manuscript in Section 4 already reports quantitative results on both Wan2.1 and CogVideoX, including specific metrics for fidelity, temporal coherence, and human preference scores, with comparisons against prior post-training baselines. We agree the abstract is too high-level and will revise it to include key numerical improvements. We will also add explicit details on statistical significance testing and experimental controls (e.g., fixed random seeds, identical inference settings) in the revised experimental section to improve clarity and reproducibility. revision: yes
Circularity Check
No circularity: independent pipeline and region-aware loss are self-contained
full rationale
The paper's core contribution is an automated data-generation pipeline that treats real videos as positives and produces negatives via random spatio-temporal masking followed by restoration with the frozen base model, plus a region-restricted DPO loss. No equation or claim reduces a prediction to a fitted parameter by construction, nor does any load-bearing step collapse to a self-citation whose validity depends on the present work. The reported gains on Wan2.1 and CogVideoX are framed as empirical outcomes of this external-to-the-model procedure, leaving the derivation chain independent of its own outputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- mask generation parameters
axioms (1)
- domain assumption High-quality real videos serve as reliable positive samples reflecting human preferences
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We treat high-quality real videos as positive samples and generate corresponding negatives by locally corrupting them with random spatio-temporal masks and restoring only the masked regions using the frozen base model... region-aware DPO loss that restricts preference learning to corrupted areas
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LRA-DPO = −E log σ(−β·(1+η(α))·E_t[Δ′w − Δ′l]) with Δ′ restricted by mask M
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL
CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight...
Reference graph
Works this paper leans on
-
[1]
Pexels.https://www.pexels.com/, 2025.10. ac- cessed: 2025-11-01. 5, 1
work page 2025
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 3, 6, 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Frozen in time: A joint video and image encoder for end-to-end retrieval
Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval. InICCV, pages 1728–1738, 2021. 3
work page 2021
-
[4]
Improving image generation with better captions.Computer Science
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023. 3
work page 2023
-
[5]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Align your latents: High-resolution video synthesis with la- tent diffusion models
Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with la- tent diffusion models. InCVPR, pages 22563–22575, 2023. 3
work page 2023
-
[7]
VideoJAM: Joint appearance-motion representations for en- hanced motion generation in video models
Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin. VideoJAM: Joint appearance-motion representations for en- hanced motion generation in video models. InICML, 2025. 3, 6
work page 2025
-
[8]
Panda-70m: Captioning 70m videos with multiple cross-modality teachers
Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InCVPR, pages 13320–13331,
-
[9]
Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InCVPR, pages 24185–24198, 2024. 3
work page 2024
-
[10]
Diffusion models in vision: A survey
Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A survey. PAMI, 45(9):10850–10869, 2023. 1
work page 2023
-
[11]
An image is worth 16x16 words: Trans- formers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InICLR, 2020. 3
work page 2020
-
[12]
Scaling recti- fied flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,
-
[13]
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xi- aojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to- image diffusion models without specific tuning.ICLR, 2024. 3
work page 2024
-
[15]
Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video genera- tion. InEMNLP, pages 2105–2123, 2024. 2
work page 2024
-
[16]
Denoising diffu- sion probabilistic models.NeurIPS, 33:6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models.NeurIPS, 33:6840–6851, 2020. 1, 3
work page 2020
-
[17]
Denoising dif- fusion probabilistic models.NeurIPS, 33:6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.NeurIPS, 33:6840–6851, 2020. 1
work page 2020
-
[18]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion mod- els.arXiv preprint arXiv:2210.02303, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[19]
Video dif- fusion models.NeurIPS, 35:8633–8646, 2022
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models.NeurIPS, 35:8633–8646, 2022. 1
work page 2022
-
[20]
Video dif- fusion models.NeurIPS, 35:8633–8646, 2022
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models.NeurIPS, 35:8633–8646, 2022. 3
work page 2022
-
[21]
Cogvideo: Large-scale pretraining for text-to-video generation via transformers
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. InICLR, 2023. 1, 3 9
work page 2023
-
[22]
Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 3, 6
work page 2022
-
[23]
Vbench: Com- prehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench: Com- prehensive benchmark suite for video generative models. In CVPR, pages 21807–21818, 2024. 3, 6
work page 2024
-
[24]
Xuan Ju, Yiming Gao, Zhaoyang Zhang, Ziyang Yuan, Xin- tao Wang, Ailing Zeng, Yu Xiong, Qiang Xu, and Ying Shan. Miradata: A large-scale video dataset with long durations and structured captions.NeurIPS, 37:48955–48970, 2024. 3, 5, 1
work page 2024
-
[25]
Musiq: Multi-scale image quality transformer
Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InICCV, pages 5148–5157, 2021. 5, 6, 1
work page 2021
-
[26]
Text2video-zero: Text- to-image diffusion models are zero-shot video generators
Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text- to-image diffusion models are zero-shot video generators. In ICCV, pages 15954–15964, 2023. 3
work page 2023
-
[27]
Auto-encoding varia- tional bayes
Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes. InICLR, 2014. 3
work page 2014
-
[28]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InICCV, pages 4015–4026, 2023. 5
work page 2023
-
[29]
Pick-a-pic: An open dataset of user preferences for text-to-image generation
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- tiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. NeurIPS, 36:36652–36663, 2023. 6
work page 2023
-
[30]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Personalvideo: High id-fidelity video customization without dynamic and semantic degradation
Hengjia Li, Haonan Qiu, Shiwei Zhang, Xiang Wang, Yu- jie Wei, Zekun Li, Yingya Zhang, Boxi Wu, and Deng Cai. Personalvideo: High id-fidelity video customization without dynamic and semantic degradation. InICCV, pages 19406– 19416, 2025. 3
work page 2025
-
[32]
Stiv: Scalable text and image conditioned video generation
Zongyu Lin, Wei Liu, Chen Chen, Jiasen Lu, Wenze Hu, Tsu-Jui Fu, Jesse Allardice, Zhengfeng Lai, Liangchen Song, Bowen Zhang, et al. Stiv: Scalable text and image conditioned video generation. InICCV, pages 16249–16259,
-
[33]
Vmbench: A benchmark for perception-aligned video motion generation
Xinran Ling, Chen Zhu, Meiqi Wu, Hangyu Li, Xiaokun Feng, Cundian Yang, Aiming Hao, Jiashu Zhu, Jiahong Wu, and Xiangxiang Chu. Vmbench: A benchmark for perception-aligned video motion generation. InICCV, pages 13087–13098, 2025. 3
work page 2025
-
[34]
Flow matching for generative modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023. 1, 3
work page 2023
-
[35]
Improving video generation with human feedback.NeurIPS, 2025
Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback.NeurIPS, 2025. 2, 3, 4, 6, 7
work page 2025
-
[36]
Hoigen-1m: A large- scale dataset for human-object interaction video generation
Kun Liu, Qi Liu, Xinchen Liu, Jie Li, Yongdong Zhang, Jiebo Luo, Xiaodong He, and Wu Liu. Hoigen-1m: A large- scale dataset for human-object interaction video generation. InCVPR, pages 24001–24010, 2025. 3
work page 2025
-
[37]
Videodpo: Omni- preference alignment for video diffusion generation
Runtao Liu, Haoyu Wu, Ziqiang Zheng, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni- preference alignment for video diffusion generation. In CVPR, pages 8009–8019, 2025. 2, 3, 4
work page 2025
-
[38]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In ECCV, pages 38–55. Springer, 2024. 5
work page 2024
-
[39]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 6
work page 2019
-
[40]
Openvid-1m: A large-scale high-quality dataset for text-to- video generation
Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhen- heng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to- video generation. InICLR, 2025. 3
work page 2025
-
[41]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, pages 4195–4205, 2023. 3
work page 2023
-
[42]
Sdxl: Improving latent diffusion models for high-resolution image synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InICLR, 2024. 3
work page 2024
-
[43]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.NeurIPS, 36:53728–53741, 2023. 2, 3
work page 2023
-
[44]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents.arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[45]
Sam 2: Seg- ment anything in images and videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Seg- ment anything in images and videos. InICLR, 2025. 5
work page 2025
-
[46]
High-resolution image syn- thesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, pages 10684– 10695, 2022. 3
work page 2022
-
[47]
U-net: Convolutional networks for biomedical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMICCAI, pages 234–241. Springer, 2015. 3
work page 2015
-
[48]
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.NeurIPS, 35:36479–36494, 2022. 3
work page 2022
-
[49]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 5, 6, 1, 2 10
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[50]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[51]
Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, et al. Seaweed-7b: Cost-effective train- ing of video generation foundation model.arXiv preprint arXiv:2504.08685, 2025. 1, 5
-
[52]
Make-a-video: Text-to-video generation without text-video data
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. InICLR, 2023. 3
work page 2023
-
[53]
Denois- ing diffusion implicit models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InICLR, 2021. 1, 3, 6
work page 2021
-
[54]
Score-based generative modeling through stochastic differential equa- tions
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. InInternational Conference on Learning Represen- tations, 2021. 3
work page 2021
-
[55]
Transnet v2: An effective deep network architecture for fast shot transition detection
Tom ´as Soucek and Jakub Lokoc. Transnet v2: An effective deep network architecture for fast shot transition detection. InACMMM, pages 11218–11221, 2024. 5, 1, 2
work page 2024
-
[56]
VidGen-1M: A large-scale dataset for text-to-video generation
Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, and Hao Li. Vidgen-1m: A large-scale dataset for text-to-video genera- tion.arXiv preprint arXiv:2408.02629, 2024. 3
-
[57]
Raft: Recurrent all-pairs field transforms for optical flow
Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InECCV, pages 402–419. Springer, 2020. 5, 1, 2
work page 2020
-
[58]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 5, 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Attention is all you need.NeurIPS, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.NeurIPS, 30, 2017. 3
work page 2017
-
[60]
Diffusion model align- ment using direct preference optimization
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model align- ment using direct preference optimization. InCVPR, pages 8228–8238, 2024. 3
work page 2024
-
[61]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 1, 3, 5, 6, 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[62]
Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, et al. Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content. InCVPR, pages 8428–8437, 2025. 3, 5, 1
work page 2025
-
[63]
Wenhao Wang and Yi Yang. Videoufo: A million-scale user- focused dataset for text-to-video generation.arXiv preprint arXiv:2503.01739, 2025. 3
-
[64]
Cogvlm: Visual expert for pretrained language models.NeurIPS, 37:121475–121499, 2024
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Song XiX- uan, et al. Cogvlm: Visual expert for pretrained language models.NeurIPS, 37:121475–121499, 2024. 3
work page 2024
-
[65]
Real-esrgan: Training real-world blind super-resolution with pure synthetic data
Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. InICCV, pages 1905–1914, 2021. 3
work page 1905
-
[66]
Yibin Wang, Zhiyu Tan, Junyan Wang, Xiaomeng Yang, Cheng Jin, and Hao Li. Lift: Leveraging human feed- back for text-to-video model alignment.arXiv preprint arXiv:2412.04814, 2024. 2, 3
-
[67]
Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models.IJCV, 133(5):3059– 3078, 2025. 3
work page 2025
-
[68]
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, et al. General ocr theory: Towards ocr-2.0 via a unified end-to-end model.arXiv preprint arXiv:2409.01704,
work page internal anchor Pith review Pith/arXiv arXiv
-
[69]
Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jing- wen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. InICCV, pages 20144–20154, 2023. 5, 1
work page 2023
-
[70]
Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation
Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, pages 7623–7633, 2023. 3
work page 2023
-
[71]
Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,
work page internal anchor Pith review Pith/arXiv arXiv
-
[72]
Densedpo: Fine-grained temporal pref- erence optimization for video diffusion models.NeurIPS,
Ziyi Wu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ashkan Mirzaei, Igor Gilitschenski, Sergey Tulyakov, and Aliaksandr Siarohin. Densedpo: Fine-grained temporal pref- erence optimization for video diffusion models.NeurIPS,
-
[73]
A survey on video dif- fusion models.ACM Computing Surveys, 57(2):1–42, 2024
Zhen Xing, Qijun Feng, Haoran Chen, Qi Dai, Han Hu, Hang Xu, Zuxuan Wu, and Yu-Gang Jiang. A survey on video dif- fusion models.ACM Computing Surveys, 57(2):1–42, 2024. 1
work page 2024
-
[74]
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation.NeurIPS, 36:15903–15935, 2023. 6
work page 2023
-
[75]
Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shu- run Li, et al. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation. arXiv preprint arXiv:2412.21059, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[76]
Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Run- sheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming- Hsuan Yang. Diffusion models: A comprehensive survey of 11 methods and applications.ACM computing surveys, 56(4): 1–39, 2023. 1
work page 2023
-
[77]
Xiaomeng Yang, Zhiyu Tan, and Hao Li. Ipo: Iterative preference optimization for text-to-video generation.arXiv preprint arXiv:2502.02088, 2025. 3
-
[78]
Cogvideox: Text-to-video diffusion models with an expert transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InICLR, 2025. 1, 3, 6, 2
work page 2025
-
[79]
Ailing Zeng, Yuhang Yang, Weidong Chen, and Wei Liu. The dawn of video generation: Preliminary explorations with sora-like models.arXiv preprint arXiv:2410.05227, 2024. 3
-
[80]
Designing a practical degradation model for deep blind image super-resolution
Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timo- fte. Designing a practical degradation model for deep blind image super-resolution. InICCV, pages 4791–4800, 2021. 3
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.