Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

Chao Gao; Kaidong Zhang; Rui Ding; Wangmeng Zuo; Ying Chen; Yukang Ding; Zitong Huang

arxiv: 2601.04068 · v4 · pith:HFQBZVIDnew · submitted 2026-01-07 · 💻 cs.CV · cs.AI

Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

Zitong Huang , Kaidong Zhang , Yukang Ding , Chao Gao , Rui Ding , Ying Chen , Wangmeng Zuo This is my paper

Pith reviewed 2026-05-21 15:56 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video diffusion modelsdirect preference optimizationlocal alignmentpost-trainingtext-to-videohuman preferencespatio-temporal regions

0 comments

The pith

LocalDPO aligns text-to-video diffusion models by optimizing preferences only on localized corrupted regions drawn from real videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LocalDPO as a post-training method that builds preference pairs directly from high-quality real videos rather than relying on multiple generations or external critics. High-quality videos serve as positives while the base model restores randomly masked spatio-temporal regions to create negatives. A region-aware DPO loss then restricts learning to those masked areas for faster convergence. Experiments on Wan2.1 and CogVideoX show gains in fidelity, temporal coherence, and human preference scores compared with prior post-training approaches.

Core claim

LocalDPO constructs localized preference pairs by treating real videos as positives and generating negatives through random spatio-temporal masking followed by restoration with the frozen base model, then applies a region-restricted DPO loss to align the generator at the detail level with a single inference per prompt and without critic models or manual labels.

What carries the argument

The automated pipeline that produces localized preference pairs from real videos via masking and frozen-model restoration, paired with a region-aware DPO loss that limits supervision to corrupted spatio-temporal areas.

If this is right

Video generators reach higher fidelity and temporal coherence after post-training.
Alignment requires only one inference per prompt instead of multiple samples.
No external critic models or manual annotations are needed for preference data.
Learning focuses on fine-grained spatio-temporal details rather than global video quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same local-corruption approach could be tested on image or 3D generators to check whether region-level supervision transfers.
If the restored negatives reliably track human judgments, the method might reduce the need for large-scale preference datasets in other generative domains.
Restricting loss to masked regions might allow targeted fixes for specific artifact types without retraining the entire model.

Load-bearing premise

Locally corrupting high-quality real videos with random spatio-temporal masks and restoring only the masked regions with the frozen base model creates negative samples that give clear, region-level supervision matching human preferences.

What would settle it

A side-by-side human evaluation that measures whether viewers consistently judge LocalDPO outputs higher than baseline outputs specifically inside the masked regions of the same video clips.

Figures

Figures reproduced from arXiv: 2601.04068 by Chao Gao, Kaidong Zhang, Rui Ding, Wangmeng Zuo, Ying Chen, Yukang Ding, Zitong Huang.

**Figure 2.** Figure 2: Comparison of video pairs generated by CogVideoX-5B from the same prompt but different seeds reveals significant discrepan [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Pipeline of locally corrupted videos generation. We first randomly sample several Bezier curves on the original video and ´ ensure that these curves form closed shapes. The interior of each closed shape defines the region to be corrupted in subsequent steps. Then, the masked area of real video is inpainted by the pretrained VDM. Specifically, given the latent of input real video, the model first adds a con… view at source ↗

**Figure 4.** Figure 4: Human evaluation of LocalDPO vs. SFT and VanillaDPO. LocalDPO achieves the best results on all dimensions of human evaluation. 5.2. Experimental Setup Baselines and comparisons. To demonstrate the effectiveness of our method, we conduct extensive experiments on multiple DiT-based VDMs with varying parameter scales, including CogVideoX-2B [78], CogVideoX5B [78], and Wan2.1-1.3B [61]. We compare our met… view at source ↗

**Figure 5.** Figure 5: Qualitative Comparison between SFT, Vanilla DPO and LocalDPO for CogVideoX models. Our LocalDPO generates rich [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Convergence of the models on aesthetic and image qual [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Human evaluation of LocalDPO vs. Baseline, SFT and Vanilla DPO on CogvideoX-2B [78], CogvideoX-5B [78] and Wan2.1- 1.3B [61]. LocalDPO achieves the best results on all dimensions of human evaluation. filter out blurry content. Aesthetics: A pre-trained aesthetic scoring model [49] is utilized to evaluate the perceptual and artistic appeal of each frame. Motion Smoothness: The “vmafmotion” filter from FFmp… view at source ↗

**Figure 8.** Figure 8: Category Distribution of the constructed video dataset. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Visualization of generated locally corrupted videos. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Visualization of LocalDPO vs. Baseline, SFT and VanillaDPO on CogvideoX-2B. dation most affects user perception. As a result, the preference signal may be less effective for improving generation fidelity of specific object classes. In future work, we will incorporate vision foundation models, such as Grounding DINO [38] for object detection and SAM [28, 45] for segmentation, to guide mask place5 [PITH_… view at source ↗

**Figure 11.** Figure 11: Visualization of LocalDPO vs. Baseline, SFT and VanillaDPO on CogvideoX-5B. ment towards semantically meaningful regions. This would enable targeted refinement of object-level realism and controllability in text-to-video generation. 6 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Visualization of LocalDPO vs. Baseline, SFT and VanillaDPO on Wan2.1-1.3B. 12. More Qualitative Comparisons We present additional visual comparisons between our method and other methods, including the baseline, SFT, and vanilla DPO [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

read the original abstract

Aligning text-to-video diffusion models with human preferences is crucial for generating high-quality videos. Existing Direct Preference Otimization (DPO) methods rely on multi-sample ranking and task-specific critic models, which is inefficient and often yields ambiguous global supervision. To address these limitations, we propose LocalDPO, a novel post-training framework that constructs localized preference pairs from real videos and optimizes alignment at the spatio-temporal region level. We design an automated pipeline to efficiently collect preference pair data that generates preference pairs with a single inference per prompt, eliminating the need for external critic models or manual annotation. Specifically, we treat high-quality real videos as positive samples and generate corresponding negatives by locally corrupting them with random spatio-temporal masks and restoring only the masked regions using the frozen base model. During training, we introduce a region-aware DPO loss that restricts preference learning to corrupted areas for rapid convergence. Experiments on Wan2.1 and CogVideoX demonstrate that LocalDPO consistently improves video fidelity, temporal coherence and human preference scores over other post-training approaches, establishing a more efficient and fine-grained paradigm for video generator alignment.The code is available at https://github.com/1170300714/Local-DPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LocalDPO gives a practical automated route to localized preference pairs for video models via mask-and-restore on real footage, but the claim that those synthetic negatives supply unambiguous region-level human signals rests on an untested assumption.

read the letter

The main takeaway is that this work adapts DPO to video by generating its own localized preference data instead of relying on multiple samples or a separate critic. They start with real high-quality videos as positives, apply random spatio-temporal masks, and use the frozen base model to inpaint only the masked parts as negatives. Training then uses a region-aware DPO loss that focuses the update on those corrupted areas. This cuts the data collection to one inference per prompt and removes the need for external models or manual labels. The experiments report gains in fidelity, temporal coherence, and human preference scores on both Wan2.1 and CogVideoX compared with other post-training baselines, and the code is public.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LocalDPO, a post-training framework for aligning text-to-video diffusion models with human preferences via localized spatio-temporal optimization. High-quality real videos are treated as positive samples; corresponding negatives are generated by applying random spatio-temporal masks and restoring only the masked regions with the frozen base model. A region-aware DPO loss restricts learning to the corrupted areas. Experiments on Wan2.1 and CogVideoX report consistent gains in fidelity, temporal coherence, and human preference scores over prior post-training methods.

Significance. If the synthetic negative generation procedure produces preference signals that reliably align with human judgments at the region level, the approach would offer a meaningful efficiency gain over global DPO variants by requiring only a single inference per prompt and no external critic models. The public code release supports reproducibility.

major comments (2)

[§3] §3 (Automated pipeline for preference pair data): The central premise that negatives obtained by random spatio-temporal masking followed by restoration with the frozen base model yield effective, unambiguous region-level supervision aligned with human preferences is load-bearing yet unsupported. No human evaluation, ablation, or analysis is described to show that restored patches are consistently inferior according to human raters rather than reflecting inpainting-specific artifacts (e.g., temporal inconsistency or texture issues). This directly affects whether the region-aware DPO loss supplies valid supervision.
[§4] §4 (Experiments): The abstract states that LocalDPO 'consistently improves' fidelity, coherence, and human scores on Wan2.1 and CogVideoX, but provides no quantitative metrics, baseline details, statistical significance tests, or controls. Without these, the magnitude and reliability of the claimed gains cannot be assessed.

minor comments (2)

[Abstract] Abstract: Typo 'Otimization' should read 'Optimization'.
The description of the region-aware DPO loss would benefit from an explicit equation or pseudocode showing how the loss is restricted to masked regions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, providing the strongest honest defense of the manuscript while committing to revisions where the concerns are valid.

read point-by-point responses

Referee: [§3] §3 (Automated pipeline for preference pair data): The central premise that negatives obtained by random spatio-temporal masking followed by restoration with the frozen base model yield effective, unambiguous region-level supervision aligned with human preferences is load-bearing yet unsupported. No human evaluation, ablation, or analysis is described to show that restored patches are consistently inferior according to human raters rather than reflecting inpainting-specific artifacts (e.g., temporal inconsistency or texture issues). This directly affects whether the region-aware DPO loss supplies valid supervision.

Authors: We acknowledge that explicit human validation of the negative samples would strengthen the central claim. The manuscript relies on the design rationale that the frozen base model, lacking any preference alignment, produces restorations that are systematically lower quality than the original real videos in the masked regions; the region-aware DPO loss then directly optimizes the model to prefer the real details. To address the concern rigorously, we will add both quantitative ablations (comparing masked vs. restored regions using standard video quality metrics) and a small-scale human preference study on the generated pairs in the revised Section 3. This will confirm that the supervision signal aligns with human judgments beyond inpainting artifacts. revision: yes
Referee: [§4] §4 (Experiments): The abstract states that LocalDPO 'consistently improves' fidelity, coherence, and human scores on Wan2.1 and CogVideoX, but provides no quantitative metrics, baseline details, statistical significance tests, or controls. Without these, the magnitude and reliability of the claimed gains cannot be assessed.

Authors: The full manuscript in Section 4 already reports quantitative results on both Wan2.1 and CogVideoX, including specific metrics for fidelity, temporal coherence, and human preference scores, with comparisons against prior post-training baselines. We agree the abstract is too high-level and will revise it to include key numerical improvements. We will also add explicit details on statistical significance testing and experimental controls (e.g., fixed random seeds, identical inference settings) in the revised experimental section to improve clarity and reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: independent pipeline and region-aware loss are self-contained

full rationale

The paper's core contribution is an automated data-generation pipeline that treats real videos as positives and produces negatives via random spatio-temporal masking followed by restoration with the frozen base model, plus a region-restricted DPO loss. No equation or claim reduces a prediction to a fitted parameter by construction, nor does any load-bearing step collapse to a self-citation whose validity depends on the present work. The reported gains on Wan2.1 and CogVideoX are framed as empirical outcomes of this external-to-the-model procedure, leaving the derivation chain independent of its own outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach depends on domain assumptions about real videos as positives and the effectiveness of local corruption for negatives, with likely unstated hyperparameters in masking and loss weighting.

free parameters (1)

mask generation parameters
Random spatio-temporal masks for corruption likely involve tunable choices such as size, density, or probability not specified in the abstract.

axioms (1)

domain assumption High-quality real videos serve as reliable positive samples reflecting human preferences
Invoked as the foundation for constructing positive examples in the preference pair collection pipeline.

pith-pipeline@v0.9.0 · 5760 in / 1501 out tokens · 64519 ms · 2026-05-21T15:56:43.552524+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We treat high-quality real videos as positive samples and generate corresponding negatives by locally corrupting them with random spatio-temporal masks and restoring only the masked regions using the frozen base model... region-aware DPO loss that restricts preference learning to corrupted areas
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LRA-DPO = −E log σ(−β·(1+η(α))·E_t[Δ′w − Δ′l]) with Δ′ restricted by mask M

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL
cs.CV 2026-05 conditional novelty 7.0

CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight...

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · cited by 1 Pith paper · 15 internal anchors

[1]

ac- cessed: 2025-11-01

Pexels.https://www.pexels.com/, 2025.10. ac- cessed: 2025-11-01. 5, 1

work page 2025
[2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 3, 6, 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval. InICCV, pages 1728–1738, 2021. 3

work page 2021
[4]

Improving image generation with better captions.Computer Science

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023. 3

work page 2023
[5]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Align your latents: High-resolution video synthesis with la- tent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with la- tent diffusion models. InCVPR, pages 22563–22575, 2023. 3

work page 2023
[7]

VideoJAM: Joint appearance-motion representations for en- hanced motion generation in video models

Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin. VideoJAM: Joint appearance-motion representations for en- hanced motion generation in video models. InICML, 2025. 3, 6

work page 2025
[8]

Panda-70m: Captioning 70m videos with multiple cross-modality teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InCVPR, pages 13320–13331,

work page
[9]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InCVPR, pages 24185–24198, 2024. 3

work page 2024
[10]

Diffusion models in vision: A survey

Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A survey. PAMI, 45(9):10850–10869, 2023. 1

work page 2023
[11]

An image is worth 16x16 words: Trans- formers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InICLR, 2020. 3

work page 2020
[12]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,

work page
[13]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xi- aojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Animatediff: Animate your personalized text-to- image diffusion models without specific tuning.ICLR, 2024

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to- image diffusion models without specific tuning.ICLR, 2024. 3

work page 2024
[15]

Videoscore: Building automatic metrics to simulate fine-grained human feedback for video genera- tion

Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video genera- tion. InEMNLP, pages 2105–2123, 2024. 2

work page 2024
[16]

Denoising diffu- sion probabilistic models.NeurIPS, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models.NeurIPS, 33:6840–6851, 2020. 1, 3

work page 2020
[17]

Denoising dif- fusion probabilistic models.NeurIPS, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.NeurIPS, 33:6840–6851, 2020. 1

work page 2020
[18]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion mod- els.arXiv preprint arXiv:2210.02303, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Video dif- fusion models.NeurIPS, 35:8633–8646, 2022

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models.NeurIPS, 35:8633–8646, 2022. 1

work page 2022
[20]

Video dif- fusion models.NeurIPS, 35:8633–8646, 2022

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models.NeurIPS, 35:8633–8646, 2022. 3

work page 2022
[21]

Cogvideo: Large-scale pretraining for text-to-video generation via transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. InICLR, 2023. 1, 3 9

work page 2023
[22]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 3, 6

work page 2022
[23]

Vbench: Com- prehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench: Com- prehensive benchmark suite for video generative models. In CVPR, pages 21807–21818, 2024. 3, 6

work page 2024
[24]

Miradata: A large-scale video dataset with long durations and structured captions.NeurIPS, 37:48955–48970, 2024

Xuan Ju, Yiming Gao, Zhaoyang Zhang, Ziyang Yuan, Xin- tao Wang, Ailing Zeng, Yu Xiong, Qiang Xu, and Ying Shan. Miradata: A large-scale video dataset with long durations and structured captions.NeurIPS, 37:48955–48970, 2024. 3, 5, 1

work page 2024
[25]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InICCV, pages 5148–5157, 2021. 5, 6, 1

work page 2021
[26]

Text2video-zero: Text- to-image diffusion models are zero-shot video generators

Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text- to-image diffusion models are zero-shot video generators. In ICCV, pages 15954–15964, 2023. 3

work page 2023
[27]

Auto-encoding varia- tional bayes

Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes. InICLR, 2014. 3

work page 2014
[28]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InICCV, pages 4015–4026, 2023. 5

work page 2023
[29]

Pick-a-pic: An open dataset of user preferences for text-to-image generation

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- tiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. NeurIPS, 36:36652–36663, 2023. 6

work page 2023
[30]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Personalvideo: High id-fidelity video customization without dynamic and semantic degradation

Hengjia Li, Haonan Qiu, Shiwei Zhang, Xiang Wang, Yu- jie Wei, Zekun Li, Yingya Zhang, Boxi Wu, and Deng Cai. Personalvideo: High id-fidelity video customization without dynamic and semantic degradation. InICCV, pages 19406– 19416, 2025. 3

work page 2025
[32]

Stiv: Scalable text and image conditioned video generation

Zongyu Lin, Wei Liu, Chen Chen, Jiasen Lu, Wenze Hu, Tsu-Jui Fu, Jesse Allardice, Zhengfeng Lai, Liangchen Song, Bowen Zhang, et al. Stiv: Scalable text and image conditioned video generation. InICCV, pages 16249–16259,

work page
[33]

Vmbench: A benchmark for perception-aligned video motion generation

Xinran Ling, Chen Zhu, Meiqi Wu, Hangyu Li, Xiaokun Feng, Cundian Yang, Aiming Hao, Jiashu Zhu, Jiahong Wu, and Xiangxiang Chu. Vmbench: A benchmark for perception-aligned video motion generation. InICCV, pages 13087–13098, 2025. 3

work page 2025
[34]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023. 1, 3

work page 2023
[35]

Improving video generation with human feedback.NeurIPS, 2025

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback.NeurIPS, 2025. 2, 3, 4, 6, 7

work page 2025
[36]

Hoigen-1m: A large- scale dataset for human-object interaction video generation

Kun Liu, Qi Liu, Xinchen Liu, Jie Li, Yongdong Zhang, Jiebo Luo, Xiaodong He, and Wu Liu. Hoigen-1m: A large- scale dataset for human-object interaction video generation. InCVPR, pages 24001–24010, 2025. 3

work page 2025
[37]

Videodpo: Omni- preference alignment for video diffusion generation

Runtao Liu, Haoyu Wu, Ziqiang Zheng, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni- preference alignment for video diffusion generation. In CVPR, pages 8009–8019, 2025. 2, 3, 4

work page 2025
[38]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In ECCV, pages 38–55. Springer, 2024. 5

work page 2024
[39]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 6

work page 2019
[40]

Openvid-1m: A large-scale high-quality dataset for text-to- video generation

Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhen- heng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to- video generation. InICLR, 2025. 3

work page 2025
[41]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, pages 4195–4205, 2023. 3

work page 2023
[42]

Sdxl: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InICLR, 2024. 3

work page 2024
[43]

Direct preference optimization: Your language model is secretly a reward model.NeurIPS, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.NeurIPS, 36:53728–53741, 2023. 2, 3

work page 2023
[44]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents.arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[45]

Sam 2: Seg- ment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Seg- ment anything in images and videos. InICLR, 2025. 5

work page 2025
[46]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, pages 10684– 10695, 2022. 3

work page 2022
[47]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMICCAI, pages 234–241. Springer, 2015. 3

work page 2015
[48]

Photorealistic text-to-image diffusion models with deep language understanding.NeurIPS, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.NeurIPS, 35:36479–36494, 2022. 3

work page 2022
[49]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 5, 6, 1, 2 10

work page internal anchor Pith review Pith/arXiv arXiv 2021
[50]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2017
[51]

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al

Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, et al. Seaweed-7b: Cost-effective train- ing of video generation foundation model.arXiv preprint arXiv:2504.08685, 2025. 1, 5

work page arXiv 2025
[52]

Make-a-video: Text-to-video generation without text-video data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. InICLR, 2023. 3

work page 2023
[53]

Denois- ing diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InICLR, 2021. 1, 3, 6

work page 2021
[54]

Score-based generative modeling through stochastic differential equa- tions

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. InInternational Conference on Learning Represen- tations, 2021. 3

work page 2021
[55]

Transnet v2: An effective deep network architecture for fast shot transition detection

Tom ´as Soucek and Jakub Lokoc. Transnet v2: An effective deep network architecture for fast shot transition detection. InACMMM, pages 11218–11221, 2024. 5, 1, 2

work page 2024
[56]

VidGen-1M: A large-scale dataset for text-to-video generation

Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, and Hao Li. Vidgen-1m: A large-scale dataset for text-to-video genera- tion.arXiv preprint arXiv:2408.02629, 2024. 3

work page arXiv 2024
[57]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InECCV, pages 402–419. Springer, 2020. 5, 1, 2

work page 2020
[58]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 5, 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Attention is all you need.NeurIPS, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.NeurIPS, 30, 2017. 3

work page 2017
[60]

Diffusion model align- ment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model align- ment using direct preference optimization. InCVPR, pages 8228–8238, 2024. 3

work page 2024
[61]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 1, 3, 5, 6, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content

Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, et al. Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content. InCVPR, pages 8428–8437, 2025. 3, 5, 1

work page 2025
[63]

Videoufo: A million-scale user- focused dataset for text-to-video generation.arXiv preprint arXiv:2503.01739, 2025

Wenhao Wang and Yi Yang. Videoufo: A million-scale user- focused dataset for text-to-video generation.arXiv preprint arXiv:2503.01739, 2025. 3

work page arXiv 2025
[64]

Cogvlm: Visual expert for pretrained language models.NeurIPS, 37:121475–121499, 2024

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Song XiX- uan, et al. Cogvlm: Visual expert for pretrained language models.NeurIPS, 37:121475–121499, 2024. 3

work page 2024
[65]

Real-esrgan: Training real-world blind super-resolution with pure synthetic data

Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. InICCV, pages 1905–1914, 2021. 3

work page 1905
[66]

Lift: Leveraging human feed- back for text-to-video model alignment.arXiv preprint arXiv:2412.04814, 2024

Yibin Wang, Zhiyu Tan, Junyan Wang, Xiaomeng Yang, Cheng Jin, and Hao Li. Lift: Leveraging human feed- back for text-to-video model alignment.arXiv preprint arXiv:2412.04814, 2024. 2, 3

work page arXiv 2024
[67]

Lavie: High-quality video generation with cascaded latent diffusion models.IJCV, 133(5):3059– 3078, 2025

Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models.IJCV, 133(5):3059– 3078, 2025. 3

work page 2025
[68]

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, et al. General ocr theory: Towards ocr-2.0 via a unified end-to-end model.arXiv preprint arXiv:2409.01704,

work page internal anchor Pith review Pith/arXiv arXiv
[69]

Exploring video quality assessment on user generated contents from aesthetic and technical perspectives

Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jing- wen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. InICCV, pages 20144–20154, 2023. 5, 1

work page 2023
[70]

Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, pages 7623–7633, 2023. 3

work page 2023
[71]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

work page internal anchor Pith review Pith/arXiv arXiv
[72]

Densedpo: Fine-grained temporal pref- erence optimization for video diffusion models.NeurIPS,

Ziyi Wu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ashkan Mirzaei, Igor Gilitschenski, Sergey Tulyakov, and Aliaksandr Siarohin. Densedpo: Fine-grained temporal pref- erence optimization for video diffusion models.NeurIPS,

work page
[73]

A survey on video dif- fusion models.ACM Computing Surveys, 57(2):1–42, 2024

Zhen Xing, Qijun Feng, Haoran Chen, Qi Dai, Han Hu, Hang Xu, Zuxuan Wu, and Yu-Gang Jiang. A survey on video dif- fusion models.ACM Computing Surveys, 57(2):1–42, 2024. 1

work page 2024
[74]

Imagere- ward: Learning and evaluating human preferences for text- to-image generation.NeurIPS, 36:15903–15935, 2023

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation.NeurIPS, 36:15903–15935, 2023. 6

work page 2023
[75]

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shu- run Li, et al. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation. arXiv preprint arXiv:2412.21059, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[76]

Diffusion models: A comprehensive survey of 11 methods and applications.ACM computing surveys, 56(4): 1–39, 2023

Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Run- sheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming- Hsuan Yang. Diffusion models: A comprehensive survey of 11 methods and applications.ACM computing surveys, 56(4): 1–39, 2023. 1

work page 2023
[77]

Ipo: Iterative preference optimization for text-to-video generation.arXiv preprint arXiv:2502.02088, 2025

Xiaomeng Yang, Zhiyu Tan, and Hao Li. Ipo: Iterative preference optimization for text-to-video generation.arXiv preprint arXiv:2502.02088, 2025. 3

work page arXiv 2025
[78]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InICLR, 2025. 1, 3, 6, 2

work page 2025
[79]

The dawn of video generation: Preliminary explorations with sora-like models.arXiv preprint arXiv:2410.05227, 2024

Ailing Zeng, Yuhang Yang, Weidong Chen, and Wei Liu. The dawn of video generation: Preliminary explorations with sora-like models.arXiv preprint arXiv:2410.05227, 2024. 3

work page arXiv 2024
[80]

Designing a practical degradation model for deep blind image super-resolution

Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timo- fte. Designing a practical degradation model for deep blind image super-resolution. InICCV, pages 4791–4800, 2021. 3

work page 2021

Showing first 80 references.

[1] [1]

ac- cessed: 2025-11-01

Pexels.https://www.pexels.com/, 2025.10. ac- cessed: 2025-11-01. 5, 1

work page 2025

[2] [2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 3, 6, 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval. InICCV, pages 1728–1738, 2021. 3

work page 2021

[4] [4]

Improving image generation with better captions.Computer Science

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023. 3

work page 2023

[5] [5]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Align your latents: High-resolution video synthesis with la- tent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with la- tent diffusion models. InCVPR, pages 22563–22575, 2023. 3

work page 2023

[7] [7]

VideoJAM: Joint appearance-motion representations for en- hanced motion generation in video models

Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin. VideoJAM: Joint appearance-motion representations for en- hanced motion generation in video models. InICML, 2025. 3, 6

work page 2025

[8] [8]

Panda-70m: Captioning 70m videos with multiple cross-modality teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InCVPR, pages 13320–13331,

work page

[9] [9]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InCVPR, pages 24185–24198, 2024. 3

work page 2024

[10] [10]

Diffusion models in vision: A survey

Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A survey. PAMI, 45(9):10850–10869, 2023. 1

work page 2023

[11] [11]

An image is worth 16x16 words: Trans- formers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InICLR, 2020. 3

work page 2020

[12] [12]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,

work page

[13] [13]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xi- aojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Animatediff: Animate your personalized text-to- image diffusion models without specific tuning.ICLR, 2024

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to- image diffusion models without specific tuning.ICLR, 2024. 3

work page 2024

[15] [15]

Videoscore: Building automatic metrics to simulate fine-grained human feedback for video genera- tion

Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video genera- tion. InEMNLP, pages 2105–2123, 2024. 2

work page 2024

[16] [16]

Denoising diffu- sion probabilistic models.NeurIPS, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models.NeurIPS, 33:6840–6851, 2020. 1, 3

work page 2020

[17] [17]

Denoising dif- fusion probabilistic models.NeurIPS, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.NeurIPS, 33:6840–6851, 2020. 1

work page 2020

[18] [18]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion mod- els.arXiv preprint arXiv:2210.02303, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[19] [19]

Video dif- fusion models.NeurIPS, 35:8633–8646, 2022

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models.NeurIPS, 35:8633–8646, 2022. 1

work page 2022

[20] [20]

Video dif- fusion models.NeurIPS, 35:8633–8646, 2022

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models.NeurIPS, 35:8633–8646, 2022. 3

work page 2022

[21] [21]

Cogvideo: Large-scale pretraining for text-to-video generation via transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. InICLR, 2023. 1, 3 9

work page 2023

[22] [22]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 3, 6

work page 2022

[23] [23]

Vbench: Com- prehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench: Com- prehensive benchmark suite for video generative models. In CVPR, pages 21807–21818, 2024. 3, 6

work page 2024

[24] [24]

Miradata: A large-scale video dataset with long durations and structured captions.NeurIPS, 37:48955–48970, 2024

Xuan Ju, Yiming Gao, Zhaoyang Zhang, Ziyang Yuan, Xin- tao Wang, Ailing Zeng, Yu Xiong, Qiang Xu, and Ying Shan. Miradata: A large-scale video dataset with long durations and structured captions.NeurIPS, 37:48955–48970, 2024. 3, 5, 1

work page 2024

[25] [25]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InICCV, pages 5148–5157, 2021. 5, 6, 1

work page 2021

[26] [26]

Text2video-zero: Text- to-image diffusion models are zero-shot video generators

Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text- to-image diffusion models are zero-shot video generators. In ICCV, pages 15954–15964, 2023. 3

work page 2023

[27] [27]

Auto-encoding varia- tional bayes

Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes. InICLR, 2014. 3

work page 2014

[28] [28]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InICCV, pages 4015–4026, 2023. 5

work page 2023

[29] [29]

Pick-a-pic: An open dataset of user preferences for text-to-image generation

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- tiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. NeurIPS, 36:36652–36663, 2023. 6

work page 2023

[30] [30]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Personalvideo: High id-fidelity video customization without dynamic and semantic degradation

Hengjia Li, Haonan Qiu, Shiwei Zhang, Xiang Wang, Yu- jie Wei, Zekun Li, Yingya Zhang, Boxi Wu, and Deng Cai. Personalvideo: High id-fidelity video customization without dynamic and semantic degradation. InICCV, pages 19406– 19416, 2025. 3

work page 2025

[32] [32]

Stiv: Scalable text and image conditioned video generation

Zongyu Lin, Wei Liu, Chen Chen, Jiasen Lu, Wenze Hu, Tsu-Jui Fu, Jesse Allardice, Zhengfeng Lai, Liangchen Song, Bowen Zhang, et al. Stiv: Scalable text and image conditioned video generation. InICCV, pages 16249–16259,

work page

[33] [33]

Vmbench: A benchmark for perception-aligned video motion generation

Xinran Ling, Chen Zhu, Meiqi Wu, Hangyu Li, Xiaokun Feng, Cundian Yang, Aiming Hao, Jiashu Zhu, Jiahong Wu, and Xiangxiang Chu. Vmbench: A benchmark for perception-aligned video motion generation. InICCV, pages 13087–13098, 2025. 3

work page 2025

[34] [34]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023. 1, 3

work page 2023

[35] [35]

Improving video generation with human feedback.NeurIPS, 2025

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback.NeurIPS, 2025. 2, 3, 4, 6, 7

work page 2025

[36] [36]

Hoigen-1m: A large- scale dataset for human-object interaction video generation

Kun Liu, Qi Liu, Xinchen Liu, Jie Li, Yongdong Zhang, Jiebo Luo, Xiaodong He, and Wu Liu. Hoigen-1m: A large- scale dataset for human-object interaction video generation. InCVPR, pages 24001–24010, 2025. 3

work page 2025

[37] [37]

Videodpo: Omni- preference alignment for video diffusion generation

Runtao Liu, Haoyu Wu, Ziqiang Zheng, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni- preference alignment for video diffusion generation. In CVPR, pages 8009–8019, 2025. 2, 3, 4

work page 2025

[38] [38]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In ECCV, pages 38–55. Springer, 2024. 5

work page 2024

[39] [39]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 6

work page 2019

[40] [40]

Openvid-1m: A large-scale high-quality dataset for text-to- video generation

Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhen- heng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to- video generation. InICLR, 2025. 3

work page 2025

[41] [41]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, pages 4195–4205, 2023. 3

work page 2023

[42] [42]

Sdxl: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InICLR, 2024. 3

work page 2024

[43] [43]

Direct preference optimization: Your language model is secretly a reward model.NeurIPS, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.NeurIPS, 36:53728–53741, 2023. 2, 3

work page 2023

[44] [44]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents.arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[45] [45]

Sam 2: Seg- ment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Seg- ment anything in images and videos. InICLR, 2025. 5

work page 2025

[46] [46]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, pages 10684– 10695, 2022. 3

work page 2022

[47] [47]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMICCAI, pages 234–241. Springer, 2015. 3

work page 2015

[48] [48]

Photorealistic text-to-image diffusion models with deep language understanding.NeurIPS, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.NeurIPS, 35:36479–36494, 2022. 3

work page 2022

[49] [49]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 5, 6, 1, 2 10

work page internal anchor Pith review Pith/arXiv arXiv 2021

[50] [50]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2017

[51] [51]

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al

Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, et al. Seaweed-7b: Cost-effective train- ing of video generation foundation model.arXiv preprint arXiv:2504.08685, 2025. 1, 5

work page arXiv 2025

[52] [52]

Make-a-video: Text-to-video generation without text-video data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. InICLR, 2023. 3

work page 2023

[53] [53]

Denois- ing diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InICLR, 2021. 1, 3, 6

work page 2021

[54] [54]

Score-based generative modeling through stochastic differential equa- tions

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. InInternational Conference on Learning Represen- tations, 2021. 3

work page 2021

[55] [55]

Transnet v2: An effective deep network architecture for fast shot transition detection

Tom ´as Soucek and Jakub Lokoc. Transnet v2: An effective deep network architecture for fast shot transition detection. InACMMM, pages 11218–11221, 2024. 5, 1, 2

work page 2024

[56] [56]

VidGen-1M: A large-scale dataset for text-to-video generation

Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, and Hao Li. Vidgen-1m: A large-scale dataset for text-to-video genera- tion.arXiv preprint arXiv:2408.02629, 2024. 3

work page arXiv 2024

[57] [57]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InECCV, pages 402–419. Springer, 2020. 5, 1, 2

work page 2020

[58] [58]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 5, 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [59]

Attention is all you need.NeurIPS, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.NeurIPS, 30, 2017. 3

work page 2017

[60] [60]

Diffusion model align- ment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model align- ment using direct preference optimization. InCVPR, pages 8228–8238, 2024. 3

work page 2024

[61] [61]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 1, 3, 5, 6, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [62]

Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content

Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, et al. Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content. InCVPR, pages 8428–8437, 2025. 3, 5, 1

work page 2025

[63] [63]

Videoufo: A million-scale user- focused dataset for text-to-video generation.arXiv preprint arXiv:2503.01739, 2025

Wenhao Wang and Yi Yang. Videoufo: A million-scale user- focused dataset for text-to-video generation.arXiv preprint arXiv:2503.01739, 2025. 3

work page arXiv 2025

[64] [64]

Cogvlm: Visual expert for pretrained language models.NeurIPS, 37:121475–121499, 2024

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Song XiX- uan, et al. Cogvlm: Visual expert for pretrained language models.NeurIPS, 37:121475–121499, 2024. 3

work page 2024

[65] [65]

Real-esrgan: Training real-world blind super-resolution with pure synthetic data

Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. InICCV, pages 1905–1914, 2021. 3

work page 1905

[66] [66]

Lift: Leveraging human feed- back for text-to-video model alignment.arXiv preprint arXiv:2412.04814, 2024

Yibin Wang, Zhiyu Tan, Junyan Wang, Xiaomeng Yang, Cheng Jin, and Hao Li. Lift: Leveraging human feed- back for text-to-video model alignment.arXiv preprint arXiv:2412.04814, 2024. 2, 3

work page arXiv 2024

[67] [67]

Lavie: High-quality video generation with cascaded latent diffusion models.IJCV, 133(5):3059– 3078, 2025

Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models.IJCV, 133(5):3059– 3078, 2025. 3

work page 2025

[68] [68]

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, et al. General ocr theory: Towards ocr-2.0 via a unified end-to-end model.arXiv preprint arXiv:2409.01704,

work page internal anchor Pith review Pith/arXiv arXiv

[69] [69]

Exploring video quality assessment on user generated contents from aesthetic and technical perspectives

Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jing- wen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. InICCV, pages 20144–20154, 2023. 5, 1

work page 2023

[70] [70]

Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, pages 7623–7633, 2023. 3

work page 2023

[71] [71]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

work page internal anchor Pith review Pith/arXiv arXiv

[72] [72]

Densedpo: Fine-grained temporal pref- erence optimization for video diffusion models.NeurIPS,

Ziyi Wu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ashkan Mirzaei, Igor Gilitschenski, Sergey Tulyakov, and Aliaksandr Siarohin. Densedpo: Fine-grained temporal pref- erence optimization for video diffusion models.NeurIPS,

work page

[73] [73]

A survey on video dif- fusion models.ACM Computing Surveys, 57(2):1–42, 2024

Zhen Xing, Qijun Feng, Haoran Chen, Qi Dai, Han Hu, Hang Xu, Zuxuan Wu, and Yu-Gang Jiang. A survey on video dif- fusion models.ACM Computing Surveys, 57(2):1–42, 2024. 1

work page 2024

[74] [74]

Imagere- ward: Learning and evaluating human preferences for text- to-image generation.NeurIPS, 36:15903–15935, 2023

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation.NeurIPS, 36:15903–15935, 2023. 6

work page 2023

[75] [75]

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shu- run Li, et al. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation. arXiv preprint arXiv:2412.21059, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[76] [76]

Diffusion models: A comprehensive survey of 11 methods and applications.ACM computing surveys, 56(4): 1–39, 2023

Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Run- sheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming- Hsuan Yang. Diffusion models: A comprehensive survey of 11 methods and applications.ACM computing surveys, 56(4): 1–39, 2023. 1

work page 2023

[77] [77]

Ipo: Iterative preference optimization for text-to-video generation.arXiv preprint arXiv:2502.02088, 2025

Xiaomeng Yang, Zhiyu Tan, and Hao Li. Ipo: Iterative preference optimization for text-to-video generation.arXiv preprint arXiv:2502.02088, 2025. 3

work page arXiv 2025

[78] [78]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InICLR, 2025. 1, 3, 6, 2

work page 2025

[79] [79]

The dawn of video generation: Preliminary explorations with sora-like models.arXiv preprint arXiv:2410.05227, 2024

Ailing Zeng, Yuhang Yang, Weidong Chen, and Wei Liu. The dawn of video generation: Preliminary explorations with sora-like models.arXiv preprint arXiv:2410.05227, 2024. 3

work page arXiv 2024

[80] [80]

Designing a practical degradation model for deep blind image super-resolution

Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timo- fte. Designing a practical degradation model for deep blind image super-resolution. InICCV, pages 4791–4800, 2021. 3

work page 2021