Rebalancing Reference Frame Dominance to Improve Motion in Image-to-Video Models

Hae-Gon Jeon; Hyeonho Jeong; Sangeyl Lee; Seungho Park; Seunghyun Shin; Wooseok Jeon

arxiv: 2605.19398 · v3 · pith:UNOHQWOFnew · submitted 2026-05-19 · 💻 cs.CV · cs.AI

Rebalancing Reference Frame Dominance to Improve Motion in Image-to-Video Models

Wooseok Jeon , Seungho Park , Seunghyun Shin , Sangeyl Lee , Hyeonho Jeong , Hae-Gon Jeon This is my paper

Pith reviewed 2026-06-30 18:30 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords image-to-video generationmotion enhancementattention rebalancingdiffusion modelstraining-free methodreference framedenoising stepsDyMoS

0 comments

The pith

Non-reference frames in image-to-video models over-attend to the reference frame, suppressing motion; rebalancing attention in early denoising steps restores dynamics without retraining or fidelity loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that reference-frame dominance is the mechanism behind overly static outputs in image-to-video models. Non-reference frames allocate too much self-attention to reference-frame key tokens, which over-propagates reference information across time and reduces inter-frame dynamics. The authors show that selectively rebalancing this attention pathway only during the first denoising steps increases motion while leaving the input image and model weights untouched. A single scalar parameter gives continuous control over motion strength. A sympathetic reader would care because earlier fixes either required training or traded away reference fidelity, whereas this approach avoids both costs.

Core claim

Non-reference frames in I2V models allocate excessive self-attention to reference-frame key tokens, causing reference information to be over-propagated across time and suppressing inter-frame dynamics. DyMoS rebalances the attention pathway from generated frames to the reference frame during initial denoising steps, introduces one scalar for continuous motion control, and leaves both the input image and model weights unchanged. Experiments across multiple state-of-the-art I2V backbones show consistent gains in motion dynamics while preserving visual quality and reference fidelity.

What carries the argument

DyMoS (Dynamic Motion Slider), a training-free adjustment that reduces attention scores from non-reference frames to the reference frame's keys in the first denoising steps.

If this is right

Motion dynamics increase across multiple I2V backbones without any model retraining.
Reference image fidelity and overall visual quality stay intact.
A single scalar parameter provides continuous, user-controllable motion strength.
The method applies at inference time only and requires no changes to input conditioning.
The same rebalancing principle can be applied to any diffusion-based I2V architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dominance pattern may appear in other conditioned generation tasks where one input strongly anchors the output sequence.
Extending the rebalancing window beyond the initial steps might be needed for very long video outputs.
Combining the scalar adjustment with existing motion-regularization losses could produce additive improvements.
Measuring attention maps directly before and after DyMoS would confirm the proposed causal link between reference dominance and static outputs.

Load-bearing premise

Reference-frame dominance is the primary cause of motion suppression, and rebalancing attention only in the initial denoising steps will reliably increase inter-frame dynamics without degrading visual quality or reference fidelity.

What would settle it

Generate videos with and without DyMoS on the same seeds and reference images, then measure whether motion metrics such as average optical flow between consecutive frames rise while CLIP similarity to the reference image and perceptual quality scores remain statistically unchanged.

Figures

Figures reproduced from arXiv: 2605.19398 by Hae-Gon Jeon, Hyeonho Jeong, Sangeyl Lee, Seungho Park, Seunghyun Shin, Wooseok Jeon.

**Figure 1.** Figure 1: Example videos from our method. We present DyMoS, a training-free and model-agnostic method for improving motion dynamics in image-to-video generation. (a) Comparison of generated videos from the same input image. DyMoS produces dynamic motion while preserving video quality. (b) Furthermore, DyMoS provides continuous control over motion dynamics. Abstract Image-to-video models often generate videos that re… view at source ↗

**Figure 2.** Figure 2: Reference-frame dominance in I2V self-attention. (a) Qualitative comparison between paired T2V and I2V generations. (b) Frame-to-frame self-attention difference map AI2V − AT2V, averaged over the first 10% of inference steps. images for an I2V model, using the same text prompts sourced from T2V-CompBench [32]. Since this setup ensures the same prompt and first frame for both models, we can analyze the diff… view at source ↗

**Figure 3.** Figure 3: Modulating reference-frame dominance controls motion dynamics. (a) Absolute difference in reference-frame attention between the vanilla I2V model and the modulated I2V model with γ = 0.6, measured over non-reference query frames. (b) Dynamic Degree and Video Quality as γ varies. The yellow star denotes the Dynamic Degree of the paired T2V generation. (c) T2V–I2V attention distance D(γ) measured by Jensen–S… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison with vanilla I2V baseline and ALG. The leftmost images are the reference images. Our method (DyMoS) produces substantially more dynamic motion than the vanilla baseline and ALG while preserving fidelity to the reference image. In contrast, ALG introduces motion at the cost of visible degradation. steers the guided update away from reference-frame dominance while keeping the null-text… view at source ↗

**Figure 5.** Figure 5: Hyperparameter analysis and user study. (a) Effect of modulation strength γ. (b) Effect of modulation step ratio λ. (c) User study results. This indicates that our attention-level intervention improves motion dynamics more effectively than input-level attenuation, while preserving visual quality. Our method also achieves the highest ViCLIP scores among the compared methods, suggesting that reducing referen… view at source ↗

**Figure 6.** Figure 6: Continuous control over static-to-dynamic generation with DyMoS. Rows from top to bottom correspond to γ ∈ {−2, −1, 0, 0.6, 1.0}, with γ = 0 denoting the baseline. DyMoS to an appropriate number of initial denoising steps is sufficient to enhance motion, while switching back to the original attention computation helps maintain generation quality. 4.4 Application: Continuous control over motion dynamics A k… view at source ↗

**Figure 7.** Figure 7: Additional comparison results with the vanilla baseline and ALG. The leftmost images are the reference images. Our method qualitatively outperforms the vanilla baseline and ALG across various cases, demonstrating superior motion dynamics and visual fidelity. (Top) The vanilla baseline and ALG exhibit static scenes of a man riding a mountain bike. In contrast, our method generates fluid and natural riding m… view at source ↗

**Figure 8.** Figure 8: Additional comparison results with the vanilla baseline and ALG. The leftmost images are the reference images. (Top) ALG produces highly static scenes where the crab barely moves. In contrast, our method successfully synthesizes vivid and realistic motions. (Bottom) The vanilla baseline produces physically weird motions, such as the bird flying backwards. While ALG generates physically plausible movements,… view at source ↗

**Figure 9.** Figure 9: Additional comparison results with the vanilla baseline and ALG. The leftmost images are the reference images. (Top) Both the vanilla baseline and ALG struggle to synthesize dynamic motions, resulting in rigid and unnatural movements of the man. In contrast, our method generates natural motions of the man hoisting a spear. (Middle) Unlike the other methods, DyMoS successfully generates a semantically align… view at source ↗

**Figure 10.** Figure 10: Additional examples of continuous control over motion dynamics with DyMoS. Rows from top to bottom correspond to γ ∈ {−2, −1, 0, 0.6, 1.0}, with γ = 0 denoting the baseline. (Top Left) The movement of the person riding a horse transitions naturally across the static-to-dynamic spectrum as the parameter increases. (Top Right) Our guidance smoothly scales the tiger’s walking speed from slow to fast. Similar… view at source ↗

read the original abstract

Image-to-video models often generate videos that remain overly static, compared to text-to-video models. While prior approaches mitigate this issue by weakening or modifying the image-conditioning signal, they often require additional training or sacrifice fidelity to the reference image. In this work, we identify reference-frame dominance as a key mechanism behind motion suppression. We observe that non-reference frames in I2V models allocate excessive self-attention to reference-frame key tokens, causing reference information to be over-propagated across time and suppressing inter-frame dynamics. Based on this finding, we propose DyMoS (Dynamic Motion Slider), a training-free and model-agnostic method that rebalances the attention pathway from generated frames to the reference frame during initial denoising steps. DyMoS leaves both the input image and model weights unchanged and introduces a single scalar parameter for continuous control over motion strength. Experiments across multiple state-of-the-art I2V backbones demonstrate that DyMoS consistently improves motion dynamics while maintaining visual quality and fidelity to the reference image.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper pins static outputs in I2V on reference-frame dominance in attention and offers a simple training-free reweighting fix called DyMoS.

read the letter

The main thing to know is that they trace motion suppression to non-reference frames over-attending to reference-frame keys in self-attention, which over-propagates static information, and they correct it with DyMoS by rebalancing that pathway only during early denoising steps.

What stands out is how direct the intervention is. It requires no training, no weight changes, and no alteration to the input image. A single scalar gives continuous control over motion strength, and they report it works across several current I2V backbones while preserving visual quality and reference fidelity. That combination of minimal overhead and cross-model applicability is the practical contribution.

The soft spots are limited but worth noting. The argument rests on their attention observations as the primary driver, and the fix is applied selectively in the initial steps. If the full experiments include clear quantitative motion metrics, attention visualizations, and comparisons showing the gains are not just from reduced conditioning strength, the case holds up. The method is low-risk because it is so lightweight, but scene-dependent variation or edge cases where motion increases at the expense of coherence would be useful to see.

This is for people working on image-to-video generation who need better motion from existing models without retraining. A reader who wants a reproducible inference-time tweak will find the method and the attention framing directly usable.

It deserves peer review. The problem is real, the solution is clean and verifiable, and the analysis provides a concrete handle even if other factors also affect motion.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that image-to-video (I2V) diffusion models produce overly static outputs because non-reference frames over-allocate self-attention to reference-frame key tokens, over-propagating static reference information across time. The authors introduce DyMoS (Dynamic Motion Slider), a training-free, model-agnostic inference-time intervention that rebalances attention scores from generated frames toward the reference frame by means of a scalar multiplier applied only during the initial denoising steps. The method leaves model weights and the conditioning image unchanged and exposes a single continuous parameter for motion strength. Experiments on multiple state-of-the-art I2V backbones are reported to show improved inter-frame dynamics while preserving visual quality and reference fidelity.

Significance. If the attention observation and the reported gains hold, the work supplies a lightweight, immediately deployable remedy for a widespread practical limitation in I2V generation. Its training-free and model-agnostic character, together with the continuous scalar control, distinguishes it from prior approaches that require retraining or trade off fidelity. The mechanistic account of reference-frame dominance could also inform architectural refinements in future video diffusion models.

major comments (2)

[attention analysis / method description] The central claim that reference-frame dominance is the primary driver of motion suppression (stated in the abstract and the opening of the method section) rests on an observational attention analysis. The manuscript does not supply quantitative statistics, cross-model attention histograms, or causal ablations that isolate this mechanism from other factors such as overall conditioning strength; without such evidence the attribution remains correlational and load-bearing for the proposed fix.
[DyMoS subsection] DyMoS restricts the rebalancing scalar to the initial denoising steps (described in the DyMoS subsection). No ablation is presented on the precise cutoff (fixed timestep count versus noise-level threshold), nor on whether applying the scalar throughout denoising would degrade fine-detail synthesis or reference fidelity; this choice is therefore not yet shown to be optimal.

minor comments (2)

[method] The single scalar parameter is introduced without an explicit equation showing how it modulates the attention matrix (e.g., scaling the reference-frame key contributions). Adding this equation would improve reproducibility.
[experiments / figures] Figure captions and axis labels for any attention visualizations or motion-metric plots should explicitly state the number of samples and random seeds used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and the recommendation for minor revision. We address each major comment below and will make the corresponding changes to the manuscript.

read point-by-point responses

Referee: [attention analysis / method description] The central claim that reference-frame dominance is the primary driver of motion suppression (stated in the abstract and the opening of the method section) rests on an observational attention analysis. The manuscript does not supply quantitative statistics, cross-model attention histograms, or causal ablations that isolate this mechanism from other factors such as overall conditioning strength; without such evidence the attribution remains correlational and load-bearing for the proposed fix.

Authors: We acknowledge that the attention analysis presented is observational. In the revised manuscript we will add quantitative statistics (mean and standard deviation of attention scores allocated to reference-frame tokens versus non-reference tokens) together with cross-model attention histograms. While a full causal ablation isolating the mechanism from conditioning strength would require retraining and therefore falls outside the training-free scope of the work, we will include supplementary experiments that modulate conditioning strength to provide further supporting evidence. These additions will strengthen the mechanistic claim. revision: yes
Referee: [DyMoS subsection] DyMoS restricts the rebalancing scalar to the initial denoising steps (described in the DyMoS subsection). No ablation is presented on the precise cutoff (fixed timestep count versus noise-level threshold), nor on whether applying the scalar throughout denoising would degrade fine-detail synthesis or reference fidelity; this choice is therefore not yet shown to be optimal.

Authors: We agree that an ablation study on the application window is needed. In the revision we will add experiments comparing fixed-timestep cutoffs against noise-level thresholds and will also report results obtained when the scalar is applied across the entire denoising trajectory, quantifying any effects on fine-detail synthesis and reference fidelity. These results will justify the current design choice of restricting the intervention to the initial steps. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central contribution is an observational identification of reference-frame dominance in attention maps of I2V models, followed by a training-free inference-time intervention (DyMoS) that applies a single scalar reweighting to attention pathways during early denoising steps. No derivation chain, equations, or fitted parameters are described that reduce by construction to the inputs; the method is presented as a heuristic fix whose effect is externally verifiable on multiple backbones while leaving model weights and reference image unchanged. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing premises. The argument is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that reference-frame dominance is the key mechanism, plus one tunable scalar for motion control; no invented entities or additional free parameters are described.

free parameters (1)

motion strength scalar
Single scalar parameter introduced for continuous control over motion strength.

axioms (1)

domain assumption Reference-frame dominance via excessive self-attention to reference key tokens is the primary mechanism suppressing motion in I2V models
This observation underpins both the diagnosis and the design of DyMoS.

pith-pipeline@v0.9.1-grok · 5731 in / 1195 out tokens · 37034 ms · 2026-06-30T18:30:55.346168+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TriMotion: Modality-Agnostic Camera Control for Video Generation
cs.CV 2026-06 unverdicted novelty 6.0

TriMotion is a modality-agnostic framework that maps video, pose, and text descriptions of the same camera trajectory into a shared motion embedding space, trained with a new triplet dataset and latent consistency obj...

Reference graph

Works this paper leans on

54 extracted references · 16 canonical work pages · cited by 1 Pith paper · 12 internal anchors

[1]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020
[2]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=PxTIG12RRHS

2021
[3]

Deep unsuper- vised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. pmlr, 2015

2015
[4]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021

2021
[5]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=PqvMRDCJT9t

2023
[7]

Building normalizing flows with stochastic interpolants

Michael Samuel Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=li7qeBbCR1t

2023
[8]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

2024
[9]

HunyuanVideo 1.5 Technical Report

Tencent Hunyuan Foundation Model Team. Hunyuanvideo 1.5 technical report, 2025. URL https://arxiv.org/abs/2511.18870

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan.Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Conference on Learning Represen...

2025
[12]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Video diffusion models.Advances in neural information processing systems, 35: 8633–8646, 2022

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in neural information processing systems, 35: 8633–8646, 2022

2022
[14]

Open-Sora Plan: Open-Source Large Video Generation Model

Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

Animatediff: Animate your personalized text-to-image diffusion models without specific tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=Fx2SbBgcte

2024
[17]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richard- son, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Conditional image-to-video generation with latent flow diffusion models

Haomiao Ni, Changhao Shi, Kai Li, Sharon X Huang, and Martin Renqiang Min. Conditional image-to-video generation with latent flow diffusion models. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18444–18455. IEEE, 2023

2023
[19]

Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling

Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, et al. Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024

2024
[20]

MAGI-1: Autoregressive Video Generation at Scale

Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Dynamicrafter: Animating open-domain images with video diffusion priors

Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Wangbo Yu, Hanyuan Liu, Gongye Liu, Xintao Wang, Ying Shan, and Tien-Tsin Wong. Dynamicrafter: Animating open-domain images with video diffusion priors. InEuropean Conference on Computer Vision, pages 399–417. Springer, 2024

2024
[22]

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jingren Zhou. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models.arXiv preprint arXiv:2311.04145, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Track4gen: Teaching video diffusion models to track points improves video generation

Hyeonho Jeong, Chun-Hao P Huang, Jong Chul Ye, Niloy J Mitra, and Duygu Ceylan. Track4gen: Teaching video diffusion models to track points improves video generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 7276–7287, 2025

2025
[24]

Identifying and solving conditional image leakage in image-to-video diffusion model

Min Zhao, Hongzhou Zhu, Chendong Xiang, Kaiwen Zheng, Chongxuan Li, and Jun Zhu. Identifying and solving conditional image leakage in image-to-video diffusion model. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=o9Lkiv1qpc

2024
[25]

Flashi2v: Fourier-guided latent shifting prevents conditional image leakage in image-to-video generation.arXiv preprint arXiv:2509.25187, 2025

Yunyang Ge, Xinhua Cheng, Chengshu Zhao, Xianyi He, Shenghai Yuan, Bin Lin, Bin Zhu, and Li Yuan. Flashi2v: Fourier-guided latent shifting prevents conditional image leakage in image-to-video generation.arXiv preprint arXiv:2509.25187, 2025

work page arXiv 2025
[26]

Enhancing motion dynamics of image-to-video models via adaptive low-pass guidance.arXiv preprint arXiv:2506.08456, 2025

June Suk Choi, Kyungmin Lee, Sihyun Yu, Yisol Choi, Jinwoo Shin, and Kimin Lee. Enhancing motion dynamics of image-to-video models via adaptive low-pass guidance.arXiv preprint arXiv:2506.08456, 2025

work page arXiv 2025
[27]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[28]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, pages 23–40. Springer, 2024

2024
[29]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013. 11

work page internal anchor Pith review Pith/arXiv arXiv 2013
[30]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

2020
[31]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[32]

History-guided video diffusion

Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History-guided video diffusion. InForty-second International Conference on Machine Learning,
[33]

URLhttps://openreview.net/forum?id=j8Vr3E3vhy
[34]

T2v- compbench: A comprehensive benchmark for compositional text-to-video generation

Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v- compbench: A comprehensive benchmark for compositional text-to-video generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 8406–8416, 2025

2025
[35]

Motioncfg: Boosting motion dynamics via stochastic concept perturbation.arXiv preprint arXiv:2603.14073, 2026

Byungjun Kim, Soobin Um, and Jong Chul Ye. Motioncfg: Boosting motion dynamics via stochastic concept perturbation.arXiv preprint arXiv:2603.14073, 2026

work page arXiv 2026
[36]

Motion prior distillation in time reversal sampling for generative inbetweening

Wooseok Jeon, Seunghyun Shin, Dongmin Shin, and Hae-Gon Jeon. Motion prior distillation in time reversal sampling for generative inbetweening. InThe Fourteenth International Con- ference on Learning Representations, 2026. URL https://openreview.net/forum?id= GRElsj9W2t

2026
[37]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

2024
[38]

Divergence measures based on the shannon entropy.IEEE Transactions on Information theory, 37(1):145–151, 2002

Jianhua Lin. Divergence measures based on the shannon entropy.IEEE Transactions on Information theory, 37(1):145–151, 2002

2002
[39]

Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation

Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2105–2123, 2024

2024
[40]

Internvid: A large-scale video-text dataset for multimodal understanding and generation

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, and Yu Qiao. Internvid: A large-scale video-text dataset for multimodal understanding and generation. In The Twelfth International Conference on Learning Representations, 2024. URL https:// openreview.n...

2024
[41]

Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation

Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, et al. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 11269–11277, 2026

2026
[42]

Perception encoder: The best visual embeddings are not at the output of the network

Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Abdul Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Shang-Wen Li, Piotr Dollar, and Christoph Feichtenhofer. Perception encoder: The best visual embeddings are not at the output of the network. In The...

2026
[43]

Vidprom: A million-scale real prompt-gallery dataset for text-to- video diffusion models.Advances in Neural Information Processing Systems, 37:65618–65642, 2024

Wenhao Wang and Yi Yang. Vidprom: A million-scale real prompt-gallery dataset for text-to- video diffusion models.Advances in Neural Information Processing Systems, 37:65618–65642, 2024

2024
[44]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

2024
[45]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In European conference on computer vision, pages 402–419. Springer, 2020. 12

2020
[46]

Memory-v2v: Memory-augmented video-to-video diffusion for consistent multi-turn editing.arXiv preprint arXiv:2601.16296, 2026

Dohun Lee, Chun-Hao Paul Huang, Xuelin Chen, Jong Chul Ye, Duygu Ceylan, and Hyeonho Jeong. Memory-v2v: Augmenting video-to-video diffusion models with memory.arXiv preprint arXiv:2601.16296, 2026

work page arXiv 2026
[47]

Amazon mechanical turk: A research tool for organizations and information systems scholars

Kevin Crowston. Amazon mechanical turk: A research tool for organizations and information systems scholars. InShaping the Future of ICT Research. Methods and Approaches: IFIP WG 8.2, Working Conference, Tampa, FL, USA, December 13-14, 2012. Proceedings, 2012

2012
[48]

Video color grading via look-up table generation

Seunghyun Shin, Dongmin Shin, Jisu Shin, Hae-Gon Jeon, and Joon-Young Lee. Video color grading via look-up table generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19141–19152, 2025

2025
[49]

Close imitation of expert retouching for black-and-white photography

Seunghyun Shin, Jisu Shin, Jihwan Bae, Inwook Shim, and Hae-Gon Jeon. Close imitation of expert retouching for black-and-white photography. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 25037–25046, June 2024

2024
[50]

Kinetic typography diffusion model

Seonmi Park, Inhwan Bae, Seunghyun Shin, and Hae-Gon Jeon. Kinetic typography diffusion model. InEuropean Conference on Computer Vision, pages 166–185. Springer, 2024

2024
[51]

Universal image immunization against diffusion-based image editing via semantic injection.arXiv preprint arXiv:2602.14679, 2026

Chanhui Lee, Seunghyun Shin, Donggyu Choi, Hae-gon Jeon, and Jeany Son. Universal image immunization against diffusion-based image editing via semantic injection.arXiv preprint arXiv:2602.14679, 2026

work page internal anchor Pith review arXiv 2026
[52]

Reangle-a-video: 4d video generation as video-to-video translation

Hyeonho Jeong, Suhyeon Lee, and Jong Chul Ye. Reangle-a-video: 4d video generation as video-to-video translation. In2025 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11164–11175. IEEE, 2025. 13 A Full algorithm of DyMoS Algorithm 1DyMoS Input: Reference image Iref, text prompt c, total inference steps N, guidance scale ω, modulation ...

2025
[53]

Motion:Which video has the most dynamic and realistic motion? Examples include water ripples, cloth movement, human action, and camera motion
[54]

Fidelity:Which video best preserves the appearance of the reference image throughout the sequence? Examples include the subject, background, and colors. 3.Text alignment:Which video most faithfully reflects the content described in the text prompt? 4.Overall preference:Overall, which video do you prefer? We collect 30 responses for each question over 25 r...

[1] [1]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020

[2] [2]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=PxTIG12RRHS

2021

[3] [3]

Deep unsuper- vised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. pmlr, 2015

2015

[4] [4]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021

2021

[5] [5]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=PqvMRDCJT9t

2023

[7] [7]

Building normalizing flows with stochastic interpolants

Michael Samuel Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=li7qeBbCR1t

2023

[8] [8]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

2024

[9] [9]

HunyuanVideo 1.5 Technical Report

Tencent Hunyuan Foundation Model Team. Hunyuanvideo 1.5 technical report, 2025. URL https://arxiv.org/abs/2511.18870

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan.Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Conference on Learning Represen...

2025

[12] [12]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Video diffusion models.Advances in neural information processing systems, 35: 8633–8646, 2022

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in neural information processing systems, 35: 8633–8646, 2022

2022

[14] [14]

Open-Sora Plan: Open-Source Large Video Generation Model

Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[16] [16]

Animatediff: Animate your personalized text-to-image diffusion models without specific tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=Fx2SbBgcte

2024

[17] [17]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richard- son, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Conditional image-to-video generation with latent flow diffusion models

Haomiao Ni, Changhao Shi, Kai Li, Sharon X Huang, and Martin Renqiang Min. Conditional image-to-video generation with latent flow diffusion models. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18444–18455. IEEE, 2023

2023

[19] [19]

Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling

Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, et al. Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024

2024

[20] [20]

MAGI-1: Autoregressive Video Generation at Scale

Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Dynamicrafter: Animating open-domain images with video diffusion priors

Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Wangbo Yu, Hanyuan Liu, Gongye Liu, Xintao Wang, Ying Shan, and Tien-Tsin Wong. Dynamicrafter: Animating open-domain images with video diffusion priors. InEuropean Conference on Computer Vision, pages 399–417. Springer, 2024

2024

[22] [22]

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jingren Zhou. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models.arXiv preprint arXiv:2311.04145, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Track4gen: Teaching video diffusion models to track points improves video generation

Hyeonho Jeong, Chun-Hao P Huang, Jong Chul Ye, Niloy J Mitra, and Duygu Ceylan. Track4gen: Teaching video diffusion models to track points improves video generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 7276–7287, 2025

2025

[24] [24]

Identifying and solving conditional image leakage in image-to-video diffusion model

Min Zhao, Hongzhou Zhu, Chendong Xiang, Kaiwen Zheng, Chongxuan Li, and Jun Zhu. Identifying and solving conditional image leakage in image-to-video diffusion model. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=o9Lkiv1qpc

2024

[25] [25]

Flashi2v: Fourier-guided latent shifting prevents conditional image leakage in image-to-video generation.arXiv preprint arXiv:2509.25187, 2025

Yunyang Ge, Xinhua Cheng, Chengshu Zhao, Xianyi He, Shenghai Yuan, Bin Lin, Bin Zhu, and Li Yuan. Flashi2v: Fourier-guided latent shifting prevents conditional image leakage in image-to-video generation.arXiv preprint arXiv:2509.25187, 2025

work page arXiv 2025

[26] [26]

Enhancing motion dynamics of image-to-video models via adaptive low-pass guidance.arXiv preprint arXiv:2506.08456, 2025

June Suk Choi, Kyungmin Lee, Sihyun Yu, Yisol Choi, Jinwoo Shin, and Kimin Lee. Enhancing motion dynamics of image-to-video models via adaptive low-pass guidance.arXiv preprint arXiv:2506.08456, 2025

work page arXiv 2025

[27] [27]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023

[28] [28]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, pages 23–40. Springer, 2024

2024

[29] [29]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013. 11

work page internal anchor Pith review Pith/arXiv arXiv 2013

[30] [30]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

2020

[31] [31]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[32] [32]

History-guided video diffusion

Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History-guided video diffusion. InForty-second International Conference on Machine Learning,

[33] [33]

URLhttps://openreview.net/forum?id=j8Vr3E3vhy

[34] [34]

T2v- compbench: A comprehensive benchmark for compositional text-to-video generation

Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v- compbench: A comprehensive benchmark for compositional text-to-video generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 8406–8416, 2025

2025

[35] [35]

Motioncfg: Boosting motion dynamics via stochastic concept perturbation.arXiv preprint arXiv:2603.14073, 2026

Byungjun Kim, Soobin Um, and Jong Chul Ye. Motioncfg: Boosting motion dynamics via stochastic concept perturbation.arXiv preprint arXiv:2603.14073, 2026

work page arXiv 2026

[36] [36]

Motion prior distillation in time reversal sampling for generative inbetweening

Wooseok Jeon, Seunghyun Shin, Dongmin Shin, and Hae-Gon Jeon. Motion prior distillation in time reversal sampling for generative inbetweening. InThe Fourteenth International Con- ference on Learning Representations, 2026. URL https://openreview.net/forum?id= GRElsj9W2t

2026

[37] [37]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

2024

[38] [38]

Divergence measures based on the shannon entropy.IEEE Transactions on Information theory, 37(1):145–151, 2002

Jianhua Lin. Divergence measures based on the shannon entropy.IEEE Transactions on Information theory, 37(1):145–151, 2002

2002

[39] [39]

Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation

Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2105–2123, 2024

2024

[40] [40]

Internvid: A large-scale video-text dataset for multimodal understanding and generation

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, and Yu Qiao. Internvid: A large-scale video-text dataset for multimodal understanding and generation. In The Twelfth International Conference on Learning Representations, 2024. URL https:// openreview.n...

2024

[41] [41]

Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation

Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, et al. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 11269–11277, 2026

2026

[42] [42]

Perception encoder: The best visual embeddings are not at the output of the network

Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Abdul Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Shang-Wen Li, Piotr Dollar, and Christoph Feichtenhofer. Perception encoder: The best visual embeddings are not at the output of the network. In The...

2026

[43] [43]

Vidprom: A million-scale real prompt-gallery dataset for text-to- video diffusion models.Advances in Neural Information Processing Systems, 37:65618–65642, 2024

Wenhao Wang and Yi Yang. Vidprom: A million-scale real prompt-gallery dataset for text-to- video diffusion models.Advances in Neural Information Processing Systems, 37:65618–65642, 2024

2024

[44] [44]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

2024

[45] [45]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In European conference on computer vision, pages 402–419. Springer, 2020. 12

2020

[46] [46]

Memory-v2v: Memory-augmented video-to-video diffusion for consistent multi-turn editing.arXiv preprint arXiv:2601.16296, 2026

Dohun Lee, Chun-Hao Paul Huang, Xuelin Chen, Jong Chul Ye, Duygu Ceylan, and Hyeonho Jeong. Memory-v2v: Augmenting video-to-video diffusion models with memory.arXiv preprint arXiv:2601.16296, 2026

work page arXiv 2026

[47] [47]

Amazon mechanical turk: A research tool for organizations and information systems scholars

Kevin Crowston. Amazon mechanical turk: A research tool for organizations and information systems scholars. InShaping the Future of ICT Research. Methods and Approaches: IFIP WG 8.2, Working Conference, Tampa, FL, USA, December 13-14, 2012. Proceedings, 2012

2012

[48] [48]

Video color grading via look-up table generation

Seunghyun Shin, Dongmin Shin, Jisu Shin, Hae-Gon Jeon, and Joon-Young Lee. Video color grading via look-up table generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19141–19152, 2025

2025

[49] [49]

Close imitation of expert retouching for black-and-white photography

Seunghyun Shin, Jisu Shin, Jihwan Bae, Inwook Shim, and Hae-Gon Jeon. Close imitation of expert retouching for black-and-white photography. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 25037–25046, June 2024

2024

[50] [50]

Kinetic typography diffusion model

Seonmi Park, Inhwan Bae, Seunghyun Shin, and Hae-Gon Jeon. Kinetic typography diffusion model. InEuropean Conference on Computer Vision, pages 166–185. Springer, 2024

2024

[51] [51]

Universal image immunization against diffusion-based image editing via semantic injection.arXiv preprint arXiv:2602.14679, 2026

Chanhui Lee, Seunghyun Shin, Donggyu Choi, Hae-gon Jeon, and Jeany Son. Universal image immunization against diffusion-based image editing via semantic injection.arXiv preprint arXiv:2602.14679, 2026

work page internal anchor Pith review arXiv 2026

[52] [52]

Reangle-a-video: 4d video generation as video-to-video translation

Hyeonho Jeong, Suhyeon Lee, and Jong Chul Ye. Reangle-a-video: 4d video generation as video-to-video translation. In2025 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11164–11175. IEEE, 2025. 13 A Full algorithm of DyMoS Algorithm 1DyMoS Input: Reference image Iref, text prompt c, total inference steps N, guidance scale ω, modulation ...

2025

[53] [53]

Motion:Which video has the most dynamic and realistic motion? Examples include water ripples, cloth movement, human action, and camera motion

[54] [54]

Fidelity:Which video best preserves the appearance of the reference image throughout the sequence? Examples include the subject, background, and colors. 3.Text alignment:Which video most faithfully reflects the content described in the text prompt? 4.Overall preference:Overall, which video do you prefer? We collect 30 responses for each question over 25 r...