arxiv: 2605.10523 · v1 · submitted 2026-05-11 · 💻 cs.CV

Recognition: no theorem link

Improving Human Image Animation via Semantic Representation Alignment

Chang Liu , Mengting Chen , Yixuan Huang , Haoning Wu , Chen Ju , Shuai Xiao , Jinsong Lan , Yanfeng Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:32 UTC · model grok-4.3

classification 💻 cs.CV

keywords human image animationimage-to-video generationsemantic representation alignmentdiffusion modelsstructure alignmentidentity consistencydepth estimationface recognition

0 comments

The pith

SemanticREPA improves human image animation by treating depth and face features as fixed supervision signals instead of input conditions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses persistent problems in image-to-video generation for humans, including limb twisting and facial distortion during long sequences or intensive motions. Existing methods add semantic representations such as dense poses or ID embeddings directly as conditions, which reduces flexibility and relies on pixel-level supervision that overlooks 3D geometry and temporal coherence. SemanticREPA instead trains separate alignment modules: one that matches structure representations from video latents to depth estimation features, and another that matches ID representations to face recognition features. These pretrained modules then supply fixed supervision to the diffusion model, rectifying structures for stability and refining identities in relevant regions. The result is higher quality on extended motions while preserving the model's original flexibility.

Core claim

SemanticREPA introduces representation alignment as supervision rather than conditioning: a structure alignment module is trained to align latent structure representations with video depth features, then frozen to supervise the diffusion process for coherent 3D geometry and temporal stability; simultaneously an ID alignment module aligns generated video identities to face recognition features, with predicted structures used to refine identity restoration in key regions.

What carries the argument

Semantic representation alignment via two fixed modules—one matching video latent structures to depth estimation features for geometric rectification, the other matching ID representations to face recognition features for consistency.

If this is right

Human structures in generated videos become more coherent and stable across frames.
Character identity remains consistent even in extended sequences with large motions.
The diffusion model retains its original generation flexibility since semantic representations are not used as input conditions.
Predicted structure representations improve identity restoration in face and body regions.
The approach separates training of alignment from the main diffusion training, allowing reuse of the modules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of alignment training from diffusion training could allow the same modules to supervise other video generation backbones without retraining them from scratch.
If depth and face features prove sufficient here, similar fixed-supervision alignment might extend to additional semantic signals such as optical flow or segmentation maps for further coherence gains.
Success would imply that representation-level supervision can substitute for conditioning in other conditional generation tasks where adding inputs reduces output variety.

Load-bearing premise

Training separate alignment modules on depth and face recognition features and then using the fixed modules as supervision will successfully add 3D geometric relationships and temporal coherence without creating new artifacts or limiting the model's flexibility.

What would settle it

Generate long videos of intensive human motions with the alignment supervision applied; if limb twisting, facial distortion, or identity drift remain at levels comparable to unaligned baselines, the claim that the modules impart the intended relationships fails.

Figures

Figures reproduced from arXiv: 2605.10523 by Chang Liu, Chen Ju, Haoning Wu, Jinsong Lan, Mengting Chen, Shuai Xiao, Yanfeng Wang, Yixuan Huang.

**Figure 2.** Figure 2: Alignment Module Pretraining Pipeline Overview. (a) The Structure Alignment Module takes clean video latents as input, and outputs depth latents that align with the VAE-encoded RGB video depth. (b) The ID Alignment Module predicts facial representations based on video latents concatenated with depth latents, and aligns them with ArcFace features. should adhere to the appearance and context described by Ire… view at source ↗

**Figure 3.** Figure 3: Diffusion Transformer Fine-tuning Pipeline Overview. With the assistance of the pretrained structure alignment module and ID alignment module, we apply additional supervision to the diffusion transformer fine-tuning through semantic representation alignment. We fix the two pretrained alignment modules, and only fine-tune diffusion transformer backbone. using the ArcFace model [13] to automatically locate t… view at source ↗

**Figure 4.** Figure 4: Qualitative Comparison with other baselines. Our method significantly outperforms other models in terms of human structure [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative Comparison with Other Baselines. Our method significantly outperforms other models in terms of human structure [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative Comparison with other baselines. Our method significantly outperforms other models in terms of human structure [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative Comparison with other baselines. Our method significantly outperforms other models in terms of human structure [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative Ablation Comparison Results. Our structure representation alignment supervision enables the model to produce [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative Ablation Comparison Results. Our structure representation alignment supervision enables the model to produce [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative Ablation Comparison Results. Our structure representation alignment supervision enables the model to produce [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

read the original abstract

The field of image-to-video generation has made remarkable progress. However, challenges such as human limb twisting and facial distortion persist, especially when generating long videos or modeling intensive motions. Existing human image animation works address these issues by incorporating human-specific semantic representations, e.g., dense poses or ID embeddings, as additional conditions. However, conditioning on these representations could decrease the generation flexibility. Moreover, their reliance on RGB pixel supervision also lacks emphasis on learning necessary 3D geometric relationships and temporal coherence. In contrast, we introduce a novel approach named SemanticREPA that leverages these semantic representations as supervision signals through representation alignment. Specifically, we begin by training a structure alignment module that aligns the structure representations obtained from video latents with video depth estimation features. We then fix the pretrained module, and utilize it to provide additional supervision on the structure representations of the diffusion models, achieving structure rectification to generate coherent and stable human structures. Simultaneously, we develop an ID alignment module to align the ID representations of the generated videos to face recognition features. We further propose to use the predicted structure representations to refine identity restoration in relevant regions. With structure and ID alignment, our method demonstrates superior quality on extended character motions and enhanced character consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SemanticREPA moves semantic features from direct conditioning to fixed alignment supervision to improve structure and ID consistency in human image-to-video, but the abstract gives no results or ablations to show it works.

read the letter

The main idea is to train separate modules that align video latent structure representations to depth features and ID representations to face recognition features, then freeze those modules and use them as supervision losses on the diffusion model instead of feeding the semantics as conditions. This is meant to keep generation flexible while still pushing for coherent 3D structure and temporal stability, which prior conditioning approaches often sacrifice. The abstract frames this as a direct fix for limb twisting and facial distortion in long or intense motions, and the logic holds up on paper: pixel supervision alone does not force geometric understanding, and conditioning can over-constrain the model. Using predicted structure to refine identity in key regions is a sensible extra step. The approach is straightforward and targets real usability problems in the subfield. That said, the abstract contains no quantitative comparisons, no ablation on the alignment losses, and no details on how the supervision is weighted against the diffusion objective. The stress-test point lands: because the alignment modules are pretrained separately and held fixed, the diffusion model could satisfy the losses through shortcuts that do not deliver actual 3D geometry or frame-to-frame coherence. Without representation diagnostics or joint training checks, it is unclear whether the claimed improvements come from the intended mechanism. This paper is aimed at researchers already working on human-specific image-to-video models who are looking for ways to add priors without losing flexibility. It is coherent enough on its own terms to merit a serious referee, even if the experiments ultimately need strengthening.

Referee Report

3 major / 2 minor

Summary. The paper proposes SemanticREPA, a method for human image animation that trains separate structure and ID alignment modules on depth estimation and face recognition features, then freezes them to provide supervision signals on the latent representations of a diffusion model. This is intended to enforce 3D geometric coherence and temporal stability for long videos and intensive motions while avoiding the flexibility loss associated with direct conditioning on semantic representations.

Significance. If the alignment modules successfully impart genuine 3D structure and frame-to-frame consistency without introducing artifacts or allowing shortcut solutions, the approach could meaningfully advance controllable human animation by decoupling supervision from conditioning. The use of representation-level alignment rather than pixel or direct conditioning is a potentially useful distinction from prior work.

major comments (3)

[Abstract / Method description] The central claim that fixed, separately trained alignment modules enforce coherent 3D relationships and temporal coherence rests on an unverified assumption that the diffusion model cannot satisfy the supervision via shortcuts (e.g., depth-consistent but physically implausible poses or over-smoothed motion). No representation-level diagnostics, adversarial robustness checks, or joint fine-tuning procedure are described that would rule this out.
[Abstract] The abstract states that the structure alignment module is trained to match video-latent structure representations to depth features and then used to supervise the diffusion model, yet provides no equations, loss formulations, or training details for either the alignment modules or the supervised diffusion objective. Without these, it is impossible to assess whether the supervision actually targets the claimed geometric and temporal properties.
[Abstract] The claim of 'superior quality on extended character motions and enhanced character consistency' is presented without any quantitative results, ablation studies, or comparison tables in the provided abstract. The absence of metrics (e.g., FID, temporal consistency scores, user studies) makes the superiority assertion unverifiable from the given information.

minor comments (2)

[Method] Clarify the exact architecture and input/output dimensions of the structure and ID alignment modules, including whether they operate on the same latent space as the diffusion model.
[Method] The phrase 'use the predicted structure representations to refine identity restoration in relevant regions' is underspecified; provide the precise mechanism or loss term used for this refinement.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful for the referee's insightful feedback, which has helped us improve the clarity and rigor of our work. We address each major comment in detail below.

read point-by-point responses

Referee: [Abstract / Method description] The central claim that fixed, separately trained alignment modules enforce coherent 3D relationships and temporal coherence rests on an unverified assumption that the diffusion model cannot satisfy the supervision via shortcuts (e.g., depth-consistent but physically implausible poses or over-smoothed motion). No representation-level diagnostics, adversarial robustness checks, or joint fine-tuning procedure are described that would rule this out.

Authors: We thank the referee for highlighting this important point regarding potential shortcut solutions. While our experiments demonstrate improved coherence through qualitative visualizations and comparisons, we agree that explicit diagnostics would strengthen the claim. In the revised manuscript, we have added representation-level analysis, including feature visualizations and comparisons to ground-truth depth maps, to show that the supervision enforces meaningful 3D structure rather than superficial consistency. We also discuss why joint fine-tuning was not pursued to maintain the decoupling of supervision and conditioning. revision: yes
Referee: [Abstract] The abstract states that the structure alignment module is trained to match video-latent structure representations to depth features and then used to supervise the diffusion model, yet provides no equations, loss formulations, or training details for either the alignment modules or the supervised diffusion objective. Without these, it is impossible to assess whether the supervision actually targets the claimed geometric and temporal properties.

Authors: The abstract provides a concise overview of the approach. Detailed equations for the alignment losses (L_struct and L_id) and the supervised diffusion objective are presented in Section 3 of the manuscript. To improve accessibility, we have revised the abstract to include a pointer to the method details and briefly mention the alignment losses used. revision: partial
Referee: [Abstract] The claim of 'superior quality on extended character motions and enhanced character consistency' is presented without any quantitative results, ablation studies, or comparison tables in the provided abstract. The absence of metrics (e.g., FID, temporal consistency scores, user studies) makes the superiority assertion unverifiable from the given information.

Authors: The abstract summarizes the key findings, with supporting quantitative evidence, including FID scores, temporal consistency metrics, and user study results, provided in Section 4 of the paper. We have updated the abstract to reference these improvements more specifically, noting the gains in consistency metrics over baselines. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical training procedure with independent external supervision signals

full rationale

The paper presents an empirical method that pre-trains separate structure and ID alignment modules on external features (depth estimation and face recognition) and then freezes them to supervise a diffusion model. No equations, derivations, or self-referential definitions appear in the provided text that would make the claimed improvements equivalent to the inputs by construction. The procedure relies on independently trained modules and external benchmarks rather than any fitted parameter renamed as prediction or uniqueness imported via self-citation. The central claim of improved coherence therefore remains a testable empirical outcome rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no specific free parameters, axioms, or invented entities can be extracted or audited from the text.

pith-pipeline@v0.9.0 · 5529 in / 1025 out tokens · 35453 ms · 2026-05-12T04:32:27.897226+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages

[1]

Imagen 3,

Jason Baldridge, Jakob Bauer, Mukul Bhutani, Nicole Brich- tova, Andrew Bunner, Kelvin Chan, Yichang Chen, Sander Dieleman, Yuqing Du, Zach Eaton-Rosen, et al. Imagen 3,

work page
[2]

Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models, 2024

Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Yaole Wang, and Jun Zhu. Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models, 2024. 2

work page 2024
[3]

Lumiere: A space-time diffu- sion model for video generation

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Her- rmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffu- sion model for video generation. InSIGGRAPH Asia 2024 Conference Papers, 2024. 2

work page 2024
[4]

Improving image genera- tion with better captions.https://cdn.openai.com/ papers/dall-e-3.pdf, 2023

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving image genera- tion with better captions.https://cdn.openai.com/ papers/dall-e-3.pdf, 2023. Accessed: 2025-07-21. 1

work page 2023
[5]

Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023. 2, 6

work page 2023
[6]

Align your latents: High-resolution video synthesis with la- tent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with la- tent diffusion models. InProceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition, 2023. 2

work page 2023
[7]

Video depth anything: Consistent depth estimation for super-long videos

Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zi- long Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025. 1, 4, 6

work page 2025
[8]

Motion-conditioned diffu- sion model for controllable video synthesis, 2023

Tsai-Shien Chen, Chieh Hubert Lin, Hung-Yu Tseng, Tsung- Yi Lin, and Ming-Hsuan Yang. Motion-conditioned diffu- sion model for controllable video synthesis, 2023. 1, 2

work page 2023
[9]

Panda-70m: Captioning 70m videos with multiple cross-modality teachers

Tsai-Shien Chen, Aliaksandr Siarohi, Willi Menapace, Eka- terina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, and Sergey Tulyakov. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,

work page
[10]

Seine: Short-to-long video diffusion model for generative transition and prediction

Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffusion model for generative transition and prediction. InProceed- ings of the International Conference on Learning Represen- tations, 2023. 6

work page 2023
[11]

Livephoto: Real image animation with text-guided motion control

Xi Chen, Zhiheng Liu, Mengting Chen, Yutong Feng, Yu Liu, Yujun Shen, and Hengshuang Zhao. Livephoto: Real image animation with text-guided motion control. InPro- ceedings of the European Conference on Computer Vision,

work page
[12]

De- constructing denoising diffusion models for self-supervised learning

Xinlei Chen, Zhuang Liu, Saining Xie, and Kaiming He. De- constructing denoising diffusion models for self-supervised learning. InProceedings of the International Conference on Learning Representations, 2025. 3

work page 2025
[13]

Arcface: Additive angular margin loss for deep face recognition

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 1, 5, 6

work page 2019
[14]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yan- nik Marek, and Robin Rombach. Scaling rectified flow trans- formers for high-resolution image synthesis. InProceedings of the Internati...

work page 2024
[15]

Emu video: Factorizing text-to-video generation by explicit image conditioning

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Du- val, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factorizing text-to-video generation by explicit image conditioning. In Proceedings of the European Conference on Computer Vi- sion, 2024. 2

work page 2024
[16]

Imagen 2.https://deepmind.google/ technologies/imagen- 2/, 2023

Google. Imagen 2.https://deepmind.google/ technologies/imagen- 2/, 2023. Accessed: 2025- 07-21. 1

work page 2023
[17]

Animatediff: Animate your personalized text-to- image diffusion models without specific tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to- image diffusion models without specific tuning. InProceed- ings of the International Conference on Learning Represen- tations, 2024. 2

work page 2024
[18]

Lotus: Diffusion-based visual foundation model for high-quality dense prediction

Jing He, Haodong Li, Wei Yin, Yixun Liang, Leheng Li, Kaiqiang Zhou, Hongbo Liu, Bingbing Liu, and Ying-Cong Chen. Lotus: Diffusion-based visual foundation model for high-quality dense prediction. InProceedings of the Inter- national Conference on Learning Representations, 2025. 3

work page 2025
[19]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 4

work page 2016
[20]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. InAdvances in Neural Information Processing Sys- tems, 2017. 6

work page 2017
[21]

Denoising dif- fusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. InAdvances in Neural Informa- tion Processing Systems, 2020. 1

work page 2020
[22]

Cogvideo: Large-scale pretraining for text-to-video generation via transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. InProceedings of the Interna- tional Conference on Learning Representations, 2023. 2

work page 2023
[23]

Image quality metrics: Psnr vs

Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. InInternational Conference on Pattern Recognition,

work page
[24]

Animate anyone: Consistent and controllable image-to-video synthesis for character animation

Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, and Liefeng Bo. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024. 1, 2, 3

work page 2024
[25]

Track4gen: Teaching video diffu- sion models to track points improves video generation

Hyeonho Jeong, Chun-Hao Paul Huang, Jong Chul Ye, Niloy Mitra, and Duygu Ceylan. Track4gen: Teaching video diffu- sion models to track points improves video generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025. 3

work page 2025
[26]

Repurpos- ing diffusion-based image generators for monocular depth estimation

Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Met- zger, Rodrigo Caye Daudt, and Konrad Schindler. Repurpos- ing diffusion-based image generators for monocular depth estimation. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, 2024. 3

work page 2024
[27]

Auto-encoding varia- tional bayes

Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes. InProceedings of the International Conference on Learning Representations, 2014. 1

work page 2014
[28]

Kling ai.https://klingai.com/, 2024

KlingAI. Kling ai.https://klingai.com/, 2024. Accessed: 2025-07-21. 1, 2

work page 2024
[29]

Return of unconditional generation: A self-supervised representation generation method

Tianhong Li, Dina Katabi, and Kaiming He. Return of unconditional generation: A self-supervised representation generation method. InAdvances in Neural Information Pro- cessing Systems, 2024. 3

work page 2024
[30]

Image conductor: Precision control for interactive video syn- thesis

Yaowei Li, Xintao Wang, Zhaoyang Zhang, Zhouxia Wang, Ziyang Yuan, Liangbin Xie, Yuexian Zou, and Ying Shan. Image conductor: Precision control for interactive video syn- thesis. InProceedings of the AAAI Conference on Artificial Intelligence, 2025. 1, 2

work page 2025
[31]

Guiding text-to-image diffusion model towards grounded generation

Ziyi Li, Qinye Zhou, Xiaoyun Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Guiding text-to-image diffusion model towards grounded generation. InProceedings of the International Conference on Computer Vision, 2023. 3

work page 2023
[32]

Physgen: Rigid-body physics-grounded image- to-video generation

Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shen- long Wang. Physgen: Rigid-body physics-grounded image- to-video generation. InProceedings of the European Con- ference on Computer Vision, 2024. 6

work page 2024
[33]

Latte: La- tent diffusion transformer for video generation.Transactions on Machine Learning Research, 2025

Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: La- tent diffusion transformer for video generation.Transactions on Machine Learning Research, 2025. 2

work page 2025
[34]

Cinemo: Consis- tent and controllable image animation with motion diffusion models

Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Yuan- Fang Li, Cunjian Chen, and Yu Qiao. Cinemo: Consis- tent and controllable image animation with motion diffusion models. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, 2025. 1, 2

work page 2025
[35]

9 Openvid-1m: A large-scale high-quality dataset for text-to- video generation

Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhen- heng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. 9 Openvid-1m: A large-scale high-quality dataset for text-to- video generation. InProceedings of the International Con- ference on Learning Representations, 2025. 1, 2, 5

work page 2025
[36]

A no-reference im- age blur metric based on the cumulative probability of blur detection (cpbd).IEEE Transactions on Image Processing,

Niranjan D Narvekar and Lina J Karam. A no-reference im- age blur metric based on the cumulative probability of blur detection (cpbd).IEEE Transactions on Image Processing,

work page
[37]

Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model

Muyao Niu, Xiaodong Cun, Xintao Wang, Yong Zhang, Ying Shan, and Yinqiang Zheng. Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model. InProceedings of the European Conference on Computer Vision, 2024. 1, 2

work page 2024
[38]

Video generation models as world simula- tors.https : / / openai

OpenAI. Video generation models as world simula- tors.https : / / openai . com / index / video - generation - models - as - world - simulators/,

work page
[39]

Accessed: 2025-07-21. 1

work page 2025
[40]

Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patri...

work page
[41]

Arc2face: A foundation model for id-consistent human faces

Foivos Paraperas Papantoniou, Alexandros Lattas, Stylianos Moschoglou, Jiankang Deng, Bernhard Kainz, and Stefanos Zafeiriou. Arc2face: A foundation model for id-consistent human faces. InProceedings of the European Conference on Computer Vision, 2024. 1, 6

work page 2024
[42]

Scalable diffusion mod- els with transformers

William Peebles and Saining Xie. Scalable diffusion mod- els with transformers. InProceedings of the International Conference on Computer Vision, 2023. 1, 2

work page 2023
[43]

W ¨urstchen: An ef- ficient architecture for large-scale text-to-image diffusion models

Pablo Pernias, Dominic Rampas, Mats Leon Richter, Christopher Pal, and Marc Aubreville. W ¨urstchen: An ef- ficient architecture for large-scale text-to-image diffusion models. InProceedings of the International Conference on Learning Representations, 2024. 3

work page 2024
[44]

Movie gen: A cast of media foundation models, 2024

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models, 2024. 1, 2

work page 2024
[45]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InProceedings of the International Conference on Machine Learning, 2021. 6

work page 2021
[46]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 2020. 6

work page 2020
[47]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InProceedings of the International Conference on Machine Learning, 2021. 1

work page 2021
[48]

Hierarchical text-conditional image gener- ation with clip latents, 2022

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents, 2022. 1

work page 2022
[49]

Consisti2v: Enhancing visual consistency for image-to-video generation.Transac- tions on Machine Learning Research, 2024

Weiming Ren, Huan Yang, Ge Zhang, Cong Wei, Xinrun Du, Wenhao Huang, and Wenhu Chen. Consisti2v: Enhancing visual consistency for image-to-video generation.Transac- tions on Machine Learning Research, 2024. 6

work page 2024
[50]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High-resolution image syn- thesis with latent diffusion models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2022. 1, 2

work page 2022
[51]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. InAdvances in Neural Information Processing Systems, 2022. 1

work page 2022
[52]

Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling.SIGGRAPH, 2024

Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, et al. Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling.SIGGRAPH, 2024. 1, 2

work page 2024
[53]

Denois- ing diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InProceedings of the Interna- tional Conference on Learning Representations, 2020. 1

work page 2020
[54]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InProceedings of the European Conference on Computer Vision, 2020. 6

work page 2020
[55]

Fvd: A new metric for video generation

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Rapha¨el Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. InProceedings of the International Conference on Learning Representations,

work page
[56]

Yolov8: A novel object detection algorithm with enhanced performance and robust- ness

Rejin Varghese and M Sambath. Yolov8: A novel object detection algorithm with enhanced performance and robust- ness. InInternational Conference on Advances in Data Engi- neering and Intelligent Computing Systems (ADICS), 2024. 5

work page 2024
[57]

Modelscope text-to-video technical report, 2023

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report, 2023. 2

work page 2023
[58]

Videocomposer: Compositional video synthesis with motion controllability

Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Ji- uniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jin- gren Zhou. Videocomposer: Compositional video synthesis with motion controllability. InAdvances in Neural Informa- tion Processing Systems, 2024. 6

work page 2024
[59]

Lavie: High-quality video generation with cascaded latent diffusion models, 2023

Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models, 2023. 2

work page 2023
[60]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing,

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing,

work page
[61]

Humanvid: Demystifying training data for camera-controllable human image animation

Zhenzhi Wang, Yixuan Li, Yanhong Zeng, Youqing Fang, Yuwei Guo, Wenran Liu, Jing Tan, Kai Chen, Tianfan Xue, 10 Bo Dai, et al. Humanvid: Demystifying training data for camera-controllable human image animation. InAdvances in Neural Information Processing Systems, 2024. 1, 2

work page 2024
[62]

Denoising diffusion autoencoders are unified self-supervised learners

Weilai Xiang, Hongyu Yang, Di Huang, and Yunhong Wang. Denoising diffusion autoencoders are unified self-supervised learners. InProceedings of the International Conference on Computer Vision, 2023. 3

work page 2023
[63]

Dynamicrafter: Animating open-domain images with video diffusion priors

Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Wangbo Yu, Hanyuan Liu, Gongye Liu, Xintao Wang, Ying Shan, and Tien-Tsin Wong. Dynamicrafter: Animating open-domain images with video diffusion priors. InPro- ceedings of the European Conference on Computer Vision,

work page
[64]

Open-V ocabulary Panop- tic Segmentation with Text-to-Image Diffusion Models

Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiao- long Wang, and Shalini De Mello. Open-V ocabulary Panop- tic Segmentation with Text-to-Image Diffusion Models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023. 3

work page 2023
[65]

Magicanimate: Temporally consistent human image animation using diffusion model

Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou. Magicanimate: Temporally consistent human image animation using diffusion model. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,

work page
[66]

Unified dense prediction of video diffusion

Lehan Yang, Lu Qi, Xiangtai Li, Sheng Li, Varun Jampani, and Ming-Hsuan Yang. Unified dense prediction of video diffusion. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, 2025. 3

work page 2025
[67]

Cogvideox: Text-to-video dif- fusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video dif- fusion models with an expert transformer. InProceedings of the International Conference on Learning Representations,

work page
[68]

Representation alignment for generation: Training diffusion transformers is easier than you think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InProceedings of the International Conference on Learning Representations,

work page
[69]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2018. 6

work page 2018
[70]

I2vgen-xl: High-quality image-to-video syn- thesis via cascaded diffusion models, 2023

Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qing, Xiang Wang, Deli Zhao, and Jingren Zhou. I2vgen-xl: High-quality image-to-video syn- thesis via cascaded diffusion models, 2023. 2, 6

work page 2023
[71]

Mimicmo- tion: High-quality human motion video generation with confidence-aware pose guidance

Yuang Zhang, Jiaxi Gu, Li-Wen Wang, Han Wang, Junqi Cheng, Yuefeng Zhu, and Fangyuan Zou. Mimicmo- tion: High-quality human motion video generation with confidence-aware pose guidance. InProceedings of the In- ternational Conference on Machine Learning, 2025. 1, 2, 3

work page 2025
[72]

Diffree: Text-guided shape free object inpainting with dif- fusion model, 2024

Lirui Zhao, Tianshuo Yang, Wenqi Shao, Yuxin Zhang, Yu Qiao, Ping Luo, Kaipeng Zhang, and Rongrong Ji. Diffree: Text-guided shape free object inpainting with dif- fusion model, 2024. 3

work page 2024
[73]

Open-sora: Democratizing efficient video production for all.https://github.com/hpcaitech/Open- Sora, 2024

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.https://github.com/hpcaitech/Open- Sora, 2024. Accessed: 2025-07-21. 1, 2

work page 2024
[74]

Champ: Controllable and consistent human image animation with 3d parametric guidance

Shenhao Zhu, Junming Leo Chen, Zuozhuo Dai, Yinghui Xu, Xun Cao, Yao Yao, Hao Zhu, and Siyu Zhu. Champ: Controllable and consistent human image animation with 3d parametric guidance. InProceedings of the European Con- ference on Computer Vision, 2024. 1, 2, 3 11 Improving Human Image Animation via Semantic Representation Alignment Supplementary Material

work page 2024
[75]

Our experiments are conducted on 8 NVIDIA A100 GPUs

Implementation Details Our base model is CogVideoX 1.0, which uses T5 as the text encoder, with V AE compression ratios of4for temporal and 8×8for spatial dimensions. Our experiments are conducted on 8 NVIDIA A100 GPUs. We use 8-bit Adam as the optimizer with a learning rate of1×10 −5. Both the structure alignment module pretraining and diffusion transfor...

work page
[76]

Qualitative Comparison with Other Baselines We conduct qualitative comparison of our method against other baselines

Qualitative Visualization 8.1. Qualitative Comparison with Other Baselines We conduct qualitative comparison of our method against other baselines. As illustrated in Figure 4, Figure 5, Figure 6, and Figure 7, our proposed method demonstrate significantly better character consistency and human structure stability. 1 Reference ImageGenerated Video Our Meth...

work page