Recognition: no theorem link
Improving Human Image Animation via Semantic Representation Alignment
Pith reviewed 2026-05-12 04:32 UTC · model grok-4.3
The pith
SemanticREPA improves human image animation by treating depth and face features as fixed supervision signals instead of input conditions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SemanticREPA introduces representation alignment as supervision rather than conditioning: a structure alignment module is trained to align latent structure representations with video depth features, then frozen to supervise the diffusion process for coherent 3D geometry and temporal stability; simultaneously an ID alignment module aligns generated video identities to face recognition features, with predicted structures used to refine identity restoration in key regions.
What carries the argument
Semantic representation alignment via two fixed modules—one matching video latent structures to depth estimation features for geometric rectification, the other matching ID representations to face recognition features for consistency.
If this is right
- Human structures in generated videos become more coherent and stable across frames.
- Character identity remains consistent even in extended sequences with large motions.
- The diffusion model retains its original generation flexibility since semantic representations are not used as input conditions.
- Predicted structure representations improve identity restoration in face and body regions.
- The approach separates training of alignment from the main diffusion training, allowing reuse of the modules.
Where Pith is reading between the lines
- The separation of alignment training from diffusion training could allow the same modules to supervise other video generation backbones without retraining them from scratch.
- If depth and face features prove sufficient here, similar fixed-supervision alignment might extend to additional semantic signals such as optical flow or segmentation maps for further coherence gains.
- Success would imply that representation-level supervision can substitute for conditioning in other conditional generation tasks where adding inputs reduces output variety.
Load-bearing premise
Training separate alignment modules on depth and face recognition features and then using the fixed modules as supervision will successfully add 3D geometric relationships and temporal coherence without creating new artifacts or limiting the model's flexibility.
What would settle it
Generate long videos of intensive human motions with the alignment supervision applied; if limb twisting, facial distortion, or identity drift remain at levels comparable to unaligned baselines, the claim that the modules impart the intended relationships fails.
Figures
read the original abstract
The field of image-to-video generation has made remarkable progress. However, challenges such as human limb twisting and facial distortion persist, especially when generating long videos or modeling intensive motions. Existing human image animation works address these issues by incorporating human-specific semantic representations, e.g., dense poses or ID embeddings, as additional conditions. However, conditioning on these representations could decrease the generation flexibility. Moreover, their reliance on RGB pixel supervision also lacks emphasis on learning necessary 3D geometric relationships and temporal coherence. In contrast, we introduce a novel approach named SemanticREPA that leverages these semantic representations as supervision signals through representation alignment. Specifically, we begin by training a structure alignment module that aligns the structure representations obtained from video latents with video depth estimation features. We then fix the pretrained module, and utilize it to provide additional supervision on the structure representations of the diffusion models, achieving structure rectification to generate coherent and stable human structures. Simultaneously, we develop an ID alignment module to align the ID representations of the generated videos to face recognition features. We further propose to use the predicted structure representations to refine identity restoration in relevant regions. With structure and ID alignment, our method demonstrates superior quality on extended character motions and enhanced character consistency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes SemanticREPA, a method for human image animation that trains separate structure and ID alignment modules on depth estimation and face recognition features, then freezes them to provide supervision signals on the latent representations of a diffusion model. This is intended to enforce 3D geometric coherence and temporal stability for long videos and intensive motions while avoiding the flexibility loss associated with direct conditioning on semantic representations.
Significance. If the alignment modules successfully impart genuine 3D structure and frame-to-frame consistency without introducing artifacts or allowing shortcut solutions, the approach could meaningfully advance controllable human animation by decoupling supervision from conditioning. The use of representation-level alignment rather than pixel or direct conditioning is a potentially useful distinction from prior work.
major comments (3)
- [Abstract / Method description] The central claim that fixed, separately trained alignment modules enforce coherent 3D relationships and temporal coherence rests on an unverified assumption that the diffusion model cannot satisfy the supervision via shortcuts (e.g., depth-consistent but physically implausible poses or over-smoothed motion). No representation-level diagnostics, adversarial robustness checks, or joint fine-tuning procedure are described that would rule this out.
- [Abstract] The abstract states that the structure alignment module is trained to match video-latent structure representations to depth features and then used to supervise the diffusion model, yet provides no equations, loss formulations, or training details for either the alignment modules or the supervised diffusion objective. Without these, it is impossible to assess whether the supervision actually targets the claimed geometric and temporal properties.
- [Abstract] The claim of 'superior quality on extended character motions and enhanced character consistency' is presented without any quantitative results, ablation studies, or comparison tables in the provided abstract. The absence of metrics (e.g., FID, temporal consistency scores, user studies) makes the superiority assertion unverifiable from the given information.
minor comments (2)
- [Method] Clarify the exact architecture and input/output dimensions of the structure and ID alignment modules, including whether they operate on the same latent space as the diffusion model.
- [Method] The phrase 'use the predicted structure representations to refine identity restoration in relevant regions' is underspecified; provide the precise mechanism or loss term used for this refinement.
Simulated Author's Rebuttal
We are grateful for the referee's insightful feedback, which has helped us improve the clarity and rigor of our work. We address each major comment in detail below.
read point-by-point responses
-
Referee: [Abstract / Method description] The central claim that fixed, separately trained alignment modules enforce coherent 3D relationships and temporal coherence rests on an unverified assumption that the diffusion model cannot satisfy the supervision via shortcuts (e.g., depth-consistent but physically implausible poses or over-smoothed motion). No representation-level diagnostics, adversarial robustness checks, or joint fine-tuning procedure are described that would rule this out.
Authors: We thank the referee for highlighting this important point regarding potential shortcut solutions. While our experiments demonstrate improved coherence through qualitative visualizations and comparisons, we agree that explicit diagnostics would strengthen the claim. In the revised manuscript, we have added representation-level analysis, including feature visualizations and comparisons to ground-truth depth maps, to show that the supervision enforces meaningful 3D structure rather than superficial consistency. We also discuss why joint fine-tuning was not pursued to maintain the decoupling of supervision and conditioning. revision: yes
-
Referee: [Abstract] The abstract states that the structure alignment module is trained to match video-latent structure representations to depth features and then used to supervise the diffusion model, yet provides no equations, loss formulations, or training details for either the alignment modules or the supervised diffusion objective. Without these, it is impossible to assess whether the supervision actually targets the claimed geometric and temporal properties.
Authors: The abstract provides a concise overview of the approach. Detailed equations for the alignment losses (L_struct and L_id) and the supervised diffusion objective are presented in Section 3 of the manuscript. To improve accessibility, we have revised the abstract to include a pointer to the method details and briefly mention the alignment losses used. revision: partial
-
Referee: [Abstract] The claim of 'superior quality on extended character motions and enhanced character consistency' is presented without any quantitative results, ablation studies, or comparison tables in the provided abstract. The absence of metrics (e.g., FID, temporal consistency scores, user studies) makes the superiority assertion unverifiable from the given information.
Authors: The abstract summarizes the key findings, with supporting quantitative evidence, including FID scores, temporal consistency metrics, and user study results, provided in Section 4 of the paper. We have updated the abstract to reference these improvements more specifically, noting the gains in consistency metrics over baselines. revision: partial
Circularity Check
No circularity: empirical training procedure with independent external supervision signals
full rationale
The paper presents an empirical method that pre-trains separate structure and ID alignment modules on external features (depth estimation and face recognition) and then freezes them to supervise a diffusion model. No equations, derivations, or self-referential definitions appear in the provided text that would make the claimed improvements equivalent to the inputs by construction. The procedure relies on independently trained modules and external benchmarks rather than any fitted parameter renamed as prediction or uniqueness imported via self-citation. The central claim of improved coherence therefore remains a testable empirical outcome rather than a tautology.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
-
[2]
Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models, 2024
Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Yaole Wang, and Jun Zhu. Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models, 2024. 2
work page 2024
-
[3]
Lumiere: A space-time diffu- sion model for video generation
Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Her- rmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffu- sion model for video generation. InSIGGRAPH Asia 2024 Conference Papers, 2024. 2
work page 2024
-
[4]
Improving image genera- tion with better captions.https://cdn.openai.com/ papers/dall-e-3.pdf, 2023
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving image genera- tion with better captions.https://cdn.openai.com/ papers/dall-e-3.pdf, 2023. Accessed: 2025-07-21. 1
work page 2023
-
[5]
Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023. 2, 6
work page 2023
-
[6]
Align your latents: High-resolution video synthesis with la- tent diffusion models
Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with la- tent diffusion models. InProceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition, 2023. 2
work page 2023
-
[7]
Video depth anything: Consistent depth estimation for super-long videos
Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zi- long Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025. 1, 4, 6
work page 2025
-
[8]
Motion-conditioned diffu- sion model for controllable video synthesis, 2023
Tsai-Shien Chen, Chieh Hubert Lin, Hung-Yu Tseng, Tsung- Yi Lin, and Ming-Hsuan Yang. Motion-conditioned diffu- sion model for controllable video synthesis, 2023. 1, 2
work page 2023
-
[9]
Panda-70m: Captioning 70m videos with multiple cross-modality teachers
Tsai-Shien Chen, Aliaksandr Siarohi, Willi Menapace, Eka- terina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, and Sergey Tulyakov. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
-
[10]
Seine: Short-to-long video diffusion model for generative transition and prediction
Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffusion model for generative transition and prediction. InProceed- ings of the International Conference on Learning Represen- tations, 2023. 6
work page 2023
-
[11]
Livephoto: Real image animation with text-guided motion control
Xi Chen, Zhiheng Liu, Mengting Chen, Yutong Feng, Yu Liu, Yujun Shen, and Hengshuang Zhao. Livephoto: Real image animation with text-guided motion control. InPro- ceedings of the European Conference on Computer Vision,
-
[12]
De- constructing denoising diffusion models for self-supervised learning
Xinlei Chen, Zhuang Liu, Saining Xie, and Kaiming He. De- constructing denoising diffusion models for self-supervised learning. InProceedings of the International Conference on Learning Representations, 2025. 3
work page 2025
-
[13]
Arcface: Additive angular margin loss for deep face recognition
Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 1, 5, 6
work page 2019
-
[14]
Scaling rectified flow trans- formers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yan- nik Marek, and Robin Rombach. Scaling rectified flow trans- formers for high-resolution image synthesis. InProceedings of the Internati...
work page 2024
-
[15]
Emu video: Factorizing text-to-video generation by explicit image conditioning
Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Du- val, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factorizing text-to-video generation by explicit image conditioning. In Proceedings of the European Conference on Computer Vi- sion, 2024. 2
work page 2024
-
[16]
Imagen 2.https://deepmind.google/ technologies/imagen- 2/, 2023
Google. Imagen 2.https://deepmind.google/ technologies/imagen- 2/, 2023. Accessed: 2025- 07-21. 1
work page 2023
-
[17]
Animatediff: Animate your personalized text-to- image diffusion models without specific tuning
Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to- image diffusion models without specific tuning. InProceed- ings of the International Conference on Learning Represen- tations, 2024. 2
work page 2024
-
[18]
Lotus: Diffusion-based visual foundation model for high-quality dense prediction
Jing He, Haodong Li, Wei Yin, Yixun Liang, Leheng Li, Kaiqiang Zhou, Hongbo Liu, Bingbing Liu, and Ying-Cong Chen. Lotus: Diffusion-based visual foundation model for high-quality dense prediction. InProceedings of the Inter- national Conference on Learning Representations, 2025. 3
work page 2025
-
[19]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 4
work page 2016
-
[20]
Gans trained by a two time-scale update rule converge to a local nash equilib- rium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. InAdvances in Neural Information Processing Sys- tems, 2017. 6
work page 2017
-
[21]
Denoising dif- fusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. InAdvances in Neural Informa- tion Processing Systems, 2020. 1
work page 2020
-
[22]
Cogvideo: Large-scale pretraining for text-to-video generation via transformers
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. InProceedings of the Interna- tional Conference on Learning Representations, 2023. 2
work page 2023
-
[23]
Image quality metrics: Psnr vs
Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. InInternational Conference on Pattern Recognition,
-
[24]
Animate anyone: Consistent and controllable image-to-video synthesis for character animation
Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, and Liefeng Bo. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024. 1, 2, 3
work page 2024
-
[25]
Track4gen: Teaching video diffu- sion models to track points improves video generation
Hyeonho Jeong, Chun-Hao Paul Huang, Jong Chul Ye, Niloy Mitra, and Duygu Ceylan. Track4gen: Teaching video diffu- sion models to track points improves video generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025. 3
work page 2025
-
[26]
Repurpos- ing diffusion-based image generators for monocular depth estimation
Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Met- zger, Rodrigo Caye Daudt, and Konrad Schindler. Repurpos- ing diffusion-based image generators for monocular depth estimation. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, 2024. 3
work page 2024
-
[27]
Auto-encoding varia- tional bayes
Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes. InProceedings of the International Conference on Learning Representations, 2014. 1
work page 2014
-
[28]
Kling ai.https://klingai.com/, 2024
KlingAI. Kling ai.https://klingai.com/, 2024. Accessed: 2025-07-21. 1, 2
work page 2024
-
[29]
Return of unconditional generation: A self-supervised representation generation method
Tianhong Li, Dina Katabi, and Kaiming He. Return of unconditional generation: A self-supervised representation generation method. InAdvances in Neural Information Pro- cessing Systems, 2024. 3
work page 2024
-
[30]
Image conductor: Precision control for interactive video syn- thesis
Yaowei Li, Xintao Wang, Zhaoyang Zhang, Zhouxia Wang, Ziyang Yuan, Liangbin Xie, Yuexian Zou, and Ying Shan. Image conductor: Precision control for interactive video syn- thesis. InProceedings of the AAAI Conference on Artificial Intelligence, 2025. 1, 2
work page 2025
-
[31]
Guiding text-to-image diffusion model towards grounded generation
Ziyi Li, Qinye Zhou, Xiaoyun Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Guiding text-to-image diffusion model towards grounded generation. InProceedings of the International Conference on Computer Vision, 2023. 3
work page 2023
-
[32]
Physgen: Rigid-body physics-grounded image- to-video generation
Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shen- long Wang. Physgen: Rigid-body physics-grounded image- to-video generation. InProceedings of the European Con- ference on Computer Vision, 2024. 6
work page 2024
-
[33]
Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: La- tent diffusion transformer for video generation.Transactions on Machine Learning Research, 2025. 2
work page 2025
-
[34]
Cinemo: Consis- tent and controllable image animation with motion diffusion models
Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Yuan- Fang Li, Cunjian Chen, and Yu Qiao. Cinemo: Consis- tent and controllable image animation with motion diffusion models. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, 2025. 1, 2
work page 2025
-
[35]
9 Openvid-1m: A large-scale high-quality dataset for text-to- video generation
Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhen- heng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. 9 Openvid-1m: A large-scale high-quality dataset for text-to- video generation. InProceedings of the International Con- ference on Learning Representations, 2025. 1, 2, 5
work page 2025
-
[36]
Niranjan D Narvekar and Lina J Karam. A no-reference im- age blur metric based on the cumulative probability of blur detection (cpbd).IEEE Transactions on Image Processing,
-
[37]
Muyao Niu, Xiaodong Cun, Xintao Wang, Yong Zhang, Ying Shan, and Yinqiang Zheng. Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model. InProceedings of the European Conference on Computer Vision, 2024. 1, 2
work page 2024
-
[38]
Video generation models as world simula- tors.https : / / openai
OpenAI. Video generation models as world simula- tors.https : / / openai . com / index / video - generation - models - as - world - simulators/,
-
[39]
Accessed: 2025-07-21. 1
work page 2025
-
[40]
Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patri...
-
[41]
Arc2face: A foundation model for id-consistent human faces
Foivos Paraperas Papantoniou, Alexandros Lattas, Stylianos Moschoglou, Jiankang Deng, Bernhard Kainz, and Stefanos Zafeiriou. Arc2face: A foundation model for id-consistent human faces. InProceedings of the European Conference on Computer Vision, 2024. 1, 6
work page 2024
-
[42]
Scalable diffusion mod- els with transformers
William Peebles and Saining Xie. Scalable diffusion mod- els with transformers. InProceedings of the International Conference on Computer Vision, 2023. 1, 2
work page 2023
-
[43]
W ¨urstchen: An ef- ficient architecture for large-scale text-to-image diffusion models
Pablo Pernias, Dominic Rampas, Mats Leon Richter, Christopher Pal, and Marc Aubreville. W ¨urstchen: An ef- ficient architecture for large-scale text-to-image diffusion models. InProceedings of the International Conference on Learning Representations, 2024. 3
work page 2024
-
[44]
Movie gen: A cast of media foundation models, 2024
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models, 2024. 1, 2
work page 2024
-
[45]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InProceedings of the International Conference on Machine Learning, 2021. 6
work page 2021
-
[46]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 2020. 6
work page 2020
-
[47]
Zero-shot text-to-image generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InProceedings of the International Conference on Machine Learning, 2021. 1
work page 2021
-
[48]
Hierarchical text-conditional image gener- ation with clip latents, 2022
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents, 2022. 1
work page 2022
-
[49]
Weiming Ren, Huan Yang, Ge Zhang, Cong Wei, Xinrun Du, Wenhao Huang, and Wenhu Chen. Consisti2v: Enhancing visual consistency for image-to-video generation.Transac- tions on Machine Learning Research, 2024. 6
work page 2024
-
[50]
High-resolution image syn- thesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High-resolution image syn- thesis with latent diffusion models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2022. 1, 2
work page 2022
-
[51]
Photorealistic text-to-image diffusion models with deep language understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. InAdvances in Neural Information Processing Systems, 2022. 1
work page 2022
-
[52]
Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, et al. Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling.SIGGRAPH, 2024. 1, 2
work page 2024
-
[53]
Denois- ing diffusion implicit models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InProceedings of the Interna- tional Conference on Learning Representations, 2020. 1
work page 2020
-
[54]
Raft: Recurrent all-pairs field transforms for optical flow
Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InProceedings of the European Conference on Computer Vision, 2020. 6
work page 2020
-
[55]
Fvd: A new metric for video generation
Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Rapha¨el Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. InProceedings of the International Conference on Learning Representations,
-
[56]
Yolov8: A novel object detection algorithm with enhanced performance and robust- ness
Rejin Varghese and M Sambath. Yolov8: A novel object detection algorithm with enhanced performance and robust- ness. InInternational Conference on Advances in Data Engi- neering and Intelligent Computing Systems (ADICS), 2024. 5
work page 2024
-
[57]
Modelscope text-to-video technical report, 2023
Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report, 2023. 2
work page 2023
-
[58]
Videocomposer: Compositional video synthesis with motion controllability
Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Ji- uniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jin- gren Zhou. Videocomposer: Compositional video synthesis with motion controllability. InAdvances in Neural Informa- tion Processing Systems, 2024. 6
work page 2024
-
[59]
Lavie: High-quality video generation with cascaded latent diffusion models, 2023
Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models, 2023. 2
work page 2023
-
[60]
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing,
-
[61]
Humanvid: Demystifying training data for camera-controllable human image animation
Zhenzhi Wang, Yixuan Li, Yanhong Zeng, Youqing Fang, Yuwei Guo, Wenran Liu, Jing Tan, Kai Chen, Tianfan Xue, 10 Bo Dai, et al. Humanvid: Demystifying training data for camera-controllable human image animation. InAdvances in Neural Information Processing Systems, 2024. 1, 2
work page 2024
-
[62]
Denoising diffusion autoencoders are unified self-supervised learners
Weilai Xiang, Hongyu Yang, Di Huang, and Yunhong Wang. Denoising diffusion autoencoders are unified self-supervised learners. InProceedings of the International Conference on Computer Vision, 2023. 3
work page 2023
-
[63]
Dynamicrafter: Animating open-domain images with video diffusion priors
Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Wangbo Yu, Hanyuan Liu, Gongye Liu, Xintao Wang, Ying Shan, and Tien-Tsin Wong. Dynamicrafter: Animating open-domain images with video diffusion priors. InPro- ceedings of the European Conference on Computer Vision,
-
[64]
Open-V ocabulary Panop- tic Segmentation with Text-to-Image Diffusion Models
Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiao- long Wang, and Shalini De Mello. Open-V ocabulary Panop- tic Segmentation with Text-to-Image Diffusion Models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023. 3
work page 2023
-
[65]
Magicanimate: Temporally consistent human image animation using diffusion model
Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou. Magicanimate: Temporally consistent human image animation using diffusion model. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
-
[66]
Unified dense prediction of video diffusion
Lehan Yang, Lu Qi, Xiangtai Li, Sheng Li, Varun Jampani, and Ming-Hsuan Yang. Unified dense prediction of video diffusion. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, 2025. 3
work page 2025
-
[67]
Cogvideox: Text-to-video dif- fusion models with an expert transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video dif- fusion models with an expert transformer. InProceedings of the International Conference on Learning Representations,
-
[68]
Representation alignment for generation: Training diffusion transformers is easier than you think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InProceedings of the International Conference on Learning Representations,
-
[69]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2018. 6
work page 2018
-
[70]
I2vgen-xl: High-quality image-to-video syn- thesis via cascaded diffusion models, 2023
Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qing, Xiang Wang, Deli Zhao, and Jingren Zhou. I2vgen-xl: High-quality image-to-video syn- thesis via cascaded diffusion models, 2023. 2, 6
work page 2023
-
[71]
Mimicmo- tion: High-quality human motion video generation with confidence-aware pose guidance
Yuang Zhang, Jiaxi Gu, Li-Wen Wang, Han Wang, Junqi Cheng, Yuefeng Zhu, and Fangyuan Zou. Mimicmo- tion: High-quality human motion video generation with confidence-aware pose guidance. InProceedings of the In- ternational Conference on Machine Learning, 2025. 1, 2, 3
work page 2025
-
[72]
Diffree: Text-guided shape free object inpainting with dif- fusion model, 2024
Lirui Zhao, Tianshuo Yang, Wenqi Shao, Yuxin Zhang, Yu Qiao, Ping Luo, Kaipeng Zhang, and Rongrong Ji. Diffree: Text-guided shape free object inpainting with dif- fusion model, 2024. 3
work page 2024
-
[73]
Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.https://github.com/hpcaitech/Open- Sora, 2024. Accessed: 2025-07-21. 1, 2
work page 2024
-
[74]
Champ: Controllable and consistent human image animation with 3d parametric guidance
Shenhao Zhu, Junming Leo Chen, Zuozhuo Dai, Yinghui Xu, Xun Cao, Yao Yao, Hao Zhu, and Siyu Zhu. Champ: Controllable and consistent human image animation with 3d parametric guidance. InProceedings of the European Con- ference on Computer Vision, 2024. 1, 2, 3 11 Improving Human Image Animation via Semantic Representation Alignment Supplementary Material
work page 2024
-
[75]
Our experiments are conducted on 8 NVIDIA A100 GPUs
Implementation Details Our base model is CogVideoX 1.0, which uses T5 as the text encoder, with V AE compression ratios of4for temporal and 8×8for spatial dimensions. Our experiments are conducted on 8 NVIDIA A100 GPUs. We use 8-bit Adam as the optimizer with a learning rate of1×10 −5. Both the structure alignment module pretraining and diffusion transfor...
-
[76]
Qualitative Visualization 8.1. Qualitative Comparison with Other Baselines We conduct qualitative comparison of our method against other baselines. As illustrated in Figure 4, Figure 5, Figure 6, and Figure 7, our proposed method demonstrate significantly better character consistency and human structure stability. 1 Reference ImageGenerated Video Our Meth...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.