One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer
Pith reviewed 2026-05-21 18:21 UTC · model grok-4.3
The pith
One-to-All Animation enables high-fidelity character animation and pose transfer from references with arbitrary layouts by treating training as self-supervised outpainting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that reformulating training as a self-supervised outpainting task on diverse-layout references, together with a reference extractor, hybrid fusion attention, identity-robust pose control, and a token-replace strategy, produces a model capable of high-fidelity animation and pose transfer for references with arbitrary layouts, including those that are spatially misaligned or only partially visible, while avoiding identity loss and artifacts that limit prior methods.
What carries the argument
The self-supervised outpainting task that transforms diverse-layout references into a unified occluded-input format, which enables the model to generalize to misaligned and partial references.
If this is right
- The model produces coherent long videos through the token replace strategy.
- Identity features remain stable even when references are only partially visible.
- Pose control is decoupled from appearance to reduce overfitting to specific skeletal structures.
- Hybrid attention allows processing of inputs with varying resolutions and dynamic lengths.
- Overall generation quality exceeds that of methods restricted to aligned reference-pose pairs.
Where Pith is reading between the lines
- The same outpainting reformulation might be tested on other diffusion-based tasks that currently require aligned conditioning, such as text-to-image editing with loose spatial hints.
- If the reference extractor proves robust, it could support animation pipelines that ingest casual smartphone photos without manual cropping or alignment preprocessing.
- The decoupling of identity and pose might reduce the need for large paired datasets in future animation work.
Load-bearing premise
Reformulating training as a self-supervised outpainting task on diverse-layout references will produce a model that generalizes to real misaligned and partially visible inputs without introducing identity loss or artifacts.
What would settle it
Running the trained model on a set of real-world reference images that are spatially misaligned or cropped and observing consistent identity changes or visible artifacts in the generated animations would falsify the central claim.
Figures
read the original abstract
Recent advances in diffusion models have greatly improved pose-driven character animation. However, existing methods are limited to spatially aligned reference-pose pairs with matched skeletal structures. Handling reference-pose misalignment remains unsolved. To address this, we present One-to-All Animation, a unified framework for high-fidelity character animation and image pose transfer for references with arbitrary layouts. First, to handle spatially misaligned reference, we reformulate training as a self-supervised outpainting task that transforms diverse-layout reference into a unified occluded-input format. Second, to process partially visible reference, we design a reference extractor for comprehensive identity feature extraction. Further, we integrate hybrid reference fusion attention to handle varying resolutions and dynamic sequence lengths. Finally, from the perspective of generation quality, we introduce identity-robust pose control that decouples appearance from skeletal structure to mitigate pose overfitting, and a token replace strategy for coherent long-video generation. Extensive experiments show that our method outperforms existing approaches. The code and model are available at https://github.com/ssj9596/One-to-All-Animation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces One-to-All Animation, a unified diffusion-based framework for high-fidelity character animation and image pose transfer from references with arbitrary layouts. It reformulates training as a self-supervised outpainting task to convert diverse-layout references into a unified occluded-input format, adds a reference extractor for identity features from partially visible inputs, hybrid reference fusion attention for varying resolutions and sequence lengths, identity-robust pose control to decouple appearance from skeletal structure, and a token-replace strategy for coherent long-video generation. The authors claim that extensive experiments demonstrate outperformance over prior methods, with code and models released.
Significance. If the generalization claims hold, the work would meaningfully advance pose-driven animation by removing the requirement for spatially aligned reference-pose pairs, enabling practical use on real-world misaligned or partially occluded references. The self-supervised outpainting reformulation and identity-robust control are potentially reusable ideas for other conditional generation tasks.
major comments (3)
- [§3.2] §3.2 (Training reformulation): The central assumption that self-supervised outpainting on synthetically generated diverse-layout references will produce a model robust to real misalignment distributions (extreme crops, unusual viewpoints, partial occlusions) is load-bearing for the 'alignment-free' claim. The manuscript should provide a quantitative comparison of the synthetic layout distribution against real test cases (e.g., via statistics on crop ratios, occlusion levels, or viewpoint variance) and an ablation showing identity preservation when the test distribution is deliberately shifted outside the training support.
- [§4] §4 (Experiments): The abstract asserts outperformance, yet the provided summary and abstract contain no quantitative metrics, baseline names, dataset statistics, or ablation tables. The full paper must include these (e.g., FID, LPIPS, identity similarity scores, user studies) with error bars or statistical significance tests; without them the empirical support for the central claim cannot be evaluated.
- [§3.4] §3.4 (Identity-robust pose control): The decoupling of appearance from skeletal structure is presented as mitigating pose overfitting, but the manuscript should clarify whether this is achieved via an architectural constraint, a loss term, or a data-augmentation schedule, and report an ablation measuring identity drift (e.g., face or clothing consistency) when the pose-control module is removed.
minor comments (2)
- [§3.3] Notation for the hybrid reference fusion attention and token-replace strategy should be introduced with explicit equations or pseudocode rather than high-level descriptions only.
- [Figure 5] Figure captions and axis labels in the qualitative results should explicitly state the reference layout type (e.g., 'extreme crop', 'partial occlusion') for each example to allow readers to assess the claimed robustness.
Simulated Author's Rebuttal
We thank the referee for the thorough and constructive review. We appreciate the recognition of the potential impact of our alignment-free approach. We address each major comment below and have revised the manuscript to incorporate additional analyses, clarifications, and supporting evidence where appropriate.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Training reformulation): The central assumption that self-supervised outpainting on synthetically generated diverse-layout references will produce a model robust to real misalignment distributions (extreme crops, unusual viewpoints, partial occlusions) is load-bearing for the 'alignment-free' claim. The manuscript should provide a quantitative comparison of the synthetic layout distribution against real test cases (e.g., via statistics on crop ratios, occlusion levels, or viewpoint variance) and an ablation showing identity preservation when the test distribution is deliberately shifted outside the training support.
Authors: We agree that validating generalization to real misalignment distributions is critical for the alignment-free claim. In the revised manuscript, we have expanded §3.2 with a quantitative comparison of the synthetic layout distributions (reporting statistics on crop ratios, occlusion levels, and viewpoint variance) against real test cases from our evaluation datasets. We have also added an ablation that deliberately shifts the test distribution toward more extreme misalignments outside the training support and measures identity preservation via feature similarity scores. These additions support the robustness of the self-supervised reformulation. revision: yes
-
Referee: [§4] §4 (Experiments): The abstract asserts outperformance, yet the provided summary and abstract contain no quantitative metrics, baseline names, dataset statistics, or ablation tables. The full paper must include these (e.g., FID, LPIPS, identity similarity scores, user studies) with error bars or statistical significance tests; without them the empirical support for the central claim cannot be evaluated.
Authors: The full manuscript already presents quantitative results in Section 4, including FID, LPIPS, identity similarity scores, baseline comparisons, dataset statistics, and ablation tables, along with user studies. To strengthen the presentation, we have added error bars from multiple runs and statistical significance tests (paired t-tests with p-values) in the revised version. The abstract summarizes the outperformance while directing readers to the detailed experiments. revision: yes
-
Referee: [§3.4] §3.4 (Identity-robust pose control): The decoupling of appearance from skeletal structure is presented as mitigating pose overfitting, but the manuscript should clarify whether this is achieved via an architectural constraint, a loss term, or a data-augmentation schedule, and report an ablation measuring identity drift (e.g., face or clothing consistency) when the pose-control module is removed.
Authors: The identity-robust pose control is realized via an architectural constraint within the pose control module together with a dedicated loss term that encourages decoupling of appearance from skeletal structure; this is not primarily a data-augmentation schedule. We have clarified the exact mechanism in the revised §3.4. We have also added an ablation study that removes the pose-control module and quantifies identity drift using face and clothing consistency metrics, confirming its role in mitigating pose overfitting. revision: yes
Circularity Check
No circularity: architectural choices presented as independent design decisions
full rationale
The paper's core contributions consist of explicit methodological decisions—reformulating training as self-supervised outpainting on diverse-layout references, designing a reference extractor, integrating hybrid fusion attention, and introducing identity-robust pose control plus token replacement—rather than any derived quantities, fitted parameters renamed as predictions, or load-bearing self-citations. No equations or uniqueness theorems are invoked that reduce to the paper's own inputs by construction; the framework is self-contained as a set of engineering choices whose validity rests on empirical generalization rather than definitional equivalence.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Diffusion models trained via self-supervised outpainting on misaligned references will learn robust identity and pose representations.
Forward citations
Cited by 1 Pith paper
-
SignVerse-2M: A Two-Million-Clip Pose-Native Universe of 55+ Sign Languages
SignVerse-2M provides a 2-million-clip multilingual pose-native dataset for sign language derived from public videos via DWPose preprocessing to enable robust modeling in real-world conditions.
Reference graph
Works this paper leans on
-
[1]
Conditional gan with discrimi- native filter generation for text-to-video synthesis
Yogesh Balaji, Martin Renqiang Min, Bing Bai, Rama Chel- lappa, and Hans Peter Graf. Conditional gan with discrimi- native filter generation for text-to-video synthesis. InIJCAI, page 2, 2019. 7
work page 2019
-
[2]
Stable video diffusion: Scaling latent video diffusion models to large datasets.CoRR, 2023
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.CoRR, 2023. 3
work page 2023
-
[3]
In- structpix2pix: Learning to follow image editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 6
work page 2023
-
[4]
Magicpose: Realistic human poses and facial expressions retargeting with identity-aware diffusion
Di Chang, Yichun Shi, Quankai Gao, Jessica Fu, Hongyi Xu, Guoxian Song, Qing Yan, Yizhe Zhu, Xiao Yang, and Mo- hammad Soleymani. Magicpose: Realistic human poses and facial expressions retargeting with identity-aware diffusion. arXiv preprint arXiv:2311.12052, 2023. 2, 3
-
[5]
Iqa-pytorch: Pytorch toolbox for im- age quality assessment.https : / / github
Chaofeng Chen. Iqa-pytorch: Pytorch toolbox for im- age quality assessment.https : / / github . com / chaofengc/IQA-PyTorch, 2022. 2
work page 2022
-
[6]
Gang Cheng, Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Ju Li, Dechao Meng, Jinwei Qi, Penchong Qiao, et al. Wan-animate: Unified character animation and replacement with holistic replication.arXiv preprint arXiv:2509.14055, 2025. 2, 3, 6, 7
-
[7]
Scaling recti- fied flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,
-
[8]
Qijun Gan, Yi Ren, Chen Zhang, Zhenhui Ye, Pan Xie, Xiang Yin, Zehuan Yuan, Bingyue Peng, and Jianke Zhu. Humandit: Pose-guided diffusion transformer for long- form human motion video generation.arXiv preprint arXiv:2502.04847, 2025. 2
-
[9]
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xi- aojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Animatediff: Animate your personalized text-to- image diffusion models without specific tuning
Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to- image diffusion models without specific tuning. In12th In- ternational Conference on Learning Representations, ICLR 2024, 2024. 3
work page 2024
-
[11]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Matrix-game 2.0: An open-source real-time and streaming interactive world model
Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source, real-time, and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Latent Video Diffusion Models for High-Fidelity Long Video Generation
Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 7
work page 2017
-
[15]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3
work page 2020
-
[16]
Image quality metrics: Psnr vs
Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010. 7
work page 2010
-
[17]
Animate anyone: Consistent and controllable image- to-video synthesis for character animation
Li Hu. Animate anyone: Consistent and controllable image- to-video synthesis for character animation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8153–8163, 2024. 2, 3
work page 2024
-
[18]
Learning high fi- delity depths of dressed humans by watching social media dance videos
Yasamin Jafarian and Hyun Soo Park. Learning high fi- delity depths of dressed humans by watching social media dance videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12753– 12762, 2021. 6
work page 2021
-
[19]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 3
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[20]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Flux.https://github.com/ black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 3, 7
work page 2024
-
[22]
Omnihuman-1: Re- thinking the scaling-up of one-stage conditioned human an- 9 imation models
Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, Chao Liang, Yuan Zhang, and Jingtuo Liu. Omnihuman-1: Re- thinking the scaling-up of one-stage conditioned human an- 9 imation models. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 13847–13858,
-
[23]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 6
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[24]
Dongyang Liu, Shicheng Li, Yutong Liu, Zhen Li, Kai Wang, Xinyue Li, Qi Qin, Yufei Liu, Yi Xin, Zhongyu Li, et al. Lumina-video: Efficient and flexible video generation with multi-scale next-dit.arXiv preprint arXiv:2502.06782,
-
[25]
Multi- focal conditioned latent diffusion for person image synthesis
Jiaqi Liu, Jichao Zhang, Paolo Rota, and Nicu Sebe. Multi- focal conditioned latent diffusion for person image synthesis. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 16019–16028, 2025. 7
work page 2025
-
[26]
Multi- focal conditioned latent diffusion for person image synthesis
Jiaqi Liu, Jichao Zhang, Paolo Rota, and Nicu Sebe. Multi- focal conditioned latent diffusion for person image synthesis. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 16019–16028, 2025. 3
work page 2025
-
[27]
Phantom: Subject- consistent video generation via cross-modal alignment
Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Ji- awei Liu, Gen Li, Siyu Zhou, Qian He, and Xinglong Wu. Phantom: Subject-consistent video generation via cross- modal alignment.arXiv preprint arXiv:2502.11079, 2025. 4
-
[28]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 6
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[29]
Deepfashion: Powering robust clothes recognition and retrieval with rich annotations
Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 1096–1104, 2016. 3, 7
work page 2016
-
[30]
Coarse-to-fine latent diffusion for pose- guided person image synthesis
Yanzuo Lu, Manlin Zhang, Andy J Ma, Xiaohua Xie, and Jianhuang Lai. Coarse-to-fine latent diffusion for pose- guided person image synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6420–6429, 2024. 7
work page 2024
-
[31]
Coarse-to-fine latent diffusion for pose- guided person image synthesis
Yanzuo Lu, Manlin Zhang, Andy J Ma, Xiaohua Xie, and Jianhuang Lai. Coarse-to-fine latent diffusion for pose- guided person image synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6420–6429, 2024. 3
work page 2024
-
[32]
Paddleocr: Awesome multilingual ocr toolkits.https://github.com/PaddlePaddle/ PaddleOCR, 2021
PaddlePaddle. Paddleocr: Awesome multilingual ocr toolkits.https://github.com/PaddlePaddle/ PaddleOCR, 2021. 1
work page 2021
-
[33]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,
-
[34]
Controlnext: Powerful and efficient control for image and video generation
Bohao Peng, Jian Wang, Yuechen Zhang, Wenbo Li, Ming- Chang Yang, and Jiaya Jia. Controlnext: Powerful and effi- cient control for image and video generation.arXiv preprint arXiv:2408.06070, 2024. 3
-
[35]
Movie Gen: A Cast of Media Foundation Models
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents.arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[37]
Deep image spatial transformation for person image generation
Yurui Ren, Xiaoming Yu, Junming Chen, Thomas H Li, and Ge Li. Deep image spatial transformation for person image generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7690–7699,
-
[38]
Neural texture extraction and distribution for controllable person image synthesis
Yurui Ren, Xiaoqing Fan, Ge Li, Shan Liu, and Thomas H Li. Neural texture extraction and distribution for controllable person image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13535–13544, 2022. 3
work page 2022
-
[39]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3
work page 2022
-
[40]
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 3
work page 2022
-
[41]
Advancing pose-guided image synthesis with pro- gressive conditional diffusion models
Fei Shen, Hu Ye, Jun Zhang, Cong Wang, Xiao Han, and Yang Wei. Advancing pose-guided image synthesis with pro- gressive conditional diffusion models. InThe Twelfth Inter- national Conference on Learning Representations. 3
-
[42]
Self-supervised controlnet with spatio-temporal mamba for real-world video super-resolution
Shijun Shi, Jing Xu, Lijing Lu, Zhihang Li, and Kai Hu. Self-supervised controlnet with spatio-temporal mamba for real-world video super-resolution. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7385–7395, 2025. 3
work page 2025
-
[43]
Deformable gans for pose-based human im- age generation
Aliaksandr Siarohin, Enver Sangineto, St ´ephane Lathuiliere, and Nicu Sebe. Deformable gans for pose-based human im- age generation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3408–3416,
-
[44]
Make-A-Video: Text-to-Video Generation without Text-Video Data
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792,
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
Denois- ing diffusion implicit models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InInternational Conference on Learning Representations. 3
-
[46]
Shuai Tan, Biao Gong, Xiang Wang, Shiwei Zhang, Dandan Zheng, Ruobing Zheng, Kecheng Zheng, Jingdong Chen, and Ming Yang. Animate-x: Universal character image ani- mation with enhanced motion representation.arXiv preprint arXiv:2410.10306, 2024. 2, 6, 8
-
[47]
Stableanimator: High- quality identity-preserving human image animation
Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, and Zuxuan Wu. Stableanimator: High- quality identity-preserving human image animation. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 21096–21106, 2025. 2, 3, 6 10
work page 2025
-
[48]
Towards Accurate Generative Models of Video: A New Metric & Challenges
Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018. 7
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[49]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Disco: Disentangled control for realistic human dance generation
Tan Wang, Linjie Li, Kevin Lin, Yuanhao Zhai, Chung- Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, and Lijuan Wang. Disco: Disentangled control for realistic human dance generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9326–9336, 2024. 7, 2
work page 2024
-
[51]
Xiang Wang, Shiwei Zhang, Changxin Gao, Jiayu Wang, Xiaoqiang Zhou, Yingya Zhang, Luxin Yan, and Nong Sang. Unianimate: Taming unified video diffusion mod- els for consistent human image animation.arXiv preprint arXiv:2406.01188, 2024. 2
-
[52]
Xiang Wang, Shiwei Zhang, Longxiang Tang, Yingya Zhang, Changxin Gao, Yuehuan Wang, and Nong Sang. Unianimate-dit: Human image animation with large-scale video diffusion transformer.arXiv preprint arXiv:2504.11289, 2025. 2, 3, 6
-
[53]
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 7
work page 2004
-
[54]
Magicanimate: Temporally consistent human im- age animation using diffusion model
Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou. Magicanimate: Temporally consistent human im- age animation using diffusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1481–1490, 2024. 7
work page 2024
-
[55]
Effec- tive whole-body pose estimation with two-stages distillation
Zhendong Yang, Ailing Zeng, Chun Yuan, and Yu Li. Effec- tive whole-body pose estimation with two-stages distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4210–4220, 2023. 1
work page 2023
-
[56]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arXiv:2308.06721,
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
Polina Zablotskaia, Aliaksandr Siarohin, Bo Zhao, and Leonid Sigal. Dwnet: Dense warp-based network for pose-guided human video generation.arXiv preprint arXiv:1910.09139, 2019. 6
-
[59]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 3
work page 2023
-
[60]
Exploring dual-task correlation for pose guided per- son image generation
Pengze Zhang, Lingxiao Yang, Jian-Huang Lai, and Xiaohua Xie. Exploring dual-task correlation for pose guided per- son image generation. InProceedings of the IEEE/CVF con- ference on Computer Vision and Pattern Recognition, pages 7713–7722, 2022. 3
work page 2022
-
[61]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 7
work page 2018
-
[62]
Yuang Zhang, Jiaxi Gu, Li-Wen Wang, Han Wang, Junqi Cheng, Yuefeng Zhu, and Fangyuan Zou. Mim- icmotion: High-quality human motion video generation with confidence-aware pose guidance.arXiv preprint arXiv:2406.19680, 2024. 2, 3, 6, 8
-
[63]
Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025
Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, et al. Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025. 3
-
[64]
Shenhao Zhu, Junming Leo Chen, Zuozhuo Dai, Zilong Dong, Yinghui Xu, Xun Cao, Yao Yao, Hao Zhu, and Siyu Zhu. Champ: Controllable and consistent human image an- imation with 3d parametric guidance. InEuropean Confer- ence on Computer Vision, pages 145–162. Springer, 2024. 6 11 One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfe...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.