pith. sign in

arxiv: 2511.22940 · v3 · pith:EWRCAILLnew · submitted 2025-11-28 · 💻 cs.CV

One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer

Pith reviewed 2026-05-21 18:21 UTC · model grok-4.3

classification 💻 cs.CV
keywords character animationpose transferdiffusion modelsself-supervised outpaintingreference alignmentimage generationvideo synthesisidentity preservation
0
0 comments X

The pith

One-to-All Animation enables high-fidelity character animation and pose transfer from references with arbitrary layouts by treating training as self-supervised outpainting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a unified framework that removes the requirement for spatial alignment between reference images and target poses in character animation and image pose transfer. It achieves this by reformulating the training process as a self-supervised outpainting task that converts diverse reference layouts into a consistent occluded format. A dedicated reference extractor captures complete identity features even from partially visible inputs, while hybrid reference fusion attention manages different resolutions and sequence lengths. Additional components decouple appearance from pose to reduce overfitting and use token replacement for consistent long-video output. If correct, the approach would let animation systems work directly with everyday photos that do not match the target pose or show the full body.

Core claim

The authors claim that reformulating training as a self-supervised outpainting task on diverse-layout references, together with a reference extractor, hybrid fusion attention, identity-robust pose control, and a token-replace strategy, produces a model capable of high-fidelity animation and pose transfer for references with arbitrary layouts, including those that are spatially misaligned or only partially visible, while avoiding identity loss and artifacts that limit prior methods.

What carries the argument

The self-supervised outpainting task that transforms diverse-layout references into a unified occluded-input format, which enables the model to generalize to misaligned and partial references.

If this is right

  • The model produces coherent long videos through the token replace strategy.
  • Identity features remain stable even when references are only partially visible.
  • Pose control is decoupled from appearance to reduce overfitting to specific skeletal structures.
  • Hybrid attention allows processing of inputs with varying resolutions and dynamic lengths.
  • Overall generation quality exceeds that of methods restricted to aligned reference-pose pairs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same outpainting reformulation might be tested on other diffusion-based tasks that currently require aligned conditioning, such as text-to-image editing with loose spatial hints.
  • If the reference extractor proves robust, it could support animation pipelines that ingest casual smartphone photos without manual cropping or alignment preprocessing.
  • The decoupling of identity and pose might reduce the need for large paired datasets in future animation work.

Load-bearing premise

Reformulating training as a self-supervised outpainting task on diverse-layout references will produce a model that generalizes to real misaligned and partially visible inputs without introducing identity loss or artifacts.

What would settle it

Running the trained model on a set of real-world reference images that are spatially misaligned or cropped and observing consistent identity changes or visible artifacts in the generated animations would falsify the central claim.

Figures

Figures reproduced from arXiv: 2511.22940 by Chunli Peng, Jiangning Zhang, Jing Xu, Kai Hu, Lijing Lu, Shijun Shi, Xiaoda Yang, Zhihang Li.

Figure 1
Figure 1. Figure 1: We introduce One-to-All Animation, a unified framework for pose-driven personalized generation. Unlike prior methods that require both spatially-aligned references and pose retargeting, our framework supports: (1) cross-scale video animation with either retar￾geted or original driving motion, (2) cross-scale image pose transfer, and (3) temporally coherent long video generation. Abstract Recent advances in… view at source ↗
Figure 2
Figure 2. Figure 2: Visual comparison under spatial-misaligned inputs. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed framework. We introduce outpainting preprocess to handle diverse body proportions through face [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: At each denoising timestep, context tokens from the last [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Human evaluation with current SOTA. across both datasets, indicating strong scalability and gen￾eralization. Qualitative results are shown in Fig. 6a. Evaluation on Image Dataset. For image pose trans￾fer, following prior works [25, 30], we evaluate on 8,570 test pairs from the DeepFashion dataset. Previous meth￾ods typically perform inference at a low resolution of 512 × 352, which often leads to loss of … view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparisons with state-of-the-art methods. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison of different reference feature ex [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative ablation of Identity-Robust Pose Control. [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Cartoon Dataset construction: (a) pose filtering and [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visualization of region-weighted loss components. [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗
Figure 14
Figure 14. Figure 14: Failure cases of crop-and-resize baseline. [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Results from the 1.3B model showing that higher [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Our model enables prompt-based editing while maintaining identity and motion. [PITH_FULL_IMAGE:figures/full_fig_p015_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Additional Cartoon benchmark results. 5 [PITH_FULL_IMAGE:figures/full_fig_p016_17.png] view at source ↗
read the original abstract

Recent advances in diffusion models have greatly improved pose-driven character animation. However, existing methods are limited to spatially aligned reference-pose pairs with matched skeletal structures. Handling reference-pose misalignment remains unsolved. To address this, we present One-to-All Animation, a unified framework for high-fidelity character animation and image pose transfer for references with arbitrary layouts. First, to handle spatially misaligned reference, we reformulate training as a self-supervised outpainting task that transforms diverse-layout reference into a unified occluded-input format. Second, to process partially visible reference, we design a reference extractor for comprehensive identity feature extraction. Further, we integrate hybrid reference fusion attention to handle varying resolutions and dynamic sequence lengths. Finally, from the perspective of generation quality, we introduce identity-robust pose control that decouples appearance from skeletal structure to mitigate pose overfitting, and a token replace strategy for coherent long-video generation. Extensive experiments show that our method outperforms existing approaches. The code and model are available at https://github.com/ssj9596/One-to-All-Animation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces One-to-All Animation, a unified diffusion-based framework for high-fidelity character animation and image pose transfer from references with arbitrary layouts. It reformulates training as a self-supervised outpainting task to convert diverse-layout references into a unified occluded-input format, adds a reference extractor for identity features from partially visible inputs, hybrid reference fusion attention for varying resolutions and sequence lengths, identity-robust pose control to decouple appearance from skeletal structure, and a token-replace strategy for coherent long-video generation. The authors claim that extensive experiments demonstrate outperformance over prior methods, with code and models released.

Significance. If the generalization claims hold, the work would meaningfully advance pose-driven animation by removing the requirement for spatially aligned reference-pose pairs, enabling practical use on real-world misaligned or partially occluded references. The self-supervised outpainting reformulation and identity-robust control are potentially reusable ideas for other conditional generation tasks.

major comments (3)
  1. [§3.2] §3.2 (Training reformulation): The central assumption that self-supervised outpainting on synthetically generated diverse-layout references will produce a model robust to real misalignment distributions (extreme crops, unusual viewpoints, partial occlusions) is load-bearing for the 'alignment-free' claim. The manuscript should provide a quantitative comparison of the synthetic layout distribution against real test cases (e.g., via statistics on crop ratios, occlusion levels, or viewpoint variance) and an ablation showing identity preservation when the test distribution is deliberately shifted outside the training support.
  2. [§4] §4 (Experiments): The abstract asserts outperformance, yet the provided summary and abstract contain no quantitative metrics, baseline names, dataset statistics, or ablation tables. The full paper must include these (e.g., FID, LPIPS, identity similarity scores, user studies) with error bars or statistical significance tests; without them the empirical support for the central claim cannot be evaluated.
  3. [§3.4] §3.4 (Identity-robust pose control): The decoupling of appearance from skeletal structure is presented as mitigating pose overfitting, but the manuscript should clarify whether this is achieved via an architectural constraint, a loss term, or a data-augmentation schedule, and report an ablation measuring identity drift (e.g., face or clothing consistency) when the pose-control module is removed.
minor comments (2)
  1. [§3.3] Notation for the hybrid reference fusion attention and token-replace strategy should be introduced with explicit equations or pseudocode rather than high-level descriptions only.
  2. [Figure 5] Figure captions and axis labels in the qualitative results should explicitly state the reference layout type (e.g., 'extreme crop', 'partial occlusion') for each example to allow readers to assess the claimed robustness.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive review. We appreciate the recognition of the potential impact of our alignment-free approach. We address each major comment below and have revised the manuscript to incorporate additional analyses, clarifications, and supporting evidence where appropriate.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Training reformulation): The central assumption that self-supervised outpainting on synthetically generated diverse-layout references will produce a model robust to real misalignment distributions (extreme crops, unusual viewpoints, partial occlusions) is load-bearing for the 'alignment-free' claim. The manuscript should provide a quantitative comparison of the synthetic layout distribution against real test cases (e.g., via statistics on crop ratios, occlusion levels, or viewpoint variance) and an ablation showing identity preservation when the test distribution is deliberately shifted outside the training support.

    Authors: We agree that validating generalization to real misalignment distributions is critical for the alignment-free claim. In the revised manuscript, we have expanded §3.2 with a quantitative comparison of the synthetic layout distributions (reporting statistics on crop ratios, occlusion levels, and viewpoint variance) against real test cases from our evaluation datasets. We have also added an ablation that deliberately shifts the test distribution toward more extreme misalignments outside the training support and measures identity preservation via feature similarity scores. These additions support the robustness of the self-supervised reformulation. revision: yes

  2. Referee: [§4] §4 (Experiments): The abstract asserts outperformance, yet the provided summary and abstract contain no quantitative metrics, baseline names, dataset statistics, or ablation tables. The full paper must include these (e.g., FID, LPIPS, identity similarity scores, user studies) with error bars or statistical significance tests; without them the empirical support for the central claim cannot be evaluated.

    Authors: The full manuscript already presents quantitative results in Section 4, including FID, LPIPS, identity similarity scores, baseline comparisons, dataset statistics, and ablation tables, along with user studies. To strengthen the presentation, we have added error bars from multiple runs and statistical significance tests (paired t-tests with p-values) in the revised version. The abstract summarizes the outperformance while directing readers to the detailed experiments. revision: yes

  3. Referee: [§3.4] §3.4 (Identity-robust pose control): The decoupling of appearance from skeletal structure is presented as mitigating pose overfitting, but the manuscript should clarify whether this is achieved via an architectural constraint, a loss term, or a data-augmentation schedule, and report an ablation measuring identity drift (e.g., face or clothing consistency) when the pose-control module is removed.

    Authors: The identity-robust pose control is realized via an architectural constraint within the pose control module together with a dedicated loss term that encourages decoupling of appearance from skeletal structure; this is not primarily a data-augmentation schedule. We have clarified the exact mechanism in the revised §3.4. We have also added an ablation study that removes the pose-control module and quantifies identity drift using face and clothing consistency metrics, confirming its role in mitigating pose overfitting. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural choices presented as independent design decisions

full rationale

The paper's core contributions consist of explicit methodological decisions—reformulating training as self-supervised outpainting on diverse-layout references, designing a reference extractor, integrating hybrid fusion attention, and introducing identity-robust pose control plus token replacement—rather than any derived quantities, fitted parameters renamed as predictions, or load-bearing self-citations. No equations or uniqueness theorems are invoked that reduce to the paper's own inputs by construction; the framework is self-contained as a set of engineering choices whose validity rests on empirical generalization rather than definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard diffusion model assumptions plus the novel training reformulation and modules introduced in the paper; no new physical entities or heavily fitted constants are described.

axioms (1)
  • domain assumption Diffusion models trained via self-supervised outpainting on misaligned references will learn robust identity and pose representations.
    This premise is invoked when the authors reformulate training as an outpainting task to handle arbitrary layouts.

pith-pipeline@v0.9.0 · 5732 in / 1302 out tokens · 52981 ms · 2026-05-21T18:21:35.355221+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SignVerse-2M: A Two-Million-Clip Pose-Native Universe of 55+ Sign Languages

    cs.CV 2026-05 unverdicted novelty 6.0

    SignVerse-2M provides a 2-million-clip multilingual pose-native dataset for sign language derived from public videos via DWPose preprocessing to enable robust modeling in real-world conditions.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 1 Pith paper · 15 internal anchors

  1. [1]

    Conditional gan with discrimi- native filter generation for text-to-video synthesis

    Yogesh Balaji, Martin Renqiang Min, Bing Bai, Rama Chel- lappa, and Hans Peter Graf. Conditional gan with discrimi- native filter generation for text-to-video synthesis. InIJCAI, page 2, 2019. 7

  2. [2]

    Stable video diffusion: Scaling latent video diffusion models to large datasets.CoRR, 2023

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.CoRR, 2023. 3

  3. [3]

    In- structpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 6

  4. [4]

    Magicpose: Realistic human poses and facial expressions retargeting with identity-aware diffusion

    Di Chang, Yichun Shi, Quankai Gao, Jessica Fu, Hongyi Xu, Guoxian Song, Qing Yan, Yizhe Zhu, Xiao Yang, and Mo- hammad Soleymani. Magicpose: Realistic human poses and facial expressions retargeting with identity-aware diffusion. arXiv preprint arXiv:2311.12052, 2023. 2, 3

  5. [5]

    Iqa-pytorch: Pytorch toolbox for im- age quality assessment.https : / / github

    Chaofeng Chen. Iqa-pytorch: Pytorch toolbox for im- age quality assessment.https : / / github . com / chaofengc/IQA-PyTorch, 2022. 2

  6. [6]

    Wan-animate: Unified character animation and replacement with holistic replication.arXiv preprint arXiv:2509.14055, 2025

    Gang Cheng, Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Ju Li, Dechao Meng, Jinwei Qi, Penchong Qiao, et al. Wan-animate: Unified character animation and replacement with holistic replication.arXiv preprint arXiv:2509.14055, 2025. 2, 3, 6, 7

  7. [7]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,

  8. [8]

    Humandit: Pose-guided diffusion transformer for long- form human motion video generation.arXiv preprint arXiv:2502.04847, 2025

    Qijun Gan, Yi Ren, Chen Zhang, Zhenhui Ye, Pan Xie, Xiang Yin, Zehuan Yuan, Bingyue Peng, and Jianke Zhu. Humandit: Pose-guided diffusion transformer for long- form human motion video generation.arXiv preprint arXiv:2502.04847, 2025. 2

  9. [9]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xi- aojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113,

  10. [10]

    Animatediff: Animate your personalized text-to- image diffusion models without specific tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to- image diffusion models without specific tuning. In12th In- ternational Conference on Learning Representations, ICLR 2024, 2024. 3

  11. [11]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103,

  12. [12]

    Matrix-game 2.0: An open-source real-time and streaming interactive world model

    Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source, real-time, and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

  13. [13]

    Latent Video Diffusion Models for High-Fidelity Long Video Generation

    Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221,

  14. [14]

    Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 7

  15. [15]

    Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3

  16. [16]

    Image quality metrics: Psnr vs

    Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010. 7

  17. [17]

    Animate anyone: Consistent and controllable image- to-video synthesis for character animation

    Li Hu. Animate anyone: Consistent and controllable image- to-video synthesis for character animation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8153–8163, 2024. 2, 3

  18. [18]

    Learning high fi- delity depths of dressed humans by watching social media dance videos

    Yasamin Jafarian and Hyun Soo Park. Learning high fi- delity depths of dressed humans by watching social media dance videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12753– 12762, 2021. 6

  19. [19]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 3

  20. [20]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 3

  21. [21]

    Flux.https://github.com/ black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 3, 7

  22. [22]

    Omnihuman-1: Re- thinking the scaling-up of one-stage conditioned human an- 9 imation models

    Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, Chao Liang, Yuan Zhang, and Jingtuo Liu. Omnihuman-1: Re- thinking the scaling-up of one-stage conditioned human an- 9 imation models. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 13847–13858,

  23. [23]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 6

  24. [24]

    Lumina-video: Efficient and flexible video generation with multi-scale next-dit.arXiv preprint arXiv:2502.06782,

    Dongyang Liu, Shicheng Li, Yutong Liu, Zhen Li, Kai Wang, Xinyue Li, Qi Qin, Yufei Liu, Yi Xin, Zhongyu Li, et al. Lumina-video: Efficient and flexible video generation with multi-scale next-dit.arXiv preprint arXiv:2502.06782,

  25. [25]

    Multi- focal conditioned latent diffusion for person image synthesis

    Jiaqi Liu, Jichao Zhang, Paolo Rota, and Nicu Sebe. Multi- focal conditioned latent diffusion for person image synthesis. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 16019–16028, 2025. 7

  26. [26]

    Multi- focal conditioned latent diffusion for person image synthesis

    Jiaqi Liu, Jichao Zhang, Paolo Rota, and Nicu Sebe. Multi- focal conditioned latent diffusion for person image synthesis. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 16019–16028, 2025. 3

  27. [27]

    Phantom: Subject- consistent video generation via cross-modal alignment

    Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Ji- awei Liu, Gen Li, Siyu Zhou, Qian He, and Xinglong Wu. Phantom: Subject-consistent video generation via cross- modal alignment.arXiv preprint arXiv:2502.11079, 2025. 4

  28. [28]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 6

  29. [29]

    Deepfashion: Powering robust clothes recognition and retrieval with rich annotations

    Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 1096–1104, 2016. 3, 7

  30. [30]

    Coarse-to-fine latent diffusion for pose- guided person image synthesis

    Yanzuo Lu, Manlin Zhang, Andy J Ma, Xiaohua Xie, and Jianhuang Lai. Coarse-to-fine latent diffusion for pose- guided person image synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6420–6429, 2024. 7

  31. [31]

    Coarse-to-fine latent diffusion for pose- guided person image synthesis

    Yanzuo Lu, Manlin Zhang, Andy J Ma, Xiaohua Xie, and Jianhuang Lai. Coarse-to-fine latent diffusion for pose- guided person image synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6420–6429, 2024. 3

  32. [32]

    Paddleocr: Awesome multilingual ocr toolkits.https://github.com/PaddlePaddle/ PaddleOCR, 2021

    PaddlePaddle. Paddleocr: Awesome multilingual ocr toolkits.https://github.com/PaddlePaddle/ PaddleOCR, 2021. 1

  33. [33]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

  34. [34]

    Controlnext: Powerful and efficient control for image and video generation

    Bohao Peng, Jian Wang, Yuechen Zhang, Wenbo Li, Ming- Chang Yang, and Jiaya Jia. Controlnext: Powerful and effi- cient control for image and video generation.arXiv preprint arXiv:2408.06070, 2024. 3

  35. [35]

    Movie Gen: A Cast of Media Foundation Models

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

  36. [36]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents.arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 3

  37. [37]

    Deep image spatial transformation for person image generation

    Yurui Ren, Xiaoming Yu, Junming Chen, Thomas H Li, and Ge Li. Deep image spatial transformation for person image generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7690–7699,

  38. [38]

    Neural texture extraction and distribution for controllable person image synthesis

    Yurui Ren, Xiaoqing Fan, Ge Li, Shan Liu, and Thomas H Li. Neural texture extraction and distribution for controllable person image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13535–13544, 2022. 3

  39. [39]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3

  40. [40]

    Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 3

  41. [41]

    Advancing pose-guided image synthesis with pro- gressive conditional diffusion models

    Fei Shen, Hu Ye, Jun Zhang, Cong Wang, Xiao Han, and Yang Wei. Advancing pose-guided image synthesis with pro- gressive conditional diffusion models. InThe Twelfth Inter- national Conference on Learning Representations. 3

  42. [42]

    Self-supervised controlnet with spatio-temporal mamba for real-world video super-resolution

    Shijun Shi, Jing Xu, Lijing Lu, Zhihang Li, and Kai Hu. Self-supervised controlnet with spatio-temporal mamba for real-world video super-resolution. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7385–7395, 2025. 3

  43. [43]

    Deformable gans for pose-based human im- age generation

    Aliaksandr Siarohin, Enver Sangineto, St ´ephane Lathuiliere, and Nicu Sebe. Deformable gans for pose-based human im- age generation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3408–3416,

  44. [44]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792,

  45. [45]

    Denois- ing diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InInternational Conference on Learning Representations. 3

  46. [46]

    Animate-x: Universal character image ani- mation with enhanced motion representation.arXiv preprint arXiv:2410.10306, 2024

    Shuai Tan, Biao Gong, Xiang Wang, Shiwei Zhang, Dandan Zheng, Ruobing Zheng, Kecheng Zheng, Jingdong Chen, and Ming Yang. Animate-x: Universal character image ani- mation with enhanced motion representation.arXiv preprint arXiv:2410.10306, 2024. 2, 6, 8

  47. [47]

    Stableanimator: High- quality identity-preserving human image animation

    Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, and Zuxuan Wu. Stableanimator: High- quality identity-preserving human image animation. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 21096–21106, 2025. 2, 3, 6 10

  48. [48]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018. 7

  49. [49]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 3, 6

  50. [50]

    Disco: Disentangled control for realistic human dance generation

    Tan Wang, Linjie Li, Kevin Lin, Yuanhao Zhai, Chung- Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, and Lijuan Wang. Disco: Disentangled control for realistic human dance generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9326–9336, 2024. 7, 2

  51. [51]

    Unianimate: Taming unified video diffusion mod- els for consistent human image animation.arXiv preprint arXiv:2406.01188, 2024

    Xiang Wang, Shiwei Zhang, Changxin Gao, Jiayu Wang, Xiaoqiang Zhou, Yingya Zhang, Luxin Yan, and Nong Sang. Unianimate: Taming unified video diffusion mod- els for consistent human image animation.arXiv preprint arXiv:2406.01188, 2024. 2

  52. [52]

    Unianimate-dit: Human image animation with large-scale video diffusion transformer.arXiv preprint arXiv:2504.11289, 2025

    Xiang Wang, Shiwei Zhang, Longxiang Tang, Yingya Zhang, Changxin Gao, Yuehuan Wang, and Nong Sang. Unianimate-dit: Human image animation with large-scale video diffusion transformer.arXiv preprint arXiv:2504.11289, 2025. 2, 3, 6

  53. [53]

    Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 7

  54. [54]

    Magicanimate: Temporally consistent human im- age animation using diffusion model

    Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou. Magicanimate: Temporally consistent human im- age animation using diffusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1481–1490, 2024. 7

  55. [55]

    Effec- tive whole-body pose estimation with two-stages distillation

    Zhendong Yang, Ailing Zeng, Chun Yuan, and Yu Li. Effec- tive whole-body pose estimation with two-stages distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4210–4220, 2023. 1

  56. [56]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 3

  57. [57]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arXiv:2308.06721,

  58. [58]

    Dwnet: Dense warp-based network for pose-guided human video generation.arXiv preprint arXiv:1910.09139, 2019

    Polina Zablotskaia, Aliaksandr Siarohin, Bo Zhao, and Leonid Sigal. Dwnet: Dense warp-based network for pose-guided human video generation.arXiv preprint arXiv:1910.09139, 2019. 6

  59. [59]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 3

  60. [60]

    Exploring dual-task correlation for pose guided per- son image generation

    Pengze Zhang, Lingxiao Yang, Jian-Huang Lai, and Xiaohua Xie. Exploring dual-task correlation for pose guided per- son image generation. InProceedings of the IEEE/CVF con- ference on Computer Vision and Pattern Recognition, pages 7713–7722, 2022. 3

  61. [61]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 7

  62. [62]

    Mim- icmotion: High-quality human motion video generation with confidence-aware pose guidance.arXiv preprint arXiv:2406.19680, 2024

    Yuang Zhang, Jiaxi Gu, Li-Wen Wang, Han Wang, Junqi Cheng, Yuefeng Zhu, and Fangyuan Zou. Mim- icmotion: High-quality human motion video generation with confidence-aware pose guidance.arXiv preprint arXiv:2406.19680, 2024. 2, 3, 6, 8

  63. [63]

    Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

    Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, et al. Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025. 3

  64. [64]

    a character is danc- ing

    Shenhao Zhu, Junming Leo Chen, Zuozhuo Dai, Zilong Dong, Yinghui Xu, Xun Cao, Yao Yao, Hao Zhu, and Siyu Zhu. Champ: Controllable and consistent human image an- imation with 3d parametric guidance. InEuropean Confer- ence on Computer Vision, pages 145–162. Springer, 2024. 6 11 One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfe...