One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer

Chunli Peng; Jiangning Zhang; Jing Xu; Kai Hu; Lijing Lu; Shijun Shi; Xiaoda Yang; Zhihang Li

arxiv: 2511.22940 · v3 · pith:EWRCAILLnew · submitted 2025-11-28 · 💻 cs.CV

One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer

Shijun Shi , Jing Xu , Zhihang Li , Chunli Peng , Xiaoda Yang , Lijing Lu , Kai Hu , Jiangning Zhang This is my paper

Pith reviewed 2026-05-21 18:21 UTC · model grok-4.3

classification 💻 cs.CV

keywords character animationpose transferdiffusion modelsself-supervised outpaintingreference alignmentimage generationvideo synthesisidentity preservation

0 comments

The pith

One-to-All Animation enables high-fidelity character animation and pose transfer from references with arbitrary layouts by treating training as self-supervised outpainting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a unified framework that removes the requirement for spatial alignment between reference images and target poses in character animation and image pose transfer. It achieves this by reformulating the training process as a self-supervised outpainting task that converts diverse reference layouts into a consistent occluded format. A dedicated reference extractor captures complete identity features even from partially visible inputs, while hybrid reference fusion attention manages different resolutions and sequence lengths. Additional components decouple appearance from pose to reduce overfitting and use token replacement for consistent long-video output. If correct, the approach would let animation systems work directly with everyday photos that do not match the target pose or show the full body.

Core claim

The authors claim that reformulating training as a self-supervised outpainting task on diverse-layout references, together with a reference extractor, hybrid fusion attention, identity-robust pose control, and a token-replace strategy, produces a model capable of high-fidelity animation and pose transfer for references with arbitrary layouts, including those that are spatially misaligned or only partially visible, while avoiding identity loss and artifacts that limit prior methods.

What carries the argument

The self-supervised outpainting task that transforms diverse-layout references into a unified occluded-input format, which enables the model to generalize to misaligned and partial references.

If this is right

The model produces coherent long videos through the token replace strategy.
Identity features remain stable even when references are only partially visible.
Pose control is decoupled from appearance to reduce overfitting to specific skeletal structures.
Hybrid attention allows processing of inputs with varying resolutions and dynamic lengths.
Overall generation quality exceeds that of methods restricted to aligned reference-pose pairs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same outpainting reformulation might be tested on other diffusion-based tasks that currently require aligned conditioning, such as text-to-image editing with loose spatial hints.
If the reference extractor proves robust, it could support animation pipelines that ingest casual smartphone photos without manual cropping or alignment preprocessing.
The decoupling of identity and pose might reduce the need for large paired datasets in future animation work.

Load-bearing premise

Reformulating training as a self-supervised outpainting task on diverse-layout references will produce a model that generalizes to real misaligned and partially visible inputs without introducing identity loss or artifacts.

What would settle it

Running the trained model on a set of real-world reference images that are spatially misaligned or cropped and observing consistent identity changes or visible artifacts in the generated animations would falsify the central claim.

Figures

Figures reproduced from arXiv: 2511.22940 by Chunli Peng, Jiangning Zhang, Jing Xu, Kai Hu, Lijing Lu, Shijun Shi, Xiaoda Yang, Zhihang Li.

**Figure 1.** Figure 1: We introduce One-to-All Animation, a unified framework for pose-driven personalized generation. Unlike prior methods that require both spatially-aligned references and pose retargeting, our framework supports: (1) cross-scale video animation with either retargeted or original driving motion, (2) cross-scale image pose transfer, and (3) temporally coherent long video generation. Abstract Recent advances in… view at source ↗

**Figure 2.** Figure 2: Visual comparison under spatial-misaligned inputs. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the proposed framework. We introduce outpainting preprocess to handle diverse body proportions through face [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: At each denoising timestep, context tokens from the last [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 7.** Figure 7: Human evaluation with current SOTA. across both datasets, indicating strong scalability and generalization. Qualitative results are shown in Fig. 6a. Evaluation on Image Dataset. For image pose transfer, following prior works [25, 30], we evaluate on 8,570 test pairs from the DeepFashion dataset. Previous methods typically perform inference at a low resolution of 512 × 352, which often leads to loss of … view at source ↗

**Figure 6.** Figure 6: Qualitative comparisons with state-of-the-art methods. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 8.** Figure 8: Qualitative comparison of different reference feature ex [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative ablation of Identity-Robust Pose Control. [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 11.** Figure 11: Cartoon Dataset construction: (a) pose filtering and [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗

**Figure 12.** Figure 12: Visualization of region-weighted loss components. [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗

**Figure 14.** Figure 14: Failure cases of crop-and-resize baseline. [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗

**Figure 15.** Figure 15: Results from the 1.3B model showing that higher [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗

**Figure 16.** Figure 16: Our model enables prompt-based editing while maintaining identity and motion. [PITH_FULL_IMAGE:figures/full_fig_p015_16.png] view at source ↗

**Figure 17.** Figure 17: Additional Cartoon benchmark results. 5 [PITH_FULL_IMAGE:figures/full_fig_p016_17.png] view at source ↗

read the original abstract

Recent advances in diffusion models have greatly improved pose-driven character animation. However, existing methods are limited to spatially aligned reference-pose pairs with matched skeletal structures. Handling reference-pose misalignment remains unsolved. To address this, we present One-to-All Animation, a unified framework for high-fidelity character animation and image pose transfer for references with arbitrary layouts. First, to handle spatially misaligned reference, we reformulate training as a self-supervised outpainting task that transforms diverse-layout reference into a unified occluded-input format. Second, to process partially visible reference, we design a reference extractor for comprehensive identity feature extraction. Further, we integrate hybrid reference fusion attention to handle varying resolutions and dynamic sequence lengths. Finally, from the perspective of generation quality, we introduce identity-robust pose control that decouples appearance from skeletal structure to mitigate pose overfitting, and a token replace strategy for coherent long-video generation. Extensive experiments show that our method outperforms existing approaches. The code and model are available at https://github.com/ssj9596/One-to-All-Animation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical framework for misalignment in pose-driven animation via self-supervised outpainting and a few targeted modules, but the generalization claim needs real data checks.

read the letter

The main thing to know is that this work targets the open problem of handling reference images with arbitrary layouts or partial visibility in character animation and pose transfer. They reformulate training as self-supervised outpainting to create a unified occluded input format, then add a reference extractor, hybrid fusion attention, identity-robust pose control, and a token replace step for long videos. Code is released, which helps verification.

Referee Report

3 major / 2 minor

Summary. The paper introduces One-to-All Animation, a unified diffusion-based framework for high-fidelity character animation and image pose transfer from references with arbitrary layouts. It reformulates training as a self-supervised outpainting task to convert diverse-layout references into a unified occluded-input format, adds a reference extractor for identity features from partially visible inputs, hybrid reference fusion attention for varying resolutions and sequence lengths, identity-robust pose control to decouple appearance from skeletal structure, and a token-replace strategy for coherent long-video generation. The authors claim that extensive experiments demonstrate outperformance over prior methods, with code and models released.

Significance. If the generalization claims hold, the work would meaningfully advance pose-driven animation by removing the requirement for spatially aligned reference-pose pairs, enabling practical use on real-world misaligned or partially occluded references. The self-supervised outpainting reformulation and identity-robust control are potentially reusable ideas for other conditional generation tasks.

major comments (3)

[§3.2] §3.2 (Training reformulation): The central assumption that self-supervised outpainting on synthetically generated diverse-layout references will produce a model robust to real misalignment distributions (extreme crops, unusual viewpoints, partial occlusions) is load-bearing for the 'alignment-free' claim. The manuscript should provide a quantitative comparison of the synthetic layout distribution against real test cases (e.g., via statistics on crop ratios, occlusion levels, or viewpoint variance) and an ablation showing identity preservation when the test distribution is deliberately shifted outside the training support.
[§4] §4 (Experiments): The abstract asserts outperformance, yet the provided summary and abstract contain no quantitative metrics, baseline names, dataset statistics, or ablation tables. The full paper must include these (e.g., FID, LPIPS, identity similarity scores, user studies) with error bars or statistical significance tests; without them the empirical support for the central claim cannot be evaluated.
[§3.4] §3.4 (Identity-robust pose control): The decoupling of appearance from skeletal structure is presented as mitigating pose overfitting, but the manuscript should clarify whether this is achieved via an architectural constraint, a loss term, or a data-augmentation schedule, and report an ablation measuring identity drift (e.g., face or clothing consistency) when the pose-control module is removed.

minor comments (2)

[§3.3] Notation for the hybrid reference fusion attention and token-replace strategy should be introduced with explicit equations or pseudocode rather than high-level descriptions only.
[Figure 5] Figure captions and axis labels in the qualitative results should explicitly state the reference layout type (e.g., 'extreme crop', 'partial occlusion') for each example to allow readers to assess the claimed robustness.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive review. We appreciate the recognition of the potential impact of our alignment-free approach. We address each major comment below and have revised the manuscript to incorporate additional analyses, clarifications, and supporting evidence where appropriate.

read point-by-point responses

Referee: [§3.2] §3.2 (Training reformulation): The central assumption that self-supervised outpainting on synthetically generated diverse-layout references will produce a model robust to real misalignment distributions (extreme crops, unusual viewpoints, partial occlusions) is load-bearing for the 'alignment-free' claim. The manuscript should provide a quantitative comparison of the synthetic layout distribution against real test cases (e.g., via statistics on crop ratios, occlusion levels, or viewpoint variance) and an ablation showing identity preservation when the test distribution is deliberately shifted outside the training support.

Authors: We agree that validating generalization to real misalignment distributions is critical for the alignment-free claim. In the revised manuscript, we have expanded §3.2 with a quantitative comparison of the synthetic layout distributions (reporting statistics on crop ratios, occlusion levels, and viewpoint variance) against real test cases from our evaluation datasets. We have also added an ablation that deliberately shifts the test distribution toward more extreme misalignments outside the training support and measures identity preservation via feature similarity scores. These additions support the robustness of the self-supervised reformulation. revision: yes
Referee: [§4] §4 (Experiments): The abstract asserts outperformance, yet the provided summary and abstract contain no quantitative metrics, baseline names, dataset statistics, or ablation tables. The full paper must include these (e.g., FID, LPIPS, identity similarity scores, user studies) with error bars or statistical significance tests; without them the empirical support for the central claim cannot be evaluated.

Authors: The full manuscript already presents quantitative results in Section 4, including FID, LPIPS, identity similarity scores, baseline comparisons, dataset statistics, and ablation tables, along with user studies. To strengthen the presentation, we have added error bars from multiple runs and statistical significance tests (paired t-tests with p-values) in the revised version. The abstract summarizes the outperformance while directing readers to the detailed experiments. revision: yes
Referee: [§3.4] §3.4 (Identity-robust pose control): The decoupling of appearance from skeletal structure is presented as mitigating pose overfitting, but the manuscript should clarify whether this is achieved via an architectural constraint, a loss term, or a data-augmentation schedule, and report an ablation measuring identity drift (e.g., face or clothing consistency) when the pose-control module is removed.

Authors: The identity-robust pose control is realized via an architectural constraint within the pose control module together with a dedicated loss term that encourages decoupling of appearance from skeletal structure; this is not primarily a data-augmentation schedule. We have clarified the exact mechanism in the revised §3.4. We have also added an ablation study that removes the pose-control module and quantifies identity drift using face and clothing consistency metrics, confirming its role in mitigating pose overfitting. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural choices presented as independent design decisions

full rationale

The paper's core contributions consist of explicit methodological decisions—reformulating training as self-supervised outpainting on diverse-layout references, designing a reference extractor, integrating hybrid fusion attention, and introducing identity-robust pose control plus token replacement—rather than any derived quantities, fitted parameters renamed as predictions, or load-bearing self-citations. No equations or uniqueness theorems are invoked that reduce to the paper's own inputs by construction; the framework is self-contained as a set of engineering choices whose validity rests on empirical generalization rather than definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard diffusion model assumptions plus the novel training reformulation and modules introduced in the paper; no new physical entities or heavily fitted constants are described.

axioms (1)

domain assumption Diffusion models trained via self-supervised outpainting on misaligned references will learn robust identity and pose representations.
This premise is invoked when the authors reformulate training as an outpainting task to handle arbitrary layouts.

pith-pipeline@v0.9.0 · 5732 in / 1302 out tokens · 52981 ms · 2026-05-21T18:21:35.355221+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SignVerse-2M: A Two-Million-Clip Pose-Native Universe of 55+ Sign Languages
cs.CV 2026-05 unverdicted novelty 6.0

SignVerse-2M provides a 2-million-clip multilingual pose-native dataset for sign language derived from public videos via DWPose preprocessing to enable robust modeling in real-world conditions.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 1 Pith paper · 15 internal anchors

[1]

Conditional gan with discrimi- native filter generation for text-to-video synthesis

Yogesh Balaji, Martin Renqiang Min, Bing Bai, Rama Chel- lappa, and Hans Peter Graf. Conditional gan with discrimi- native filter generation for text-to-video synthesis. InIJCAI, page 2, 2019. 7

work page 2019
[2]

Stable video diffusion: Scaling latent video diffusion models to large datasets.CoRR, 2023

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.CoRR, 2023. 3

work page 2023
[3]

In- structpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 6

work page 2023
[4]

Magicpose: Realistic human poses and facial expressions retargeting with identity-aware diffusion

Di Chang, Yichun Shi, Quankai Gao, Jessica Fu, Hongyi Xu, Guoxian Song, Qing Yan, Yizhe Zhu, Xiao Yang, and Mo- hammad Soleymani. Magicpose: Realistic human poses and facial expressions retargeting with identity-aware diffusion. arXiv preprint arXiv:2311.12052, 2023. 2, 3

work page arXiv 2023
[5]

Iqa-pytorch: Pytorch toolbox for im- age quality assessment.https : / / github

Chaofeng Chen. Iqa-pytorch: Pytorch toolbox for im- age quality assessment.https : / / github . com / chaofengc/IQA-PyTorch, 2022. 2

work page 2022
[6]

Wan-animate: Unified character animation and replacement with holistic replication.arXiv preprint arXiv:2509.14055, 2025

Gang Cheng, Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Ju Li, Dechao Meng, Jinwei Qi, Penchong Qiao, et al. Wan-animate: Unified character animation and replacement with holistic replication.arXiv preprint arXiv:2509.14055, 2025. 2, 3, 6, 7

work page arXiv 2025
[7]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,

work page
[8]

Humandit: Pose-guided diffusion transformer for long- form human motion video generation.arXiv preprint arXiv:2502.04847, 2025

Qijun Gan, Yi Ren, Chen Zhang, Zhenhui Ye, Pan Xie, Xiang Yin, Zehuan Yuan, Bingyue Peng, and Jianke Zhu. Humandit: Pose-guided diffusion transformer for long- form human motion video generation.arXiv preprint arXiv:2502.04847, 2025. 2

work page arXiv 2025
[9]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xi- aojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Animatediff: Animate your personalized text-to- image diffusion models without specific tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to- image diffusion models without specific tuning. In12th In- ternational Conference on Learning Representations, ICLR 2024, 2024. 3

work page 2024
[11]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Matrix-game 2.0: An open-source real-time and streaming interactive world model

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source, real-time, and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 7

work page 2017
[15]

Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3

work page 2020
[16]

Image quality metrics: Psnr vs

Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010. 7

work page 2010
[17]

Animate anyone: Consistent and controllable image- to-video synthesis for character animation

Li Hu. Animate anyone: Consistent and controllable image- to-video synthesis for character animation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8153–8163, 2024. 2, 3

work page 2024
[18]

Learning high fi- delity depths of dressed humans by watching social media dance videos

Yasamin Jafarian and Hyun Soo Park. Learning high fi- delity depths of dressed humans by watching social media dance videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12753– 12762, 2021. 6

work page 2021
[19]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 3

work page internal anchor Pith review Pith/arXiv arXiv 2013
[20]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Flux.https://github.com/ black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 3, 7

work page 2024
[22]

Omnihuman-1: Re- thinking the scaling-up of one-stage conditioned human an- 9 imation models

Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, Chao Liang, Yuan Zhang, and Jingtuo Liu. Omnihuman-1: Re- thinking the scaling-up of one-stage conditioned human an- 9 imation models. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 13847–13858,

work page
[23]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 6

work page internal anchor Pith review Pith/arXiv arXiv 2022
[24]

Lumina-video: Efficient and flexible video generation with multi-scale next-dit.arXiv preprint arXiv:2502.06782,

Dongyang Liu, Shicheng Li, Yutong Liu, Zhen Li, Kai Wang, Xinyue Li, Qi Qin, Yufei Liu, Yi Xin, Zhongyu Li, et al. Lumina-video: Efficient and flexible video generation with multi-scale next-dit.arXiv preprint arXiv:2502.06782,

work page arXiv
[25]

Multi- focal conditioned latent diffusion for person image synthesis

Jiaqi Liu, Jichao Zhang, Paolo Rota, and Nicu Sebe. Multi- focal conditioned latent diffusion for person image synthesis. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 16019–16028, 2025. 7

work page 2025
[26]

Multi- focal conditioned latent diffusion for person image synthesis

Jiaqi Liu, Jichao Zhang, Paolo Rota, and Nicu Sebe. Multi- focal conditioned latent diffusion for person image synthesis. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 16019–16028, 2025. 3

work page 2025
[27]

Phantom: Subject- consistent video generation via cross-modal alignment

Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Ji- awei Liu, Gen Li, Siyu Zhou, Qian He, and Xinglong Wu. Phantom: Subject-consistent video generation via cross- modal alignment.arXiv preprint arXiv:2502.11079, 2025. 4

work page arXiv 2025
[28]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 6

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

Deepfashion: Powering robust clothes recognition and retrieval with rich annotations

Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 1096–1104, 2016. 3, 7

work page 2016
[30]

Coarse-to-fine latent diffusion for pose- guided person image synthesis

Yanzuo Lu, Manlin Zhang, Andy J Ma, Xiaohua Xie, and Jianhuang Lai. Coarse-to-fine latent diffusion for pose- guided person image synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6420–6429, 2024. 7

work page 2024
[31]

Coarse-to-fine latent diffusion for pose- guided person image synthesis

Yanzuo Lu, Manlin Zhang, Andy J Ma, Xiaohua Xie, and Jianhuang Lai. Coarse-to-fine latent diffusion for pose- guided person image synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6420–6429, 2024. 3

work page 2024
[32]

Paddleocr: Awesome multilingual ocr toolkits.https://github.com/PaddlePaddle/ PaddleOCR, 2021

PaddlePaddle. Paddleocr: Awesome multilingual ocr toolkits.https://github.com/PaddlePaddle/ PaddleOCR, 2021. 1

work page 2021
[33]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

work page
[34]

Controlnext: Powerful and efficient control for image and video generation

Bohao Peng, Jian Wang, Yuechen Zhang, Wenbo Li, Ming- Chang Yang, and Jiaya Jia. Controlnext: Powerful and effi- cient control for image and video generation.arXiv preprint arXiv:2408.06070, 2024. 3

work page arXiv 2024
[35]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents.arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Deep image spatial transformation for person image generation

Yurui Ren, Xiaoming Yu, Junming Chen, Thomas H Li, and Ge Li. Deep image spatial transformation for person image generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7690–7699,

work page
[38]

Neural texture extraction and distribution for controllable person image synthesis

Yurui Ren, Xiaoqing Fan, Ge Li, Shan Liu, and Thomas H Li. Neural texture extraction and distribution for controllable person image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13535–13544, 2022. 3

work page 2022
[39]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3

work page 2022
[40]

Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 3

work page 2022
[41]

Advancing pose-guided image synthesis with pro- gressive conditional diffusion models

Fei Shen, Hu Ye, Jun Zhang, Cong Wang, Xiao Han, and Yang Wei. Advancing pose-guided image synthesis with pro- gressive conditional diffusion models. InThe Twelfth Inter- national Conference on Learning Representations. 3

work page
[42]

Self-supervised controlnet with spatio-temporal mamba for real-world video super-resolution

Shijun Shi, Jing Xu, Lijing Lu, Zhihang Li, and Kai Hu. Self-supervised controlnet with spatio-temporal mamba for real-world video super-resolution. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7385–7395, 2025. 3

work page 2025
[43]

Deformable gans for pose-based human im- age generation

Aliaksandr Siarohin, Enver Sangineto, St ´ephane Lathuiliere, and Nicu Sebe. Deformable gans for pose-based human im- age generation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3408–3416,

work page
[44]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792,

work page internal anchor Pith review Pith/arXiv arXiv
[45]

Denois- ing diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InInternational Conference on Learning Representations. 3

work page
[46]

Animate-x: Universal character image ani- mation with enhanced motion representation.arXiv preprint arXiv:2410.10306, 2024

Shuai Tan, Biao Gong, Xiang Wang, Shiwei Zhang, Dandan Zheng, Ruobing Zheng, Kecheng Zheng, Jingdong Chen, and Ming Yang. Animate-x: Universal character image ani- mation with enhanced motion representation.arXiv preprint arXiv:2410.10306, 2024. 2, 6, 8

work page arXiv 2024
[47]

Stableanimator: High- quality identity-preserving human image animation

Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, and Zuxuan Wu. Stableanimator: High- quality identity-preserving human image animation. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 21096–21106, 2025. 2, 3, 6 10

work page 2025
[48]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018. 7

work page internal anchor Pith review Pith/arXiv arXiv 2018
[49]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Disco: Disentangled control for realistic human dance generation

Tan Wang, Linjie Li, Kevin Lin, Yuanhao Zhai, Chung- Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, and Lijuan Wang. Disco: Disentangled control for realistic human dance generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9326–9336, 2024. 7, 2

work page 2024
[51]

Unianimate: Taming unified video diffusion mod- els for consistent human image animation.arXiv preprint arXiv:2406.01188, 2024

Xiang Wang, Shiwei Zhang, Changxin Gao, Jiayu Wang, Xiaoqiang Zhou, Yingya Zhang, Luxin Yan, and Nong Sang. Unianimate: Taming unified video diffusion mod- els for consistent human image animation.arXiv preprint arXiv:2406.01188, 2024. 2

work page arXiv 2024
[52]

Unianimate-dit: Human image animation with large-scale video diffusion transformer.arXiv preprint arXiv:2504.11289, 2025

Xiang Wang, Shiwei Zhang, Longxiang Tang, Yingya Zhang, Changxin Gao, Yuehuan Wang, and Nong Sang. Unianimate-dit: Human image animation with large-scale video diffusion transformer.arXiv preprint arXiv:2504.11289, 2025. 2, 3, 6

work page arXiv 2025
[53]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 7

work page 2004
[54]

Magicanimate: Temporally consistent human im- age animation using diffusion model

Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou. Magicanimate: Temporally consistent human im- age animation using diffusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1481–1490, 2024. 7

work page 2024
[55]

Effec- tive whole-body pose estimation with two-stages distillation

Zhendong Yang, Ailing Zeng, Chun Yuan, and Yu Li. Effec- tive whole-body pose estimation with two-stages distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4210–4220, 2023. 1

work page 2023
[56]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arXiv:2308.06721,

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Dwnet: Dense warp-based network for pose-guided human video generation.arXiv preprint arXiv:1910.09139, 2019

Polina Zablotskaia, Aliaksandr Siarohin, Bo Zhao, and Leonid Sigal. Dwnet: Dense warp-based network for pose-guided human video generation.arXiv preprint arXiv:1910.09139, 2019. 6

work page arXiv 1910
[59]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 3

work page 2023
[60]

Exploring dual-task correlation for pose guided per- son image generation

Pengze Zhang, Lingxiao Yang, Jian-Huang Lai, and Xiaohua Xie. Exploring dual-task correlation for pose guided per- son image generation. InProceedings of the IEEE/CVF con- ference on Computer Vision and Pattern Recognition, pages 7713–7722, 2022. 3

work page 2022
[61]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 7

work page 2018
[62]

Mim- icmotion: High-quality human motion video generation with confidence-aware pose guidance.arXiv preprint arXiv:2406.19680, 2024

Yuang Zhang, Jiaxi Gu, Li-Wen Wang, Han Wang, Junqi Cheng, Yuefeng Zhu, and Fangyuan Zou. Mim- icmotion: High-quality human motion video generation with confidence-aware pose guidance.arXiv preprint arXiv:2406.19680, 2024. 2, 3, 6, 8

work page arXiv 2024
[63]

Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, et al. Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025. 3

work page arXiv 2025
[64]

a character is danc- ing

Shenhao Zhu, Junming Leo Chen, Zuozhuo Dai, Zilong Dong, Yinghui Xu, Xun Cao, Yao Yao, Hao Zhu, and Siyu Zhu. Champ: Controllable and consistent human image an- imation with 3d parametric guidance. InEuropean Confer- ence on Computer Vision, pages 145–162. Springer, 2024. 6 11 One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfe...

work page 2024

[1] [1]

Conditional gan with discrimi- native filter generation for text-to-video synthesis

Yogesh Balaji, Martin Renqiang Min, Bing Bai, Rama Chel- lappa, and Hans Peter Graf. Conditional gan with discrimi- native filter generation for text-to-video synthesis. InIJCAI, page 2, 2019. 7

work page 2019

[2] [2]

Stable video diffusion: Scaling latent video diffusion models to large datasets.CoRR, 2023

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.CoRR, 2023. 3

work page 2023

[3] [3]

In- structpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 6

work page 2023

[4] [4]

Magicpose: Realistic human poses and facial expressions retargeting with identity-aware diffusion

Di Chang, Yichun Shi, Quankai Gao, Jessica Fu, Hongyi Xu, Guoxian Song, Qing Yan, Yizhe Zhu, Xiao Yang, and Mo- hammad Soleymani. Magicpose: Realistic human poses and facial expressions retargeting with identity-aware diffusion. arXiv preprint arXiv:2311.12052, 2023. 2, 3

work page arXiv 2023

[5] [5]

Iqa-pytorch: Pytorch toolbox for im- age quality assessment.https : / / github

Chaofeng Chen. Iqa-pytorch: Pytorch toolbox for im- age quality assessment.https : / / github . com / chaofengc/IQA-PyTorch, 2022. 2

work page 2022

[6] [6]

Wan-animate: Unified character animation and replacement with holistic replication.arXiv preprint arXiv:2509.14055, 2025

Gang Cheng, Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Ju Li, Dechao Meng, Jinwei Qi, Penchong Qiao, et al. Wan-animate: Unified character animation and replacement with holistic replication.arXiv preprint arXiv:2509.14055, 2025. 2, 3, 6, 7

work page arXiv 2025

[7] [7]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,

work page

[8] [8]

Humandit: Pose-guided diffusion transformer for long- form human motion video generation.arXiv preprint arXiv:2502.04847, 2025

Qijun Gan, Yi Ren, Chen Zhang, Zhenhui Ye, Pan Xie, Xiang Yin, Zehuan Yuan, Bingyue Peng, and Jianke Zhu. Humandit: Pose-guided diffusion transformer for long- form human motion video generation.arXiv preprint arXiv:2502.04847, 2025. 2

work page arXiv 2025

[9] [9]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xi- aojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Animatediff: Animate your personalized text-to- image diffusion models without specific tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to- image diffusion models without specific tuning. In12th In- ternational Conference on Learning Representations, ICLR 2024, 2024. 3

work page 2024

[11] [11]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Matrix-game 2.0: An open-source real-time and streaming interactive world model

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source, real-time, and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 7

work page 2017

[15] [15]

Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3

work page 2020

[16] [16]

Image quality metrics: Psnr vs

Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010. 7

work page 2010

[17] [17]

Animate anyone: Consistent and controllable image- to-video synthesis for character animation

Li Hu. Animate anyone: Consistent and controllable image- to-video synthesis for character animation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8153–8163, 2024. 2, 3

work page 2024

[18] [18]

Learning high fi- delity depths of dressed humans by watching social media dance videos

Yasamin Jafarian and Hyun Soo Park. Learning high fi- delity depths of dressed humans by watching social media dance videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12753– 12762, 2021. 6

work page 2021

[19] [19]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 3

work page internal anchor Pith review Pith/arXiv arXiv 2013

[20] [20]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Flux.https://github.com/ black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 3, 7

work page 2024

[22] [22]

Omnihuman-1: Re- thinking the scaling-up of one-stage conditioned human an- 9 imation models

Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, Chao Liang, Yuan Zhang, and Jingtuo Liu. Omnihuman-1: Re- thinking the scaling-up of one-stage conditioned human an- 9 imation models. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 13847–13858,

work page

[23] [23]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 6

work page internal anchor Pith review Pith/arXiv arXiv 2022

[24] [24]

Lumina-video: Efficient and flexible video generation with multi-scale next-dit.arXiv preprint arXiv:2502.06782,

Dongyang Liu, Shicheng Li, Yutong Liu, Zhen Li, Kai Wang, Xinyue Li, Qi Qin, Yufei Liu, Yi Xin, Zhongyu Li, et al. Lumina-video: Efficient and flexible video generation with multi-scale next-dit.arXiv preprint arXiv:2502.06782,

work page arXiv

[25] [25]

Multi- focal conditioned latent diffusion for person image synthesis

Jiaqi Liu, Jichao Zhang, Paolo Rota, and Nicu Sebe. Multi- focal conditioned latent diffusion for person image synthesis. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 16019–16028, 2025. 7

work page 2025

[26] [26]

Multi- focal conditioned latent diffusion for person image synthesis

Jiaqi Liu, Jichao Zhang, Paolo Rota, and Nicu Sebe. Multi- focal conditioned latent diffusion for person image synthesis. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 16019–16028, 2025. 3

work page 2025

[27] [27]

Phantom: Subject- consistent video generation via cross-modal alignment

Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Ji- awei Liu, Gen Li, Siyu Zhou, Qian He, and Xinglong Wu. Phantom: Subject-consistent video generation via cross- modal alignment.arXiv preprint arXiv:2502.11079, 2025. 4

work page arXiv 2025

[28] [28]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 6

work page internal anchor Pith review Pith/arXiv arXiv 2022

[29] [29]

Deepfashion: Powering robust clothes recognition and retrieval with rich annotations

Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 1096–1104, 2016. 3, 7

work page 2016

[30] [30]

Coarse-to-fine latent diffusion for pose- guided person image synthesis

Yanzuo Lu, Manlin Zhang, Andy J Ma, Xiaohua Xie, and Jianhuang Lai. Coarse-to-fine latent diffusion for pose- guided person image synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6420–6429, 2024. 7

work page 2024

[31] [31]

Coarse-to-fine latent diffusion for pose- guided person image synthesis

Yanzuo Lu, Manlin Zhang, Andy J Ma, Xiaohua Xie, and Jianhuang Lai. Coarse-to-fine latent diffusion for pose- guided person image synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6420–6429, 2024. 3

work page 2024

[32] [32]

Paddleocr: Awesome multilingual ocr toolkits.https://github.com/PaddlePaddle/ PaddleOCR, 2021

PaddlePaddle. Paddleocr: Awesome multilingual ocr toolkits.https://github.com/PaddlePaddle/ PaddleOCR, 2021. 1

work page 2021

[33] [33]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

work page

[34] [34]

Controlnext: Powerful and efficient control for image and video generation

Bohao Peng, Jian Wang, Yuechen Zhang, Wenbo Li, Ming- Chang Yang, and Jiaya Jia. Controlnext: Powerful and effi- cient control for image and video generation.arXiv preprint arXiv:2408.06070, 2024. 3

work page arXiv 2024

[35] [35]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents.arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[37] [37]

Deep image spatial transformation for person image generation

Yurui Ren, Xiaoming Yu, Junming Chen, Thomas H Li, and Ge Li. Deep image spatial transformation for person image generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7690–7699,

work page

[38] [38]

Neural texture extraction and distribution for controllable person image synthesis

Yurui Ren, Xiaoqing Fan, Ge Li, Shan Liu, and Thomas H Li. Neural texture extraction and distribution for controllable person image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13535–13544, 2022. 3

work page 2022

[39] [39]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3

work page 2022

[40] [40]

Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 3

work page 2022

[41] [41]

Advancing pose-guided image synthesis with pro- gressive conditional diffusion models

Fei Shen, Hu Ye, Jun Zhang, Cong Wang, Xiao Han, and Yang Wei. Advancing pose-guided image synthesis with pro- gressive conditional diffusion models. InThe Twelfth Inter- national Conference on Learning Representations. 3

work page

[42] [42]

Self-supervised controlnet with spatio-temporal mamba for real-world video super-resolution

Shijun Shi, Jing Xu, Lijing Lu, Zhihang Li, and Kai Hu. Self-supervised controlnet with spatio-temporal mamba for real-world video super-resolution. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7385–7395, 2025. 3

work page 2025

[43] [43]

Deformable gans for pose-based human im- age generation

Aliaksandr Siarohin, Enver Sangineto, St ´ephane Lathuiliere, and Nicu Sebe. Deformable gans for pose-based human im- age generation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3408–3416,

work page

[44] [44]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792,

work page internal anchor Pith review Pith/arXiv arXiv

[45] [45]

Denois- ing diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InInternational Conference on Learning Representations. 3

work page

[46] [46]

Animate-x: Universal character image ani- mation with enhanced motion representation.arXiv preprint arXiv:2410.10306, 2024

Shuai Tan, Biao Gong, Xiang Wang, Shiwei Zhang, Dandan Zheng, Ruobing Zheng, Kecheng Zheng, Jingdong Chen, and Ming Yang. Animate-x: Universal character image ani- mation with enhanced motion representation.arXiv preprint arXiv:2410.10306, 2024. 2, 6, 8

work page arXiv 2024

[47] [47]

Stableanimator: High- quality identity-preserving human image animation

Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, and Zuxuan Wu. Stableanimator: High- quality identity-preserving human image animation. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 21096–21106, 2025. 2, 3, 6 10

work page 2025

[48] [48]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018. 7

work page internal anchor Pith review Pith/arXiv arXiv 2018

[49] [49]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

Disco: Disentangled control for realistic human dance generation

Tan Wang, Linjie Li, Kevin Lin, Yuanhao Zhai, Chung- Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, and Lijuan Wang. Disco: Disentangled control for realistic human dance generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9326–9336, 2024. 7, 2

work page 2024

[51] [51]

Unianimate: Taming unified video diffusion mod- els for consistent human image animation.arXiv preprint arXiv:2406.01188, 2024

Xiang Wang, Shiwei Zhang, Changxin Gao, Jiayu Wang, Xiaoqiang Zhou, Yingya Zhang, Luxin Yan, and Nong Sang. Unianimate: Taming unified video diffusion mod- els for consistent human image animation.arXiv preprint arXiv:2406.01188, 2024. 2

work page arXiv 2024

[52] [52]

Unianimate-dit: Human image animation with large-scale video diffusion transformer.arXiv preprint arXiv:2504.11289, 2025

Xiang Wang, Shiwei Zhang, Longxiang Tang, Yingya Zhang, Changxin Gao, Yuehuan Wang, and Nong Sang. Unianimate-dit: Human image animation with large-scale video diffusion transformer.arXiv preprint arXiv:2504.11289, 2025. 2, 3, 6

work page arXiv 2025

[53] [53]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 7

work page 2004

[54] [54]

Magicanimate: Temporally consistent human im- age animation using diffusion model

Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou. Magicanimate: Temporally consistent human im- age animation using diffusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1481–1490, 2024. 7

work page 2024

[55] [55]

Effec- tive whole-body pose estimation with two-stages distillation

Zhendong Yang, Ailing Zeng, Chun Yuan, and Yu Li. Effec- tive whole-body pose estimation with two-stages distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4210–4220, 2023. 1

work page 2023

[56] [56]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[57] [57]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arXiv:2308.06721,

work page internal anchor Pith review Pith/arXiv arXiv

[58] [58]

Dwnet: Dense warp-based network for pose-guided human video generation.arXiv preprint arXiv:1910.09139, 2019

Polina Zablotskaia, Aliaksandr Siarohin, Bo Zhao, and Leonid Sigal. Dwnet: Dense warp-based network for pose-guided human video generation.arXiv preprint arXiv:1910.09139, 2019. 6

work page arXiv 1910

[59] [59]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 3

work page 2023

[60] [60]

Exploring dual-task correlation for pose guided per- son image generation

Pengze Zhang, Lingxiao Yang, Jian-Huang Lai, and Xiaohua Xie. Exploring dual-task correlation for pose guided per- son image generation. InProceedings of the IEEE/CVF con- ference on Computer Vision and Pattern Recognition, pages 7713–7722, 2022. 3

work page 2022

[61] [61]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 7

work page 2018

[62] [62]

Mim- icmotion: High-quality human motion video generation with confidence-aware pose guidance.arXiv preprint arXiv:2406.19680, 2024

Yuang Zhang, Jiaxi Gu, Li-Wen Wang, Han Wang, Junqi Cheng, Yuefeng Zhu, and Fangyuan Zou. Mim- icmotion: High-quality human motion video generation with confidence-aware pose guidance.arXiv preprint arXiv:2406.19680, 2024. 2, 3, 6, 8

work page arXiv 2024

[63] [63]

Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, et al. Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025. 3

work page arXiv 2025

[64] [64]

a character is danc- ing

Shenhao Zhu, Junming Leo Chen, Zuozhuo Dai, Zilong Dong, Yinghui Xu, Xun Cao, Yao Yao, Hao Zhu, and Siyu Zhu. Champ: Controllable and consistent human image an- imation with 3d parametric guidance. InEuropean Confer- ence on Computer Vision, pages 145–162. Springer, 2024. 6 11 One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfe...

work page 2024