arxiv: 2604.05961 · v1 · submitted 2026-04-07 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation

Tao Hu , Varun Jampani

Authors on Pith no claims yet

Pith reviewed 2026-05-10 20:18 UTC · model grok-4.3

classification 💻 cs.CV

keywords human video generationarticulated noisediffusion modelsmotion consistencyvideo synthesisclothing deformationhuman motion control

0 comments

The pith

Sampling noise on 3D human body surfaces lets diffusion models generate motion-consistent videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HumANDiff to fix the failure of standard video diffusion models to capture real human motion physics and dynamics. It replaces unstructured Gaussian noise with noise sampled on the surface of a statistical 3D human body template, adds joint prediction of appearance and motion, and enforces consistency through a loss defined in that noise space. These changes are applied during fine-tuning of existing models, so no architecture redesign is needed. The result is scalable image-to-video generation that produces high-fidelity humans whose clothing and movements remain coherent across frames.

Core claim

HumANDiff replaces random Gaussian noise with 3D articulated noise sampled on the dense surface manifold of a statistical human body template. This noise inherits body topology priors for spatially and temporally consistent sampling. The training objective is extended to jointly predict pixel appearances and physical motions from the noise, while a geometric motion consistency loss defined in the articulated noise space enforces physical consistency across frames. The resulting fine-tuned models support controllable human video generation, including image-to-video synthesis, without any changes to the underlying diffusion architecture.

What carries the argument

Articulated motion-consistent noise sampling on the dense surface manifold of a statistical human body template, which correlates spatiotemporal noise distributions and supplies body topology priors for consistent motion control.

If this is right

Existing video diffusion models can be adapted for high-fidelity human video synthesis through fine-tuning alone.
Image-to-video generation becomes possible inside one framework with built-in motion control and no extra modules.
Motion-dependent effects such as clothing wrinkles arise directly from the joint appearance-motion objective.
The approach remains model-agnostic and works with diverse clothing styles while preserving temporal consistency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same surface-manifold noise idea could be tested on other articulated objects such as animals or robots by swapping the body template.
Adding explicit physics simulation on top of the noise sampling might further reduce implausible deformations that the current method still produces.
Because the method needs no architecture changes, it could be combined with future advances in base video diffusion models to improve human results without re-engineering.

Load-bearing premise

That noise sampled on the surface of a statistical human body template will automatically encode the real dynamics and physics of human motion and clothing deformation.

What would settle it

Videos in which clothing fails to wrinkle or stretch in ways that match the body motion, or in which limbs and clothing exhibit temporal jitter or intersections absent from real footage.

Figures

Figures reproduced from arXiv: 2604.05961 by Tao Hu, Varun Jampani.

**Figure 1.** Figure 1: HumANDiff generates temporally consistent human videos from a single image by sampling motion-consistent noise on the UV manifold of a parametric body model (e.g., SMPL [36]), enabling intrinsic motion control without explicit motion adapters. video generation still struggle to capture the dynamics and physics of human motion accurately. Within the domain of motion-controllable human video diffusion models… view at source ↗

**Figure 2.** Figure 2: Training framework. During training, HumANDiff learns to generate motioncontrollable human videos using a latent video diffusion model with four components: 1) Video Encoding: we employ a 3D VAE to encode video segments x0 into latent space. 2) Articulated Motion Consistent Noise Sampling (Sec. 3.2): We model motions on the surface UV manifold of SMPL [36] and integrate motion control m0 as a warping fiel… view at source ↗

**Figure 3.** Figure 3: Qualitative comparisons for fashion video synthesis. We show the results of each methods under similar novel poses. Our method can faithfully preserve the appearance details of the reference image, e.g., face identity and clothing details [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Comparisons against Animate Anyone [21] and Wan2.1 based UniAnimate [69]. Our method can generate high-quality clothing details (○1 , ○3 ), and face details (○2 ) [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparisons for appearance consistency of different noise warping methods, where two different frames are visualized for evaluations. Compared with 2D optical flow based noise warping GWTF [6], our 3D articulated noise warping enables more precise motion control, and generates more consistent rendering results for complex textures. and our method were both fine-tuned on fashion videos, whereas … view at source ↗

**Figure 6.** Figure 6: Generalization results. Our method trained on fashion videos generalizes well to unseen out-of-domain data, and synthesizes human novel pose videos from a single image in diverse clothing styles (a), and diverse in-the-wild backgrounds (b). 4.3 Generalization on Out-of-domain Data Though our model was fine-tuned on studio fashion videos, it generalizes well on out-of-domain data. We evaluate the generaliza… view at source ↗

**Figure 7.** Figure 7: Ablation study of Joint Appearance-Motion Learning (JAML). JAML improves motion coherence for faithful motion control, while the variant without JAML failed to generate accurate animation videos. Joint Appearance-Motion Learning (JAML). We analyze the effectiveness of the JAML mechanism (Sec. 3.3) qualitatively and quantitatively. The qualitative comparisons are shown in [PITH_FULL_IMAGE:figures/full_fig… view at source ↗

**Figure 8.** Figure 8: Ablation study of Geometric Motion Consistency Learning (GMCL), where two frames at different time steps are shown. GMCL improves appearance consistency, i.e., it preserves the face identity details better (○2 vs. ○1 , and ○4 vs. ○3 ) [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Generation results of model variants trained on Wan2.1-I2V-14B and CogVideoX-I2V-5B. Our method is model agnostic, and the model trained on Wan2.1- I2V-14B generates higher quality appearance details (○1 ○2 ○3 ). • Efficiency. HumANDiff shares the same architecture as the base video diffusion model, and hence uses exactly the same amount of memory and runtime as the base model for both CogVideoX [75] and W… view at source ↗

**Figure 10.** Figure 10: Ablation study of noise degradation. We set γ = 0.1 to achieve a balance between motion control and video quality. Noise Degradation. We use the noise degradation strategy [6] to control the strength of motion conditioning at inference time. We analyze the effect of the noise degradation parameter γ for human video generation on a subset of the UBC Fashion dataset [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Illustration of video VAE encoding in CogVideoX. Fine details such as faces and hands are often lost during video encoding, which makes it challenging for our model to generate high-quality face and hand details. C Limitations Our model relies on the video VAE of the underlying video diffusion backbone to encode input videos, and the overall generation quality is therefore constrained by the VAE’s recons… view at source ↗

read the original abstract

Despite tremendous recent progress in human video generation, generative video diffusion models still struggle to capture the dynamics and physics of human motions faithfully. In this paper, we propose a new framework for human video generation, HumANDiff, which enhances the human motion control with three key designs: 1) Articulated motion-consistent noise sampling that correlates the spatiotemporal distribution of latent noise and replaces the unstructured random Gaussian noise with 3D articulated noise sampled on the dense surface manifold of a statistical human body template. It inherits body topology priors for spatially and temporally consistent noise sampling. 2) Joint appearance-motion learning that enhances the standard training objective of video diffusion models by jointly predicting pixel appearances and corresponding physical motions from the articulated noises. It enables high-fidelity human video synthesis, e.g., capturing motion-dependent clothing wrinkles. 3) Geometric motion consistency learning that enforces physical motion consistency across frames via a novel geometric motion consistency loss defined in the articulated noise space. HumANDiff enables scalable controllable human video generation by fine-tuning video diffusion models with articulated noise sampling. Consequently, our method is agnostic to diffusion model design, and requires no modifications to the model architecture. During inference, HumANDiff enables image-to-video generation within a single framework, achieving intrinsic motion control without requiring additional motion modules. Extensive experiments demonstrate that our method achieves state-of-the-art performance in rendering motion-consistent, high-fidelity humans with diverse clothing styles. Project page: https://taohuumd.github.io/projects/HumANDiff/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HumANDiff replaces Gaussian noise with noise sampled on a statistical human body surface to improve motion consistency in video diffusion, but the abstract gives no metrics or ablations to show whether the gains are real.

read the letter

The main contribution here is sampling noise on the dense surface of a SMPL-like body template instead of random Gaussian noise, then adding joint appearance-motion prediction and a geometric consistency loss defined in that same noise space. This lets them fine-tune existing video diffusion models without touching the architecture and supports image-to-video with built-in motion control. The approach is model-agnostic, which is a practical plus for people who already have a working video diffusion backbone. It directly targets the common failure mode where generated humans look stiff or clothing moves independently of the body. That design choice is the clearest novelty relative to the diffusion papers cited in the abstract. The paper does a reasonable job explaining how the three pieces fit together and why they should produce better spatiotemporal coherence. The stress-test concern about whether surface noise on a template actually encodes clothing physics without simulation is fair, but the abstract leaves that as an implicit claim rather than a proven one. The real limitation is the complete absence of numbers. No quantitative results, no ablation tables, no error breakdowns, and no comparison details appear in the provided text. Without those, it is impossible to judge if the method actually reaches SOTA or merely looks plausible on cherry-picked frames. The assumption that body-topology priors plus appearance learning will handle wrinkles and dynamics for diverse clothing is stated but not stress-tested in the abstract. This paper is for labs already running video diffusion experiments on humans who want a new noise-sampling trick to try. A reader who cares about controllable human synthesis would find the framework worth examining once the experiments are visible. It deserves peer review so referees can check the quantitative claims and implementation details rather than desk-rejecting on the abstract alone.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces HumANDiff, a framework for human video generation via fine-tuning of video diffusion models. It proposes three components: (1) articulated motion-consistent noise sampling that replaces unstructured Gaussian noise with 3D noise sampled on the dense surface manifold of a statistical human body template to inherit topology priors for spatiotemporal consistency; (2) joint appearance-motion learning that augments the diffusion training objective to predict both pixel appearances and physical motions; and (3) a geometric motion consistency loss defined in the articulated noise space to enforce cross-frame physical consistency. The method is architecture-agnostic, supports image-to-video generation intrinsically, and claims state-of-the-art performance in producing motion-consistent, high-fidelity human videos with diverse clothing styles.

Significance. If the empirical claims hold, the work would be significant for controllable video generation by showing how to inject 3D body topology priors directly into the noise process of existing diffusion models without architectural modifications or external motion modules. This could improve scalability and motion fidelity in human synthesis tasks, addressing a recognized limitation in current generative video models regarding faithful dynamics and clothing deformation.

major comments (3)

[§3.1] §3.1: The articulated noise sampling is defined on the dense surface manifold of a statistical body template and claimed to capture spatiotemporal distributions for motion consistency. However, the construction supplies only topology priors; no derivation shows how this sampling encodes real motion dynamics or clothing physics (e.g., wrinkles, forces) without physics simulation or per-sequence fitting, which is load-bearing for the central claim of faithful dynamics.
[§4] §4 (Experiments): The abstract and method claim SOTA results in motion consistency and high-fidelity rendering, but the provided description lacks explicit quantitative metrics (e.g., FVD, motion accuracy scores), ablation tables isolating each of the three components, or error analysis on unseen motions/complex clothing. Without these, the attribution of improvements to the articulated noise and geometric loss cannot be verified.
[§3.3] §3.3: The geometric motion consistency loss is defined in the same articulated noise space as the sampling. This risks circularity: the loss operates on quantities constructed by the method itself rather than on independent physical quantities, so it is unclear whether it enforces true physical consistency or merely geometric self-consistency.

minor comments (2)

[Abstract] The abstract states the method is 'agnostic to diffusion model design' and requires 'no modifications to the model architecture'; this strength should be highlighted with a concrete example of the base model used in experiments.
[§3] Notation for the articulated noise (e.g., how the manifold sampling is parameterized) should be introduced earlier and used consistently in the loss definitions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on HumANDiff. We address each major comment below, providing clarifications and indicating revisions to the manuscript where the points strengthen the presentation or evidence.

read point-by-point responses

Referee: [§3.1] The articulated noise sampling is defined on the dense surface manifold of a statistical body template and claimed to capture spatiotemporal distributions for motion consistency. However, the construction supplies only topology priors; no derivation shows how this sampling encodes real motion dynamics or clothing physics (e.g., wrinkles, forces) without physics simulation or per-sequence fitting, which is load-bearing for the central claim of faithful dynamics.

Authors: We appreciate this point. The articulated noise sampling is intended to replace unstructured Gaussian noise with a structured distribution that respects the fixed topology and articulation of the SMPL body template, thereby inducing correlated spatiotemporal noise patterns that aid the diffusion process in producing consistent motion. It does not claim to directly encode clothing physics or forces; those emerge from the combination of joint appearance-motion prediction (which learns motion-dependent appearance changes such as wrinkles) and the geometric consistency loss. We have revised §3.1 to explicitly state that the sampling supplies topological priors for consistency rather than a physics derivation, and we reference how the other two components enable the capture of dynamic effects during fine-tuning. revision: partial
Referee: [§4] The abstract and method claim SOTA results in motion consistency and high-fidelity rendering, but the provided description lacks explicit quantitative metrics (e.g., FVD, motion accuracy scores), ablation tables isolating each of the three components, or error analysis on unseen motions/complex clothing. Without these, the attribution of improvements to the articulated noise and geometric loss cannot be verified.

Authors: We agree that clear quantitative support and component-wise ablations are necessary to substantiate the claims. The full manuscript contains FVD, motion consistency, and user-study metrics in §4 along with some ablations, but we acknowledge the presentation could be more explicit. In the revision we will add a dedicated ablation table isolating the contribution of each component (noise sampling, joint prediction, and geometric loss), include motion accuracy scores on held-out sequences, and provide error analysis for complex clothing and unseen motions. These additions will directly address attribution. revision: yes
Referee: [§3.3] The geometric motion consistency loss is defined in the same articulated noise space as the sampling. This risks circularity: the loss operates on quantities constructed by the method itself rather than on independent physical quantities, so it is unclear whether it enforces true physical consistency or merely geometric self-consistency.

Authors: The loss is computed by comparing the predicted motion fields (derived from the diffusion model's output) projected back onto the fixed articulated template across frames. Because the template articulation is independent of the generated video content and the noise field is sampled once per sequence, the loss penalizes temporal inconsistencies in the body manifold coordinates. This is not purely self-referential; it enforces agreement with the physical joint structure encoded in the template. We have expanded §3.3 with a short derivation showing the correspondence between noise-space differences and 3D joint displacements, and we note that the loss is applied only during training to regularize the learned dynamics. revision: partial

Circularity Check

0 steps flagged

No significant circularity; method introduces independent sampling and loss designs

full rationale

The paper proposes three explicit new components—articulated noise sampling on a body template manifold, joint appearance-motion prediction, and a geometric consistency loss in noise space—for fine-tuning existing video diffusion models. No equations, derivations, or claims reduce the motion-consistency or fidelity improvements to quantities defined by the inputs themselves, nor do they rely on load-bearing self-citations or fitted parameters renamed as predictions. The SOTA performance is presented as an empirical outcome of the new designs rather than a tautological re-expression of prior quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The method rests on the assumption that a fixed statistical human body template provides sufficient topology priors for noise sampling to enforce physical motion consistency; no free parameters are explicitly named in the abstract, but the template itself and any scaling factors for the noise manifold are implicit.

axioms (2)

domain assumption Video diffusion models can be fine-tuned by replacing unstructured Gaussian noise with structured noise sampled on a human body surface manifold without architectural changes.
Stated as the basis for the scalable fine-tuning approach.
domain assumption Jointly predicting pixel appearance and physical motion from the same articulated noise improves fidelity of motion-dependent effects such as clothing wrinkles.
Core of the joint appearance-motion learning design.

invented entities (1)

Articulated motion-consistent noise no independent evidence
purpose: To correlate spatiotemporal distribution of latent noise with human body topology for consistent sampling.
New sampling procedure introduced in the first key design.

pith-pipeline@v0.9.0 · 5569 in / 1530 out tokens · 43308 ms · 2026-05-10T20:18:16.037834+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

replaces the unstructured random Gaussian noise with 3D articulated noise sampled on the dense surface manifold of a statistical human body template... geometric motion consistency loss defined in the articulated noise space
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

jointly predicting pixel appearances and corresponding physical motions from the articulated noises

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

85 extracted references · 27 canonical work pages · 8 internal anchors

[1]

Github (2021),https://github

Easymocap - make human motion capture easier. Github (2021),https://github. com/zju3dv/EasyMocap6, 9

2021
[2]

Computer Graphics Forum38(2019) 4

Aberman, K., Shi, M., Liao, J., Lischinski, D., Chen, B., Cohen-Or, D.: Deep video-based performance cloning. Computer Graphics Forum38(2019) 4

2019
[3]

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

Bhunia, A.K., Khan, S.H., Cholakkal, H., Anwer, R.M., Laaksonen, J., Shah, M., Khan, F.S.: Person image synthesis via denoising diffusion model. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 5968–5976 (2022),https://api.semanticscholar.org/CorpusID:25376129110, 11

2023
[5]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets. arXiv preprint arXiv:2311.15127 (2023) 4

work page internal anchor Pith review arXiv 2023
[6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025),https://arxiv.org/abs/2501.083312, 5, 7, 9, 10, 11, 12, 17

Burgert, R., Xu, Y., Xian, W., Pilarski, O., Clausen, P., He, M., Ma, L., Deng, Y., Li, L., Mousavi, M., Ryoo, M., Debevec, P., Yu, N.: Go-with-the-flow: Motion- controllable video diffusion models using real-time warped noise. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025),https://arxiv.org/abs/2501.08...

work page arXiv 2025
[7]

IEEE Transactions on Pat- tern Analysis and Machine Intelligence43(1), 172–186 (2019) 4

Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: OpenPose: Realtime Multi- Person 2D Pose Estimation using Part Affinity Fields. IEEE Transactions on Pat- tern Analysis and Machine Intelligence43(1), 172–186 (2019) 4

2019
[8]

Chan, C., Ginosar, S., Zhou, T., Efros, A.A.: Everybody dance now. ICCV pp. 5932–5941 (2019) 4

2019
[9]

ArXivabs/2112.07945 (2021) 4

Chan, E., Lin, C.Z., Chan, M.A., Nagano, K., Pan, B., Mello, S.D., Gallo, O., Guibas, L.J., Tremblay, J., Khamis, S., Karras, T., Wetzstein, G.: Effi- cient geometry-aware 3d generative adversarial networks. ArXivabs/2112.07945 (2021) 4

work page arXiv 2021
[10]

ArXivabs/2504.03072 (2025),https://api.semanticscholar.org/CorpusID:2709646892, 5

Chang, P., Tang, J., Gross, M., Azevedo, V.C.: How i warped your noise: a temporally-correlated noise prior for diffusion models. ArXivabs/2504.03072 (2025),https://api.semanticscholar.org/CorpusID:2709646892, 5

work page arXiv 2025
[11]

VideoJAM: Joint appearance-motion representations for enhanced motion generation in video models.arXiv preprint arXiv:2502.02492,

Chefer, H., Singer, U., Zohar, A., Kirstain, Y., Polyak, A., Taigman, Y., Wolf, L., Sheynin, S.: Videojam: Joint appearance-motion representations for enhanced motion generation in video models. ArXivabs/2502.02492(2025),https://api. semanticscholar.org/CorpusID:2761076103, 7, 8

work page arXiv 2025
[12]

ArXivabs/2106.13629(2021) 4

Chen,J.,Zhang,Y.,Kang,D.,Zhe,X.,Bao,L.,Lu,H.:Animatableneuralradiance fields from monocular rgb video. ArXivabs/2106.13629(2021) 4

work page arXiv 2021
[13]

Grigor’ev, A.K., Sevastopolsky, A., Vakhitov, A., Lempitsky, V.S.: Coordinate- based texture inpainting for pose-guided human image generation. CVPR pp. 12127–12136 (2019) 4

2019
[14]

2110.08985 , archivePrefix=

Gu, J., Liu, L., Wang, P., Theobalt, C.: Stylenerf: A style-based 3d-aware generator for high-resolution image synthesis. ArXivabs/2110.08985(2021) 4 20 Tao Hu and Varun Jampani

work page arXiv 2021
[15]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., Agrawala, M., Lin, D., Dai, B.: AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning. arXiv preprint arXiv:2307.04725 (2023) 4

work page internal anchor Pith review arXiv 2023
[16]

In: CVPR

Güler, R.A., Neverova, N., Kokkinos, I.: Densepose: Dense human pose estimation in the wild. In: CVPR. pp. 7297–7306 (2018) 6

2018
[17]

In: NIPS (2017) 10

Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. In: NIPS (2017) 10

2017
[18]

Advances in Neural Information Processing Systems33(2020) 4

Ho, J., Jain, A., Abbeel, P.: Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems33(2020) 4

2020
[19]

Headnerf: A real-time nerf-based parametric head model.CoRR, abs/2112.05637, 2021

Hong, Y., Peng, B., Xiao, H., Liu, L., Zhang, J.: Headnerf: A real-time nerf-based parametric head model. ArXivabs/2112.05637(2021) 4

work page arXiv 2021
[20]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, J.E., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Chen, W.: Lora: Low-rank adaptation of large language models. ArXivabs/2106.09685(2021), https://api.semanticscholar.org/CorpusID:23545800917

work page internal anchor Pith review Pith/arXiv arXiv 2021
[21]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024) 2, 4, 11, 12

Hu, L., Gao, X., Zhang, P., Sun, K., Zhang, B., Bo, L.: Animate Anyone: Con- sistent and Controllable Image-to-Video Synthesis for Character Animation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024) 2, 4, 11, 12

2024
[22]

Animate anyone 2: High-fidelity character image animation with environment affordance,

Hu, L., Wang, G., Shen, Z., Gao, X., Meng, D., Zhuo, L., Zhang, P., Zhang, B., Bo, L.: Animate Anyone 2: High-Fidelity Character Image Animation with Envi- ronment Affordance. arXiv preprint arXiv:2502.06145 (2025) 4

work page arXiv 2025
[23]

3DV (2022) 4

Hu,T.,Yu,T.,Zheng,Z.,Zhang,H.,Liu,Y.,Zwicker,M.:Hvtr:Hybridvolumetric- textural rendering for human avatars. 3DV (2022) 4

2022
[24]

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

Hu, T., Hong, F., Liu, Z.: Surmo: Surface-based 4d motion modeling for dynamic human rendering. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 6550–6560 (2024),https://api.semanticscholar.org/ CorpusID:2688194933, 7, 8

2024
[25]

In: ICCV (2021) 4

Hu, T., Sarkar, K., Liu, L., Zwicker, M., Theobalt, C.: Egorenderer: Rendering human avatars from egocentric camera images. In: ICCV (2021) 4

2021
[26]

IEEE Transactions on Visualization and Computer Graphics pp

Hu, T., Xu, H., Luo, L., Yu, T., Zheng, Z., Zhang, H., Liu, Y., Zwicker, M.: Hvtr++: Image and pose driven human avatars using hybrid volumetric-textural rendering. IEEE Transactions on Visualization and Computer Graphics pp. 1–15 (2023).https://doi.org/10.1109/TVCG.2023.32977214

work page doi:10.1109/tvcg.2023.32977214 2023
[27]

ACM Transactions on Graphics42(4), 1–12 (2023) 4

Işık, M., Rünz, M., Georgopoulos, M., Khakhulin, T., Starck, J., Agapito, L., Nießner, M.: HumanRF: High-Fidelity Neural Radiance Fields for Humans in Mo- tion. ACM Transactions on Graphics42(4), 1–12 (2023) 4

2023
[28]

2023 IEEE/CVF Interna- tional Conference on Computer Vision (ICCV) pp

Karras, J., Holynski, A., Wang, T.C., Kemelmacher-Shlizerman, I.: Dreampose: Fashion image-to-video synthesis via stable diffusion. 2023 IEEE/CVF Interna- tional Conference on Computer Vision (ICCV) pp. 22623–22633 (2023),https: //api.semanticscholar.org/CorpusID:2580788929, 10, 11, 19

2023
[29]

11453, arXiv:1711.11453 4

Kratzwald, B., Huang, Z., Paudel, D.P., Van Gool, L.: Towards an understanding of our world by GANing videos in the wild (2017),https://arxiv.org/abs/1711. 11453, arXiv:1711.11453 4

work page arXiv 2017
[30]

A comprehensive survey on human video generation: Challenges, methods, and insights,

Lei, W., Wang, J., Ma, F., Huang, G., Liu, L.: A Comprehensive Survey on Human Video Generation: Challenges, Methods, and Insights. arXiv preprint arXiv:2407.08428 (2024) 4

work page arXiv 2024
[31]

Liu, C., Vahdat, A.: On equivariance and fast sampling in video diffusion models trained with warped noise (2025),https://api.semanticscholar.org/CorpusID: 2777810882, 5, 7 HumANDiff: Articulated Noise Diffusion 21

2025
[32]

TOG40(2021) 4, 10, 15

Liu,L.,Habermann,M.,Rudnev,V.,Sarkar,K.,Gu,J.,Theobalt,C.:Neuralactor: Neural free-view synthesis of human actors with pose control. TOG40(2021) 4, 10, 15

2021
[33]

IEEE Transactions on Visualization and Computer Graphics (05 2020).https://doi.org/10.1109/TVCG.2020.29965944

Liu, L., Xu, W., Habermann, M., Zollhöfer, M., Bernard, F., Kim, H., Wang, W., Theobalt, C.: Neural human video rendering by learning dynamic textures and rendering-to-video translation. IEEE Transactions on Visualization and Computer Graphics (05 2020).https://doi.org/10.1109/TVCG.2020.29965944

work page doi:10.1109/tvcg.2020.29965944 2020
[34]

ACM Transactions on Graphics (TOG) (2019) 4

Liu, L., Xu, W., Zollhoefer, M., Kim, H., Bernard, F., Habermann, M., Wang, W., Theobalt, C.: Neural rendering and reenactment of human actor videos. ACM Transactions on Graphics (TOG) (2019) 4

2019
[35]

In: CVPR

Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In: CVPR. pp. 1096–1104 (2016) 13

2016
[36]

ACM Trans

Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: Smpl: a skinned multi-person linear model. ACM Trans. Graph.34, 248:1–16 (2015) 2, 3, 4, 6

2015
[37]

In: NeurIPS

Ma, L., Jia, X., Sun, Q., Schiele, B., Tuytelaars, T., Van Gool, L.: Pose guided person image generation. In: NeurIPS. pp. 405–415 (2017) 4

2017
[38]

CVPR (2018) 4

Ma, L., Sun, Q., Georgoulis, S., van Gool, L., Schiele, B., Fritz, M.: Disentangled person image generation. CVPR (2018) 4

2018
[39]

ECCV (2018) 4

Neverova, N., Güler, R.A., Kokkinos, I.: Dense pose transfer. ECCV (2018) 4

2018
[40]

Niemeyer, M., Geiger, A.: Giraffe: Representing scenes as compositional generative neural feature fields. CVPR pp. 11448–11459 (2021) 4

2021
[41]

In: IEEE/CVF ICCV (2021) 4

Noguchi, A., Sun, X., Lin, S., Harada, T.: Neural articulated radiance field. In: IEEE/CVF ICCV (2021) 4

2021
[42]

ArXiv abs/2112.11427(2021) 4

Or-El, R., Luo, X., Shan, M., Shechtman, E., Park, J.J., Kemelmacher-Shlizerman, I.: Stylesdf: High-resolution 3d-consistent image and geometry generation. ArXiv abs/2112.11427(2021) 4

work page arXiv 2021
[43]

In: ICCV (2021) 4

Peng, S., Dong, J., Wang, Q., Zhang, S., Shuai, Q., Zhou, X., Bao, H.: Animatable neural radiance fields for modeling dynamic human bodies. In: ICCV (2021) 4

2021
[44]

CVPR (2021) 4

Peng, S., Zhang, Y., Xu, Y., Wang, Q., Shuai, Q., Bao, H., Zhou, X.: Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. CVPR (2021) 4

2021
[45]

WACV (2021) 4

Prokudin, S., Black, M.J., Romero, J.: Smplpix: Neural avatars from 3d human models. WACV (2021) 4

2021
[46]

In: CVPR (June 2018) 4

Pumarola, A., Agudo, A., Sanfeliu, A., Moreno-Noguer, F.: Unsupervised person image synthesis in arbitrary poses. In: CVPR (June 2018) 4

2018
[47]

LHM: large animatable human reconstruction model from a single image in seconds

Qiu, L., Gu, X., Li, P., Zuo, Q., Shen, W., Zhang, J., Qiu, K., Yuan, W., Chen, G., Dong, Z., et al.: Lhm: Large animatable human reconstruction model from a single image in seconds. arXiv preprint arXiv:2503.10625 (2025) 4

work page arXiv 2025
[48]

arXiv preprint arXiv:2412.02684 (2024) 4

Qiu, L., Zhu, S., Zuo, Q., Gu, X., Dong, Y., Zhang, J., Xu, C., Li, Z., Yuan, W., Bo, L., et al.: Anigs: Animatable gaussian avatar from a single image with inconsistent gaussian reconstruction. arXiv preprint arXiv:2412.02684 (2024) 4

work page arXiv 2024
[49]

Raj, A., Tanke, J., Hays, J., Vo, M., Stoll, C., Lassner, C.: Anr: Articulated neural rendering for virtual avatars. CVPR pp. 3721–3730 (2021) 4

2021
[50]

In: IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) (2022) 4

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-Resolution Image Synthesis with Latent Diffusion Models. In: IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) (2022) 4

2022
[51]

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) 1, 2, 4, 11

2021
[52]

In: ECCV (2020) 4 22 Tao Hu and Varun Jampani

Sarkar, K., Mehta, D., Xu, W., Golyanik, V., Theobalt, C.: Neural re-rendering of humans from a single image. In: ECCV (2020) 4 22 Tao Hu and Varun Jampani

2020
[53]

ACM Transactions on Graphics 43(6), 1–13 (2024) 2, 4, 5

Shao, R., Pang, Y., Zheng, Z., Sun, J., Liu, Y.: Human4DiT: 360-Degree Human Video Generation with 4D Diffusion Transformer. ACM Transactions on Graphics 43(6), 1–13 (2024) 2, 4, 5

2024
[54]

ACM Transactions on Graphics (TOG) (2025) 2, 4

Shao, R., Xu, Y., Shen, Y., Yang, C., Zheng, Y., Chen, C., Liu, Y., Wetzstein, G.: Isa4d: Interspatial attention for efficient 4d human video generation. ACM Transactions on Graphics (TOG) (2025) 2, 4

2025
[55]

In: CVPR (2018) 4

Siarohin, A., Sangineto, E., Lathuiliere, S., Sebe, N.: Deformable GANs for pose- based human image generation. In: CVPR (2018) 4

2018
[56]

2021 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR) pp

Siarohin, A., Woodford, O.J., Ren, J., Chai, M., Tulyakov, S.: Motion represen- tations for articulated animation. 2021 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR) pp. 13648–13657 (2021),https://api. semanticscholar.org/CorpusID:23338803210

2021
[57]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al.: Make-A-Video: Text-to-Video Generation without Text-Video Data. arXiv preprint arXiv:2209.14792 (2022) 4

work page internal anchor Pith review arXiv 2022
[58]

Score-Based Generative Modeling through Stochastic Differential Equations

Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-Based Generative Modeling through Stochastic Differential Equations. arXiv preprint arXiv:2011.13456 (2020) 4

work page internal anchor Pith review Pith/arXiv arXiv 2011
[59]

In: NeurIPS (2021) 4

Su, S.Y., Yu, F., Zollhoefer, M., Rhodin, H.: A-nerf: Articulated neural radiance fields for learning human shape, appearance, and pose. In: NeurIPS (2021) 4

2021
[60]

ACM Transactions on Graphics (TOG)38(2019) 4

Thies, J., Zollhöfer, M., Nießner, M.: Deferred neural rendering: image synthesis using neural textures. ACM Transactions on Graphics (TOG)38(2019) 4

2019
[61]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Unterthiner, T., van Steenkiste, S., Kurach, K., Roth, K., Hulsc, A., Riedmiller, M., Bousquet, O.: Towards accurate generative models of video: The fvd score. arXiv preprint arXiv:1812.01717 (2018) 10

work page internal anchor Pith review arXiv 2018
[62]

arXiv preprint arXiv:1907.06578 (2019) 10

Unterthiner, T., et al.: Fvd: A new metric for video generation. arXiv preprint arXiv:1907.06578 (2019) 10

work page arXiv 1907
[63]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang, W., Wang, W., Zhou, W.,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519, 2024

Wang, Q., Bai, X., Wang, H., Qin, Z., Chen, A., Li, H., Tang, X., Hu, Y.: InstantID: Zero-shot Identity-Preserving Generation in Seconds. arXiv preprint arXiv:2401.07519 (2024) 4

work page arXiv 2024
[65]

In: IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR) (2024) 4, 5

Wang, T., Li, L., Lin, K., Lin, C.C., Yang, Z., Zhang, H., Liu, Z., Wang, L.: DisCo: Disentangled Control for Realistic Human Dance Generation. In: IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR) (2024) 4, 5

2024
[66]

In: NeurIPS (2018) 4

Wang, T.C., Liu, M.Y., Zhu, J.Y., Liu, G., Tao, A., Kautz, J., Catanzaro, B.: Video-to-video synthesis. In: NeurIPS (2018) 4

2018
[67]

In: CVPR (2018) 17

Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High- resolution image synthesis and semantic manipulation with conditional gans. In: CVPR (2018) 17

2018
[68]

Magicvideo-v2: Multi-stage high- aesthetic video generation,

Wang,W.,Liu,J.,Lin,Z.,Yan,J.,Chen,S.,Low,C.,Hoang,T.,Wu,J.,Liew,J.H., Yan, H., et al.: MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation. arXiv preprint arXiv:2401.04468 (2024) 4 HumANDiff: Articulated Noise Diffusion 23

work page arXiv 2024
[69]

Science China Information Sciences (2025) 11, 12

Wang, X., Zhang, S., Gao, C., Wang, J., Zhou, X., Zhang, Y., Yan, L., Sang, N.: Unianimate: Taming unified video diffusion models for consistent human image animation. Science China Information Sciences (2025) 11, 12

2025
[70]

IEEE Transactions on Image Process- ing13(4), 600–612 (2004) 10

Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Process- ing13(4), 600–612 (2004) 10

2004
[71]

In: International Conference on Computer Vision (ICCV) (2023) 4

Wu, J.Z., Ge, Y., Wang, X., Lei, S.W., Gu, Y., Shi, Y., Hsu, W., Shan, Y., Qie, X., Shou, M.Z.: Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text- to-Video Generation. In: International Conference on Computer Vision (ICCV) (2023) 4

2023
[72]

In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR) (2024) 4

Xu, Z., Zhang, J., Liew, J.H., Yan, H., Liu, J.W., Zhang, C., Feng, J., Shou, M.Z.: MagicAnimate: Temporally Consistent Human Image Animation using Dif- fusion Model. In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR) (2024) 4

2024
[73]

VideoGPT: Video Generation using VQ-VAE and Transformers

Yan, W., Zhang, Y., Abbeel, P., Srinivas, A.: VideoGPT: Video Generation using VQ-VAE and Transformers. arXiv preprint arXiv:2104.10157 (2021) 4

work page internal anchor Pith review arXiv 2021
[74]

In: International Conference on Computer Vision (ICCV) (2023) 4

Yang, Z., Zeng, A., Yuan, C., Li, Y.: Effective Whole-body Pose Estimation with Two-stages Distillation. In: International Conference on Computer Vision (ICCV) (2023) 4

2023
[75]

arXiv (2024) 1, 2, 4, 5, 7, 9, 10, 11, 15, 16, 17

Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., Yin, D., Zhang, Y., Wang, W., Cheng, Y., Xu, B., Gu, X., Dong, Y., Tang, J.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv (2024) 1, 2, 4, 5, 7, 9, 10, 11, 15, 16, 17

2024
[76]

In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR) (2023) 4

Yu, L., Cheng, Y., Sohn, K., Lezama, J., Zhang, H., Chang, H., Hauptmann, A.G., Yang, M.H., Hao, Y., Essa, I., et al.: MAGVIT: Masked Generative Video Transformer. In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR) (2023) 4

2023
[77]

2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp

Yu, W.Y., Po, L.M., Cheung, R.C.C., Zhao, Y., Xue, Y.Z., Li, K.: Bidirection- ally deformable motion modulation for video-based human pose transfer. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 7468–7478 (2023),https://api.semanticscholar.org/CorpusID:25993785010, 11

2023
[78]

In: British Machine Vision Conference (BMVC) (2019) 9

Zablotskaia, P., Siarohin, A., Sigal, L., Zhao, B.: Dwnet: Dense warp-based network for pose-guided human video generation. In: British Machine Vision Conference (BMVC) (2019) 9

2019
[79]

Advances in Neural Information Processing Systems37(2025) 4

Zhang, H., Chen, X., Wang, Y., Liu, X., Wang, Y., Qiao, Y.: 4Diffusion: Multi- view Video Diffusion Model for 4D Generation. Advances in Neural Information Processing Systems37(2025) 4

2025
[80]

In: International Conference on Computer Vision (ICCV) (2023) 4

Zhang, L., Rao, A., Agrawala, M.: Adding Conditional Control to Text-to-Image Diffusion Models. In: International Conference on Computer Vision (ICCV) (2023) 4

2023
[81]

In: CVPR (2018) 10

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018) 10

2018

Showing first 80 references.