pith. sign in

arxiv: 2604.17623 · v3 · pith:2RLAJPX5new · submitted 2026-04-19 · 💻 cs.CV · cs.GR

ViPS: Video-informed Pose Spaces for Auto-Rigged Meshes

Pith reviewed 2026-05-22 11:27 UTC · model grok-4.3

classification 💻 cs.CV cs.GR
keywords pose spacevideo diffusionkinematic rigsauto-rigging3D mesh articulationgenerative priorszero-shot generalizationmanifold learning
0
0 comments X

The pith

ViPS distills video diffusion priors into a latent distribution over rig parameters for any auto-rigged mesh.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Video-informed Pose Spaces to give kinematic rigs an explicit representation of valid joint configurations without relying on scarce artist-created 4D datasets. It transfers motion priors from a pretrained video diffusion model into a feedforward network that outputs a smooth distribution over the rig parameters of a given skinned mesh. Differentiable geometric validators applied after skinning enforce mesh-specific constraints such as no self-intersections or hyperextension. The resulting space supports direct sampling of diverse poses, projection onto the manifold for inverse kinematics, and generation of temporally coherent animation paths. Evaluations indicate that this video-only training matches the plausibility and diversity of models trained on synthetic 4D data while generalizing to new species and skeletal structures.

Core claim

ViPS is a feedforward framework that discovers the latent distribution of valid articulations for auto-rigged meshes by distilling motion priors from a pretrained video diffusion model. Unlike methods that require artist-authored 4D data or reconstruct single motions, it transfers generative video priors into a universal distribution over the given rig parameterization, with differentiable geometric validators enforcing shape-specific integrity without manual regularizers.

What carries the argument

ViPS feedforward model that distills a universal distribution over rig parameters from video diffusion priors, guided by differentiable geometric validators on the skinned mesh.

If this is right

  • Direct sampling yields diverse yet plausible shape variations for the input mesh.
  • Projection onto the learned manifold performs inverse kinematics while staying in valid configurations.
  • Temporally coherent trajectories can be generated for animation and keyframing.
  • The distilled 3D pose samples act as semantic guides that close the loop back to video diffusion models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same distillation process could be applied to other generative image or video models to bootstrap pose spaces for non-humanoid characters.
  • Real-time applications such as live character correction in games might become feasible if the feedforward network is distilled into a smaller runtime model.
  • Combining ViPS with user-provided motion clips could refine the pose space for domain-specific behaviors without retraining from scratch.
  • The approach opens a route to automatically equipping user-uploaded meshes with controllable animation interfaces in consumer tools.

Load-bearing premise

Priors learned by a video diffusion model contain enough information to produce accurate, shape-specific distributions over the joint angles of an arbitrary rigged mesh.

What would settle it

Sampling many poses from a trained ViPS model on an unseen rigged mesh and observing a high rate of anatomically impossible configurations such as joint hyperextension or mesh self-intersections that violate the geometric validators.

Figures

Figures reproduced from arXiv: 2604.17623 by Ayush Tewari, Changxi Zheng, Honglin Chen, Karran Pandey, Matheus Gadelha, Niloy J. Mitra, Paul Guerrero, Rundi Wu, Yannick Hold-Geoffroy.

Figure 1
Figure 1. Figure 1: Overview. We introduce ViPS, a universal feed-forward model that lifts static, auto-rigged meshes into a plausible and editable pose manifold. ViPS leverages the rich priors of foundational video models to automatically reveal a pose space that enables (a) manifold-constrained editing; (b) smooth pose-space interpolation, and (c) pose-guided video synthesis by using 3D proxies as structural guidance. The p… view at source ↗
Figure 2
Figure 2. Figure 2: The ViPS data pipeline. Given a species like ‘bear’, we first generate a clean single-object image of diverse species representatives using an image generator with prompts generated by a carefully instructed VLM. We then expand these images into many videos of diverse motions using a VLM-prompted video generator. We use a VLM to choose a frame α as rest pose, reconstruct a textured mesh from the frame usin… view at source ↗
Figure 3
Figure 3. Figure 3: The architecture of our universal pose denoiser [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 3
Figure 3. Figure 3: We base our architecture on AnyTop [6] that we modify to (i) model a [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative Comparison. We compare random poses samples generated with ViPS to all baselines on three individuals of the cross-species dataset and one individual of the single species dataset. Strongly implausible poses are marked in red. Our data generation allows generalization to a large range of species, such as the alien, and gives pose plausibility and diversity on par with the much more expensive Pu… view at source ↗
Figure 5
Figure 5. Figure 5: Manifold-Constrained Semantic Editing. ViPS enables precise Inverse Kinematics (IK) by projecting user-driven joint handles (orange→green) into the discovered plausible pose space. By leveraging our video-informed priors, our model ensures that sparse edits result in anatomically valid and structurally consistent poses across a wide range of species and skeletal topologies, effectively avoiding the geometr… view at source ↗
read the original abstract

Kinematic rigs provide a structured interface for articulating 3D meshes but lack any associated pose space, i.e., an explicit representation of the plausible manifold of joint configurations for a given mesh. Without such a pose space, stochastic sampling or manual manipulation of raw rig parameters easily results in semantic and/or geometric violations, such as anatomical hyperextension and non-physical self-intersections. We propose Video-informed Pose Spaces (ViPS), a feedforward framework that discovers the latent distribution of valid articulations for auto-rigged meshes by distilling motion priors from a pretrained video diffusion model. Unlike existing methods that rely on scarce, artist-authored 4D datasets, or focus on reconstructing instances of individual motions, ViPS transfers generative video model priors into a universal distribution over the given rig parameterization. Differentiable geometric validators applied to the skinned mesh enforce shape-specific integrity without requiring manual regularizers. Our feedforward model reveals a smooth, compact, and controllable pose space. This, in turn, supports sampling for diverse shape variations, manifold projection for inverse kinematics, and temporally coherent trajectories for animation and keyframing. Further, the distilled 3D pose samples serve as semantic proxies to guide video diffusion, effectively closing the loop between generative 2D priors and structured 3D kinematic control. Our evaluations show that ViPS, trained solely using video priors, matches the performance of state-of-the-art models trained on synthetic artist-created 4D data in both plausibility and diversity. Additionally, as a universal model, ViPS exhibits robust zero-shot generalization to out-of-distribution species and unseen skeletal topologies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes ViPS, a feedforward framework that distills motion priors from a pretrained video diffusion model into a latent distribution over rig parameters for auto-rigged skinned meshes. Differentiable geometric validators enforce shape-specific validity (self-intersection, joint limits) without 3D supervision or artist 4D data. The central claims are that ViPS matches SOTA performance (trained on synthetic 4D data) in plausibility and diversity, supports sampling, IK projection, and animation, and exhibits robust zero-shot generalization to out-of-distribution species and unseen skeletal topologies.

Significance. If the claims hold, the work would meaningfully reduce dependence on scarce artist-authored 4D datasets for creating controllable pose spaces, enabling broader use of auto-rigged meshes in animation and generative pipelines. The closed-loop use of distilled 3D samples to guide video diffusion is a notable technical direction.

major comments (2)
  1. [§4.2] §4.2 (zero-shot experiments): the claim of robust generalization to unseen skeletal topologies rests on a narrow set of test rigs; the quantitative tables report aggregate scores but do not break out failure modes for topologies with substantially different joint counts or connectivity, which is load-bearing for the 'universal model' assertion.
  2. [§3.2] §3.2 (differentiable validators): the description of the self-intersection and joint-limit validators does not address coverage of non-local interpenetrations or topology-specific hyperextension; without explicit completeness arguments or failure-case analysis, it is unclear whether the validators are tight enough to guarantee that video-prior samples remain geometrically valid after skinning.
minor comments (2)
  1. [Figure 5] Figure 5 caption: the diversity metric (e.g., whether it is average pairwise distance in pose space or in skinned vertex space) is not stated explicitly, making direct comparison to the SOTA baseline difficult.
  2. [§2] §2 (related work): the discussion of prior video-to-3D distillation methods omits recent works on motion priors from diffusion models that use explicit 3D consistency losses; adding these would better situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate revisions to the manuscript where they strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (zero-shot experiments): the claim of robust generalization to unseen skeletal topologies rests on a narrow set of test rigs; the quantitative tables report aggregate scores but do not break out failure modes for topologies with substantially different joint counts or connectivity, which is load-bearing for the 'universal model' assertion.

    Authors: The zero-shot section evaluates generalization on multiple rigs that differ in joint count and connectivity from the training distribution, with aggregate metrics showing comparable plausibility and diversity to supervised baselines. We agree that a per-topology breakdown would more directly support the universal-model claim. In the revision we will add further test rigs with substantially different skeletal structures, report disaggregated scores, and discuss observed failure modes by joint-count and connectivity categories. revision: yes

  2. Referee: [§3.2] §3.2 (differentiable validators): the description of the self-intersection and joint-limit validators does not address coverage of non-local interpenetrations or topology-specific hyperextension; without explicit completeness arguments or failure-case analysis, it is unclear whether the validators are tight enough to guarantee that video-prior samples remain geometrically valid after skinning.

    Authors: The validators combine a differentiable surface-intersection loss that penalizes both local and non-local collisions after skinning with per-joint limit constraints derived from the rig. While a formal completeness guarantee is difficult in the continuous high-dimensional space, the losses are applied directly to the skinned mesh. We will expand §3.2 with additional implementation details on non-local coverage, include a short failure-case analysis for topology-specific hyperextension, and report the fraction of samples that remain valid after projection. revision: yes

Circularity Check

0 steps flagged

No significant circularity: derivation transfers external video priors via distillation and validators without self-referential reduction

full rationale

The paper's claimed derivation chain starts from an external pretrained video diffusion model whose motion priors are distilled into a latent distribution over rig parameters for a given skinned mesh. Differentiable geometric validators (self-intersection, joint limits) are then applied directly to skinned outputs to enforce validity. This setup does not define any quantity in terms of itself, fit a parameter on a data subset and rename the fit as a prediction, or invoke self-citations for uniqueness or ansatz smuggling. The central performance claim is benchmarked against independent external SOTA models trained on artist-authored 4D data, and zero-shot generalization is tested on out-of-distribution rigs. No equation or step reduces the output distribution to the input priors by algebraic construction; the transfer and validation steps remain non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework depends on the transferability of 2D video priors to 3D kinematic validity and on the sufficiency of differentiable mesh validators; no free parameters or new entities are explicitly introduced in the abstract.

axioms (2)
  • domain assumption Pretrained video diffusion models contain transferable priors over plausible 3D articulations for skinned meshes.
    This underpins the distillation step and is invoked to justify training without 4D data.
  • domain assumption Differentiable geometric validators can enforce mesh integrity for any rig configuration without manual per-shape regularizers.
    Stated as enabling shape-specific integrity checks.

pith-pipeline@v0.9.0 · 5857 in / 1358 out tokens · 61610 ms · 2026-05-22T11:27:14.225467+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · 5 internal anchors

  1. [1]

    Software (2026),https://firefly.adobe.com , accessed: 2026-01-18

    Adobe: Adobe firefly. Software (2026),https://firefly.adobe.com , accessed: 2026-01-18

  2. [2]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024), https://vveicao.github.io/projects/Motion2VecSets/

    Cao, W., Luo, C., Zhang, B., Nießner, M., Tang, J.: Motion2vecsets: 4d latent vector set diffusion for non-rigid shape reconstruction and tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024), https://vveicao.github.io/projects/Motion2VecSets/

  3. [3]

    SAM 3D: 3Dfy Anything in Images

    Chen, X., Chu, F.J., Gleize, P., Liang, K.J., Sax, A., Tang, H., Wang, W., Guo, M., Hardin, T., Li, X., et al.: Sam 3d: 3dfy anything in images. arXiv preprint arXiv:2511.16624 (2025)

  4. [4]

    In: SIGGRAPH (2025)

    Deng, Y., Zhang, Y., Geng, C., Wu, S., Wu, J.: Anymate: A dataset and baselines for learning 3d object rigging. In: SIGGRAPH (2025)

  5. [5]

    Dutt, N.S., Muralikrishnan, S., Mitra, N.J.: Diffusion 3d features (diff3f): Decorating untexturedshapeswithdistilledsemanticfeatures.In:ProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4494–4504 (June 2024)

  6. [6]

    SIGGRAPH (2025)

    Gat, I., Raab, S., Tevet, G., Reshef, Y., Bermano, A.H., Cohen-Or, D.: Anytop: Character animation diffusion with any topology. SIGGRAPH (2025)

  7. [7]

    Advances in neural information processing systems33, 6840–6851 (2020)

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

  8. [8]

    In: NeurIPS Workshop on Deep Generative Models and Downstream Applications (2021)

    Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPS Workshop on Deep Generative Models and Downstream Applications (2021)

  9. [9]

    Jiang, Z., Zheng, C., Laina, I., Larlus, D., Vedaldi, A.: Geo4d: Leveraging video generators for geometric 4d scene reconstruction (2025),https://arxiv.org/abs/ 2504.07961

  10. [10]

    SIGGRAPH pp

    Kavan, L., Collins, S., Žára, J., O’Sullivan, C.: Geometric skinning with approximate dual quaternion blending. SIGGRAPH pp. 105:1–105:10 (2008).https://doi.org/ 10.1145/1399504.1360717

  11. [11]

    ACM Transactions on Graphics (SIGGRAPH)26(3), #64, 1–8 (2007)

    Kilian, M., Mitra, N.J., Pottmann, H.: Geometric modeling in shape space. ACM Transactions on Graphics (SIGGRAPH)26(3), #64, 1–8 (2007)

  12. [12]

    Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2 (2025)

  13. [13]

    In: Proceedings of the 1999 symposium on Interactive 3D graphics

    Lengyel, J.E.: Compression of time-dependent geometry. In: Proceedings of the 1999 symposium on Interactive 3D graphics. pp. 89–95 (1999) 16 Honglin Chen et al

  14. [14]

    In: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques (2000)

    Lewis, J.P., Cordner, M., Fong, N.: Pose space deformation: a unified approach to shape interpolation and skeleton-driven deformation. In: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques (2000)

  15. [15]

    ACM Transactions on Graphics (TOG)41(4), 138 (2022)

    Li, P., Aberman, K., Zhang, Z., Hanocka, R., Sorkine-Hornung, O.: Ganimator: Neural motion synthesis from a single sequence. ACM Transactions on Graphics (TOG)41(4), 138 (2022)

  16. [16]

    arXiv preprint arXiv:2512.11798 (2025)

    Li, R., Yao, Y., Zheng, C., Rupprecht, C., Lasenby, J., Wu, S., Vedaldi, A.: Par- ticulate: Feed-forward 3d object articulation. arXiv preprint arXiv:2512.11798 (2025)

  17. [17]

    ACM Trans

    Lipman, Y., Cohen-Or, D., Gal, R., Levin, D.: Volume and shape preservation via moving frame manipulation. ACM Trans. Graph.26(1), 5–es (Jan 2007)

  18. [18]

    ACM TOG44(4), 1–12 (2025)

    Liu, I., Xu, Z., Yifan, W., Tan, H., Xu, Z., Wang, X., Su, H., Shi, Z.: Riganything: Template-free autoregressive rigging for diverse 3d assets. ACM TOG44(4), 1–12 (2025)

  19. [19]

    arXiv preprint arXiv:2410.16499 (2024)

    Liu, J., Iliash, D., Chang, A.X., Savva, M., Mahdavi-Amiri, A.: SINGAPO: Sin- gle image controlled generation of articulated parts in object. arXiv preprint arXiv:2410.16499 (2024)

  20. [20]

    In: ICLR (2025), https://arxiv.org/abs/2510.18489

    Liu, J., Kong, L., Zhou, M., Chen, J., Xu, D.: Mono4dgs-hdr: High dynamic range 4d gaussian splatting from alternating-exposure monocular videos. In: ICLR (2025), https://arxiv.org/abs/2510.18489

  21. [21]

    In: ICCV (2025)

    Lu, J., Lin, J., Dou, H., Zeng, A., Deng, Y., Liu, X., Cai, Z., Yang, L., Zhang, Y., Wang, H., Liu, Z.: Dposer-x: Diffusion model as robust 3d whole-body human pose prior. In: ICCV (2025)

  22. [22]

    Advances in neural information processing systems (2025)

    Luo, Z., Ran, H., Lu, L.: Instant4d: 4d gaussian splatting in minutes. Advances in neural information processing systems (2025)

  23. [23]

    In: Proceedings on Graphics interface’88

    Magnenat-Thalmann, N., Laperrière, R., Thalmann, D.: Joint-dependent local deformations for hand animation and object grasping. In: Proceedings on Graphics interface’88. pp. 26–33 (1989)

  24. [24]

    In: Proceedings of the 16th ACM SIGGRAPH Conference on Motion, Interaction and Games (2023)

    Maiorca, A., Bohy, H., Yoon, Y., Dutoit, T.: Objective evaluation metric for motion generative models: Validating fréchet motion distance on foot skating and over- smoothing artifacts. In: Proceedings of the 16th ACM SIGGRAPH Conference on Motion, Interaction and Games (2023)

  25. [25]

    In: ACM SIGGRAPH 2022 Posters (2022)

    Maiorca, A., Yoon, Y., Dutoit, T.: Evaluating the quality of a synthesized motion with the fréchet motion distance. In: ACM SIGGRAPH 2022 Posters (2022)

  26. [26]

    Advances in neural information processing systems28(2015)

    Maystre, L., Grossglauser, M.: Fast and accurate inference of plackett–luce models. Advances in neural information processing systems28(2015)

  27. [27]

    In: Proc

    Mo, C., Hu, K., Long, C., Yuan, D., Wang, Z.: Motion keyframe interpolation for any human skeleton via temporally consistent point cloud sampling and reconstruction. In: Proc. ECCV. p. 159–175 (2024)

  28. [28]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025)

    Mou, L., Lei, J., Wang, C., Liu, L., Daniilidis, K.: Dimo: Diverse 3d motion generation for arbitrary objects. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025)

  29. [29]

    Muralikrishnan, S., Dutt, N.S., Mitra, N.J.: Smf: Template-free and rig-free anima- tion transfer using kinetic codes (2025),https://arxiv.org/abs/2504.04831

  30. [30]

    In: NeurIPS work- shop: AI for non-human animal communication (2025)

    Noronha, I., Chowdhury, A., Bharti, S., Kaur, U.: Quadforecaster: Diffusion-based quadruped pose prediction for animal communication analysis. In: NeurIPS work- shop: AI for non-human animal communication (2025)

  31. [31]

    Software (2026),https://openai.com/index/sora-2/, accessed: 2026-01-18 ViPS: Video-informed Pose Spaces 17

    OpenAI: Sora 2. Software (2026),https://openai.com/index/sora-2/, accessed: 2026-01-18 ViPS: Video-informed Pose Spaces 17

  32. [32]

    Transactions on Machine Learning Research Journal pp

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research Journal pp. 1–31 (2024)

  33. [33]

    Pandey, K., Hold-Geoffroy, Y., Gadelha, M., Mitra, N.J., Singh, K., Guerrero, P.: Motion modes: What could happen next? In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 2030–2039 (2025)

  34. [34]

    ACM Trans

    Park, K., Sinha, U., Hedman, P., Barron, J.T., Bouaziz, S., Goldman, D.B., Martin- Brualla, R., Seitz, S.M.: Hypernerf: a higher-dimensional representation for topo- logically varying neural radiance fields. ACM Trans. Graph.40(6) (Dec 2021)

  35. [35]

    In: CVPR (2020)

    Pumarola, A., Corona, E., Pons-Moll, G., Moreno-Noguer, F.: D-nerf: Neural radiance fields for dynamic scenes. In: CVPR (2020)

  36. [36]

    arXiv preprint arXiv:2502.02590 (2025)

    Qiu, X., Yang, J., Wang, Y., Chen, Z., Wang, Y., Wang, T.H., Xian, Z., Gan, C.: Articulate anymesh: Open-vocabulary 3d articulated objects modeling. arXiv preprint arXiv:2502.02590 (2025)

  37. [37]

    Journal of machine learning research21(140), 1–67 (2020)

    Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research21(140), 1–67 (2020)

  38. [38]

    Ren, J., Xie, K., Mirzaei, A., Liang, H., Zeng, X., Kreis, K., Liu, Z., Torralba, A., Fidler, S., Kim, S.W., Ling, H.: L4gm: Large 4d gaussian reconstruction model. Adv. Neural Inform. Process. Syst. (2024),https://arxiv.org/abs/2406.10324

  39. [39]

    In: CVPR (2026)

    Sabathier, R., Novotny, D., Mitra, N.J., Monnier, T.: Actionmesh: Animated 3d mesh generation with temporal 3d diffusion. In: CVPR (2026)

  40. [40]

    Song, C., Li, X., Yang, F., Xu, Z., Wei, J., Liu, F., Feng, J., Lin, G., Zhang, J.: Puppeteer: Rig and animate your 3d models. In: Adv. Neural Inform. Process. Syst. (2025)

  41. [41]

    In: International Conference on Learning Representations (ICLR) (2021)

    Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: International Conference on Learning Representations (ICLR) (2021)

  42. [42]

    ACM Trans- actions on graphics (TOG)23(3), 399–405 (2004)

    Sumner, R.W., Popović, J.: Deformation transfer for triangle meshes. ACM Trans- actions on graphics (TOG)23(3), 399–405 (2004)

  43. [43]

    THU-ML: TurboDiffusion: 100–200× acceleration for video diffusion models.https: //github.com/thu-ml/TurboDiffusion (2025), gitHub repository, accessed 2026- 01-18

  44. [44]

    In: ECCV (2022)

    Tiwari, G., Antic, D., Lenssen, J.E., Sarafianos, N., Tung, T., Pons-Moll, G.: Pose-ndf: Modeling human pose manifolds with neural distance fields. In: ECCV (2022)

  45. [45]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang, W., Wang, W., Zhou, W.,...

  46. [46]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)

  47. [47]

    In: CVPR (2024) 18 Honglin Chen et al

    Weng, Y., Wen, B., Tremblay, J., Blukis, V., Fox, D., Guibas, L., Birchfield, S.: Neural implicit representation for building digital twins of unknown articulated objects. In: CVPR (2024) 18 Honglin Chen et al

  48. [48]

    Video models are zero-shot learners and reasoners

    Wiedemer, T., Li, Y., Vicol, P., Gu, S.S., Matarese, N., Swersky, K., Kim, B., Jaini, P., Geirhos, R.: Video models are zero-shot learners and reasoners. arXiv preprint arXiv:2509.20328 (2025)

  49. [49]

    Freeman, Frédo Durand, Eli Shechtman, and Xun Huang

    Wu, D., Liu, F., Hung, Y.H., Qian, Y., Zhan, X., Duan, Y.: 4d-fly: Fast 4d recon- struction from a single monocular video. In: Proc. CVPR. pp. 16663–16673 (06 2025).https://doi.org/10.1109/CVPR52734.2025.01553

  50. [50]

    ACM TOG39(4), 58–1 (2020)

    Xu, Z., Zhou, Y., Kalogerakis, E., Landreth, C., Singh, K.: Rignet: neural rigging for articulated characters. ACM TOG39(4), 58–1 (2020)

  51. [51]

    Xu, Z., Li, Z., Dong, Z., Zhou, X., Newcombe, R., Lv, Z.: 4dgt: Learning a 4d gaussian transformer using real-world monocular videos. In: Adv. Neural Inform. Process. Syst. (2025)

  52. [52]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)

  53. [53]

    Advances in Neural Information Processing Systems (2024)

    Ye, Z., Liu, J.W., Jia, J., Sun, S., Shou, M.Z.: Skinned motion retargeting with dense geometric interaction perception. Advances in Neural Information Processing Systems (2024)

  54. [54]

    arXiv preprint (2025)

    Yenphraphai, J., Mirzaei, A., Chen, J., Zou, J., Tulyakov, S., Yeh, R.A., Wonka, P., Wang, C.: Shapegen4d: Towards high quality 4d shape generation from videos. arXiv preprint (2025)

  55. [55]

    In: ACM TOG

    Yu, Y., Zhou, K., Xu, D., Shi, X., Bao, H., Guo, B., Shum, H.Y.: Mesh editing with poisson-based gradient field manipulation. In: ACM TOG. pp. 644–651 (2004)

  56. [56]

    arXiv preprint arXiv:2601.06378 (2026)

    Zhang, H., Luo, J., Wan, B., Zhao, Y., Li, Z., Vasilkovsky, M., Wang, C., Wang, J., Ahuja, N., Zhou, B.: Rigmo: Unifying rig and motion learning for generative animation. arXiv preprint arXiv:2601.06378 (2026)

  57. [57]

    In: ICCV (2025)

    Zhang, H., Xu, H., Feng, C., Jampani, V., Ahuja, N.: Physrig: Differentiable physics- based skinning and rigging framework for realistic articulated object modeling. In: ICCV (2025)

  58. [58]

    In: The Twelfth International Conference on Learning Representations (ICLR) (2024)

    Zhang, J., Huang, S., Tu, Z., Chen, X., Zhan, X., Yu, G., Shan, Y.: Tapmo: Shape- aware motion generation of skeleton-free characters. In: The Twelfth International Conference on Learning Representations (ICLR) (2024)

  59. [59]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Zhang, J., Weng, J., Kang, D., Zhao, F., Huang, S., Zhe, X., Bao, L., Shan, Y., Wang, J., Tu, Z.: Skinned motion retargeting with residual perception of motion semantics & geometry. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13864–13872 (2023)

  60. [60]

    head”, “neck

    Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., Liu, Z.: Motiondif- fuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001 (2022) ViPS: Video-informed Pose Spaces for Auto-Rigged Meshes Supplementary Material Honglin Chen1,2∗ , Karran Pandey3 , Rundi Wu1 , Matheus Gadelha2 , Yannick Hold-Geoffroy2 , Ayush...

  61. [61]

    EXACT SUBJECT : Use the pro vi de d object / species name as the subject ; do NOT change it to a d i f f e r e n t species / object

  62. [62]

    FULL - BODY / FULL - OBJECT ( CR IT IC AL ) : The ENTIRE subject must be visible from end to end ( e . g . , head - to - toe / nose - to - tail / top - to - bottom ) . A b s o l u t e l y NO cropping , NO cut - off limbs / ears / tail , NO partial framing . Keep the subject ce nt er ed with ge ner ou s margins

  63. [63]

    WIDE SHOT ( C RI TI CA L ) : Use a wide shot with the camera pulled back enough to g u a r a n t e e the whole subject fits c o m f o r t a b l y in frame , with extra space around it

  64. [64]

    NO ground plane , NO visible floor line , NO horizon

    B A C K G R O U N D : Solid pure white studio b a c k g r o u n d . NO ground plane , NO visible floor line , NO horizon

  65. [65]

    Avoid harsh cast shadows ; keep r e f l e c t i o n s minimal

    L IG HT IN G : Even , soft , s h a d o w l e s s studio li gh ti ng . Avoid harsh cast shadows ; keep r e f l e c t i o n s minimal

  66. [66]

    ViPS: Video-informed Pose Spaces – Supplementary Material 7

    NO NEW E LE ME NTS : Do NOT i n t r o d u c e any other objects , humans , animals , props , text , logos , watermarks , scenery , or clutter . ViPS: Video-informed Pose Spaces – Supplementary Material 7

  67. [67]

    Use side or front three - quarter v i e w p o i n t ; AVOID direct front or rear views . D i v e r s i t y r e q u i r e m e n t s ( across prompts for the same subject ) : - Vary v i e w p o i n t ( side / front three - quarter ) , slight tilt angle ( e s p e c i a l l y for animals ) , and pose / o r i e n t a t i o n while ALWAYS keeping full - body / ...

  68. [68]

    Do NOT replace it with a d i f f e r e n t motion

    ** Use the P rov id ed Motion EXACTLY :** The prompt MUST des cr ib e the target motion f a i t h f u l l y . Do NOT replace it with a d i f f e r e n t motion . Do NOT add extra actions beyond what is stated

  69. [69]

    Do NOT invent new colors , accessories , markings , clothing , species , or extra objects

    ** Keep A p p e a r a n c e C O N S I S T E N T :** Use the pr ov id ed a p p e a r a n c e d e s c r i p t i o n as - is . Do NOT invent new colors , accessories , markings , clothing , species , or extra objects

  70. [70]

    If s o m e t h i n g is not m e n t i o n e d in the appearance , it must not appear in the prompt

    ** NO NEW EL EM EN TS ( VERY STRICT ) :** Do NOT i n t r o d u c e any new objects , humans , animals , props , accessories , text , logos , extra scenery items , or a d d i t i o n a l en ti tie s not e x p l i c i t l y present in the p ro vid ed a p p e a r a n c e d e s c r i p t i o n . If s o m e t h i n g is not m e n t i o n e d in the appearance ...

  71. [71]

    No new items in the scene

    ** B a c k g r o u n d & L ig ht in g :** Keep the b a c k g r o u n d simple and stable , c o n s i s t e n t with the p ro vid ed a p p e a r a n c e . No new items in the scene

  72. [72]

    pan " ,

    ** Camera Be ha vio r :** ST RI CT LY STATIC FULL BODY SHOT . Tripod - locked . The entire object visible at all times . ** Single shot , no transitions , no cuts .** NO camera m ov eme nt . Avoid words like " pan " , " zoom " , " track " , " dolly " , " close - up " , " follow " , " cut " , " scene change " , " t r a n s i t i o n " , " montage "

  73. [73]

    ** Natural Physics :** Motion should have r e a l i s t i c weight and timing a p p r o p r i a t e to the object

  74. [74]

    Photorealistic , 4 k , high f id eli ty

    ** Style Default :** If no style is specified , default to " Photorealistic , 4 k , high f id eli ty ."

  75. [75]

    8 Honglin Chen et al

    ** Length :** 60 -90 words . 8 Honglin Chen et al. Output ONLY the final prompt . No quotes . No bullet points . No extra c o m m e n t a r y . Object ( appearance , keep exact ) : < appearance_text > Target Motion ( must use exact ) : < motion_text > Camera C o n s t r a i n t ( must use exact ) : Static Full Body Shot ( Tripod View ) . The entire object...