ViPS: Video-informed Pose Spaces for Auto-Rigged Meshes

Ayush Tewari; Changxi Zheng; Honglin Chen; Karran Pandey; Matheus Gadelha; Niloy J. Mitra; Paul Guerrero; Rundi Wu; Yannick Hold-Geoffroy

arxiv: 2604.17623 · v3 · pith:2RLAJPX5new · submitted 2026-04-19 · 💻 cs.CV · cs.GR

ViPS: Video-informed Pose Spaces for Auto-Rigged Meshes

Honglin Chen , Karran Pandey , Rundi Wu , Matheus Gadelha , Yannick Hold-Geoffroy , Ayush Tewari , Niloy J. Mitra , Changxi Zheng

show 1 more author

Paul Guerrero

This is my paper

Pith reviewed 2026-05-22 11:27 UTC · model grok-4.3

classification 💻 cs.CV cs.GR

keywords pose spacevideo diffusionkinematic rigsauto-rigging3D mesh articulationgenerative priorszero-shot generalizationmanifold learning

0 comments

The pith

ViPS distills video diffusion priors into a latent distribution over rig parameters for any auto-rigged mesh.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Video-informed Pose Spaces to give kinematic rigs an explicit representation of valid joint configurations without relying on scarce artist-created 4D datasets. It transfers motion priors from a pretrained video diffusion model into a feedforward network that outputs a smooth distribution over the rig parameters of a given skinned mesh. Differentiable geometric validators applied after skinning enforce mesh-specific constraints such as no self-intersections or hyperextension. The resulting space supports direct sampling of diverse poses, projection onto the manifold for inverse kinematics, and generation of temporally coherent animation paths. Evaluations indicate that this video-only training matches the plausibility and diversity of models trained on synthetic 4D data while generalizing to new species and skeletal structures.

Core claim

ViPS is a feedforward framework that discovers the latent distribution of valid articulations for auto-rigged meshes by distilling motion priors from a pretrained video diffusion model. Unlike methods that require artist-authored 4D data or reconstruct single motions, it transfers generative video priors into a universal distribution over the given rig parameterization, with differentiable geometric validators enforcing shape-specific integrity without manual regularizers.

What carries the argument

ViPS feedforward model that distills a universal distribution over rig parameters from video diffusion priors, guided by differentiable geometric validators on the skinned mesh.

If this is right

Direct sampling yields diverse yet plausible shape variations for the input mesh.
Projection onto the learned manifold performs inverse kinematics while staying in valid configurations.
Temporally coherent trajectories can be generated for animation and keyframing.
The distilled 3D pose samples act as semantic guides that close the loop back to video diffusion models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distillation process could be applied to other generative image or video models to bootstrap pose spaces for non-humanoid characters.
Real-time applications such as live character correction in games might become feasible if the feedforward network is distilled into a smaller runtime model.
Combining ViPS with user-provided motion clips could refine the pose space for domain-specific behaviors without retraining from scratch.
The approach opens a route to automatically equipping user-uploaded meshes with controllable animation interfaces in consumer tools.

Load-bearing premise

Priors learned by a video diffusion model contain enough information to produce accurate, shape-specific distributions over the joint angles of an arbitrary rigged mesh.

What would settle it

Sampling many poses from a trained ViPS model on an unseen rigged mesh and observing a high rate of anatomically impossible configurations such as joint hyperextension or mesh self-intersections that violate the geometric validators.

Figures

Figures reproduced from arXiv: 2604.17623 by Ayush Tewari, Changxi Zheng, Honglin Chen, Karran Pandey, Matheus Gadelha, Niloy J. Mitra, Paul Guerrero, Rundi Wu, Yannick Hold-Geoffroy.

**Figure 1.** Figure 1: Overview. We introduce ViPS, a universal feed-forward model that lifts static, auto-rigged meshes into a plausible and editable pose manifold. ViPS leverages the rich priors of foundational video models to automatically reveal a pose space that enables (a) manifold-constrained editing; (b) smooth pose-space interpolation, and (c) pose-guided video synthesis by using 3D proxies as structural guidance. The p… view at source ↗

**Figure 2.** Figure 2: The ViPS data pipeline. Given a species like ‘bear’, we first generate a clean single-object image of diverse species representatives using an image generator with prompts generated by a carefully instructed VLM. We then expand these images into many videos of diverse motions using a VLM-prompted video generator. We use a VLM to choose a frame α as rest pose, reconstruct a textured mesh from the frame usin… view at source ↗

**Figure 3.** Figure 3: The architecture of our universal pose denoiser [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 3.** Figure 3: We base our architecture on AnyTop [6] that we modify to (i) model a [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative Comparison. We compare random poses samples generated with ViPS to all baselines on three individuals of the cross-species dataset and one individual of the single species dataset. Strongly implausible poses are marked in red. Our data generation allows generalization to a large range of species, such as the alien, and gives pose plausibility and diversity on par with the much more expensive Pu… view at source ↗

**Figure 5.** Figure 5: Manifold-Constrained Semantic Editing. ViPS enables precise Inverse Kinematics (IK) by projecting user-driven joint handles (orange→green) into the discovered plausible pose space. By leveraging our video-informed priors, our model ensures that sparse edits result in anatomically valid and structurally consistent poses across a wide range of species and skeletal topologies, effectively avoiding the geometr… view at source ↗

read the original abstract

Kinematic rigs provide a structured interface for articulating 3D meshes but lack any associated pose space, i.e., an explicit representation of the plausible manifold of joint configurations for a given mesh. Without such a pose space, stochastic sampling or manual manipulation of raw rig parameters easily results in semantic and/or geometric violations, such as anatomical hyperextension and non-physical self-intersections. We propose Video-informed Pose Spaces (ViPS), a feedforward framework that discovers the latent distribution of valid articulations for auto-rigged meshes by distilling motion priors from a pretrained video diffusion model. Unlike existing methods that rely on scarce, artist-authored 4D datasets, or focus on reconstructing instances of individual motions, ViPS transfers generative video model priors into a universal distribution over the given rig parameterization. Differentiable geometric validators applied to the skinned mesh enforce shape-specific integrity without requiring manual regularizers. Our feedforward model reveals a smooth, compact, and controllable pose space. This, in turn, supports sampling for diverse shape variations, manifold projection for inverse kinematics, and temporally coherent trajectories for animation and keyframing. Further, the distilled 3D pose samples serve as semantic proxies to guide video diffusion, effectively closing the loop between generative 2D priors and structured 3D kinematic control. Our evaluations show that ViPS, trained solely using video priors, matches the performance of state-of-the-art models trained on synthetic artist-created 4D data in both plausibility and diversity. Additionally, as a universal model, ViPS exhibits robust zero-shot generalization to out-of-distribution species and unseen skeletal topologies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ViPS distills video diffusion priors into a universal rig pose space for auto-rigged meshes and claims to match SOTA without 4D data, but the validators' ability to enforce full geometric validity is the part that needs checking.

read the letter

The main takeaway is that ViPS learns a distribution over rig parameters by pulling motion knowledge from a pretrained video diffusion model, then uses differentiable validators on the skinned mesh to filter for valid poses. This sidesteps the usual need for scarce artist 4D datasets and produces a feedforward model that supports sampling, inverse kinematics, and animation trajectories. They also close the loop by feeding the 3D samples back to guide the video model. That transfer from 2D priors to structured 3D kinematic control is the genuinely new piece here, and it looks more practical than per-instance reconstruction methods. The zero-shot results on out-of-distribution species and unseen skeletal topologies are the part that could make this useful beyond the training rigs. Their evaluations report matching prior SOTA models on plausibility and diversity, which is a concrete result if the controls are solid. The soft spot is exactly the one the stress test flags: the validators have to catch the right classes of violations, including non-local interpenetrations and topology-specific issues, or the sampled poses will include invalid ones that still score well on the metrics. Video priors can produce 2D-consistent motion that does not hold up in 3D, and the zero-shot claim makes the validator generalization risk higher. From the abstract the logic is consistent and the claims are testable rather than circular. This is for graphics and animation researchers who work with rigged meshes and want better priors for generation or editing tasks. It has enough substance and a clear experimental setup to deserve a serious referee, even if the validator details will need scrutiny in review. I would send it out for peer review.

Referee Report

2 major / 2 minor

Summary. The paper proposes ViPS, a feedforward framework that distills motion priors from a pretrained video diffusion model into a latent distribution over rig parameters for auto-rigged skinned meshes. Differentiable geometric validators enforce shape-specific validity (self-intersection, joint limits) without 3D supervision or artist 4D data. The central claims are that ViPS matches SOTA performance (trained on synthetic 4D data) in plausibility and diversity, supports sampling, IK projection, and animation, and exhibits robust zero-shot generalization to out-of-distribution species and unseen skeletal topologies.

Significance. If the claims hold, the work would meaningfully reduce dependence on scarce artist-authored 4D datasets for creating controllable pose spaces, enabling broader use of auto-rigged meshes in animation and generative pipelines. The closed-loop use of distilled 3D samples to guide video diffusion is a notable technical direction.

major comments (2)

[§4.2] §4.2 (zero-shot experiments): the claim of robust generalization to unseen skeletal topologies rests on a narrow set of test rigs; the quantitative tables report aggregate scores but do not break out failure modes for topologies with substantially different joint counts or connectivity, which is load-bearing for the 'universal model' assertion.
[§3.2] §3.2 (differentiable validators): the description of the self-intersection and joint-limit validators does not address coverage of non-local interpenetrations or topology-specific hyperextension; without explicit completeness arguments or failure-case analysis, it is unclear whether the validators are tight enough to guarantee that video-prior samples remain geometrically valid after skinning.

minor comments (2)

[Figure 5] Figure 5 caption: the diversity metric (e.g., whether it is average pairwise distance in pose space or in skinned vertex space) is not stated explicitly, making direct comparison to the SOTA baseline difficult.
[§2] §2 (related work): the discussion of prior video-to-3D distillation methods omits recent works on motion priors from diffusion models that use explicit 3D consistency losses; adding these would better situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate revisions to the manuscript where they strengthen the presentation of our results.

read point-by-point responses

Referee: [§4.2] §4.2 (zero-shot experiments): the claim of robust generalization to unseen skeletal topologies rests on a narrow set of test rigs; the quantitative tables report aggregate scores but do not break out failure modes for topologies with substantially different joint counts or connectivity, which is load-bearing for the 'universal model' assertion.

Authors: The zero-shot section evaluates generalization on multiple rigs that differ in joint count and connectivity from the training distribution, with aggregate metrics showing comparable plausibility and diversity to supervised baselines. We agree that a per-topology breakdown would more directly support the universal-model claim. In the revision we will add further test rigs with substantially different skeletal structures, report disaggregated scores, and discuss observed failure modes by joint-count and connectivity categories. revision: yes
Referee: [§3.2] §3.2 (differentiable validators): the description of the self-intersection and joint-limit validators does not address coverage of non-local interpenetrations or topology-specific hyperextension; without explicit completeness arguments or failure-case analysis, it is unclear whether the validators are tight enough to guarantee that video-prior samples remain geometrically valid after skinning.

Authors: The validators combine a differentiable surface-intersection loss that penalizes both local and non-local collisions after skinning with per-joint limit constraints derived from the rig. While a formal completeness guarantee is difficult in the continuous high-dimensional space, the losses are applied directly to the skinned mesh. We will expand §3.2 with additional implementation details on non-local coverage, include a short failure-case analysis for topology-specific hyperextension, and report the fraction of samples that remain valid after projection. revision: yes

Circularity Check

0 steps flagged

No significant circularity: derivation transfers external video priors via distillation and validators without self-referential reduction

full rationale

The paper's claimed derivation chain starts from an external pretrained video diffusion model whose motion priors are distilled into a latent distribution over rig parameters for a given skinned mesh. Differentiable geometric validators (self-intersection, joint limits) are then applied directly to skinned outputs to enforce validity. This setup does not define any quantity in terms of itself, fit a parameter on a data subset and rename the fit as a prediction, or invoke self-citations for uniqueness or ansatz smuggling. The central performance claim is benchmarked against independent external SOTA models trained on artist-authored 4D data, and zero-shot generalization is tested on out-of-distribution rigs. No equation or step reduces the output distribution to the input priors by algebraic construction; the transfer and validation steps remain non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework depends on the transferability of 2D video priors to 3D kinematic validity and on the sufficiency of differentiable mesh validators; no free parameters or new entities are explicitly introduced in the abstract.

axioms (2)

domain assumption Pretrained video diffusion models contain transferable priors over plausible 3D articulations for skinned meshes.
This underpins the distillation step and is invoked to justify training without 4D data.
domain assumption Differentiable geometric validators can enforce mesh integrity for any rig configuration without manual per-shape regularizers.
Stated as enabling shape-specific integrity checks.

pith-pipeline@v0.9.0 · 5857 in / 1358 out tokens · 61610 ms · 2026-05-22T11:27:14.225467+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ViPS models a generative distribution over rig parameters using a diffusion model in rig space... Differentiable geometric validators applied to the skinned mesh enforce shape-specific integrity

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · 5 internal anchors

[1]

Software (2026),https://firefly.adobe.com , accessed: 2026-01-18

Adobe: Adobe firefly. Software (2026),https://firefly.adobe.com , accessed: 2026-01-18

work page 2026
[2]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024), https://vveicao.github.io/projects/Motion2VecSets/

Cao, W., Luo, C., Zhang, B., Nießner, M., Tang, J.: Motion2vecsets: 4d latent vector set diffusion for non-rigid shape reconstruction and tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024), https://vveicao.github.io/projects/Motion2VecSets/

work page 2024
[3]

SAM 3D: 3Dfy Anything in Images

Chen, X., Chu, F.J., Gleize, P., Liang, K.J., Sax, A., Tang, H., Wang, W., Guo, M., Hardin, T., Li, X., et al.: Sam 3d: 3dfy anything in images. arXiv preprint arXiv:2511.16624 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

In: SIGGRAPH (2025)

Deng, Y., Zhang, Y., Geng, C., Wu, S., Wu, J.: Anymate: A dataset and baselines for learning 3d object rigging. In: SIGGRAPH (2025)

work page 2025
[5]

Dutt, N.S., Muralikrishnan, S., Mitra, N.J.: Diffusion 3d features (diff3f): Decorating untexturedshapeswithdistilledsemanticfeatures.In:ProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4494–4504 (June 2024)

work page 2024
[6]

SIGGRAPH (2025)

Gat, I., Raab, S., Tevet, G., Reshef, Y., Bermano, A.H., Cohen-Or, D.: Anytop: Character animation diffusion with any topology. SIGGRAPH (2025)

work page 2025
[7]

Advances in neural information processing systems33, 6840–6851 (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

work page 2020
[8]

In: NeurIPS Workshop on Deep Generative Models and Downstream Applications (2021)

Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPS Workshop on Deep Generative Models and Downstream Applications (2021)

work page 2021
[9]

Jiang, Z., Zheng, C., Laina, I., Larlus, D., Vedaldi, A.: Geo4d: Leveraging video generators for geometric 4d scene reconstruction (2025),https://arxiv.org/abs/ 2504.07961

work page arXiv 2025
[10]

SIGGRAPH pp

Kavan, L., Collins, S., Žára, J., O’Sullivan, C.: Geometric skinning with approximate dual quaternion blending. SIGGRAPH pp. 105:1–105:10 (2008).https://doi.org/ 10.1145/1399504.1360717

work page doi:10.1145/1399504.1360717 2008
[11]

ACM Transactions on Graphics (SIGGRAPH)26(3), #64, 1–8 (2007)

Kilian, M., Mitra, N.J., Pottmann, H.: Geometric modeling in shape space. ACM Transactions on Graphics (SIGGRAPH)26(3), #64, 1–8 (2007)

work page 2007
[12]

Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2 (2025)

work page 2025
[13]

In: Proceedings of the 1999 symposium on Interactive 3D graphics

Lengyel, J.E.: Compression of time-dependent geometry. In: Proceedings of the 1999 symposium on Interactive 3D graphics. pp. 89–95 (1999) 16 Honglin Chen et al

work page 1999
[14]

In: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques (2000)

Lewis, J.P., Cordner, M., Fong, N.: Pose space deformation: a unified approach to shape interpolation and skeleton-driven deformation. In: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques (2000)

work page 2000
[15]

ACM Transactions on Graphics (TOG)41(4), 138 (2022)

Li, P., Aberman, K., Zhang, Z., Hanocka, R., Sorkine-Hornung, O.: Ganimator: Neural motion synthesis from a single sequence. ACM Transactions on Graphics (TOG)41(4), 138 (2022)

work page 2022
[16]

arXiv preprint arXiv:2512.11798 (2025)

Li, R., Yao, Y., Zheng, C., Rupprecht, C., Lasenby, J., Wu, S., Vedaldi, A.: Par- ticulate: Feed-forward 3d object articulation. arXiv preprint arXiv:2512.11798 (2025)

work page arXiv 2025
[17]

ACM Trans

Lipman, Y., Cohen-Or, D., Gal, R., Levin, D.: Volume and shape preservation via moving frame manipulation. ACM Trans. Graph.26(1), 5–es (Jan 2007)

work page 2007
[18]

ACM TOG44(4), 1–12 (2025)

Liu, I., Xu, Z., Yifan, W., Tan, H., Xu, Z., Wang, X., Su, H., Shi, Z.: Riganything: Template-free autoregressive rigging for diverse 3d assets. ACM TOG44(4), 1–12 (2025)

work page 2025
[19]

arXiv preprint arXiv:2410.16499 (2024)

Liu, J., Iliash, D., Chang, A.X., Savva, M., Mahdavi-Amiri, A.: SINGAPO: Sin- gle image controlled generation of articulated parts in object. arXiv preprint arXiv:2410.16499 (2024)

work page arXiv 2024
[20]

In: ICLR (2025), https://arxiv.org/abs/2510.18489

Liu, J., Kong, L., Zhou, M., Chen, J., Xu, D.: Mono4dgs-hdr: High dynamic range 4d gaussian splatting from alternating-exposure monocular videos. In: ICLR (2025), https://arxiv.org/abs/2510.18489

work page arXiv 2025
[21]

In: ICCV (2025)

Lu, J., Lin, J., Dou, H., Zeng, A., Deng, Y., Liu, X., Cai, Z., Yang, L., Zhang, Y., Wang, H., Liu, Z.: Dposer-x: Diffusion model as robust 3d whole-body human pose prior. In: ICCV (2025)

work page 2025
[22]

Advances in neural information processing systems (2025)

Luo, Z., Ran, H., Lu, L.: Instant4d: 4d gaussian splatting in minutes. Advances in neural information processing systems (2025)

work page 2025
[23]

In: Proceedings on Graphics interface’88

Magnenat-Thalmann, N., Laperrière, R., Thalmann, D.: Joint-dependent local deformations for hand animation and object grasping. In: Proceedings on Graphics interface’88. pp. 26–33 (1989)

work page 1989
[24]

In: Proceedings of the 16th ACM SIGGRAPH Conference on Motion, Interaction and Games (2023)

Maiorca, A., Bohy, H., Yoon, Y., Dutoit, T.: Objective evaluation metric for motion generative models: Validating fréchet motion distance on foot skating and over- smoothing artifacts. In: Proceedings of the 16th ACM SIGGRAPH Conference on Motion, Interaction and Games (2023)

work page 2023
[25]

In: ACM SIGGRAPH 2022 Posters (2022)

Maiorca, A., Yoon, Y., Dutoit, T.: Evaluating the quality of a synthesized motion with the fréchet motion distance. In: ACM SIGGRAPH 2022 Posters (2022)

work page 2022
[26]

Advances in neural information processing systems28(2015)

Maystre, L., Grossglauser, M.: Fast and accurate inference of plackett–luce models. Advances in neural information processing systems28(2015)

work page 2015
[27]

In: Proc

Mo, C., Hu, K., Long, C., Yuan, D., Wang, Z.: Motion keyframe interpolation for any human skeleton via temporally consistent point cloud sampling and reconstruction. In: Proc. ECCV. p. 159–175 (2024)

work page 2024
[28]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025)

Mou, L., Lei, J., Wang, C., Liu, L., Daniilidis, K.: Dimo: Diverse 3d motion generation for arbitrary objects. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025)

work page 2025
[29]

Muralikrishnan, S., Dutt, N.S., Mitra, N.J.: Smf: Template-free and rig-free anima- tion transfer using kinetic codes (2025),https://arxiv.org/abs/2504.04831

work page arXiv 2025
[30]

In: NeurIPS work- shop: AI for non-human animal communication (2025)

Noronha, I., Chowdhury, A., Bharti, S., Kaur, U.: Quadforecaster: Diffusion-based quadruped pose prediction for animal communication analysis. In: NeurIPS work- shop: AI for non-human animal communication (2025)

work page 2025
[31]

Software (2026),https://openai.com/index/sora-2/, accessed: 2026-01-18 ViPS: Video-informed Pose Spaces 17

OpenAI: Sora 2. Software (2026),https://openai.com/index/sora-2/, accessed: 2026-01-18 ViPS: Video-informed Pose Spaces 17

work page 2026
[32]

Transactions on Machine Learning Research Journal pp

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research Journal pp. 1–31 (2024)

work page 2024
[33]

Pandey, K., Hold-Geoffroy, Y., Gadelha, M., Mitra, N.J., Singh, K., Guerrero, P.: Motion modes: What could happen next? In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 2030–2039 (2025)

work page 2030
[34]

ACM Trans

Park, K., Sinha, U., Hedman, P., Barron, J.T., Bouaziz, S., Goldman, D.B., Martin- Brualla, R., Seitz, S.M.: Hypernerf: a higher-dimensional representation for topo- logically varying neural radiance fields. ACM Trans. Graph.40(6) (Dec 2021)

work page 2021
[35]

In: CVPR (2020)

Pumarola, A., Corona, E., Pons-Moll, G., Moreno-Noguer, F.: D-nerf: Neural radiance fields for dynamic scenes. In: CVPR (2020)

work page 2020
[36]

arXiv preprint arXiv:2502.02590 (2025)

Qiu, X., Yang, J., Wang, Y., Chen, Z., Wang, Y., Wang, T.H., Xian, Z., Gan, C.: Articulate anymesh: Open-vocabulary 3d articulated objects modeling. arXiv preprint arXiv:2502.02590 (2025)

work page arXiv 2025
[37]

Journal of machine learning research21(140), 1–67 (2020)

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research21(140), 1–67 (2020)

work page 2020
[38]

Ren, J., Xie, K., Mirzaei, A., Liang, H., Zeng, X., Kreis, K., Liu, Z., Torralba, A., Fidler, S., Kim, S.W., Ling, H.: L4gm: Large 4d gaussian reconstruction model. Adv. Neural Inform. Process. Syst. (2024),https://arxiv.org/abs/2406.10324

work page arXiv 2024
[39]

In: CVPR (2026)

Sabathier, R., Novotny, D., Mitra, N.J., Monnier, T.: Actionmesh: Animated 3d mesh generation with temporal 3d diffusion. In: CVPR (2026)

work page 2026
[40]

Song, C., Li, X., Yang, F., Xu, Z., Wei, J., Liu, F., Feng, J., Lin, G., Zhang, J.: Puppeteer: Rig and animate your 3d models. In: Adv. Neural Inform. Process. Syst. (2025)

work page 2025
[41]

In: International Conference on Learning Representations (ICLR) (2021)

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: International Conference on Learning Representations (ICLR) (2021)

work page 2021
[42]

ACM Trans- actions on graphics (TOG)23(3), 399–405 (2004)

Sumner, R.W., Popović, J.: Deformation transfer for triangle meshes. ACM Trans- actions on graphics (TOG)23(3), 399–405 (2004)

work page 2004
[43]

THU-ML: TurboDiffusion: 100–200× acceleration for video diffusion models.https: //github.com/thu-ml/TurboDiffusion (2025), gitHub repository, accessed 2026- 01-18

work page 2025
[44]

In: ECCV (2022)

Tiwari, G., Antic, D., Lenssen, J.E., Sarafianos, N., Tung, T., Pons-Moll, G.: Pose-ndf: Modeling human pose manifolds with neural distance fields. In: ECCV (2022)

work page 2022
[45]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang, W., Wang, W., Zhou, W.,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

In: CVPR (2024) 18 Honglin Chen et al

Weng, Y., Wen, B., Tremblay, J., Blukis, V., Fox, D., Guibas, L., Birchfield, S.: Neural implicit representation for building digital twins of unknown articulated objects. In: CVPR (2024) 18 Honglin Chen et al

work page 2024
[48]

Video models are zero-shot learners and reasoners

Wiedemer, T., Li, Y., Vicol, P., Gu, S.S., Matarese, N., Swersky, K., Kim, B., Jaini, P., Geirhos, R.: Video models are zero-shot learners and reasoners. arXiv preprint arXiv:2509.20328 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Freeman, Frédo Durand, Eli Shechtman, and Xun Huang

Wu, D., Liu, F., Hung, Y.H., Qian, Y., Zhan, X., Duan, Y.: 4d-fly: Fast 4d recon- struction from a single monocular video. In: Proc. CVPR. pp. 16663–16673 (06 2025).https://doi.org/10.1109/CVPR52734.2025.01553

work page doi:10.1109/cvpr52734.2025.01553 2025
[50]

ACM TOG39(4), 58–1 (2020)

Xu, Z., Zhou, Y., Kalogerakis, E., Landreth, C., Singh, K.: Rignet: neural rigging for articulated characters. ACM TOG39(4), 58–1 (2020)

work page 2020
[51]

Xu, Z., Li, Z., Dong, Z., Zhou, X., Newcombe, R., Lv, Z.: 4dgt: Learning a 4d gaussian transformer using real-world monocular videos. In: Adv. Neural Inform. Process. Syst. (2025)

work page 2025
[52]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

Advances in Neural Information Processing Systems (2024)

Ye, Z., Liu, J.W., Jia, J., Sun, S., Shou, M.Z.: Skinned motion retargeting with dense geometric interaction perception. Advances in Neural Information Processing Systems (2024)

work page 2024
[54]

arXiv preprint (2025)

Yenphraphai, J., Mirzaei, A., Chen, J., Zou, J., Tulyakov, S., Yeh, R.A., Wonka, P., Wang, C.: Shapegen4d: Towards high quality 4d shape generation from videos. arXiv preprint (2025)

work page 2025
[55]

In: ACM TOG

Yu, Y., Zhou, K., Xu, D., Shi, X., Bao, H., Guo, B., Shum, H.Y.: Mesh editing with poisson-based gradient field manipulation. In: ACM TOG. pp. 644–651 (2004)

work page 2004
[56]

arXiv preprint arXiv:2601.06378 (2026)

Zhang, H., Luo, J., Wan, B., Zhao, Y., Li, Z., Vasilkovsky, M., Wang, C., Wang, J., Ahuja, N., Zhou, B.: Rigmo: Unifying rig and motion learning for generative animation. arXiv preprint arXiv:2601.06378 (2026)

work page arXiv 2026
[57]

In: ICCV (2025)

Zhang, H., Xu, H., Feng, C., Jampani, V., Ahuja, N.: Physrig: Differentiable physics- based skinning and rigging framework for realistic articulated object modeling. In: ICCV (2025)

work page 2025
[58]

In: The Twelfth International Conference on Learning Representations (ICLR) (2024)

Zhang, J., Huang, S., Tu, Z., Chen, X., Zhan, X., Yu, G., Shan, Y.: Tapmo: Shape- aware motion generation of skeleton-free characters. In: The Twelfth International Conference on Learning Representations (ICLR) (2024)

work page 2024
[59]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zhang, J., Weng, J., Kang, D., Zhao, F., Huang, S., Zhe, X., Bao, L., Shan, Y., Wang, J., Tu, Z.: Skinned motion retargeting with residual perception of motion semantics & geometry. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13864–13872 (2023)

work page 2023
[60]

head”, “neck

Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., Liu, Z.: Motiondif- fuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001 (2022) ViPS: Video-informed Pose Spaces for Auto-Rigged Meshes Supplementary Material Honglin Chen1,2∗ , Karran Pandey3 , Rundi Wu1 , Matheus Gadelha2 , Yannick Hold-Geoffroy2 , Ayush...

work page arXiv 2022
[61]

EXACT SUBJECT : Use the pro vi de d object / species name as the subject ; do NOT change it to a d i f f e r e n t species / object

work page
[62]

FULL - BODY / FULL - OBJECT ( CR IT IC AL ) : The ENTIRE subject must be visible from end to end ( e . g . , head - to - toe / nose - to - tail / top - to - bottom ) . A b s o l u t e l y NO cropping , NO cut - off limbs / ears / tail , NO partial framing . Keep the subject ce nt er ed with ge ner ou s margins

work page
[63]

WIDE SHOT ( C RI TI CA L ) : Use a wide shot with the camera pulled back enough to g u a r a n t e e the whole subject fits c o m f o r t a b l y in frame , with extra space around it

work page
[64]

NO ground plane , NO visible floor line , NO horizon

B A C K G R O U N D : Solid pure white studio b a c k g r o u n d . NO ground plane , NO visible floor line , NO horizon

work page
[65]

Avoid harsh cast shadows ; keep r e f l e c t i o n s minimal

L IG HT IN G : Even , soft , s h a d o w l e s s studio li gh ti ng . Avoid harsh cast shadows ; keep r e f l e c t i o n s minimal

work page
[66]

ViPS: Video-informed Pose Spaces – Supplementary Material 7

NO NEW E LE ME NTS : Do NOT i n t r o d u c e any other objects , humans , animals , props , text , logos , watermarks , scenery , or clutter . ViPS: Video-informed Pose Spaces – Supplementary Material 7

work page
[67]

Use side or front three - quarter v i e w p o i n t ; AVOID direct front or rear views . D i v e r s i t y r e q u i r e m e n t s ( across prompts for the same subject ) : - Vary v i e w p o i n t ( side / front three - quarter ) , slight tilt angle ( e s p e c i a l l y for animals ) , and pose / o r i e n t a t i o n while ALWAYS keeping full - body / ...

work page
[68]

Do NOT replace it with a d i f f e r e n t motion

** Use the P rov id ed Motion EXACTLY :** The prompt MUST des cr ib e the target motion f a i t h f u l l y . Do NOT replace it with a d i f f e r e n t motion . Do NOT add extra actions beyond what is stated

work page
[69]

Do NOT invent new colors , accessories , markings , clothing , species , or extra objects

** Keep A p p e a r a n c e C O N S I S T E N T :** Use the pr ov id ed a p p e a r a n c e d e s c r i p t i o n as - is . Do NOT invent new colors , accessories , markings , clothing , species , or extra objects

work page
[70]

If s o m e t h i n g is not m e n t i o n e d in the appearance , it must not appear in the prompt

** NO NEW EL EM EN TS ( VERY STRICT ) :** Do NOT i n t r o d u c e any new objects , humans , animals , props , accessories , text , logos , extra scenery items , or a d d i t i o n a l en ti tie s not e x p l i c i t l y present in the p ro vid ed a p p e a r a n c e d e s c r i p t i o n . If s o m e t h i n g is not m e n t i o n e d in the appearance ...

work page
[71]

No new items in the scene

** B a c k g r o u n d & L ig ht in g :** Keep the b a c k g r o u n d simple and stable , c o n s i s t e n t with the p ro vid ed a p p e a r a n c e . No new items in the scene

work page
[72]

pan " ,

** Camera Be ha vio r :** ST RI CT LY STATIC FULL BODY SHOT . Tripod - locked . The entire object visible at all times . ** Single shot , no transitions , no cuts .** NO camera m ov eme nt . Avoid words like " pan " , " zoom " , " track " , " dolly " , " close - up " , " follow " , " cut " , " scene change " , " t r a n s i t i o n " , " montage "

work page
[73]

** Natural Physics :** Motion should have r e a l i s t i c weight and timing a p p r o p r i a t e to the object

work page
[74]

Photorealistic , 4 k , high f id eli ty

** Style Default :** If no style is specified , default to " Photorealistic , 4 k , high f id eli ty ."

work page
[75]

8 Honglin Chen et al

** Length :** 60 -90 words . 8 Honglin Chen et al. Output ONLY the final prompt . No quotes . No bullet points . No extra c o m m e n t a r y . Object ( appearance , keep exact ) : < appearance_text > Target Motion ( must use exact ) : < motion_text > Camera C o n s t r a i n t ( must use exact ) : Static Full Body Shot ( Tripod View ) . The entire object...

work page

[1] [1]

Software (2026),https://firefly.adobe.com , accessed: 2026-01-18

Adobe: Adobe firefly. Software (2026),https://firefly.adobe.com , accessed: 2026-01-18

work page 2026

[2] [2]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024), https://vveicao.github.io/projects/Motion2VecSets/

Cao, W., Luo, C., Zhang, B., Nießner, M., Tang, J.: Motion2vecsets: 4d latent vector set diffusion for non-rigid shape reconstruction and tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024), https://vveicao.github.io/projects/Motion2VecSets/

work page 2024

[3] [3]

SAM 3D: 3Dfy Anything in Images

Chen, X., Chu, F.J., Gleize, P., Liang, K.J., Sax, A., Tang, H., Wang, W., Guo, M., Hardin, T., Li, X., et al.: Sam 3d: 3dfy anything in images. arXiv preprint arXiv:2511.16624 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

In: SIGGRAPH (2025)

Deng, Y., Zhang, Y., Geng, C., Wu, S., Wu, J.: Anymate: A dataset and baselines for learning 3d object rigging. In: SIGGRAPH (2025)

work page 2025

[5] [5]

Dutt, N.S., Muralikrishnan, S., Mitra, N.J.: Diffusion 3d features (diff3f): Decorating untexturedshapeswithdistilledsemanticfeatures.In:ProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4494–4504 (June 2024)

work page 2024

[6] [6]

SIGGRAPH (2025)

Gat, I., Raab, S., Tevet, G., Reshef, Y., Bermano, A.H., Cohen-Or, D.: Anytop: Character animation diffusion with any topology. SIGGRAPH (2025)

work page 2025

[7] [7]

Advances in neural information processing systems33, 6840–6851 (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

work page 2020

[8] [8]

In: NeurIPS Workshop on Deep Generative Models and Downstream Applications (2021)

Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPS Workshop on Deep Generative Models and Downstream Applications (2021)

work page 2021

[9] [9]

Jiang, Z., Zheng, C., Laina, I., Larlus, D., Vedaldi, A.: Geo4d: Leveraging video generators for geometric 4d scene reconstruction (2025),https://arxiv.org/abs/ 2504.07961

work page arXiv 2025

[10] [10]

SIGGRAPH pp

Kavan, L., Collins, S., Žára, J., O’Sullivan, C.: Geometric skinning with approximate dual quaternion blending. SIGGRAPH pp. 105:1–105:10 (2008).https://doi.org/ 10.1145/1399504.1360717

work page doi:10.1145/1399504.1360717 2008

[11] [11]

ACM Transactions on Graphics (SIGGRAPH)26(3), #64, 1–8 (2007)

Kilian, M., Mitra, N.J., Pottmann, H.: Geometric modeling in shape space. ACM Transactions on Graphics (SIGGRAPH)26(3), #64, 1–8 (2007)

work page 2007

[12] [12]

Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2 (2025)

work page 2025

[13] [13]

In: Proceedings of the 1999 symposium on Interactive 3D graphics

Lengyel, J.E.: Compression of time-dependent geometry. In: Proceedings of the 1999 symposium on Interactive 3D graphics. pp. 89–95 (1999) 16 Honglin Chen et al

work page 1999

[14] [14]

In: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques (2000)

Lewis, J.P., Cordner, M., Fong, N.: Pose space deformation: a unified approach to shape interpolation and skeleton-driven deformation. In: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques (2000)

work page 2000

[15] [15]

ACM Transactions on Graphics (TOG)41(4), 138 (2022)

Li, P., Aberman, K., Zhang, Z., Hanocka, R., Sorkine-Hornung, O.: Ganimator: Neural motion synthesis from a single sequence. ACM Transactions on Graphics (TOG)41(4), 138 (2022)

work page 2022

[16] [16]

arXiv preprint arXiv:2512.11798 (2025)

Li, R., Yao, Y., Zheng, C., Rupprecht, C., Lasenby, J., Wu, S., Vedaldi, A.: Par- ticulate: Feed-forward 3d object articulation. arXiv preprint arXiv:2512.11798 (2025)

work page arXiv 2025

[17] [17]

ACM Trans

Lipman, Y., Cohen-Or, D., Gal, R., Levin, D.: Volume and shape preservation via moving frame manipulation. ACM Trans. Graph.26(1), 5–es (Jan 2007)

work page 2007

[18] [18]

ACM TOG44(4), 1–12 (2025)

Liu, I., Xu, Z., Yifan, W., Tan, H., Xu, Z., Wang, X., Su, H., Shi, Z.: Riganything: Template-free autoregressive rigging for diverse 3d assets. ACM TOG44(4), 1–12 (2025)

work page 2025

[19] [19]

arXiv preprint arXiv:2410.16499 (2024)

Liu, J., Iliash, D., Chang, A.X., Savva, M., Mahdavi-Amiri, A.: SINGAPO: Sin- gle image controlled generation of articulated parts in object. arXiv preprint arXiv:2410.16499 (2024)

work page arXiv 2024

[20] [20]

In: ICLR (2025), https://arxiv.org/abs/2510.18489

Liu, J., Kong, L., Zhou, M., Chen, J., Xu, D.: Mono4dgs-hdr: High dynamic range 4d gaussian splatting from alternating-exposure monocular videos. In: ICLR (2025), https://arxiv.org/abs/2510.18489

work page arXiv 2025

[21] [21]

In: ICCV (2025)

Lu, J., Lin, J., Dou, H., Zeng, A., Deng, Y., Liu, X., Cai, Z., Yang, L., Zhang, Y., Wang, H., Liu, Z.: Dposer-x: Diffusion model as robust 3d whole-body human pose prior. In: ICCV (2025)

work page 2025

[22] [22]

Advances in neural information processing systems (2025)

Luo, Z., Ran, H., Lu, L.: Instant4d: 4d gaussian splatting in minutes. Advances in neural information processing systems (2025)

work page 2025

[23] [23]

In: Proceedings on Graphics interface’88

Magnenat-Thalmann, N., Laperrière, R., Thalmann, D.: Joint-dependent local deformations for hand animation and object grasping. In: Proceedings on Graphics interface’88. pp. 26–33 (1989)

work page 1989

[24] [24]

In: Proceedings of the 16th ACM SIGGRAPH Conference on Motion, Interaction and Games (2023)

Maiorca, A., Bohy, H., Yoon, Y., Dutoit, T.: Objective evaluation metric for motion generative models: Validating fréchet motion distance on foot skating and over- smoothing artifacts. In: Proceedings of the 16th ACM SIGGRAPH Conference on Motion, Interaction and Games (2023)

work page 2023

[25] [25]

In: ACM SIGGRAPH 2022 Posters (2022)

Maiorca, A., Yoon, Y., Dutoit, T.: Evaluating the quality of a synthesized motion with the fréchet motion distance. In: ACM SIGGRAPH 2022 Posters (2022)

work page 2022

[26] [26]

Advances in neural information processing systems28(2015)

Maystre, L., Grossglauser, M.: Fast and accurate inference of plackett–luce models. Advances in neural information processing systems28(2015)

work page 2015

[27] [27]

In: Proc

Mo, C., Hu, K., Long, C., Yuan, D., Wang, Z.: Motion keyframe interpolation for any human skeleton via temporally consistent point cloud sampling and reconstruction. In: Proc. ECCV. p. 159–175 (2024)

work page 2024

[28] [28]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025)

Mou, L., Lei, J., Wang, C., Liu, L., Daniilidis, K.: Dimo: Diverse 3d motion generation for arbitrary objects. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025)

work page 2025

[29] [29]

Muralikrishnan, S., Dutt, N.S., Mitra, N.J.: Smf: Template-free and rig-free anima- tion transfer using kinetic codes (2025),https://arxiv.org/abs/2504.04831

work page arXiv 2025

[30] [30]

In: NeurIPS work- shop: AI for non-human animal communication (2025)

Noronha, I., Chowdhury, A., Bharti, S., Kaur, U.: Quadforecaster: Diffusion-based quadruped pose prediction for animal communication analysis. In: NeurIPS work- shop: AI for non-human animal communication (2025)

work page 2025

[31] [31]

Software (2026),https://openai.com/index/sora-2/, accessed: 2026-01-18 ViPS: Video-informed Pose Spaces 17

OpenAI: Sora 2. Software (2026),https://openai.com/index/sora-2/, accessed: 2026-01-18 ViPS: Video-informed Pose Spaces 17

work page 2026

[32] [32]

Transactions on Machine Learning Research Journal pp

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research Journal pp. 1–31 (2024)

work page 2024

[33] [33]

Pandey, K., Hold-Geoffroy, Y., Gadelha, M., Mitra, N.J., Singh, K., Guerrero, P.: Motion modes: What could happen next? In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 2030–2039 (2025)

work page 2030

[34] [34]

ACM Trans

Park, K., Sinha, U., Hedman, P., Barron, J.T., Bouaziz, S., Goldman, D.B., Martin- Brualla, R., Seitz, S.M.: Hypernerf: a higher-dimensional representation for topo- logically varying neural radiance fields. ACM Trans. Graph.40(6) (Dec 2021)

work page 2021

[35] [35]

In: CVPR (2020)

Pumarola, A., Corona, E., Pons-Moll, G., Moreno-Noguer, F.: D-nerf: Neural radiance fields for dynamic scenes. In: CVPR (2020)

work page 2020

[36] [36]

arXiv preprint arXiv:2502.02590 (2025)

Qiu, X., Yang, J., Wang, Y., Chen, Z., Wang, Y., Wang, T.H., Xian, Z., Gan, C.: Articulate anymesh: Open-vocabulary 3d articulated objects modeling. arXiv preprint arXiv:2502.02590 (2025)

work page arXiv 2025

[37] [37]

Journal of machine learning research21(140), 1–67 (2020)

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research21(140), 1–67 (2020)

work page 2020

[38] [38]

Ren, J., Xie, K., Mirzaei, A., Liang, H., Zeng, X., Kreis, K., Liu, Z., Torralba, A., Fidler, S., Kim, S.W., Ling, H.: L4gm: Large 4d gaussian reconstruction model. Adv. Neural Inform. Process. Syst. (2024),https://arxiv.org/abs/2406.10324

work page arXiv 2024

[39] [39]

In: CVPR (2026)

Sabathier, R., Novotny, D., Mitra, N.J., Monnier, T.: Actionmesh: Animated 3d mesh generation with temporal 3d diffusion. In: CVPR (2026)

work page 2026

[40] [40]

Song, C., Li, X., Yang, F., Xu, Z., Wei, J., Liu, F., Feng, J., Lin, G., Zhang, J.: Puppeteer: Rig and animate your 3d models. In: Adv. Neural Inform. Process. Syst. (2025)

work page 2025

[41] [41]

In: International Conference on Learning Representations (ICLR) (2021)

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: International Conference on Learning Representations (ICLR) (2021)

work page 2021

[42] [42]

ACM Trans- actions on graphics (TOG)23(3), 399–405 (2004)

Sumner, R.W., Popović, J.: Deformation transfer for triangle meshes. ACM Trans- actions on graphics (TOG)23(3), 399–405 (2004)

work page 2004

[43] [43]

THU-ML: TurboDiffusion: 100–200× acceleration for video diffusion models.https: //github.com/thu-ml/TurboDiffusion (2025), gitHub repository, accessed 2026- 01-18

work page 2025

[44] [44]

In: ECCV (2022)

Tiwari, G., Antic, D., Lenssen, J.E., Sarafianos, N., Tung, T., Pons-Moll, G.: Pose-ndf: Modeling human pose manifolds with neural distance fields. In: ECCV (2022)

work page 2022

[45] [45]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang, W., Wang, W., Zhou, W.,...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

In: CVPR (2024) 18 Honglin Chen et al

Weng, Y., Wen, B., Tremblay, J., Blukis, V., Fox, D., Guibas, L., Birchfield, S.: Neural implicit representation for building digital twins of unknown articulated objects. In: CVPR (2024) 18 Honglin Chen et al

work page 2024

[48] [48]

Video models are zero-shot learners and reasoners

Wiedemer, T., Li, Y., Vicol, P., Gu, S.S., Matarese, N., Swersky, K., Kim, B., Jaini, P., Geirhos, R.: Video models are zero-shot learners and reasoners. arXiv preprint arXiv:2509.20328 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [49]

Freeman, Frédo Durand, Eli Shechtman, and Xun Huang

Wu, D., Liu, F., Hung, Y.H., Qian, Y., Zhan, X., Duan, Y.: 4d-fly: Fast 4d recon- struction from a single monocular video. In: Proc. CVPR. pp. 16663–16673 (06 2025).https://doi.org/10.1109/CVPR52734.2025.01553

work page doi:10.1109/cvpr52734.2025.01553 2025

[50] [50]

ACM TOG39(4), 58–1 (2020)

Xu, Z., Zhou, Y., Kalogerakis, E., Landreth, C., Singh, K.: Rignet: neural rigging for articulated characters. ACM TOG39(4), 58–1 (2020)

work page 2020

[51] [51]

Xu, Z., Li, Z., Dong, Z., Zhou, X., Newcombe, R., Lv, Z.: 4dgt: Learning a 4d gaussian transformer using real-world monocular videos. In: Adv. Neural Inform. Process. Syst. (2025)

work page 2025

[52] [52]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[53] [53]

Advances in Neural Information Processing Systems (2024)

Ye, Z., Liu, J.W., Jia, J., Sun, S., Shou, M.Z.: Skinned motion retargeting with dense geometric interaction perception. Advances in Neural Information Processing Systems (2024)

work page 2024

[54] [54]

arXiv preprint (2025)

Yenphraphai, J., Mirzaei, A., Chen, J., Zou, J., Tulyakov, S., Yeh, R.A., Wonka, P., Wang, C.: Shapegen4d: Towards high quality 4d shape generation from videos. arXiv preprint (2025)

work page 2025

[55] [55]

In: ACM TOG

Yu, Y., Zhou, K., Xu, D., Shi, X., Bao, H., Guo, B., Shum, H.Y.: Mesh editing with poisson-based gradient field manipulation. In: ACM TOG. pp. 644–651 (2004)

work page 2004

[56] [56]

arXiv preprint arXiv:2601.06378 (2026)

Zhang, H., Luo, J., Wan, B., Zhao, Y., Li, Z., Vasilkovsky, M., Wang, C., Wang, J., Ahuja, N., Zhou, B.: Rigmo: Unifying rig and motion learning for generative animation. arXiv preprint arXiv:2601.06378 (2026)

work page arXiv 2026

[57] [57]

In: ICCV (2025)

Zhang, H., Xu, H., Feng, C., Jampani, V., Ahuja, N.: Physrig: Differentiable physics- based skinning and rigging framework for realistic articulated object modeling. In: ICCV (2025)

work page 2025

[58] [58]

In: The Twelfth International Conference on Learning Representations (ICLR) (2024)

Zhang, J., Huang, S., Tu, Z., Chen, X., Zhan, X., Yu, G., Shan, Y.: Tapmo: Shape- aware motion generation of skeleton-free characters. In: The Twelfth International Conference on Learning Representations (ICLR) (2024)

work page 2024

[59] [59]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zhang, J., Weng, J., Kang, D., Zhao, F., Huang, S., Zhe, X., Bao, L., Shan, Y., Wang, J., Tu, Z.: Skinned motion retargeting with residual perception of motion semantics & geometry. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13864–13872 (2023)

work page 2023

[60] [60]

head”, “neck

Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., Liu, Z.: Motiondif- fuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001 (2022) ViPS: Video-informed Pose Spaces for Auto-Rigged Meshes Supplementary Material Honglin Chen1,2∗ , Karran Pandey3 , Rundi Wu1 , Matheus Gadelha2 , Yannick Hold-Geoffroy2 , Ayush...

work page arXiv 2022

[61] [61]

EXACT SUBJECT : Use the pro vi de d object / species name as the subject ; do NOT change it to a d i f f e r e n t species / object

work page

[62] [62]

FULL - BODY / FULL - OBJECT ( CR IT IC AL ) : The ENTIRE subject must be visible from end to end ( e . g . , head - to - toe / nose - to - tail / top - to - bottom ) . A b s o l u t e l y NO cropping , NO cut - off limbs / ears / tail , NO partial framing . Keep the subject ce nt er ed with ge ner ou s margins

work page

[63] [63]

WIDE SHOT ( C RI TI CA L ) : Use a wide shot with the camera pulled back enough to g u a r a n t e e the whole subject fits c o m f o r t a b l y in frame , with extra space around it

work page

[64] [64]

NO ground plane , NO visible floor line , NO horizon

B A C K G R O U N D : Solid pure white studio b a c k g r o u n d . NO ground plane , NO visible floor line , NO horizon

work page

[65] [65]

Avoid harsh cast shadows ; keep r e f l e c t i o n s minimal

L IG HT IN G : Even , soft , s h a d o w l e s s studio li gh ti ng . Avoid harsh cast shadows ; keep r e f l e c t i o n s minimal

work page

[66] [66]

ViPS: Video-informed Pose Spaces – Supplementary Material 7

NO NEW E LE ME NTS : Do NOT i n t r o d u c e any other objects , humans , animals , props , text , logos , watermarks , scenery , or clutter . ViPS: Video-informed Pose Spaces – Supplementary Material 7

work page

[67] [67]

Use side or front three - quarter v i e w p o i n t ; AVOID direct front or rear views . D i v e r s i t y r e q u i r e m e n t s ( across prompts for the same subject ) : - Vary v i e w p o i n t ( side / front three - quarter ) , slight tilt angle ( e s p e c i a l l y for animals ) , and pose / o r i e n t a t i o n while ALWAYS keeping full - body / ...

work page

[68] [68]

Do NOT replace it with a d i f f e r e n t motion

** Use the P rov id ed Motion EXACTLY :** The prompt MUST des cr ib e the target motion f a i t h f u l l y . Do NOT replace it with a d i f f e r e n t motion . Do NOT add extra actions beyond what is stated

work page

[69] [69]

Do NOT invent new colors , accessories , markings , clothing , species , or extra objects

** Keep A p p e a r a n c e C O N S I S T E N T :** Use the pr ov id ed a p p e a r a n c e d e s c r i p t i o n as - is . Do NOT invent new colors , accessories , markings , clothing , species , or extra objects

work page

[70] [70]

If s o m e t h i n g is not m e n t i o n e d in the appearance , it must not appear in the prompt

** NO NEW EL EM EN TS ( VERY STRICT ) :** Do NOT i n t r o d u c e any new objects , humans , animals , props , accessories , text , logos , extra scenery items , or a d d i t i o n a l en ti tie s not e x p l i c i t l y present in the p ro vid ed a p p e a r a n c e d e s c r i p t i o n . If s o m e t h i n g is not m e n t i o n e d in the appearance ...

work page

[71] [71]

No new items in the scene

** B a c k g r o u n d & L ig ht in g :** Keep the b a c k g r o u n d simple and stable , c o n s i s t e n t with the p ro vid ed a p p e a r a n c e . No new items in the scene

work page

[72] [72]

pan " ,

** Camera Be ha vio r :** ST RI CT LY STATIC FULL BODY SHOT . Tripod - locked . The entire object visible at all times . ** Single shot , no transitions , no cuts .** NO camera m ov eme nt . Avoid words like " pan " , " zoom " , " track " , " dolly " , " close - up " , " follow " , " cut " , " scene change " , " t r a n s i t i o n " , " montage "

work page

[73] [73]

** Natural Physics :** Motion should have r e a l i s t i c weight and timing a p p r o p r i a t e to the object

work page

[74] [74]

Photorealistic , 4 k , high f id eli ty

** Style Default :** If no style is specified , default to " Photorealistic , 4 k , high f id eli ty ."

work page

[75] [75]

8 Honglin Chen et al

** Length :** 60 -90 words . 8 Honglin Chen et al. Output ONLY the final prompt . No quotes . No bullet points . No extra c o m m e n t a r y . Object ( appearance , keep exact ) : < appearance_text > Target Motion ( must use exact ) : < motion_text > Camera C o n s t r a i n t ( must use exact ) : Static Full Body Shot ( Tripod View ) . The entire object...

work page