pith. machine review for the scientific record. sign in

arxiv: 2512.10881 · v2 · submitted 2025-12-11 · 💻 cs.CV

MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos

Pith reviewed 2026-05-16 23:02 UTC · model grok-4.3

classification 💻 cs.CV
keywords motion capture3D skeletal animationmonocular videoinverse kinematicsrigged assetsmotion retargetingcategory-agnosticBVH animation
0
0 comments X

The pith

MoCapAnything reconstructs rotation-based animations for arbitrary rigged 3D assets directly from monocular video using the asset itself as a structural prompt.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines Category-Agnostic Motion Capture as the task of turning a monocular video and any rigged 3D asset into a rotation sequence such as BVH that drives that exact asset. It introduces a factorized pipeline that first predicts 3D joint trajectories by combining video-derived features with per-joint queries extracted from the asset, then converts those trajectories into asset-specific rotations through constraint-aware inverse kinematics. The system relies on three learned components: a Reference Prompt Encoder that ingests the asset skeleton, mesh and rendered views; a Video Feature Extractor that builds dense descriptors and a coarse 4D deforming mesh; and a Unified Motion Decoder that fuses the cues into coherent trajectories. Experiments on both controlled benchmarks and in-the-wild footage, supported by a new dataset of 1038 standardized skeleton-mesh-render clips, show the method produces usable animations and performs cross-species retargeting on rigs never seen during training.

Core claim

A reference-guided framework first extracts per-joint queries from an arbitrary asset's skeleton, mesh and rendered images, fuses them with dense video features and a reconstructed coarse 4D mesh, decodes temporally coherent 3D joint trajectories, and finally recovers asset-specific rotations via constraint-aware inverse kinematics, thereby producing driveable BVH-style animations for any rigged asset from a single monocular video.

What carries the argument

The Reference Prompt Encoder that extracts per-joint queries from the asset's skeleton, mesh and rendered images to condition the entire motion prediction pipeline.

If this is right

  • Prompt-driven capture becomes possible for any rigged asset without species- or template-specific retraining.
  • Cross-species retargeting works across heterogeneous skeleton topologies while preserving motion semantics.
  • Temporally coherent 3D trajectories can be produced directly from in-the-wild monocular footage.
  • A standardized skeleton-mesh-render dataset of over one thousand clips can serve as a common benchmark for category-agnostic methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same factorization could be tested on multi-view video inputs to reduce depth ambiguity in the trajectory prediction stage.
  • Replacing the lightweight IK stage with a learned rotation predictor might allow end-to-end differentiability for further fine-tuning on custom rigs.
  • Combining the prompt encoder with generative video models could enable creation of new motions that still respect the target asset's exact skeleton constraints.

Load-bearing premise

The rigged 3D asset supplies enough structural information through its skeleton, mesh and rendered images for the prompt encoder to produce accurate per-joint queries that generalize to unseen rigs and motions.

What would settle it

Running the system on videos of animals whose limb topology and joint count differ markedly from the Truebones Zoo training distribution and checking whether the output rotations visibly fail to reproduce the observed motion when applied to the input asset mesh.

Figures

Figures reproduced from arXiv: 2512.10881 by Dongze Lian, Kehong Gong, Mingxi Xu, Mingyuan Zhang, Ning Zhang, Qi Wang, Weixia He, Wei Zhao, Xiaoyu He, Zhengyu Li, Zhengyu Wen.

Figure 1
Figure 1. Figure 1: Overview of our MoCapAnything framework. Given a monocular video and a reference 3D asset (mesh/skeleton/appearance), [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Detailed architecture of our method. A multi-modal Reference Prompt Encoder fuses mesh, skeleton, and appearance of the target [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: presents representative qualitative results on the Truebones Zoo dataset. Compared to GenZoo, our predic￾tions exhibit smoother motion trajectories, higher anatomi￾cal fidelity, and robust stability across both quadruped and non-quadruped skeletons—including bipeds, birds, reptiles, and even non-biological assets. GenZoo, while currently the most widely applicable animal motion capture method, is fundament… view at source ↗
Figure 4
Figure 4. Figure 4: presents representative Truebones Zoo-test results. Row 1 shows input video Jugar attack. Row 2 displays the same-species reference and predicted mocap outputs. Rows 3–5 show results when retargeting to skeletons of three dif￾ferent species. Our approach generalizes robustly across species and maintains temporally consistent, anatomically plausible 3D motion even with significant appearance and shape varia… view at source ↗
Figure 5
Figure 5. Figure 5: Real-world (Wild) results. Similar layout as [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: More Truebones in-the-wild mocap results. Our method generalizes to a diverse range of species and scenarios. Although not explicitly trained for cross-species transfer, our model leverages structural, visual, and geometric cues to synthesize plausible retargeted motion. We observe a wide range of creative results: bird videos drive quadrupeds to perform flapping-like actions or ani￾mate pterosaurs; fish s… view at source ↗
read the original abstract

Motion capture now underpins content creation far beyond digital humans, yet most existing pipelines remain species- or template-specific. We formalize this gap as Category-Agnostic Motion Capture (CAMoCap): given a monocular video and an arbitrary rigged 3D asset as a prompt, the goal is to reconstruct a rotation-based animation such as BVH that directly drives the specific asset. We present MoCapAnything, a reference-guided, factorized framework that first predicts 3D joint trajectories and then recovers asset-specific rotations via constraint-aware inverse kinematics. The system contains three learnable modules and a lightweight IK stage: (1) a Reference Prompt Encoder that extracts per-joint queries from the asset's skeleton, mesh, and rendered images; (2) a Video Feature Extractor that computes dense visual descriptors and reconstructs a coarse 4D deforming mesh to bridge the gap between video and joint space; and (3) a Unified Motion Decoder that fuses these cues to produce temporally coherent trajectories. We also curate Truebones Zoo with 1038 motion clips, each providing a standardized skeleton-mesh-render triad. Experiments on both in-domain benchmarks and in-the-wild videos show that MoCapAnything delivers high-quality skeletal animations and exhibits meaningful cross-species retargeting across heterogeneous rigs, enabling scalable, prompt-driven 3D motion capture for arbitrary assets. Project page: https://animotionlab.github.io/MoCapAnything/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces MoCapAnything, a reference-guided, factorized framework for Category-Agnostic Motion Capture (CAMoCap). Given a monocular video and an arbitrary rigged 3D asset as prompt, it predicts 3D joint trajectories via a Reference Prompt Encoder (extracting per-joint queries from skeleton/mesh/renders), Video Feature Extractor (dense descriptors plus coarse 4D mesh), and Unified Motion Decoder, then recovers asset-specific rotations through constraint-aware inverse kinematics. The method is trained on the curated Truebones Zoo dataset (1038 clips with standardized triads) and claims high-quality animations plus meaningful cross-species retargeting on in-domain benchmarks and in-the-wild videos.

Significance. If the generalization claims hold with supporting evidence, the work would be significant for enabling scalable, prompt-driven 3D motion capture beyond human- or template-specific pipelines, with potential impact on animation, gaming, and cross-species content creation. The factorized design and new dataset curation are positive elements that could support further research.

major comments (3)
  1. [Experiments] Experiments section: The abstract and results claim 'high-quality skeletal animations' and 'meaningful cross-species retargeting' on in-domain benchmarks and in-the-wild videos, yet no quantitative metrics (e.g., MPJPE, rotation error, or temporal consistency scores), ablation studies, or error analysis are reported. This leaves the central performance claims resting solely on qualitative description, which is insufficient to substantiate the generalization assertions.
  2. [§3.3] §3.3, Unified Motion Decoder: The decoder produces trajectories in a unified space before asset-specific IK. Because training relies on Truebones Zoo's standardized triads, it is unclear whether the learned trajectories encode purely geometric, topology-agnostic features or dataset-specific joint semantics. This assumption is load-bearing for the arbitrary-rig and cross-species claims (e.g., quadruped vs. avian hierarchies); without explicit tests on varying joint counts/connectivities, the subsequent IK stage may receive misaligned inputs.
  3. [§3.1] §3.1, Reference Prompt Encoder: The design assumes that the target asset's skeleton, mesh, and rendered images supply sufficient cues for accurate per-joint queries that generalize to unseen rigs. No analysis or ablation demonstrates robustness when these cues differ substantially from the training distribution, which directly affects the weakest assumption underlying the prompt-driven approach.
minor comments (2)
  1. [Abstract] Abstract: References to 'in-domain benchmarks' are made without naming the specific datasets or providing any numerical context, reducing clarity for readers.
  2. [§3] Notation and figures: The distinction between the unified trajectory space and asset-specific rotations could be clarified with an explicit diagram or equation in §3; some figure captions lack detail on which species/rigs are shown in cross-retargeting examples.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications on our methodology and commit to revisions that provide stronger empirical support for the claims.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The abstract and results claim 'high-quality skeletal animations' and 'meaningful cross-species retargeting' on in-domain benchmarks and in-the-wild videos, yet no quantitative metrics (e.g., MPJPE, rotation error, or temporal consistency scores), ablation studies, or error analysis are reported. This leaves the central performance claims resting solely on qualitative description, which is insufficient to substantiate the generalization assertions.

    Authors: We agree that quantitative metrics and ablations would strengthen the presentation. In the revised manuscript we will report MPJPE and per-joint rotation errors on the held-out portion of Truebones Zoo (where ground-truth 3D trajectories are available from the standardized triads), add temporal consistency scores, and include ablation studies on the Reference Prompt Encoder and Unified Motion Decoder. For in-the-wild results we will supplement the qualitative examples with a small-scale user study measuring perceived animation quality and retargeting fidelity. revision: yes

  2. Referee: [§3.3] §3.3, Unified Motion Decoder: The decoder produces trajectories in a unified space before asset-specific IK. Because training relies on Truebones Zoo's standardized triads, it is unclear whether the learned trajectories encode purely geometric, topology-agnostic features or dataset-specific joint semantics. This assumption is load-bearing for the arbitrary-rig and cross-species claims (e.g., quadruped vs. avian hierarchies); without explicit tests on varying joint counts/connectivities, the subsequent IK stage may receive misaligned inputs.

    Authors: The decoder is trained to regress 3D joint positions in a canonical coordinate frame; the per-joint queries supplied by the Reference Prompt Encoder are the only mechanism that injects rig-specific information, so the trajectory output itself is intended to be topology-agnostic. The subsequent constraint-aware IK stage then solves for asset-specific rotations given the target skeleton's joint limits and connectivity. To make this explicit, the revision will add a dedicated experiment that evaluates the decoder on rigs whose joint counts and connectivities differ from the training triads (e.g., by subsampling joints or introducing additional cross-species skeletons) and measures the quality of the IK-recovered animations. revision: partial

  3. Referee: [§3.1] §3.1, Reference Prompt Encoder: The design assumes that the target asset's skeleton, mesh, and rendered images supply sufficient cues for accurate per-joint queries that generalize to unseen rigs. No analysis or ablation demonstrates robustness when these cues differ substantially from the training distribution, which directly affects the weakest assumption underlying the prompt-driven approach.

    Authors: The encoder fuses three complementary signals (skeleton graph, mesh geometry, and rendered views) precisely to increase robustness to variations in any single cue. We will add an ablation that systematically removes or perturbs each cue and evaluates performance on rigs whose visual and structural statistics differ from the Truebones Zoo training distribution, thereby quantifying the contribution of each modality to generalization. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical learned pipeline with independent training data

full rationale

The paper presents a factorized neural architecture (Reference Prompt Encoder, Video Feature Extractor, Unified Motion Decoder) trained end-to-end on the curated Truebones Zoo dataset of 1038 clips. No equations, fitted parameters, or self-citations are shown that reduce the output trajectories or retargeting results to definitional equivalence with the inputs. The central claims rest on empirical performance metrics for in-domain and in-the-wild videos rather than any self-referential construction. The IK stage is described as a lightweight post-process, not a fitted component. This is the expected non-circular outcome for a data-driven CV method.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard domain assumptions about rigged assets and learned visual features rather than explicit free parameters or new entities.

axioms (2)
  • domain assumption A rigged 3D asset's skeleton, mesh, and rendered images contain sufficient information to generate per-joint queries that enable accurate motion retargeting.
    Invoked in the Reference Prompt Encoder description as the basis for guiding the decoder.
  • domain assumption Constraint-aware inverse kinematics can reliably convert predicted 3D joint trajectories into valid bone rotations for arbitrary rigs.
    Stated as the final recovery stage without further justification in the abstract.

pith-pipeline@v0.9.0 · 5595 in / 1356 out tokens · 32860 ms · 2026-05-16T23:02:42.894519+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow

    cs.CV 2026-05 unverdicted novelty 7.0

    R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.

  2. R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow

    cs.CV 2026-05 unverdicted novelty 5.0

    R-DMesh uses a VAE with a learned rectification jump offset and Triflow Attention inside a rectified-flow diffusion transformer to produce video-aligned 4D meshes despite initial pose misalignment.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Multi-hmr: Multi-person whole-body human mesh recovery in a single shot

    Fabien Baradel, Matthieu Armando, Salma Galaaoui, Ro- main Br ´egier, Philippe Weinzaepfel, Gr ´egory Rogez, and Thomas Lucas. Multi-hmr: Multi-person whole-body human mesh recovery in a single shot. InEuropean Conference on Computer Vision, pages 202–218. Springer, 2024. 3

  2. [2]

    End-to- end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InEuropean confer- ence on computer vision, pages 213–229. Springer, 2020. 2

  3. [3]

    Cascaded pyramid net- work for multi-person pose estimation

    Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. Cascaded pyramid net- work for multi-person pose estimation. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 7103–7112, 2018. 2

  4. [4]

    Huang, and Lei Zhang

    Bowen Cheng, Bin Xiao, Jingdong Wang, Honghui Shi, Thomas S. Huang, and Lei Zhang. Higherhrnet: Scale-aware representation learning for bottom-up human pose estima- tion. InCVPR, 2020. 2

  5. [5]

    Beyond static features for temporally consis- tent 3d human pose and shape from a video

    Hongsuk Choi, Gyeongsik Moon, Ju Yong Chang, and Ky- oung Mu Lee. Beyond static features for temporally consis- tent 3d human pose and shape from a video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1964–1973, 2021. 3

  6. [6]

    Deitke, D

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Obja- verse: A universe of annotated 3d objects.arXiv preprint arXiv:2212.08051, 2022. 6

  7. [7]

    Objaverse-XL: A Universe of 10M+ 3D Objects

    Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Chris- tian Laforte, Vikram V oleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl V ondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. Objaverse-xl: A universe of 10m+ 3d objects.arXiv preprint arXiv:2307.05663, 2023. 6

  8. [8]

    Anytop: Character animation diffusion with any topology.arXiv preprint arXiv:2502.17327, 2025

    Inbar Gat, Sigal Raab, Guy Tevet, Yuval Reshef, Amit H Bermano, and Daniel Cohen-Or. Anytop: Character animation diffusion with any topology.arXiv preprint arXiv:2502.17327, 2025. 4, 5

  9. [9]

    Humans in 4d: Re- constructing and tracking humans with transformers

    Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, and Jitendra Malik. Humans in 4d: Re- constructing and tracking humans with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14783–14794, 2023. 3

  10. [10]

    A graph-based approach for category-agnostic pose estimation.arXiv preprint arXiv:2311.17891, 2023

    Or Hirschorn and Shai Avidan. A graph-based approach for category-agnostic pose estimation.arXiv preprint arXiv:2311.17891, 2023. 2

  11. [11]

    End-to-end recovery of human shape and pose

    Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7122–7131, 2018. 1, 3

  12. [12]

    Learning category-specific mesh reconstruc- tion from image collections

    Angjoo Kanazawa, Shubham Tulsiani, Alexei A Efros, and Jitendra Malik. Learning category-specific mesh reconstruc- tion from image collections. InProceedings of the Euro- pean conference on computer vision (ECCV), pages 371– 386, 2018. 3

  13. [13]

    Vibe: Video inference for human body pose and shape estimation

    Muhammed Kocabas, Nikos Athanasiou, and Michael J Black. Vibe: Video inference for human body pose and shape estimation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 5253–5263, 2020. 3

  14. [14]

    Pace: Human and camera motion estimation from in- the-wild videos

    Muhammed Kocabas, Ye Yuan, Pavlo Molchanov, Yunrong Guo, Michael J Black, Otmar Hilliges, Jan Kautz, and Umar Iqbal. Pace: Human and camera motion estimation from in- the-wild videos. In2024 International Conference on 3D Vision (3DV), pages 397–408. IEEE, 2024. 3

  15. [15]

    Simple pose: Rethinking and improving a bottom-up approach for multi-person pose estimation

    Jia Li, Wen Su, and Zengfu Wang. Simple pose: Rethinking and improving a bottom-up approach for multi-person pose estimation. InProceedings of the AAAI conference on artifi- cial intelligence, pages 11354–11361, 2020. 2

  16. [16]

    Human pose regression with residual log-likelihood estimation

    Jiefeng Li, Siyuan Bian, Ailing Zeng, Can Wang, Bo Pang, Wentao Liu, and Cewu Lu. Human pose regression with residual log-likelihood estimation. InProceedings of the IEEE/CVF international conference on computer vision, pages 11025–11034, 2021. 2

  17. [17]

    Simcc: A simple coordinate classification perspective for hu- man pose estimation

    Yanjie Li, Sen Yang, Peidong Liu, Shoukui Zhang, Yunx- iao Wang, Zhicheng Wang, Wankou Yang, and Shu-Tao Xia. Simcc: A simple coordinate classification perspective for hu- man pose estimation. InEuropean Conference on Computer Vision, pages 89–106. Springer, 2022. 2

  18. [18]

    Learning the 3d fauna of the web

    Zizhang Li, Dor Litvak, Ruining Li, Yunzhi Zhang, Tomas Jakab, Christian Rupprecht, Shangzhe Wu, Andrea Vedaldi, and Jiajun Wu. Learning the 3d fauna of the web. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9752–9762, 2024. 3

  19. [19]

    One-stage 3d whole-body mesh recovery with component aware transformer

    Jing Lin, Ailing Zeng, Haoqian Wang, Lei Zhang, and Yu Li. One-stage 3d whole-body mesh recovery with component aware transformer. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 21159–21168, 2023. 3

  20. [20]

    SMPL: A skinned multi- person linear model.ACM transactions on graphics (TOG), 34(6):1–16, 2015

    Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. SMPL: A skinned multi- person linear model.ACM transactions on graphics (TOG), 34(6):1–16, 2015. 1, 3 9

  21. [21]

    Stacked hour- glass networks for human pose estimation

    Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hour- glass networks for human pose estimation. InEuropean con- ference on computer vision, pages 483–499. Springer, 2016. 2

  22. [22]

    Black, Sil- via Zuffi, and Peter Kulits

    Tomasz Niewiadomski, Anastasios Yiannakidis, Hanz Cuevas-Velasquez, Soubhik Sanyal, Michael J. Black, Sil- via Zuffi, and Peter Kulits. Generative zoo.CoRR, abs/2412.08101, 2024. 3

  23. [23]

    Black, Silvia Zuffi, and Peter Kulits

    Tomasz Niewiadomski, Anastasios Yiannakidis, Hanz Cuevas-Velasquez, Soubhik Sanyal, Michael J. Black, Silvia Zuffi, and Peter Kulits. Generative zoo. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 6

  24. [24]

    Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Rus- sell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael G. Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv´e J´egou, Julien Ma...

  25. [25]

    Expressive body capture: 3d hands, face, and body from a single image

    Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. Expressive body capture: 3d hands, face, and body from a single image. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10975–10985, 2019. 1, 3

  26. [26]

    Capex: Category-agnostic pose estimation from textual point expla- nation.arXiv preprint arXiv:2406.00384, 2024

    Matan Rusanovsky, Or Hirschorn, and Shai Avidan. Capex: Category-agnostic pose estimation from textual point expla- nation.arXiv preprint arXiv:2406.00384, 2024. 3

  27. [27]

    Capex: Category-agnostic pose estimation from textual point ex- planation

    Matan Rusanovsky, Or Hirschorn, and Shai Avidan. Capex: Category-agnostic pose estimation from textual point ex- planation. InThe Thirteenth International Conference on Learning Representations, 2025. 1

  28. [28]

    End-to-end multi-person pose estimation with transformers

    Dahu Shi, Xing Wei, Liangqi Li, Ye Ren, and Wenming Tan. End-to-end multi-person pose estimation with transformers. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 11069–11078, 2022. 2

  29. [29]

    Matching is not enough: A two-stage frame- work for category-agnostic pose estimation

    Min Shi, Zihao Huang, Xianzheng Ma, Xiaowei Hu, and Zhiguo Cao. Matching is not enough: A two-stage frame- work for category-agnostic pose estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 7308–7317, 2023. 2

  30. [30]

    Wham: Reconstructing world-grounded humans with accu- rate 3d motion

    Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J Black. Wham: Reconstructing world-grounded humans with accu- rate 3d motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2070– 2080, 2024. 3

  31. [31]

    Deep high-resolution representation learning for human pose es- timation

    Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose es- timation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5693–5703,

  32. [32]

    Aios: All-in-one-stage expres- sive human pose and shape estimation

    Qingping Sun, Yanjun Wang, Ailing Zeng, Wanqi Yin, Chen Wei, Wenjia Wang, Haiyi Mei, Chi-Sing Leung, Zi- wei Liu, Lei Yang, et al. Aios: All-in-one-stage expres- sive human pose and shape estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1834–1843, 2024. 3

  33. [33]

    Deeppose: Hu- man pose estimation via deep neural networks

    Alexander Toshev and Christian Szegedy. Deeppose: Hu- man pose estimation via deep neural networks. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. 1

  34. [34]

    Deeppose: Human pose estimation via deep neural networks

    Alexander Toshev and Christian Szegedy. Deeppose: Human pose estimation via deep neural networks. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 1653–1660, 2014. 2

  35. [35]

    Truebones motion capture – mocap files, n.d

    Truebones. Truebones motion capture – mocap files, n.d. Accessed: 2025-05-22. 1, 2, 6

  36. [36]

    Locllm: Exploiting generalizable human keypoint localization via large language model

    Dongkai Wang, Shiyu Xuan, and Shiliang Zhang. Locllm: Exploiting generalizable human keypoint localization via large language model. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 614–623, 2024. 2

  37. [37]

    Tram: Global trajectory and motion of 3d humans from in- the-wild videos

    Yufu Wang, Ziyun Wang, Lingjie Liu, and Kostas Daniilidis. Tram: Global trajectory and motion of 3d humans from in- the-wild videos. InEuropean Conference on Computer Vi- sion, pages 467–487. Springer, 2024. 3

  38. [38]

    Magicpony: Learning ar- ticulated 3d animals in the wild

    Shangzhe Wu, Ruining Li, Tomas Jakab, Christian Rup- precht, and Andrea Vedaldi. Magicpony: Learning ar- ticulated 3d animals in the wild. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8792–8802, 2023. 3

  39. [39]

    CASA: Category-agnostic skeletal an- imal reconstruction

    Yuefan Wu*, Zeyuan Chen*, Shaowei Liu, Zhongzheng Ren, and Shenlong Wang. CASA: Category-agnostic skeletal an- imal reconstruction. InNeurIPS, 2022. 3

  40. [40]

    Simple baselines for human pose estimation and tracking

    Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines for human pose estimation and tracking. InProceedings of the European conference on computer vision (ECCV), pages 466–481, 2018. 2

  41. [41]

    Querypose: Sparse multi- person pose regression via spatial-aware part-level query

    Yabo Xiao, Kai Su, Xiaojuan Wang, Dongdong Yu, Lei Jin, Mingshu He, and Zehuan Yuan. Querypose: Sparse multi- person pose regression via spatial-aware part-level query. Advances in Neural Information Processing Systems, 35: 12464–12477, 2022. 2

  42. [42]

    Pose for everything: Towards category-agnostic pose estimation

    Lumin Xu, Sheng Jin, Wang Zeng, Wentao Liu, Chen Qian, Wanli Ouyang, Ping Luo, and Xiaogang Wang. Pose for everything: Towards category-agnostic pose estimation. In European conference on computer vision, pages 398–416. Springer, 2022. 2

  43. [43]

    Vit- pose: Simple vision transformer baselines for human pose estimation.Advances in neural information processing sys- tems, 35:38571–38584, 2022

    Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vit- pose: Simple vision transformer baselines for human pose estimation.Advances in neural information processing sys- tems, 35:38571–38584, 2022. 2

  44. [44]

    Lasr: Learning articulated shape re- construction from a monocular video

    Gengshan Yang, Deqing Sun, Varun Jampani, Daniel Vlasic, Forrester Cole, Huiwen Chang, Deva Ramanan, William T Freeman, and Ce Liu. Lasr: Learning articulated shape re- construction from a monocular video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15980–15989, 2021. 3

  45. [45]

    Viser: Video- 10 specific surface embeddings for articulated 3d shape recon- struction.Advances in Neural Information Processing Sys- tems, 34:19326–19338, 2021

    Gengshan Yang, Deqing Sun, Varun Jampani, Daniel Vlasic, Forrester Cole, Ce Liu, and Deva Ramanan. Viser: Video- 10 specific surface embeddings for articulated 3d shape recon- struction.Advances in Neural Information Processing Sys- tems, 34:19326–19338, 2021. 3

  46. [46]

    Banmo: Building animatable 3d neural models from many casual videos

    Gengshan Yang, Minh V o, Natalia Neverova, Deva Ra- manan, Andrea Vedaldi, and Hanbyul Joo. Banmo: Building animatable 3d neural models from many casual videos. In CVPR, 2022. 3

  47. [47]

    Zhang, Zachary Manchester, and Deva Ramanan

    Gengshan Yang, Shuo Yang, John Z. Zhang, Zachary Manchester, and Deva Ramanan. Physically plausible re- construction from monocular videos. InICCV, 2023. 3

  48. [48]

    Explicit box detection unifies end-to-end multi-person pose estimation.arXiv preprint arXiv:2302.01593, 2023

    Jie Yang, Ailing Zeng, Shilong Liu, Feng Li, Ruimao Zhang, and Lei Zhang. Explicit box detection unifies end-to-end multi-person pose estimation.arXiv preprint arXiv:2302.01593, 2023. 2

  49. [49]

    Lassie: Learning articulated shapes from sparse image ensemble via 3d part discovery.Advances in Neural Information Process- ing Systems, 35:15296–15308, 2022

    Chun-Han Yao, Wei-Chih Hung, Yuanzhen Li, Michael Ru- binstein, Ming-Hsuan Yang, and Varun Jampani. Lassie: Learning articulated shapes from sparse image ensemble via 3d part discovery.Advances in Neural Information Process- ing Systems, 35:15296–15308, 2022. 3

  50. [50]

    Whac: World-grounded hu- mans and cameras

    Wanqi Yin, Zhongang Cai, Ruisi Wang, Fanzhou Wang, Chen Wei, Haiyi Mei, Weiye Xiao, Zhitao Yang, Qingping Sun, Atsushi Yamashita, et al. Whac: World-grounded hu- mans and cameras. InEuropean Conference on Computer Vision, pages 20–37. Springer, 2024. 3

  51. [51]

    Glamr: Global occlusion-aware human mesh recov- ery with dynamic cameras

    Ye Yuan, Umar Iqbal, Pavlo Molchanov, Kris Kitani, and Jan Kautz. Glamr: Global occlusion-aware human mesh recov- ery with dynamic cameras. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11038–11049, 2022. 3

  52. [52]

    Silvia Zuffi, Angjoo Kanazawa, David Jacobs, and Michael J. Black. 3D menagerie: Modeling the 3D shape and pose of animals. InIEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2017. 1, 3 11 MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos Supplementary Material

  53. [53]

    More Visualization Results In this section, we summarize additional qualitative re- sults from our supplementary webpage. These visualiza- tions highlight the effectiveness of our approach across con- trolled multi-species datasets, in-the-wild videos, and cross- species retargeting scenarios, showing that our model pro- duces high-fidelity and temporally...

  54. [54]

    Dataset and Environment Details Dataset Processing.All meshes are first scaled by the bounding box of their rest pose, normalizing each mesh into a unit-volume space

    Implementation Details A. Dataset and Environment Details Dataset Processing.All meshes are first scaled by the bounding box of their rest pose, normalizing each mesh into a unit-volume space. For sequence data, we remove the global translation of every frame, compute a sequence- level super bounding box, and uniformly scale the entire se- quence into the...

  55. [55]

    Each skeleton is represented by a set of 3D joint positions and a kinematic hierarchy defined by a parent array

    Evaluation Metrics This section describes the computation of the proposed metric(CD-Skeleton) that evaluates the alignment between two articulated skeletons. Each skeleton is represented by a set of 3D joint positions and a kinematic hierarchy defined by a parent array. Notation Let Skeleton A and Skeleton B be defined as: • Joint positions: XA ={x A i ∈R...