arxiv: 2512.10881 · v2 · submitted 2025-12-11 · 💻 cs.CV

MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos

Kehong Gong , Zhengyu Wen , Weixia He , Mingxi Xu , Qi Wang , Ning Zhang , Zhengyu Li , Dongze Lian

show 3 more authors

Wei Zhao Xiaoyu He Mingyuan Zhang

This is my paper

Pith reviewed 2026-05-16 23:02 UTC · model grok-4.3

classification 💻 cs.CV

keywords motion capture3D skeletal animationmonocular videoinverse kinematicsrigged assetsmotion retargetingcategory-agnosticBVH animation

0 comments

The pith

MoCapAnything reconstructs rotation-based animations for arbitrary rigged 3D assets directly from monocular video using the asset itself as a structural prompt.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines Category-Agnostic Motion Capture as the task of turning a monocular video and any rigged 3D asset into a rotation sequence such as BVH that drives that exact asset. It introduces a factorized pipeline that first predicts 3D joint trajectories by combining video-derived features with per-joint queries extracted from the asset, then converts those trajectories into asset-specific rotations through constraint-aware inverse kinematics. The system relies on three learned components: a Reference Prompt Encoder that ingests the asset skeleton, mesh and rendered views; a Video Feature Extractor that builds dense descriptors and a coarse 4D deforming mesh; and a Unified Motion Decoder that fuses the cues into coherent trajectories. Experiments on both controlled benchmarks and in-the-wild footage, supported by a new dataset of 1038 standardized skeleton-mesh-render clips, show the method produces usable animations and performs cross-species retargeting on rigs never seen during training.

Core claim

A reference-guided framework first extracts per-joint queries from an arbitrary asset's skeleton, mesh and rendered images, fuses them with dense video features and a reconstructed coarse 4D mesh, decodes temporally coherent 3D joint trajectories, and finally recovers asset-specific rotations via constraint-aware inverse kinematics, thereby producing driveable BVH-style animations for any rigged asset from a single monocular video.

What carries the argument

The Reference Prompt Encoder that extracts per-joint queries from the asset's skeleton, mesh and rendered images to condition the entire motion prediction pipeline.

If this is right

Prompt-driven capture becomes possible for any rigged asset without species- or template-specific retraining.
Cross-species retargeting works across heterogeneous skeleton topologies while preserving motion semantics.
Temporally coherent 3D trajectories can be produced directly from in-the-wild monocular footage.
A standardized skeleton-mesh-render dataset of over one thousand clips can serve as a common benchmark for category-agnostic methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same factorization could be tested on multi-view video inputs to reduce depth ambiguity in the trajectory prediction stage.
Replacing the lightweight IK stage with a learned rotation predictor might allow end-to-end differentiability for further fine-tuning on custom rigs.
Combining the prompt encoder with generative video models could enable creation of new motions that still respect the target asset's exact skeleton constraints.

Load-bearing premise

The rigged 3D asset supplies enough structural information through its skeleton, mesh and rendered images for the prompt encoder to produce accurate per-joint queries that generalize to unseen rigs and motions.

What would settle it

Running the system on videos of animals whose limb topology and joint count differ markedly from the Truebones Zoo training distribution and checking whether the output rotations visibly fail to reproduce the observed motion when applied to the input asset mesh.

Figures

Figures reproduced from arXiv: 2512.10881 by Dongze Lian, Kehong Gong, Mingxi Xu, Mingyuan Zhang, Ning Zhang, Qi Wang, Weixia He, Wei Zhao, Xiaoyu He, Zhengyu Li, Zhengyu Wen.

**Figure 1.** Figure 1: Overview of our MoCapAnything framework. Given a monocular video and a reference 3D asset (mesh/skeleton/appearance), [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Detailed architecture of our method. A multi-modal Reference Prompt Encoder fuses mesh, skeleton, and appearance of the target [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: presents representative qualitative results on the Truebones Zoo dataset. Compared to GenZoo, our predictions exhibit smoother motion trajectories, higher anatomical fidelity, and robust stability across both quadruped and non-quadruped skeletons—including bipeds, birds, reptiles, and even non-biological assets. GenZoo, while currently the most widely applicable animal motion capture method, is fundament… view at source ↗

**Figure 4.** Figure 4: presents representative Truebones Zoo-test results. Row 1 shows input video Jugar attack. Row 2 displays the same-species reference and predicted mocap outputs. Rows 3–5 show results when retargeting to skeletons of three different species. Our approach generalizes robustly across species and maintains temporally consistent, anatomically plausible 3D motion even with significant appearance and shape varia… view at source ↗

**Figure 5.** Figure 5: Real-world (Wild) results. Similar layout as [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: More Truebones in-the-wild mocap results. Our method generalizes to a diverse range of species and scenarios. Although not explicitly trained for cross-species transfer, our model leverages structural, visual, and geometric cues to synthesize plausible retargeted motion. We observe a wide range of creative results: bird videos drive quadrupeds to perform flapping-like actions or animate pterosaurs; fish s… view at source ↗

read the original abstract

Motion capture now underpins content creation far beyond digital humans, yet most existing pipelines remain species- or template-specific. We formalize this gap as Category-Agnostic Motion Capture (CAMoCap): given a monocular video and an arbitrary rigged 3D asset as a prompt, the goal is to reconstruct a rotation-based animation such as BVH that directly drives the specific asset. We present MoCapAnything, a reference-guided, factorized framework that first predicts 3D joint trajectories and then recovers asset-specific rotations via constraint-aware inverse kinematics. The system contains three learnable modules and a lightweight IK stage: (1) a Reference Prompt Encoder that extracts per-joint queries from the asset's skeleton, mesh, and rendered images; (2) a Video Feature Extractor that computes dense visual descriptors and reconstructs a coarse 4D deforming mesh to bridge the gap between video and joint space; and (3) a Unified Motion Decoder that fuses these cues to produce temporally coherent trajectories. We also curate Truebones Zoo with 1038 motion clips, each providing a standardized skeleton-mesh-render triad. Experiments on both in-domain benchmarks and in-the-wild videos show that MoCapAnything delivers high-quality skeletal animations and exhibits meaningful cross-species retargeting across heterogeneous rigs, enabling scalable, prompt-driven 3D motion capture for arbitrary assets. Project page: https://animotionlab.github.io/MoCapAnything/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper formalizes category-agnostic motion capture with a factorized reference-guided model, but its claims depend on unquantified qualitative results.

read the letter

The main takeaway is that this work defines Category-Agnostic Motion Capture as a new task and gives a concrete architecture for turning a monocular video plus an arbitrary rigged asset into a usable animation. The system predicts 3D joint trajectories in a shared space first, then runs asset-specific inverse kinematics, with a Reference Prompt Encoder that pulls cues from the target skeleton, mesh, and rendered images. That factorization and the prompt encoder are the pieces that actually move past human-only or fixed-template pipelines. Curating Truebones Zoo with 1038 standardized skeleton-mesh-render clips is also useful; training data of that form has been a practical blocker for this kind of work. The qualitative cross-species examples in the abstract look reasonable on the surface and suggest the reference cues can transfer at least in some cases. The soft spot is the complete absence of numbers. The abstract claims good performance on in-domain benchmarks and in-the-wild videos but supplies no error metrics, no baseline comparisons, no ablations on the encoder or decoder, and no analysis of failure modes. Without those, it is hard to tell whether the unified trajectory decoder is truly producing geometry that transfers across joint counts and hierarchies or whether it is mostly fitting patterns from the training skeletons. The stress-test worry about dataset-specific joint semantics is therefore worth checking in the full text; if the decoder outputs are not purely geometric, the later IK stage will inherit the misalignment. This is aimed at people building general animation tools or prompt-driven 3D pipelines. A reader who needs a working reference architecture and a new dataset will get value from the description even if the numbers are still missing. It deserves a serious referee because the task definition is clean and the components are clearly motivated, though any review should press hard for quantitative evidence and generalization tests.

Referee Report

3 major / 2 minor

Summary. The paper introduces MoCapAnything, a reference-guided, factorized framework for Category-Agnostic Motion Capture (CAMoCap). Given a monocular video and an arbitrary rigged 3D asset as prompt, it predicts 3D joint trajectories via a Reference Prompt Encoder (extracting per-joint queries from skeleton/mesh/renders), Video Feature Extractor (dense descriptors plus coarse 4D mesh), and Unified Motion Decoder, then recovers asset-specific rotations through constraint-aware inverse kinematics. The method is trained on the curated Truebones Zoo dataset (1038 clips with standardized triads) and claims high-quality animations plus meaningful cross-species retargeting on in-domain benchmarks and in-the-wild videos.

Significance. If the generalization claims hold with supporting evidence, the work would be significant for enabling scalable, prompt-driven 3D motion capture beyond human- or template-specific pipelines, with potential impact on animation, gaming, and cross-species content creation. The factorized design and new dataset curation are positive elements that could support further research.

major comments (3)

[Experiments] Experiments section: The abstract and results claim 'high-quality skeletal animations' and 'meaningful cross-species retargeting' on in-domain benchmarks and in-the-wild videos, yet no quantitative metrics (e.g., MPJPE, rotation error, or temporal consistency scores), ablation studies, or error analysis are reported. This leaves the central performance claims resting solely on qualitative description, which is insufficient to substantiate the generalization assertions.
[§3.3] §3.3, Unified Motion Decoder: The decoder produces trajectories in a unified space before asset-specific IK. Because training relies on Truebones Zoo's standardized triads, it is unclear whether the learned trajectories encode purely geometric, topology-agnostic features or dataset-specific joint semantics. This assumption is load-bearing for the arbitrary-rig and cross-species claims (e.g., quadruped vs. avian hierarchies); without explicit tests on varying joint counts/connectivities, the subsequent IK stage may receive misaligned inputs.
[§3.1] §3.1, Reference Prompt Encoder: The design assumes that the target asset's skeleton, mesh, and rendered images supply sufficient cues for accurate per-joint queries that generalize to unseen rigs. No analysis or ablation demonstrates robustness when these cues differ substantially from the training distribution, which directly affects the weakest assumption underlying the prompt-driven approach.

minor comments (2)

[Abstract] Abstract: References to 'in-domain benchmarks' are made without naming the specific datasets or providing any numerical context, reducing clarity for readers.
[§3] Notation and figures: The distinction between the unified trajectory space and asset-specific rotations could be clarified with an explicit diagram or equation in §3; some figure captions lack detail on which species/rigs are shown in cross-retargeting examples.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications on our methodology and commit to revisions that provide stronger empirical support for the claims.

read point-by-point responses

Referee: [Experiments] Experiments section: The abstract and results claim 'high-quality skeletal animations' and 'meaningful cross-species retargeting' on in-domain benchmarks and in-the-wild videos, yet no quantitative metrics (e.g., MPJPE, rotation error, or temporal consistency scores), ablation studies, or error analysis are reported. This leaves the central performance claims resting solely on qualitative description, which is insufficient to substantiate the generalization assertions.

Authors: We agree that quantitative metrics and ablations would strengthen the presentation. In the revised manuscript we will report MPJPE and per-joint rotation errors on the held-out portion of Truebones Zoo (where ground-truth 3D trajectories are available from the standardized triads), add temporal consistency scores, and include ablation studies on the Reference Prompt Encoder and Unified Motion Decoder. For in-the-wild results we will supplement the qualitative examples with a small-scale user study measuring perceived animation quality and retargeting fidelity. revision: yes
Referee: [§3.3] §3.3, Unified Motion Decoder: The decoder produces trajectories in a unified space before asset-specific IK. Because training relies on Truebones Zoo's standardized triads, it is unclear whether the learned trajectories encode purely geometric, topology-agnostic features or dataset-specific joint semantics. This assumption is load-bearing for the arbitrary-rig and cross-species claims (e.g., quadruped vs. avian hierarchies); without explicit tests on varying joint counts/connectivities, the subsequent IK stage may receive misaligned inputs.

Authors: The decoder is trained to regress 3D joint positions in a canonical coordinate frame; the per-joint queries supplied by the Reference Prompt Encoder are the only mechanism that injects rig-specific information, so the trajectory output itself is intended to be topology-agnostic. The subsequent constraint-aware IK stage then solves for asset-specific rotations given the target skeleton's joint limits and connectivity. To make this explicit, the revision will add a dedicated experiment that evaluates the decoder on rigs whose joint counts and connectivities differ from the training triads (e.g., by subsampling joints or introducing additional cross-species skeletons) and measures the quality of the IK-recovered animations. revision: partial
Referee: [§3.1] §3.1, Reference Prompt Encoder: The design assumes that the target asset's skeleton, mesh, and rendered images supply sufficient cues for accurate per-joint queries that generalize to unseen rigs. No analysis or ablation demonstrates robustness when these cues differ substantially from the training distribution, which directly affects the weakest assumption underlying the prompt-driven approach.

Authors: The encoder fuses three complementary signals (skeleton graph, mesh geometry, and rendered views) precisely to increase robustness to variations in any single cue. We will add an ablation that systematically removes or perturbs each cue and evaluates performance on rigs whose visual and structural statistics differ from the Truebones Zoo training distribution, thereby quantifying the contribution of each modality to generalization. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical learned pipeline with independent training data

full rationale

The paper presents a factorized neural architecture (Reference Prompt Encoder, Video Feature Extractor, Unified Motion Decoder) trained end-to-end on the curated Truebones Zoo dataset of 1038 clips. No equations, fitted parameters, or self-citations are shown that reduce the output trajectories or retargeting results to definitional equivalence with the inputs. The central claims rest on empirical performance metrics for in-domain and in-the-wild videos rather than any self-referential construction. The IK stage is described as a lightweight post-process, not a fitted component. This is the expected non-circular outcome for a data-driven CV method.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard domain assumptions about rigged assets and learned visual features rather than explicit free parameters or new entities.

axioms (2)

domain assumption A rigged 3D asset's skeleton, mesh, and rendered images contain sufficient information to generate per-joint queries that enable accurate motion retargeting.
Invoked in the Reference Prompt Encoder description as the basis for guiding the decoder.
domain assumption Constraint-aware inverse kinematics can reliably convert predicted 3D joint trajectories into valid bone rotations for arbitrary rigs.
Stated as the final recovery stage without further justification in the abstract.

pith-pipeline@v0.9.0 · 5595 in / 1356 out tokens · 32860 ms · 2026-05-16T23:02:42.894519+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Unified Motion Decoder that fuses these cues to produce temporally coherent trajectories... followed by an IK fitting stage

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
cs.CV 2026-05 unverdicted novelty 7.0

R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
cs.CV 2026-05 unverdicted novelty 5.0

R-DMesh uses a VAE with a learned rectification jump offset and Triflow Attention inside a rectified-flow diffusion transformer to produce video-aligned 4D meshes despite initial pose misalignment.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Multi-hmr: Multi-person whole-body human mesh recovery in a single shot

Fabien Baradel, Matthieu Armando, Salma Galaaoui, Ro- main Br ´egier, Philippe Weinzaepfel, Gr ´egory Rogez, and Thomas Lucas. Multi-hmr: Multi-person whole-body human mesh recovery in a single shot. InEuropean Conference on Computer Vision, pages 202–218. Springer, 2024. 3

work page 2024
[2]

End-to- end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InEuropean confer- ence on computer vision, pages 213–229. Springer, 2020. 2

work page 2020
[3]

Cascaded pyramid net- work for multi-person pose estimation

Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. Cascaded pyramid net- work for multi-person pose estimation. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 7103–7112, 2018. 2

work page 2018
[4]

Huang, and Lei Zhang

Bowen Cheng, Bin Xiao, Jingdong Wang, Honghui Shi, Thomas S. Huang, and Lei Zhang. Higherhrnet: Scale-aware representation learning for bottom-up human pose estima- tion. InCVPR, 2020. 2

work page 2020
[5]

Beyond static features for temporally consis- tent 3d human pose and shape from a video

Hongsuk Choi, Gyeongsik Moon, Ju Yong Chang, and Ky- oung Mu Lee. Beyond static features for temporally consis- tent 3d human pose and shape from a video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1964–1973, 2021. 3

work page 1964
[6]

Deitke, D

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Obja- verse: A universe of annotated 3d objects.arXiv preprint arXiv:2212.08051, 2022. 6

work page arXiv 2022
[7]

Objaverse-XL: A Universe of 10M+ 3D Objects

Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Chris- tian Laforte, Vikram V oleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl V ondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. Objaverse-xl: A universe of 10m+ 3d objects.arXiv preprint arXiv:2307.05663, 2023. 6

work page internal anchor Pith review arXiv 2023
[8]

Anytop: Character animation diffusion with any topology.arXiv preprint arXiv:2502.17327, 2025

Inbar Gat, Sigal Raab, Guy Tevet, Yuval Reshef, Amit H Bermano, and Daniel Cohen-Or. Anytop: Character animation diffusion with any topology.arXiv preprint arXiv:2502.17327, 2025. 4, 5

work page arXiv 2025
[9]

Humans in 4d: Re- constructing and tracking humans with transformers

Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, and Jitendra Malik. Humans in 4d: Re- constructing and tracking humans with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14783–14794, 2023. 3

work page 2023
[10]

A graph-based approach for category-agnostic pose estimation.arXiv preprint arXiv:2311.17891, 2023

Or Hirschorn and Shai Avidan. A graph-based approach for category-agnostic pose estimation.arXiv preprint arXiv:2311.17891, 2023. 2

work page arXiv 2023
[11]

End-to-end recovery of human shape and pose

Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7122–7131, 2018. 1, 3

work page 2018
[12]

Learning category-specific mesh reconstruc- tion from image collections

Angjoo Kanazawa, Shubham Tulsiani, Alexei A Efros, and Jitendra Malik. Learning category-specific mesh reconstruc- tion from image collections. InProceedings of the Euro- pean conference on computer vision (ECCV), pages 371– 386, 2018. 3

work page 2018
[13]

Vibe: Video inference for human body pose and shape estimation

Muhammed Kocabas, Nikos Athanasiou, and Michael J Black. Vibe: Video inference for human body pose and shape estimation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 5253–5263, 2020. 3

work page 2020
[14]

Pace: Human and camera motion estimation from in- the-wild videos

Muhammed Kocabas, Ye Yuan, Pavlo Molchanov, Yunrong Guo, Michael J Black, Otmar Hilliges, Jan Kautz, and Umar Iqbal. Pace: Human and camera motion estimation from in- the-wild videos. In2024 International Conference on 3D Vision (3DV), pages 397–408. IEEE, 2024. 3

work page 2024
[15]

Simple pose: Rethinking and improving a bottom-up approach for multi-person pose estimation

Jia Li, Wen Su, and Zengfu Wang. Simple pose: Rethinking and improving a bottom-up approach for multi-person pose estimation. InProceedings of the AAAI conference on artifi- cial intelligence, pages 11354–11361, 2020. 2

work page 2020
[16]

Human pose regression with residual log-likelihood estimation

Jiefeng Li, Siyuan Bian, Ailing Zeng, Can Wang, Bo Pang, Wentao Liu, and Cewu Lu. Human pose regression with residual log-likelihood estimation. InProceedings of the IEEE/CVF international conference on computer vision, pages 11025–11034, 2021. 2

work page 2021
[17]

Simcc: A simple coordinate classification perspective for hu- man pose estimation

Yanjie Li, Sen Yang, Peidong Liu, Shoukui Zhang, Yunx- iao Wang, Zhicheng Wang, Wankou Yang, and Shu-Tao Xia. Simcc: A simple coordinate classification perspective for hu- man pose estimation. InEuropean Conference on Computer Vision, pages 89–106. Springer, 2022. 2

work page 2022
[18]

Learning the 3d fauna of the web

Zizhang Li, Dor Litvak, Ruining Li, Yunzhi Zhang, Tomas Jakab, Christian Rupprecht, Shangzhe Wu, Andrea Vedaldi, and Jiajun Wu. Learning the 3d fauna of the web. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9752–9762, 2024. 3

work page 2024
[19]

One-stage 3d whole-body mesh recovery with component aware transformer

Jing Lin, Ailing Zeng, Haoqian Wang, Lei Zhang, and Yu Li. One-stage 3d whole-body mesh recovery with component aware transformer. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 21159–21168, 2023. 3

work page 2023
[20]

SMPL: A skinned multi- person linear model.ACM transactions on graphics (TOG), 34(6):1–16, 2015

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. SMPL: A skinned multi- person linear model.ACM transactions on graphics (TOG), 34(6):1–16, 2015. 1, 3 9

work page 2015
[21]

Stacked hour- glass networks for human pose estimation

Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hour- glass networks for human pose estimation. InEuropean con- ference on computer vision, pages 483–499. Springer, 2016. 2

work page 2016
[22]

Black, Sil- via Zuffi, and Peter Kulits

Tomasz Niewiadomski, Anastasios Yiannakidis, Hanz Cuevas-Velasquez, Soubhik Sanyal, Michael J. Black, Sil- via Zuffi, and Peter Kulits. Generative zoo.CoRR, abs/2412.08101, 2024. 3

work page arXiv 2024
[23]

Black, Silvia Zuffi, and Peter Kulits

Tomasz Niewiadomski, Anastasios Yiannakidis, Hanz Cuevas-Velasquez, Soubhik Sanyal, Michael J. Black, Silvia Zuffi, and Peter Kulits. Generative zoo. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 6

work page 2025
[24]

Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Rus- sell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael G. Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv´e J´egou, Julien Ma...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Expressive body capture: 3d hands, face, and body from a single image

Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. Expressive body capture: 3d hands, face, and body from a single image. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10975–10985, 2019. 1, 3

work page 2019
[26]

Capex: Category-agnostic pose estimation from textual point expla- nation.arXiv preprint arXiv:2406.00384, 2024

Matan Rusanovsky, Or Hirschorn, and Shai Avidan. Capex: Category-agnostic pose estimation from textual point expla- nation.arXiv preprint arXiv:2406.00384, 2024. 3

work page arXiv 2024
[27]

Capex: Category-agnostic pose estimation from textual point ex- planation

Matan Rusanovsky, Or Hirschorn, and Shai Avidan. Capex: Category-agnostic pose estimation from textual point ex- planation. InThe Thirteenth International Conference on Learning Representations, 2025. 1

work page 2025
[28]

End-to-end multi-person pose estimation with transformers

Dahu Shi, Xing Wei, Liangqi Li, Ye Ren, and Wenming Tan. End-to-end multi-person pose estimation with transformers. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 11069–11078, 2022. 2

work page 2022
[29]

Matching is not enough: A two-stage frame- work for category-agnostic pose estimation

Min Shi, Zihao Huang, Xianzheng Ma, Xiaowei Hu, and Zhiguo Cao. Matching is not enough: A two-stage frame- work for category-agnostic pose estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 7308–7317, 2023. 2

work page 2023
[30]

Wham: Reconstructing world-grounded humans with accu- rate 3d motion

Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J Black. Wham: Reconstructing world-grounded humans with accu- rate 3d motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2070– 2080, 2024. 3

work page 2070
[31]

Deep high-resolution representation learning for human pose es- timation

Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose es- timation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5693–5703,

work page
[32]

Aios: All-in-one-stage expres- sive human pose and shape estimation

Qingping Sun, Yanjun Wang, Ailing Zeng, Wanqi Yin, Chen Wei, Wenjia Wang, Haiyi Mei, Chi-Sing Leung, Zi- wei Liu, Lei Yang, et al. Aios: All-in-one-stage expres- sive human pose and shape estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1834–1843, 2024. 3

work page 2024
[33]

Deeppose: Hu- man pose estimation via deep neural networks

Alexander Toshev and Christian Szegedy. Deeppose: Hu- man pose estimation via deep neural networks. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. 1

work page 2014
[34]

Deeppose: Human pose estimation via deep neural networks

Alexander Toshev and Christian Szegedy. Deeppose: Human pose estimation via deep neural networks. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 1653–1660, 2014. 2

work page 2014
[35]

Truebones motion capture – mocap files, n.d

Truebones. Truebones motion capture – mocap files, n.d. Accessed: 2025-05-22. 1, 2, 6

work page 2025
[36]

Locllm: Exploiting generalizable human keypoint localization via large language model

Dongkai Wang, Shiyu Xuan, and Shiliang Zhang. Locllm: Exploiting generalizable human keypoint localization via large language model. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 614–623, 2024. 2

work page 2024
[37]

Tram: Global trajectory and motion of 3d humans from in- the-wild videos

Yufu Wang, Ziyun Wang, Lingjie Liu, and Kostas Daniilidis. Tram: Global trajectory and motion of 3d humans from in- the-wild videos. InEuropean Conference on Computer Vi- sion, pages 467–487. Springer, 2024. 3

work page 2024
[38]

Magicpony: Learning ar- ticulated 3d animals in the wild

Shangzhe Wu, Ruining Li, Tomas Jakab, Christian Rup- precht, and Andrea Vedaldi. Magicpony: Learning ar- ticulated 3d animals in the wild. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8792–8802, 2023. 3

work page 2023
[39]

CASA: Category-agnostic skeletal an- imal reconstruction

Yuefan Wu*, Zeyuan Chen*, Shaowei Liu, Zhongzheng Ren, and Shenlong Wang. CASA: Category-agnostic skeletal an- imal reconstruction. InNeurIPS, 2022. 3

work page 2022
[40]

Simple baselines for human pose estimation and tracking

Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines for human pose estimation and tracking. InProceedings of the European conference on computer vision (ECCV), pages 466–481, 2018. 2

work page 2018
[41]

Querypose: Sparse multi- person pose regression via spatial-aware part-level query

Yabo Xiao, Kai Su, Xiaojuan Wang, Dongdong Yu, Lei Jin, Mingshu He, and Zehuan Yuan. Querypose: Sparse multi- person pose regression via spatial-aware part-level query. Advances in Neural Information Processing Systems, 35: 12464–12477, 2022. 2

work page 2022
[42]

Pose for everything: Towards category-agnostic pose estimation

Lumin Xu, Sheng Jin, Wang Zeng, Wentao Liu, Chen Qian, Wanli Ouyang, Ping Luo, and Xiaogang Wang. Pose for everything: Towards category-agnostic pose estimation. In European conference on computer vision, pages 398–416. Springer, 2022. 2

work page 2022
[43]

Vit- pose: Simple vision transformer baselines for human pose estimation.Advances in neural information processing sys- tems, 35:38571–38584, 2022

Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vit- pose: Simple vision transformer baselines for human pose estimation.Advances in neural information processing sys- tems, 35:38571–38584, 2022. 2

work page 2022
[44]

Lasr: Learning articulated shape re- construction from a monocular video

Gengshan Yang, Deqing Sun, Varun Jampani, Daniel Vlasic, Forrester Cole, Huiwen Chang, Deva Ramanan, William T Freeman, and Ce Liu. Lasr: Learning articulated shape re- construction from a monocular video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15980–15989, 2021. 3

work page 2021
[45]

Viser: Video- 10 specific surface embeddings for articulated 3d shape recon- struction.Advances in Neural Information Processing Sys- tems, 34:19326–19338, 2021

Gengshan Yang, Deqing Sun, Varun Jampani, Daniel Vlasic, Forrester Cole, Ce Liu, and Deva Ramanan. Viser: Video- 10 specific surface embeddings for articulated 3d shape recon- struction.Advances in Neural Information Processing Sys- tems, 34:19326–19338, 2021. 3

work page 2021
[46]

Banmo: Building animatable 3d neural models from many casual videos

Gengshan Yang, Minh V o, Natalia Neverova, Deva Ra- manan, Andrea Vedaldi, and Hanbyul Joo. Banmo: Building animatable 3d neural models from many casual videos. In CVPR, 2022. 3

work page 2022
[47]

Zhang, Zachary Manchester, and Deva Ramanan

Gengshan Yang, Shuo Yang, John Z. Zhang, Zachary Manchester, and Deva Ramanan. Physically plausible re- construction from monocular videos. InICCV, 2023. 3

work page 2023
[48]

Explicit box detection unifies end-to-end multi-person pose estimation.arXiv preprint arXiv:2302.01593, 2023

Jie Yang, Ailing Zeng, Shilong Liu, Feng Li, Ruimao Zhang, and Lei Zhang. Explicit box detection unifies end-to-end multi-person pose estimation.arXiv preprint arXiv:2302.01593, 2023. 2

work page arXiv 2023
[49]

Lassie: Learning articulated shapes from sparse image ensemble via 3d part discovery.Advances in Neural Information Process- ing Systems, 35:15296–15308, 2022

Chun-Han Yao, Wei-Chih Hung, Yuanzhen Li, Michael Ru- binstein, Ming-Hsuan Yang, and Varun Jampani. Lassie: Learning articulated shapes from sparse image ensemble via 3d part discovery.Advances in Neural Information Process- ing Systems, 35:15296–15308, 2022. 3

work page 2022
[50]

Whac: World-grounded hu- mans and cameras

Wanqi Yin, Zhongang Cai, Ruisi Wang, Fanzhou Wang, Chen Wei, Haiyi Mei, Weiye Xiao, Zhitao Yang, Qingping Sun, Atsushi Yamashita, et al. Whac: World-grounded hu- mans and cameras. InEuropean Conference on Computer Vision, pages 20–37. Springer, 2024. 3

work page 2024
[51]

Glamr: Global occlusion-aware human mesh recov- ery with dynamic cameras

Ye Yuan, Umar Iqbal, Pavlo Molchanov, Kris Kitani, and Jan Kautz. Glamr: Global occlusion-aware human mesh recov- ery with dynamic cameras. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11038–11049, 2022. 3

work page 2022
[52]

Silvia Zuffi, Angjoo Kanazawa, David Jacobs, and Michael J. Black. 3D menagerie: Modeling the 3D shape and pose of animals. InIEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2017. 1, 3 11 MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos Supplementary Material

work page 2017
[53]

More Visualization Results In this section, we summarize additional qualitative re- sults from our supplementary webpage. These visualiza- tions highlight the effectiveness of our approach across con- trolled multi-species datasets, in-the-wild videos, and cross- species retargeting scenarios, showing that our model pro- duces high-fidelity and temporally...

work page
[54]

Dataset and Environment Details Dataset Processing.All meshes are first scaled by the bounding box of their rest pose, normalizing each mesh into a unit-volume space

Implementation Details A. Dataset and Environment Details Dataset Processing.All meshes are first scaled by the bounding box of their rest pose, normalizing each mesh into a unit-volume space. For sequence data, we remove the global translation of every frame, compute a sequence- level super bounding box, and uniformly scale the entire se- quence into the...

work page
[55]

Each skeleton is represented by a set of 3D joint positions and a kinematic hierarchy defined by a parent array

Evaluation Metrics This section describes the computation of the proposed metric(CD-Skeleton) that evaluates the alignment between two articulated skeletons. Each skeleton is represented by a set of 3D joint positions and a kinematic hierarchy defined by a parent array. Notation Let Skeleton A and Skeleton B be defined as: • Joint positions: XA ={x A i ∈R...

work page