pith. sign in

arxiv: 2606.02366 · v1 · pith:A7YTOZLQnew · submitted 2026-06-01 · 💻 cs.CV

PRIMA: Boosting Animal Mesh Recovery with Biological Priors and Test-Time Adaptation

Pith reviewed 2026-06-28 14:53 UTC · model grok-4.3

classification 💻 cs.CV
keywords animal mesh recovery3D reconstructionbiological priorstest-time adaptationquadruped meshpseudo-3D datasetSMAL model
0
0 comments X

The pith

Biological priors from image embeddings combined with test-time adaptation using 2D constraints improve 3D mesh recovery for diverse quadruped species and poses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to overcome the tendency of existing animal reconstruction methods to default to average shapes and poses when data is scarce and species distributions are long-tailed. It injects semantic and morphological knowledge through BioCLIP embeddings and refines SMAL predictions at test time with reprojection losses plus keypoint guidance. These steps also produce a large pseudo-3D dataset called Quadruped3D from existing 2D sources. If the approach works, models should generalize better to rare animals and difficult articulations without requiring new 3D annotations. Experiments on four benchmarks report state-of-the-art numbers, with the largest gains on underrepresented species.

Core claim

Biological priors supplied by BioCLIP embeddings, together with a test-time adaptation procedure that enforces 2D reprojection consistency and auxiliary keypoint constraints, enable more accurate and generalizable SMAL-based mesh predictions across quadrupeds; the same adaptation process can be used to bootstrap a large-scale pseudo-3D training set that further lifts performance.

What carries the argument

BioCLIP embeddings acting as biological priors to condition shape prediction, paired with a test-time adaptation loop that optimizes SMAL parameters against 2D reprojection and keypoint losses.

If this is right

  • Models trained with the generated Quadruped3D dataset achieve higher accuracy on long-tailed species distributions.
  • Test-time refinement produces usable pseudo-3D labels from ordinary 2D animal photographs.
  • Performance gains concentrate on underrepresented species and extreme poses.
  • The same prior-plus-adaptation pattern applies to any SMAL-style parametric model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be tested on non-quadruped animals or on other parametric body models by swapping the embedding source and loss terms.
  • If the priors remain effective, similar adaptation pipelines might reduce reliance on expensive 3D animal capture in behavioral studies.
  • One could measure whether the improvement scales with the diversity of the 2D source data used for adaptation.

Load-bearing premise

The embeddings carry semantic and morphological information that improves shape estimates across quadrupeds without importing unwanted biases from the image collection used to train them.

What would settle it

A controlled test on a held-out set of 2D images of rare species where independent 3D ground truth or expert-verified meshes show no improvement over a non-adapted baseline after PRIMA is applied.

Figures

Figures reproduced from arXiv: 2606.02366 by Mackenzie Weygandt Mathis, Ti Wang, Xiaohang Yu.

Figure 1
Figure 1. Figure 1: Overview of PRIMA. Due to the unbalanced distribution of 3D animal training data, vanilla networks often suffer from mean regression bias, yielding implausible meshes that collapse toward the average shape of majority species (e.g., a ”dog-like” zebra in wrong prediction). We propose to inject biological taxonomies and keypoint-aware tokens as strong inductive biases into the latent space. This semantic an… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed model architecture. Given an input image, a ViT encoder extracts image tokens and a BioCLIP encoder produces a bio token. An MLP maps the bio token to initialize the SMAL shape parameter βˆ 0, and the SMAL parameters are projected into a SMAL parameter token. The transformer decoder and regression heads follow an Iterative Error Feedback (IEF) scheme for N iterations. Across L laye… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed three-stage training framework. The proposed methodology consists of an alternating three-stage training paradigm. Stage 1, warm-up training, leverages available 3D datasets and a 2D dataset to initialize the model parameters. Stage 2, per-instance refinement, then optimizes pose and shape parameters under 2D keypoint supervision to construct pseudo Ground Truth (pGT) SMAL paramete… view at source ↗
Figure 4
Figure 4. Figure 4: Quadruped3D pGT examples. Each row visualizes the input image (left), the initial 3D prediction before test-time adaptation (middle), and the refined pGT after adaptation (right), illustrating pose alignment and shape re￾covery across diverse species and articulated poses. 3.5 Our Quadruped3D Dataset Quadruped2D curation. Quadruped2D is constructed from the SuperAnimal dataset [50], which contains 80k imag… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparisons on Animal3D (rows 1–4), CtrlAni3D (rows 5–6), Animal Kingdom (rows 7–8), and Quadruped2D (row 9) datasets. We compare our results with AniMer [28] and AniMer+ [1] [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: t-SNE [29] embeddings of predicted and GT shape parameters on Animal3D. Colors indicate different animal family category labels. 4.5 Ablation Study Ablation on model architecture. We assess the contribution of biological priors and learnable keypoint tokens through three ablation variants. (a) w/o Fbio: We remove biologically initialized shape features while retaining the coarse￾to-fine shape estimation st… view at source ↗
Figure 7
Figure 7. Figure 7: Ablation on model architecture. The top row shows a sheep from Quadruped2D, and the bottom row shows a zebra from Animal3D. Animal3D CtrlAni3D Quadruped2D Animal Kingdom Method PAJ↓ PAV↓ PAJ↓ PAV↓ AUC↑ P@0.1↑ AUC↑ P@0.1↑ w/o Fbio 75.5 79.7 43.5 46.2 91.7 78.5 83.6 37.3 w/o βinit 75.6 80.1 41.8 44.4 91.9 78.9 83.4 36.1 w/o Tkeypoint 75.6 80.2 43.5 46.2 91.1 75.9 83.7 37.0 Ours (Stage-1) 75.3 79.5 40.1 42.8 … view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative results on Animal3D dataset before and after test￾time adaptation. The orange arrow indicates better alignment refined after TTA compared to before. PAJ↓ PAV↓ AUC↑ Instance Before After Before After Before After pGT Cat licking 273.6 313.9 295.8 318.1 83.5 92.6 81.6 Dog jumping 133.1 117.6 114.2 80.6 81.9 90.1 95.8 Cat sitting 113.6 88.3 77.2 67.2 86.5 94.6 91.8 Horse running 52.1 46.0 53.2 52.… view at source ↗
Figure 1
Figure 1. Figure 1: Family-level image distributions of the datasets used in this work [PITH_FULL_IMAGE:figures/full_fig_p018_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Species-level image distribution of the Animal3D dataset across 40 categories, sorted by total count in descending order. Bar colors indicate the biological family: Felidae, Equidae, Canidae, Bovidae, and Hip￾popotamidae. The dataset contains 3,385 images in total and exhibits significant class imbalance, with the most populated category (sorrel horse, 710 images) containing ∼39× more images than the least… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparisons on Animal3D and Quadruped2D dataset [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: More examples on Quadruped3D pGT [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results on Animal Kingdom datasets. D Additional Ablation Study D.1 Ablation on biological priors To validate the biological embedding, we analyze the t-SNE [29] visualizations of the BioCLIP [41] feature and the shape init features, both with and without Fbio. As shown in [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: t-SNE [29] visualization of BioCLIP feature and shape init fea￾ture on the Animal3D test set. D.2 Ablation on Quadruped3D To further demonstrate the impact of our Quadruped3D dataset, we conduct a comparative analysis between stage-1 and stage-3 training regimes, in which the model is trained with Quadruped2D and Quadruped3D, respectively, along with the Animal3D and CtrlAni3D datasets. As shown in [PITH_… view at source ↗
Figure 7
Figure 7. Figure 7: Effect of training with Quadruped3D on Animal3D (rows 1-2), and Animal Kingdom (rows 3-4) datasets. steps), performance degrades and even drops below the no-adaptation baseline. This suggests that excessive optimization overfits the 2D observations and de￾grades mesh reconstruction, especially without using silhouette information, un￾like the SMALify [5] optimization process. We also investigate how the nu… view at source ↗
Figure 8
Figure 8. Figure 8: Effect of optimization iterations and visible keypoints in the horse-running case. Left: Relative improvement of PA-MPJPE, PA-MPVPE, and AUC as the number of optimization iterations increases. Right: Relative PA￾MPVPE improvement under different numbers of visible keypoint conditions as the optimization iteration increases [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗
read the original abstract

We present PRIMA (*PRI*ors for *M*esh *A*daptation), a framework for robust 3D quadruped mesh recovery under severe species and pose imbalance. Existing animal reconstruction methods often regress toward mean shapes and poses due to limited 3D supervision and long-tailed species distributions, resulting in poor generalization to underrepresented animals and rare articulations. PRIMA addresses this challenge through three key contributions. First, we incorporate BioCLIP embeddings as biological priors to inject semantic and morphological knowledge into the reconstruction process, enabling more accurate and generalizable shape prediction across diverse quadrupeds. Second, we introduce a test-time adaptation (TTA) strategy that refines SMAL predictions using 2D reprojection constraints together with auxiliary keypoint guidance, improving pose and shape estimation while enabling the generation of high-quality pseudo-3D annotations from existing 2D datasets. Third, leveraging this TTA framework, we construct Quadruped3D, a large-scale pseudo-3D dataset that covers diverse species and pose variations to systematically improve model performance. Extensive experiments on Animal3D, CtrlAni3D, Quadruped2D, and Animal Kingdom demonstrate that PRIMA achieves state-of-the-art results, with particularly strong improvements on underrepresented species and challenging poses. Our results highlight the importance of biological priors and adaptation-driven data expansion for scalable and generalizable animal mesh recovery. Code is available at https://github.com/AdaptiveMotorControlLab/PRIMA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents PRIMA, a framework for 3D quadruped mesh recovery that injects BioCLIP embeddings as biological priors for shape prediction and employs test-time adaptation (TTA) with 2D reprojection and keypoint guidance on the SMAL model to generate pseudo-3D annotations. This enables construction of the Quadruped3D dataset from existing 2D sources to mitigate species/pose imbalance. The central empirical claim is state-of-the-art performance on Animal3D, CtrlAni3D, Quadruped2D, and Animal Kingdom, with largest gains on underrepresented species and rare poses; code is released.

Significance. If the pseudo-label quality and BioCLIP contribution hold under scrutiny, the work provides a practical route to scalable animal mesh recovery by combining semantic priors with adaptation-driven data expansion. The explicit release of code and the Quadruped3D construction process are concrete strengths that could support follow-on research on long-tailed 3D animal datasets.

major comments (3)
  1. [§3.3 and §4.1] §3.3 (TTA procedure) and §4.1 (Quadruped3D construction): the pseudo-3D labels are generated solely from 2D reprojection loss plus keypoint guidance without any held-out 3D ground-truth validation or cross-check against an independent 3D test set for tail species; this leaves open the possibility that reported gains on underrepresented animals simply reinforce systematic errors in the pseudo-annotations rather than demonstrate genuine generalization from BioCLIP priors.
  2. [§5] §5 (experiments): the main tables report SOTA numbers but provide no per-species error breakdowns, confidence intervals, or ablation isolating the BioCLIP embedding contribution from the TTA data-augmentation effect; without these controls it is difficult to attribute the claimed improvements on rare poses specifically to the biological priors.
  3. [§3.2] §3.2 (BioCLIP integration): the claim that BioCLIP supplies morphological knowledge that improves shape prediction across diverse quadrupeds is not accompanied by any analysis of domain shift between BioCLIP's pretraining corpus and the target animal images, nor by a controlled experiment replacing BioCLIP with a generic vision-language embedding.
minor comments (2)
  1. [Abstract and §1] The abstract and §1 refer to "Quadruped2D" and "Animal Kingdom" without citing the original dataset papers; add these references for completeness.
  2. [§5] Figure captions in §5 could more explicitly state whether reported metrics are mean per-vertex error or PA-MPJPE to aid direct comparison with prior animal mesh work.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below with clarifications and commitments to revisions where they strengthen the work.

read point-by-point responses
  1. Referee: [§3.3 and §4.1] the pseudo-3D labels are generated solely from 2D reprojection loss plus keypoint guidance without any held-out 3D ground-truth validation or cross-check against an independent 3D test set for tail species; this leaves open the possibility that reported gains on underrepresented animals simply reinforce systematic errors in the pseudo-annotations rather than demonstrate genuine generalization from BioCLIP priors.

    Authors: We acknowledge the concern about pseudo-label validation for tail species. The TTA is driven by reliable 2D keypoints and reprojection from source datasets, and the final model is evaluated on held-out 3D benchmarks (Animal3D, CtrlAni3D) containing underrepresented species. Consistent SOTA gains across these sets indicate genuine improvement rather than error reinforcement. In revision we will expand §4.1 with additional details on pseudo-label quality, including qualitative examples and quantitative checks against any available 3D data. revision: partial

  2. Referee: [§5] the main tables report SOTA numbers but provide no per-species error breakdowns, confidence intervals, or ablation isolating the BioCLIP embedding contribution from the TTA data-augmentation effect; without these controls it is difficult to attribute the claimed improvements on rare poses specifically to the biological priors.

    Authors: We agree these controls would strengthen attribution of gains. The revised manuscript will add per-species error breakdowns on benchmarks with species annotations, include confidence intervals on main metrics, and present an ablation isolating BioCLIP priors from the TTA-driven Quadruped3D data expansion. revision: yes

  3. Referee: [§3.2] the claim that BioCLIP supplies morphological knowledge that improves shape prediction across diverse quadrupeds is not accompanied by any analysis of domain shift between BioCLIP's pretraining corpus and the target animal images, nor by a controlled experiment replacing BioCLIP with a generic vision-language embedding.

    Authors: We agree a direct comparison would better substantiate the biological priors. In revision we will add both a short discussion of domain considerations for BioCLIP and a controlled ablation in §5 that replaces BioCLIP with a generic vision-language embedding (e.g., CLIP) to quantify the specific benefit. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on external BioCLIP priors and standard TTA/reprojection losses without self-referential reduction

full rationale

The paper's core steps—injecting BioCLIP embeddings as biological priors, applying TTA with 2D reprojection and keypoint guidance on the external SMAL model to generate Quadruped3D pseudo-labels, then training and evaluating on held-out benchmarks (Animal3D, CtrlAni3D, etc.)—do not reduce any claimed prediction or result to quantities defined inside the paper by construction. No equations equate fitted parameters to outputs, no self-citation chain bears the central claim, and the TTA-generated dataset is an input expansion step whose value is assessed via external metrics rather than tautological reuse. This matches the default non-circular case.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claim rests on the effectiveness of BioCLIP embeddings as morphological priors and on the assumption that 2D reprojection plus keypoint guidance yields reliable pseudo-3D labels.

axioms (2)
  • domain assumption BioCLIP embeddings capture useful semantic and morphological knowledge for quadruped shape prediction
    Invoked in the first contribution to justify injection of BioCLIP into the reconstruction network.
  • domain assumption Test-time adaptation with 2D reprojection constraints improves pose and shape estimates without accumulating errors
    Central to the second contribution and the construction of Quadruped3D.

pith-pipeline@v0.9.1-grok · 5802 in / 1407 out tokens · 26072 ms · 2026-06-28T14:53:08.961564+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 7 canonical work pages · 1 internal anchor

  1. [1]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

    An, L., Lyu, J., Lin, L., Cheng, P., Liu, Y., Tang, X.: Animer+: Unified pose and shape estimation across mammalia and aves via family-aware transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

  2. [2]

    In: ECCV

    Badger, M., Wang, Y., Modh, A., Perkes, A., Kolotouros, N., Pfrommer, B.G., Schmidt, M.F., Daniilidis, K.: 3d bird reconstruction: a dataset, model, and shape recovery from a single view. In: ECCV. pp. 1–17. Springer (2020)

  3. [3]

    Neurocomputing611, 128605 (2025)

    Bidulka, L., Gholami, M., Zheng, J., McKeown, M.J., Wang, Z.J.: Escape: Energy- based selective adaptive correction for out-of-distribution 3d human pose estima- tion. Neurocomputing611, 128605 (2025)

  4. [4]

    In: ECCV

    Biggs, B., Boyne, O., Charles, J., Fitzgibbon, A., Cipolla, R.: Who left the dogs out? 3d animal reconstruction with expectation maximization in the loop. In: ECCV. pp. 195–211. Springer (2020)

  5. [5]

    In: ACCV (2018)

    Biggs, B., Roddick, T., Fitzgibbon, A., Cipolla, R.: Creatures great and SMAL: Recovering the shape and motion of animals from video. In: ACCV (2018)

  6. [6]

    In: ECCV

    Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In: ECCV. pp. 561–578 (2016)

  7. [7]

    cho et al

    Cho, G., Kang, C., Soon, D., Joo, K.: Dogrecon: Canine prior-guided animatable 3d gaussian dog reconstruction from a single image: G. cho et al. International Journal of Computer Vision133(9), 6332–6346 (2025)

  8. [8]

    In: CVPR

    Dwivedi, S.K., Sun, Y., Patel, P., Feng, Y., Black, M.J.: Tokenhmr: Advancing human mesh recovery with a tokenized pose representation. In: CVPR. pp. 1323– 1333 (2024)

  9. [9]

    In: CVPR (2021)

    Fang, Q., Shuai, Q., Dong, J., Bao, H., Zhou, X.: Reconstructing 3d human pose by watching humans in the mirror. In: CVPR (2021)

  10. [10]

    arXiv preprint arXiv:2511.15586 (2025)

    Ferguson, A., Osman, A.A., Bescos, B., Stoll, C., Twigg, C., Lassner, C., Otte, D., Vignola, E., Prada, F., Bogo, F., et al.: Mhr: Momentum human rig. arXiv preprint arXiv:2511.15586 (2025)

  11. [12]

    In: ICCV

    Goel, S., Pavlakos, G., Rajasegaran, J., Kanazawa, A., Malik, J.: Humans in 4d: Reconstructing and tracking humans with transformers. In: ICCV. pp. 14783– 14794 (2023)

  12. [13]

    arXiv preprint arXiv:2505.23883 (2025)

    Gu, J., Stevens, S., Campolongo, E.G., Thompson, M.J., Zhang, N., Wu, J., Kopanev, A., Mai, Z., White, A.E., Balhoff, J., et al.: Bioclip 2: Emergent proper- ties from scaling hierarchical contrastive learning. arXiv preprint arXiv:2505.23883 (2025)

  13. [14]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2022)

    Guan, S., Xu, J., He, M.Z., Wang, Y., Ni, B., Yang, X.: Out-of-domain human mesh reconstruction via dynamic bilevel online adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022)

  14. [15]

    IEEE Transactions on Pattern Analysis and Machine Intelligence36(7), 1325–1339 (2013)

    Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence36(7), 1325–1339 (2013)

  15. [16]

    Joo, H., Neverova, N., Vedaldi, A.: Exemplar fine-tuning for 3D human model fitting towards in-the-wild 3D human pose estimation. In: 3DV. pp. 42–52. IEEE (2021) 16 X. Yu et al

  16. [17]

    In: CVPR

    Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR. pp. 7122–7131 (2018)

  17. [18]

    In: ECCV

    Kanazawa, A., Tulsiani, S., Efros, A.A., Malik, J.: Learning category-specific mesh reconstruction from image collections. In: ECCV. pp. 371–386 (2018)

  18. [19]

    In: CVPR

    Kulits, P., Black, M.J., Zuffi, S.: Reconstructing animals and the wild. In: CVPR. pp. 16565–16577 (June 2025)

  19. [20]

    In: CVPR

    Kulkarni, N., Gupta, A., Fouhey, D.F., Tulsiani, S.: Articulation-aware canonical surface mapping. In: CVPR. pp. 452–461 (2020)

  20. [21]

    Advances in Neural Information Processing Systems34, 11757–11768 (2021)

    Li, C., Lee, G.H.: Coarse-to-fine animal pose and shape estimation. Advances in Neural Information Processing Systems34, 11757–11768 (2021)

  21. [22]

    arXiv preprint arXiv:2106.10102 (2021)

    Li, C., Ghorbani, N., Broom´ e, S., Rashid, M., Black, M.J., Hernlund, E., Kjell- str¨ om, H., Zuffi, S.: hsmal: Detailed horse shape and pose reconstruction for motion pattern recognition. arXiv preprint arXiv:2106.10102 (2021)

  22. [23]

    In: Proceedings of the Asian Conference on Computer Vision

    Li, C., Yang, Y., Weng, Z., Hernlund, E., Zuffi, S., Kjellstr¨ om, H.: Dessie: Disen- tanglement for articulated 3d horse shape and pose estimation from images. In: Proceedings of the Asian Conference on Computer Vision. pp. 764–783 (2024)

  23. [24]

    arXiv preprint arXiv:2508.16062 (2025)

    Li, Z., Amrani, A., Rai, S., Laga, H.: Advances and trends in the 3d reconstruction of the shape and motion of animals. arXiv preprint arXiv:2508.16062 (2025)

  24. [25]

    In: CVPR

    Li, Z., Litvak, D., Li, R., Zhang, Y., Jakab, T., Rupprecht, C., Wu, S., Vedaldi, A., Wu, J.: Learning the 3d fauna of the web. In: CVPR. pp. 9752–9762 (2024)

  25. [26]

    ACM TOG34(6), 1–16 (2015)

    Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: A skinned multi-person linear model. ACM TOG34(6), 1–16 (2015)

  26. [27]

    Decoupled Weight Decay Regularization

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  27. [28]

    In: CVPR

    Lyu, J., Zhu, T., Gu, Y., Lin, L., Cheng, P., Liu, Y., Tang, X., An, L.: Animer: Animal pose and shape estimation using family aware transformer. In: CVPR. pp. 17486–17496 (2025)

  28. [29]

    Journal of machine learning research9(11) (2008)

    Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. Journal of machine learning research9(11) (2008)

  29. [30]

    In: ICCV

    Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: Archive of motion capture as surface shapes. In: ICCV. pp. 5442–5451 (Oct 2019)

  30. [31]

    In: CVPR (June 2020)

    Mu, J., Qiu, W., Hager, G.D., Yuille, A.L.: Learning from synthetic animals. In: CVPR (June 2020)

  31. [32]

    In: ICCV

    Nam, H., Jung, D.S., Oh, Y., Lee, K.M.: Cyclic test-time adaptation on monocular video for 3d human mesh reconstruction. In: ICCV. pp. 14829–14839 (2023)

  32. [33]

    In: ICCV

    Niewiadomski, T., Yiannakidis, A., Cuevas-Velasquez, H., Sanyal, S., Black, M.J., Zuffi, S., Kulits, P.: Generative zoo. In: ICCV. pp. 8492–8502 (2025)

  33. [34]

    In: CVPR

    Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single image. In: CVPR. pp. 10975–10985 (2019)

  34. [35]

    In: CVPR

    Pavlakos, G., Shan, D., Radosavovic, I., Kanazawa, A., Fouhey, D., Malik, J.: Reconstructing hands in 3d with transformers. In: CVPR. pp. 9826–9836 (2024)

  35. [36]

    In: CVPR (2021)

    Peng, S., Zhang, Y., Xu, Y., Wang, Q., Shuai, Q., Bao, H., Zhou, X.: Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In: CVPR (2021)

  36. [37]

    In: ICCV

    Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., Guibas, L.J.: Humor: 3d human motion model for robust pose estimation. In: ICCV. pp. 11488–11499 (2021)

  37. [38]

    In: CVPR

    Rueegg, N., Zuffi, S., Schindler, K., Black, M.J.: Barc: Learning to regress 3d dog shape from images by exploiting breed information. In: CVPR. pp. 3876–3884 (2022) PRIMA 17

  38. [39]

    In: CVPR

    R¨ uegg, N., Tripathi, S., Schindler, K., Black, M.J., Zuffi, S.: Bite: Beyond priors for improved three-d dog pose estimation. In: CVPR. pp. 8867–8876 (2023)

  39. [40]

    In: ECCV

    Sabathier, R., Mitra, N.J., Novotny, D.: Animal avatars: Reconstructing animat- able 3d animals from casual videos. In: ECCV. pp. 270–287. Springer (2024)

  40. [41]

    In: CVPR

    Stevens, S., Wu, J., Thompson, M.J., Campolongo, E.G., Song, C.H., Carlyn, D.E., Dong, L., Dahdul, W.M., Stewart, C., Berger-Wolf, T., Chao, W.L., Su, Y.: Bio- CLIP: A vision foundation model for the tree of life. In: CVPR. pp. 19412–19424 (2024)

  41. [42]

    In: CVPR

    Tan, J., Yang, G., Ramanan, D.: Distilling neural fields for real-time articulated shape reconstruction. In: CVPR. pp. 4692–4701 (2023)

  42. [43]

    IEEE Transactions on Multimedia (2026)

    Wang, T., Liu, M., Liu, H., Ren, B., You, Y., Li, W., Sebe, N., Li, X.: Uncertainty- aware testing-time optimization for 3d human pose estimation. IEEE Transactions on Multimedia (2026)

  43. [44]

    In: CVPR

    Wang, Y., Kolotouros, N., Daniilidis, K., Badger, M.: Birds of a feather: Capturing avian shape models from images. In: CVPR. pp. 14739–14749 (2021)

  44. [45]

    arXiv preprint arXiv:2404.06507 (2024)

    Wu, J., Pavlakos, G., Gkioxari, G., Malik, J.: Reconstructing hand-held objects in 3d from images and videos. arXiv preprint arXiv:2404.06507 (2024)

  45. [46]

    In: ICCV

    Xu, J., Zhang, Y., Peng, J., Ma, W., Jesslen, A., Ji, P., Hu, Q., Zhang, J., Liu, Q., Wang, J., et al.: Animal3d: A comprehensive dataset of 3d animal pose and shape. In: ICCV. pp. 9099–9109 (2023)

  46. [47]

    Sam 3d body: Robust full-body human mesh recovery

    Yang, X., Kukreja, D., Pinkus, D., Sagar, A., Fan, T., Park, J., Shin, S., Cao, J., Liu, J., Ugrinovic, N., Feiszli, M., Malik, J., Dollar, P., Kitani, K.: Sam 3d body: Robust full-body human mesh recovery. arXiv preprint arXiv:2602.15989 (2026)

  47. [48]

    Advances in Neural Information Processing Systems35, 15296–15308 (2022)

    Yao, C.H., Hung, W.C., Li, Y., Rubinstein, M., Yang, M.H., Jampani, V.: Lassie: Learning articulated shapes from sparse image ensemble via 3d part discovery. Advances in Neural Information Processing Systems35, 15296–15308 (2022)

  48. [49]

    In: CVPR

    Yao, C.H., Hung, W.C., Li, Y., Rubinstein, M., Yang, M.H., Jampani, V.: Hi-lassie: High-fidelity articulated shape and skeleton discovery from sparse image ensemble. In: CVPR. pp. 4853–4862 (2023)

  49. [50]

    Nature communications15(1), 5165 (2024)

    Ye, S., Filippova, A., Lauer, J., Schneider, S., Vidal, M., Qiu, T., Mathis, A., Mathis, M.W.: Superanimal pretrained pose estimation models for behavioral anal- ysis. Nature communications15(1), 5165 (2024)

  50. [51]

    In: ICCV

    You, Y., Liu, H., Wang, T., Li, W., Ding, R., Li, X.: Co-evolution of pose and mesh for 3d human body estimation from video. In: ICCV. pp. 14963–14973 (2023)

  51. [52]

    In: NeurIPS

    Zhang, J., Nie, X., Feng, J.: Inference stage optimization for cross-scenario 3D human pose estimation. In: NeurIPS. pp. 2408–2419 (2020)

  52. [53]

    In: ICCV (October 2019)

    Zheng, Z., Yu, T., Wei, Y., Dai, Q., Liu, Y.: Deephuman: 3d human reconstruction from a single image. In: ICCV (October 2019)

  53. [54]

    In: CVPR

    Zioulis, N., O’Brien, J.F.: Kbody: Towards general, robust, and aligned monocular whole-body estimation. In: CVPR. pp. 6215–6225 (2023)

  54. [55]

    In: ECCV

    Zuffi, S., Black, M.J.: Awol: Analysis without synthesis using language. In: ECCV. pp. 1–19. Springer (2024)

  55. [56]

    In: ICCV

    Zuffi, S., Kanazawa, A., Berger-Wolf, T., Black, M.J.: Three-d safari: Learning to estimate zebra pose, shape, and texture from images” in the wild”. In: ICCV. pp. 5359–5368 (2019)

  56. [57]

    In: CVPR (Jul 2017)

    Zuffi, S., Kanazawa, A., Jacobs, D., Black, M.J.: 3D menagerie: Modeling the 3D shape and pose of animals. In: CVPR (Jul 2017)

  57. [58]

    In: CVPR

    Zuffi, S., Mellbin, Y., Li, C., Hoeschle, M., Kjellstr¨ om, H., Polikovsky, S., Hernlund, E., Black, M.J.: Varen: Very accurate and realistic equine network. In: CVPR. pp. 5374–5383 (2024) PRIMA: Boosting Animal Mesh Recovery with Biological Priors and Test-Time Adaptation Supplementary Material A Dataset Statistics: Taxonomic Distribution To characterize...

  58. [59]

    As shown in Fig

    dataset. As shown in Fig. 2, the dataset contains 40 animal species, but the number of samples per species varies significantly. The resulting distribution exhibits a long-tailed pattern, where a small subset of species accounts for the majority of samples, while many species have relatively few instances. Such an imbalance may hinder the learning of robu...

  59. [60]

    The ViT encoder extracts visual representations, producing a sequence of feature tokens of size 192×1280

    These features are subsequently projected to a 1280-dimensional bio-token space. The ViT encoder extracts visual representations, producing a sequence of feature tokens of size 192×1280 . For the keypoint-aware decoder, we employ an Iterative Error Feedback (IEF) loop to progressively refine the SMAL param- eters. In our configuration, we perform N= 3 Ite...