arxiv: 2604.09210 · v1 · submitted 2026-04-10 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Adding Another Dimension to Image-based Animal Detection

Vandita Shukla , Fabio Remondino , Benjamin Risse

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords animal detection3D bounding boxesmonocular imagingSMAL modelscamera pose refinementdataset labelingwildlife computer vision

0 comments

The pith

Skinned Multi-Animal Linear models estimate 3D bounding boxes from 2D animal images and project them as labels via camera pose refinement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Monocular images of animals lose depth and orientation information, so standard 2D detectors cannot support full 3D understanding. The authors introduce a pipeline that fits Skinned Multi-Animal Linear models to animals visible in ordinary photos to recover their 3D position, size, and rotation. A dedicated camera pose refinement step then projects these 3D boxes back onto the original image to create reliable 2D training labels without any 3D capture equipment. Cuboid face visibility metrics are computed to record which sides of each animal face the camera. The resulting labels and metrics are evaluated on the Animal3D dataset and shown to work across species and imaging conditions.

Core claim

The paper presents a pipeline that utilises Skinned Multi Animal Linear models to estimate 3D bounding boxes and to project them as robust labels into 2D image space using a dedicated camera pose refinement algorithm. Cuboid face visibility metrics are computed to assess which sides of the animal are captured. These 3D bounding boxes and metrics form a step toward developing and benchmarking future monocular 3D animal detection algorithms, with accurate performance demonstrated on the Animal3D dataset across species and settings.

What carries the argument

Skinned Multi-Animal Linear models fitted to 2D images together with a camera pose refinement algorithm that projects the recovered 3D boxes into accurate 2D labels.

If this is right

Existing 2D animal image collections can be retroactively labeled with 3D bounding boxes.
Monocular 3D animal detection algorithms can be trained and benchmarked without requiring 3D sensors at data collection time.
Visibility metrics supply explicit orientation cues that 2D detectors normally lack.
The same pipeline applies to multiple species and varied imaging conditions as shown on Animal3D.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The generated labels could be extended to video sequences to support 3D tracking of moving animals.
Consumer-grade cameras might now suffice for field researchers to gather 3D animal data at scale.
The approach could be combined with detailed pose estimation to recover not only boxes but full 3D body shapes.
Similar model-based projection techniques might transfer to other monocular 3D tasks such as vehicle or object detection.

Load-bearing premise

Skinned Multi-Animal Linear models can be fitted accurately to the animals appearing in the target 2D images and the camera pose refinement step can succeed without any 3D ground-truth input.

What would settle it

Running the full pipeline on a collection of animal images that also have independent 3D ground-truth measurements and checking whether the estimated boxes match the ground truth in position, size, and orientation within acceptable error bounds.

Figures

Figures reproduced from arXiv: 2604.09210 by Benjamin Risse, Fabio Remondino, Vandita Shukla.

**Figure 2.** Figure 2: Zero shot predictions from Ovmono3D [26] on Imagenet samples. newer purely RGB-based approaches tend to support only rigid objects of known shapes or almost always require manual perspective snapping, known camera intrinsics, or video sequences, impractical for in-the-wild animal images with arbitrary viewpoints [20, 23]. Introducing new object classes also suffers from a cold-start problem: without high… view at source ↗

**Figure 3.** Figure 3: Pipeline overview. Regarding inputs, it is important to [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗

**Figure 4.** Figure 4: Reprojected bounding box derived with Basic Method (top row), PCA (middle row) and Our Method (bottom row). The proposed [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 6.** Figure 6: For each view on the zebra, we estimate the 3D bounding [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

read the original abstract

Monocular imaging of animals inherently reduces 3D structures to 2D projections. Detection algorithms lead to 2D bounding boxes that lack information about animal's orientation relative to the camera. To build 3D detection methods for RGB animal images, there is a lack of labeled datasets; such labeling processes require 3D input streams along with RGB data. We present a pipeline that utilises Skinned Multi Animal Linear models to estimate 3D bounding boxes and to project them as robust labels into 2D image space using a dedicated camera pose refinement algorithm. To assess which sides of the animal are captured, cuboid face visibility metrics are computed. These 3D bounding boxes and metrics form a crucial step toward developing and benchmarking future monocular 3D animal detection algorithms. We evaluate our method on the Animal3D dataset, demonstrating accurate performance across species and settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a pipeline to generate 3D bounding-box labels for animals from single RGB images by fitting SMAL models plus a camera-pose refinement step, but the abstract supplies no numbers to show whether the fitting or the labels are actually reliable.

read the letter

The core contribution is a labeling workflow that takes existing SMAL models, fits them to 2D animal photos, refines the camera pose without 3D ground truth, projects the resulting 3D boxes back into image space, and adds simple cuboid-face visibility scores. This directly targets the shortage of 3D annotations for monocular animal detection in ecology and agriculture settings. The approach is straightforward and re-uses prior model work rather than inventing new representations, which keeps the method practical for dataset creation on Animal3D and similar collections. That reuse is the main thing the paper does well: it turns an existing 3D animal model into a 2D labeling tool without requiring extra sensors. The idea itself is not revolutionary, but it fills a narrow but real gap for people who need oriented 3D boxes from ordinary field photos. The soft spot is the evaluation. The abstract claims “accurate performance across species and settings” yet gives no fitting errors, no ablation of the refinement step, no comparison against held-out 3D data, and no failure cases. Without those numbers it is impossible to tell whether SMAL actually fits the target animals well enough or whether the refinement converges reliably on new viewpoints. The stress-test concern about missing independent 3D validation therefore lands; the central claim rests on an unshown assumption that the models and the refinement will succeed on the images at hand. This paper is mainly for computer-vision groups already working on wildlife or livestock monitoring who need quick ways to bootstrap 3D training data. A reader outside that niche will find little to take away. It deserves a serious referee because the underlying problem is genuine and the proposed workflow is simple enough to test, even though the current version needs quantitative results and failure analysis before it can be trusted.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a pipeline to generate 3D bounding-box labels for monocular RGB animal images. It fits Skinned Multi-Animal Linear (SMAL) models to estimate 3D shape and pose, applies a dedicated camera-pose refinement step to project the resulting 3D cuboids into 2D image space, and computes cuboid-face visibility metrics. The method is evaluated on the Animal3D dataset, with the abstract stating that it demonstrates accurate performance across species and settings.

Significance. If the SMAL fitting and camera-pose refinement steps can be shown to produce accurate 3D boxes without 3D ground truth, the pipeline would provide a practical route to large-scale 3D-labeled animal datasets from existing 2D imagery. This directly addresses the data scarcity noted in the introduction and could support downstream monocular 3D detection research. The reliance on an established SMAL model family is a strength for reproducibility.

major comments (2)

[Abstract and evaluation section] Abstract and evaluation section: the claim of 'demonstrating accurate performance' on Animal3D is unsupported by any quantitative metrics (e.g., SMAL fitting error, 3D-to-2D projection accuracy, or comparison against held-out 3D annotations). Without these numbers, error analysis, or ablations of the refinement step, the central assertion that the projected labels are robust cannot be verified.
[Method section on camera-pose refinement] Method section on camera-pose refinement: the algorithm is presented as converging without 3D ground-truth input, yet no optimization objective, convergence criteria, or sensitivity analysis to species/viewpoint variation is supplied. This is load-bearing for the claim that SMAL models can be fitted accurately to the target images.

minor comments (1)

[Abstract] The abstract would be strengthened by including one or two key numerical results (e.g., mean projection error) to substantiate the performance claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important areas where the manuscript's claims and methodological details require strengthening. We address each point below and have revised the manuscript to include the requested quantitative support and algorithmic specifications.

read point-by-point responses

Referee: [Abstract and evaluation section] Abstract and evaluation section: the claim of 'demonstrating accurate performance' on Animal3D is unsupported by any quantitative metrics (e.g., SMAL fitting error, 3D-to-2D projection accuracy, or comparison against held-out 3D annotations). Without these numbers, error analysis, or ablations of the refinement step, the central assertion that the projected labels are robust cannot be verified.

Authors: We agree that the abstract's claim of 'accurate performance' is not supported by quantitative evidence in the submitted version, which relied on qualitative examples. We have revised the abstract to remove this phrasing and added a dedicated quantitative evaluation subsection. This includes SMAL fitting errors (vertex-to-vertex distances), 3D-to-2D projection accuracy (reprojection error and 2D IoU), direct comparison against held-out 3D annotations from Animal3D, and an ablation isolating the contribution of the camera-pose refinement step across species and viewpoints. revision: yes
Referee: [Method section on camera-pose refinement] Method section on camera-pose refinement: the algorithm is presented as converging without 3D ground-truth input, yet no optimization objective, convergence criteria, or sensitivity analysis to species/viewpoint variation is supplied. This is load-bearing for the claim that SMAL models can be fitted accurately to the target images.

Authors: We acknowledge that the original method section lacked sufficient detail on the refinement procedure. The revised manuscript now specifies the optimization objective (a weighted sum of landmark reprojection loss and pose regularization terms), the convergence criteria (loss change below 1e-4 or maximum 200 iterations), and a sensitivity analysis table showing fitting stability across species (e.g., dogs, horses, cows) and viewpoint angles without any 3D ground-truth supervision. revision: yes

Circularity Check

0 steps flagged

No circularity: pipeline uses pre-existing SMAL models and external dataset

full rationale

The paper presents a pipeline that applies existing Skinned Multi-Animal Linear (SMAL) models to estimate 3D bounding boxes from monocular RGB images, projects them to 2D labels via a camera-pose refinement step, and evaluates on the external Animal3D dataset. No equations, derivations, or load-bearing steps in the abstract or described method reduce by construction to parameters fitted within the paper itself, nor do they rely on self-citations whose validity depends on the current work. The central claims rest on independent prior models and data, satisfying the criteria for a self-contained, non-circular contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the suitability of pre-existing SMAL models for the imaged animals and on the effectiveness of the pose-refinement algorithm; neither is derived or validated from first principles in this work.

axioms (1)

domain assumption Skinned Multi-Animal Linear models provide sufficiently accurate 3D shape priors for the animals in the target images
The pipeline directly uses SMAL models to estimate 3D boxes; accuracy depends on this prior matching real animals.

pith-pipeline@v0.9.0 · 5446 in / 1173 out tokens · 55197 ms · 2026-05-10T16:40:00.018240+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present a pipeline that utilises Skinned Multi Animal Linear models to estimate 3D bounding boxes and to project them as robust labels into 2D image space using a dedicated camera pose refinement algorithm.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use Efficient Perspective-n-Point (EPnP) algorithm to establish an initial camera pose... joint cost optimization... Etotal = lambda sum dMahalanobis + (1-lambda) d_bbox

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references

[1]

Advanced image recognition: a fully automated, high- accuracy photo-identification matching system for hump- back whales.Mammalian Biology, 102(3):915–929, 2022

Ted Cheeseman, Ken Southerland, Jinmo Park, Marilia Olio, Kiirsten Flynn, John Calambokidis, Lindsey Jones, Claire Garrigue, Astrid Frisch Jord ´an, Addison Howard, Walter Reade, Janet Neilson, Christine Gabriele, and Phil Clapham. Advanced image recognition: a fully automated, high- accuracy photo-identification matching system for hump- back whales.Mamm...

2022
[2]

Van Horn

Melanie Clapham, Ed Miller, Mary Nguyen, and Russell C. Van Horn. Multispecies facial detection for individual iden- tification of wildlife: a case study across ursids.Mammalian Biology, 102(3):943–955, 2022. 1

2022
[3]

Bhatt, Arunava Banerjee, and Ehsan Hashemi

Marcelo Contreras, Aayush Jain, Neel P. Bhatt, Arunava Banerjee, and Ehsan Hashemi. A survey on 3d object de- tection in real time for autonomous driving.Frontiers in Robotics and AI, 11, 2024. Publisher: Frontiers. 1

2024
[4]

Dloniak, and Alexander Braczkowski

Arjun Dheer, Dinal Samarasinghe, Stephanie M. Dloniak, and Alexander Braczkowski. Using camera traps to study hyenas: challenges, opportunities, and outlook.Mammalian Biology, 102(3):847–854, 2022. 1

2022
[5]

Deep learning on monocular object pose detec- tion and tracking: A comprehensive overview.ACM Comput

Zhaoxin Fan, Yazhi Zhu, Yulin He, Qi Sun, Hongyan Liu, and Jun He. Deep learning on monocular object pose detec- tion and tracking: A comprehensive overview.ACM Comput. Surv., 55(4):81:1–81:40, 2022. 1

2022
[6]

Mahmoud M. S. Farrag. Biometrics of aquatic animals. In Recent Advances in Biometrics. IntechOpen, 2022. 1

2022
[7]

Fischler and Robert C

Martin A. Fischler and Robert C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Commun. ACM, 24(6):381–395, 1981. 3

1981
[8]

Gallo, Agust ´ın M

Jorge A. Gallo, Agust ´ın M. Abba, and Mariella Superina. In- dividual identification of armadillos (mammalia, cingulata) using a photo-identification software.Mammalian Biology, 102(3):855–861, 2022. 1

2022
[9]

Training an open-vocabulary monocular 3d detection model without 3d data.Advances in Neural Information Processing Systems, 37:72145–72169,

Rui Huang, Henry Zheng, Yan Wang, Zhuofan Xia, Marco Pavone, and Gao Huang. Training an open-vocabulary monocular 3d detection model without 3d data.Advances in Neural Information Processing Systems, 37:72145–72169,
[10]

Stew- art, Christopher Stewart, Daniel I

Jenna Kline, Alison Zhong, Kevyn Irizarry, Charles V . Stew- art, Christopher Stewart, Daniel I. Rubenstein, and Tanya Berger-Wolf. WildWing: An open-source, autonomous and affordable UAS for animal behaviour video monitoring. Methods in Ecology and Evolution, 2025. 1

2025
[11]

Visual animal bio- metrics: survey.IET Biometrics, 6(3):139–156, 2017

Santosh Kumar and Sanjay Kumar Singh. Visual animal bio- metrics: survey.IET Biometrics, 6(3):139–156, 2017. 1

2017
[12]

Cattle recognition: A new frontier in visual animal biometrics research.Pro- ceedings of the National Academy of Sciences, India Section A: Physical Sciences, 90(4):689–708, 2020

Santosh Kumar and Sanjay Kumar Singh. Cattle recognition: A new frontier in visual animal biometrics research.Pro- ceedings of the National Academy of Sciences, India Section A: Physical Sciences, 90(4):689–708, 2020. 1

2020
[13]

K ¨uhl and Tilo Burghardt

Hjalmar S. K ¨uhl and Tilo Burghardt. Animal biometrics: quantifying and detecting phenotypic appearance.Trends in Ecology & Evolution, 28(7):432–441, 2013. 1

2013
[14]

Assess- ing the performance of open-source, semi-automated pattern recognition software for harbour seal (p

Izzy Langley, Emily Hague, and M `onica Arso Civil. Assess- ing the performance of open-source, semi-automated pattern recognition software for harbour seal (p. v. vitulina) photo ID.Mammalian Biology, 102(3):973–982, 2022. 1

2022
[15]

EPnP: An accurate o(n) solution to the PnP problem.Inter- national Journal of Computer Vision, 81(2):155–166, 2009

Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. EPnP: An accurate o(n) solution to the PnP problem.Inter- national Journal of Computer Vision, 81(2):155–166, 2009. 3

2009
[16]

WildBridge: Conservation software for animal locali- sation using commercial drones.15th annual International Micro Air Vehicle Conference and Competition, pages 324– 333, 2024

Kilian Meier, Arthur Richards, Matthew Watson, Guy Maalouf, Caspian Johnson, Duncan Hine, and Tom Richard- son. WildBridge: Conservation software for animal locali- sation using commercial drones.15th annual International Micro Air Vehicle Conference and Competition, pages 324– 333, 2024. 1

2024
[17]

Towards a weakly supervised framework for 3d point cloud object detection and annotation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):4454–4468, 2022

Qinghao Meng, Wenguan Wang, Tianfei Zhou, Jianbing Shen, Yunde Jia, and Luc Van Gool. Towards a weakly supervised framework for 3d point cloud object detection and annotation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):4454–4468, 2022. Conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence. 1

2022
[18]

View-to-label: Multi-view consistency for self-supervised monocular 3d object detec- tion.Computer Vision and Image Understanding, 254: 104320, 2025

Issa Mouawad, Nikolas Brasch, Fabian Manhardt, Federico Tombari, and Francesca Odone. View-to-label: Multi-view consistency for self-supervised monocular 3d object detec- tion.Computer Vision and Image Understanding, 254: 104320, 2025. 1

2025
[19]

Re- view on methods used for wildlife species and individual identification.European Journal of Wildlife Research, 68,

Tinao Petso, Rodrigo Jamisola, and Dimane Mpoeleng. Re- view on methods used for wildlife species and individual identification.European Journal of Wildlife Research, 68,
[20]

MonoDCN: Monocular 3d object detection based on dynamic convolution.PLOS ONE, 17(10):e0275438, 2022

Shenming Qu, Xinyu Yang, Yiming Gao, and Shengbin Liang. MonoDCN: Monocular 3d object detection based on dynamic convolution.PLOS ONE, 17(10):e0275438, 2022. Publisher: Public Library of Science. 2

2022
[21]

Ravoor and Sudarshan T.S.B

Prashanth C. Ravoor and Sudarshan T.S.B. Deep learning methods for multi-species animal re-identification and track- ing – a survey.Computer Science Review, 38:100289, 2020. 1

2020
[22]

Shukla, L

V . Shukla, L. Morelli, F. Remondino, A. Micheli, D. Tuia, and B. Risse. Towards estimation of 3d poses and shapes of animals from oblique drone imagery.The International Archives of the Photogrammetry, Remote Sensing and Spa- tial Information Sciences, XLVIII-2-2024:379–386, 2024. 5

2024
[23]

Disentan- gling monocular 3d object detection: From single to multi- class recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3):1219–1231, 2022

Andrea Simonelli, Samuel Rota Bul `o, Lorenzo Porzi, Manuel L´opez Antequera, and Peter Kontschieder. Disentan- gling monocular 3d object detection: From single to multi- class recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3):1219–1231, 2022. Confer- ence Name: IEEE Transactions on Pattern Analysis and Ma- chine Intelligence. 2

2022
[24]

SALT: A semi-automatic labeling tool for RGB-d video sequences:

Dennis Stumpf, Stephan Krauß, Gerd Reis, Oliver Wasenm¨uller, and Didier Stricker. SALT: A semi-automatic labeling tool for RGB-d video sequences:. InProceedings of the 16th International Joint Conference on Computer Vi- sion, Imaging and Computer Graphics Theory and Applica- tions, pages 595–603. SCITEPRESS - Science and Technol- ogy Publications, 2021. 1

2021
[25]

Animal3d: A comprehensive dataset of 3d animal pose and shape

Jiacong Xu, Yi Zhang, Jiawei Peng, Wufei Ma, Artur Jesslen, Pengliang Ji, Qixin Hu, Jiehua Zhang, Qihao Liu, Jiahao Wang, Wei Ji, Chen Wang, Xiaoding Yuan, Prakhar Kaushik, Guofeng Zhang, Jie Liu, Yushan Xie, Yawen Cui, Alan Yuille, and Adam Kortylewski. Animal3d: A comprehensive dataset of 3d animal pose and shape. In2023 IEEE/CVF In- ternational Confere...

2023
[26]

Open vocabulary monocular 3d object detection,

Jin Yao, Hao Gu, Xuweiyi Chen, Jiayun Wang, and Zezhou Cheng. Open vocabulary monocular 3d object detection,
[27]

Jacobs, and Michael J

Silvia Zuffi, Angjoo Kanazawa, David W. Jacobs, and Michael J. Black. 3d menagerie: Modeling the 3d shape and pose of animals. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5524–5532. IEEE, 2017. 2

2017