Multi-Object Tracking Consistently Improves Wildlife Inference
Pith reviewed 2026-05-20 18:06 UTC · model grok-4.3
The pith
Multi-object tracking on camera-trap sequences improves wildlife classification by fusing frame predictions into consensus labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By adopting multi-object tracking models to associate detections into trajectories and fusing the softmax probabilities along those trajectories, the method produces a consensus class label per individual that corrects misclassifications from environmental noise, delivering higher accuracy than a standalone classifier on every dataset and metric tested.
What carries the argument
Multi-object tracking models that link detections across frames into trajectories, enabling fusion of class probabilities for a consensus label.
If this is right
- Species identification from camera traps becomes more reliable without requiring new classifier training.
- Biodiversity monitoring datasets accumulate fewer labeling errors from transient poor frames.
- Existing MOT algorithms can be inserted into current wildlife analysis pipelines for immediate gains.
- Performance improvements appear across varied real-world conditions and multiple datasets.
Where Pith is reading between the lines
- The same trajectory-based fusion idea could apply to drone or underwater video surveys where animals move through sequences.
- Pairing this method with individual re-identification might support longitudinal population studies from the same footage.
- Temporal consistency signals appear underused in other ecological computer-vision tasks and could be tested on additional video sources.
Load-bearing premise
The chosen multi-object tracking models will correctly associate detections of the same individual animal across frames despite occlusions, similar-looking species, and variable camera-trap conditions.
What would settle it
A camera-trap sequence dataset with frequent occlusions and visually similar animals where the MOT models produce incorrect track associations, yielding no gain or a drop in classification F1-score compared with the classifier alone.
Figures
read the original abstract
Camera traps have become a common tool for wildlife monitoring efforts in ecological research and biodiversity conservation. Wildlife classification models have benefited from the increase in wildlife visual data. These models reach high levels of accuracy on curated, high-quality datasets. However, their performance remains sensitive to real-world environmental constraints. They often produce inconsistent predictions when performing inference on temporally coherent sequences. The predicted label for a single individual shifts rapidly between frames. This study exploits the temporal nature of camera-trap data to augment inferred predictions from a wildlife classification model. Specifically, we adopt several standard Multi-Object Tracking (MOT) models to link detections across consecutive frames. The curated trajectories are used to fuse the softmax class probabilities. The fused probability score produces a single consensus class label estimate that overrides misclassifications caused by noise. The analysis of the experimental results shows that our proposed strategy improves over a standalone classifier over all datasets and for each metric. Specifically, the best-performing MOT models gain a weighted F1-Score of 5.1%, 3.1% and 2.0% over the classifier across three MOT datasets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes augmenting a wildlife image classifier with multi-object tracking (MOT) on camera-trap sequences: detections are linked into trajectories by off-the-shelf MOT models, softmax probabilities are fused along each trajectory, and the resulting consensus label replaces per-frame predictions. On three datasets the best MOT variants yield weighted F1 gains of 5.1 %, 3.1 % and 2.0 % over the standalone classifier baseline.
Significance. If the observed gains are shown to arise from MOT-specific identity-consistent associations rather than generic temporal smoothing, the method supplies a lightweight, training-free post-processing step that can improve real-world camera-trap inference. The empirical improvements are modest yet consistent across datasets; the approach is therefore of practical interest to ecological monitoring provided the contribution of the tracking component is isolated.
major comments (2)
- [Experimental results / evaluation section] Experimental results (as summarized in the abstract and detailed in the full evaluation): no ablation is reported that replaces MOT trajectories with a non-associative temporal smoother (e.g., fixed-window averaging of detections at the same spatial location). Without this control it remains possible that the 2–5.1 % F1 lifts are produced by any form of per-location averaging rather than by the cross-frame identity associations that MOT is claimed to provide; this directly affects the central claim that MOT “consistently improves” inference.
- [Experimental results / evaluation section] The manuscript does not report standard MOT quality metrics (ID switches, fragmentation, trajectory purity, or MOTA/MOTP) on the camera-trap sequences. In the presence of stationary animals, partial occlusions and similar-looking species, these numbers are needed to verify that the trajectories actually group frames of the same individual before the fusion step is credited with the observed gains.
minor comments (3)
- [Method] The precise fusion rule (mean, max, or weighted sum of softmax vectors along a trajectory) is stated only at a high level; an equation or short pseudocode would remove ambiguity.
- [Datasets / Experiments] Dataset characteristics (number of sequences, average trajectory length, occlusion frequency) are not tabulated; these details would help readers assess how representative the reported gains are.
- [Abstract and results tables] A few minor typographical inconsistencies appear in the abstract and results tables (e.g., inconsistent use of “weighted F1-Score” vs. “weighted F1”); these do not affect readability but should be harmonized.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the contribution of the MOT component. We respond to each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Experimental results / evaluation section] Experimental results (as summarized in the abstract and detailed in the full evaluation): no ablation is reported that replaces MOT trajectories with a non-associative temporal smoother (e.g., fixed-window averaging of detections at the same spatial location). Without this control it remains possible that the 2–5.1 % F1 lifts are produced by any form of per-location averaging rather than by the cross-frame identity associations that MOT is claimed to provide; this directly affects the central claim that MOT “consistently improves” inference.
Authors: We agree that an explicit ablation against non-associative temporal smoothing would more cleanly isolate the benefit of identity-consistent associations. The current experiments already vary the MOT model (and thus the quality of associations) while keeping the fusion step fixed, and the largest gains occur with the strongest trackers; this pattern is consistent with the value of proper linking rather than generic averaging. Nevertheless, to directly address the concern we will add a controlled baseline that performs fixed-window averaging of softmax scores for detections at the same spatial location without any cross-frame association. The revised manuscript will report this comparison on all three datasets. revision: yes
-
Referee: [Experimental results / evaluation section] The manuscript does not report standard MOT quality metrics (ID switches, fragmentation, trajectory purity, or MOTA/MOTP) on the camera-trap sequences. In the presence of stationary animals, partial occlusions and similar-looking species, these numbers are needed to verify that the trajectories actually group frames of the same individual before the fusion step is credited with the observed gains.
Authors: We acknowledge the value of such metrics for validating trajectory quality. However, the datasets are annotated only for species classification; no ground-truth identities or trajectories are available, precluding computation of MOTA, ID switches, or similar measures. We will add a qualitative discussion of the generated trajectories, including examples of how the chosen MOT models handle stationary animals and brief occlusions, together with any internal consistency statistics (e.g., average trajectory length) that can be obtained without external ground truth. revision: partial
- Quantitative MOT metrics (MOTA, ID switches, etc.) cannot be reported because the classification-only datasets lack ground-truth tracking annotations.
Circularity Check
No circularity: purely empirical comparison using off-the-shelf components
full rationale
The paper describes an experimental pipeline that applies existing MOT algorithms to link detections and then fuses softmax outputs along the resulting trajectories. All reported gains (weighted F1 improvements of 2.0–5.1 %) are measured directly against a per-frame classifier baseline on three datasets. No equations, fitted parameters, self-definitional constructs, or load-bearing self-citations appear in the provided text; the central claim rests on external MOT implementations and straightforward empirical evaluation rather than any reduction to the paper’s own inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
WildlifeReID-10k: Wildlife re- identification dataset with 10k individual animals
Luk ´aˇs Adam, V ojt ˇech ˇCerm´ak, Kostas Papafitsoros, and Lukas Picek. WildlifeReID-10k: Wildlife re- identification dataset with 10k individual animals. In 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 2090–
work page 2025
-
[2]
BoT- SORT: Robust associations multi-pedestrian tracking
Nir Aharon, Roy Orfaig, and Ben-Zion Bobrovsky. BoT- SORT: Robust associations multi-pedestrian tracking. arXiv preprint arXiv:2206.14651, July 2022
-
[3]
Simple online and realtime track- ing
Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Simple online and realtime track- ing. In2016 IEEE International Conference on Im- age Processing (ICIP), pages 3464–3468, 2016. doi: 10.1109/ICIP.2016.7533003
-
[4]
Observation-centric SORT: Rethinking SORT for robust multi-object tracking
Jinkun Cao, Jiangmiao Pang, Xinshuo Weng, Rawal Khirodkar, and Kris Kitani. Observation-centric SORT: Rethinking SORT for robust multi-object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9686–9696, 2023
work page 2023
-
[5]
A review of camera trapping for conservation behaviour research
Anthony Caravaggi, Peter B Banks, A Cole Burton, Caroline MV Finlay, Peter M Haswell, Matt W Hayward, Marcus J Rowcliffe, and Mike D Wood. A review of camera trapping for conservation behaviour research. Remote Sensing in Ecology and Conservation, 3(3):109– 122, 2017
work page 2017
-
[6]
SAM 3: Segment Anything with Concepts
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoub- hik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. SAM 3: Segment anything with concepts. arXiv preprint arXiv:2511.16719, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Wildlifedatasets: An open-source toolkit for animal re-identification
V ojtˇech ˇCerm´ak, Lukas Picek, Luk ´aˇs Adam, and Kostas Papafitsoros. Wildlifedatasets: An open-source toolkit for animal re-identification. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5953–5963, 2024
work page 2024
-
[8]
Laurence A Clarfeld, Katherina D Gieder, Angela Fuller, Zhongqi Miao, Alexej PK Sir ´en, Shevenell M Webb, Toni Lyn Morelli, Tammy L Wilson, Jillian Kilborn, Catherine B Callahan, et al. Deepfaune New England: A species classification model for trail camera images in northeastern North America.Ecology and Evolution, 15 (11):e72174, 2025
work page 2025
-
[9]
Gaspard Dussert, Simon Chamaill ´e-Jammes, St ´ephane Dray, and Vincent Miele. Being confident in confidence scores: calibration in deep learning models for camera trap image sequences.Remote Sensing in Ecology and Conservation, 11(1):88–99, 2025
work page 2025
-
[10]
Gaspard Dussert, St ´ephane Dray, Simon Chamaill ´e- Jammes, and Vincent Miele. Paying attention to other animal detections improves camera trap classification models.bioRxiv, pages 2025–07, 2025
work page 2025
-
[11]
Multiobject tracking of wildlife in videos using few-shot learning.Animals, 12 (9):1223, 2022
Jiangfan Feng and Xinxin Xiao. Multiobject tracking of wildlife in videos using few-shot learning.Animals, 12 (9):1223, 2022
work page 2022
-
[12]
MammAlps: A multi-view video behavior monitoring dataset of wild mammals in the swiss alps
Valentin Gabeff, Haozhe Qi, Brendan Flaherty, Gencer Sumbul, Alexander Mathis, and Devis Tuia. MammAlps: A multi-view video behavior monitoring dataset of wild mammals in the swiss alps. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13854–13864, 2025
work page 2025
-
[13]
Jianyang Gu, Samuel Stevens, Elizabeth G Campolongo, Matthew J Thompson, Net Zhang, Jiaman Wu, An- drei Kopanev, Zheda Mai, Alexander E White, James Balhoff, et al. Bioclip 2: Emergent properties from scaling hierarchical contrastive learning.arXiv preprint arXiv:2505.23883, 2025
-
[14]
Nearest-class mean and logits agreement for wildlife open-set recognition
Jiahao Huo, Mufhumudzi Muthivhi, Terence L van Zyl, and Fredrik Gustafsson. Nearest-class mean and logits agreement for wildlife open-set recognition. InSouthern African Conference for Artificial Intelligence Research, pages 316–329. Springer, 2025
work page 2025
-
[15]
Ming Jin, Qingsong Wen, Yuxuan Liang, Chaoli Zhang, Siqiao Xue, Xue Wang, James Zhang, Yi Wang, Haifeng Chen, Xiaoli Li, et al. Large models for time series and spatio-temporal data: A survey and outlook.arXiv preprint arXiv:2310.10196, 2023
-
[16]
Rudolph Emil Kalman. A new approach to linear filtering and prediction problems.Journal of Basic Engineering, 82(1):35–45, 1960. doi: 10.1115/1.3662552
-
[17]
Lei Liu, Chao Mou, and Fu Xu. Improved wildlife recog- nition through fusing camera trap images and temporal metadata.Diversity, 16(3):139, 2024
work page 2024
-
[18]
Deep learning in multiple animal tracking: A survey
Yeqiang Liu, Weiran Li, Xue Liu, Zhenbo Li, and Jun Yue. Deep learning in multiple animal tracking: A survey. Computers and Electronics in Agriculture, 224:109161, 2024
work page 2024
-
[19]
Trackformer: Multi-object tracking with transformers
Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, and Christoph Feichtenhofer. Trackformer: Multi-object tracking with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8844–8854, 2022
work page 2022
-
[20]
Wildlife target re-identification using self-supervised learning in non-urban settings
Mufhumudzi Muthivhi and Terence L Van Zyl. Wildlife target re-identification using self-supervised learning in non-urban settings. In2025 28th International Confer- ence on Information Fusion (FUSION), pages 1–8. IEEE, 2025
work page 2025
-
[21]
Mufhumudzi Muthivhi, Jiahao Huo, Fredrik Gustafsson, and Terence L van Zyl. Improving wildlife out-of- distribution detection: Africas big five.arXiv preprint arXiv:2506.06719, 2025
-
[22]
Jacinto C. Nascimento, Arnaldo J. Abrantes, and Jorge S. Marques. An algorithm for centroid-based tracking of moving objects. In1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceed- ings. ICASSP99 (Cat. No.99CH36258), volume 6, pages 3305–3308, 1999. URL https://api.semanticscholar.org/ CorpusID:6330699
work page 1999
-
[23]
Yuxin Peng, Yunzhen Zhao, and Junchao Zhang. Two- stream collaborative learning with spatial-temporal at- tention for video classification.IEEE Transactions on Circuits and Systems for Video Technology, 29(3):773– 786, 2018
work page 2018
-
[24]
C3d-convlstm based cow behaviour classi- fication using video data for precision livestock farming
Yongliang Qiao, Yangyang Guo, Keping Yu, and Dongjian He. C3d-convlstm based cow behaviour classi- fication using video data for precision livestock farming. Computers and electronics in agriculture, 193:106650, 2022
work page 2022
-
[25]
Rial Arifin Rajagukguk, Se-yeon Lee, Ji-yeon Park, Ke- hinde Favour Daniel, Chae-rin Lee, Zheng Chen, Dong Liu, Tom ´as Norton, Jinseon Park, and Se-woon Hong. Deep learning for visual animal monitoring (detection, tracking, pose estimation, and behavior classification): a comprehensive review.Smart Agricultural Technology, page 101539, 2025
work page 2025
-
[26]
Generalized intersection over union: A metric and a loss for bounding box regression
Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019
work page 2019
-
[27]
Lyes Saad Saoud, Atif Sultan, Mahmoud Elmezain, Mo- hamed Heshmat, Lakmal Seneviratne, and Irfan Hussain. Beyond observation: Deep learning for animal behavior and ecological conservation.Ecological Informatics, 84: 102893, 2024
work page 2024
-
[28]
Vukasin D Stanojevic and Branimir T Todorovic. Boost- Track: boosting the similarity measure and detection con- fidence for improved multiple object tracking.Machine Vision and Applications, 35(3), 2024. ISSN 0932-8092. doi: 10.1007/s00138-024-01531
-
[29]
Alexandra Swanson, Margaret Kosmala, Chris Lintott, Robert Simpson, Arfon Smith, and Craig Packer. Snap- shot serengeti, high-frequency annotated camera trap images of 40 mammalian species in an african savanna. Scientific data, 2(1):1–14, 2015
work page 2015
-
[30]
Alexander Gomez Villa, Augusto Salazar, and Francisco Vargas. Towards automatic wild animal monitoring: Iden- tification of animal species in camera-trap images using very deep convolutional neural networks.Ecological informatics, 41:24–32, 2017
work page 2017
-
[31]
Marco Willi, Ross T Pitman, Anabelle W Cardoso, Christina Locke, Alexandra Swanson, Amy Boyer, Marten Veldthuis, and Lucy Fortson. Identifying animal species in camera trap images using deep learning and citizen science.Methods in Ecology and Evolution, 10 (1):80–91, 2019
work page 2019
-
[32]
Simple online and realtime tracking with a deep association metric
Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep association metric. In2017 IEEE international conference on image processing (ICIP), pages 3645–3649. IEEE, 2017
work page 2017
-
[33]
Motr: End-to-end multiple-object tracking with transformer
Fangao Zeng, Bin Dong, Yuang Zhang, Tiancai Wang, Xiangyu Zhang, and Yichen Wei. Motr: End-to-end multiple-object tracking with transformer. InEuropean conference on computer vision, pages 659–675. Springer, 2022
work page 2022
-
[34]
Libo Zhang, Junyuan Gao, Zhen Xiao, and Heng Fan. Animaltrack: A benchmark for multi-animal tracking in the wild.International Journal of Computer Vision, 131 (2):496–513, 2023
work page 2023
-
[35]
ByteTrack: Multi-object tracking by associ- ating every detection box
Yifu Zhang, Peize Sun, Yi Jiang, Dongdong Yu, Fucheng Weng, Zehuan Yuan, Ping Luo, Wenyu Liu, and Xing- gang Wang. ByteTrack: Multi-object tracking by associ- ating every detection box. InProceedings of the Euro- pean Conference on Computer Vision (ECCV), 2022
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.