Multi-Object Tracking Consistently Improves Wildlife Inference

Fredrik Gustafsson; Jiahao Huo; Mufhumudzi Muthivhi; Terence L. van Zyl

arxiv: 2605.16672 · v1 · pith:LRTWELQAnew · submitted 2026-05-15 · 💻 cs.CV · cs.AI· cs.LG

Multi-Object Tracking Consistently Improves Wildlife Inference

Mufhumudzi Muthivhi , Jiahao Huo , Fredrik Gustafsson , Terence L. van Zyl This is my paper

Pith reviewed 2026-05-20 18:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords multi-object trackingwildlife classificationcamera trapsspecies identificationtemporal fusionbiodiversity monitoring

0 comments

The pith

Multi-object tracking on camera-trap sequences improves wildlife classification by fusing frame predictions into consensus labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard multi-object tracking models can link detections of the same animal across consecutive frames in camera-trap footage. These trajectories allow the fusion of softmax class probabilities from a wildlife classifier, yielding a single stable label that overrides frame-by-frame errors caused by noise or poor image quality. Experiments across three datasets demonstrate consistent gains, with the best MOT models raising weighted F1-scores by 5.1 percent, 3.1 percent, and 2.0 percent over the classifier used alone. The approach exploits the temporal coherence inherent in the data to make inference more robust without retraining the underlying model.

Core claim

By adopting multi-object tracking models to associate detections into trajectories and fusing the softmax probabilities along those trajectories, the method produces a consensus class label per individual that corrects misclassifications from environmental noise, delivering higher accuracy than a standalone classifier on every dataset and metric tested.

What carries the argument

Multi-object tracking models that link detections across frames into trajectories, enabling fusion of class probabilities for a consensus label.

If this is right

Species identification from camera traps becomes more reliable without requiring new classifier training.
Biodiversity monitoring datasets accumulate fewer labeling errors from transient poor frames.
Existing MOT algorithms can be inserted into current wildlife analysis pipelines for immediate gains.
Performance improvements appear across varied real-world conditions and multiple datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same trajectory-based fusion idea could apply to drone or underwater video surveys where animals move through sequences.
Pairing this method with individual re-identification might support longitudinal population studies from the same footage.
Temporal consistency signals appear underused in other ecological computer-vision tasks and could be tested on additional video sources.

Load-bearing premise

The chosen multi-object tracking models will correctly associate detections of the same individual animal across frames despite occlusions, similar-looking species, and variable camera-trap conditions.

What would settle it

A camera-trap sequence dataset with frequent occlusions and visually similar animals where the MOT models produce incorrect track associations, yielding no gain or a drop in classification F1-score compared with the classifier alone.

Figures

Figures reproduced from arXiv: 2605.16672 by Fredrik Gustafsson, Jiahao Huo, Mufhumudzi Muthivhi, Terence L. van Zyl.

**Figure 2.** Figure 2: The proposed framework processes sequential camera trap frames through a detector and a standalone classifier. The MOT module links these detections [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The figure presents F1-score improvements across the AnimalTrack, MammAlps and SA-FARI datasets. The dotted line presents the classifier baseline [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: The radar charts presents per animal class accuracy@1 performance on the classifier baseline and the improvement gains from the inference augmentation [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Consecutive-frame examples comparing the standalone classifier (top) to centroid-based MOT (bottom) of each example. Centroid association reduces [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Camera traps have become a common tool for wildlife monitoring efforts in ecological research and biodiversity conservation. Wildlife classification models have benefited from the increase in wildlife visual data. These models reach high levels of accuracy on curated, high-quality datasets. However, their performance remains sensitive to real-world environmental constraints. They often produce inconsistent predictions when performing inference on temporally coherent sequences. The predicted label for a single individual shifts rapidly between frames. This study exploits the temporal nature of camera-trap data to augment inferred predictions from a wildlife classification model. Specifically, we adopt several standard Multi-Object Tracking (MOT) models to link detections across consecutive frames. The curated trajectories are used to fuse the softmax class probabilities. The fused probability score produces a single consensus class label estimate that overrides misclassifications caused by noise. The analysis of the experimental results shows that our proposed strategy improves over a standalone classifier over all datasets and for each metric. Specifically, the best-performing MOT models gain a weighted F1-Score of 5.1%, 3.1% and 2.0% over the classifier across three MOT datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MOT fusion on camera-trap sequences gives small consistent F1 lifts over per-frame classification, but the gains may not require the tracking associations themselves.

read the letter

This paper applies standard multi-object tracking to link detections of the same animal across frames in camera-trap footage, then fuses the softmax probabilities along those trajectories to produce a single label per individual. The reported result is that the best trackers improve weighted F1 by 5.1%, 3.1%, and 2.0% over a standalone classifier on three datasets. That is the core finding and it is presented clearly in the abstract and results sections. The work is new mainly as a domain application rather than a new algorithm; MOT itself is off-the-shelf and the fusion step is a straightforward average of probabilities. What the paper does well is show that this post-processing step produces measurable, repeatable gains across multiple datasets and metrics without retraining the base classifier. For ecological monitoring that is a practical win, since label flips on the same animal are a known nuisance in real deployments. The soft spot is the lack of controls that would isolate the contribution of the MOT associations. The concern that a simple per-location temporal average could deliver comparable smoothing is reasonable here, especially when many camera-trap animals are stationary or slow-moving. The paper does not report tracking quality numbers such as ID switches or fragmentation, nor does it compare against a non-tracking smoother. Without those checks it remains possible that the observed lift comes from any form of temporal aggregation rather than from correct cross-frame linking. This is a minor rather than fatal gap, but it limits how strongly one can claim the MOT component is essential. The paper is aimed at researchers who build or use camera-trap pipelines for biodiversity monitoring. A reader working on applied wildlife CV would get concrete numbers and implementation ideas to try. It deserves a serious referee because the claim is empirical, the datasets are real, and the improvement is quantified even if the mechanism needs tighter validation. I would send it for peer review with the expectation that reviewers will ask for the missing ablations.

Referee Report

2 major / 3 minor

Summary. The paper proposes augmenting a wildlife image classifier with multi-object tracking (MOT) on camera-trap sequences: detections are linked into trajectories by off-the-shelf MOT models, softmax probabilities are fused along each trajectory, and the resulting consensus label replaces per-frame predictions. On three datasets the best MOT variants yield weighted F1 gains of 5.1 %, 3.1 % and 2.0 % over the standalone classifier baseline.

Significance. If the observed gains are shown to arise from MOT-specific identity-consistent associations rather than generic temporal smoothing, the method supplies a lightweight, training-free post-processing step that can improve real-world camera-trap inference. The empirical improvements are modest yet consistent across datasets; the approach is therefore of practical interest to ecological monitoring provided the contribution of the tracking component is isolated.

major comments (2)

[Experimental results / evaluation section] Experimental results (as summarized in the abstract and detailed in the full evaluation): no ablation is reported that replaces MOT trajectories with a non-associative temporal smoother (e.g., fixed-window averaging of detections at the same spatial location). Without this control it remains possible that the 2–5.1 % F1 lifts are produced by any form of per-location averaging rather than by the cross-frame identity associations that MOT is claimed to provide; this directly affects the central claim that MOT “consistently improves” inference.
[Experimental results / evaluation section] The manuscript does not report standard MOT quality metrics (ID switches, fragmentation, trajectory purity, or MOTA/MOTP) on the camera-trap sequences. In the presence of stationary animals, partial occlusions and similar-looking species, these numbers are needed to verify that the trajectories actually group frames of the same individual before the fusion step is credited with the observed gains.

minor comments (3)

[Method] The precise fusion rule (mean, max, or weighted sum of softmax vectors along a trajectory) is stated only at a high level; an equation or short pseudocode would remove ambiguity.
[Datasets / Experiments] Dataset characteristics (number of sequences, average trajectory length, occlusion frequency) are not tabulated; these details would help readers assess how representative the reported gains are.
[Abstract and results tables] A few minor typographical inconsistencies appear in the abstract and results tables (e.g., inconsistent use of “weighted F1-Score” vs. “weighted F1”); these do not affect readability but should be harmonized.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments, which help clarify the contribution of the MOT component. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Experimental results / evaluation section] Experimental results (as summarized in the abstract and detailed in the full evaluation): no ablation is reported that replaces MOT trajectories with a non-associative temporal smoother (e.g., fixed-window averaging of detections at the same spatial location). Without this control it remains possible that the 2–5.1 % F1 lifts are produced by any form of per-location averaging rather than by the cross-frame identity associations that MOT is claimed to provide; this directly affects the central claim that MOT “consistently improves” inference.

Authors: We agree that an explicit ablation against non-associative temporal smoothing would more cleanly isolate the benefit of identity-consistent associations. The current experiments already vary the MOT model (and thus the quality of associations) while keeping the fusion step fixed, and the largest gains occur with the strongest trackers; this pattern is consistent with the value of proper linking rather than generic averaging. Nevertheless, to directly address the concern we will add a controlled baseline that performs fixed-window averaging of softmax scores for detections at the same spatial location without any cross-frame association. The revised manuscript will report this comparison on all three datasets. revision: yes
Referee: [Experimental results / evaluation section] The manuscript does not report standard MOT quality metrics (ID switches, fragmentation, trajectory purity, or MOTA/MOTP) on the camera-trap sequences. In the presence of stationary animals, partial occlusions and similar-looking species, these numbers are needed to verify that the trajectories actually group frames of the same individual before the fusion step is credited with the observed gains.

Authors: We acknowledge the value of such metrics for validating trajectory quality. However, the datasets are annotated only for species classification; no ground-truth identities or trajectories are available, precluding computation of MOTA, ID switches, or similar measures. We will add a qualitative discussion of the generated trajectories, including examples of how the chosen MOT models handle stationary animals and brief occlusions, together with any internal consistency statistics (e.g., average trajectory length) that can be obtained without external ground truth. revision: partial

standing simulated objections not resolved

Quantitative MOT metrics (MOTA, ID switches, etc.) cannot be reported because the classification-only datasets lack ground-truth tracking annotations.

Circularity Check

0 steps flagged

No circularity: purely empirical comparison using off-the-shelf components

full rationale

The paper describes an experimental pipeline that applies existing MOT algorithms to link detections and then fuses softmax outputs along the resulting trajectories. All reported gains (weighted F1 improvements of 2.0–5.1 %) are measured directly against a per-frame classifier baseline on three datasets. No equations, fitted parameters, self-definitional constructs, or load-bearing self-citations appear in the provided text; the central claim rests on external MOT implementations and straightforward empirical evaluation rather than any reduction to the paper’s own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the approach relies on standard MOT assumptions and off-the-shelf classifiers whose internal details are not audited here.

pith-pipeline@v0.9.0 · 5733 in / 1113 out tokens · 41512 ms · 2026-05-20T18:06:48.873906+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 1 internal anchor

[1]

WildlifeReID-10k: Wildlife re- identification dataset with 10k individual animals

Luk ´aˇs Adam, V ojt ˇech ˇCerm´ak, Kostas Papafitsoros, and Lukas Picek. WildlifeReID-10k: Wildlife re- identification dataset with 10k individual animals. In 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 2090–

work page 2025
[2]

BoT- SORT: Robust associations multi-pedestrian tracking

Nir Aharon, Roy Orfaig, and Ben-Zion Bobrovsky. BoT- SORT: Robust associations multi-pedestrian tracking. arXiv preprint arXiv:2206.14651, July 2022

work page arXiv 2022
[3]

Simple online and realtime track- ing

Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Simple online and realtime track- ing. In2016 IEEE International Conference on Im- age Processing (ICIP), pages 3464–3468, 2016. doi: 10.1109/ICIP.2016.7533003

work page doi:10.1109/icip.2016.7533003 2016
[4]

Observation-centric SORT: Rethinking SORT for robust multi-object tracking

Jinkun Cao, Jiangmiao Pang, Xinshuo Weng, Rawal Khirodkar, and Kris Kitani. Observation-centric SORT: Rethinking SORT for robust multi-object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9686–9696, 2023

work page 2023
[5]

A review of camera trapping for conservation behaviour research

Anthony Caravaggi, Peter B Banks, A Cole Burton, Caroline MV Finlay, Peter M Haswell, Matt W Hayward, Marcus J Rowcliffe, and Mike D Wood. A review of camera trapping for conservation behaviour research. Remote Sensing in Ecology and Conservation, 3(3):109– 122, 2017

work page 2017
[6]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoub- hik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. SAM 3: Segment anything with concepts. arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Wildlifedatasets: An open-source toolkit for animal re-identification

V ojtˇech ˇCerm´ak, Lukas Picek, Luk ´aˇs Adam, and Kostas Papafitsoros. Wildlifedatasets: An open-source toolkit for animal re-identification. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5953–5963, 2024

work page 2024
[8]

Deepfaune New England: A species classification model for trail camera images in northeastern North America.Ecology and Evolution, 15 (11):e72174, 2025

Laurence A Clarfeld, Katherina D Gieder, Angela Fuller, Zhongqi Miao, Alexej PK Sir ´en, Shevenell M Webb, Toni Lyn Morelli, Tammy L Wilson, Jillian Kilborn, Catherine B Callahan, et al. Deepfaune New England: A species classification model for trail camera images in northeastern North America.Ecology and Evolution, 15 (11):e72174, 2025

work page 2025
[9]

Being confident in confidence scores: calibration in deep learning models for camera trap image sequences.Remote Sensing in Ecology and Conservation, 11(1):88–99, 2025

Gaspard Dussert, Simon Chamaill ´e-Jammes, St ´ephane Dray, and Vincent Miele. Being confident in confidence scores: calibration in deep learning models for camera trap image sequences.Remote Sensing in Ecology and Conservation, 11(1):88–99, 2025

work page 2025
[10]

Paying attention to other animal detections improves camera trap classification models.bioRxiv, pages 2025–07, 2025

Gaspard Dussert, St ´ephane Dray, Simon Chamaill ´e- Jammes, and Vincent Miele. Paying attention to other animal detections improves camera trap classification models.bioRxiv, pages 2025–07, 2025

work page 2025
[11]

Multiobject tracking of wildlife in videos using few-shot learning.Animals, 12 (9):1223, 2022

Jiangfan Feng and Xinxin Xiao. Multiobject tracking of wildlife in videos using few-shot learning.Animals, 12 (9):1223, 2022

work page 2022
[12]

MammAlps: A multi-view video behavior monitoring dataset of wild mammals in the swiss alps

Valentin Gabeff, Haozhe Qi, Brendan Flaherty, Gencer Sumbul, Alexander Mathis, and Devis Tuia. MammAlps: A multi-view video behavior monitoring dataset of wild mammals in the swiss alps. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13854–13864, 2025

work page 2025
[13]

Bioclip 2: Emergent properties from scaling hierarchical contrastive learning.arXiv preprint arXiv:2505.23883, 2025

Jianyang Gu, Samuel Stevens, Elizabeth G Campolongo, Matthew J Thompson, Net Zhang, Jiaman Wu, An- drei Kopanev, Zheda Mai, Alexander E White, James Balhoff, et al. Bioclip 2: Emergent properties from scaling hierarchical contrastive learning.arXiv preprint arXiv:2505.23883, 2025

work page arXiv 2025
[14]

Nearest-class mean and logits agreement for wildlife open-set recognition

Jiahao Huo, Mufhumudzi Muthivhi, Terence L van Zyl, and Fredrik Gustafsson. Nearest-class mean and logits agreement for wildlife open-set recognition. InSouthern African Conference for Artificial Intelligence Research, pages 316–329. Springer, 2025

work page 2025
[15]

Large models for time series and spatio-temporal data: A survey and outlook.arXiv preprint arXiv:2310.10196, 2023

Ming Jin, Qingsong Wen, Yuxuan Liang, Chaoli Zhang, Siqiao Xue, Xue Wang, James Zhang, Yi Wang, Haifeng Chen, Xiaoli Li, et al. Large models for time series and spatio-temporal data: A survey and outlook.arXiv preprint arXiv:2310.10196, 2023

work page arXiv 2023
[16]

A new approach to linear filtering and prediction problems.Journal of Basic Engineering, 82(1):35–45, 1960

Rudolph Emil Kalman. A new approach to linear filtering and prediction problems.Journal of Basic Engineering, 82(1):35–45, 1960. doi: 10.1115/1.3662552

work page doi:10.1115/1.3662552 1960
[17]

Improved wildlife recog- nition through fusing camera trap images and temporal metadata.Diversity, 16(3):139, 2024

Lei Liu, Chao Mou, and Fu Xu. Improved wildlife recog- nition through fusing camera trap images and temporal metadata.Diversity, 16(3):139, 2024

work page 2024
[18]

Deep learning in multiple animal tracking: A survey

Yeqiang Liu, Weiran Li, Xue Liu, Zhenbo Li, and Jun Yue. Deep learning in multiple animal tracking: A survey. Computers and Electronics in Agriculture, 224:109161, 2024

work page 2024
[19]

Trackformer: Multi-object tracking with transformers

Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, and Christoph Feichtenhofer. Trackformer: Multi-object tracking with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8844–8854, 2022

work page 2022
[20]

Wildlife target re-identification using self-supervised learning in non-urban settings

Mufhumudzi Muthivhi and Terence L Van Zyl. Wildlife target re-identification using self-supervised learning in non-urban settings. In2025 28th International Confer- ence on Information Fusion (FUSION), pages 1–8. IEEE, 2025

work page 2025
[21]

Improving wildlife out-of- distribution detection: Africas big five.arXiv preprint arXiv:2506.06719, 2025

Mufhumudzi Muthivhi, Jiahao Huo, Fredrik Gustafsson, and Terence L van Zyl. Improving wildlife out-of- distribution detection: Africas big five.arXiv preprint arXiv:2506.06719, 2025

work page arXiv 2025
[22]

Nascimento, Arnaldo J

Jacinto C. Nascimento, Arnaldo J. Abrantes, and Jorge S. Marques. An algorithm for centroid-based tracking of moving objects. In1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceed- ings. ICASSP99 (Cat. No.99CH36258), volume 6, pages 3305–3308, 1999. URL https://api.semanticscholar.org/ CorpusID:6330699

work page 1999
[23]

Two- stream collaborative learning with spatial-temporal at- tention for video classification.IEEE Transactions on Circuits and Systems for Video Technology, 29(3):773– 786, 2018

Yuxin Peng, Yunzhen Zhao, and Junchao Zhang. Two- stream collaborative learning with spatial-temporal at- tention for video classification.IEEE Transactions on Circuits and Systems for Video Technology, 29(3):773– 786, 2018

work page 2018
[24]

C3d-convlstm based cow behaviour classi- fication using video data for precision livestock farming

Yongliang Qiao, Yangyang Guo, Keping Yu, and Dongjian He. C3d-convlstm based cow behaviour classi- fication using video data for precision livestock farming. Computers and electronics in agriculture, 193:106650, 2022

work page 2022
[25]

Rial Arifin Rajagukguk, Se-yeon Lee, Ji-yeon Park, Ke- hinde Favour Daniel, Chae-rin Lee, Zheng Chen, Dong Liu, Tom ´as Norton, Jinseon Park, and Se-woon Hong. Deep learning for visual animal monitoring (detection, tracking, pose estimation, and behavior classification): a comprehensive review.Smart Agricultural Technology, page 101539, 2025

work page 2025
[26]

Generalized intersection over union: A metric and a loss for bounding box regression

Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

work page 2019
[27]

Beyond observation: Deep learning for animal behavior and ecological conservation.Ecological Informatics, 84: 102893, 2024

Lyes Saad Saoud, Atif Sultan, Mahmoud Elmezain, Mo- hamed Heshmat, Lakmal Seneviratne, and Irfan Hussain. Beyond observation: Deep learning for animal behavior and ecological conservation.Ecological Informatics, 84: 102893, 2024

work page 2024
[28]

Boost- Track: boosting the similarity measure and detection con- fidence for improved multiple object tracking.Machine Vision and Applications, 35(3), 2024

Vukasin D Stanojevic and Branimir T Todorovic. Boost- Track: boosting the similarity measure and detection con- fidence for improved multiple object tracking.Machine Vision and Applications, 35(3), 2024. ISSN 0932-8092. doi: 10.1007/s00138-024-01531

work page doi:10.1007/s00138-024-01531 2024
[29]

Snap- shot serengeti, high-frequency annotated camera trap images of 40 mammalian species in an african savanna

Alexandra Swanson, Margaret Kosmala, Chris Lintott, Robert Simpson, Arfon Smith, and Craig Packer. Snap- shot serengeti, high-frequency annotated camera trap images of 40 mammalian species in an african savanna. Scientific data, 2(1):1–14, 2015

work page 2015
[30]

Alexander Gomez Villa, Augusto Salazar, and Francisco Vargas. Towards automatic wild animal monitoring: Iden- tification of animal species in camera-trap images using very deep convolutional neural networks.Ecological informatics, 41:24–32, 2017

work page 2017
[31]

Identifying animal species in camera trap images using deep learning and citizen science.Methods in Ecology and Evolution, 10 (1):80–91, 2019

Marco Willi, Ross T Pitman, Anabelle W Cardoso, Christina Locke, Alexandra Swanson, Amy Boyer, Marten Veldthuis, and Lucy Fortson. Identifying animal species in camera trap images using deep learning and citizen science.Methods in Ecology and Evolution, 10 (1):80–91, 2019

work page 2019
[32]

Simple online and realtime tracking with a deep association metric

Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep association metric. In2017 IEEE international conference on image processing (ICIP), pages 3645–3649. IEEE, 2017

work page 2017
[33]

Motr: End-to-end multiple-object tracking with transformer

Fangao Zeng, Bin Dong, Yuang Zhang, Tiancai Wang, Xiangyu Zhang, and Yichen Wei. Motr: End-to-end multiple-object tracking with transformer. InEuropean conference on computer vision, pages 659–675. Springer, 2022

work page 2022
[34]

Animaltrack: A benchmark for multi-animal tracking in the wild.International Journal of Computer Vision, 131 (2):496–513, 2023

Libo Zhang, Junyuan Gao, Zhen Xiao, and Heng Fan. Animaltrack: A benchmark for multi-animal tracking in the wild.International Journal of Computer Vision, 131 (2):496–513, 2023

work page 2023
[35]

ByteTrack: Multi-object tracking by associ- ating every detection box

Yifu Zhang, Peize Sun, Yi Jiang, Dongdong Yu, Fucheng Weng, Zehuan Yuan, Ping Luo, Wenyu Liu, and Xing- gang Wang. ByteTrack: Multi-object tracking by associ- ating every detection box. InProceedings of the Euro- pean Conference on Computer Vision (ECCV), 2022

work page 2022

[1] [1]

WildlifeReID-10k: Wildlife re- identification dataset with 10k individual animals

Luk ´aˇs Adam, V ojt ˇech ˇCerm´ak, Kostas Papafitsoros, and Lukas Picek. WildlifeReID-10k: Wildlife re- identification dataset with 10k individual animals. In 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 2090–

work page 2025

[2] [2]

BoT- SORT: Robust associations multi-pedestrian tracking

Nir Aharon, Roy Orfaig, and Ben-Zion Bobrovsky. BoT- SORT: Robust associations multi-pedestrian tracking. arXiv preprint arXiv:2206.14651, July 2022

work page arXiv 2022

[3] [3]

Simple online and realtime track- ing

Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Simple online and realtime track- ing. In2016 IEEE International Conference on Im- age Processing (ICIP), pages 3464–3468, 2016. doi: 10.1109/ICIP.2016.7533003

work page doi:10.1109/icip.2016.7533003 2016

[4] [4]

Observation-centric SORT: Rethinking SORT for robust multi-object tracking

Jinkun Cao, Jiangmiao Pang, Xinshuo Weng, Rawal Khirodkar, and Kris Kitani. Observation-centric SORT: Rethinking SORT for robust multi-object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9686–9696, 2023

work page 2023

[5] [5]

A review of camera trapping for conservation behaviour research

Anthony Caravaggi, Peter B Banks, A Cole Burton, Caroline MV Finlay, Peter M Haswell, Matt W Hayward, Marcus J Rowcliffe, and Mike D Wood. A review of camera trapping for conservation behaviour research. Remote Sensing in Ecology and Conservation, 3(3):109– 122, 2017

work page 2017

[6] [6]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoub- hik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. SAM 3: Segment anything with concepts. arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Wildlifedatasets: An open-source toolkit for animal re-identification

V ojtˇech ˇCerm´ak, Lukas Picek, Luk ´aˇs Adam, and Kostas Papafitsoros. Wildlifedatasets: An open-source toolkit for animal re-identification. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5953–5963, 2024

work page 2024

[8] [8]

Deepfaune New England: A species classification model for trail camera images in northeastern North America.Ecology and Evolution, 15 (11):e72174, 2025

Laurence A Clarfeld, Katherina D Gieder, Angela Fuller, Zhongqi Miao, Alexej PK Sir ´en, Shevenell M Webb, Toni Lyn Morelli, Tammy L Wilson, Jillian Kilborn, Catherine B Callahan, et al. Deepfaune New England: A species classification model for trail camera images in northeastern North America.Ecology and Evolution, 15 (11):e72174, 2025

work page 2025

[9] [9]

Being confident in confidence scores: calibration in deep learning models for camera trap image sequences.Remote Sensing in Ecology and Conservation, 11(1):88–99, 2025

Gaspard Dussert, Simon Chamaill ´e-Jammes, St ´ephane Dray, and Vincent Miele. Being confident in confidence scores: calibration in deep learning models for camera trap image sequences.Remote Sensing in Ecology and Conservation, 11(1):88–99, 2025

work page 2025

[10] [10]

Paying attention to other animal detections improves camera trap classification models.bioRxiv, pages 2025–07, 2025

Gaspard Dussert, St ´ephane Dray, Simon Chamaill ´e- Jammes, and Vincent Miele. Paying attention to other animal detections improves camera trap classification models.bioRxiv, pages 2025–07, 2025

work page 2025

[11] [11]

Multiobject tracking of wildlife in videos using few-shot learning.Animals, 12 (9):1223, 2022

Jiangfan Feng and Xinxin Xiao. Multiobject tracking of wildlife in videos using few-shot learning.Animals, 12 (9):1223, 2022

work page 2022

[12] [12]

MammAlps: A multi-view video behavior monitoring dataset of wild mammals in the swiss alps

Valentin Gabeff, Haozhe Qi, Brendan Flaherty, Gencer Sumbul, Alexander Mathis, and Devis Tuia. MammAlps: A multi-view video behavior monitoring dataset of wild mammals in the swiss alps. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13854–13864, 2025

work page 2025

[13] [13]

Bioclip 2: Emergent properties from scaling hierarchical contrastive learning.arXiv preprint arXiv:2505.23883, 2025

Jianyang Gu, Samuel Stevens, Elizabeth G Campolongo, Matthew J Thompson, Net Zhang, Jiaman Wu, An- drei Kopanev, Zheda Mai, Alexander E White, James Balhoff, et al. Bioclip 2: Emergent properties from scaling hierarchical contrastive learning.arXiv preprint arXiv:2505.23883, 2025

work page arXiv 2025

[14] [14]

Nearest-class mean and logits agreement for wildlife open-set recognition

Jiahao Huo, Mufhumudzi Muthivhi, Terence L van Zyl, and Fredrik Gustafsson. Nearest-class mean and logits agreement for wildlife open-set recognition. InSouthern African Conference for Artificial Intelligence Research, pages 316–329. Springer, 2025

work page 2025

[15] [15]

Large models for time series and spatio-temporal data: A survey and outlook.arXiv preprint arXiv:2310.10196, 2023

Ming Jin, Qingsong Wen, Yuxuan Liang, Chaoli Zhang, Siqiao Xue, Xue Wang, James Zhang, Yi Wang, Haifeng Chen, Xiaoli Li, et al. Large models for time series and spatio-temporal data: A survey and outlook.arXiv preprint arXiv:2310.10196, 2023

work page arXiv 2023

[16] [16]

A new approach to linear filtering and prediction problems.Journal of Basic Engineering, 82(1):35–45, 1960

Rudolph Emil Kalman. A new approach to linear filtering and prediction problems.Journal of Basic Engineering, 82(1):35–45, 1960. doi: 10.1115/1.3662552

work page doi:10.1115/1.3662552 1960

[17] [17]

Improved wildlife recog- nition through fusing camera trap images and temporal metadata.Diversity, 16(3):139, 2024

Lei Liu, Chao Mou, and Fu Xu. Improved wildlife recog- nition through fusing camera trap images and temporal metadata.Diversity, 16(3):139, 2024

work page 2024

[18] [18]

Deep learning in multiple animal tracking: A survey

Yeqiang Liu, Weiran Li, Xue Liu, Zhenbo Li, and Jun Yue. Deep learning in multiple animal tracking: A survey. Computers and Electronics in Agriculture, 224:109161, 2024

work page 2024

[19] [19]

Trackformer: Multi-object tracking with transformers

Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, and Christoph Feichtenhofer. Trackformer: Multi-object tracking with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8844–8854, 2022

work page 2022

[20] [20]

Wildlife target re-identification using self-supervised learning in non-urban settings

Mufhumudzi Muthivhi and Terence L Van Zyl. Wildlife target re-identification using self-supervised learning in non-urban settings. In2025 28th International Confer- ence on Information Fusion (FUSION), pages 1–8. IEEE, 2025

work page 2025

[21] [21]

Improving wildlife out-of- distribution detection: Africas big five.arXiv preprint arXiv:2506.06719, 2025

Mufhumudzi Muthivhi, Jiahao Huo, Fredrik Gustafsson, and Terence L van Zyl. Improving wildlife out-of- distribution detection: Africas big five.arXiv preprint arXiv:2506.06719, 2025

work page arXiv 2025

[22] [22]

Nascimento, Arnaldo J

Jacinto C. Nascimento, Arnaldo J. Abrantes, and Jorge S. Marques. An algorithm for centroid-based tracking of moving objects. In1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceed- ings. ICASSP99 (Cat. No.99CH36258), volume 6, pages 3305–3308, 1999. URL https://api.semanticscholar.org/ CorpusID:6330699

work page 1999

[23] [23]

Two- stream collaborative learning with spatial-temporal at- tention for video classification.IEEE Transactions on Circuits and Systems for Video Technology, 29(3):773– 786, 2018

Yuxin Peng, Yunzhen Zhao, and Junchao Zhang. Two- stream collaborative learning with spatial-temporal at- tention for video classification.IEEE Transactions on Circuits and Systems for Video Technology, 29(3):773– 786, 2018

work page 2018

[24] [24]

C3d-convlstm based cow behaviour classi- fication using video data for precision livestock farming

Yongliang Qiao, Yangyang Guo, Keping Yu, and Dongjian He. C3d-convlstm based cow behaviour classi- fication using video data for precision livestock farming. Computers and electronics in agriculture, 193:106650, 2022

work page 2022

[25] [25]

Rial Arifin Rajagukguk, Se-yeon Lee, Ji-yeon Park, Ke- hinde Favour Daniel, Chae-rin Lee, Zheng Chen, Dong Liu, Tom ´as Norton, Jinseon Park, and Se-woon Hong. Deep learning for visual animal monitoring (detection, tracking, pose estimation, and behavior classification): a comprehensive review.Smart Agricultural Technology, page 101539, 2025

work page 2025

[26] [26]

Generalized intersection over union: A metric and a loss for bounding box regression

Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

work page 2019

[27] [27]

Beyond observation: Deep learning for animal behavior and ecological conservation.Ecological Informatics, 84: 102893, 2024

Lyes Saad Saoud, Atif Sultan, Mahmoud Elmezain, Mo- hamed Heshmat, Lakmal Seneviratne, and Irfan Hussain. Beyond observation: Deep learning for animal behavior and ecological conservation.Ecological Informatics, 84: 102893, 2024

work page 2024

[28] [28]

Boost- Track: boosting the similarity measure and detection con- fidence for improved multiple object tracking.Machine Vision and Applications, 35(3), 2024

Vukasin D Stanojevic and Branimir T Todorovic. Boost- Track: boosting the similarity measure and detection con- fidence for improved multiple object tracking.Machine Vision and Applications, 35(3), 2024. ISSN 0932-8092. doi: 10.1007/s00138-024-01531

work page doi:10.1007/s00138-024-01531 2024

[29] [29]

Snap- shot serengeti, high-frequency annotated camera trap images of 40 mammalian species in an african savanna

Alexandra Swanson, Margaret Kosmala, Chris Lintott, Robert Simpson, Arfon Smith, and Craig Packer. Snap- shot serengeti, high-frequency annotated camera trap images of 40 mammalian species in an african savanna. Scientific data, 2(1):1–14, 2015

work page 2015

[30] [30]

Alexander Gomez Villa, Augusto Salazar, and Francisco Vargas. Towards automatic wild animal monitoring: Iden- tification of animal species in camera-trap images using very deep convolutional neural networks.Ecological informatics, 41:24–32, 2017

work page 2017

[31] [31]

Identifying animal species in camera trap images using deep learning and citizen science.Methods in Ecology and Evolution, 10 (1):80–91, 2019

Marco Willi, Ross T Pitman, Anabelle W Cardoso, Christina Locke, Alexandra Swanson, Amy Boyer, Marten Veldthuis, and Lucy Fortson. Identifying animal species in camera trap images using deep learning and citizen science.Methods in Ecology and Evolution, 10 (1):80–91, 2019

work page 2019

[32] [32]

Simple online and realtime tracking with a deep association metric

Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep association metric. In2017 IEEE international conference on image processing (ICIP), pages 3645–3649. IEEE, 2017

work page 2017

[33] [33]

Motr: End-to-end multiple-object tracking with transformer

Fangao Zeng, Bin Dong, Yuang Zhang, Tiancai Wang, Xiangyu Zhang, and Yichen Wei. Motr: End-to-end multiple-object tracking with transformer. InEuropean conference on computer vision, pages 659–675. Springer, 2022

work page 2022

[34] [34]

Animaltrack: A benchmark for multi-animal tracking in the wild.International Journal of Computer Vision, 131 (2):496–513, 2023

Libo Zhang, Junyuan Gao, Zhen Xiao, and Heng Fan. Animaltrack: A benchmark for multi-animal tracking in the wild.International Journal of Computer Vision, 131 (2):496–513, 2023

work page 2023

[35] [35]

ByteTrack: Multi-object tracking by associ- ating every detection box

Yifu Zhang, Peize Sun, Yi Jiang, Dongdong Yu, Fucheng Weng, Zehuan Yuan, Ping Luo, Wenyu Liu, and Xing- gang Wang. ByteTrack: Multi-object tracking by associ- ating every detection box. InProceedings of the Euro- pean Conference on Computer Vision (ECCV), 2022

work page 2022