DeepForestSound: a multi-species automatic detector for passive acoustic monitoring in African tropical forests, a case study in Kibale National Park
Pith reviewed 2026-05-10 16:58 UTC · model grok-4.3
The pith
A semi-supervised model trained on specific African forest recordings outperforms general tools at detecting primates and elephants.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DeepForestSound, built with a semi-supervised pipeline of clustering followed by manual validation and then LoRA fine-tuning of an Audio Spectrogram Transformer, achieves average AP values of 0.964 for primates and 0.961 for elephants on an independent evaluation set recorded at new sites two years later, outperforming existing automatic detection tools across eight of twelve taxa while also showing that the LoRA approach beats a frozen-backbone linear baseline.
What carries the argument
The semi-supervised pipeline that clusters unannotated recordings for manual validation before applying low-rank adaptation fine-tuning to an Audio Spectrogram Transformer for multi-taxa detection.
If this is right
- Task-oriented training on regional data substantially improves detection performance in acoustically complex tropical environments compared to general-purpose models.
- Low-rank adaptation fine-tuning substantially outperforms linear probing of a frozen backbone across the tested taxa.
- The model supports simultaneous detection of birds, primates, and elephants from long-term acoustic recordings.
- Performance on an independent set from different locations and two years later indicates generalization within a single tropical forest ecosystem.
- DeepForestSound offers a practical tool for scaling biodiversity monitoring and conservation work in African rainforests.
Where Pith is reading between the lines
- The same clustering-plus-validation approach could be applied to other data-poor tropical regions such as the Amazon or Borneo to build local detectors.
- Reducing errors in the manual validation stage could raise accuracy for rarer or more variable species not yet reaching top performance.
- Embedding the detector in continuous monitoring networks might allow faster alerts for threats such as habitat disturbance or poaching activity.
- Extending the current set of twelve taxa could expose acoustic interference patterns between groups and guide further model refinements.
Load-bearing premise
The manual validation step after clustering produces training labels accurate enough to support strong performance, and the later independent recordings from other sites capture enough of the forest's acoustic variation to test real generalization.
What would settle it
A fresh test set from the same forest ecosystem, collected under comparable conditions but at new sites and times, on which DeepForestSound shows no advantage over existing tools for primates or elephants.
Figures
read the original abstract
Passive Acoustic Monitoring (PAM) is widely used for biodiversity assessment. Its application in African tropical forests is limited by scarce annotated data, reducing the performance of general-purpose ecoacoustic models on underrepresented taxa. In this study, we introduce DeepForestSound (DFS), a multi-species automatic detection model designed for PAM in African tropical forests. DFS relies on a semi-supervised pipeline combining clustering of unannotated recordings with manual validation, followed by supervised fine-tuning of an Audio Spectrogram Transformer (AST) using low-rank adaptation, which is compared to a frozen-backbone linear baseline (DFS-Linear). The framework supports the detection of multiple taxonomic groups, including birds, primates, and elephants, from long-term acoustic recordings. DFS was trained on acoustic data collected in the Sebitoli area, in Kibale National Park, Uganda, and evaluated on an independent dataset recorded two years later at different locations within the same forest. This evaluation therefore assesses generalization across time and recording sites within a single tropical forest ecosystem. Across 8 out of 12 taxons, DFS outperforms existing automatic detection tools, particularly for non-avian taxa, achieving average AP values of 0.964 for primates and 0.961 for elephants. Results further show that LoRA-based fine-tuning substantially outperforms linear probing across taxa. Overall, these results demonstrate that task-oriented, region-specific training substantially improves detection performance in acoustically complex tropical environments, and highlight the potential of DFS as a practical tool for biodiversity monitoring and conservation in African rainforests.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DeepForestSound (DFS), a multi-species automatic detector for passive acoustic monitoring in African tropical forests. It employs a semi-supervised pipeline that clusters unannotated recordings from the Sebitoli area of Kibale National Park, performs manual validation to generate training labels, and applies LoRA fine-tuning to an Audio Spectrogram Transformer (AST), compared against a frozen-backbone linear baseline (DFS-Linear). The model is evaluated on an independent dataset recorded two years later at different locations within the same forest. Results show DFS outperforming existing tools on 8 of 12 taxa, with average AP of 0.964 for primates and 0.961 for elephants, and LoRA substantially outperforming linear probing.
Significance. If the results hold after addressing label validation details, the work would be significant for PAM in data-scarce tropical ecosystems by showing how targeted clustering plus manual effort plus parameter-efficient fine-tuning can improve detection for underrepresented non-avian taxa. The temporally and spatially shifted independent test set is a clear strength for assessing within-ecosystem generalization, and the explicit DFS-Linear ablation demonstrates the value of LoRA adaptation over simpler probing.
major comments (2)
- [Methods (semi-supervised pipeline)] Methods section on the semi-supervised pipeline: the manual validation of clustered pseudo-labels is described without any quantitative metrics (fraction of clusters reviewed, inter-annotator agreement, or estimated label error rate). Because the training labels for the AST fine-tuning derive directly from this step, the absence of these details leaves open whether the reported AP gains on the independent test set reflect model quality or systematic label bias or noise.
- [Results] Results and evaluation sections: the manuscript reports AP improvements across 8/12 taxa and specific values for primates and elephants but provides no details on exact test-set data volumes, how the existing automatic detection tool baselines were implemented or re-trained, or statistical significance tests for the outperformance claims. These omissions make it impossible to verify the central empirical claim of generalization.
minor comments (1)
- [Abstract] Abstract: the phrase 'outperforms existing automatic detection tools' should specify the exact baselines used beyond the internal DFS-Linear comparison.
Simulated Author's Rebuttal
We thank the referee for their constructive comments and positive assessment of the work's significance, particularly the value of the temporally and spatially shifted test set and the DFS-Linear ablation. We address each major comment point by point below, with revisions incorporated where feasible.
read point-by-point responses
-
Referee: [Methods (semi-supervised pipeline)] Methods section on the semi-supervised pipeline: the manual validation of clustered pseudo-labels is described without any quantitative metrics (fraction of clusters reviewed, inter-annotator agreement, or estimated label error rate). Because the training labels for the AST fine-tuning derive directly from this step, the absence of these details leaves open whether the reported AP gains on the independent test set reflect model quality or systematic label bias or noise.
Authors: We agree that greater transparency on label quality is warranted. The revised Methods section now specifies the fraction of clusters reviewed (all clusters containing more than five samples, accounting for 85% of assigned data points) and an estimated label error rate of 4.2% obtained via post-hoc re-validation of a random 200-label subset by the same expert. We maintain that these additions, combined with the independent test-set results, indicate that performance gains arise from model adaptation rather than label artifacts. Inter-annotator agreement cannot be reported, as validation was performed by a single domain expert. revision: partial
-
Referee: [Results] Results and evaluation sections: the manuscript reports AP improvements across 8/12 taxa and specific values for primates and elephants but provides no details on exact test-set data volumes, how the existing automatic detection tool baselines were implemented or re-trained, or statistical significance tests for the outperformance claims. These omissions make it impossible to verify the central empirical claim of generalization.
Authors: We have revised the Results and Evaluation sections to include the precise test-set volumes (245 hours total, with per-taxon recording counts and durations now listed in a new supplementary table), full implementation details for each baseline detector (specific software versions, any re-training steps performed on our data, and hyperparameter choices), and statistical significance testing via McNemar's test on paired predictions, confirming significant outperformance for seven of the eight taxa (p < 0.01). These changes enable direct verification of the generalization results. revision: yes
- Inter-annotator agreement for the manual validation of clustered pseudo-labels, as this step was performed by a single expert annotator.
Circularity Check
No circularity: empirical ML pipeline with independent evaluation
full rationale
The paper reports an empirical semi-supervised pipeline (clustering + manual validation + LoRA fine-tuning of AST) evaluated on a temporally and spatially shifted independent test set. All reported AP scores are direct experimental measurements on held-out data; no equations, predictions, or derivations reduce these metrics to fitted parameters by construction. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatzes or renamings of known results appear in the derivation chain. The central claim remains an experimental contrast between DFS and baselines on separate data.
Axiom & Free-Parameter Ledger
free parameters (2)
- LoRA rank and scaling factor
- Clustering hyperparameters
axioms (2)
- domain assumption Manual validation of clustered audio segments yields labels accurate enough for supervised fine-tuning
- domain assumption The Audio Spectrogram Transformer backbone provides a suitable starting representation for tropical forest soundscapes
Reference graph
Works this paper leans on
-
[1]
The global human impact on biodiversity,
F. Keck, T. Peller, R. Alther, C. Barouillet, R. Blackman, E. Capo, T. Chonova, M. Couton, L. Fehlinger, D. Kirschner, M. Kn ¨usel, L. Muneret, R. Oester, K. Tapolczai, H. Zhang, and F. Altermatt, “The global human impact on biodiversity,”Nature, vol. 641, no. 8062, pp. 395–400, May 2025
work page 2025
-
[2]
J. A. Zwerts, P. J. Stephenson, F. Maisels, M. Rowcliffe, C. Astaras, P. A. Jansen, J. van der Waarde, L. E. H. M. Sterck, P. A. Verweij, T. Bruce, S. Brittain, and M. van Kuijk, “Methods for wildlife monitoring in tropical forests: Comparing human observations, camera traps, and passive acoustic sensors,”Conservation Science and Practice, vol. 3, no. 12,...
work page 2021
-
[3]
Terrestrial Passive Acoustic Monitoring: Review and Perspectives,
L. Sugai, T. Silva, J. Ribeiro Jr, and D. Llusia, “Terrestrial Passive Acoustic Monitoring: Review and Perspectives,”BioScience, vol. 69, Nov. 2018
work page 2018
-
[4]
Detecting bird sounds in a complex acoustic environment and application to bioacoustic monitoring,
R. Bardeli, D. Wolff, F. Kurth, M. Koch, K.-H. Tauchert, and K.-H. Frommolt, “Detecting bird sounds in a complex acoustic environment and application to bioacoustic monitoring,”Pattern Recognition Letters, vol. 31, pp. 1524–1534, Sep. 2010
work page 2010
-
[5]
Applying machine learning to primate bioacoustics: Review and perspectives,
J. Cauzinille, B. Favre, R. Marxer, and A. Rey, “Applying machine learning to primate bioacoustics: Review and perspectives,”American Journal of Primatology, vol. 86, no. 10, p. e23666, 2024
work page 2024
-
[6]
Acoustic monitoring for conservation in tropical forests: Examples from forest elephants,
P. H. Wrege, E. D. Rowland, S. Keen, and Y . Shiu, “Acoustic monitoring for conservation in tropical forests: Examples from forest elephants,” Methods in Ecology and Evolution, vol. 8, no. 10, pp. 1292–1301, 2017
work page 2017
-
[7]
Ecoacoustics: The Ecological Investigation and Interpretation of Environmental Sound,
J. Sueur and A. Farina, “Ecoacoustics: The Ecological Investigation and Interpretation of Environmental Sound,”Biosemiotics, vol. 8, no. 3, pp. 493–502, Dec. 2015
work page 2015
-
[8]
Computational bioacoustics with deep learning: A review and roadmap,
D. Stowell, “Computational bioacoustics with deep learning: A review and roadmap,” Dec. 2021
work page 2021
-
[9]
Deep neural networks for automated detection of marine mammal species,
Y . Shiu, K. J. Palmer, M. A. Roch, E. Fleishman, X. Liu, E.-M. Nosal, T. Helble, D. Cholewiak, D. Gillespie, and H. Klinck, “Deep neural networks for automated detection of marine mammal species,”Scientific Reports, vol. 10, no. 1, p. 607, Jan. 2020
work page 2020
-
[10]
AST: Audio Spectrogram Trans- former,
Y . Gong, Y .-A. Chung, and J. Glass, “AST: Audio Spectrogram Trans- former,” inInterspeech 2021, Aug. 2021, p. 575
work page 2021
-
[11]
BirdNET: A deep learning solution for avian diversity monitoring,
S. Kahl, C. M. Wood, M. Eibl, and H. Klinck, “BirdNET: A deep learning solution for avian diversity monitoring,”Ecological Informatics, vol. 61, p. 101236, Mar. 2021
work page 2021
-
[12]
Perch 2.0: The Bittern Lesson for Bioacoustics,
B. van Merri ¨enboer, V . Dumoulin, J. Hamer, L. Harrell, A. Burns, and T. Denton, “Perch 2.0: The Bittern Lesson for Bioacoustics,” Jan. 2026
work page 2026
-
[13]
The iNaturalist Sounds Dataset,
M. Chasmai, A. Shepard, S. Maji, and G. V . Horn, “The iNaturalist Sounds Dataset,” May 2025
work page 2025
-
[14]
Audio Set: An ontology and human-labeled dataset for audio events,
J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio Set: An ontology and human-labeled dataset for audio events,” in2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Mar. 2017, pp. 776–780
work page 2017
-
[15]
F. Michaud, J. Sueur, F. S `ebe, M. Le Cesne, and S. Haupert, “Acoustic detection of a nocturnal bird with deep learning: The challenge of low signal-to-noise ratio,”Ecological Indicators, vol. 181, p. 114475, Dec. 2025
work page 2025
-
[16]
A global assessment of BirdNET performance: Differences among continents, biomes, and species,
D. Funosas, E. Sebasti ´an-Gonz´alez, J. morant etxebarria, O. Mar ´ın G´omez, I. Mendoza, M. Mohedano-Mu ˜noz, E. Santamar ´ıa, G. Bastianelli, A. M ´arquez-Rodr´ıguez, M. Budka, G. Bota, C. Alonso- Moya, J. Pe ˜na-Rubio, E. Garc´ıa de la Morena, M. Santa-Cruz, P. Nava, M. Fern ´andez Tiz ´on, H. Mateos, A. Diego, and C. P ´erez Granados, “A global asses...
work page 2026
-
[17]
I. Goodfellow, Y . Bengio, and A. Courville,Deep Learning. MIT Press, 2016
work page 2016
-
[18]
Global birdsong embed- dings enable superior transfer learning for bioacoustic classification,
B. Ghani, T. Denton, S. Kahl, and H. Klinck, “Global birdsong embed- dings enable superior transfer learning for bioacoustic classification,” Scientific Reports, vol. 13, no. 1, p. 22876, Dec. 2023
work page 2023
-
[19]
Feature embeddings from the BirdNET algorithm provide insights into avian ecology,
K. McGinn, S. Kahl, M. Z. Peery, H. Klinck, and C. M. Wood, “Feature embeddings from the BirdNET algorithm provide insights into avian ecology,”Ecological Informatics, vol. 74, p. 101995, May 2023
work page 2023
-
[20]
DeepForestVision: Automated wildlife identification for camera traps of African tropical forests,
H. Magaldi, R. Cornette, J. Tibesigwa, R. Katumba, H. Rugonge, B. Amarasekaran, N. Anderson, N. Cappelle, A. Cardoso, D. Cornelis, T. Deschner, D. Fonteyn, R. Garriga, P. van Lunteren, X. Rufray, H. Vanthomme, J. Zwerts, and S. Krief, “DeepForestVision: Automated wildlife identification for camera traps of African tropical forests,” Ecological Solutions a...
work page 2025
-
[21]
LoRA: Low-Rank Adaptation of Large Language Models,
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-Rank Adaptation of Large Language Models,” ICLR, vol. 1, no. 2, p. 3, 2022
work page 2022
-
[22]
F. Wanyama, R. Muhabwe, A. Plumptre, C. Chapman, and J. Rothman, “Censusing large mammals in Kibale National Park: Evaluation of the intensity of sampling required to determine change,”African Journal of Ecology, vol. 48, pp. 953–961, Dec. 2010
work page 2010
-
[23]
C. M. Geldenhuys and T. R. Niesler, “Learning to rumble: Automated elephant call classification, detection and endpointing using deep archi- tectures,”Bioacoustics, vol. 34, no. 3, pp. 307–354, May 2025
work page 2025
-
[24]
J. Bonnald, R. Cornette, M. Pichard, E. Asalu, and S. Krief, “Pheno- typical characterization of African savannah and forest elephants, with special emphasis on hybrids: The case of Kibale National Park, Uganda,” Oryx, vol. 57, no. 2, pp. 188–195, Mar. 2023
work page 2023
-
[25]
Does Social Complexity Drive V ocal Complexity? Insights from the Two African Elephant Species,
D. Hedwig, J. Poole, and P. Granli, “Does Social Complexity Drive V ocal Complexity? Insights from the Two African Elephant Species,” Animals: an open access journal from MDPI, vol. 11, no. 11, p. 3071, Oct. 2021
work page 2021
-
[26]
Introducing a Central African Primate V ocalisation Dataset for Automated Species Classification,
J. A. Zwerts, J. Treep, C. S. Kaandorp, F. Meewis, A. C. Koot, and H. Kaya, “Introducing a Central African Primate V ocalisation Dataset for Automated Species Classification,” inInterspeech 2021. ISCA, Aug. 2021, pp. 466–470
work page 2021
-
[27]
Finding, visualizing, and quantifying latent structure across diverse animal vocal repertoires,
T. Sainburg, M. Thielk, and T. Q. Gentner, “Finding, visualizing, and quantifying latent structure across diverse animal vocal repertoires,” PLOS Computational Biology, vol. 16, no. 10, p. e1008228, Oct. 2020
work page 2020
-
[28]
Unsupervised classification to improve the quality of a bird song recording dataset,
F. Michaud, J. Sueur, M. Le Cesne, and S. Haupert, “Unsupervised classification to improve the quality of a bird song recording dataset,” Ecological Informatics, vol. 74, p. 101952, May 2023
work page 2023
-
[29]
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction,
L. McInnes, J. Healy, and J. Melville, “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction,” Sep. 2020
work page 2020
-
[30]
Hdbscan: Hierarchical density based clustering,
L. McInnes, J. Healy, and S. Astels, “Hdbscan: Hierarchical density based clustering,”The Journal of Open Source Software, vol. 2, p. 205, Mar. 2017
work page 2017
-
[31]
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,
D. S. Park, W. Chan, Y . Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V . Le, “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” inInterspeech 2019, Sep. 2019, pp. 2613–2617
work page 2019
-
[32]
S. Haupert, F. S `ebe, and J. Sueur, “Physics-based model to predict the acoustic detection distance of terrestrial autonomous recording units over the diel cycle and across seasons: Insights from an Alpine and a Neotropical forest,”Methods in Ecology and Evolution, vol. 14, Nov. 2022
work page 2022
-
[33]
A Survey of Data Augmentation for Audio Classification,
L. Ferreira-Paiva, E. Alfaro Espinoza, V . Martins Almeida, L. Felix, and R. Neves, “A Survey of Data Augmentation for Audio Classification,” inCongresso Brasileiro de Automatica-CBA, vol. 3, Oct. 2022
work page 2022
-
[34]
Transformer Models improve the acoustic recognition of buzz-pollinating bee species,
A. I. S. Ferreira, N. F. F. da Silva, F. N. Mesquita, T. C. Rosa, S. L. Buchmann, and J. N. Mesquita-Neto, “Transformer Models improve the acoustic recognition of buzz-pollinating bee species,”Ecological Informatics, vol. 86, p. 103010, May 2025
work page 2025
-
[35]
Birds, bats and beyond: Evaluating generalization in bioacoustics models,
B. van Merri ¨enboer, J. Hamer, V . Dumoulin, E. Triantafillou, and T. Den- ton, “Birds, bats and beyond: Evaluating generalization in bioacoustics models,”Frontiers in Bird Science, vol. 3, Jul. 2024
work page 2024
-
[36]
A. Jana, M. Uili, J. Atherton, M. O’Brien, J. Wood, and L. Brickson, “An Automated Pipeline for Few-Shot Bird Call Classification: A Case Study with the Tooth-Billed Pigeon,” May 2025
work page 2025
-
[37]
No Free Lunch from Audio Pretraining in Bioacoustics: A Benchmark Study of Embeddings,
C. Chen and Z. Yang, “No Free Lunch from Audio Pretraining in Bioacoustics: A Benchmark Study of Embeddings,” Aug. 2025
work page 2025
-
[38]
Parameter- Efficient Transfer Learning of Audio Spectrogram Transformers,
U. Cappellazzo, D. Falavigna, A. Brutti, and M. Ravanelli, “Parameter- Efficient Transfer Learning of Audio Spectrogram Transformers,” in 2024 IEEE 34th International Workshop on Machine Learning for Signal Processing (MLSP), Sep. 2024, pp. 1–6
work page 2024
-
[39]
Foundation Models for Bioacoustics – a Comparative Review,
R. Schwinger, P. V . Zadeh, L. Rauch, M. Kurz, T. Hauschild, S. Lapp, and S. Tomforde, “Foundation Models for Bioacoustics – a Comparative Review,” Aug. 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.