CoarseSoundNet: Building a reliable model for ecological soundscape analysis
Pith reviewed 2026-05-22 08:48 UTC · model grok-4.3
The pith
CoarseSoundNet distinguishes biophony, geophony, and anthropophony in realistic passive acoustic monitoring recordings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CoarseSoundNet is a deep learning model trained to distinguish biophony, geophony, and anthropophony under realistic PAM conditions. Model performance improves with additional PAM data, especially when similar to the target domain, and by introducing an explicit silence class during training. Class-specific decision thresholds and duration-based constraints further enhance performance, particularly for anthropophony and geophony. Error analyses exhibit challenges for anthropophony due to masking effects and confusions for silence and insect sounds for geophony and biophony. Pre-filtering recordings with CoarseSoundNet yields acoustic index trends comparable to ground-truth filtering.
What carries the argument
CoarseSoundNet, a deep learning model for coarse soundscape classification into biophony, geophony, anthropophony and an added silence class, combined with class-specific thresholds and duration constraints for post-processing.
If this is right
- Performance improves with additional PAM data especially when similar to the target domain.
- Introducing an explicit silence class during training reduces confusion with other categories.
- Class-specific decision thresholds and duration constraints further enhance performance for anthropophony and geophony.
- Pre-filtering recordings with the model yields acoustic index trends comparable to ground-truth filtering.
Where Pith is reading between the lines
- The reported error patterns point to masking effects as a persistent challenge for detecting human sounds in complex environments.
- The systematic structure for building such models could be reused to train classifiers for other soundscape tasks or regions.
- Using the model for preprocessing may allow larger-scale ecoacoustic studies without proportional increases in manual labeling effort.
Load-bearing premise
The labeled training data and class definitions accurately represent the acoustic variability and labeling conventions present in unseen target-domain PAM recordings.
What would settle it
Apply CoarseSoundNet to a new set of PAM recordings from a different site or season, compute acoustic indices after model-based filtering, and check whether the resulting trends deviate substantially from those obtained after manual ground-truth filtering.
Figures
read the original abstract
A soundscape is composed of three types of sound: biophony (sounds made by animals), geophony (natural abiotic sounds) and anthropophony (sounds made by humans). A key research question in the field of soundscape ecology is how these components interact with each other, specifically how biophony responds to geophony and anthropophony. Nevertheless, as of today, there are not many analytical instruments that enable the distinct quantification of these elements. Recent machine learning (ML) approaches aim to support automated analysis but often rely on task-specific or clean data, limiting generalisation to noisy passive acoustic monitoring (PAM) recordings. This study presents a clear and reproducible structure to build ML models for coarse soundscape classification and introduces CoarseSoundNet, a deep learning model trained to distinguish biophony, geophony, and anthropophony under realistic PAM conditions. We systematically investigate model architectures, the influence of an additional training class, data composition, and evaluation strategies. Our findings suggest that model performance improves with additional PAM data, especially when similar to the target domain, and by introducing an explicit silence class during training. Class-specific decision thresholds and duration-based constraints further enhance performance, particularly for anthropophony and geophony. Error analyses exhibit challenges for anthropophony due to masking effects and confusions for silence and insect sounds for geophony and biophony. Finally, we conduct an ecological case study which shows that pre-filtering recordings with CoarseSoundNet yields acoustic index trends comparable to ground-truth filtering, supporting its use as an effective preprocessing tool for ecoacoustic analyses.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CoarseSoundNet, a deep learning model for coarse classification of ecological soundscapes into biophony, geophony, and anthropophony from realistic passive acoustic monitoring (PAM) recordings. It outlines a systematic investigation of model architectures, the addition of an explicit silence class, effects of data composition (especially similar PAM data), evaluation strategies including class-specific decision thresholds and duration constraints, error analysis highlighting masking of anthropophony and confusions involving insects/silence, and a case study showing that pre-filtering recordings with the model produces acoustic index trends comparable to ground-truth filtering.
Significance. If the central claims hold, this work offers a practical preprocessing tool for soundscape ecology that could improve the reliability of acoustic indices by separating sound types in noisy field data, addressing a noted gap in analytical instruments. The reproducible structure, focus on realistic PAM conditions, and exploration of factors such as additional similar data and a silence class are strengths that could aid adoption. Credit is given for the case-study validation approach linking model output to ecological metrics.
major comments (2)
- [Data Composition and Case Study] The generalization claim—that CoarseSoundNet distinguishes the three classes under realistic PAM conditions and that pre-filtering yields comparable acoustic-index trends—is load-bearing for the central contribution. However, the manuscript provides insufficient detail on how the labeled training data and class definitions (including the silence class) match the acoustic variability, masking patterns, site-specific backgrounds, and annotator conventions in the unseen target-domain recordings used in the case study. See the data composition and case study sections; without cross-site validation or domain-shift analysis, the reported performance gains and equivalence may not transfer.
- [Evaluation Strategies and Results] The abstract and findings state that model performance improves with additional similar PAM data, an explicit silence class, class-specific thresholds, and duration constraints, with particular gains for anthropophony and geophony. Yet the evaluation lacks reported quantitative metrics (e.g., precision/recall/F1 scores, confusion matrices, or statistical tests) comparing configurations before and after these additions. This weakens support for the performance claims. See the evaluation strategies and results sections.
minor comments (2)
- [Abstract] The abstract contains a minor grammatical issue: 'We systematically investigate model architectures...' should be revised for subject-verb agreement.
- [Methods] Clarify the exact labeling criteria and acoustic characteristics used to define the added silence class versus low-energy segments of the other classes, as this directly affects the reported confusions with insects and geophony/biophony.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the potential of CoarseSoundNet as a practical tool for soundscape analysis. We address the major comments point by point below, with clarifications based on the manuscript content and proposed revisions to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [Data Composition and Case Study] The generalization claim—that CoarseSoundNet distinguishes the three classes under realistic PAM conditions and that pre-filtering yields comparable acoustic-index trends—is load-bearing for the central contribution. However, the manuscript provides insufficient detail on how the labeled training data and class definitions (including the silence class) match the acoustic variability, masking patterns, site-specific backgrounds, and annotator conventions in the unseen target-domain recordings used in the case study. See the data composition and case study sections; without cross-site validation or domain-shift analysis, the reported performance gains and equivalence may not transfer.
Authors: We appreciate the referee's focus on generalization and domain matching. The Data Composition section details that the training data consist of labeled PAM recordings drawn from multiple sites with documented acoustic variability, including varying levels of masking and background noise; class definitions follow standard soundscape ecology conventions, with the silence class explicitly added for segments lacking audible events above a defined threshold. The case study applies the model to an unseen but ecologically comparable site. To strengthen this, we will expand the relevant sections with explicit comparisons of acoustic features (e.g., spectrogram statistics and masking prevalence) between training and case-study data, plus a discussion of potential domain shifts. We will also report any available cross-site performance indicators from within our multi-site training corpus. These additions will better support the transferability claims without requiring new data collection. revision: partial
-
Referee: [Evaluation Strategies and Results] The abstract and findings state that model performance improves with additional similar PAM data, an explicit silence class, class-specific thresholds, and duration constraints, with particular gains for anthropophony and geophony. Yet the evaluation lacks reported quantitative metrics (e.g., precision/recall/F1 scores, confusion matrices, or statistical tests) comparing configurations before and after these additions. This weakens support for the performance claims. See the evaluation strategies and results sections.
Authors: We agree that explicit quantitative comparisons would improve clarity. The Evaluation Strategies and Results sections describe the systematic investigation of architectures, data composition, and strategies, reporting final-model metrics along with qualitative indications of gains from each addition (additional similar PAM data, silence class, thresholds, and duration constraints). To address the concern directly, we will insert a new table in the revised manuscript that tabulates precision, recall, F1, and confusion matrices for the key configurations before and after each enhancement, accompanied by appropriate statistical comparisons (e.g., paired tests on per-class performance). This will provide stronger, quantitative support for the stated improvements, especially for anthropophony and geophony. revision: yes
Circularity Check
No circularity: empirical ML training and held-out evaluation are self-contained
full rationale
The paper trains CoarseSoundNet on labeled PAM recordings to classify biophony, geophony, and anthropophony, then reports accuracy, error patterns, and a case-study comparison of acoustic-index trends before/after model-based pre-filtering versus ground-truth labels. All performance numbers and ecological conclusions derive directly from standard supervised training plus independent test-set and case-study measurements; no equation, prediction, or uniqueness claim reduces by construction to a fitted parameter or self-citation chain. The derivation chain is therefore externally falsifiable against new recordings and does not loop back to its own inputs.
Axiom & Free-Parameter Ledger
free parameters (2)
- class-specific decision thresholds
- model architecture hyperparameters
axioms (2)
- domain assumption Human-provided labels for biophony, geophony, anthropophony and silence accurately reflect acoustic content in the training recordings
- domain assumption Additional PAM data drawn from similar environments improves generalization to the target domain
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We systematically investigate model architectures, the influence of an additional training class, data composition, and evaluation strategies... class-specific decision thresholds and duration-based constraints
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Example spectrograms for the four acoustic classes... sliding-window manner using non-overlapping windows
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
Newman, M. E. J. and Girvan, M. , title =. Phys. Rev. E. , volume =. 2004 , pages =
work page 2004
-
[3]
Vehlow, C. and Reinhardt, T. and Weiskopf, D. , title =. IEEE Trans. Vis. Comput. Graph. , volume =. 2013 , pages =
work page 2013
-
[4]
Raghavan, U. and Albert, R. and Kumara, S. , title =. Phys. Rev E. , volume =. 2007 , pages =
work page 2007
-
[5]
Robust network community detection using balanced propagation , journal =. 2011 , pages =
work page 2011
- [6]
-
[7]
Clauset, A. and Newman, M. E. J. and Moore, C. , title =. Phys. Rev. E. , volume =. 2004 , pages =
work page 2004
-
[8]
Blondel, V. D. and Guillaume, J. L. and Lambiotte, R. and Lefebvre, E. , title =. J. Stat. Mech.-Theory Exp. , volume =. 2008 , pages =
work page 2008
-
[9]
Sobolevsky, S. and Campari, R. , title =. Phys. Rev. E. , volume =. 2014 , pages =
work page 2014
-
[10]
Fortunato, S. and Barthelemy, M. , title =. Proc. Natl. Acad. Sci. U. S. A. , volume =. 2007 , pages =
work page 2007
-
[11]
Unfolding communities in large complex networks: Combining defensive and offensive label propagation for core extraction , journal =. 2011 , pages =
work page 2011
- [12]
-
[13]
Li, J. and Wang, X. and Eustace, J. , title =. Physica A. , volume =. 2013 , pages =
work page 2013
-
[14]
Fabio, D. R. and Fabio, D. and Carlo, P. , title =. Sci. Rep. , volume =. 2013 , pages =
work page 2013
- [15]
-
[16]
Zhang, S. and Wang, R. and Zhang, X. , title =. Physica A. , volume =. 2007 , pages =
work page 2007
-
[17]
Nepusz, T. and Petr\'oczi, A. and N\'egyessy, L. and Bazs\'o, F. , title =. Phys. Rev. E. , volume =. 2008 , pages =
work page 2008
-
[18]
Fabricio, B. and Liang, Z. , title =. Soft Comput. , volume =. 2013 , pages =
work page 2013
-
[19]
Sun, P. and Gao, L. and Han, S. , title =. Inf. Sci. , volume =. 2011 , pages =
work page 2011
-
[20]
Wang, W. and Liu, D. and Liu, X. and Pan, L. , title =. Physica A. , volume =. 2013 , pages =
work page 2013
-
[21]
Psorakis, I. and Roberts, S. and Ebden, M. and Sheldon, B. , title =. Phys. Rev. E. , volume =. 2011 , pages =
work page 2011
- [22]
- [23]
-
[24]
Havens, T. C. and Bezdek, J. C. and Leckie, C., Ramamohanarao, K. and Palaniswami, M. , title =. IEEE Trans. Fuzzy Syst. , volume =. 2013 , pages =
work page 2013
-
[25]
Newman, M. E. J. , title =
-
[26]
Ubiquitousness of link-density and link-pattern communities in real-world networks , journal =. 2012 , pages =
work page 2012
-
[27]
Lancichinetti, A. and Fortunato, S. and Radicchi, F. , title =. Phys. Rev. E. , volume =. 2008 , pages =
work page 2008
-
[28]
Liu, W. and Pellegrini, M. and Wang, X. , title =. Sci. Rep. , volume =. 2014 , pages =
work page 2014
-
[29]
Danon, L. and Diaz-Guilera, A. and Duch, J. and Arenas, A. , title =. J. Stat. Mech.-Theory Exp. , volume =. 2005 , pages =
work page 2005
- [30]
-
[31]
Lancichinetti, A. and Fortunato, S. , title =. Phys. Rev. E. , volume =. 2009 , pages =
work page 2009
-
[32]
Hullermeier, E. and Rifqi, M. , title =. in Proc. IFSA/EUSFLAT Conf. , year =
-
[33]
Ecological Informatics , volume=
The use of acoustic indices to determine avian species richness in audio-recordings of the environment , author=. Ecological Informatics , volume=. 2014 , publisher=
work page 2014
-
[34]
Science of the Total Environment , volume=
Windy events detection in big bioacoustics datasets using a pre-trained Convolutional Neural Network , author=. Science of the Total Environment , volume=. 2024 , publisher=
work page 2024
-
[35]
Ecological Informatics , volume=
Transformer Models improve the acoustic recognition of buzz-pollinating bee species , author=. Ecological Informatics , volume=. 2025 , publisher=
work page 2025
-
[36]
Ecological Indicators , volume=
Soundscape classification with convolutional neural networks reveals temporal and geographic patterns in ecoacoustic data , author=. Ecological Indicators , volume=. 2022 , publisher=
work page 2022
-
[37]
Frontiers in Remote Sensing , volume=
Soundscape components inform acoustic index patterns and refine estimates of bird species richness , author=. Frontiers in Remote Sensing , volume=. 2023 , publisher=
work page 2023
-
[38]
A dataset of acoustic measurements from soundscapes collected worldwide during the COVID-19 pandemic , author=. Scientific Data , volume=. 2024 , publisher=
work page 2024
-
[39]
Methods in ecology and evolution , volume=
CityNet—Deep learning tools for urban ecoacoustic assessment , author=. Methods in ecology and evolution , volume=. 2019 , publisher=
work page 2019
-
[40]
Biological Conservation , volume=
Road disturbance drives a more simplified soundscape in temperate forests revealed by deep learning and acoustics indices , author=. Biological Conservation , volume=. 2025 , publisher=
work page 2025
-
[41]
Classification of complicated urban forest acoustic scenes with deep learning models , author=. Forests , volume=. 2023 , publisher=
work page 2023
-
[42]
Ecoacoustics: The ecological role of sounds , author=. 2017 , publisher=
work page 2017
-
[43]
Pijanowski, Bryan C. and Villanueva-Rivera, Luis J. and Dumyahn, Sarah L. and Farina, Almo and Krause, Bernie L. and Napoletano, Brian M. and Gage, Stuart H. and Pieretti, Nadia , title =. BioScience , volume =. 2011 , month =
work page 2011
-
[44]
BirdNET: A deep learning solution for avian diversity monitoring , journal =. 2021 , issn =. doi:https://doi.org/10.1016/j.ecoinf.2021.101236 , author =
-
[45]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
SSAST: Self-Supervised Audio Spectrogram Transformer , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=. 2022 , month=
work page 2022
-
[46]
AVES: Animal Vocalization Encoder Based on Self-Supervision , year=
Hagiwara, Masato , booktitle=. AVES: Animal Vocalization Encoder Based on Self-Supervision , year=
-
[47]
EDANSA-2019: The Ecoacoustic Dataset from Arctic North Slope Alaska
C oban, Enis Berk and Perra, Megan and Pir, Dara and Mandel, Michael I. EDANSA-2019: The Ecoacoustic Dataset from Arctic North Slope Alaska. Proceedings of the 7th Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022). 2022
work page 2019
-
[48]
Gemmeke, Jort F. and Ellis, Daniel P. W. and Freedman, Dylan and Jansen, Aren and Lawrence, Wade and Moore, R. Channing and Plakal, Manoj and Ritter, Marvin , year =. Audio Set: An ontology and human-labeled dataset for audio events , DOI =. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , publisher =
work page 2017
-
[49]
arXiv preprint arXiv:2311.06368 , year=
The AeroSonicDB (YPAD-0523) dataset for acoustic detection and classification of aircraft , author=. arXiv preprint arXiv:2311.06368 , year=
-
[50]
FSD50K: An Open Dataset of Human-Labeled Sound Events , year=
Fonseca, Eduardo and Favory, Xavier and Pons, Jordi and Font, Frederic and Serra, Xavier , journal=. FSD50K: An Open Dataset of Human-Labeled Sound Events , year=
-
[51]
Yang , title =. 2022 , publisher =. doi:10.5281/zenodo.6687981 , url =
-
[52]
IDMT-Traffic: An Open Benchmark Dataset for Acoustic Traffic Monitoring Research , year=
Abeßer, Jakob and Gourishetti, Saichand and Kátai, András and Clauß, Tobias and Sharma, Prachi and Liebetrau, Judith , booktitle=. IDMT-Traffic: An Open Benchmark Dataset for Acoustic Traffic Monitoring Research , year=
-
[53]
MAVD: a dataset for sound event detection in urban environments , author=. Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019) , year=. doi:10.33682/kfmf-zv94 , address=
-
[54]
IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP) , volume=
Panns: Large-scale pretrained audio neural networks for audio pattern recognition , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP) , volume=. 2020 , publisher=
work page 2020
-
[55]
Yuan Gong and Yu-An Chung and James Glass , title=. Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH) , pages=. doi:10.21437/Interspeech.2021-698 , publisher=
-
[56]
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , volume =
Baevski, Alexei and Zhou, Yuhao and Mohamed, Abdelrahman and Auli, Michael , booktitle =. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , volume =. 2020 , address=
work page 2020
-
[57]
Koutini, Khaled and Schlüter, Jan and Eghbal-zadeh, Hamid and Widmer, Gerhard , title =. Proc. Interspeech 2022 , year =. doi:10.21437/Interspeech.2022-227 , issn =
-
[58]
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =
He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =
- [59]
-
[60]
Wu, Yusong and Chen, Ke and Zhang, Tianyu and Hui, Yuchen and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo , booktitle=. Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation , year=
-
[61]
Chen, Ke and Du, Xingjian and Zhu, Bilei and Ma, Zejun and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo , booktitle=. HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection , year=
-
[62]
Qwen2-audio technical report , author=. arXiv preprint arXiv:2407.10759 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[63]
Proceedings of the 40th International Conference on Machine Learning , pages =
Robust Speech Recognition via Large-Scale Weak Supervision , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =
work page 2023
-
[64]
Proceedings of the 39th International Conference on Machine Learning , pages =
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =
work page 2022
-
[65]
Soundscape-based evaluation of small-scale forest management interventions , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.foreco.2025.123067 , author =
-
[66]
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , author =. 2019 , booktitle =. doi:10.21437/Interspeech.2019-2680 , issn =
-
[67]
arXiv preprint arXiv:2412.11943 , year=
autrainer: A Modular and Extensible Deep Learning Toolkit for Computer Audition Tasks , author=. arXiv preprint arXiv:2412.11943 , year=
-
[68]
Soundscape dynamics of a cold protected forest: dominance of aircraft noise , author=. Landscape Ecology , volume=. 2022 , publisher=
work page 2022
- [69]
-
[70]
Urban soundscapes: Experiences and knowledge , journal =. 2005 , issn =. doi:https://doi.org/10.1016/j.cities.2005.05.003 , author =
- [71]
-
[72]
AudioProtoPNet: An interpretable deep learning model for bird sound classification , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.ecoinf.2025.103081 , author =
-
[73]
arXiv preprint arXiv:2312.07439 , year=
Birb: A generalization benchmark for information retrieval in bioacoustics , author=. arXiv preprint arXiv:2312.07439 , year=
-
[74]
Environmental sound recordings from BeSound (all EPs, 20015 / 2016)
M. Environmental sound recordings from BeSound (all EPs, 20015 / 2016). Version 3 , year =
work page 2016
-
[75]
arXiv preprint arXiv:2411.07186 , year=
NatureLM-audio: An audio-language foundation model for bioacoustics , author=. arXiv preprint arXiv:2411.07186 , year=
-
[76]
ORCA-SPOT: An automatic killer whale sound detection toolkit using deep learning , author=. Scientific reports , volume=. 2019 , publisher=
work page 2019
-
[77]
ORCA-SLANG: An Automatic Multi-Stage Semi-Supervised Deep Learning Framework for Large-Scale Killer Whale Call Type Identification , author =. 2021 , booktitle =. doi:10.21437/Interspeech.2021-616 , issn =
-
[78]
Methods in Ecology and Evolution , volume=
AudioMoth: Evaluation of a smart open acoustic device for monitoring biodiversity and the environment , author=. Methods in Ecology and Evolution , volume=. 2018 , publisher=
work page 2018
-
[79]
Triantafyllopoulos, Andreas and Tsangko, Iosif and Gebhard, Alexander and Mesaros, Annamaria and Virtanen, Tuomas and Schuller, Björn W. , journal=. Computer Audition: From Task-Specific Machine Learning to Foundation Models , year=
-
[80]
New avenues in audio intelligence: Towards holistic real-life audio understanding , author=. Trends in Hearing , volume=. 2021 , publisher=
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.