CoarseSoundNet: Building a reliable model for ecological soundscape analysis

Alexander Gebhard; Andreas Triantafyllopoulos; Bj\"orn W. Schuller; Dominik Arend; Michael Scherer-Lorenzen; Sandra M\"uller; Svenja Schmidt

arxiv: 2605.21143 · v2 · pith:Z532GEJMnew · submitted 2026-05-20 · 💻 cs.SD · cs.LG

CoarseSoundNet: Building a reliable model for ecological soundscape analysis

Alexander Gebhard , Andreas Triantafyllopoulos , Dominik Arend , Sandra M\"uller , Svenja Schmidt , Michael Scherer-Lorenzen , Bj\"orn W. Schuller This is my paper

Pith reviewed 2026-05-22 08:48 UTC · model grok-4.3

classification 💻 cs.SD cs.LG

keywords soundscape ecologypassive acoustic monitoringbiophonygeophonyanthropophonydeep learningmachine learningecological sound analysis

0 comments

The pith

CoarseSoundNet distinguishes biophony, geophony, and anthropophony in realistic passive acoustic monitoring recordings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents CoarseSoundNet, a deep learning model designed to classify soundscapes into animal-made sounds, natural abiotic sounds, and human-made sounds using noisy field recordings from passive acoustic monitoring. It shows that performance improves when the training set includes additional data similar to the target recordings and when an explicit silence class is added. Class-specific decision thresholds combined with duration constraints reduce specific errors, particularly for human and natural sound categories. In an ecological case study, pre-filtering data with the model produces acoustic index trends that closely match results from manually filtered ground-truth data, indicating the model can act as an effective preprocessing step for broader ecoacoustic analyses.

Core claim

CoarseSoundNet is a deep learning model trained to distinguish biophony, geophony, and anthropophony under realistic PAM conditions. Model performance improves with additional PAM data, especially when similar to the target domain, and by introducing an explicit silence class during training. Class-specific decision thresholds and duration-based constraints further enhance performance, particularly for anthropophony and geophony. Error analyses exhibit challenges for anthropophony due to masking effects and confusions for silence and insect sounds for geophony and biophony. Pre-filtering recordings with CoarseSoundNet yields acoustic index trends comparable to ground-truth filtering.

What carries the argument

CoarseSoundNet, a deep learning model for coarse soundscape classification into biophony, geophony, anthropophony and an added silence class, combined with class-specific thresholds and duration constraints for post-processing.

If this is right

Performance improves with additional PAM data especially when similar to the target domain.
Introducing an explicit silence class during training reduces confusion with other categories.
Class-specific decision thresholds and duration constraints further enhance performance for anthropophony and geophony.
Pre-filtering recordings with the model yields acoustic index trends comparable to ground-truth filtering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reported error patterns point to masking effects as a persistent challenge for detecting human sounds in complex environments.
The systematic structure for building such models could be reused to train classifiers for other soundscape tasks or regions.
Using the model for preprocessing may allow larger-scale ecoacoustic studies without proportional increases in manual labeling effort.

Load-bearing premise

The labeled training data and class definitions accurately represent the acoustic variability and labeling conventions present in unseen target-domain PAM recordings.

What would settle it

Apply CoarseSoundNet to a new set of PAM recordings from a different site or season, compute acoustic indices after model-based filtering, and check whether the resulting trends deviate substantially from those obtained after manual ground-truth filtering.

Figures

Figures reproduced from arXiv: 2605.21143 by Alexander Gebhard, Andreas Triantafyllopoulos, Bj\"orn W. Schuller, Dominik Arend, Michael Scherer-Lorenzen, Sandra M\"uller, Svenja Schmidt.

**Figure 2.** Figure 2: The Precision-Recall (PR) curves for the three classes Anthropophony (Anth), [PITH_FULL_IMAGE:figures/full_fig_p020_2.png] view at source ↗

**Figure 3.** Figure 3: The receiver-operating characteristic (ROC) curves for the three classes An [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗

**Figure 4.** Figure 4: False positives (FPs; top row) and false negatives (FNs; bottom row) for the [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution boxplots for ecoacoustic indices (top) vs CoarseSoundNet model [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗

**Figure 6.** Figure 6: Pearson correlation of three standard ecoacoustic indices (ACI, ADI, NDSI) [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

read the original abstract

A soundscape is composed of three types of sound: biophony (sounds made by animals), geophony (natural abiotic sounds) and anthropophony (sounds made by humans). A key research question in the field of soundscape ecology is how these components interact with each other, specifically how biophony responds to geophony and anthropophony. Nevertheless, as of today, there are not many analytical instruments that enable the distinct quantification of these elements. Recent machine learning (ML) approaches aim to support automated analysis but often rely on task-specific or clean data, limiting generalisation to noisy passive acoustic monitoring (PAM) recordings. This study presents a clear and reproducible structure to build ML models for coarse soundscape classification and introduces CoarseSoundNet, a deep learning model trained to distinguish biophony, geophony, and anthropophony under realistic PAM conditions. We systematically investigate model architectures, the influence of an additional training class, data composition, and evaluation strategies. Our findings suggest that model performance improves with additional PAM data, especially when similar to the target domain, and by introducing an explicit silence class during training. Class-specific decision thresholds and duration-based constraints further enhance performance, particularly for anthropophony and geophony. Error analyses exhibit challenges for anthropophony due to masking effects and confusions for silence and insect sounds for geophony and biophony. Finally, we conduct an ecological case study which shows that pre-filtering recordings with CoarseSoundNet yields acoustic index trends comparable to ground-truth filtering, supporting its use as an effective preprocessing tool for ecoacoustic analyses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CoarseSoundNet, a deep learning model for coarse classification of ecological soundscapes into biophony, geophony, and anthropophony from realistic passive acoustic monitoring (PAM) recordings. It outlines a systematic investigation of model architectures, the addition of an explicit silence class, effects of data composition (especially similar PAM data), evaluation strategies including class-specific decision thresholds and duration constraints, error analysis highlighting masking of anthropophony and confusions involving insects/silence, and a case study showing that pre-filtering recordings with the model produces acoustic index trends comparable to ground-truth filtering.

Significance. If the central claims hold, this work offers a practical preprocessing tool for soundscape ecology that could improve the reliability of acoustic indices by separating sound types in noisy field data, addressing a noted gap in analytical instruments. The reproducible structure, focus on realistic PAM conditions, and exploration of factors such as additional similar data and a silence class are strengths that could aid adoption. Credit is given for the case-study validation approach linking model output to ecological metrics.

major comments (2)

[Data Composition and Case Study] The generalization claim—that CoarseSoundNet distinguishes the three classes under realistic PAM conditions and that pre-filtering yields comparable acoustic-index trends—is load-bearing for the central contribution. However, the manuscript provides insufficient detail on how the labeled training data and class definitions (including the silence class) match the acoustic variability, masking patterns, site-specific backgrounds, and annotator conventions in the unseen target-domain recordings used in the case study. See the data composition and case study sections; without cross-site validation or domain-shift analysis, the reported performance gains and equivalence may not transfer.
[Evaluation Strategies and Results] The abstract and findings state that model performance improves with additional similar PAM data, an explicit silence class, class-specific thresholds, and duration constraints, with particular gains for anthropophony and geophony. Yet the evaluation lacks reported quantitative metrics (e.g., precision/recall/F1 scores, confusion matrices, or statistical tests) comparing configurations before and after these additions. This weakens support for the performance claims. See the evaluation strategies and results sections.

minor comments (2)

[Abstract] The abstract contains a minor grammatical issue: 'We systematically investigate model architectures...' should be revised for subject-verb agreement.
[Methods] Clarify the exact labeling criteria and acoustic characteristics used to define the added silence class versus low-energy segments of the other classes, as this directly affects the reported confusions with insects and geophony/biophony.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential of CoarseSoundNet as a practical tool for soundscape analysis. We address the major comments point by point below, with clarifications based on the manuscript content and proposed revisions to strengthen the presentation of our results.

read point-by-point responses

Referee: [Data Composition and Case Study] The generalization claim—that CoarseSoundNet distinguishes the three classes under realistic PAM conditions and that pre-filtering yields comparable acoustic-index trends—is load-bearing for the central contribution. However, the manuscript provides insufficient detail on how the labeled training data and class definitions (including the silence class) match the acoustic variability, masking patterns, site-specific backgrounds, and annotator conventions in the unseen target-domain recordings used in the case study. See the data composition and case study sections; without cross-site validation or domain-shift analysis, the reported performance gains and equivalence may not transfer.

Authors: We appreciate the referee's focus on generalization and domain matching. The Data Composition section details that the training data consist of labeled PAM recordings drawn from multiple sites with documented acoustic variability, including varying levels of masking and background noise; class definitions follow standard soundscape ecology conventions, with the silence class explicitly added for segments lacking audible events above a defined threshold. The case study applies the model to an unseen but ecologically comparable site. To strengthen this, we will expand the relevant sections with explicit comparisons of acoustic features (e.g., spectrogram statistics and masking prevalence) between training and case-study data, plus a discussion of potential domain shifts. We will also report any available cross-site performance indicators from within our multi-site training corpus. These additions will better support the transferability claims without requiring new data collection. revision: partial
Referee: [Evaluation Strategies and Results] The abstract and findings state that model performance improves with additional similar PAM data, an explicit silence class, class-specific thresholds, and duration constraints, with particular gains for anthropophony and geophony. Yet the evaluation lacks reported quantitative metrics (e.g., precision/recall/F1 scores, confusion matrices, or statistical tests) comparing configurations before and after these additions. This weakens support for the performance claims. See the evaluation strategies and results sections.

Authors: We agree that explicit quantitative comparisons would improve clarity. The Evaluation Strategies and Results sections describe the systematic investigation of architectures, data composition, and strategies, reporting final-model metrics along with qualitative indications of gains from each addition (additional similar PAM data, silence class, thresholds, and duration constraints). To address the concern directly, we will insert a new table in the revised manuscript that tabulates precision, recall, F1, and confusion matrices for the key configurations before and after each enhancement, accompanied by appropriate statistical comparisons (e.g., paired tests on per-class performance). This will provide stronger, quantitative support for the stated improvements, especially for anthropophony and geophony. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ML training and held-out evaluation are self-contained

full rationale

The paper trains CoarseSoundNet on labeled PAM recordings to classify biophony, geophony, and anthropophony, then reports accuracy, error patterns, and a case-study comparison of acoustic-index trends before/after model-based pre-filtering versus ground-truth labels. All performance numbers and ecological conclusions derive directly from standard supervised training plus independent test-set and case-study measurements; no equation, prediction, or uniqueness claim reduces by construction to a fitted parameter or self-citation chain. The derivation chain is therefore externally falsifiable against new recordings and does not loop back to its own inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard supervised deep-learning assumptions plus domain-specific premises about label quality and data representativeness; no new physical entities are postulated.

free parameters (2)

class-specific decision thresholds
Tuned post-training to improve per-class performance on validation data.
model architecture hyperparameters
Chosen during systematic architecture search and training.

axioms (2)

domain assumption Human-provided labels for biophony, geophony, anthropophony and silence accurately reflect acoustic content in the training recordings
The model is trained and evaluated under the assumption that these coarse labels are reliable and consistent.
domain assumption Additional PAM data drawn from similar environments improves generalization to the target domain
The reported performance gains presuppose that domain similarity is the operative factor.

pith-pipeline@v0.9.0 · 5853 in / 1609 out tokens · 44127 ms · 2026-05-22T08:48:17.646056+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We systematically investigate model architectures, the influence of an additional training class, data composition, and evaluation strategies... class-specific decision thresholds and duration-based constraints
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Example spectrograms for the four acoustic classes... sliding-window manner using non-overlapping windows

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

153 extracted references · 153 canonical work pages · 3 internal anchors

[1]

, title =

Fortunato, S. , title =. Phys. Rep.-Rev. Sec. Phys. Lett. , volume =. 2010 , pages =

work page 2010
[2]

Newman, M. E. J. and Girvan, M. , title =. Phys. Rev. E. , volume =. 2004 , pages =

work page 2004
[3]

and Reinhardt, T

Vehlow, C. and Reinhardt, T. and Weiskopf, D. , title =. IEEE Trans. Vis. Comput. Graph. , volume =. 2013 , pages =

work page 2013
[4]

and Albert, R

Raghavan, U. and Albert, R. and Kumara, S. , title =. Phys. Rev E. , volume =. 2007 , pages =

work page 2007
[5]

2011 , pages =

Robust network community detection using balanced propagation , journal =. 2011 , pages =

work page 2011
[6]

and Li, S

Lou, H. and Li, S. and Zhao, Y. , title =. Physica A. , volume =. 2013 , pages =

work page 2013
[7]

and Newman, M

Clauset, A. and Newman, M. E. J. and Moore, C. , title =. Phys. Rev. E. , volume =. 2004 , pages =

work page 2004
[8]

Blondel, V. D. and Guillaume, J. L. and Lambiotte, R. and Lefebvre, E. , title =. J. Stat. Mech.-Theory Exp. , volume =. 2008 , pages =

work page 2008
[9]

and Campari, R

Sobolevsky, S. and Campari, R. , title =. Phys. Rev. E. , volume =. 2014 , pages =

work page 2014
[10]

and Barthelemy, M

Fortunato, S. and Barthelemy, M. , title =. Proc. Natl. Acad. Sci. U. S. A. , volume =. 2007 , pages =

work page 2007
[11]

2011 , pages =

Unfolding communities in large complex networks: Combining defensive and offensive label propagation for core extraction , journal =. 2011 , pages =

work page 2011
[12]

and Li, J

Wang, X. and Li, J. , title =. Physica A. , volume =. 2013 , pages =

work page 2013
[13]

and Wang, X

Li, J. and Wang, X. and Eustace, J. , title =. Physica A. , volume =. 2013 , pages =

work page 2013
[14]

Fabio, D. R. and Fabio, D. and Carlo, P. , title =. Sci. Rep. , volume =. 2013 , pages =

work page 2013
[15]

and Wu, T

Chen, Q. and Wu, T. T. and Fang, M. , title =. Physica A. , volume =. 2013 , pages =

work page 2013
[16]

and Wang, R

Zhang, S. and Wang, R. and Zhang, X. , title =. Physica A. , volume =. 2007 , pages =

work page 2007
[17]

and Petr\'oczi, A

Nepusz, T. and Petr\'oczi, A. and N\'egyessy, L. and Bazs\'o, F. , title =. Phys. Rev. E. , volume =. 2008 , pages =

work page 2008
[18]

and Liang, Z

Fabricio, B. and Liang, Z. , title =. Soft Comput. , volume =. 2013 , pages =

work page 2013
[19]

and Gao, L

Sun, P. and Gao, L. and Han, S. , title =. Inf. Sci. , volume =. 2011 , pages =

work page 2011
[20]

and Liu, D

Wang, W. and Liu, D. and Liu, X. and Pan, L. , title =. Physica A. , volume =. 2013 , pages =

work page 2013
[21]

and Roberts, S

Psorakis, I. and Roberts, S. and Ebden, M. and Sheldon, B. , title =. Phys. Rev. E. , volume =. 2011 , pages =

work page 2011
[22]

and Yeung, D

Zhang, Y. and Yeung, D. , title =. In Proc. ACM SIGKDD Conf. , year =

work page
[23]

, title =

Liu, J. , title =. Eur. Phys. J. B. , volume =. 2010 , pages =

work page 2010
[24]

Havens, T. C. and Bezdek, J. C. and Leckie, C., Ramamohanarao, K. and Palaniswami, M. , title =. IEEE Trans. Fuzzy Syst. , volume =. 2013 , pages =

work page 2013
[25]

Newman, M. E. J. , title =

work page
[26]

2012 , pages =

Ubiquitousness of link-density and link-pattern communities in real-world networks , journal =. 2012 , pages =

work page 2012
[27]

and Fortunato, S

Lancichinetti, A. and Fortunato, S. and Radicchi, F. , title =. Phys. Rev. E. , volume =. 2008 , pages =

work page 2008
[28]

and Pellegrini, M

Liu, W. and Pellegrini, M. and Wang, X. , title =. Sci. Rep. , volume =. 2014 , pages =

work page 2014
[29]

and Diaz-Guilera, A

Danon, L. and Diaz-Guilera, A. and Duch, J. and Arenas, A. , title =. J. Stat. Mech.-Theory Exp. , volume =. 2005 , pages =

work page 2005
[30]

, title =

Gregory, S. , title =. J. Stat. Mech.-Theory Exp. , volume =. 2011 , pages =

work page 2011
[31]

and Fortunato, S

Lancichinetti, A. and Fortunato, S. , title =. Phys. Rev. E. , volume =. 2009 , pages =

work page 2009
[32]

and Rifqi, M

Hullermeier, E. and Rifqi, M. , title =. in Proc. IFSA/EUSFLAT Conf. , year =

work page
[33]

Ecological Informatics , volume=

The use of acoustic indices to determine avian species richness in audio-recordings of the environment , author=. Ecological Informatics , volume=. 2014 , publisher=

work page 2014
[34]

Science of the Total Environment , volume=

Windy events detection in big bioacoustics datasets using a pre-trained Convolutional Neural Network , author=. Science of the Total Environment , volume=. 2024 , publisher=

work page 2024
[35]

Ecological Informatics , volume=

Transformer Models improve the acoustic recognition of buzz-pollinating bee species , author=. Ecological Informatics , volume=. 2025 , publisher=

work page 2025
[36]

Ecological Indicators , volume=

Soundscape classification with convolutional neural networks reveals temporal and geographic patterns in ecoacoustic data , author=. Ecological Indicators , volume=. 2022 , publisher=

work page 2022
[37]

Frontiers in Remote Sensing , volume=

Soundscape components inform acoustic index patterns and refine estimates of bird species richness , author=. Frontiers in Remote Sensing , volume=. 2023 , publisher=

work page 2023
[38]

Scientific Data , volume=

A dataset of acoustic measurements from soundscapes collected worldwide during the COVID-19 pandemic , author=. Scientific Data , volume=. 2024 , publisher=

work page 2024
[39]

Methods in ecology and evolution , volume=

CityNet—Deep learning tools for urban ecoacoustic assessment , author=. Methods in ecology and evolution , volume=. 2019 , publisher=

work page 2019
[40]

Biological Conservation , volume=

Road disturbance drives a more simplified soundscape in temperate forests revealed by deep learning and acoustics indices , author=. Biological Conservation , volume=. 2025 , publisher=

work page 2025
[41]

Forests , volume=

Classification of complicated urban forest acoustic scenes with deep learning models , author=. Forests , volume=. 2023 , publisher=

work page 2023
[42]

2017 , publisher=

Ecoacoustics: The ecological role of sounds , author=. 2017 , publisher=

work page 2017
[43]

and Villanueva-Rivera, Luis J

Pijanowski, Bryan C. and Villanueva-Rivera, Luis J. and Dumyahn, Sarah L. and Farina, Almo and Krause, Bernie L. and Napoletano, Brian M. and Gage, Stuart H. and Pieretti, Nadia , title =. BioScience , volume =. 2011 , month =

work page 2011
[44]

2021 , issn =

BirdNET: A deep learning solution for avian diversity monitoring , journal =. 2021 , issn =. doi:https://doi.org/10.1016/j.ecoinf.2021.101236 , author =

work page doi:10.1016/j.ecoinf.2021.101236 2021
[45]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

SSAST: Self-Supervised Audio Spectrogram Transformer , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=. 2022 , month=

work page 2022
[46]

AVES: Animal Vocalization Encoder Based on Self-Supervision , year=

Hagiwara, Masato , booktitle=. AVES: Animal Vocalization Encoder Based on Self-Supervision , year=

work page
[47]

EDANSA-2019: The Ecoacoustic Dataset from Arctic North Slope Alaska

C oban, Enis Berk and Perra, Megan and Pir, Dara and Mandel, Michael I. EDANSA-2019: The Ecoacoustic Dataset from Arctic North Slope Alaska. Proceedings of the 7th Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022). 2022

work page 2019
[48]

and Ellis, Daniel P

Gemmeke, Jort F. and Ellis, Daniel P. W. and Freedman, Dylan and Jansen, Aren and Lawrence, Wade and Moore, R. Channing and Plakal, Manoj and Ritter, Marvin , year =. Audio Set: An ontology and human-labeled dataset for audio events , DOI =. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , publisher =

work page 2017
[49]

arXiv preprint arXiv:2311.06368 , year=

The AeroSonicDB (YPAD-0523) dataset for acoustic detection and classification of aircraft , author=. arXiv preprint arXiv:2311.06368 , year=

work page arXiv
[50]

FSD50K: An Open Dataset of Human-Labeled Sound Events , year=

Fonseca, Eduardo and Favory, Xavier and Pons, Jordi and Font, Frederic and Serra, Xavier , journal=. FSD50K: An Open Dataset of Human-Labeled Sound Events , year=

work page
[51]

2022 , publisher =

Yang , title =. 2022 , publisher =. doi:10.5281/zenodo.6687981 , url =

work page doi:10.5281/zenodo.6687981 2022
[52]

IDMT-Traffic: An Open Benchmark Dataset for Acoustic Traffic Monitoring Research , year=

Abeßer, Jakob and Gourishetti, Saichand and Kátai, András and Clauß, Tobias and Sharma, Prachi and Liebetrau, Judith , booktitle=. IDMT-Traffic: An Open Benchmark Dataset for Acoustic Traffic Monitoring Research , year=

work page
[53]

Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019) , year=

MAVD: a dataset for sound event detection in urban environments , author=. Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019) , year=. doi:10.33682/kfmf-zv94 , address=

work page doi:10.33682/kfmf-zv94 2019
[54]

IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP) , volume=

Panns: Large-scale pretrained audio neural networks for audio pattern recognition , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP) , volume=. 2020 , publisher=

work page 2020
[55]

Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH) , pages=

Yuan Gong and Yu-An Chung and James Glass , title=. Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH) , pages=. doi:10.21437/Interspeech.2021-698 , publisher=

work page doi:10.21437/interspeech.2021-698 2021
[56]

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , volume =

Baevski, Alexei and Zhou, Yuhao and Mohamed, Abdelrahman and Auli, Michael , booktitle =. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , volume =. 2020 , address=

work page 2020
[57]

Koutini, Khaled and Schlüter, Jan and Eghbal-zadeh, Hamid and Widmer, Gerhard , title =. Proc. Interspeech 2022 , year =. doi:10.21437/Interspeech.2022-227 , issn =

work page doi:10.21437/interspeech.2022-227 2022
[58]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

work page
[59]

2019 , editor =

Tan, Mingxing and Le, Quoc , booktitle =. 2019 , editor =

work page 2019
[60]

Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation , year=

Wu, Yusong and Chen, Ke and Zhang, Tianyu and Hui, Yuchen and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo , booktitle=. Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation , year=

work page
[61]

HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection , year=

Chen, Ke and Du, Xingjian and Zhu, Bilei and Ma, Zejun and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo , booktitle=. HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection , year=

work page
[62]

Qwen2-Audio Technical Report

Qwen2-audio technical report , author=. arXiv preprint arXiv:2407.10759 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[63]

Proceedings of the 40th International Conference on Machine Learning , pages =

Robust Speech Recognition via Large-Scale Weak Supervision , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

work page 2023
[64]

Proceedings of the 39th International Conference on Machine Learning , pages =

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =

work page 2022
[65]

2025 , issn =

Soundscape-based evaluation of small-scale forest management interventions , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.foreco.2025.123067 , author =

work page doi:10.1016/j.foreco.2025.123067 2025
[66]

2019 , booktitle =

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , author =. 2019 , booktitle =. doi:10.21437/Interspeech.2019-2680 , issn =

work page doi:10.21437/interspeech.2019-2680 2019
[67]

arXiv preprint arXiv:2412.11943 , year=

autrainer: A Modular and Extensible Deep Learning Toolkit for Computer Audition Tasks , author=. arXiv preprint arXiv:2412.11943 , year=

work page arXiv
[68]

Landscape Ecology , volume=

Soundscape dynamics of a cold protected forest: dominance of aircraft noise , author=. Landscape Ecology , volume=. 2022 , publisher=

work page 2022
[69]

, author=

The sonic environment of cities. , author=. 1967 , school=

work page 1967
[70]

2005 , issn =

Urban soundscapes: Experiences and knowledge , journal =. 2005 , issn =. doi:https://doi.org/10.1016/j.cities.2005.05.003 , author =

work page doi:10.1016/j.cities.2005.05.003 2005
[71]

2014 , edition =

Almo Farina , title =. 2014 , edition =

work page 2014
[72]

2025 , issn =

AudioProtoPNet: An interpretable deep learning model for bird sound classification , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.ecoinf.2025.103081 , author =

work page doi:10.1016/j.ecoinf.2025.103081 2025
[73]

arXiv preprint arXiv:2312.07439 , year=

Birb: A generalization benchmark for information retrieval in bioacoustics , author=. arXiv preprint arXiv:2312.07439 , year=

work page arXiv
[74]

Environmental sound recordings from BeSound (all EPs, 20015 / 2016)

M. Environmental sound recordings from BeSound (all EPs, 20015 / 2016). Version 3 , year =

work page 2016
[75]

arXiv preprint arXiv:2411.07186 , year=

NatureLM-audio: An audio-language foundation model for bioacoustics , author=. arXiv preprint arXiv:2411.07186 , year=

work page arXiv
[76]

Scientific reports , volume=

ORCA-SPOT: An automatic killer whale sound detection toolkit using deep learning , author=. Scientific reports , volume=. 2019 , publisher=

work page 2019
[77]

2021 , booktitle =

ORCA-SLANG: An Automatic Multi-Stage Semi-Supervised Deep Learning Framework for Large-Scale Killer Whale Call Type Identification , author =. 2021 , booktitle =. doi:10.21437/Interspeech.2021-616 , issn =

work page doi:10.21437/interspeech.2021-616 2021
[78]

Methods in Ecology and Evolution , volume=

AudioMoth: Evaluation of a smart open acoustic device for monitoring biodiversity and the environment , author=. Methods in Ecology and Evolution , volume=. 2018 , publisher=

work page 2018
[79]

, journal=

Triantafyllopoulos, Andreas and Tsangko, Iosif and Gebhard, Alexander and Mesaros, Annamaria and Virtanen, Tuomas and Schuller, Björn W. , journal=. Computer Audition: From Task-Specific Machine Learning to Foundation Models , year=

work page
[80]

Trends in Hearing , volume=

New avenues in audio intelligence: Towards holistic real-life audio understanding , author=. Trends in Hearing , volume=. 2021 , publisher=

work page 2021

Showing first 80 references.

[1] [1]

, title =

Fortunato, S. , title =. Phys. Rep.-Rev. Sec. Phys. Lett. , volume =. 2010 , pages =

work page 2010

[2] [2]

Newman, M. E. J. and Girvan, M. , title =. Phys. Rev. E. , volume =. 2004 , pages =

work page 2004

[3] [3]

and Reinhardt, T

Vehlow, C. and Reinhardt, T. and Weiskopf, D. , title =. IEEE Trans. Vis. Comput. Graph. , volume =. 2013 , pages =

work page 2013

[4] [4]

and Albert, R

Raghavan, U. and Albert, R. and Kumara, S. , title =. Phys. Rev E. , volume =. 2007 , pages =

work page 2007

[5] [5]

2011 , pages =

Robust network community detection using balanced propagation , journal =. 2011 , pages =

work page 2011

[6] [6]

and Li, S

Lou, H. and Li, S. and Zhao, Y. , title =. Physica A. , volume =. 2013 , pages =

work page 2013

[7] [7]

and Newman, M

Clauset, A. and Newman, M. E. J. and Moore, C. , title =. Phys. Rev. E. , volume =. 2004 , pages =

work page 2004

[8] [8]

Blondel, V. D. and Guillaume, J. L. and Lambiotte, R. and Lefebvre, E. , title =. J. Stat. Mech.-Theory Exp. , volume =. 2008 , pages =

work page 2008

[9] [9]

and Campari, R

Sobolevsky, S. and Campari, R. , title =. Phys. Rev. E. , volume =. 2014 , pages =

work page 2014

[10] [10]

and Barthelemy, M

Fortunato, S. and Barthelemy, M. , title =. Proc. Natl. Acad. Sci. U. S. A. , volume =. 2007 , pages =

work page 2007

[11] [11]

2011 , pages =

Unfolding communities in large complex networks: Combining defensive and offensive label propagation for core extraction , journal =. 2011 , pages =

work page 2011

[12] [12]

and Li, J

Wang, X. and Li, J. , title =. Physica A. , volume =. 2013 , pages =

work page 2013

[13] [13]

and Wang, X

Li, J. and Wang, X. and Eustace, J. , title =. Physica A. , volume =. 2013 , pages =

work page 2013

[14] [14]

Fabio, D. R. and Fabio, D. and Carlo, P. , title =. Sci. Rep. , volume =. 2013 , pages =

work page 2013

[15] [15]

and Wu, T

Chen, Q. and Wu, T. T. and Fang, M. , title =. Physica A. , volume =. 2013 , pages =

work page 2013

[16] [16]

and Wang, R

Zhang, S. and Wang, R. and Zhang, X. , title =. Physica A. , volume =. 2007 , pages =

work page 2007

[17] [17]

and Petr\'oczi, A

Nepusz, T. and Petr\'oczi, A. and N\'egyessy, L. and Bazs\'o, F. , title =. Phys. Rev. E. , volume =. 2008 , pages =

work page 2008

[18] [18]

and Liang, Z

Fabricio, B. and Liang, Z. , title =. Soft Comput. , volume =. 2013 , pages =

work page 2013

[19] [19]

and Gao, L

Sun, P. and Gao, L. and Han, S. , title =. Inf. Sci. , volume =. 2011 , pages =

work page 2011

[20] [20]

and Liu, D

Wang, W. and Liu, D. and Liu, X. and Pan, L. , title =. Physica A. , volume =. 2013 , pages =

work page 2013

[21] [21]

and Roberts, S

Psorakis, I. and Roberts, S. and Ebden, M. and Sheldon, B. , title =. Phys. Rev. E. , volume =. 2011 , pages =

work page 2011

[22] [22]

and Yeung, D

Zhang, Y. and Yeung, D. , title =. In Proc. ACM SIGKDD Conf. , year =

work page

[23] [23]

, title =

Liu, J. , title =. Eur. Phys. J. B. , volume =. 2010 , pages =

work page 2010

[24] [24]

Havens, T. C. and Bezdek, J. C. and Leckie, C., Ramamohanarao, K. and Palaniswami, M. , title =. IEEE Trans. Fuzzy Syst. , volume =. 2013 , pages =

work page 2013

[25] [25]

Newman, M. E. J. , title =

work page

[26] [26]

2012 , pages =

Ubiquitousness of link-density and link-pattern communities in real-world networks , journal =. 2012 , pages =

work page 2012

[27] [27]

and Fortunato, S

Lancichinetti, A. and Fortunato, S. and Radicchi, F. , title =. Phys. Rev. E. , volume =. 2008 , pages =

work page 2008

[28] [28]

and Pellegrini, M

Liu, W. and Pellegrini, M. and Wang, X. , title =. Sci. Rep. , volume =. 2014 , pages =

work page 2014

[29] [29]

and Diaz-Guilera, A

Danon, L. and Diaz-Guilera, A. and Duch, J. and Arenas, A. , title =. J. Stat. Mech.-Theory Exp. , volume =. 2005 , pages =

work page 2005

[30] [30]

, title =

Gregory, S. , title =. J. Stat. Mech.-Theory Exp. , volume =. 2011 , pages =

work page 2011

[31] [31]

and Fortunato, S

Lancichinetti, A. and Fortunato, S. , title =. Phys. Rev. E. , volume =. 2009 , pages =

work page 2009

[32] [32]

and Rifqi, M

Hullermeier, E. and Rifqi, M. , title =. in Proc. IFSA/EUSFLAT Conf. , year =

work page

[33] [33]

Ecological Informatics , volume=

The use of acoustic indices to determine avian species richness in audio-recordings of the environment , author=. Ecological Informatics , volume=. 2014 , publisher=

work page 2014

[34] [34]

Science of the Total Environment , volume=

Windy events detection in big bioacoustics datasets using a pre-trained Convolutional Neural Network , author=. Science of the Total Environment , volume=. 2024 , publisher=

work page 2024

[35] [35]

Ecological Informatics , volume=

Transformer Models improve the acoustic recognition of buzz-pollinating bee species , author=. Ecological Informatics , volume=. 2025 , publisher=

work page 2025

[36] [36]

Ecological Indicators , volume=

Soundscape classification with convolutional neural networks reveals temporal and geographic patterns in ecoacoustic data , author=. Ecological Indicators , volume=. 2022 , publisher=

work page 2022

[37] [37]

Frontiers in Remote Sensing , volume=

Soundscape components inform acoustic index patterns and refine estimates of bird species richness , author=. Frontiers in Remote Sensing , volume=. 2023 , publisher=

work page 2023

[38] [38]

Scientific Data , volume=

A dataset of acoustic measurements from soundscapes collected worldwide during the COVID-19 pandemic , author=. Scientific Data , volume=. 2024 , publisher=

work page 2024

[39] [39]

Methods in ecology and evolution , volume=

CityNet—Deep learning tools for urban ecoacoustic assessment , author=. Methods in ecology and evolution , volume=. 2019 , publisher=

work page 2019

[40] [40]

Biological Conservation , volume=

Road disturbance drives a more simplified soundscape in temperate forests revealed by deep learning and acoustics indices , author=. Biological Conservation , volume=. 2025 , publisher=

work page 2025

[41] [41]

Forests , volume=

Classification of complicated urban forest acoustic scenes with deep learning models , author=. Forests , volume=. 2023 , publisher=

work page 2023

[42] [42]

2017 , publisher=

Ecoacoustics: The ecological role of sounds , author=. 2017 , publisher=

work page 2017

[43] [43]

and Villanueva-Rivera, Luis J

Pijanowski, Bryan C. and Villanueva-Rivera, Luis J. and Dumyahn, Sarah L. and Farina, Almo and Krause, Bernie L. and Napoletano, Brian M. and Gage, Stuart H. and Pieretti, Nadia , title =. BioScience , volume =. 2011 , month =

work page 2011

[44] [44]

2021 , issn =

BirdNET: A deep learning solution for avian diversity monitoring , journal =. 2021 , issn =. doi:https://doi.org/10.1016/j.ecoinf.2021.101236 , author =

work page doi:10.1016/j.ecoinf.2021.101236 2021

[45] [45]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

SSAST: Self-Supervised Audio Spectrogram Transformer , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=. 2022 , month=

work page 2022

[46] [46]

AVES: Animal Vocalization Encoder Based on Self-Supervision , year=

Hagiwara, Masato , booktitle=. AVES: Animal Vocalization Encoder Based on Self-Supervision , year=

work page

[47] [47]

EDANSA-2019: The Ecoacoustic Dataset from Arctic North Slope Alaska

C oban, Enis Berk and Perra, Megan and Pir, Dara and Mandel, Michael I. EDANSA-2019: The Ecoacoustic Dataset from Arctic North Slope Alaska. Proceedings of the 7th Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022). 2022

work page 2019

[48] [48]

and Ellis, Daniel P

Gemmeke, Jort F. and Ellis, Daniel P. W. and Freedman, Dylan and Jansen, Aren and Lawrence, Wade and Moore, R. Channing and Plakal, Manoj and Ritter, Marvin , year =. Audio Set: An ontology and human-labeled dataset for audio events , DOI =. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , publisher =

work page 2017

[49] [49]

arXiv preprint arXiv:2311.06368 , year=

The AeroSonicDB (YPAD-0523) dataset for acoustic detection and classification of aircraft , author=. arXiv preprint arXiv:2311.06368 , year=

work page arXiv

[50] [50]

FSD50K: An Open Dataset of Human-Labeled Sound Events , year=

Fonseca, Eduardo and Favory, Xavier and Pons, Jordi and Font, Frederic and Serra, Xavier , journal=. FSD50K: An Open Dataset of Human-Labeled Sound Events , year=

work page

[51] [51]

2022 , publisher =

Yang , title =. 2022 , publisher =. doi:10.5281/zenodo.6687981 , url =

work page doi:10.5281/zenodo.6687981 2022

[52] [52]

IDMT-Traffic: An Open Benchmark Dataset for Acoustic Traffic Monitoring Research , year=

Abeßer, Jakob and Gourishetti, Saichand and Kátai, András and Clauß, Tobias and Sharma, Prachi and Liebetrau, Judith , booktitle=. IDMT-Traffic: An Open Benchmark Dataset for Acoustic Traffic Monitoring Research , year=

work page

[53] [53]

Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019) , year=

MAVD: a dataset for sound event detection in urban environments , author=. Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019) , year=. doi:10.33682/kfmf-zv94 , address=

work page doi:10.33682/kfmf-zv94 2019

[54] [54]

IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP) , volume=

Panns: Large-scale pretrained audio neural networks for audio pattern recognition , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP) , volume=. 2020 , publisher=

work page 2020

[55] [55]

Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH) , pages=

Yuan Gong and Yu-An Chung and James Glass , title=. Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH) , pages=. doi:10.21437/Interspeech.2021-698 , publisher=

work page doi:10.21437/interspeech.2021-698 2021

[56] [56]

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , volume =

Baevski, Alexei and Zhou, Yuhao and Mohamed, Abdelrahman and Auli, Michael , booktitle =. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , volume =. 2020 , address=

work page 2020

[57] [57]

Koutini, Khaled and Schlüter, Jan and Eghbal-zadeh, Hamid and Widmer, Gerhard , title =. Proc. Interspeech 2022 , year =. doi:10.21437/Interspeech.2022-227 , issn =

work page doi:10.21437/interspeech.2022-227 2022

[58] [58]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

work page

[59] [59]

2019 , editor =

Tan, Mingxing and Le, Quoc , booktitle =. 2019 , editor =

work page 2019

[60] [60]

Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation , year=

Wu, Yusong and Chen, Ke and Zhang, Tianyu and Hui, Yuchen and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo , booktitle=. Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation , year=

work page

[61] [61]

HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection , year=

Chen, Ke and Du, Xingjian and Zhu, Bilei and Ma, Zejun and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo , booktitle=. HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection , year=

work page

[62] [62]

Qwen2-Audio Technical Report

Qwen2-audio technical report , author=. arXiv preprint arXiv:2407.10759 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[63] [63]

Proceedings of the 40th International Conference on Machine Learning , pages =

Robust Speech Recognition via Large-Scale Weak Supervision , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

work page 2023

[64] [64]

Proceedings of the 39th International Conference on Machine Learning , pages =

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =

work page 2022

[65] [65]

2025 , issn =

Soundscape-based evaluation of small-scale forest management interventions , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.foreco.2025.123067 , author =

work page doi:10.1016/j.foreco.2025.123067 2025

[66] [66]

2019 , booktitle =

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , author =. 2019 , booktitle =. doi:10.21437/Interspeech.2019-2680 , issn =

work page doi:10.21437/interspeech.2019-2680 2019

[67] [67]

arXiv preprint arXiv:2412.11943 , year=

autrainer: A Modular and Extensible Deep Learning Toolkit for Computer Audition Tasks , author=. arXiv preprint arXiv:2412.11943 , year=

work page arXiv

[68] [68]

Landscape Ecology , volume=

Soundscape dynamics of a cold protected forest: dominance of aircraft noise , author=. Landscape Ecology , volume=. 2022 , publisher=

work page 2022

[69] [69]

, author=

The sonic environment of cities. , author=. 1967 , school=

work page 1967

[70] [70]

2005 , issn =

Urban soundscapes: Experiences and knowledge , journal =. 2005 , issn =. doi:https://doi.org/10.1016/j.cities.2005.05.003 , author =

work page doi:10.1016/j.cities.2005.05.003 2005

[71] [71]

2014 , edition =

Almo Farina , title =. 2014 , edition =

work page 2014

[72] [72]

2025 , issn =

AudioProtoPNet: An interpretable deep learning model for bird sound classification , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.ecoinf.2025.103081 , author =

work page doi:10.1016/j.ecoinf.2025.103081 2025

[73] [73]

arXiv preprint arXiv:2312.07439 , year=

Birb: A generalization benchmark for information retrieval in bioacoustics , author=. arXiv preprint arXiv:2312.07439 , year=

work page arXiv

[74] [74]

Environmental sound recordings from BeSound (all EPs, 20015 / 2016)

M. Environmental sound recordings from BeSound (all EPs, 20015 / 2016). Version 3 , year =

work page 2016

[75] [75]

arXiv preprint arXiv:2411.07186 , year=

NatureLM-audio: An audio-language foundation model for bioacoustics , author=. arXiv preprint arXiv:2411.07186 , year=

work page arXiv

[76] [76]

Scientific reports , volume=

ORCA-SPOT: An automatic killer whale sound detection toolkit using deep learning , author=. Scientific reports , volume=. 2019 , publisher=

work page 2019

[77] [77]

2021 , booktitle =

ORCA-SLANG: An Automatic Multi-Stage Semi-Supervised Deep Learning Framework for Large-Scale Killer Whale Call Type Identification , author =. 2021 , booktitle =. doi:10.21437/Interspeech.2021-616 , issn =

work page doi:10.21437/interspeech.2021-616 2021

[78] [78]

Methods in Ecology and Evolution , volume=

AudioMoth: Evaluation of a smart open acoustic device for monitoring biodiversity and the environment , author=. Methods in Ecology and Evolution , volume=. 2018 , publisher=

work page 2018

[79] [79]

, journal=

Triantafyllopoulos, Andreas and Tsangko, Iosif and Gebhard, Alexander and Mesaros, Annamaria and Virtanen, Tuomas and Schuller, Björn W. , journal=. Computer Audition: From Task-Specific Machine Learning to Foundation Models , year=

work page

[80] [80]

Trends in Hearing , volume=

New avenues in audio intelligence: Towards holistic real-life audio understanding , author=. Trends in Hearing , volume=. 2021 , publisher=

work page 2021