pith. sign in

arxiv: 2605.21143 · v2 · pith:Z532GEJMnew · submitted 2026-05-20 · 💻 cs.SD · cs.LG

CoarseSoundNet: Building a reliable model for ecological soundscape analysis

Pith reviewed 2026-05-22 08:48 UTC · model grok-4.3

classification 💻 cs.SD cs.LG
keywords soundscape ecologypassive acoustic monitoringbiophonygeophonyanthropophonydeep learningmachine learningecological sound analysis
0
0 comments X

The pith

CoarseSoundNet distinguishes biophony, geophony, and anthropophony in realistic passive acoustic monitoring recordings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents CoarseSoundNet, a deep learning model designed to classify soundscapes into animal-made sounds, natural abiotic sounds, and human-made sounds using noisy field recordings from passive acoustic monitoring. It shows that performance improves when the training set includes additional data similar to the target recordings and when an explicit silence class is added. Class-specific decision thresholds combined with duration constraints reduce specific errors, particularly for human and natural sound categories. In an ecological case study, pre-filtering data with the model produces acoustic index trends that closely match results from manually filtered ground-truth data, indicating the model can act as an effective preprocessing step for broader ecoacoustic analyses.

Core claim

CoarseSoundNet is a deep learning model trained to distinguish biophony, geophony, and anthropophony under realistic PAM conditions. Model performance improves with additional PAM data, especially when similar to the target domain, and by introducing an explicit silence class during training. Class-specific decision thresholds and duration-based constraints further enhance performance, particularly for anthropophony and geophony. Error analyses exhibit challenges for anthropophony due to masking effects and confusions for silence and insect sounds for geophony and biophony. Pre-filtering recordings with CoarseSoundNet yields acoustic index trends comparable to ground-truth filtering.

What carries the argument

CoarseSoundNet, a deep learning model for coarse soundscape classification into biophony, geophony, anthropophony and an added silence class, combined with class-specific thresholds and duration constraints for post-processing.

If this is right

  • Performance improves with additional PAM data especially when similar to the target domain.
  • Introducing an explicit silence class during training reduces confusion with other categories.
  • Class-specific decision thresholds and duration constraints further enhance performance for anthropophony and geophony.
  • Pre-filtering recordings with the model yields acoustic index trends comparable to ground-truth filtering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reported error patterns point to masking effects as a persistent challenge for detecting human sounds in complex environments.
  • The systematic structure for building such models could be reused to train classifiers for other soundscape tasks or regions.
  • Using the model for preprocessing may allow larger-scale ecoacoustic studies without proportional increases in manual labeling effort.

Load-bearing premise

The labeled training data and class definitions accurately represent the acoustic variability and labeling conventions present in unseen target-domain PAM recordings.

What would settle it

Apply CoarseSoundNet to a new set of PAM recordings from a different site or season, compute acoustic indices after model-based filtering, and check whether the resulting trends deviate substantially from those obtained after manual ground-truth filtering.

Figures

Figures reproduced from arXiv: 2605.21143 by Alexander Gebhard, Andreas Triantafyllopoulos, Bj\"orn W. Schuller, Dominik Arend, Michael Scherer-Lorenzen, Sandra M\"uller, Svenja Schmidt.

Figure 1
Figure 1. Figure 1: Example spectrograms for the four acoustic classes: anthropophony, biophony, [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The Precision-Recall (PR) curves for the three classes Anthropophony (Anth), [PITH_FULL_IMAGE:figures/full_fig_p020_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The receiver-operating characteristic (ROC) curves for the three classes An [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: False positives (FPs; top row) and false negatives (FNs; bottom row) for the [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution boxplots for ecoacoustic indices (top) vs CoarseSoundNet model [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Pearson correlation of three standard ecoacoustic indices (ACI, ADI, NDSI) [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
read the original abstract

A soundscape is composed of three types of sound: biophony (sounds made by animals), geophony (natural abiotic sounds) and anthropophony (sounds made by humans). A key research question in the field of soundscape ecology is how these components interact with each other, specifically how biophony responds to geophony and anthropophony. Nevertheless, as of today, there are not many analytical instruments that enable the distinct quantification of these elements. Recent machine learning (ML) approaches aim to support automated analysis but often rely on task-specific or clean data, limiting generalisation to noisy passive acoustic monitoring (PAM) recordings. This study presents a clear and reproducible structure to build ML models for coarse soundscape classification and introduces CoarseSoundNet, a deep learning model trained to distinguish biophony, geophony, and anthropophony under realistic PAM conditions. We systematically investigate model architectures, the influence of an additional training class, data composition, and evaluation strategies. Our findings suggest that model performance improves with additional PAM data, especially when similar to the target domain, and by introducing an explicit silence class during training. Class-specific decision thresholds and duration-based constraints further enhance performance, particularly for anthropophony and geophony. Error analyses exhibit challenges for anthropophony due to masking effects and confusions for silence and insect sounds for geophony and biophony. Finally, we conduct an ecological case study which shows that pre-filtering recordings with CoarseSoundNet yields acoustic index trends comparable to ground-truth filtering, supporting its use as an effective preprocessing tool for ecoacoustic analyses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CoarseSoundNet, a deep learning model for coarse classification of ecological soundscapes into biophony, geophony, and anthropophony from realistic passive acoustic monitoring (PAM) recordings. It outlines a systematic investigation of model architectures, the addition of an explicit silence class, effects of data composition (especially similar PAM data), evaluation strategies including class-specific decision thresholds and duration constraints, error analysis highlighting masking of anthropophony and confusions involving insects/silence, and a case study showing that pre-filtering recordings with the model produces acoustic index trends comparable to ground-truth filtering.

Significance. If the central claims hold, this work offers a practical preprocessing tool for soundscape ecology that could improve the reliability of acoustic indices by separating sound types in noisy field data, addressing a noted gap in analytical instruments. The reproducible structure, focus on realistic PAM conditions, and exploration of factors such as additional similar data and a silence class are strengths that could aid adoption. Credit is given for the case-study validation approach linking model output to ecological metrics.

major comments (2)
  1. [Data Composition and Case Study] The generalization claim—that CoarseSoundNet distinguishes the three classes under realistic PAM conditions and that pre-filtering yields comparable acoustic-index trends—is load-bearing for the central contribution. However, the manuscript provides insufficient detail on how the labeled training data and class definitions (including the silence class) match the acoustic variability, masking patterns, site-specific backgrounds, and annotator conventions in the unseen target-domain recordings used in the case study. See the data composition and case study sections; without cross-site validation or domain-shift analysis, the reported performance gains and equivalence may not transfer.
  2. [Evaluation Strategies and Results] The abstract and findings state that model performance improves with additional similar PAM data, an explicit silence class, class-specific thresholds, and duration constraints, with particular gains for anthropophony and geophony. Yet the evaluation lacks reported quantitative metrics (e.g., precision/recall/F1 scores, confusion matrices, or statistical tests) comparing configurations before and after these additions. This weakens support for the performance claims. See the evaluation strategies and results sections.
minor comments (2)
  1. [Abstract] The abstract contains a minor grammatical issue: 'We systematically investigate model architectures...' should be revised for subject-verb agreement.
  2. [Methods] Clarify the exact labeling criteria and acoustic characteristics used to define the added silence class versus low-energy segments of the other classes, as this directly affects the reported confusions with insects and geophony/biophony.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential of CoarseSoundNet as a practical tool for soundscape analysis. We address the major comments point by point below, with clarifications based on the manuscript content and proposed revisions to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Data Composition and Case Study] The generalization claim—that CoarseSoundNet distinguishes the three classes under realistic PAM conditions and that pre-filtering yields comparable acoustic-index trends—is load-bearing for the central contribution. However, the manuscript provides insufficient detail on how the labeled training data and class definitions (including the silence class) match the acoustic variability, masking patterns, site-specific backgrounds, and annotator conventions in the unseen target-domain recordings used in the case study. See the data composition and case study sections; without cross-site validation or domain-shift analysis, the reported performance gains and equivalence may not transfer.

    Authors: We appreciate the referee's focus on generalization and domain matching. The Data Composition section details that the training data consist of labeled PAM recordings drawn from multiple sites with documented acoustic variability, including varying levels of masking and background noise; class definitions follow standard soundscape ecology conventions, with the silence class explicitly added for segments lacking audible events above a defined threshold. The case study applies the model to an unseen but ecologically comparable site. To strengthen this, we will expand the relevant sections with explicit comparisons of acoustic features (e.g., spectrogram statistics and masking prevalence) between training and case-study data, plus a discussion of potential domain shifts. We will also report any available cross-site performance indicators from within our multi-site training corpus. These additions will better support the transferability claims without requiring new data collection. revision: partial

  2. Referee: [Evaluation Strategies and Results] The abstract and findings state that model performance improves with additional similar PAM data, an explicit silence class, class-specific thresholds, and duration constraints, with particular gains for anthropophony and geophony. Yet the evaluation lacks reported quantitative metrics (e.g., precision/recall/F1 scores, confusion matrices, or statistical tests) comparing configurations before and after these additions. This weakens support for the performance claims. See the evaluation strategies and results sections.

    Authors: We agree that explicit quantitative comparisons would improve clarity. The Evaluation Strategies and Results sections describe the systematic investigation of architectures, data composition, and strategies, reporting final-model metrics along with qualitative indications of gains from each addition (additional similar PAM data, silence class, thresholds, and duration constraints). To address the concern directly, we will insert a new table in the revised manuscript that tabulates precision, recall, F1, and confusion matrices for the key configurations before and after each enhancement, accompanied by appropriate statistical comparisons (e.g., paired tests on per-class performance). This will provide stronger, quantitative support for the stated improvements, especially for anthropophony and geophony. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ML training and held-out evaluation are self-contained

full rationale

The paper trains CoarseSoundNet on labeled PAM recordings to classify biophony, geophony, and anthropophony, then reports accuracy, error patterns, and a case-study comparison of acoustic-index trends before/after model-based pre-filtering versus ground-truth labels. All performance numbers and ecological conclusions derive directly from standard supervised training plus independent test-set and case-study measurements; no equation, prediction, or uniqueness claim reduces by construction to a fitted parameter or self-citation chain. The derivation chain is therefore externally falsifiable against new recordings and does not loop back to its own inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard supervised deep-learning assumptions plus domain-specific premises about label quality and data representativeness; no new physical entities are postulated.

free parameters (2)
  • class-specific decision thresholds
    Tuned post-training to improve per-class performance on validation data.
  • model architecture hyperparameters
    Chosen during systematic architecture search and training.
axioms (2)
  • domain assumption Human-provided labels for biophony, geophony, anthropophony and silence accurately reflect acoustic content in the training recordings
    The model is trained and evaluated under the assumption that these coarse labels are reliable and consistent.
  • domain assumption Additional PAM data drawn from similar environments improves generalization to the target domain
    The reported performance gains presuppose that domain similarity is the operative factor.

pith-pipeline@v0.9.0 · 5853 in / 1609 out tokens · 44127 ms · 2026-05-22T08:48:17.646056+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

153 extracted references · 153 canonical work pages · 3 internal anchors

  1. [1]

    , title =

    Fortunato, S. , title =. Phys. Rep.-Rev. Sec. Phys. Lett. , volume =. 2010 , pages =

  2. [2]

    Newman, M. E. J. and Girvan, M. , title =. Phys. Rev. E. , volume =. 2004 , pages =

  3. [3]

    and Reinhardt, T

    Vehlow, C. and Reinhardt, T. and Weiskopf, D. , title =. IEEE Trans. Vis. Comput. Graph. , volume =. 2013 , pages =

  4. [4]

    and Albert, R

    Raghavan, U. and Albert, R. and Kumara, S. , title =. Phys. Rev E. , volume =. 2007 , pages =

  5. [5]

    2011 , pages =

    Robust network community detection using balanced propagation , journal =. 2011 , pages =

  6. [6]

    and Li, S

    Lou, H. and Li, S. and Zhao, Y. , title =. Physica A. , volume =. 2013 , pages =

  7. [7]

    and Newman, M

    Clauset, A. and Newman, M. E. J. and Moore, C. , title =. Phys. Rev. E. , volume =. 2004 , pages =

  8. [8]

    Blondel, V. D. and Guillaume, J. L. and Lambiotte, R. and Lefebvre, E. , title =. J. Stat. Mech.-Theory Exp. , volume =. 2008 , pages =

  9. [9]

    and Campari, R

    Sobolevsky, S. and Campari, R. , title =. Phys. Rev. E. , volume =. 2014 , pages =

  10. [10]

    and Barthelemy, M

    Fortunato, S. and Barthelemy, M. , title =. Proc. Natl. Acad. Sci. U. S. A. , volume =. 2007 , pages =

  11. [11]

    2011 , pages =

    Unfolding communities in large complex networks: Combining defensive and offensive label propagation for core extraction , journal =. 2011 , pages =

  12. [12]

    and Li, J

    Wang, X. and Li, J. , title =. Physica A. , volume =. 2013 , pages =

  13. [13]

    and Wang, X

    Li, J. and Wang, X. and Eustace, J. , title =. Physica A. , volume =. 2013 , pages =

  14. [14]

    Fabio, D. R. and Fabio, D. and Carlo, P. , title =. Sci. Rep. , volume =. 2013 , pages =

  15. [15]

    and Wu, T

    Chen, Q. and Wu, T. T. and Fang, M. , title =. Physica A. , volume =. 2013 , pages =

  16. [16]

    and Wang, R

    Zhang, S. and Wang, R. and Zhang, X. , title =. Physica A. , volume =. 2007 , pages =

  17. [17]

    and Petr\'oczi, A

    Nepusz, T. and Petr\'oczi, A. and N\'egyessy, L. and Bazs\'o, F. , title =. Phys. Rev. E. , volume =. 2008 , pages =

  18. [18]

    and Liang, Z

    Fabricio, B. and Liang, Z. , title =. Soft Comput. , volume =. 2013 , pages =

  19. [19]

    and Gao, L

    Sun, P. and Gao, L. and Han, S. , title =. Inf. Sci. , volume =. 2011 , pages =

  20. [20]

    and Liu, D

    Wang, W. and Liu, D. and Liu, X. and Pan, L. , title =. Physica A. , volume =. 2013 , pages =

  21. [21]

    and Roberts, S

    Psorakis, I. and Roberts, S. and Ebden, M. and Sheldon, B. , title =. Phys. Rev. E. , volume =. 2011 , pages =

  22. [22]

    and Yeung, D

    Zhang, Y. and Yeung, D. , title =. In Proc. ACM SIGKDD Conf. , year =

  23. [23]

    , title =

    Liu, J. , title =. Eur. Phys. J. B. , volume =. 2010 , pages =

  24. [24]

    Havens, T. C. and Bezdek, J. C. and Leckie, C., Ramamohanarao, K. and Palaniswami, M. , title =. IEEE Trans. Fuzzy Syst. , volume =. 2013 , pages =

  25. [25]

    Newman, M. E. J. , title =

  26. [26]

    2012 , pages =

    Ubiquitousness of link-density and link-pattern communities in real-world networks , journal =. 2012 , pages =

  27. [27]

    and Fortunato, S

    Lancichinetti, A. and Fortunato, S. and Radicchi, F. , title =. Phys. Rev. E. , volume =. 2008 , pages =

  28. [28]

    and Pellegrini, M

    Liu, W. and Pellegrini, M. and Wang, X. , title =. Sci. Rep. , volume =. 2014 , pages =

  29. [29]

    and Diaz-Guilera, A

    Danon, L. and Diaz-Guilera, A. and Duch, J. and Arenas, A. , title =. J. Stat. Mech.-Theory Exp. , volume =. 2005 , pages =

  30. [30]

    , title =

    Gregory, S. , title =. J. Stat. Mech.-Theory Exp. , volume =. 2011 , pages =

  31. [31]

    and Fortunato, S

    Lancichinetti, A. and Fortunato, S. , title =. Phys. Rev. E. , volume =. 2009 , pages =

  32. [32]

    and Rifqi, M

    Hullermeier, E. and Rifqi, M. , title =. in Proc. IFSA/EUSFLAT Conf. , year =

  33. [33]

    Ecological Informatics , volume=

    The use of acoustic indices to determine avian species richness in audio-recordings of the environment , author=. Ecological Informatics , volume=. 2014 , publisher=

  34. [34]

    Science of the Total Environment , volume=

    Windy events detection in big bioacoustics datasets using a pre-trained Convolutional Neural Network , author=. Science of the Total Environment , volume=. 2024 , publisher=

  35. [35]

    Ecological Informatics , volume=

    Transformer Models improve the acoustic recognition of buzz-pollinating bee species , author=. Ecological Informatics , volume=. 2025 , publisher=

  36. [36]

    Ecological Indicators , volume=

    Soundscape classification with convolutional neural networks reveals temporal and geographic patterns in ecoacoustic data , author=. Ecological Indicators , volume=. 2022 , publisher=

  37. [37]

    Frontiers in Remote Sensing , volume=

    Soundscape components inform acoustic index patterns and refine estimates of bird species richness , author=. Frontiers in Remote Sensing , volume=. 2023 , publisher=

  38. [38]

    Scientific Data , volume=

    A dataset of acoustic measurements from soundscapes collected worldwide during the COVID-19 pandemic , author=. Scientific Data , volume=. 2024 , publisher=

  39. [39]

    Methods in ecology and evolution , volume=

    CityNet—Deep learning tools for urban ecoacoustic assessment , author=. Methods in ecology and evolution , volume=. 2019 , publisher=

  40. [40]

    Biological Conservation , volume=

    Road disturbance drives a more simplified soundscape in temperate forests revealed by deep learning and acoustics indices , author=. Biological Conservation , volume=. 2025 , publisher=

  41. [41]

    Forests , volume=

    Classification of complicated urban forest acoustic scenes with deep learning models , author=. Forests , volume=. 2023 , publisher=

  42. [42]

    2017 , publisher=

    Ecoacoustics: The ecological role of sounds , author=. 2017 , publisher=

  43. [43]

    and Villanueva-Rivera, Luis J

    Pijanowski, Bryan C. and Villanueva-Rivera, Luis J. and Dumyahn, Sarah L. and Farina, Almo and Krause, Bernie L. and Napoletano, Brian M. and Gage, Stuart H. and Pieretti, Nadia , title =. BioScience , volume =. 2011 , month =

  44. [44]

    2021 , issn =

    BirdNET: A deep learning solution for avian diversity monitoring , journal =. 2021 , issn =. doi:https://doi.org/10.1016/j.ecoinf.2021.101236 , author =

  45. [45]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    SSAST: Self-Supervised Audio Spectrogram Transformer , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=. 2022 , month=

  46. [46]

    AVES: Animal Vocalization Encoder Based on Self-Supervision , year=

    Hagiwara, Masato , booktitle=. AVES: Animal Vocalization Encoder Based on Self-Supervision , year=

  47. [47]

    EDANSA-2019: The Ecoacoustic Dataset from Arctic North Slope Alaska

    C oban, Enis Berk and Perra, Megan and Pir, Dara and Mandel, Michael I. EDANSA-2019: The Ecoacoustic Dataset from Arctic North Slope Alaska. Proceedings of the 7th Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022). 2022

  48. [48]

    and Ellis, Daniel P

    Gemmeke, Jort F. and Ellis, Daniel P. W. and Freedman, Dylan and Jansen, Aren and Lawrence, Wade and Moore, R. Channing and Plakal, Manoj and Ritter, Marvin , year =. Audio Set: An ontology and human-labeled dataset for audio events , DOI =. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , publisher =

  49. [49]

    arXiv preprint arXiv:2311.06368 , year=

    The AeroSonicDB (YPAD-0523) dataset for acoustic detection and classification of aircraft , author=. arXiv preprint arXiv:2311.06368 , year=

  50. [50]

    FSD50K: An Open Dataset of Human-Labeled Sound Events , year=

    Fonseca, Eduardo and Favory, Xavier and Pons, Jordi and Font, Frederic and Serra, Xavier , journal=. FSD50K: An Open Dataset of Human-Labeled Sound Events , year=

  51. [51]

    2022 , publisher =

    Yang , title =. 2022 , publisher =. doi:10.5281/zenodo.6687981 , url =

  52. [52]

    IDMT-Traffic: An Open Benchmark Dataset for Acoustic Traffic Monitoring Research , year=

    Abeßer, Jakob and Gourishetti, Saichand and Kátai, András and Clauß, Tobias and Sharma, Prachi and Liebetrau, Judith , booktitle=. IDMT-Traffic: An Open Benchmark Dataset for Acoustic Traffic Monitoring Research , year=

  53. [53]

    Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019) , year=

    MAVD: a dataset for sound event detection in urban environments , author=. Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019) , year=. doi:10.33682/kfmf-zv94 , address=

  54. [54]

    IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP) , volume=

    Panns: Large-scale pretrained audio neural networks for audio pattern recognition , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP) , volume=. 2020 , publisher=

  55. [55]

    Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH) , pages=

    Yuan Gong and Yu-An Chung and James Glass , title=. Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH) , pages=. doi:10.21437/Interspeech.2021-698 , publisher=

  56. [56]

    wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , volume =

    Baevski, Alexei and Zhou, Yuhao and Mohamed, Abdelrahman and Auli, Michael , booktitle =. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , volume =. 2020 , address=

  57. [57]

    Koutini, Khaled and Schlüter, Jan and Eghbal-zadeh, Hamid and Widmer, Gerhard , title =. Proc. Interspeech 2022 , year =. doi:10.21437/Interspeech.2022-227 , issn =

  58. [58]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

  59. [59]

    2019 , editor =

    Tan, Mingxing and Le, Quoc , booktitle =. 2019 , editor =

  60. [60]

    Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation , year=

    Wu, Yusong and Chen, Ke and Zhang, Tianyu and Hui, Yuchen and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo , booktitle=. Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation , year=

  61. [61]

    HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection , year=

    Chen, Ke and Du, Xingjian and Zhu, Bilei and Ma, Zejun and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo , booktitle=. HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection , year=

  62. [62]

    Qwen2-Audio Technical Report

    Qwen2-audio technical report , author=. arXiv preprint arXiv:2407.10759 , year=

  63. [63]

    Proceedings of the 40th International Conference on Machine Learning , pages =

    Robust Speech Recognition via Large-Scale Weak Supervision , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

  64. [64]

    Proceedings of the 39th International Conference on Machine Learning , pages =

    Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =

  65. [65]

    2025 , issn =

    Soundscape-based evaluation of small-scale forest management interventions , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.foreco.2025.123067 , author =

  66. [66]

    2019 , booktitle =

    SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , author =. 2019 , booktitle =. doi:10.21437/Interspeech.2019-2680 , issn =

  67. [67]

    arXiv preprint arXiv:2412.11943 , year=

    autrainer: A Modular and Extensible Deep Learning Toolkit for Computer Audition Tasks , author=. arXiv preprint arXiv:2412.11943 , year=

  68. [68]

    Landscape Ecology , volume=

    Soundscape dynamics of a cold protected forest: dominance of aircraft noise , author=. Landscape Ecology , volume=. 2022 , publisher=

  69. [69]

    , author=

    The sonic environment of cities. , author=. 1967 , school=

  70. [70]

    2005 , issn =

    Urban soundscapes: Experiences and knowledge , journal =. 2005 , issn =. doi:https://doi.org/10.1016/j.cities.2005.05.003 , author =

  71. [71]

    2014 , edition =

    Almo Farina , title =. 2014 , edition =

  72. [72]

    2025 , issn =

    AudioProtoPNet: An interpretable deep learning model for bird sound classification , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.ecoinf.2025.103081 , author =

  73. [73]

    arXiv preprint arXiv:2312.07439 , year=

    Birb: A generalization benchmark for information retrieval in bioacoustics , author=. arXiv preprint arXiv:2312.07439 , year=

  74. [74]

    Environmental sound recordings from BeSound (all EPs, 20015 / 2016)

    M. Environmental sound recordings from BeSound (all EPs, 20015 / 2016). Version 3 , year =

  75. [75]

    arXiv preprint arXiv:2411.07186 , year=

    NatureLM-audio: An audio-language foundation model for bioacoustics , author=. arXiv preprint arXiv:2411.07186 , year=

  76. [76]

    Scientific reports , volume=

    ORCA-SPOT: An automatic killer whale sound detection toolkit using deep learning , author=. Scientific reports , volume=. 2019 , publisher=

  77. [77]

    2021 , booktitle =

    ORCA-SLANG: An Automatic Multi-Stage Semi-Supervised Deep Learning Framework for Large-Scale Killer Whale Call Type Identification , author =. 2021 , booktitle =. doi:10.21437/Interspeech.2021-616 , issn =

  78. [78]

    Methods in Ecology and Evolution , volume=

    AudioMoth: Evaluation of a smart open acoustic device for monitoring biodiversity and the environment , author=. Methods in Ecology and Evolution , volume=. 2018 , publisher=

  79. [79]

    , journal=

    Triantafyllopoulos, Andreas and Tsangko, Iosif and Gebhard, Alexander and Mesaros, Annamaria and Virtanen, Tuomas and Schuller, Björn W. , journal=. Computer Audition: From Task-Specific Machine Learning to Foundation Models , year=

  80. [80]

    Trends in Hearing , volume=

    New avenues in audio intelligence: Towards holistic real-life audio understanding , author=. Trends in Hearing , volume=. 2021 , publisher=

Showing first 80 references.