AVEX: What Matters for Animal Vocalization Encoding

Aza Raskin; Benjamin Hoffman; David Robinson; Diane Kim; Ellen Gilsenan-McMahon; Emmanuel Chemla; Felix Effenberger; Gagan Narula; Jane Lawton; Jen-Yu Liu

arxiv: 2508.11845 · v3 · pith:2URAYBMVnew · submitted 2025-08-15 · 💻 cs.SD · cs.AI· cs.IR· cs.LG

AVEX: What Matters for Animal Vocalization Encoding

Marius Miron , David Robinson , Milad Alizadeh , Ellen Gilsenan-McMahon , Gagan Narula , Emmanuel Chemla , Maddie Cusimano , Felix Effenberger

show 9 more authors

Masato Hagiwara Benjamin Hoffman Sara Keen Diane Kim Jane Lawton Jen-Yu Liu Aza Raskin Olivier Pietquin Matthieu Geist

This is my paper

Pith reviewed 2026-05-18 22:11 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.IRcs.LG

keywords bioacousticsself-supervised learninganimal vocalizationaudio encodersspecies classificationmachine learningbiodiversity monitoring

0 comments

The pith

Self-supervised pre-training on mixed bioacoustics and general audio followed by supervised post-training produces the strongest encoders for animal vocalization tasks across 26 datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper runs a large empirical study to find effective ways to train general-purpose encoders for bioacoustics work such as species classification, individual identification, and behavior analysis. The field often lacks enough labeled recordings, so reusable representations from pre-training can help many downstream tasks. The authors vary training data scale and diversity, model architectures, and training recipes while testing on a broad set of tasks and datasets. They find that self-supervised pre-training on a combined bioacoustics and general-audio corpus, followed by supervised post-training, gives the best results both on familiar data and on new distributions. The study also shows that data diversity matters in both training stages and releases checkpoints to let others build on the findings.

Core claim

We obtain encoders that are state-of-the-art on the existing and proposed benchmarks. We also identify what matters for training these encoders, such that this work can be extended when more data are available or better architectures are proposed. Specifically, across 26 datasets with tasks including species classification, detection, individual ID, and vocal repertoire discovery, we find self-supervised pre-training followed by supervised post-training on a mixed bioacoustics + general-audio corpus yields the strongest in- and out-of-distribution performance. We show the importance of data diversity in both stages.

What carries the argument

The two-stage training recipe of self-supervised pre-training on a mixed bioacoustics and general-audio corpus followed by supervised post-training on bioacoustics data.

If this is right

Encoders achieve state-of-the-art accuracy on species classification, detection, and individual identification tasks.
Results remain strong on out-of-distribution datasets not seen during training.
Data diversity in both pre-training and post-training stages is required for the performance gains.
The same recipe can be applied when larger bioacoustics datasets become available.
Released checkpoints allow direct use or further fine-tuning on new tasks without retraining from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

General-audio data supplies useful sound representations that transfer to animal vocalizations.
These encoders could support automated, large-scale biodiversity monitoring where labeled data remain scarce.
The emphasis on data diversity over architecture choice may apply to other audio domains with limited labels.
Community use of the released models can accelerate testing on additional real-world conservation problems.

Load-bearing premise

Strong benchmark results on the 26 selected datasets will translate into practical gains for conservation and behavioral studies in real-world settings.

What would settle it

A new bioacoustics dataset from an unseen habitat or species group on which the two-stage mixed-corpus recipe performs no better than a standard supervised baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 2508.11845 by Aza Raskin, Benjamin Hoffman, David Robinson, Diane Kim, Ellen Gilsenan-McMahon, Emmanuel Chemla, Felix Effenberger, Gagan Narula, Jane Lawton, Jen-Yu Liu, Maddie Cusimano, Marius Miron, Masato Hagiwara, Matthieu Geist, Milad Alizadeh, Olivier Pietquin, Sara Keen.

**Figure 1.** Figure 1: Our empirical study diagram, assessing (1) models, (2) training data, (3) training paradigms, and proposing an (4) extended evaluation data and methodology. 1 Introduction Bioacoustics is the study of animal sound production and perception (Bradbury & Vehrencamp, 1998). It is a crucial component for understanding animal behavior (Fischer et al., 2013), for biodiversity monitoring and conservation ⋆Equal c… view at source ↗

**Figure 2.** Figure 2: (a) Win-rate of adding AudioSet in self-supervised pre-training vs. pure bioacoustic data, with average relative gain per metric. (b) Supervised encoders outperform self-supervised on BEANS classification, which is primarily focal recordings. However, self-supervised encoders suffer markedly smaller performance drops than supervised encoders when moving from focal recordings to soundscape (BEANS Detection… view at source ↗

**Figure 3.** Figure 3: Win-rate of post-trained SSL models vs. their raw SSL backbones. We plot the win-rates summing over all metrics for all our post-trained (EAT and BEATs) models, and show the average relative gain per model with respect to its base model [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗

**Figure 4.** Figure 4: Detailed transfer of training data to taxa and tasks in the BEANS benchmark. Heatmap shows the performance change for an EfficientNet trained on each data mix as compared to a baseline “bio” dataset. “- Bio + General” is trained on only AudioSet, “+ Soundscape” adds soundscape datasets, “- Whales” ablates all marine mammal recordings, “Birds only” removes all non-bird recordings [PITH_FULL_IMAGE:figures/f… view at source ↗

**Figure 5.** Figure 5: Detailed transfer of training data to taxa and tasks in the BirdSet, Individual Identification, and Vocal Repertoire Discovery benchmarks. Heatmap shows the performance change for an EfficientNet trained on each data mix as compared to a baseline “bio” dataset. “- Bio + General” is trained on only AudioSet, “+ Soundscape” adds soundscape datasets, “- Whales” ablates all marine mammal recordings, “Birds onl… view at source ↗

read the original abstract

Bioacoustics, the study of sounds produced by living organisms, plays a vital role in conservation, biodiversity monitoring, and behavioral studies. Many tasks in this field, such as species, individual, and behavior classification and detection, are well-suited to machine learning. However, they often suffer from limited annotated data, highlighting the need for a general-purpose bioacoustic encoder capable of extracting useful representations for diverse downstream tasks. Such encoders have been proposed before, but are often limited in scope due to a focus on a narrow range of species (typically birds), and a reliance on a single model architecture or training paradigm. Moreover, they are usually evaluated on a small set of tasks and datasets. In this work, we present a large-scale empirical study that covers aspects of bioacoustics that are relevant to research but have previously been scarcely considered: training data diversity and scale, model architectures and training recipes, and the breadth of evaluation tasks and datasets. We obtain encoders that are state-of-the-art on the existing and proposed benchmarks. We also identify what matters for training these encoders, such that this work can be extended when more data are available or better architectures are proposed. Specifically, across 26 datasets with tasks including species classification, detection, individual ID, and vocal repertoire discovery, we find self-supervised pre-training followed by supervised post-training on a mixed bioacoustics + general-audio corpus yields the strongest in- and out-of-distribution performance. We show the importance of data diversity in both stages. To support ongoing research and application, we will release the model checkpoints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core finding is that self-supervised pre-training plus supervised post-training on a mixed bioacoustics and general-audio corpus gives the best results across 26 datasets and tasks.

read the letter

Colleague, the main thing to know is that this work runs a broad empirical comparison of training recipes for animal sound encoders and concludes that self-supervised pre-training followed by supervised post-training on a mixed corpus works best for in- and out-of-distribution performance. They back this with results on species classification, detection, individual ID, and vocal repertoire discovery. The scale of the study stands out: they vary data diversity and size, test multiple architectures, and compare training stages in a way that goes beyond the narrower bird-focused or single-paradigm encoders in the cited prior work. Releasing the checkpoints is a practical plus that lets others build on the recipe directly. Data diversity in both stages also comes through as important in their comparisons. The soft spots are mostly in the level of detail provided. The abstract gives no specifics on model sizes, hyperparameter search, exact data splits, or statistical testing, which makes it difficult to assess how reliable the performance gaps really are. The concern about the 26 datasets is reasonable too; without a clear species breakdown, noise statistics, or confirmation that the eval sets are independent from the mixed training corpus, it's possible the gains are partly tied to benchmark construction rather than a general property of the approach. If the full paper fills in those gaps with transparent methods, the central claim holds up better. This is the sort of paper that researchers working on bioacoustics tools or conservation monitoring applications will want to read for the benchmarks and the reusable encoder. It has enough empirical scope and a clear practical recommendation to deserve a serious referee, though the methods section will likely need tightening for full verification.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a large-scale empirical comparison of training strategies for general-purpose animal vocalization encoders. It evaluates the impact of data diversity/scale, model architectures, and training paradigms (including self-supervised pre-training and supervised post-training) across 26 datasets spanning species classification, detection, individual identification, and vocal repertoire discovery. The central claim is that self-supervised pre-training followed by supervised post-training on a mixed bioacoustics + general-audio corpus produces the strongest in- and out-of-distribution results; the authors also highlight the value of data diversity and commit to releasing model checkpoints.

Significance. If the performance ordering holds under rigorous verification, the work supplies actionable guidance on effective training recipes for bioacoustic models, which could improve downstream applications in conservation and behavioral ecology where labeled data are scarce. The breadth of tasks/datasets and the planned checkpoint release are concrete strengths that support reproducibility and extension by the community.

major comments (2)

[Abstract and evaluation-breadth paragraph] Abstract and evaluation-breadth paragraph: the central claim that the mixed-corpus recipe yields the strongest performance cannot be fully assessed because exact model sizes, hyperparameter search ranges, train/eval splits, and statistical testing procedures are not reported. These details are load-bearing for verifying that reported gains are not artifacts of implementation choices.
[Evaluation section] Evaluation section: the claim that superior benchmark results will translate to practical gains in conservation and behavioral studies rests on the assumption that the 26 datasets adequately sample acoustic variability, species diversity, and recording conditions. No species breakdown, noise-level statistics, or explicit check for overlap with the mixed training corpus is supplied, leaving open the possibility that the reported superiority is benchmark-specific rather than a robust property of the recipe.

minor comments (2)

Clarify the precise definition of 'mixed bioacoustics + general-audio corpus' (e.g., relative proportions and source datasets) to allow readers to replicate the data-diversity finding.
Add a short table summarizing the 26 datasets (task, species count, recording conditions) to make the evaluation breadth concrete.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us strengthen the clarity and verifiability of our empirical study. We address each major comment below and have revised the manuscript accordingly to incorporate additional details and analyses.

read point-by-point responses

Referee: [Abstract and evaluation-breadth paragraph] Abstract and evaluation-breadth paragraph: the central claim that the mixed-corpus recipe yields the strongest performance cannot be fully assessed because exact model sizes, hyperparameter search ranges, train/eval splits, and statistical testing procedures are not reported. These details are load-bearing for verifying that reported gains are not artifacts of implementation choices.

Authors: We agree that these implementation details are necessary for rigorous verification of the performance ordering. In the revised manuscript we have expanded the Experimental Setup section with a dedicated 'Implementation Details' subsection. This now reports exact model sizes (parameter counts for each architecture), the full hyperparameter search ranges and final selected values, precise train/eval splits for every dataset, and the statistical procedures (including number of runs, bootstrap confidence intervals, and paired significance tests with p-values). These additions directly support assessment of the central claim. revision: yes
Referee: [Evaluation section] Evaluation section: the claim that superior benchmark results will translate to practical gains in conservation and behavioral studies rests on the assumption that the 26 datasets adequately sample acoustic variability, species diversity, and recording conditions. No species breakdown, noise-level statistics, or explicit check for overlap with the mixed training corpus is supplied, leaving open the possibility that the reported superiority is benchmark-specific rather than a robust property of the recipe.

Authors: We acknowledge the importance of demonstrating that the evaluation suite is representative rather than benchmark-specific. The revised manuscript includes a new 'Evaluation Dataset Characterization' subsection that supplies (1) a species-level breakdown and overall taxonomic diversity across the 26 datasets, (2) noise-level statistics (estimated SNR distributions for recordings where metadata is available), and (3) an explicit overlap analysis with the mixed training corpus using audio fingerprinting and metadata comparison, which shows minimal overlap. We discuss the implications for generalizability to conservation and behavioral applications. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons on held-out datasets

full rationale

The paper reports an empirical study that trains encoders under different recipes (self-supervised pre-training, supervised post-training, mixed corpora) and measures performance directly on 26 held-out datasets for tasks such as species classification and detection. No equations, parameter fittings, or uniqueness theorems are invoked; the central claim is a ranking of observed benchmark scores rather than a derivation that reduces to its own inputs by construction. Evaluation uses explicit train/eval splits on external data sources, satisfying the criteria for an independent, self-contained result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical machine-learning study; the central claim rests on standard transfer-learning assumptions rather than new free parameters or invented entities.

axioms (1)

domain assumption Self-supervised representations learned on general audio transfer usefully to bioacoustic tasks when followed by supervised fine-tuning.
Invoked when claiming the mixed-corpus recipe is optimal (abstract).

pith-pipeline@v0.9.0 · 5884 in / 1172 out tokens · 39941 ms · 2026-05-18T22:11:51.375124+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

self-supervised pre-training followed by supervised post-training on a mixed bioacoustics + general-audio corpus yields the strongest in- and out-of-distribution performance
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

across 26 datasets with tasks including species classification, detection, individual ID, and vocal repertoire discovery

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Multi-layer attentive probing improves transfer of audio representations for bioacoustics
cs.SD 2026-05 unverdicted novelty 7.0

Multi-layer attentive probing outperforms last-layer linear probing for transferring audio representations to bioacoustic tasks, indicating that standard evaluation setups may underestimate model quality.
Masked Autoencoders with Limited Data: Does It Work? A Fine-Grained Bioacoustics Case Study
cs.SD 2026-05 conditional novelty 6.0

In moderate-sized fine-grained bioacoustics, pretraining scale of masked autoencoders on diverse general audio dominates over domain-specific objectives or data curation for transfer performance.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 2 Pith papers · 2 internal anchors

[1]

IDMT-Traffic: an open benchmark dataset for acoustic traffic monitoring research

Jakob Abeßer, Saichand Gourishetti, Andr´as K´atai, Tobias Clauß, Prachi Sharma, and Judith Liebetrau. IDMT-Traffic: an open benchmark dataset for acoustic traffic monitoring research. In 2021 29th European Signal Processing Conference (EUSIPCO), pp. 551–555. IEEE,

work page 2021
[2]

Investigating self-supervised speech models’ ability to classify animal vocalizations: The case of gibbon’s vocal signatures

Jules Cauzinille, Beno ˆıt Favre, Ricard Marxer, Dena Clink, Abdul Hamid Ahmad, and Arnaud Rey. Investigating self-supervised speech models’ ability to classify animal vocalizations: The case of gibbon’s vocal signatures. In Interspeech 2024, pp. 132–136. ISCA; ISCA,

work page 2024
[3]

Mustafa Chasmai, Alexander Shepard, Subhransu Maji, and Grant Van Horn

doi: https://doi.org/10.25921/e12p-gj65. Mustafa Chasmai, Alexander Shepard, Subhransu Maji, and Grant Van Horn. The inaturalist sounds dataset.Advances in Neural Information Processing Systems, 37:132524–132544,

work page doi:10.25921/e12p-gj65
[4]

Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, Wanxiang Che, Xiangzhan Yu, and Furu Wei

URL https://arxiv.org/abs/2506.00343. Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, Wanxiang Che, Xiangzhan Yu, and Furu Wei. Beats: Audio pre-training with acoustic tokenizers. In International Conference on Machine Learning, pp. 5178–5193. PMLR,

work page arXiv
[5]

Tweetynet: a neural network that enables high-throughput, automated annotation of birdsong

Yarden Cohen, David Nicholson, Alexa Sanchioni, Emily K Mallaber, Viktoriya Skidanova, and Timothy J Gardner. Tweetynet: a neural network that enables high-throughput, automated annotation of birdsong. BioRxiv, pp. 2020– 08,

work page 2020
[6]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pp. 4171–4186,

work page 2019
[7]

Julie E Elie and Frederic E Theunissen

URL https://arxiv.org/abs/2505.03071. Julie E Elie and Frederic E Theunissen. The vocal repertoire of the domesticated zebra finch: a data-driven approach to decipher the information-bearing acoustic features of communication signals. Animal cognition, 19(2):285–315,

work page arXiv
[8]

The Llama 3 Herd of Models

URL https://arxiv.org/abs/2407.21783. Masato Hagiwara. Aves: Animal vocalization encoder based on self-supervision. In ICASSP 2023-2023 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Beans: The benchmark of animal sounds

Masato Hagiwara, Benjamin Hoffman, Jen-Yu Liu, Maddie Cusimano, Felix Effenberger, and Katie Zacarian. Beans: The benchmark of animal sounds. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

work page 2023
[10]

arXiv preprint arXiv:2312.07439 , year=

URL https: //arxiv.org/abs/2312.07439. W Alexander Hopping, Christopher J Sayers, Noe Roger Huaraca-Charca, and Holger Klinck. Simultaneous passive acoustic monitoring uncovers evidence of potentially overlooked temporal variation in an amazonian bird commu- nity. Ibis, 166(3):986–1002,

work page arXiv
[11]

doi: https://doi.org/10.1016/j.eswa.2021.115270

ISSN 0957-4174. doi: https://doi.org/10.1016/j.eswa.2021.115270. URL https://www.sciencedirect.com/science/article/pii/S0957417421007016. Stefan Kahl, Connor M Wood, Maximilian Eibl, and Holger Klinck. Birdnet: A deep learning solution for avian diversity monitoring. Ecological Informatics, 61:101236,

work page doi:10.1016/j.eswa.2021.115270 2021
[12]

A collection of fully-annotated soundscape recordings from the Northeastern United States, September 2022a

Stefan Kahl, Russell Charif, and Holger Klinck. A collection of fully-annotated soundscape recordings from the Northeastern United States, September 2022a. URL https://doi.org/10.5281/zenodo.7079380. Stefan Kahl, Connor M Wood, Philip Chaon, M Zachariah Peery, and Holger Klinck. A collection of fully-annotated soundscape recordings from the western united...

work page doi:10.5281/zenodo.7079380
[13]

org/abs/2503.02389

URL https://arxiv. org/abs/2503.02389. Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Metrics for polyphonic sound event detection. Applied Sciences, 6(6):162, 2016a. Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. TUT database for acoustic scene classification and sound event detection. In 2016 24th European Signal Processing Conference (E...

work page arXiv 2016
[14]

Gill, Hanna Pamula, David Benvent, and Dan Stowell

Veronica Morfi, In ˆes Nolasco, Vincent Lostanlen, Shubhr Singh, Ariana Strandburg-Peshkin, Lisa F. Gill, Hanna Pamula, David Benvent, and Dan Stowell. Few-shot bioacoustic event detection: A new task at the dcase 2021 challenge. In Detection and Classification of Acoustic Scenes and Events 2021,

work page 2021
[15]

Animal sound archive

Museum f¨ur Naturkunde Berlin. Animal sound archive. https://doi.org/10.15468/0bpalr. Accessed via gbif.org 2023-05-09. 14 WHAT MATTERS FOR BIOACOUSTIC ENCODING Amanda Navine, Stefan Kahl, Ann Tanimoto-Johnson, Holger Klinck, and Patrick Hart. A collection of fully- annotated soundscape recordings from the island of hawai’i. Zenodo https://doi. org/10.528...

work page doi:10.15468/0bpalr 2023
[16]

Acoustic identification of individual animals with hierarchical contrastive learning

Ines Nolasco, Ilyass Moummad, Dan Stowell, and Emmanouil Benetos. Acoustic identification of individual animals with hierarchical contrastive learning. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

work page 2025
[17]

URL https://doi.org/10.1038/s41597- 025-05281-5

doi: 10.1038/s41597-025-05281-5. URL https://doi.org/10.1038/s41597- 025-05281-5. Data descriptor for the DCLDE 2026 killer-whale annotation dataset. Michael A Pardo, Kurt Fristrup, David S Lolchuragi, Joyce H Poole, Petter Granli, Cynthia Moss, Iain Douglas- Hamilton, and George Wittemyer. African elephants address one another with individually specific ...

work page doi:10.1038/s41597-025-05281-5 2026
[18]

From massive detections and localisations of orca at orcalab over three years to real-time survey joint to environmental conditions

M Poupard, P Best, M Ferrari, P Spong, H Symonds, J-M Pr ´evot, T Soriano, and H Glotin. From massive detections and localisations of orca at orcalab over three years to real-time survey joint to environmental conditions. In e-Forum Acusticum 2020, pp. 3235–3237,

work page 2020
[19]

Can masked autoencoders also listen to birds?, 2025a

Lukas Rauch, Ren ´e Heinrich, Ilyass Moummad, Alexis Joly, Bernhard Sick, and Christoph Scholz. Can masked autoencoders also listen to birds?, 2025a. URL https://arxiv.org/abs/2504.12880. Lukas Rauch, Raphael Schwinger, Moritz Wirth, Ren ´e Heinrich, Denis Huseljic, Marek Herde, Jonas Lange, Ste- fan Kahl, Bernhard Sick, Sven Tomforde, et al. Birdset: A l...

work page arXiv 2024
[20]

Comparing self-supervised learning models pre-trained on human speech and animal vocalizations for bioacoustics processing

Eklavya Sarkar and Mathew Magimai Doss. Comparing self-supervised learning models pre-trained on human speech and animal vocalizations for bioacoustics processing. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

work page 2025
[21]

The watkins marine mammal sound database: An online, freely accessible resource

doi: 10.1121/2.0000358. URL https://asa.scitation.org/doi/abs/ 10.1121/2.0000358. Julian C. Sch¨afer-Zimmermann, Vlad Demartsev, Baptiste Averly, Kiran Dhanjal-Adams, Mathieu Duteil, Gabriella Gall, Marius Faiß, Lily Johnson-Ulrich, Dan Stowell, Marta B. Manser, Marie A. Roch, and Ariana Strandburg- Peshkin. animal2vec and meerkat: A self-supervised trans...

work page doi:10.1121/2.0000358
[22]

animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics

URL https://arxiv.org/abs/2406.01253. Samuel Stevens, Jiaman Wu, Matthew J Thompson, Elizabeth G Campolongo, Chan Hee Song, David Edward Carlyn, Li Dong, Wasila M Dahdul, Charles Stewart, Tanya Berger-Wolf, et al. Bioclip: A vision foundation model for the tree of life. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition ,...

work page internal anchor Pith review Pith/arXiv arXiv
[23]

doi: 10.1121/1.4799597

ISSN 1939-800X. doi: 10.1121/1.4799597. URL https://doi.org/10. 1121/1.4799597. Joseph Turian, Jordie Shier, Humair Raj Khan, Bhiksha Raj, Bj ¨orn W Schuller, Christian J Steinmetz, Colin Malloy, George Tzanetakis, Gissel Velarde, Kirk McNally, et al. Hear: Holistic evaluation of audio representations. In NeurIPS 2021 Competitions and Demonstrations Track...

work page doi:10.1121/1.4799597 1939
[24]

librosa/librosa: 0.6.3,

Alvaro Vega-Hidalgo, Stefan Kahl, Laurel B Symes, Viviana Ruiz-Guti´errez, Ingrid Molina-Mora, Fernando Cediel, Luis Sandoval, and Holger Klinck. A collection of fully-annotated soundscape recordings from neotropical coffee farms in colombia and costa rica. Zenodo https://doi. org/10.5281/zenodo, 7525349,

work page doi:10.5281/zenodo
[25]

doi: 10.1101/2025.04.09. 648029. URL https://www.biorxiv.org/content/early/2025/04/10/2025.04.09.648029. Gordon Wichern, Joe Antognini, Michael Flynn, Licheng Richard Zhu, Emmett McQuinn, Dwight Crow, Ethan Manilow, and Jonathan Le Roux. Wham!: Extending speech separation to noisy environments. In Proceedings of Interspeech, September

work page doi:10.1101/2025.04.09 2025
[26]

Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov

URL https://arxiv.org/abs/2404.16436. Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale con- trastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP 2023- 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

work page arXiv 2023
[27]

arXiv preprint arXiv:2105.01051 , year=

Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y Lin, Andy T Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, et al. Superb: Speech processing universal performance benchmark. arXiv preprint arXiv:2105.01051,

work page arXiv
[28]

All cuts

ISSN 0003-3472. doi: https://doi.org/10.1016/j.anbehav.2003.07.016. URL https://www.sciencedirect.com/science/article/pii/S000334720400123X. 17 WHAT MATTERS FOR BIOACOUSTIC ENCODING A Experimental setup A.1 Evaluation Metrics We formalize the evaluation metrics we introduce in Section 3.4. We evaluate linear probing with accuracy for classification, and m...

work page doi:10.1016/j.anbehav.2003.07.016 2003
[29]

bio” dataset. “- Bio + General

From a baseline of (focal) bioacoustic data only, we show the performance of adding general audio, adding soundscape recordings, and ablating different taxonomic groups (whales, and then all taxa but birds.) Adding general audio to the training mix improved results overall, but in particular transferred consistently across our vocal repertoire datasets. T...

work page arXiv 2021

[1] [1]

IDMT-Traffic: an open benchmark dataset for acoustic traffic monitoring research

Jakob Abeßer, Saichand Gourishetti, Andr´as K´atai, Tobias Clauß, Prachi Sharma, and Judith Liebetrau. IDMT-Traffic: an open benchmark dataset for acoustic traffic monitoring research. In 2021 29th European Signal Processing Conference (EUSIPCO), pp. 551–555. IEEE,

work page 2021

[2] [2]

Investigating self-supervised speech models’ ability to classify animal vocalizations: The case of gibbon’s vocal signatures

Jules Cauzinille, Beno ˆıt Favre, Ricard Marxer, Dena Clink, Abdul Hamid Ahmad, and Arnaud Rey. Investigating self-supervised speech models’ ability to classify animal vocalizations: The case of gibbon’s vocal signatures. In Interspeech 2024, pp. 132–136. ISCA; ISCA,

work page 2024

[3] [3]

Mustafa Chasmai, Alexander Shepard, Subhransu Maji, and Grant Van Horn

doi: https://doi.org/10.25921/e12p-gj65. Mustafa Chasmai, Alexander Shepard, Subhransu Maji, and Grant Van Horn. The inaturalist sounds dataset.Advances in Neural Information Processing Systems, 37:132524–132544,

work page doi:10.25921/e12p-gj65

[4] [4]

Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, Wanxiang Che, Xiangzhan Yu, and Furu Wei

URL https://arxiv.org/abs/2506.00343. Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, Wanxiang Che, Xiangzhan Yu, and Furu Wei. Beats: Audio pre-training with acoustic tokenizers. In International Conference on Machine Learning, pp. 5178–5193. PMLR,

work page arXiv

[5] [5]

Tweetynet: a neural network that enables high-throughput, automated annotation of birdsong

Yarden Cohen, David Nicholson, Alexa Sanchioni, Emily K Mallaber, Viktoriya Skidanova, and Timothy J Gardner. Tweetynet: a neural network that enables high-throughput, automated annotation of birdsong. BioRxiv, pp. 2020– 08,

work page 2020

[6] [6]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pp. 4171–4186,

work page 2019

[7] [7]

Julie E Elie and Frederic E Theunissen

URL https://arxiv.org/abs/2505.03071. Julie E Elie and Frederic E Theunissen. The vocal repertoire of the domesticated zebra finch: a data-driven approach to decipher the information-bearing acoustic features of communication signals. Animal cognition, 19(2):285–315,

work page arXiv

[8] [8]

The Llama 3 Herd of Models

URL https://arxiv.org/abs/2407.21783. Masato Hagiwara. Aves: Animal vocalization encoder based on self-supervision. In ICASSP 2023-2023 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Beans: The benchmark of animal sounds

Masato Hagiwara, Benjamin Hoffman, Jen-Yu Liu, Maddie Cusimano, Felix Effenberger, and Katie Zacarian. Beans: The benchmark of animal sounds. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

work page 2023

[10] [10]

arXiv preprint arXiv:2312.07439 , year=

URL https: //arxiv.org/abs/2312.07439. W Alexander Hopping, Christopher J Sayers, Noe Roger Huaraca-Charca, and Holger Klinck. Simultaneous passive acoustic monitoring uncovers evidence of potentially overlooked temporal variation in an amazonian bird commu- nity. Ibis, 166(3):986–1002,

work page arXiv

[11] [11]

doi: https://doi.org/10.1016/j.eswa.2021.115270

ISSN 0957-4174. doi: https://doi.org/10.1016/j.eswa.2021.115270. URL https://www.sciencedirect.com/science/article/pii/S0957417421007016. Stefan Kahl, Connor M Wood, Maximilian Eibl, and Holger Klinck. Birdnet: A deep learning solution for avian diversity monitoring. Ecological Informatics, 61:101236,

work page doi:10.1016/j.eswa.2021.115270 2021

[12] [12]

A collection of fully-annotated soundscape recordings from the Northeastern United States, September 2022a

Stefan Kahl, Russell Charif, and Holger Klinck. A collection of fully-annotated soundscape recordings from the Northeastern United States, September 2022a. URL https://doi.org/10.5281/zenodo.7079380. Stefan Kahl, Connor M Wood, Philip Chaon, M Zachariah Peery, and Holger Klinck. A collection of fully-annotated soundscape recordings from the western united...

work page doi:10.5281/zenodo.7079380

[13] [13]

org/abs/2503.02389

URL https://arxiv. org/abs/2503.02389. Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Metrics for polyphonic sound event detection. Applied Sciences, 6(6):162, 2016a. Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. TUT database for acoustic scene classification and sound event detection. In 2016 24th European Signal Processing Conference (E...

work page arXiv 2016

[14] [14]

Gill, Hanna Pamula, David Benvent, and Dan Stowell

Veronica Morfi, In ˆes Nolasco, Vincent Lostanlen, Shubhr Singh, Ariana Strandburg-Peshkin, Lisa F. Gill, Hanna Pamula, David Benvent, and Dan Stowell. Few-shot bioacoustic event detection: A new task at the dcase 2021 challenge. In Detection and Classification of Acoustic Scenes and Events 2021,

work page 2021

[15] [15]

Animal sound archive

Museum f¨ur Naturkunde Berlin. Animal sound archive. https://doi.org/10.15468/0bpalr. Accessed via gbif.org 2023-05-09. 14 WHAT MATTERS FOR BIOACOUSTIC ENCODING Amanda Navine, Stefan Kahl, Ann Tanimoto-Johnson, Holger Klinck, and Patrick Hart. A collection of fully- annotated soundscape recordings from the island of hawai’i. Zenodo https://doi. org/10.528...

work page doi:10.15468/0bpalr 2023

[16] [16]

Acoustic identification of individual animals with hierarchical contrastive learning

Ines Nolasco, Ilyass Moummad, Dan Stowell, and Emmanouil Benetos. Acoustic identification of individual animals with hierarchical contrastive learning. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

work page 2025

[17] [17]

URL https://doi.org/10.1038/s41597- 025-05281-5

doi: 10.1038/s41597-025-05281-5. URL https://doi.org/10.1038/s41597- 025-05281-5. Data descriptor for the DCLDE 2026 killer-whale annotation dataset. Michael A Pardo, Kurt Fristrup, David S Lolchuragi, Joyce H Poole, Petter Granli, Cynthia Moss, Iain Douglas- Hamilton, and George Wittemyer. African elephants address one another with individually specific ...

work page doi:10.1038/s41597-025-05281-5 2026

[18] [18]

From massive detections and localisations of orca at orcalab over three years to real-time survey joint to environmental conditions

M Poupard, P Best, M Ferrari, P Spong, H Symonds, J-M Pr ´evot, T Soriano, and H Glotin. From massive detections and localisations of orca at orcalab over three years to real-time survey joint to environmental conditions. In e-Forum Acusticum 2020, pp. 3235–3237,

work page 2020

[19] [19]

Can masked autoencoders also listen to birds?, 2025a

Lukas Rauch, Ren ´e Heinrich, Ilyass Moummad, Alexis Joly, Bernhard Sick, and Christoph Scholz. Can masked autoencoders also listen to birds?, 2025a. URL https://arxiv.org/abs/2504.12880. Lukas Rauch, Raphael Schwinger, Moritz Wirth, Ren ´e Heinrich, Denis Huseljic, Marek Herde, Jonas Lange, Ste- fan Kahl, Bernhard Sick, Sven Tomforde, et al. Birdset: A l...

work page arXiv 2024

[20] [20]

Comparing self-supervised learning models pre-trained on human speech and animal vocalizations for bioacoustics processing

Eklavya Sarkar and Mathew Magimai Doss. Comparing self-supervised learning models pre-trained on human speech and animal vocalizations for bioacoustics processing. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

work page 2025

[21] [21]

The watkins marine mammal sound database: An online, freely accessible resource

doi: 10.1121/2.0000358. URL https://asa.scitation.org/doi/abs/ 10.1121/2.0000358. Julian C. Sch¨afer-Zimmermann, Vlad Demartsev, Baptiste Averly, Kiran Dhanjal-Adams, Mathieu Duteil, Gabriella Gall, Marius Faiß, Lily Johnson-Ulrich, Dan Stowell, Marta B. Manser, Marie A. Roch, and Ariana Strandburg- Peshkin. animal2vec and meerkat: A self-supervised trans...

work page doi:10.1121/2.0000358

[22] [22]

animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics

URL https://arxiv.org/abs/2406.01253. Samuel Stevens, Jiaman Wu, Matthew J Thompson, Elizabeth G Campolongo, Chan Hee Song, David Edward Carlyn, Li Dong, Wasila M Dahdul, Charles Stewart, Tanya Berger-Wolf, et al. Bioclip: A vision foundation model for the tree of life. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition ,...

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

doi: 10.1121/1.4799597

ISSN 1939-800X. doi: 10.1121/1.4799597. URL https://doi.org/10. 1121/1.4799597. Joseph Turian, Jordie Shier, Humair Raj Khan, Bhiksha Raj, Bj ¨orn W Schuller, Christian J Steinmetz, Colin Malloy, George Tzanetakis, Gissel Velarde, Kirk McNally, et al. Hear: Holistic evaluation of audio representations. In NeurIPS 2021 Competitions and Demonstrations Track...

work page doi:10.1121/1.4799597 1939

[24] [24]

librosa/librosa: 0.6.3,

Alvaro Vega-Hidalgo, Stefan Kahl, Laurel B Symes, Viviana Ruiz-Guti´errez, Ingrid Molina-Mora, Fernando Cediel, Luis Sandoval, and Holger Klinck. A collection of fully-annotated soundscape recordings from neotropical coffee farms in colombia and costa rica. Zenodo https://doi. org/10.5281/zenodo, 7525349,

work page doi:10.5281/zenodo

[25] [25]

doi: 10.1101/2025.04.09. 648029. URL https://www.biorxiv.org/content/early/2025/04/10/2025.04.09.648029. Gordon Wichern, Joe Antognini, Michael Flynn, Licheng Richard Zhu, Emmett McQuinn, Dwight Crow, Ethan Manilow, and Jonathan Le Roux. Wham!: Extending speech separation to noisy environments. In Proceedings of Interspeech, September

work page doi:10.1101/2025.04.09 2025

[26] [26]

Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov

URL https://arxiv.org/abs/2404.16436. Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale con- trastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP 2023- 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

work page arXiv 2023

[27] [27]

arXiv preprint arXiv:2105.01051 , year=

Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y Lin, Andy T Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, et al. Superb: Speech processing universal performance benchmark. arXiv preprint arXiv:2105.01051,

work page arXiv

[28] [28]

All cuts

ISSN 0003-3472. doi: https://doi.org/10.1016/j.anbehav.2003.07.016. URL https://www.sciencedirect.com/science/article/pii/S000334720400123X. 17 WHAT MATTERS FOR BIOACOUSTIC ENCODING A Experimental setup A.1 Evaluation Metrics We formalize the evaluation metrics we introduce in Section 3.4. We evaluate linear probing with accuracy for classification, and m...

work page doi:10.1016/j.anbehav.2003.07.016 2003

[29] [29]

bio” dataset. “- Bio + General

From a baseline of (focal) bioacoustic data only, we show the performance of adding general audio, adding soundscape recordings, and ablating different taxonomic groups (whales, and then all taxa but birds.) Adding general audio to the training mix improved results overall, but in particular transferred consistently across our vocal repertoire datasets. T...

work page arXiv 2021