AVEX: What Matters for Animal Vocalization Encoding
Pith reviewed 2026-05-18 22:11 UTC · model grok-4.3
The pith
Self-supervised pre-training on mixed bioacoustics and general audio followed by supervised post-training produces the strongest encoders for animal vocalization tasks across 26 datasets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We obtain encoders that are state-of-the-art on the existing and proposed benchmarks. We also identify what matters for training these encoders, such that this work can be extended when more data are available or better architectures are proposed. Specifically, across 26 datasets with tasks including species classification, detection, individual ID, and vocal repertoire discovery, we find self-supervised pre-training followed by supervised post-training on a mixed bioacoustics + general-audio corpus yields the strongest in- and out-of-distribution performance. We show the importance of data diversity in both stages.
What carries the argument
The two-stage training recipe of self-supervised pre-training on a mixed bioacoustics and general-audio corpus followed by supervised post-training on bioacoustics data.
If this is right
- Encoders achieve state-of-the-art accuracy on species classification, detection, and individual identification tasks.
- Results remain strong on out-of-distribution datasets not seen during training.
- Data diversity in both pre-training and post-training stages is required for the performance gains.
- The same recipe can be applied when larger bioacoustics datasets become available.
- Released checkpoints allow direct use or further fine-tuning on new tasks without retraining from scratch.
Where Pith is reading between the lines
- General-audio data supplies useful sound representations that transfer to animal vocalizations.
- These encoders could support automated, large-scale biodiversity monitoring where labeled data remain scarce.
- The emphasis on data diversity over architecture choice may apply to other audio domains with limited labels.
- Community use of the released models can accelerate testing on additional real-world conservation problems.
Load-bearing premise
Strong benchmark results on the 26 selected datasets will translate into practical gains for conservation and behavioral studies in real-world settings.
What would settle it
A new bioacoustics dataset from an unseen habitat or species group on which the two-stage mixed-corpus recipe performs no better than a standard supervised baseline would falsify the central claim.
Figures
read the original abstract
Bioacoustics, the study of sounds produced by living organisms, plays a vital role in conservation, biodiversity monitoring, and behavioral studies. Many tasks in this field, such as species, individual, and behavior classification and detection, are well-suited to machine learning. However, they often suffer from limited annotated data, highlighting the need for a general-purpose bioacoustic encoder capable of extracting useful representations for diverse downstream tasks. Such encoders have been proposed before, but are often limited in scope due to a focus on a narrow range of species (typically birds), and a reliance on a single model architecture or training paradigm. Moreover, they are usually evaluated on a small set of tasks and datasets. In this work, we present a large-scale empirical study that covers aspects of bioacoustics that are relevant to research but have previously been scarcely considered: training data diversity and scale, model architectures and training recipes, and the breadth of evaluation tasks and datasets. We obtain encoders that are state-of-the-art on the existing and proposed benchmarks. We also identify what matters for training these encoders, such that this work can be extended when more data are available or better architectures are proposed. Specifically, across 26 datasets with tasks including species classification, detection, individual ID, and vocal repertoire discovery, we find self-supervised pre-training followed by supervised post-training on a mixed bioacoustics + general-audio corpus yields the strongest in- and out-of-distribution performance. We show the importance of data diversity in both stages. To support ongoing research and application, we will release the model checkpoints.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a large-scale empirical comparison of training strategies for general-purpose animal vocalization encoders. It evaluates the impact of data diversity/scale, model architectures, and training paradigms (including self-supervised pre-training and supervised post-training) across 26 datasets spanning species classification, detection, individual identification, and vocal repertoire discovery. The central claim is that self-supervised pre-training followed by supervised post-training on a mixed bioacoustics + general-audio corpus produces the strongest in- and out-of-distribution results; the authors also highlight the value of data diversity and commit to releasing model checkpoints.
Significance. If the performance ordering holds under rigorous verification, the work supplies actionable guidance on effective training recipes for bioacoustic models, which could improve downstream applications in conservation and behavioral ecology where labeled data are scarce. The breadth of tasks/datasets and the planned checkpoint release are concrete strengths that support reproducibility and extension by the community.
major comments (2)
- [Abstract and evaluation-breadth paragraph] Abstract and evaluation-breadth paragraph: the central claim that the mixed-corpus recipe yields the strongest performance cannot be fully assessed because exact model sizes, hyperparameter search ranges, train/eval splits, and statistical testing procedures are not reported. These details are load-bearing for verifying that reported gains are not artifacts of implementation choices.
- [Evaluation section] Evaluation section: the claim that superior benchmark results will translate to practical gains in conservation and behavioral studies rests on the assumption that the 26 datasets adequately sample acoustic variability, species diversity, and recording conditions. No species breakdown, noise-level statistics, or explicit check for overlap with the mixed training corpus is supplied, leaving open the possibility that the reported superiority is benchmark-specific rather than a robust property of the recipe.
minor comments (2)
- Clarify the precise definition of 'mixed bioacoustics + general-audio corpus' (e.g., relative proportions and source datasets) to allow readers to replicate the data-diversity finding.
- Add a short table summarizing the 26 datasets (task, species count, recording conditions) to make the evaluation breadth concrete.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which has helped us strengthen the clarity and verifiability of our empirical study. We address each major comment below and have revised the manuscript accordingly to incorporate additional details and analyses.
read point-by-point responses
-
Referee: [Abstract and evaluation-breadth paragraph] Abstract and evaluation-breadth paragraph: the central claim that the mixed-corpus recipe yields the strongest performance cannot be fully assessed because exact model sizes, hyperparameter search ranges, train/eval splits, and statistical testing procedures are not reported. These details are load-bearing for verifying that reported gains are not artifacts of implementation choices.
Authors: We agree that these implementation details are necessary for rigorous verification of the performance ordering. In the revised manuscript we have expanded the Experimental Setup section with a dedicated 'Implementation Details' subsection. This now reports exact model sizes (parameter counts for each architecture), the full hyperparameter search ranges and final selected values, precise train/eval splits for every dataset, and the statistical procedures (including number of runs, bootstrap confidence intervals, and paired significance tests with p-values). These additions directly support assessment of the central claim. revision: yes
-
Referee: [Evaluation section] Evaluation section: the claim that superior benchmark results will translate to practical gains in conservation and behavioral studies rests on the assumption that the 26 datasets adequately sample acoustic variability, species diversity, and recording conditions. No species breakdown, noise-level statistics, or explicit check for overlap with the mixed training corpus is supplied, leaving open the possibility that the reported superiority is benchmark-specific rather than a robust property of the recipe.
Authors: We acknowledge the importance of demonstrating that the evaluation suite is representative rather than benchmark-specific. The revised manuscript includes a new 'Evaluation Dataset Characterization' subsection that supplies (1) a species-level breakdown and overall taxonomic diversity across the 26 datasets, (2) noise-level statistics (estimated SNR distributions for recordings where metadata is available), and (3) an explicit overlap analysis with the mixed training corpus using audio fingerprinting and metadata comparison, which shows minimal overlap. We discuss the implications for generalizability to conservation and behavioral applications. revision: yes
Circularity Check
No circularity: empirical comparisons on held-out datasets
full rationale
The paper reports an empirical study that trains encoders under different recipes (self-supervised pre-training, supervised post-training, mixed corpora) and measures performance directly on 26 held-out datasets for tasks such as species classification and detection. No equations, parameter fittings, or uniqueness theorems are invoked; the central claim is a ranking of observed benchmark scores rather than a derivation that reduces to its own inputs by construction. Evaluation uses explicit train/eval splits on external data sources, satisfying the criteria for an independent, self-contained result.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Self-supervised representations learned on general audio transfer usefully to bioacoustic tasks when followed by supervised fine-tuning.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
self-supervised pre-training followed by supervised post-training on a mixed bioacoustics + general-audio corpus yields the strongest in- and out-of-distribution performance
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
across 26 datasets with tasks including species classification, detection, individual ID, and vocal repertoire discovery
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Multi-layer attentive probing improves transfer of audio representations for bioacoustics
Multi-layer attentive probing outperforms last-layer linear probing for transferring audio representations to bioacoustic tasks, indicating that standard evaluation setups may underestimate model quality.
-
Masked Autoencoders with Limited Data: Does It Work? A Fine-Grained Bioacoustics Case Study
In moderate-sized fine-grained bioacoustics, pretraining scale of masked autoencoders on diverse general audio dominates over domain-specific objectives or data curation for transfer performance.
Reference graph
Works this paper leans on
-
[1]
IDMT-Traffic: an open benchmark dataset for acoustic traffic monitoring research
Jakob Abeßer, Saichand Gourishetti, Andr´as K´atai, Tobias Clauß, Prachi Sharma, and Judith Liebetrau. IDMT-Traffic: an open benchmark dataset for acoustic traffic monitoring research. In 2021 29th European Signal Processing Conference (EUSIPCO), pp. 551–555. IEEE,
work page 2021
-
[2]
Jules Cauzinille, Beno ˆıt Favre, Ricard Marxer, Dena Clink, Abdul Hamid Ahmad, and Arnaud Rey. Investigating self-supervised speech models’ ability to classify animal vocalizations: The case of gibbon’s vocal signatures. In Interspeech 2024, pp. 132–136. ISCA; ISCA,
work page 2024
-
[3]
Mustafa Chasmai, Alexander Shepard, Subhransu Maji, and Grant Van Horn
doi: https://doi.org/10.25921/e12p-gj65. Mustafa Chasmai, Alexander Shepard, Subhransu Maji, and Grant Van Horn. The inaturalist sounds dataset.Advances in Neural Information Processing Systems, 37:132524–132544,
-
[4]
URL https://arxiv.org/abs/2506.00343. Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, Wanxiang Che, Xiangzhan Yu, and Furu Wei. Beats: Audio pre-training with acoustic tokenizers. In International Conference on Machine Learning, pp. 5178–5193. PMLR,
-
[5]
Tweetynet: a neural network that enables high-throughput, automated annotation of birdsong
Yarden Cohen, David Nicholson, Alexa Sanchioni, Emily K Mallaber, Viktoriya Skidanova, and Timothy J Gardner. Tweetynet: a neural network that enables high-throughput, automated annotation of birdsong. BioRxiv, pp. 2020– 08,
work page 2020
-
[6]
Bert: Pre-training of deep bidirectional trans- formers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pp. 4171–4186,
work page 2019
-
[7]
Julie E Elie and Frederic E Theunissen
URL https://arxiv.org/abs/2505.03071. Julie E Elie and Frederic E Theunissen. The vocal repertoire of the domesticated zebra finch: a data-driven approach to decipher the information-bearing acoustic features of communication signals. Animal cognition, 19(2):285–315,
-
[8]
URL https://arxiv.org/abs/2407.21783. Masato Hagiwara. Aves: Animal vocalization encoder based on self-supervision. In ICASSP 2023-2023 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Beans: The benchmark of animal sounds
Masato Hagiwara, Benjamin Hoffman, Jen-Yu Liu, Maddie Cusimano, Felix Effenberger, and Katie Zacarian. Beans: The benchmark of animal sounds. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,
work page 2023
-
[10]
arXiv preprint arXiv:2312.07439 , year=
URL https: //arxiv.org/abs/2312.07439. W Alexander Hopping, Christopher J Sayers, Noe Roger Huaraca-Charca, and Holger Klinck. Simultaneous passive acoustic monitoring uncovers evidence of potentially overlooked temporal variation in an amazonian bird commu- nity. Ibis, 166(3):986–1002,
-
[11]
doi: https://doi.org/10.1016/j.eswa.2021.115270
ISSN 0957-4174. doi: https://doi.org/10.1016/j.eswa.2021.115270. URL https://www.sciencedirect.com/science/article/pii/S0957417421007016. Stefan Kahl, Connor M Wood, Maximilian Eibl, and Holger Klinck. Birdnet: A deep learning solution for avian diversity monitoring. Ecological Informatics, 61:101236,
-
[12]
Stefan Kahl, Russell Charif, and Holger Klinck. A collection of fully-annotated soundscape recordings from the Northeastern United States, September 2022a. URL https://doi.org/10.5281/zenodo.7079380. Stefan Kahl, Connor M Wood, Philip Chaon, M Zachariah Peery, and Holger Klinck. A collection of fully-annotated soundscape recordings from the western united...
-
[13]
URL https://arxiv. org/abs/2503.02389. Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Metrics for polyphonic sound event detection. Applied Sciences, 6(6):162, 2016a. Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. TUT database for acoustic scene classification and sound event detection. In 2016 24th European Signal Processing Conference (E...
-
[14]
Gill, Hanna Pamula, David Benvent, and Dan Stowell
Veronica Morfi, In ˆes Nolasco, Vincent Lostanlen, Shubhr Singh, Ariana Strandburg-Peshkin, Lisa F. Gill, Hanna Pamula, David Benvent, and Dan Stowell. Few-shot bioacoustic event detection: A new task at the dcase 2021 challenge. In Detection and Classification of Acoustic Scenes and Events 2021,
work page 2021
-
[15]
Museum f¨ur Naturkunde Berlin. Animal sound archive. https://doi.org/10.15468/0bpalr. Accessed via gbif.org 2023-05-09. 14 WHAT MATTERS FOR BIOACOUSTIC ENCODING Amanda Navine, Stefan Kahl, Ann Tanimoto-Johnson, Holger Klinck, and Patrick Hart. A collection of fully- annotated soundscape recordings from the island of hawai’i. Zenodo https://doi. org/10.528...
-
[16]
Acoustic identification of individual animals with hierarchical contrastive learning
Ines Nolasco, Ilyass Moummad, Dan Stowell, and Emmanouil Benetos. Acoustic identification of individual animals with hierarchical contrastive learning. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,
work page 2025
-
[17]
URL https://doi.org/10.1038/s41597- 025-05281-5
doi: 10.1038/s41597-025-05281-5. URL https://doi.org/10.1038/s41597- 025-05281-5. Data descriptor for the DCLDE 2026 killer-whale annotation dataset. Michael A Pardo, Kurt Fristrup, David S Lolchuragi, Joyce H Poole, Petter Granli, Cynthia Moss, Iain Douglas- Hamilton, and George Wittemyer. African elephants address one another with individually specific ...
-
[18]
M Poupard, P Best, M Ferrari, P Spong, H Symonds, J-M Pr ´evot, T Soriano, and H Glotin. From massive detections and localisations of orca at orcalab over three years to real-time survey joint to environmental conditions. In e-Forum Acusticum 2020, pp. 3235–3237,
work page 2020
-
[19]
Can masked autoencoders also listen to birds?, 2025a
Lukas Rauch, Ren ´e Heinrich, Ilyass Moummad, Alexis Joly, Bernhard Sick, and Christoph Scholz. Can masked autoencoders also listen to birds?, 2025a. URL https://arxiv.org/abs/2504.12880. Lukas Rauch, Raphael Schwinger, Moritz Wirth, Ren ´e Heinrich, Denis Huseljic, Marek Herde, Jonas Lange, Ste- fan Kahl, Bernhard Sick, Sven Tomforde, et al. Birdset: A l...
-
[20]
Eklavya Sarkar and Mathew Magimai Doss. Comparing self-supervised learning models pre-trained on human speech and animal vocalizations for bioacoustics processing. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,
work page 2025
-
[21]
The watkins marine mammal sound database: An online, freely accessible resource
doi: 10.1121/2.0000358. URL https://asa.scitation.org/doi/abs/ 10.1121/2.0000358. Julian C. Sch¨afer-Zimmermann, Vlad Demartsev, Baptiste Averly, Kiran Dhanjal-Adams, Mathieu Duteil, Gabriella Gall, Marius Faiß, Lily Johnson-Ulrich, Dan Stowell, Marta B. Manser, Marie A. Roch, and Ariana Strandburg- Peshkin. animal2vec and meerkat: A self-supervised trans...
-
[22]
URL https://arxiv.org/abs/2406.01253. Samuel Stevens, Jiaman Wu, Matthew J Thompson, Elizabeth G Campolongo, Chan Hee Song, David Edward Carlyn, Li Dong, Wasila M Dahdul, Charles Stewart, Tanya Berger-Wolf, et al. Bioclip: A vision foundation model for the tree of life. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition ,...
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
ISSN 1939-800X. doi: 10.1121/1.4799597. URL https://doi.org/10. 1121/1.4799597. Joseph Turian, Jordie Shier, Humair Raj Khan, Bhiksha Raj, Bj ¨orn W Schuller, Christian J Steinmetz, Colin Malloy, George Tzanetakis, Gissel Velarde, Kirk McNally, et al. Hear: Holistic evaluation of audio representations. In NeurIPS 2021 Competitions and Demonstrations Track...
-
[24]
Alvaro Vega-Hidalgo, Stefan Kahl, Laurel B Symes, Viviana Ruiz-Guti´errez, Ingrid Molina-Mora, Fernando Cediel, Luis Sandoval, and Holger Klinck. A collection of fully-annotated soundscape recordings from neotropical coffee farms in colombia and costa rica. Zenodo https://doi. org/10.5281/zenodo, 7525349,
-
[25]
doi: 10.1101/2025.04.09. 648029. URL https://www.biorxiv.org/content/early/2025/04/10/2025.04.09.648029. Gordon Wichern, Joe Antognini, Michael Flynn, Licheng Richard Zhu, Emmett McQuinn, Dwight Crow, Ethan Manilow, and Jonathan Le Roux. Wham!: Extending speech separation to noisy environments. In Proceedings of Interspeech, September
-
[26]
Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov
URL https://arxiv.org/abs/2404.16436. Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale con- trastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP 2023- 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,
-
[27]
arXiv preprint arXiv:2105.01051 , year=
Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y Lin, Andy T Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, et al. Superb: Speech processing universal performance benchmark. arXiv preprint arXiv:2105.01051,
-
[28]
ISSN 0003-3472. doi: https://doi.org/10.1016/j.anbehav.2003.07.016. URL https://www.sciencedirect.com/science/article/pii/S000334720400123X. 17 WHAT MATTERS FOR BIOACOUSTIC ENCODING A Experimental setup A.1 Evaluation Metrics We formalize the evaluation metrics we introduce in Section 3.4. We evaluate linear probing with accuracy for classification, and m...
-
[29]
bio” dataset. “- Bio + General
From a baseline of (focal) bioacoustic data only, we show the performance of adding general audio, adding soundscape recordings, and ablating different taxonomic groups (whales, and then all taxa but birds.) Adding general audio to the training mix improved results overall, but in particular transferred consistently across our vocal repertoire datasets. T...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.