pith. sign in

arxiv: 2508.11845 · v3 · pith:2URAYBMVnew · submitted 2025-08-15 · 💻 cs.SD · cs.AI· cs.IR· cs.LG

AVEX: What Matters for Animal Vocalization Encoding

Pith reviewed 2026-05-18 22:11 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.IRcs.LG
keywords bioacousticsself-supervised learninganimal vocalizationaudio encodersspecies classificationmachine learningbiodiversity monitoring
0
0 comments X

The pith

Self-supervised pre-training on mixed bioacoustics and general audio followed by supervised post-training produces the strongest encoders for animal vocalization tasks across 26 datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper runs a large empirical study to find effective ways to train general-purpose encoders for bioacoustics work such as species classification, individual identification, and behavior analysis. The field often lacks enough labeled recordings, so reusable representations from pre-training can help many downstream tasks. The authors vary training data scale and diversity, model architectures, and training recipes while testing on a broad set of tasks and datasets. They find that self-supervised pre-training on a combined bioacoustics and general-audio corpus, followed by supervised post-training, gives the best results both on familiar data and on new distributions. The study also shows that data diversity matters in both training stages and releases checkpoints to let others build on the findings.

Core claim

We obtain encoders that are state-of-the-art on the existing and proposed benchmarks. We also identify what matters for training these encoders, such that this work can be extended when more data are available or better architectures are proposed. Specifically, across 26 datasets with tasks including species classification, detection, individual ID, and vocal repertoire discovery, we find self-supervised pre-training followed by supervised post-training on a mixed bioacoustics + general-audio corpus yields the strongest in- and out-of-distribution performance. We show the importance of data diversity in both stages.

What carries the argument

The two-stage training recipe of self-supervised pre-training on a mixed bioacoustics and general-audio corpus followed by supervised post-training on bioacoustics data.

If this is right

  • Encoders achieve state-of-the-art accuracy on species classification, detection, and individual identification tasks.
  • Results remain strong on out-of-distribution datasets not seen during training.
  • Data diversity in both pre-training and post-training stages is required for the performance gains.
  • The same recipe can be applied when larger bioacoustics datasets become available.
  • Released checkpoints allow direct use or further fine-tuning on new tasks without retraining from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • General-audio data supplies useful sound representations that transfer to animal vocalizations.
  • These encoders could support automated, large-scale biodiversity monitoring where labeled data remain scarce.
  • The emphasis on data diversity over architecture choice may apply to other audio domains with limited labels.
  • Community use of the released models can accelerate testing on additional real-world conservation problems.

Load-bearing premise

Strong benchmark results on the 26 selected datasets will translate into practical gains for conservation and behavioral studies in real-world settings.

What would settle it

A new bioacoustics dataset from an unseen habitat or species group on which the two-stage mixed-corpus recipe performs no better than a standard supervised baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 2508.11845 by Aza Raskin, Benjamin Hoffman, David Robinson, Diane Kim, Ellen Gilsenan-McMahon, Emmanuel Chemla, Felix Effenberger, Gagan Narula, Jane Lawton, Jen-Yu Liu, Maddie Cusimano, Marius Miron, Masato Hagiwara, Matthieu Geist, Milad Alizadeh, Olivier Pietquin, Sara Keen.

Figure 1
Figure 1. Figure 1: Our empirical study diagram, assessing (1) models, (2) training data, (3) training paradigms, and proposing an (4) ex￾tended evaluation data and methodology. 1 Introduction Bioacoustics is the study of animal sound production and perception (Bradbury & Vehrencamp, 1998). It is a crucial component for understanding animal behavior (Fischer et al., 2013), for biodiversity monitoring and conservation ⋆Equal c… view at source ↗
Figure 2
Figure 2. Figure 2: (a) Win-rate of adding AudioSet in self-supervised pre-training vs. pure bioacoustic data, with average relative gain per metric. (b) Supervised encoders outperform self-supervised on BEANS classification, which is primarily focal recordings. However, self-supervised encoders suffer markedly smaller performance drops than supervised encoders when moving from fo￾cal recordings to soundscape (BEANS Detection… view at source ↗
Figure 3
Figure 3. Figure 3: Win-rate of post-trained SSL models vs. their raw SSL backbones. We plot the win-rates summing over all metrics for all our post-trained (EAT and BEATs) models, and show the average relative gain per model with respect to its base model [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Detailed transfer of training data to taxa and tasks in the BEANS benchmark. Heatmap shows the performance change for an EfficientNet trained on each data mix as compared to a baseline “bio” dataset. “- Bio + General” is trained on only AudioSet, “+ Soundscape” adds soundscape datasets, “- Whales” ablates all marine mammal recordings, “Birds only” removes all non-bird recordings [PITH_FULL_IMAGE:figures/f… view at source ↗
Figure 5
Figure 5. Figure 5: Detailed transfer of training data to taxa and tasks in the BirdSet, Individual Identification, and Vocal Repertoire Discovery benchmarks. Heatmap shows the performance change for an EfficientNet trained on each data mix as compared to a baseline “bio” dataset. “- Bio + General” is trained on only AudioSet, “+ Soundscape” adds soundscape datasets, “- Whales” ablates all marine mammal recordings, “Birds onl… view at source ↗
read the original abstract

Bioacoustics, the study of sounds produced by living organisms, plays a vital role in conservation, biodiversity monitoring, and behavioral studies. Many tasks in this field, such as species, individual, and behavior classification and detection, are well-suited to machine learning. However, they often suffer from limited annotated data, highlighting the need for a general-purpose bioacoustic encoder capable of extracting useful representations for diverse downstream tasks. Such encoders have been proposed before, but are often limited in scope due to a focus on a narrow range of species (typically birds), and a reliance on a single model architecture or training paradigm. Moreover, they are usually evaluated on a small set of tasks and datasets. In this work, we present a large-scale empirical study that covers aspects of bioacoustics that are relevant to research but have previously been scarcely considered: training data diversity and scale, model architectures and training recipes, and the breadth of evaluation tasks and datasets. We obtain encoders that are state-of-the-art on the existing and proposed benchmarks. We also identify what matters for training these encoders, such that this work can be extended when more data are available or better architectures are proposed. Specifically, across 26 datasets with tasks including species classification, detection, individual ID, and vocal repertoire discovery, we find self-supervised pre-training followed by supervised post-training on a mixed bioacoustics + general-audio corpus yields the strongest in- and out-of-distribution performance. We show the importance of data diversity in both stages. To support ongoing research and application, we will release the model checkpoints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a large-scale empirical comparison of training strategies for general-purpose animal vocalization encoders. It evaluates the impact of data diversity/scale, model architectures, and training paradigms (including self-supervised pre-training and supervised post-training) across 26 datasets spanning species classification, detection, individual identification, and vocal repertoire discovery. The central claim is that self-supervised pre-training followed by supervised post-training on a mixed bioacoustics + general-audio corpus produces the strongest in- and out-of-distribution results; the authors also highlight the value of data diversity and commit to releasing model checkpoints.

Significance. If the performance ordering holds under rigorous verification, the work supplies actionable guidance on effective training recipes for bioacoustic models, which could improve downstream applications in conservation and behavioral ecology where labeled data are scarce. The breadth of tasks/datasets and the planned checkpoint release are concrete strengths that support reproducibility and extension by the community.

major comments (2)
  1. [Abstract and evaluation-breadth paragraph] Abstract and evaluation-breadth paragraph: the central claim that the mixed-corpus recipe yields the strongest performance cannot be fully assessed because exact model sizes, hyperparameter search ranges, train/eval splits, and statistical testing procedures are not reported. These details are load-bearing for verifying that reported gains are not artifacts of implementation choices.
  2. [Evaluation section] Evaluation section: the claim that superior benchmark results will translate to practical gains in conservation and behavioral studies rests on the assumption that the 26 datasets adequately sample acoustic variability, species diversity, and recording conditions. No species breakdown, noise-level statistics, or explicit check for overlap with the mixed training corpus is supplied, leaving open the possibility that the reported superiority is benchmark-specific rather than a robust property of the recipe.
minor comments (2)
  1. Clarify the precise definition of 'mixed bioacoustics + general-audio corpus' (e.g., relative proportions and source datasets) to allow readers to replicate the data-diversity finding.
  2. Add a short table summarizing the 26 datasets (task, species count, recording conditions) to make the evaluation breadth concrete.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us strengthen the clarity and verifiability of our empirical study. We address each major comment below and have revised the manuscript accordingly to incorporate additional details and analyses.

read point-by-point responses
  1. Referee: [Abstract and evaluation-breadth paragraph] Abstract and evaluation-breadth paragraph: the central claim that the mixed-corpus recipe yields the strongest performance cannot be fully assessed because exact model sizes, hyperparameter search ranges, train/eval splits, and statistical testing procedures are not reported. These details are load-bearing for verifying that reported gains are not artifacts of implementation choices.

    Authors: We agree that these implementation details are necessary for rigorous verification of the performance ordering. In the revised manuscript we have expanded the Experimental Setup section with a dedicated 'Implementation Details' subsection. This now reports exact model sizes (parameter counts for each architecture), the full hyperparameter search ranges and final selected values, precise train/eval splits for every dataset, and the statistical procedures (including number of runs, bootstrap confidence intervals, and paired significance tests with p-values). These additions directly support assessment of the central claim. revision: yes

  2. Referee: [Evaluation section] Evaluation section: the claim that superior benchmark results will translate to practical gains in conservation and behavioral studies rests on the assumption that the 26 datasets adequately sample acoustic variability, species diversity, and recording conditions. No species breakdown, noise-level statistics, or explicit check for overlap with the mixed training corpus is supplied, leaving open the possibility that the reported superiority is benchmark-specific rather than a robust property of the recipe.

    Authors: We acknowledge the importance of demonstrating that the evaluation suite is representative rather than benchmark-specific. The revised manuscript includes a new 'Evaluation Dataset Characterization' subsection that supplies (1) a species-level breakdown and overall taxonomic diversity across the 26 datasets, (2) noise-level statistics (estimated SNR distributions for recordings where metadata is available), and (3) an explicit overlap analysis with the mixed training corpus using audio fingerprinting and metadata comparison, which shows minimal overlap. We discuss the implications for generalizability to conservation and behavioral applications. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons on held-out datasets

full rationale

The paper reports an empirical study that trains encoders under different recipes (self-supervised pre-training, supervised post-training, mixed corpora) and measures performance directly on 26 held-out datasets for tasks such as species classification and detection. No equations, parameter fittings, or uniqueness theorems are invoked; the central claim is a ranking of observed benchmark scores rather than a derivation that reduces to its own inputs by construction. Evaluation uses explicit train/eval splits on external data sources, satisfying the criteria for an independent, self-contained result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical machine-learning study; the central claim rests on standard transfer-learning assumptions rather than new free parameters or invented entities.

axioms (1)
  • domain assumption Self-supervised representations learned on general audio transfer usefully to bioacoustic tasks when followed by supervised fine-tuning.
    Invoked when claiming the mixed-corpus recipe is optimal (abstract).

pith-pipeline@v0.9.0 · 5884 in / 1172 out tokens · 39941 ms · 2026-05-18T22:11:51.375124+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Multi-layer attentive probing improves transfer of audio representations for bioacoustics

    cs.SD 2026-05 unverdicted novelty 7.0

    Multi-layer attentive probing outperforms last-layer linear probing for transferring audio representations to bioacoustic tasks, indicating that standard evaluation setups may underestimate model quality.

  2. Masked Autoencoders with Limited Data: Does It Work? A Fine-Grained Bioacoustics Case Study

    cs.SD 2026-05 conditional novelty 6.0

    In moderate-sized fine-grained bioacoustics, pretraining scale of masked autoencoders on diverse general audio dominates over domain-specific objectives or data curation for transfer performance.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 2 Pith papers · 2 internal anchors

  1. [1]

    IDMT-Traffic: an open benchmark dataset for acoustic traffic monitoring research

    Jakob Abeßer, Saichand Gourishetti, Andr´as K´atai, Tobias Clauß, Prachi Sharma, and Judith Liebetrau. IDMT-Traffic: an open benchmark dataset for acoustic traffic monitoring research. In 2021 29th European Signal Processing Conference (EUSIPCO), pp. 551–555. IEEE,

  2. [2]

    Investigating self-supervised speech models’ ability to classify animal vocalizations: The case of gibbon’s vocal signatures

    Jules Cauzinille, Beno ˆıt Favre, Ricard Marxer, Dena Clink, Abdul Hamid Ahmad, and Arnaud Rey. Investigating self-supervised speech models’ ability to classify animal vocalizations: The case of gibbon’s vocal signatures. In Interspeech 2024, pp. 132–136. ISCA; ISCA,

  3. [3]

    Mustafa Chasmai, Alexander Shepard, Subhransu Maji, and Grant Van Horn

    doi: https://doi.org/10.25921/e12p-gj65. Mustafa Chasmai, Alexander Shepard, Subhransu Maji, and Grant Van Horn. The inaturalist sounds dataset.Advances in Neural Information Processing Systems, 37:132524–132544,

  4. [4]

    Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, Wanxiang Che, Xiangzhan Yu, and Furu Wei

    URL https://arxiv.org/abs/2506.00343. Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, Wanxiang Che, Xiangzhan Yu, and Furu Wei. Beats: Audio pre-training with acoustic tokenizers. In International Conference on Machine Learning, pp. 5178–5193. PMLR,

  5. [5]

    Tweetynet: a neural network that enables high-throughput, automated annotation of birdsong

    Yarden Cohen, David Nicholson, Alexa Sanchioni, Emily K Mallaber, Viktoriya Skidanova, and Timothy J Gardner. Tweetynet: a neural network that enables high-throughput, automated annotation of birdsong. BioRxiv, pp. 2020– 08,

  6. [6]

    Bert: Pre-training of deep bidirectional trans- formers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pp. 4171–4186,

  7. [7]

    Julie E Elie and Frederic E Theunissen

    URL https://arxiv.org/abs/2505.03071. Julie E Elie and Frederic E Theunissen. The vocal repertoire of the domesticated zebra finch: a data-driven approach to decipher the information-bearing acoustic features of communication signals. Animal cognition, 19(2):285–315,

  8. [8]

    The Llama 3 Herd of Models

    URL https://arxiv.org/abs/2407.21783. Masato Hagiwara. Aves: Animal vocalization encoder based on self-supervision. In ICASSP 2023-2023 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

  9. [9]

    Beans: The benchmark of animal sounds

    Masato Hagiwara, Benjamin Hoffman, Jen-Yu Liu, Maddie Cusimano, Felix Effenberger, and Katie Zacarian. Beans: The benchmark of animal sounds. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

  10. [10]

    arXiv preprint arXiv:2312.07439 , year=

    URL https: //arxiv.org/abs/2312.07439. W Alexander Hopping, Christopher J Sayers, Noe Roger Huaraca-Charca, and Holger Klinck. Simultaneous passive acoustic monitoring uncovers evidence of potentially overlooked temporal variation in an amazonian bird commu- nity. Ibis, 166(3):986–1002,

  11. [11]

    doi: https://doi.org/10.1016/j.eswa.2021.115270

    ISSN 0957-4174. doi: https://doi.org/10.1016/j.eswa.2021.115270. URL https://www.sciencedirect.com/science/article/pii/S0957417421007016. Stefan Kahl, Connor M Wood, Maximilian Eibl, and Holger Klinck. Birdnet: A deep learning solution for avian diversity monitoring. Ecological Informatics, 61:101236,

  12. [12]

    A collection of fully-annotated soundscape recordings from the Northeastern United States, September 2022a

    Stefan Kahl, Russell Charif, and Holger Klinck. A collection of fully-annotated soundscape recordings from the Northeastern United States, September 2022a. URL https://doi.org/10.5281/zenodo.7079380. Stefan Kahl, Connor M Wood, Philip Chaon, M Zachariah Peery, and Holger Klinck. A collection of fully-annotated soundscape recordings from the western united...

  13. [13]

    org/abs/2503.02389

    URL https://arxiv. org/abs/2503.02389. Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Metrics for polyphonic sound event detection. Applied Sciences, 6(6):162, 2016a. Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. TUT database for acoustic scene classification and sound event detection. In 2016 24th European Signal Processing Conference (E...

  14. [14]

    Gill, Hanna Pamula, David Benvent, and Dan Stowell

    Veronica Morfi, In ˆes Nolasco, Vincent Lostanlen, Shubhr Singh, Ariana Strandburg-Peshkin, Lisa F. Gill, Hanna Pamula, David Benvent, and Dan Stowell. Few-shot bioacoustic event detection: A new task at the dcase 2021 challenge. In Detection and Classification of Acoustic Scenes and Events 2021,

  15. [15]

    Animal sound archive

    Museum f¨ur Naturkunde Berlin. Animal sound archive. https://doi.org/10.15468/0bpalr. Accessed via gbif.org 2023-05-09. 14 WHAT MATTERS FOR BIOACOUSTIC ENCODING Amanda Navine, Stefan Kahl, Ann Tanimoto-Johnson, Holger Klinck, and Patrick Hart. A collection of fully- annotated soundscape recordings from the island of hawai’i. Zenodo https://doi. org/10.528...

  16. [16]

    Acoustic identification of individual animals with hierarchical contrastive learning

    Ines Nolasco, Ilyass Moummad, Dan Stowell, and Emmanouil Benetos. Acoustic identification of individual animals with hierarchical contrastive learning. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

  17. [17]

    URL https://doi.org/10.1038/s41597- 025-05281-5

    doi: 10.1038/s41597-025-05281-5. URL https://doi.org/10.1038/s41597- 025-05281-5. Data descriptor for the DCLDE 2026 killer-whale annotation dataset. Michael A Pardo, Kurt Fristrup, David S Lolchuragi, Joyce H Poole, Petter Granli, Cynthia Moss, Iain Douglas- Hamilton, and George Wittemyer. African elephants address one another with individually specific ...

  18. [18]

    From massive detections and localisations of orca at orcalab over three years to real-time survey joint to environmental conditions

    M Poupard, P Best, M Ferrari, P Spong, H Symonds, J-M Pr ´evot, T Soriano, and H Glotin. From massive detections and localisations of orca at orcalab over three years to real-time survey joint to environmental conditions. In e-Forum Acusticum 2020, pp. 3235–3237,

  19. [19]

    Can masked autoencoders also listen to birds?, 2025a

    Lukas Rauch, Ren ´e Heinrich, Ilyass Moummad, Alexis Joly, Bernhard Sick, and Christoph Scholz. Can masked autoencoders also listen to birds?, 2025a. URL https://arxiv.org/abs/2504.12880. Lukas Rauch, Raphael Schwinger, Moritz Wirth, Ren ´e Heinrich, Denis Huseljic, Marek Herde, Jonas Lange, Ste- fan Kahl, Bernhard Sick, Sven Tomforde, et al. Birdset: A l...

  20. [20]

    Comparing self-supervised learning models pre-trained on human speech and animal vocalizations for bioacoustics processing

    Eklavya Sarkar and Mathew Magimai Doss. Comparing self-supervised learning models pre-trained on human speech and animal vocalizations for bioacoustics processing. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

  21. [21]

    The watkins marine mammal sound database: An online, freely accessible resource

    doi: 10.1121/2.0000358. URL https://asa.scitation.org/doi/abs/ 10.1121/2.0000358. Julian C. Sch¨afer-Zimmermann, Vlad Demartsev, Baptiste Averly, Kiran Dhanjal-Adams, Mathieu Duteil, Gabriella Gall, Marius Faiß, Lily Johnson-Ulrich, Dan Stowell, Marta B. Manser, Marie A. Roch, and Ariana Strandburg- Peshkin. animal2vec and meerkat: A self-supervised trans...

  22. [22]

    animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics

    URL https://arxiv.org/abs/2406.01253. Samuel Stevens, Jiaman Wu, Matthew J Thompson, Elizabeth G Campolongo, Chan Hee Song, David Edward Carlyn, Li Dong, Wasila M Dahdul, Charles Stewart, Tanya Berger-Wolf, et al. Bioclip: A vision foundation model for the tree of life. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition ,...

  23. [23]

    doi: 10.1121/1.4799597

    ISSN 1939-800X. doi: 10.1121/1.4799597. URL https://doi.org/10. 1121/1.4799597. Joseph Turian, Jordie Shier, Humair Raj Khan, Bhiksha Raj, Bj ¨orn W Schuller, Christian J Steinmetz, Colin Malloy, George Tzanetakis, Gissel Velarde, Kirk McNally, et al. Hear: Holistic evaluation of audio representations. In NeurIPS 2021 Competitions and Demonstrations Track...

  24. [24]

    librosa/librosa: 0.6.3,

    Alvaro Vega-Hidalgo, Stefan Kahl, Laurel B Symes, Viviana Ruiz-Guti´errez, Ingrid Molina-Mora, Fernando Cediel, Luis Sandoval, and Holger Klinck. A collection of fully-annotated soundscape recordings from neotropical coffee farms in colombia and costa rica. Zenodo https://doi. org/10.5281/zenodo, 7525349,

  25. [25]

    doi: 10.1101/2025.04.09. 648029. URL https://www.biorxiv.org/content/early/2025/04/10/2025.04.09.648029. Gordon Wichern, Joe Antognini, Michael Flynn, Licheng Richard Zhu, Emmett McQuinn, Dwight Crow, Ethan Manilow, and Jonathan Le Roux. Wham!: Extending speech separation to noisy environments. In Proceedings of Interspeech, September

  26. [26]

    Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov

    URL https://arxiv.org/abs/2404.16436. Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale con- trastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP 2023- 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

  27. [27]

    arXiv preprint arXiv:2105.01051 , year=

    Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y Lin, Andy T Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, et al. Superb: Speech processing universal performance benchmark. arXiv preprint arXiv:2105.01051,

  28. [28]

    All cuts

    ISSN 0003-3472. doi: https://doi.org/10.1016/j.anbehav.2003.07.016. URL https://www.sciencedirect.com/science/article/pii/S000334720400123X. 17 WHAT MATTERS FOR BIOACOUSTIC ENCODING A Experimental setup A.1 Evaluation Metrics We formalize the evaluation metrics we introduce in Section 3.4. We evaluate linear probing with accuracy for classification, and m...

  29. [29]

    bio” dataset. “- Bio + General

    From a baseline of (focal) bioacoustic data only, we show the performance of adding general audio, adding soundscape recordings, and ablating different taxonomic groups (whales, and then all taxa but birds.) Adding general audio to the training mix improved results overall, but in particular transferred consistently across our vocal repertoire datasets. T...