Dolph2Vec: Self-Supervised Representations of Dolphin Vocalizations

Alexis Emanuelli; Chiara Semenzin; Faadil Mustun; German Sumbre; Gonzalo De Polavieja; Pierre Orhan; Roberto Dessi; Yair Lakretz

arxiv: 2606.12503 · v1 · pith:F7LPUWOXnew · submitted 2026-06-10 · 💻 cs.LG · cs.SD

Dolph2Vec: Self-Supervised Representations of Dolphin Vocalizations

Chiara Semenzin , Faadil Mustun , Roberto Dessi , Pierre Orhan , Alexis Emanuelli , Yair Lakretz , Gonzalo de Polavieja , German Sumbre This is my paper

Pith reviewed 2026-06-27 10:46 UTC · model grok-4.3

classification 💻 cs.LG cs.SD

keywords dolphin vocalizationsself-supervised learningbioacousticssignature whistlesrepresentation learninganimal communicationWav2Vec

0 comments

The pith

Dolph2Vec, trained only on dolphin recordings, outperforms general audio models on whistle classification and detection while its codebook units match known whistle categories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors collect five years of recordings from five known dolphins and train Dolph2Vec, a version of the Wav2Vec2.0 architecture, exclusively on this data. They show the resulting embeddings beat general-purpose baselines on two tasks: identifying signature whistles and detecting whistles in recordings. The discrete units in the model's codebook further organize in ways that line up with established dolphin whistle types and hint at sub-whistle elements. This positions self-supervised learning as a way to study animal vocal systems at finer scale without heavy manual labeling. A reader would care because it turns large unlabeled audio into both a practical classifier and a potential map of communication structure.

Core claim

Dolph2Vec, the first large-scale species-specific self-supervised model trained on longitudinal recordings from five dolphins, produces embeddings that outperform general-purpose baselines on signature whistle classification and whistle detection; its learned codebook structure further captures interpretable acoustic units aligned with dolphin whistle categories and possibly sub-whistle structure.

What carries the argument

Dolph2Vec, the adapted Wav2Vec2.0 model whose discrete codebook units function as acoustic building blocks that align with known whistle categories.

If this is right

Fine-grained study of dolphin communication patterns can proceed with reduced need for manual annotation of individual calls.
Sub-whistle acoustic elements become observable through the organization of the learned codebook.
Self-supervised models can function simultaneously as performance tools and as instruments for generating hypotheses about animal signaling.
Species-specific training data yields representations better matched to within-species structure than cross-species models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the codebook units remain stable when the model is applied to new individuals, they could form the basis for a standardized inventory of dolphin sound elements.
The same training approach could be used to compare communication structure across different dolphin populations or related species.
Longer recordings that include social context might reveal whether the learned units combine in systematic ways during interactions.

Load-bearing premise

Recordings from five dolphins in one semi-naturalistic setting supply a sufficient sample for learning generalizable features of dolphin communication.

What would settle it

Testing the same model on vocalizations from a separate group of dolphins recorded in a different environment and finding that classification accuracy falls to the level of general baselines or that codebook units no longer align with whistle categories.

Figures

Figures reproduced from arXiv: 2606.12503 by Alexis Emanuelli, Chiara Semenzin, Faadil Mustun, German Sumbre, Gonzalo De Polavieja, Pierre Orhan, Roberto Dessi, Yair Lakretz.

**Figure 2.** Figure 2: A) The Wav2Vec2.0 architecture used in Dolph2Vec. Raw audio is encoded into latent representations by a convolutional feature encoder, discretized via a quantization module with a learned codebook, and contextualized using a Transformer network. B) Downstream tasks: Top—whistle detection on spectrograms with highlighted whistles; Bottom—whistle classification of three distinct signature whistles from diffe… view at source ↗

**Figure 3.** Figure 3: A) UMAP projection of learned embeddings from AVES-bio, BioLingual and [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: A) Codebook activations by signature whistle category in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: An aeral photo of our data collection setup. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Pretraining losses over total training steps. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Second Codebook activations by signature whistle category in Dolph2Vec trained [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

read the original abstract

Self-supervised learning (SSL) has opened new opportunities in bioacoustics by enabling scalable modeling of animal vocalizations without the need for expensive manual annotation. However, current SSL models in this domain prioritize broad generalization across species and are not optimized for uncovering the fine-grained structure of individual communication systems. In this work, we collect and release a novel dataset of over five years of longitudinal recordings, from five known dolphins in a semi-naturalistic marine environment, an unprecedented resource for studying dolphin communication. We adapt the Wav2Vec2.0 Baevski et al. (2020) architecture to this domain and introduce Dolph2Vec, the first large-scale, species-specific SSL model trained exclusively on this data. We benchmark our model on two biologically relevant tasks: signature whistle classification and whistle detection. Dolph2Vec significantly outperforms general-purpose baselines in both tasks. Beyond performance, we show that learned embeddings and codebook structure capture interpretable acoustic units aligned with dolphin whistle categories and possibly sub-whistle structure, enabling fine-grained analysis of communication patterns. Our findings demonstrate how SSL can serve as both a model and a scientific tool to explore hypotheses in animal communication research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Dolph2Vec brings a new multi-year dolphin dataset and the first species-specific SSL model, but the abstract supplies no numbers for its performance claims and the five-animal sample makes generalization to broader communication structure uncertain.

read the letter

The one or two things to know: Dolph2Vec is the first SSL model trained exclusively on dolphin data using a new five-year dataset from five known individuals, and it claims better results on signature whistle tasks plus interpretable embeddings. But the abstract gives no numbers to back the performance claims, and the small sample raises real questions about whether the model learns species-general features or just individual ones.

The new dataset stands out as a solid contribution. Longitudinal recordings from identified dolphins over multiple years are hard to come by, and releasing it would help the field. Adapting Wav2Vec2 specifically to this data is a direct way to get domain-adapted representations without relying on cross-species models.

The soft spots are in the evaluation and generalization. Without quantitative metrics, error bars, or details on how the data was split, the "significantly outperforms" statement is hard to assess. The central concern about the five-dolphin sample is valid based on the abstract. All training and test data come from the same animals in one semi-naturalistic setting, so apparent alignment with whistle categories could reflect individual signatures instead of broader communication structure. The paper would need to show results on new dolphins or different locations to support the broader claims about studying dolphin communication systems.

The math and methods are standard adaptations, with no obvious circularity in the empirical benchmarks.

This paper is aimed at bioacoustics researchers focused on dolphins or similar species-specific modeling. Readers working on limited-data SSL applications might find the approach useful once full results are available. It deserves a serious referee because the dataset is novel and the questions it raises about tailored models are worth checking with proper experiments.

I recommend sending it to peer review, but the authors should add the missing quantitative details and address the individual-versus-species issue with additional tests.

Referee Report

2 major / 1 minor

Summary. The paper introduces Dolph2Vec, an adaptation of the Wav2Vec2.0 architecture trained exclusively on a new longitudinal dataset of dolphin vocalizations collected over five years from five known individuals in a semi-naturalistic environment. It claims that the model significantly outperforms general-purpose baselines on signature whistle classification and whistle detection tasks, and that the learned embeddings and codebook units capture interpretable acoustic structure aligned with established whistle categories (and possibly sub-whistle units), thereby serving as a tool for studying dolphin communication systems.

Significance. If the empirical results hold under proper controls for generalization, the work would supply both a reusable species-specific SSL model and a public dataset that could accelerate hypothesis-driven research in bioacoustics; the explicit release of five years of labeled longitudinal recordings from identified animals is a concrete strength.

major comments (2)

[Abstract] Abstract: the central framing that the model enables study of “dolphin communication systems” (rather than the communication of these five individuals) is load-bearing yet unsupported; all training data, signature-whistle labels, and evaluation sets come from the same five dolphins at one site, so any performance gain or apparent alignment with whistle categories could reflect individual or site-specific acoustic idiosyncrasies instead of transferable species-level features.
[Abstract] Abstract: the assertion of “significant outperformance” on both tasks is presented without any quantitative metrics, error bars, dataset-split protocol, or statistical tests, rendering the magnitude and reliability of the claimed gains impossible to evaluate from the provided text.

minor comments (1)

[Abstract] Abstract: the citation “Wav2Vec2.0 Baevski et al. (2020)” should be expanded to the full reference on first use for completeness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We agree that both points identify areas where the current text overstates scope and lacks supporting detail, and we will revise accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central framing that the model enables study of “dolphin communication systems” (rather than the communication of these five individuals) is load-bearing yet unsupported; all training data, signature-whistle labels, and evaluation sets come from the same five dolphins at one site, so any performance gain or apparent alignment with whistle categories could reflect individual or site-specific acoustic idiosyncrasies instead of transferable species-level features.

Authors: We agree the framing risks implying species-wide transferability that the five-individual, single-site dataset does not demonstrate. 'Species-specific' in the manuscript denotes training exclusively on dolphin vocalizations rather than a multi-species corpus. We will revise the abstract to refer explicitly to vocalizations from these five individuals and add a limitations paragraph discussing possible individual or site-specific idiosyncrasies. revision: yes
Referee: [Abstract] Abstract: the assertion of “significant outperformance” on both tasks is presented without any quantitative metrics, error bars, dataset-split protocol, or statistical tests, rendering the magnitude and reliability of the claimed gains impossible to evaluate from the provided text.

Authors: The detailed metrics, splits, error bars, and statistical tests appear in the experimental sections of the full manuscript. However, we accept that the abstract claim is insufficiently substantiated on its own. We will incorporate concise quantitative results and a reference to the evaluation protocol into the revised abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical adaptation and benchmarks are self-contained

full rationale

The paper adapts the external Wav2Vec2.0 architecture to a new longitudinal dolphin dataset and reports empirical results on signature whistle classification and detection tasks, plus qualitative embedding inspection. No equations, predictions, or uniqueness claims reduce by construction to fitted parameters or self-citations; the central claims rest on performance deltas against general-purpose baselines and visual alignment with whistle categories, all externally falsifiable via the released data and standard SSL training. This matches the default expectation of a non-circular empirical ML paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of the five-dolphin dataset, the transferability of the Wav2Vec2.0 masking objective to dolphin vocalizations, and standard assumptions of deep learning optimization; full text would be needed to enumerate all hyperparameters and data-processing choices.

free parameters (1)

Wav2Vec2.0 adaptation hyperparameters
Learning rate, masking probability, codebook size and other training choices are fitted or chosen to make the model work on the new domain.

axioms (1)

domain assumption The Wav2Vec2.0 self-supervised objective remains effective when the training distribution is restricted to a single species' vocalizations.
Invoked by the decision to train exclusively on the dolphin data rather than mixed-species corpora.

pith-pipeline@v0.9.1-grok · 5768 in / 1367 out tokens · 31120 ms · 2026-06-27T10:46:37.460164+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 17 canonical work pages · 1 internal anchor

[1]

Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli

URL https://proceedings.neurips.cc/paper_files/paper/2020/file/ 92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf. Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. data2vec: A general framework for self-supervised learning in speech, vision and language. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Ni...

2020
[2]

Peter C Bermant

URLhttps://arxiv.org/ abs/2303.10931. Peter C Bermant. Biocppnet: automatic bioacoustic source separation with deep neural networks.Scientific Reports, 11(1):23502,

arXiv
[3]

URLhttps://www.biorxiv.org/content/early/2022/10/16/2022.10.12.511740.1

doi: 10.1101/2022.10.12.511740. URLhttps://www.biorxiv.org/content/early/2022/10/16/2022.10.12.511740.1. Daniel T Blumstein, Daniel J Mennill, Patrick Clemins, Lewis Girod, Kung Yao, Gail Patricelli, Jill L Deppe, Alan H Krakauer, Christopher Clark, Kathryn A Cortopassi, et al. Acoustic monitoring in terrestrial environments using microphone arrays: appli...

work page doi:10.1101/2022.10.12.511740 2022
[4]

Enriching word vectors with subword information.Transactions of the Association for Computational Linguistics, 5: 135–146, 2017

doi: 10.1162/tacl_a_00051. URLhttps://aclanthology. org/Q17-1010/. John R Buck and Peter L Tyack. A quantitative measure of similarity for tursiops truncatus signature whistles.The Journal of the Acoustical Society of America, 94(5):2497–2506,

work page doi:10.1162/tacl_a_00051
[5]

Investigating self-supervised speech models’ ability to classify animal vocalizations: The case of gibbon’s vocal signatures

JulesCauzinille, BenoîtFavre, RicardMarxer, DenaClink, AbdulHamidAhmad, andArnaud Rey. Investigating self-supervised speech models’ ability to classify animal vocalizations: The case of gibbon’s vocal signatures. InInterspeech 2024, pages 132–136. ISCA; ISCA,

2024
[6]

Vggsound: A large-scale audio-visual dataset

Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725. IEEE, 2020a. Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xi...

2020
[7]

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing , volume=

doi: 10.1109/JSTSP.2022.3188113. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InProceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020b. Richard C Connor and Rachel A Smolker. ’pop’goes the dolphin: A vocalization male bottle...

work page doi:10.1109/jstsp.2022.3188113 2022
[8]

2023.10400331

doi: 10.1109/ICSPCC59353. 2023.10400331. Francesco Di Nardo, Rocco De Marco, Alessandro Lucchetti, and David Scaradozzi. A wav file dataset of bottlenose dolphin whistles, clicks, and pulse sounds during trawling interactions.Scientific Data, 10:650,

work page doi:10.1109/icspcc59353 2023
[9]

URL https://doi.org/10.1038/s41597-023-02547-8

doi: 10.1038/s41597-023-02547-8. URL https://doi.org/10.1038/s41597-023-02547-8. John Firth. A synopsis of linguistic theory, 1930-1955.Studies in linguistic analysis, pages 10–32,

work page doi:10.1038/s41597-023-02547-8 1930
[10]

URLhttps://doi.org/10.1038/s41598-025-00996-2

doi: 10.1038/s41598-025-00996-2. URLhttps://doi.org/10.1038/s41598-025-00996-2. Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Moham- mad Gheshlaghi Azar, Bilal Piot, koray kavukcuoglu, Remi Munos, and Michal Valko. Bootstrap your own latent - a new ...

work page doi:10.1038/s41598-025-00996-2
[11]

URL https://proceedings.neurips.cc/paper/2020/file/ f3ada80d5c4ee70142b17b8192b2958e-Paper.pdf. G. Gubnitsky, Y. Mevorach, S. Gero, et al. Automatic detection and annotation of eastern caribbean sperm whale codas.Scientific Reports, 15(12790),

2020
[12]

URLhttps://doi.org/10.1038/s41598-025-97009-z

doi: 10.1038/s41598-025-97009-z. URLhttps://doi.org/10.1038/s41598-025-97009-z. Masato Hagiwara. Aves: Animal vocalization encoder based on self-supervision,

work page doi:10.1038/s41598-025-97009-z
[13]

Masato Hagiwara, Benjamin Hoffman, Jen-Yu Liu, Maddie Cusimano, Felix Effenberger, and Katie Zacarian

URL https://arxiv.org/abs/2210.14493. Masato Hagiwara, Benjamin Hoffman, Jen-Yu Liu, Maddie Cusimano, Felix Effenberger, and Katie Zacarian. Beans: The benchmark of animal sounds. InICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5,

arXiv 2023
[14]

A learnable spatial mapping for decoding the directional focus of auditory attention using EEG,

doi: 10.1109/ICASSP49357.2023.10096686. Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June

work page doi:10.1109/icassp49357.2023.10096686 2023
[15]

Tyack, Randall S

Frants Havmand Jensen, Piper Wolters, Louisa van Zeeland, Evan Morrison, Gracie Ermi, Scott Smith, Peter L. Tyack, Randall S. Wells, Sam McKennoch, Vincent M. Janik, and Laela S. Sayigh.Automatic Deep-Learning-Based Classification of Bot- tlenose Dolphin Signature Whistles, pages 2059–2070. Springer International Publishing, Cham,

2059
[16]

URL https://doi.org/10.1007/ 978-3-031-50256-9_143

doi: 10.1007/978-3-031-50256-9_143. URL https://doi.org/10.1007/ 978-3-031-50256-9_143. Stefan Kahl, Connor M Wood, Maximilian Eibl, and Holger Klinck. Birdnet: A deep learning solution for avian diversity monitoring.Ecological Informatics, 61:101236,

work page doi:10.1007/978-3-031-50256-9_143
[17]

Lecun, L

doi: 10.1109/5.726791. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.Nature, 521:436–444,

work page doi:10.1109/5.726791
[18]

doi: 10.1111/j.1439-0310.1995.tb00325.x

ISSN 0179-1613. doi: 10.1111/j.1439-0310.1995.tb00325.x. URL http://dx.doi.org/10.1111/J.1439-0310. 1995.TB00325.X. Paulius Micikevicius, Sharan Narang, Jonah Alben, Greg Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training.arXiv preprint arXiv:1710.03740,

work page doi:10.1111/j.1439-0310.1995.tb00325.x 1995
[19]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean

doi: 10.1109/ICASSP.2011.5947611. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space,

work page doi:10.1109/icassp.2011.5947611 2011
[20]

doi: 10.1121/1.1496079

ISSN 0001-4966. doi: 10.1121/1.1496079. URLhttp://dx.doi.org/10.1121/1.1496079. 13 Abdelrahman Mohamed, Hung-yi Lee, Lasse Borgholt, Jakob D. Havtorn, Joakim Edin, Christian Igel, Katrin Kirchhoff, Shang-Wen Li, Karen Livescu, Lars Maaløe, Tara N. Sainath, and Shinji Watanabe. Self-supervised speech representation learning: A review. IEEE Journal of Selec...

work page doi:10.1121/1.1496079
[21]

Faadil Mustun, Chiara Semenzin, Dean Rance, Emiliano Marachlian, Zohria-Lys Guillerm, Agathe Mancini, Inès Bouaziz, Elisabeth Fleck, Nadav Shashar, Gonzalo G de Polavieja, et al

1109/JSTSP.2022.3207050. Faadil Mustun, Chiara Semenzin, Dean Rance, Emiliano Marachlian, Zohria-Lys Guillerm, Agathe Mancini, Inès Bouaziz, Elisabeth Fleck, Nadav Shashar, Gonzalo G de Polavieja, et al. Whistle variability and social acoustic interactions in bottlenose dolphins.bioRxiv, pages 2024–10,

arXiv 2022
[22]

Machine learning for efficient segregation and labeling of potential biological sounds in long-term underwater recordings.Frontiers in Remote Sensing, Volume 5 - 2024,

Clea Parcerisas, Elena Schall, Kees te Velde, Dick Botteldooren, Paul Devos, and Elisabeth Debusschere. Machine learning for efficient segregation and labeling of potential biological sounds in long-term underwater recordings.Frontiers in Remote Sensing, Volume 5 - 2024,

2024
[23]

doi: 10.3389/frsen.2024.1390687

ISSN 2673-6187. doi: 10.3389/frsen.2024.1390687. URLhttps://www.frontiersin. org/journals/remote-sensing/articles/10.3389/frsen.2024.1390687. Michael A Pardo, Kurt Fristrup, David S Lolchuragi, Joyce H Poole, Petter Granli, Cynthia Moss, Iain Douglas-Hamilton, and George Wittemyer. African elephants address one another with individually specific name-like...

work page doi:10.3389/frsen.2024.1390687 2024
[24]

Transferable models for bioacoustics with human language supervision

David Robinson, Adelaide Robinson, and Lily Akrapongpisak. Transferable models for bioacoustics with human language supervision. InICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1316–1320,

2024
[25]

In: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

doi: 10.1109/ICASSP48485.2024.10447250. David Robinson, Marius Miron, Masato Hagiwara, and Olivier Pietquin. NatureLM-audio: an audio-language foundation model for bioacoustics. InThe Thirteenth International Conference on Learning Representations,

work page doi:10.1109/icassp48485.2024.10447250 2024
[26]

Eklavya Sarkar and Mathew Magimai

doi: 10.1109/CRV.2019.00010. Eklavya Sarkar and Mathew Magimai. Doss. Comparing self-supervised learning models pre-trained on human speech and animal vocalizations for bioacoustics processing,

work page doi:10.1109/crv.2019.00010 2019
[27]

14 Laela Sayigh, Mary Ann Daher, Julie Allen, Helen Gordon, Katherine Joyce, Claire Stuhlmann, and Peter Tyack

URLhttps://arxiv.org/abs/2501.05987. 14 Laela Sayigh, Mary Ann Daher, Julie Allen, Helen Gordon, Katherine Joyce, Claire Stuhlmann, and Peter Tyack. The watkins marine mammal sound database: an on- line, freely accessible resource. InProceedings of Meetings on Acoustics, volume

arXiv
[28]

Very Deep Convolutional Networks for Large-Scale Image Recognition

doi: 10.1038/s41467-024-47221-8. URLhttps://doi.org/10.1038/s41467-024-47221-8. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41467-024-47221-8
[29]

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE,

2023
[30]

A Additonal related work: self-super vised learning The traditional supervised learning approach for dolphin vocalization embeddings has long been criticized for enforcing a human-biased perspective Schlenker et al. (2022). This bias stems from linking each vocalization directly to expert annotations or predefined features assumed by humans to be importan...

2022
[31]

C Dataset properties Studying a small, stable pod of five dolphins across several years provides advantages rarely available in animal communication research

Data acquisition was automated using scheduled crontab commands. C Dataset properties Studying a small, stable pod of five dolphins across several years provides advantages rarely available in animal communication research. The individuals’ sex, family history, and kinship relations are well documented Mustun et al. (2024); Perelberg et al. (2010), enabli...

2024
[32]

Category Count SW_Luna 2,934 SW_Neo 2,239 SW_Nikita 888 NSW_9 658 SW_Yosefa 626 SW_Nana 521 SW_Dana 335 SW_Shy 81 NSW_3 45 NSW_6 27 Table 3: Distribution of annotated whistle categories. To balance categories, all classes with at least 500 examples were subsampled to 500 instances each (SW_Luna, SW_Neo, SW_Nikita, NSW_9, SW_Yosefa, SW_Nana), ensuring an e...

2019
[33]

Wav2Vec2 refers to the base model pretrained on human speech at 16 kHz Baevski et al. (2020).Dolph2Vec-random init denotes theDolph2Vecmodel with randomly initialized weights, i.e., before any self-supervised pretraining.Dolph2Vec-shuffled is a variant ofDolph2Vecin which the temporal structure of the learned representations is disrupted by shuffling the ...

2020

[1] [1]

Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli

URL https://proceedings.neurips.cc/paper_files/paper/2020/file/ 92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf. Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. data2vec: A general framework for self-supervised learning in speech, vision and language. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Ni...

2020

[2] [2]

Peter C Bermant

URLhttps://arxiv.org/ abs/2303.10931. Peter C Bermant. Biocppnet: automatic bioacoustic source separation with deep neural networks.Scientific Reports, 11(1):23502,

arXiv

[3] [3]

URLhttps://www.biorxiv.org/content/early/2022/10/16/2022.10.12.511740.1

doi: 10.1101/2022.10.12.511740. URLhttps://www.biorxiv.org/content/early/2022/10/16/2022.10.12.511740.1. Daniel T Blumstein, Daniel J Mennill, Patrick Clemins, Lewis Girod, Kung Yao, Gail Patricelli, Jill L Deppe, Alan H Krakauer, Christopher Clark, Kathryn A Cortopassi, et al. Acoustic monitoring in terrestrial environments using microphone arrays: appli...

work page doi:10.1101/2022.10.12.511740 2022

[4] [4]

Enriching word vectors with subword information.Transactions of the Association for Computational Linguistics, 5: 135–146, 2017

doi: 10.1162/tacl_a_00051. URLhttps://aclanthology. org/Q17-1010/. John R Buck and Peter L Tyack. A quantitative measure of similarity for tursiops truncatus signature whistles.The Journal of the Acoustical Society of America, 94(5):2497–2506,

work page doi:10.1162/tacl_a_00051

[5] [5]

Investigating self-supervised speech models’ ability to classify animal vocalizations: The case of gibbon’s vocal signatures

JulesCauzinille, BenoîtFavre, RicardMarxer, DenaClink, AbdulHamidAhmad, andArnaud Rey. Investigating self-supervised speech models’ ability to classify animal vocalizations: The case of gibbon’s vocal signatures. InInterspeech 2024, pages 132–136. ISCA; ISCA,

2024

[6] [6]

Vggsound: A large-scale audio-visual dataset

Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725. IEEE, 2020a. Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xi...

2020

[7] [7]

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing , volume=

doi: 10.1109/JSTSP.2022.3188113. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InProceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020b. Richard C Connor and Rachel A Smolker. ’pop’goes the dolphin: A vocalization male bottle...

work page doi:10.1109/jstsp.2022.3188113 2022

[8] [8]

2023.10400331

doi: 10.1109/ICSPCC59353. 2023.10400331. Francesco Di Nardo, Rocco De Marco, Alessandro Lucchetti, and David Scaradozzi. A wav file dataset of bottlenose dolphin whistles, clicks, and pulse sounds during trawling interactions.Scientific Data, 10:650,

work page doi:10.1109/icspcc59353 2023

[9] [9]

URL https://doi.org/10.1038/s41597-023-02547-8

doi: 10.1038/s41597-023-02547-8. URL https://doi.org/10.1038/s41597-023-02547-8. John Firth. A synopsis of linguistic theory, 1930-1955.Studies in linguistic analysis, pages 10–32,

work page doi:10.1038/s41597-023-02547-8 1930

[10] [10]

URLhttps://doi.org/10.1038/s41598-025-00996-2

doi: 10.1038/s41598-025-00996-2. URLhttps://doi.org/10.1038/s41598-025-00996-2. Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Moham- mad Gheshlaghi Azar, Bilal Piot, koray kavukcuoglu, Remi Munos, and Michal Valko. Bootstrap your own latent - a new ...

work page doi:10.1038/s41598-025-00996-2

[11] [11]

URL https://proceedings.neurips.cc/paper/2020/file/ f3ada80d5c4ee70142b17b8192b2958e-Paper.pdf. G. Gubnitsky, Y. Mevorach, S. Gero, et al. Automatic detection and annotation of eastern caribbean sperm whale codas.Scientific Reports, 15(12790),

2020

[12] [12]

URLhttps://doi.org/10.1038/s41598-025-97009-z

doi: 10.1038/s41598-025-97009-z. URLhttps://doi.org/10.1038/s41598-025-97009-z. Masato Hagiwara. Aves: Animal vocalization encoder based on self-supervision,

work page doi:10.1038/s41598-025-97009-z

[13] [13]

Masato Hagiwara, Benjamin Hoffman, Jen-Yu Liu, Maddie Cusimano, Felix Effenberger, and Katie Zacarian

URL https://arxiv.org/abs/2210.14493. Masato Hagiwara, Benjamin Hoffman, Jen-Yu Liu, Maddie Cusimano, Felix Effenberger, and Katie Zacarian. Beans: The benchmark of animal sounds. InICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5,

arXiv 2023

[14] [14]

A learnable spatial mapping for decoding the directional focus of auditory attention using EEG,

doi: 10.1109/ICASSP49357.2023.10096686. Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June

work page doi:10.1109/icassp49357.2023.10096686 2023

[15] [15]

Tyack, Randall S

Frants Havmand Jensen, Piper Wolters, Louisa van Zeeland, Evan Morrison, Gracie Ermi, Scott Smith, Peter L. Tyack, Randall S. Wells, Sam McKennoch, Vincent M. Janik, and Laela S. Sayigh.Automatic Deep-Learning-Based Classification of Bot- tlenose Dolphin Signature Whistles, pages 2059–2070. Springer International Publishing, Cham,

2059

[16] [16]

URL https://doi.org/10.1007/ 978-3-031-50256-9_143

doi: 10.1007/978-3-031-50256-9_143. URL https://doi.org/10.1007/ 978-3-031-50256-9_143. Stefan Kahl, Connor M Wood, Maximilian Eibl, and Holger Klinck. Birdnet: A deep learning solution for avian diversity monitoring.Ecological Informatics, 61:101236,

work page doi:10.1007/978-3-031-50256-9_143

[17] [17]

Lecun, L

doi: 10.1109/5.726791. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.Nature, 521:436–444,

work page doi:10.1109/5.726791

[18] [18]

doi: 10.1111/j.1439-0310.1995.tb00325.x

ISSN 0179-1613. doi: 10.1111/j.1439-0310.1995.tb00325.x. URL http://dx.doi.org/10.1111/J.1439-0310. 1995.TB00325.X. Paulius Micikevicius, Sharan Narang, Jonah Alben, Greg Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training.arXiv preprint arXiv:1710.03740,

work page doi:10.1111/j.1439-0310.1995.tb00325.x 1995

[19] [19]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean

doi: 10.1109/ICASSP.2011.5947611. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space,

work page doi:10.1109/icassp.2011.5947611 2011

[20] [20]

doi: 10.1121/1.1496079

ISSN 0001-4966. doi: 10.1121/1.1496079. URLhttp://dx.doi.org/10.1121/1.1496079. 13 Abdelrahman Mohamed, Hung-yi Lee, Lasse Borgholt, Jakob D. Havtorn, Joakim Edin, Christian Igel, Katrin Kirchhoff, Shang-Wen Li, Karen Livescu, Lars Maaløe, Tara N. Sainath, and Shinji Watanabe. Self-supervised speech representation learning: A review. IEEE Journal of Selec...

work page doi:10.1121/1.1496079

[21] [21]

Faadil Mustun, Chiara Semenzin, Dean Rance, Emiliano Marachlian, Zohria-Lys Guillerm, Agathe Mancini, Inès Bouaziz, Elisabeth Fleck, Nadav Shashar, Gonzalo G de Polavieja, et al

1109/JSTSP.2022.3207050. Faadil Mustun, Chiara Semenzin, Dean Rance, Emiliano Marachlian, Zohria-Lys Guillerm, Agathe Mancini, Inès Bouaziz, Elisabeth Fleck, Nadav Shashar, Gonzalo G de Polavieja, et al. Whistle variability and social acoustic interactions in bottlenose dolphins.bioRxiv, pages 2024–10,

arXiv 2022

[22] [22]

Machine learning for efficient segregation and labeling of potential biological sounds in long-term underwater recordings.Frontiers in Remote Sensing, Volume 5 - 2024,

Clea Parcerisas, Elena Schall, Kees te Velde, Dick Botteldooren, Paul Devos, and Elisabeth Debusschere. Machine learning for efficient segregation and labeling of potential biological sounds in long-term underwater recordings.Frontiers in Remote Sensing, Volume 5 - 2024,

2024

[23] [23]

doi: 10.3389/frsen.2024.1390687

ISSN 2673-6187. doi: 10.3389/frsen.2024.1390687. URLhttps://www.frontiersin. org/journals/remote-sensing/articles/10.3389/frsen.2024.1390687. Michael A Pardo, Kurt Fristrup, David S Lolchuragi, Joyce H Poole, Petter Granli, Cynthia Moss, Iain Douglas-Hamilton, and George Wittemyer. African elephants address one another with individually specific name-like...

work page doi:10.3389/frsen.2024.1390687 2024

[24] [24]

Transferable models for bioacoustics with human language supervision

David Robinson, Adelaide Robinson, and Lily Akrapongpisak. Transferable models for bioacoustics with human language supervision. InICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1316–1320,

2024

[25] [25]

In: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

doi: 10.1109/ICASSP48485.2024.10447250. David Robinson, Marius Miron, Masato Hagiwara, and Olivier Pietquin. NatureLM-audio: an audio-language foundation model for bioacoustics. InThe Thirteenth International Conference on Learning Representations,

work page doi:10.1109/icassp48485.2024.10447250 2024

[26] [26]

Eklavya Sarkar and Mathew Magimai

doi: 10.1109/CRV.2019.00010. Eklavya Sarkar and Mathew Magimai. Doss. Comparing self-supervised learning models pre-trained on human speech and animal vocalizations for bioacoustics processing,

work page doi:10.1109/crv.2019.00010 2019

[27] [27]

14 Laela Sayigh, Mary Ann Daher, Julie Allen, Helen Gordon, Katherine Joyce, Claire Stuhlmann, and Peter Tyack

URLhttps://arxiv.org/abs/2501.05987. 14 Laela Sayigh, Mary Ann Daher, Julie Allen, Helen Gordon, Katherine Joyce, Claire Stuhlmann, and Peter Tyack. The watkins marine mammal sound database: an on- line, freely accessible resource. InProceedings of Meetings on Acoustics, volume

arXiv

[28] [28]

Very Deep Convolutional Networks for Large-Scale Image Recognition

doi: 10.1038/s41467-024-47221-8. URLhttps://doi.org/10.1038/s41467-024-47221-8. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41467-024-47221-8

[29] [29]

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE,

2023

[30] [30]

A Additonal related work: self-super vised learning The traditional supervised learning approach for dolphin vocalization embeddings has long been criticized for enforcing a human-biased perspective Schlenker et al. (2022). This bias stems from linking each vocalization directly to expert annotations or predefined features assumed by humans to be importan...

2022

[31] [31]

C Dataset properties Studying a small, stable pod of five dolphins across several years provides advantages rarely available in animal communication research

Data acquisition was automated using scheduled crontab commands. C Dataset properties Studying a small, stable pod of five dolphins across several years provides advantages rarely available in animal communication research. The individuals’ sex, family history, and kinship relations are well documented Mustun et al. (2024); Perelberg et al. (2010), enabli...

2024

[32] [32]

Category Count SW_Luna 2,934 SW_Neo 2,239 SW_Nikita 888 NSW_9 658 SW_Yosefa 626 SW_Nana 521 SW_Dana 335 SW_Shy 81 NSW_3 45 NSW_6 27 Table 3: Distribution of annotated whistle categories. To balance categories, all classes with at least 500 examples were subsampled to 500 instances each (SW_Luna, SW_Neo, SW_Nikita, NSW_9, SW_Yosefa, SW_Nana), ensuring an e...

2019

[33] [33]

Wav2Vec2 refers to the base model pretrained on human speech at 16 kHz Baevski et al. (2020).Dolph2Vec-random init denotes theDolph2Vecmodel with randomly initialized weights, i.e., before any self-supervised pretraining.Dolph2Vec-shuffled is a variant ofDolph2Vecin which the temporal structure of the learned representations is disrupted by shuffling the ...

2020