animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics

Ariana Strandburg-Peshkin; Baptiste Averly; Dan Stowell; Gabriella Gall; Julian C. Sch\"afer-Zimmermann; Kiran Dhanjal-Adams; Lily Johnson-Ulrich; Marie A. Roch; Marius Fai{\ss}; Marta B. Manser

arxiv: 2406.01253 · v3 · submitted 2024-06-03 · 💻 cs.SD · cs.AI· eess.AS· q-bio.QM· stat.AP

animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics

Julian C. Sch\"afer-Zimmermann , Vlad Demartsev , Baptiste Averly , Kiran Dhanjal-Adams , Mathieu Duteil , Gabriella Gall , Marius Fai{\ss} , Lily Johnson-Ulrich

show 4 more authors

Dan Stowell Marta B. Manser Marie A. Roch Ariana Strandburg-Peshkin

This is my paper

Pith reviewed 2026-05-24 00:29 UTC · model grok-4.3

classification 💻 cs.SD cs.AIeess.ASq-bio.QMstat.AP

keywords bioacousticsself-supervised learningtransformeranimal vocalizationsmeerkat datasetrare event detectionfew-shot learningaudio classification

0 comments

The pith

A self-supervised transformer learns rare animal vocalizations from unlabeled audio before refining with limited labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents animal2vec, a large transformer model built for bioacoustic recordings where vocalizations occur infrequently amid long stretches of silence. It employs a self-supervised training process that first extracts patterns from unlabeled audio and then adjusts using sparse labeled examples. The work also releases MeerKAT, a new dataset of meerkat calls annotated at millisecond resolution that is the largest such collection for non-human terrestrial mammals. Tests show the model surpasses prior methods on both MeerKAT and an existing birdsong benchmark, and it maintains strong results when only a few labeled samples are available. The combination supplies a practical route for handling the scale of modern bioacoustic archives despite limited ground-truth data.

Core claim

animal2vec is an interpretable large transformer model with a self-supervised training scheme tailored for sparse and unbalanced bioacoustic data. It learns from unlabeled audio and then refines its understanding with labeled data, outperforming existing methods on the MeerKAT meerkat vocalization dataset and the NIPS4Bplus birdsong dataset while performing well even with limited labeled data in few-shot settings.

What carries the argument

animal2vec, the self-supervised transformer for rare-event raw audio input that first learns representations from unlabeled bioacoustic recordings before supervised refinement on sparse labels.

If this is right

Large bioacoustic archives can be processed effectively even when animal sounds are rare and labels are scarce.
The model supports few-shot adaptation across different species and recording conditions.
MeerKAT provides a public benchmark for evaluating future methods on terrestrial mammal vocalizations at high temporal resolution.
Performance gains observed on both meerkat and bird data indicate the training scheme generalizes beyond a single taxon.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same self-supervised pretraining plus sparse-label refinement pattern could transfer to other domains that face rare acoustic events, such as environmental sound monitoring.
Wider adoption might reduce the labeling burden in long-term field studies and allow faster turnaround from raw recordings to behavioral insights.
Testing the model on unlabeled corpora orders of magnitude larger than MeerKAT would clarify how much additional performance comes from scale alone.

Load-bearing premise

The self-supervised training scheme tailored for sparse and unbalanced bioacoustic data enables effective learning from unlabeled audio followed by refinement with labeled data.

What would settle it

A direct comparison on a fresh bioacoustic dataset in which animal2vec fails to exceed the accuracy of standard supervised transformers or other self-supervised audio models.

Figures

Figures reproduced from arXiv: 2406.01253 by Ariana Strandburg-Peshkin, Baptiste Averly, Dan Stowell, Gabriella Gall, Julian C. Sch\"afer-Zimmermann, Kiran Dhanjal-Adams, Lily Johnson-Ulrich, Marie A. Roch, Marius Fai{\ss}, Marta B. Manser, Mathieu Duteil, Vlad Demartsev.

**Figure 2.** Figure 2: FIG. 2. Example Mel spectrograms for a representative audio snippet [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: FIG. 3 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: FIG. 4. Globally averaged attention map of a four-second segment [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: FIG. 5. Mask length distributions of the baseline (solid red line) [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: FIG. 6. Cumulative frequency response (CFR) of the SincNet filters [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

read the original abstract

Bioacoustic research, vital for understanding animal behavior, conservation, and ecology, faces a monumental challenge: analyzing vast datasets where animal vocalizations are rare. While deep learning techniques are becoming standard, adapting them to bioacoustics remains difficult. We address this with animal2vec, an interpretable large transformer model, and a self-supervised training scheme tailored for sparse and unbalanced bioacoustic data. It learns from unlabeled audio and then refines its understanding with labeled data. Furthermore, we introduce and publicly release MeerKAT: Meerkat Kalahari Audio Transcripts, a dataset of meerkat (Suricata suricatta) vocalizations with millisecond-resolution annotations, the largest labeled dataset on non-human terrestrial mammals currently available. Our model outperforms existing methods on MeerKAT and the publicly available NIPS4Bplus birdsong dataset. Moreover, animal2vec performs well even with limited labeled data (few-shot learning). animal2vec and MeerKAT provide a new reference point for bioacoustic research, enabling scientists to analyze large amounts of data even with scarce ground truth information.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's real value is the public MeerKAT dataset release plus a self-supervised transformer adapted to sparse bioacoustic audio; performance claims look plausible but rest on details not visible in the abstract.

read the letter

The punchline is that this work supplies a large new labeled audio dataset for meerkat vocalizations and shows a transformer can be pretrained self-supervised on raw, unbalanced bioacoustic recordings before fine-tuning. That combination is the concrete advance. The dataset is described as the largest available for non-human terrestrial mammals, with millisecond annotations, which gives the field a new public benchmark to test against. The model claims to handle the rarity of events better than prior approaches and to retain performance in few-shot regimes on both MeerKAT and the existing NIPS4Bplus birdsong set. Releasing both the data and the trained model is the part that actually moves the needle for people who work with passive acoustic monitoring. Self-supervised pretraining on unlabeled audio is a reasonable extension of methods already used in speech and music, and the paper positions it specifically for the sparsity and class imbalance typical in animal sound. The stress-test note is right that nothing in the abstract creates an internal contradiction or circular evaluation. The soft spots sit in the evaluation section. The abstract gives no numbers on baselines, splits, error bars, or controls for recording conditions, so it is impossible to tell how much of the reported gains trace to the self-supervised scheme versus other modeling choices. The full methods and results need to show those comparisons explicitly. This paper is for bioacoustics and ecology researchers who need reference data and few-shot methods for large unlabeled archives. It is not aimed at general audio ML. It deserves a serious referee because the dataset is new and substantial and the modeling approach is grounded enough to check in detail.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces animal2vec, a self-supervised transformer model with a training scheme tailored for sparse and unbalanced raw bioacoustic audio, and releases the MeerKAT dataset of meerkat vocalizations with millisecond-resolution annotations (the largest labeled dataset on non-human terrestrial mammals). The central claims are that animal2vec outperforms existing methods on MeerKAT and the public NIPS4Bplus birdsong dataset and that it supports effective few-shot learning with limited labeled data after self-supervised pretraining on unlabeled audio.

Significance. If the performance claims hold after proper evaluation, the work would provide a useful reference point for bioacoustics by addressing the challenge of rare vocalizations through self-supervised learning on unbalanced data and by releasing a large-scale annotated dataset that enables community benchmarking. The public dataset release and focus on few-shot regimes constitute concrete strengths that could facilitate progress in analyzing large unlabeled audio corpora common to the field.

major comments (1)

[Abstract] Abstract: the assertion that the model 'outperforms existing methods on MeerKAT and the publicly available NIPS4Bplus birdsong dataset' and 'performs well even with limited labeled data (few-shot learning)' supplies no information on baselines, statistical tests, error bars, data splits, or potential confounds; the results section must supply these details to allow evaluation of the central performance claims.

minor comments (1)

[Methods] Clarify the exact self-supervised objective and any hyperparameters specific to the sparse-event regime in the methods section to improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation of minor revision. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that the model 'outperforms existing methods on MeerKAT and the publicly available NIPS4Bplus birdsong dataset' and 'performs well even with limited labeled data (few-shot learning)' supplies no information on baselines, statistical tests, error bars, data splits, or potential confounds; the results section must supply these details to allow evaluation of the central performance claims.

Authors: The abstract is a concise summary and does not include methodological details by design. The Results section of the manuscript supplies the requested information, including the specific baselines used, statistical tests performed, error bars, data splits, and discussion of potential confounds for both the MeerKAT and NIPS4Bplus evaluations as well as the few-shot experiments. revision: no

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is an empirical ML contribution: it introduces a transformer model (animal2vec) with a tailored self-supervised pretraining scheme for sparse bioacoustic audio, releases the MeerKAT dataset, and reports outperformance on MeerKAT plus the external NIPS4Bplus benchmark, including few-shot regimes. No derivation chain, equations, or predictions are presented that reduce by construction to fitted inputs or self-citations. Claims rest on standard train/fine-tune/evaluate protocols against public baselines and a new public dataset release, which constitute independent external support. No self-definitional, fitted-prediction, or load-bearing self-citation patterns appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no details on model hyperparameters, training objectives, or background assumptions, precluding identification of free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5809 in / 1179 out tokens · 33114 ms · 2026-05-24T00:29:22.466758+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AVEX: What Matters for Animal Vocalization Encoding
cs.SD 2025-08 unverdicted novelty 5.0

Large empirical study finds self-supervised pre-training then supervised post-training on mixed bioacoustics and general audio data produces the strongest encoders across 26 datasets for species classification, detect...

Reference graph

Works this paper leans on

139 extracted references · 139 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

The final model has315M trainable parameters. We train for100 epochs using the decoupled Adam optimizer (weight decay of0.01) [110], a cosine learning rate schedule [111], linear warmup for10000 steps, a final learning rateof 1 ×10−4,gradientclippingof 1[112],andabatchsizeof 1020s onfour NVIDIA A100-SXM4-80GB GPUs for20d. The code is written in PyTorch

work page
[2]

The pretraining parameters for all settings can be found in table S1 in the supplemental material

using the fairseq framework [114]. The pretraining parameters for all settings can be found in table S1 in the supplemental material. We estimate that our setup consumed3200kWh (the typical yearly con- sumption of a German household [115]) with a carbon footprint of ap- proximately 1400kgCO2eq. (average emission factor of 2023 for Germany is 400gCO2eq.kW−...

work page 2023
[3]

For the MeerKAT (100%) setting, we use𝑝 = 0.0825 and 𝑀 = 4, which sets the modeofthemaskdistributionto 22ms while60% ofalltimestepsaremasked

(BCL) with our modified window length for augmenting the input audio, and we mask parts of the input using the same stochastic masking strategy but with fewer masked spans, depending on the finetuning setting (table S1). For the MeerKAT (100%) setting, we use𝑝 = 0.0825 and 𝑀 = 4, which sets the modeofthemaskdistributionto 22ms while60% ofalltimestepsarema...

work page
[4]

A fixed threshold is applied to binarize the output, generating a step function representing our event boundary estimates

Event boundary prediction: We slide a fixed-length average-pooling window (filter width is100ms) across the model’s likelihood output to predict event onsets and offsets within a continuous audio stream. A fixed threshold is applied to binarize the output, generating a step function representing our event boundary estimates

work page
[5]

Predicted spans without corresponding ground truth events are assigned an IOU of zero

Intersection-over-union (IOU) calculation: Using the IOU metric, we measure the overlap between the ground truth event spans and our predictions. Predicted spans without corresponding ground truth events are assigned an IOU of zero

work page
[6]

All reported metrics utilize these final likelihood values

Final likelihood assignment: If the IOU for a predicted event exceeds 0.5, the average model likelihood within the predicted span is used as the final scalar likelihood. All reported metrics utilize these final likelihood values. TABLE III. Results on the evaluation split for the holdout generaliz- ability study. The average precision scores (AP) [80] for...

work page
[7]

Predicted spans lacking a ground truth counterpart, or those with insufficient IOU, are considered false positives

Erroridentification: Groundtrutheventswithoutpredictedboundaries are considered false negatives. Predicted spans lacking a ground truth counterpart, or those with insufficient IOU, are considered false positives. AschematicofthisprocesscanbefoundinfigureS14inthethesupplemental material. 8: animal2vec generalizes well Foranalyzingthegeneralizabilityof anim...

work page 2000
[8]

Bradbury, J. W. & Vehrencamp, S. L.Principles of animal communication(Sinauer Associates Sunderland, MA, 1998)

work page 1998
[9]

Signalling in groups: New tools for the integrationofanimalcommunicationandcollectivemovement

Demartsev, V.et al. Signalling in groups: New tools for the integrationofanimalcommunicationandcollectivemovement. Methods Ecol. Evol.(2022)

work page 2022
[10]

Rutz, C. et al. Using machine learning to decode animal communication. Science381, 152–155 (2023)

work page 2023
[11]

Applicationsofbioacous- tics in animal ecology.Ecol

Penar,W.,Magiera,A.&Klocek,C. Applicationsofbioacous- tics in animal ecology.Ecol. Complex.43, 100847 (2020)

work page 2020
[12]

Fleishman, E.et al.Ecological inferences about marine mam- mals from passive acoustic data.Biol. Rev. Camb. Philos. Soc. 98, 1633–1647 (2023)

work page 2023
[13]

H., Stowell, D

Rasmussen, J. H., Stowell, D. & Briefer, E. F. Sound evidence for biodiversity monitoring.Science 385, 138–140 (2024)

work page 2024
[14]

The emerging significance of bioacoustics in animal species conservation.Biol

Laiolo, P. The emerging significance of bioacoustics in animal species conservation.Biol. Conserv.143, 1635–1645 (2010)

work page 2010
[15]

Automated bioacoustics: methods in ecology and conservation and their potential for animal welfare monitoring.J

Mcloughlin,M.P.,Stewart,R.&McElligott,A.G. Automated bioacoustics: methods in ecology and conservation and their potential for animal welfare monitoring.J. R. Soc. Interface 16, 20190225 (2019)

work page 2019
[16]

Bioscience69, 15–25 (2019)

Sugai,L.S.M.,Silva,T.S.F.,Ribeiro,J.W.,Jr&Llusia,D.Ter- restrial passive acoustic monitoring: Review and perspectives. Bioscience69, 15–25 (2019)

work page 2019
[17]

Lindseth, A. V. & Lobel, P. S. Underwater soundscape moni- toringandfishbioacoustics: Areview. Fishes3, 36–30(2018)

work page 2018
[18]

Madhusudhana, S.et al.Choosing equipment for animal bioa- coustic research.Exploring Animal Behavior Through Sound: Volume37 (2022)

work page 2022
[19]

Allen,A.N. etal. Aconvolutionalneuralnetworkforautomated detection of humpback whale song in a diverse, long-term passive acoustic dataset.Front. Mar. Sci.8 (2021)

work page 2021
[20]

Lostanlen,V.,Salamon,J.,Farnsworth,A.,Kelling,S.&Bello, J. P. Birdvox-full-night: A dataset and benchmark for avian flightcalldetection. In 2018IEEEInternationalConferenceon Acoustics, Speech and Signal Processing (ICASSP), 266–270 (IEEE, 2018)

work page 2018
[21]

& Stowell, D

Morfi, V., Bas, Y., Pamuła, H., Glotin, H. & Stowell, D. NIPS4Bplus: a richly annotated birdsong audio dataset.PeerJ Comput. Sci.5, e223 (2019)

work page 2019
[22]

The Orchive : Data mining a massive bioacoustic archive

Ness, S., Symonds, H., Spong, P. & Tzanetakis, G. The orchive: Data mining a massive bioacoustic archive. Preprint at https://arxiv.org/abs/1307.0589(2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013
[23]

Wall, C. C.et al. The next wave of passive acoustic data management: How centralized access can enhance science. Front. Mar. Sci.8, 703682 (2021)

work page 2021
[24]

& Hinton, G

LeCun, Y., Bengio, Y. & Hinton, G. Deep learning.Nature 521, 436–444 (2015)

work page 2015
[25]

Transformer: Attention is all you need

Vaswani, A.et al. Transformer: Attention is all you need. Advances in Neural Information Processing Systems 305998– 6008 (2017)

work page 2017
[26]

& Qiu, X

Lin, T., Wang, Y., Liu, X. & Qiu, X. A survey of transformers. AI Open3, 111–132 (2022)

work page 2022
[27]

Computational bioacoustics with deep learning: a review and roadmap.PeerJ10, e13152 (2022)

Stowell, D. Computational bioacoustics with deep learning: a review and roadmap.PeerJ10, e13152 (2022)

work page 2022
[28]

Audio spectrogram representations for processing withconvolutionalneuralnetworks

Wyse, L. Audio spectrogram representations for processing withconvolutionalneuralnetworks. In ProceedingsoftheFirst International Conference on Deep Learning and Music, 37–41 (2017)

work page 2017
[29]

InInternational Con- ference on Learning Representations(2020)

Dosovitskiy, A.et al.An image is worth 16x16 words: Trans- formers for image recognition at scale. InInternational Con- ference on Learning Representations(2020)

work page 2020
[30]

Khan,S. etal. Transformersinvision: Asurvey. ACMComput. Surv.(2021)

work page 2021
[31]

Robust speech recognition via large-scale weak supervision

Radford, A.et al. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, 28492–28518 (2023)

work page 2023
[32]

Deng, J. et al. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255 (IEEE, 2009)

work page 2009
[33]

Gemmeke, J. F.et al. Audio set: An ontology and human- labeled dataset for audio events. In 2017 IEEE interna- tional conference on acoustics, speech and signal processing (ICASSP), 776–780 (IEEE, 2017)

work page 2017
[34]

Abriefintroductiontoweaklysupervisedlearning

Zhou,Z.-H. Abriefintroductiontoweaklysupervisedlearning. Natl. Sci. Rev.5, 44–53 (2018)

work page 2018
[35]

& Khudanpur, S

Panayotov, V., Chen, G., Povey, D. & Khudanpur, S. Lib- rispeech: an ASR corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), 5206–5210 (IEEE, 2015)

work page 2015
[36]

Longpre,S. etal. Apretrainer’sguidetotrainingdata: Measur- ingtheeffectsofdataage,domaincoverage,quality,&toxicity. Preprint athttps://arxiv.org/abs/2305.13169(2023)

work page arXiv 2023
[37]

& Chao, W.-L

Chen, H.-Y., Tu, C.-H., Li, Z.-H., Shen, H. & Chao, W.-L. On the importance and applicability of pre-training for federated learning. Int Conf Learn Represent(2022). 2206.11488

work page arXiv 2022
[38]

Liu, X. et al. Self-supervised learning: Generative or con- trastive. IEEE Trans. Knowl. Data Eng.1–1 (2021)

work page 2021
[39]

& Girshick, R

He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. Momen- tum contrast for unsupervised visual representation learning. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition9726–9735 (2020)

work page 2020
[40]

& Hinton, G

Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. In International conference on machine learning, 1597–1607 (PMLR, 2020). 14

work page 2020
[41]

D.et al.Can contrastive learning avoid shortcut solutions? Conference on Neural Information Processing Systems (2021)

Robinson, J. D.et al.Can contrastive learning avoid shortcut solutions? Conference on Neural Information Processing Systems (2021)

work page 2021
[42]

Tomasev,N.etal. Pushingthelimitsofself-supervisedresnets: Can we outperform supervised learning without labels on im- agenet? In First Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward at ICML 2022(2022)

work page 2022
[43]

& Hinton, G

Chen, T., Kornblith, S., Swersky, K., Norouzi, M. & Hinton, G. E. Big self-supervised models are strong semi-supervised learners. vol. 33, 22243–22255 (2020)

work page 2020
[44]

& Auli, M

Baevski, A., Zhou, Y., Mohamed, A. & Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representa- tions. vol. 33, 12449–12460 (2020)

work page 2020
[45]

& Zeghidour, N

Saeed, A., Grangier, D. & Zeghidour, N. Contrastive learning of general-purpose audio representations. InICASSP 2021 - 2021IEEEInternationalConferenceonAcoustics, Speechand Signal Processing (ICASSP), 3875–3879 (2021)

work page 2021
[46]

Byol for audio: Self-supervised learning for general-purpose audio representation

Niizumi,D.,Takeuchi,D.,Ohishi,Y.,Harada,N.&Kashino,K. Byol for audio: Self-supervised learning for general-purpose audio representation. In2021 International Joint Conference on Neural Networks (ĲCNN), 1–8 (IEEE, 2021)

work page 2021
[47]

Byol for audio: Exploring pre-trained general-purpose audio representations

Niizumi,D.,Takeuchi,D.,Ohishi,Y.,Harada,N.&Kashino,K. Byol for audio: Exploring pre-trained general-purpose audio representations. IEEE/ACM Transactions on Audio, Speech, and Language Processing31, 137–151 (2023)

work page 2023
[48]

Brown, T.et al.Language models are few-shot learners.Ad- vancesinneuralinformationprocessingsystems 33,1877–1901 (2020)

work page 1901
[49]

& Toutanova, K

Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference1, 4171–4186 (2018)

work page 2019
[50]

Liu,Y. etal. Roberta: Arobustlyoptimizedbertpretrainingap- proach. Preprint athttps://arxiv.org/abs/1907.11692 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1907
[51]

IEEEACM Trans

Hsu, W.-N.et al.HuBERT: Self-supervised speech representa- tionlearningbymaskedpredictionofhiddenunits. IEEEACM Trans. Audio Speech Lang. Process.29, 3451–3460 (2021)

work page 2021
[52]

Lin, C.-C., Jaech, A., Li, X., Gormley, M. R. & Eisner, J. Limitations of autoregressive models and their alternatives. In Toutanova, K.et al.(eds.)Proceedings of the 2021 Conference oftheNorthAmericanChapteroftheAssociationforComputa- tionalLinguistics: HumanLanguageTechnologies,5147–5173 (Association for Computational Linguistics, Online, 2021)

work page 2021
[53]

A., Zhai, F., Adelani, D

Zhu, D., Hedderich, M. A., Zhai, F., Adelani, D. & Klakow, D. Is bert robust to label noise? a study on learning with noisy labels in text classification. InProceedings of the Third Workshop on Insights from Negative Results in NLP, 62–67 (2022)

work page 2022
[54]

R., Zadeh, M

Jaiswal, A., Babu, A. R., Zadeh, M. Z., Banerjee, D. & Make- don, F. A survey on contrastive self-supervised learning.Tech- nologies (Basel)9, 2 (2020)

work page 2020
[55]

In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 244–250 (IEEE, 2021)

Chung, Y.-A.et al.W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre- training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 244–250 (IEEE, 2021)

work page 2021
[56]

& Glass, J

Gong, Y., Chung, Y.-A. & Glass, J. AST: Audio Spectrogram Transformer. InProc. Interspeech 2021, 571–575 (2021)

work page 2021
[57]

Environmental sound classification with tiny transformers in noisy edge environments

Wyatt, S.et al. Environmental sound classification with tiny transformers in noisy edge environments. In2021 IEEE 7th World Forum on Internet of Things (WF-IoT), 309–314 (IEEE, 2021)

work page 2021
[58]

& Phillips, L

Wolters, P., Sizemore, L., Daw, C., Hutchinson, B. & Phillips, L. Proposal-based few-shot sound event detection for speech and environmental sounds with perceivers. Preprint athttps: //arxiv.org/abs/2107.13616(2021)

work page arXiv 2021
[59]

P., Gunturu, S

You, L., Coyotl, E. P., Gunturu, S. & Van Segbroeck, M. Transformer-based bioacoustic sound event detection on few- shot learning tasks. In ICASSP 2023-2023 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5 (IEEE, 2023)

work page 2023
[60]

& Akrapongpisak, L

Robinson, D., Robinson, A. & Akrapongpisak, L. Transfer- able models for bioacoustics with human language supervi- sion. InICASSP2024-2024IEEEInternationalConferenceon Acoustics,SpeechandSignalProcessing(ICASSP) ,1316–1320 (IEEE, 2024)

work page 2024
[61]

Gu,N. etal. Positivetransferofthewhisperspeechtransformer tohumanandanimalvoiceactivitydetection. In ICASSP2024- 2024IEEEInternationalConferenceonAcoustics, Speechand Signal Processing (ICASSP), 7505–7509 (IEEE, 2024)

work page 2024
[62]

& Valpola, H

Tarvainen, A. & Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi- supervised deep learning results.Advances in neural informa- tion processing systems30 (2017)

work page 2017
[63]

Bootstrap your own latent-a new approach to self-supervised learning.Adv

Grill, J.-B.et al. Bootstrap your own latent-a new approach to self-supervised learning.Adv. Neural Inf. Process. Syst.33, 21271–21284 (2020)

work page 2020
[64]

InProceedings of the IEEE/CVF international conference on computer vision, 9650–9660 (2021)

Caron, M.et al.Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, 9650–9660 (2021)

work page 2021
[65]

data2vec: A general framework for self- supervised learning in speech, vision and language

Baevski, A.et al. data2vec: A general framework for self- supervised learning in speech, vision and language. In Chaud- huri, K. et al. (eds.) Proceedings of the 39th International Conference on Machine Learning, vol. 162 ofProceedings of Machine Learning Research, 1298–1312 (PMLR, 2022)

work page 2022
[66]

& Auli, M

Baevski, A., Babu, A., Hsu, W.-N. & Auli, M. Efficient self- supervised learning with contextualized target representations for vision, speech and language. InInternational Conference on Machine Learning, 1416–1429 (PMLR, 2023)

work page 2023
[67]

& Lee, J.-G

Song, H., Kim, M., Park, D., Shin, Y. & Lee, J.-G. Learning from noisy labels with deep neural networks: A survey.IEEE Trans. Neural Netw. Learn. Syst.34, 8135–8153 (2023)

work page 2023
[68]

In2018 IEEE spoken language technology workshop (SLT), 1021–1028 (IEEE, 2018)

Ravanelli,M.&Bengio,Y.Speakerrecognitionfromrawwave- form with sincnet. In2018 IEEE spoken language technology workshop (SLT), 1021–1028 (IEEE, 2018)

work page 2018
[69]

https: //kalahariresearchcentre.org

The kalahari research centre KRC. https: //kalahariresearchcentre.org. Accessed: 2024- 04-25

work page 2024
[70]

https://opensource.org/license/ mit

The MIT license. https://opensource.org/license/ mit. Accessed: 2024-04-25

work page 2024
[71]

com/livingingroups/animal2vec

Official GitHub repository for animal2vec.https://github. com/livingingroups/animal2vec. Accessed: 2024-04- 25

work page 2024
[72]

https://creativecommons.org/licenses/ by-nc/4.0

Creative Commons Attribution-NonCommercial 4.0 Inter- national. https://creativecommons.org/licenses/ by-nc/4.0. Accessed: 2024-04-25

work page 2024
[73]

Schäfer-Zimmermann, J. C.et al. MeerKAT: Meerkat Kala- hari Audio Transcripts (2024). URLhttps://doi.org/10. 17617/3.0J0DYB

work page 2024
[74]

B.The evolution of auditory communication in suricates, Suricata suricatta

Manser, M. B.The evolution of auditory communication in suricates, Suricata suricatta. Ph.D. thesis, University of Cam- bridge (1998)

work page 1998
[75]

B., Jansen, D

Manser, M. B., Jansen, D. A. W. A., Graw, B. & le Roux, A. Vocalcomplexityinmeerkatsandothermongoosespecies. Adv. Stud. Behav.46, 281–310 (2014)

work page 2014
[76]

W., Charlton, B

Townsend, S. W., Charlton, B. D. & Manser, M. B. Acoustic 15 cues to identity and predator context in meerkat barks.Anim. Behav.94, 143–149 (2014)

work page 2014
[77]

W., Hollén, L

Townsend, S. W., Hollén, L. I. & Manser, M. B. Meerkat close calls encode group-specific signatures, but receivers fail to discriminate.Anim. Behav.80, 133–138 (2010)

work page 2010
[78]

Collier, K., Townsend, S. W. & Manser, M. B. Call concatena- tion in wild meerkats.Anim. Behav.134, 257–269 (2017)

work page 2017
[79]

& Manser, M

Demartsev, V., Strandburg-Peshkin, A., Ruffner, M. & Manser, M. Vocal turn-taking in meerkat group calling sessions.Curr. Biol.28, 3661–3666.e3 (2018)

work page 2018
[80]

Manser, M. B. The acoustic structure of suricates’ alarm calls varies with predator type and the level of response urgency. Proc. Biol. Sci.268, 2315–2324 (2001)

work page 2001

Showing first 80 references.

[1] [1]

The final model has315M trainable parameters. We train for100 epochs using the decoupled Adam optimizer (weight decay of0.01) [110], a cosine learning rate schedule [111], linear warmup for10000 steps, a final learning rateof 1 ×10−4,gradientclippingof 1[112],andabatchsizeof 1020s onfour NVIDIA A100-SXM4-80GB GPUs for20d. The code is written in PyTorch

work page

[2] [2]

The pretraining parameters for all settings can be found in table S1 in the supplemental material

using the fairseq framework [114]. The pretraining parameters for all settings can be found in table S1 in the supplemental material. We estimate that our setup consumed3200kWh (the typical yearly con- sumption of a German household [115]) with a carbon footprint of ap- proximately 1400kgCO2eq. (average emission factor of 2023 for Germany is 400gCO2eq.kW−...

work page 2023

[3] [3]

For the MeerKAT (100%) setting, we use𝑝 = 0.0825 and 𝑀 = 4, which sets the modeofthemaskdistributionto 22ms while60% ofalltimestepsaremasked

(BCL) with our modified window length for augmenting the input audio, and we mask parts of the input using the same stochastic masking strategy but with fewer masked spans, depending on the finetuning setting (table S1). For the MeerKAT (100%) setting, we use𝑝 = 0.0825 and 𝑀 = 4, which sets the modeofthemaskdistributionto 22ms while60% ofalltimestepsarema...

work page

[4] [4]

A fixed threshold is applied to binarize the output, generating a step function representing our event boundary estimates

Event boundary prediction: We slide a fixed-length average-pooling window (filter width is100ms) across the model’s likelihood output to predict event onsets and offsets within a continuous audio stream. A fixed threshold is applied to binarize the output, generating a step function representing our event boundary estimates

work page

[5] [5]

Predicted spans without corresponding ground truth events are assigned an IOU of zero

Intersection-over-union (IOU) calculation: Using the IOU metric, we measure the overlap between the ground truth event spans and our predictions. Predicted spans without corresponding ground truth events are assigned an IOU of zero

work page

[6] [6]

All reported metrics utilize these final likelihood values

Final likelihood assignment: If the IOU for a predicted event exceeds 0.5, the average model likelihood within the predicted span is used as the final scalar likelihood. All reported metrics utilize these final likelihood values. TABLE III. Results on the evaluation split for the holdout generaliz- ability study. The average precision scores (AP) [80] for...

work page

[7] [7]

Predicted spans lacking a ground truth counterpart, or those with insufficient IOU, are considered false positives

Erroridentification: Groundtrutheventswithoutpredictedboundaries are considered false negatives. Predicted spans lacking a ground truth counterpart, or those with insufficient IOU, are considered false positives. AschematicofthisprocesscanbefoundinfigureS14inthethesupplemental material. 8: animal2vec generalizes well Foranalyzingthegeneralizabilityof anim...

work page 2000

[8] [8]

Bradbury, J. W. & Vehrencamp, S. L.Principles of animal communication(Sinauer Associates Sunderland, MA, 1998)

work page 1998

[9] [9]

Signalling in groups: New tools for the integrationofanimalcommunicationandcollectivemovement

Demartsev, V.et al. Signalling in groups: New tools for the integrationofanimalcommunicationandcollectivemovement. Methods Ecol. Evol.(2022)

work page 2022

[10] [10]

Rutz, C. et al. Using machine learning to decode animal communication. Science381, 152–155 (2023)

work page 2023

[11] [11]

Applicationsofbioacous- tics in animal ecology.Ecol

Penar,W.,Magiera,A.&Klocek,C. Applicationsofbioacous- tics in animal ecology.Ecol. Complex.43, 100847 (2020)

work page 2020

[12] [12]

Fleishman, E.et al.Ecological inferences about marine mam- mals from passive acoustic data.Biol. Rev. Camb. Philos. Soc. 98, 1633–1647 (2023)

work page 2023

[13] [13]

H., Stowell, D

Rasmussen, J. H., Stowell, D. & Briefer, E. F. Sound evidence for biodiversity monitoring.Science 385, 138–140 (2024)

work page 2024

[14] [14]

The emerging significance of bioacoustics in animal species conservation.Biol

Laiolo, P. The emerging significance of bioacoustics in animal species conservation.Biol. Conserv.143, 1635–1645 (2010)

work page 2010

[15] [15]

Automated bioacoustics: methods in ecology and conservation and their potential for animal welfare monitoring.J

Mcloughlin,M.P.,Stewart,R.&McElligott,A.G. Automated bioacoustics: methods in ecology and conservation and their potential for animal welfare monitoring.J. R. Soc. Interface 16, 20190225 (2019)

work page 2019

[16] [16]

Bioscience69, 15–25 (2019)

Sugai,L.S.M.,Silva,T.S.F.,Ribeiro,J.W.,Jr&Llusia,D.Ter- restrial passive acoustic monitoring: Review and perspectives. Bioscience69, 15–25 (2019)

work page 2019

[17] [17]

Lindseth, A. V. & Lobel, P. S. Underwater soundscape moni- toringandfishbioacoustics: Areview. Fishes3, 36–30(2018)

work page 2018

[18] [18]

Madhusudhana, S.et al.Choosing equipment for animal bioa- coustic research.Exploring Animal Behavior Through Sound: Volume37 (2022)

work page 2022

[19] [19]

Allen,A.N. etal. Aconvolutionalneuralnetworkforautomated detection of humpback whale song in a diverse, long-term passive acoustic dataset.Front. Mar. Sci.8 (2021)

work page 2021

[20] [20]

Lostanlen,V.,Salamon,J.,Farnsworth,A.,Kelling,S.&Bello, J. P. Birdvox-full-night: A dataset and benchmark for avian flightcalldetection. In 2018IEEEInternationalConferenceon Acoustics, Speech and Signal Processing (ICASSP), 266–270 (IEEE, 2018)

work page 2018

[21] [21]

& Stowell, D

Morfi, V., Bas, Y., Pamuła, H., Glotin, H. & Stowell, D. NIPS4Bplus: a richly annotated birdsong audio dataset.PeerJ Comput. Sci.5, e223 (2019)

work page 2019

[22] [22]

The Orchive : Data mining a massive bioacoustic archive

Ness, S., Symonds, H., Spong, P. & Tzanetakis, G. The orchive: Data mining a massive bioacoustic archive. Preprint at https://arxiv.org/abs/1307.0589(2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013

[23] [23]

Wall, C. C.et al. The next wave of passive acoustic data management: How centralized access can enhance science. Front. Mar. Sci.8, 703682 (2021)

work page 2021

[24] [24]

& Hinton, G

LeCun, Y., Bengio, Y. & Hinton, G. Deep learning.Nature 521, 436–444 (2015)

work page 2015

[25] [25]

Transformer: Attention is all you need

Vaswani, A.et al. Transformer: Attention is all you need. Advances in Neural Information Processing Systems 305998– 6008 (2017)

work page 2017

[26] [26]

& Qiu, X

Lin, T., Wang, Y., Liu, X. & Qiu, X. A survey of transformers. AI Open3, 111–132 (2022)

work page 2022

[27] [27]

Computational bioacoustics with deep learning: a review and roadmap.PeerJ10, e13152 (2022)

Stowell, D. Computational bioacoustics with deep learning: a review and roadmap.PeerJ10, e13152 (2022)

work page 2022

[28] [28]

Audio spectrogram representations for processing withconvolutionalneuralnetworks

Wyse, L. Audio spectrogram representations for processing withconvolutionalneuralnetworks. In ProceedingsoftheFirst International Conference on Deep Learning and Music, 37–41 (2017)

work page 2017

[29] [29]

InInternational Con- ference on Learning Representations(2020)

Dosovitskiy, A.et al.An image is worth 16x16 words: Trans- formers for image recognition at scale. InInternational Con- ference on Learning Representations(2020)

work page 2020

[30] [30]

Khan,S. etal. Transformersinvision: Asurvey. ACMComput. Surv.(2021)

work page 2021

[31] [31]

Robust speech recognition via large-scale weak supervision

Radford, A.et al. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, 28492–28518 (2023)

work page 2023

[32] [32]

Deng, J. et al. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255 (IEEE, 2009)

work page 2009

[33] [33]

Gemmeke, J. F.et al. Audio set: An ontology and human- labeled dataset for audio events. In 2017 IEEE interna- tional conference on acoustics, speech and signal processing (ICASSP), 776–780 (IEEE, 2017)

work page 2017

[34] [34]

Abriefintroductiontoweaklysupervisedlearning

Zhou,Z.-H. Abriefintroductiontoweaklysupervisedlearning. Natl. Sci. Rev.5, 44–53 (2018)

work page 2018

[35] [35]

& Khudanpur, S

Panayotov, V., Chen, G., Povey, D. & Khudanpur, S. Lib- rispeech: an ASR corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), 5206–5210 (IEEE, 2015)

work page 2015

[36] [36]

Longpre,S. etal. Apretrainer’sguidetotrainingdata: Measur- ingtheeffectsofdataage,domaincoverage,quality,&toxicity. Preprint athttps://arxiv.org/abs/2305.13169(2023)

work page arXiv 2023

[37] [37]

& Chao, W.-L

Chen, H.-Y., Tu, C.-H., Li, Z.-H., Shen, H. & Chao, W.-L. On the importance and applicability of pre-training for federated learning. Int Conf Learn Represent(2022). 2206.11488

work page arXiv 2022

[38] [38]

Liu, X. et al. Self-supervised learning: Generative or con- trastive. IEEE Trans. Knowl. Data Eng.1–1 (2021)

work page 2021

[39] [39]

& Girshick, R

He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. Momen- tum contrast for unsupervised visual representation learning. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition9726–9735 (2020)

work page 2020

[40] [40]

& Hinton, G

Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. In International conference on machine learning, 1597–1607 (PMLR, 2020). 14

work page 2020

[41] [41]

D.et al.Can contrastive learning avoid shortcut solutions? Conference on Neural Information Processing Systems (2021)

Robinson, J. D.et al.Can contrastive learning avoid shortcut solutions? Conference on Neural Information Processing Systems (2021)

work page 2021

[42] [42]

Tomasev,N.etal. Pushingthelimitsofself-supervisedresnets: Can we outperform supervised learning without labels on im- agenet? In First Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward at ICML 2022(2022)

work page 2022

[43] [43]

& Hinton, G

Chen, T., Kornblith, S., Swersky, K., Norouzi, M. & Hinton, G. E. Big self-supervised models are strong semi-supervised learners. vol. 33, 22243–22255 (2020)

work page 2020

[44] [44]

& Auli, M

Baevski, A., Zhou, Y., Mohamed, A. & Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representa- tions. vol. 33, 12449–12460 (2020)

work page 2020

[45] [45]

& Zeghidour, N

Saeed, A., Grangier, D. & Zeghidour, N. Contrastive learning of general-purpose audio representations. InICASSP 2021 - 2021IEEEInternationalConferenceonAcoustics, Speechand Signal Processing (ICASSP), 3875–3879 (2021)

work page 2021

[46] [46]

Byol for audio: Self-supervised learning for general-purpose audio representation

Niizumi,D.,Takeuchi,D.,Ohishi,Y.,Harada,N.&Kashino,K. Byol for audio: Self-supervised learning for general-purpose audio representation. In2021 International Joint Conference on Neural Networks (ĲCNN), 1–8 (IEEE, 2021)

work page 2021

[47] [47]

Byol for audio: Exploring pre-trained general-purpose audio representations

Niizumi,D.,Takeuchi,D.,Ohishi,Y.,Harada,N.&Kashino,K. Byol for audio: Exploring pre-trained general-purpose audio representations. IEEE/ACM Transactions on Audio, Speech, and Language Processing31, 137–151 (2023)

work page 2023

[48] [48]

Brown, T.et al.Language models are few-shot learners.Ad- vancesinneuralinformationprocessingsystems 33,1877–1901 (2020)

work page 1901

[49] [49]

& Toutanova, K

Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference1, 4171–4186 (2018)

work page 2019

[50] [50]

Liu,Y. etal. Roberta: Arobustlyoptimizedbertpretrainingap- proach. Preprint athttps://arxiv.org/abs/1907.11692 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1907

[51] [51]

IEEEACM Trans

Hsu, W.-N.et al.HuBERT: Self-supervised speech representa- tionlearningbymaskedpredictionofhiddenunits. IEEEACM Trans. Audio Speech Lang. Process.29, 3451–3460 (2021)

work page 2021

[52] [52]

Lin, C.-C., Jaech, A., Li, X., Gormley, M. R. & Eisner, J. Limitations of autoregressive models and their alternatives. In Toutanova, K.et al.(eds.)Proceedings of the 2021 Conference oftheNorthAmericanChapteroftheAssociationforComputa- tionalLinguistics: HumanLanguageTechnologies,5147–5173 (Association for Computational Linguistics, Online, 2021)

work page 2021

[53] [53]

A., Zhai, F., Adelani, D

Zhu, D., Hedderich, M. A., Zhai, F., Adelani, D. & Klakow, D. Is bert robust to label noise? a study on learning with noisy labels in text classification. InProceedings of the Third Workshop on Insights from Negative Results in NLP, 62–67 (2022)

work page 2022

[54] [54]

R., Zadeh, M

Jaiswal, A., Babu, A. R., Zadeh, M. Z., Banerjee, D. & Make- don, F. A survey on contrastive self-supervised learning.Tech- nologies (Basel)9, 2 (2020)

work page 2020

[55] [55]

In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 244–250 (IEEE, 2021)

Chung, Y.-A.et al.W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre- training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 244–250 (IEEE, 2021)

work page 2021

[56] [56]

& Glass, J

Gong, Y., Chung, Y.-A. & Glass, J. AST: Audio Spectrogram Transformer. InProc. Interspeech 2021, 571–575 (2021)

work page 2021

[57] [57]

Environmental sound classification with tiny transformers in noisy edge environments

Wyatt, S.et al. Environmental sound classification with tiny transformers in noisy edge environments. In2021 IEEE 7th World Forum on Internet of Things (WF-IoT), 309–314 (IEEE, 2021)

work page 2021

[58] [58]

& Phillips, L

Wolters, P., Sizemore, L., Daw, C., Hutchinson, B. & Phillips, L. Proposal-based few-shot sound event detection for speech and environmental sounds with perceivers. Preprint athttps: //arxiv.org/abs/2107.13616(2021)

work page arXiv 2021

[59] [59]

P., Gunturu, S

You, L., Coyotl, E. P., Gunturu, S. & Van Segbroeck, M. Transformer-based bioacoustic sound event detection on few- shot learning tasks. In ICASSP 2023-2023 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5 (IEEE, 2023)

work page 2023

[60] [60]

& Akrapongpisak, L

Robinson, D., Robinson, A. & Akrapongpisak, L. Transfer- able models for bioacoustics with human language supervi- sion. InICASSP2024-2024IEEEInternationalConferenceon Acoustics,SpeechandSignalProcessing(ICASSP) ,1316–1320 (IEEE, 2024)

work page 2024

[61] [61]

Gu,N. etal. Positivetransferofthewhisperspeechtransformer tohumanandanimalvoiceactivitydetection. In ICASSP2024- 2024IEEEInternationalConferenceonAcoustics, Speechand Signal Processing (ICASSP), 7505–7509 (IEEE, 2024)

work page 2024

[62] [62]

& Valpola, H

Tarvainen, A. & Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi- supervised deep learning results.Advances in neural informa- tion processing systems30 (2017)

work page 2017

[63] [63]

Bootstrap your own latent-a new approach to self-supervised learning.Adv

Grill, J.-B.et al. Bootstrap your own latent-a new approach to self-supervised learning.Adv. Neural Inf. Process. Syst.33, 21271–21284 (2020)

work page 2020

[64] [64]

InProceedings of the IEEE/CVF international conference on computer vision, 9650–9660 (2021)

Caron, M.et al.Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, 9650–9660 (2021)

work page 2021

[65] [65]

data2vec: A general framework for self- supervised learning in speech, vision and language

Baevski, A.et al. data2vec: A general framework for self- supervised learning in speech, vision and language. In Chaud- huri, K. et al. (eds.) Proceedings of the 39th International Conference on Machine Learning, vol. 162 ofProceedings of Machine Learning Research, 1298–1312 (PMLR, 2022)

work page 2022

[66] [66]

& Auli, M

Baevski, A., Babu, A., Hsu, W.-N. & Auli, M. Efficient self- supervised learning with contextualized target representations for vision, speech and language. InInternational Conference on Machine Learning, 1416–1429 (PMLR, 2023)

work page 2023

[67] [67]

& Lee, J.-G

Song, H., Kim, M., Park, D., Shin, Y. & Lee, J.-G. Learning from noisy labels with deep neural networks: A survey.IEEE Trans. Neural Netw. Learn. Syst.34, 8135–8153 (2023)

work page 2023

[68] [68]

In2018 IEEE spoken language technology workshop (SLT), 1021–1028 (IEEE, 2018)

Ravanelli,M.&Bengio,Y.Speakerrecognitionfromrawwave- form with sincnet. In2018 IEEE spoken language technology workshop (SLT), 1021–1028 (IEEE, 2018)

work page 2018

[69] [69]

https: //kalahariresearchcentre.org

The kalahari research centre KRC. https: //kalahariresearchcentre.org. Accessed: 2024- 04-25

work page 2024

[70] [70]

https://opensource.org/license/ mit

The MIT license. https://opensource.org/license/ mit. Accessed: 2024-04-25

work page 2024

[71] [71]

com/livingingroups/animal2vec

Official GitHub repository for animal2vec.https://github. com/livingingroups/animal2vec. Accessed: 2024-04- 25

work page 2024

[72] [72]

https://creativecommons.org/licenses/ by-nc/4.0

Creative Commons Attribution-NonCommercial 4.0 Inter- national. https://creativecommons.org/licenses/ by-nc/4.0. Accessed: 2024-04-25

work page 2024

[73] [73]

Schäfer-Zimmermann, J. C.et al. MeerKAT: Meerkat Kala- hari Audio Transcripts (2024). URLhttps://doi.org/10. 17617/3.0J0DYB

work page 2024

[74] [74]

B.The evolution of auditory communication in suricates, Suricata suricatta

Manser, M. B.The evolution of auditory communication in suricates, Suricata suricatta. Ph.D. thesis, University of Cam- bridge (1998)

work page 1998

[75] [75]

B., Jansen, D

Manser, M. B., Jansen, D. A. W. A., Graw, B. & le Roux, A. Vocalcomplexityinmeerkatsandothermongoosespecies. Adv. Stud. Behav.46, 281–310 (2014)

work page 2014

[76] [76]

W., Charlton, B

Townsend, S. W., Charlton, B. D. & Manser, M. B. Acoustic 15 cues to identity and predator context in meerkat barks.Anim. Behav.94, 143–149 (2014)

work page 2014

[77] [77]

W., Hollén, L

Townsend, S. W., Hollén, L. I. & Manser, M. B. Meerkat close calls encode group-specific signatures, but receivers fail to discriminate.Anim. Behav.80, 133–138 (2010)

work page 2010

[78] [78]

Collier, K., Townsend, S. W. & Manser, M. B. Call concatena- tion in wild meerkats.Anim. Behav.134, 257–269 (2017)

work page 2017

[79] [79]

& Manser, M

Demartsev, V., Strandburg-Peshkin, A., Ruffner, M. & Manser, M. Vocal turn-taking in meerkat group calling sessions.Curr. Biol.28, 3661–3666.e3 (2018)

work page 2018

[80] [80]

Manser, M. B. The acoustic structure of suricates’ alarm calls varies with predator type and the level of response urgency. Proc. Biol. Sci.268, 2315–2324 (2001)

work page 2001