pith. sign in

arxiv: 2406.01253 · v3 · submitted 2024-06-03 · 💻 cs.SD · cs.AI· eess.AS· q-bio.QM· stat.AP

animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics

Pith reviewed 2026-05-24 00:29 UTC · model grok-4.3

classification 💻 cs.SD cs.AIeess.ASq-bio.QMstat.AP
keywords bioacousticsself-supervised learningtransformeranimal vocalizationsmeerkat datasetrare event detectionfew-shot learningaudio classification
0
0 comments X

The pith

A self-supervised transformer learns rare animal vocalizations from unlabeled audio before refining with limited labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents animal2vec, a large transformer model built for bioacoustic recordings where vocalizations occur infrequently amid long stretches of silence. It employs a self-supervised training process that first extracts patterns from unlabeled audio and then adjusts using sparse labeled examples. The work also releases MeerKAT, a new dataset of meerkat calls annotated at millisecond resolution that is the largest such collection for non-human terrestrial mammals. Tests show the model surpasses prior methods on both MeerKAT and an existing birdsong benchmark, and it maintains strong results when only a few labeled samples are available. The combination supplies a practical route for handling the scale of modern bioacoustic archives despite limited ground-truth data.

Core claim

animal2vec is an interpretable large transformer model with a self-supervised training scheme tailored for sparse and unbalanced bioacoustic data. It learns from unlabeled audio and then refines its understanding with labeled data, outperforming existing methods on the MeerKAT meerkat vocalization dataset and the NIPS4Bplus birdsong dataset while performing well even with limited labeled data in few-shot settings.

What carries the argument

animal2vec, the self-supervised transformer for rare-event raw audio input that first learns representations from unlabeled bioacoustic recordings before supervised refinement on sparse labels.

If this is right

  • Large bioacoustic archives can be processed effectively even when animal sounds are rare and labels are scarce.
  • The model supports few-shot adaptation across different species and recording conditions.
  • MeerKAT provides a public benchmark for evaluating future methods on terrestrial mammal vocalizations at high temporal resolution.
  • Performance gains observed on both meerkat and bird data indicate the training scheme generalizes beyond a single taxon.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same self-supervised pretraining plus sparse-label refinement pattern could transfer to other domains that face rare acoustic events, such as environmental sound monitoring.
  • Wider adoption might reduce the labeling burden in long-term field studies and allow faster turnaround from raw recordings to behavioral insights.
  • Testing the model on unlabeled corpora orders of magnitude larger than MeerKAT would clarify how much additional performance comes from scale alone.

Load-bearing premise

The self-supervised training scheme tailored for sparse and unbalanced bioacoustic data enables effective learning from unlabeled audio followed by refinement with labeled data.

What would settle it

A direct comparison on a fresh bioacoustic dataset in which animal2vec fails to exceed the accuracy of standard supervised transformers or other self-supervised audio models.

Figures

Figures reproduced from arXiv: 2406.01253 by Ariana Strandburg-Peshkin, Baptiste Averly, Dan Stowell, Gabriella Gall, Julian C. Sch\"afer-Zimmermann, Kiran Dhanjal-Adams, Lily Johnson-Ulrich, Marie A. Roch, Marius Fai{\ss}, Marta B. Manser, Mathieu Duteil, Vlad Demartsev.

Figure 1
Figure 1. Figure 1: FIG. 1. The statistics of the [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: FIG. 2. Example Mel spectrograms for a representative audio snippet [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: FIG. 3 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: FIG. 4. Globally averaged attention map of a four-second segment [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: FIG. 5. Mask length distributions of the baseline (solid red line) [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: FIG. 6. Cumulative frequency response (CFR) of the SincNet filters [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
read the original abstract

Bioacoustic research, vital for understanding animal behavior, conservation, and ecology, faces a monumental challenge: analyzing vast datasets where animal vocalizations are rare. While deep learning techniques are becoming standard, adapting them to bioacoustics remains difficult. We address this with animal2vec, an interpretable large transformer model, and a self-supervised training scheme tailored for sparse and unbalanced bioacoustic data. It learns from unlabeled audio and then refines its understanding with labeled data. Furthermore, we introduce and publicly release MeerKAT: Meerkat Kalahari Audio Transcripts, a dataset of meerkat (Suricata suricatta) vocalizations with millisecond-resolution annotations, the largest labeled dataset on non-human terrestrial mammals currently available. Our model outperforms existing methods on MeerKAT and the publicly available NIPS4Bplus birdsong dataset. Moreover, animal2vec performs well even with limited labeled data (few-shot learning). animal2vec and MeerKAT provide a new reference point for bioacoustic research, enabling scientists to analyze large amounts of data even with scarce ground truth information.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces animal2vec, a self-supervised transformer model with a training scheme tailored for sparse and unbalanced raw bioacoustic audio, and releases the MeerKAT dataset of meerkat vocalizations with millisecond-resolution annotations (the largest labeled dataset on non-human terrestrial mammals). The central claims are that animal2vec outperforms existing methods on MeerKAT and the public NIPS4Bplus birdsong dataset and that it supports effective few-shot learning with limited labeled data after self-supervised pretraining on unlabeled audio.

Significance. If the performance claims hold after proper evaluation, the work would provide a useful reference point for bioacoustics by addressing the challenge of rare vocalizations through self-supervised learning on unbalanced data and by releasing a large-scale annotated dataset that enables community benchmarking. The public dataset release and focus on few-shot regimes constitute concrete strengths that could facilitate progress in analyzing large unlabeled audio corpora common to the field.

major comments (1)
  1. [Abstract] Abstract: the assertion that the model 'outperforms existing methods on MeerKAT and the publicly available NIPS4Bplus birdsong dataset' and 'performs well even with limited labeled data (few-shot learning)' supplies no information on baselines, statistical tests, error bars, data splits, or potential confounds; the results section must supply these details to allow evaluation of the central performance claims.
minor comments (1)
  1. [Methods] Clarify the exact self-supervised objective and any hyperparameters specific to the sparse-event regime in the methods section to improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation of minor revision. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that the model 'outperforms existing methods on MeerKAT and the publicly available NIPS4Bplus birdsong dataset' and 'performs well even with limited labeled data (few-shot learning)' supplies no information on baselines, statistical tests, error bars, data splits, or potential confounds; the results section must supply these details to allow evaluation of the central performance claims.

    Authors: The abstract is a concise summary and does not include methodological details by design. The Results section of the manuscript supplies the requested information, including the specific baselines used, statistical tests performed, error bars, data splits, and discussion of potential confounds for both the MeerKAT and NIPS4Bplus evaluations as well as the few-shot experiments. revision: no

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is an empirical ML contribution: it introduces a transformer model (animal2vec) with a tailored self-supervised pretraining scheme for sparse bioacoustic audio, releases the MeerKAT dataset, and reports outperformance on MeerKAT plus the external NIPS4Bplus benchmark, including few-shot regimes. No derivation chain, equations, or predictions are presented that reduce by construction to fitted inputs or self-citations. Claims rest on standard train/fine-tune/evaluate protocols against public baselines and a new public dataset release, which constitute independent external support. No self-definitional, fitted-prediction, or load-bearing self-citation patterns appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no details on model hyperparameters, training objectives, or background assumptions, precluding identification of free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5809 in / 1179 out tokens · 33114 ms · 2026-05-24T00:29:22.466758+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AVEX: What Matters for Animal Vocalization Encoding

    cs.SD 2025-08 unverdicted novelty 5.0

    Large empirical study finds self-supervised pre-training then supervised post-training on mixed bioacoustics and general audio data produces the strongest encoders across 26 datasets for species classification, detect...

Reference graph

Works this paper leans on

139 extracted references · 139 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    The final model has315M trainable parameters. We train for100 epochs using the decoupled Adam optimizer (weight decay of0.01) [110], a cosine learning rate schedule [111], linear warmup for10000 steps, a final learning rateof 1 ×10−4,gradientclippingof 1[112],andabatchsizeof 1020s onfour NVIDIA A100-SXM4-80GB GPUs for20d. The code is written in PyTorch

  2. [2]

    The pretraining parameters for all settings can be found in table S1 in the supplemental material

    using the fairseq framework [114]. The pretraining parameters for all settings can be found in table S1 in the supplemental material. We estimate that our setup consumed3200kWh (the typical yearly con- sumption of a German household [115]) with a carbon footprint of ap- proximately 1400kgCO2eq. (average emission factor of 2023 for Germany is 400gCO2eq.kW−...

  3. [3]

    For the MeerKAT (100%) setting, we use𝑝 = 0.0825 and 𝑀 = 4, which sets the modeofthemaskdistributionto 22ms while60% ofalltimestepsaremasked

    (BCL) with our modified window length for augmenting the input audio, and we mask parts of the input using the same stochastic masking strategy but with fewer masked spans, depending on the finetuning setting (table S1). For the MeerKAT (100%) setting, we use𝑝 = 0.0825 and 𝑀 = 4, which sets the modeofthemaskdistributionto 22ms while60% ofalltimestepsarema...

  4. [4]

    A fixed threshold is applied to binarize the output, generating a step function representing our event boundary estimates

    Event boundary prediction: We slide a fixed-length average-pooling window (filter width is100ms) across the model’s likelihood output to predict event onsets and offsets within a continuous audio stream. A fixed threshold is applied to binarize the output, generating a step function representing our event boundary estimates

  5. [5]

    Predicted spans without corresponding ground truth events are assigned an IOU of zero

    Intersection-over-union (IOU) calculation: Using the IOU metric, we measure the overlap between the ground truth event spans and our predictions. Predicted spans without corresponding ground truth events are assigned an IOU of zero

  6. [6]

    All reported metrics utilize these final likelihood values

    Final likelihood assignment: If the IOU for a predicted event exceeds 0.5, the average model likelihood within the predicted span is used as the final scalar likelihood. All reported metrics utilize these final likelihood values. TABLE III. Results on the evaluation split for the holdout generaliz- ability study. The average precision scores (AP) [80] for...

  7. [7]

    Predicted spans lacking a ground truth counterpart, or those with insufficient IOU, are considered false positives

    Erroridentification: Groundtrutheventswithoutpredictedboundaries are considered false negatives. Predicted spans lacking a ground truth counterpart, or those with insufficient IOU, are considered false positives. AschematicofthisprocesscanbefoundinfigureS14inthethesupplemental material. 8: animal2vec generalizes well Foranalyzingthegeneralizabilityof anim...

  8. [8]

    Bradbury, J. W. & Vehrencamp, S. L.Principles of animal communication(Sinauer Associates Sunderland, MA, 1998)

  9. [9]

    Signalling in groups: New tools for the integrationofanimalcommunicationandcollectivemovement

    Demartsev, V.et al. Signalling in groups: New tools for the integrationofanimalcommunicationandcollectivemovement. Methods Ecol. Evol.(2022)

  10. [10]

    Rutz, C. et al. Using machine learning to decode animal communication. Science381, 152–155 (2023)

  11. [11]

    Applicationsofbioacous- tics in animal ecology.Ecol

    Penar,W.,Magiera,A.&Klocek,C. Applicationsofbioacous- tics in animal ecology.Ecol. Complex.43, 100847 (2020)

  12. [12]

    Fleishman, E.et al.Ecological inferences about marine mam- mals from passive acoustic data.Biol. Rev. Camb. Philos. Soc. 98, 1633–1647 (2023)

  13. [13]

    H., Stowell, D

    Rasmussen, J. H., Stowell, D. & Briefer, E. F. Sound evidence for biodiversity monitoring.Science 385, 138–140 (2024)

  14. [14]

    The emerging significance of bioacoustics in animal species conservation.Biol

    Laiolo, P. The emerging significance of bioacoustics in animal species conservation.Biol. Conserv.143, 1635–1645 (2010)

  15. [15]

    Automated bioacoustics: methods in ecology and conservation and their potential for animal welfare monitoring.J

    Mcloughlin,M.P.,Stewart,R.&McElligott,A.G. Automated bioacoustics: methods in ecology and conservation and their potential for animal welfare monitoring.J. R. Soc. Interface 16, 20190225 (2019)

  16. [16]

    Bioscience69, 15–25 (2019)

    Sugai,L.S.M.,Silva,T.S.F.,Ribeiro,J.W.,Jr&Llusia,D.Ter- restrial passive acoustic monitoring: Review and perspectives. Bioscience69, 15–25 (2019)

  17. [17]

    Lindseth, A. V. & Lobel, P. S. Underwater soundscape moni- toringandfishbioacoustics: Areview. Fishes3, 36–30(2018)

  18. [18]

    Madhusudhana, S.et al.Choosing equipment for animal bioa- coustic research.Exploring Animal Behavior Through Sound: Volume37 (2022)

  19. [19]

    Allen,A.N. etal. Aconvolutionalneuralnetworkforautomated detection of humpback whale song in a diverse, long-term passive acoustic dataset.Front. Mar. Sci.8 (2021)

  20. [20]

    Lostanlen,V.,Salamon,J.,Farnsworth,A.,Kelling,S.&Bello, J. P. Birdvox-full-night: A dataset and benchmark for avian flightcalldetection. In 2018IEEEInternationalConferenceon Acoustics, Speech and Signal Processing (ICASSP), 266–270 (IEEE, 2018)

  21. [21]

    & Stowell, D

    Morfi, V., Bas, Y., Pamuła, H., Glotin, H. & Stowell, D. NIPS4Bplus: a richly annotated birdsong audio dataset.PeerJ Comput. Sci.5, e223 (2019)

  22. [22]

    The Orchive : Data mining a massive bioacoustic archive

    Ness, S., Symonds, H., Spong, P. & Tzanetakis, G. The orchive: Data mining a massive bioacoustic archive. Preprint at https://arxiv.org/abs/1307.0589(2013)

  23. [23]

    Wall, C. C.et al. The next wave of passive acoustic data management: How centralized access can enhance science. Front. Mar. Sci.8, 703682 (2021)

  24. [24]

    & Hinton, G

    LeCun, Y., Bengio, Y. & Hinton, G. Deep learning.Nature 521, 436–444 (2015)

  25. [25]

    Transformer: Attention is all you need

    Vaswani, A.et al. Transformer: Attention is all you need. Advances in Neural Information Processing Systems 305998– 6008 (2017)

  26. [26]

    & Qiu, X

    Lin, T., Wang, Y., Liu, X. & Qiu, X. A survey of transformers. AI Open3, 111–132 (2022)

  27. [27]

    Computational bioacoustics with deep learning: a review and roadmap.PeerJ10, e13152 (2022)

    Stowell, D. Computational bioacoustics with deep learning: a review and roadmap.PeerJ10, e13152 (2022)

  28. [28]

    Audio spectrogram representations for processing withconvolutionalneuralnetworks

    Wyse, L. Audio spectrogram representations for processing withconvolutionalneuralnetworks. In ProceedingsoftheFirst International Conference on Deep Learning and Music, 37–41 (2017)

  29. [29]

    InInternational Con- ference on Learning Representations(2020)

    Dosovitskiy, A.et al.An image is worth 16x16 words: Trans- formers for image recognition at scale. InInternational Con- ference on Learning Representations(2020)

  30. [30]

    Khan,S. etal. Transformersinvision: Asurvey. ACMComput. Surv.(2021)

  31. [31]

    Robust speech recognition via large-scale weak supervision

    Radford, A.et al. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, 28492–28518 (2023)

  32. [32]

    Deng, J. et al. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255 (IEEE, 2009)

  33. [33]

    Gemmeke, J. F.et al. Audio set: An ontology and human- labeled dataset for audio events. In 2017 IEEE interna- tional conference on acoustics, speech and signal processing (ICASSP), 776–780 (IEEE, 2017)

  34. [34]

    Abriefintroductiontoweaklysupervisedlearning

    Zhou,Z.-H. Abriefintroductiontoweaklysupervisedlearning. Natl. Sci. Rev.5, 44–53 (2018)

  35. [35]

    & Khudanpur, S

    Panayotov, V., Chen, G., Povey, D. & Khudanpur, S. Lib- rispeech: an ASR corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), 5206–5210 (IEEE, 2015)

  36. [36]

    Longpre,S. etal. Apretrainer’sguidetotrainingdata: Measur- ingtheeffectsofdataage,domaincoverage,quality,&toxicity. Preprint athttps://arxiv.org/abs/2305.13169(2023)

  37. [37]

    & Chao, W.-L

    Chen, H.-Y., Tu, C.-H., Li, Z.-H., Shen, H. & Chao, W.-L. On the importance and applicability of pre-training for federated learning. Int Conf Learn Represent(2022). 2206.11488

  38. [38]

    Liu, X. et al. Self-supervised learning: Generative or con- trastive. IEEE Trans. Knowl. Data Eng.1–1 (2021)

  39. [39]

    & Girshick, R

    He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. Momen- tum contrast for unsupervised visual representation learning. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition9726–9735 (2020)

  40. [40]

    & Hinton, G

    Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. In International conference on machine learning, 1597–1607 (PMLR, 2020). 14

  41. [41]

    D.et al.Can contrastive learning avoid shortcut solutions? Conference on Neural Information Processing Systems (2021)

    Robinson, J. D.et al.Can contrastive learning avoid shortcut solutions? Conference on Neural Information Processing Systems (2021)

  42. [42]

    Tomasev,N.etal. Pushingthelimitsofself-supervisedresnets: Can we outperform supervised learning without labels on im- agenet? In First Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward at ICML 2022(2022)

  43. [43]

    & Hinton, G

    Chen, T., Kornblith, S., Swersky, K., Norouzi, M. & Hinton, G. E. Big self-supervised models are strong semi-supervised learners. vol. 33, 22243–22255 (2020)

  44. [44]

    & Auli, M

    Baevski, A., Zhou, Y., Mohamed, A. & Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representa- tions. vol. 33, 12449–12460 (2020)

  45. [45]

    & Zeghidour, N

    Saeed, A., Grangier, D. & Zeghidour, N. Contrastive learning of general-purpose audio representations. InICASSP 2021 - 2021IEEEInternationalConferenceonAcoustics, Speechand Signal Processing (ICASSP), 3875–3879 (2021)

  46. [46]

    Byol for audio: Self-supervised learning for general-purpose audio representation

    Niizumi,D.,Takeuchi,D.,Ohishi,Y.,Harada,N.&Kashino,K. Byol for audio: Self-supervised learning for general-purpose audio representation. In2021 International Joint Conference on Neural Networks (IJCNN), 1–8 (IEEE, 2021)

  47. [47]

    Byol for audio: Exploring pre-trained general-purpose audio representations

    Niizumi,D.,Takeuchi,D.,Ohishi,Y.,Harada,N.&Kashino,K. Byol for audio: Exploring pre-trained general-purpose audio representations. IEEE/ACM Transactions on Audio, Speech, and Language Processing31, 137–151 (2023)

  48. [48]

    Brown, T.et al.Language models are few-shot learners.Ad- vancesinneuralinformationprocessingsystems 33,1877–1901 (2020)

  49. [49]

    & Toutanova, K

    Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference1, 4171–4186 (2018)

  50. [50]

    Liu,Y. etal. Roberta: Arobustlyoptimizedbertpretrainingap- proach. Preprint athttps://arxiv.org/abs/1907.11692 (2019)

  51. [51]

    IEEEACM Trans

    Hsu, W.-N.et al.HuBERT: Self-supervised speech representa- tionlearningbymaskedpredictionofhiddenunits. IEEEACM Trans. Audio Speech Lang. Process.29, 3451–3460 (2021)

  52. [52]

    Lin, C.-C., Jaech, A., Li, X., Gormley, M. R. & Eisner, J. Limitations of autoregressive models and their alternatives. In Toutanova, K.et al.(eds.)Proceedings of the 2021 Conference oftheNorthAmericanChapteroftheAssociationforComputa- tionalLinguistics: HumanLanguageTechnologies,5147–5173 (Association for Computational Linguistics, Online, 2021)

  53. [53]

    A., Zhai, F., Adelani, D

    Zhu, D., Hedderich, M. A., Zhai, F., Adelani, D. & Klakow, D. Is bert robust to label noise? a study on learning with noisy labels in text classification. InProceedings of the Third Workshop on Insights from Negative Results in NLP, 62–67 (2022)

  54. [54]

    R., Zadeh, M

    Jaiswal, A., Babu, A. R., Zadeh, M. Z., Banerjee, D. & Make- don, F. A survey on contrastive self-supervised learning.Tech- nologies (Basel)9, 2 (2020)

  55. [55]

    In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 244–250 (IEEE, 2021)

    Chung, Y.-A.et al.W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre- training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 244–250 (IEEE, 2021)

  56. [56]

    & Glass, J

    Gong, Y., Chung, Y.-A. & Glass, J. AST: Audio Spectrogram Transformer. InProc. Interspeech 2021, 571–575 (2021)

  57. [57]

    Environmental sound classification with tiny transformers in noisy edge environments

    Wyatt, S.et al. Environmental sound classification with tiny transformers in noisy edge environments. In2021 IEEE 7th World Forum on Internet of Things (WF-IoT), 309–314 (IEEE, 2021)

  58. [58]

    & Phillips, L

    Wolters, P., Sizemore, L., Daw, C., Hutchinson, B. & Phillips, L. Proposal-based few-shot sound event detection for speech and environmental sounds with perceivers. Preprint athttps: //arxiv.org/abs/2107.13616(2021)

  59. [59]

    P., Gunturu, S

    You, L., Coyotl, E. P., Gunturu, S. & Van Segbroeck, M. Transformer-based bioacoustic sound event detection on few- shot learning tasks. In ICASSP 2023-2023 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5 (IEEE, 2023)

  60. [60]

    & Akrapongpisak, L

    Robinson, D., Robinson, A. & Akrapongpisak, L. Transfer- able models for bioacoustics with human language supervi- sion. InICASSP2024-2024IEEEInternationalConferenceon Acoustics,SpeechandSignalProcessing(ICASSP) ,1316–1320 (IEEE, 2024)

  61. [61]

    Gu,N. etal. Positivetransferofthewhisperspeechtransformer tohumanandanimalvoiceactivitydetection. In ICASSP2024- 2024IEEEInternationalConferenceonAcoustics, Speechand Signal Processing (ICASSP), 7505–7509 (IEEE, 2024)

  62. [62]

    & Valpola, H

    Tarvainen, A. & Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi- supervised deep learning results.Advances in neural informa- tion processing systems30 (2017)

  63. [63]

    Bootstrap your own latent-a new approach to self-supervised learning.Adv

    Grill, J.-B.et al. Bootstrap your own latent-a new approach to self-supervised learning.Adv. Neural Inf. Process. Syst.33, 21271–21284 (2020)

  64. [64]

    InProceedings of the IEEE/CVF international conference on computer vision, 9650–9660 (2021)

    Caron, M.et al.Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, 9650–9660 (2021)

  65. [65]

    data2vec: A general framework for self- supervised learning in speech, vision and language

    Baevski, A.et al. data2vec: A general framework for self- supervised learning in speech, vision and language. In Chaud- huri, K. et al. (eds.) Proceedings of the 39th International Conference on Machine Learning, vol. 162 ofProceedings of Machine Learning Research, 1298–1312 (PMLR, 2022)

  66. [66]

    & Auli, M

    Baevski, A., Babu, A., Hsu, W.-N. & Auli, M. Efficient self- supervised learning with contextualized target representations for vision, speech and language. InInternational Conference on Machine Learning, 1416–1429 (PMLR, 2023)

  67. [67]

    & Lee, J.-G

    Song, H., Kim, M., Park, D., Shin, Y. & Lee, J.-G. Learning from noisy labels with deep neural networks: A survey.IEEE Trans. Neural Netw. Learn. Syst.34, 8135–8153 (2023)

  68. [68]

    In2018 IEEE spoken language technology workshop (SLT), 1021–1028 (IEEE, 2018)

    Ravanelli,M.&Bengio,Y.Speakerrecognitionfromrawwave- form with sincnet. In2018 IEEE spoken language technology workshop (SLT), 1021–1028 (IEEE, 2018)

  69. [69]

    https: //kalahariresearchcentre.org

    The kalahari research centre KRC. https: //kalahariresearchcentre.org. Accessed: 2024- 04-25

  70. [70]

    https://opensource.org/license/ mit

    The MIT license. https://opensource.org/license/ mit. Accessed: 2024-04-25

  71. [71]

    com/livingingroups/animal2vec

    Official GitHub repository for animal2vec.https://github. com/livingingroups/animal2vec. Accessed: 2024-04- 25

  72. [72]

    https://creativecommons.org/licenses/ by-nc/4.0

    Creative Commons Attribution-NonCommercial 4.0 Inter- national. https://creativecommons.org/licenses/ by-nc/4.0. Accessed: 2024-04-25

  73. [73]

    Schäfer-Zimmermann, J. C.et al. MeerKAT: Meerkat Kala- hari Audio Transcripts (2024). URLhttps://doi.org/10. 17617/3.0J0DYB

  74. [74]

    B.The evolution of auditory communication in suricates, Suricata suricatta

    Manser, M. B.The evolution of auditory communication in suricates, Suricata suricatta. Ph.D. thesis, University of Cam- bridge (1998)

  75. [75]

    B., Jansen, D

    Manser, M. B., Jansen, D. A. W. A., Graw, B. & le Roux, A. Vocalcomplexityinmeerkatsandothermongoosespecies. Adv. Stud. Behav.46, 281–310 (2014)

  76. [76]

    W., Charlton, B

    Townsend, S. W., Charlton, B. D. & Manser, M. B. Acoustic 15 cues to identity and predator context in meerkat barks.Anim. Behav.94, 143–149 (2014)

  77. [77]

    W., Hollén, L

    Townsend, S. W., Hollén, L. I. & Manser, M. B. Meerkat close calls encode group-specific signatures, but receivers fail to discriminate.Anim. Behav.80, 133–138 (2010)

  78. [78]

    Collier, K., Townsend, S. W. & Manser, M. B. Call concatena- tion in wild meerkats.Anim. Behav.134, 257–269 (2017)

  79. [79]

    & Manser, M

    Demartsev, V., Strandburg-Peshkin, A., Ruffner, M. & Manser, M. Vocal turn-taking in meerkat group calling sessions.Curr. Biol.28, 3661–3666.e3 (2018)

  80. [80]

    Manser, M. B. The acoustic structure of suricates’ alarm calls varies with predator type and the level of response urgency. Proc. Biol. Sci.268, 2315–2324 (2001)

Showing first 80 references.