Data-Efficient Self-Supervised Algorithms for Fine-Grained Birdsong Analysis

Houtan Ghaffari; Lukas Rauch; Paul Devos

arxiv: 2511.12158 · v3 · pith:UZGBOAACnew · submitted 2025-11-15 · 💻 cs.LG

Data-Efficient Self-Supervised Algorithms for Fine-Grained Birdsong Analysis

Houtan Ghaffari , Lukas Rauch , Paul Devos This is my paper

Pith reviewed 2026-05-21 19:27 UTC · model grok-4.3

classification 💻 cs.LG

keywords self-supervised learningbirdsong syllable detectiondata-efficient annotationbioacousticsCanary songBengalese Finchmasked predictionsemi-supervised refinement

0 comments

The pith

A three-stage self-supervised pipeline produces reliable syllable detectors for complex birdsong from very few labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a Residual Multi-Layer Perceptron Recurrent Neural Network for syllable-level birdsong annotation. It describes a three-stage process that begins with self-supervised pretraining on unlabeled audio through masked prediction or online clustering. Supervised training with data augmentation then builds a frame-level detector from limited labels. A final semi-supervised refinement step improves the model using additional unlabeled recordings. The pipeline succeeds on the demanding Canary song and generalizes to Bengalese Finch, lowering annotation demands for bioacoustic research.

Core claim

The authors establish that self-supervised pretraining on unlabeled Canary and Finch audio, followed by supervised training with augmentation and semi-supervised post-training, yields accurate frame-level syllable detectors even under extreme label scarcity, succeeding on songs marked by rapid vocalizations, brief intervals, broadband sweeps, and spectrally similar syllables that demand fine-grained distinctions.

What carries the argument

The three-stage training pipeline of self-supervised pretraining via masked prediction or online clustering, supervised training with augmentation, and semi-supervised refinement applied to a Residual Multi-Layer Perceptron Recurrent Neural Network.

If this is right

Accurate syllable-level annotation of individual birds becomes practical with far smaller labeled datasets than previously required.
The pipeline supplies a working baseline for annotating songs from additional bird species that share rapid and spectrally complex patterns.
Self-supervised embeddings produced in the first stage support both linear probing for specific tasks and fully unsupervised exploration of song structure.
Annotation costs drop sharply for studies in bioacoustics, neuroscience, and linguistics that depend on detailed syllable parsing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same staged approach could lower labeling needs when analyzing vocalizations of other animals such as whales or primates.
Adaptations of the pipeline might aid low-resource audio classification tasks beyond birdsong.
Testing alternative self-supervised objectives could reveal which pretraining method best isolates fine spectral differences.

Load-bearing premise

Self-supervised pretraining on unlabeled audio captures the fine-grained spectro-temporal features needed to tell apart spectrally similar syllables in rapid sequences with brief gaps.

What would settle it

A controlled test on Canary song showing that the three-stage model achieves no higher syllable detection accuracy than a purely supervised baseline trained on the identical small labeled set would falsify the value of the self-supervised stages.

Figures

Figures reproduced from arXiv: 2511.12158 by Houtan Ghaffari, Lukas Rauch, Paul Devos.

**Figure 1.** Figure 1: An example of syllable prediction using the proposed model and the three-stage training framework. repetitions. To be more precise, notice that all syllable types are not present in a single recording or song. Additionally, the recordings have variable durations, ranging from 1 to 40 seconds. This few-shot subset is the minimum number of recording files that ensures each syllable is vocalized at least once… view at source ↗

**Figure 2.** Figure 2: The proposed Res-MLP-RNN neural network architecture with an input spectrogram example. The model has roughly 10 M parameters. The Masked Prediction and Online Clustering heads are used in two separate self-supervised pretraining tasks. The Classifier head is used for supervised and post-training semi-supervised syllable detection tasks. The first linear layer of the first block projects 256 frequency bins… view at source ↗

**Figure 3.** Figure 3: An example from the birdsong MAE model for the masked prediction task. The model is used in different ways for experiments and ablation studies. However, the proposed training framework has three clear stages. First, pretrain the model on all available species datasets, three Canaries in this work, using either MAE or OSC. Second, after pretraining, replace the SSL head with a linear classifier for supervi… view at source ↗

**Figure 4.** Figure 4: Online Syllable Clustering loss. The model is trained for 200 epochs using the AdamW optimizer61 with a weight decay of 5e−2, a linear learning rate warmup from 1e−6 to 5e−4 in 20 epochs, then, a constant rate for 10 epochs, and finally, a cosine decay for the remaining 170 epochs to the minimum learning rate of 1e−6. The experiment is conducted once, and the resulting SSL model is used in all subsequent … view at source ↗

**Figure 5.** Figure 5: Two syllables from each bird with their true and predicted distribution of duration across training sizes and models. MAE means the model was pretrained by the SSL masked prediction prior to the supervised finetuning. Equivalently, OSC refers to the Online Syllable Clustering pretraining task. The predictions are faithful, even with such small training sizes. dimensions using Principal Component Analysis (… view at source ↗

**Figure 6.** Figure 6: Syllable transition matrices for the llb3 canary across different SSL pretraining tasks and train set sizes. Even with few-shot finetuning, the main structure of the syllable transition is visible. The True Matrix is the same in both rows, and it is calculated based on human annotation. datasets, can be open-sourced for off-the-shelf usage in the future. Then, the expert can perform a similar clustering an… view at source ↗

**Figure 7.** Figure 7: T-SNE plots of the clustered SSL embeddings of syllables after re-labeling via majority vote. To avoid cluttering, the plots show the density contours of each cluster, estimated from 2000 random syllables at maximum. and established the proposed model and three-stage training framework as a suitable method for birdsong analysis. The Application section further expanded on the utility of SSL models. As alre… view at source ↗

read the original abstract

Research in bioacoustics, neuroscience, and linguistics often uses birdsong as a proxy to acquire knowledge across diverse areas. This requires audio models to annotate and parse the birdsong. Developing such models requires precise, syllable-level annotated training data. Therefore, automated methods that reduce annotation costs are in demand. This work presents a data-efficient birdsong annotator called Residual Multi-Layer Perceptron Recurrent Neural Network. It then presents a three-stage training pipeline for developing reliable birdsong syllable detectors with minimal annotation. The first stage is self-supervised learning from unlabeled data. Two of the most successful pretraining paradigms are explored, namely, masked prediction and online clustering. The second stage is supervised training with effective data augmentation to produce a robust frame-level syllable detector for each individual. The third stage is a semi-supervised post-training step that refines each individual's model using unlabeled data. The effectiveness of this approach is demonstrated for the Canary song in extreme label-scarcity scenarios. From a signal-processing perspective, the Canary song exhibits one of the most challenging spectro-temporal patterns for algorithmic time-series annotation: rapid vocalizations, brief inter-syllabic intervals, fast and broadband frequency sweeps, and spectrally similar syllables that require fine-grained features to distinguish. Hence, a successful syllable detection algorithm for Canary also establishes a robust baseline for other birds. This methodological generalization is validated in a case study of Bengalese Finch song annotation. Finally, the potential of self-supervised embeddings is assessed for linear probing and unsupervised birdsong analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The three-stage pipeline with Residual MLP-RNN delivers usable low-label syllable detection on hard Canary data and checks out on Finch, with quantitative metrics that support the claims.

read the letter

The colleague should know that this paper's three-stage pipeline—self-supervised pretraining via masked prediction or online clustering, then augmented supervised training, then semi-supervised refinement—produces measurable syllable detectors for Canary song even with single-digit minutes of labels, and it generalizes in a case study to Bengalese Finch. The architecture is a Residual MLP-RNN tuned to fine-grained spectro-temporal features like rapid sweeps and similar syllables. Results include frame-level F1, boundary error, and direct comparisons to supervised baselines plus prior birdsong tools. Self-supervised embeddings lift linear probing accuracy, and the refinement step improves held-out performance. The stress-test details confirm the methods section spells out objectives, augmentations, and controls with no internal inconsistencies. That is concrete evidence for the central claim once you move past the abstract. The work is limited to two species, so transfer to birds with very different vocal patterns stays untested. Hyperparameters for pretraining and fine-tuning are left open, which is standard but means new users will tune. This is for bioacoustics and neuroscience groups that annotate birdsong under label constraints. Readers working on self-supervised audio or animal vocalization tasks would pick up a practical pipeline and a solid stress-test case. The combination of domain challenge, quantitative results, and clear motivation makes it worth a full referee process. I would send it to peer review. The reported data line up with what the authors claim.

Referee Report

2 major / 2 minor

Summary. The paper introduces a Residual MLP-RNN architecture together with a three-stage training pipeline for data-efficient syllable-level birdsong annotation. Stage 1 performs self-supervised pretraining on unlabeled audio via masked prediction or online clustering; Stage 2 trains a frame-level detector with data augmentation; Stage 3 applies semi-supervised refinement on additional unlabeled segments. Effectiveness is shown for Canary song under extreme label scarcity (down to single-digit annotated minutes) and validated on Bengalese Finch, with quantitative metrics including frame-level F1 and syllable boundary error.

Significance. If the reported gains hold, the work supplies a practical route to lower annotation costs in bioacoustics and neuroscience while handling one of the most demanding vocalization patterns (rapid sweeps, spectrally similar syllables). Credit is due for the explicit stress-test design on Canary, the provision of exact label counts, augmentation details, and linear-probing results that demonstrate the value of the self-supervised embeddings over random initialization.

major comments (2)

[§4] §4 (Results): the manuscript states that the semi-supervised refinement yields measurable improvement on held-out unlabeled segments, yet the precise procedure for generating and filtering pseudo-labels (threshold, selection criterion) is not specified; this detail is load-bearing for assessing whether error propagation is controlled in the extreme-scarcity regime.
[Table 2] Table 2 (Canary scarcity experiments): while comparisons to supervised baselines and prior tools are presented, the exact number of annotated minutes corresponding to each reported F1 score must be stated explicitly in the table caption or a dedicated column so that the 'extreme label-scarcity' claim can be evaluated quantitatively.

minor comments (2)

[Abstract] Abstract: no numerical results (F1, boundary error, or label counts) are given, which weakens the summary of the central claim.
[§3.1] §3.1: the description of the Residual MLP-RNN would benefit from an explicit equation or diagram showing the residual connections and RNN integration.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and constructive comments on our manuscript. We address each major comment below and will revise the manuscript accordingly to improve clarity and reproducibility.

read point-by-point responses

Referee: [§4] §4 (Results): the manuscript states that the semi-supervised refinement yields measurable improvement on held-out unlabeled segments, yet the precise procedure for generating and filtering pseudo-labels (threshold, selection criterion) is not specified; this detail is load-bearing for assessing whether error propagation is controlled in the extreme-scarcity regime.

Authors: We agree that the precise procedure for generating and filtering pseudo-labels must be specified to allow proper evaluation of error propagation control. The current manuscript describes Stage 3 at a high level but does not detail the threshold or selection criterion. In the revised version we will add an explicit description of the pseudo-labeling process, including the confidence threshold applied and the filtering rule used to select reliable segments. revision: yes
Referee: [Table 2] Table 2 (Canary scarcity experiments): while comparisons to supervised baselines and prior tools are presented, the exact number of annotated minutes corresponding to each reported F1 score must be stated explicitly in the table caption or a dedicated column so that the 'extreme label-scarcity' claim can be evaluated quantitatively.

Authors: We acknowledge that placing the exact annotated-minute counts directly in Table 2 would make the scarcity experiments easier to evaluate at a glance. Although these quantities appear in the surrounding text, we will revise the table by adding a dedicated column (or updating the caption) to list the precise number of annotated minutes for each F1 score. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained empirical pipeline

full rationale

The paper presents a three-stage empirical pipeline (self-supervised pretraining via masked prediction or online clustering on unlabeled audio, followed by supervised training with augmentation, then semi-supervised refinement) for syllable detection. No equations, fitted parameters, or derivations are described that reduce the reported metrics (frame-level F1, boundary error) to inputs by construction. Claims rest on quantitative comparisons to supervised baselines and prior tools, with explicit label-scarcity experiments and cross-species validation. No self-citation load-bearing steps or ansatz smuggling appear in the provided methods description; the architecture and objectives follow standard ML practices without internal reduction to the target results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard self-supervised learning assumptions plus domain-specific expectations about birdsong spectro-temporal structure; limited information available from abstract alone.

free parameters (1)

Pretraining and fine-tuning hyperparameters
Learning rates, masking ratios, cluster counts, and augmentation strengths are chosen or fitted during the three stages.

axioms (1)

domain assumption Unlabeled birdsong recordings contain sufficient structure for masked prediction or clustering to learn features useful for syllable discrimination.
Invoked by the first stage of the pipeline described in the abstract.

pith-pipeline@v0.9.0 · 5807 in / 1169 out tokens · 70529 ms · 2026-05-21T19:27:41.841405+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

three-stage training pipeline... self-supervised pretraining via masked prediction or online clustering... Residual Multi-Layer Perceptron Recurrent Neural Network
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

linear probing results... clustering results... syllable transition matrices

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 5 internal anchors

[1]

C., Okanoya, K., Beckers, G

Berwick, R. C., Okanoya, K., Beckers, G. J. & Bolhuis, J. J. Songs to syntax: the linguistics of birdsong.Trends cognitive sciences15, 113–121 (2011)

work page 2011
[2]

Mets, D. G. & Brainard, M. S. Learning is enhanced by tailoring instruction to individual genetic differences.Elife8, e47216 (2019)

work page 2019
[3]

D., Day, N

Burkett, Z. D., Day, N. F., Peñagarikano, O., Geschwind, D. H. & White, S. A. V oice: A semi-automated pipeline for standardizing vocal analysis across models.Sci. reports5, 10237 (2015)

work page 2015
[4]

Cohen, Y .et al.Automated annotation of birdsong with a neural network that segments spectrograms.Elife11, e63853 (2022)

work page 2022
[5]

& Tachibana, R

Morita, T., Koda, H., Okanoya, K. & Tachibana, R. O. Measuring context dependency in birdsong using artificial neural networks.PLoS computational biology17, e1009707 (2021)

work page 2021
[6]

& Webster, M

Podos, J. & Webster, M. S. Ecology and evolution of bird sounds.Curr. Biol.32, R1100–R1104, DOI: https://doi.org/10. 1016/j.cub.2022.07.073 (2022)

work page 2022
[7]

& Edeline, J.-M

Huetz, C., Del Negro, C., Lehongre, K., Tarroux, P. & Edeline, J.-M. The selectivity of canary hvc neurons for the bird’s own song: Rate coding, temporal coding, or both?J. Physiol.98, 395–406, DOI: https://doi.org/10.1016/j.jphysparis.2005. 09.011 (2004). Decoding and interfacing the brain: from neuronal assemblies to cyborgs

work page doi:10.1016/j.jphysparis.2005 2005
[8]

& Devos, P

Ghaffari, H. & Devos, P. Consistent birdsong syllable segmentation using deep semi-supervised learning. InProceedings of the 10th Convention of the European Acoustics Association Forum Acusticum 2023, 6277–6284, DOI: 10.61782/fa.2023. 0897 (2023)

work page doi:10.61782/fa.2023 2023
[9]

& Plumbley, M

Stowell, D. & Plumbley, M. D. Birdsong and c4dm: A survey of uk birdsong and machine recognition for music researchers. Centre for Digit. Music. Queen Mary Univ. London, Tech. Rep. C4DM-TR-09-12(2010)

work page 2010
[10]

& Gentner, T

Sainburg, T., Thielk, M. & Gentner, T. Q. Finding, visualizing, and quantifying latent structure across diverse animal vocal repertoires.PLoS computational biology16, e1008228 (2020)

work page 2020
[11]

& Saino, N

Boncoraglio, G. & Saino, N. Habitat structure and the evolution of bird song: a meta-analysis of the evidence for the acoustic adaptation hypothesis.Funct. Ecol.134–142 (2007)

work page 2007
[12]

Morfi, V ., Lachlan, R. F. & Stowell, D. Deep perceptual embeddings for unlabelled animal sound events.The J. Acoust. Soc. Am.150, 2–11 (2021)

work page 2021
[13]

& Devos, P

Ghaffari, H. & Devos, P. Robust weakly supervised bird species detection via peak aggregation and pie.IEEE Transactions on Audio, Speech Lang. Process.33, 1427–1439, DOI: 10.1109/TASLPRO.2025.3552983 (2025). 15.Rauch, L.et al.Can masked autoencoders also listen to birds?Transactions on Mach. Learn. Res.(2025). 16.Rauch, L.et al.Unmute the patch tokens: Re...

work page doi:10.1109/taslpro.2025.3552983 2025
[14]

& Toutanova, K

Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 4171–4186 (2019)

work page 2019
[15]

InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16000–16009 (2022)

He, K.et al.Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16000–16009 (2022). 20.Huang, P.-Y .et al.Masked autoencoders that listen.Adv. Neural Inf. Process. Syst.35, 28708–28720 (2022). 21.Vaswani, A.et al.Attention is all you need.Adv. neural information processing ...

work page 2022
[16]

InInternational Conference on Learning Representations(2021)

Dosovitskiy, A.et al.An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations(2021)

work page 2021
[17]

& Douze, M

Caron, M., Bojanowski, P., Joulin, A. & Douze, M. Deep clustering for unsupervised learning of visual features. In Proceedings of the European conference on computer vision (ECCV), 132–149 (2018)

work page 2018
[18]

neural information processing systems33, 9912–9924 (2020)

Caron, M.et al.Unsupervised learning of visual features by contrasting cluster assignments.Adv. neural information processing systems33, 9912–9924 (2020)

work page 2020
[19]

InProceedings of the IEEE/CVF international conference on computer vision, 9650–9660 (2021)

Caron, M.et al.Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, 9650–9660 (2021)

work page 2021
[20]

YM., A., C., R. & A., V . Self-labelling via simultaneous clustering and representation learning. InInternational Conference on Learning Representations(2020)

work page 2020
[21]

InEuropean conference on computer vision, 456–473 (Springer, 2022)

Assran, M.et al.Masked siamese networks for label-efficient learning. InEuropean conference on computer vision, 456–473 (Springer, 2022)

work page 2022
[22]

In Krause, A.et al.(eds.)Proceedings of the 40th International Conference on Machine Learning, vol

Chen, S.et al.BEATs: Audio pre-training with acoustic tokenizers. In Krause, A.et al.(eds.)Proceedings of the 40th International Conference on Machine Learning, vol. 202 ofProceedings of Machine Learning Research, 5178–5193 (PMLR, 2023)

work page 2023
[23]

Oquab, M.et al.DINOv2: Learning robust visual features without supervision.Transactions on Mach. Learn. Res.(2024). Featured Certification. 30.Murphy, K. P.Probabilistic machine learning: an introduction(MIT press, 2022)

work page 2024
[24]

& Bengio, Y

Grandvalet, Y . & Bengio, Y . Semi-supervised learning by entropy minimization.Adv. neural information processing systems17(2004)

work page 2004
[25]

In Workshop on challenges in representation learning, ICML, vol

Lee, D.-H.et al.Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, vol. 3, 896 (Atlanta, 2013)

work page 2013
[26]

neural information processing systems33, 596–608 (2020)

Sohn, K.et al.Fixmatch: Simplifying semi-supervised learning with consistency and confidence.Adv. neural information processing systems33, 596–608 (2020)

work page 2020
[27]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531(2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[28]

& Valpola, H

Tarvainen, A. & Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.Adv. neural information processing systems30(2017)

work page 2017
[29]

& Garamszegi, L

Soma, M. & Garamszegi, L. Z. Rethinking birdsong evolution: meta-analysis of the relationship between song complexity and reproductive success.Behav. Ecol.22, 363–371 (2011)

work page 2011
[30]

J., Bolhuis, J

Terpstra, N. J., Bolhuis, J. J. & den Boer-Visser, A. M. An analysis of the neural representation of birdsong memory.J. Neurosci.24, 4971–4977 (2004). 38.Williams, H. Birdsong and singing behavior.Annals New York Acad. Sci.1016, 1–30 (2004)

work page 2004
[31]

& Tchernichovski, O

Lipkind, D. & Tchernichovski, O. Quantification of developmental birdsong learning from the subsyllabic scale to cultural evolution.Proc. Natl. Acad. Sci.108, 15572–15579 (2011)

work page 2011
[32]

Sober, S. J. & Brainard, M. S. Adult birdsong is actively maintained by error correction.Nat. neuroscience12, 927–931 (2009)

work page 2009
[33]

neural information processing systems33, 21271–21284 (2020)

Grill, J.-B.et al.Bootstrap your own latent-a new approach to self-supervised learning.Adv. neural information processing systems33, 21271–21284 (2020)

work page 2020
[34]

& Hinton, G

Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, 1597–1607 (PmLR, 2020)

work page 2020
[35]

InInternational Conference on Learning Representations (ICLR)(2025)

Rauch, L.et al.BirdSet: A Large-Scale Dataset for Audio Classification in Avian Bioacoustics. InInternational Conference on Learning Representations (ICLR)(2025)

work page 2025
[36]

Sinkhorn distances: Lightspeed computation of optimal transport.Adv

Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport.Adv. neural information processing systems 26(2013)

work page 2013
[37]

A Convex Relaxation for Weakly Supervised Classifiers

He, K., Fan, H., Wu, Y ., Xie, S. & Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9729–9738 (2020). 46.Zhou, J.et al.ibot: Image bert pre-training with online tokenizer.Int. Conf. on Learn. Represent. (ICLR)(2022). 16/17 47.Joulin, A. & ...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[38]

& Bertram, R

Daou, A., Johnson, F., Wu, W. & Bertram, R. A computational tool for automated large-scale analysis and measurement of bird-song syntax.J. neuroscience methods210, 147–160 (2012)

work page 2012
[39]

InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16133–16142 (2023)

Woo, S.et al.Convnext v2: Co-designing and scaling convnets with masked autoencoders. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16133–16142 (2023)

work page 2023
[40]

& Devos, P

Ghaffari, H. & Devos, P. On the role of audio frontends in bird species recognition.Ecol. Informatics81, 102573, DOI: https://doi.org/10.1016/j.ecoinf.2024.102573 (2024)

work page doi:10.1016/j.ecoinf.2024.102573 2024
[41]

M., Felix, L

Ferreira-Paiva, L., Alfaro-Espinoza, E., Almeida, V . M., Felix, L. B. & Neves, R. V . A survey of data augmentation for audio classification. InCongresso Brasileiro de Automática-CBA, vol. 3 (2022)

work page 2022
[42]

Zollinger, S. A. & Brumm, H. Why birds sing loud songs and why they sometimes don’t.Animal Behav.105, 289–295, DOI: https://doi.org/10.1016/j.anbehav.2015.03.030 (2015). 53.Park, D. S.et al.Specaugment: A simple data augmentation method for automatic speech recognition.Interspeech 2019 DOI: 10.21437/interspeech.2019-2680 (2019)

work page doi:10.1016/j.anbehav.2015.03.030 2015
[43]

Layer Normalization

Hochreiter, S. & Schmidhuber, J. Long short-term memory.Neural Comput.9, 1735–1780, DOI: 10.1162/neco.1997.9.8. 1735 (1997). 55.Ba, J. L., Kiros, J. R. & Hinton, G. E. Layer normalization.arXiv preprint arXiv:1607.06450(2016)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1162/neco.1997.9.8 1997
[44]

Ramachandran, P., Zoph, B. & Le, Q. V . Swish: a self-gated activation function.arXiv preprint arXiv:1710.059417, 5 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[45]

& Salakhutdinov, R

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting.The journal machine learning research15, 1929–1958 (2014)

work page 1929
[46]

Adam: A Method for Stochastic Optimization

He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, 770–778 (2016). 59.Kingma, D. P. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980(2014)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[47]

Breiman, L., Friedman, J., Olshen, R. A. & Stone, C. J.Classification and regression trees(Chapman and Hall/CRC, 2017)

work page 2017
[48]

& Hutter, F

Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. InInternational Conference on Learning Representations (2019)

work page 2019
[49]

X., Epps, J

Vinh, N. X., Epps, J. & Bailey, J. Information theoretic measures for clusterings comparison: is a correction for chance necessary? InProceedings of the 26th annual international conference on machine learning, 1073–1080 (2009)

work page 2009
[50]

& Jarvis, E

Tchernichovski, O., Eisenberg-Edidin, S. & Jarvis, E. D. Balanced imitation sustains song culture in zebra finches.Nat. communications12, 2562 (2021). Author contributions statement H.G. conceptualized the work, reviewed the literature, conducted the experiments, analyzed the results, made the figures, validated the results and arguments, and wrote the or...

work page 2021

[1] [1]

C., Okanoya, K., Beckers, G

Berwick, R. C., Okanoya, K., Beckers, G. J. & Bolhuis, J. J. Songs to syntax: the linguistics of birdsong.Trends cognitive sciences15, 113–121 (2011)

work page 2011

[2] [2]

Mets, D. G. & Brainard, M. S. Learning is enhanced by tailoring instruction to individual genetic differences.Elife8, e47216 (2019)

work page 2019

[3] [3]

D., Day, N

Burkett, Z. D., Day, N. F., Peñagarikano, O., Geschwind, D. H. & White, S. A. V oice: A semi-automated pipeline for standardizing vocal analysis across models.Sci. reports5, 10237 (2015)

work page 2015

[4] [4]

Cohen, Y .et al.Automated annotation of birdsong with a neural network that segments spectrograms.Elife11, e63853 (2022)

work page 2022

[5] [5]

& Tachibana, R

Morita, T., Koda, H., Okanoya, K. & Tachibana, R. O. Measuring context dependency in birdsong using artificial neural networks.PLoS computational biology17, e1009707 (2021)

work page 2021

[6] [6]

& Webster, M

Podos, J. & Webster, M. S. Ecology and evolution of bird sounds.Curr. Biol.32, R1100–R1104, DOI: https://doi.org/10. 1016/j.cub.2022.07.073 (2022)

work page 2022

[7] [7]

& Edeline, J.-M

Huetz, C., Del Negro, C., Lehongre, K., Tarroux, P. & Edeline, J.-M. The selectivity of canary hvc neurons for the bird’s own song: Rate coding, temporal coding, or both?J. Physiol.98, 395–406, DOI: https://doi.org/10.1016/j.jphysparis.2005. 09.011 (2004). Decoding and interfacing the brain: from neuronal assemblies to cyborgs

work page doi:10.1016/j.jphysparis.2005 2005

[8] [8]

& Devos, P

Ghaffari, H. & Devos, P. Consistent birdsong syllable segmentation using deep semi-supervised learning. InProceedings of the 10th Convention of the European Acoustics Association Forum Acusticum 2023, 6277–6284, DOI: 10.61782/fa.2023. 0897 (2023)

work page doi:10.61782/fa.2023 2023

[9] [9]

& Plumbley, M

Stowell, D. & Plumbley, M. D. Birdsong and c4dm: A survey of uk birdsong and machine recognition for music researchers. Centre for Digit. Music. Queen Mary Univ. London, Tech. Rep. C4DM-TR-09-12(2010)

work page 2010

[10] [10]

& Gentner, T

Sainburg, T., Thielk, M. & Gentner, T. Q. Finding, visualizing, and quantifying latent structure across diverse animal vocal repertoires.PLoS computational biology16, e1008228 (2020)

work page 2020

[11] [11]

& Saino, N

Boncoraglio, G. & Saino, N. Habitat structure and the evolution of bird song: a meta-analysis of the evidence for the acoustic adaptation hypothesis.Funct. Ecol.134–142 (2007)

work page 2007

[12] [12]

Morfi, V ., Lachlan, R. F. & Stowell, D. Deep perceptual embeddings for unlabelled animal sound events.The J. Acoust. Soc. Am.150, 2–11 (2021)

work page 2021

[13] [13]

& Devos, P

Ghaffari, H. & Devos, P. Robust weakly supervised bird species detection via peak aggregation and pie.IEEE Transactions on Audio, Speech Lang. Process.33, 1427–1439, DOI: 10.1109/TASLPRO.2025.3552983 (2025). 15.Rauch, L.et al.Can masked autoencoders also listen to birds?Transactions on Mach. Learn. Res.(2025). 16.Rauch, L.et al.Unmute the patch tokens: Re...

work page doi:10.1109/taslpro.2025.3552983 2025

[14] [14]

& Toutanova, K

Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 4171–4186 (2019)

work page 2019

[15] [15]

InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16000–16009 (2022)

He, K.et al.Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16000–16009 (2022). 20.Huang, P.-Y .et al.Masked autoencoders that listen.Adv. Neural Inf. Process. Syst.35, 28708–28720 (2022). 21.Vaswani, A.et al.Attention is all you need.Adv. neural information processing ...

work page 2022

[16] [16]

InInternational Conference on Learning Representations(2021)

Dosovitskiy, A.et al.An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations(2021)

work page 2021

[17] [17]

& Douze, M

Caron, M., Bojanowski, P., Joulin, A. & Douze, M. Deep clustering for unsupervised learning of visual features. In Proceedings of the European conference on computer vision (ECCV), 132–149 (2018)

work page 2018

[18] [18]

neural information processing systems33, 9912–9924 (2020)

Caron, M.et al.Unsupervised learning of visual features by contrasting cluster assignments.Adv. neural information processing systems33, 9912–9924 (2020)

work page 2020

[19] [19]

InProceedings of the IEEE/CVF international conference on computer vision, 9650–9660 (2021)

Caron, M.et al.Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, 9650–9660 (2021)

work page 2021

[20] [20]

YM., A., C., R. & A., V . Self-labelling via simultaneous clustering and representation learning. InInternational Conference on Learning Representations(2020)

work page 2020

[21] [21]

InEuropean conference on computer vision, 456–473 (Springer, 2022)

Assran, M.et al.Masked siamese networks for label-efficient learning. InEuropean conference on computer vision, 456–473 (Springer, 2022)

work page 2022

[22] [22]

In Krause, A.et al.(eds.)Proceedings of the 40th International Conference on Machine Learning, vol

Chen, S.et al.BEATs: Audio pre-training with acoustic tokenizers. In Krause, A.et al.(eds.)Proceedings of the 40th International Conference on Machine Learning, vol. 202 ofProceedings of Machine Learning Research, 5178–5193 (PMLR, 2023)

work page 2023

[23] [23]

Oquab, M.et al.DINOv2: Learning robust visual features without supervision.Transactions on Mach. Learn. Res.(2024). Featured Certification. 30.Murphy, K. P.Probabilistic machine learning: an introduction(MIT press, 2022)

work page 2024

[24] [24]

& Bengio, Y

Grandvalet, Y . & Bengio, Y . Semi-supervised learning by entropy minimization.Adv. neural information processing systems17(2004)

work page 2004

[25] [25]

In Workshop on challenges in representation learning, ICML, vol

Lee, D.-H.et al.Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, vol. 3, 896 (Atlanta, 2013)

work page 2013

[26] [26]

neural information processing systems33, 596–608 (2020)

Sohn, K.et al.Fixmatch: Simplifying semi-supervised learning with consistency and confidence.Adv. neural information processing systems33, 596–608 (2020)

work page 2020

[27] [27]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531(2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[28] [28]

& Valpola, H

Tarvainen, A. & Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.Adv. neural information processing systems30(2017)

work page 2017

[29] [29]

& Garamszegi, L

Soma, M. & Garamszegi, L. Z. Rethinking birdsong evolution: meta-analysis of the relationship between song complexity and reproductive success.Behav. Ecol.22, 363–371 (2011)

work page 2011

[30] [30]

J., Bolhuis, J

Terpstra, N. J., Bolhuis, J. J. & den Boer-Visser, A. M. An analysis of the neural representation of birdsong memory.J. Neurosci.24, 4971–4977 (2004). 38.Williams, H. Birdsong and singing behavior.Annals New York Acad. Sci.1016, 1–30 (2004)

work page 2004

[31] [31]

& Tchernichovski, O

Lipkind, D. & Tchernichovski, O. Quantification of developmental birdsong learning from the subsyllabic scale to cultural evolution.Proc. Natl. Acad. Sci.108, 15572–15579 (2011)

work page 2011

[32] [32]

Sober, S. J. & Brainard, M. S. Adult birdsong is actively maintained by error correction.Nat. neuroscience12, 927–931 (2009)

work page 2009

[33] [33]

neural information processing systems33, 21271–21284 (2020)

Grill, J.-B.et al.Bootstrap your own latent-a new approach to self-supervised learning.Adv. neural information processing systems33, 21271–21284 (2020)

work page 2020

[34] [34]

& Hinton, G

Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, 1597–1607 (PmLR, 2020)

work page 2020

[35] [35]

InInternational Conference on Learning Representations (ICLR)(2025)

Rauch, L.et al.BirdSet: A Large-Scale Dataset for Audio Classification in Avian Bioacoustics. InInternational Conference on Learning Representations (ICLR)(2025)

work page 2025

[36] [36]

Sinkhorn distances: Lightspeed computation of optimal transport.Adv

Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport.Adv. neural information processing systems 26(2013)

work page 2013

[37] [37]

A Convex Relaxation for Weakly Supervised Classifiers

He, K., Fan, H., Wu, Y ., Xie, S. & Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9729–9738 (2020). 46.Zhou, J.et al.ibot: Image bert pre-training with online tokenizer.Int. Conf. on Learn. Represent. (ICLR)(2022). 16/17 47.Joulin, A. & ...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[38] [38]

& Bertram, R

Daou, A., Johnson, F., Wu, W. & Bertram, R. A computational tool for automated large-scale analysis and measurement of bird-song syntax.J. neuroscience methods210, 147–160 (2012)

work page 2012

[39] [39]

InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16133–16142 (2023)

Woo, S.et al.Convnext v2: Co-designing and scaling convnets with masked autoencoders. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16133–16142 (2023)

work page 2023

[40] [40]

& Devos, P

Ghaffari, H. & Devos, P. On the role of audio frontends in bird species recognition.Ecol. Informatics81, 102573, DOI: https://doi.org/10.1016/j.ecoinf.2024.102573 (2024)

work page doi:10.1016/j.ecoinf.2024.102573 2024

[41] [41]

M., Felix, L

Ferreira-Paiva, L., Alfaro-Espinoza, E., Almeida, V . M., Felix, L. B. & Neves, R. V . A survey of data augmentation for audio classification. InCongresso Brasileiro de Automática-CBA, vol. 3 (2022)

work page 2022

[42] [42]

Zollinger, S. A. & Brumm, H. Why birds sing loud songs and why they sometimes don’t.Animal Behav.105, 289–295, DOI: https://doi.org/10.1016/j.anbehav.2015.03.030 (2015). 53.Park, D. S.et al.Specaugment: A simple data augmentation method for automatic speech recognition.Interspeech 2019 DOI: 10.21437/interspeech.2019-2680 (2019)

work page doi:10.1016/j.anbehav.2015.03.030 2015

[43] [43]

Layer Normalization

Hochreiter, S. & Schmidhuber, J. Long short-term memory.Neural Comput.9, 1735–1780, DOI: 10.1162/neco.1997.9.8. 1735 (1997). 55.Ba, J. L., Kiros, J. R. & Hinton, G. E. Layer normalization.arXiv preprint arXiv:1607.06450(2016)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1162/neco.1997.9.8 1997

[44] [44]

Ramachandran, P., Zoph, B. & Le, Q. V . Swish: a self-gated activation function.arXiv preprint arXiv:1710.059417, 5 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[45] [45]

& Salakhutdinov, R

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting.The journal machine learning research15, 1929–1958 (2014)

work page 1929

[46] [46]

Adam: A Method for Stochastic Optimization

He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, 770–778 (2016). 59.Kingma, D. P. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980(2014)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[47] [47]

Breiman, L., Friedman, J., Olshen, R. A. & Stone, C. J.Classification and regression trees(Chapman and Hall/CRC, 2017)

work page 2017

[48] [48]

& Hutter, F

Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. InInternational Conference on Learning Representations (2019)

work page 2019

[49] [49]

X., Epps, J

Vinh, N. X., Epps, J. & Bailey, J. Information theoretic measures for clusterings comparison: is a correction for chance necessary? InProceedings of the 26th annual international conference on machine learning, 1073–1080 (2009)

work page 2009

[50] [50]

& Jarvis, E

Tchernichovski, O., Eisenberg-Edidin, S. & Jarvis, E. D. Balanced imitation sustains song culture in zebra finches.Nat. communications12, 2562 (2021). Author contributions statement H.G. conceptualized the work, reviewed the literature, conducted the experiments, analyzed the results, made the figures, validated the results and arguments, and wrote the or...

work page 2021