Recognition: 2 theorem links
· Lean TheoremMasked Autoencoders with Limited Data: Does It Work? A Fine-Grained Bioacoustics Case Study
Pith reviewed 2026-05-15 05:41 UTC · model grok-4.3
The pith
For fine-grained bioacoustic classification with limited labels, pretraining on large general audio datasets beats additional domain-specific masked autoencoder training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Models pretrained on diverse general audio data achieve the best transfer performance on iNatSounds. Additional masked reconstruction pretraining on domain-specific data provides limited benefits and may even degrade performance relative to off-the-shelf models. Selective data filtering offers a negligible advantage when the overall data scale is limited. In moderate-sized fine-grained bioacoustic settings, pretraining scale dominates objective design.
What carries the argument
Masked autoencoder pretraining applied to audio spectrograms, with systematic variation of pretraining data scale and domain for downstream species classification on iNatSounds.
If this is right
- Off-the-shelf general-audio models should be the default starting point for bioacoustic tasks with moderate data.
- Additional masked autoencoder pretraining on limited domain data is not worth the compute when general pretraining already exists.
- Data curation steps such as filtering add little value once overall pretraining volume is constrained.
- Similar patterns are likely in other fine-grained audio domains that rely on weakly labeled recordings.
Where Pith is reading between the lines
- The finding suggests that for other weakly labeled audio tasks, simply scaling general pretraining may be more reliable than designing new self-supervised objectives.
- It raises the question of whether masked reconstruction loses fine species distinctions that supervised signals from broad audio corpora preserve.
- Practitioners could test whether the same scale-over-objective pattern appears when the target domain has even fewer total hours of audio.
Load-bearing premise
Observed performance gaps arise mainly from differences in pretraining data volume rather than from variations in model capacity, optimizer settings, or label noise in the weakly annotated recordings.
What would settle it
A controlled experiment in which model size, optimizer, and training schedule are matched exactly yet domain-specific masked autoencoder pretraining on a larger bioacoustic corpus still underperforms general-audio pretraining.
Figures
read the original abstract
Bioacoustic recognition requires fine-grained acoustic understanding to distinguish similar-sounding species. However, many large-scale data repositories such as iNaturalist are weakly annotated, often with only a single positive species label per recording, making supervised learning particularly challenging. Inspired by advances in computer vision, recent approaches have shifted toward self-supervised learning to capture the underlying structure of audio without relying on exhaustive annotations. In particular, masked autoencoders (MAE) have shown strong transferability on massive audio corpora, yet their effectiveness in more modest bioacoustic settings remains underexplored. In this work, we conduct a systematic study of MAE pretraining for species classification on iNatSounds, analyzing the impacts of pretraining data scale, domain specificity, data curation, and transfer strategies. Consistent with prior work, we find that models pretrained on diverse general audio data achieve the best transfer performance on iNatSounds. Contrary to observations from large-scale audio benchmarks, we find that (1) additional masked reconstruction pretraining on domain-specific data provides limited benefits and may even degrade performance relative to off-the-shelf models, and (2) selective data filtering offers a negligible advantage when the overall data scale is limited. Our results indicate that, in moderate-sized fine-grained bioacoustic settings, pretraining scale dominates objective design. These findings further clarify when MAE-based pretraining is effective and provide practical guidance for model selection under limited supervision.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts a systematic empirical study of masked autoencoder (MAE) pretraining for fine-grained species classification on the iNatSounds dataset. It analyzes the effects of pretraining data scale, domain specificity, data curation, and transfer strategies, reporting that off-the-shelf models pretrained on large-scale general audio achieve the best transfer performance. Additional domain-specific MAE pretraining provides limited benefits or can degrade results, and selective filtering yields negligible gains when overall scale is limited. The central claim is that, in moderate-sized fine-grained bioacoustic settings, pretraining scale dominates objective design.
Significance. If the central attribution to scale holds after addressing controls, the work supplies actionable guidance for self-supervised audio models under weak supervision and limited domain data. It helps delineate when general large-scale pretraining is preferable to further domain-specific MAE efforts, which is useful for bioacoustics practitioners facing weakly labeled repositories like iNaturalist.
major comments (2)
- [Abstract and §4] The comparison between off-the-shelf general-audio checkpoints and custom domain-specific MAE runs does not report matched model capacity, optimizer, learning-rate schedule, or total training budget. Without these controls, performance differences on iNatSounds cannot be cleanly attributed to pretraining data scale rather than architectural or optimization mismatches. This directly affects the load-bearing claim in the abstract that scale dominates objective design.
- [§4.2] No variance estimates, statistical significance tests, or multiple random seeds are described for the reported transfer accuracies across pretraining regimes. This weakens confidence that the observed ordering (general > domain-specific MAE) is robust rather than an artifact of single-run variability or dataset-specific biases in the single-label iNatSounds recordings.
minor comments (2)
- [§3.3] The transfer-strategy ablation in §3.3 would be clearer with an explicit diagram showing the frozen vs. fine-tuned stages and the exact linear-probe protocol.
- [Table 2] Table 2 caption should explicitly state the number of parameters and the exact checkpoint sources for each general-audio baseline to facilitate replication.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript to incorporate additional controls and statistical reporting where feasible.
read point-by-point responses
-
Referee: [Abstract and §4] The comparison between off-the-shelf general-audio checkpoints and custom domain-specific MAE runs does not report matched model capacity, optimizer, learning-rate schedule, or total training budget. Without these controls, performance differences on iNatSounds cannot be cleanly attributed to pretraining data scale rather than architectural or optimization mismatches. This directly affects the load-bearing claim in the abstract that scale dominates objective design.
Authors: We appreciate the referee highlighting this issue. The off-the-shelf general-audio models refer to publicly released checkpoints (e.g., AudioMAE variants pretrained on AudioSet) whose architectures, capacities, and training protocols are documented in the original publications. Our domain-specific MAE runs used the identical ViT-based architecture and followed the standard MAE optimization settings (AdamW, cosine schedule, etc.) as closely as possible given the available compute. To strengthen the attribution to scale, we will add a supplementary table in the revised manuscript that explicitly lists model capacity, optimizer, learning-rate schedule, and total training budget for every pretraining regime. This will allow readers to evaluate comparability directly. revision: yes
-
Referee: [§4.2] No variance estimates, statistical significance tests, or multiple random seeds are described for the reported transfer accuracies across pretraining regimes. This weakens confidence that the observed ordering (general > domain-specific MAE) is robust rather than an artifact of single-run variability or dataset-specific biases in the single-label iNatSounds recordings.
Authors: We agree that variance estimates and statistical tests would increase confidence in the reported ordering. Our experiments were run with single random seeds owing to the substantial compute required for MAE pretraining on large audio corpora. In the revision we will re-evaluate the key transfer results using at least three independent random seeds, report means and standard deviations, and include paired statistical significance tests (e.g., t-tests) between the general-audio and domain-specific regimes. This directly addresses concerns about single-run variability. revision: yes
Circularity Check
No significant circularity in this empirical study
full rationale
This is a purely empirical paper that reports experimental comparisons of MAE pretraining strategies on the iNatSounds dataset. There are no derivations, equations, fitted parameters, or mathematical claims that reduce to their own inputs by construction. Central claims rest on held-out transfer performance metrics rather than any self-referential definitions, uniqueness theorems, or ansatzes imported via self-citation. Results are grounded in direct experimentation against external benchmarks and off-the-shelf models, making the study self-contained with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Masked autoencoder pretraining produces transferable representations across audio domains when data scale is sufficient.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our results indicate that, in moderate-sized fine-grained bioacoustic settings, pretraining scale dominates objective design.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The pretraining objective minimizes the reconstruction loss over the masked patches: L_MAE = 1/|M| Σ_{k∈M} ||p_k - ˆp_k||_2^2
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Mae-ast: Masked autoencoding audio spectrogram transformer
Alan Baade, Puyuan Peng, and David Harwath. Mae-ast: Masked autoencoding audio spectrogram transformer. In Proc. Interspeech 2022, pages 2438–2442, 2022. 2
work page 2022
-
[2]
Junho Bae, Mingu Kang, and Youngmin Choo. Entropy-based analysis of influential factors for underwater acoustic target recognition in passive sonar data.Ocean Engineering, 342: 122908, 2025. 3
work page 2025
-
[3]
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural infor- mation processing systems, 33:12449–12460, 2020. 2
work page 2020
-
[4]
Global biodiversity: indicators of recent declines
Stuart HM Butchart, Matt Walpole, Ben Collen, Arco Van Strien, J¨orn PW Scharlemann, Rosamunde EA Almond, Jonathan EM Baillie, Bastian Bomhard, Claire Brown, John Bruno, et al. Global biodiversity: indicators of recent declines. Science, 328(5982):1164–1168, 2010. 1
work page 2010
-
[5]
The inaturalist sounds dataset.Advances in Neural Information Processing Systems, 37:132524–132544,
Mustafa Chasmai, Alexander Shepard, Subhransu Maji, and Grant Van Horn. The inaturalist sounds dataset.Advances in Neural Information Processing Systems, 37:132524–132544,
-
[6]
Audio geolocation: A natural sounds benchmark
Mustafa Chasmai, Wuao Liu, Subhransu Maji, and Grant Van Horn. Audio geolocation: A natural sounds benchmark. arXiv preprint arXiv:2505.18726, 2025. 3
-
[7]
Beats: audio pre-training with acoustic tokeniz- ers
Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, Wanxiang Che, Xiangzhan Yu, and Furu Wei. Beats: audio pre-training with acoustic tokeniz- ers. InProceedings of the 40th International Conference on Machine Learning, pages 5178–5193, 2023. 2
work page 2023
-
[8]
Eat: self-supervised pre-training with efficient audio transformer
Wenxi Chen, Yuzhe Liang, Ziyang Ma, Zhisheng Zheng, and Xie Chen. Eat: self-supervised pre-training with efficient audio transformer. InProceedings of the Thirty-Third Inter- national Joint Conference on Artificial Intelligence, pages 3807–3815, 2024. 2
work page 2024
-
[9]
Yezhen Cong, Samar Khanna, Chenlin Meng, Patrick Liu, Erik Rozi, Yutong He, Marshall Burke, David Lobell, and Stefano Ermon. Satmae: Pre-training transformers for tempo- ral and multi-spectral satellite imagery.Advances in Neural Information Processing Systems, 35:197–211, 2022. 2
work page 2022
-
[10]
Sandra D´ıaz, Josef Settele, Eduardo S Brond ´ızio, Hien T Ngo, John Agard, Almut Arneth, Patricia Balvanera, Kate A Brauman, Stuart HM Butchart, Kai MA Chan, et al. Pervasive human-driven decline of life on earth points to the need for transformative change.Science, 366(6471):eaax3100, 2019. 1
work page 2019
-
[11]
Clap learning audio concepts from natu- ral language supervision
Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. Clap learning audio concepts from natu- ral language supervision. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023. 2
work page 2023
-
[12]
Marius Faiß, Burooj Ghani, and Dan Stowell. Insectset459: an open dataset of insect sounds for bioacoustic machine learning.arXiv preprint arXiv:2503.15074, 2025. 3
-
[13]
Christoph Feichtenhofer, Yanghao Li, Kaiming He, et al. Masked autoencoders as spatiotemporal learners.Advances in neural information processing systems, 35:35946–35958,
-
[14]
Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. Fsd50k: an open dataset of human-labeled sound events.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:829–852, 2021. 3 8
work page 2021
-
[15]
Jianlong Fu, Heliang Zheng, and Tao Mei. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4438–4446, 2017. 1
work page 2017
-
[16]
Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. InProc. IEEE ICASSP 2017, New Orleans, LA, 2017. 2, 3, 4
work page 2017
-
[17]
PhD thesis, TILBURG UNIVERSITY , 2024
STEFAN V ASILEV GENEV .Classification of Anuran Species Using High Efficiency CNNs. PhD thesis, TILBURG UNIVERSITY , 2024. 3
work page 2024
-
[18]
Rory Gibb, Ella Browning, Paul Glover-Kapfer, and Kate E Jones. Emerging opportunities and challenges for passive acoustics in ecological assessment and monitoring.Methods in Ecology and Evolution, 10(2):169–185, 2019. 1
work page 2019
-
[19]
Audioclip: Extending clip to image, text and audio
Andrey Guzhov, Federico Raue, J ¨orn Hees, and Andreas Dengel. Audioclip: Extending clip to image, text and audio. InICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 976–980. IEEE, 2022. 2
work page 2022
-
[20]
Aves: Animal vocalization encoder based on self-supervision
Masato Hagiwara. Aves: Animal vocalization encoder based on self-supervision. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023. 1, 2
work page 2023
-
[21]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 2, 5
work page 2022
-
[22]
Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mo- bilenetv3. InProceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324, 2019. 7
work page 2019
-
[23]
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29: 3451–3460, 2021. 2
work page 2021
-
[24]
Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, and Christoph Feicht- enhofer. Masked autoencoders that listen.Advances in Neural Information Processing Systems, 35:28708–28720, 2022. 2, 4, 5, 6
work page 2022
-
[25]
iNaturalist. inaturalist. https://www.inaturalist. org, 2026. Accessed: 2026-02-23. 1, 4
work page 2026
-
[26]
Stefan Kahl, Connor M Wood, Maximilian Eibl, and Holger Klinck. Birdnet: A deep learning solution for avian diversity monitoring.Ecological Informatics, 61:101236, 2021. 3
work page 2021
-
[27]
3d object representations for fine-grained categorization
Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013. 1
work page 2013
-
[28]
Mavis: A multimodal conversational assistant for avian species
Yevheniia Kryklyvets, Mohammed Irfan Kurpath, Sa- hal Shaji Mullappilly, Jinxing Zhou, Fahad Shahbaz Khan, Rao Muhammad Anwer, Salman Khan, and Hisham Cholakkal. Mavis: A multimodal conversational assistant for avian species. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 28601–28627, 2025. 3
work page 2025
-
[29]
Bilinear cnn models for fine-grained visual recognition
Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. Bilinear cnn models for fine-grained visual recognition. In Proceedings of the IEEE international conference on com- puter vision, pages 1449–1457, 2015. 1
work page 2015
-
[30]
Zhiwei Lin, Yongtao Wang, Shengxiang Qi, Nan Dong, and Ming-Hsuan Yang. Bev-mae: Bird’s eye view masked au- toencoders for point cloud pre-training in autonomous driving scenarios.Proceedings of the AAAI Conference on Artificial Intelligence, 38(4):3531–3539, 2024. 2
work page 2024
-
[31]
Fine-Grained Visual Classification of Aircraft
Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual clas- sification of aircraft.arXiv preprint arXiv:1306.5151, 2013. 1
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[32]
AVEX: What Matters for Animal Vocalization Encoding
Marius Miron, David Robinson, Milad Alizadeh, Ellen Gilsenan-McMahon, Gagan Narula, Emmanuel Chemla, Mad- die Cusimano, Felix Effenberger, Masato Hagiwara, Ben- jamin Hoffman, et al. What matters for bioacoustic encoding. arXiv preprint arXiv:2508.11845, 2025. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Mixture of mixups for multi-label classification of rare anuran sounds
Ilyass Moummad, Nicolas Farrugia, Romain Serizel, Jeremy Froidevaux, and Vincent Lostanlen. Mixture of mixups for multi-label classification of rare anuran sounds. In2024 32nd European Signal Processing Conference (EUSIPCO), pages 1282–1286. IEEE, 2024. 3
work page 2024
-
[34]
Lv-mae: Learning long video representations through masked-embedding autoencoders
Ilan Naiman, Emanuel Ben-Baruch, Oron Anschel, Alon Shoshan, Igor Kviatkovsky, Manoj Aggarwal, and Gerard Medioni. Lv-mae: Learning long video representations through masked-embedding autoencoders. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21398–21407, 2025. 2
work page 2025
-
[35]
The mel scale.Journal of Music Theory, 9(2): 295–308, 1965
Paul Pedersen. The mel scale.Journal of Music Theory, 9(2): 295–308, 1965. 4
work page 1965
-
[36]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2
work page 2021
-
[37]
Lukas Rauch, Ren´e Heinrich, Ilyass Moummad, Alexis Joly, Bernhard Sick, and Christoph Scholz. Can masked autoen- coders also listen to birds?Transactions on Machine Learning Research Journal, 2025. 1, 2, 3
work page 2025
-
[38]
Birdset: A large-scale dataset for audio classification in avian bioacoustics
Lukas Rauch, Raphael Schwinger, Moritz Wirth, Ren´e Hein- rich, Denis Huseljic, Marek Herde, Jonas Lange, Stefan Kahl, Bernhard Sick, Sven Tomforde, et al. Birdset: A large-scale dataset for audio classification in avian bioacoustics. InIn- ternational Conference on Learning Representations, pages 29482–29520, 2025. 2, 3
work page 2025
-
[39]
Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning
Colorado J Reed, Ritwik Gupta, Shufan Li, Sarah Brockman, Christopher Funk, Brian Clipp, Kurt Keutzer, Salvatore Can- dido, Matt Uyttendaele, and Trevor Darrell. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. InProceedings of the IEEE/CVF 9 International Conference on Computer Vision, pages 4088– 4099, 2023. 2
work page 2023
-
[40]
Transferable models for bioacoustics with human language supervision
David Robinson, Adelaide Robinson, and Lily Akrapong- pisak. Transferable models for bioacoustics with human language supervision. InICASSP 2024-2024 IEEE Interna- tional Conference on Acoustics, Speech and Signal Process- ing (ICASSP), pages 1316–1320. IEEE, 2024. 2
work page 2024
-
[41]
NatureLM-audio: an audio-language foun- dation model for bioacoustics
David Robinson, Marius Miron, Masato Hagiwara, and Olivier Pietquin. NatureLM-audio: an audio-language foun- dation model for bioacoustics. InThe Thirteenth International Conference on Learning Representations, 2025. 3
work page 2025
-
[42]
Samuel RP-J Ross, Darren P O’Connell, Jessica L Deich- mann, Camille Desjonqu `eres, Amandine Gasc, Jennifer N Phillips, Sarab S Sethi, Connor M Wood, and Zuzana Burival- ova. Passive acoustic monitoring provides a fresh perspective on fundamental ecological questions.Functional Ecology, 37 (4):959–975, 2023. 1
work page 2023
-
[43]
Dirk S Schmeller, Romain Julliard, Peter J Bellingham, Monika B ¨ohm, Neil Brummitt, Alessandro Chiarucci, De- nis Couvet, Sarah Elmendorf, David M Forsyth, Jaime Garc´ıa Moreno, et al. Towards a global terrestrial species monitoring program.Journal for Nature Conservation, 25:51–57, 2015. 1
work page 2015
-
[44]
wav2vec: Unsupervised pre-training for speech recognition
Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. wav2vec: Unsupervised pre-training for speech recognition. InProc. Interspeech 2019, pages 3465–3469,
work page 2019
-
[45]
Raphael Schwinger, Paria Vali Zadeh, Lukas Rauch, Mats Kurz, Tom Hauschild, Sam Lapp, and Sven Tomforde. Foun- dation models for bioacoustics–a comparative review.Eco- logical Informatics, page 103765, 2026. 3
work page 2026
-
[46]
Sam audio: Segment anything in audio
Bowen Shi, Andros Tjandra, John Hoffman, Helin Wang, Yi- Chiao Wu, Luya Gao, Julius Richter, Matt Le, Apoorv Vyas, Sanyuan Chen, et al. Sam audio: Segment anything in audio. arXiv preprint arXiv:2512.18099, 2025. 3
-
[47]
Terrestrial passive acoustic monitoring: review and perspectives.BioScience, 69 (1):15–25, 2019
Larissa Sayuri Moreira Sugai, Thiago Sanna Freire Silva, Jos´e Wagner Ribeiro Jr, and Diego Llusia. Terrestrial passive acoustic monitoring: review and perspectives.BioScience, 69 (1):15–25, 2019. 1
work page 2019
-
[48]
Merlin l48 spectrogram dataset
Aaron Sun, Subhransu Maji, and Grant Van Horn. Merlin l48 spectrogram dataset. InThe Thirty-ninth Annual Confer- ence on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. 1
work page 2025
-
[49]
Maofeng Tang, Andrei Cozma, Konstantinos Georgiou, and Hairong Qi. Cross-scale mae: A tale of multiscale exploita- tion in remote sensing.Advances in Neural Information Processing Systems, 36:20054–20066, 2023. 2
work page 2023
-
[50]
Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Video- mae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural infor- mation processing systems, 35:10078–10093, 2022. 2
work page 2022
-
[51]
The inaturalist species classification and detection dataset
Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778,
-
[52]
Exploring fine- grained audiovisual categorization with the SSW60 dataset
Grant Van Horn, Rui Qian, Kimberly Wilber, Hartwig Adam, Oisin Mac Aodha, and Serge Belongie. Exploring fine- grained audiovisual categorization with the SSW60 dataset. InEuropean Conference on Computer Vision, pages 271–289. Springer, 2022. 1
work page 2022
-
[53]
Perch 2.0: The bittern lesson for bioacoustics.arXiv preprint arXiv:2508.04665, 2025
Bart van Merri ¨enboer, Vincent Dumoulin, Jenny Hamer, Lauren Harrell, Andrea Burns, and Tom Denton. Perch 2.0: The bittern lesson for bioacoustics.arXiv preprint arXiv:2508.04665, 2025. 1, 2, 3
-
[54]
Revisiting mae pre-training for 3d medical image segmentation
Tassilo Wald, Constantin Ulrich, Stanislav Lukyanenko, An- drei Goncharov, Alberto Paderno, Maximilian Miller, Leander Maerkisch, Paul Jaeger, and Klaus Maier-Hein. Revisiting mae pre-training for 3d medical image segmentation. InPro- ceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 5186–5196, 2025. 2
work page 2025
-
[55]
Videomae v2: Scaling video masked autoencoders with dual masking
Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yi- nan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14549–14560, 2023. 2
work page 2023
-
[56]
Diffusion models as masked autoencoders
Chen Wei, Karttikeya Mangalam, Po-Yao Huang, Yanghao Li, Haoqi Fan, Hu Xu, Huiyu Wang, Cihang Xie, Alan Yuille, and Christoph Feichtenhofer. Diffusion models as masked autoencoders. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16284–16294, 2023. 2
work page 2023
-
[57]
Xeno-canto: Sharing bird sounds from around the world
Xeno-Canto Foundation. Xeno-canto: Sharing bird sounds from around the world. https://xeno- canto.org ,
-
[58]
Accessed: 2026-02-23. 1 10
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.