Time-frequency localization of bird calls in dense soundscapes
Pith reviewed 2026-06-27 11:58 UTC · model grok-4.3
The pith
Treating bird calls as objects in spectrograms lets YOLO models localize them in time and frequency, nearly doubling baseline performance in dense soundscapes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Training YOLO11 models to detect bounding boxes on spectrograms yields an IoMin@50 F1-score of 81.8 percent on Singapore soundscapes versus 42.1 percent for the baseline, and 58.6 percent on out-of-distribution Hawaii recordings versus 48.6 percent.
What carries the argument
YOLO object detection applied directly to spectrogram images, with bird calls represented as time-frequency bounding boxes and evaluated using the IoMin metric that measures overlap relative to the smaller of two regions.
If this is right
- Precise time-frequency maps become available for analyses that need the exact timing and frequency content of each vocalization rather than aggregate presence.
- The same pipeline can be applied to other dense acoustic environments once a modest amount of annotated spectrograms is collected.
- IoMin provides a more stable scoring method than standard IoU when acoustic boundaries are inherently ambiguous.
- Generalization from Singapore to Hawaii indicates that the learned visual features on spectrograms capture transferable properties of bird calls.
Where Pith is reading between the lines
- The method could be tested on other vocalizing taxa such as insects or marine mammals by retraining only the final layers.
- Real-time versions on embedded hardware would allow immediate localization during field deployments rather than post-processing.
- Combining the detector with source separation or multi-channel arrays might further reduce errors in extremely overlapping choruses.
Load-bearing premise
That converting audio to spectrograms and treating calls as visual objects preserves enough acoustic information for accurate localization without introducing systematic errors from windowing, overlap, or model priors on box shapes.
What would settle it
Performance on the same calls drops sharply when spectrogram window length or hop size is changed, or when the evaluation set contains only calls whose shapes deviate markedly from rectangular boxes.
Figures
read the original abstract
Passive acoustic monitoring enables large-scale observation of wildlife, but most bioacoustic classifiers only predict species presence in a time window without localizing vocalizations precisely in time or frequency, limiting downstream analyses. We formulate bird vocalization detection as an object detection task on spectrograms and train YOLO11 models to localize bird calls in dense tropical soundscapes from Singapore. We additionally introduce an open-source browser-based annotation tool and propose Intersection over Minimum (IoMin), an evaluation metric that better handles ambiguous acoustic boundaries than standard IoU and is better suited to the problem at hand. The best YOLO model nearly doubles baseline performance on in-distribution soundscapes from Singapore (81.8% vs. 42.1% IoMin@50 F1-score) while still outperforming the baseline on unseen out-of-distribution recordings from Hawaii (58.6% vs. 48.6%). These results suggest that object detection frameworks are a promising approach to time-frequency localization of animal vocalizations in complex soundscapes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript formulates bird vocalization detection and localization in dense soundscapes as a 2D object detection task on spectrograms, trains YOLO11 models on Singapore recordings, introduces an open-source browser-based annotation tool, and proposes the Intersection over Minimum (IoMin) metric as an alternative to IoU for handling ambiguous acoustic boundaries. It reports that the best YOLO model achieves 81.8% IoMin@50 F1 on in-distribution Singapore data (vs. 42.1% baseline) and 58.6% on out-of-distribution Hawaii recordings (vs. 48.6% baseline).
Significance. If the reported performance gains prove robust under proper validation, the work would demonstrate that visual object-detection frameworks can be adapted for precise time-frequency localization of animal vocalizations, addressing a clear limitation of existing presence-only bioacoustic classifiers. The open-source annotation tool and IoMin metric are concrete contributions that could be adopted more broadly.
major comments (2)
- [Abstract, §4] Abstract and §4 (results): the headline claims of nearly doubled performance (81.8% vs 42.1% IoMin@50 F1 in-distribution; 58.6% vs 48.6% OOD) are presented without any description of training/validation/test splits, hyperparameter selection, baseline implementation details, or error bars/statistical tests. These omissions make the central empirical claim unverifiable from the provided text.
- [§3] §3 (method): the conversion of audio to spectrograms via STFT is treated as a fixed preprocessing step with no analysis of window length, hop size, or overlap effects on call boundary accuracy. Because the headline performance rests on the assumption that rectangular YOLO boxes on spectrograms faithfully represent acoustic events, the lack of sensitivity analysis to these parameters is load-bearing for the localization claim.
minor comments (2)
- [§3] The manuscript refers to “YOLO11” without clarifying whether this denotes an official YOLOv11 release or a custom variant; a citation or version number would improve reproducibility.
- [§4, figures] Figure captions and §4 should explicitly state the number of annotated calls per dataset and the class distribution to allow readers to assess whether the reported F1 scores are driven by a few dominant species.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights important aspects of verifiability and methodological robustness. We address each major comment point by point below and will revise the manuscript accordingly where appropriate.
read point-by-point responses
-
Referee: [Abstract, §4] Abstract and §4 (results): the headline claims of nearly doubled performance (81.8% vs 42.1% IoMin@50 F1 in-distribution; 58.6% vs 48.6% OOD) are presented without any description of training/validation/test splits, hyperparameter selection, baseline implementation details, or error bars/statistical tests. These omissions make the central empirical claim unverifiable from the provided text.
Authors: We agree that the current manuscript text does not provide sufficient detail on the experimental protocol to allow independent verification of the reported performance numbers. In the revised version we will expand §4 (and the abstract if space permits) to explicitly describe the train/validation/test splits used for the Singapore data, the hyperparameter search procedure and final values for the YOLO11 models, the exact implementation of the baseline detector, and the computation of error bars together with any statistical significance tests performed. These additions will directly address the verifiability concern. revision: yes
-
Referee: [§3] §3 (method): the conversion of audio to spectrograms via STFT is treated as a fixed preprocessing step with no analysis of window length, hop size, or overlap effects on call boundary accuracy. Because the headline performance rests on the assumption that rectangular YOLO boxes on spectrograms faithfully represent acoustic events, the lack of sensitivity analysis to these parameters is load-bearing for the localization claim.
Authors: We acknowledge that the manuscript presents the STFT parameters as fixed without accompanying sensitivity analysis. While the chosen parameters follow common practice in bioacoustic spectrogram generation for bird vocalizations, we agree that demonstrating robustness to reasonable variations in window length and hop size would strengthen the localization claims. In the revision we will add a short sensitivity study (either in §3 or as an appendix) that reports IoMin@50 F1 under a small grid of STFT settings on the Singapore validation set, thereby quantifying the impact on boundary accuracy. revision: yes
Circularity Check
No circularity detected; empirical results self-contained
full rationale
The paper presents an empirical study converting audio to spectrograms and applying YOLO object detection, reporting F1 scores on in-distribution and OOD data. No equations, fitted parameters, or derivation chain are described that reduce the performance metrics to inputs by construction. The IoMin metric is newly proposed but does not create self-referential predictions. No self-citation load-bearing steps or ansatz smuggling appear in the provided text. The central claims rest on experimental outcomes rather than tautological reductions.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Foundation models for bioacoustics – a comparative review,
R. Schwinger, P. V . Zadeh, L. Rauch, M. Kurz, T. Hauschild, S. Lapp, and S. Tomforde, “Foundation models for bioacoustics – a comparative review,” 2025. [Online]. Available: https://arxiv.org/abs/2508.01277
-
[2]
Birdnet: A deep learning solution for avian diversity monitoring,
S. Kahl, C. M. Wood, M. Eibl, and H. Klinck, “Birdnet: A deep learning solution for avian diversity monitoring,”Ecological Informatics, vol. 61, p. 101236, 2021
2021
-
[3]
arXiv preprint arXiv:2508.04665 , year=
B. van Merri ¨enboer, V . Dumoulin, J. Hamer, L. Harrell, A. Burns, and T. Denton, “Perch 2.0: The bittern lesson for bioacoustics,” 2026. [Online]. Available: https://arxiv.org/abs/2508.04665
-
[4]
Birdset: A large-scale dataset for audio classification in avian bioacoustics,
L. Rauch, R. Schwinger, M. Wirth, R. Heinrich, D. Huseljic, M. Herde, J. Lange, S. Kahl, B. Sick, S. Tomforde, and C. Scholz, “Birdset: A large-scale dataset for audio classification in avian bioacoustics,” 2025. [Online]. Available: https://arxiv.org/abs/2403.10380
-
[5]
Xeno-canto: Sharing wildlife sounds from around the world,
“Xeno-canto: Sharing wildlife sounds from around the world,” Xeno- canto Foundation for Nature Sounds, 2026, accessed: 2026-05-17. [Online]. Available: https://xeno-canto.org/
2026
-
[6]
Biological invasions and the acoustic niche: the effect of bullfrog calls on the acoustic signals of white-banded tree frogs,
C. Both and T. Grant, “Biological invasions and the acoustic niche: the effect of bullfrog calls on the acoustic signals of white-banded tree frogs,”Biology Letters, vol. 8, no. 5, pp. 714–716, 2012
2012
-
[7]
C. I. Medeiros, C. Both, T. Grant, and S. M. Hartz, “Invasion of the acoustic niche: variable responses by native species to invasive american bullfrog calls,”Biological Invasions, vol. 19, no. 2, pp. 675–690, 2017. [Online]. Available: https://doi.org/10.1007/s10530-016-1327-7
-
[8]
Effects of invasive toad calls and synthetic tones on call properties of native australian toadlets,
J. M. Hopkins, D. S. Bower, W. Edwards, and L. Schwarzkopf, “Effects of invasive toad calls and synthetic tones on call properties of native australian toadlets,”Journal of Herpetology, vol. 57, pp. 437–446, 2023
2023
-
[9]
A. Farina, N. Pieretti, and N. Morganti, “Acoustic patterns of an invasive species: the red-billed leiothrix (leiothrix lutea scopoli 1786) in a mediterranean shrubland,”Bioacoustics, vol. 22, no. 3, pp. 175–194, 2013. [Online]. Available: https://doi.org/10.1080/09524622. 2012.761571
-
[10]
Acoustic detector for multiple vocalizing marine mammal individuals,
S. Hexeberg, H. Vishnu, K. T. Beng, A. Ho, W. Yusong, M. Chitre, K. Tun, and K. Lim, “Acoustic detector for multiple vocalizing marine mammal individuals,” inOCEANS 2023 - Limerick, 2023, pp. 1–8
2023
-
[11]
Acoustic classification of multiple simultane- ous bird species: A multi-instance multi-label approach,
F. Briggs, B. Lakshminarayanan, L. Neal, X. Fern, R. Raich, S. Frey, A. Hadley, and M. Betts, “Acoustic classification of multiple simultane- ous bird species: A multi-instance multi-label approach,”The Journal of the Acoustical Society of America, vol. 131, pp. 4640–50, 2012
2012
-
[12]
Semi-supervised classification of bird vocalizations,
S. Hexeberg, M. Chitre, M. Hoffmann-Kuhnt, and B. W. Low, “Semi-supervised classification of bird vocalizations,” 2025. [Online]. Available: https://arxiv.org/abs/2502.13440
-
[13]
Feature vector selection and use with hidden markov models to identify frequency-modulated bioacoustic signals amidst noise,
T. S. Brandes, “Feature vector selection and use with hidden markov models to identify frequency-modulated bioacoustic signals amidst noise,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 6, pp. 1173–1180, 2008
2008
-
[14]
Ultralytics yolo11,
G. Jocher and J. Qiu, “Ultralytics yolo11,” 2024. [Online]. Available: https://github.com/ultralytics/ultralytics
2024
-
[15]
Microsoft COCO: Common Objects in Context
T.-Y . Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Doll ´ar, “Microsoft coco: Common objects in context,” 2015. [Online]. Available: https://arxiv.org/abs/1405.0312
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[16]
A collection of fully-annotated soundscape recordings from the island of hawai‘i,
A. Navine, S. Kahl, A. Tanimoto-Johnson, H. Klinck, and P. Hart, “A collection of fully-annotated soundscape recordings from the island of hawai‘i,” https://doi.org/10.5281/zenodo.7078499, 2022
-
[17]
Train- able frontend for robust and far-field keyword spotting,
Y . Wang, P. Getreuer, T. Hughes, R. F. Lyon, and R. A. Saurous, “Train- able frontend for robust and far-field keyword spotting,” in2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5670–5674
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.