Time-frequency localization of bird calls in dense soundscapes

Fanghui Tong; Hari Vishnu; Mandar Chitre; Simen Hexeberg

arxiv: 2606.10407 · v1 · pith:RKBLX3KMnew · submitted 2026-06-09 · 💻 cs.SD · cs.CV· q-bio.QM

Time-frequency localization of bird calls in dense soundscapes

Simen Hexeberg , Fanghui Tong , Hari Vishnu , Mandar Chitre This is my paper

Pith reviewed 2026-06-27 11:58 UTC · model grok-4.3

classification 💻 cs.SD cs.CVq-bio.QM

keywords bird call localizationspectrogram object detectionYOLOpassive acoustic monitoringtime-frequency localizationbioacousticsIoMin metric

0 comments

The pith

Treating bird calls as objects in spectrograms lets YOLO models localize them in time and frequency, nearly doubling baseline performance in dense soundscapes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that bird vocalization detection can be reframed as an object detection problem on spectrogram images rather than simple presence classification in time windows. By training YOLO11 models on recordings from Singapore, the approach achieves substantially higher localization accuracy than prior methods while still improving on unseen data from Hawaii. The work also supplies a browser-based annotation tool and introduces the IoMin metric to handle the fuzzy boundaries typical of acoustic events. If correct, this shifts passive acoustic monitoring from coarse species detection toward precise time-frequency maps of individual calls.

Core claim

Training YOLO11 models to detect bounding boxes on spectrograms yields an IoMin@50 F1-score of 81.8 percent on Singapore soundscapes versus 42.1 percent for the baseline, and 58.6 percent on out-of-distribution Hawaii recordings versus 48.6 percent.

What carries the argument

YOLO object detection applied directly to spectrogram images, with bird calls represented as time-frequency bounding boxes and evaluated using the IoMin metric that measures overlap relative to the smaller of two regions.

If this is right

Precise time-frequency maps become available for analyses that need the exact timing and frequency content of each vocalization rather than aggregate presence.
The same pipeline can be applied to other dense acoustic environments once a modest amount of annotated spectrograms is collected.
IoMin provides a more stable scoring method than standard IoU when acoustic boundaries are inherently ambiguous.
Generalization from Singapore to Hawaii indicates that the learned visual features on spectrograms capture transferable properties of bird calls.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be tested on other vocalizing taxa such as insects or marine mammals by retraining only the final layers.
Real-time versions on embedded hardware would allow immediate localization during field deployments rather than post-processing.
Combining the detector with source separation or multi-channel arrays might further reduce errors in extremely overlapping choruses.

Load-bearing premise

That converting audio to spectrograms and treating calls as visual objects preserves enough acoustic information for accurate localization without introducing systematic errors from windowing, overlap, or model priors on box shapes.

What would settle it

Performance on the same calls drops sharply when spectrogram window length or hop size is changed, or when the evaluation set contains only calls whose shapes deviate markedly from rectangular boxes.

Figures

Figures reproduced from arXiv: 2606.10407 by Fanghui Tong, Hari Vishnu, Mandar Chitre, Simen Hexeberg.

**Figure 1.** Figure 1: Two common failure modes of the TFE detector on a test set recording from Singapore (left), and the corresponding [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Interface of the open-source annotation tool BirdWatch [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The two recording locations in the Singapore Botanic [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of the data split strategy for the Singapore dataset. Recordings are converted to 6-second spectrograms [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Example illustrating where the standard definition of [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Precision-recall curves. YOLO models show significant [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Examples from the Hawaii dataset illustrating how annotation discrepancies affect performance metrics. [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: F1 scores using IoMin > 0.5 for different YOLO architectures over 5 training runs on the Singapore and Hawaii test sets. Note that the y-axis ranges differ. normalization (PCEN) [17] may better preserve low-SNR vocalizations and improve detection accuracy. V. CONCLUSION We present a YOLO-based method to detect and localize bird vocalizations in both time and frequency from dense soundscapes. We also releas… view at source ↗

read the original abstract

Passive acoustic monitoring enables large-scale observation of wildlife, but most bioacoustic classifiers only predict species presence in a time window without localizing vocalizations precisely in time or frequency, limiting downstream analyses. We formulate bird vocalization detection as an object detection task on spectrograms and train YOLO11 models to localize bird calls in dense tropical soundscapes from Singapore. We additionally introduce an open-source browser-based annotation tool and propose Intersection over Minimum (IoMin), an evaluation metric that better handles ambiguous acoustic boundaries than standard IoU and is better suited to the problem at hand. The best YOLO model nearly doubles baseline performance on in-distribution soundscapes from Singapore (81.8% vs. 42.1% IoMin@50 F1-score) while still outperforming the baseline on unseen out-of-distribution recordings from Hawaii (58.6% vs. 48.6%). These results suggest that object detection frameworks are a promising approach to time-frequency localization of animal vocalizations in complex soundscapes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

YOLO on spectrograms nearly doubles in-distribution localization F1 and still beats baseline out-of-distribution, with IoMin as a reasonable new metric for fuzzy boundaries.

read the letter

The main thing here is that treating spectrograms as images and running YOLO11 produces clear localization gains over whatever baseline they used, moving from 42.1% to 81.8% IoMin@50 F1 on the Singapore data and from 48.6% to 58.6% on the Hawaii recordings. That OOD result is the part worth paying attention to.

The paper applies an off-the-shelf detector to this domain and adds the IoMin metric, which looks like a sensible adjustment when boundaries in time-frequency are inherently soft. The browser annotation tool is a practical addition that others can use. Cross-site testing is also done right; too many bioacoustics papers stay within one recording setup.

The numbers suggest the method is picking up real structure rather than just site-specific artifacts. If the full methods section shows reasonable training splits, hyperparameter reporting, and some error bars, this becomes a usable tool for people who need call positions instead of just presence.

The soft spot is the conversion step itself. Spectrogram parameters can split or smear calls, and YOLO's rectangular priors plus NMS may not match the irregular shapes of many vocalizations. The abstract does not spell out how the STFT window, hop, or overlap were chosen or tested, so the full paper needs to show those choices were not driving the gains. If they were, the reported improvement could partly be an artifact of the image representation. The OOD lift makes a pure artifact less likely, but it still needs explicit checking.

This is for bioacoustics groups or passive monitoring teams that already work with spectrograms and want better localization for downstream rate or interaction studies. It is coherent on its own terms and the empirical comparison is direct enough to justify referee time, even if the methods section will need tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript formulates bird vocalization detection and localization in dense soundscapes as a 2D object detection task on spectrograms, trains YOLO11 models on Singapore recordings, introduces an open-source browser-based annotation tool, and proposes the Intersection over Minimum (IoMin) metric as an alternative to IoU for handling ambiguous acoustic boundaries. It reports that the best YOLO model achieves 81.8% IoMin@50 F1 on in-distribution Singapore data (vs. 42.1% baseline) and 58.6% on out-of-distribution Hawaii recordings (vs. 48.6% baseline).

Significance. If the reported performance gains prove robust under proper validation, the work would demonstrate that visual object-detection frameworks can be adapted for precise time-frequency localization of animal vocalizations, addressing a clear limitation of existing presence-only bioacoustic classifiers. The open-source annotation tool and IoMin metric are concrete contributions that could be adopted more broadly.

major comments (2)

[Abstract, §4] Abstract and §4 (results): the headline claims of nearly doubled performance (81.8% vs 42.1% IoMin@50 F1 in-distribution; 58.6% vs 48.6% OOD) are presented without any description of training/validation/test splits, hyperparameter selection, baseline implementation details, or error bars/statistical tests. These omissions make the central empirical claim unverifiable from the provided text.
[§3] §3 (method): the conversion of audio to spectrograms via STFT is treated as a fixed preprocessing step with no analysis of window length, hop size, or overlap effects on call boundary accuracy. Because the headline performance rests on the assumption that rectangular YOLO boxes on spectrograms faithfully represent acoustic events, the lack of sensitivity analysis to these parameters is load-bearing for the localization claim.

minor comments (2)

[§3] The manuscript refers to “YOLO11” without clarifying whether this denotes an official YOLOv11 release or a custom variant; a citation or version number would improve reproducibility.
[§4, figures] Figure captions and §4 should explicitly state the number of annotated calls per dataset and the class distribution to allow readers to assess whether the reported F1 scores are driven by a few dominant species.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of verifiability and methodological robustness. We address each major comment point by point below and will revise the manuscript accordingly where appropriate.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (results): the headline claims of nearly doubled performance (81.8% vs 42.1% IoMin@50 F1 in-distribution; 58.6% vs 48.6% OOD) are presented without any description of training/validation/test splits, hyperparameter selection, baseline implementation details, or error bars/statistical tests. These omissions make the central empirical claim unverifiable from the provided text.

Authors: We agree that the current manuscript text does not provide sufficient detail on the experimental protocol to allow independent verification of the reported performance numbers. In the revised version we will expand §4 (and the abstract if space permits) to explicitly describe the train/validation/test splits used for the Singapore data, the hyperparameter search procedure and final values for the YOLO11 models, the exact implementation of the baseline detector, and the computation of error bars together with any statistical significance tests performed. These additions will directly address the verifiability concern. revision: yes
Referee: [§3] §3 (method): the conversion of audio to spectrograms via STFT is treated as a fixed preprocessing step with no analysis of window length, hop size, or overlap effects on call boundary accuracy. Because the headline performance rests on the assumption that rectangular YOLO boxes on spectrograms faithfully represent acoustic events, the lack of sensitivity analysis to these parameters is load-bearing for the localization claim.

Authors: We acknowledge that the manuscript presents the STFT parameters as fixed without accompanying sensitivity analysis. While the chosen parameters follow common practice in bioacoustic spectrogram generation for bird vocalizations, we agree that demonstrating robustness to reasonable variations in window length and hop size would strengthen the localization claims. In the revision we will add a short sensitivity study (either in §3 or as an appendix) that reports IoMin@50 F1 under a small grid of STFT settings on the Singapore validation set, thereby quantifying the impact on boundary accuracy. revision: yes

Circularity Check

0 steps flagged

No circularity detected; empirical results self-contained

full rationale

The paper presents an empirical study converting audio to spectrograms and applying YOLO object detection, reporting F1 scores on in-distribution and OOD data. No equations, fitted parameters, or derivation chain are described that reduce the performance metrics to inputs by construction. The IoMin metric is newly proposed but does not create self-referential predictions. No self-citation load-bearing steps or ansatz smuggling appear in the provided text. The central claims rest on experimental outcomes rather than tautological reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities beyond standard supervised learning assumptions.

pith-pipeline@v0.9.1-grok · 5712 in / 1006 out tokens · 18711 ms · 2026-06-27T11:58:45.549342+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 8 canonical work pages · 1 internal anchor

[1]

Foundation models for bioacoustics – a comparative review,

R. Schwinger, P. V . Zadeh, L. Rauch, M. Kurz, T. Hauschild, S. Lapp, and S. Tomforde, “Foundation models for bioacoustics – a comparative review,” 2025. [Online]. Available: https://arxiv.org/abs/2508.01277

work page arXiv 2025
[2]

Birdnet: A deep learning solution for avian diversity monitoring,

S. Kahl, C. M. Wood, M. Eibl, and H. Klinck, “Birdnet: A deep learning solution for avian diversity monitoring,”Ecological Informatics, vol. 61, p. 101236, 2021

2021
[3]

arXiv preprint arXiv:2508.04665 , year=

B. van Merri ¨enboer, V . Dumoulin, J. Hamer, L. Harrell, A. Burns, and T. Denton, “Perch 2.0: The bittern lesson for bioacoustics,” 2026. [Online]. Available: https://arxiv.org/abs/2508.04665

work page arXiv 2026
[4]

Birdset: A large-scale dataset for audio classification in avian bioacoustics,

L. Rauch, R. Schwinger, M. Wirth, R. Heinrich, D. Huseljic, M. Herde, J. Lange, S. Kahl, B. Sick, S. Tomforde, and C. Scholz, “Birdset: A large-scale dataset for audio classification in avian bioacoustics,” 2025. [Online]. Available: https://arxiv.org/abs/2403.10380

work page arXiv 2025
[5]

Xeno-canto: Sharing wildlife sounds from around the world,

“Xeno-canto: Sharing wildlife sounds from around the world,” Xeno- canto Foundation for Nature Sounds, 2026, accessed: 2026-05-17. [Online]. Available: https://xeno-canto.org/

2026
[6]

Biological invasions and the acoustic niche: the effect of bullfrog calls on the acoustic signals of white-banded tree frogs,

C. Both and T. Grant, “Biological invasions and the acoustic niche: the effect of bullfrog calls on the acoustic signals of white-banded tree frogs,”Biology Letters, vol. 8, no. 5, pp. 714–716, 2012

2012
[7]

Invasion of the acoustic niche: variable responses by native species to invasive american bullfrog calls,

C. I. Medeiros, C. Both, T. Grant, and S. M. Hartz, “Invasion of the acoustic niche: variable responses by native species to invasive american bullfrog calls,”Biological Invasions, vol. 19, no. 2, pp. 675–690, 2017. [Online]. Available: https://doi.org/10.1007/s10530-016-1327-7

work page doi:10.1007/s10530-016-1327-7 2017
[8]

Effects of invasive toad calls and synthetic tones on call properties of native australian toadlets,

J. M. Hopkins, D. S. Bower, W. Edwards, and L. Schwarzkopf, “Effects of invasive toad calls and synthetic tones on call properties of native australian toadlets,”Journal of Herpetology, vol. 57, pp. 437–446, 2023

2023
[9]

Acoustic patterns of an invasive species: the red-billed leiothrix (leiothrix lutea scopoli 1786) in a mediterranean shrubland,

A. Farina, N. Pieretti, and N. Morganti, “Acoustic patterns of an invasive species: the red-billed leiothrix (leiothrix lutea scopoli 1786) in a mediterranean shrubland,”Bioacoustics, vol. 22, no. 3, pp. 175–194, 2013. [Online]. Available: https://doi.org/10.1080/09524622. 2012.761571

work page doi:10.1080/09524622 2013
[10]

Acoustic detector for multiple vocalizing marine mammal individuals,

S. Hexeberg, H. Vishnu, K. T. Beng, A. Ho, W. Yusong, M. Chitre, K. Tun, and K. Lim, “Acoustic detector for multiple vocalizing marine mammal individuals,” inOCEANS 2023 - Limerick, 2023, pp. 1–8

2023
[11]

Acoustic classification of multiple simultane- ous bird species: A multi-instance multi-label approach,

F. Briggs, B. Lakshminarayanan, L. Neal, X. Fern, R. Raich, S. Frey, A. Hadley, and M. Betts, “Acoustic classification of multiple simultane- ous bird species: A multi-instance multi-label approach,”The Journal of the Acoustical Society of America, vol. 131, pp. 4640–50, 2012

2012
[12]

Semi-supervised classification of bird vocalizations,

S. Hexeberg, M. Chitre, M. Hoffmann-Kuhnt, and B. W. Low, “Semi-supervised classification of bird vocalizations,” 2025. [Online]. Available: https://arxiv.org/abs/2502.13440

work page arXiv 2025
[13]

Feature vector selection and use with hidden markov models to identify frequency-modulated bioacoustic signals amidst noise,

T. S. Brandes, “Feature vector selection and use with hidden markov models to identify frequency-modulated bioacoustic signals amidst noise,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 6, pp. 1173–1180, 2008

2008
[14]

Ultralytics yolo11,

G. Jocher and J. Qiu, “Ultralytics yolo11,” 2024. [Online]. Available: https://github.com/ultralytics/ultralytics

2024
[15]

Microsoft COCO: Common Objects in Context

T.-Y . Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Doll ´ar, “Microsoft coco: Common objects in context,” 2015. [Online]. Available: https://arxiv.org/abs/1405.0312

work page internal anchor Pith review Pith/arXiv arXiv 2015
[16]

A collection of fully-annotated soundscape recordings from the island of hawai‘i,

A. Navine, S. Kahl, A. Tanimoto-Johnson, H. Klinck, and P. Hart, “A collection of fully-annotated soundscape recordings from the island of hawai‘i,” https://doi.org/10.5281/zenodo.7078499, 2022

work page doi:10.5281/zenodo.7078499 2022
[17]

Train- able frontend for robust and far-field keyword spotting,

Y . Wang, P. Getreuer, T. Hughes, R. F. Lyon, and R. A. Saurous, “Train- able frontend for robust and far-field keyword spotting,” in2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5670–5674

2017

[1] [1]

Foundation models for bioacoustics – a comparative review,

R. Schwinger, P. V . Zadeh, L. Rauch, M. Kurz, T. Hauschild, S. Lapp, and S. Tomforde, “Foundation models for bioacoustics – a comparative review,” 2025. [Online]. Available: https://arxiv.org/abs/2508.01277

work page arXiv 2025

[2] [2]

Birdnet: A deep learning solution for avian diversity monitoring,

S. Kahl, C. M. Wood, M. Eibl, and H. Klinck, “Birdnet: A deep learning solution for avian diversity monitoring,”Ecological Informatics, vol. 61, p. 101236, 2021

2021

[3] [3]

arXiv preprint arXiv:2508.04665 , year=

B. van Merri ¨enboer, V . Dumoulin, J. Hamer, L. Harrell, A. Burns, and T. Denton, “Perch 2.0: The bittern lesson for bioacoustics,” 2026. [Online]. Available: https://arxiv.org/abs/2508.04665

work page arXiv 2026

[4] [4]

Birdset: A large-scale dataset for audio classification in avian bioacoustics,

L. Rauch, R. Schwinger, M. Wirth, R. Heinrich, D. Huseljic, M. Herde, J. Lange, S. Kahl, B. Sick, S. Tomforde, and C. Scholz, “Birdset: A large-scale dataset for audio classification in avian bioacoustics,” 2025. [Online]. Available: https://arxiv.org/abs/2403.10380

work page arXiv 2025

[5] [5]

Xeno-canto: Sharing wildlife sounds from around the world,

“Xeno-canto: Sharing wildlife sounds from around the world,” Xeno- canto Foundation for Nature Sounds, 2026, accessed: 2026-05-17. [Online]. Available: https://xeno-canto.org/

2026

[6] [6]

Biological invasions and the acoustic niche: the effect of bullfrog calls on the acoustic signals of white-banded tree frogs,

C. Both and T. Grant, “Biological invasions and the acoustic niche: the effect of bullfrog calls on the acoustic signals of white-banded tree frogs,”Biology Letters, vol. 8, no. 5, pp. 714–716, 2012

2012

[7] [7]

Invasion of the acoustic niche: variable responses by native species to invasive american bullfrog calls,

C. I. Medeiros, C. Both, T. Grant, and S. M. Hartz, “Invasion of the acoustic niche: variable responses by native species to invasive american bullfrog calls,”Biological Invasions, vol. 19, no. 2, pp. 675–690, 2017. [Online]. Available: https://doi.org/10.1007/s10530-016-1327-7

work page doi:10.1007/s10530-016-1327-7 2017

[8] [8]

Effects of invasive toad calls and synthetic tones on call properties of native australian toadlets,

J. M. Hopkins, D. S. Bower, W. Edwards, and L. Schwarzkopf, “Effects of invasive toad calls and synthetic tones on call properties of native australian toadlets,”Journal of Herpetology, vol. 57, pp. 437–446, 2023

2023

[9] [9]

Acoustic patterns of an invasive species: the red-billed leiothrix (leiothrix lutea scopoli 1786) in a mediterranean shrubland,

A. Farina, N. Pieretti, and N. Morganti, “Acoustic patterns of an invasive species: the red-billed leiothrix (leiothrix lutea scopoli 1786) in a mediterranean shrubland,”Bioacoustics, vol. 22, no. 3, pp. 175–194, 2013. [Online]. Available: https://doi.org/10.1080/09524622. 2012.761571

work page doi:10.1080/09524622 2013

[10] [10]

Acoustic detector for multiple vocalizing marine mammal individuals,

S. Hexeberg, H. Vishnu, K. T. Beng, A. Ho, W. Yusong, M. Chitre, K. Tun, and K. Lim, “Acoustic detector for multiple vocalizing marine mammal individuals,” inOCEANS 2023 - Limerick, 2023, pp. 1–8

2023

[11] [11]

Acoustic classification of multiple simultane- ous bird species: A multi-instance multi-label approach,

F. Briggs, B. Lakshminarayanan, L. Neal, X. Fern, R. Raich, S. Frey, A. Hadley, and M. Betts, “Acoustic classification of multiple simultane- ous bird species: A multi-instance multi-label approach,”The Journal of the Acoustical Society of America, vol. 131, pp. 4640–50, 2012

2012

[12] [12]

Semi-supervised classification of bird vocalizations,

S. Hexeberg, M. Chitre, M. Hoffmann-Kuhnt, and B. W. Low, “Semi-supervised classification of bird vocalizations,” 2025. [Online]. Available: https://arxiv.org/abs/2502.13440

work page arXiv 2025

[13] [13]

Feature vector selection and use with hidden markov models to identify frequency-modulated bioacoustic signals amidst noise,

T. S. Brandes, “Feature vector selection and use with hidden markov models to identify frequency-modulated bioacoustic signals amidst noise,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 6, pp. 1173–1180, 2008

2008

[14] [14]

Ultralytics yolo11,

G. Jocher and J. Qiu, “Ultralytics yolo11,” 2024. [Online]. Available: https://github.com/ultralytics/ultralytics

2024

[15] [15]

Microsoft COCO: Common Objects in Context

T.-Y . Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Doll ´ar, “Microsoft coco: Common objects in context,” 2015. [Online]. Available: https://arxiv.org/abs/1405.0312

work page internal anchor Pith review Pith/arXiv arXiv 2015

[16] [16]

A collection of fully-annotated soundscape recordings from the island of hawai‘i,

A. Navine, S. Kahl, A. Tanimoto-Johnson, H. Klinck, and P. Hart, “A collection of fully-annotated soundscape recordings from the island of hawai‘i,” https://doi.org/10.5281/zenodo.7078499, 2022

work page doi:10.5281/zenodo.7078499 2022

[17] [17]

Train- able frontend for robust and far-field keyword spotting,

Y . Wang, P. Getreuer, T. Hughes, R. F. Lyon, and R. A. Saurous, “Train- able frontend for robust and far-field keyword spotting,” in2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5670–5674

2017