arxiv: 2602.00681 · v2 · submitted 2026-01-31 · 💻 cs.SD · cs.IR· cs.LG

Recognition: no theorem link

Audio-to-Image Bird Species Retrieval without Audio-Image Pairs via Text Distillation

Ilyass Moummad , Marius Miron , Lukas Rauch , David Robinson , Alexis Joly , Olivier Pietquin , Emmanuel Chemla , Matthieu Geist

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:07 UTC · model grok-4.3

classification 💻 cs.SD cs.IRcs.LG

keywords audio-to-image retrievaltext distillationbioacousticsbird speciescontrastive learningmultimodal alignmentzero-shot transfer

0 comments

The pith

Text distillation aligns audio and image embeddings for bird species retrieval without any paired audio-image data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to create useful alignment between audio recordings and images of birds by using text embeddings as an intermediary bridge. It fine-tunes only the audio encoder of a pretrained audio-text model against the text embeddings of a pretrained image-text model using a contrastive loss. This transfers visual and taxonomic structure into the audio space, producing emergent audio-image alignment even though no audio-image pairs were ever seen during training. The result matters for bioacoustics because collecting matched audio and image data at scale is difficult, yet the method still delivers strong retrieval performance on the SSW60 benchmark while keeping audio-only discrimination intact.

Core claim

By distilling the text embedding space of BioCLIP-2 into the audio encoder of BioLingual through contrastive fine-tuning on audio-text pairs alone, the resulting audio representations become aligned with image embeddings from BioCLIP-2. This alignment supports effective audio-to-image retrieval on bioacoustic benchmarks such as SSW60, outperforming zero-shot model combinations and direct text-embedding mappings, while preserving the audio encoder's original discriminative capability on focal and soundscape data.

What carries the argument

Contrastive distillation of text embeddings from a pretrained image-text model into a pretrained audio encoder, which transfers visual semantics without ever using images.

If this is right

The distilled audio encoder continues to perform well on standard audio classification tasks.
Audio-to-image retrieval becomes feasible on any bioacoustic dataset that lacks image pairings.
Audio-text alignment improves on both focal recordings and soundscape recordings.
Indirect transfer through text offers a scalable route to visually grounded species recognition when direct multimodal pairs are unavailable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar distillation steps could align audio with other visual or environmental modalities using the same text bridge.
The approach may lower the barrier to building multimodal tools for field ecology by removing the need for expensive paired data collection.
If the text space proves sufficiently rich, the same pattern could extend to other animal vocalizations or non-bird bioacoustic tasks.
Performance on noisier, real-world soundscapes could be tested to check whether the transferred alignment holds under variable conditions.

Load-bearing premise

The text embedding space of the image-text model already encodes enough visual and species-specific structure that contrastive fine-tuning can move it into the audio space.

What would settle it

If audio-to-image retrieval accuracy on SSW60 drops below the level achieved by simple zero-shot combinations of the same base models, the claim that text distillation induces meaningful alignment would be false.

Figures

Figures reproduced from arXiv: 2602.00681 by Alexis Joly, David Robinson, Emmanuel Chemla, Ilyass Moummad, Lukas Rauch, Marius Miron, Matthieu Geist, Olivier Pietquin.

**Figure 1.** Figure 1: Comparison of audio–image alignment approaches. Left: direct contrastive alignment using paired audio–image data. Right (ours): text-based distillation Audio to Image [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Audio-to-image retrieval: Images are embedded using BioCLIP-2 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Audio-to-image retrieval offers an interpretable alternative to audio-only classification for bioacoustic species recognition, but learning aligned audio-image representations is challenging due to the scarcity of paired audio-image data. We propose a simple and data-efficient approach that enables audio-to-image retrieval without any audio-image supervision. Our proposed method uses text as a semantic intermediary: we distill the text embedding space of a pretrained image-text model (BioCLIP-2), which encodes rich visual and taxonomic structure, into a pretrained audio-text model (BioLingual) by fine-tuning its audio encoder with a contrastive objective. This distillation transfers visually grounded semantics into the audio representation, inducing emergent alignment between audio and image embeddings without using images during training. We evaluate the resulting model on multiple bioacoustic benchmarks. The distilled audio encoder preserves audio discriminative power while substantially improving audio-text alignment on focal recordings and soundscape datasets. Most importantly, on the SSW60 benchmark, the proposed approach achieves strong audio-to-image retrieval performance exceeding baselines based on zero-shot model combinations or learned mappings between text embeddings, despite not training on paired audio-image data. These results demonstrate that indirect semantic transfer through text is sufficient to induce meaningful audio-image alignment, providing a practical solution for visually grounded species recognition in data-scarce bioacoustic settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Text distillation from BioCLIP-2 into an audio encoder produces usable audio-to-image retrieval on SSW60 without paired data, but the source of the gains still needs clearer isolation.

read the letter

The main takeaway is that this paper shows a workable route to audio-image alignment for bird species by using text embeddings from BioCLIP-2 as the only bridge. They fine-tune the audio encoder of BioLingual with a contrastive loss against those text vectors, then use the resulting audio embeddings to query image embeddings from the same image-text model. No audio-image pairs are ever shown during training, which directly tackles the data scarcity problem in bioacoustics. The approach keeps standard audio classification performance intact while delivering retrieval results on SSW60 that beat zero-shot combinations and simple text-mapping baselines. That is the concrete advance: a lightweight distillation step that induces cross-modal behavior from text alone. The setup is straightforward and the motivation is sound for anyone who has tried to collect matched audio and image data for the same recordings. The soft spot is the missing detail on how much of the SSW60 lift comes from genuine visual structure transfer versus the side effect of contrastive audio-text training making the audio features more separable in general. An ablation that replaces BioCLIP-2 text with generic embeddings would settle this quickly, and the abstract does not supply the numbers or controls needed to judge the size of the effect. The paper is aimed at researchers in bioacoustics and multimodal learning who need retrieval or interpretability without large paired corpora. A reader already working on zero-shot or distillation methods in ecology will get the most out of it. It deserves peer review because the core idea is practical, the problem it solves is real, and the method is simple enough that referees can check the claims with the right ablations and metrics.

Referee Report

3 major / 2 minor

Summary. The paper proposes a text-distillation method to enable audio-to-image retrieval for bird species without any paired audio-image training data. It fine-tunes only the audio encoder of a pretrained audio-text model (BioLingual) against the text embeddings of a pretrained image-text model (BioCLIP-2) using a contrastive objective, thereby transferring visually grounded semantics to produce audio embeddings that align with image embeddings at inference time. The central empirical claim is that the resulting model achieves strong audio-to-image retrieval on the SSW60 benchmark, outperforming zero-shot model combinations and learned text-embedding mappings, while preserving audio-only discriminative performance on focal and soundscape datasets.

Significance. If the quantitative results and ablations hold, the work demonstrates a practical, data-efficient route to cross-modal alignment in bioacoustics by leveraging existing image-text and audio-text models rather than requiring scarce paired audio-image data. This could meaningfully expand the set of visually interpretable tools available for species recognition in data-scarce settings and highlights the utility of text as a semantic bridge between modalities.

major comments (3)

[Abstract, §4] Abstract and §4 (evaluation): the headline claim that the method 'achieves strong audio-to-image retrieval performance exceeding baselines' on SSW60 is unsupported by any numerical results (recall@K, mAP, dataset size, number of classes, or error bars). Without these metrics the central claim cannot be assessed and the comparison to zero-shot and text-mapping baselines remains unverifiable.
[§3.2] §3.2 (method): the contrastive fine-tuning step is described at a high level but supplies no ablation that isolates the contribution of BioCLIP-2's image-derived text space from generic effects of audio-text contrastive training. The reported gains could therefore arise from improved audio discriminability alone rather than emergent audio-image alignment, directly undermining the 'visually grounded semantics' transfer claim.
[§3.1, §4] §3.1 and §4: the assumption that BioCLIP-2 text embeddings encode sufficiently rich visual/taxonomic structure for transfer is load-bearing yet untested; no analysis (e.g., nearest-neighbor inspection or controlled text-only vs. image-text ablation) is provided to show that the alignment is driven by visual content rather than generic semantic overlap.

minor comments (2)

[§3.2] Notation for the contrastive loss and temperature parameter is introduced without an equation number or explicit definition, making the training objective difficult to reproduce from the text alone.
[§4] The manuscript refers to 'multiple bioacoustic benchmarks' but does not list them or provide per-dataset statistics in the main text or a table; this should be added for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us improve the clarity and rigor of our paper. We address each major comment below and have made revisions to incorporate additional results and analyses as suggested.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (evaluation): the headline claim that the method 'achieves strong audio-to-image retrieval performance exceeding baselines' on SSW60 is unsupported by any numerical results (recall@K, mAP, dataset size, number of classes, or error bars). Without these metrics the central claim cannot be assessed and the comparison to zero-shot and text-mapping baselines remains unverifiable.

Authors: We agree that the abstract and evaluation section would benefit from explicit inclusion of the quantitative metrics. In the revised manuscript, we have updated the abstract and §4 to report the specific recall@K, mAP values, the size of the SSW60 dataset (60 classes), and error bars computed over multiple runs. The comparisons to the zero-shot and text-mapping baselines are now presented with numerical differences. revision: yes
Referee: [§3.2] §3.2 (method): the contrastive fine-tuning step is described at a high level but supplies no ablation that isolates the contribution of BioCLIP-2's image-derived text space from generic effects of audio-text contrastive training. The reported gains could therefore arise from improved audio discriminability alone rather than emergent audio-image alignment, directly undermining the 'visually grounded semantics' transfer claim.

Authors: To address this, we have added an ablation study in the revised version of §4. This ablation compares the performance when distilling from BioCLIP-2 text embeddings versus using a generic audio-text contrastive fine-tuning without the image-derived semantics. The results demonstrate that the gains in audio-to-image retrieval are specifically attributable to the visually grounded text space from BioCLIP-2, rather than generic improvements in audio discriminability. revision: yes
Referee: [§3.1, §4] §3.1 and §4: the assumption that BioCLIP-2 text embeddings encode sufficiently rich visual/taxonomic structure for transfer is load-bearing yet untested; no analysis (e.g., nearest-neighbor inspection or controlled text-only vs. image-text ablation) is provided to show that the alignment is driven by visual content rather than generic semantic overlap.

Authors: We have incorporated additional analyses in the revised manuscript to validate this assumption. Specifically, we include a nearest-neighbor analysis of the BioCLIP-2 text embeddings, showing that they cluster according to visual and taxonomic similarities among bird species. Furthermore, we provide a controlled ablation comparing transfer using BioCLIP-2 (image-text) embeddings versus embeddings from a text-only model, confirming that the visual component drives the effective alignment. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on independent pretrained models and empirical evaluation

full rationale

The paper's central procedure fine-tunes a pretrained audio encoder (BioLingual) via contrastive loss against text embeddings from an independent pretrained image-text model (BioCLIP-2). The resulting audio representations are then evaluated for audio-to-image retrieval on the SSW60 benchmark using held-out paired data that was never seen during training. No equation or step reduces the reported retrieval metric to a fitted parameter defined on the same data, no self-citation is invoked as a uniqueness theorem or load-bearing premise, and no ansatz is smuggled through prior work. The performance claim is therefore an empirical outcome of the distillation rather than a definitional or fitted tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method depends on two existing pretrained models and the assumption that their text spaces are compatible for distillation; no new free parameters, axioms beyond standard contrastive learning, or invented entities are introduced.

axioms (1)

domain assumption The text embedding space of BioCLIP-2 encodes rich visual and taxonomic structure transferable to audio via contrastive alignment.
This transferability is the load-bearing premise that allows the method to work without images.

pith-pipeline@v0.9.0 · 5554 in / 1301 out tokens · 33812 ms · 2026-05-16T09:07:04.298397+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

[1]

Lasbird: Large scale bird recognition dataset,

F. E. JAIMI, W. RABHI, W. AMARA, Z. CHAROUH, H. BENABOUD, and M. B. SAINDOU, “Lasbird: Large scale bird recognition dataset,”

work page
[2]

Available: https://dx.doi.org/10.21227/s3xd-2s66

[Online]. Available: https://dx.doi.org/10.21227/s3xd-2s66

work page doi:10.21227/s3xd-2s66
[3]

TreeOfLife-10M,

S. Stevens, J. Wu, M. J. Thompson, E. G. Campolongo, C. H. Song, D. E. Carlyn, L. Dong, W. M. Dahdul, C. Stewart, T. Berger-Wolf, W.-L. Chao, and Y . Su, “TreeOfLife-10M,” 2023. [Online]. Available: https://huggingface.co/datasets/imageomics/TreeOfLife-10M

work page 2023
[4]

TreeOfLife-200M (revision a8f38b4),

J. Gu, S. Stevens, E. G. Campolongo, M. J. Thompson, N. Zhang, J. Wu, A. Kopanev, Z. Mai, A. E. White, J. Balhoff, W. M. Dahdul, D. Rubenstein, H. Lapp, T. Berger-Wolf, W.-L. Chao, and Y . Su, “TreeOfLife-200M (revision a8f38b4),” 2025. [Online]. Available: https://huggingface.co/datasets/imageomics/TreeOfLife-200M

work page 2025
[5]

iNat Challenge 2021 - FGVC8,

G. V . Horn and macaodha, “iNat Challenge 2021 - FGVC8,” https:// kaggle.com/competitions/inaturalist-2021, 2021, kaggle

work page 2021
[6]

INQUIRE: A Natural World Text-to-Image Retrieval Benchmark,

E. Vendrow, O. Pantazis, A. Shepard, G. Brostow, K. Jones, O. Mac Aodha, S. Beery, and G. Van Horn, “INQUIRE: A Natural World Text-to-Image Retrieval Benchmark,”Advances in Neural Information Processing Systems, vol. 37, pp. 126 500–126 514, 2024

work page 2024
[7]

BioCLIP: A Vision Foundation Model for the Tree of Life,

S. Stevens, J. Wu, M. J. Thompson, E. G. Campolongo, C. H. Song, D. E. Carlyn, L. Dong, W. M. Dahdul, C. Stewart, T. Berger-Wolf et al., “BioCLIP: A Vision Foundation Model for the Tree of Life,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 19 412–19 424

work page 2024
[8]

BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning,

J. Gu, S. Stevens, E. G. Campolongo, M. J. Thompson, N. Zhang, J. Wu, A. Kopanev, Z. Mai, A. E. White, J. Balhoffet al., “BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning,” arXiv preprint arXiv:2505.23883, 2025

work page arXiv 2025
[9]

A collection of fully-annotated soundscape recordings from the southern sierra nevada mountain range,

M. Clapp, S. Kahl, E. Meyer, M. McKenna, H. Klinck, and G. Patricelli, “A collection of fully-annotated soundscape recordings from the southern sierra nevada mountain range,” 2023. [Online]. Available: https://doi.org/10.5281/zenodo.7525805

work page doi:10.5281/zenodo.7525805 2023
[10]

A collection of fully-annotated soundscape recordings from neotropical coffee farms in colombia and costa rica,

´A. Vega-Hidalgo, S. Kahl, L. B. Symes, V . Ruiz-Guti ´errez, I. Molina- Mora, F. Cediel, L. Sandoval, and H. Klinck, “A collection of fully-annotated soundscape recordings from neotropical coffee farms in colombia and costa rica,” 2023. [Online]. Available: https://doi.org/10.5281/zenodo.7525349

work page doi:10.5281/zenodo.7525349 2023
[11]

A collection of fully-annotated soundscape recordings from the southwestern amazon basin,

W. A. Hopping, S. Kahl, and H. Klinck, “A collection of fully-annotated soundscape recordings from the southwestern amazon basin,” 10 2022. [Online]. Available: https://doi.org/10.5281/zenodo.7079124

work page doi:10.5281/zenodo.7079124 2022
[13]

A collection of fully-annotated soundscape recordings from the northeastern united states,

S. Kahl, R. Charif, and H. Klinck, “A collection of fully-annotated soundscape recordings from the northeastern united states,” 2022. [Online]. Available: https://doi.org/10.5281/zenodo.7079380

work page doi:10.5281/zenodo.7079380 2022
[14]

A collection of fully-annotated soundscape recordings from the western united states,

A. Navine, S. Kahl, A. Tanimoto-Johnson, H. Klinck, and P. Hart, “A collection of fully-annotated soundscape recordings from the island of hawaii,” 2022. [Online]. Available: https://doi.org/10.5281/zenodo. 7078499

work page doi:10.5281/zenodo 2022
[15]

Exploring fine-grained audiovisual categorization with the SSW60 dataset,

G. Van Horn, R. Qian, K. Wilber, H. Adam, O. Mac Aodha, and S. Belongie, “Exploring fine-grained audiovisual categorization with the SSW60 dataset,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 271–289

work page 2022
[16]

Transferable Models for Bioacoustics with Human Language Supervision,

D. Robinson, A. Robinson, and L. Akrapongpisak, “Transferable Models for Bioacoustics with Human Language Supervision,” inICASSP 2024- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 1316–1320

work page 2024
[17]

ImageBind: One Embedding Space To Bind Them All,

R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V . Alwala, A. Joulin, and I. Misra, “ImageBind: One Embedding Space To Bind Them All,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 15 180–15 190

work page 2023
[18]

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment,

B. Zhu, B. Lin, M. Ning, Y . Yan, J. Cui, W. HongFa, Y . Pang, W. Jiang, J. Zhang, Z. Li, C. W. Zhang, Z. Li, W. Liu, and L. Yuan, “LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/for...

work page 2024
[19]

TaxaBind: A Unified Embedding Space for Ecological Applications,

S. Sastry, S. Khanal, A. Dhakal, A. Ahmad, and N. Jacobs, “TaxaBind: A Unified Embedding Space for Ecological Applications,” inWinter Conference on Applications of Computer Vision. IEEE/CVF, 2025

work page 2025
[20]

Connecting Multi- modal Contrastive Representations,

Z. Wang, Y . Zhao, X. Cheng, H. Huang, J. Liu, A. Yin, L. Tang, L. Li, Y . Wang, Z. Zhang, and Z. Zhao, “Connecting Multi- modal Contrastive Representations,” inThirty-seventh Conference on Neural Information Processing Systems, 2023. [Online]. Available: https://openreview.net/forum?id=IGTbT9P1ti

work page 2023
[21]

GeoPlant: Spatial Plant Species Prediction Dataset,

L. Picek, C. Botella, M. Servajean, C. Leblanc, R. Palard, T. Larcher, B. Deneu, D. Marcos, P. Bonnet, and A. Joly, “GeoPlant: Spatial Plant Species Prediction Dataset,” inNeurIPS 2024 Datasets and Benchmarks Track, 2024

work page 2024
[22]

The iNaturalist Sounds Dataset,

M. Chasmai, A. Shepard, S. Maji, and G. Van Horn, “The iNaturalist Sounds Dataset,”Advances in Neural Information Processing Systems, vol. 37, pp. 132 524–132 544, 2024

work page 2024
[23]

Cornell Birdcall Identification,

A. Howard, H. Klinck, S. Dane, S. Kahl, tom denton, and T. Den- ton, “Cornell Birdcall Identification,” https://kaggle.com/competitions/ birdsong-recognition, 2020, kaggle

work page 2020