Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping

Aayush Dhakal; Abby Stylianou; Adeel Ahmad; Nathan Jacobs; Srikumar Sastry; Subash Khanal

arxiv: 2505.13777 · v2 · submitted 2025-05-19 · 💻 cs.CV · cs.AI· cs.SD

Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping

Subash Khanal , Srikumar Sastry , Aayush Dhakal , Adeel Ahmad , Abby Stylianou , Nathan Jacobs This is my paper

Pith reviewed 2026-05-22 13:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.SD

keywords soundscape mappingsatellite imagerycross-modal retrievalmultimodal learningzero-shot predictiongeospatial audiovision-language modelssound synthesis

0 comments

The pith

Sat2Sound learns shared soundscape concepts from satellite images, audio and text to map sounds at any Earth location without paired recordings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework that augments limited paired satellite-audio datasets with soundscape descriptions generated by vision-language models. It trains by aligning audio, text, satellite images and synthetic captions through contrastive and codebook-aligned learning. This process identifies soundscape concepts that link the modalities and support mapping from images alone. The approach also retrieves captions that text-to-audio models can turn into sound. A reader would care because it removes the need for dense audio collections to understand acoustic environments across the globe.

Core claim

Sat2Sound is a unified multimodal framework that jointly learns from audio, text descriptions of audio, satellite images, and synthetic image captions through contrastive and codebook-aligned learning, discovering a set of soundscape concepts shared across modalities and thereby enabling hyper-localized, explainable soundscape mapping plus location-conditioned soundscape synthesis.

What carries the argument

soundscape concepts: shared representations discovered across audio, text, satellite images and synthetic captions via contrastive and codebook-aligned learning.

If this is right

State-of-the-art cross-modal retrieval performance between satellite images and audio on the GeoSound and SoundingEarth benchmarks.
Hyper-localized and explainable soundscape mapping directly from satellite imagery.
Location-conditioned soundscape synthesis by retrieving detailed captions for rendering through text-to-audio models.
Effective operation for immersive and educational applications even with limited computational resources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could reduce dependence on scarce geotagged audio by substituting model-generated descriptions for many locations.
Historical satellite archives might allow reconstruction of past soundscapes using the same learned concepts.
The concept-discovery mechanism could transfer to other geospatial multimodal tasks such as mapping visual or olfactory features.

Load-bearing premise

Vision-language models generate soundscape descriptions from satellite images that accurately broaden the diversity of ambient sounds at each location without introducing systematic biases.

What would settle it

Measuring cross-modal retrieval accuracy on a fresh set of real paired satellite-audio samples from unseen regions and finding that performance falls below prior methods that avoid generated descriptions would falsify the central claim.

Figures

Figures reproduced from arXiv: 2505.13777 by Aayush Dhakal, Abby Stylianou, Adeel Ahmad, Nathan Jacobs, Srikumar Sastry, Subash Khanal.

**Figure 1.** Figure 1: Sat2Sound framework learns a shared multimodal embedding space between satellite images, audio, audio captions, and image [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Comparison of original audio captions and synthetic [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: (a) Soundscape mapping framework using Sat2Sound’s encoders. (b) A landcover map for the United States for comparison to [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Alignment between patches in a single image and sound [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Examples of Top-1 retrieved LLaVA captions for a Bing image by Sat2Sound from our gallery, which is the [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Some example groups from the GeoSound test set are shown, where each group shares a common set of highly activated codebook [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

read the original abstract

We present Sat2Sound, a unified multimodal framework for geospatial soundscape understanding, designed to predict and map the distribution of sounds across the Earth's surface. Existing methods for this task rely on paired satellite images and geotagged audio samples, which often fail to capture the full diversity of sound at a location. Sat2Sound overcomes this limitation by augmenting datasets with semantically rich, vision-language model-generated soundscape descriptions, which broaden the range of possible ambient sounds represented at each location. Our framework jointly learns from audio, text descriptions of audio, satellite images, and synthetic image captions through contrastive and codebook-aligned learning, discovering a set of "soundscape concepts" shared across modalities, enabling hyper-localized, explainable soundscape mapping. Sat2Sound achieves state-of-the-art performance in cross-modal retrieval between satellite image and audio on the GeoSound and SoundingEarth benchmarks. Finally, by retrieving detailed soundscape captions that can be rendered through text-to-audio models, Sat2Sound enables location-conditioned soundscape synthesis for immersive and educational applications, even with limited computational resources. Our code and models are available at https://github.com/mvrl/sat2sound.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sat2Sound augments scarce satellite-audio pairs with VLM-generated descriptions then aligns them to audio and text via contrastive losses plus a codebook of shared concepts, but the SOTA retrieval claim sits on thin evidence.

read the letter

The main thing here is that they generate extra soundscape descriptions from satellite images using a vision-language model to expand the range of sounds tied to each location, then train a joint model that pulls satellite images, real audio, text descriptions, and synthetic captions into alignment through contrastive learning and a learned codebook of common concepts. This lets them do cross-modal retrieval and even turn satellite views into rendered soundscapes via text-to-audio models. The code is public, which helps.

Referee Report

2 major / 2 minor

Summary. The paper introduces Sat2Sound, a unified multimodal framework for geospatial soundscape understanding and zero-shot mapping. It augments limited paired satellite-image and geotagged-audio datasets with semantically rich soundscape descriptions generated by vision-language models, then jointly optimizes contrastive losses and codebook alignment across satellite images, real audio, text descriptions, and synthetic captions to discover shared 'soundscape concepts'. The framework claims state-of-the-art cross-modal retrieval performance between satellite images and audio on the GeoSound and SoundingEarth benchmarks and demonstrates downstream use for location-conditioned soundscape synthesis via text-to-audio rendering.

Significance. If the empirical claims hold, the work offers a practical route to increase acoustic diversity in geospatial audio datasets without additional field collection, enabling more explainable and hyper-localized soundscape mapping. The open release of code and models at the cited GitHub repository is a clear strength that supports reproducibility.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): the claim of state-of-the-art retrieval performance on GeoSound and SoundingEarth is stated without any reported metrics (e.g., Recall@K, mAP), baselines, ablation tables, or statistical significance tests. This absence prevents assessment of whether the VLM-augmented descriptions actually drive the reported gains or whether the result is driven by implementation details.
[§3.2 and §4.3] §3.2 (Method) and §4.3 (Ablations): no explicit ablation isolates the contribution of VLM-generated captions versus real audio-text pairs, nor is there a bias audit measuring how often generated descriptions introduce visually salient but acoustically implausible events. Without such controls, it remains possible that the learned soundscape codebook exploits VLM priors rather than genuine cross-modal correspondence, directly affecting the central claim of broadened acoustic diversity.

minor comments (2)

[§3] Notation for the codebook and contrastive losses is introduced without an explicit equation numbering or diagram; a single consolidated figure showing the four-modality alignment would improve readability.
[§2] The paper should cite prior work on vision-language augmentation for audio (e.g., recent CLAP or AudioCLIP extensions) to better situate the novelty of the joint image-audio-text codebook.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our results and strengthen the supporting evidence for our claims. We respond to each major comment below and indicate the corresponding revisions.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the claim of state-of-the-art retrieval performance on GeoSound and SoundingEarth is stated without any reported metrics (e.g., Recall@K, mAP), baselines, ablation tables, or statistical significance tests. This absence prevents assessment of whether the VLM-augmented descriptions actually drive the reported gains or whether the result is driven by implementation details.

Authors: We agree that the abstract would benefit from explicit numerical results. The full manuscript already reports Recall@K, mAP, and baseline comparisons in Tables 1–3 of §4, together with ablation tables. To address the concern directly, we will update the abstract to include the key quantitative gains (e.g., Recall@5 on both benchmarks) and add a short paragraph in §4 discussing statistical significance of the improvements attributable to the VLM-augmented captions versus implementation choices. revision: yes
Referee: [§3.2 and §4.3] §3.2 (Method) and §4.3 (Ablations): no explicit ablation isolates the contribution of VLM-generated captions versus real audio-text pairs, nor is there a bias audit measuring how often generated descriptions introduce visually salient but acoustically implausible events. Without such controls, it remains possible that the learned soundscape codebook exploits VLM priors rather than genuine cross-modal correspondence, directly affecting the central claim of broadened acoustic diversity.

Authors: We acknowledge the value of an isolated comparison. Section 4.3 already ablates loss terms and modality combinations; we will add a dedicated table that directly contrasts performance when VLM-generated captions are included versus when only real audio-text pairs are used. For the bias audit, we will insert a new qualitative analysis subsection that examines a sample of generated descriptions for acoustically implausible events and explains how joint contrastive training with real audio mitigates over-reliance on VLM priors. These additions will more rigorously support the diversity claim. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical multimodal training pipeline

full rationale

The paper describes an empirical framework that augments paired satellite-audio data with VLM-generated text descriptions and trains a shared embedding space via contrastive and codebook losses. No derivation chain, first-principles equations, or 'predictions' are presented that reduce to fitted parameters or self-referential definitions by construction. Performance claims are evaluated on external benchmarks (GeoSound, SoundingEarth) rather than internal construction, and the method remains falsifiable against real audio distributions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the quality of externally generated text descriptions and the assumption that contrastive alignment will produce meaningful shared concepts; no free parameters or invented entities are quantified in the abstract.

axioms (1)

domain assumption Vision-language models produce soundscape descriptions that are sufficiently accurate and diverse to augment real audio data effectively.
This is invoked to overcome the limitation of paired satellite-audio datasets.

invented entities (1)

soundscape concepts no independent evidence
purpose: Reusable, explainable representations shared across audio, text, and image modalities.
Discovered via codebook-aligned learning; no independent external validation is described.

pith-pipeline@v0.9.0 · 5759 in / 1270 out tokens · 40076 ms · 2026-05-22T13:34:49.353608+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our framework jointly learns from audio, text descriptions of audio, satellite images, and synthetic image captions through contrastive and codebook-aligned learning, discovering a set of 'soundscape concepts' shared across modalities
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ri_m = max_j (p_i^j · C_m); w_i^m = Softmax(Sparsemax(r_i))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery
cs.MM 2026-04 unverdicted novelty 7.0

Geo2Sound generates geographically realistic soundscapes from satellite imagery via geospatial attribute modeling, semantic hypothesis expansion, and geo-acoustic alignment, achieving SOTA FAD of 1.765 on a new 20k-pa...

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Freesound, https://freesound.org. 6

work page
[2]

Radio aporee: Maps - sounds of the world, https://aporee.org. 6

work page
[3]

inaturalist, https://www.inaturalist.org. 6

work page
[4]

Associations between soundscape experience and self-reported wellbeing in open public urban spaces: a field study.The Lancet, 394:S17, 2019

Francesco Aletta, Tin Oberman, Andrew Mitchell, Mercede Erfanian, Matteo Lionello, Magdalena Kachlicka, and Jian Kang. Associations between soundscape experience and self-reported wellbeing in open public urban spaces: a field study.The Lancet, 394:S17, 2019. 1

work page 2019
[5]

Revis- iting multimodal representation in contrastive learning: from patch and token embeddings to finite discrete tokens

Yuxiao Chen, Jianbo Yuan, Yu Tian, Shijie Geng, Xinyu Li, Ding Zhou, Dimitris N Metaxas, and Hongxia Yang. Revis- iting multimodal representation in contrastive learning: from patch and token embeddings to finite discrete tokens. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15095–15104, 2023. 1, 2, 4

work page 2023
[6]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shil- iang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models.arXiv preprint arXiv:2311.07919, 2023. 3, 6, 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Improved probabilistic image-text rep- resentations

Sanghyuk Chun. Improved probabilistic image-text rep- resentations. InThe Twelfth International Conference on Learning Representations, 2024. 2, 4

work page 2024
[8]

Probabilistic embeddings for cross-modal retrieval

Sanghyuk Chun, Seong Joon Oh, Rafael Sampaio De Rezende, Yannis Kalantidis, and Diane Larlus. Probabilistic embeddings for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8415–8424, 2021. 2

work page 2021
[9]

Lobell, and Stefano Ermon

Yezhen Cong, Samar Khanna, Chenlin Meng, Patrick Liu, Erik Rozi, Yutong He, Marshall Burke, David B. Lobell, and Stefano Ermon. SatMAE: Pre-training transformers for tem- poral and multi-spectral satellite imagery. InAdvances in Neural Information Processing Systems, 2022. 1, 3, 7

work page 2022
[10]

Pengi: An audio language model for audio tasks.Advances in Neural Information Processing Systems, 36:18090–18108, 2023

Soham Deshmukh, Benjamin Elizalde, Rita Singh, and Huaming Wang. Pengi: An audio language model for audio tasks.Advances in Neural Information Processing Systems, 36:18090–18108, 2023. 3, 6, 1

work page 2023
[11]

Audio–visual representation learning for anomaly events de- tection in crowds.Neurocomputing, 582:127489, 2024

Junyu Gao, Hao Yang, Maoguo Gong, and Xuelong Li. Audio–visual representation learning for anomaly events de- tection in crowds.Neurocomputing, 582:127489, 2024. 2

work page 2024
[12]

Imagebind: One embedding space to bind them all

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023. 2, 3, 4, 7

work page 2023
[13]

Vec- tor quantized diffusion model for text-to-image synthesis

Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vec- tor quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 10696–10706, 2022. 2

work page 2022
[14]

Self- supervised audiovisual representation learning for remote sensing data.International Journal of Applied Earth Ob- servation and Geoinformation, 116:103130, 2023

Konrad Heidler, Lichao Mou, Di Hu, Pu Jin, Guangyao Li, Chuang Gan, Ji-Rong Wen, and Xiao Xiang Zhu. Self- supervised audiovisual representation learning for remote sensing data.International Journal of Applied Earth Ob- servation and Geoinformation, 116:103130, 2023. 3

work page 2023
[15]

Tangoflux: Super fast and faithful text to audio generation with flow matching and clap-ranked preference optimization

Chia-Yu Hung, Navonil Majumder, Zhifeng Kong, Ambuj Mehrish, Rafael Valle, Bryan Catanzaro, and Soujanya Poria. Tangoflux: Super fast and faithful text to audio generation with flow matching and clap-ranked preference optimization. arXiv preprint arXiv:2412.21037, 2024. 6, 8, 1

work page arXiv 2024
[16]

Taming visually guided sound generation

Vladimir Iashin and Esa Rahtu. Taming visually guided sound generation. InThe 32st British Machine Vision Vir- tual Conference. BMV A Press, 2021. 2

work page 2021
[17]

Learning tri-modal embeddings for zero-shot soundscape mapping

Subash Khanal, Srikumar Sastry, Aayush Dhakal, and Nathan Jacobs. Learning tri-modal embeddings for zero-shot soundscape mapping. InBritish Machine Vision Conference (BMVC), 2023. 1, 2, 5, 8

work page 2023
[18]

Psm: Learning probabilistic embeddings for multi-scale zero-shot soundscape mapping

Subash Khanal, Xing Eric, Srikumar Sastry, Aayush Dhakal, Xiong Zhexiao, Adeel Ahmad, and Nathan Jacobs. Psm: Learning probabilistic embeddings for multi-scale zero-shot soundscape mapping. InACM Multimedia, 2024. 1, 2, 3, 4, 5, 6, 8

work page 2024
[19]

Measuring biodiversity with sound: How effective are acoustic indices for quantifying biodiversity in a tropical dry forest?Conservation Science and Practice, 6(6):e13133, 2024

Mayuri Kotian, Siddharth Biniwale, Pravar Mourya, Zuzana Burivalova, and Pooja Choksi. Measuring biodiversity with sound: How effective are acoustic indices for quantifying biodiversity in a tropical dry forest?Conservation Science and Practice, 6(6):e13133, 2024. 1

work page 2024
[20]

Align before fuse: Vision and language representation learn- ing with momentum distillation.Advances in neural infor- mation processing systems, 34:9694–9705, 2021

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learn- ing with momentum distillation.Advances in neural infor- mation processing systems, 34:9694–9705, 2021. 2

work page 2021
[21]

Advancing multi-grained alignment for contrastive language-audio pre-training

Yiming Li, Zhifang Guo, Xiangdong Wang, and Hong Liu. Advancing multi-grained alignment for contrastive language-audio pre-training. InProceedings of the 32nd ACM International Conference on Multimedia, pages 7356– 7365, 2024. 2, 4, 1, 3, 7

work page 2024
[22]

Cross-modal dis- crete representation learning

Alex Liu, SouYoung Jin, Cheng-I Lai, Andrew Rou- ditchenko, Aude Oliva, and James Glass. Cross-modal dis- crete representation learning. InProceedings of the 60th An- nual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 3013–3035, 2022. 2

work page 2022
[23]

Visual instruction tuning.Advances in neural information processing systems, 36, 2024

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36, 2024. 3, 1

work page 2024
[24]

From softmax to sparsemax: A sparse model of attention and multi-label clas- sification

Andre Martins and Ramon Astudillo. From softmax to sparsemax: A sparse model of attention and multi-label clas- sification. InInternational conference on machine learning, pages 1614–1623. PMLR, 2016. 4, 2

work page 2016
[25]

Economic evaluation of the indoor environ- mental quality of buildings: The noise pollution effects on housing prices in the city of bari (italy).Buildings, 11(5): 213, 2021

Pierluigi Morano, Francesco Tajani, Felicia Di Liddo, and Michele Dar `o. Economic evaluation of the indoor environ- mental quality of buildings: The noise pollution effects on housing prices in the city of bari (italy).Buildings, 11(5): 213, 2021. 1 9

work page 2021
[26]

Rethinking transformers pre-training for multi- spectral satellite imagery

Mubashir Noman, Muzammal Naseer, Hisham Cholakkal, Rao Muhammad Anwar, Salman Khan, and Fahad Shah- baz Khan. Rethinking transformers pre-training for multi- spectral satellite imagery. InCVPR, 2024. 3, 7

work page 2024
[27]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 4

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 2, 4

work page 2021
[29]

Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning

Colorado J Reed, Ritwik Gupta, Shufan Li, Sarah Brock- man, Christopher Funk, Brian Clipp, Kurt Keutzer, Salvatore Candido, Matt Uyttendaele, and Trevor Darrell. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4088– 4099, 2023. 1

work page 2023
[30]

Scaling up models and data with t5x and seqio.Journal of Machine Learning Research, 24(377):1–8, 2023

Adam Roberts, Hyung Won Chung, Gaurav Mishra, Anselm Levskaya, James Bradbury, Daniel Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, et al. Scaling up models and data with t5x and seqio.Journal of Machine Learning Research, 24(377):1–8, 2023. 1

work page 2023
[31]

A multimodal approach to mapping soundscapes

Tawfiq Salem, Menghua Zhai, Scott Workman, and Nathan Jacobs. A multimodal approach to mapping soundscapes. InProceedings of the IEEE Conference on Computer Vi- sion and Pattern Recognition Workshops, pages 2524–2527,

work page
[32]

Taxabind: A unified embedding space for ecological applications

Srikumar Sastry, Subash Khanal, Aayush Dhakal, Adeel Ah- mad, and Nathan Jacobs. Taxabind: A unified embedding space for ecological applications. In2025 IEEE/CVF Win- ter Conference on Applications of Computer Vision (WACV), pages 1765–1774. IEEE, 2025. 3, 4, 7

work page 2025
[33]

I hear your true colors: Im- age guided audio generation

Roy Sheffer and Yossi Adi. I hear your true colors: Im- age guided audio generation. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. 2

work page 2023
[34]

Sound to visual scene genera- tion by audio-to-visual latent alignment

Kim Sung-Bin, Arda Senocak, Hyunwoo Ha, Andrew Owens, and Tae-Hyun Oh. Sound to visual scene genera- tion by audio-to-visual latent alignment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6430–6440, 2023. 2

work page 2023
[35]

Shamma, Gerald Friedland, Ben- jamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li

Bart Thomee, David A. Shamma, Gerald Friedland, Ben- jamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. YFCC100M: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016. 6

work page 2016
[36]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017. 2

work page 2017
[38]

Geoclip: Clip-inspired alignment be- tween locations and images for effective worldwide geo- localization.Advances in Neural Information Processing Systems, 36, 2024

Vicente Vivanco Cepeda, Gaurav Kumar Nayak, and Mubarak Shah. Geoclip: Clip-inspired alignment be- tween locations and images for effective worldwide geo- localization.Advances in Neural Information Processing Systems, 36, 2024. 2

work page 2024
[39]

V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foun- dation models

Heng Wang, Jianbo Ma, Santiago Pascual, Richard Cartwright, and Weidong Cai. V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foun- dation models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 15492–15501, 2024. 2

work page 2024
[40]

Large-scale con- trastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Yusong Wu*, Ke Chen*, Tianyu Zhang*, Yuchen Hui*, Tay- lor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale con- trastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. InIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2023. 3, 7

work page 2023
[41]

Large-scale con- trastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale con- trastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. 3, 1

work page 2023
[42]

Achiev- ing cross modal generalization with multimodal unified rep- resentation.Advances in Neural Information Processing Sys- tems, 36, 2024

Yan Xia, Hai Huang, Jieming Zhu, and Zhou Zhao. Achiev- ing cross modal generalization with multimodal unified rep- resentation.Advances in Neural Information Processing Sys- tems, 36, 2024. 2

work page 2024
[43]

Diffsound: Discrete diffusion model for text-to-sound generation.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:1720–1733, 2023

Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, and Dong Yu. Diffsound: Discrete diffusion model for text-to-sound generation.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:1720–1733, 2023. 2

work page 2023
[44]

Filip: Fine-grained interactive language-image pre-training

Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training.arXiv preprint arXiv:2111.07783, 2021. 1, 2

work page arXiv 2021
[45]

Coca: Contrastive captioners are image-text foundation models.Transactions on Machine Learning Research, 2022

Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo- jtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models.Transactions on Machine Learning Research, 2022. 2

work page 2022
[46]

Donghuo Zeng, Jianming Wu, Gen Hattori, Rong Xu, and Yi Yu. Learning explicit and implicit dual common subspaces for audio-visual cross-modal retrieval.ACM Transactions on Multimedia Computing, Communications and Applications, 19(2s):1–23, 2023. 2

work page 2023
[47]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 4

work page 2023
[48]

Audio-synchronized visual animation

Lin Zhang, Shentong Mo, Yijing Zhang, and Pedro Mor- gado. Audio-synchronized visual animation. InEuropean Conference on Computer Vision, pages 1–18. Springer, 2025. 2 10 Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping Supplementary Material

work page 2025
[49]

SoundingEarthwith0.2mGSDGoogle Earthsatellite im- age tiles of size (1024×1024) contains41469/3242/5801 train/validation/test samples

Experimental Details Datasets:We experiment with two datasets:GeoSoundand SoundingEarth.GeoSoundcontains294019/5000/9931 train/validation/test samples and uses both0.6mGSD (Ground Sample Distance)Bingimage tiles (1500×1500) and10mGSDSentinel-2image tiles (1280×1280). SoundingEarthwith0.2mGSDGoogle Earthsatellite im- age tiles of size (1024×1024) contains4...

work page
[50]

What types of sounds can we expect to hear from the location captured by this aerial view image? Describe in up to two sentences

or Qwen-Audio [6], with the caption selection based on the caption’s CLAP score [41] with the ground-truth audio. For theGeoSounddataset, this resulted in 58.7% of audio captions from Pengi, 23.8% from Qwen-Audio, and 17.5% from human-annotated text. Image Captions:For the cropped satellite images at each scale, we generate detailed soundscape cap- tions ...

work page
[51]

Loss Ablation We conduct an ablation study on different components of the loss to assess their impact on the overall training ob- jective (Equation 9)

Ablation Studies 9.1. Loss Ablation We conduct an ablation study on different components of the loss to assess their impact on the overall training ob- jective (Equation 9). We observe that the addition of the composite audio-based loss (L † i,a+c) slightly improves the performance of the standard audio-image cross-modal re- trieval as observed in Table 4...

work page
[52]

As shown in the image-text cross-modal retrieval results (Table 10), existing pre-trained image-text models underperform compared to Sat2Sound

Simpler Baselines In this section, we compare the performance of Sat2Sound with existing off-the-shelf multimodal em- bedding spaces. As shown in the image-text cross-modal retrieval results (Table 10), existing pre-trained image-text models underperform compared to Sat2Sound. We attribute this to the mismatch between the soundscape descriptions generated...

work page
[53]

The results presented in the main paper are for satellite imagery at scale1

Multi-scale Cross-Modal Retrieval Sat2Sound is trained on multi-scale satellite imagery for the GeoSound dataset. The results presented in the main paper are for satellite imagery at scale1. In this section, we present results for two additional scales:3and5, us- ing both Sentinel and Bing imagery from the GeoSound dataset. Additionally, for both datasets...

work page
[54]

In this section, we qualitatively explore what the codebook has learned

Analyzing codebook concepts As illustrated in Figure 4, the codebook learned by Sat2Sound can be used to generate fine-grained soundscape maps for regions covered by a single satellite image. In this section, we qualitatively explore what the codebook has learned. Specifically, for our gallery of image captions, we first obtain the corresponding codebook ...

work page
[55]

We evaluate these embeddings on two downstream tasks: audio classification and satellite image classification, using linear probing on the audio and image embeddings, respectively

Linear Probing Experiments Sat2Sound learns a multimodal embedding space between audio and satellite imagery. We evaluate these embeddings on two downstream tasks: audio classification and satellite image classification, using linear probing on the audio and image embeddings, respectively. For audio classification, we compare the Sat2Sound audio encoder a...

work page 2022

[1] [1]

Freesound, https://freesound.org. 6

work page

[2] [2]

Radio aporee: Maps - sounds of the world, https://aporee.org. 6

work page

[3] [3]

inaturalist, https://www.inaturalist.org. 6

work page

[4] [4]

Associations between soundscape experience and self-reported wellbeing in open public urban spaces: a field study.The Lancet, 394:S17, 2019

Francesco Aletta, Tin Oberman, Andrew Mitchell, Mercede Erfanian, Matteo Lionello, Magdalena Kachlicka, and Jian Kang. Associations between soundscape experience and self-reported wellbeing in open public urban spaces: a field study.The Lancet, 394:S17, 2019. 1

work page 2019

[5] [5]

Revis- iting multimodal representation in contrastive learning: from patch and token embeddings to finite discrete tokens

Yuxiao Chen, Jianbo Yuan, Yu Tian, Shijie Geng, Xinyu Li, Ding Zhou, Dimitris N Metaxas, and Hongxia Yang. Revis- iting multimodal representation in contrastive learning: from patch and token embeddings to finite discrete tokens. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15095–15104, 2023. 1, 2, 4

work page 2023

[6] [6]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shil- iang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models.arXiv preprint arXiv:2311.07919, 2023. 3, 6, 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Improved probabilistic image-text rep- resentations

Sanghyuk Chun. Improved probabilistic image-text rep- resentations. InThe Twelfth International Conference on Learning Representations, 2024. 2, 4

work page 2024

[8] [8]

Probabilistic embeddings for cross-modal retrieval

Sanghyuk Chun, Seong Joon Oh, Rafael Sampaio De Rezende, Yannis Kalantidis, and Diane Larlus. Probabilistic embeddings for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8415–8424, 2021. 2

work page 2021

[9] [9]

Lobell, and Stefano Ermon

Yezhen Cong, Samar Khanna, Chenlin Meng, Patrick Liu, Erik Rozi, Yutong He, Marshall Burke, David B. Lobell, and Stefano Ermon. SatMAE: Pre-training transformers for tem- poral and multi-spectral satellite imagery. InAdvances in Neural Information Processing Systems, 2022. 1, 3, 7

work page 2022

[10] [10]

Pengi: An audio language model for audio tasks.Advances in Neural Information Processing Systems, 36:18090–18108, 2023

Soham Deshmukh, Benjamin Elizalde, Rita Singh, and Huaming Wang. Pengi: An audio language model for audio tasks.Advances in Neural Information Processing Systems, 36:18090–18108, 2023. 3, 6, 1

work page 2023

[11] [11]

Audio–visual representation learning for anomaly events de- tection in crowds.Neurocomputing, 582:127489, 2024

Junyu Gao, Hao Yang, Maoguo Gong, and Xuelong Li. Audio–visual representation learning for anomaly events de- tection in crowds.Neurocomputing, 582:127489, 2024. 2

work page 2024

[12] [12]

Imagebind: One embedding space to bind them all

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023. 2, 3, 4, 7

work page 2023

[13] [13]

Vec- tor quantized diffusion model for text-to-image synthesis

Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vec- tor quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 10696–10706, 2022. 2

work page 2022

[14] [14]

Self- supervised audiovisual representation learning for remote sensing data.International Journal of Applied Earth Ob- servation and Geoinformation, 116:103130, 2023

Konrad Heidler, Lichao Mou, Di Hu, Pu Jin, Guangyao Li, Chuang Gan, Ji-Rong Wen, and Xiao Xiang Zhu. Self- supervised audiovisual representation learning for remote sensing data.International Journal of Applied Earth Ob- servation and Geoinformation, 116:103130, 2023. 3

work page 2023

[15] [15]

Tangoflux: Super fast and faithful text to audio generation with flow matching and clap-ranked preference optimization

Chia-Yu Hung, Navonil Majumder, Zhifeng Kong, Ambuj Mehrish, Rafael Valle, Bryan Catanzaro, and Soujanya Poria. Tangoflux: Super fast and faithful text to audio generation with flow matching and clap-ranked preference optimization. arXiv preprint arXiv:2412.21037, 2024. 6, 8, 1

work page arXiv 2024

[16] [16]

Taming visually guided sound generation

Vladimir Iashin and Esa Rahtu. Taming visually guided sound generation. InThe 32st British Machine Vision Vir- tual Conference. BMV A Press, 2021. 2

work page 2021

[17] [17]

Learning tri-modal embeddings for zero-shot soundscape mapping

Subash Khanal, Srikumar Sastry, Aayush Dhakal, and Nathan Jacobs. Learning tri-modal embeddings for zero-shot soundscape mapping. InBritish Machine Vision Conference (BMVC), 2023. 1, 2, 5, 8

work page 2023

[18] [18]

Psm: Learning probabilistic embeddings for multi-scale zero-shot soundscape mapping

Subash Khanal, Xing Eric, Srikumar Sastry, Aayush Dhakal, Xiong Zhexiao, Adeel Ahmad, and Nathan Jacobs. Psm: Learning probabilistic embeddings for multi-scale zero-shot soundscape mapping. InACM Multimedia, 2024. 1, 2, 3, 4, 5, 6, 8

work page 2024

[19] [19]

Measuring biodiversity with sound: How effective are acoustic indices for quantifying biodiversity in a tropical dry forest?Conservation Science and Practice, 6(6):e13133, 2024

Mayuri Kotian, Siddharth Biniwale, Pravar Mourya, Zuzana Burivalova, and Pooja Choksi. Measuring biodiversity with sound: How effective are acoustic indices for quantifying biodiversity in a tropical dry forest?Conservation Science and Practice, 6(6):e13133, 2024. 1

work page 2024

[20] [20]

Align before fuse: Vision and language representation learn- ing with momentum distillation.Advances in neural infor- mation processing systems, 34:9694–9705, 2021

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learn- ing with momentum distillation.Advances in neural infor- mation processing systems, 34:9694–9705, 2021. 2

work page 2021

[21] [21]

Advancing multi-grained alignment for contrastive language-audio pre-training

Yiming Li, Zhifang Guo, Xiangdong Wang, and Hong Liu. Advancing multi-grained alignment for contrastive language-audio pre-training. InProceedings of the 32nd ACM International Conference on Multimedia, pages 7356– 7365, 2024. 2, 4, 1, 3, 7

work page 2024

[22] [22]

Cross-modal dis- crete representation learning

Alex Liu, SouYoung Jin, Cheng-I Lai, Andrew Rou- ditchenko, Aude Oliva, and James Glass. Cross-modal dis- crete representation learning. InProceedings of the 60th An- nual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 3013–3035, 2022. 2

work page 2022

[23] [23]

Visual instruction tuning.Advances in neural information processing systems, 36, 2024

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36, 2024. 3, 1

work page 2024

[24] [24]

From softmax to sparsemax: A sparse model of attention and multi-label clas- sification

Andre Martins and Ramon Astudillo. From softmax to sparsemax: A sparse model of attention and multi-label clas- sification. InInternational conference on machine learning, pages 1614–1623. PMLR, 2016. 4, 2

work page 2016

[25] [25]

Economic evaluation of the indoor environ- mental quality of buildings: The noise pollution effects on housing prices in the city of bari (italy).Buildings, 11(5): 213, 2021

Pierluigi Morano, Francesco Tajani, Felicia Di Liddo, and Michele Dar `o. Economic evaluation of the indoor environ- mental quality of buildings: The noise pollution effects on housing prices in the city of bari (italy).Buildings, 11(5): 213, 2021. 1 9

work page 2021

[26] [26]

Rethinking transformers pre-training for multi- spectral satellite imagery

Mubashir Noman, Muzammal Naseer, Hisham Cholakkal, Rao Muhammad Anwar, Salman Khan, and Fahad Shah- baz Khan. Rethinking transformers pre-training for multi- spectral satellite imagery. InCVPR, 2024. 3, 7

work page 2024

[27] [27]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 4

work page internal anchor Pith review Pith/arXiv arXiv 2018

[28] [28]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 2, 4

work page 2021

[29] [29]

Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning

Colorado J Reed, Ritwik Gupta, Shufan Li, Sarah Brock- man, Christopher Funk, Brian Clipp, Kurt Keutzer, Salvatore Candido, Matt Uyttendaele, and Trevor Darrell. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4088– 4099, 2023. 1

work page 2023

[30] [30]

Scaling up models and data with t5x and seqio.Journal of Machine Learning Research, 24(377):1–8, 2023

Adam Roberts, Hyung Won Chung, Gaurav Mishra, Anselm Levskaya, James Bradbury, Daniel Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, et al. Scaling up models and data with t5x and seqio.Journal of Machine Learning Research, 24(377):1–8, 2023. 1

work page 2023

[31] [31]

A multimodal approach to mapping soundscapes

Tawfiq Salem, Menghua Zhai, Scott Workman, and Nathan Jacobs. A multimodal approach to mapping soundscapes. InProceedings of the IEEE Conference on Computer Vi- sion and Pattern Recognition Workshops, pages 2524–2527,

work page

[32] [32]

Taxabind: A unified embedding space for ecological applications

Srikumar Sastry, Subash Khanal, Aayush Dhakal, Adeel Ah- mad, and Nathan Jacobs. Taxabind: A unified embedding space for ecological applications. In2025 IEEE/CVF Win- ter Conference on Applications of Computer Vision (WACV), pages 1765–1774. IEEE, 2025. 3, 4, 7

work page 2025

[33] [33]

I hear your true colors: Im- age guided audio generation

Roy Sheffer and Yossi Adi. I hear your true colors: Im- age guided audio generation. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. 2

work page 2023

[34] [34]

Sound to visual scene genera- tion by audio-to-visual latent alignment

Kim Sung-Bin, Arda Senocak, Hyunwoo Ha, Andrew Owens, and Tae-Hyun Oh. Sound to visual scene genera- tion by audio-to-visual latent alignment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6430–6440, 2023. 2

work page 2023

[35] [35]

Shamma, Gerald Friedland, Ben- jamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li

Bart Thomee, David A. Shamma, Gerald Friedland, Ben- jamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. YFCC100M: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016. 6

work page 2016

[36] [36]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017. 2

work page 2017

[38] [38]

Geoclip: Clip-inspired alignment be- tween locations and images for effective worldwide geo- localization.Advances in Neural Information Processing Systems, 36, 2024

Vicente Vivanco Cepeda, Gaurav Kumar Nayak, and Mubarak Shah. Geoclip: Clip-inspired alignment be- tween locations and images for effective worldwide geo- localization.Advances in Neural Information Processing Systems, 36, 2024. 2

work page 2024

[39] [39]

V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foun- dation models

Heng Wang, Jianbo Ma, Santiago Pascual, Richard Cartwright, and Weidong Cai. V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foun- dation models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 15492–15501, 2024. 2

work page 2024

[40] [40]

Large-scale con- trastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Yusong Wu*, Ke Chen*, Tianyu Zhang*, Yuchen Hui*, Tay- lor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale con- trastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. InIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2023. 3, 7

work page 2023

[41] [41]

Large-scale con- trastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale con- trastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. 3, 1

work page 2023

[42] [42]

Achiev- ing cross modal generalization with multimodal unified rep- resentation.Advances in Neural Information Processing Sys- tems, 36, 2024

Yan Xia, Hai Huang, Jieming Zhu, and Zhou Zhao. Achiev- ing cross modal generalization with multimodal unified rep- resentation.Advances in Neural Information Processing Sys- tems, 36, 2024. 2

work page 2024

[43] [43]

Diffsound: Discrete diffusion model for text-to-sound generation.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:1720–1733, 2023

Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, and Dong Yu. Diffsound: Discrete diffusion model for text-to-sound generation.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:1720–1733, 2023. 2

work page 2023

[44] [44]

Filip: Fine-grained interactive language-image pre-training

Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training.arXiv preprint arXiv:2111.07783, 2021. 1, 2

work page arXiv 2021

[45] [45]

Coca: Contrastive captioners are image-text foundation models.Transactions on Machine Learning Research, 2022

Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo- jtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models.Transactions on Machine Learning Research, 2022. 2

work page 2022

[46] [46]

Donghuo Zeng, Jianming Wu, Gen Hattori, Rong Xu, and Yi Yu. Learning explicit and implicit dual common subspaces for audio-visual cross-modal retrieval.ACM Transactions on Multimedia Computing, Communications and Applications, 19(2s):1–23, 2023. 2

work page 2023

[47] [47]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 4

work page 2023

[48] [48]

Audio-synchronized visual animation

Lin Zhang, Shentong Mo, Yijing Zhang, and Pedro Mor- gado. Audio-synchronized visual animation. InEuropean Conference on Computer Vision, pages 1–18. Springer, 2025. 2 10 Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping Supplementary Material

work page 2025

[49] [49]

SoundingEarthwith0.2mGSDGoogle Earthsatellite im- age tiles of size (1024×1024) contains41469/3242/5801 train/validation/test samples

Experimental Details Datasets:We experiment with two datasets:GeoSoundand SoundingEarth.GeoSoundcontains294019/5000/9931 train/validation/test samples and uses both0.6mGSD (Ground Sample Distance)Bingimage tiles (1500×1500) and10mGSDSentinel-2image tiles (1280×1280). SoundingEarthwith0.2mGSDGoogle Earthsatellite im- age tiles of size (1024×1024) contains4...

work page

[50] [50]

What types of sounds can we expect to hear from the location captured by this aerial view image? Describe in up to two sentences

or Qwen-Audio [6], with the caption selection based on the caption’s CLAP score [41] with the ground-truth audio. For theGeoSounddataset, this resulted in 58.7% of audio captions from Pengi, 23.8% from Qwen-Audio, and 17.5% from human-annotated text. Image Captions:For the cropped satellite images at each scale, we generate detailed soundscape cap- tions ...

work page

[51] [51]

Loss Ablation We conduct an ablation study on different components of the loss to assess their impact on the overall training ob- jective (Equation 9)

Ablation Studies 9.1. Loss Ablation We conduct an ablation study on different components of the loss to assess their impact on the overall training ob- jective (Equation 9). We observe that the addition of the composite audio-based loss (L † i,a+c) slightly improves the performance of the standard audio-image cross-modal re- trieval as observed in Table 4...

work page

[52] [52]

As shown in the image-text cross-modal retrieval results (Table 10), existing pre-trained image-text models underperform compared to Sat2Sound

Simpler Baselines In this section, we compare the performance of Sat2Sound with existing off-the-shelf multimodal em- bedding spaces. As shown in the image-text cross-modal retrieval results (Table 10), existing pre-trained image-text models underperform compared to Sat2Sound. We attribute this to the mismatch between the soundscape descriptions generated...

work page

[53] [53]

The results presented in the main paper are for satellite imagery at scale1

Multi-scale Cross-Modal Retrieval Sat2Sound is trained on multi-scale satellite imagery for the GeoSound dataset. The results presented in the main paper are for satellite imagery at scale1. In this section, we present results for two additional scales:3and5, us- ing both Sentinel and Bing imagery from the GeoSound dataset. Additionally, for both datasets...

work page

[54] [54]

In this section, we qualitatively explore what the codebook has learned

Analyzing codebook concepts As illustrated in Figure 4, the codebook learned by Sat2Sound can be used to generate fine-grained soundscape maps for regions covered by a single satellite image. In this section, we qualitatively explore what the codebook has learned. Specifically, for our gallery of image captions, we first obtain the corresponding codebook ...

work page

[55] [55]

We evaluate these embeddings on two downstream tasks: audio classification and satellite image classification, using linear probing on the audio and image embeddings, respectively

Linear Probing Experiments Sat2Sound learns a multimodal embedding space between audio and satellite imagery. We evaluate these embeddings on two downstream tasks: audio classification and satellite image classification, using linear probing on the audio and image embeddings, respectively. For audio classification, we compare the Sat2Sound audio encoder a...

work page 2022