pith. machine review for the scientific record. sign in

arxiv: 2604.14707 · v1 · submitted 2026-04-16 · 💻 cs.MM · cs.SD

Recognition: unknown

Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery

Authors on Pith no claims yet

Pith reviewed 2026-05-10 09:49 UTC · model grok-4.3

classification 💻 cs.MM cs.SD
keywords soundscape generationsatellite imageryimage-to-audiogeo-acoustic alignmentgeospatial attributesacoustic embeddingmultimodal synthesisbenchmark dataset
0
0 comments X

The pith

Geo2Sound generates realistic soundscapes from satellite imagery by aligning geographic attributes with acoustic embeddings

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Geo2Sound as a framework to create soundscapes that match the real acoustic environments captured in satellite views. Existing image-to-audio techniques work well for close-up scenes but fail on the broad semantic ambiguity of top-down satellite imagery, which otherwise offers a scalable way to map sounds across the globe. The approach first condenses overhead scenes into geographic attributes, generates multiple plausible audio candidates from expanded semantics, and then selects the best match through projection into acoustic space. If this holds, soundscapes could be produced for any location using only existing satellite coverage, supporting applications from environmental mapping to virtual environments. Experiments report a new benchmark dataset and performance gains over baselines in both objective metrics and human ratings of realism and alignment.

Core claim

Geo2Sound establishes a unified framework for generating geographically realistic soundscapes from satellite imagery by combining structural geospatial attributes modeling, semantic hypothesis expansion to produce diverse audio candidates, and a geo-acoustic alignment module that projects attributes into acoustic embedding space to pick the most consistent candidate. The work also releases SatSound-Bench, a dataset of over 20,000 paired satellite images, text descriptions, and field recordings collected across more than 10 countries plus three public datasets. This yields a state-of-the-art Frechet Audio Distance of 1.765, a 50 percent improvement over the strongest baseline, plus 26.5% upli

What carries the argument

The geo-acoustic alignment module, which summarizes satellite scenes into compact geographic attributes, projects them into acoustic embedding space, and selects the most consistent sound candidate from multiple semantic hypotheses

If this is right

  • Soundscapes can be generated at global scale from any satellite-covered location without requiring local audio capture.
  • Resolution of visual-to-audio ambiguity in top-down views improves both objective fidelity and human-perceived realism and semantic fit.
  • A large-scale paired benchmark dataset becomes available to standardize evaluation for future geo-aligned audio generation methods.
  • The attribute-based selection process offers a general mechanism for handling multiple plausible interpretations in cross-modal generation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework could support real-time audio overlays for satellite-based mapping tools or digital environmental twins.
  • Integration with ongoing satellite monitoring might allow sound generation to track ecological or urban changes visible from space.
  • Alignment techniques developed here could extend to other cross-modal tasks, such as location-specific scent or texture synthesis from imagery.

Load-bearing premise

That geographic attributes extracted from wide-area satellite views can be mapped reliably to a unique acoustic signature despite inherent visual ambiguity in complex scenes.

What would settle it

Blind listening tests on a held-out set of satellite images from regions with overlapping sound sources, such as dense urban zones, comparing the framework's selected audio against independent field recordings for human-rated realism and alignment.

Figures

Figures reproduced from arXiv: 2604.14707 by Boyi Chen, Haofeng Tan, Kunlin Wu, Teng Fei, Xianping Ma, Xiaofeng Liu, Yang Yue, Yanning Wang, Zan Zhou.

Figure 1
Figure 1. Figure 1: Overview of Geo2Sound. Given satellite imagery, Geo2Sound generates geographically plausible soundscapes through [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Geo2Sound. The framework combines structural geospatial attributes, semantic hypothesis expansion, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of SatSound-Bench. (a) Global spatial distribution of audio recordings from both field-collected data and [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) Qualitative examples of Geo2Sound, showing paired satellite imagery, generated textual descriptions, and mel [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Recent image-to-audio models have shown impressive performance on object-centric visual scenes. However, their application to satellite imagery remains limited by the complex, wide-area semantic ambiguity of top-down views. While satellite imagery provides a uniquely scalable source for global soundscape generation, matching these views to real acoustic environments with unique spatial structures is inherently difficult. To address this challenge, we introduce Geo2Sound, a novel task and framework for generating geographically realistic soundscapes from satellite imagery. Specifically, Geo2Sound combines structural geospatial attributes modeling, semantic hypothesis expansion, and geo-acoustic alignment in a unified framework. A lightweight classifier summarizes overhead scenes into compact geographic attributes, multiple sound-oriented semantic hypotheses are used to generate diverse acoustically plausible candidates, and a geo-acoustic alignment module projects geographic attributes into the acoustic embedding space and identifies the candidate most consistent with the candidate sets. Moreover, we establish SatSound-Bench, the first benchmark comprising over 20k high-quality paired satellite images, text descriptions, and real-world audio recordings, collected from the field across more than 10 countries and complemented by three public datasets. Experiments show that Geo2Sound achieves a SOTA FAD of 1.765, outperforming the strongest baseline by 50.0%. Human evaluations further confirm substantial gains in both realism (26.5%) and semantic alignment, validating our high-fidelity synthesis on scale. Project page and source code: https://github.com/Blanketzzz/Geo2Sound

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Geo2Sound, a framework for generating geographically realistic soundscapes from satellite imagery. It combines a lightweight classifier for geospatial attributes, semantic hypothesis expansion to produce diverse audio candidates, and a geo-acoustic alignment module that projects attributes into acoustic embedding space to select the most consistent candidate. The work also presents SatSound-Bench, a new dataset of over 20k paired satellite images, text descriptions, and real-world audio recordings collected across more than 10 countries (plus three public datasets). Experiments report a state-of-the-art Fréchet Audio Distance (FAD) of 1.765, a 50% improvement over the strongest baseline, plus human evaluation gains of 26.5% in realism and improvements in semantic alignment.

Significance. If the performance claims hold after proper validation, the framework could enable scalable soundscape synthesis from ubiquitous satellite imagery, with potential uses in environmental monitoring, urban acoustics, and immersive media. The introduction of SatSound-Bench as the first large-scale geo-aligned benchmark is a clear positive contribution that could support future research in this area.

major comments (2)
  1. [Abstract] Abstract: The central SOTA claim (FAD = 1.765, 50% better than strongest baseline) and human preference gains are presented without any description of baseline implementations, data splits, statistical tests, or controls for selection effects in the 20k-pair collection. These omissions make the quantitative results impossible to verify and are load-bearing for the performance narrative.
  2. [Method] Method (geo-acoustic alignment module): The module is described as projecting geographic attributes into acoustic embedding space to select the most consistent candidate from semantic hypotheses. However, no ablation study, selection-precision metric, or error analysis on ambiguous wide-area scenes is reported, leaving open the possibility that reported gains arise from hypothesis generation or benchmark curation rather than reliable alignment.
minor comments (1)
  1. [Dataset] The abstract states that SatSound-Bench is 'complemented by three public datasets' but does not name them or describe integration details; this should be clarified in the dataset section for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions where appropriate to enhance the verifiability and completeness of our work.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central SOTA claim (FAD = 1.765, 50% better than strongest baseline) and human preference gains are presented without any description of baseline implementations, data splits, statistical tests, or controls for selection effects in the 20k-pair collection. These omissions make the quantitative results impossible to verify and are load-bearing for the performance narrative.

    Authors: We agree that the abstract, due to its brevity, does not include these details. However, the full paper provides extensive information: baseline implementations are detailed in Section 4.1 and 4.2, data collection and splits in Section 3, and human evaluation protocols in Section 4.3. Statistical significance is reported via standard deviations and p-values in the results tables. For the 20k-pair benchmark, we followed standard practices for geo-aligned data collection across diverse countries to mitigate selection bias, with details in the dataset section. To make the abstract more self-contained, we will revise it to briefly reference the evaluation setup and direct readers to the relevant sections for full verification. This addresses the verifiability concern without altering the core claims. revision: yes

  2. Referee: [Method] Method (geo-acoustic alignment module): The module is described as projecting geographic attributes into acoustic embedding space to select the most consistent candidate from semantic hypotheses. However, no ablation study, selection-precision metric, or error analysis on ambiguous wide-area scenes is reported, leaving open the possibility that reported gains arise from hypothesis generation or benchmark curation rather than reliable alignment.

    Authors: The geo-acoustic alignment is a key component, and while we compared the full model against ablated versions without it in our experiments (showing consistent improvements in Table 3), we did not include a dedicated ablation study or precision metrics in the main text due to space limitations. We will add an ablation study focusing on the alignment module, including quantitative selection precision (e.g., accuracy of chosen hypothesis matching ground-truth audio) and qualitative error analysis on challenging wide-area scenes with high semantic ambiguity. This will demonstrate that the gains are attributable to the alignment rather than other factors. We believe this will strengthen the methodological contribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework and metrics are independently evaluated

full rationale

The paper presents a modular framework (geospatial attribute modeling via lightweight classifier, semantic hypothesis expansion, and geo-acoustic alignment via embedding projection) trained and evaluated on a newly collected SatSound-Bench dataset of 20k+ paired satellite-image/audio samples from multiple countries. No equations, derivations, or self-citations are shown that reduce the reported FAD score, realism gains, or alignment performance to a fitted parameter or prior result by construction. Standard metrics (FAD, human ratings) are applied externally to generated vs. real audio, and the benchmark curation is described as independent field collection rather than a tautological reuse of training inputs. The central claims rest on empirical comparison to baselines rather than self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated beyond standard assumptions of trained neural networks and embedding spaces.

pith-pipeline@v0.9.0 · 5594 in / 1086 out tokens · 18023 ms · 2026-05-10T09:49:27.474358+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

75 extracted references · 32 canonical work pages · 6 internal anchors

  1. [1]

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al

  2. [2]

    Flamingo: a Visual Language Model for Few-Shot Learning.arXiv preprint arXiv:2204.14198(2022)

  3. [3]

    Francesco Aletta, Jian Kang, and Östen Axelsson. 2016. Soundscape Descriptors and a Conceptual Framework for Developing Predictive Soundscape Models. Landscape and Urban Planning149 (2016), 65–74. doi:10.1016/j.landurbplan.2016. 02.001

  4. [4]

    A. L. Brown, Jian Kang, and Truls T. Gjestland. 2011. Towards standardization in soundscape preference assessment.Applied Acoustics72, 6 (2011), 387–392

  5. [5]

    Mustafa Chasmai, Alexander Shepard, Subhransu Maji, and Grant Van Horn. 2024. The iNaturalist Sounds Dataset. InAdvances in Neural Information Processing Systems (Datasets and Benchmarks Track)

  6. [6]

    Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. 2020. VG- GSound: A Large-Scale Audio-Visual Dataset. InProceedings of the IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 721–725. doi:10.1109/ICASSP40776.2020.9052941

  7. [7]

    Chen et al

    J. Chen et al. 2016. Information from imagery: ISPRS scientific vision and research agenda.ISPRS Journal of Photogrammetry and Remote Sensing(2016)

  8. [8]

    Ziyang Chen et al. 2025. Video-Guided Foley Sound Generation with Multimodal Controls. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  9. [9]

    Zhihong Chen, Teng Fei, Jing Xiao, Jing Huang, Dong Jia, and Meng Bian. 2025. Estimating urban noise levels from multi-scale and multi-spectral remote sensing imagery.International Journal of Applied Earth Observation and Geoinformation 129 (2025), 103848

  10. [10]

    Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. 2025. MMAudio: Taming Multimodal Joint Train- ing for High-Quality Video-to-Audio Synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  11. [11]

    Yezhen Cong, Samar Khanna, Chenlin Meng, Lin Liu, Erik Rozi, Marshall Burke, David Lobell, and Stefano Ermon. 2022. SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery. InAdvances in Neural Information Processing Systems (NeurIPS)

  12. [12]

    Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre Défossez. 2023. Simple and Controllable Music Generation. InAdvances in Neural Information Processing Systems, Vol. 36

  13. [13]

    Rishit Dagli, Shivesh Prakash, Robert Wu, and Houman Khosravani. 2024. SEE- 2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound. InICML Workshop on Foundation Models in the Wild

  14. [14]

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv preprint arXiv:2305.06500(2023)

  15. [15]

    Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang

  16. [16]

    Full-Band General Audio Synthesis with Score-Based Diffusion

    CLAP: Learning Audio Concepts From Natural Language Supervision. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5. doi:10.1109/ICASSP49357.2023.10095889

  17. [17]

    Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. InProceedings of the 21st ACM International Conference on Multimedia. 411–412

  18. [18]

    Gemmeke, Daniel P

    Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio Set: An ontology and human-labeled dataset for audio events. InICASSP

  19. [19]

    Wei Guo, Heng Wang, Jianbo Ma, and Weidong Cai. 2024. Gotta Hear Them All: Towards Sound Source Aware Audio Generation.arXiv preprint arXiv:2411.15447 (2024)

  20. [20]

    Jiarui Hai, Yong Xu, Hao Zhang, Chenxing Li, Helin Wang, Mounya Elhilali, and Dong Yu. 2024. EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer.arXiv preprint arXiv:2409.10819(2024). 8

  21. [21]

    Konrad Heidler, Lichao Mou, Di Hu, Pu Jin, Guangyao Li, Chuang Gan, Ji-Rong Wen, and Xiao Xiang Zhu. 2022. Self-supervised audiovisual representation learn- ing for remote sensing data.International Journal of Applied Earth Observation and Geoinformation112 (2022), 102907

  22. [22]

    Joo Young Hong and Joo Young Jeon. 2017. Relationship between Spatiotemporal Variability of Soundscape and Urban Morphology in a Multifunctional Urban Area: A Case Study in Seoul, Korea.Building and Environment126 (2017), 382–395. doi:10.1016/j.buildenv.2017.10.019

  23. [23]

    Yuan Hu, Jianglong Yuan, Congcong Wen, Xiaonan Lu, and Xiang Li. 2023. RSGPT: A Remote Sensing Vision Language Model and Benchmark.arXiv preprint arXiv:2307.15266(2023)

  24. [24]

    Jing Huang, Teng Fei, Yuhao Kang, Jun Li, Ziyu Liu, and Guofeng Wu. 2024. Esti- mating urban noise along road network from street view imagery.International Journal of Geographical Information Science38, 1 (2024), 128–155

  25. [25]

    Jiawei Huang, Yi Ren, Rongjie Huang, Dongchao Yang, Zhenhui Ye, Chen Zhang, Jinglin Liu, Xiang Yin, Zejun Ma, and Zhou Zhao. 2023. Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation.arXiv preprint arXiv:2305.18474 (2023)

  26. [26]

    Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Maosong Li, Zhefeng Ye, Jinglin Liu, Xize Yin, and Zhou Zhao. 2023. Make-An-Audio: Text- To-Audio Generation with Prompt-Enhanced Diffusion Models. InProceedings of the 40th International Conference on Machine Learning (ICML ’23). PMLR, 13916–13932

  27. [27]

    International Organization for Standardization. 2014. ISO 12913-1:2014 Acoustics — Soundscape — Part 1: Definition and Conceptual Framework

  28. [28]

    Johannes Jakubik et al. 2024. Prithvi-EO-2.0: A Versatile Multi-Temporal Founda- tion Model for Earth Observation Applications.arXiv preprint arXiv:2412.02732 (2024)

  29. [29]

    Jian Kang. 2023. Soundscape in City and Built Environment: Current Devel- opments and Design Potentials.City and Built Environment1, 1 (2023), 1. doi:10.1007/s44213-022-00005-6

  30. [30]

    Gjestland, A

    Jian Kang, Francesco Aletta, Truls T. Gjestland, A. Lex Brown, Dick Botteldooren, Brigitte Schulte-Fortkamp, Peter Lercher, Irene van Kamp, Klaus Genuit, Achim Fiebig, Joaquim Luis Bento Coelho, Luigi Maffei, and Lucy Lavia. 2016. Ten Ques- tions on the Soundscapes of the Built Environment.Building and Environment 108 (2016), 284–294. doi:10.1016/j.builde...

  31. [31]

    Subash Khanal, Srikumar Sastry, Aayush Dhakal, Adeel Ahmad, and Nathan Ja- cobs. 2025. Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping. arXiv preprint arXiv:2505.13777(2025)

  32. [32]

    Subash Khanal, Srikumar Sastry, Aayush Dhakal, and Nathan Jacobs. 2023. Learn- ing Tri-modal Embeddings for Zero-Shot Soundscape Mapping. InProceedings of the British Machine Vision Conference (BMVC)

  33. [33]

    Konstantin Klemmer, Esther Rolf, Caleb Robinson, Lester Mackey, and Marc Rußwurm. 2023. SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery.arXiv preprint arXiv:2311.17179(2023)

  34. [34]

    Plumbley

    Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley. 2020. PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition.IEEE/ACM Transactions on Audio, Speech, and Language Processing28 (2020), 2880–2894. doi:10.1109/TASLP.2020.3030497

  35. [35]

    Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi. 2023. AudioGen: Textually Guided Audio Generation. InThe Eleventh International Conference on Learning Representations

  36. [36]

    Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. 2023. GeoChat: Grounded Large Vision- Language Model for Remote Sensing.arXiv preprint arXiv:2311.15826(2023)

  37. [37]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. BLIP-2: Boot- strapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models.arXiv preprint arXiv:2301.12597(2023)

  38. [38]

    Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang

  39. [39]

    arXiv preprint arXiv:1908.03557 , year=

    VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv preprint arXiv:1908.03557(2019)

  40. [40]

    Tingle Li, Baihe Huang, Xiaobin Zhuang, Dongya Jia, Jiawei Chen, Yuping Wang, Zhuo Chen, Gopala Anumanchipalli, and Yuxuan Wang. 2025. Sounding that Object: Interactive Object-Aware Image-to-Audio Generation. InInternational Conference on Machine Learning (ICML)

  41. [41]

    Xiquan Li, Junxi Liu, Yuzhe Liang, Zhikang Niu, Wenxi Chen, and Xie Chen

  42. [42]

    arXiv preprint(2025)

    MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows. arXiv preprint(2025)

  43. [43]

    Li et al

    Z. Li et al. 2022. A low-to-high network for large-scale high-resolution land-cover mapping.ISPRS Journal of Photogrammetry and Remote Sensing(2022)

  44. [44]

    Li et al

    Z. Li et al. 2024. Deep learning for urban land use category classification.Remote Sensing of Environment(2024)

  45. [45]

    Marta Lionello, Francesco Aletta, and Jian Kang. 2020. A systematic review of prediction models for the experience of urban soundscapes.Applied Acoustics 166 (2020), 107372

  46. [46]

    Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. 2023. RemoteCLIP: A Vision Language Foundation Model for Remote Sensing.arXiv preprint arXiv:2306.11029(2023)

  47. [47]

    Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. 2024. RemoteCLIP: A Vision Language Foundation Model for Remote Sensing.IEEE Transactions on Geoscience and Remote Sensing 62 (2024), 1–16. doi:10.1109/TGRS.2024.3365373

  48. [48]

    Plumbley

    Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D. Plumbley. 2023. AudioLDM: Text-to-Audio Generation with Latent Diffusion Models. InProceedings of the 40th International Conference on Machine Learning (ICML ’23). PMLR, 21450–21474

  49. [49]

    Plumbley

    Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qifeng Tian, Yuxuan Wang, Wenwu Wang, Yue Wang, and Mark D. Plumbley. 2024. AudioLDM 2: Learning Holistic Audio Generation With Self-Supervised Pretraining.IEEE/ACM Transactions on Audio, Speech, and Language Processing32 (2024), 2392–2406. doi:10.1109/TASLP.2024.3395797

  50. [50]

    Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. 2023. Diff-Foley: Syn- chronized Video-to-Audio Synthesis with Latent Diffusion Models. InAdvances in Neural Information Processing Systems, Vol. 36

  51. [51]

    Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, and Soujanya Poria. 2024. Tango 2: Aligning Diffusion-based Text-to- Audio Generations through Direct Preference Optimization. InProceedings of the 32nd ACM International Conference on Multimedia. ACM, 564–572. doi:10. 1145/3664647.3681688

  52. [52]

    Jan Melechovsky, Zixun Guo, Deepanway Ghosal, Navonil Majumder, Dorien Herremans, and Soujanya Poria. 2024. Mustango: Toward Controllable Text-to- Music Generation. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics. Association for Compu- tational Linguistics, 5596–5611

  53. [53]

    Pijanowski, Luis J

    Bryan C. Pijanowski, Luis J. Villanueva-Rivera, Sarah L. Dumyahn, et al. 2011. Soundscape Ecology: The Science of Sound in the Landscape.BioScience61, 3 (2011), 203–216

  54. [54]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InProceedings of the 38th International Conference on Machine Learning. 8748–8763

  55. [55]

    Yan Rong, Jinting Wang, Guangzhi Lei, Shan Yang, and Li Liu. 2025. Audio- Genie: A Training-Free Multi-Agent Framework for Diverse Multimodality-to- Multiaudio Generation. InProceedings of the 33rd ACM International Conference on Multimedia. ACM, 1–11. doi:10.1145/3746027.3755758

  56. [56]

    Justin Salamon, Christopher Jacoby, and Juan Pablo Bello. 2014. A Dataset and Taxonomy for Urban Sound Research. InProceedings of the 22nd ACM Interna- tional Conference on Multimedia (MM ’14). ACM, 1041–1044. doi:10.1145/2647868. 2655045

  57. [57]

    Murray Schafer

    R. Murray Schafer. 1977.The Tuning of the World. Knopf, New York

  58. [58]

    Akashah Shabbir, Mohammed Zumri, Mohammed Bennamoun, Fahad Shahbaz Khan, and Salman Khan. 2025. GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing. InProceedings of the 42nd International Conference on Machine Learning (ICML). Poster

  59. [59]

    Roy Sheffer and Yossi Adi. 2023. I Hear Your True Colors: Image Guided Audio Generation. InICASSP

  60. [60]

    DINOv3

    Oriane Siméoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Rama- monjisoa, et al. 2025. DINOv3.arXiv preprint arXiv:2508.10104(2025)

  61. [61]

    Zineng Tang, Ziyi Yang, Mahmoud Khademi, Yang Liu, Chenguang Zhu, and Mohit Bansal. 2023. CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation.arXiv preprint arXiv:2311.18775(2023)

  62. [62]

    Zhiqiang Tian, Yu Jin, Ziyang Liu, Rongjie Yuan, Xu Tan, Qifeng Chen, Wen Xue, and Yike Guo. 2025. AudioX: Diffusion Transformer for Anything-to-Audio Generation.arXiv preprint arXiv:2503.10522(2025)

  63. [63]

    2001.Acoustic Communication

    Barry Truax. 2001.Acoustic Communication. Ablex Publishing

  64. [64]

    Heng Wang, Jianbo Ma, Santiago Pascual, Richard Cartwright, and Weidong Cai. 2024. V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 15492–15501

  65. [65]

    Junbo Wang, Haofeng Tan, Bowen Liao, Albert Jiang, Teng Fei, Qixing Huang, Bing Zhou, Zhengzhong Tu, Shan Ye, and Yuhao Kang. 2025. SounDiT: Geo-Contextual Soundscape-to-Landscape Generation.arXiv preprint arXiv:2505.12734(2025)

  66. [66]

    Xinyu Wang et al. 2024. TiVA: Time-Aligned Video-to-Audio Generation. In ACM International Conference on Multimedia (ACM MM)

  67. [67]

    Zhenyu Wang, Chenxing Li, Yong Xu, Chunlei Zhang, John H. L. Hansen, and Dong Yu. 2024. Text-to-Audio Generation via Bridging Audio Language Model and Latent Diffusion. InAudio Imagination Workshop at NeurIPS

  68. [68]

    Zhecheng Wang, Rajanie Prabha, Tianyuan Huang, Jiajun Wu, and Ram Ra- jagopal. 2024. SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing. InAAAI

  69. [69]

    Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. 2024. NExT- GPT: Any-to-Any Multimodal LLM. InInternational Conference on Machine 9 Learning

  70. [70]

    Yazhou Xing, Yingqing He, Zeyue Tian, Xintao Wang, and Qifeng Chen. 2024. Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7151–7161

  71. [71]

    Jinlong Xue, Yayue Deng, Yingming Gao, and Ya Li. 2024. Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation. arXiv preprint arXiv:2401.01044(2024)

  72. [72]

    Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, and Kai Chen. 2024. FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds. arXiv:2407.01494 [cs.SD]

  73. [73]

    Tianhong Zhao, Xiucheng Liang, Wei Tu, Zhengdong Huang, and Filip Biljecki

  74. [74]

    Sensing urban soundscapes from street view imagery.Computers, Environ- ment and Urban Systems99 (2023), 101915

  75. [75]

    This scene

    Zhe Zhu. 2022. Remote sensing of land change: A multifaceted perspective. Remote Sensing of Environment285 (2022), 113379. A Prompt Design and Hypothesis Analysis To analyze the effect of semantic hypothesis expansion, we compare the exact prompt settings used in our experiments with a controlled rephrasing variant. Let 𝐶0 denote the base caption generate...