Recognition: unknown
Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery
Pith reviewed 2026-05-10 09:49 UTC · model grok-4.3
The pith
Geo2Sound generates realistic soundscapes from satellite imagery by aligning geographic attributes with acoustic embeddings
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Geo2Sound establishes a unified framework for generating geographically realistic soundscapes from satellite imagery by combining structural geospatial attributes modeling, semantic hypothesis expansion to produce diverse audio candidates, and a geo-acoustic alignment module that projects attributes into acoustic embedding space to pick the most consistent candidate. The work also releases SatSound-Bench, a dataset of over 20,000 paired satellite images, text descriptions, and field recordings collected across more than 10 countries plus three public datasets. This yields a state-of-the-art Frechet Audio Distance of 1.765, a 50 percent improvement over the strongest baseline, plus 26.5% upli
What carries the argument
The geo-acoustic alignment module, which summarizes satellite scenes into compact geographic attributes, projects them into acoustic embedding space, and selects the most consistent sound candidate from multiple semantic hypotheses
If this is right
- Soundscapes can be generated at global scale from any satellite-covered location without requiring local audio capture.
- Resolution of visual-to-audio ambiguity in top-down views improves both objective fidelity and human-perceived realism and semantic fit.
- A large-scale paired benchmark dataset becomes available to standardize evaluation for future geo-aligned audio generation methods.
- The attribute-based selection process offers a general mechanism for handling multiple plausible interpretations in cross-modal generation tasks.
Where Pith is reading between the lines
- The framework could support real-time audio overlays for satellite-based mapping tools or digital environmental twins.
- Integration with ongoing satellite monitoring might allow sound generation to track ecological or urban changes visible from space.
- Alignment techniques developed here could extend to other cross-modal tasks, such as location-specific scent or texture synthesis from imagery.
Load-bearing premise
That geographic attributes extracted from wide-area satellite views can be mapped reliably to a unique acoustic signature despite inherent visual ambiguity in complex scenes.
What would settle it
Blind listening tests on a held-out set of satellite images from regions with overlapping sound sources, such as dense urban zones, comparing the framework's selected audio against independent field recordings for human-rated realism and alignment.
Figures
read the original abstract
Recent image-to-audio models have shown impressive performance on object-centric visual scenes. However, their application to satellite imagery remains limited by the complex, wide-area semantic ambiguity of top-down views. While satellite imagery provides a uniquely scalable source for global soundscape generation, matching these views to real acoustic environments with unique spatial structures is inherently difficult. To address this challenge, we introduce Geo2Sound, a novel task and framework for generating geographically realistic soundscapes from satellite imagery. Specifically, Geo2Sound combines structural geospatial attributes modeling, semantic hypothesis expansion, and geo-acoustic alignment in a unified framework. A lightweight classifier summarizes overhead scenes into compact geographic attributes, multiple sound-oriented semantic hypotheses are used to generate diverse acoustically plausible candidates, and a geo-acoustic alignment module projects geographic attributes into the acoustic embedding space and identifies the candidate most consistent with the candidate sets. Moreover, we establish SatSound-Bench, the first benchmark comprising over 20k high-quality paired satellite images, text descriptions, and real-world audio recordings, collected from the field across more than 10 countries and complemented by three public datasets. Experiments show that Geo2Sound achieves a SOTA FAD of 1.765, outperforming the strongest baseline by 50.0%. Human evaluations further confirm substantial gains in both realism (26.5%) and semantic alignment, validating our high-fidelity synthesis on scale. Project page and source code: https://github.com/Blanketzzz/Geo2Sound
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Geo2Sound, a framework for generating geographically realistic soundscapes from satellite imagery. It combines a lightweight classifier for geospatial attributes, semantic hypothesis expansion to produce diverse audio candidates, and a geo-acoustic alignment module that projects attributes into acoustic embedding space to select the most consistent candidate. The work also presents SatSound-Bench, a new dataset of over 20k paired satellite images, text descriptions, and real-world audio recordings collected across more than 10 countries (plus three public datasets). Experiments report a state-of-the-art Fréchet Audio Distance (FAD) of 1.765, a 50% improvement over the strongest baseline, plus human evaluation gains of 26.5% in realism and improvements in semantic alignment.
Significance. If the performance claims hold after proper validation, the framework could enable scalable soundscape synthesis from ubiquitous satellite imagery, with potential uses in environmental monitoring, urban acoustics, and immersive media. The introduction of SatSound-Bench as the first large-scale geo-aligned benchmark is a clear positive contribution that could support future research in this area.
major comments (2)
- [Abstract] Abstract: The central SOTA claim (FAD = 1.765, 50% better than strongest baseline) and human preference gains are presented without any description of baseline implementations, data splits, statistical tests, or controls for selection effects in the 20k-pair collection. These omissions make the quantitative results impossible to verify and are load-bearing for the performance narrative.
- [Method] Method (geo-acoustic alignment module): The module is described as projecting geographic attributes into acoustic embedding space to select the most consistent candidate from semantic hypotheses. However, no ablation study, selection-precision metric, or error analysis on ambiguous wide-area scenes is reported, leaving open the possibility that reported gains arise from hypothesis generation or benchmark curation rather than reliable alignment.
minor comments (1)
- [Dataset] The abstract states that SatSound-Bench is 'complemented by three public datasets' but does not name them or describe integration details; this should be clarified in the dataset section for reproducibility.
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive comments on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions where appropriate to enhance the verifiability and completeness of our work.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central SOTA claim (FAD = 1.765, 50% better than strongest baseline) and human preference gains are presented without any description of baseline implementations, data splits, statistical tests, or controls for selection effects in the 20k-pair collection. These omissions make the quantitative results impossible to verify and are load-bearing for the performance narrative.
Authors: We agree that the abstract, due to its brevity, does not include these details. However, the full paper provides extensive information: baseline implementations are detailed in Section 4.1 and 4.2, data collection and splits in Section 3, and human evaluation protocols in Section 4.3. Statistical significance is reported via standard deviations and p-values in the results tables. For the 20k-pair benchmark, we followed standard practices for geo-aligned data collection across diverse countries to mitigate selection bias, with details in the dataset section. To make the abstract more self-contained, we will revise it to briefly reference the evaluation setup and direct readers to the relevant sections for full verification. This addresses the verifiability concern without altering the core claims. revision: yes
-
Referee: [Method] Method (geo-acoustic alignment module): The module is described as projecting geographic attributes into acoustic embedding space to select the most consistent candidate from semantic hypotheses. However, no ablation study, selection-precision metric, or error analysis on ambiguous wide-area scenes is reported, leaving open the possibility that reported gains arise from hypothesis generation or benchmark curation rather than reliable alignment.
Authors: The geo-acoustic alignment is a key component, and while we compared the full model against ablated versions without it in our experiments (showing consistent improvements in Table 3), we did not include a dedicated ablation study or precision metrics in the main text due to space limitations. We will add an ablation study focusing on the alignment module, including quantitative selection precision (e.g., accuracy of chosen hypothesis matching ground-truth audio) and qualitative error analysis on challenging wide-area scenes with high semantic ambiguity. This will demonstrate that the gains are attributable to the alignment rather than other factors. We believe this will strengthen the methodological contribution. revision: yes
Circularity Check
No significant circularity; framework and metrics are independently evaluated
full rationale
The paper presents a modular framework (geospatial attribute modeling via lightweight classifier, semantic hypothesis expansion, and geo-acoustic alignment via embedding projection) trained and evaluated on a newly collected SatSound-Bench dataset of 20k+ paired satellite-image/audio samples from multiple countries. No equations, derivations, or self-citations are shown that reduce the reported FAD score, realism gains, or alignment performance to a fitted parameter or prior result by construction. Standard metrics (FAD, human ratings) are applied externally to generated vs. real audio, and the benchmark curation is described as independent field collection rather than a tautological reuse of training inputs. The central claims rest on empirical comparison to baselines rather than self-referential definitions.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al
-
[2]
Flamingo: a Visual Language Model for Few-Shot Learning.arXiv preprint arXiv:2204.14198(2022)
work page internal anchor Pith review arXiv 2022
-
[3]
Francesco Aletta, Jian Kang, and Östen Axelsson. 2016. Soundscape Descriptors and a Conceptual Framework for Developing Predictive Soundscape Models. Landscape and Urban Planning149 (2016), 65–74. doi:10.1016/j.landurbplan.2016. 02.001
-
[4]
A. L. Brown, Jian Kang, and Truls T. Gjestland. 2011. Towards standardization in soundscape preference assessment.Applied Acoustics72, 6 (2011), 387–392
2011
-
[5]
Mustafa Chasmai, Alexander Shepard, Subhransu Maji, and Grant Van Horn. 2024. The iNaturalist Sounds Dataset. InAdvances in Neural Information Processing Systems (Datasets and Benchmarks Track)
2024
-
[6]
Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. 2020. VG- GSound: A Large-Scale Audio-Visual Dataset. InProceedings of the IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 721–725. doi:10.1109/ICASSP40776.2020.9052941
-
[7]
Chen et al
J. Chen et al. 2016. Information from imagery: ISPRS scientific vision and research agenda.ISPRS Journal of Photogrammetry and Remote Sensing(2016)
2016
-
[8]
Ziyang Chen et al. 2025. Video-Guided Foley Sound Generation with Multimodal Controls. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
2025
-
[9]
Zhihong Chen, Teng Fei, Jing Xiao, Jing Huang, Dong Jia, and Meng Bian. 2025. Estimating urban noise levels from multi-scale and multi-spectral remote sensing imagery.International Journal of Applied Earth Observation and Geoinformation 129 (2025), 103848
2025
-
[10]
Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. 2025. MMAudio: Taming Multimodal Joint Train- ing for High-Quality Video-to-Audio Synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
2025
-
[11]
Yezhen Cong, Samar Khanna, Chenlin Meng, Lin Liu, Erik Rozi, Marshall Burke, David Lobell, and Stefano Ermon. 2022. SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery. InAdvances in Neural Information Processing Systems (NeurIPS)
2022
-
[12]
Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre Défossez. 2023. Simple and Controllable Music Generation. InAdvances in Neural Information Processing Systems, Vol. 36
2023
-
[13]
Rishit Dagli, Shivesh Prakash, Robert Wu, and Houman Khosravani. 2024. SEE- 2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound. InICML Workshop on Foundation Models in the Wild
2024
-
[14]
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv preprint arXiv:2305.06500(2023)
work page internal anchor Pith review arXiv 2023
-
[15]
Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang
-
[16]
Full-Band General Audio Synthesis with Score-Based Diffusion
CLAP: Learning Audio Concepts From Natural Language Supervision. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5. doi:10.1109/ICASSP49357.2023.10095889
-
[17]
Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. InProceedings of the 21st ACM International Conference on Multimedia. 411–412
2013
-
[18]
Gemmeke, Daniel P
Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio Set: An ontology and human-labeled dataset for audio events. InICASSP
2017
- [19]
- [20]
-
[21]
Konrad Heidler, Lichao Mou, Di Hu, Pu Jin, Guangyao Li, Chuang Gan, Ji-Rong Wen, and Xiao Xiang Zhu. 2022. Self-supervised audiovisual representation learn- ing for remote sensing data.International Journal of Applied Earth Observation and Geoinformation112 (2022), 102907
2022
-
[22]
Joo Young Hong and Joo Young Jeon. 2017. Relationship between Spatiotemporal Variability of Soundscape and Urban Morphology in a Multifunctional Urban Area: A Case Study in Seoul, Korea.Building and Environment126 (2017), 382–395. doi:10.1016/j.buildenv.2017.10.019
- [23]
-
[24]
Jing Huang, Teng Fei, Yuhao Kang, Jun Li, Ziyu Liu, and Guofeng Wu. 2024. Esti- mating urban noise along road network from street view imagery.International Journal of Geographical Information Science38, 1 (2024), 128–155
2024
- [25]
-
[26]
Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Maosong Li, Zhefeng Ye, Jinglin Liu, Xize Yin, and Zhou Zhao. 2023. Make-An-Audio: Text- To-Audio Generation with Prompt-Enhanced Diffusion Models. InProceedings of the 40th International Conference on Machine Learning (ICML ’23). PMLR, 13916–13932
2023
-
[27]
International Organization for Standardization. 2014. ISO 12913-1:2014 Acoustics — Soundscape — Part 1: Definition and Conceptual Framework
2014
- [28]
-
[29]
Jian Kang. 2023. Soundscape in City and Built Environment: Current Devel- opments and Design Potentials.City and Built Environment1, 1 (2023), 1. doi:10.1007/s44213-022-00005-6
-
[30]
Jian Kang, Francesco Aletta, Truls T. Gjestland, A. Lex Brown, Dick Botteldooren, Brigitte Schulte-Fortkamp, Peter Lercher, Irene van Kamp, Klaus Genuit, Achim Fiebig, Joaquim Luis Bento Coelho, Luigi Maffei, and Lucy Lavia. 2016. Ten Ques- tions on the Soundscapes of the Built Environment.Building and Environment 108 (2016), 284–294. doi:10.1016/j.builde...
-
[31]
Subash Khanal, Srikumar Sastry, Aayush Dhakal, Adeel Ahmad, and Nathan Ja- cobs. 2025. Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping. arXiv preprint arXiv:2505.13777(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Subash Khanal, Srikumar Sastry, Aayush Dhakal, and Nathan Jacobs. 2023. Learn- ing Tri-modal Embeddings for Zero-Shot Soundscape Mapping. InProceedings of the British Machine Vision Conference (BMVC)
2023
- [33]
-
[34]
Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley. 2020. PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition.IEEE/ACM Transactions on Audio, Speech, and Language Processing28 (2020), 2880–2894. doi:10.1109/TASLP.2020.3030497
-
[35]
Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi. 2023. AudioGen: Textually Guided Audio Generation. InThe Eleventh International Conference on Learning Representations
2023
- [36]
-
[37]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. BLIP-2: Boot- strapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models.arXiv preprint arXiv:2301.12597(2023)
work page internal anchor Pith review arXiv 2023
-
[38]
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang
-
[39]
arXiv preprint arXiv:1908.03557 , year=
VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv preprint arXiv:1908.03557(2019)
-
[40]
Tingle Li, Baihe Huang, Xiaobin Zhuang, Dongya Jia, Jiawei Chen, Yuping Wang, Zhuo Chen, Gopala Anumanchipalli, and Yuxuan Wang. 2025. Sounding that Object: Interactive Object-Aware Image-to-Audio Generation. InInternational Conference on Machine Learning (ICML)
2025
-
[41]
Xiquan Li, Junxi Liu, Yuzhe Liang, Zhikang Niu, Wenxi Chen, and Xie Chen
-
[42]
arXiv preprint(2025)
MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows. arXiv preprint(2025)
2025
-
[43]
Li et al
Z. Li et al. 2022. A low-to-high network for large-scale high-resolution land-cover mapping.ISPRS Journal of Photogrammetry and Remote Sensing(2022)
2022
-
[44]
Li et al
Z. Li et al. 2024. Deep learning for urban land use category classification.Remote Sensing of Environment(2024)
2024
-
[45]
Marta Lionello, Francesco Aletta, and Jian Kang. 2020. A systematic review of prediction models for the experience of urban soundscapes.Applied Acoustics 166 (2020), 107372
2020
- [46]
-
[47]
Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. 2024. RemoteCLIP: A Vision Language Foundation Model for Remote Sensing.IEEE Transactions on Geoscience and Remote Sensing 62 (2024), 1–16. doi:10.1109/TGRS.2024.3365373
-
[48]
Plumbley
Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D. Plumbley. 2023. AudioLDM: Text-to-Audio Generation with Latent Diffusion Models. InProceedings of the 40th International Conference on Machine Learning (ICML ’23). PMLR, 21450–21474
2023
-
[49]
Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qifeng Tian, Yuxuan Wang, Wenwu Wang, Yue Wang, and Mark D. Plumbley. 2024. AudioLDM 2: Learning Holistic Audio Generation With Self-Supervised Pretraining.IEEE/ACM Transactions on Audio, Speech, and Language Processing32 (2024), 2392–2406. doi:10.1109/TASLP.2024.3395797
-
[50]
Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. 2023. Diff-Foley: Syn- chronized Video-to-Audio Synthesis with Latent Diffusion Models. InAdvances in Neural Information Processing Systems, Vol. 36
2023
-
[51]
Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, and Soujanya Poria. 2024. Tango 2: Aligning Diffusion-based Text-to- Audio Generations through Direct Preference Optimization. InProceedings of the 32nd ACM International Conference on Multimedia. ACM, 564–572. doi:10. 1145/3664647.3681688
-
[52]
Jan Melechovsky, Zixun Guo, Deepanway Ghosal, Navonil Majumder, Dorien Herremans, and Soujanya Poria. 2024. Mustango: Toward Controllable Text-to- Music Generation. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics. Association for Compu- tational Linguistics, 5596–5611
2024
-
[53]
Pijanowski, Luis J
Bryan C. Pijanowski, Luis J. Villanueva-Rivera, Sarah L. Dumyahn, et al. 2011. Soundscape Ecology: The Science of Sound in the Landscape.BioScience61, 3 (2011), 203–216
2011
-
[54]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InProceedings of the 38th International Conference on Machine Learning. 8748–8763
2021
-
[55]
Yan Rong, Jinting Wang, Guangzhi Lei, Shan Yang, and Li Liu. 2025. Audio- Genie: A Training-Free Multi-Agent Framework for Diverse Multimodality-to- Multiaudio Generation. InProceedings of the 33rd ACM International Conference on Multimedia. ACM, 1–11. doi:10.1145/3746027.3755758
-
[56]
Justin Salamon, Christopher Jacoby, and Juan Pablo Bello. 2014. A Dataset and Taxonomy for Urban Sound Research. InProceedings of the 22nd ACM Interna- tional Conference on Multimedia (MM ’14). ACM, 1041–1044. doi:10.1145/2647868. 2655045
-
[57]
Murray Schafer
R. Murray Schafer. 1977.The Tuning of the World. Knopf, New York
1977
-
[58]
Akashah Shabbir, Mohammed Zumri, Mohammed Bennamoun, Fahad Shahbaz Khan, and Salman Khan. 2025. GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing. InProceedings of the 42nd International Conference on Machine Learning (ICML). Poster
2025
-
[59]
Roy Sheffer and Yossi Adi. 2023. I Hear Your True Colors: Image Guided Audio Generation. InICASSP
2023
-
[60]
Oriane Siméoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Rama- monjisoa, et al. 2025. DINOv3.arXiv preprint arXiv:2508.10104(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [61]
-
[62]
Zhiqiang Tian, Yu Jin, Ziyang Liu, Rongjie Yuan, Xu Tan, Qifeng Chen, Wen Xue, and Yike Guo. 2025. AudioX: Diffusion Transformer for Anything-to-Audio Generation.arXiv preprint arXiv:2503.10522(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
2001.Acoustic Communication
Barry Truax. 2001.Acoustic Communication. Ablex Publishing
2001
-
[64]
Heng Wang, Jianbo Ma, Santiago Pascual, Richard Cartwright, and Weidong Cai. 2024. V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 15492–15501
2024
- [65]
-
[66]
Xinyu Wang et al. 2024. TiVA: Time-Aligned Video-to-Audio Generation. In ACM International Conference on Multimedia (ACM MM)
2024
-
[67]
Zhenyu Wang, Chenxing Li, Yong Xu, Chunlei Zhang, John H. L. Hansen, and Dong Yu. 2024. Text-to-Audio Generation via Bridging Audio Language Model and Latent Diffusion. InAudio Imagination Workshop at NeurIPS
2024
-
[68]
Zhecheng Wang, Rajanie Prabha, Tianyuan Huang, Jiajun Wu, and Ram Ra- jagopal. 2024. SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing. InAAAI
2024
-
[69]
Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. 2024. NExT- GPT: Any-to-Any Multimodal LLM. InInternational Conference on Machine 9 Learning
2024
-
[70]
Yazhou Xing, Yingqing He, Zeyue Tian, Xintao Wang, and Qifeng Chen. 2024. Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7151–7161
2024
- [71]
- [72]
-
[73]
Tianhong Zhao, Xiucheng Liang, Wei Tu, Zhengdong Huang, and Filip Biljecki
-
[74]
Sensing urban soundscapes from street view imagery.Computers, Environ- ment and Urban Systems99 (2023), 101915
2023
-
[75]
Zhe Zhu. 2022. Remote sensing of land change: A multifaceted perspective. Remote Sensing of Environment285 (2022), 113379. A Prompt Design and Hypothesis Analysis To analyze the effect of semantic hypothesis expansion, we compare the exact prompt settings used in our experiments with a controlled rephrasing variant. Let 𝐶0 denote the base caption generate...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.