Zero-Shot Satellite Image Retrieval through Joint Embeddings: Application to Crisis Response
Pith reviewed 2026-05-21 09:21 UTC · model grok-4.3
The pith
Optimizing text descriptions on a 100k proxy subset aligns language queries with frozen visual embeddings to retrieve relevant satellite images for disasters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GeoQuery achieves zero-shot retrieval by optimizing a description-generation prompt on a proxy subset so that text embeddings correlate with visual embeddings from CLAY, enabling two-stage search that identifies relevant satellite images for disaster locations worldwide.
What carries the argument
Prompt optimization on language descriptions of a 100k proxy subset to align text-embedding distances with those in the frozen CLAY visual-embedding space for two-stage text-then-visual retrieval.
Load-bearing premise
That distances in the text-embedding space after prompt optimization on the 100k proxy subset will reliably correspond to distances in the frozen CLAY visual-embedding space for unseen global queries and disaster types.
What would settle it
A new test set of disaster-location queries from regions or disaster types outside the original UK floods, US wildfires, and US droughts evaluation showing retrieval accuracy well below 31.6 percent within 50 km.
Figures
read the original abstract
Semantic search of Earth observation archives remains challenging. Visual foundation models such as CLAY produce rich embeddings of satellite imagery but lack the natural-language grounding needed for intuitive query, and full contrastive training of a remote-sensing CLIP-style model requires paired data and compute that are unavailable at global scale. To allow natural language querying at global scales, we present GeoQuery, a zero-shot retrieval system that sidesteps data and compute constraints through a two-stage semantic and visual search, leveraging a natural language embedding of a subset (proxy) of global data. Rather than training a joint encoder, we generate language descriptions for a 100k proxy subset of global Sentinel-2 tiles and optimise the description-generation prompt so that distances in the resulting text-embedding space correlate with distances in the frozen CLAY visual-embedding space. Queries are resolved in two stages, with a text-similarity search over the proxy subset followed by a visual nearest-neighbour search over worldwide CLAY embeddings On 76 disaster-location queries covering UK floods, US wildfires, and US droughts, GeoQuery achieves 31.6\% accuracy within 50\,km, with the strongest performance on floods (50\% within 50\,km) where terrain features are well captured by RGB embeddings. Deployed within a crisis response system called \ECHO{}, GeoQuery identified vulnerable areas during Brisbane's 2025 Cyclone Alfred, with downstream flood simulations reproducing historical patterns. Prompt-aligned proxies offer a practical bridge between EO foundation models and operational retrieval when full contrastive training is out of reach.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces GeoQuery, a zero-shot retrieval system for satellite imagery that generates language descriptions for a 100k proxy subset of global Sentinel-2 tiles, optimizes the description-generation prompt to align text-embedding distances with frozen CLAY visual embeddings, and resolves queries via text-similarity search over the proxy followed by visual nearest-neighbor search over worldwide CLAY embeddings. It reports 31.6% accuracy within 50 km on 76 disaster-location queries covering UK floods, US wildfires, and US droughts (with 50% on floods), and demonstrates deployment in the ECHO crisis response system for Brisbane's 2025 Cyclone Alfred.
Significance. If the prompt-optimized alignment generalizes reliably to unseen global locations and disaster types, the approach offers a practical, low-resource bridge between visual foundation models and natural-language querying of EO archives without full contrastive training or global paired data. The two-stage proxy-plus-visual design and the reported crisis-response application are potentially useful, though the strength of the contribution depends on demonstrating robust transfer beyond the optimization set.
major comments (2)
- [Abstract] Abstract: The headline result of 31.6% accuracy within 50 km (50% on floods) on 76 disaster queries provides no information on query selection criteria, definition of a positive match, error bars, statistical significance, or ablation of the prompt-optimization step. These omissions make the central performance claim difficult to evaluate.
- [Method] Method section (prompt optimization and two-stage retrieval): The prompt is optimized on the 100k proxy subset so that text-embedding distances correlate with frozen CLAY visual distances, yet no quantitative check is reported for correlation strength, retrieval quality, or generalization on a held-out portion of the proxy or on queries involving unseen locations and disaster types. This directly affects the validity of the zero-shot transfer assumption.
minor comments (2)
- [Abstract] Abstract: Clarify whether the final visual nearest-neighbor search is performed over the complete worldwide CLAY embedding collection or a filtered subset.
- [Abstract] Ensure the first use of the acronym ECHO is accompanied by its full expansion.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The comments highlight important areas where additional clarity and analysis will strengthen the manuscript. We address each major comment point by point below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline result of 31.6% accuracy within 50 km (50% on floods) on 76 disaster queries provides no information on query selection criteria, definition of a positive match, error bars, statistical significance, or ablation of the prompt-optimization step. These omissions make the central performance claim difficult to evaluate.
Authors: We agree that the abstract lacks sufficient supporting details for rigorous evaluation of the headline result. In the revised manuscript we will expand the abstract to briefly describe the query selection criteria (publicly reported disaster events for UK floods, US wildfires and US droughts), define a positive match as retrieval within 50 km of the documented ground-truth location, and reference the addition of error bars (via bootstrap resampling of the 76 queries), statistical significance testing, and an ablation of the prompt-optimization step. These elements will also be elaborated in the main text. revision: yes
-
Referee: [Method] Method section (prompt optimization and two-stage retrieval): The prompt is optimized on the 100k proxy subset so that text-embedding distances correlate with frozen CLAY visual distances, yet no quantitative check is reported for correlation strength, retrieval quality, or generalization on a held-out portion of the proxy or on queries involving unseen locations and disaster types. This directly affects the validity of the zero-shot transfer assumption.
Authors: The referee correctly notes the absence of direct quantitative diagnostics for the prompt-optimization procedure. While the reported end-to-end accuracy on the 76 disaster queries (which involve locations and event types outside the proxy) already provides indirect evidence of transfer, we acknowledge that explicit metrics are needed. In the revision we will add (i) correlation coefficients (Pearson and Spearman) between text-embedding and CLAY visual distances on the proxy set, (ii) retrieval-quality metrics on a held-out portion of the proxy, and (iii) explicit discussion of generalization to the unseen disaster queries. These additions will be placed in the Method section with supporting figures or tables. revision: yes
Circularity Check
No significant circularity: empirical prompt optimization validated on held-out queries
full rationale
The paper presents an empirical two-stage retrieval method: language descriptions are generated for a 100k proxy subset of Sentinel-2 tiles, a prompt is optimized so that text-embedding distances correlate with frozen CLAY visual distances, and queries are handled via text search on the proxy followed by visual nearest-neighbor search globally. Performance is reported on 76 separate disaster-location queries (UK floods, US wildfires, US droughts) that are distinct from the proxy optimization set. No derivation, prediction, or result reduces to its inputs by construction, no self-citations or uniqueness theorems are invoked as load-bearing, and no ansatz or renaming is smuggled in. The central claim rests on measured accuracy rather than tautological equivalence, making the approach self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- prompt template for description generation
axioms (1)
- domain assumption CLAY visual embeddings capture terrain and land-cover features relevant to flood, fire, and drought location queries
Reference graph
Works this paper leans on
-
[1]
Clay foundation model: An open source AI model for earth
Clay Foundation. Clay foundation model: An open source AI model for earth. https: //github.com/Clay-foundation/model, 2024. Version 1.5. Pretrained Vision Transformer with masked autoencoder objective on approximately 70 million globally sampled chips from Sentinel-2, Landsat, Sentinel-1 SAR, LINZ, NAIP, and MODIS
work page 2024
-
[2]
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision.CoRR, abs/2103.00020, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Le, Yunhsuan Sung, Zhen Li, and Tom Duerig
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InProceedings of the 38th International Conference on Machine Learning (ICML), volume 139 ofProceedings of Machine Learning Research, pages 49...
work page 2021
-
[4]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InProceedings of the 39th International Conference on Machine Learning (ICML), volume 162 ofProceedings of Machine Learning Research, pages 12888–12900. PMLR, 2022
work page 2022
-
[5]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11975–11986, 2023
work page 2023
-
[6]
Prithvi-eo-2.0: A versatile multi-temporal foundation model for earth observation applications, 2025
Daniela Szwarcman, Sujit Roy, Paolo Fraccaro, Þorsteinn Elí Gíslason, Benedikt Blumenstiel, Rinki Ghosal, Pedro Henrique de Oliveira, Joao Lucas de Sousa Almeida, Rocco Sedona, Yanghui Kang, Srija Chakraborty, Sizhe Wang, Carlos Gomes, Ankur Kumar, Myscon Truong, Denys Godwin, Hyunho Lee, Chia-Yu Hsu, Ata Akbari Asanjan, Besart Mujeci, Disha Shid- ham, Tr...
work page 2025
-
[7]
Yezhen Cong, Samar Khanna, Chenlin Meng, Patrick Liu, Erik Rozi, Yutong He, Marshall Burke, David B. Lobell, and Stefano Ermon. Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery, 2023
work page 2023
-
[8]
Colorado J. Reed, Ritwik Gupta, Shufan Li, Sarah Brockman, Christopher Funk, Brian Clipp, Kurt Keutzer, Salvatore Candido, Matt Uyttendaele, and Trevor Darrell. Scale-MAE: A scale- aware masked autoencoder for multiscale geospatial representation learning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4088–4099, 2023
work page 2023
-
[9]
SatlasPretrain: A large-scale dataset for remote sensing image understanding
Favyen Bastani, Piper Wolters, Ritwik Gupta, Joe Ferdinando, and Aniruddha Kembhavi. SatlasPretrain: A large-scale dataset for remote sensing image understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16772–16782, 2023
work page 2023
-
[10]
Danfeng Hong, Bing Zhang, Xuyang Li, Yuxuan Li, Chenyu Li, Jing Yao, Naoto Yokoya, Hao Li, Pedram Ghamisi, Xiuping Jia, Antonio Plaza, Paolo Gamba, Jon Atli Benediktsson, and 6 Jocelyn Chanussot. SpectralGPT: Spectral remote sensing foundation model.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5227–5244, 2024
work page 2024
-
[11]
Zhitong Xiong, Yi Wang, Fahong Zhang, Adam J. Stewart, Joëlle Hanna, Damian Borth, Ioannis Papoutsis, Bertrand Le Saux, Gustau Camps-Valls, and Xiao Xiang Zhu. Neural plasticity-inspired multimodal foundation model for earth observation, 2024
work page 2024
-
[12]
Aoran Xiao, Weihao Xuan, Junjue Wang, Jiaxing Huang, Dacheng Tao, Shijian Lu, and Naoto Yokoya. Foundation models for remote sensing and earth observation: A survey.IEEE Geoscience and Remote Sensing Magazine, 2025. In press
work page 2025
-
[13]
GEO-Bench: Toward foundation models for earth monitoring
Alexandre Lacoste, Nils Lehmann, Pau Rodriguez, Evan David Sherwin, Hannah Kerner, Björn Lütjens, Jeremy Andrew Irvin, David Dao, Hamed Alemohammad, Alexandre Drouin, Mehmet Gunturkun, Gabriel Huang, David Vazquez, Dava Newman, Yoshua Bengio, Stefano Ermon, and Xiao Xiang Zhu. GEO-Bench: Toward foundation models for earth monitoring. InAdvances in Neural ...
work page 2023
-
[14]
Zilun Zhang, Tiancheng Zhao, Yulong Guo, and Jianwei Yin. Rs5m and georsclip: A large- scale vision- language dataset and a large vision-language model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–23, 2024
work page 2024
-
[15]
Remoteclip: A vision language foundation model for remote sensing, 2024
Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. Remoteclip: A vision language foundation model for remote sensing, 2024
work page 2024
-
[16]
SkyScript: A large and semantically diverse vision-language dataset for remote sensing
Zhecheng Wang, Rajanie Prabha, Tianyuan Huang, Jiajun Wu, and Ram Rajagopal. SkyScript: A large and semantically diverse vision-language dataset for remote sensing. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 5805–5813, 2024
work page 2024
-
[17]
Stewart, Jie Zhao, Nils Lehmann, Thomas Dujardin, Zhenghang Yuan, Pedram Ghamisi, and Xiao Xiang Zhu
Zhitong Xiong, Yi Wang, Weikang Yu, Adam J. Stewart, Jie Zhao, Nils Lehmann, Thomas Dujardin, Zhenghang Yuan, Pedram Ghamisi, and Xiao Xiang Zhu. DOFA-CLIP: Multimodal vision-language foundation models for earth observation, 2025
work page 2025
-
[18]
Toolformer: Language models can teach themselves to use tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023), 2023
work page 2023
-
[19]
ReAct: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InProceedings of the 11th International Conference on Learning Representations (ICLR), 2023
work page 2023
-
[20]
White, Doug Burger, and Chi Wang
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation. InProceedings of the 1st Conference on Language Modeling (COLM), 2024
work page 2024
-
[21]
Chemcrow: Augmenting large-language models with chemistry tools, 2023
Andres M Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. Chemcrow: Augmenting large-language models with chemistry tools, 2023
work page 2023
-
[22]
Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes
Daniil A. Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models.Nature, 624(7992):570–578, December 2023
work page 2023
-
[23]
Alireza Ghafarollahi and Markus J. Buehler. ProtAgents: Protein discoveryvialarge language model multi-agent collaborations combining physics and machine learning.Digital Discovery, 3(7):1389–1409, 2024
work page 2024
-
[24]
Yifan Zhang, Cheng Wei, Zhengting He, and Wenhao Yu. Geogpt: An assistant for understand- ing and processing geospatial tasks.International Journal of Applied Earth Observation and Geoinformation, 131:103976, 2024
work page 2024
-
[25]
Derrick Bonafilia, Beth Tellman, Tyler Anderson, and Erica Issenberg. Sen1Floods11: A georeferenced dataset to train and test deep learning flood algorithms for Sentinel-1. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 210–211, 2020. 7
work page 2020
-
[26]
Gonzalo Mateo-Garcia, Joshua Veitch-Michaelis, Lewis Smith, Silviu Vlad Oprea, Guy Schu- mann, Yarin Gal, Atılım Güne¸ s Baydin, and Dietmar Backes. Towards global flood mapping onboard low cost satellites with machine learning.Scientific Reports, 11(1):7249, 2021
work page 2021
-
[27]
xBD: A dataset for assessing building damage from satellite imagery, 2019
Ritwik Gupta, Richard Hosfelt, Sandra Sajeev, Nirav Patel, Bryce Goodman, Jigar Doshi, Eric Heim, Howie Choset, and Matthew Gaston. xBD: A dataset for assessing building damage from satellite imagery, 2019
work page 2019
- [28]
-
[29]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024
Gemini Team Google. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024
work page 2024
-
[30]
Reid Pryzant, Dan Iter, Jerry Li, Yin Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with “gradient descent” and beam search. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7957–7968, Singapore, December 2023. Association for Computati...
work page 2023
-
[31]
Le, Denny Zhou, and Xinyun Chen
Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V . Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InProceedings of the 12th International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[32]
Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative language model calls into state-of-the-art pipelines. InProceedings of the 12th International Conference on Learning...
work page 2024
-
[33]
Dense passage retrieval for open-domain question answering
Vladimir Karpukhin, Barlas O ˘guz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781. Association for Computational Linguistics, 2020
work page 2020
-
[34]
ColBERT: Efficient and effective passage search via con- textualized late interaction over BERT
Omar Khattab and Matei Zaharia. ColBERT: Efficient and effective passage search via con- textualized late interaction over BERT. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 39–48, 2020
work page 2020
-
[35]
PlaNet - photo geolocation with convolu- tional neural networks
Tobias Weyand, Ilya Kostrikov, and James Philbin. PlaNet - photo geolocation with convolu- tional neural networks. InComputer Vision – ECCV 2016, volume 9912 ofLecture Notes in Computer Science, pages 37–55. Springer, 2016
work page 2016
-
[36]
Australian Government Publishing Service, Canberra, 1974
Bureau of Meteorology.Brisbane Floods January 1974: Report by Director of Meteorology. Australian Government Publishing Service, Canberra, 1974
work page 1974
-
[37]
Open Source Geospatial Foundation, 2025
GDAL/OGR contributors.GDAL/OGR Geospatial Data Abstraction software Library. Open Source Geospatial Foundation, 2025
work page 2025
-
[38]
Kelsey Jordahl et al. geopandas/geopandas: v0.6.1, October 2019. A GeoQuery Ablation Study A.1 Experimental Setup We evaluated GeoQuery’s disaster location identification capability using 76 queries across three categories: 40 UK flood queries (testing 10 major 2024 flooding locations including Stratford- upon-Avon, Birmingham, and Portsmouth), 20 US wild...
work page 2019
-
[39]
meteorological alerts for severe rainfall)
Risk identification via external monitoring (e.g. meteorological alerts for severe rainfall)
-
[40]
The risk is developed into a “project” defined spatially and temporally. These extents define the bounds for digital twinning of infrastructure and topography, a core foundation for downstream simulation and scenario building. For example, a national meteorological agency might flag a possible flood event triggered by 48 hours of intense rainfall in Australia
-
[41]
Once enough information is collected on a given project, experts may begin to define the nature of the inquiry. ECHO supports requests to specify which real-time data streams must be monitored first, simulate crisis events, and finally define alerting procedures as information is ingested. For example, five-metre digital elevation maps are downloaded alon...
-
[42]
These highly granular assets are then accessible to an expert to rapidly define the line of geospatial inquiry and identify risks unknown to the automated system. For example, an expert might request a flood model and an evaluation of which buildings may be suitable for sheltering at-risk individuals in place
-
[43]
A crisis responder or member of the public may then request hyper-localised information from the contextually aware agent. For example, they might ask which roads are likely to be inaccessible to a particular vehicle, such as an ambulance or a family car, when planning a safe route. For any of the steps above to be possible, we require a means to construc...
work page 2025
-
[44]
Disaster Risk Analysis For requests about assessing disaster risks (fire, floods, earthquakes, etc.), ensure the query includes: - Location of interest - Time horizon - Type of disaster Example 1: Previous context: Take me to valencia Current state variables available: {"data": bbox"} User Input: Can you determine if this area is flood prone over the next...
-
[45]
Show me images of oceans near deserts
Satellite Image Search For general satellite image queries that don’t involve disaster risk (e.g., "Show me images of oceans near deserts"). These queries do not require a time horizon, nor a specific location. Feel confident to pass on such queries to the planner as long as no disasters are mentioned. Example 1: User input: show me forests Output: {’stat...
-
[46]
Start with OSM_Geocode for location queries
-
[47]
Use ’after’ for dependencies
-
[48]
Empty ’after’ means step can start immediately
-
[49]
Input/output must match tool definitions exactly
-
[50]
Use only listed tools
-
[51]
OSM Points of Interest should only be used when looking for specific physical infrastructure tags **{examples}** Return only valid JSON matching this format using listed tools. B.5 Planner User Prompt Create a logical tool sequence plan for: ‘‘‘{query}‘‘‘ Here are all previous messages between the user and the planner: **{conversation_history}** Here are ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.