arxiv: 2601.01891 · v1 · submitted 2026-01-05 · 💻 cs.CV

Recognition: no theorem link

Agentic AI in Remote Sensing: Foundations, Taxonomy, and Emerging Systems

Niloufar Alipour Talemi , Julia Boone , Fatemeh Afghah

Authors on Pith no claims yet

Pith reviewed 2026-05-16 18:25 UTC · model grok-4.3

classification 💻 cs.CV

keywords agentic AIremote sensingEarth observationtaxonomymulti-agent systemsplanning mechanismsretrieval-augmented generationtrajectory-aware reasoning

0 comments

The pith

Agentic AI systems add sequential planning and tool orchestration to remote sensing beyond current vision models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey claims that Earth Observation is moving from static deep learning models to autonomous agentic AI capable of handling complex geospatial workflows. It notes that recent vision foundation models and multimodal large language models improve representation learning but typically lack the planning and active tool use required for full autonomy. The paper supplies a unified taxonomy that separates single-agent copilots from multi-agent systems and examines supporting architectures such as planning mechanisms, retrieval-augmented generation, and memory structures. It also reviews new benchmarks that test trajectory-aware reasoning rather than pixel accuracy alone and flags remaining gaps in grounding, safety, and orchestration. The overall aim is to chart a path toward reliable autonomous geospatial intelligence.

Core claim

The paper presents the first comprehensive review of agentic AI in remote sensing and introduces a unified taxonomy that distinguishes single-agent copilots from multi-agent systems while analyzing core architectural elements including planning mechanisms, retrieval-augmented generation, and memory structures, together with emerging benchmarks that evaluate trajectory-aware reasoning correctness instead of pixel-level accuracy.

What carries the argument

Unified taxonomy separating single-agent copilots from multi-agent systems, supported by planning mechanisms, retrieval-augmented generation, and memory structures.

If this is right

Remote sensing evaluation must shift from pixel-level accuracy to trajectory-aware reasoning correctness.
Architectures should incorporate planning mechanisms, retrieval-augmented generation, and memory structures.
Limitations in grounding, safety, and orchestration require targeted solutions before deployment.
A strategic roadmap can guide development of robust autonomous geospatial intelligence systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The taxonomy could be tested for transfer to sequential decision domains such as autonomous navigation or medical image analysis.
Integration with live satellite streams might enable real-time adaptive response in disaster monitoring.
Safety constraints could lead to new verification methods for agent trajectories in Earth observation.

Load-bearing premise

Current vision foundation models and multimodal large language models inherently lack the sequential planning and active tool orchestration needed for complex geospatial workflows.

What would settle it

A demonstration that an unmodified multimodal large language model achieves comparable trajectory-aware reasoning scores on remote-sensing benchmarks without added planning or tool orchestration layers.

Figures

Figures reproduced from arXiv: 2601.01891 by Fatemeh Afghah, Julia Boone, Niloufar Alipour Talemi.

**Figure 1.** Figure 1: Overview of the Agentic AI ecosystem in remote sensing. The proposed framework consists of four key components: 1) [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Benchmarks and datasets for agentic remote sens [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

The paradigm of Earth Observation analysis is shifting from static deep learning models to autonomous agentic AI. Although recent vision foundation models and multimodal large language models advance representation learning, they often lack the sequential planning and active tool orchestration required for complex geospatial workflows. This survey presents the first comprehensive review of agentic AI in remote sensing. We introduce a unified taxonomy distinguishing between single-agent copilots and multi-agent systems while analyzing architectural foundations such as planning mechanisms, retrieval-augmented generation, and memory structures. Furthermore, we review emerging benchmarks that move the evaluation from pixel-level accuracy to trajectory-aware reasoning correctness. By critically examining limitations in grounding, safety, and orchestration, this work outlines a strategic roadmap for the development of robust, autonomous geospatial intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey gives a first taxonomy for agentic AI in remote sensing that separates single-agent copilots from multi-agent systems and tracks the move to trajectory-based evaluation.

read the letter

The main point is that this paper supplies the first organized taxonomy for agentic AI applied to remote sensing. It splits single-agent copilots from multi-agent setups and walks through planning, retrieval-augmented generation, memory, and the change in benchmarks from pixel accuracy to trajectory reasoning. That structure is useful because the area is new and scattered, so a clear map can help people see where the pieces fit without starting from scratch each time. The authors also note the practical limits around grounding, safety, and orchestration, which keeps the discussion grounded rather than promotional. They lay out a short roadmap that points to what needs work next. The paper does not introduce new experiments, code, or proofs, so its strength is entirely in how well it covers and organizes existing work. If the taxonomy actually matches the systems in the cited papers and avoids forcing awkward fits, it will hold up; if some reviewed systems sit between the two categories, the split will need tightening. The claim that current vision foundation models and multimodal LLMs lack sequential planning is treated as background rather than something the paper proves, which is fair for a survey but leaves room for readers to check against the newest base models. This is the kind of paper that remote-sensing researchers and applied AI groups will pull off the shelf when they start building autonomous geospatial tools. It gives them a shared language and a list of open problems without requiring them to read every individual preprint. A serious editor should send it to peer review so referees can check the coverage and test whether the taxonomy stays clean once more systems appear.

Referee Report

2 major / 2 minor

Summary. The manuscript is a survey on agentic AI in remote sensing. It argues that vision foundation models and multimodal LLMs advance representation learning but lack sequential planning and tool orchestration for complex geospatial workflows. The central contribution is a unified taxonomy distinguishing single-agent copilots from multi-agent systems, together with analysis of architectural foundations (planning mechanisms, retrieval-augmented generation, memory structures), a review of emerging benchmarks that shift evaluation from pixel-level accuracy to trajectory-aware reasoning correctness, and a critical examination of limitations in grounding, safety, and orchestration that culminates in a strategic roadmap.

Significance. If the taxonomy is coherently supported by the reviewed literature and the analysis of evaluation shifts and limitations is balanced, the work would be significant as the first organizational framework for an emerging intersection of agentic systems and Earth observation. It could help researchers navigate distinctions between system types and redirect attention toward higher-level reasoning metrics rather than isolated accuracy scores.

major comments (2)

[Abstract] Abstract: the motivating claim that current vision foundation models and MLLMs 'often lack the sequential planning and active tool orchestration required for complex geospatial workflows' is presented without concrete citations or failure-mode examples; because this premise justifies the entire survey, the introduction or taxonomy section must supply a short, referenced enumeration of documented shortcomings in existing models on representative remote-sensing tasks.
[Taxonomy section] Taxonomy and architectural foundations: the distinction between single-agent copilots and multi-agent systems is introduced at a high level, but the manuscript must explicitly map at least a representative sample of the cited systems onto the taxonomy categories (with a summary table) so that the taxonomy functions as an analytical lens rather than a purely descriptive partition.

minor comments (2)

All acronyms (RAG, MLLM, etc.) should be defined at first use and a glossary or footnote list added for readers outside the immediate subfield.
[Benchmarks section] The discussion of benchmark shifts would be strengthened by a comparative table listing existing benchmarks, their primary metrics, and the new trajectory-aware criteria proposed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our survey. We agree that strengthening the motivation with concrete examples and providing an explicit mapping table will improve the clarity and analytical value of the taxonomy. Both changes will be incorporated in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the motivating claim that current vision foundation models and MLLMs 'often lack the sequential planning and active tool orchestration required for complex geospatial workflows' is presented without concrete citations or failure-mode examples; because this premise justifies the entire survey, the introduction or taxonomy section must supply a short, referenced enumeration of documented shortcomings in existing models on representative remote-sensing tasks.

Authors: We agree that the motivating premise benefits from concrete support. In the revised manuscript we will add a concise, referenced enumeration of documented shortcomings (e.g., failures in multi-step change detection, trajectory planning for disaster response, and tool-use errors on satellite imagery benchmarks) to the Introduction section, citing representative studies that illustrate these limitations. revision: yes
Referee: [Taxonomy section] Taxonomy and architectural foundations: the distinction between single-agent copilots and multi-agent systems is introduced at a high level, but the manuscript must explicitly map at least a representative sample of the cited systems onto the taxonomy categories (with a summary table) so that the taxonomy functions as an analytical lens rather than a purely descriptive partition.

Authors: We accept this recommendation. The revised version will include a summary table in the Taxonomy section that explicitly maps a representative sample of the cited systems (e.g., single-agent copilots such as GeoChat and multi-agent frameworks such as those using hierarchical planning) onto the taxonomy categories, thereby making the distinctions operational and analytically useful. revision: yes

Circularity Check

0 steps flagged

No significant circularity: survey synthesizes external literature without internal derivations or self-referential predictions

full rationale

This is a survey paper whose core contribution is organizational: a taxonomy of single-agent vs. multi-agent systems, review of planning/RAG/memory components, and shift in evaluation benchmarks. No equations, fitted parameters, predictions, or derivations appear in the provided abstract or description. All claims rest on synthesis of prior external work rather than reduction to the paper's own inputs. Self-citations, if present, are not load-bearing for any technical result because no new technical result is derived. The motivational statement about limitations of current vision models is presented as context, not a falsifiable claim proven inside the paper. This matches the default expectation for non-circular survey work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a survey and therefore introduces no new free parameters, axioms, or invented entities; it relies on definitions and findings drawn from the reviewed remote sensing and AI literature.

pith-pipeline@v0.9.0 · 5427 in / 1094 out tokens · 41564 ms · 2026-05-16T18:25:53.039179+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Agentic AI for Remote Sensing: Technical Challenges and Research Directions
cs.CV 2026-04 unverdicted novelty 6.0

Agentic AI faces structural challenges in remote sensing due to geospatial data properties and workflow constraints, requiring EO-native agents built around structured state, tool-aware reasoning, and validity-aware e...
Agentic AI for Remote Sensing: Technical Challenges and Research Directions
cs.CV 2026-04 unverdicted novelty 5.0

Agentic AI for remote sensing requires new designs centered on structured geospatial state, tool-aware reasoning, verifier-guided execution, and physical validity rather than generic extensions.

Reference graph

Works this paper leans on

157 extracted references · 157 canonical work pages · cited by 1 Pith paper · 11 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Data Leakage Detection and De-duplication in Large Scale Geospatial Image Datasets

Yeshwanth Kumar Adimoolam, Bodhiswatta Chatterjee, Charalambos Poullis, and Melinos Averkiou. Efficient deduplication and leakage detection in large scale image datasets with a focus on the crowdai mapping challenge dataset.arXiv preprint arXiv:2304.02296, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

GIS copilot: Towards an autonomous GIS agent for spatial analysis.International Journal of Dig- ital Earth, 18(1):2497489, 2025

Temitope Akinboyewa, Zhenlong Li, Huan Ning, and M Naser Lessani. GIS copilot: Towards an autonomous GIS agent for spatial analysis.International Journal of Dig- ital Earth, 18(1):2497489, 2025. 5, 6

work page 2025
[4]

Agentic large- language-model systems in medicine: A systematic review and taxonomy.Authorea Preprints, 2025

Abdul Mohaimen Al Radi, Xu Cao, Fanyang Yu, Yuyuan Liu, Fengbei Liu, Chong Wang, Yuanhong Chen, Jin- tai Chen, Hu Wang, Yanda Meng, et al. Agentic large- language-model systems in medicine: A systematic review and taxonomy.Authorea Preprints, 2025. 2

work page 2025
[5]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35: 23716–23736, 2022. 2, 3

work page 2022
[6]

Disa: Directional saliency- aware prompt learning for generalizable vision-language models

Niloufar Alipour Talemi, Hossein Kashiani, Hossein R Nowdeh, and Fatemeh Afghah. Disa: Directional saliency- aware prompt learning for generalizable vision-language models. InProceedings of the 31st ACM SIGKDD Confer- ence on Knowledge Discovery and Data Mining V . 2, pages 37–46, 2025. 3

work page 2025
[7]

Foun- dation models defining a new era in vision: a survey and outlook.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 2025

Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Foun- dation models defining a new era in vision: a survey and outlook.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 2025. 3

work page 2025
[8]

Geography-aware self-supervised learning

Kumar Ayush, Burak Uzkent, Chenlin Meng, Shah Tan- may, Marshall Burke, David Lobell, and Stefano Ermon. Geography-aware self-supervised learning. InICCV, 2021. 3

work page 2021
[9]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Satlaspretrain: A large- scale dataset for remote sensing image understanding

Favyen Bastani, Piper Wolters, Ritwik Gupta, Joe Ferdi- nando, and Aniruddha Kembhavi. Satlaspretrain: A large- scale dataset for remote sensing image understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16772–16782, 2023. 3

work page 2023
[11]

RS-LLaV A: A large vision-language model for joint captioning and ques- tion answering in remote sensing imagery.Remote Sensing, 16(9), 2024

Yakoub Bazi, Laila Bashmal, Mohamad Mahmoud Al Rah- hal, Riccardo Ricci, and Farid Melgani. RS-LLaV A: A large vision-language model for joint captioning and ques- tion answering in remote sensing imagery.Remote Sensing, 16(9), 2024. 2, 3

work page 2024
[12]

Earth ai: Unlocking geospatial insights with foundation models and cross-modal reasoning.arXiv preprint arXiv:2510.18318,

Aaron Bell, Amit Aides, Amr Helmy, Arbaaz Muslim, Aviad Barzilai, Aviv Slobodkin, Bolous Jaber, David Schot- tlander, George Leifman, Joydeep Paul, et al. Earth ai: Unlocking geospatial insights with foundation models and cross-modal reasoning.arXiv preprint arXiv:2510.18318,

work page arXiv
[13]

5 9 Dataset Sensor / Modality Resolution / Scale Dataset Application Scene / LULC classification UC Merced Land Use [143] Aerial RGB∼0.3 m, 256×256 patches land-use scene classification (21 classes) AID [135] Aerial RGB 600×600 pixel patches Aerial scene image classification (30 classes) NWPU-RESISC45 [25] Aerial RGB 256×256 pixel patches Scene classifica...

work page
[14]

GeoFlow: Agentic workflow automation for geospatial tasks.arXiv preprint arXiv:2508.04719, 2025

Amulya Bhattaram, Justin Chung, Stanley Chung, Ranit Gupta, Janani Ramamoorthy, Kartikeya Gullapalli, Diana Marculescu, and Dimitrios Stamoulis. GeoFlow: Agentic workflow automation for geospatial tasks.arXiv preprint arXiv:2508.04719, 2025. 4, 6

work page arXiv 2025
[15]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bern- stein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie S. Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, et al. On the opportunities an...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[16]

Sen1floods11: A georeferenced dataset to train and test deep learning flood algorithms for sentinel-1

Derrick Bonafilia, Beth Tellman, Tyler Anderson, and Er- ica Issenberg. Sen1floods11: A georeferenced dataset to train and test deep learning flood algorithms for sentinel-1. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 210–211,

work page
[17]

Dynamic world, near real-time global 10 m land use land cover mapping

Christopher F Brown, Steven P Brumby, Brookie Guzder- Williams, Tanya Birch, Samantha Brooks Hyde, Joseph Mazzariello, Wanda Czerwinski, Valerie J Pasquarella, Robert Haertel, Simon Ilyushchenko, et al. Dynamic world, near real-time global 10 m land use land cover mapping. Scientific data, 9(1):251, 2022. 10

work page 2022
[18]

Overview of datasets and evaluation benchmarks for LLM driven agentic methods in geospatial and remote sensing

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- 10 Dataset / Benchmark Applications Systems and Technologies GeoBenchX [65] Dataset and evaluation framework Multi-step GIS reasoning LangGraph ReAct agent, Python geospatial stack, and an LLM as Judge GTChain-IT / CTChain-Eval [149] Dataset and evaluation framework Benchmarking LLMs on geospatial tool us...

work page 1901
[19]

A spatial-temporal attention- based method and a new dataset for remote sensing image change detection.Remote sensing, 12(10):1662, 2020

Hao Chen and Zhenwei Shi. A spatial-temporal attention- based method and a new dataset for remote sensing image change detection.Remote sensing, 12(10):1662, 2020. 10

work page 2020
[20]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multi- modal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Rsrefseg: Referring remote sensing im- age segmentation with foundation models.arXiv preprint arXiv:2501.06809, 2025

Keyan Chen, Jiafan Zhang, Chenyang Liu, Zhengxia Zou, and Zhenwei Shi. Rsrefseg: Referring remote sensing im- age segmentation with foundation models.arXiv preprint arXiv:2501.06809, 2025. 2

work page arXiv 2025
[22]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020. 2, 3

work page 2020
[23]

An LLM agent for automatic geospatial data analy- sis.arXiv preprint arXiv:2410.18792, 2024

Yuxing Chen, Weijie Wang, Sylvain Lobry, and Camille Kurtz. An LLM agent for automatic geospatial data analy- sis.arXiv preprint arXiv:2410.18792, 2024. 8

work page arXiv 2024
[24]

Empowering LLM Agents with Geospatial Awareness: Toward Grounded Reasoning for Wildfire Response

Yiheng Chen, Lingyao Li, Zihui Ma, Qikai Hu, Yilun Zhu, Min Deng, and Runlong Yu. Empowering llm agents with geospatial awareness: Toward grounded reasoning for wild- fire response.arXiv preprint arXiv:2510.12061, 2025. 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Automating geospatial vision tasks with a large language model agent

Yuxing Chen, Weijie Wang, Camille Kurtz, and Sylvain Lobry. Automating geospatial vision tasks with a large language model agent. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 218–235. Springer, 2025. 5, 6, 7

work page 2025
[26]

Remote sens- ing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017

Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sens- ing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017. 9, 10

work page 2017
[27]

Applications and ad- vancements of spaceborne insar in landslide monitoring and susceptibility mapping: a systematic review.Remote Sens- ing, 17(6):999, 2025

Yusen Cheng, Hongli Pang, Yangyang Li, Lei Fan, Shengjie Wei, Ziwen Yuan, and Yinqing Fang. Applications and ad- vancements of spaceborne insar in landslide monitoring and susceptibility mapping: a systematic review.Remote Sens- ing, 17(6):999, 2025. 2

work page 2025
[28]

SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery.Advances in Neural Information Processing Systems, 35:197–211, 2022

Yezhen Cong, Samar Khanna, Chenlin Meng, Patrick Liu, Erik Rozi, Yutong He, Marshall Burke, David Lobell, and Stefano Ermon. SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery.Advances in Neural Information Processing Systems, 35:197–211, 2022. 2, 3

work page 2022
[29]

Li Dalin, Wang Haijiao, Yang Zhen, Gu Yanfeng, and Shen Shi. An online distributed satellite cooperative observation scheduling algorithm based on multiagent deep reinforce- ment learning.IEEE Geoscience and Remote Sensing Let- ters, 18(11):1901–1905, 2020. 4, 6

work page 1901
[30]

Multimodal artificial intelligence foundation models: Unleashing the power of remote sensing big data in earth observation.Innovation, 2(1):100055, 2024

MRSB DATA. Multimodal artificial intelligence foundation models: Unleashing the power of remote sensing big data in earth observation.Innovation, 2(1):100055, 2024. 2

work page 2024
[31]

Urban change detection for multispec- tral earth observation using convolutional neural networks

Rodrigo Caye Daudt, Bertr Le Saux, Alexandre Boulch, and Yann Gousseau. Urban change detection for multispec- tral earth observation using convolutional neural networks. InIGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium, pages 2115–2118. Ieee, 2018. 2, 10

work page 2018
[32]

Deepglobe 2018: A challenge to parse the earth through satellite images

Ilke Demir, Krzysztof Koperski, David Lindenbaum, Guan Pang, Jing Huang, Saikat Basu, Forest Hughes, Devis Tuia, and Ramesh Raskar. Deepglobe 2018: A challenge to parse the earth through satellite images. InThe IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR) Workshops, 2018. 10

work page 2018
[33]

Deepglobe 2018: A challenge to parse the earth through satellite images

Ilke Demir, Krzysztof Koperski, David Lindenbaum, Guan Pang, Jing Huang, Saikat Basu, Forest Hughes, Devis Tuia, and Ramesh Raskar. Deepglobe 2018: A challenge to parse the earth through satellite images. InProceedings of the IEEE conference on computer vision and pattern recogni- tion workshops, pages 172–181, 2018. 9, 10

work page 2018
[34]

Ringmo-aerial: An aerial remote sensing foundation model with affine transformation con- 11 trastive learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Wenhui Diao, Haichen Yu, Kaiyue Kang, Tong Ling, Di Liu, Yingchao Feng, Hanbo Bi, Libo Ren, Xuexue Li, Yongqiang Mao, et al. Ringmo-aerial: An aerial remote sensing foundation model with affine transformation con- 11 trastive learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 2

work page 2025
[35]

An im- age is worth 16x16 words: Transformers for image recog- nition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An im- age is worth 16x16 words: Transformers for image recog- nition at scale. InInternational Conference on Learning Representations, 2021. 2, 3

work page 2021
[36]

TREE-GPT: Modular large lan- guage model expert system for forest remote sensing im- age understanding and interactive analysis.arXiv preprint arXiv:2310.04698, 2023

Siqi Du, Shengjun Tang, Weixi Wang, Xiaoming Li, and Renzhong Guo. TREE-GPT: Modular large lan- guage model expert system for forest remote sensing im- age understanding and interactive analysis.arXiv preprint arXiv:2310.04698, 2023. 4, 5, 6

work page arXiv 2023
[37]

Towards artificial general intelligence via a multimodal foundation model.Nature Communications, 13 (1):3094, 2022

Nanyi Fei, Zhiwu Lu, Yizhao Gao, Guoxing Yang, Yuqi Huo, Jingyuan Wen, Haoyu Lu, Ruihua Song, Xin Gao, Tao Xiang, et al. Towards artificial general intelligence via a multimodal foundation model.Nature Communications, 13 (1):3094, 2022. 3

work page 2022
[38]

Earth-agent: Unlocking the full landscape of earth observation with agents.arXiv preprint arXiv:2509.23141, 2025

Peilin Feng, Zhutao Lv, Junyan Ye, Xiaolei Wang, Xinjie Huo, Jinhua Yu, Wanghan Xu, Wenlong Zhang, Lei Bai, Conghui He, and Weijia Li. Earth-agent: Unlocking the full landscape of earth observation with agents.arXiv preprint arXiv:2509.23141, 2025. 5, 6, 7, 8

work page arXiv 2025
[39]

Google earth ai and gemini for cli- mate and environmental analysis.https://earth

Google Earth Team. Google earth ai and gemini for cli- mate and environmental analysis.https://earth. google.com, 2025. Accessed 2025. 6, 8

work page 2025
[40]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing rea- soning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Remote sensing chatgpt: Solving remote sensing tasks with chatgpt and visual models

Haonan Guo, Xin Su, Chen Wu, Bo Du, Liangpei Zhang, and Deren Li. Remote sensing chatgpt: Solving remote sensing tasks with chatgpt and visual models. InIGARSS 2024-2024 IEEE International Geoscience and Remote Sensing Symposium, pages 11474–11478. IEEE, 2024. 4, 5, 6

work page 2024
[42]

AgentSense: LLMs empower generalizable and explain- able web-based participatory urban sensing.arXiv preprint arXiv:2510.19661, 2025

Xusen Guo, Mingxing Peng, Xixuan Hao, Xingchen Zou, Qiongyan Wang, Sijie Ruan, and Yuxuan Liang. AgentSense: LLMs empower generalizable and explain- able web-based participatory urban sensing.arXiv preprint arXiv:2510.19661, 2025. 7

work page arXiv 2025
[43]

EarthLink: A self-evolving ai agent for climate science.arXiv preprint arXiv:2507.17311, 2025

Zijie Guo, Jiong Wang, Xiaoyu Yue, Wangxu Wei, Zhe Jiang, Wanghan Xu, Ben Fei, Wenlong Zhang, Xinyu Gu, Lijing Cheng, et al. EarthLink: A self-evolving ai agent for climate science.arXiv preprint arXiv:2507.17311, 2025. 4, 5, 7

work page arXiv 2025
[44]

Geode: A zero-shot geospa- tial question-answering agent with explicit reasoning and precise spatio-temporal retrieval.arXiv preprint arXiv:2407.11014, 2024

Devashish Vikas Gupta, Azeez Syed Ali Ishaqui, and Divya Kiran Kadiyala. Geode: A zero-shot geospa- tial question-answering agent with explicit reasoning and precise spatio-temporal retrieval.arXiv preprint arXiv:2407.11014, 2024. 4, 5

work page arXiv 2024
[45]

xbd: A dataset for assess- ing building damage from satellite imagery.arXiv preprint arXiv:1911.09296, 2019

Ritwik Gupta, Richard Hosfelt, Sandra Sajeev, Nirav Pa- tel, Bryce Goodman, Jigar Doshi, Eric Heim, Howie Choset, and Matthew Gaston. xbd: A dataset for assess- ing building damage from satellite imagery.arXiv preprint arXiv:1911.09296, 2019. 9, 10

work page arXiv 1911
[46]

Imagebind-llm: Multi-modality instruction tuning

Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, et al. Imagebind-llm: Multi-modality instruction tun- ing.arXiv preprint arXiv:2309.03905, 2023. 3

work page arXiv 2023
[47]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 3

work page 2016
[48]

Masked autoencoders are scal- able vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scal- able vision learners. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 16000–16009, 2022. 2, 3

work page 2022
[49]

Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019. 9, 10

work page 2019
[50]

Masked image mod- eling: A survey.International Journal of Computer Vision, 133(10):7154–7200, 2025

Vlad Hondru, Florinel Alin Croitoru, Shervin Minaee, Radu Tudor Ionescu, and Nicu Sebe. Masked image mod- eling: A survey.International Journal of Computer Vision, 133(10):7154–7200, 2025. 3

work page 2025
[51]

An overview of multimodal remote sensing data fusion: From image to feature, from shallow to deep

Danfeng Hong, Jocelyn Chanussot, and Xiao Xiang Zhu. An overview of multimodal remote sensing data fusion: From image to feature, from shallow to deep. In2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, pages 1245–1248. IEEE, 2021. 2

work page 2021
[52]

Geocode-gpt: A large language model for geospatial code generation.International Journal of Applied Earth Obser- vation and Geoinformation, page 104456, 2025

Shuyang Hou, Zhangxiao Shen, Anqi Zhao, Jianyuan Liang, Zhipeng Gui, Xuefeng Guan, Rui Li, and Huayi Wu. Geocode-gpt: A large language model for geospatial code generation.International Journal of Applied Earth Obser- vation and Geoinformation, page 104456, 2025. 5, 7, 8

work page 2025
[53]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022. 2

work page 2022
[54]

A comprehensive survey on contrastive learning

Haigen Hu, Xiaoyuan Wang, Yan Zhang, Qi Chen, and Qiu Guan. A comprehensive survey on contrastive learning. Neurocomputing, 610:128645, 2024. 3

work page 2024
[55]

RINGMO-Agent: A unified remote sensing foundation model for multi-platform and multi-modal reasoning.arXiv preprint arXiv:2507.20776,

Huiyang Hu, Peijin Wang, Yingchao Feng, Kaiwen Wei, Wenxin Yin, Wenhui Diao, Mengyu Wang, Hanbo Bi, Kaiyue Kang, Tong Ling, et al. RINGMO-Agent: A unified remote sensing foundation model for multi-platform and multi-modal reasoning.arXiv preprint arXiv:2507.20776,

work page arXiv
[56]

2, 3, 4, 5, 6, 7, 8, 9, 11

work page
[57]

RSGPT: A remote sensing vision lan- guage model and benchmark.ISPRS Journal of Photogram- metry and Remote Sensing, 224:272–286, 2025

Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Yu Liu, and Xiang Li. RSGPT: A remote sensing vision lan- guage model and benchmark.ISPRS Journal of Photogram- metry and Remote Sensing, 224:272–286, 2025. 3

work page 2025
[58]

GeoAgent: To empower llms using geospatial tools for address stan- dardization

Chenghua Huang, Shisong Chen, Zhixu Li, Jianfeng Qu, Yanghua Xiao, Jiaxin Liu, and Zhigang Chen. GeoAgent: To empower llms using geospatial tools for address stan- dardization. InFindings of the Association for Computa- tional Linguistics ACL 2024, pages 6048–6063, 2024. 2, 8

work page 2024
[59]

Geoagent: To empower llms using geospatial tools for ad- 12 dress standardization

Cheng Huang, Yifan Zhang, Zhiyun Wang, and Wenhao Yu. Geoagent: To empower llms using geospatial tools for ad- 12 dress standardization. InFindings of the Association for Computational Linguistics ACL, 2024. 6, 8

work page 2024
[60]

Language is not all you need: Aligning perception with language mod- els.Advances in Neural Information Processing Systems, 36:72096–72109, 2023

Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, et al. Language is not all you need: Aligning perception with language mod- els.Advances in Neural Information Processing Systems, 36:72096–72109, 2023. 2

work page 2023
[61]

Whose truth? pluralistic Geo-Alignment for (agentic) ai.arXiv preprint arXiv:2508.05432, 2025

Krzysztof Janowicz, Zilong Liu, Gengchen Mai, Zhangyu Wang, Ivan Majic, Alexandra Fortacz, Grant McKenzie, and Song Gao. Whose truth? pluralistic Geo-Alignment for (agentic) ai.arXiv preprint arXiv:2508.05432, 2025. 6

work page arXiv 2025
[62]

Roads: Robust prompt-driven multi-class anomaly detection under domain shift

Hossein Kashiani, Niloufar Alipour Talemi, and Fatemeh Afghah. Roads: Robust prompt-driven multi-class anomaly detection under domain shift. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 7908–7917. IEEE, 2025. 2

work page 2025
[63]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023. 3

work page 2023
[64]

Satclip: Global, general- purpose location embeddings with satellite imagery

Konstantin Klemmer, Esther Rolf, Caleb Robinson, Lester Mackey, and Marc Rußwurm. Satclip: Global, general- purpose location embeddings with satellite imagery. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 4347–4355, 2025. 2

work page 2025
[65]

Un- derstanding masked autoencoders via hierarchical latent variable models

Lingjing Kong, Martin Q Ma, Guangyi Chen, Eric P Xing, Yuejie Chi, Louis-Philippe Morency, and Kun Zhang. Un- derstanding masked autoencoders via hierarchical latent variable models. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 7918–7928, 2023. 2

work page 2023
[66]

Agentic UA Vs: Llm-driven autonomy with integrated tool-calling and cognitive reason- ing.arXiv preprint arXiv:2509.13352, 2025

Anis Koubaa and Khaled Gabr. Agentic UA Vs: Llm-driven autonomy with integrated tool-calling and cognitive reason- ing.arXiv preprint arXiv:2509.13352, 2025. 4, 5, 6

work page arXiv 2025
[67]

GeoBenchX: Benchmarking LLMs in agent solving multistep geospatial tasks

Varvara Krechetova and Denis Kochedykov. GeoBenchX: Benchmarking LLMs in agent solving multistep geospatial tasks. InProceedings of the 1st ACM SIGSPATIAL Interna- tional Workshop on Generative and Agentic AI for Multi- Modality Space-Time Intelligence, page 27–35, New York, NY , USA, 2025. Association for Computing Machinery. 6, 7, 8, 9, 11

work page 2025
[68]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural net- works. InAdvances in Neural Information Processing Sys- tems, 2012. 3

work page 2012
[69]

Geochat: Grounded large vision-language model for remote sensing

Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 27831–27840, 2024. 2, 3, 7

work page 2024
[70]

xView: Objects in Context in Overhead Imagery

Darius Lam, Richard Kuzma, Kevin McGee, Samuel Doo- ley, Michael Laielli, Matthew Klaric, Yaroslav Bulatov, and Brendan McCord. xview: Objects in context in overhead imagery.arXiv preprint arXiv:1802.07856, 2018. 10

work page internal anchor Pith review Pith/arXiv arXiv 2018
[71]

Multi-agent geospatial copilots for remote sensing workflows.arXiv preprint arXiv:2501.16254, 2025

Chaehong Lee, Varatheepan Paramanayakam, Andreas Karatzas, Yanan Jian, Michael Foret, Heming Liao, Fuxun Yu, Ruopu Li, Iraklis Anagnostopoulos, and Dimitrios Sta- moulis. Multi-agent geospatial copilots for remote sensing workflows.arXiv preprint arXiv:2501.16254, 2025. 5, 7, 8

work page arXiv 2025
[72]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 2

work page 2023
[73]

LHRS-Bot-Nova: Improved multimodal large language model for remote sensing vision-language inter- pretation.ISPRS Journal of Photogrammetry and Remote Sensing, 227:539–550, 2025

Zhenshi Li, Dilxat Muhtar, Feng Gu, Yanglangxing He, Xueliang Zhang, Pengfeng Xiao, Guangjun He, and Xiaox- iang Zhu. LHRS-Bot-Nova: Improved multimodal large language model for remote sensing vision-language inter- pretation.ISPRS Journal of Photogrammetry and Remote Sensing, 227:539–550, 2025. 7

work page 2025
[74]

Jianyuan Liang, Shuyang Hou, Haoyue Jiao, Yaxian Qing, Anqi Zhao, Zhangxiao Shen, Longgang Xiang, and Huayi Wu. GeoGraphRAG: A graph-based retrieval-augmented generation approach for empowering large language mod- els in automated geospatial modeling.International Jour- nal of Applied Earth Observation and Geoinformation, 142:104712, 2025. 5, 6, 7

work page 2025
[75]

ShapefileGPT: A multi-agent large language model framework for automated shapefile processing.Interna- tional Journal of Digital Earth, 18(2):2577884, 2025

Qingming Lin, Rui Hu, Huaxia Li, Sensen Wu, Yadong Li, Kai Fang, Hailin Feng, Zhenhong Du, and Liuchang Xu. ShapefileGPT: A multi-agent large language model framework for automated shapefile processing.Interna- tional Journal of Digital Earth, 18(2):2577884, 2025. 4, 5, 6, 7, 9, 11

work page 2025
[76]

Change-agent: Towards interactive comprehensive remote sensing change interpre- tation and analysis.IEEE Transactions on Geoscience and Remote Sensing, 2024

Chenyang Liu, Keyan Chen, Haotian Zhang, Zipeng Qi, Zhengxia Zou, and Zhenwei Shi. Change-agent: Towards interactive comprehensive remote sensing change interpre- tation and analysis.IEEE Transactions on Geoscience and Remote Sensing, 2024. 2, 5

work page 2024
[77]

Remoteclip: A vision language foundation model for re- mote sensing.IEEE Transactions on Geoscience and Re- mote Sensing, 62:1–16, 2024

Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. Remoteclip: A vision language foundation model for re- mote sensing.IEEE Transactions on Geoscience and Re- mote Sensing, 62:1–16, 2024. 2, 4, 7

work page 2024
[78]

Visual instruction tuning.Advances in neural infor- mation processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural infor- mation processing systems, 36:34892–34916, 2023. 2, 3

work page 2023
[79]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 3

work page 2021
[80]

RescueADI: adaptive disaster interpretation in remote sens- ing images with autonomous agents.IEEE Transactions on Geoscience and Remote Sensing, 2025

Zhuoran Liu, Danpei Zhao, Bo Yuan, and Zhiguo Jiang. RescueADI: adaptive disaster interpretation in remote sens- ing images with autonomous agents.IEEE Transactions on Geoscience and Remote Sensing, 2025. 5, 7, 9, 11

work page 2025

Showing first 80 references.