pith. machine review for the scientific record. sign in

arxiv: 2604.07765 · v2 · submitted 2026-04-09 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords Earth ObservationMulti-modal Large Language ModelsReinforcement LearningAgentic FrameworksVague QueriesIntent RecognitionTool Orchestration
0
0 comments X

The pith

RemoteAgent uses reinforcement learning on a dataset of vague queries to align an MLLM as a cognitive core that handles image-level and sparse-region Earth Observation tasks itself while routing only dense predictions to external tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to connect everyday vague human requests about satellite imagery with the right granularity of analysis, from broad scene understanding to pixel-level mapping. It does so by building VagueEO, a collection of Earth Observation tasks paired with simulated fuzzy natural-language instructions, then applying reinforcement fine-tuning to teach the multimodal model when its own capabilities are sufficient. This matters because current systems either waste resources by always invoking external tools or produce unreliable results on spatial tasks that exceed an MLLM's native output format. A reader would care if the result is an agent that feels natural to non-experts yet remains computationally efficient by avoiding unnecessary tool calls. If the claim holds, EO systems could respond directly to imprecise queries without forcing every request through separate dense-prediction modules.

Core claim

RemoteAgent is an agentic framework that respects the intrinsic capability boundaries of MLLMs. By constructing VagueEO and using it for reinforcement fine-tuning, the system turns an MLLM into a robust cognitive core that directly resolves image- and sparse region-level tasks, while orchestrating specialized tools via the Model Context Protocol exclusively for dense predictions.

What carries the argument

VagueEO-guided reinforcement fine-tuning that aligns an MLLM into a cognitive core deciding between internal resolution and Model Context Protocol tool orchestration.

If this is right

  • RemoteAgent demonstrates robust recognition of vague human intents in Earth Observation scenarios.
  • The framework achieves competitive performance on a range of image-, region-, and pixel-level EO tasks.
  • Suitable tasks are processed internally, limiting tool calls to only those requiring dense spatial output.
  • The Model Context Protocol enables selective orchestration of specialized tools without indiscriminate invocation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reinforcement approach for learning self-limits could apply to MLLM agents in medical imaging or autonomous driving where query precision varies.
  • Replacing simulated queries with logs of actual user interactions would test whether the learned boundaries generalize beyond the training distribution.
  • Extending the protocol to more tool types might allow the core model to stay small while covering additional precision levels.
  • The design illustrates a broader pattern: train models first to recognize what they cannot do well before building surrounding agent infrastructure.

Load-bearing premise

The VagueEO dataset of simulated vague queries accurately captures real user intents, and reinforcement learning can reliably teach the MLLM its own capability boundaries without post-hoc adjustments.

What would settle it

Run RemoteAgent on a fresh collection of real human-provided vague EO queries and measure whether its rate of correct internal handling plus overall task accuracy exceeds both a version that always delegates to tools and a version that always answers directly.

Figures

Figures reproduced from arXiv: 2604.07765 by Bishun Yao, Chaoqian Ouyang, Chuanyi Zhang, Fan Liu, Liang Yao, Min-Ling Zhang, Rui Min, Shengxiang Xu, Shimin Di, Yongjun Li.

Figure 1
Figure 1. Figure 1: (a) The usability gap between vague user intents and [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: VagueEO Benchmark Overview. We construct ten diverse Earth Observation tasks that pair vague, human-centric queries with [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of RemoteAgent. During training, the model is aligned via GRPO, guided by a unified multi-task reward that evaluates [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Intent recognition performance across diverse EO tasks [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results of RemoteAgent. The agent accurately interprets free-form queries and dynamically routes them to specialized [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Earth Observation (EO) systems are essentially designed to support domain experts who often express their requirements through vague natural language rather than precise, machine-friendly instructions. Depending on the specific application scenario, these vague queries can demand vastly different levels of visual precision. Consequently, a practical EO AI system must bridge the gap between ambiguous human queries and the appropriate multi-granularity visual analysis tasks, ranging from holistic image interpretation to fine-grained pixel-wise predictions. While Multi-modal Large Language Models (MLLMs) demonstrate strong semantic understanding, their text-based output format is inherently ill-suited for dense, precision-critical spatial predictions. Existing agentic frameworks address this limitation by delegating tasks to external tools, but indiscriminate tool invocation is computationally inefficient and underutilizes the MLLM's native capabilities. To this end, we propose RemoteAgent, an agentic framework that strategically respects the intrinsic capability boundaries of MLLMs. To empower this framework to understand real user intents, we construct VagueEO, a human-centric instruction dataset pairing EO tasks with simulated vague natural-language queries. By leveraging VagueEO for reinforcement fine-tuning, we align an MLLM into a robust cognitive core that directly resolves image- and sparse region-level tasks. Consequently, RemoteAgent processes suitable tasks internally while intelligently orchestrating specialized tools via the Model Context Protocol exclusively for dense predictions. Extensive experiments demonstrate that RemoteAgent achieves robust intent recognition capabilities while delivering highly competitive performance across diverse EO tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes RemoteAgent, an agentic framework for Multi-modal Large Language Models (MLLMs) in Earth Observation (EO). It introduces the VagueEO dataset, constructed by pairing EO tasks with simulated vague natural-language queries, and applies reinforcement fine-tuning to align an MLLM as a cognitive core. The framework claims to respect intrinsic MLLM capability boundaries by handling image- and sparse region-level tasks internally while delegating only dense predictions to specialized tools via the Model Context Protocol (MCP), achieving robust intent recognition and competitive performance on diverse EO tasks.

Significance. If the claims hold, this approach could meaningfully improve efficiency in EO systems by reducing unnecessary external tool calls and better exploiting native MLLM strengths for multi-granularity analysis. The VagueEO dataset and RL-based boundary alignment represent a concrete contribution to human-centric agentic designs in remote sensing, with potential applicability to other domains requiring selective delegation.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (VagueEO Construction): The dataset is built by 'pairing EO tasks with simulated vague natural-language queries,' yet the manuscript supplies no quantitative validation (e.g., distributional statistics on temporal qualifiers, sensor references, or precision tolerances) showing that these simulations match real expert query distributions. This assumption is load-bearing for the central claim that RL fine-tuning produces reliable self-assessment of capability boundaries; divergence would cause the learned policy to misclassify tasks on authentic inputs.
  2. [§5] §5 (Experiments): The abstract asserts that 'extensive experiments demonstrate robust intent recognition capabilities while delivering highly competitive performance,' but the text provides no specific metrics, baselines, error bars, ablation studies, or details on how intent recognition accuracy versus task performance was quantified across EO scenarios. This prevents independent verification of the robustness and competitiveness claims.
minor comments (2)
  1. [§2] The Model Context Protocol (MCP) is referenced without an initial definition or citation; a short explanatory sentence or pointer to its specification would aid readers unfamiliar with the term.
  2. [Figures] Ensure any workflow diagrams clearly label the decision points where the MLLM elects internal processing versus MCP tool invocation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (VagueEO Construction): The dataset is built by 'pairing EO tasks with simulated vague natural-language queries,' yet the manuscript supplies no quantitative validation (e.g., distributional statistics on temporal qualifiers, sensor references, or precision tolerances) showing that these simulations match real expert query distributions. This assumption is load-bearing for the central claim that RL fine-tuning produces reliable self-assessment of capability boundaries; divergence would cause the learned policy to misclassify tasks on authentic inputs.

    Authors: We agree that explicit validation of the simulated queries is important to support the RL fine-tuning claims. The VagueEO queries were generated by EO domain experts to reflect typical vagueness, but the initial manuscript omitted comparative statistics. In revision we will add to §3 quantitative distributional comparisons (e.g., frequencies of temporal qualifiers, sensor references, and precision terms) against a held-out set of real expert queries, along with any noted limitations. revision: yes

  2. Referee: [§5] §5 (Experiments): The abstract asserts that 'extensive experiments demonstrate robust intent recognition capabilities while delivering highly competitive performance,' but the text provides no specific metrics, baselines, error bars, ablation studies, or details on how intent recognition accuracy versus task performance was quantified across EO scenarios. This prevents independent verification of the robustness and competitiveness claims.

    Authors: We acknowledge the experimental section lacked sufficient detail for independent verification. We will revise §5 to include concrete metrics (intent recognition accuracy and F1, task success rates), explicit baselines, error bars from repeated runs, ablation studies on RL and MCP components, and a clear protocol separating intent recognition evaluation from downstream task performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper constructs VagueEO by pairing EO tasks with simulated vague queries, then applies standard RL fine-tuning to align an MLLM for deciding internal handling versus tool delegation via Model Context Protocol. No equations, derivations, or self-referential definitions appear in the abstract or described claims that reduce any prediction or result to its inputs by construction. The approach relies on established RL/MLLM methods plus a new dataset without load-bearing self-citations, imported uniqueness theorems, or smuggled ansatzes. Central claims about intent recognition and capability boundaries remain independent of the inputs and are presented as empirically validated rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on the effectiveness of RL fine-tuning to learn capability boundaries and the representativeness of the simulated VagueEO dataset for real vague queries.

axioms (2)
  • domain assumption MLLMs possess learnable intrinsic capability boundaries separating tasks they can resolve internally from those requiring external tools
    Invoked when stating that RemoteAgent processes suitable tasks internally while delegating dense predictions.
  • domain assumption VagueEO dataset of simulated vague queries paired with EO tasks is representative of real human intents
    Central to the reinforcement fine-tuning step described in the abstract.
invented entities (2)
  • RemoteAgent no independent evidence
    purpose: Agentic framework for bridging vague intents and EO tasks
    New proposed system architecture.
  • VagueEO no independent evidence
    purpose: Human-centric instruction dataset for RL fine-tuning
    New dataset introduced to train the model.

pith-pipeline@v0.9.0 · 5596 in / 1587 out tokens · 62030 ms · 2026-05-10T17:56:48.830800+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Evaluating Remote Sensing Image Captions Beyond Metric Biases

    cs.CV 2026-04 unverdicted novelty 7.0

    Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA pe...

  2. RemoteShield: Enable Robust Multimodal Large Language Models for Earth Observation

    cs.CV 2026-04 unverdicted novelty 6.0

    RemoteShield improves robustness of Earth observation MLLMs by training on semantic equivalence clusters of clean and perturbed inputs via preference learning to maintain consistent reasoning under noise.

Reference graph

Works this paper leans on

88 extracted references · 35 canonical work pages · cited by 2 Pith papers · 7 internal anchors

  1. [1]

    Phi-4 Technical Report

    Marah Abdin, Jyoti Aneja, Harkirat Behl, S ´ebastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905, 2024. 6

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 4, 6, 7

  3. [3]

    Whu-rs19 abzsl: An attribute-based dataset for remote sens- ing image understanding

    Mattia Balestra, Marina Paolanti, and Roberto Pierdicca. Whu-rs19 abzsl: An attribute-based dataset for remote sens- ing image understanding. Remote Sensing, 17(14):2384,

  4. [4]

    Geoflow: Agentic workflow automation for geospatial tasks

    Amulya Bhattaram, Justin Chung, Stanley Chung, Ranit Gupta, Janani Ramamoorthy, Kartikeya Gullapalli, Diana Marculescu, and Dimitrios Stamoulis. Geoflow: Agentic workflow automation for geospatial tasks. In Proceedings of the 33rd ACM International Conference on Advances in Geographic Information Systems, pages 1150–1153, 2025. 2, 9

  5. [5]

    Dual-tasks siamese transformer framework for building damage assessment

    Hongruixuan Chen, Edoardo Nemni, Sofia Vallecorsa, Xi Li, Chen Wu, and Lars Bromley. Dual-tasks siamese transformer framework for building damage assessment. In IGARSS 2022-2022 IEEE international geoscience and remote sensing symposium, pages 1600–1603. IEEE, 2022. 8

  6. [6]

    Cangling- knowflow: A unified knowledge-and-flow-fused agent for comprehensive remote sensing applications

    Zhengchao Chen, Haoran Wang, Jing Yao, Pedram Ghamisi, Jun Zhou, Peter M Atkinson, and Bing Zhang. Cangling- knowflow: A unified knowledge-and-flow-fused agent for comprehensive remote sensing applications. arXiv preprint arXiv:2512.15231, 2025. 2, 9

  7. [7]

    Anchor-free oriented proposal generator for object detection

    Gong Cheng, Jiabao Wang, Ke Li, Xingxing Xie, Chunbo Lang, Yanqing Yao, and Junwei Han. Anchor-free oriented proposal generator for object detection. IEEE Transactions on Geoscience and Remote Sensing, 60:1–11, 2022. 7

  8. [8]

    Fuse-rsvlm: Feature fusion vision-language model for remote sensing

    Yunkai Dang, Donghao Wang, Jiacheng Yang, Yifan Jiang, Meiyi Zhu, Yuekun Yang, Cong Wang, Qi Fan, Wenbin Li, and Yang Gao. Fuse-rsvlm: Feature fusion vision-language model for remote sensing. arXiv preprint arXiv:2512.24022,

  9. [9]

    ToolRosella: Translating Code Repositories into Standardized Tools for Scientific Agents

    Shimin Di, Xujie Yuan, Hanghui Guo, Chaoqian Ouyang, Zhangze Chen, Ling Yue, Libin Zheng, Jia Zhu, Shaowu Pan, Jian Yin, et al. Toolrosetta: Bridging open-source repos- itories and large language model agents through automated tool standardization. arXiv preprint arXiv:2603.09290, 2026. 2

  10. [10]

    Cross-modal bidi- rectional interaction model for referring remote sensing image segmentation,

    Zhe Dong, Yuzhe Sun, Tianzhu Liu, Wangmeng Zuo, and Yanfeng Gu. Cross-modal bidirectional interaction model for referring remote sensing image segmentation. arXiv preprint arXiv:2410.08613, 2024. 8

  11. [11]

    Earth-agent: Unlocking the full landscape of earth observation with agents,

    Peilin Feng, Zhutao Lv, Junyan Ye, Xiaolei Wang, Xin- jie Huo, Jinhua Yu, Wanghan Xu, Wenlong Zhang, Lei Bai, Conghui He, et al. Earth-agent: Unlocking the full landscape of earth observation with agents. arXiv preprint arXiv:2509.23141, 2025. 2, 8, 9

  12. [12]

    Geovlm-r1: Reinforcement fine-tuning for improved remote sensing reasoning

    Mustansar Fiaz, Hiyam Debary, Paolo Fraccaro, Danda Paudel, Luc Van Gool, Fahad Khan, and Salman Khan. Geovlm-r1: Reinforcement fine-tuning for improved remote sensing reasoning. arXiv preprint arXiv:2509.25026, 2025. 9

  13. [13]

    Skysense: A multi-modal remote sens- ing foundation model towards universal interpretation for earth observation imagery

    Xin Guo, Jiangwei Lao, Bo Dang, Yingying Zhang, Lei Yu, Lixiang Ru, Liheng Zhong, Ziyuan Huang, Kang Wu, Dingxiang Hu, et al. Skysense: A multi-modal remote sens- ing foundation model towards universal interpretation for earth observation imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27672–27683, 2024. 7

  14. [14]

    Model context protocol (mcp): Landscape, security threats, and future research directions

    Xinyi Hou, Yanjie Zhao, Shenao Wang, and Haoyu Wang. Model context protocol (mcp): Landscape, security threats, and future research directions. ACM Transactions on Software Engineering and Methodology, 2025. 2

  15. [15]

    Rs-vheat: Heat conduc- tion guided efficient remote sensing foundation model

    Huiyang Hu, Peijin Wang, Hanbo Bi, Boyuan Tong, Zhaozhi Wang, Wenhui Diao, Hao Chang, Yingchao Feng, Ziqi Zhang, Yaowei Wang, et al. Rs-vheat: Heat conduc- tion guided efficient remote sensing foundation model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9876–9887, 2025. 7

  16. [16]

    Rsgpt: A remote sensing vision language model and benchmark

    Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Yu Liu, and Xiang Li. Rsgpt: A remote sensing vision language model and benchmark. ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025. 1, 9

  17. [17]

    Teochat: A large vision-language as- sistant for temporal earth observation data

    Jeremy Andrew Irvin, Emily Ruoyu Liu, Joyce Chuyi Chen, Ines Dormoy, Jinyoung Kim, Samar Khanna, Zhuo Zheng, and Stefano Ermon. Teochat: A large vision-language as- sistant for temporal earth observation data. In International Conference on Learning Representations, 2025. 9

  18. [18]

    Eaglevision: Object-level attribute multimodal llm for remote sensing, 2025

    Hongxiang Jiang, Jihao Yin, Qixiong Wang, Jiaqi Feng, and Guo Chen. Eaglevision: Object-level attribute multimodal llm for remote sensing, 2025. 9

  19. [19]

    Falcon: A remote sensing vision-language foundation model.arXiv preprint arXiv:2503.11070, 2025

    Yao kelu, Xu Nuo, Yang Rong, Xu Yingying, Gao Zhuoyan, Kitrungrotsakul Titinunt, Ren yi, Zhang Pu, Wang Jin, Wei Ning, and Li Chao. Falcon: A remote sens- ing vision-language foundation model. arXiv preprint arXiv:2503.11070, 2025. 1, 6, 7, 9

  20. [20]

    Geochat: Grounded large vision-language model for remote sensing

    Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 27831– 27840, 2024. 1, 6, 7, 9

  21. [21]

    Lisa: Reasoning seg- 10 mentation via large language model

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning seg- 10 mentation via large language model. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9579–9589, 2024. 8

  22. [22]

    Text4seg++: Advancing image segmentation via generative language modeling.arXiv preprint arXiv:2509.06321, 2025

    Mengcheng Lan, Chaofeng Chen, Jiaxing Xu, Zongrui Li, Yiping Ke, Xudong Jiang, Yingchen Yu, Yunqing Zhao, and Song Bai. Text4seg++: Advancing image segmen- tation via generative language modeling. arXiv preprint arXiv:2509.06321, 2025. 8

  23. [23]

    Object detection in optical remote sensing im- ages: A survey and a new benchmark

    Ke Li, Gang Wan, Gong Cheng, Liqiu Meng, and Jun- wei Han. Object detection in optical remote sensing im- ages: A survey and a new benchmark. ISPRS journal of photogrammetry and remote sensing, 159:296–307, 2020. 7

  24. [24]

    Unleashing channel potential: Space- frequency selection convolution for sar object detection

    Ke Li, Di Wang, Zhangyuan Hu, Wenxuan Zhu, Shaofeng Li, and Quan Wang. Unleashing channel potential: Space- frequency selection convolution for sar object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17323–17332, 2024. 1

  25. [25]

    Language-guided progressive attention for visual grounding in remote sensing images

    Ke Li, Di Wang, Haojie Xu, Haodi Zhong, and Cong Wang. Language-guided progressive attention for visual grounding in remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, 62:1–13, 2024. 9

  26. [26]

    Designing domain- specific agents via hierarchical task abstraction mechanism

    Kaiyu Li, Jiayu Wang, Zhi Wang, Hui Qiao, Weizhan Zhang, Deyu Meng, and Xiangyong Cao. Designing domain- specific agents via hierarchical task abstraction mechanism. arXiv preprint arXiv:2511.17198, 2025. 9

  27. [27]

    Segearth-r1: Geospatial pixel reasoning via large language model

    Kaiyu Li, Zepeng Xin, Li Pang, Chao Pang, Yupeng Deng, Jing Yao, Guisong Xia, Deyu Meng, Zhi Wang, and Xiangy- ong Cao. Segearth-r1: Geospatial pixel reasoning via large language model. arXiv preprint arXiv:2504.09644, 2025. 7, 8, 9

  28. [28]

    Rsvg-zeroov: Exploring a training-free framework for zero- shot open-vocabulary visual grounding in remote sensing im- ages

    Ke Li, Di Wang, Ting Wang, Fuyu Dong, Yiming Zhang, Luyao Zhang, Xiangyu Wang, Shaofeng Li, and Quan Wang. Rsvg-zeroov: Exploring a training-free framework for zero- shot open-vocabulary visual grounding in remote sensing im- ages. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 6288–6296, 2026. 1

  29. [29]

    Georeason: Aligning thinking and answering in remote sensing vision-language models via logical consis- tency reinforcement learning, 2026

    Wenshuai Li, Xiantai Xiang, Zixiao Wen, Guangyao Zhou, Ben Niu, Feng Wang, Lijia Huang, Qiantong Wang, and Yuxin Hu. Georeason: Aligning thinking and answering in remote sensing vision-language models via logical consis- tency reinforcement learning, 2026. 9

  30. [30]

    Masked angle-aware autoen- coder for remote sensing images

    Zhihao Li, Biao Hou, Siteng Ma, Zitong Wu, Xianpeng Guo, Bo Ren, and Licheng Jiao. Masked angle-aware autoen- coder for remote sensing images. In European Conference on Computer Vision, pages 260–278. Springer, 2024. 7

  31. [31]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023. 7

  32. [32]

    Skymoe: A vision-language foundation model for enhancing geospatial interpretation with mixture of experts.arXiv preprint arXiv:2512.02517, 2025

    Jiaqi Liu, Ronghao Fu, Lang Sun, Haoran Liu, Xiao Yang, Weipeng Zhang, Xu Na, Zhuoran Duan, and Bo Yang. Sky- moe: A vision-language foundation model for enhancing geospatial interpretation with mixture of experts. arXiv preprint arXiv:2512.02517, 2025. 6, 7

  33. [33]

    To- wards faithful reasoning in remote sensing: A perceptually- grounded geospatial chain-of-thought for vision-language models

    Jiaqi Liu, Lang Sun, Ronghao Fu, and Bo Yang. To- wards faithful reasoning in remote sensing: A perceptually- grounded geospatial chain-of-thought for vision-language models. arXiv preprint arXiv:2509.22221, 2025. 9

  34. [34]

    Rotated multi-scale interaction network for referring remote sensing image seg- mentation

    Sihan Liu, Yiwei Ma, Xiaoqing Zhang, Haowei Wang, Ji- ayi Ji, Xiaoshuai Sun, and Rongrong Ji. Rotated multi-scale interaction network for referring remote sensing image seg- mentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26658– 26668, 2024. 8

  35. [35]

    Rsunivlm: A unified vision language model for remote sensing via granularity-oriented mixture of experts.arXiv preprint arXiv:2412.05679, 2024

    Xu Liu and Zhouhui Lian. Rsunivlm: A unified vision language model for remote sensing via granularity-oriented mixture of experts. arXiv preprint arXiv:2412.05679, 2024. 1, 7, 9

  36. [36]

    Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision- language understanding,

    Junwei Luo, Zhen Pang, Yongjun Zhang, Tingzhu Wang, Linlin Wang, Bo Dang, Jiangwei Lao, Jian Wang, Jingdong Chen, Yihua Tan, et al. Skysensegpt: A fine-grained in- struction tuning dataset and model for remote sensing vision- language understanding. arXiv preprint arXiv:2406.10100,

  37. [37]

    Ge- omag: A vision-language model for pixel-level fine-grained remote sensing image parsing

    Xianzhi Ma, Jianhui Li, Changhua Pei, and Hao Liu. Ge- omag: A vision-language model for pixel-level fine-grained remote sensing image parsing. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 5441– 5450, 2025. 6, 8

  38. [38]

    Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model

    Dilxat Muhtar, Zhenshi Li, Feng Gu, Xueliang Zhang, and Pengfeng Xiao. Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model. In European Conference on Computer Vision, pages 440–457. Springer, 2024. 1, 6, 7

  39. [39]

    Chaoqian Ouyang, Ling Yue, Shimin Di, Libin Zheng, Linan Yue, Shaowu Pan, Jian Yin, and Min-Ling Zhang

    Chaoqian Ouyang, Ling Yue, Shimin Di, Libin Zheng, Linan Yue, Shaowu Pan, Jian Yin, and Min-Ling Zhang. Code2mcp: Transforming code repositories into mcp ser- vices. arXiv preprint arXiv:2509.05941, 2025. 2

  40. [40]

    Vhm: Versatile and honest vision language model for remote sensing image analysis

    Chao Pang, Xingxing Weng, Jiang Wu, Jiayu Li, Yi Liu, Ji- axing Sun, Weijia Li, Shuai Wang, Litong Feng, Gui-Song Xia, et al. Vhm: Versatile and honest vision language model for remote sensing image analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 6381– 6388, 2025. 6, 7

  41. [41]

    Deepspeed: System optimizations enable train- ing deep learning models with over 100 billion parame- ters

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable train- ing deep learning models with over 100 billion parame- ters. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 3505–3506, 2020. 6

  42. [42]

    Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning

    Colorado J Reed, Ritwik Gupta, Shufan Li, Sarah Brock- man, Christopher Funk, Brian Clipp, Kurt Keutzer, Salvatore Candido, Matt Uyttendaele, and Trevor Darrell. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4088– 4099, 2023. 7

  43. [43]

    Pixellm: Pixel reasoning with large multimodal model

    Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel reasoning with large multimodal model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26374–26383, 2024. 8

  44. [44]

    Rs2- sam2: Customized sam2 for referring remote sensing image segmentation

    Fu Rong, Meng Lan, Qian Zhang, and Lefei Zhang. Rs2- sam2: Customized sam2 for referring remote sensing image segmentation. arXiv preprint arXiv:2503.07266, 2025. 8 11

  45. [45]

    ThinkGeo: Evaluating tool-augmented agents for remote sensing tasks,

    Akashah Shabbir, Muhammad Akhtar Munir, Akshay Dud- hane, Muhammad Umer Sheikh, Muhammad Haris Khan, Paolo Fraccaro, Juan Bernabe Moreno, Fahad Shahbaz Khan, and Salman Khan. Thinkgeo: Evaluating tool- augmented agents for remote sensing tasks. arXiv preprint arXiv:2505.23752, 2025. 9

  46. [46]

    arXiv preprint arXiv:2501.13925 URL:https://arxiv.org/abs/2501.13925

    Akashah Shabbir, Mohammed Zumri, Mohammed Ben- namoun, Fahad S Khan, and Salman Khan. Geopixel: Pixel grounding large multimodal model in remote sensing. arXiv preprint arXiv:2501.13925, 2025. 8

  47. [47]

    Openearthagent: A unified framework for tool-augmented geospatial agents,

    Akashah Shabbir, Muhammad Umer Sheikh, Muham- mad Akhtar Munir, Hiyam Debary, Mustansar Fiaz, Muham- mad Zaigham Zaheer, Paolo Fraccaro, Fahad Shahbaz Khan, Muhammad Haris Khan, Xiao Xiang Zhu, et al. Openeartha- gent: A unified framework for tool-augmented geospatial agents. arXiv preprint arXiv:2602.17665, 2026. 2, 9

  48. [48]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 4

  49. [49]

    Fully convolutional networks for dense se- mantic labelling of high-resolution aerial imagery

    Jamie Sherrah. Fully convolutional networks for dense se- mantic labelling of high-resolution aerial imagery. arXiv preprint arXiv:1606.02585, 2016. 7

  50. [50]

    Earthdial: Turning multi-sensory earth observations to interactive dialogues

    Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fa- had Shahbaz Khan, et al. Earthdial: Turning multi-sensory earth observations to interactive dialogues. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 14303–14313, ...

  51. [51]

    Geospatial foundation models: Recent advances and applications

    Ranga Raju Vatsavai. Geospatial foundation models: Recent advances and applications. In Proceedings of the 12th ACM SIGSPATIALInternational Workshop on Analytics for Big Geospatial Data, pages 30–33, 2024. 7

  52. [52]

    Geozero: Incentivizing reasoning from scratch on geospatial scenes, 2025

    Di Wang, Shunyu Liu, Wentao Jiang, Fengxiang Wang, Yi Liu, Xiaolei Qin, Zhiming Luo, Chaoyang Zhou, Haonan Guo, Jing Zhang, Bo Du, Dacheng Tao, and Liangpei Zhang. Geozero: Incentivizing reasoning from scratch on geospatial scenes, 2025. 9

  53. [53]

    Pcdasnet: Position-constrained differential attention siamese network for building damage assessment

    Jiaqi Wang, Haonan Guo, Xin Su, Li Zheng, and Qiangqiang Yuan. Pcdasnet: Position-constrained differential attention siamese network for building damage assessment. IEEE Transactions on Geoscience and Remote Sensing, 62:1–18,

  54. [54]

    Learning to compose for cross-domain agentic workflow generation

    Jialiang Wang, Shengxiang Xu, Hanmo Liu, Jiachuan Wang, Yuyu Luo, Shimin Di, Min-Ling Zhang, and Lei Chen. Learning to compose for cross-domain agentic workflow generation. arXiv preprint arXiv:2602.11114, 2026. 2

  55. [55]

    Ringmogpt: A unified re- mote sensing foundation model for vision, language, and grounded tasks

    Peijin Wang, Huiyang Hu, Boyuan Tong, Ziqi Zhang, Fan- glong Yao, Yingchao Feng, Zining Zhu, Hao Chang, Wen- hui Diao, Qixiang Ye, et al. Ringmogpt: A unified re- mote sensing foundation model for vision, language, and grounded tasks. IEEE Transactions on Geoscience and Remote Sensing, 2024. 9

  56. [56]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 6

  57. [57]

    isaid: A large- scale dataset for instance segmentation in aerial images

    Syed Waqas Zamir, Aditya Arora, Akshita Gupta, Salman Khan, Guolei Sun, Fahad Shahbaz Khan, Fan Zhu, Ling Shao, Gui-Song Xia, and Xiang Bai. isaid: A large- scale dataset for instance segmentation in aerial images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 28–37,

  58. [58]

    Vision- language modeling meets remote sensing: Models, datasets, and perspectives

    Xingxing Weng, Chao Pang, and Gui-Song Xia. Vision- language modeling meets remote sensing: Models, datasets, and perspectives. IEEE Geoscience and Remote Sensing Magazine, 2025. 1

  59. [59]

    On the generalization of sft: A reinforcement learning perspective with reward rectification, 2026

    Yongliang Wu, Yizhou Zhou, Zhou Ziheng, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming-Hsuan Yang, and Xu Yang. On the generalization of sft: A re- inforcement learning perspective with reward rectification. arXiv preprint arXiv:2508.05629, 2025. 4

  60. [60]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of- experts vision-language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302, 2024. 7

  61. [61]

    Aid: A benchmark data set for performance evaluation of aerial scene classification

    Gui-Song Xia, Jingwen Hu, Fan Hu, Baoguang Shi, Xiang Bai, Yanfei Zhong, Liangpei Zhang, and Xiaoqiang Lu. Aid: A benchmark data set for performance evaluation of aerial scene classification. IEEE Transactions on Geoscience and Remote Sensing, 55(7):3965–3981, 2017. 6

  62. [62]

    Dota: A large-scale dataset for object detection in aerial images

    Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Be- longie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liang- pei Zhang. Dota: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3974–3983,

  63. [63]

    Florence-2: Advancing a unified representation for a variety of vision tasks

    Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan. Florence-2: Advancing a unified representation for a variety of vision tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4818– 4829, 2024. 7

  64. [64]

    Segearth-r2: Towards comprehensive language- guided segmentation for remote sensing images

    Zepeng Xin, Kaiyu Li, Luodi Chen, Wanchen Li, Yuchen Xiao, Hui Qiao, Weizhan Zhang, Deyu Meng, and Xiangy- ong Cao. Segearth-r2: Towards comprehensive language- guided segmentation for remote sensing images. arXiv preprint arXiv:2512.20013, 2025. 8

  65. [65]

    Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assess- ment

    Lingling Xu, Haoran Xie, S Joe Qin, Xiaohui Tao, and Fu Lee Wang. Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assess- ment. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026. 2

  66. [66]

    Robustflow: Towards robust agentic workflow generation.arXiv preprint arXiv:2509.21834, 2025

    Shengxiang Xu, Jiayi Zhang, Shimin Di, Yuyu Luo, Liang Yao, Hanmo Liu, Jia Zhu, Fan Liu, and Min-Ling Zhang. Robustflow: Towards robust agentic workflow generation. arXiv preprint arXiv:2509.21834, 2025. 2

  67. [67]

    RS-Agent: Automating remote sensing tasks through intelligent agent,

    Wenjia Xu, Zijian Yu, Boyang Mu, Zhiwei Wei, Yuanben Zhang, Guangzuo Li, Jiuniu Wang, and Mugen Peng. Rs- 12 agent: Automating remote sensing tasks through intelligent agent. arXiv preprint arXiv:2406.07089, 2024. 9

  68. [68]

    Lavt: Language- aware vision transformer for referring image segmentation

    Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Heng- shuang Zhao, and Philip HS Torr. Lavt: Language- aware vision transformer for referring image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18155–18165, 2022. 8

  69. [69]

    Remotesam: Towards segment anything for earth observa- tion

    Liang Yao, Fan Liu, Delong Chen, Chuanyi Zhang, Yijun Wang, Ziyun Chen, Wei Xu, Shimin Di, and Yuhui Zheng. Remotesam: Towards segment anything for earth observa- tion. arXiv preprint arXiv:2505.18022, 2025. 1, 7

  70. [70]

    Uemm-air: Enable uavs to undertake more multi-modal tasks

    Liang Yao, Fan Liu, Shengxiang Xu, Chuanyi Zhang, Shimin Di, Xing Ma, Jianyu Jiang, Zequan Wang, and Jun Zhou. Uemm-air: Enable uavs to undertake more multi-modal tasks. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 12792–12798, 2025. 1

  71. [71]

    Re- motereasoner: Towards unifying geospatial reasoning work- flow

    Liang Yao, Fan Liu, Hongbo Lu, Chuanyi Zhang, Rui Min, Shengxiang Xu, Shimin Di, and Pai Peng. Re- motereasoner: Towards unifying geospatial reasoning work- flow. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 11883–11891, 2026. 2, 4, 7, 9

  72. [72]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations, 2022. 8

  73. [73]

    Rrsis: Referring remote sensing image seg- mentation

    Zhenghang Yuan, Lichao Mou, Yuansheng Hua, and Xiao Xiang Zhu. Rrsis: Referring remote sensing image seg- mentation. IEEE Transactions on Geoscience and Remote Sensing, 62:1–12, 2024. 8

  74. [74]

    Rsvg: Exploring data and models for visual grounding on remote sensing data

    Yang Zhan, Zhitong Xiong, and Yuan Yuan. Rsvg: Exploring data and models for visual grounding on remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 61: 1–13, 2023. 6

  75. [75]

    Skyeyegpt: Unifying remote sensing vision-language tasks via instruc- tion tuning with large language model

    Yang Zhan, Zhitong Xiong, and Yuan Yuan. Skyeyegpt: Unifying remote sensing vision-language tasks via instruc- tion tuning with large language model. ISPRS Journal of Photogrammetry and Remote Sensing, 221:64–77, 2025. 6, 9

  76. [76]

    Next-chat: An lmm for chat, detection and segmentation

    Ao Zhang, Yuan Yao, Wei Ji, Zhiyuan Liu, and Tat-Seng Chua. Next-chat: An lmm for chat, detection and segmenta- tion. arXiv preprint arXiv:2311.04498, 2023. 8

  77. [77]

    Vision-language models for vision tasks: A survey

    Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5625–5644, 2024. 1

  78. [78]

    Earthmarker: A visual prompting multi-modal large language model for remote sensing

    Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, Jun Li, and Xuerui Mao. Earthmarker: A visual prompting multi-modal large language model for remote sensing. IEEE Transactions on Geoscience and Remote Sensing, 2024. 9

  79. [79]

    Earthgpt: A universal multimodal large lan- guage model for multisensor image comprehension in re- mote sensing domain

    Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, and Xuerui Mao. Earthgpt: A universal multimodal large lan- guage model for multisensor image comprehension in re- mote sensing domain. IEEE Transactions on Geoscience and Remote Sensing, 62:1–20, 2024. 9

  80. [80]

    Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection

    Yuanlin Zhang, Yuan Yuan, Yachuang Feng, and Xiaoqiang Lu. Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection. IEEE Transactions on Geoscience and Remote Sensing, 57 (8):5535–5548, 2019. 7

Showing first 80 references.