arxiv: 2605.04777 · v1 · submitted 2026-05-06 · 💻 cs.MA

Recognition: unknown

Bridging Perception and Action: A Lightweight Multimodal Meta-Planner Framework for Robust Earth Observation Agents

Jinghui Xu , Boyi Shangguan , Mengke Zhu , Hao Liu , Junhuan Jiang , Guangjun He , Pengming Feng , Shichao Jin

show 5 more authors

Bin Liang Yongzhe Chang Junbo Tan Tiantian Zhang Xueqian Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 15:42 UTC · model grok-4.3

classification 💻 cs.MA

keywords Earth Observation AgentsMultimodal Meta-PlannerTask PlanningAutonomous SystemsRemote SensingTool CallingDual-Awareness MechanismTwo-Stage Training

0 comments

The pith

A lightweight meta-planner separates planning from execution in Earth observation agents by grounding decisions in images, task semantics, and remote-sensing expertise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Autonomous Earth observation agents must shift from passive image analysis to reliable multi-step task execution in changing environments. Integrated models that combine planning and action inside one network often produce reasoning errors or infeasible sequences under combinatorial pressure. The paper presents the Lightweight Multimodal Meta-Planner framework that keeps planning in a dedicated lightweight module while using a dual-awareness mechanism to connect multimodal image features with high-level task goals. A Meta Task Library injects remote-sensing expert knowledge to standardize logic and enforce physical feasibility, and a two-stage training process first distills expert demonstrations then refines the planner with execution feedback. Experiments on datasets derived from EarthBench and ThinkGeo show higher tool-calling accuracy, elevated task success, and consistent gains when the planner is paired with different executor backbones on previously unseen missions.

Core claim

The Lightweight Multimodal Meta-Planner (LMMP) decouples strategic planning from low-level execution, grounds plans through dual awareness of image features and task semantics, and injects domain logic via a Meta Task Library so that generated plans remain physically feasible; the planner is first initialized by supervised fine-tuning on expert trajectories and then aligned by Direct Preference Optimization on execution outcomes, yielding measurable gains in tool accuracy and mission completion across diverse backbones and unseen Earth-observation tasks.

What carries the argument

The Meta Task Library, which injects remote-sensing expert knowledge to standardize domain logic and produce physically feasible plans.

If this is right

Tool-calling accuracy and overall task success rates rise on EarthBench- and ThinkGeo-derived datasets.
The same planner module improves performance when attached to multiple different executor backbones.
Gains persist on Earth-observation missions that were not seen during training.
The two-stage training pipeline first distills expert plans then refines them from execution feedback.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The clean separation of planning from execution may limit error accumulation across long action sequences in other robotic or autonomous systems.
Adding real-time sensor feedback loops into the Meta Task Library could further tighten the link between perception and feasible action.
Deployment on physical platforms such as drones or satellites would expose whether the expert-injected plans remain robust under unmodeled environmental noise.

Load-bearing premise

The Meta Task Library successfully injects remote-sensing expert knowledge to standardize domain logic and guarantee physically feasible plans, and the two-stage training pipeline generalizes beyond the specific datasets used.

What would settle it

A controlled test on a fresh collection of Earth-observation missions in which LMMP produces no measurable increase in tool-calling accuracy or task success compared with an integrated single-model baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.04777 by Bin Liang, Boyi Shangguan, Guangjun He, Hao Liu, Jinghui Xu, Junbo Tan, Junhuan Jiang, Mengke Zhu, Pengming Feng, Shichao Jin, Tiantian Zhang, Xueqian Wang, Yongzhe Chang.

**Figure 1.** Figure 1: Overview of the LMMP framework. The trainable RS Meta-Planner generates a structured Meta Plan based on multimodal input. This plan is then enriched with domain-specific logic from the Meta Task Library, allowing a frozen Executor to call relevant tools and generate the final response. 2. Methodology This section presents the Lightweight Multimodal MetaPlanner framework. We design this architecture to re… view at source ↗

**Figure 2.** Figure 2: The training and evaluation pipeline. The two-stage evolutionary training consists of Expert Logic Internalization (Stage 1) and Preference Alignment (Stage 2). The Process-Outcome Assessment shows how the Meta-Planner is evaluated through step-by-step and end-to-end metrics. formance distribution. We evaluate these trajectories using a Step-aware Discounted Reward, where the score at step t is weighted by… view at source ↗

**Figure 3.** Figure 3: Performance comparison across task complexity levels on EarthBench and ThinkGeo. The complexity levels are defined by the number of steps in the ground truth trajectory: Simple (2-3 steps), Medium (4-6 steps), and Complex (≥7 steps). three categories, respectively, while the ThinkGeo dataset is divided into 40, 16, and 1 samples view at source ↗

**Figure 5.** Figure 5: Alignment analysis of score distributions between Human Judges (top) and the LLM-as-a-judge (bottom) across 50 randomly sampled tasks. 4. Conclusion In this work, we presented a Lightweight Multimodal MetaPlanner framework to resolve the reasoning limitations and cognitive overload in EO agents that integrate planning and execution within a single model. By decoupling high-level reasoning from precise ex… view at source ↗

**Figure 6.** Figure 6: Hyperparameter sensitivity analysis for the discount factor γ. The plots illustrate the Sample Overlap Rate (SOR) and Preference Turn Rate (PTR) for Self-Generated (Left) and Teacher-Mixed (Center) datasets. The bar chart (Right) displays the total Number of Preference Pairs (NPP) retained at each γ level. The results of this analysis are presented in view at source ↗

**Figure 7.** Figure 7: Head-to-head qualitative comparison of meta plan quality between SFT and DPO stages on EarthBench (N = 46). Evaluated by GPT-4o, the DPO-optimized Meta-Planner demonstrates strict dominance, significantly improving Followability and Standardization (23.9% win rate) while maintaining perfect parity in Correctness (100% Tie), with zero regression (0% Loss). D. Detailed Scalability Results view at source ↗

read the original abstract

Autonomous Earth Observation (EO) agents are transitioning from passive perception to complex, multi-step task execution. However, current architectures that integrate planning and execution within a single model often struggle with combinatorial complexity and reasoning errors in dynamic EO scenarios. To resolve these challenges, we propose the Lightweight Multimodal Meta-Planner (LMMP) framework. LMMP incorporates a dual-awareness mechanism that grounds strategic plans in both multimodal image features and high-level task semantics. Crucially, we introduce a Meta Task Library to inject remote sensing expert knowledge directly into the workflow, which standardizes domain logic and ensures plans are physically feasible. We further implement a two-stage training pipeline, initializing the Meta-Planner via expert-distilled Supervised Fine-Tuning and refining it through Direct Preference Optimization based on execution feedback. Extensive experiments on a dataset derived from EarthBench and ThinkGeo demonstrate that LMMP significantly improves tool-calling accuracy and task success rates. Moreover, the framework exhibits strong ``plug-and-play'' versatility, consistently enhancing the performance of diverse executor backbones across previously unseen EO missions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LMMP adds a meta task library to EO agent planning but the feasibility and experiment claims need more backing than what's shown.

read the letter

Hey, the main takeaway on this one is that they've built a meta-planner called LMMP for these Earth observation agents. It adds a dual-awareness thing and this Meta Task Library to try to make the plans actually workable in the real world with remote sensing constraints. They train it in two stages, first copying experts then tweaking with feedback from running the tasks. It does a decent job framing the problem of why single-model planners fall apart on complex EO jobs with lots of steps. The plug-and-play part, where it boosts various backbones on new missions, is the kind of thing that could matter for people actually deploying this stuff. Where it gets thin is the library itself. They say it standardizes the logic and guarantees feasible plans by injecting expert knowledge, but there's nothing on how that's done—no list of rules, no checks for things like whether a sensor can actually see the target from orbit, nothing. So the DPO stage might just be teaching the model to dodge whatever bad plans the library allows rather than building in real constraints. Same issue with the experiments: big claims about better tool calling and success rates on those EarthBench and ThinkGeo based sets, but no word on the comparison points or how they measured it or if the differences are real. This is probably for the crowd working on agent systems for satellites and remote sensing. Someone trying to put together a practical EO agent might find the separation of planning and the library concept worth a look. It should go to referees because the idea has legs for the subfield, though they'll need to see the actual library and the full experiment setup. I'd say send it for review and ask for those missing pieces.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes the Lightweight Multimodal Meta-Planner (LMMP) framework for autonomous Earth Observation agents. It introduces a dual-awareness mechanism to ground strategic plans in multimodal image features and high-level task semantics, a Meta Task Library that directly injects remote sensing expert knowledge to standardize domain logic and guarantee physically feasible plans, and a two-stage training pipeline (expert-distilled supervised fine-tuning followed by execution-feedback direct preference optimization). The central claims are that LMMP yields significant gains in tool-calling accuracy and task success rates on datasets derived from EarthBench and ThinkGeo, while exhibiting plug-and-play versatility that improves diverse executor backbones on previously unseen EO missions.

Significance. If the experimental results and feasibility guarantees are substantiated, the separation of a lightweight meta-planner from execution, combined with explicit domain-knowledge injection and preference-based refinement, could offer a practical route to more robust multimodal planning in dynamic EO settings. The plug-and-play design would be particularly valuable for integrating new backbones without retraining the planner. At present, however, the absence of concrete validation for the Meta Task Library and the experimental protocol limits any assessment of broader impact.

major comments (3)

[Meta Task Library description] Framework description (Meta Task Library subsection): The central claim that the Meta Task Library 'standardizes domain logic and ensures plans are physically feasible' is load-bearing for the entire contribution, yet the manuscript supplies no construction details, explicit rule set, constraint checker, or reference to physical models (e.g., orbital mechanics or sensor coverage). Without these elements it is impossible to determine whether feasibility is enforced at generation time or merely approximated post-hoc by the DPO stage.
[Experiments and evaluation] Experiments and evaluation section: The abstract asserts 'significant improvements' in tool-calling accuracy and task success rates together with generalization to unseen missions, but reports neither baselines, concrete metrics, statistical tests, dataset construction procedure, nor ablation isolating the Meta Task Library's contribution. This evidentiary gap directly undermines verification of the strongest empirical claims.
[Two-stage training pipeline] Training pipeline description: The two-stage pipeline is presented as enabling generalization beyond the training distribution, yet no cross-validation, out-of-distribution mission splits, or failure-case analysis is described that would distinguish library-driven feasibility from dataset-specific artifacts or backbone improvements.

minor comments (1)

[Framework overview] The dual-awareness mechanism would benefit from an explicit diagram or pseudocode showing how multimodal features and task semantics are fused before plan generation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive review of our manuscript on the LMMP framework. The comments have helped us identify areas where additional clarity and detail are needed to strengthen the presentation. We address each major comment below and have made corresponding revisions to the manuscript.

read point-by-point responses

Referee: [Meta Task Library description] Framework description (Meta Task Library subsection): The central claim that the Meta Task Library 'standardizes domain logic and ensures plans are physically feasible' is load-bearing for the entire contribution, yet the manuscript supplies no construction details, explicit rule set, constraint checker, or reference to physical models (e.g., orbital mechanics or sensor coverage). Without these elements it is impossible to determine whether feasibility is enforced at generation time or merely approximated post-hoc by the DPO stage.

Authors: We acknowledge that the original manuscript did not provide sufficient construction details for the Meta Task Library. In the revised version, we have substantially expanded the relevant subsection to describe the library's construction from remote sensing expert knowledge, including the explicit rule set for standardizing domain logic, the constraint checker implementation, and references to physical models such as orbital mechanics and sensor coverage constraints. These elements are integrated to enforce physical feasibility directly at plan generation time within the meta-planner, prior to any refinement in the DPO stage. We have added pseudocode, examples, and a diagram to illustrate the process. revision: yes
Referee: [Experiments and evaluation] Experiments and evaluation section: The abstract asserts 'significant improvements' in tool-calling accuracy and task success rates together with generalization to unseen missions, but reports neither baselines, concrete metrics, statistical tests, dataset construction procedure, nor ablation isolating the Meta Task Library's contribution. This evidentiary gap directly undermines verification of the strongest empirical claims.

Authors: We agree that the experimental reporting in the initial submission required greater explicitness to allow full verification of the claims. The revised manuscript expands the Experiments and evaluation section to clearly list all baselines, report concrete metrics and statistical test results, provide a detailed account of the dataset construction procedure derived from EarthBench and ThinkGeo, and include an ablation study that isolates the Meta Task Library's specific contribution to the observed gains in tool-calling accuracy, task success rates, and generalization performance. revision: yes
Referee: [Two-stage training pipeline] Training pipeline description: The two-stage pipeline is presented as enabling generalization beyond the training distribution, yet no cross-validation, out-of-distribution mission splits, or failure-case analysis is described that would distinguish library-driven feasibility from dataset-specific artifacts or backbone improvements.

Authors: We appreciate the referee's point regarding the need for stronger evidence of generalization. In the revised manuscript, we have augmented the Training pipeline description with details on the cross-validation procedure, the explicit out-of-distribution mission splits drawn from the ThinkGeo dataset for testing on previously unseen EO missions, and a failure-case analysis. This analysis supports that the feasibility guarantees and performance improvements arise primarily from the Meta Task Library and dual-awareness mechanism rather than dataset artifacts or backbone-specific effects. revision: yes

Circularity Check

0 steps flagged

No circularity: framework and results are independently constructed and empirically validated

full rationale

The paper introduces a new LMMP architecture, Meta Task Library, and two-stage training pipeline as explicit contributions, then validates them via experiments on derived external datasets (EarthBench/ThinkGeo). No equations, parameters, or predictions reduce by construction to the inputs; the Meta Task Library is presented as an injected knowledge source rather than a self-defined output, and performance gains are measured against baselines rather than fitted to the same quantities. The derivation chain is self-contained against external benchmarks with no load-bearing self-citation or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract provides insufficient technical detail to enumerate free parameters or axioms; the primary invented component is the Meta Task Library.

invented entities (1)

Meta Task Library no independent evidence
purpose: Inject remote sensing expert knowledge to standardize domain logic and ensure physically feasible plans
Introduced as a core component of the framework but no independent evidence or validation details are provided in the abstract.

pith-pipeline@v0.9.0 · 5533 in / 1158 out tokens · 60153 ms · 2026-05-08T15:42:01.116188+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

96 extracted references · 29 canonical work pages · 12 internal anchors

[1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000
[2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980
[3]

M. J. Kearns , title =
[4]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983
[5]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000
[6]

Suppressed for Anonymity , author=
[7]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981
[8]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959
[9]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Earthdial: Turning multi-sensory earth observations to interactive dialogues , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[10]

IEEE Transactions on Geoscience and Remote Sensing , volume=

SARCLIP: The First Vision--Language Foundation Model for SAR Image , author=. IEEE Transactions on Geoscience and Remote Sensing , volume=. 2025 , publisher=

2025
[11]

IEEE Transactions on Geoscience and Remote Sensing , volume=

Remoteclip: A vision language foundation model for remote sensing , author=. IEEE Transactions on Geoscience and Remote Sensing , volume=. 2024 , publisher=

2024
[12]

Nature Machine Intelligence , volume=

A semantic-enhanced multi-modal remote sensing foundation model for Earth observation , author=. Nature Machine Intelligence , volume=. 2025 , publisher=

2025
[13]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Skysense: A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[14]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

SkySense V2: A unified foundation model for multi-modal remote sensing , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[15]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

RingMo-aerial: An aerial remote sensing foundation model with affine transformation contrastive learning , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
[16]

IEEE Transactions on Geoscience and Remote Sensing , volume=

RingMo-SAM: A foundation model for segment anything in multimodal remote-sensing images , author=. IEEE Transactions on Geoscience and Remote Sensing , volume=. 2023 , publisher=

2023
[17]

IEEE Transactions on Geoscience and Remote Sensing , volume=

RingMo: A remote sensing foundation model with masked image modeling , author=. IEEE Transactions on Geoscience and Remote Sensing , volume=. 2022 , publisher=

2022
[18]

Earth-agent: Unlocking the full landscape of earth observation with agents,

Earth-agent: Unlocking the full landscape of earth observation with agents , author=. arXiv preprint arXiv:2509.23141 , year=

work page arXiv
[19]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
[20]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review arXiv
[21]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review arXiv
[22]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review arXiv
[23]

ToolRL: Reward is All Tool Learning Needs

Toolrl: Reward is all tool learning needs , author=. arXiv preprint arXiv:2504.13958 , year=

work page internal anchor Pith review arXiv
[24]

arXiv preprint arXiv:2503.02682 , year=

Mpo: Boosting llm agents with meta plan optimization , author=. arXiv preprint arXiv:2503.02682 , year=

work page arXiv
[25]

ThinkGeo: Evaluating tool-augmented agents for remote sensing tasks,

ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks , author=. arXiv preprint arXiv:2505.23752 , year=

work page arXiv
[26]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Swift: a scalable lightweight infrastructure for fine-tuning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[27]

ISPRS Journal of Photogrammetry and Remote Sensing , volume=

Lhrs-bot-nova: Improved multimodal large language model for remote sensing vision-language interpretation , author=. ISPRS Journal of Photogrammetry and Remote Sensing , volume=. 2025 , publisher=

2025
[28]

European Conference on Computer Vision , pages=

Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024
[29]

Ringmo-agent: A unified remote sensing foun- dation model for multi-platform and multi-modal reasoning.arXiv preprint arXiv:2507.20776, 2025

RingMo-Agent: A Unified Remote Sensing Foundation Model for Multi-Platform and Multi-Modal Reasoning , author=. arXiv preprint arXiv:2507.20776 , year=

work page arXiv
[30]

Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision- language understanding,

Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding , author=. arXiv preprint arXiv:2406.10100 , year=

work page arXiv
[31]

The eleventh international conference on learning representations , year=

React: Synergizing reasoning and acting in language models , author=. The eleventh international conference on learning representations , year=
[32]

Advances in neural information processing systems , volume=

Visual instruction tuning , author=. Advances in neural information processing systems , volume=
[33]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Minigpt-4: Enhancing vision-language understanding with advanced large language models , author=. arXiv preprint arXiv:2304.10592 , year=

work page internal anchor Pith review arXiv
[34]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
[35]

Findings of the Association for Computational Linguistics: ACL 2024 , pages=

Agenttuning: Enabling generalized agent abilities for llms , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

2024
[36]

IEEE Transactions on Geoscience and Remote Sensing , volume=

RSVQA: Visual question answering for remote sensing data , author=. IEEE Transactions on Geoscience and Remote Sensing , volume=. 2020 , publisher=

2020
[37]

IEEE Transactions on Geoscience and Remote Sensing , volume=

AID: A benchmark data set for performance evaluation of aerial scene classification , author=. IEEE Transactions on Geoscience and Remote Sensing , volume=. 2017 , publisher=

2017
[38]

IEEE Transactions on Geoscience and Remote Sensing , volume=

Exploring models and data for remote sensing image caption generation , author=. IEEE Transactions on Geoscience and Remote Sensing , volume=. 2017 , publisher=

2017
[39]

Proceedings of the IEEE , volume=

Remote sensing image scene classification: Benchmark and state of the art , author=. Proceedings of the IEEE , volume=. 2017 , publisher=

2017
[40]

2024 , publisher=

Steerable Visual Intelligence , author=. 2024 , publisher=

2024
[41]

IGARSS 2019-2019 IEEE international geoscience and remote sensing symposium , pages=

Bigearthnet: A large-scale benchmark archive for remote sensing image understanding , author=. IGARSS 2019-2019 IEEE international geoscience and remote sensing symposium , pages=. 2019 , organization=

2019
[42]

Remote Sensing , VOLUME =

Bazi, Yakoub and Bashmal, Laila and Al Rahhal, Mohamad Mahmoud and Ricci, Riccardo and Melgani, Farid , TITLE =. Remote Sensing , VOLUME =. 2024 , NUMBER =

2024
[43]

2023 , html =

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =. 2023 , html =

2023
[44]

2023 , eprint=

AgentTuning: Enabling Generalized Agent Abilities for LLMs , author=. 2023 , eprint=

2023
[45]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Swe-bench: Can language models resolve real-world github issues? , author=. arXiv preprint arXiv:2310.06770 , year=

work page internal anchor Pith review arXiv
[46]

Forty-first International Conference on Machine Learning , year=

An llm compiler for parallel function calling , author=. Forty-first International Conference on Machine Learning , year=
[47]

AnyTool: Self-reflective, hierarchical agents for large-scale API calls,

Anytool: Self-reflective, hierarchical agents for large-scale api calls , author=. arXiv preprint arXiv:2402.04253 , year=

work page arXiv
[48]

OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning

Octotools: An agentic framework with extensible tools for complex reasoning , author=. arXiv preprint arXiv:2502.11271 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[49]

Advances in Neural Information Processing Systems , volume=

Chameleon: Plug-and-play compositional reasoning with large language models , author=. Advances in Neural Information Processing Systems , volume=
[50]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Toolllm: Facilitating large language models to master 16000+ real-world apis , author=. arXiv preprint arXiv:2307.16789 , year=

work page internal anchor Pith review arXiv
[51]

arXiv preprint arXiv:2509.05933 , year=

MapAgent: A Hierarchical Agent for Geospatial Reasoning with Dynamic Map Tool Integration , author=. arXiv preprint arXiv:2509.05933 , year=

work page arXiv
[52]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Skyscript: A large and semantically diverse vision-language dataset for remote sensing , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[53]

ISPRS Journal of Photogrammetry and Remote Sensing , volume=

Skyeyegpt: Unifying remote sensing vision-language tasks via instruction tuning with large language model , author=. ISPRS Journal of Photogrammetry and Remote Sensing , volume=. 2025 , publisher=

2025
[54]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Webarena: A realistic web environment for building autonomous agents , author=. arXiv preprint arXiv:2307.13854 , year=

work page internal anchor Pith review arXiv
[55]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks.arXiv preprint arXiv:2401.13649,

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks , author=. arXiv preprint arXiv:2401.13649 , year=

work page arXiv
[56]

2, 2022-06-27 , author=

A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27 , author=. Open Review , volume=

2022
[57]

The right to be forgotten in federated learning: An efficient realization with rapid retraining

Tofu: A task of fictitious unlearning for llms , author=. arXiv preprint arXiv:2401.06121 , year=

work page arXiv
[58]

IGARSS 2024-2024 IEEE International Geoscience and Remote Sensing Symposium , pages=

Remote sensing chatgpt: Solving remote sensing tasks with chatgpt and visual models , author=. IGARSS 2024-2024 IEEE International Geoscience and Remote Sensing Symposium , pages=. 2024 , organization=

2024
[59]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Geochat: Grounded large vision-language model for remote sensing , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[60]

arXiv preprint arXiv:2312.06960 , year=

Remote sensing vision-language foundation models without annotations via ground remote alignment , author=. arXiv preprint arXiv:2312.06960 , year=

work page arXiv
[61]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Charting new territories: Exploring the geographic and geospatial capabilities of multimodal llms , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[62]

Evaluating tool-augmented agents in remote sensing platforms,

Evaluating tool-augmented agents in remote sensing platforms , author=. arXiv preprint arXiv:2405.00709 , year=

work page arXiv
[63]

Earth System Science Data Discussions , volume=

Chatearthnet: A global-scale image-text dataset empowering vision-language geo-foundation models , author=. Earth System Science Data Discussions , volume=. 2024 , publisher=

2024
[64]

IEEE Transactions on Geoscience and Remote Sensing , volume=

EarthGPT: A universal multimodal large language model for multisensor image comprehension in remote sensing domain , author=. IEEE Transactions on Geoscience and Remote Sensing , volume=. 2024 , publisher=

2024
[65]

IEEE Transactions on Geoscience and Remote Sensing , year=

Rs5m and georsclip: A large scale vision-language dataset and a large vision-language model for remote sensing , author=. IEEE Transactions on Geoscience and Remote Sensing , year=
[66]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Geollm-engine: A realistic environment for building geospatial copilots , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[67]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[68]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Eva-clip: Improved training techniques for clip at scale , author=. arXiv preprint arXiv:2303.15389 , year=

work page internal anchor Pith review arXiv
[69]

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , volume=

A general transitive transfer learning framework for cross-optical sensor remote sensing image scene understanding , author=. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , volume=. 2023 , publisher=

2023
[70]

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , volume=

Visual question generation from remote sensing images , author=. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , volume=. 2023 , publisher=

2023
[71]

ISPRS Journal of Photogrammetry and Remote Sensing , volume=

Rsgpt: A remote sensing vision language model and benchmark , author=. ISPRS Journal of Photogrammetry and Remote Sensing , volume=. 2025 , publisher=

2025
[72]

ISPRS Journal of Photogrammetry and Remote Sensing , volume=

PBNet: Part-based convolutional neural network for complex composite object detection in remote sensing imagery , author=. ISPRS Journal of Photogrammetry and Remote Sensing , volume=. 2021 , publisher=

2021
[73]

ISPRS Journal of Photogrammetry and Remote Sensing , volume=

A deep translation (GAN) based change detection network for optical and SAR remote sensing images , author=. ISPRS Journal of Photogrammetry and Remote Sensing , volume=. 2021 , publisher=

2021
[74]

ISPRS Journal of Photogrammetry and Remote Sensing , volume=

UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery , author=. ISPRS Journal of Photogrammetry and Remote Sensing , volume=. 2022 , publisher=

2022
[75]

ISPRS Journal of Photogrammetry and Remote Sensing , volume=

Land-use/land-cover change detection based on a Siamese global learning framework for high spatial resolution remote sensing imagery , author=. ISPRS Journal of Photogrammetry and Remote Sensing , volume=. 2022 , publisher=

2022
[76]

Science China Information Sciences , volume=

The rise and potential of large language model based agents: A survey , author=. Science China Information Sciences , volume=. 2025 , publisher=

2025
[77]

A Generalist Agent

A generalist agent , author=. arXiv preprint arXiv:2205.06175 , year=

work page internal anchor Pith review arXiv
[78]

Proceedings of the 36th annual acm symposium on user interface software and technology , pages=

Generative agents: Interactive simulacra of human behavior , author=. Proceedings of the 36th annual acm symposium on user interface software and technology , pages=
[79]

Frontiers of Computer Science , volume=

A survey on large language model based autonomous agents , author=. Frontiers of Computer Science , volume=. 2024 , publisher=

2024
[80]

Tptu: Task planning and tool usage of large language model-based ai agents

TPTU: large language model-based AI agents for task planning and tool usage , author=. arXiv preprint arXiv:2308.03427 , year=

work page arXiv

Showing first 80 references.