arxiv: 2604.13888 · v1 · submitted 2026-04-15 · 💻 cs.AI

Recognition: unknown

GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis

Bo Yu , Cheng Yang , Dongyang Hou , Chengfu Liu , Jiayao Liu , Chi Wang , Zhiming Zhang , Haifeng Li

show 1 more author

Wentao Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:02 UTC · model grok-4.3

classification 💻 cs.AI

keywords GeoAgentBenchLLM agentsGIS toolsspatial analysistool-augmented agentsPlan-and-Reactdynamic benchmarkparameter execution accuracy

0 comments

The pith

The Plan-and-React architecture outperforms traditional frameworks for LLM-based agents on dynamic geospatial analysis tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GeoAgentBench as a dynamic execution sandbox that supplies 117 atomic GIS tools across 53 tasks in six core domains to test tool-augmented LLM agents under realistic runtime conditions. It defines the Parameter Execution Accuracy metric, which scores implicit parameter inference through a last-attempt alignment check, and adds VLM-based verification to judge data-spatial correctness and cartographic style. The central contribution is the Plan-and-React paradigm, which separates global task orchestration from step-wise reactive execution to manage parameter misalignments and runtime errors. Experiments across seven LLMs show this separation produces higher success rates than standard agent designs, especially on multi-step chains that require error recovery.

Core claim

The paper claims that decoupling global orchestration from step-wise reactive execution in the Plan-and-React architecture delivers the best trade-off between logical rigor and execution robustness for tool-augmented GIS agents, as measured by improved multi-step reasoning and error recovery on the 53-task benchmark.

What carries the argument

The Plan-and-React agent architecture, which separates high-level planning from immediate reactive tool calls to correct parameter errors and recover from runtime anomalies during spatial workflows.

If this is right

LLM agents can achieve higher success rates on multi-step spatial tasks when planning and execution are explicitly decoupled.
Parameter-level accuracy becomes the dominant failure mode in dynamic GIS environments and can be quantified separately from final output correctness.
VLM verification provides an additional signal for assessing spatial fidelity and cartographic quality that text matching alone misses.
Current LLMs still exhibit clear capability boundaries on tasks requiring repeated parameter adjustment and runtime adaptation.
A standardized dynamic benchmark with atomic tools can serve as a repeatable testbed for measuring progress in autonomous GeoAI.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decoupling principle could be tested in other tool-heavy domains such as scientific simulation or laboratory automation where intermediate outputs must be inspected.
VLM verification might transfer to visual reasoning benchmarks in fields like remote sensing or medical image analysis that also produce map-like outputs.
Expanding the benchmark to include larger-scale or user-generated workflows would test whether the reported performance gap persists beyond the curated 53 tasks.
Hybrid agents that switch between Plan-and-React and other frameworks depending on task length could combine the strengths observed here.

Load-bearing premise

The 53 chosen tasks and 117 atomic tools, together with the Parameter Execution Accuracy metric and VLM verification, are sufficient to represent the full complexity of real-world geospatial workflows.

What would settle it

A follow-up study that applies the same seven LLMs to a fresh set of 100 real-world GIS projects outside the benchmark and finds that Plan-and-React no longer shows a statistically significant advantage in completion rate or error recovery.

Figures

Figures reproduced from arXiv: 2604.13888 by Bo Yu, Chengfu Liu, Cheng Yang, Chi Wang, Dongyang Hou, Haifeng Li, Jiayao Liu, Wentao Yang, Zhiming Zhang.

**Figure 2.** Figure 2: Overview of the GABench dataset construction and verification workflow. The left panel outlines task [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: The Plan-and-React baseline agent framework adopts a design that decouples global workflow orchestration [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

read the original abstract

The integration of Large Language Models (LLMs) into Geographic Information Systems (GIS) marks a paradigm shift toward autonomous spatial analysis. However, evaluating these LLM-based agents remains challenging due to the complex, multi-step nature of geospatial workflows. Existing benchmarks primarily rely on static text or code matching, neglecting dynamic runtime feedback and the multimodal nature of spatial outputs. To address this gap, we introduce GeoAgentBench (GABench), a dynamic and interactive evaluation benchmark tailored for tool-augmented GIS agents. GABench provides a realistic execution sandbox integrating 117 atomic GIS tools, encompassing 53 typical spatial analysis tasks across 6 core GIS domains. Recognizing that precise parameter configuration is the primary determinant of execution success in dynamic GIS environments, we designed the Parameter Execution Accuracy (PEA) metric, which utilizes a "Last-Attempt Alignment" strategy to quantify the fidelity of implicit parameter inference. Complementing this, a Vision-Language Model (VLM) based verification is proposed to assess data-spatial accuracy and cartographic style adherence. Furthermore, to address the frequent task failures caused by parameter misalignments and runtime anomalies, we developed a novel agent architecture, Plan-and-React, that mimics expert cognitive workflows by decoupling global orchestration from step-wise reactive execution. Extensive experiments with seven representative LLMs demonstrate that the Plan-and-React paradigm significantly outperforms traditional frameworks, achieving the optimal balance between logical rigor and execution robustness, particularly in multi-step reasoning and error recovery. Our findings highlight current capability boundaries and establish a robust standard for assessing and advancing the next generation of autonomous GeoAI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GeoAgentBench gives a working sandbox for running GIS agents on real tools instead of static checks, and their Plan-and-React setup beats the usual baselines on the 53 tasks they tested.

read the letter

The paper's main contribution is a live execution environment with 117 atomic GIS tools and 53 tasks spread across six domains. They score agents on Parameter Execution Accuracy using a last-attempt alignment rule, then add VLM checks for whether the resulting maps and spatial data actually look right. That combination moves past the usual text or code-matching benchmarks that ignore runtime feedback and visual output quality. The Plan-and-React agent splits high-level planning from step-by-step reaction to errors, and the experiments across seven LLMs show consistent gains in multi-step reasoning and recovery from parameter mistakes or tool failures. Those pieces are new enough to matter for anyone trying to build agents that do actual spatial work rather than just generate plausible code snippets. The evaluation protocol looks internally consistent, with no obvious circularity in how tasks or metrics were designed. The gains hold across the models they tried, which is a positive sign. The main limitation is scope. Fifty-three tasks is a reasonable starting set, but real GIS workflows often involve messier data issues, larger datasets, or domain quirks that may not be fully represented in the sandbox. The paper does not claim the benchmark covers everything, but readers will still want to see how well the deltas transfer to tasks outside the current collection. This work is aimed at people building or testing tool-augmented agents for geospatial analysis, planning, or environmental applications. Anyone already working on agent benchmarks or GeoAI will find the sandbox setup and the PEA metric useful to look at. It is solid enough on the execution and verification side to deserve a full referee process rather than a desk reject.

Referee Report

3 major / 3 minor

Summary. The paper introduces GeoAgentBench (GABench), a dynamic execution benchmark for tool-augmented GIS agents. It integrates 117 atomic GIS tools across 53 tasks in 6 core domains, proposes the Parameter Execution Accuracy (PEA) metric using a Last-Attempt Alignment strategy to evaluate implicit parameter inference, and introduces VLM-based verification for spatial and cartographic accuracy. The work also presents a Plan-and-React agent architecture that separates global planning from step-wise reactive execution and reports that this paradigm outperforms traditional frameworks (e.g., ReAct-style) across experiments with seven LLMs, particularly in multi-step reasoning and error recovery.

Significance. If the empirical results hold, this benchmark fills an important gap by enabling dynamic, runtime-aware evaluation of GeoAI agents rather than relying on static text or code matching. The PEA metric and VLM verification offer more realistic proxies for execution success in spatial workflows. The Plan-and-React architecture, if shown to be robust, provides a concrete design pattern for balancing planning rigor with runtime adaptability, with potential to guide development of autonomous spatial analysis systems. The sandbox artifact itself is a reusable contribution for the community.

major comments (3)

[Benchmark Construction] Benchmark construction (53 tasks, 117 tools): the central claim that the benchmark captures realistic multi-step geospatial workflows rests on task and tool selection, yet the manuscript provides limited justification for coverage of variability, edge cases, or domain representativeness; this directly affects whether the reported outperformance generalizes beyond the sandbox.
[Evaluation Metrics] PEA metric definition: the Last-Attempt Alignment strategy is described as quantifying parameter inference fidelity, but without a formal equation, pseudocode, or handling rules for multiple attempts and partial matches, it is unclear whether the metric avoids bias or circularity in success measurement; this is load-bearing for all quantitative claims.
[Experiments] Experiments section: the claim that Plan-and-React 'significantly outperforms' traditional frameworks across seven LLMs lacks reported statistical tests, run-to-run variance, or explicit baseline implementations (e.g., exact ReAct or Plan-and-Execute variants), making it difficult to assess the magnitude and reliability of the gains.

minor comments (3)

[Abstract] Abstract: the acronym GABench is introduced without immediate expansion on first use; consistent parenthetical definition would improve readability.
[Benchmark Construction] The description of the six core GIS domains would benefit from an accompanying table listing representative tasks per domain to aid quick assessment of coverage.
[Verification Methods] Notation for the VLM verification step is introduced but not contrasted explicitly with PEA; a short comparison paragraph would clarify their complementary roles.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for the constructive feedback and positive evaluation of our work. The comments highlight important areas for improvement in clarity and rigor, which we will address in the revised manuscript. We respond to each major comment below.

read point-by-point responses

Referee: [Benchmark Construction] Benchmark construction (53 tasks, 117 tools): the central claim that the benchmark captures realistic multi-step geospatial workflows rests on task and tool selection, yet the manuscript provides limited justification for coverage of variability, edge cases, or domain representativeness; this directly affects whether the reported outperformance generalizes beyond the sandbox.

Authors: We appreciate this observation. While the task and tool selection was guided by standard references in GIS literature and common workflows in spatial analysis (as briefly noted in Section 3), we agree that more detailed justification is warranted to strengthen claims of representativeness. In the revised manuscript, we will expand the benchmark construction section to include explicit criteria for task selection, coverage of variability across domains, discussion of included edge cases, and references to domain standards. This will better substantiate the generalizability of our findings. revision: yes
Referee: [Evaluation Metrics] PEA metric definition: the Last-Attempt Alignment strategy is described as quantifying parameter inference fidelity, but without a formal equation, pseudocode, or handling rules for multiple attempts and partial matches, it is unclear whether the metric avoids bias or circularity in success measurement; this is load-bearing for all quantitative claims.

Authors: The referee correctly identifies a gap in the presentation of the PEA metric. To address this, we will include a formal mathematical definition of the Parameter Execution Accuracy metric in the revised version, along with pseudocode for the Last-Attempt Alignment strategy. We will also specify rules for handling multiple attempts (e.g., using the final attempt for alignment) and partial matches (with a defined similarity threshold). This addition will clarify the metric's computation and mitigate concerns about bias or circularity. revision: yes
Referee: [Experiments] Experiments section: the claim that Plan-and-React 'significantly outperforms' traditional frameworks across seven LLMs lacks reported statistical tests, run-to-run variance, or explicit baseline implementations (e.g., exact ReAct or Plan-and-Execute variants), making it difficult to assess the magnitude and reliability of the gains.

Authors: We acknowledge the need for greater statistical rigor in the experimental results. In the revision, we will report standard deviations from multiple independent runs (e.g., 5 runs per configuration), include statistical significance tests such as paired t-tests or Wilcoxon signed-rank tests to support the 'significantly outperforms' claims, and provide more explicit descriptions of the baseline implementations, including code-level details for ReAct and Plan-and-Execute variants. These changes will enhance the reliability and interpretability of the performance comparisons. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark construction or empirical claims

full rationale

The paper introduces new artifacts—a sandbox with 53 tasks and 117 atomic GIS tools, the Parameter Execution Accuracy (PEA) metric using Last-Attempt Alignment, VLM-based verification, and the Plan-and-React agent architecture—then evaluates them empirically across seven LLMs. No equations, fitted parameters, or derivations appear in the manuscript that reduce any claimed prediction or result to the inputs by construction. The central performance claims rest on direct experimental comparisons within the explicitly scoped sandbox rather than on self-citations, uniqueness theorems, or ansatzes imported from prior author work. This structure is self-contained and follows standard practice for benchmark papers, with independent content in task design, metric definition, and architecture description.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on domain assumptions about task representativeness rather than fitted parameters or newly postulated entities.

axioms (1)

domain assumption The 53 selected tasks and 117 atomic tools adequately represent typical spatial analysis workflows across six GIS domains.
Invoked when claiming the benchmark covers realistic geospatial work.

pith-pipeline@v0.9.0 · 5608 in / 1213 out tokens · 26683 ms · 2026-05-10T13:02:41.428723+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 17 canonical work pages · 6 internal anchors

[1]

ISBN: 1753-8947. X. An, J. Sun, Z. Gui, and W. He. Choice: benchmarking the remote sensing capabilities of large vision-language models.arXiv preprint arXiv:2411.18145,

work page arXiv
[2]

B. Chen, T. E. Bök, B. Rasti, V . Markl, and B. Demir. REMSA: An LLM Agent for Foundation Model Selection in Remote Sensing.arXiv preprint arXiv:2511.17442,

work page arXiv
[3]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

URL http://arxiv.org/abs/2507.06261. arXiv:2507.06261 [cs]. J. Cui, W. Guo, H. Huang, X. Lv, H. Cao, and H. Li. Adversarial examples for vehicle detection with projection transformation.IEEE Transactions on Geoscience and Remote Sensing, 62:1–18,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

DeepSeek-V3 Technical Report

URLhttp://arxiv.org/abs/2412.19437. arXiv:2412.19437 [cs]. C. Deng, T. Zhang, Z. He, Q. Chen, Y . Shi, Y . Xu, L. Fu, W. Zhang, X. Wang, and C. Zhou. K2: A foundation language model for geoscience knowledge understanding and utilization. InProceedings of the 17th ACM international conference on web search and data mining, pages 161–170,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

P. Feng, Z. Lv, J. Ye, X. Wang, X. Huo, J. Yu, W. Xu, W. Zhang, L. Bai, and C. He. Earth-agent: Unlocking the full landscape of earth observation with agents.arXiv preprint arXiv:2509.23141,

work page arXiv
[6]

The Llama 3 Herd of Models

URLhttp://arxiv.org/abs/2407.21783. arXiv:2407.21783 [cs]. W. Guo, J. Cui, X. Cui, J. Li, Z. Zhang, R. Shao, M. Guo, and H. Li. TriMem: Tri-Fold Memory Framework for Continual Learning of VLMs in Remote Sensing.IEEE Transactions on Geoscience and Remote Sensing,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

ISBN: 1009-5020. C. Huang, S. Chen, Z. Li, J. Qu, Y . Xiao, J. Liu, and Z. Chen. Geoagent: To empower llms using geospatial tools for address standardization. InFindings of the association for computational linguistics: ACL 2024, pages 6048–6063,

2024
[8]

ISBN: 1947-5683. K. Janowicz, S. Gao, G. McKenzie, Y . Hu, and B. Bhaduri.GeoAI: spatially explicit artificial intelligence techniques for geographic knowledge discovery and beyond, volume

1947
[9]

ISBN: 1365-8816. Y . Ji, S. Gao, Y . Nie, I. Maji´c, and K. Janowicz. Foundation models for geospatial reasoning: assessing the capabilities of large language models in understanding geometries and topological spatial relations.International Journal of Geographical Information Science, 39(9):1866–1903,

1903
[10]

ISBN: 1365-8816. T. N. Kipf and M. Welling. Semi-Supervised Classification with Graph Convolutional Networks. In5th Interna- tional Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net,

2017
[11]

URL https://openaccess.thecvf.com/content/ CVPR2024/html/Kuckreja_GeoChat_Grounded_Large_Vision-Language_Model_for_Remote_Sensing_ CVPR_2024_paper.html. Y . Lai, C. Li, Y . Wang, T. Zhang, R. Zhong, L. Zettlemoyer, W.-T. Yih, D. Fried, S. I. Wang, and T. Yu. DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation. In A. Krause, E. Bruns...

2023
[12]

17 GeoAgentBenchA PREPRINT C

ISBN: 2296-665X. 17 GeoAgentBenchA PREPRINT C. Lee, V . Paramanayakam, A. Karatzas, Y . Jian, M. Fore, H. Liao, F. Yu, R. Li, I. Anagnostopoulos, and D. Stamoulis. Multi-Agent Geospatial Copilots for Remote Sensing Workflows.CoRR, abs/2501.16254,

work page arXiv
[15]

API-Bank : A comprehensive benchmark for tool-augmented LLMs

doi:10.18653/V1/2023.EMNLP-MAIN.187. URL https://doi.org/10.18653/v1/2023.emnlp-main.187. S. Li, S. Dragicevic, F. A. Castro, M. Sester, S. Winter, A. Coltekin, C. Pettit, B. Jiang, J. Haworth, and A. Stein. Geospa- tial big data handling theory and methods: A review and research challenges.ISPRS journal of Photogrammetry and Remote Sensing, 115:119–133,

work page doi:10.18653/v1/2023.emnlp-main.187 2023
[16]

ISBN: 2073-445X. X. Liao, C. Fang, T. Shu, and Y . Ren. Spatiotemporal impacts of urban structure upon urban land-use efficiency: Evidence from 280 cities in China.Habitat International, 131:102727,

2073
[17]

RSVQA: Visual Question Answering for Remote Sensing Data,

ISSN 1558-0644. doi:10.1109/TGRS.2020.2988782. URLhttps://ieeexplore.ieee.org/abstract/document/9088993. P. A. Longley, M. F. Goodchild, D. J. Maguire, and D. W. Rhind.Geographic information science and systems. John Wiley & Sons,

work page doi:10.1109/tgrs.2020.2988782 2020
[18]

Manvi, S

R. Manvi, S. Khanna, G. Mai, M. Burke, D. B. Lobell, and S. Ermon. GeoLLM: Extracting Geospatial Knowledge from Large Language Models. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

2024
[19]

GPT-4o System Card

URL http://arxiv.org/abs/2410.21276. arXiv:2410.21276 [cs]. J. Peng, H. Zhang, J. Shen, Z. Li, J. Ma, and H. Li. Rethinking Domain-Agnostic Continual Learning via Frequency Completeness Learning.Information Fusion, page 103961,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

18 GeoAgentBenchA PREPRINT Y

ISBN: 0950-7051. 18 GeoAgentBenchA PREPRINT Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qian, S. Zhao, L. Hong, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. InThe Twelfth International Conference on Learning Representatio...

2024
[21]

Qwen2.5 Technical Report

URL http://arxiv.org/abs/2412.15115. arXiv:2412.15115 [cs]. J. Roberts, T. Lüddecke, S. Das, K. Han, and S. Albanie. GPT4GEO: How a Language Model Sees the World’s Geography. Nov

work page internal anchor Pith review Pith/arXiv arXiv
[22]

URLhttps://openreview.net/forum?id=egKxRC5gf8. O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. In N. Navab, J. Hornegger, W. M. W. III, and A. F. Frangi, editors,Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015 - 18th International Conference Munich, Germany, October 5 - 9, ...

2015
[23]

Vassilev, P.A

doi:10.1007/978-3- 319-24574-4_28. URLhttps://doi.org/10.1007/978-3-319-24574-4_28. T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551,

work page doi:10.1007/978-3-
[24]

ThinkGeo: Evaluating tool-augmented agents for remote sensing tasks,

A. Shabbir, M. A. Munir, A. Dudhane, M. U. Sheikh, M. H. Khan, P. Fraccaro, J. Bernabé-Moreno, F. S. Khan, and S. Khan. ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks.CoRR, abs/2505.23752,

work page arXiv
[25]

1904.01361

doi:10.48550/ARXIV .2505.23752. URLhttps://doi.org/10.48550/arXiv.2505.23752. arXiv: 2505.23752. S. Shahi, M. Brussel, and A. Grigolon. Spatial analysis of road traffic crashes and user based assessment of road safety: A case study of Rotterdam.Traffic injury prevention, 24(7):567–576,

work page internal anchor Pith review doi:10.48550/arxiv
[26]

[in Chinese]. L. Wang, W. Xu, Y . Lan, Z. Hu, Y . Lan, R. K.-W. Lee, and E.-P. Lim. Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models. In A. Rogers, J. L. Boyd-Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL...

2023
[27]

Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models

doi:10.18653/V1/2023.ACL-LONG.147. URLhttps://doi.org/10.18653/v1/2023.acl-long.147. Y . Wang, S. He, Q. Luo, H. Yuan, L. Zhao, J. Zhu, and H. Li. Causal invariant geographic network representations with feature and structural distribution shifts.Future Generation Computer Systems, 169:107814,

work page doi:10.18653/v1/2023.acl-long.147 2023
[28]

Z. Yang, Z. Zhou, S. Wang, X. Cong, X. Han, Y . Yan, Z. Liu, Z. Tan, P. Liu, D. Yu, Z. Liu, X. Shi, and M. Sun. MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization. In L.-W. Ku, A. Martins, and V . Srikumar, editors,Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meetin...

2024
[29]

Matplotagent: Method and evaluation for llm-based agentic scientific data visualization

doi:10.18653/V1/2024.FINDINGS-ACL.701. URL https://doi.org/10. 18653/v1/2024.findings-acl.701. S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao. ReAct: Synergizing Reasoning and Acting in Language Models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5,

work page doi:10.18653/v1/2024.findings-acl.701 2024
[30]

URLhttps://doi.org/10.1111/tgis.70135

doi:10.1111/TGIS.70135. URLhttps://doi.org/10.1111/tgis.70135. W. Zhang, Y . Shen, W. Lu, and Y . Zhuang. Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow. InICLR 2024 Workshop on Large Language Model (LLM) Agents. 20

work page doi:10.1111/tgis.70135 2024