TerraBench is a new benchmark with 403 tasks across Earth-science domains that evaluates LLM agents on coordinating heterogeneous data using executable ReAct-style workflows and process-level metrics.
Towards llm agents for earth observation
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 5representative citing papers
GeoNatureAgent Benchmark tests seven LLMs on 93 tasks via a production geospatial API, with Claude Sonnet 4 at 60.8% and DeepSeek V3.2 offering near performance at 11x lower cost while all models fail on close-value comparisons.
DORA is the first end-to-end agentic benchmark for LLM-based disaster response, covering perception, spatial analysis, evacuation planning, temporal reasoning, and report generation over heterogeneous geospatial data, with evaluations of 13 frontier models revealing tool-use and composition failures
EO-Gym supplies an executable multimodal environment and 9k-trajectory benchmark that turns Earth Observation into a tool-using, multi-step reasoning task, revealing that current VLMs struggle on temporal and cross-sensor workflows while fine-tuning lifts Pass@3 from 0.49 to 0.74.
Position paper identifies structural challenges in applying generic agentic AI to Earth Observation and outlines design principles for EO-native agents focused on geospatial state and validity.
citing papers explorer
-
TerraBench: Can Agents Reason Over Heterogeneous Earth-System Data?
TerraBench is a new benchmark with 403 tasks across Earth-science domains that evaluates LLM agents on coordinating heterogeneous data using executable ReAct-style workflows and process-level metrics.
-
GeoNatureAgent Benchmark: Benchmarking LLM Agents for Environmental Geospatial Analysis Across Frontier and Open-Weight Foundation Models
GeoNatureAgent Benchmark tests seven LLMs on 93 tasks via a production geospatial API, with Claude Sonnet 4 at 60.8% and DeepSeek V3.2 offering near performance at 11x lower cost while all models fail on close-value comparisons.
-
Can LLM Agents Respond to Disasters? Benchmarking Heterogeneous Geospatial Reasoning in Emergency Operations
DORA is the first end-to-end agentic benchmark for LLM-based disaster response, covering perception, spatial analysis, evacuation planning, temporal reasoning, and report generation over heterogeneous geospatial data, with evaluations of 13 frontier models revealing tool-use and composition failures
-
EO-Gym: A Multimodal, Interactive Environment for Earth Observation Agents
EO-Gym supplies an executable multimodal environment and 9k-trajectory benchmark that turns Earth Observation into a tool-using, multi-step reasoning task, revealing that current VLMs struggle on temporal and cross-sensor workflows while fine-tuning lifts Pass@3 from 0.49 to 0.74.
-
Agentic AI for Remote Sensing: Technical Challenges and Research Directions
Position paper identifies structural challenges in applying generic agentic AI to Earth Observation and outlines design principles for EO-native agents focused on geospatial state and validity.