EO-Gym supplies an executable multimodal environment and 9k-trajectory benchmark that turns Earth Observation into a tool-using, multi-step reasoning task, revealing that current VLMs struggle on temporal and cross-sensor workflows while fine-tuning lifts Pass@3 from 0.49 to 0.74.
EarthGPT: A universal multi -modal large language model for multi-sensor image comprehension in remote sensing domain
2 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Introduces the SMART-HC-VQA dataset with 65k single-image and 2.3M temporal VQA examples plus an adapted LLaVA-NeXT MLLM framework for geospatial-temporal sensemaking of remote sensing construction activity.
citing papers explorer
-
EO-Gym: A Multimodal, Interactive Environment for Earth Observation Agents
EO-Gym supplies an executable multimodal environment and 9k-trajectory benchmark that turns Earth Observation into a tool-using, multi-step reasoning task, revealing that current VLMs struggle on temporal and cross-sensor workflows while fine-tuning lifts Pass@3 from 0.49 to 0.74.
-
Geospatial-Temporal Sensemaking of Remote Sensing Activity Detections with Multimodal Large Language Model
Introduces the SMART-HC-VQA dataset with 65k single-image and 2.3M temporal VQA examples plus an adapted LLaVA-NeXT MLLM framework for geospatial-temporal sensemaking of remote sensing construction activity.