GeoX: Mastering Geospatial Reasoning Through Self-Play and Verifiable Rewards

Krishna P. Gummadi; Kyeongjin Ahn; Meeyoung Cha; Seungeon Lee

arxiv: 2605.20006 · v1 · pith:7CXCT2E3new · submitted 2026-05-19 · 💻 cs.AI

GeoX: Mastering Geospatial Reasoning Through Self-Play and Verifiable Rewards

Kyeongjin Ahn , Seungeon Lee , Krishna P. Gummadi , Meeyoung Cha This is my paper

Pith reviewed 2026-05-20 05:29 UTC · model grok-4.3

classification 💻 cs.AI

keywords geospatial reasoningself-playvision-language modelsreinforcement learningverifiable rewardsexecutable programsspatial logic

0 comments

The pith

A self-play loop lets one vision model generate spatial problems as code, solve them, and train itself via execution-based rewards without large labeled datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GeoX as a framework in which a single multimodal policy creates spatial reasoning tasks by writing executable programs and then attempts to solve those same tasks using abduction, deduction, and induction over image primitives. A verifier runs the programs against the actual satellite or aerial image to produce reward signals that reinforce the policy through reinforcement learning. This closed loop allows the model to develop geospatial understanding from its own generated data rather than from millions of human-annotated examples. If the approach holds, it points toward training complex spatial reasoning at far lower annotation cost while still reaching or surpassing performance of heavily supervised baselines.

Core claim

GeoX uses one policy to propose spatial problems as executable programs and to solve them across three reasoning modes over spatial primitives and an image tool; the verifier then executes each program to supply a reward that jointly optimizes both proposal and solution roles via reinforcement learning, yielding up to 5.5-point average gains on geospatial tasks that match or exceed models trained on millions of curated examples.

What carries the argument

A self-play loop in which the same multimodal policy generates and solves spatial problems as executable programs, with a verifier supplying rewards from direct program execution on the image.

If this is right

Base vision-language models gain up to 5.5 points on average across geospatial reasoning benchmarks.
Performance reaches or exceeds that of models trained on millions of human-curated examples.
A new benchmark of geospatial problems is generated and released through the self-play process itself.
Spatial reasoning can be acquired with far less reliance on large-scale human annotation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same verifiable-program self-play pattern could be tested on other grounded reasoning domains such as physics simulation or diagram interpretation.
If program execution reliably captures spatial primitives, the released benchmark could become a reusable testbed for measuring interpretable spatial logic in future models.
Extending the verifier to handle noisy real-world images might reveal whether the current gains depend on clean satellite data.

Load-bearing premise

The verifier must correctly run the generated programs and return reward signals that truly reflect spatial understanding instead of just rewarding syntactically valid code.

What would settle it

Train the model with the verifier disabled so that only syntactic correctness of programs is rewarded, then measure whether the 5.5-point gains on held-out geospatial questions disappear.

Figures

Figures reproduced from arXiv: 2605.20006 by Krishna P. Gummadi, Kyeongjin Ahn, Meeyoung Cha, Seungeon Lee.

**Figure 1.** Figure 1: Standing and motivation. (a) GeoX outperforms prior VLMs on VQA benchmarks while using zero curated training data; (b) Conventional work derives questions from human-designed templates with answers externally annotated, confining them to predefined patterns. Our framework replaces this paradigm with autonomous self-play, where a single model proposes questions whose answers are programmatically verified, b… view at source ↗

**Figure 2.** Figure 2: Method overview. A single multimodal policy πθ alternates between a proposer (middle) and a solver (right) that share a call interface (left) of spatial primitives F and tools T (instantiated here with an open-vocabulary segmenter; greyed entries left for future work). Conditioned on image I, the proposer construct a problem by composing calls into an executable program p paired with an argument a, forming… view at source ↗

**Figure 2.** Figure 2: The policy first acts as a proposer, constructing an executable problem that realizes a spatial question over an input image. It then acts as a solver, attempting to find the solution for constructed problems. Both roles are driven by verifiable rewards: the proposer is rewarded for devising challenging yet learnable problems, while the solver is rewarded for answering correctly under the three reasoning m… view at source ↗

**Figure 3.** Figure 3: Seed problem. A segmenter call with the phrase "building", followed by a presence check on the returned mask. The template is instantiated by pairing random object phrases with images, populating each bank with Nseed problems. From this warm start, the proposer moves toward increasingly compositional problems, such as comparing areas across object categories, without further human intervention. 2.3 Progra… view at source ↗

**Figure 4.** Figure 4: Pairwise dimension compositions across datasets. Each node represents one of nine question dimensions; node size reflects how often a dimension appears in a single problem, and edge thickness reflects how often two dimensions co-occur within a single problem. Image (I) Segmentation <think> 1. The input class is "ship". 2. In the image, there are multiple ships docked at the harbor. 3. The task is to find t… view at source ↗

**Figure 5.** Figure 5: Qualitative analysis of deduction. Given image I, argument a = "ship", and program p, the solver’s Chain-of-Thought follows the program’s control flow and predicts oˆ = "TR", which exactly matches the program-executed label o = p(a; I) [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Usage frequency of primitives in GeoX. Each bar denotes the number of constructed programs that invoke a given operator in F (log scale), grouped by primitive type. Q. Which primitives are used by the proposer during self-play? Across roughly 6,500 programs constructed by the proposer over training, the usage frequency of each primitive in library F reveals which operations are called during self-play, a… view at source ↗

**Figure 7.** Figure 7: Training dynamics during self-play. Task accuracy of GeoX over training steps on representative geospatial reasoning subtasks drawn from three remote sensing VQA benchmarks: Comparison (RSVQA-HR), Reasoning-based Counting (EarthVQA), and Spatial Relation Classification (GEOBench-VLM). E.2 Evaluation on Object Counting Beyond the VQA results in Section 3.2, we further evaluate GeoX on object counting. Data… view at source ↗

**Figure 8.** Figure 8: Examples of rule-based mapping of natural language problems. [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Examples of rule-based mapping of problems generated in [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative analysis of abduction. Given (I, p, o), the solver infers aˆ by simulating p forward and searching for arguments whose execution reproduces the observed output o. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative analysis of deduction (reproduced from [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative analysis of induction. Given input-output pairs {(at, ot)}t∈V , the solver synthesizes a program pˆ consistent with the visible pairs. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

read the original abstract

Geospatial reasoning requires solving image-grounded problems over the complex spatial structure of a scene. However, developing this capability is hindered by the cost of annotating a vast and combinatorial question space. We propose GeoX, a self-play framework that acquires spatial logic through executable programs that yield verifiable rewards, without relying on large-scale human-curated data Given a satellite or aerial image, our framework employs a single multimodal policy that proposes spatial problems as executable programs and solves them under three reasoning modes-abduction, deduction, and induction-over spatial primitives and an image understanding tool. A verifier executes each program to covert a reward signal that jointly optimizes the two roles via reinforcement learning. GeoX consistently improves its base VLMs by up to 5.5 points on average, matching or exceeding conventional baselines trained on millions of curated data. Along-side the proposed method, we release a benchmark for geospatial understanding accumulated through self-play.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GeoX's self-play with executable programs for rewards is a clean idea for bootstrapping geospatial VLMs without big labeled sets, but the abstract leaves the verifier's actual signal too opaque to trust the 5.5-point gains yet.

read the letter

The main thing to know is that GeoX puts one multimodal policy in charge of both inventing spatial problems as executable programs and solving them, then uses program execution success as the RL reward. That closed loop is the actual new piece, and it sidesteps the usual cost of curating millions of examples for satellite or aerial images. They report the base VLMs pick up 5.5 points on average and reach parity with heavy supervised baselines, plus they release the benchmark built the same way. That framing of abduction, deduction, and induction over spatial primitives plus an image tool is a reasonable way to structure the reasoning modes. Releasing the benchmark is also a concrete contribution that others can use to test similar ideas. The soft spot is exactly the one the stress-test note flags: nothing shown so far confirms that successful program execution reflects genuine spatial understanding rather than the model learning to emit programs whose syntax or simple patterns just pass the verifier. Because the benchmark itself comes from the same loop, it is possible the measured gains stay inside that closed distribution instead of proving transferable logic. The abstract gives no program examples, no ablation on verifier strictness, and no external test sets, so the central claim is still unverified. This is for people already working on RL or self-play methods for vision-language models, especially those focused on geospatial or scene-understanding tasks. A reader who wants concrete frameworks for reducing annotation cost would get value from the setup even if the results need more scrutiny. It deserves a serious referee because the core mechanism is distinct from standard VLM fine-tuning and the claims are falsifiable once code and program traces are available. I would send it to review and ask the referees to check the generated programs for depth and to run the model on held-out external benchmarks.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces GeoX, a self-play framework for geospatial reasoning in vision-language models. A single multimodal policy proposes spatial problems as executable programs and solves them via abduction, deduction, and induction over spatial primitives plus an image tool. A verifier executes the programs to supply reward signals that jointly optimize the policy through reinforcement learning. The work claims consistent improvements of up to 5.5 points on base VLMs, matching or exceeding conventional baselines trained on millions of curated examples, and releases a benchmark accumulated through the same self-play process.

Significance. If the reported gains prove robust and the rewards demonstrably capture genuine spatial understanding rather than program syntax, the framework offers a scalable alternative to large-scale human annotation for complex visual reasoning. The self-play plus verifiable-execution design and the public benchmark release would be concrete strengths for the field.

major comments (3)

[Abstract] Abstract: the central claim of a 5.5-point average improvement that matches million-example baselines is presented without any description of the evaluation datasets, metrics, baselines, number of runs, or error bars. This information is load-bearing for the data-efficiency argument and must be supplied with concrete numbers and controls.
[Method] Method description (self-play loop): nothing rules out the policy converging on degenerate but easily executable programs (e.g., tautological queries or syntax patterns that pass the verifier without testing scene geometry). The paper must show how the verifier and the three reasoning modes prevent such collapse; otherwise the reward signal may not reflect spatial understanding.
[Experiments / Benchmark] Benchmark release: because the benchmark is itself accumulated through the same self-play loop, the manuscript needs to demonstrate that performance gains transfer to held-out external geospatial benchmarks rather than remaining internal to the generated distribution.

minor comments (1)

[Abstract] The abstract states that rewards come from 'program execution' but does not specify the execution environment or failure modes; a short paragraph clarifying this would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the changes planned for the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of a 5.5-point average improvement that matches million-example baselines is presented without any description of the evaluation datasets, metrics, baselines, number of runs, or error bars. This information is load-bearing for the data-efficiency argument and must be supplied with concrete numbers and controls.

Authors: We agree that the abstract requires additional context to support the reported gains. In the revised manuscript we will expand the abstract to name the evaluation datasets (standard geospatial VQA and spatial-reasoning benchmarks), the metric (accuracy), the baselines (models trained on millions of curated examples), the number of runs, and error bars. These details will be placed in the abstract while preserving its length constraints. revision: yes
Referee: [Method] Method description (self-play loop): nothing rules out the policy converging on degenerate but easily executable programs (e.g., tautological queries or syntax patterns that pass the verifier without testing scene geometry). The paper must show how the verifier and the three reasoning modes prevent such collapse; otherwise the reward signal may not reflect spatial understanding.

Authors: This concern is well-founded. The verifier executes programs against actual image primitives, so only programs that correctly query scene geometry receive positive reward; purely syntactic or tautological programs yield zero or negative reward on varied images. The three modes further constrain the policy: abduction requires inferring unobserved spatial facts, deduction applies logical rules to the observed primitives, and induction generalizes across scenes. In the revision we will add a dedicated paragraph with program-complexity statistics and an ablation that removes individual modes, showing increased degeneracy when any mode is absent. revision: yes
Referee: [Experiments / Benchmark] Benchmark release: because the benchmark is itself accumulated through the same self-play loop, the manuscript needs to demonstrate that performance gains transfer to held-out external geospatial benchmarks rather than remaining internal to the generated distribution.

Authors: We accept that transfer to external distributions must be shown explicitly. Although the self-play benchmark supplies scalable training data, the revised manuscript will include new evaluation results on held-out external geospatial benchmarks (distinct from the self-play distribution) to confirm that the observed improvements generalize. We will report these numbers alongside the internal held-out splits. revision: yes

Circularity Check

0 steps flagged

Self-play benchmark is internally generated but central performance gains remain empirically reported without definitional reduction

full rationale

The paper presents GeoX as a self-play RL loop in which a single policy generates executable spatial programs (as problems) and solves them via abduction/deduction/induction, with rewards obtained by external program execution on an image tool. The abstract explicitly states that the released benchmark is 'accumulated through self-play,' which creates a potential closed distribution. However, no equations, fitted parameters, or self-citations are shown that would make the reported 5.5-point gains equivalent to the input distribution by construction. The improvement is described as an observed outcome after RL optimization rather than a quantity defined in terms of the verifier success rate or the generated problems themselves. No load-bearing uniqueness theorem, ansatz smuggling, or renaming of known results appears in the provided text. The derivation therefore stays self-contained against external benchmarks even if the evaluation distribution overlaps with the training loop; this warrants only a minor (non-load-bearing) circularity flag.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the unverified assumption that generated programs can be executed to yield accurate, non-trivial reward signals for spatial reasoning. No free parameters or invented entities are mentioned in the abstract.

axioms (2)

domain assumption A single multimodal policy can generate valid executable programs that capture meaningful spatial relations over image primitives.
This premise enables the self-play loop described in the abstract.
domain assumption The verifier produces reward signals that improve genuine spatial understanding rather than rewarding program syntax alone.
Required for the RL optimization to translate into the claimed performance gains.

pith-pipeline@v0.9.0 · 5695 in / 1281 out tokens · 35239 ms · 2026-05-20T05:29:43.335154+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 2 internal anchors

[1]

Fine-grained socioeconomic prediction from satellite images with distributional adjustment

Donghyun Ahn, Minhyuk Song, Seungeon Lee, Yubin Choi, Jihee Kim, Sangyoon Park, Hyun- joo Yang, and Meeyoung Cha. Fine-grained socioeconomic prediction from satellite images with distributional adjustment. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 3717–3721, 2023

work page 2023
[2]

Georeg: Weight-constrained few-shot regression for socio-economic estimation using llm.arXiv preprint arXiv:2507.13323, 2025

Kyeongjin Ahn, Sungwon Han, Seungeon Lee, Donghyun Ahn, Hyoshin Kim, Jungwon Kim, Jihee Kim, Sangyoon Park, and Meeyoung Cha. Georeg: Weight-constrained few-shot regression for socio-economic estimation using llm.arXiv preprint arXiv:2507.13323, 2025

work page arXiv 2025
[3]

Generalizable disaster damage assessment via change detection with vision foundation model

Kyeongjin Ahn, Sungwon Han, Sungwon Park, Jihee Kim, Sangyoon Park, and Meeyoung Cha. Generalizable disaster damage assessment via change detection with vision foundation model. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 27784–27792, 2025

work page 2025
[4]

Mapping reduced accessibility to wash facilities in rohingya refugee camps with sub-meter imagery.arXiv preprint arXiv:2511.07231, 2025

Kyeongjin Ahn, YongHun Suh, Sungwon Han, Jeasurk Yang, Hannes TaubenbÃ k, ck, and Meeyoung Cha. Mapping reduced accessibility to wash facilities in rohingya refugee camps with sub-meter imagery.arXiv preprint arXiv:2511.07231, 2025

work page arXiv 2025
[5]

Vision- language models can self-improve reasoning via reflection

Kanzhi Cheng, Li YanTao, Fangzhi Xu, Jianbing Zhang, Hao Zhou, and Yang Liu. Vision- language models can self-improve reasoning via reflection. InProceedings of the 2025 Confer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8876–8892, 2025

work page 2025
[6]

GEOBench-VLM: Benchmarking vision-language models for geospatial tasks

Muhammad Sohail Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Kartik Kuckreja, Fahad Shahbaz Khan, Paolo Fraccaro, Alexandre Lacoste, and Salman Khan. GEOBench-VLM: Benchmarking vision-language models for geospatial tasks. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7132–7142, 2025

work page 2025
[7]

Enhancing large vision language models with self-training on image comprehension.Advances in Neural Information Processing Systems, 37:131369–131397, 2024

Yihe Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, Quanquan Gu, James Zou, Kai-Wei Chang, and Wei Wang. Enhancing large vision language models with self-training on image comprehension.Advances in Neural Information Processing Systems, 37:131369–131397, 2024

work page 2024
[8]

Egenhofer and Robert D

Max J. Egenhofer and Robert D. Franzosa. Point-set topological spatial relations.International Journal of Geographical Information Systems, 5(2):161–174, 1991

work page 1991
[9]

Creating xbd: A dataset for assessing building damage from satellite imagery

Ritwik Gupta, Bryce Goodman, Nirav Patel, Ricky Hosfelt, Sandra Sajeev, Eric Heim, Jigar Doshi, Keane Lucas, Howie Choset, and Matthew Gaston. Creating xbd: A dataset for assessing building damage from satellite imagery. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 10–17, 2019

work page 2019
[10]

Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025

Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Yu Liu, and Xiang Li. Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025

work page 2025
[11]

Geochat: Grounded large vision-language model for remote sensing

Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 27831–27840, 2024

work page 2024
[12]

Generalizable slum detection from satellite imagery with mixture-of-experts

Sumin Lee, Sungwon Park, Jeasurk Yang, Jihee Kim, and Meeyoung Cha. Generalizable slum detection from satellite imagery with mixture-of-experts. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 38826–38834, 2026. 10

work page 2026
[13]

SegEarth-OV3: Exploring SAM 3 for Open-Vocabulary Semantic Segmentation in Remote Sensing Images

Kaiyu Li, Shengqi Zhang, Yujie Wang, Yupeng Deng, Zhi Wang, Deyu Meng, and Xiangyong Cao. Segearth-ov3: Exploring sam 3 for open-vocabulary semantic segmentation in remote sensing images.arXiv preprint arXiv:2512.08730, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

work page 2024
[15]

Towards faithful reasoning in remote sensing: A perceptually-grounded geospatial chain-of-thought for vision-language models

Jiaqi Liu, Lang Sun, Ronghao Fu, and Bo Yang. Towards faithful reasoning in remote sensing: A perceptually-grounded geospatial chain-of-thought for vision-language models. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026. URL https: //openreview.net/forum?id=lJ7zecny2e

work page 2026
[16]

RSVQA: Visual question answer- ing for remote sensing data.IEEE Transactions on Geoscience and Remote Sensing, 58(12): 8555–8566, 2020

Sylvain Lobry, Diego Marcos, Jesse Murray, and Devis Tuia. RSVQA: Visual question answer- ing for remote sensing data.IEEE Transactions on Geoscience and Remote Sensing, 58(12): 8555–8566, 2020. doi: 10.1109/TGRS.2020.2988782

work page doi:10.1109/tgrs.2020.2988782 2020
[17]

Accurate object localization in remote sensing images based on convolutional neural networks.IEEE Transactions on Geoscience and Remote Sensing, 55(5):2486–2498, 2017

Yang Long, Yiping Gong, Zhifeng Xiao, and Qing Liu. Accurate object localization in remote sensing images based on convolutional neural networks.IEEE Transactions on Geoscience and Remote Sensing, 55(5):2486–2498, 2017

work page 2017
[18]

Lhrs-bot: Em- powering remote sensing with vgi-enhanced large multimodal language model

Dilxat Muhtar, Zhenshi Li, Feng Gu, Xueliang Zhang, and Pengfeng Xiao. Lhrs-bot: Em- powering remote sensing with vgi-enhanced large multimodal language model. InEuropean Conference on Computer Vision, pages 440–457. Springer, 2024

work page 2024
[19]

Vhm: Versatile and honest vision language model for remote sensing image analysis

Chao Pang, Xingxing Weng, Jiang Wu, Jiayu Li, Yi Liu, Jiaxing Sun, Weijia Li, Shuai Wang, Litong Feng, Gui-Song Xia, et al. Vhm: Versatile and honest vision language model for remote sensing image analysis. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 6381–6388, 2025

work page 2025
[20]

Globe230k: A benchmark dense-pixel annotation dataset for global land cover mapping.Journal of Remote Sensing, 3: 0078, 2023

Qian Shi, Da He, Zhengyu Liu, Xiaoping Liu, and Jingqian Xue. Globe230k: A benchmark dense-pixel annotation dataset for global land cover mapping.Journal of Remote Sensing, 3: 0078, 2023

work page 2023
[21]

Measuring fine-grained urban air temperature with satellite imagery

Minhyuk Song, Sungwon Han, Seungeon Lee, Donghyun Ahn, Jihee Kim, and Meeyoung Cha. Measuring fine-grained urban air temperature with satellite imagery. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 28397–28404, 2025

work page 2025
[22]

Earthdial: Turning multi-sensory earth observations to interactive dialogues

Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fahad Shah- baz Khan, et al. Earthdial: Turning multi-sensory earth observations to interactive dialogues. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 14303–14313, 2025

work page 2025
[23]

Samrs: Scaling-up remote sensing segmentation dataset with segment anything model.Advances in Neural Information Processing Systems, 36:8815–8827, 2023

Di Wang, Jing Zhang, Bo Du, Minqiang Xu, Lin Liu, Dacheng Tao, and Liangpei Zhang. Samrs: Scaling-up remote sensing segmentation dataset with segment anything model.Advances in Neural Information Processing Systems, 36:8815–8827, 2023

work page 2023
[24]

Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation

Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zhong. Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation. In J. Vanschoren and S. Yeung, editors,Proceedings of the Neural Information Process- ing Systems Track on Datasets and Benchmarks, volume 1. Curran Associates, Inc.,

work page
[25]

URL https://datasets-benchmarks-proceedings.neurips.cc/paper_files/ paper/2021/file/4e732ced3463d06de0ca9a15b6153677-Paper-round2.pdf

work page 2021
[26]

EarthVQA: Towards queryable earth via relational reasoning-based remote sensing visual question answering

Junjue Wang, Zhuo Zheng, Zihang Chen, Ailong Ma, and Yanfei Zhong. EarthVQA: Towards queryable earth via relational reasoning-based remote sensing visual question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 5481–5489,

work page
[27]

doi: 10.1609/aaai.v38i6.28357

work page doi:10.1609/aaai.v38i6.28357
[28]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Assessing climate risks from satellite imagery with machine learning: A case study of flood risks in jakarta.Climate Risk Management, 46:100651, 2024

Jeasurk Yang, Donghyun Ahn, Junbeom Bahk, Sungwon Park, Nurrokhmah Rizqihandari, and Meeyoung Cha. Assessing climate risks from satellite imagery with machine learning: A case study of flood risks in jakarta.Climate Risk Management, 46:100651, 2024

work page 2024
[30]

Star: Self-taught reasoner bootstrapping reasoning with reasoning

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D Goodman. Star: Self-taught reasoner bootstrapping reasoning with reasoning. InProc. the 36th International Conference on Neural Information Processing Systems, volume 1126, pages 0–55, 2024

work page 2024
[31]

Earthgpt: A universal multimodal large language model for multisensor image comprehension in remote sensing domain.IEEE Transactions on Geoscience and Remote Sensing, 62:1–20, 2024

Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, and Xuerui Mao. Earthgpt: A universal multimodal large language model for multisensor image comprehension in remote sensing domain.IEEE Transactions on Geoscience and Remote Sensing, 62:1–20, 2024

work page 2024
[32]

Yuanlin Zhang, Yuan Yuan, Yachuang Feng, and Xiaoqiang Lu. Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection.IEEE Transactions on Geoscience and Remote Sensing, 57(8):5535–5548, 2019

work page 2019
[33]

Absolute zero: Reinforced self-play reasoning with zero data

Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 12 A Related Work A.1 Vision-Language Models for remote sensing Recent progress has extended ...

work page 2025
[34]

Primitive coverage.Every primitive in the call interface F (Section 2.3) is assigned to exactly one dimension, ensuring that any program constructed by the proposer is representable in the nine dimensions

work page
[35]

which class occupies the largest area?

Benchmark coverage.Every question in our evaluation benchmarks (RSVQA-HR, EarthVQA, GEOBench-VLM) is assigned to at least one dimension, ensuring a fair side-by-side comparison. Each problem is mapped to asetof dimensions rather than a single one, since real spatial questions routinely combine multiple atomic geospatial concepts. For instance, “which clas...

work page
[36]

For eachoperationaldimension d∈ {EXISTENCE,QUANTITY,COVERAGE,COMPARISON,RELATION,OVERLAP}, adddtoC(p)if any associated regular expression matches a substring ofs

work page
[37]

vehicle, ship,building,tree)

Add CATEGORYif p contains a string literal naming one of the 19 object classes (e.g. vehicle, ship,building,tree)

work page
[38]

Add SCENEif p contains a string literal naming one of the 11 scene classes ( road, water, parking_lot, vegetation, forest, grass, cropland, land, bareland, intersection, roundabout) or the identifierscene_classes

work page
[39]

Are there more buildings than roads in this image?

Add DIRECTIONif p contains a string literal drawn from the set of 18 direction tokens (cardinal directions and quadrant abbreviations). Verification.We verified the procedure by manually annotating a held-out sample of 100 problems (25 from each of RSVQA-HR, EarthVQA, GEOBench-VLM, and GeoX) and observed that every problem agreed on dimension membership. ...

work page
[40]

The input class is "ship"

work page
[41]

In the image, there are multiple ships docked at the harbor

work page
[42]

The task is to find the largest ship and determine the quadrant where its center of mass lies

work page
[43]

From the image, the long vessel docked at the top of the image appears to be the largest ship

work page
[44]

In which quadrant (TL, TR, BL, BR) does the centroid of the largest ship lie in the image?

The center of mass calculation will be applied to this ship to determine the quadrant it falls into. Let's calculate the center of mass for the largest ship. The coordinates of the ship in the image are approximately: - Top-left corner: (260, 60) / Bottom-right corner: (540, 130) The center of mass calculation for a rectangle is at the midpoint of the top...

work page
[45]

**Input 0**: `vehicle` - **Output 0**: `3` - From the image, we can see three vehicles on the road

work page
[46]

**Input 1**: `bridge` - **Output 1**: `2` - From the image, we can see two bridges spanning over the water

work page
[47]

Based on these observations, the function needs to count the number of instances of the given class

**Input 2**: `football-field` - **Output 2**: `1` - From the image, we can see one football field in the top-right corner. Based on these observations, the function needs to count the number of instances of the given class. The class names are `vehicle`, `bridge`, and `football-field`, and the outputs are the counts of instances for each class. The functi...

work page

[1] [1]

Fine-grained socioeconomic prediction from satellite images with distributional adjustment

Donghyun Ahn, Minhyuk Song, Seungeon Lee, Yubin Choi, Jihee Kim, Sangyoon Park, Hyun- joo Yang, and Meeyoung Cha. Fine-grained socioeconomic prediction from satellite images with distributional adjustment. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 3717–3721, 2023

work page 2023

[2] [2]

Georeg: Weight-constrained few-shot regression for socio-economic estimation using llm.arXiv preprint arXiv:2507.13323, 2025

Kyeongjin Ahn, Sungwon Han, Seungeon Lee, Donghyun Ahn, Hyoshin Kim, Jungwon Kim, Jihee Kim, Sangyoon Park, and Meeyoung Cha. Georeg: Weight-constrained few-shot regression for socio-economic estimation using llm.arXiv preprint arXiv:2507.13323, 2025

work page arXiv 2025

[3] [3]

Generalizable disaster damage assessment via change detection with vision foundation model

Kyeongjin Ahn, Sungwon Han, Sungwon Park, Jihee Kim, Sangyoon Park, and Meeyoung Cha. Generalizable disaster damage assessment via change detection with vision foundation model. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 27784–27792, 2025

work page 2025

[4] [4]

Mapping reduced accessibility to wash facilities in rohingya refugee camps with sub-meter imagery.arXiv preprint arXiv:2511.07231, 2025

Kyeongjin Ahn, YongHun Suh, Sungwon Han, Jeasurk Yang, Hannes TaubenbÃ k, ck, and Meeyoung Cha. Mapping reduced accessibility to wash facilities in rohingya refugee camps with sub-meter imagery.arXiv preprint arXiv:2511.07231, 2025

work page arXiv 2025

[5] [5]

Vision- language models can self-improve reasoning via reflection

Kanzhi Cheng, Li YanTao, Fangzhi Xu, Jianbing Zhang, Hao Zhou, and Yang Liu. Vision- language models can self-improve reasoning via reflection. InProceedings of the 2025 Confer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8876–8892, 2025

work page 2025

[6] [6]

GEOBench-VLM: Benchmarking vision-language models for geospatial tasks

Muhammad Sohail Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Kartik Kuckreja, Fahad Shahbaz Khan, Paolo Fraccaro, Alexandre Lacoste, and Salman Khan. GEOBench-VLM: Benchmarking vision-language models for geospatial tasks. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7132–7142, 2025

work page 2025

[7] [7]

Enhancing large vision language models with self-training on image comprehension.Advances in Neural Information Processing Systems, 37:131369–131397, 2024

Yihe Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, Quanquan Gu, James Zou, Kai-Wei Chang, and Wei Wang. Enhancing large vision language models with self-training on image comprehension.Advances in Neural Information Processing Systems, 37:131369–131397, 2024

work page 2024

[8] [8]

Egenhofer and Robert D

Max J. Egenhofer and Robert D. Franzosa. Point-set topological spatial relations.International Journal of Geographical Information Systems, 5(2):161–174, 1991

work page 1991

[9] [9]

Creating xbd: A dataset for assessing building damage from satellite imagery

Ritwik Gupta, Bryce Goodman, Nirav Patel, Ricky Hosfelt, Sandra Sajeev, Eric Heim, Jigar Doshi, Keane Lucas, Howie Choset, and Matthew Gaston. Creating xbd: A dataset for assessing building damage from satellite imagery. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 10–17, 2019

work page 2019

[10] [10]

Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025

Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Yu Liu, and Xiang Li. Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025

work page 2025

[11] [11]

Geochat: Grounded large vision-language model for remote sensing

Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 27831–27840, 2024

work page 2024

[12] [12]

Generalizable slum detection from satellite imagery with mixture-of-experts

Sumin Lee, Sungwon Park, Jeasurk Yang, Jihee Kim, and Meeyoung Cha. Generalizable slum detection from satellite imagery with mixture-of-experts. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 38826–38834, 2026. 10

work page 2026

[13] [13]

SegEarth-OV3: Exploring SAM 3 for Open-Vocabulary Semantic Segmentation in Remote Sensing Images

Kaiyu Li, Shengqi Zhang, Yujie Wang, Yupeng Deng, Zhi Wang, Deyu Meng, and Xiangyong Cao. Segearth-ov3: Exploring sam 3 for open-vocabulary semantic segmentation in remote sensing images.arXiv preprint arXiv:2512.08730, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

work page 2024

[15] [15]

Towards faithful reasoning in remote sensing: A perceptually-grounded geospatial chain-of-thought for vision-language models

Jiaqi Liu, Lang Sun, Ronghao Fu, and Bo Yang. Towards faithful reasoning in remote sensing: A perceptually-grounded geospatial chain-of-thought for vision-language models. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026. URL https: //openreview.net/forum?id=lJ7zecny2e

work page 2026

[16] [16]

RSVQA: Visual question answer- ing for remote sensing data.IEEE Transactions on Geoscience and Remote Sensing, 58(12): 8555–8566, 2020

Sylvain Lobry, Diego Marcos, Jesse Murray, and Devis Tuia. RSVQA: Visual question answer- ing for remote sensing data.IEEE Transactions on Geoscience and Remote Sensing, 58(12): 8555–8566, 2020. doi: 10.1109/TGRS.2020.2988782

work page doi:10.1109/tgrs.2020.2988782 2020

[17] [17]

Accurate object localization in remote sensing images based on convolutional neural networks.IEEE Transactions on Geoscience and Remote Sensing, 55(5):2486–2498, 2017

Yang Long, Yiping Gong, Zhifeng Xiao, and Qing Liu. Accurate object localization in remote sensing images based on convolutional neural networks.IEEE Transactions on Geoscience and Remote Sensing, 55(5):2486–2498, 2017

work page 2017

[18] [18]

Lhrs-bot: Em- powering remote sensing with vgi-enhanced large multimodal language model

Dilxat Muhtar, Zhenshi Li, Feng Gu, Xueliang Zhang, and Pengfeng Xiao. Lhrs-bot: Em- powering remote sensing with vgi-enhanced large multimodal language model. InEuropean Conference on Computer Vision, pages 440–457. Springer, 2024

work page 2024

[19] [19]

Vhm: Versatile and honest vision language model for remote sensing image analysis

Chao Pang, Xingxing Weng, Jiang Wu, Jiayu Li, Yi Liu, Jiaxing Sun, Weijia Li, Shuai Wang, Litong Feng, Gui-Song Xia, et al. Vhm: Versatile and honest vision language model for remote sensing image analysis. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 6381–6388, 2025

work page 2025

[20] [20]

Globe230k: A benchmark dense-pixel annotation dataset for global land cover mapping.Journal of Remote Sensing, 3: 0078, 2023

Qian Shi, Da He, Zhengyu Liu, Xiaoping Liu, and Jingqian Xue. Globe230k: A benchmark dense-pixel annotation dataset for global land cover mapping.Journal of Remote Sensing, 3: 0078, 2023

work page 2023

[21] [21]

Measuring fine-grained urban air temperature with satellite imagery

Minhyuk Song, Sungwon Han, Seungeon Lee, Donghyun Ahn, Jihee Kim, and Meeyoung Cha. Measuring fine-grained urban air temperature with satellite imagery. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 28397–28404, 2025

work page 2025

[22] [22]

Earthdial: Turning multi-sensory earth observations to interactive dialogues

Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fahad Shah- baz Khan, et al. Earthdial: Turning multi-sensory earth observations to interactive dialogues. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 14303–14313, 2025

work page 2025

[23] [23]

Samrs: Scaling-up remote sensing segmentation dataset with segment anything model.Advances in Neural Information Processing Systems, 36:8815–8827, 2023

Di Wang, Jing Zhang, Bo Du, Minqiang Xu, Lin Liu, Dacheng Tao, and Liangpei Zhang. Samrs: Scaling-up remote sensing segmentation dataset with segment anything model.Advances in Neural Information Processing Systems, 36:8815–8827, 2023

work page 2023

[24] [24]

Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation

Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zhong. Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation. In J. Vanschoren and S. Yeung, editors,Proceedings of the Neural Information Process- ing Systems Track on Datasets and Benchmarks, volume 1. Curran Associates, Inc.,

work page

[25] [25]

URL https://datasets-benchmarks-proceedings.neurips.cc/paper_files/ paper/2021/file/4e732ced3463d06de0ca9a15b6153677-Paper-round2.pdf

work page 2021

[26] [26]

EarthVQA: Towards queryable earth via relational reasoning-based remote sensing visual question answering

Junjue Wang, Zhuo Zheng, Zihang Chen, Ailong Ma, and Yanfei Zhong. EarthVQA: Towards queryable earth via relational reasoning-based remote sensing visual question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 5481–5489,

work page

[27] [27]

doi: 10.1609/aaai.v38i6.28357

work page doi:10.1609/aaai.v38i6.28357

[28] [28]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Assessing climate risks from satellite imagery with machine learning: A case study of flood risks in jakarta.Climate Risk Management, 46:100651, 2024

Jeasurk Yang, Donghyun Ahn, Junbeom Bahk, Sungwon Park, Nurrokhmah Rizqihandari, and Meeyoung Cha. Assessing climate risks from satellite imagery with machine learning: A case study of flood risks in jakarta.Climate Risk Management, 46:100651, 2024

work page 2024

[30] [30]

Star: Self-taught reasoner bootstrapping reasoning with reasoning

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D Goodman. Star: Self-taught reasoner bootstrapping reasoning with reasoning. InProc. the 36th International Conference on Neural Information Processing Systems, volume 1126, pages 0–55, 2024

work page 2024

[31] [31]

Earthgpt: A universal multimodal large language model for multisensor image comprehension in remote sensing domain.IEEE Transactions on Geoscience and Remote Sensing, 62:1–20, 2024

Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, and Xuerui Mao. Earthgpt: A universal multimodal large language model for multisensor image comprehension in remote sensing domain.IEEE Transactions on Geoscience and Remote Sensing, 62:1–20, 2024

work page 2024

[32] [32]

Yuanlin Zhang, Yuan Yuan, Yachuang Feng, and Xiaoqiang Lu. Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection.IEEE Transactions on Geoscience and Remote Sensing, 57(8):5535–5548, 2019

work page 2019

[33] [33]

Absolute zero: Reinforced self-play reasoning with zero data

Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 12 A Related Work A.1 Vision-Language Models for remote sensing Recent progress has extended ...

work page 2025

[34] [34]

Primitive coverage.Every primitive in the call interface F (Section 2.3) is assigned to exactly one dimension, ensuring that any program constructed by the proposer is representable in the nine dimensions

work page

[35] [35]

which class occupies the largest area?

Benchmark coverage.Every question in our evaluation benchmarks (RSVQA-HR, EarthVQA, GEOBench-VLM) is assigned to at least one dimension, ensuring a fair side-by-side comparison. Each problem is mapped to asetof dimensions rather than a single one, since real spatial questions routinely combine multiple atomic geospatial concepts. For instance, “which clas...

work page

[36] [36]

For eachoperationaldimension d∈ {EXISTENCE,QUANTITY,COVERAGE,COMPARISON,RELATION,OVERLAP}, adddtoC(p)if any associated regular expression matches a substring ofs

work page

[37] [37]

vehicle, ship,building,tree)

Add CATEGORYif p contains a string literal naming one of the 19 object classes (e.g. vehicle, ship,building,tree)

work page

[38] [38]

Add SCENEif p contains a string literal naming one of the 11 scene classes ( road, water, parking_lot, vegetation, forest, grass, cropland, land, bareland, intersection, roundabout) or the identifierscene_classes

work page

[39] [39]

Are there more buildings than roads in this image?

Add DIRECTIONif p contains a string literal drawn from the set of 18 direction tokens (cardinal directions and quadrant abbreviations). Verification.We verified the procedure by manually annotating a held-out sample of 100 problems (25 from each of RSVQA-HR, EarthVQA, GEOBench-VLM, and GeoX) and observed that every problem agreed on dimension membership. ...

work page

[40] [40]

The input class is "ship"

work page

[41] [41]

In the image, there are multiple ships docked at the harbor

work page

[42] [42]

The task is to find the largest ship and determine the quadrant where its center of mass lies

work page

[43] [43]

From the image, the long vessel docked at the top of the image appears to be the largest ship

work page

[44] [44]

In which quadrant (TL, TR, BL, BR) does the centroid of the largest ship lie in the image?

The center of mass calculation will be applied to this ship to determine the quadrant it falls into. Let's calculate the center of mass for the largest ship. The coordinates of the ship in the image are approximately: - Top-left corner: (260, 60) / Bottom-right corner: (540, 130) The center of mass calculation for a rectangle is at the midpoint of the top...

work page

[45] [45]

**Input 0**: `vehicle` - **Output 0**: `3` - From the image, we can see three vehicles on the road

work page

[46] [46]

**Input 1**: `bridge` - **Output 1**: `2` - From the image, we can see two bridges spanning over the water

work page

[47] [47]

Based on these observations, the function needs to count the number of instances of the given class

**Input 2**: `football-field` - **Output 2**: `1` - From the image, we can see one football field in the top-right corner. Based on these observations, the function needs to count the number of instances of the given class. The class names are `vehicle`, `bridge`, and `football-field`, and the outputs are the counts of instances for each class. The functi...

work page