GeoX: Mastering Geospatial Reasoning Through Self-Play and Verifiable Rewards
Pith reviewed 2026-05-20 05:29 UTC · model grok-4.3
The pith
A self-play loop lets one vision model generate spatial problems as code, solve them, and train itself via execution-based rewards without large labeled datasets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GeoX uses one policy to propose spatial problems as executable programs and to solve them across three reasoning modes over spatial primitives and an image tool; the verifier then executes each program to supply a reward that jointly optimizes both proposal and solution roles via reinforcement learning, yielding up to 5.5-point average gains on geospatial tasks that match or exceed models trained on millions of curated examples.
What carries the argument
A self-play loop in which the same multimodal policy generates and solves spatial problems as executable programs, with a verifier supplying rewards from direct program execution on the image.
If this is right
- Base vision-language models gain up to 5.5 points on average across geospatial reasoning benchmarks.
- Performance reaches or exceeds that of models trained on millions of human-curated examples.
- A new benchmark of geospatial problems is generated and released through the self-play process itself.
- Spatial reasoning can be acquired with far less reliance on large-scale human annotation.
Where Pith is reading between the lines
- The same verifiable-program self-play pattern could be tested on other grounded reasoning domains such as physics simulation or diagram interpretation.
- If program execution reliably captures spatial primitives, the released benchmark could become a reusable testbed for measuring interpretable spatial logic in future models.
- Extending the verifier to handle noisy real-world images might reveal whether the current gains depend on clean satellite data.
Load-bearing premise
The verifier must correctly run the generated programs and return reward signals that truly reflect spatial understanding instead of just rewarding syntactically valid code.
What would settle it
Train the model with the verifier disabled so that only syntactic correctness of programs is rewarded, then measure whether the 5.5-point gains on held-out geospatial questions disappear.
Figures
read the original abstract
Geospatial reasoning requires solving image-grounded problems over the complex spatial structure of a scene. However, developing this capability is hindered by the cost of annotating a vast and combinatorial question space. We propose GeoX, a self-play framework that acquires spatial logic through executable programs that yield verifiable rewards, without relying on large-scale human-curated data Given a satellite or aerial image, our framework employs a single multimodal policy that proposes spatial problems as executable programs and solves them under three reasoning modes-abduction, deduction, and induction-over spatial primitives and an image understanding tool. A verifier executes each program to covert a reward signal that jointly optimizes the two roles via reinforcement learning. GeoX consistently improves its base VLMs by up to 5.5 points on average, matching or exceeding conventional baselines trained on millions of curated data. Along-side the proposed method, we release a benchmark for geospatial understanding accumulated through self-play.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces GeoX, a self-play framework for geospatial reasoning in vision-language models. A single multimodal policy proposes spatial problems as executable programs and solves them via abduction, deduction, and induction over spatial primitives plus an image tool. A verifier executes the programs to supply reward signals that jointly optimize the policy through reinforcement learning. The work claims consistent improvements of up to 5.5 points on base VLMs, matching or exceeding conventional baselines trained on millions of curated examples, and releases a benchmark accumulated through the same self-play process.
Significance. If the reported gains prove robust and the rewards demonstrably capture genuine spatial understanding rather than program syntax, the framework offers a scalable alternative to large-scale human annotation for complex visual reasoning. The self-play plus verifiable-execution design and the public benchmark release would be concrete strengths for the field.
major comments (3)
- [Abstract] Abstract: the central claim of a 5.5-point average improvement that matches million-example baselines is presented without any description of the evaluation datasets, metrics, baselines, number of runs, or error bars. This information is load-bearing for the data-efficiency argument and must be supplied with concrete numbers and controls.
- [Method] Method description (self-play loop): nothing rules out the policy converging on degenerate but easily executable programs (e.g., tautological queries or syntax patterns that pass the verifier without testing scene geometry). The paper must show how the verifier and the three reasoning modes prevent such collapse; otherwise the reward signal may not reflect spatial understanding.
- [Experiments / Benchmark] Benchmark release: because the benchmark is itself accumulated through the same self-play loop, the manuscript needs to demonstrate that performance gains transfer to held-out external geospatial benchmarks rather than remaining internal to the generated distribution.
minor comments (1)
- [Abstract] The abstract states that rewards come from 'program execution' but does not specify the execution environment or failure modes; a short paragraph clarifying this would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the changes planned for the revised manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of a 5.5-point average improvement that matches million-example baselines is presented without any description of the evaluation datasets, metrics, baselines, number of runs, or error bars. This information is load-bearing for the data-efficiency argument and must be supplied with concrete numbers and controls.
Authors: We agree that the abstract requires additional context to support the reported gains. In the revised manuscript we will expand the abstract to name the evaluation datasets (standard geospatial VQA and spatial-reasoning benchmarks), the metric (accuracy), the baselines (models trained on millions of curated examples), the number of runs, and error bars. These details will be placed in the abstract while preserving its length constraints. revision: yes
-
Referee: [Method] Method description (self-play loop): nothing rules out the policy converging on degenerate but easily executable programs (e.g., tautological queries or syntax patterns that pass the verifier without testing scene geometry). The paper must show how the verifier and the three reasoning modes prevent such collapse; otherwise the reward signal may not reflect spatial understanding.
Authors: This concern is well-founded. The verifier executes programs against actual image primitives, so only programs that correctly query scene geometry receive positive reward; purely syntactic or tautological programs yield zero or negative reward on varied images. The three modes further constrain the policy: abduction requires inferring unobserved spatial facts, deduction applies logical rules to the observed primitives, and induction generalizes across scenes. In the revision we will add a dedicated paragraph with program-complexity statistics and an ablation that removes individual modes, showing increased degeneracy when any mode is absent. revision: yes
-
Referee: [Experiments / Benchmark] Benchmark release: because the benchmark is itself accumulated through the same self-play loop, the manuscript needs to demonstrate that performance gains transfer to held-out external geospatial benchmarks rather than remaining internal to the generated distribution.
Authors: We accept that transfer to external distributions must be shown explicitly. Although the self-play benchmark supplies scalable training data, the revised manuscript will include new evaluation results on held-out external geospatial benchmarks (distinct from the self-play distribution) to confirm that the observed improvements generalize. We will report these numbers alongside the internal held-out splits. revision: yes
Circularity Check
Self-play benchmark is internally generated but central performance gains remain empirically reported without definitional reduction
full rationale
The paper presents GeoX as a self-play RL loop in which a single policy generates executable spatial programs (as problems) and solves them via abduction/deduction/induction, with rewards obtained by external program execution on an image tool. The abstract explicitly states that the released benchmark is 'accumulated through self-play,' which creates a potential closed distribution. However, no equations, fitted parameters, or self-citations are shown that would make the reported 5.5-point gains equivalent to the input distribution by construction. The improvement is described as an observed outcome after RL optimization rather than a quantity defined in terms of the verifier success rate or the generated problems themselves. No load-bearing uniqueness theorem, ansatz smuggling, or renaming of known results appears in the provided text. The derivation therefore stays self-contained against external benchmarks even if the evaluation distribution overlaps with the training loop; this warrants only a minor (non-load-bearing) circularity flag.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption A single multimodal policy can generate valid executable programs that capture meaningful spatial relations over image primitives.
- domain assumption The verifier produces reward signals that improve genuine spatial understanding rather than rewarding program syntax alone.
Reference graph
Works this paper leans on
-
[1]
Fine-grained socioeconomic prediction from satellite images with distributional adjustment
Donghyun Ahn, Minhyuk Song, Seungeon Lee, Yubin Choi, Jihee Kim, Sangyoon Park, Hyun- joo Yang, and Meeyoung Cha. Fine-grained socioeconomic prediction from satellite images with distributional adjustment. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 3717–3721, 2023
work page 2023
-
[2]
Kyeongjin Ahn, Sungwon Han, Seungeon Lee, Donghyun Ahn, Hyoshin Kim, Jungwon Kim, Jihee Kim, Sangyoon Park, and Meeyoung Cha. Georeg: Weight-constrained few-shot regression for socio-economic estimation using llm.arXiv preprint arXiv:2507.13323, 2025
-
[3]
Generalizable disaster damage assessment via change detection with vision foundation model
Kyeongjin Ahn, Sungwon Han, Sungwon Park, Jihee Kim, Sangyoon Park, and Meeyoung Cha. Generalizable disaster damage assessment via change detection with vision foundation model. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 27784–27792, 2025
work page 2025
-
[4]
Kyeongjin Ahn, YongHun Suh, Sungwon Han, Jeasurk Yang, Hannes Taubenbà k, ck, and Meeyoung Cha. Mapping reduced accessibility to wash facilities in rohingya refugee camps with sub-meter imagery.arXiv preprint arXiv:2511.07231, 2025
-
[5]
Vision- language models can self-improve reasoning via reflection
Kanzhi Cheng, Li YanTao, Fangzhi Xu, Jianbing Zhang, Hao Zhou, and Yang Liu. Vision- language models can self-improve reasoning via reflection. InProceedings of the 2025 Confer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8876–8892, 2025
work page 2025
-
[6]
GEOBench-VLM: Benchmarking vision-language models for geospatial tasks
Muhammad Sohail Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Kartik Kuckreja, Fahad Shahbaz Khan, Paolo Fraccaro, Alexandre Lacoste, and Salman Khan. GEOBench-VLM: Benchmarking vision-language models for geospatial tasks. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7132–7142, 2025
work page 2025
-
[7]
Yihe Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, Quanquan Gu, James Zou, Kai-Wei Chang, and Wei Wang. Enhancing large vision language models with self-training on image comprehension.Advances in Neural Information Processing Systems, 37:131369–131397, 2024
work page 2024
-
[8]
Max J. Egenhofer and Robert D. Franzosa. Point-set topological spatial relations.International Journal of Geographical Information Systems, 5(2):161–174, 1991
work page 1991
-
[9]
Creating xbd: A dataset for assessing building damage from satellite imagery
Ritwik Gupta, Bryce Goodman, Nirav Patel, Ricky Hosfelt, Sandra Sajeev, Eric Heim, Jigar Doshi, Keane Lucas, Howie Choset, and Matthew Gaston. Creating xbd: A dataset for assessing building damage from satellite imagery. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 10–17, 2019
work page 2019
-
[10]
Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Yu Liu, and Xiang Li. Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025
work page 2025
-
[11]
Geochat: Grounded large vision-language model for remote sensing
Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 27831–27840, 2024
work page 2024
-
[12]
Generalizable slum detection from satellite imagery with mixture-of-experts
Sumin Lee, Sungwon Park, Jeasurk Yang, Jihee Kim, and Meeyoung Cha. Generalizable slum detection from satellite imagery with mixture-of-experts. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 38826–38834, 2026. 10
work page 2026
-
[13]
SegEarth-OV3: Exploring SAM 3 for Open-Vocabulary Semantic Segmentation in Remote Sensing Images
Kaiyu Li, Shengqi Zhang, Yujie Wang, Yupeng Deng, Zhi Wang, Deyu Meng, and Xiangyong Cao. Segearth-ov3: Exploring sam 3 for open-vocabulary semantic segmentation in remote sensing images.arXiv preprint arXiv:2512.08730, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024
work page 2024
-
[15]
Jiaqi Liu, Lang Sun, Ronghao Fu, and Bo Yang. Towards faithful reasoning in remote sensing: A perceptually-grounded geospatial chain-of-thought for vision-language models. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026. URL https: //openreview.net/forum?id=lJ7zecny2e
work page 2026
-
[16]
Sylvain Lobry, Diego Marcos, Jesse Murray, and Devis Tuia. RSVQA: Visual question answer- ing for remote sensing data.IEEE Transactions on Geoscience and Remote Sensing, 58(12): 8555–8566, 2020. doi: 10.1109/TGRS.2020.2988782
-
[17]
Yang Long, Yiping Gong, Zhifeng Xiao, and Qing Liu. Accurate object localization in remote sensing images based on convolutional neural networks.IEEE Transactions on Geoscience and Remote Sensing, 55(5):2486–2498, 2017
work page 2017
-
[18]
Lhrs-bot: Em- powering remote sensing with vgi-enhanced large multimodal language model
Dilxat Muhtar, Zhenshi Li, Feng Gu, Xueliang Zhang, and Pengfeng Xiao. Lhrs-bot: Em- powering remote sensing with vgi-enhanced large multimodal language model. InEuropean Conference on Computer Vision, pages 440–457. Springer, 2024
work page 2024
-
[19]
Vhm: Versatile and honest vision language model for remote sensing image analysis
Chao Pang, Xingxing Weng, Jiang Wu, Jiayu Li, Yi Liu, Jiaxing Sun, Weijia Li, Shuai Wang, Litong Feng, Gui-Song Xia, et al. Vhm: Versatile and honest vision language model for remote sensing image analysis. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 6381–6388, 2025
work page 2025
-
[20]
Qian Shi, Da He, Zhengyu Liu, Xiaoping Liu, and Jingqian Xue. Globe230k: A benchmark dense-pixel annotation dataset for global land cover mapping.Journal of Remote Sensing, 3: 0078, 2023
work page 2023
-
[21]
Measuring fine-grained urban air temperature with satellite imagery
Minhyuk Song, Sungwon Han, Seungeon Lee, Donghyun Ahn, Jihee Kim, and Meeyoung Cha. Measuring fine-grained urban air temperature with satellite imagery. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 28397–28404, 2025
work page 2025
-
[22]
Earthdial: Turning multi-sensory earth observations to interactive dialogues
Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fahad Shah- baz Khan, et al. Earthdial: Turning multi-sensory earth observations to interactive dialogues. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 14303–14313, 2025
work page 2025
-
[23]
Di Wang, Jing Zhang, Bo Du, Minqiang Xu, Lin Liu, Dacheng Tao, and Liangpei Zhang. Samrs: Scaling-up remote sensing segmentation dataset with segment anything model.Advances in Neural Information Processing Systems, 36:8815–8827, 2023
work page 2023
-
[24]
Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation
Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zhong. Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation. In J. Vanschoren and S. Yeung, editors,Proceedings of the Neural Information Process- ing Systems Track on Datasets and Benchmarks, volume 1. Curran Associates, Inc.,
-
[25]
URL https://datasets-benchmarks-proceedings.neurips.cc/paper_files/ paper/2021/file/4e732ced3463d06de0ca9a15b6153677-Paper-round2.pdf
work page 2021
-
[26]
Junjue Wang, Zhuo Zheng, Zihang Chen, Ailong Ma, and Yanfei Zhong. EarthVQA: Towards queryable earth via relational reasoning-based remote sensing visual question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 5481–5489,
-
[27]
doi: 10.1609/aaai.v38i6.28357
-
[28]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Jeasurk Yang, Donghyun Ahn, Junbeom Bahk, Sungwon Park, Nurrokhmah Rizqihandari, and Meeyoung Cha. Assessing climate risks from satellite imagery with machine learning: A case study of flood risks in jakarta.Climate Risk Management, 46:100651, 2024
work page 2024
-
[30]
Star: Self-taught reasoner bootstrapping reasoning with reasoning
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D Goodman. Star: Self-taught reasoner bootstrapping reasoning with reasoning. InProc. the 36th International Conference on Neural Information Processing Systems, volume 1126, pages 0–55, 2024
work page 2024
-
[31]
Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, and Xuerui Mao. Earthgpt: A universal multimodal large language model for multisensor image comprehension in remote sensing domain.IEEE Transactions on Geoscience and Remote Sensing, 62:1–20, 2024
work page 2024
-
[32]
Yuanlin Zhang, Yuan Yuan, Yachuang Feng, and Xiaoqiang Lu. Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection.IEEE Transactions on Geoscience and Remote Sensing, 57(8):5535–5548, 2019
work page 2019
-
[33]
Absolute zero: Reinforced self-play reasoning with zero data
Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 12 A Related Work A.1 Vision-Language Models for remote sensing Recent progress has extended ...
work page 2025
-
[34]
Primitive coverage.Every primitive in the call interface F (Section 2.3) is assigned to exactly one dimension, ensuring that any program constructed by the proposer is representable in the nine dimensions
-
[35]
which class occupies the largest area?
Benchmark coverage.Every question in our evaluation benchmarks (RSVQA-HR, EarthVQA, GEOBench-VLM) is assigned to at least one dimension, ensuring a fair side-by-side comparison. Each problem is mapped to asetof dimensions rather than a single one, since real spatial questions routinely combine multiple atomic geospatial concepts. For instance, “which clas...
-
[36]
For eachoperationaldimension d∈ {EXISTENCE,QUANTITY,COVERAGE,COMPARISON,RELATION,OVERLAP}, adddtoC(p)if any associated regular expression matches a substring ofs
-
[37]
Add CATEGORYif p contains a string literal naming one of the 19 object classes (e.g. vehicle, ship,building,tree)
-
[38]
Add SCENEif p contains a string literal naming one of the 11 scene classes ( road, water, parking_lot, vegetation, forest, grass, cropland, land, bareland, intersection, roundabout) or the identifierscene_classes
-
[39]
Are there more buildings than roads in this image?
Add DIRECTIONif p contains a string literal drawn from the set of 18 direction tokens (cardinal directions and quadrant abbreviations). Verification.We verified the procedure by manually annotating a held-out sample of 100 problems (25 from each of RSVQA-HR, EarthVQA, GEOBench-VLM, and GeoX) and observed that every problem agreed on dimension membership. ...
-
[40]
The input class is "ship"
-
[41]
In the image, there are multiple ships docked at the harbor
-
[42]
The task is to find the largest ship and determine the quadrant where its center of mass lies
-
[43]
From the image, the long vessel docked at the top of the image appears to be the largest ship
-
[44]
In which quadrant (TL, TR, BL, BR) does the centroid of the largest ship lie in the image?
The center of mass calculation will be applied to this ship to determine the quadrant it falls into. Let's calculate the center of mass for the largest ship. The coordinates of the ship in the image are approximately: - Top-left corner: (260, 60) / Bottom-right corner: (540, 130) The center of mass calculation for a rectangle is at the midpoint of the top...
-
[45]
**Input 0**: `vehicle` - **Output 0**: `3` - From the image, we can see three vehicles on the road
-
[46]
**Input 1**: `bridge` - **Output 1**: `2` - From the image, we can see two bridges spanning over the water
-
[47]
Based on these observations, the function needs to count the number of instances of the given class
**Input 2**: `football-field` - **Output 2**: `1` - From the image, we can see one football field in the top-right corner. Based on these observations, the function needs to count the number of instances of the given class. The class names are `vehicle`, `bridge`, and `football-field`, and the outputs are the counts of instances for each class. The functi...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.