pith. sign in

arxiv: 2605.20006 · v1 · pith:7CXCT2E3new · submitted 2026-05-19 · 💻 cs.AI

GeoX: Mastering Geospatial Reasoning Through Self-Play and Verifiable Rewards

Pith reviewed 2026-05-20 05:29 UTC · model grok-4.3

classification 💻 cs.AI
keywords geospatial reasoningself-playvision-language modelsreinforcement learningverifiable rewardsexecutable programsspatial logic
0
0 comments X

The pith

A self-play loop lets one vision model generate spatial problems as code, solve them, and train itself via execution-based rewards without large labeled datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GeoX as a framework in which a single multimodal policy creates spatial reasoning tasks by writing executable programs and then attempts to solve those same tasks using abduction, deduction, and induction over image primitives. A verifier runs the programs against the actual satellite or aerial image to produce reward signals that reinforce the policy through reinforcement learning. This closed loop allows the model to develop geospatial understanding from its own generated data rather than from millions of human-annotated examples. If the approach holds, it points toward training complex spatial reasoning at far lower annotation cost while still reaching or surpassing performance of heavily supervised baselines.

Core claim

GeoX uses one policy to propose spatial problems as executable programs and to solve them across three reasoning modes over spatial primitives and an image tool; the verifier then executes each program to supply a reward that jointly optimizes both proposal and solution roles via reinforcement learning, yielding up to 5.5-point average gains on geospatial tasks that match or exceed models trained on millions of curated examples.

What carries the argument

A self-play loop in which the same multimodal policy generates and solves spatial problems as executable programs, with a verifier supplying rewards from direct program execution on the image.

If this is right

  • Base vision-language models gain up to 5.5 points on average across geospatial reasoning benchmarks.
  • Performance reaches or exceeds that of models trained on millions of human-curated examples.
  • A new benchmark of geospatial problems is generated and released through the self-play process itself.
  • Spatial reasoning can be acquired with far less reliance on large-scale human annotation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same verifiable-program self-play pattern could be tested on other grounded reasoning domains such as physics simulation or diagram interpretation.
  • If program execution reliably captures spatial primitives, the released benchmark could become a reusable testbed for measuring interpretable spatial logic in future models.
  • Extending the verifier to handle noisy real-world images might reveal whether the current gains depend on clean satellite data.

Load-bearing premise

The verifier must correctly run the generated programs and return reward signals that truly reflect spatial understanding instead of just rewarding syntactically valid code.

What would settle it

Train the model with the verifier disabled so that only syntactic correctness of programs is rewarded, then measure whether the 5.5-point gains on held-out geospatial questions disappear.

Figures

Figures reproduced from arXiv: 2605.20006 by Krishna P. Gummadi, Kyeongjin Ahn, Meeyoung Cha, Seungeon Lee.

Figure 1
Figure 1. Figure 1: Standing and motivation. (a) GeoX outperforms prior VLMs on VQA benchmarks while using zero curated training data; (b) Conventional work derives questions from human-designed templates with answers externally annotated, confining them to predefined patterns. Our framework replaces this paradigm with autonomous self-play, where a single model proposes questions whose answers are programmatically verified, b… view at source ↗
Figure 2
Figure 2. Figure 2: Method overview. A single multimodal policy πθ alternates between a proposer (middle) and a solver (right) that share a call interface (left) of spatial primitives F and tools T (instantiated here with an open-vocabulary segmenter; greyed entries left for future work). Conditioned on image I, the proposer construct a problem by composing calls into an executable program p paired with an argument a, forming… view at source ↗
Figure 2
Figure 2. Figure 2: The policy first acts as a proposer, constructing an executable problem that realizes a spatial question over an input image. It then acts as a solver, attempting to find the solution for constructed problems. Both roles are driven by verifiable rewards: the proposer is rewarded for devising challenging yet learnable problems, while the solver is rewarded for answering correctly under the three reasoning m… view at source ↗
Figure 3
Figure 3. Figure 3: Seed problem. A segmenter call with the phrase "building", followed by a presence check on the returned mask. The template is instantiated by pairing random object phrases with images, populating each bank with Nseed problems. From this warm start, the proposer moves toward increasingly compositional problems, such as comparing areas across object categories, without fur￾ther human intervention. 2.3 Progra… view at source ↗
Figure 4
Figure 4. Figure 4: Pairwise dimension compositions across datasets. Each node represents one of nine question dimensions; node size reflects how often a dimension appears in a single problem, and edge thickness reflects how often two dimensions co-occur within a single problem. Image (I) Segmentation <think> 1. The input class is "ship". 2. In the image, there are multiple ships docked at the harbor. 3. The task is to find t… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative analysis of deduction. Given image I, argument a = "ship", and program p, the solver’s Chain-of-Thought follows the program’s control flow and predicts oˆ = "TR", which exactly matches the program-executed label o = p(a; I) [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Usage frequency of primitives in GeoX. Each bar denotes the number of constructed programs that invoke a given operator in F (log scale), grouped by primitive type. Q. Which primitives are used by the proposer during self-play? Across roughly 6,500 programs constructed by the proposer over training, the us￾age frequency of each primitive in li￾brary F reveals which operations are called during self-play, a… view at source ↗
Figure 7
Figure 7. Figure 7: Training dynamics during self-play. Task accuracy of GeoX over training steps on representative geospatial reasoning subtasks drawn from three remote sensing VQA benchmarks: Comparison (RSVQA-HR), Reasoning-based Counting (EarthVQA), and Spatial Relation Classifica￾tion (GEOBench-VLM). E.2 Evaluation on Object Counting Beyond the VQA results in Section 3.2, we further evaluate GeoX on object counting. Data… view at source ↗
Figure 8
Figure 8. Figure 8: Examples of rule-based mapping of natural language problems. [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Examples of rule-based mapping of problems generated in [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative analysis of abduction. Given (I, p, o), the solver infers aˆ by simulating p forward and searching for arguments whose execution reproduces the observed output o. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative analysis of deduction (reproduced from [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative analysis of induction. Given input-output pairs {(at, ot)}t∈V , the solver synthesizes a program pˆ consistent with the visible pairs. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
read the original abstract

Geospatial reasoning requires solving image-grounded problems over the complex spatial structure of a scene. However, developing this capability is hindered by the cost of annotating a vast and combinatorial question space. We propose GeoX, a self-play framework that acquires spatial logic through executable programs that yield verifiable rewards, without relying on large-scale human-curated data Given a satellite or aerial image, our framework employs a single multimodal policy that proposes spatial problems as executable programs and solves them under three reasoning modes-abduction, deduction, and induction-over spatial primitives and an image understanding tool. A verifier executes each program to covert a reward signal that jointly optimizes the two roles via reinforcement learning. GeoX consistently improves its base VLMs by up to 5.5 points on average, matching or exceeding conventional baselines trained on millions of curated data. Along-side the proposed method, we release a benchmark for geospatial understanding accumulated through self-play.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces GeoX, a self-play framework for geospatial reasoning in vision-language models. A single multimodal policy proposes spatial problems as executable programs and solves them via abduction, deduction, and induction over spatial primitives plus an image tool. A verifier executes the programs to supply reward signals that jointly optimize the policy through reinforcement learning. The work claims consistent improvements of up to 5.5 points on base VLMs, matching or exceeding conventional baselines trained on millions of curated examples, and releases a benchmark accumulated through the same self-play process.

Significance. If the reported gains prove robust and the rewards demonstrably capture genuine spatial understanding rather than program syntax, the framework offers a scalable alternative to large-scale human annotation for complex visual reasoning. The self-play plus verifiable-execution design and the public benchmark release would be concrete strengths for the field.

major comments (3)
  1. [Abstract] Abstract: the central claim of a 5.5-point average improvement that matches million-example baselines is presented without any description of the evaluation datasets, metrics, baselines, number of runs, or error bars. This information is load-bearing for the data-efficiency argument and must be supplied with concrete numbers and controls.
  2. [Method] Method description (self-play loop): nothing rules out the policy converging on degenerate but easily executable programs (e.g., tautological queries or syntax patterns that pass the verifier without testing scene geometry). The paper must show how the verifier and the three reasoning modes prevent such collapse; otherwise the reward signal may not reflect spatial understanding.
  3. [Experiments / Benchmark] Benchmark release: because the benchmark is itself accumulated through the same self-play loop, the manuscript needs to demonstrate that performance gains transfer to held-out external geospatial benchmarks rather than remaining internal to the generated distribution.
minor comments (1)
  1. [Abstract] The abstract states that rewards come from 'program execution' but does not specify the execution environment or failure modes; a short paragraph clarifying this would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the changes planned for the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of a 5.5-point average improvement that matches million-example baselines is presented without any description of the evaluation datasets, metrics, baselines, number of runs, or error bars. This information is load-bearing for the data-efficiency argument and must be supplied with concrete numbers and controls.

    Authors: We agree that the abstract requires additional context to support the reported gains. In the revised manuscript we will expand the abstract to name the evaluation datasets (standard geospatial VQA and spatial-reasoning benchmarks), the metric (accuracy), the baselines (models trained on millions of curated examples), the number of runs, and error bars. These details will be placed in the abstract while preserving its length constraints. revision: yes

  2. Referee: [Method] Method description (self-play loop): nothing rules out the policy converging on degenerate but easily executable programs (e.g., tautological queries or syntax patterns that pass the verifier without testing scene geometry). The paper must show how the verifier and the three reasoning modes prevent such collapse; otherwise the reward signal may not reflect spatial understanding.

    Authors: This concern is well-founded. The verifier executes programs against actual image primitives, so only programs that correctly query scene geometry receive positive reward; purely syntactic or tautological programs yield zero or negative reward on varied images. The three modes further constrain the policy: abduction requires inferring unobserved spatial facts, deduction applies logical rules to the observed primitives, and induction generalizes across scenes. In the revision we will add a dedicated paragraph with program-complexity statistics and an ablation that removes individual modes, showing increased degeneracy when any mode is absent. revision: yes

  3. Referee: [Experiments / Benchmark] Benchmark release: because the benchmark is itself accumulated through the same self-play loop, the manuscript needs to demonstrate that performance gains transfer to held-out external geospatial benchmarks rather than remaining internal to the generated distribution.

    Authors: We accept that transfer to external distributions must be shown explicitly. Although the self-play benchmark supplies scalable training data, the revised manuscript will include new evaluation results on held-out external geospatial benchmarks (distinct from the self-play distribution) to confirm that the observed improvements generalize. We will report these numbers alongside the internal held-out splits. revision: yes

Circularity Check

0 steps flagged

Self-play benchmark is internally generated but central performance gains remain empirically reported without definitional reduction

full rationale

The paper presents GeoX as a self-play RL loop in which a single policy generates executable spatial programs (as problems) and solves them via abduction/deduction/induction, with rewards obtained by external program execution on an image tool. The abstract explicitly states that the released benchmark is 'accumulated through self-play,' which creates a potential closed distribution. However, no equations, fitted parameters, or self-citations are shown that would make the reported 5.5-point gains equivalent to the input distribution by construction. The improvement is described as an observed outcome after RL optimization rather than a quantity defined in terms of the verifier success rate or the generated problems themselves. No load-bearing uniqueness theorem, ansatz smuggling, or renaming of known results appears in the provided text. The derivation therefore stays self-contained against external benchmarks even if the evaluation distribution overlaps with the training loop; this warrants only a minor (non-load-bearing) circularity flag.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the unverified assumption that generated programs can be executed to yield accurate, non-trivial reward signals for spatial reasoning. No free parameters or invented entities are mentioned in the abstract.

axioms (2)
  • domain assumption A single multimodal policy can generate valid executable programs that capture meaningful spatial relations over image primitives.
    This premise enables the self-play loop described in the abstract.
  • domain assumption The verifier produces reward signals that improve genuine spatial understanding rather than rewarding program syntax alone.
    Required for the RL optimization to translate into the claimed performance gains.

pith-pipeline@v0.9.0 · 5695 in / 1281 out tokens · 35239 ms · 2026-05-20T05:29:43.335154+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 2 internal anchors

  1. [1]

    Fine-grained socioeconomic prediction from satellite images with distributional adjustment

    Donghyun Ahn, Minhyuk Song, Seungeon Lee, Yubin Choi, Jihee Kim, Sangyoon Park, Hyun- joo Yang, and Meeyoung Cha. Fine-grained socioeconomic prediction from satellite images with distributional adjustment. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 3717–3721, 2023

  2. [2]

    Georeg: Weight-constrained few-shot regression for socio-economic estimation using llm.arXiv preprint arXiv:2507.13323, 2025

    Kyeongjin Ahn, Sungwon Han, Seungeon Lee, Donghyun Ahn, Hyoshin Kim, Jungwon Kim, Jihee Kim, Sangyoon Park, and Meeyoung Cha. Georeg: Weight-constrained few-shot regression for socio-economic estimation using llm.arXiv preprint arXiv:2507.13323, 2025

  3. [3]

    Generalizable disaster damage assessment via change detection with vision foundation model

    Kyeongjin Ahn, Sungwon Han, Sungwon Park, Jihee Kim, Sangyoon Park, and Meeyoung Cha. Generalizable disaster damage assessment via change detection with vision foundation model. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 27784–27792, 2025

  4. [4]

    Mapping reduced accessibility to wash facilities in rohingya refugee camps with sub-meter imagery.arXiv preprint arXiv:2511.07231, 2025

    Kyeongjin Ahn, YongHun Suh, Sungwon Han, Jeasurk Yang, Hannes Taubenbà k, ck, and Meeyoung Cha. Mapping reduced accessibility to wash facilities in rohingya refugee camps with sub-meter imagery.arXiv preprint arXiv:2511.07231, 2025

  5. [5]

    Vision- language models can self-improve reasoning via reflection

    Kanzhi Cheng, Li YanTao, Fangzhi Xu, Jianbing Zhang, Hao Zhou, and Yang Liu. Vision- language models can self-improve reasoning via reflection. InProceedings of the 2025 Confer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8876–8892, 2025

  6. [6]

    GEOBench-VLM: Benchmarking vision-language models for geospatial tasks

    Muhammad Sohail Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Kartik Kuckreja, Fahad Shahbaz Khan, Paolo Fraccaro, Alexandre Lacoste, and Salman Khan. GEOBench-VLM: Benchmarking vision-language models for geospatial tasks. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7132–7142, 2025

  7. [7]

    Enhancing large vision language models with self-training on image comprehension.Advances in Neural Information Processing Systems, 37:131369–131397, 2024

    Yihe Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, Quanquan Gu, James Zou, Kai-Wei Chang, and Wei Wang. Enhancing large vision language models with self-training on image comprehension.Advances in Neural Information Processing Systems, 37:131369–131397, 2024

  8. [8]

    Egenhofer and Robert D

    Max J. Egenhofer and Robert D. Franzosa. Point-set topological spatial relations.International Journal of Geographical Information Systems, 5(2):161–174, 1991

  9. [9]

    Creating xbd: A dataset for assessing building damage from satellite imagery

    Ritwik Gupta, Bryce Goodman, Nirav Patel, Ricky Hosfelt, Sandra Sajeev, Eric Heim, Jigar Doshi, Keane Lucas, Howie Choset, and Matthew Gaston. Creating xbd: A dataset for assessing building damage from satellite imagery. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 10–17, 2019

  10. [10]

    Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025

    Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Yu Liu, and Xiang Li. Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025

  11. [11]

    Geochat: Grounded large vision-language model for remote sensing

    Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 27831–27840, 2024

  12. [12]

    Generalizable slum detection from satellite imagery with mixture-of-experts

    Sumin Lee, Sungwon Park, Jeasurk Yang, Jihee Kim, and Meeyoung Cha. Generalizable slum detection from satellite imagery with mixture-of-experts. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 38826–38834, 2026. 10

  13. [13]

    SegEarth-OV3: Exploring SAM 3 for Open-Vocabulary Semantic Segmentation in Remote Sensing Images

    Kaiyu Li, Shengqi Zhang, Yujie Wang, Yupeng Deng, Zhi Wang, Deyu Meng, and Xiangyong Cao. Segearth-ov3: Exploring sam 3 for open-vocabulary semantic segmentation in remote sensing images.arXiv preprint arXiv:2512.08730, 2025

  14. [14]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

  15. [15]

    Towards faithful reasoning in remote sensing: A perceptually-grounded geospatial chain-of-thought for vision-language models

    Jiaqi Liu, Lang Sun, Ronghao Fu, and Bo Yang. Towards faithful reasoning in remote sensing: A perceptually-grounded geospatial chain-of-thought for vision-language models. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026. URL https: //openreview.net/forum?id=lJ7zecny2e

  16. [16]

    RSVQA: Visual question answer- ing for remote sensing data.IEEE Transactions on Geoscience and Remote Sensing, 58(12): 8555–8566, 2020

    Sylvain Lobry, Diego Marcos, Jesse Murray, and Devis Tuia. RSVQA: Visual question answer- ing for remote sensing data.IEEE Transactions on Geoscience and Remote Sensing, 58(12): 8555–8566, 2020. doi: 10.1109/TGRS.2020.2988782

  17. [17]

    Accurate object localization in remote sensing images based on convolutional neural networks.IEEE Transactions on Geoscience and Remote Sensing, 55(5):2486–2498, 2017

    Yang Long, Yiping Gong, Zhifeng Xiao, and Qing Liu. Accurate object localization in remote sensing images based on convolutional neural networks.IEEE Transactions on Geoscience and Remote Sensing, 55(5):2486–2498, 2017

  18. [18]

    Lhrs-bot: Em- powering remote sensing with vgi-enhanced large multimodal language model

    Dilxat Muhtar, Zhenshi Li, Feng Gu, Xueliang Zhang, and Pengfeng Xiao. Lhrs-bot: Em- powering remote sensing with vgi-enhanced large multimodal language model. InEuropean Conference on Computer Vision, pages 440–457. Springer, 2024

  19. [19]

    Vhm: Versatile and honest vision language model for remote sensing image analysis

    Chao Pang, Xingxing Weng, Jiang Wu, Jiayu Li, Yi Liu, Jiaxing Sun, Weijia Li, Shuai Wang, Litong Feng, Gui-Song Xia, et al. Vhm: Versatile and honest vision language model for remote sensing image analysis. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 6381–6388, 2025

  20. [20]

    Globe230k: A benchmark dense-pixel annotation dataset for global land cover mapping.Journal of Remote Sensing, 3: 0078, 2023

    Qian Shi, Da He, Zhengyu Liu, Xiaoping Liu, and Jingqian Xue. Globe230k: A benchmark dense-pixel annotation dataset for global land cover mapping.Journal of Remote Sensing, 3: 0078, 2023

  21. [21]

    Measuring fine-grained urban air temperature with satellite imagery

    Minhyuk Song, Sungwon Han, Seungeon Lee, Donghyun Ahn, Jihee Kim, and Meeyoung Cha. Measuring fine-grained urban air temperature with satellite imagery. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 28397–28404, 2025

  22. [22]

    Earthdial: Turning multi-sensory earth observations to interactive dialogues

    Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fahad Shah- baz Khan, et al. Earthdial: Turning multi-sensory earth observations to interactive dialogues. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 14303–14313, 2025

  23. [23]

    Samrs: Scaling-up remote sensing segmentation dataset with segment anything model.Advances in Neural Information Processing Systems, 36:8815–8827, 2023

    Di Wang, Jing Zhang, Bo Du, Minqiang Xu, Lin Liu, Dacheng Tao, and Liangpei Zhang. Samrs: Scaling-up remote sensing segmentation dataset with segment anything model.Advances in Neural Information Processing Systems, 36:8815–8827, 2023

  24. [24]

    Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation

    Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zhong. Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation. In J. Vanschoren and S. Yeung, editors,Proceedings of the Neural Information Process- ing Systems Track on Datasets and Benchmarks, volume 1. Curran Associates, Inc.,

  25. [25]

    URL https://datasets-benchmarks-proceedings.neurips.cc/paper_files/ paper/2021/file/4e732ced3463d06de0ca9a15b6153677-Paper-round2.pdf

  26. [26]

    EarthVQA: Towards queryable earth via relational reasoning-based remote sensing visual question answering

    Junjue Wang, Zhuo Zheng, Zihang Chen, Ailong Ma, and Yanfei Zhong. EarthVQA: Towards queryable earth via relational reasoning-based remote sensing visual question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 5481–5489,

  27. [27]

    doi: 10.1609/aaai.v38i6.28357

  28. [28]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 11

  29. [29]

    Assessing climate risks from satellite imagery with machine learning: A case study of flood risks in jakarta.Climate Risk Management, 46:100651, 2024

    Jeasurk Yang, Donghyun Ahn, Junbeom Bahk, Sungwon Park, Nurrokhmah Rizqihandari, and Meeyoung Cha. Assessing climate risks from satellite imagery with machine learning: A case study of flood risks in jakarta.Climate Risk Management, 46:100651, 2024

  30. [30]

    Star: Self-taught reasoner bootstrapping reasoning with reasoning

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D Goodman. Star: Self-taught reasoner bootstrapping reasoning with reasoning. InProc. the 36th International Conference on Neural Information Processing Systems, volume 1126, pages 0–55, 2024

  31. [31]

    Earthgpt: A universal multimodal large language model for multisensor image comprehension in remote sensing domain.IEEE Transactions on Geoscience and Remote Sensing, 62:1–20, 2024

    Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, and Xuerui Mao. Earthgpt: A universal multimodal large language model for multisensor image comprehension in remote sensing domain.IEEE Transactions on Geoscience and Remote Sensing, 62:1–20, 2024

  32. [32]

    Yuanlin Zhang, Yuan Yuan, Yachuang Feng, and Xiaoqiang Lu. Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection.IEEE Transactions on Geoscience and Remote Sensing, 57(8):5535–5548, 2019

  33. [33]

    Absolute zero: Reinforced self-play reasoning with zero data

    Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 12 A Related Work A.1 Vision-Language Models for remote sensing Recent progress has extended ...

  34. [34]

    Primitive coverage.Every primitive in the call interface F (Section 2.3) is assigned to exactly one dimension, ensuring that any program constructed by the proposer is representable in the nine dimensions

  35. [35]

    which class occupies the largest area?

    Benchmark coverage.Every question in our evaluation benchmarks (RSVQA-HR, EarthVQA, GEOBench-VLM) is assigned to at least one dimension, ensuring a fair side-by-side comparison. Each problem is mapped to asetof dimensions rather than a single one, since real spatial questions routinely combine multiple atomic geospatial concepts. For instance, “which clas...

  36. [36]

    For eachoperationaldimension d∈ {EXISTENCE,QUANTITY,COVERAGE,COMPARISON,RELATION,OVERLAP}, adddtoC(p)if any associated regular expression matches a substring ofs

  37. [37]

    vehicle, ship,building,tree)

    Add CATEGORYif p contains a string literal naming one of the 19 object classes (e.g. vehicle, ship,building,tree)

  38. [38]

    Add SCENEif p contains a string literal naming one of the 11 scene classes ( road, water, parking_lot, vegetation, forest, grass, cropland, land, bareland, intersection, roundabout) or the identifierscene_classes

  39. [39]

    Are there more buildings than roads in this image?

    Add DIRECTIONif p contains a string literal drawn from the set of 18 direction tokens (cardinal directions and quadrant abbreviations). Verification.We verified the procedure by manually annotating a held-out sample of 100 problems (25 from each of RSVQA-HR, EarthVQA, GEOBench-VLM, and GeoX) and observed that every problem agreed on dimension membership. ...

  40. [40]

    The input class is "ship"

  41. [41]

    In the image, there are multiple ships docked at the harbor

  42. [42]

    The task is to find the largest ship and determine the quadrant where its center of mass lies

  43. [43]

    From the image, the long vessel docked at the top of the image appears to be the largest ship

  44. [44]

    In which quadrant (TL, TR, BL, BR) does the centroid of the largest ship lie in the image?

    The center of mass calculation will be applied to this ship to determine the quadrant it falls into. Let's calculate the center of mass for the largest ship. The coordinates of the ship in the image are approximately: - Top-left corner: (260, 60) / Bottom-right corner: (540, 130) The center of mass calculation for a rectangle is at the midpoint of the top...

  45. [45]

    **Input 0**: `vehicle` - **Output 0**: `3` - From the image, we can see three vehicles on the road

  46. [46]

    **Input 1**: `bridge` - **Output 1**: `2` - From the image, we can see two bridges spanning over the water

  47. [47]

    Based on these observations, the function needs to count the number of instances of the given class

    **Input 2**: `football-field` - **Output 2**: `1` - From the image, we can see one football field in the top-right corner. Based on these observations, the function needs to count the number of instances of the given class. The class names are `vehicle`, `bridge`, and `football-field`, and the outputs are the counts of instances for each class. The functi...