pith. machine review for the scientific record. sign in

arxiv: 2511.22396 · v2 · submitted 2025-11-27 · 💻 cs.CV · cs.AI

Asking like Socrates: Socrates helps VLMs understand remote sensing images

Pith reviewed 2026-05-17 04:50 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords remote sensingvision-language modelsevidence-based reasoningiterative reasoningmulti-agent systemsreinforcement learningvisual question answeringgrounding
0
0 comments X

The pith

Remote sensing vision models overcome pseudo-reasoning by iteratively seeking visual evidence in large images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current vision-language models tend to describe a reasoning process without truly examining remote sensing images, relying instead on language consistency due to a coarse initial view of vast scenes. The paper proposes RS-EoT to change this by creating an iterative process where the model reasons, checks visual evidence, and repeats. This is achieved using a self-play system called SocraticAgent to generate example reasoning traces and then applying reinforcement learning in stages to strengthen the behavior. If the claim holds, these models would produce answers backed by actual image details rather than plausible but unexamined narratives, improving reliability on tasks like visual question answering and object grounding in satellite imagery.

Core claim

The paper establishes that RS-EoT, a language-driven iterative visual evidence-seeking paradigm, when instilled via SocraticAgent's self-play multi-agent synthesis of alternating reasoning and inspection cycles and refined through two-stage progressive reinforcement learning on grounding followed by VQA, enables genuine evidence-grounded reasoning that mitigates the Glance Effect in remote sensing tasks.

What carries the argument

RS-EoT, the iterative paradigm of alternating reasoning steps with visual inspections to build evidence-based conclusions instead of linguistic self-consistency.

If this is right

  • RS-EoT models reach state-of-the-art accuracy on several remote sensing visual question answering benchmarks.
  • The approach produces observable iterative cycles of reasoning and evidence checking in model outputs.
  • Training first on fine-grained grounding tasks builds the core capability before generalizing to broader questions.
  • Models shift from pseudo-reasoning to answers that depend on specific visual details in the imagery.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method might apply to other vision domains involving wide-area or high-resolution images where initial glances miss critical details.
  • Similar self-play agents could be used to bootstrap evidence-seeking in non-visual reasoning tasks.
  • Future systems may default to multi-step visual verification as a core training objective rather than post-hoc prompting.

Load-bearing premise

The reasoning traces generated by the self-play multi-agent system actually come from looking at image content rather than reproducing common language patterns.

What would settle it

Edit the remote sensing image to alter a key visual feature that should change the correct answer, then check if the model updates its response based on the new evidence or sticks to the original output.

Figures

Figures reproduced from arXiv: 2511.22396 by Bolei He, Haifeng Li, Hongyuan Yuan, Linrui Xu, Run Shao, Wang Guo, Xinran He, Yijun Chen, Yiming Yan, Yongxing Dai, Zhaoyang Zhang, Ziyu Li.

Figure 1
Figure 1. Figure 1: Illustration of the pseudo reasoning problem and our RS-EoT solution. (a) Existing models show pseudo reasoning: explicit think [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our method to instill the RS-EoT paradigm. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Case studies comparing RS-EoT-7B with prior multimodal reasoning models on (top) Remote Sensing General QA and (bottom) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Token-wise attention visualization on eight randomly sampled cases. The y-axis represents the proportion of attention allocated [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The reward curve for the VQA RL stage. The stable [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The system prompt for the Reasoner in SocraticAgent. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The system prompt for the Perceiver in SocraticAgent. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The system prompt for the Verifier in SocraticAgent. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Ablation comparing reinforcement learning on the [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Reasoning cases of RS-EoT-7B (Part 1). 5 [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Reasoning cases of RS-EoT-7B (Part 2). 6 [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Reasoning cases of RS-EoT-7B (Part 3). 7 [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: CReasoning cases of RS-EoT-7B (Part 4). 8 [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Reasoning cases of RS-EoT-7B (Part 5). 9 [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
read the original abstract

Recent multimodal reasoning models, inspired by DeepSeek-R1, have significantly advanced vision-language systems. However, in remote sensing (RS) tasks, we observe widespread pseudo reasoning: models narrate the process of reasoning rather than genuinely reason toward the correct answer based on visual evidence. We attribute this to the Glance Effect, where a single, coarse perception of large-scale RS imagery results in incomplete understanding and reasoning based on linguistic self-consistency instead of visual evidence. To address this, we propose RS-EoT (Remote Sensing Evidence-of-Thought), a language-driven, iterative visual evidence-seeking paradigm. To instill this paradigm, we propose SocraticAgent, a self-play multi-agent system that synthesizes reasoning traces via alternating cycles of reasoning and visual inspection. To enhance and generalize these patterns, we propose a two-stage progressive RL strategy: first, RL on fine-grained Grounding tasks to enhance RS-EoT capabilities, followed by RL on RS VQA to generalize to broader understanding scenarios. Experiments show RS-EoT achieves state-of-the-art performance on multiple RS VQA and grounding benchmarks. Analyses reveal clear iterative cycles of reasoning and evidence seeking, confirming RS-EoT mitigates the Glance Effect and enables genuine evidence-grounded reasoning. Our code, data, and models are available at https://geox-lab.github.io/Asking_like_Socrates

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that VLMs exhibit pseudo-reasoning on remote sensing tasks due to the Glance Effect (coarse perception of large-scale imagery leading to linguistic self-consistency rather than visual evidence). It proposes RS-EoT, a language-driven iterative evidence-seeking paradigm, implemented via SocraticAgent (a self-play multi-agent system synthesizing traces through alternating reasoning and visual inspection cycles) and a two-stage progressive RL strategy (RL on fine-grained grounding followed by RL on RS VQA). Experiments report SOTA performance on multiple RS VQA and grounding benchmarks, with analyses showing iterative cycles that mitigate the Glance Effect and enable evidence-grounded reasoning; code, data, and models are released.

Significance. If the central claim holds and the synthesized traces reflect genuine visual evidence-seeking rather than linguistic patterns, RS-EoT could meaningfully advance multimodal reasoning for remote sensing by providing a controllable way to enforce iterative inspection and reduce the Glance Effect. The open release of code, data, and models is a clear strength for reproducibility. However, the significance is currently limited by insufficient verification that the self-play and RL stages produce behavior driven by actual visual evidence rather than self-consistent narration.

major comments (2)
  1. [Abstract] Abstract: the claim that 'analyses reveal clear iterative cycles of reasoning and evidence seeking' confirming mitigation of the Glance Effect is load-bearing for the central contribution, yet the abstract provides no quantitative controls such as ablation of the visual-inspection branch, comparison to language-only self-play, or metrics of visual grounding fidelity; without these, the distinction between genuine evidence-seeking and linguistic pattern matching cannot be verified.
  2. [Experiments] Experiments section: SOTA results on RS VQA and grounding benchmarks are asserted without reported details on baselines, statistical significance tests, error bars, or controls for confounding factors such as prompt engineering; this undermines the strength of the performance claims that support the RS-EoT paradigm.
minor comments (1)
  1. The abstract and method description introduce several new terms (SocraticAgent, RS-EoT paradigm, two-stage progressive RL) without a concise summary table or diagram early in the paper that would help readers track the relationships between components.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'analyses reveal clear iterative cycles of reasoning and evidence seeking' confirming mitigation of the Glance Effect is load-bearing for the central contribution, yet the abstract provides no quantitative controls such as ablation of the visual-inspection branch, comparison to language-only self-play, or metrics of visual grounding fidelity; without these, the distinction between genuine evidence-seeking and linguistic pattern matching cannot be verified.

    Authors: We agree that the abstract would be strengthened by explicit reference to quantitative evidence supporting the central claim. In the revised manuscript we will add a concise clause noting key ablation results on the visual-inspection branch together with grounding-fidelity metrics that differentiate evidence-driven cycles from language-only self-play. These additions will remain within abstract length limits while directing readers to the detailed analyses in Section 4. revision: yes

  2. Referee: [Experiments] Experiments section: SOTA results on RS VQA and grounding benchmarks are asserted without reported details on baselines, statistical significance tests, error bars, or controls for confounding factors such as prompt engineering; this undermines the strength of the performance claims that support the RS-EoT paradigm.

    Authors: We acknowledge that clearer reporting of experimental controls would improve transparency. The revised experiments section will include error bars on all reported metrics, paired statistical significance tests against baselines, and an explicit language-only self-play control to isolate the contribution of the visual-inspection component. These details were partially present but will be expanded and highlighted for clarity. revision: yes

Circularity Check

0 steps flagged

No significant circularity in RS-EoT derivation chain

full rationale

The paper proposes SocraticAgent for synthesizing reasoning traces via self-play multi-agent cycles and applies a two-stage RL procedure (grounding then VQA) before reporting SOTA results on external RS VQA and grounding benchmarks. No equations, predictions, or central claims reduce by construction to the method's own inputs; the iterative cycles are described as produced outputs whose presence is confirmed by post-hoc analyses rather than presupposed definitions. The derivation chain remains independent of self-citation load-bearing or fitted-input renaming, with evaluation performed on separate benchmark tasks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on the domain assumption that iterative visual evidence-seeking can be instilled through self-generated traces and staged RL without introducing new pseudo-reasoning artifacts; no explicit free parameters or invented physical entities are described.

axioms (2)
  • domain assumption Self-play multi-agent interaction can produce high-quality reasoning traces that reflect genuine visual grounding for remote sensing imagery.
    Invoked in the description of SocraticAgent synthesizing traces via alternating cycles of reasoning and visual inspection.
  • domain assumption Progressive RL first on grounding then on VQA will generalize the RS-EoT capability to broader understanding scenarios.
    Stated as the two-stage strategy to enhance and generalize the patterns.
invented entities (2)
  • SocraticAgent no independent evidence
    purpose: Self-play multi-agent system to synthesize reasoning traces for RS-EoT
    Newly proposed component that alternates reasoning and visual inspection to generate training data.
  • RS-EoT paradigm no independent evidence
    purpose: Language-driven iterative visual evidence-seeking process
    Core new framework introduced to address the Glance Effect.

pith-pipeline@v0.9.0 · 5577 in / 1530 out tokens · 43900 ms · 2026-05-17T04:50:48.248369+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    GeoVista introduces a planning-driven active perception framework with global exploration plans, branch-wise local inspection, and explicit evidence tracking to achieve state-of-the-art results on ultra-high-resolutio...

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · cited by 1 Pith paper · 10 internal anchors

  1. [1]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6

  2. [2]

    Seed 1.6 — doubao (seed) 1.6

    ByteDance / V olcengine. Seed 1.6 — doubao (seed) 1.6. Online, 2025. 1, 4, 6

  3. [3]

    SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

    Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. Sft or rl? an early investigation into training r1-like reasoning large vision-language models.arXiv preprint arXiv:2504.11468,

  4. [4]

    Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. Transactions on Machine Learning Research, 2023. 3

  5. [5]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 4

  6. [6]

    Missing premise exacerbates overthinking: Are reason- ing models losing critical thinking skill?arXiv preprint arXiv:2504.06514, 2025

    Chenrui Fan, Ming Li, Lichao Sun, and Tianyi Zhou. Missing premise exacerbates overthinking: Are reason- ing models losing critical thinking skill?arXiv preprint arXiv:2504.06514, 2025. 2

  7. [7]

    Thinkless: Llm learns when to think.Advances in neural information processing systems, 2025

    Gongfan Fang, Xinyin Ma, and Xinchao Wang. Thinkless: Llm learns when to think.Advances in neural information processing systems, 2025. 2

  8. [8]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633– 638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633– 638, 2025. 1, 2, 3

  9. [9]

    Skysense: A multi- modal remote sensing foundation model towards universal interpretation for earth observation imagery

    Xin Guo, Jiangwei Lao, Bo Dang, Yingying Zhang, Lei Yu, Lixiang Ru, Liheng Zhong, Ziyuan Huang, Kang Wu, Dingxiang Hu, Huimei He, Jian Wang, Jingdong Chen, Ming Yang, Yongjun Zhang, and Yansheng Li. Skysense: A multi- modal remote sensing foundation model towards universal interpretation for earth observation imagery. InProceedings of the IEEE/CVF Confere...

  10. [10]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

  11. [11]

    A survey on remote sens- ing foundation models: From vision to multimodality.arXiv preprint arXiv:2503.22081, 2025

    Ziyue Huang, Hongxi Yan, Qiqi Zhan, Shuai Yang, Ming- ming Zhang, Chenkai Zhang, YiMing Lei, Zeming Liu, Qingjie Liu, and Yunhong Wang. A survey on remote sens- ing foundation models: From vision to multimodality.arXiv preprint arXiv:2503.22081, 2025. 2

  12. [12]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 3

  13. [13]

    Few-shot vision- language reasoning for satellite imagery via verifiable re- wards

    Aybora K ¨oksal and A Aydın Alatan. Few-shot vision- language reasoning for satellite imagery via verifiable re- wards. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 6901–6910, 2025. 3, 6

  14. [14]

    Geochat: Grounded large vision-language model for remote sensing

    Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27831– 27840, 2024. 2

  15. [15]

    Ddfav: Re- mote sensing large vision language models dataset and eval- uation benchmark.Remote Sensing, 17(4):719, 2025

    Haodong Li, Xiaofeng Zhang, and Haicheng Qu. Ddfav: Re- mote sensing large vision language models dataset and eval- uation benchmark.Remote Sensing, 17(4):719, 2025. 2

  16. [16]

    Hrvqa: A visual question answering benchmark for high-resolution aerial images.ISPRS Journal of Photogrammetry and Re- mote Sensing, 214:65–81, 2024

    Kun Li, George V osselman, and Michael Ying Yang. Hrvqa: A visual question answering benchmark for high-resolution aerial images.ISPRS Journal of Photogrammetry and Re- mote Sensing, 214:65–81, 2024. 5

  17. [17]

    Vrsbench: A versatile vision-language benchmark dataset for remote sens- ing image understanding.Advances in Neural Information Processing Systems, 37:3229–3242, 2024

    Xiang Li, Jian Ding, and Mohamed Elhoseiny. Vrsbench: A versatile vision-language benchmark dataset for remote sens- ing image understanding.Advances in Neural Information Processing Systems, 37:3229–3242, 2024. 6

  18. [18]

    Vision-language models in remote sensing: Current progress and future trends.IEEE Geoscience and Remote Sensing Magazine, 12(2):32–66, 2024

    Xiang Li, Congcong Wen, Yuan Hu, Zhenghang Yuan, and Xiao Xiang Zhu. Vision-language models in remote sensing: Current progress and future trends.IEEE Geoscience and Remote Sensing Magazine, 12(2):32–66, 2024. 2

  19. [19]

    Remote sensing spa- tiotemporal vision–language models: A comprehensive sur- vey.IEEE Geoscience and Remote Sensing Magazine, 2025

    Chenyang Liu, Jiafan Zhang, Keyan Chen, Man Wang, Zhengxia Zou, and Zhenwei Shi. Remote sensing spa- tiotemporal vision–language models: A comprehensive sur- vey.IEEE Geoscience and Remote Sensing Magazine, 2025. 2

  20. [20]

    Rsvqa: Visual question answering for remote sensing data

    Sylvain Lobry, Diego Marcos, Jesse Murray, and Devis Tuia. Rsvqa: Visual question answering for remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 58 (12):8555–8566, 2020. 6

  21. [21]

    Rsvqa meets bigearthnet: A new, large-scale, visual question answering dataset for remote sensing

    Sylvain Lobry, Beg ¨um Demir, and Devis Tuia. Rsvqa meets bigearthnet: A new, large-scale, visual question answering dataset for remote sensing. In2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, pages 1218–1221, 2021. 5, 6

  22. [22]

    Skysensegpt: A fine- grained instruction tuning dataset and model for remote sens- ing vision-language understanding, 2024

    Junwei Luo, Zhen Pang, Yongjun Zhang, Tingzhu Wang, Linlin Wang, Bo Dang, Jiangwei Lao, Jian Wang, Jingdong Chen, Yihua Tan, and Yansheng Li. Skysensegpt: A fine- grained instruction tuning dataset and model for remote sens- ing vision-language understanding, 2024. 5, 6

  23. [23]

    Mm-eureka: Ex- ploring the frontiers of multimodal reasoning with rule-based reinforcement learning, 2025

    Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, 9 Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, and Wenqi Shao. Mm-eureka: Ex- ploring the frontiers of multimodal reasoning with rule-based reinforcement learning, 2025. 2, 3, 6

  24. [24]

    Gpt-5 system card.https://cdn.openai

    OpenAI. Gpt-5 system card.https://cdn.openai. com/gpt-5-system-card.pdf, 2025. 4

  25. [25]

    Uav-vln: End-to-end vision language guided navigation for uavs

    Pranav Saxena, Nishant Raghuvanshi, and Neena Goveas. Uav-vln: End-to-end vision language guided navigation for uavs. In2025 European Conference on Mobile Robots (ECMR), page 1–6. IEEE, 2025. 2

  26. [26]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 3

  27. [27]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reason- ing in open language models, 2024. 3, 6

  28. [28]

    Satori- r1: Incentivizing multimodal reasoning with spatial ground- ing and verifiable rewards, 2025

    Chuming Shen, Wei Wei, Xiaoye Qu, and Yu Cheng. Satori- r1: Incentivizing multimodal reasoning with spatial ground- ing and verifiable rewards, 2025. 2

  29. [29]

    Drone-based rgb-infrared cross-modality vehicle detection via uncertainty-aware learning.IEEE Transactions on Cir- cuits and Systems for Video Technology, pages 1–1, 2022

    Yiming Sun, Bing Cao, Pengfei Zhu, and Qinghua Hu. Drone-based rgb-infrared cross-modality vehicle detection via uncertainty-aware learning.IEEE Transactions on Cir- cuits and Systems for Video Technology, pages 1–1, 2022. 6

  30. [30]

    Advancements in vision– language models for remote sensing: Datasets, capabilities, and enhancement techniques.Remote Sensing, 17(1):162,

    Lijie Tao, Haokui Zhang, Haizhao Jing, Yu Liu, Dawei Yan, Guoting Wei, and Xizhe Xue. Advancements in vision– language models for remote sensing: Datasets, capabilities, and enhancement techniques.Remote Sensing, 17(1):162,

  31. [31]

    Glm-4.5v and glm-4.1v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning, 2025

    GLM-V Team. Glm-4.5v and glm-4.1v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning, 2025. 6

  32. [32]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025. 2

  33. [33]

    Qwen3-vl: Sharper vision, deeper thought, broader action.Qwen Blog

    Qwen Team. Qwen3-vl: Sharper vision, deeper thought, broader action.Qwen Blog. Accessed, pages 10–04, 2025. 6

  34. [34]

    VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

    Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.arXiv preprint arXiv:2504.08837, 2025. 2, 3, 6

  35. [35]

    Junjue Wang, Zhuo Zheng, Zihang Chen, Ailong Ma, and Yanfei Zhong. Earthvqa: Towards queryable earth via re- lational reasoning-based remote sensing visual question an- swering.Proceedings of the AAAI Conference on Artificial Intelligence, 38(6):5481–5489, 2024. 6

  36. [36]

    Ringmogpt: A unified remote sensing foundation model for vision, language, and grounded tasks.IEEE Transactions on Geoscience and Re- mote Sensing, 63:1–20, 2025

    Peijin Wang, Huiyang Hu, Boyuan Tong, Ziqi Zhang, Fang- long Yao, Yingchao Feng, Zining Zhu, Hao Chang, Wenhui Diao, Qixiang Ye, and Xian Sun. Ringmogpt: A unified remote sensing foundation model for vision, language, and grounded tasks.IEEE Transactions on Geoscience and Re- mote Sensing, 63:1–20, 2025. 2

  37. [37]

    Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022. 3

  38. [38]

    Sarlang-1m: A benchmark for vision-language modeling in sar image un- derstanding, 2025

    Yimin Wei, Aoran Xiao, Yexian Ren, Yuting Zhu, Hongruix- uan Chen, Junshi Xia, and Naoto Yokoya. Sarlang-1m: A benchmark for vision-language modeling in sar image un- derstanding, 2025. 6

  39. [39]

    Light-r1: Curriculum SFT, DPO and RL for long COT from scratch and beyond

    Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Tanglifu Tanglifu, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, and Xi- angzheng Zhang. Light-r1: Curriculum SFT, DPO and RL for long COT from scratch and beyond. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume...

  40. [40]

    Chenhui Xu, Fuxun Yu, Michael J. Bianco, Jacob Kovarskiy, Raphael Tang, Qi Zhang, Zirui Xu, Will LeVine, Brandon Dubbs, Heming Liao, Cassandra Burgess, Suvam Bag, Jay Patravali, Rupanjali Kukal, Mikael Figueroa, Rishi Madhok, Nikolaos Karianakis, and Jinjun Xiong. Geo-r1: Unlock- ing vlm geospatial reasoning with cross-view reinforcement learning, 2025. 3, 6

  41. [41]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1

  42. [42]

    Wethink: To- ward general-purpose vision-language reasoning via rein- forcement learning, 2025

    Jie Yang, Feipeng Ma, Zitian Wang, Dacheng Yin, Kang Rong, Fengyun Rao, and Ruimao Zhang. Wethink: To- ward general-purpose vision-language reasoning via rein- forcement learning, 2025. 2, 3, 6

  43. [43]

    R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

    Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, Bo Zhang, and Wei Chen. R1-onevision: Advancing generalized multimodal reasoning through cross- modal formalization.arXiv preprint arXiv:2503.10615,

  44. [44]

    Dapo: An open-source llm reinforcement learning system at scale, 2025

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xi- angpeng Wei, Hao Zhou, Jingjing Li...

  45. [45]

    Vl-cogito: Progressive curriculum reinforcement learning for advanced multimodal reasoning

    Ruifeng Yuan, Chenghao Xiao, Sicong Leng, Jianyu Wang, Long Li, Weiwen Xu, Hou Pong Chan, Deli Zhao, Tingyang Xu, Zhongyu Wei, et al. Vl-cogito: Progressive curriculum reinforcement learning for advanced multimodal reasoning. arXiv preprint arXiv:2507.22607, 2025. 2

  46. [46]

    Rsvg: Exploring data and models for visual grounding on remote sensing data

    Yang Zhan, Zhitong Xiong, and Yuan Yuan. Rsvg: Exploring data and models for visual grounding on remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 61: 1–13, 2023. 6

  47. [47]

    Grounded vision-language navigation for uavs with open-vocabulary goal understanding, 2025

    Yuhang Zhang, Haosheng Yu, Jiaping Xiao, and Mir Fer- 10 oskhan. Grounded vision-language navigation for uavs with open-vocabulary goal understanding, 2025. 2

  48. [48]

    Llamafac- tory: Unified efficient fine-tuning of 100+ language mod- els

    Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafac- tory: Unified efficient fine-tuning of 100+ language mod- els. InProceedings of the 62nd Annual Meeting of the As- sociation for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, 2024. Association for Computational Lin...

  49. [49]

    Easyr1: An efficient, scalable, multi-modality rl training framework.https:// github.com/hiyouga/EasyR1, 2025

    Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, and Yuwen Xiong. Easyr1: An efficient, scalable, multi-modality rl training framework.https:// github.com/hiyouga/EasyR1, 2025. 6

  50. [50]

    Least-to-most prompting enables complex reasoning in large language models, 2023

    Denny Zhou, Nathanael Sch ¨arli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models, 2023. 3 11 Asking like Socrates: Socrates helps VLMs understand remote sensing images Supplementary Material

  51. [51]

    coarse- to-fine

    System Prompts Details In this section, we provide the exact system prompts used in our SocraticAgent framework to synthesize the RS-EoT-4K dataset. As described in the main paper, SocraticAgent op- erates as a self-play multi-agent system consisting of three distinct roles: theReasoner, thePerceiver, and theVeri- fier. •The Reasoner(Fig. 6) serves as the...

  52. [52]

    Please reason step-by-step

    SFT Training Settings We perform SFT on the base model Qwen2.5-VL-7B- Instruct using the RS-EoT-4K dataset. The training is im- plemented based on the LLaMA-Factory framework. We train the model for 5 epochs with a learning rate of3×10−5, using the AdamW optimizer and a cosine learning rate scheduler. The global batch size is set to 64, and the max- imum ...

  53. [53]

    We fix the KL coefficient toβ= 1.0×10 −2

    RL Training Settings All reinforcement learning experiments are conducted using the EasyR1 framework, which provides a production-ready implementation of GRPO with KL regularization. We fix the KL coefficient toβ= 1.0×10 −2. For each input, the model generates 4 rollout samples using sampling tempera- ture 1.0, with a maximum response length of 4096 token...

  54. [54]

    the perception model,

    RL Reward Function 9.1. Grounding Reward For the grounding task, the model is required to output a bounding box in the form[x1, y1, x2, y2]after a complete<think></think>block. Our reward con- tains two components: an IoU-based accuracy term and a lightweight format term. Format reward.For the grounding task, we apply a lightweightformat rewardto encourag...

  55. [55]

    RL Training Dynamics Curves Figure 9 visualizes the evolution of key optimization statis- tics during the two RL stages in our pipeline: RL- Grounding and RL-VQA. The top block corresponds to the RL Grounding stage and the bottom block to the RL-VQA stage; in both cases we plot the same set of metrics, in- cluding mean advantage, actor gradient norm, entr...

  56. [56]

    This experiment allows us to isolate and compare the impact of our multiple-choice VQA design against the conventional VQA supervision

    Difference Between Multiple-Choice VQA and Standard VQA To assess the effectiveness of our proposed multiple-choice reformulation of VQA, we additionally perform an ablation study using the original dataset and model settings, but ap- plying reinforcement learning directly on thestandardfree- form VQA answers. This experiment allows us to isolate and comp...

  57. [57]

    Specifically, we present extended case studies covering both Remote Sensing Gen- eral VQA tasks (Fig

    Case Study We provide additional qualitative examples to further demonstrate the effectiveness of RS-EoT-7B in complex re- mote sensing reasoning scenarios. Specifically, we present extended case studies covering both Remote Sensing Gen- eral VQA tasks (Fig. 11, Fig. 12, and Fig. 13) and Fine- grained Grounding tasks (Fig. 14 and Fig. 15). These vi- suali...