arxiv: 2511.22396 · v2 · submitted 2025-11-27 · 💻 cs.CV · cs.AI

Asking like Socrates: Socrates helps VLMs understand remote sensing images

Run Shao , Ziyu Li , Zhaoyang Zhang , Linrui Xu , Xinran He , Hongyuan Yuan , Bolei He , Yongxing Dai

show 4 more authors

Yiming Yan Yijun Chen Wang Guo Haifeng Li

This is my paper

Pith reviewed 2026-05-17 04:50 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords remote sensingvision-language modelsevidence-based reasoningiterative reasoningmulti-agent systemsreinforcement learningvisual question answeringgrounding

0 comments

The pith

Remote sensing vision models overcome pseudo-reasoning by iteratively seeking visual evidence in large images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current vision-language models tend to describe a reasoning process without truly examining remote sensing images, relying instead on language consistency due to a coarse initial view of vast scenes. The paper proposes RS-EoT to change this by creating an iterative process where the model reasons, checks visual evidence, and repeats. This is achieved using a self-play system called SocraticAgent to generate example reasoning traces and then applying reinforcement learning in stages to strengthen the behavior. If the claim holds, these models would produce answers backed by actual image details rather than plausible but unexamined narratives, improving reliability on tasks like visual question answering and object grounding in satellite imagery.

Core claim

The paper establishes that RS-EoT, a language-driven iterative visual evidence-seeking paradigm, when instilled via SocraticAgent's self-play multi-agent synthesis of alternating reasoning and inspection cycles and refined through two-stage progressive reinforcement learning on grounding followed by VQA, enables genuine evidence-grounded reasoning that mitigates the Glance Effect in remote sensing tasks.

What carries the argument

RS-EoT, the iterative paradigm of alternating reasoning steps with visual inspections to build evidence-based conclusions instead of linguistic self-consistency.

If this is right

RS-EoT models reach state-of-the-art accuracy on several remote sensing visual question answering benchmarks.
The approach produces observable iterative cycles of reasoning and evidence checking in model outputs.
Training first on fine-grained grounding tasks builds the core capability before generalizing to broader questions.
Models shift from pseudo-reasoning to answers that depend on specific visual details in the imagery.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method might apply to other vision domains involving wide-area or high-resolution images where initial glances miss critical details.
Similar self-play agents could be used to bootstrap evidence-seeking in non-visual reasoning tasks.
Future systems may default to multi-step visual verification as a core training objective rather than post-hoc prompting.

Load-bearing premise

The reasoning traces generated by the self-play multi-agent system actually come from looking at image content rather than reproducing common language patterns.

What would settle it

Edit the remote sensing image to alter a key visual feature that should change the correct answer, then check if the model updates its response based on the new evidence or sticks to the original output.

Figures

Figures reproduced from arXiv: 2511.22396 by Bolei He, Haifeng Li, Hongyuan Yuan, Linrui Xu, Run Shao, Wang Guo, Xinran He, Yijun Chen, Yiming Yan, Yongxing Dai, Zhaoyang Zhang, Ziyu Li.

**Figure 1.** Figure 1: Illustration of the pseudo reasoning problem and our RS-EoT solution. (a) Existing models show pseudo reasoning: explicit think [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of our method to instill the RS-EoT paradigm. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Case studies comparing RS-EoT-7B with prior multimodal reasoning models on (top) Remote Sensing General QA and (bottom) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Token-wise attention visualization on eight randomly sampled cases. The y-axis represents the proportion of attention allocated [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: The reward curve for the VQA RL stage. The stable [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: The system prompt for the Reasoner in SocraticAgent. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: The system prompt for the Perceiver in SocraticAgent. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: The system prompt for the Verifier in SocraticAgent. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 10.** Figure 10: Ablation comparing reinforcement learning on the [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Reasoning cases of RS-EoT-7B (Part 1). 5 [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: Reasoning cases of RS-EoT-7B (Part 2). 6 [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: Reasoning cases of RS-EoT-7B (Part 3). 7 [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 14.** Figure 14: CReasoning cases of RS-EoT-7B (Part 4). 8 [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗

**Figure 15.** Figure 15: Reasoning cases of RS-EoT-7B (Part 5). 9 [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗

read the original abstract

Recent multimodal reasoning models, inspired by DeepSeek-R1, have significantly advanced vision-language systems. However, in remote sensing (RS) tasks, we observe widespread pseudo reasoning: models narrate the process of reasoning rather than genuinely reason toward the correct answer based on visual evidence. We attribute this to the Glance Effect, where a single, coarse perception of large-scale RS imagery results in incomplete understanding and reasoning based on linguistic self-consistency instead of visual evidence. To address this, we propose RS-EoT (Remote Sensing Evidence-of-Thought), a language-driven, iterative visual evidence-seeking paradigm. To instill this paradigm, we propose SocraticAgent, a self-play multi-agent system that synthesizes reasoning traces via alternating cycles of reasoning and visual inspection. To enhance and generalize these patterns, we propose a two-stage progressive RL strategy: first, RL on fine-grained Grounding tasks to enhance RS-EoT capabilities, followed by RL on RS VQA to generalize to broader understanding scenarios. Experiments show RS-EoT achieves state-of-the-art performance on multiple RS VQA and grounding benchmarks. Analyses reveal clear iterative cycles of reasoning and evidence seeking, confirming RS-EoT mitigates the Glance Effect and enables genuine evidence-grounded reasoning. Our code, data, and models are available at https://geox-lab.github.io/Asking_like_Socrates

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Socratic self-play plus staged RL for remote sensing VLMs targets the glance effect with a fresh angle, but the proof that traces reflect real visual evidence rather than language patterns still needs tighter controls.

read the letter

The punchline here is that the authors have built a Socratic self-play agent to generate iterative visual evidence-seeking traces for VLMs on remote sensing images, then used a two-stage RL schedule to train on grounding followed by VQA, reporting SOTA results while claiming to fix the glance effect that leads to pseudo-reasoning. What is new is the tailored application to RS with the RS-EoT paradigm and the specific self-play multi-agent setup for synthesizing those traces. This does not look like a straightforward extension of general multimodal reasoning work. The paper does well in releasing code, data, and models publicly, which supports reproducibility and lets the community test the claims. The idea of alternating reasoning and inspection cycles to encourage genuine evidence-grounded behavior is a direct response to a documented limitation in current VLMs for large imagery. The soft spots center on how well the method ensures the traces come from actual visual inspection rather than learned linguistic patterns. The stress-test note highlights that without quantitative controls, such as ablating the visual branch or comparing to language-only self-play, or metrics on grounding fidelity, it's difficult to fully rule out pseudo-reasoning. The abstract's mention of analyses confirming iterative cycles is a start, but the full paper would need to show those details clearly to make the central claim solid. Baselines and statistical significance are also key to check in the experiments section. This paper is for researchers working on vision-language models for remote sensing and earth observation applications. Readers interested in improving multimodal reasoning for monitoring and planning tasks would find the method and the open resources useful. It has enough substance and addresses a practical issue, so it deserves a serious referee even if some aspects of the evaluation could be tightened. I would recommend sending this to peer review.

Referee Report

2 major / 1 minor

Summary. The paper claims that VLMs exhibit pseudo-reasoning on remote sensing tasks due to the Glance Effect (coarse perception of large-scale imagery leading to linguistic self-consistency rather than visual evidence). It proposes RS-EoT, a language-driven iterative evidence-seeking paradigm, implemented via SocraticAgent (a self-play multi-agent system synthesizing traces through alternating reasoning and visual inspection cycles) and a two-stage progressive RL strategy (RL on fine-grained grounding followed by RL on RS VQA). Experiments report SOTA performance on multiple RS VQA and grounding benchmarks, with analyses showing iterative cycles that mitigate the Glance Effect and enable evidence-grounded reasoning; code, data, and models are released.

Significance. If the central claim holds and the synthesized traces reflect genuine visual evidence-seeking rather than linguistic patterns, RS-EoT could meaningfully advance multimodal reasoning for remote sensing by providing a controllable way to enforce iterative inspection and reduce the Glance Effect. The open release of code, data, and models is a clear strength for reproducibility. However, the significance is currently limited by insufficient verification that the self-play and RL stages produce behavior driven by actual visual evidence rather than self-consistent narration.

major comments (2)

[Abstract] Abstract: the claim that 'analyses reveal clear iterative cycles of reasoning and evidence seeking' confirming mitigation of the Glance Effect is load-bearing for the central contribution, yet the abstract provides no quantitative controls such as ablation of the visual-inspection branch, comparison to language-only self-play, or metrics of visual grounding fidelity; without these, the distinction between genuine evidence-seeking and linguistic pattern matching cannot be verified.
[Experiments] Experiments section: SOTA results on RS VQA and grounding benchmarks are asserted without reported details on baselines, statistical significance tests, error bars, or controls for confounding factors such as prompt engineering; this undermines the strength of the performance claims that support the RS-EoT paradigm.

minor comments (1)

The abstract and method description introduce several new terms (SocraticAgent, RS-EoT paradigm, two-stage progressive RL) without a concise summary table or diagram early in the paper that would help readers track the relationships between components.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'analyses reveal clear iterative cycles of reasoning and evidence seeking' confirming mitigation of the Glance Effect is load-bearing for the central contribution, yet the abstract provides no quantitative controls such as ablation of the visual-inspection branch, comparison to language-only self-play, or metrics of visual grounding fidelity; without these, the distinction between genuine evidence-seeking and linguistic pattern matching cannot be verified.

Authors: We agree that the abstract would be strengthened by explicit reference to quantitative evidence supporting the central claim. In the revised manuscript we will add a concise clause noting key ablation results on the visual-inspection branch together with grounding-fidelity metrics that differentiate evidence-driven cycles from language-only self-play. These additions will remain within abstract length limits while directing readers to the detailed analyses in Section 4. revision: yes
Referee: [Experiments] Experiments section: SOTA results on RS VQA and grounding benchmarks are asserted without reported details on baselines, statistical significance tests, error bars, or controls for confounding factors such as prompt engineering; this undermines the strength of the performance claims that support the RS-EoT paradigm.

Authors: We acknowledge that clearer reporting of experimental controls would improve transparency. The revised experiments section will include error bars on all reported metrics, paired statistical significance tests against baselines, and an explicit language-only self-play control to isolate the contribution of the visual-inspection component. These details were partially present but will be expanded and highlighted for clarity. revision: yes

Circularity Check

0 steps flagged

No significant circularity in RS-EoT derivation chain

full rationale

The paper proposes SocraticAgent for synthesizing reasoning traces via self-play multi-agent cycles and applies a two-stage RL procedure (grounding then VQA) before reporting SOTA results on external RS VQA and grounding benchmarks. No equations, predictions, or central claims reduce by construction to the method's own inputs; the iterative cycles are described as produced outputs whose presence is confirmed by post-hoc analyses rather than presupposed definitions. The derivation chain remains independent of self-citation load-bearing or fitted-input renaming, with evaluation performed on separate benchmark tasks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on the domain assumption that iterative visual evidence-seeking can be instilled through self-generated traces and staged RL without introducing new pseudo-reasoning artifacts; no explicit free parameters or invented physical entities are described.

axioms (2)

domain assumption Self-play multi-agent interaction can produce high-quality reasoning traces that reflect genuine visual grounding for remote sensing imagery.
Invoked in the description of SocraticAgent synthesizing traces via alternating cycles of reasoning and visual inspection.
domain assumption Progressive RL first on grounding then on VQA will generalize the RS-EoT capability to broader understanding scenarios.
Stated as the two-stage strategy to enhance and generalize the patterns.

invented entities (2)

SocraticAgent no independent evidence
purpose: Self-play multi-agent system to synthesize reasoning traces for RS-EoT
Newly proposed component that alternates reasoning and visual inspection to generate training data.
RS-EoT paradigm no independent evidence
purpose: Language-driven iterative visual evidence-seeking process
Core new framework introduced to address the Glance Effect.

pith-pipeline@v0.9.0 · 5577 in / 1530 out tokens · 43900 ms · 2026-05-17T04:50:48.248369+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding
cs.CV 2026-05 unverdicted novelty 7.0

GeoVista introduces a planning-driven active perception framework with global exploration plans, branch-wise local inspection, and explicit evidence tracking to achieve state-of-the-art results on ultra-high-resolutio...

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · cited by 1 Pith paper · 10 internal anchors

[1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Seed 1.6 — doubao (seed) 1.6

ByteDance / V olcengine. Seed 1.6 — doubao (seed) 1.6. Online, 2025. 1, 4, 6

work page 2025
[3]

SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. Sft or rl? an early investigation into training r1-like reasoning large vision-language models.arXiv preprint arXiv:2504.11468,

work page internal anchor Pith review arXiv
[4]

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. Transactions on Machine Learning Research, 2023. 3

work page 2023
[5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Missing premise exacerbates overthinking: Are reason- ing models losing critical thinking skill?arXiv preprint arXiv:2504.06514, 2025

Chenrui Fan, Ming Li, Lichao Sun, and Tianyi Zhou. Missing premise exacerbates overthinking: Are reason- ing models losing critical thinking skill?arXiv preprint arXiv:2504.06514, 2025. 2

work page arXiv 2025
[7]

Thinkless: Llm learns when to think.Advances in neural information processing systems, 2025

Gongfan Fang, Xinyin Ma, and Xinchao Wang. Thinkless: Llm learns when to think.Advances in neural information processing systems, 2025. 2

work page 2025
[8]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633– 638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633– 638, 2025. 1, 2, 3

work page 2025
[9]

Skysense: A multi- modal remote sensing foundation model towards universal interpretation for earth observation imagery

Xin Guo, Jiangwei Lao, Bo Dang, Yingying Zhang, Lei Yu, Lixiang Ru, Liheng Zhong, Ziyuan Huang, Kang Wu, Dingxiang Hu, Huimei He, Jian Wang, Jingdong Chen, Ming Yang, Yongjun Zhang, and Yansheng Li. Skysense: A multi- modal remote sensing foundation model towards universal interpretation for earth observation imagery. InProceedings of the IEEE/CVF Confere...

work page 2024
[10]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

A survey on remote sens- ing foundation models: From vision to multimodality.arXiv preprint arXiv:2503.22081, 2025

Ziyue Huang, Hongxi Yan, Qiqi Zhan, Shuai Yang, Ming- ming Zhang, Chenkai Zhang, YiMing Lei, Zeming Liu, Qingjie Liu, and Yunhong Wang. A survey on remote sens- ing foundation models: From vision to multimodality.arXiv preprint arXiv:2503.22081, 2025. 2

work page arXiv 2025
[12]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Few-shot vision- language reasoning for satellite imagery via verifiable re- wards

Aybora K ¨oksal and A Aydın Alatan. Few-shot vision- language reasoning for satellite imagery via verifiable re- wards. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 6901–6910, 2025. 3, 6

work page 2025
[14]

Geochat: Grounded large vision-language model for remote sensing

Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27831– 27840, 2024. 2

work page 2024
[15]

Ddfav: Re- mote sensing large vision language models dataset and eval- uation benchmark.Remote Sensing, 17(4):719, 2025

Haodong Li, Xiaofeng Zhang, and Haicheng Qu. Ddfav: Re- mote sensing large vision language models dataset and eval- uation benchmark.Remote Sensing, 17(4):719, 2025. 2

work page 2025
[16]

Hrvqa: A visual question answering benchmark for high-resolution aerial images.ISPRS Journal of Photogrammetry and Re- mote Sensing, 214:65–81, 2024

Kun Li, George V osselman, and Michael Ying Yang. Hrvqa: A visual question answering benchmark for high-resolution aerial images.ISPRS Journal of Photogrammetry and Re- mote Sensing, 214:65–81, 2024. 5

work page 2024
[17]

Vrsbench: A versatile vision-language benchmark dataset for remote sens- ing image understanding.Advances in Neural Information Processing Systems, 37:3229–3242, 2024

Xiang Li, Jian Ding, and Mohamed Elhoseiny. Vrsbench: A versatile vision-language benchmark dataset for remote sens- ing image understanding.Advances in Neural Information Processing Systems, 37:3229–3242, 2024. 6

work page 2024
[18]

Vision-language models in remote sensing: Current progress and future trends.IEEE Geoscience and Remote Sensing Magazine, 12(2):32–66, 2024

Xiang Li, Congcong Wen, Yuan Hu, Zhenghang Yuan, and Xiao Xiang Zhu. Vision-language models in remote sensing: Current progress and future trends.IEEE Geoscience and Remote Sensing Magazine, 12(2):32–66, 2024. 2

work page 2024
[19]

Remote sensing spa- tiotemporal vision–language models: A comprehensive sur- vey.IEEE Geoscience and Remote Sensing Magazine, 2025

Chenyang Liu, Jiafan Zhang, Keyan Chen, Man Wang, Zhengxia Zou, and Zhenwei Shi. Remote sensing spa- tiotemporal vision–language models: A comprehensive sur- vey.IEEE Geoscience and Remote Sensing Magazine, 2025. 2

work page 2025
[20]

Rsvqa: Visual question answering for remote sensing data

Sylvain Lobry, Diego Marcos, Jesse Murray, and Devis Tuia. Rsvqa: Visual question answering for remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 58 (12):8555–8566, 2020. 6

work page 2020
[21]

Rsvqa meets bigearthnet: A new, large-scale, visual question answering dataset for remote sensing

Sylvain Lobry, Beg ¨um Demir, and Devis Tuia. Rsvqa meets bigearthnet: A new, large-scale, visual question answering dataset for remote sensing. In2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, pages 1218–1221, 2021. 5, 6

work page 2021
[22]

Skysensegpt: A fine- grained instruction tuning dataset and model for remote sens- ing vision-language understanding, 2024

Junwei Luo, Zhen Pang, Yongjun Zhang, Tingzhu Wang, Linlin Wang, Bo Dang, Jiangwei Lao, Jian Wang, Jingdong Chen, Yihua Tan, and Yansheng Li. Skysensegpt: A fine- grained instruction tuning dataset and model for remote sens- ing vision-language understanding, 2024. 5, 6

work page 2024
[23]

Mm-eureka: Ex- ploring the frontiers of multimodal reasoning with rule-based reinforcement learning, 2025

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, 9 Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, and Wenqi Shao. Mm-eureka: Ex- ploring the frontiers of multimodal reasoning with rule-based reinforcement learning, 2025. 2, 3, 6

work page 2025
[24]

Gpt-5 system card.https://cdn.openai

OpenAI. Gpt-5 system card.https://cdn.openai. com/gpt-5-system-card.pdf, 2025. 4

work page 2025
[25]

Uav-vln: End-to-end vision language guided navigation for uavs

Pranav Saxena, Nishant Raghuvanshi, and Neena Goveas. Uav-vln: End-to-end vision language guided navigation for uavs. In2025 European Conference on Mobile Robots (ECMR), page 1–6. IEEE, 2025. 2

work page 2025
[26]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 3

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reason- ing in open language models, 2024. 3, 6

work page 2024
[28]

Satori- r1: Incentivizing multimodal reasoning with spatial ground- ing and verifiable rewards, 2025

Chuming Shen, Wei Wei, Xiaoye Qu, and Yu Cheng. Satori- r1: Incentivizing multimodal reasoning with spatial ground- ing and verifiable rewards, 2025. 2

work page 2025
[29]

Drone-based rgb-infrared cross-modality vehicle detection via uncertainty-aware learning.IEEE Transactions on Cir- cuits and Systems for Video Technology, pages 1–1, 2022

Yiming Sun, Bing Cao, Pengfei Zhu, and Qinghua Hu. Drone-based rgb-infrared cross-modality vehicle detection via uncertainty-aware learning.IEEE Transactions on Cir- cuits and Systems for Video Technology, pages 1–1, 2022. 6

work page 2022
[30]

Advancements in vision– language models for remote sensing: Datasets, capabilities, and enhancement techniques.Remote Sensing, 17(1):162,

Lijie Tao, Haokui Zhang, Haizhao Jing, Yu Liu, Dawei Yan, Guoting Wei, and Xizhe Xue. Advancements in vision– language models for remote sensing: Datasets, capabilities, and enhancement techniques.Remote Sensing, 17(1):162,

work page
[31]

Glm-4.5v and glm-4.1v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning, 2025

GLM-V Team. Glm-4.5v and glm-4.1v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning, 2025. 6

work page 2025
[32]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Qwen3-vl: Sharper vision, deeper thought, broader action.Qwen Blog

Qwen Team. Qwen3-vl: Sharper vision, deeper thought, broader action.Qwen Blog. Accessed, pages 10–04, 2025. 6

work page 2025
[34]

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.arXiv preprint arXiv:2504.08837, 2025. 2, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Junjue Wang, Zhuo Zheng, Zihang Chen, Ailong Ma, and Yanfei Zhong. Earthvqa: Towards queryable earth via re- lational reasoning-based remote sensing visual question an- swering.Proceedings of the AAAI Conference on Artificial Intelligence, 38(6):5481–5489, 2024. 6

work page 2024
[36]

Ringmogpt: A unified remote sensing foundation model for vision, language, and grounded tasks.IEEE Transactions on Geoscience and Re- mote Sensing, 63:1–20, 2025

Peijin Wang, Huiyang Hu, Boyuan Tong, Ziqi Zhang, Fang- long Yao, Yingchao Feng, Zining Zhu, Hao Chang, Wenhui Diao, Qixiang Ye, and Xian Sun. Ringmogpt: A unified remote sensing foundation model for vision, language, and grounded tasks.IEEE Transactions on Geoscience and Re- mote Sensing, 63:1–20, 2025. 2

work page 2025
[37]

Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022. 3

work page 2022
[38]

Sarlang-1m: A benchmark for vision-language modeling in sar image un- derstanding, 2025

Yimin Wei, Aoran Xiao, Yexian Ren, Yuting Zhu, Hongruix- uan Chen, Junshi Xia, and Naoto Yokoya. Sarlang-1m: A benchmark for vision-language modeling in sar image un- derstanding, 2025. 6

work page 2025
[39]

Light-r1: Curriculum SFT, DPO and RL for long COT from scratch and beyond

Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Tanglifu Tanglifu, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, and Xi- angzheng Zhang. Light-r1: Curriculum SFT, DPO and RL for long COT from scratch and beyond. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume...

work page 2025
[40]

Chenhui Xu, Fuxun Yu, Michael J. Bianco, Jacob Kovarskiy, Raphael Tang, Qi Zhang, Zirui Xu, Will LeVine, Brandon Dubbs, Heming Liao, Cassandra Burgess, Suvam Bag, Jay Patravali, Rupanjali Kukal, Mikael Figueroa, Rishi Madhok, Nikolaos Karianakis, and Jinjun Xiong. Geo-r1: Unlock- ing vlm geospatial reasoning with cross-view reinforcement learning, 2025. 3, 6

work page 2025
[41]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Wethink: To- ward general-purpose vision-language reasoning via rein- forcement learning, 2025

Jie Yang, Feipeng Ma, Zitian Wang, Dacheng Yin, Kang Rong, Fengyun Rao, and Ruimao Zhang. Wethink: To- ward general-purpose vision-language reasoning via rein- forcement learning, 2025. 2, 3, 6

work page 2025
[43]

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, Bo Zhang, and Wei Chen. R1-onevision: Advancing generalized multimodal reasoning through cross- modal formalization.arXiv preprint arXiv:2503.10615,

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Dapo: An open-source llm reinforcement learning system at scale, 2025

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xi- angpeng Wei, Hao Zhou, Jingjing Li...

work page 2025
[45]

Vl-cogito: Progressive curriculum reinforcement learning for advanced multimodal reasoning

Ruifeng Yuan, Chenghao Xiao, Sicong Leng, Jianyu Wang, Long Li, Weiwen Xu, Hou Pong Chan, Deli Zhao, Tingyang Xu, Zhongyu Wei, et al. Vl-cogito: Progressive curriculum reinforcement learning for advanced multimodal reasoning. arXiv preprint arXiv:2507.22607, 2025. 2

work page arXiv 2025
[46]

Rsvg: Exploring data and models for visual grounding on remote sensing data

Yang Zhan, Zhitong Xiong, and Yuan Yuan. Rsvg: Exploring data and models for visual grounding on remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 61: 1–13, 2023. 6

work page 2023
[47]

Grounded vision-language navigation for uavs with open-vocabulary goal understanding, 2025

Yuhang Zhang, Haosheng Yu, Jiaping Xiao, and Mir Fer- 10 oskhan. Grounded vision-language navigation for uavs with open-vocabulary goal understanding, 2025. 2

work page 2025
[48]

Llamafac- tory: Unified efficient fine-tuning of 100+ language mod- els

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafac- tory: Unified efficient fine-tuning of 100+ language mod- els. InProceedings of the 62nd Annual Meeting of the As- sociation for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, 2024. Association for Computational Lin...

work page 2024
[49]

Easyr1: An efficient, scalable, multi-modality rl training framework.https:// github.com/hiyouga/EasyR1, 2025

Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, and Yuwen Xiong. Easyr1: An efficient, scalable, multi-modality rl training framework.https:// github.com/hiyouga/EasyR1, 2025. 6

work page 2025
[50]

Least-to-most prompting enables complex reasoning in large language models, 2023

Denny Zhou, Nathanael Sch ¨arli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models, 2023. 3 11 Asking like Socrates: Socrates helps VLMs understand remote sensing images Supplementary Material

work page 2023
[51]

coarse- to-fine

System Prompts Details In this section, we provide the exact system prompts used in our SocraticAgent framework to synthesize the RS-EoT-4K dataset. As described in the main paper, SocraticAgent op- erates as a self-play multi-agent system consisting of three distinct roles: theReasoner, thePerceiver, and theVeri- fier. •The Reasoner(Fig. 6) serves as the...

work page
[52]

Please reason step-by-step

SFT Training Settings We perform SFT on the base model Qwen2.5-VL-7B- Instruct using the RS-EoT-4K dataset. The training is im- plemented based on the LLaMA-Factory framework. We train the model for 5 epochs with a learning rate of3×10−5, using the AdamW optimizer and a cosine learning rate scheduler. The global batch size is set to 64, and the max- imum ...

work page
[53]

We fix the KL coefficient toβ= 1.0×10 −2

RL Training Settings All reinforcement learning experiments are conducted using the EasyR1 framework, which provides a production-ready implementation of GRPO with KL regularization. We fix the KL coefficient toβ= 1.0×10 −2. For each input, the model generates 4 rollout samples using sampling tempera- ture 1.0, with a maximum response length of 4096 token...

work page
[54]

the perception model,

RL Reward Function 9.1. Grounding Reward For the grounding task, the model is required to output a bounding box in the form[x1, y1, x2, y2]after a complete<think></think>block. Our reward con- tains two components: an IoU-based accuracy term and a lightweight format term. Format reward.For the grounding task, we apply a lightweightformat rewardto encourag...

work page
[55]

RL Training Dynamics Curves Figure 9 visualizes the evolution of key optimization statis- tics during the two RL stages in our pipeline: RL- Grounding and RL-VQA. The top block corresponds to the RL Grounding stage and the bottom block to the RL-VQA stage; in both cases we plot the same set of metrics, in- cluding mean advantage, actor gradient norm, entr...

work page
[56]

This experiment allows us to isolate and compare the impact of our multiple-choice VQA design against the conventional VQA supervision

Difference Between Multiple-Choice VQA and Standard VQA To assess the effectiveness of our proposed multiple-choice reformulation of VQA, we additionally perform an ablation study using the original dataset and model settings, but ap- plying reinforcement learning directly on thestandardfree- form VQA answers. This experiment allows us to isolate and comp...

work page
[57]

Specifically, we present extended case studies covering both Remote Sensing Gen- eral VQA tasks (Fig

Case Study We provide additional qualitative examples to further demonstrate the effectiveness of RS-EoT-7B in complex re- mote sensing reasoning scenarios. Specifically, we present extended case studies covering both Remote Sensing Gen- eral VQA tasks (Fig. 11, Fig. 12, and Fig. 13) and Fine- grained Grounding tasks (Fig. 14 and Fig. 15). These vi- suali...

work page