Asking like Socrates: Socrates helps VLMs understand remote sensing images
Pith reviewed 2026-05-17 04:50 UTC · model grok-4.3
The pith
Remote sensing vision models overcome pseudo-reasoning by iteratively seeking visual evidence in large images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that RS-EoT, a language-driven iterative visual evidence-seeking paradigm, when instilled via SocraticAgent's self-play multi-agent synthesis of alternating reasoning and inspection cycles and refined through two-stage progressive reinforcement learning on grounding followed by VQA, enables genuine evidence-grounded reasoning that mitigates the Glance Effect in remote sensing tasks.
What carries the argument
RS-EoT, the iterative paradigm of alternating reasoning steps with visual inspections to build evidence-based conclusions instead of linguistic self-consistency.
If this is right
- RS-EoT models reach state-of-the-art accuracy on several remote sensing visual question answering benchmarks.
- The approach produces observable iterative cycles of reasoning and evidence checking in model outputs.
- Training first on fine-grained grounding tasks builds the core capability before generalizing to broader questions.
- Models shift from pseudo-reasoning to answers that depend on specific visual details in the imagery.
Where Pith is reading between the lines
- This method might apply to other vision domains involving wide-area or high-resolution images where initial glances miss critical details.
- Similar self-play agents could be used to bootstrap evidence-seeking in non-visual reasoning tasks.
- Future systems may default to multi-step visual verification as a core training objective rather than post-hoc prompting.
Load-bearing premise
The reasoning traces generated by the self-play multi-agent system actually come from looking at image content rather than reproducing common language patterns.
What would settle it
Edit the remote sensing image to alter a key visual feature that should change the correct answer, then check if the model updates its response based on the new evidence or sticks to the original output.
Figures
read the original abstract
Recent multimodal reasoning models, inspired by DeepSeek-R1, have significantly advanced vision-language systems. However, in remote sensing (RS) tasks, we observe widespread pseudo reasoning: models narrate the process of reasoning rather than genuinely reason toward the correct answer based on visual evidence. We attribute this to the Glance Effect, where a single, coarse perception of large-scale RS imagery results in incomplete understanding and reasoning based on linguistic self-consistency instead of visual evidence. To address this, we propose RS-EoT (Remote Sensing Evidence-of-Thought), a language-driven, iterative visual evidence-seeking paradigm. To instill this paradigm, we propose SocraticAgent, a self-play multi-agent system that synthesizes reasoning traces via alternating cycles of reasoning and visual inspection. To enhance and generalize these patterns, we propose a two-stage progressive RL strategy: first, RL on fine-grained Grounding tasks to enhance RS-EoT capabilities, followed by RL on RS VQA to generalize to broader understanding scenarios. Experiments show RS-EoT achieves state-of-the-art performance on multiple RS VQA and grounding benchmarks. Analyses reveal clear iterative cycles of reasoning and evidence seeking, confirming RS-EoT mitigates the Glance Effect and enables genuine evidence-grounded reasoning. Our code, data, and models are available at https://geox-lab.github.io/Asking_like_Socrates
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that VLMs exhibit pseudo-reasoning on remote sensing tasks due to the Glance Effect (coarse perception of large-scale imagery leading to linguistic self-consistency rather than visual evidence). It proposes RS-EoT, a language-driven iterative evidence-seeking paradigm, implemented via SocraticAgent (a self-play multi-agent system synthesizing traces through alternating reasoning and visual inspection cycles) and a two-stage progressive RL strategy (RL on fine-grained grounding followed by RL on RS VQA). Experiments report SOTA performance on multiple RS VQA and grounding benchmarks, with analyses showing iterative cycles that mitigate the Glance Effect and enable evidence-grounded reasoning; code, data, and models are released.
Significance. If the central claim holds and the synthesized traces reflect genuine visual evidence-seeking rather than linguistic patterns, RS-EoT could meaningfully advance multimodal reasoning for remote sensing by providing a controllable way to enforce iterative inspection and reduce the Glance Effect. The open release of code, data, and models is a clear strength for reproducibility. However, the significance is currently limited by insufficient verification that the self-play and RL stages produce behavior driven by actual visual evidence rather than self-consistent narration.
major comments (2)
- [Abstract] Abstract: the claim that 'analyses reveal clear iterative cycles of reasoning and evidence seeking' confirming mitigation of the Glance Effect is load-bearing for the central contribution, yet the abstract provides no quantitative controls such as ablation of the visual-inspection branch, comparison to language-only self-play, or metrics of visual grounding fidelity; without these, the distinction between genuine evidence-seeking and linguistic pattern matching cannot be verified.
- [Experiments] Experiments section: SOTA results on RS VQA and grounding benchmarks are asserted without reported details on baselines, statistical significance tests, error bars, or controls for confounding factors such as prompt engineering; this undermines the strength of the performance claims that support the RS-EoT paradigm.
minor comments (1)
- The abstract and method description introduce several new terms (SocraticAgent, RS-EoT paradigm, two-stage progressive RL) without a concise summary table or diagram early in the paper that would help readers track the relationships between components.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comments point by point below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'analyses reveal clear iterative cycles of reasoning and evidence seeking' confirming mitigation of the Glance Effect is load-bearing for the central contribution, yet the abstract provides no quantitative controls such as ablation of the visual-inspection branch, comparison to language-only self-play, or metrics of visual grounding fidelity; without these, the distinction between genuine evidence-seeking and linguistic pattern matching cannot be verified.
Authors: We agree that the abstract would be strengthened by explicit reference to quantitative evidence supporting the central claim. In the revised manuscript we will add a concise clause noting key ablation results on the visual-inspection branch together with grounding-fidelity metrics that differentiate evidence-driven cycles from language-only self-play. These additions will remain within abstract length limits while directing readers to the detailed analyses in Section 4. revision: yes
-
Referee: [Experiments] Experiments section: SOTA results on RS VQA and grounding benchmarks are asserted without reported details on baselines, statistical significance tests, error bars, or controls for confounding factors such as prompt engineering; this undermines the strength of the performance claims that support the RS-EoT paradigm.
Authors: We acknowledge that clearer reporting of experimental controls would improve transparency. The revised experiments section will include error bars on all reported metrics, paired statistical significance tests against baselines, and an explicit language-only self-play control to isolate the contribution of the visual-inspection component. These details were partially present but will be expanded and highlighted for clarity. revision: yes
Circularity Check
No significant circularity in RS-EoT derivation chain
full rationale
The paper proposes SocraticAgent for synthesizing reasoning traces via self-play multi-agent cycles and applies a two-stage RL procedure (grounding then VQA) before reporting SOTA results on external RS VQA and grounding benchmarks. No equations, predictions, or central claims reduce by construction to the method's own inputs; the iterative cycles are described as produced outputs whose presence is confirmed by post-hoc analyses rather than presupposed definitions. The derivation chain remains independent of self-citation load-bearing or fitted-input renaming, with evaluation performed on separate benchmark tasks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Self-play multi-agent interaction can produce high-quality reasoning traces that reflect genuine visual grounding for remote sensing imagery.
- domain assumption Progressive RL first on grounding then on VQA will generalize the RS-EoT capability to broader understanding scenarios.
invented entities (2)
-
SocraticAgent
no independent evidence
-
RS-EoT paradigm
no independent evidence
Forward citations
Cited by 1 Pith paper
-
GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding
GeoVista introduces a planning-driven active perception framework with global exploration plans, branch-wise local inspection, and explicit evidence tracking to achieve state-of-the-art results on ultra-high-resolutio...
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
ByteDance / V olcengine. Seed 1.6 — doubao (seed) 1.6. Online, 2025. 1, 4, 6
work page 2025
-
[3]
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models
Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. Sft or rl? an early investigation into training r1-like reasoning large vision-language models.arXiv preprint arXiv:2504.11468,
work page internal anchor Pith review arXiv
-
[4]
Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. Transactions on Machine Learning Research, 2023. 3
work page 2023
-
[5]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Chenrui Fan, Ming Li, Lichao Sun, and Tianyi Zhou. Missing premise exacerbates overthinking: Are reason- ing models losing critical thinking skill?arXiv preprint arXiv:2504.06514, 2025. 2
-
[7]
Thinkless: Llm learns when to think.Advances in neural information processing systems, 2025
Gongfan Fang, Xinyin Ma, and Xinchao Wang. Thinkless: Llm learns when to think.Advances in neural information processing systems, 2025. 2
work page 2025
-
[8]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633– 638, 2025. 1, 2, 3
work page 2025
-
[9]
Xin Guo, Jiangwei Lao, Bo Dang, Yingying Zhang, Lei Yu, Lixiang Ru, Liheng Zhong, Ziyuan Huang, Kang Wu, Dingxiang Hu, Huimei He, Jian Wang, Jingdong Chen, Ming Yang, Yongjun Zhang, and Yansheng Li. Skysense: A multi- modal remote sensing foundation model towards universal interpretation for earth observation imagery. InProceedings of the IEEE/CVF Confere...
work page 2024
-
[10]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Ziyue Huang, Hongxi Yan, Qiqi Zhan, Shuai Yang, Ming- ming Zhang, Chenkai Zhang, YiMing Lei, Zeming Liu, Qingjie Liu, and Yunhong Wang. A survey on remote sens- ing foundation models: From vision to multimodality.arXiv preprint arXiv:2503.22081, 2025. 2
-
[12]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Few-shot vision- language reasoning for satellite imagery via verifiable re- wards
Aybora K ¨oksal and A Aydın Alatan. Few-shot vision- language reasoning for satellite imagery via verifiable re- wards. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 6901–6910, 2025. 3, 6
work page 2025
-
[14]
Geochat: Grounded large vision-language model for remote sensing
Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27831– 27840, 2024. 2
work page 2024
-
[15]
Haodong Li, Xiaofeng Zhang, and Haicheng Qu. Ddfav: Re- mote sensing large vision language models dataset and eval- uation benchmark.Remote Sensing, 17(4):719, 2025. 2
work page 2025
-
[16]
Kun Li, George V osselman, and Michael Ying Yang. Hrvqa: A visual question answering benchmark for high-resolution aerial images.ISPRS Journal of Photogrammetry and Re- mote Sensing, 214:65–81, 2024. 5
work page 2024
-
[17]
Xiang Li, Jian Ding, and Mohamed Elhoseiny. Vrsbench: A versatile vision-language benchmark dataset for remote sens- ing image understanding.Advances in Neural Information Processing Systems, 37:3229–3242, 2024. 6
work page 2024
-
[18]
Xiang Li, Congcong Wen, Yuan Hu, Zhenghang Yuan, and Xiao Xiang Zhu. Vision-language models in remote sensing: Current progress and future trends.IEEE Geoscience and Remote Sensing Magazine, 12(2):32–66, 2024. 2
work page 2024
-
[19]
Chenyang Liu, Jiafan Zhang, Keyan Chen, Man Wang, Zhengxia Zou, and Zhenwei Shi. Remote sensing spa- tiotemporal vision–language models: A comprehensive sur- vey.IEEE Geoscience and Remote Sensing Magazine, 2025. 2
work page 2025
-
[20]
Rsvqa: Visual question answering for remote sensing data
Sylvain Lobry, Diego Marcos, Jesse Murray, and Devis Tuia. Rsvqa: Visual question answering for remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 58 (12):8555–8566, 2020. 6
work page 2020
-
[21]
Rsvqa meets bigearthnet: A new, large-scale, visual question answering dataset for remote sensing
Sylvain Lobry, Beg ¨um Demir, and Devis Tuia. Rsvqa meets bigearthnet: A new, large-scale, visual question answering dataset for remote sensing. In2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, pages 1218–1221, 2021. 5, 6
work page 2021
-
[22]
Junwei Luo, Zhen Pang, Yongjun Zhang, Tingzhu Wang, Linlin Wang, Bo Dang, Jiangwei Lao, Jian Wang, Jingdong Chen, Yihua Tan, and Yansheng Li. Skysensegpt: A fine- grained instruction tuning dataset and model for remote sens- ing vision-language understanding, 2024. 5, 6
work page 2024
-
[23]
Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, 9 Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, and Wenqi Shao. Mm-eureka: Ex- ploring the frontiers of multimodal reasoning with rule-based reinforcement learning, 2025. 2, 3, 6
work page 2025
-
[24]
Gpt-5 system card.https://cdn.openai
OpenAI. Gpt-5 system card.https://cdn.openai. com/gpt-5-system-card.pdf, 2025. 4
work page 2025
-
[25]
Uav-vln: End-to-end vision language guided navigation for uavs
Pranav Saxena, Nishant Raghuvanshi, and Neena Goveas. Uav-vln: End-to-end vision language guided navigation for uavs. In2025 European Conference on Mobile Robots (ECMR), page 1–6. IEEE, 2025. 2
work page 2025
-
[26]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 3
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[27]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reason- ing in open language models, 2024. 3, 6
work page 2024
-
[28]
Satori- r1: Incentivizing multimodal reasoning with spatial ground- ing and verifiable rewards, 2025
Chuming Shen, Wei Wei, Xiaoye Qu, and Yu Cheng. Satori- r1: Incentivizing multimodal reasoning with spatial ground- ing and verifiable rewards, 2025. 2
work page 2025
-
[29]
Yiming Sun, Bing Cao, Pengfei Zhu, and Qinghua Hu. Drone-based rgb-infrared cross-modality vehicle detection via uncertainty-aware learning.IEEE Transactions on Cir- cuits and Systems for Video Technology, pages 1–1, 2022. 6
work page 2022
-
[30]
Lijie Tao, Haokui Zhang, Haizhao Jing, Yu Liu, Dawei Yan, Guoting Wei, and Xizhe Xue. Advancements in vision– language models for remote sensing: Datasets, capabilities, and enhancement techniques.Remote Sensing, 17(1):162,
-
[31]
GLM-V Team. Glm-4.5v and glm-4.1v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning, 2025. 6
work page 2025
-
[32]
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Qwen3-vl: Sharper vision, deeper thought, broader action.Qwen Blog
Qwen Team. Qwen3-vl: Sharper vision, deeper thought, broader action.Qwen Blog. Accessed, pages 10–04, 2025. 6
work page 2025
-
[34]
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.arXiv preprint arXiv:2504.08837, 2025. 2, 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Junjue Wang, Zhuo Zheng, Zihang Chen, Ailong Ma, and Yanfei Zhong. Earthvqa: Towards queryable earth via re- lational reasoning-based remote sensing visual question an- swering.Proceedings of the AAAI Conference on Artificial Intelligence, 38(6):5481–5489, 2024. 6
work page 2024
-
[36]
Peijin Wang, Huiyang Hu, Boyuan Tong, Ziqi Zhang, Fang- long Yao, Yingchao Feng, Zining Zhu, Hao Chang, Wenhui Diao, Qixiang Ye, and Xian Sun. Ringmogpt: A unified remote sensing foundation model for vision, language, and grounded tasks.IEEE Transactions on Geoscience and Re- mote Sensing, 63:1–20, 2025. 2
work page 2025
-
[37]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022. 3
work page 2022
-
[38]
Sarlang-1m: A benchmark for vision-language modeling in sar image un- derstanding, 2025
Yimin Wei, Aoran Xiao, Yexian Ren, Yuting Zhu, Hongruix- uan Chen, Junshi Xia, and Naoto Yokoya. Sarlang-1m: A benchmark for vision-language modeling in sar image un- derstanding, 2025. 6
work page 2025
-
[39]
Light-r1: Curriculum SFT, DPO and RL for long COT from scratch and beyond
Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Tanglifu Tanglifu, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, and Xi- angzheng Zhang. Light-r1: Curriculum SFT, DPO and RL for long COT from scratch and beyond. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume...
work page 2025
-
[40]
Chenhui Xu, Fuxun Yu, Michael J. Bianco, Jacob Kovarskiy, Raphael Tang, Qi Zhang, Zirui Xu, Will LeVine, Brandon Dubbs, Heming Liao, Cassandra Burgess, Suvam Bag, Jay Patravali, Rupanjali Kukal, Mikael Figueroa, Rishi Madhok, Nikolaos Karianakis, and Jinjun Xiong. Geo-r1: Unlock- ing vlm geospatial reasoning with cross-view reinforcement learning, 2025. 3, 6
work page 2025
-
[41]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Wethink: To- ward general-purpose vision-language reasoning via rein- forcement learning, 2025
Jie Yang, Feipeng Ma, Zitian Wang, Dacheng Yin, Kang Rong, Fengyun Rao, and Ruimao Zhang. Wethink: To- ward general-purpose vision-language reasoning via rein- forcement learning, 2025. 2, 3, 6
work page 2025
-
[43]
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, Bo Zhang, and Wei Chen. R1-onevision: Advancing generalized multimodal reasoning through cross- modal formalization.arXiv preprint arXiv:2503.10615,
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
Dapo: An open-source llm reinforcement learning system at scale, 2025
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xi- angpeng Wei, Hao Zhou, Jingjing Li...
work page 2025
-
[45]
Vl-cogito: Progressive curriculum reinforcement learning for advanced multimodal reasoning
Ruifeng Yuan, Chenghao Xiao, Sicong Leng, Jianyu Wang, Long Li, Weiwen Xu, Hou Pong Chan, Deli Zhao, Tingyang Xu, Zhongyu Wei, et al. Vl-cogito: Progressive curriculum reinforcement learning for advanced multimodal reasoning. arXiv preprint arXiv:2507.22607, 2025. 2
-
[46]
Rsvg: Exploring data and models for visual grounding on remote sensing data
Yang Zhan, Zhitong Xiong, and Yuan Yuan. Rsvg: Exploring data and models for visual grounding on remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 61: 1–13, 2023. 6
work page 2023
-
[47]
Grounded vision-language navigation for uavs with open-vocabulary goal understanding, 2025
Yuhang Zhang, Haosheng Yu, Jiaping Xiao, and Mir Fer- 10 oskhan. Grounded vision-language navigation for uavs with open-vocabulary goal understanding, 2025. 2
work page 2025
-
[48]
Llamafac- tory: Unified efficient fine-tuning of 100+ language mod- els
Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafac- tory: Unified efficient fine-tuning of 100+ language mod- els. InProceedings of the 62nd Annual Meeting of the As- sociation for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, 2024. Association for Computational Lin...
work page 2024
-
[49]
Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, and Yuwen Xiong. Easyr1: An efficient, scalable, multi-modality rl training framework.https:// github.com/hiyouga/EasyR1, 2025. 6
work page 2025
-
[50]
Least-to-most prompting enables complex reasoning in large language models, 2023
Denny Zhou, Nathanael Sch ¨arli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models, 2023. 3 11 Asking like Socrates: Socrates helps VLMs understand remote sensing images Supplementary Material
work page 2023
-
[51]
System Prompts Details In this section, we provide the exact system prompts used in our SocraticAgent framework to synthesize the RS-EoT-4K dataset. As described in the main paper, SocraticAgent op- erates as a self-play multi-agent system consisting of three distinct roles: theReasoner, thePerceiver, and theVeri- fier. •The Reasoner(Fig. 6) serves as the...
-
[52]
SFT Training Settings We perform SFT on the base model Qwen2.5-VL-7B- Instruct using the RS-EoT-4K dataset. The training is im- plemented based on the LLaMA-Factory framework. We train the model for 5 epochs with a learning rate of3×10−5, using the AdamW optimizer and a cosine learning rate scheduler. The global batch size is set to 64, and the max- imum ...
-
[53]
We fix the KL coefficient toβ= 1.0×10 −2
RL Training Settings All reinforcement learning experiments are conducted using the EasyR1 framework, which provides a production-ready implementation of GRPO with KL regularization. We fix the KL coefficient toβ= 1.0×10 −2. For each input, the model generates 4 rollout samples using sampling tempera- ture 1.0, with a maximum response length of 4096 token...
-
[54]
RL Reward Function 9.1. Grounding Reward For the grounding task, the model is required to output a bounding box in the form[x1, y1, x2, y2]after a complete<think></think>block. Our reward con- tains two components: an IoU-based accuracy term and a lightweight format term. Format reward.For the grounding task, we apply a lightweightformat rewardto encourag...
-
[55]
RL Training Dynamics Curves Figure 9 visualizes the evolution of key optimization statis- tics during the two RL stages in our pipeline: RL- Grounding and RL-VQA. The top block corresponds to the RL Grounding stage and the bottom block to the RL-VQA stage; in both cases we plot the same set of metrics, in- cluding mean advantage, actor gradient norm, entr...
-
[56]
Difference Between Multiple-Choice VQA and Standard VQA To assess the effectiveness of our proposed multiple-choice reformulation of VQA, we additionally perform an ablation study using the original dataset and model settings, but ap- plying reinforcement learning directly on thestandardfree- form VQA answers. This experiment allows us to isolate and comp...
-
[57]
Specifically, we present extended case studies covering both Remote Sensing Gen- eral VQA tasks (Fig
Case Study We provide additional qualitative examples to further demonstrate the effectiveness of RS-EoT-7B in complex re- mote sensing reasoning scenarios. Specifically, we present extended case studies covering both Remote Sensing Gen- eral VQA tasks (Fig. 11, Fig. 12, and Fig. 13) and Fine- grained Grounding tasks (Fig. 14 and Fig. 15). These vi- suali...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.