RS-HyRe-R1: A Hybrid Reward Mechanism to Overcome Perceptual Inertia for Remote Sensing Images Understanding
Pith reviewed 2026-05-10 06:12 UTC · model grok-4.3
The pith
A hybrid reward system with three targeted components overcomes perceptual inertia in remote sensing vision-language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RS-HyRe-R1 mitigates perceptual inertia through a hybrid reward framework that combines a spatial reasoning activation reward to enforce structured visual reasoning, a perception correctness reward for adaptive geometric and semantic alignment, and a visual-semantic path evolution reward to penalize repetitive paths and promote complementary evidence chains. The result is deeper, more diverse reasoning that improves performance across tasks and enables strong zero-shot generalization.
What carries the argument
The RS-HyRe-R1 hybrid reward framework, which integrates spatial reasoning activation, perception correctness, and visual-semantic path evolution rewards to guide models toward comprehensive visual evidence mining and flexible focus shifting.
If this is right
- Models construct richer evidence chains instead of relying on quick outcome fitting.
- Smaller 3B-parameter models outperform systems up to 7B parameters on REC, OVD, and VQA tasks.
- Zero-shot performance improves by margins of 3.16% on VQA, 3.97% on OVD, and 2.72% on REC over the prior best.
- Reasoning becomes more diverse and less repetitive across varied remote sensing tasks.
Where Pith is reading between the lines
- The same reward structure could reduce shortcut biases in other vision-language reinforcement learning settings outside remote sensing.
- Tuning the balance among the three rewards might further improve generalization on unseen imagery types.
- Integration with existing RL post-training pipelines would require only modest changes to the reward computation step.
Load-bearing premise
The three rewards will successfully enforce thorough visual evidence mining and flexible focus shifting in complex remote sensing imagery without creating new exploitable biases or loopholes.
What would settle it
A test set of complex remote sensing images where the trained model still defaults to localized salient cues and produces incomplete evidence chains despite the hybrid rewards.
Figures
read the original abstract
Reinforcement learning (RL) post-training substantially improves remote sensing vision-language models (RS-VLMs). However, when handling complex remote sensing imagery (RSI) requiring exhaustive visual scanning, models tend to rely on localized salient cues for rapid inference. We term this RL-induced bias "perceptual inertia". Driven by reward maximization, models favor quick outcome fitting, leading to two limitations: cognitively, overreliance on specific features impedes complete evidence construction; operationally, models struggle to flexibly shift visual focus across tasks. To address this bias and encourage comprehensive visual evidence mining, we propose RS-HyRe-R1, a hybrid reward framework for RSI understanding. It introduces: (1) a spatial reasoning activation reward that enforces structured visual reasoning; (2) a perception correctness reward that provides adaptive quality anchors across RS tasks, ensuring accurate geometric and semantic alignment; and (3) a visual-semantic path evolution reward that penalizes repetitive reasoning and promotes exploration of complementary cues to build richer evidence chains. Experiments show RS-HyRe-R1 effectively mitigates "perceptual inertia", encouraging deeper, more diverse reasoning. With only 3B parameters, it achieves state-of-the-art performance on REC, OVD, and VQA tasks, outperforming models up to 7B parameters. It also demonstrates strong zero-shot generalization, surpassing the second-best model by 3.16%, 3.97%, and 2.72% on VQA, OVD, and REC, respectively. Code and datasets are available at https://github.com/geox-lab/RS-HyRe-R1.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes RS-HyRe-R1, a hybrid reward framework for RL post-training of remote sensing vision-language models (RS-VLMs) to mitigate 'perceptual inertia'—the bias toward localized salient cues for rapid inference. It defines three reward components (spatial reasoning activation, perception correctness, and visual-semantic path evolution) intended to enforce structured reasoning, accurate alignment, and exploration of complementary cues. Experiments on a 3B-parameter model report state-of-the-art results on REC, OVD, and VQA tasks (outperforming models up to 7B parameters) along with zero-shot gains of 3.16%, 3.97%, and 2.72% respectively; code and datasets are linked.
Significance. If the central claims hold after verification, the work would be significant for improving reasoning depth in domain-specific VLMs with modest model size, addressing a plausible RL-induced bias in remote sensing imagery. Open-sourcing of code and datasets supports reproducibility and could influence post-training practices for efficient RS-VLMs.
major comments (3)
- [§3] §3 (Hybrid Reward Framework): The three reward formulations are described at a high level without explicit equations, weighting coefficients, or normalization details. This prevents assessment of whether the composite reward can be gamed by superficial patterns (e.g., repetitive spatial phrases satisfying the activation term or cycling similar cues for the evolution term) rather than enforcing exhaustive evidence mining, directly undermining the central claim that perceptual inertia is mitigated.
- [§4.3] §4.3 (Ablation Studies): No ablation isolates the contribution of each reward component or tests robustness against exploitation on ambiguous RSI; the reported gains could arise from hyperparameter tuning or data-specific fitting rather than the proposed mechanism, which is load-bearing for the SOTA and zero-shot claims.
- [§4.1] §4.1 (Experimental Setup): Training details, data splits, and statistical significance for the zero-shot improvements (3.16% VQA, 3.97% OVD, 2.72% REC) are insufficiently specified, making it impossible to rule out post-hoc tuning or evaluation leakage as alternative explanations for outperformance over larger models.
minor comments (2)
- [§2] The introduction of the term 'perceptual inertia' would benefit from explicit comparison to related concepts such as reward hacking or shortcut learning in prior VLM RL literature.
- [§4.4] Figure captions and axis labels in the qualitative results could be clarified to directly link observed behaviors to specific reward terms.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and detailed review of our manuscript on RS-HyRe-R1. We address each major comment point-by-point below, providing clarifications and committing to revisions that strengthen the presentation of the hybrid reward framework, ablations, and experimental details without altering the core contributions.
read point-by-point responses
-
Referee: [§3] §3 (Hybrid Reward Framework): The three reward formulations are described at a high level without explicit equations, weighting coefficients, or normalization details. This prevents assessment of whether the composite reward can be gamed by superficial patterns (e.g., repetitive spatial phrases satisfying the activation term or cycling similar cues for the evolution term) rather than enforcing exhaustive evidence mining, directly undermining the central claim that perceptual inertia is mitigated.
Authors: We agree that explicit formulations are necessary for rigorous evaluation. In the revised manuscript, we will expand Section 3 to include the full mathematical definitions of the three reward components: the spatial reasoning activation reward (with its structured reasoning enforcement term), the perception correctness reward (with adaptive anchors and normalization), and the visual-semantic path evolution reward (with its penalty on repetitive paths). We will specify the weighting coefficients (λ_spatial, λ_correct, λ_evolve) and normalization procedures used in the composite reward R = λ1*R1 + λ2*R2 + λ3*R3. Additionally, we will add analysis showing how the evolution reward explicitly penalizes repetitive cue cycling and superficial phrase repetition, thereby supporting the claim that perceptual inertia is mitigated rather than gamed. These additions will enable direct assessment of robustness. revision: yes
-
Referee: [§4.3] §4.3 (Ablation Studies): No ablation isolates the contribution of each reward component or tests robustness against exploitation on ambiguous RSI; the reported gains could arise from hyperparameter tuning or data-specific fitting rather than the proposed mechanism, which is load-bearing for the SOTA and zero-shot claims.
Authors: We acknowledge the value of isolating each component's contribution. In the revised Section 4.3, we will add targeted ablations that systematically disable or scale individual rewards (e.g., training with only spatial activation, only correctness, or only evolution) and report delta performance on REC, OVD, and VQA. We will also include experiments on ambiguous RSI cases (e.g., low-contrast or multi-object scenes) to test for exploitation vulnerabilities. These results will demonstrate that the hybrid combination is responsible for the observed gains beyond hyperparameter effects, directly addressing the load-bearing nature of the mechanism for our SOTA and zero-shot claims. revision: yes
-
Referee: [§4.1] §4.1 (Experimental Setup): Training details, data splits, and statistical significance for the zero-shot improvements (3.16% VQA, 3.97% OVD, 2.72% REC) are insufficiently specified, making it impossible to rule out post-hoc tuning or evaluation leakage as alternative explanations for outperformance over larger models.
Authors: We appreciate the emphasis on reproducibility. In the revised Section 4.1, we will provide complete training hyperparameters (learning rate schedules, batch sizes, RL-specific settings like PPO clip range), exact train/validation/test splits for all datasets, and the full protocol for zero-shot evaluation. For the reported improvements, we will include results from multiple independent runs with different random seeds, along with standard deviations and statistical significance tests (e.g., paired t-tests or confidence intervals) to rule out post-hoc tuning or leakage. We will also clarify that all comparisons used the same evaluation metrics and held-out test sets as prior work. revision: yes
Circularity Check
No circularity: empirical reward design with independent evaluation
full rationale
The paper introduces three new reward terms (spatial reasoning activation, perception correctness, visual-semantic path evolution) as an engineering solution to perceptual inertia in RL post-training of RS-VLMs. These are defined directly from the problem description rather than derived from prior fitted quantities or self-citations that reduce to the target result. Performance claims rest on standard task benchmarks (REC, OVD, VQA) with reported zero-shot gains and public code/datasets, not on any equation that equates a prediction to its own input by construction. No load-bearing self-citation chains, ansatz smuggling, or renaming of known results appear in the provided abstract or description. The central claim therefore remains an independent empirical proposal rather than a tautology.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Reinforcement learning post-training substantially improves remote sensing vision-language models
- domain assumption Models tend to rely on localized salient cues for rapid inference when handling complex RSI
invented entities (1)
-
perceptual inertia
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Vision-Language Models in Remote Sensing: Current progress and future trends,
X. Li, C. Wen, Y . Hu, Z. Yuan, and X. X. Zhu, “Vision-Language Models in Remote Sensing: Current progress and future trends,”IEEE Geoscience and Remote Sensing Magazine, vol. 12, no. 2, pp. 32–66, 2024
work page 2024
-
[2]
RemoteCLIP: A Vision Language Foundation Model for Remote Sensing,
F. Liu, D. Chen, Z. Guan, X. Zhou, J. Zhuet al., “RemoteCLIP: A Vision Language Foundation Model for Remote Sensing,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–16, 2024
work page 2024
-
[3]
J. Wang, Z. Zheng, Z. Chen, A. Ma, and Y . Zhong, “EarthVQA: towards queryable earth via relational reasoning-based remote sensing visual question answering,” inProceedings of the Thirty-Eighth AAAI Confer- ence on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on IEEE TRA...
work page 2024
-
[4]
Vision-Language Model for Remote Sensing Images Object Detection,
D. Liu, T. Li, Y . Qi, Y . Xi, J. Jin, and J. Zhang, “Vision-Language Model for Remote Sensing Images Object Detection,” inIGARSS 2025 - 2025 IEEE International Geoscience and Remote Sensing Symposium, 2025, pp. 6069–6072
work page 2025
-
[5]
AddressCLIP: Empowering Vision-Language Models for City-Wide Image Address Localization,
S. Xu, C. Zhang, L. Fan, G. Meng, S. Xiang, and J. Ye, “AddressCLIP: Empowering Vision-Language Models for City-Wide Image Address Localization,” inComputer Vision – ECCV 2024. Cham: Springer Nature Switzerland, 2025, pp. 76–92
work page 2024
-
[6]
R. Shao, C. Yang, Q. Li, L. Xu, X. Yanget al., “Allspark: A multimodal spatio-temporal general intelligence model with ten modalities via language as a reference framework,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, 2025
work page 2025
-
[7]
Rethinking domain- agnostic continual learning via frequency completeness learning,
J. Peng, H. Zhang, J. Shen, Z. Li, J. Ma, and H. Li, “Rethinking domain- agnostic continual learning via frequency completeness learning,”Infor- mation Fusion, vol. 129, p. 103961, 2026
work page 2026
-
[8]
Homogeneous tokenizer matters: Homogeneous visual tokenizer for remote sensing image understanding,
R. Shao, Z. Zhang, C. Tao, Y . Zhang, C. Peng, and H. Li, “Homogeneous tokenizer matters: Homogeneous visual tokenizer for remote sensing image understanding,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 218, 2024
work page 2024
-
[9]
DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning,
D. Guo, D. Yang, H. Zhang, J. Song, P. Wanget al., “DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning,”Nature, vol. 645, no. 8081, pp. 633–638, 2025
work page 2025
-
[10]
A. Yang, A. Li, B. Yang, B. Zhang, B. Huiet al., “Qwen3 Technical Report,” 2025
work page 2025
-
[11]
MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning,
F. Meng, L. Du, Z. Liu, Z. Zhou, Q. Luet al., “MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning,” 2025
work page 2025
-
[12]
Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond,
L. Wen, Y . Cai, F. Xiao, X. He, Q. Anet al., “Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, vol. 6, Vienna, Austria, jul 2025, pp. 318–327
work page 2025
-
[13]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models,
W. Huang, B. Jia, Z. Zhai, S. Cao, Z. Yeet al., “Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models,” 2025
work page 2025
-
[14]
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model,
H. Shen, P. Liu, J. Li, C. Fang, Y . Maet al., “VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model,” 2025
work page 2025
-
[15]
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models,
H. Chen, H. Tu, F. Wang, H. Liu, X. Tanget al., “SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models,” 2025
work page 2025
-
[16]
WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learn- ing,
J. Yang, F. Ma, Z. Wang, D. Yin, K. Ronget al., “WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learn- ing,” 2025
work page 2025
-
[17]
H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Bakeret al., “Let’s Verify Step by Step,” 2023
work page 2023
-
[18]
Solving math word problems with process- and outcome-based feedback,
J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegelet al., “Solving math word problems with process- and outcome-based feedback,” 2022
work page 2022
-
[19]
STaR: Bootstrapping Reasoning With Reasoning,
E. Zelikman, Y . Wu, J. Mu, and N. Goodman, “STaR: Bootstrapping Reasoning With Reasoning,” inAdvances in Neural Information Pro- cessing Systems, vol. 35. Curran Associates, Inc., 2022, pp. 15 476– 15 488
work page 2022
-
[20]
Process reward models that think,
M. Khalifa, R. Agarwal, L. Logeswaran, J. Kim, H. Penget al., “Process reward models that think,” 2025
work page 2025
-
[21]
J. Zhang, J. Huang, H. Yao, S. Liu, X. Zhanget al., “R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization,” 2025
work page 2025
-
[22]
Z. Zhang, Z. Guan, T. Zhao, H. Shen, T. Liet al., “Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Rein- forcement Fine-Tuning,” 2025
work page 2025
-
[23]
Geo-R1: Unlocking VLM Geospatial Reasoning with Cross-View Reinforcement Learning,
C. Xu, F. Yu, M. J. Bianco, J. Kovarskiy, R. Tanget al., “Geo-R1: Unlocking VLM Geospatial Reasoning with Cross-View Reinforcement Learning,” 2025
work page 2025
-
[24]
TinyRS-R1: Compact Vision Language Model for Remote Sensing,
A. K ¨oksal and A. A. Alatan, “TinyRS-R1: Compact Vision Language Model for Remote Sensing,”IEEE Geoscience and Remote Sensing Letters, vol. 22, p. 1–5, 2025
work page 2025
-
[25]
W. Li, X. Xiang, Z. Wen, G. Zhou, B. Niuet al., “GeoReason: Aligning Thinking And Answering In Remote Sensing Vision-Language Models Via Logical Consistency Reinforcement Learning,” 2026
work page 2026
-
[26]
J. A. Richards and X. Jia,Remote Sensing Digital Image Analysis: An Introduction, 6th ed. Berlin, Heidelberg: Springer, 2022
work page 2022
-
[27]
Remote Sensing Image Scene Classifi- cation: Benchmark and State of the Art,
G. Cheng, J. Han, and X. Lu, “Remote Sensing Image Scene Classifi- cation: Benchmark and State of the Art,”Proceedings of the IEEE, vol. 105, no. 10, pp. 1865–1883, 2017
work page 2017
-
[28]
J. Qu, Z. Tang, L. Zhang, Y . Zhang, and Z. Zhang, “Remote Sensing Small Object Detection Network Based on Attention Mechanism and Multi-Scale Feature Fusion,”Remote Sensing, vol. 15, no. 11, p. 2728, 2023
work page 2023
-
[29]
Adversarial examples for vehicle detection with projection transformation,
J. Cui, W. Guo, H. Huang, X. Lv, H. Cao, and H. Li, “Adversarial examples for vehicle detection with projection transformation,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, p. 5632418, 2024
work page 2024
-
[30]
Trimem: Tri-fold memory framework for continual learning of vlms in remote sensing,
W. Guo, J. Cui, X. Cui, J. Li, Z. Zhanget al., “Trimem: Tri-fold memory framework for continual learning of vlms in remote sensing,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, p. 5648115, 2025
work page 2025
-
[31]
Rs deepreason: Llm-driven deep reasoning for multigranularity remote sensing scene interpretation,
C. Yang, J. Zhang, Q. Li, W. Guo, and H. Li, “Rs deepreason: Llm-driven deep reasoning for multigranularity remote sensing scene interpretation,”IEEE Geoscience and Remote Sensing Letters, vol. 23, p. 6005305, 2026
work page 2026
-
[32]
R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduction, 2nd ed. Cambridge, MA: MIT Press, 2018
work page 2018
-
[33]
Reinforcement learning: A survey,
L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: A survey,”Journal of Artificial Intelligence Research, vol. 4, no. 1, pp. 237–285, 1996
work page 1996
-
[34]
Human-level control through deep reinforcement learning,
V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Venesset al., “Human-level control through deep reinforcement learning,”Nature, vol. 518, no. 7540, pp. 529–533, 2015
work page 2015
-
[35]
A Survey on Remote Sensing Foundation Models: From Vision to Multimodality,
Z. Huang, H. Yan, Q. Zhan, S. Yang, M. Zhanget al., “A Survey on Remote Sensing Foundation Models: From Vision to Multimodality,” 2025
work page 2025
-
[36]
Open-ended remote sensing visual question answering with transformers,
M. M. A. Rahhal, Y . Bazi, S. O. Alsaleh, M. Al-Razgan, M. L. Mekhalfi et al., “Open-ended remote sensing visual question answering with transformers,”International Journal of Remote Sensing, vol. 43, no. 18, pp. 6809–6823, 2022
work page 2022
-
[37]
Visual Grounding in Remote Sensing Images,
Y . Sun, S. Feng, X. Li, Y . Ye, J. Kang, and X. Huang, “Visual Grounding in Remote Sensing Images,” inProceedings of the 30th ACM International Conference on Multimedia. New York, NY , USA: Association for Computing Machinery, 2022, p. 404–412
work page 2022
-
[38]
Context-driven and sparse decoding for Remote Sensing Visual Grounding,
Y . Zhao, Y . Chen, R. Yao, S. Xiong, and X. Lu, “Context-driven and sparse decoding for Remote Sensing Visual Grounding,”Information Fusion, vol. 123, p. 103296, 2025
work page 2025
-
[39]
GeoChat: Grounded Large Vision-Language Model for Remote Sensing,
K. Kuckreja, M. S. Danish, M. Naseer, A. Das, S. Khan, and F. S. Khan, “GeoChat: Grounded Large Vision-Language Model for Remote Sensing,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 27 831–27 840
work page 2024
-
[40]
X. Guo, J. Lao, B. Dang, Y . Zhang, L. Yuet al., “SkySense: A Multi- Modal Remote Sensing Foundation Model Towards Universal Interpre- tation for Earth Observation Imagery,”2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 27 662–27 673, 2024
work page 2024
-
[41]
RingMoGPT: A Unified Remote Sensing Foundation Model for Vision, Language, and Grounded Tasks,
P. Wang, H. Hu, B. Tong, Z. Zhang, F. Yaoet al., “RingMoGPT: A Unified Remote Sensing Foundation Model for Vision, Language, and Grounded Tasks,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1–20, 2025
work page 2025
-
[42]
J. Luo, Z. Pang, Y . Zhang, T. Wang, L. Wanget al., “SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding,”arXiv, 2024
work page 2024
-
[43]
W. Zhang, M. Cai, T. Zhang, Y . Zhuang, and X. Mao, “EarthGPT: A Universal Multimodal Large Language Model for Multisensor Image Comprehension in Remote Sensing Domain,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–20, 2024
work page 2024
-
[44]
Train- ing language models to follow instructions with human feedback,
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwrightet al., “Train- ing language models to follow instructions with human feedback,” in Proceedings of the 36th International Conference on Neural Information Processing Systems, vol. 35, Red Hook, NY , USA, 2022, pp. 27 730– 27 744
work page 2022
-
[45]
The Flan Collection: Designing Data and Methods for Effective Instruction Tun- ing,
S. Longpre, L. Hou, T. Vu, A. Webson, H. W. Chunget al., “The Flan Collection: Designing Data and Methods for Effective Instruction Tun- ing,” inProceedings of the 40th International Conference on Machine Learning (ICML), vol. 202. JMLR.org, 2023, pp. 22 631–22 648
work page 2023
-
[46]
RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data,
Y . Zhan, Z. Xiong, and Y . Yuan, “RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–13, 2023
work page 2023
-
[47]
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces,
J. Yang, S. Yang, A. W. Gupta, R. Han, L. Fei-Fei, and S. Xie, “Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces,” in2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 10 632–10 643
work page 2025
-
[48]
OpenAI, A. Jaech, A. Kalai, A. Lerer, A. Richardsonet al., “OpenAI o1 System Card,” 2024
work page 2024
-
[49]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models,
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Songet al., “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models,” 2024. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 13
work page 2024
-
[50]
H. Yao, J. Huang, W. Wu, J. Zhang, Y . Wanget al., “Mulberry: Em- powering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search,” 2024
work page 2024
-
[51]
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning,
W. Wang, Z. Gao, L. Chen, Z. Chen, J. Zhuet al., “VisualPRM: An Effective Process Reward Model for Multimodal Reasoning,” 2025
work page 2025
-
[52]
VRSBench: A Versatile Vision- Language Benchmark Dataset for Remote Sensing Image Understand- ing,
X. Li, J. Ding, and M. Elhoseiny, “VRSBench: A Versatile Vision- Language Benchmark Dataset for Remote Sensing Image Understand- ing,” inProceedings of the 38th International Conference on Neural Information Processing Systems. Curran Associates Inc., 2024
work page 2024
-
[53]
G. Cheng, J. Han, P. Zhou, and L. Guo, “Multi-class geospatial object detection and geographic image classification based on collection of part detectors,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 98, pp. 119–132, 2014
work page 2014
-
[54]
RSVQA: Visual Question Answering for Remote Sensing Data,
S. Lobry, D. Marcos, J. Murray, and D. Tuia, “RSVQA: Visual Question Answering for Remote Sensing Data,”IEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 12, pp. 8555–8566, 2020
work page 2020
-
[55]
S. Bai, K. Chen, X. Liu, J. Wang, W. Geet al., “Qwen2.5-VL Technical Report,” 2025
work page 2025
-
[56]
EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework,
Z. Yaowei, L. Junting, W. Shenzhi, F. Zhangchi, K. Dongdong, and X. Yuwen, “EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework,” https://github.com/hiyouga/EasyR1, 2025
work page 2025
-
[57]
Accurate Object Localization in Remote Sensing Images Based on Convolutional Neural Networks,
Y . Long, Y . Gong, Z. Xiao, and Q. Liu, “Accurate Object Localization in Remote Sensing Images Based on Convolutional Neural Networks,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 5, pp. 2486–2498, 2017
work page 2017
-
[58]
Improved Baselines with Visual Instruction Tuning,
H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved Baselines with Visual Instruction Tuning,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 26 286–26 296
work page 2024
-
[59]
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models,
Y . Li, Y . Zhang, C. Wang, Z. Zhong, Y . Chenet al., “Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models,” 2024
work page 2024
-
[60]
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning,
J. Chen, D. Zhu, X. Shen, X. Li, Z. Liuet al., “MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning,” 2023
work page 2023
-
[61]
Pcir: An open-world remote sensing image representation learning method from a causal perspective,
L. Zhao, M. Li, R. Shao, and H. Li, “Pcir: An open-world remote sensing image representation learning method from a causal perspective,”IEEE Transactions on Geoscience and Remote Sensing, vol. 64, p. 5601916, 2026
work page 2026
-
[62]
S. He, P. Shen, P. Xu, Q. Luo, and H. Li, “Stdcformer: A transformer- based model with a spatial-temporal causal de-confounding strategy for crowd flow prediction,”Information Fusion, vol. 126, p. 103645, 2026
work page 2026
-
[63]
Y . Wang, S. He, Q. Luo, H. Yuan, L. Zhaoet al., “Causal invariant geographic network representations with feature and structural distribu- tion shifts,”Future Generation Computer Systems, vol. 169, p. 107814, 2025. Gaozhi Zhoureceived the Master’s degree in Elec- tronic Information from Kunming University of Sci- ence and Technology, Kunming, China, in 2...
work page 2025
-
[64]
degree in the School of Geosciences and Info-Physics, Central South University
He is currently pursuing the Ph.D. degree in the School of Geosciences and Info-Physics, Central South University. His research interests include time series and spatio-temporal data mining. Jipeng Zhangis a Ph.D. student in the Department of Geomatics and remote sensing at Central South Universitty. His research interests include embodied agents in UA V ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.