pith. sign in

arxiv: 2604.17504 · v1 · submitted 2026-04-19 · 💻 cs.CV · cs.AI

RS-HyRe-R1: A Hybrid Reward Mechanism to Overcome Perceptual Inertia for Remote Sensing Images Understanding

Pith reviewed 2026-05-10 06:12 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords remote sensingvision-language modelsreinforcement learningperceptual inertiahybrid rewardimage understandingzero-shot generalization
0
0 comments X

The pith

A hybrid reward system with three targeted components overcomes perceptual inertia in remote sensing vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RS-HyRe-R1 to fix a bias in reinforcement learning post-training for remote sensing vision-language models. These models often latch onto quick salient features in complex satellite imagery instead of building full visual evidence. The approach adds three rewards to push structured reasoning, accurate alignment, and exploration of varied cues. With a 3B-parameter model, it reaches state-of-the-art results on referring expression comprehension, open-vocabulary detection, and visual question answering while beating larger models. This matters for reliable AI interpretation of remote sensing data without scaling up compute.

Core claim

RS-HyRe-R1 mitigates perceptual inertia through a hybrid reward framework that combines a spatial reasoning activation reward to enforce structured visual reasoning, a perception correctness reward for adaptive geometric and semantic alignment, and a visual-semantic path evolution reward to penalize repetitive paths and promote complementary evidence chains. The result is deeper, more diverse reasoning that improves performance across tasks and enables strong zero-shot generalization.

What carries the argument

The RS-HyRe-R1 hybrid reward framework, which integrates spatial reasoning activation, perception correctness, and visual-semantic path evolution rewards to guide models toward comprehensive visual evidence mining and flexible focus shifting.

If this is right

  • Models construct richer evidence chains instead of relying on quick outcome fitting.
  • Smaller 3B-parameter models outperform systems up to 7B parameters on REC, OVD, and VQA tasks.
  • Zero-shot performance improves by margins of 3.16% on VQA, 3.97% on OVD, and 2.72% on REC over the prior best.
  • Reasoning becomes more diverse and less repetitive across varied remote sensing tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reward structure could reduce shortcut biases in other vision-language reinforcement learning settings outside remote sensing.
  • Tuning the balance among the three rewards might further improve generalization on unseen imagery types.
  • Integration with existing RL post-training pipelines would require only modest changes to the reward computation step.

Load-bearing premise

The three rewards will successfully enforce thorough visual evidence mining and flexible focus shifting in complex remote sensing imagery without creating new exploitable biases or loopholes.

What would settle it

A test set of complex remote sensing images where the trained model still defaults to localized salient cues and produces incomplete evidence chains despite the hybrid rewards.

Figures

Figures reproduced from arXiv: 2604.17504 by Gaozhi Zhou, Haifeng Li, Hu He, Jipeng Zhang, Linrui Xu, Liujue Zhang, Peng Shen, Wang Guo, Xuezhi Cui, Zeyuan Wang, Ziyu Li.

Figure 1
Figure 1. Figure 1: Illustration of the “perceptual inertia” phenomenon and our proposed RS-HyRe-R1 solution. (a) Existing RL-driven models (e.g., TinyRS-R1, [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed RS-HyRe-R1 framework. The pipeline begins with the construction of a RS-Task Dataset [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison results with six models (including five RL training models). (a) Visualization of REC Task [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of response length trajectories. Unlike [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The overall reward curve for the RL training. The [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation study for different rewards. The comparison curves of (a) Acc@0.5 for REC, (b) mAP@0.5 for OVD, and [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
read the original abstract

Reinforcement learning (RL) post-training substantially improves remote sensing vision-language models (RS-VLMs). However, when handling complex remote sensing imagery (RSI) requiring exhaustive visual scanning, models tend to rely on localized salient cues for rapid inference. We term this RL-induced bias "perceptual inertia". Driven by reward maximization, models favor quick outcome fitting, leading to two limitations: cognitively, overreliance on specific features impedes complete evidence construction; operationally, models struggle to flexibly shift visual focus across tasks. To address this bias and encourage comprehensive visual evidence mining, we propose RS-HyRe-R1, a hybrid reward framework for RSI understanding. It introduces: (1) a spatial reasoning activation reward that enforces structured visual reasoning; (2) a perception correctness reward that provides adaptive quality anchors across RS tasks, ensuring accurate geometric and semantic alignment; and (3) a visual-semantic path evolution reward that penalizes repetitive reasoning and promotes exploration of complementary cues to build richer evidence chains. Experiments show RS-HyRe-R1 effectively mitigates "perceptual inertia", encouraging deeper, more diverse reasoning. With only 3B parameters, it achieves state-of-the-art performance on REC, OVD, and VQA tasks, outperforming models up to 7B parameters. It also demonstrates strong zero-shot generalization, surpassing the second-best model by 3.16%, 3.97%, and 2.72% on VQA, OVD, and REC, respectively. Code and datasets are available at https://github.com/geox-lab/RS-HyRe-R1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes RS-HyRe-R1, a hybrid reward framework for RL post-training of remote sensing vision-language models (RS-VLMs) to mitigate 'perceptual inertia'—the bias toward localized salient cues for rapid inference. It defines three reward components (spatial reasoning activation, perception correctness, and visual-semantic path evolution) intended to enforce structured reasoning, accurate alignment, and exploration of complementary cues. Experiments on a 3B-parameter model report state-of-the-art results on REC, OVD, and VQA tasks (outperforming models up to 7B parameters) along with zero-shot gains of 3.16%, 3.97%, and 2.72% respectively; code and datasets are linked.

Significance. If the central claims hold after verification, the work would be significant for improving reasoning depth in domain-specific VLMs with modest model size, addressing a plausible RL-induced bias in remote sensing imagery. Open-sourcing of code and datasets supports reproducibility and could influence post-training practices for efficient RS-VLMs.

major comments (3)
  1. [§3] §3 (Hybrid Reward Framework): The three reward formulations are described at a high level without explicit equations, weighting coefficients, or normalization details. This prevents assessment of whether the composite reward can be gamed by superficial patterns (e.g., repetitive spatial phrases satisfying the activation term or cycling similar cues for the evolution term) rather than enforcing exhaustive evidence mining, directly undermining the central claim that perceptual inertia is mitigated.
  2. [§4.3] §4.3 (Ablation Studies): No ablation isolates the contribution of each reward component or tests robustness against exploitation on ambiguous RSI; the reported gains could arise from hyperparameter tuning or data-specific fitting rather than the proposed mechanism, which is load-bearing for the SOTA and zero-shot claims.
  3. [§4.1] §4.1 (Experimental Setup): Training details, data splits, and statistical significance for the zero-shot improvements (3.16% VQA, 3.97% OVD, 2.72% REC) are insufficiently specified, making it impossible to rule out post-hoc tuning or evaluation leakage as alternative explanations for outperformance over larger models.
minor comments (2)
  1. [§2] The introduction of the term 'perceptual inertia' would benefit from explicit comparison to related concepts such as reward hacking or shortcut learning in prior VLM RL literature.
  2. [§4.4] Figure captions and axis labels in the qualitative results could be clarified to directly link observed behaviors to specific reward terms.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and detailed review of our manuscript on RS-HyRe-R1. We address each major comment point-by-point below, providing clarifications and committing to revisions that strengthen the presentation of the hybrid reward framework, ablations, and experimental details without altering the core contributions.

read point-by-point responses
  1. Referee: [§3] §3 (Hybrid Reward Framework): The three reward formulations are described at a high level without explicit equations, weighting coefficients, or normalization details. This prevents assessment of whether the composite reward can be gamed by superficial patterns (e.g., repetitive spatial phrases satisfying the activation term or cycling similar cues for the evolution term) rather than enforcing exhaustive evidence mining, directly undermining the central claim that perceptual inertia is mitigated.

    Authors: We agree that explicit formulations are necessary for rigorous evaluation. In the revised manuscript, we will expand Section 3 to include the full mathematical definitions of the three reward components: the spatial reasoning activation reward (with its structured reasoning enforcement term), the perception correctness reward (with adaptive anchors and normalization), and the visual-semantic path evolution reward (with its penalty on repetitive paths). We will specify the weighting coefficients (λ_spatial, λ_correct, λ_evolve) and normalization procedures used in the composite reward R = λ1*R1 + λ2*R2 + λ3*R3. Additionally, we will add analysis showing how the evolution reward explicitly penalizes repetitive cue cycling and superficial phrase repetition, thereby supporting the claim that perceptual inertia is mitigated rather than gamed. These additions will enable direct assessment of robustness. revision: yes

  2. Referee: [§4.3] §4.3 (Ablation Studies): No ablation isolates the contribution of each reward component or tests robustness against exploitation on ambiguous RSI; the reported gains could arise from hyperparameter tuning or data-specific fitting rather than the proposed mechanism, which is load-bearing for the SOTA and zero-shot claims.

    Authors: We acknowledge the value of isolating each component's contribution. In the revised Section 4.3, we will add targeted ablations that systematically disable or scale individual rewards (e.g., training with only spatial activation, only correctness, or only evolution) and report delta performance on REC, OVD, and VQA. We will also include experiments on ambiguous RSI cases (e.g., low-contrast or multi-object scenes) to test for exploitation vulnerabilities. These results will demonstrate that the hybrid combination is responsible for the observed gains beyond hyperparameter effects, directly addressing the load-bearing nature of the mechanism for our SOTA and zero-shot claims. revision: yes

  3. Referee: [§4.1] §4.1 (Experimental Setup): Training details, data splits, and statistical significance for the zero-shot improvements (3.16% VQA, 3.97% OVD, 2.72% REC) are insufficiently specified, making it impossible to rule out post-hoc tuning or evaluation leakage as alternative explanations for outperformance over larger models.

    Authors: We appreciate the emphasis on reproducibility. In the revised Section 4.1, we will provide complete training hyperparameters (learning rate schedules, batch sizes, RL-specific settings like PPO clip range), exact train/validation/test splits for all datasets, and the full protocol for zero-shot evaluation. For the reported improvements, we will include results from multiple independent runs with different random seeds, along with standard deviations and statistical significance tests (e.g., paired t-tests or confidence intervals) to rule out post-hoc tuning or leakage. We will also clarify that all comparisons used the same evaluation metrics and held-out test sets as prior work. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical reward design with independent evaluation

full rationale

The paper introduces three new reward terms (spatial reasoning activation, perception correctness, visual-semantic path evolution) as an engineering solution to perceptual inertia in RL post-training of RS-VLMs. These are defined directly from the problem description rather than derived from prior fitted quantities or self-citations that reduce to the target result. Performance claims rest on standard task benchmarks (REC, OVD, VQA) with reported zero-shot gains and public code/datasets, not on any equation that equates a prediction to its own input by construction. No load-bearing self-citation chains, ansatz smuggling, or renaming of known results appear in the provided abstract or description. The central claim therefore remains an independent empirical proposal rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on domain assumptions about how RL induces visual biases in RSI and that the three hand-designed rewards will promote comprehensive evidence chains; no free parameters are explicitly listed in the abstract, and the new term perceptual inertia is introduced without external falsifiable evidence.

axioms (2)
  • domain assumption Reinforcement learning post-training substantially improves remote sensing vision-language models
    Stated as background fact in the opening sentence of the abstract.
  • domain assumption Models tend to rely on localized salient cues for rapid inference when handling complex RSI
    Presented as an observed RL-induced behavior leading to the definition of perceptual inertia.
invented entities (1)
  • perceptual inertia no independent evidence
    purpose: To name and frame the RL-induced bias of overreliance on specific features that impedes complete evidence construction
    New term coined in the abstract to describe the identified limitation; no independent evidence provided outside the paper's observations.

pith-pipeline@v0.9.0 · 5634 in / 1444 out tokens · 39580 ms · 2026-05-10T06:12:11.247818+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages

  1. [1]

    Vision-Language Models in Remote Sensing: Current progress and future trends,

    X. Li, C. Wen, Y . Hu, Z. Yuan, and X. X. Zhu, “Vision-Language Models in Remote Sensing: Current progress and future trends,”IEEE Geoscience and Remote Sensing Magazine, vol. 12, no. 2, pp. 32–66, 2024

  2. [2]

    RemoteCLIP: A Vision Language Foundation Model for Remote Sensing,

    F. Liu, D. Chen, Z. Guan, X. Zhou, J. Zhuet al., “RemoteCLIP: A Vision Language Foundation Model for Remote Sensing,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–16, 2024

  3. [3]

    EarthVQA: towards queryable earth via relational reasoning-based remote sensing visual question answering,

    J. Wang, Z. Zheng, Z. Chen, A. Ma, and Y . Zhong, “EarthVQA: towards queryable earth via relational reasoning-based remote sensing visual question answering,” inProceedings of the Thirty-Eighth AAAI Confer- ence on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on IEEE TRA...

  4. [4]

    Vision-Language Model for Remote Sensing Images Object Detection,

    D. Liu, T. Li, Y . Qi, Y . Xi, J. Jin, and J. Zhang, “Vision-Language Model for Remote Sensing Images Object Detection,” inIGARSS 2025 - 2025 IEEE International Geoscience and Remote Sensing Symposium, 2025, pp. 6069–6072

  5. [5]

    AddressCLIP: Empowering Vision-Language Models for City-Wide Image Address Localization,

    S. Xu, C. Zhang, L. Fan, G. Meng, S. Xiang, and J. Ye, “AddressCLIP: Empowering Vision-Language Models for City-Wide Image Address Localization,” inComputer Vision – ECCV 2024. Cham: Springer Nature Switzerland, 2025, pp. 76–92

  6. [6]

    Allspark: A multimodal spatio-temporal general intelligence model with ten modalities via language as a reference framework,

    R. Shao, C. Yang, Q. Li, L. Xu, X. Yanget al., “Allspark: A multimodal spatio-temporal general intelligence model with ten modalities via language as a reference framework,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, 2025

  7. [7]

    Rethinking domain- agnostic continual learning via frequency completeness learning,

    J. Peng, H. Zhang, J. Shen, Z. Li, J. Ma, and H. Li, “Rethinking domain- agnostic continual learning via frequency completeness learning,”Infor- mation Fusion, vol. 129, p. 103961, 2026

  8. [8]

    Homogeneous tokenizer matters: Homogeneous visual tokenizer for remote sensing image understanding,

    R. Shao, Z. Zhang, C. Tao, Y . Zhang, C. Peng, and H. Li, “Homogeneous tokenizer matters: Homogeneous visual tokenizer for remote sensing image understanding,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 218, 2024

  9. [9]

    DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning,

    D. Guo, D. Yang, H. Zhang, J. Song, P. Wanget al., “DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning,”Nature, vol. 645, no. 8081, pp. 633–638, 2025

  10. [10]

    Qwen3 Technical Report,

    A. Yang, A. Li, B. Yang, B. Zhang, B. Huiet al., “Qwen3 Technical Report,” 2025

  11. [11]

    MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning,

    F. Meng, L. Du, Z. Liu, Z. Zhou, Q. Luet al., “MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning,” 2025

  12. [12]

    Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond,

    L. Wen, Y . Cai, F. Xiao, X. He, Q. Anet al., “Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, vol. 6, Vienna, Austria, jul 2025, pp. 318–327

  13. [13]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models,

    W. Huang, B. Jia, Z. Zhai, S. Cao, Z. Yeet al., “Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models,” 2025

  14. [14]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model,

    H. Shen, P. Liu, J. Li, C. Fang, Y . Maet al., “VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model,” 2025

  15. [15]

    SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models,

    H. Chen, H. Tu, F. Wang, H. Liu, X. Tanget al., “SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models,” 2025

  16. [16]

    WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learn- ing,

    J. Yang, F. Ma, Z. Wang, D. Yin, K. Ronget al., “WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learn- ing,” 2025

  17. [17]

    Let’s Verify Step by Step,

    H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Bakeret al., “Let’s Verify Step by Step,” 2023

  18. [18]

    Solving math word problems with process- and outcome-based feedback,

    J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegelet al., “Solving math word problems with process- and outcome-based feedback,” 2022

  19. [19]

    STaR: Bootstrapping Reasoning With Reasoning,

    E. Zelikman, Y . Wu, J. Mu, and N. Goodman, “STaR: Bootstrapping Reasoning With Reasoning,” inAdvances in Neural Information Pro- cessing Systems, vol. 35. Curran Associates, Inc., 2022, pp. 15 476– 15 488

  20. [20]

    Process reward models that think,

    M. Khalifa, R. Agarwal, L. Logeswaran, J. Kim, H. Penget al., “Process reward models that think,” 2025

  21. [21]

    R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization,

    J. Zhang, J. Huang, H. Yao, S. Liu, X. Zhanget al., “R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization,” 2025

  22. [22]

    Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Rein- forcement Fine-Tuning,

    Z. Zhang, Z. Guan, T. Zhao, H. Shen, T. Liet al., “Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Rein- forcement Fine-Tuning,” 2025

  23. [23]

    Geo-R1: Unlocking VLM Geospatial Reasoning with Cross-View Reinforcement Learning,

    C. Xu, F. Yu, M. J. Bianco, J. Kovarskiy, R. Tanget al., “Geo-R1: Unlocking VLM Geospatial Reasoning with Cross-View Reinforcement Learning,” 2025

  24. [24]

    TinyRS-R1: Compact Vision Language Model for Remote Sensing,

    A. K ¨oksal and A. A. Alatan, “TinyRS-R1: Compact Vision Language Model for Remote Sensing,”IEEE Geoscience and Remote Sensing Letters, vol. 22, p. 1–5, 2025

  25. [25]

    GeoReason: Aligning Thinking And Answering In Remote Sensing Vision-Language Models Via Logical Consistency Reinforcement Learning,

    W. Li, X. Xiang, Z. Wen, G. Zhou, B. Niuet al., “GeoReason: Aligning Thinking And Answering In Remote Sensing Vision-Language Models Via Logical Consistency Reinforcement Learning,” 2026

  26. [26]

    J. A. Richards and X. Jia,Remote Sensing Digital Image Analysis: An Introduction, 6th ed. Berlin, Heidelberg: Springer, 2022

  27. [27]

    Remote Sensing Image Scene Classifi- cation: Benchmark and State of the Art,

    G. Cheng, J. Han, and X. Lu, “Remote Sensing Image Scene Classifi- cation: Benchmark and State of the Art,”Proceedings of the IEEE, vol. 105, no. 10, pp. 1865–1883, 2017

  28. [28]

    Remote Sensing Small Object Detection Network Based on Attention Mechanism and Multi-Scale Feature Fusion,

    J. Qu, Z. Tang, L. Zhang, Y . Zhang, and Z. Zhang, “Remote Sensing Small Object Detection Network Based on Attention Mechanism and Multi-Scale Feature Fusion,”Remote Sensing, vol. 15, no. 11, p. 2728, 2023

  29. [29]

    Adversarial examples for vehicle detection with projection transformation,

    J. Cui, W. Guo, H. Huang, X. Lv, H. Cao, and H. Li, “Adversarial examples for vehicle detection with projection transformation,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, p. 5632418, 2024

  30. [30]

    Trimem: Tri-fold memory framework for continual learning of vlms in remote sensing,

    W. Guo, J. Cui, X. Cui, J. Li, Z. Zhanget al., “Trimem: Tri-fold memory framework for continual learning of vlms in remote sensing,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, p. 5648115, 2025

  31. [31]

    Rs deepreason: Llm-driven deep reasoning for multigranularity remote sensing scene interpretation,

    C. Yang, J. Zhang, Q. Li, W. Guo, and H. Li, “Rs deepreason: Llm-driven deep reasoning for multigranularity remote sensing scene interpretation,”IEEE Geoscience and Remote Sensing Letters, vol. 23, p. 6005305, 2026

  32. [32]

    R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduction, 2nd ed. Cambridge, MA: MIT Press, 2018

  33. [33]

    Reinforcement learning: A survey,

    L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: A survey,”Journal of Artificial Intelligence Research, vol. 4, no. 1, pp. 237–285, 1996

  34. [34]

    Human-level control through deep reinforcement learning,

    V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Venesset al., “Human-level control through deep reinforcement learning,”Nature, vol. 518, no. 7540, pp. 529–533, 2015

  35. [35]

    A Survey on Remote Sensing Foundation Models: From Vision to Multimodality,

    Z. Huang, H. Yan, Q. Zhan, S. Yang, M. Zhanget al., “A Survey on Remote Sensing Foundation Models: From Vision to Multimodality,” 2025

  36. [36]

    Open-ended remote sensing visual question answering with transformers,

    M. M. A. Rahhal, Y . Bazi, S. O. Alsaleh, M. Al-Razgan, M. L. Mekhalfi et al., “Open-ended remote sensing visual question answering with transformers,”International Journal of Remote Sensing, vol. 43, no. 18, pp. 6809–6823, 2022

  37. [37]

    Visual Grounding in Remote Sensing Images,

    Y . Sun, S. Feng, X. Li, Y . Ye, J. Kang, and X. Huang, “Visual Grounding in Remote Sensing Images,” inProceedings of the 30th ACM International Conference on Multimedia. New York, NY , USA: Association for Computing Machinery, 2022, p. 404–412

  38. [38]

    Context-driven and sparse decoding for Remote Sensing Visual Grounding,

    Y . Zhao, Y . Chen, R. Yao, S. Xiong, and X. Lu, “Context-driven and sparse decoding for Remote Sensing Visual Grounding,”Information Fusion, vol. 123, p. 103296, 2025

  39. [39]

    GeoChat: Grounded Large Vision-Language Model for Remote Sensing,

    K. Kuckreja, M. S. Danish, M. Naseer, A. Das, S. Khan, and F. S. Khan, “GeoChat: Grounded Large Vision-Language Model for Remote Sensing,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 27 831–27 840

  40. [40]

    SkySense: A Multi- Modal Remote Sensing Foundation Model Towards Universal Interpre- tation for Earth Observation Imagery,

    X. Guo, J. Lao, B. Dang, Y . Zhang, L. Yuet al., “SkySense: A Multi- Modal Remote Sensing Foundation Model Towards Universal Interpre- tation for Earth Observation Imagery,”2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 27 662–27 673, 2024

  41. [41]

    RingMoGPT: A Unified Remote Sensing Foundation Model for Vision, Language, and Grounded Tasks,

    P. Wang, H. Hu, B. Tong, Z. Zhang, F. Yaoet al., “RingMoGPT: A Unified Remote Sensing Foundation Model for Vision, Language, and Grounded Tasks,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1–20, 2025

  42. [42]

    SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding,

    J. Luo, Z. Pang, Y . Zhang, T. Wang, L. Wanget al., “SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding,”arXiv, 2024

  43. [43]

    EarthGPT: A Universal Multimodal Large Language Model for Multisensor Image Comprehension in Remote Sensing Domain,

    W. Zhang, M. Cai, T. Zhang, Y . Zhuang, and X. Mao, “EarthGPT: A Universal Multimodal Large Language Model for Multisensor Image Comprehension in Remote Sensing Domain,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–20, 2024

  44. [44]

    Train- ing language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwrightet al., “Train- ing language models to follow instructions with human feedback,” in Proceedings of the 36th International Conference on Neural Information Processing Systems, vol. 35, Red Hook, NY , USA, 2022, pp. 27 730– 27 744

  45. [45]

    The Flan Collection: Designing Data and Methods for Effective Instruction Tun- ing,

    S. Longpre, L. Hou, T. Vu, A. Webson, H. W. Chunget al., “The Flan Collection: Designing Data and Methods for Effective Instruction Tun- ing,” inProceedings of the 40th International Conference on Machine Learning (ICML), vol. 202. JMLR.org, 2023, pp. 22 631–22 648

  46. [46]

    RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data,

    Y . Zhan, Z. Xiong, and Y . Yuan, “RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–13, 2023

  47. [47]

    Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces,

    J. Yang, S. Yang, A. W. Gupta, R. Han, L. Fei-Fei, and S. Xie, “Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces,” in2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 10 632–10 643

  48. [48]

    OpenAI o1 System Card,

    OpenAI, A. Jaech, A. Kalai, A. Lerer, A. Richardsonet al., “OpenAI o1 System Card,” 2024

  49. [49]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models,

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Songet al., “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models,” 2024. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 13

  50. [50]

    Mulberry: Em- powering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search,

    H. Yao, J. Huang, W. Wu, J. Zhang, Y . Wanget al., “Mulberry: Em- powering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search,” 2024

  51. [51]

    VisualPRM: An Effective Process Reward Model for Multimodal Reasoning,

    W. Wang, Z. Gao, L. Chen, Z. Chen, J. Zhuet al., “VisualPRM: An Effective Process Reward Model for Multimodal Reasoning,” 2025

  52. [52]

    VRSBench: A Versatile Vision- Language Benchmark Dataset for Remote Sensing Image Understand- ing,

    X. Li, J. Ding, and M. Elhoseiny, “VRSBench: A Versatile Vision- Language Benchmark Dataset for Remote Sensing Image Understand- ing,” inProceedings of the 38th International Conference on Neural Information Processing Systems. Curran Associates Inc., 2024

  53. [53]

    Multi-class geospatial object detection and geographic image classification based on collection of part detectors,

    G. Cheng, J. Han, P. Zhou, and L. Guo, “Multi-class geospatial object detection and geographic image classification based on collection of part detectors,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 98, pp. 119–132, 2014

  54. [54]

    RSVQA: Visual Question Answering for Remote Sensing Data,

    S. Lobry, D. Marcos, J. Murray, and D. Tuia, “RSVQA: Visual Question Answering for Remote Sensing Data,”IEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 12, pp. 8555–8566, 2020

  55. [55]

    Qwen2.5-VL Technical Report,

    S. Bai, K. Chen, X. Liu, J. Wang, W. Geet al., “Qwen2.5-VL Technical Report,” 2025

  56. [56]

    EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework,

    Z. Yaowei, L. Junting, W. Shenzhi, F. Zhangchi, K. Dongdong, and X. Yuwen, “EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework,” https://github.com/hiyouga/EasyR1, 2025

  57. [57]

    Accurate Object Localization in Remote Sensing Images Based on Convolutional Neural Networks,

    Y . Long, Y . Gong, Z. Xiao, and Q. Liu, “Accurate Object Localization in Remote Sensing Images Based on Convolutional Neural Networks,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 5, pp. 2486–2498, 2017

  58. [58]

    Improved Baselines with Visual Instruction Tuning,

    H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved Baselines with Visual Instruction Tuning,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 26 286–26 296

  59. [59]

    Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models,

    Y . Li, Y . Zhang, C. Wang, Z. Zhong, Y . Chenet al., “Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models,” 2024

  60. [60]

    MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning,

    J. Chen, D. Zhu, X. Shen, X. Li, Z. Liuet al., “MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning,” 2023

  61. [61]

    Pcir: An open-world remote sensing image representation learning method from a causal perspective,

    L. Zhao, M. Li, R. Shao, and H. Li, “Pcir: An open-world remote sensing image representation learning method from a causal perspective,”IEEE Transactions on Geoscience and Remote Sensing, vol. 64, p. 5601916, 2026

  62. [62]

    Stdcformer: A transformer- based model with a spatial-temporal causal de-confounding strategy for crowd flow prediction,

    S. He, P. Shen, P. Xu, Q. Luo, and H. Li, “Stdcformer: A transformer- based model with a spatial-temporal causal de-confounding strategy for crowd flow prediction,”Information Fusion, vol. 126, p. 103645, 2026

  63. [63]

    Causal invariant geographic network representations with feature and structural distribu- tion shifts,

    Y . Wang, S. He, Q. Luo, H. Yuan, L. Zhaoet al., “Causal invariant geographic network representations with feature and structural distribu- tion shifts,”Future Generation Computer Systems, vol. 169, p. 107814, 2025. Gaozhi Zhoureceived the Master’s degree in Elec- tronic Information from Kunming University of Sci- ence and Technology, Kunming, China, in 2...

  64. [64]

    degree in the School of Geosciences and Info-Physics, Central South University

    He is currently pursuing the Ph.D. degree in the School of Geosciences and Info-Physics, Central South University. His research interests include time series and spatio-temporal data mining. Jipeng Zhangis a Ph.D. student in the Department of Geomatics and remote sensing at Central South Universitty. His research interests include embodied agents in UA V ...