pith. the verified trust layer for science. sign in

arxiv: 2508.11196 · v2 · submitted 2025-08-15 · 💻 cs.CV

UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning

Pith reviewed 2026-05-18 22:13 UTC · model grok-4.3

classification 💻 cs.CV
keywords UAV visual reasoningvision-language modelsGRPOsupervised fine-tuningaerial imageryhigh-resolution VQAlightweight VLMzero-shot accuracy
0
0 comments X p. Extension

The pith

A lightweight UAV vision-language model trained with supervised fine-tuning and multi-stage GRPO outperforms much larger general models on aerial reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents UAV-VL-R1 as a compact vision-language model built specifically for high-resolution UAV imagery, where standard VLMs lose performance due to complex spatial semantics and real-time demands. It combines supervised fine-tuning on a new 50,019-sample HRVQA-VL dataset with multi-stage group relative policy optimization that uses rule-guided rewards to encourage structured reasoning across eight tasks such as object counting and spatial inference. This hybrid process produces a 48.17 percent zero-shot accuracy gain over the Qwen2-VL-2B baseline and exceeds the results of a 72B-scale version on multiple benchmarks while fitting in 3.9 GB of memory. A sympathetic reader would care because the work demonstrates how targeted post-training can adapt small models to constrained, high-stakes environments like drone platforms without relying on scale alone. Ablations indicate that GRPO restores logical flexibility that supervised fine-tuning alone tends to reduce.

Core claim

UAV-VL-R1 is trained by first applying supervised fine-tuning for semantic alignment and then multi-stage GRPO with rule-guided rewards and intra-group policy alignment to improve logical reasoning; the resulting model achieves substantially higher zero-shot accuracy than both its 2B baseline and a 72B general model on the HRVQA-VL tasks while remaining deployable under tight UAV memory limits.

What carries the argument

Multi-stage GRPO with rule-guided rewards, which aligns policies within groups and enforces interpretable reasoning steps for aerial visual question answering.

If this is right

  • Supervised fine-tuning improves semantic alignment with aerial imagery but can narrow reasoning diversity on mathematical subtasks.
  • GRPO reinforcement learning restores logical flexibility and robustness that fine-tuning tends to limit.
  • The final model supports real-time inference on resource-limited UAV hardware after quantization to 2.5 GB.
  • Performance improvements appear consistently across the eight UAV-specific reasoning tasks covered by the new dataset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same staged training recipe may transfer to other high-resolution sensing domains such as satellite imagery or autonomous vehicle perception.
  • Rule-based reward signals could be adapted to reduce inconsistent outputs in broader vision-language applications beyond UAVs.
  • On-board deployment of such lightweight models could remove the need for constant cloud connectivity in remote drone operations.

Load-bearing premise

The accuracy gains arise from genuine gains in logical reasoning and generalization rather than from fitting the particular question and annotation patterns in the HRVQA-VL training set.

What would settle it

Testing UAV-VL-R1 on a fresh set of UAV images and questions drawn from different flight conditions or annotation styles not present in HRVQA-VL and checking whether the reported accuracy advantage over baselines remains.

Figures

Figures reproduced from arXiv: 2508.11196 by (2) School of Aeronautics, Astronautics, Bonan Zhang (1), Chengdu, China, China), Dan Liu (1), Haibo Mei (2), Jiajin Guan (1), Technology, Technology of China, University of Electronic Science, Yuanshuang Fu (1), Yue Zhang (2) ((1) Research Institute of Electronic Science.

Figure 1
Figure 1. Figure 1: Overview of the HRVQA-VL dataset task structure. We organize tasks [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the UAV-VL-R1 training framework. The model is first initialized via supervised fine-tuning (SFT) using prompt-constrained VQA data [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Multi-stage visual reasoning framework of UAV-VL-R1. The pipeline consists of four components: SFT for semantic initialization, a reward function [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training loss, KL divergence, format reward, and accuracy reward [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Recent advances in vision-language models (VLMs) have demonstrated strong generalization in natural image tasks. However, their performance often degrades on unmanned aerial vehicle (UAV)-based aerial imagery, which features high resolution, complex spatial semantics, and strict real-time constraints. These challenges limit the applicability of general-purpose VLMs to structured aerial reasoning tasks. To address these challenges, we propose UAV-VL-R1, a lightweight VLM explicitly designed for aerial visual reasoning. It is trained using a hybrid method that combines supervised fine-tuning (SFT) and multi-stage reinforcement learning (RL). We leverage the group relative policy optimization (GRPO) algorithm to promote structured and interpretable reasoning through rule-guided rewards and intra-group policy alignment. To support model training and evaluation, we introduce a high-resolution visual question answering dataset named HRVQA-VL, which consists of 50,019 annotated samples covering eight UAV-relevant reasoning tasks, including object counting, transportation recognition, and spatial scene inference. Experimental results show that UAV-VL-R1 achieves a 48.17% higher zero-shot accuracy than the Qwen2-VL-2B-Instruct baseline and even outperforms its 72B-scale variant, which is 36x larger, on multiple tasks. Ablation studies reveal that while SFT improves semantic alignment, it may reduce reasoning diversity in mathematical tasks. GRPO-based RL compensates for this limitation by enhancing logical flexibility and the robustness of inference. Additionally, UAV-VL-R1 requires only 3.9GB of memory under FP16 inference and can be quantized to 2.5GB with INT8, supporting real-time deployment on resource-constrained UAV platforms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces UAV-VL-R1, a lightweight vision-language model for UAV aerial visual reasoning. It is trained via a hybrid approach of supervised fine-tuning (SFT) followed by multi-stage Group Relative Policy Optimization (GRPO) using rule-guided rewards and intra-group alignment. The authors release the HRVQA-VL dataset of 50,019 samples spanning eight UAV-relevant tasks (object counting, transportation recognition, spatial inference, etc.). The central empirical claim is that UAV-VL-R1 attains 48.17% higher zero-shot accuracy than the Qwen2-VL-2B-Instruct baseline and outperforms a 72B-scale variant on multiple tasks while requiring only 3.9 GB (FP16) or 2.5 GB (INT8) memory.

Significance. If the performance and generalization claims are substantiated, the work would be significant for resource-constrained UAV perception, demonstrating that GRPO-based RL can restore logical flexibility after SFT and that rule-guided rewards produce interpretable aerial reasoning. The HRVQA-VL dataset is a concrete community resource. The efficiency numbers directly support real-time deployment. The paper receives credit for its empirical focus and the ablation isolating the contribution of the multi-stage RL stage.

major comments (3)
  1. [Abstract] Abstract and Experimental Results: the headline claim of a 48.17% higher zero-shot accuracy (and outperformance of the 72B model) supplies no information on the precise metric (overall accuracy, per-task accuracy, or F1), statistical significance, error bars, or number of runs. This information is load-bearing for the central performance claim.
  2. [§5] §5 (Experiments) and Ablation Studies: all quantitative results, including the GRPO ablation that supposedly restores reasoning diversity, are reported exclusively on the HRVQA-VL test split. No out-of-distribution UAV imagery, cross-benchmark results on standard VQA datasets, or leakage checks are provided. This directly affects the generalization narrative in the title and abstract.
  3. [Method] Method section: the multi-stage GRPO procedure is described at a high level, but the concrete reward functions, the exact grouping strategy, and how rule-guided rewards are formulated for each of the eight tasks are not specified. These details are necessary to assess reproducibility and to understand why the approach yields genuine logical gains rather than dataset-specific fitting.
minor comments (3)
  1. [Abstract] Clarify whether the 48.17% figure is an average across all eight tasks or driven by a subset; a per-task breakdown table would improve transparency.
  2. [Dataset] The dataset section should include basic statistics (question-type distribution, image resolution range) and a brief description of the annotation protocol to support reproducibility.
  3. [Introduction] Minor notation inconsistency: the acronym GRPO is introduced without spelling out 'Group Relative Policy Optimization' on first use in the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for the thorough and constructive review. We address each major comment point by point below, indicating revisions where the manuscript will be updated in the next version.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Experimental Results: the headline claim of a 48.17% higher zero-shot accuracy (and outperformance of the 72B model) supplies no information on the precise metric (overall accuracy, per-task accuracy, or F1), statistical significance, error bars, or number of runs. This information is load-bearing for the central performance claim.

    Authors: We agree that the abstract requires greater precision on the central claim. The 48.17% figure denotes the relative improvement in overall accuracy (mean across all eight tasks) on the HRVQA-VL test split. We have revised the abstract to state this explicitly and added a sentence noting the evaluation protocol. In the updated Section 5, we now report results from three independent runs with different seeds, include standard deviation as error bars in Table 1, and provide a paired t-test (p < 0.01) against the baseline. These changes directly address the load-bearing nature of the claim. revision: yes

  2. Referee: [§5] §5 (Experiments) and Ablation Studies: all quantitative results, including the GRPO ablation that supposedly restores reasoning diversity, are reported exclusively on the HRVQA-VL test split. No out-of-distribution UAV imagery, cross-benchmark results on standard VQA datasets, or leakage checks are provided. This directly affects the generalization narrative in the title and abstract.

    Authors: We acknowledge that all reported numbers are on the HRVQA-VL test split, as this is a newly introduced UAV-specific resource. To strengthen the generalization narrative, the revision adds a dedicated subsection with qualitative results on additional out-of-distribution UAV images sourced from public aerial datasets, demonstrating consistent reasoning improvements. We also include an explicit data-leakage analysis confirming no overlap with common VQA corpora. While cross-benchmark evaluation on standard VQA datasets is not performed (due to domain mismatch with high-resolution aerial imagery), we discuss this limitation and note that outperformance versus the 72B model on UAV tasks supports broader applicability. The GRPO ablation is now buttressed by both quantitative deltas and qualitative examples of restored reasoning diversity. revision: partial

  3. Referee: [Method] Method section: the multi-stage GRPO procedure is described at a high level, but the concrete reward functions, the exact grouping strategy, and how rule-guided rewards are formulated for each of the eight tasks are not specified. These details are necessary to assess reproducibility and to understand why the approach yields genuine logical gains rather than dataset-specific fitting.

    Authors: We agree that additional implementation details are essential for reproducibility. The revised Method section now specifies the exact reward functions for each of the eight tasks (e.g., object counting uses a binary match reward with a tolerance of ±1; spatial inference applies rule-based entity-relation consistency checks). The grouping strategy samples eight responses per prompt, computes relative advantages within the group, and applies intra-group alignment via a KL term to the group-mean policy. We also detail the two-stage schedule, reward weighting coefficients, and all GRPO hyperparameters. These additions clarify how rule-guided rewards encourage logical structure beyond dataset-specific fitting. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on external baselines and measured test performance

full rationale

This is an empirical computer-vision paper that trains a model via SFT plus multi-stage GRPO on a newly introduced 50k-sample dataset and reports accuracy numbers against external Qwen2-VL baselines of different sizes. No mathematical derivation, uniqueness theorem, or first-principles claim is offered whose output is definitionally identical to its own fitted inputs or to a self-citation chain. All load-bearing evidence consists of measured performance on a held-out test split and ablation tables; these quantities are not forced by construction from the authors' own parameter choices. The work therefore satisfies the self-contained-against-external-benchmarks criterion and receives score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the quality and representativeness of the newly introduced HRVQA-VL annotations and on the assumption that GRPO rewards produce transferable reasoning improvements; no new physical entities are postulated and no explicit free parameters beyond standard training hyperparameters are named.

axioms (1)
  • domain assumption The 50,019 annotations in HRVQA-VL correctly capture the visual content and intended reasoning for the eight UAV tasks.
    Both training and the reported accuracy numbers rest on the correctness of these labels.

pith-pipeline@v0.9.0 · 5912 in / 1381 out tokens · 52749 ms · 2026-05-18T22:13:47.178042+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 23 internal anchors

  1. [1]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford,et al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

  2. [2]

    Cogvlm: Visual expert for pretrained language models,

    W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y . Wang, J. Ji, Z. Yang, L. Zhao, S. XiXuan,et al., “Cogvlm: Visual expert for pretrained language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 121475–121499, 2024

  3. [3]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge,et al., “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024

  4. [4]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan, D. Dai, D. Guo,et al., “Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model,”arXiv preprint arXiv:2405.04434, 2024

  5. [5]

    Rl-vlm-f: Reinforcement learning from vision language foundation model feedback,

    Y . Wang, Z. Sun, J. Zhang, Z. Xian, E. Biyik, D. Held, and Z. Erickson, “Rl-vlm-f: Reinforcement learning from vision language foundation model feedback,”arXiv preprint arXiv:2402.03681, 2024

  6. [6]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    W. Huang, B. Jia, Z. Zhai, S. Cao, Z. Ye, F. Zhao, Z. Xu, Y . Hu, and S. Lin, “Vision-r1: Incentivizing reasoning capability in multimodal large language models,”arXiv preprint arXiv:2503.06749, 2025

  7. [7]

    Med-r1: Reinforcement learning for generalizable medical reasoning in vision-language models,

    Y . Lai, J. Zhong, M. Li, S. Zhao, and X. Yang, “Med-r1: Reinforcement learning for generalizable medical reasoning in vision-language models,” arXiv preprint arXiv:2503.13939, 2025

  8. [8]

    OpenAI o1 System Card

    A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney,et al., “Openai o1 system card,”arXiv preprint arXiv:2412.16720, 2024

  9. [9]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    K. Team, A. Du, B. Gao, B. Xing, C. Jiang, C. Chen, C. Li, C. Xiao, C. Du, C. Liao,et al., “k1. 5: Scaling reinforcement learning with llms, 2025,”URL https://arxiv. org/abs/2501.12599

  10. [10]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat,et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  11. [11]

    Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

    W. Chen, X. Ma, X. Wang, and W. W. Cohen, “Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks,”arXiv preprint arXiv:2211.12588, 2022

  12. [12]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    T. Chu, Y . Zhai, J. Yang, S. Tong, S. Xie, D. Schuurmans, Q. V . Le, S. Levine, and Y . Ma, “Sft memorizes, rl generalizes: A com- parative study of foundation model post-training,”arXiv preprint arXiv:2501.17161, 2025

  13. [13]

    Object detection in optical remote sensing images: A survey and a new benchmark,

    K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, “Object detection in optical remote sensing images: A survey and a new benchmark,”ISPRS journal of photogrammetry and remote sensing, vol. 159, pp. 296–307, 2020

  14. [14]

    Fine-tuning large vision-language models as decision-making agents via reinforcement learning,

    Y . Zhai, H. Bai, Z. Lin, J. Pan, S. Tong, Y . Zhou, A. Suhr, S. Xie, Y . LeCun, Y . Ma,et al., “Fine-tuning large vision-language models as decision-making agents via reinforcement learning, 2024,”URL https://arxiv. org/abs/2405.10292

  15. [15]

    R1-v: Reinforcing super generalization ability in vision-language models with less than $3

    L. Chen, L. Li, H. Zhao, Y . Song, and Vinci, “R1-v: Reinforcing super generalization ability in vision-language models with less than $3.” https: //github.com/Deep-Agent/R1-V, 2025. Accessed: 2025-02-02

  16. [16]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu,et al., “Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models, 2024,”URL https://arxiv. org/abs/2402.03300

  17. [17]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

  18. [18]

    Direct preference optimization: Your language model is secretly a reward model,

    R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”Advances in Neural Information Processing Systems, vol. 36, pp. 53728–53741, 2023

  19. [19]

    Gptq: Accurate post-training quantization for generative pre-trained transformers,

    E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate post-training quantization for generative pre-trained transformers,” 2023

  20. [20]

    Kimi-VL Technical Report

    K. Team, A. Du, B. Yin, B. Xing, B. Qu, B. Wang, C. Chen, C. Zhang, C. Du, C. Wei,et al., “Kimi-vl technical report,”arXiv preprint arXiv:2504.07491, 2025

  21. [21]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Y . Yao, T. Yu, A. Zhang, C. Wang, J. Cui, H. Zhu, T. Cai, H. Li, W. Zhao, Z. He,et al., “Minicpm-v: A gpt-4v level mllm on your phone,”arXiv preprint arXiv:2408.01800, 2024

  22. [22]

    Hrvqa: A visual question answering benchmark for high-resolution aerial images,

    K. Li, G. V osselman, and M. Y . Yang, “Hrvqa: A visual question answering benchmark for high-resolution aerial images,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 214, pp. 65–81, 2024

  23. [23]

    Hazydet: Open-source benchmark for drone- view object detection with depth-cues in hazy scenes,

    C. Feng, Z. Chen, R. Kou, G. Gao, C. Wang, X. Li, X. Shu, Y . Dai, Q. Fu, and J. Yang, “Hazydet: Open-source benchmark for drone- view object detection with depth-cues in hazy scenes,”arXiv preprint arXiv:2409.19833, 2024

  24. [24]

    Remoteclip: A vision language foundation model for remote sensing,

    F. Liu, D. Chen, Z. Guan, X. Zhou, J. Zhu, Q. Ye, L. Fu, and J. Zhou, “Remoteclip: A vision language foundation model for remote sensing,” IEEE Transactions on Geoscience and Remote Sensing, 2024

  25. [25]

    Marcinkowska-Ochtyra,A.,Ochtyra,A.,Raczko,E.,Kopeć,D.,2023.Natura2000grasslandhabitatsmappingbasedonspectro-temporaldimension of sentinel-2 images with machine learning

    U. Mall, C. P. Phoo, M. K. Liu, C. V ondrick, B. Hariharan, and K. Bala, “Remote sensing vision-language foundation models without annota- tions via ground remote alignment,”arXiv preprint arXiv:2312.06960, 2023

  26. [26]

    SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

    H. Chen, H. Tu, F. Wang, H. Liu, X. Tang, X. Du, Y . Zhou, and C. Xie, “Sft or rl? an early investigation into training r1-like reasoning large vision-language models,”arXiv preprint arXiv:2504.11468, 2025

  27. [27]

    Detection and tracking meet drones challenge,

    P. Zhu, L. Wen, D. Du, X. Bian, H. Fan, Q. Hu, and H. Ling, “Detection and tracking meet drones challenge,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 11, pp. 7380–7399, 2021

  28. [28]

    Uav3d: A large-scale 3d per- ception benchmark for unmanned aerial vehicles,

    H. Ye, R. Sunderraman, and S. Ji, “Uav3d: A large-scale 3d per- ception benchmark for unmanned aerial vehicles,”arXiv preprint arXiv:2410.11125, 2024

  29. [29]

    Uavid: A semantic segmentation dataset for uav imagery,

    Y . Lyu, G. V osselman, G.-S. Xia, A. Yilmaz, and M. Y . Yang, “Uavid: A semantic segmentation dataset for uav imagery,”ISPRS journal of photogrammetry and remote sensing, vol. 165, pp. 108–119, 2020

  30. [30]

    Syndrone-multi- modal uav dataset for urban scenarios,

    G. Rizzoli, F. Barbato, M. Caligiuri, and P. Zanuttigh, “Syndrone-multi- modal uav dataset for urban scenarios,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2210–2220, 2023

  31. [31]

    Skyscenes: A synthetic dataset for aerial scene understanding,

    S. Khose, A. Pal, A. Agarwal, Deepanshi, J. Hoffman, and P. Chattopad- hyay, “Skyscenes: A synthetic dataset for aerial scene understanding,” in European Conference on Computer Vision, pp. 19–35, Springer, 2024

  32. [32]

    Floodnet: A high resolution aerial imagery dataset for post flood scene understanding,

    M. Rahnemoonfar, T. Chowdhury, A. Sarkar, D. Varshney, M. Yari, and R. R. Murphy, “Floodnet: A high resolution aerial imagery dataset for post flood scene understanding,”IEEE Access, vol. 9, pp. 89644–89654, 2021

  33. [33]

    Rsvqa: Visual question answering for remote sensing data,

    S. Lobry, D. Marcos, J. Murray, and D. Tuia, “Rsvqa: Visual question answering for remote sensing data,”IEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 12, pp. 8555–8566, 2020

  34. [34]

    Reason- 14 rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

    H. Tan, Y . Ji, X. Hao, M. Lin, P. Wang, Z. Wang, and S. Zhang, “Reason- rft: Reinforcement fine-tuning for visual reasoning,”arXiv preprint arXiv:2503.20752, 2025

  35. [35]

    OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

    Y . Deng, H. Bansal, F. Yin, N. Peng, W. Wang, and K.-W. Chang, “Open- vlthinker: An early exploration to complex vision-language reasoning via iterative self-improvement,”arXiv preprint arXiv:2503.17352, 2025

  36. [36]

    G1: Bootstrapping perception and reasoning abilities of vision-language model via reinforcement learning,

    L. Chen, H. Gao, T. Liu, Z. Huang, F. Sung, X. Zhou, Y . Wu, and B. Chang, “G1: Bootstrapping perception and reasoning abilities of vision-language model via reinforcement learning,”arXiv preprint arXiv:2505.13426, 2025

  37. [37]

    Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback,

    T. Yu, Y . Yao, H. Zhang, T. He, Y . Han, G. Cui, J. Hu, Z. Liu, H.-T. Zheng, M. Sun,et al., “Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13807–13816, 2024

  38. [38]

    Lora: Low-rank adaptation of large language models.,

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen,et al., “Lora: Low-rank adaptation of large language models.,” ICLR, vol. 1, no. 2, p. 3, 2022

  39. [39]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    N. Lambert, J. Morrison, V . Pyatkin, S. Huang, H. Ivison, F. Brahman, L. J. V . Miranda, A. Liu, N. Dziri, S. Lyu,et al., “T\” ulu 3: Pushing frontiers in open language model post-training,”arXiv preprint arXiv:2411.15124, 2024

  40. [40]

    Supervised contrastive learn- ing,

    P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y . Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learn- ing,”Advances in neural information processing systems, vol. 33, pp. 18661–18673, 2020

  41. [41]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, pp. 34892– 34916, 2023

  42. [42]

    Approximating kl divergence

    J.Schulman., “Approximating kl divergence.” http://joschu.net/blog/ kl-approx.html, 2020

  43. [43]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High- dimensional continuous control using generalized advantage estimation,” arXiv preprint arXiv:1506.02438, 2015

  44. [44]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi,et al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

  45. [45]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    H. Shen, P. Liu, J. Li, C. Fang, Y . Ma, J. Liao, Q. Shen, Z. Zhang, K. Zhao, Q. Zhang,et al., “Vlm-r1: A stable and generalizable r1-style large vision-language model,”arXiv preprint arXiv:2504.07615, 2025. 11

  46. [46]

    PyTorch: An Imperative Style, High-Performance Deep Learning Library

    A. Paszke, “Pytorch: An imperative style, high-performance deep learn- ing library,”arXiv preprint arXiv:1912.01703, 2019

  47. [47]

    Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters,

    J. Rasley, S. Rajbhandari, O. Ruwase, and Y . He, “Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters,” inProceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 3505–3506, 2020

  48. [48]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    T. Dao, “Flashattention-2: Faster attention with better parallelism and work partitioning, 2023,”URL https://arxiv. org/abs/2307.08691, 2023

  49. [49]

    Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping

    J. Dodge, G. Ilharco, R. Schwartz, A. Farhadi, H. Hajishirzi, and N. Smith, “Fine-tuning pretrained language models: Weight initializa- tions, data orders, and early stopping,”arXiv preprint arXiv:2002.06305, 2020

  50. [50]

    Instructblip: Towards general-purpose vision- language models with instruction tuning,

    W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision- language models with instruction tuning,” 2023

  51. [51]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,”arXiv preprint arXiv:2308.12966, vol. 1, no. 2, p. 3, 2023

  52. [52]

    Qwen2.5 Technical Report

    Q. Team, “Qwen2 technical report,”arXiv preprint arXiv:2412.15115, 2024