UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning
Pith reviewed 2026-05-18 22:13 UTC · model grok-4.3
The pith
A lightweight UAV vision-language model trained with supervised fine-tuning and multi-stage GRPO outperforms much larger general models on aerial reasoning tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
UAV-VL-R1 is trained by first applying supervised fine-tuning for semantic alignment and then multi-stage GRPO with rule-guided rewards and intra-group policy alignment to improve logical reasoning; the resulting model achieves substantially higher zero-shot accuracy than both its 2B baseline and a 72B general model on the HRVQA-VL tasks while remaining deployable under tight UAV memory limits.
What carries the argument
Multi-stage GRPO with rule-guided rewards, which aligns policies within groups and enforces interpretable reasoning steps for aerial visual question answering.
If this is right
- Supervised fine-tuning improves semantic alignment with aerial imagery but can narrow reasoning diversity on mathematical subtasks.
- GRPO reinforcement learning restores logical flexibility and robustness that fine-tuning tends to limit.
- The final model supports real-time inference on resource-limited UAV hardware after quantization to 2.5 GB.
- Performance improvements appear consistently across the eight UAV-specific reasoning tasks covered by the new dataset.
Where Pith is reading between the lines
- The same staged training recipe may transfer to other high-resolution sensing domains such as satellite imagery or autonomous vehicle perception.
- Rule-based reward signals could be adapted to reduce inconsistent outputs in broader vision-language applications beyond UAVs.
- On-board deployment of such lightweight models could remove the need for constant cloud connectivity in remote drone operations.
Load-bearing premise
The accuracy gains arise from genuine gains in logical reasoning and generalization rather than from fitting the particular question and annotation patterns in the HRVQA-VL training set.
What would settle it
Testing UAV-VL-R1 on a fresh set of UAV images and questions drawn from different flight conditions or annotation styles not present in HRVQA-VL and checking whether the reported accuracy advantage over baselines remains.
Figures
read the original abstract
Recent advances in vision-language models (VLMs) have demonstrated strong generalization in natural image tasks. However, their performance often degrades on unmanned aerial vehicle (UAV)-based aerial imagery, which features high resolution, complex spatial semantics, and strict real-time constraints. These challenges limit the applicability of general-purpose VLMs to structured aerial reasoning tasks. To address these challenges, we propose UAV-VL-R1, a lightweight VLM explicitly designed for aerial visual reasoning. It is trained using a hybrid method that combines supervised fine-tuning (SFT) and multi-stage reinforcement learning (RL). We leverage the group relative policy optimization (GRPO) algorithm to promote structured and interpretable reasoning through rule-guided rewards and intra-group policy alignment. To support model training and evaluation, we introduce a high-resolution visual question answering dataset named HRVQA-VL, which consists of 50,019 annotated samples covering eight UAV-relevant reasoning tasks, including object counting, transportation recognition, and spatial scene inference. Experimental results show that UAV-VL-R1 achieves a 48.17% higher zero-shot accuracy than the Qwen2-VL-2B-Instruct baseline and even outperforms its 72B-scale variant, which is 36x larger, on multiple tasks. Ablation studies reveal that while SFT improves semantic alignment, it may reduce reasoning diversity in mathematical tasks. GRPO-based RL compensates for this limitation by enhancing logical flexibility and the robustness of inference. Additionally, UAV-VL-R1 requires only 3.9GB of memory under FP16 inference and can be quantized to 2.5GB with INT8, supporting real-time deployment on resource-constrained UAV platforms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces UAV-VL-R1, a lightweight vision-language model for UAV aerial visual reasoning. It is trained via a hybrid approach of supervised fine-tuning (SFT) followed by multi-stage Group Relative Policy Optimization (GRPO) using rule-guided rewards and intra-group alignment. The authors release the HRVQA-VL dataset of 50,019 samples spanning eight UAV-relevant tasks (object counting, transportation recognition, spatial inference, etc.). The central empirical claim is that UAV-VL-R1 attains 48.17% higher zero-shot accuracy than the Qwen2-VL-2B-Instruct baseline and outperforms a 72B-scale variant on multiple tasks while requiring only 3.9 GB (FP16) or 2.5 GB (INT8) memory.
Significance. If the performance and generalization claims are substantiated, the work would be significant for resource-constrained UAV perception, demonstrating that GRPO-based RL can restore logical flexibility after SFT and that rule-guided rewards produce interpretable aerial reasoning. The HRVQA-VL dataset is a concrete community resource. The efficiency numbers directly support real-time deployment. The paper receives credit for its empirical focus and the ablation isolating the contribution of the multi-stage RL stage.
major comments (3)
- [Abstract] Abstract and Experimental Results: the headline claim of a 48.17% higher zero-shot accuracy (and outperformance of the 72B model) supplies no information on the precise metric (overall accuracy, per-task accuracy, or F1), statistical significance, error bars, or number of runs. This information is load-bearing for the central performance claim.
- [§5] §5 (Experiments) and Ablation Studies: all quantitative results, including the GRPO ablation that supposedly restores reasoning diversity, are reported exclusively on the HRVQA-VL test split. No out-of-distribution UAV imagery, cross-benchmark results on standard VQA datasets, or leakage checks are provided. This directly affects the generalization narrative in the title and abstract.
- [Method] Method section: the multi-stage GRPO procedure is described at a high level, but the concrete reward functions, the exact grouping strategy, and how rule-guided rewards are formulated for each of the eight tasks are not specified. These details are necessary to assess reproducibility and to understand why the approach yields genuine logical gains rather than dataset-specific fitting.
minor comments (3)
- [Abstract] Clarify whether the 48.17% figure is an average across all eight tasks or driven by a subset; a per-task breakdown table would improve transparency.
- [Dataset] The dataset section should include basic statistics (question-type distribution, image resolution range) and a brief description of the annotation protocol to support reproducibility.
- [Introduction] Minor notation inconsistency: the acronym GRPO is introduced without spelling out 'Group Relative Policy Optimization' on first use in the main text.
Simulated Author's Rebuttal
We are grateful to the referee for the thorough and constructive review. We address each major comment point by point below, indicating revisions where the manuscript will be updated in the next version.
read point-by-point responses
-
Referee: [Abstract] Abstract and Experimental Results: the headline claim of a 48.17% higher zero-shot accuracy (and outperformance of the 72B model) supplies no information on the precise metric (overall accuracy, per-task accuracy, or F1), statistical significance, error bars, or number of runs. This information is load-bearing for the central performance claim.
Authors: We agree that the abstract requires greater precision on the central claim. The 48.17% figure denotes the relative improvement in overall accuracy (mean across all eight tasks) on the HRVQA-VL test split. We have revised the abstract to state this explicitly and added a sentence noting the evaluation protocol. In the updated Section 5, we now report results from three independent runs with different seeds, include standard deviation as error bars in Table 1, and provide a paired t-test (p < 0.01) against the baseline. These changes directly address the load-bearing nature of the claim. revision: yes
-
Referee: [§5] §5 (Experiments) and Ablation Studies: all quantitative results, including the GRPO ablation that supposedly restores reasoning diversity, are reported exclusively on the HRVQA-VL test split. No out-of-distribution UAV imagery, cross-benchmark results on standard VQA datasets, or leakage checks are provided. This directly affects the generalization narrative in the title and abstract.
Authors: We acknowledge that all reported numbers are on the HRVQA-VL test split, as this is a newly introduced UAV-specific resource. To strengthen the generalization narrative, the revision adds a dedicated subsection with qualitative results on additional out-of-distribution UAV images sourced from public aerial datasets, demonstrating consistent reasoning improvements. We also include an explicit data-leakage analysis confirming no overlap with common VQA corpora. While cross-benchmark evaluation on standard VQA datasets is not performed (due to domain mismatch with high-resolution aerial imagery), we discuss this limitation and note that outperformance versus the 72B model on UAV tasks supports broader applicability. The GRPO ablation is now buttressed by both quantitative deltas and qualitative examples of restored reasoning diversity. revision: partial
-
Referee: [Method] Method section: the multi-stage GRPO procedure is described at a high level, but the concrete reward functions, the exact grouping strategy, and how rule-guided rewards are formulated for each of the eight tasks are not specified. These details are necessary to assess reproducibility and to understand why the approach yields genuine logical gains rather than dataset-specific fitting.
Authors: We agree that additional implementation details are essential for reproducibility. The revised Method section now specifies the exact reward functions for each of the eight tasks (e.g., object counting uses a binary match reward with a tolerance of ±1; spatial inference applies rule-based entity-relation consistency checks). The grouping strategy samples eight responses per prompt, computes relative advantages within the group, and applies intra-group alignment via a KL term to the group-mean policy. We also detail the two-stage schedule, reward weighting coefficients, and all GRPO hyperparameters. These additions clarify how rule-guided rewards encourage logical structure beyond dataset-specific fitting. revision: yes
Circularity Check
No circularity: empirical results rest on external baselines and measured test performance
full rationale
This is an empirical computer-vision paper that trains a model via SFT plus multi-stage GRPO on a newly introduced 50k-sample dataset and reports accuracy numbers against external Qwen2-VL baselines of different sizes. No mathematical derivation, uniqueness theorem, or first-principles claim is offered whose output is definitionally identical to its own fitted inputs or to a self-citation chain. All load-bearing evidence consists of measured performance on a held-out test split and ablation tables; these quantities are not forced by construction from the authors' own parameter choices. The work therefore satisfies the self-contained-against-external-benchmarks criterion and receives score 0.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 50,019 annotations in HRVQA-VL correctly capture the visual content and intended reasoning for the eight UAV tasks.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
hybrid method that combines supervised fine-tuning (SFT) and multi-stage reinforcement learning (RL). We leverage the group relative policy optimization (GRPO) algorithm... dual-objective reward function combining format compliance and answer correctness
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
eight UAV-relevant reasoning tasks, including object counting, transportation recognition, and spatial scene inference... three-phase training process (Stage-A/B/C)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford,et al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Cogvlm: Visual expert for pretrained language models,
W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y . Wang, J. Ji, Z. Yang, L. Zhao, S. XiXuan,et al., “Cogvlm: Visual expert for pretrained language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 121475–121499, 2024
work page 2024
-
[3]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge,et al., “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan, D. Dai, D. Guo,et al., “Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model,”arXiv preprint arXiv:2405.04434, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Rl-vlm-f: Reinforcement learning from vision language foundation model feedback,
Y . Wang, Z. Sun, J. Zhang, Z. Xian, E. Biyik, D. Held, and Z. Erickson, “Rl-vlm-f: Reinforcement learning from vision language foundation model feedback,”arXiv preprint arXiv:2402.03681, 2024
-
[6]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
W. Huang, B. Jia, Z. Zhai, S. Cao, Z. Ye, F. Zhao, Z. Xu, Y . Hu, and S. Lin, “Vision-r1: Incentivizing reasoning capability in multimodal large language models,”arXiv preprint arXiv:2503.06749, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Med-r1: Reinforcement learning for generalizable medical reasoning in vision-language models,
Y . Lai, J. Zhong, M. Li, S. Zhao, and X. Yang, “Med-r1: Reinforcement learning for generalizable medical reasoning in vision-language models,” arXiv preprint arXiv:2503.13939, 2025
-
[8]
A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney,et al., “Openai o1 system card,”arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Kimi k1.5: Scaling Reinforcement Learning with LLMs
K. Team, A. Du, B. Gao, B. Xing, C. Jiang, C. Chen, C. Li, C. Xiao, C. Du, C. Liao,et al., “k1. 5: Scaling reinforcement learning with llms, 2025,”URL https://arxiv. org/abs/2501.12599
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat,et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
W. Chen, X. Ma, X. Wang, and W. W. Cohen, “Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks,”arXiv preprint arXiv:2211.12588, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[12]
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
T. Chu, Y . Zhai, J. Yang, S. Tong, S. Xie, D. Schuurmans, Q. V . Le, S. Levine, and Y . Ma, “Sft memorizes, rl generalizes: A com- parative study of foundation model post-training,”arXiv preprint arXiv:2501.17161, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Object detection in optical remote sensing images: A survey and a new benchmark,
K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, “Object detection in optical remote sensing images: A survey and a new benchmark,”ISPRS journal of photogrammetry and remote sensing, vol. 159, pp. 296–307, 2020
work page 2020
-
[14]
Fine-tuning large vision-language models as decision-making agents via reinforcement learning,
Y . Zhai, H. Bai, Z. Lin, J. Pan, S. Tong, Y . Zhou, A. Suhr, S. Xie, Y . LeCun, Y . Ma,et al., “Fine-tuning large vision-language models as decision-making agents via reinforcement learning, 2024,”URL https://arxiv. org/abs/2405.10292
-
[15]
R1-v: Reinforcing super generalization ability in vision-language models with less than $3
L. Chen, L. Li, H. Zhao, Y . Song, and Vinci, “R1-v: Reinforcing super generalization ability in vision-language models with less than $3.” https: //github.com/Deep-Agent/R1-V, 2025. Accessed: 2025-02-02
work page 2025
-
[16]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu,et al., “Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models, 2024,”URL https://arxiv. org/abs/2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[18]
Direct preference optimization: Your language model is secretly a reward model,
R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”Advances in Neural Information Processing Systems, vol. 36, pp. 53728–53741, 2023
work page 2023
-
[19]
Gptq: Accurate post-training quantization for generative pre-trained transformers,
E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate post-training quantization for generative pre-trained transformers,” 2023
work page 2023
-
[20]
K. Team, A. Du, B. Yin, B. Xing, B. Qu, B. Wang, C. Chen, C. Zhang, C. Du, C. Wei,et al., “Kimi-vl technical report,”arXiv preprint arXiv:2504.07491, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Y . Yao, T. Yu, A. Zhang, C. Wang, J. Cui, H. Zhu, T. Cai, H. Li, W. Zhao, Z. He,et al., “Minicpm-v: A gpt-4v level mllm on your phone,”arXiv preprint arXiv:2408.01800, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Hrvqa: A visual question answering benchmark for high-resolution aerial images,
K. Li, G. V osselman, and M. Y . Yang, “Hrvqa: A visual question answering benchmark for high-resolution aerial images,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 214, pp. 65–81, 2024
work page 2024
-
[23]
Hazydet: Open-source benchmark for drone- view object detection with depth-cues in hazy scenes,
C. Feng, Z. Chen, R. Kou, G. Gao, C. Wang, X. Li, X. Shu, Y . Dai, Q. Fu, and J. Yang, “Hazydet: Open-source benchmark for drone- view object detection with depth-cues in hazy scenes,”arXiv preprint arXiv:2409.19833, 2024
-
[24]
Remoteclip: A vision language foundation model for remote sensing,
F. Liu, D. Chen, Z. Guan, X. Zhou, J. Zhu, Q. Ye, L. Fu, and J. Zhou, “Remoteclip: A vision language foundation model for remote sensing,” IEEE Transactions on Geoscience and Remote Sensing, 2024
work page 2024
-
[25]
U. Mall, C. P. Phoo, M. K. Liu, C. V ondrick, B. Hariharan, and K. Bala, “Remote sensing vision-language foundation models without annota- tions via ground remote alignment,”arXiv preprint arXiv:2312.06960, 2023
-
[26]
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models
H. Chen, H. Tu, F. Wang, H. Liu, X. Tang, X. Du, Y . Zhou, and C. Xie, “Sft or rl? an early investigation into training r1-like reasoning large vision-language models,”arXiv preprint arXiv:2504.11468, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Detection and tracking meet drones challenge,
P. Zhu, L. Wen, D. Du, X. Bian, H. Fan, Q. Hu, and H. Ling, “Detection and tracking meet drones challenge,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 11, pp. 7380–7399, 2021
work page 2021
-
[28]
Uav3d: A large-scale 3d per- ception benchmark for unmanned aerial vehicles,
H. Ye, R. Sunderraman, and S. Ji, “Uav3d: A large-scale 3d per- ception benchmark for unmanned aerial vehicles,”arXiv preprint arXiv:2410.11125, 2024
-
[29]
Uavid: A semantic segmentation dataset for uav imagery,
Y . Lyu, G. V osselman, G.-S. Xia, A. Yilmaz, and M. Y . Yang, “Uavid: A semantic segmentation dataset for uav imagery,”ISPRS journal of photogrammetry and remote sensing, vol. 165, pp. 108–119, 2020
work page 2020
-
[30]
Syndrone-multi- modal uav dataset for urban scenarios,
G. Rizzoli, F. Barbato, M. Caligiuri, and P. Zanuttigh, “Syndrone-multi- modal uav dataset for urban scenarios,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2210–2220, 2023
work page 2023
-
[31]
Skyscenes: A synthetic dataset for aerial scene understanding,
S. Khose, A. Pal, A. Agarwal, Deepanshi, J. Hoffman, and P. Chattopad- hyay, “Skyscenes: A synthetic dataset for aerial scene understanding,” in European Conference on Computer Vision, pp. 19–35, Springer, 2024
work page 2024
-
[32]
Floodnet: A high resolution aerial imagery dataset for post flood scene understanding,
M. Rahnemoonfar, T. Chowdhury, A. Sarkar, D. Varshney, M. Yari, and R. R. Murphy, “Floodnet: A high resolution aerial imagery dataset for post flood scene understanding,”IEEE Access, vol. 9, pp. 89644–89654, 2021
work page 2021
-
[33]
Rsvqa: Visual question answering for remote sensing data,
S. Lobry, D. Marcos, J. Murray, and D. Tuia, “Rsvqa: Visual question answering for remote sensing data,”IEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 12, pp. 8555–8566, 2020
work page 2020
-
[34]
Reason- 14 rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025
H. Tan, Y . Ji, X. Hao, M. Lin, P. Wang, Z. Wang, and S. Zhang, “Reason- rft: Reinforcement fine-tuning for visual reasoning,”arXiv preprint arXiv:2503.20752, 2025
-
[35]
OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles
Y . Deng, H. Bansal, F. Yin, N. Peng, W. Wang, and K.-W. Chang, “Open- vlthinker: An early exploration to complex vision-language reasoning via iterative self-improvement,”arXiv preprint arXiv:2503.17352, 2025
work page internal anchor Pith review arXiv 2025
-
[36]
L. Chen, H. Gao, T. Liu, Z. Huang, F. Sung, X. Zhou, Y . Wu, and B. Chang, “G1: Bootstrapping perception and reasoning abilities of vision-language model via reinforcement learning,”arXiv preprint arXiv:2505.13426, 2025
-
[37]
T. Yu, Y . Yao, H. Zhang, T. He, Y . Han, G. Cui, J. Hu, Z. Liu, H.-T. Zheng, M. Sun,et al., “Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13807–13816, 2024
work page 2024
-
[38]
Lora: Low-rank adaptation of large language models.,
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen,et al., “Lora: Low-rank adaptation of large language models.,” ICLR, vol. 1, no. 2, p. 3, 2022
work page 2022
-
[39]
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
N. Lambert, J. Morrison, V . Pyatkin, S. Huang, H. Ivison, F. Brahman, L. J. V . Miranda, A. Liu, N. Dziri, S. Lyu,et al., “T\” ulu 3: Pushing frontiers in open language model post-training,”arXiv preprint arXiv:2411.15124, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Supervised contrastive learn- ing,
P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y . Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learn- ing,”Advances in neural information processing systems, vol. 33, pp. 18661–18673, 2020
work page 2020
-
[41]
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, pp. 34892– 34916, 2023
work page 2023
-
[42]
J.Schulman., “Approximating kl divergence.” http://joschu.net/blog/ kl-approx.html, 2020
work page 2020
-
[43]
High-Dimensional Continuous Control Using Generalized Advantage Estimation
J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High- dimensional continuous control using generalized advantage estimation,” arXiv preprint arXiv:1506.02438, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[44]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi,et al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
H. Shen, P. Liu, J. Li, C. Fang, Y . Ma, J. Liao, Q. Shen, Z. Zhang, K. Zhao, Q. Zhang,et al., “Vlm-r1: A stable and generalizable r1-style large vision-language model,”arXiv preprint arXiv:2504.07615, 2025. 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
PyTorch: An Imperative Style, High-Performance Deep Learning Library
A. Paszke, “Pytorch: An imperative style, high-performance deep learn- ing library,”arXiv preprint arXiv:1912.01703, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[47]
J. Rasley, S. Rajbhandari, O. Ruwase, and Y . He, “Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters,” inProceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 3505–3506, 2020
work page 2020
-
[48]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
T. Dao, “Flashattention-2: Faster attention with better parallelism and work partitioning, 2023,”URL https://arxiv. org/abs/2307.08691, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping
J. Dodge, G. Ilharco, R. Schwartz, A. Farhadi, H. Hajishirzi, and N. Smith, “Fine-tuning pretrained language models: Weight initializa- tions, data orders, and early stopping,”arXiv preprint arXiv:2002.06305, 2020
-
[50]
Instructblip: Towards general-purpose vision- language models with instruction tuning,
W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision- language models with instruction tuning,” 2023
work page 2023
-
[51]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,”arXiv preprint arXiv:2308.12966, vol. 1, no. 2, p. 3, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
Q. Team, “Qwen2 technical report,”arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.