ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models
Pith reviewed 2026-05-18 07:20 UTC · model grok-4.3
The pith
ViSurf unifies supervised fine-tuning and reinforcement learning with verifiable rewards into one training stage for large vision-language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ViSurf creates a single-stage framework that integrates supervised fine-tuning and reinforcement learning with verifiable rewards by directly injecting ground-truth labels into RLVR rollouts, allowing external supervision and internal reinforcement to occur together, and supports this integration with three reward control strategies that maintain training stability and optimization.
What carries the argument
The unified objective that places ground-truth labels inside RLVR rollouts together with three reward control strategies that balance supervision and reinforcement signals.
If this is right
- Models reach higher scores on diverse benchmarks than when using SFT or RLVR in isolation.
- The single-stage process removes the extra compute required by running SFT followed by RLVR.
- Catastrophic forgetting that appears in two-stage pipelines is avoided.
- The same label-injection and reward-control pattern applies across multiple vision-language evaluation sets.
Where Pith is reading between the lines
- The method could extend to other multimodal or language-only models by reusing the same label-injection pattern.
- Training time and memory use might drop enough to allow larger batch sizes or longer context lengths in post-training.
- Different label-injection schedules or reward weighting could be tested to further reduce any residual instability.
Load-bearing premise
Directly adding ground-truth labels to reinforcement learning rollouts plus the three reward controls will keep optimization stable and avoid new instabilities or forgetting.
What would settle it
On standard vision-language benchmarks, ViSurf produces lower scores than a sequential SFT-then-RLVR pipeline or shows clear training divergence or forgetting.
Figures
read the original abstract
Post-training Large Vision-and-Language Models (LVLMs) typically involves Supervised Fine-Tuning (SFT) for knowledge injection or Reinforcement Learning with Verifiable Rewards (RLVR) for performance enhancement. However, SFT often leads to sub-optimal performance, while RLVR remains constrained by the model's internal knowledge base. While a sequential SFT $\rightarrow$ RLVR pipeline can be used, it introduces significant computational overhead and suffers from catastrophic forgetting. To address these limitations, we propose ViSurf (\textbf{Vi}sual \textbf{Su}pervised-and-\textbf{R}einforcement \textbf{F}ine-Tuning), a unified, single-stage paradigm that integrates the strengths of both SFT and RLVR. By analyzing their training objectives, we establish a unified framework that injects ground-truth labels directly into RLVR rollouts, facilitating simultaneous external supervision and internal reinforcement. Furthermore, we introduce three novel reward control strategies to ensure training stability and optimization. Extensive experiments demonstrate that ViSurf consistently outperforms standalone SFT, RLVR, and the traditional two-stage pipeline across diverse benchmarks. In-depth analysis corroborates these findings, validating the derivation and design principles of ViSurf.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes ViSurf, a unified single-stage fine-tuning method for Large Vision-and-Language Models that integrates Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR). It does so by injecting ground-truth labels directly into RLVR rollouts and introducing three novel reward control strategies to maintain stability. The central claim is that this approach simultaneously provides external supervision and internal reinforcement, outperforming standalone SFT, standalone RLVR, and the traditional sequential SFT→RLVR pipeline on diverse benchmarks while avoiding catastrophic forgetting and reducing computational overhead. In-depth analysis is said to validate the derivation and design.
Significance. If the stability and performance claims hold under the proposed controls, ViSurf would offer a practical simplification of LVLM post-training pipelines. The single-stage unification addresses real limitations of current sequential methods, such as overhead and forgetting, and could influence how future work combines supervised and reinforcement objectives in vision-language settings. The absence of free parameters or invented entities in the core derivation is a positive structural feature.
major comments (3)
- [§3] §3 (Unified Objective): The derivation that injects ground-truth labels into RLVR rollouts to create a joint objective provides no explicit analysis or bound on gradient interference between the supervised term and the RL term. This is load-bearing for the stability claim, as the three reward controls are asserted to guarantee stable optimization without supporting math or empirical isolation of interference effects.
- [§4.3] §4.3 (Experiments and Ablations): The reported consistent outperformance across benchmarks lacks ablations that isolate each of the three reward control strategies, and no variance is reported across random seeds or model scales. Without these, it is unclear whether gains are attributable to the unified framework or to hyperparameter tuning, directly affecting the central empirical claim.
- [§5] §5 (Analysis of Forgetting): The in-depth analysis asserts avoidance of catastrophic forgetting, yet provides no quantitative retention metrics on prior tasks or representation drift measurements after the joint update. This leaves the claim that the single-stage method prevents forgetting insufficiently supported.
minor comments (2)
- [Abstract] The abstract and §1 could more explicitly list the specific benchmarks and model scales used in the 'extensive experiments' to allow immediate assessment of scope.
- [§3.3] Notation for the three reward control strategies is introduced without a compact summary table; adding one would improve readability of the design principles.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments in detail below and outline the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: [§3] §3 (Unified Objective): The derivation that injects ground-truth labels into RLVR rollouts to create a joint objective provides no explicit analysis or bound on gradient interference between the supervised term and the RL term. This is load-bearing for the stability claim, as the three reward controls are asserted to guarantee stable optimization without supporting math or empirical isolation of interference effects.
Authors: We appreciate the referee pointing out the need for more rigorous support for the stability of the joint objective. The reward controls are intended to balance the supervised and RL terms by scaling rewards based on label consistency and rollout variance, which empirically reduces interference. However, we agree that an explicit bound would be valuable. In the revised manuscript, we will add a section providing a preliminary analysis of gradient norms and interference under the proposed controls, including a simple bound derived from the reward scaling factors. We will also include empirical plots showing gradient alignment before and after applying the controls. revision: yes
-
Referee: [§4.3] §4.3 (Experiments and Ablations): The reported consistent outperformance across benchmarks lacks ablations that isolate each of the three reward control strategies, and no variance is reported across random seeds or model scales. Without these, it is unclear whether gains are attributable to the unified framework or to hyperparameter tuning, directly affecting the central empirical claim.
Authors: We acknowledge that the current ablations do not fully isolate the individual contributions of each reward control strategy. To address this, we will expand the experimental section with new ablations that disable one control at a time while keeping others active, reporting the performance drop on key benchmarks. Furthermore, we will rerun the main experiments with at least three different random seeds and report mean and standard deviation to quantify variance. For model scales, we will add results on a smaller model variant to demonstrate consistency across scales, subject to computational availability. revision: yes
-
Referee: [§5] §5 (Analysis of Forgetting): The in-depth analysis asserts avoidance of catastrophic forgetting, yet provides no quantitative retention metrics on prior tasks or representation drift measurements after the joint update. This leaves the claim that the single-stage method prevents forgetting insufficiently supported.
Authors: We thank the referee for this observation. Our current analysis relies on maintained performance on diverse benchmarks post-training as indirect evidence against forgetting. To provide more direct support, we will include quantitative retention metrics by measuring accuracy on a set of tasks from the pre-training or SFT phase before and after ViSurf training. Additionally, we will compute representation drift using metrics such as the average cosine distance between embeddings of the same inputs extracted from intermediate layers at different training stages. These additions will be incorporated into the revised §5. revision: yes
Circularity Check
No circularity: ViSurf proposes an explicit new integration of SFT and RLVR objectives
full rationale
The paper derives its unified single-stage objective by directly analyzing and combining the standard SFT and RLVR loss formulations, then injecting ground-truth labels into rollouts as a design choice rather than a fitted or self-referential step. No equations reduce to prior results by construction, no self-citations are load-bearing for the central claim, and the three reward controls are presented as novel additions whose stability is asserted via experiment rather than definition. The derivation chain remains self-contained against external benchmarks and does not rename or smuggle in known results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Ground-truth labels and verifiable rewards are available and can be directly injected into RLVR rollouts without destabilizing training.
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2, 3, 6 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Sam4mllm: Enhance multi- modal large language model for referring expression seg- mentation
Yi-Chia Chen, Wei-Hua Li, Cheng Sun, Yu-Chiang Frank Wang, and Chu-Song Chen. Sam4mllm: Enhance multi- modal large language model for referring expression seg- mentation. InEuropean Conference on Computer Vision, pages 323–340. Springer, 2024. 8
work page 2024
-
[3]
Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 2
work page 2024
-
[4]
Noel Codella, Veronica Rotemberg, Philipp Tschandl, M Emre Celebi, Stephen Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael Marchetti, et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the interna- tional skin imaging collaboration (isic).arXiv preprint arXiv:1902.03368, 2019. 6
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[5]
Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri- pathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models.arXiv e-prints, pages arXiv–2409, 2024. 2
work page 2024
-
[6]
TRL - Transformer Reinforcement Learning
Hugging Face. TRL - Transformer Reinforcement Learning. https://github.com/huggingface/trl, 2024. 8
work page 2024
-
[7]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark
Himanshu Gupta, Shreyas Verma, Ujjwala Anantheswaran, Kevin Scaria, Mihir Parmar, Swaroop Mishra, and Chitta Baral. Polymath: A challenging multi-modal mathemati- cal reasoning benchmark.arXiv preprint arXiv:2410.14702,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
International Skin Imaging Collaboration (ISIC). Isic 2018: Skin lesion analysis towards melanoma detection.https: //challenge.isic-archive.com/data/#2018,
work page 2018
-
[11]
Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem AlShikh, and Ruslan Salakhut- dinov. Omniact: A dataset and benchmark for enabling mul- timodal generalist autonomous agents for desktop and web. InEuropean Conference on Computer Vision, pages 161–
-
[12]
Lisa: Reasoning segmentation via large language model
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024. 5, 8
work page 2024
-
[13]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models.arXiv preprint arXiv:2403.18814,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Zhenwen Liang, Kehan Guo, Gang Liu, Taicheng Guo, Yujun Zhou, Tianyu Yang, Jiajun Jiao, Renjie Pi, Jipeng Zhang, and Xiangliang Zhang. Scemqa: A scientific col- lege entrance level multimodal question answering bench- mark.arXiv preprint arXiv:2402.05138, 2024. 6
-
[16]
Gres: Gen- eralized referring expression segmentation
Chang Liu, Henghui Ding, and Xudong Jiang. Gres: Gen- eralized referring expression segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 23592–23601, 2023. 3, 5
work page 2023
-
[17]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 2
work page 2023
-
[18]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 2
work page 2024
-
[19]
Jiazhen Liu, Yuchuan Deng, and Long Chen. Empowering small vlms to think with dynamic memorization and explo- ration.arXiv preprint arXiv:2506.23061, 2025. 2
-
[20]
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025. 1, 2, 3, 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, and Jiaya Jia. Visionreasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025. 1, 2, 3, 5, 6, 8
-
[22]
Visual-RFT: Visual Reinforcement Fine-Tuning
Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual- rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Inter-gps: Interpretable geometry problem solving with formal language and sym- bolic reasoning
Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and sym- bolic reasoning.arXiv preprint arXiv:2105.04165, 2021. 6
-
[24]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023. 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Bin Cui, et al. Learning what reinforcement learning can’t: Interleaved online fine-tuning for hardest questions. arXiv preprint arXiv:2506.07527, 2025. 2
-
[26]
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022. 6 9
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[27]
Docvqa: A dataset for vqa on document images
Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. InProceed- ings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021. 6
work page 2021
-
[28]
DeepSpeed.https:// github.com/deepspeedai/DeepSpeed, 2020
Microsoft and DeepSpeed Team. DeepSpeed.https:// github.com/deepspeedai/DeepSpeed, 2020. 8
work page 2020
-
[29]
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large mul- timodal model achieve human-like mathematical reasoning? arXiv preprint arXiv:2407.01284, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023. 2
work page 2023
-
[31]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 2
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[33]
verl: V olcano Engine Reinforce- ment Learning for LLMs.https://github.com/ volcengine/verl, 2024
ByteDance Seed. verl: V olcano Engine Reinforce- ment Learning for LLMs.https://github.com/ volcengine/verl, 2024. 8
work page 2024
-
[34]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 1, 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Real-iad: A real-world multi-view dataset for benchmarking versatile industrial anomaly detec- tion
Chengjie Wang, Wenbing Zhu, Bin-Bin Gao, Zhenye Gan, Jiangning Zhang, Zhihao Gu, Shuguang Qian, Mingang Chen, and Lizhuang Ma. Real-iad: A real-world multi-view dataset for benchmarking versatile industrial anomaly detec- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 22883–22892,
-
[36]
Mgm-omni: Scaling omni llms to personal- ized long-horizon speech.arXiv preprint arXiv:2509.25131,
Chengyao Wang, Zhisheng Zhong, Bohao Peng, Senqiao Yang, Yuqi Liu, Haokun Gui, Bin Xia, Jingyao Li, Bei Yu, and Jiaya Jia. Mgm-omni: Scaling omni llms to personal- ized long-horizon speech.arXiv preprint arXiv:2509.25131,
-
[37]
Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Mea- suring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Sys- tems, 37:95095–95169, 2024. 6
work page 2024
-
[38]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1, 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Gsva: Generalized segmentation via multimodal large language models
Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. Gsva: Generalized segmentation via multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3858–3869, 2024. 8
work page 2024
-
[40]
Zhenhua Xu, Yan Bai, Yujia Zhang, Zhuoling Li, Fei Xia, Kwan-Yee K Wong, Jianqiang Wang, and Hengshuang Zhao. Drivegpt4-v2: Harnessing large language model capabilities for enhanced closed-loop autonomous driving. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 17261–17270, 2025. 2
work page 2025
-
[41]
Teaching large language models to regress accurate image quality scores using score distribution
Zhiyuan You, Xin Cai, Jinjin Gu, Tianfan Xue, and Chao Dong. Teaching large language models to regress accurate image quality scores using score distribution. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14483–14494, 2025. 2
work page 2025
-
[42]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforce- ment learning system at scale, 2025.URL https://arxiv. org/abs/2503.14476, 2025. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, and Jingren Zhou. On-policy rl meets off-policy experts: Harmonizing super- vised fine-tuning and reinforcement learning via dynamic weighting.arXiv preprint arXiv:2508.11408, 2025. 2
-
[44]
Zhisheng Zhong, Chengyao Wang, Yuqi Liu, Senqiao Yang, Longxiang Tang, Yuechen Zhang, Jingyao Li, Tianyuan Qu, Yanwei Li, Yukang Chen, et al. Lyra: An efficient and speech-centric framework for omni-cognition.arXiv preprint arXiv:2412.09501, 2024. 2 10 ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models Supplementa...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.