pith. sign in

arxiv: 2510.10606 · v4 · submitted 2025-10-12 · 💻 cs.CV

ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models

Pith reviewed 2026-05-18 07:20 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language modelssupervised fine-tuningreinforcement learningunified fine-tuningreward controlpost-traininglarge multimodal modelsRLVR
0
0 comments X

The pith

ViSurf unifies supervised fine-tuning and reinforcement learning with verifiable rewards into one training stage for large vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to demonstrate that large vision-language models can be post-trained more effectively by merging supervised fine-tuning and reinforcement learning into a single stage rather than running them separately or in sequence. Current approaches either cap performance because they stay within the model's existing knowledge or incur high compute costs and lose prior capabilities when switching stages. ViSurf addresses this by feeding ground-truth labels straight into the reinforcement learning rollouts and adding three reward control mechanisms that keep the combined process stable. If the approach holds, training pipelines become simpler while models reach higher accuracy across vision and language tasks without extra overhead or forgetting.

Core claim

ViSurf creates a single-stage framework that integrates supervised fine-tuning and reinforcement learning with verifiable rewards by directly injecting ground-truth labels into RLVR rollouts, allowing external supervision and internal reinforcement to occur together, and supports this integration with three reward control strategies that maintain training stability and optimization.

What carries the argument

The unified objective that places ground-truth labels inside RLVR rollouts together with three reward control strategies that balance supervision and reinforcement signals.

If this is right

  • Models reach higher scores on diverse benchmarks than when using SFT or RLVR in isolation.
  • The single-stage process removes the extra compute required by running SFT followed by RLVR.
  • Catastrophic forgetting that appears in two-stage pipelines is avoided.
  • The same label-injection and reward-control pattern applies across multiple vision-language evaluation sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could extend to other multimodal or language-only models by reusing the same label-injection pattern.
  • Training time and memory use might drop enough to allow larger batch sizes or longer context lengths in post-training.
  • Different label-injection schedules or reward weighting could be tested to further reduce any residual instability.

Load-bearing premise

Directly adding ground-truth labels to reinforcement learning rollouts plus the three reward controls will keep optimization stable and avoid new instabilities or forgetting.

What would settle it

On standard vision-language benchmarks, ViSurf produces lower scores than a sequential SFT-then-RLVR pipeline or shows clear training divergence or forgetting.

Figures

Figures reproduced from arXiv: 2510.10606 by Bei Yu, Jiaya Jia, Jiazhen Liu, Liangyu Chen, Mingkang Zhu, Yuqi Liu, Zhisheng Zhong.

Figure 1
Figure 1. Figure 1: (a) Examples of vision-ang-language tasks. (b) For tasks [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Radar Chart: ViSurf achieves superior performance [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) Illustration on Non-Object Segmentation and Vision [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: ViSurf Framework. Upper: The integration of external guidance [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance on gRefCOCO in different training steps. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of ViSurf on various tasks. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Post-training Large Vision-and-Language Models (LVLMs) typically involves Supervised Fine-Tuning (SFT) for knowledge injection or Reinforcement Learning with Verifiable Rewards (RLVR) for performance enhancement. However, SFT often leads to sub-optimal performance, while RLVR remains constrained by the model's internal knowledge base. While a sequential SFT $\rightarrow$ RLVR pipeline can be used, it introduces significant computational overhead and suffers from catastrophic forgetting. To address these limitations, we propose ViSurf (\textbf{Vi}sual \textbf{Su}pervised-and-\textbf{R}einforcement \textbf{F}ine-Tuning), a unified, single-stage paradigm that integrates the strengths of both SFT and RLVR. By analyzing their training objectives, we establish a unified framework that injects ground-truth labels directly into RLVR rollouts, facilitating simultaneous external supervision and internal reinforcement. Furthermore, we introduce three novel reward control strategies to ensure training stability and optimization. Extensive experiments demonstrate that ViSurf consistently outperforms standalone SFT, RLVR, and the traditional two-stage pipeline across diverse benchmarks. In-depth analysis corroborates these findings, validating the derivation and design principles of ViSurf.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes ViSurf, a unified single-stage fine-tuning method for Large Vision-and-Language Models that integrates Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR). It does so by injecting ground-truth labels directly into RLVR rollouts and introducing three novel reward control strategies to maintain stability. The central claim is that this approach simultaneously provides external supervision and internal reinforcement, outperforming standalone SFT, standalone RLVR, and the traditional sequential SFT→RLVR pipeline on diverse benchmarks while avoiding catastrophic forgetting and reducing computational overhead. In-depth analysis is said to validate the derivation and design.

Significance. If the stability and performance claims hold under the proposed controls, ViSurf would offer a practical simplification of LVLM post-training pipelines. The single-stage unification addresses real limitations of current sequential methods, such as overhead and forgetting, and could influence how future work combines supervised and reinforcement objectives in vision-language settings. The absence of free parameters or invented entities in the core derivation is a positive structural feature.

major comments (3)
  1. [§3] §3 (Unified Objective): The derivation that injects ground-truth labels into RLVR rollouts to create a joint objective provides no explicit analysis or bound on gradient interference between the supervised term and the RL term. This is load-bearing for the stability claim, as the three reward controls are asserted to guarantee stable optimization without supporting math or empirical isolation of interference effects.
  2. [§4.3] §4.3 (Experiments and Ablations): The reported consistent outperformance across benchmarks lacks ablations that isolate each of the three reward control strategies, and no variance is reported across random seeds or model scales. Without these, it is unclear whether gains are attributable to the unified framework or to hyperparameter tuning, directly affecting the central empirical claim.
  3. [§5] §5 (Analysis of Forgetting): The in-depth analysis asserts avoidance of catastrophic forgetting, yet provides no quantitative retention metrics on prior tasks or representation drift measurements after the joint update. This leaves the claim that the single-stage method prevents forgetting insufficiently supported.
minor comments (2)
  1. [Abstract] The abstract and §1 could more explicitly list the specific benchmarks and model scales used in the 'extensive experiments' to allow immediate assessment of scope.
  2. [§3.3] Notation for the three reward control strategies is introduced without a compact summary table; adding one would improve readability of the design principles.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments in detail below and outline the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [§3] §3 (Unified Objective): The derivation that injects ground-truth labels into RLVR rollouts to create a joint objective provides no explicit analysis or bound on gradient interference between the supervised term and the RL term. This is load-bearing for the stability claim, as the three reward controls are asserted to guarantee stable optimization without supporting math or empirical isolation of interference effects.

    Authors: We appreciate the referee pointing out the need for more rigorous support for the stability of the joint objective. The reward controls are intended to balance the supervised and RL terms by scaling rewards based on label consistency and rollout variance, which empirically reduces interference. However, we agree that an explicit bound would be valuable. In the revised manuscript, we will add a section providing a preliminary analysis of gradient norms and interference under the proposed controls, including a simple bound derived from the reward scaling factors. We will also include empirical plots showing gradient alignment before and after applying the controls. revision: yes

  2. Referee: [§4.3] §4.3 (Experiments and Ablations): The reported consistent outperformance across benchmarks lacks ablations that isolate each of the three reward control strategies, and no variance is reported across random seeds or model scales. Without these, it is unclear whether gains are attributable to the unified framework or to hyperparameter tuning, directly affecting the central empirical claim.

    Authors: We acknowledge that the current ablations do not fully isolate the individual contributions of each reward control strategy. To address this, we will expand the experimental section with new ablations that disable one control at a time while keeping others active, reporting the performance drop on key benchmarks. Furthermore, we will rerun the main experiments with at least three different random seeds and report mean and standard deviation to quantify variance. For model scales, we will add results on a smaller model variant to demonstrate consistency across scales, subject to computational availability. revision: yes

  3. Referee: [§5] §5 (Analysis of Forgetting): The in-depth analysis asserts avoidance of catastrophic forgetting, yet provides no quantitative retention metrics on prior tasks or representation drift measurements after the joint update. This leaves the claim that the single-stage method prevents forgetting insufficiently supported.

    Authors: We thank the referee for this observation. Our current analysis relies on maintained performance on diverse benchmarks post-training as indirect evidence against forgetting. To provide more direct support, we will include quantitative retention metrics by measuring accuracy on a set of tasks from the pre-training or SFT phase before and after ViSurf training. Additionally, we will compute representation drift using metrics such as the average cosine distance between embeddings of the same inputs extracted from intermediate layers at different training stages. These additions will be incorporated into the revised §5. revision: yes

Circularity Check

0 steps flagged

No circularity: ViSurf proposes an explicit new integration of SFT and RLVR objectives

full rationale

The paper derives its unified single-stage objective by directly analyzing and combining the standard SFT and RLVR loss formulations, then injecting ground-truth labels into rollouts as a design choice rather than a fitted or self-referential step. No equations reduce to prior results by construction, no self-citations are load-bearing for the central claim, and the three reward controls are presented as novel additions whose stability is asserted via experiment rather than definition. The derivation chain remains self-contained against external benchmarks and does not rename or smuggle in known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard domain assumptions about the availability of ground-truth labels and verifiable rewards in LVLM training; no free parameters or new entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Ground-truth labels and verifiable rewards are available and can be directly injected into RLVR rollouts without destabilizing training.
    The unified framework and reward control strategies presuppose these elements exist and function as described.

pith-pipeline@v0.9.0 · 5764 in / 1320 out tokens · 45259 ms · 2026-05-18T07:20:59.053309+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 17 internal anchors

  1. [1]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2, 3, 6 8

  2. [2]

    Sam4mllm: Enhance multi- modal large language model for referring expression seg- mentation

    Yi-Chia Chen, Wei-Hua Li, Cheng Sun, Yu-Chiang Frank Wang, and Chu-Song Chen. Sam4mllm: Enhance multi- modal large language model for referring expression seg- mentation. InEuropean Conference on Computer Vision, pages 323–340. Springer, 2024. 8

  3. [3]

    Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 2

  4. [4]

    Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)

    Noel Codella, Veronica Rotemberg, Philipp Tschandl, M Emre Celebi, Stephen Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael Marchetti, et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the interna- tional skin imaging collaboration (isic).arXiv preprint arXiv:1902.03368, 2019. 6

  5. [5]

    Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models.arXiv e-prints, pages arXiv–2409, 2024

    Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri- pathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models.arXiv e-prints, pages arXiv–2409, 2024. 2

  6. [6]

    TRL - Transformer Reinforcement Learning

    Hugging Face. TRL - Transformer Reinforcement Learning. https://github.com/huggingface/trl, 2024. 8

  7. [7]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 3

  8. [8]

    Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark

    Himanshu Gupta, Shreyas Verma, Ujjwala Anantheswaran, Kevin Scaria, Mihir Parmar, Swaroop Mishra, and Chitta Baral. Polymath: A challenging multi-modal mathemati- cal reasoning benchmark.arXiv preprint arXiv:2410.14702,

  9. [9]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

  10. [10]

    Isic 2018: Skin lesion analysis towards melanoma detection.https: //challenge.isic-archive.com/data/#2018,

    International Skin Imaging Collaboration (ISIC). Isic 2018: Skin lesion analysis towards melanoma detection.https: //challenge.isic-archive.com/data/#2018,

  11. [11]

    Omniact: A dataset and benchmark for enabling mul- timodal generalist autonomous agents for desktop and web

    Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem AlShikh, and Ruslan Salakhut- dinov. Omniact: A dataset and benchmark for enabling mul- timodal generalist autonomous agents for desktop and web. InEuropean Conference on Computer Vision, pages 161–

  12. [12]

    Lisa: Reasoning segmentation via large language model

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024. 5, 8

  13. [13]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 2

  14. [14]

    Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

    Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models.arXiv preprint arXiv:2403.18814,

  15. [15]

    Scemqa: A scientific col- lege entrance level multimodal question answering bench- mark.arXiv preprint arXiv:2402.05138, 2024

    Zhenwen Liang, Kehan Guo, Gang Liu, Taicheng Guo, Yujun Zhou, Tianyu Yang, Jiajun Jiao, Renjie Pi, Jipeng Zhang, and Xiangliang Zhang. Scemqa: A scientific col- lege entrance level multimodal question answering bench- mark.arXiv preprint arXiv:2402.05138, 2024. 6

  16. [16]

    Gres: Gen- eralized referring expression segmentation

    Chang Liu, Henghui Ding, and Xudong Jiang. Gres: Gen- eralized referring expression segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 23592–23601, 2023. 3, 5

  17. [17]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 2

  18. [18]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 2

  19. [19]

    Empowering small vlms to think with dynamic memorization and explo- ration.arXiv preprint arXiv:2506.23061, 2025

    Jiazhen Liu, Yuchuan Deng, and Long Chen. Empowering small vlms to think with dynamic memorization and explo- ration.arXiv preprint arXiv:2506.23061, 2025. 2

  20. [20]

    Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

    Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025. 1, 2, 3, 8

  21. [21]

    Visionreasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025b

    Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, and Jiaya Jia. Visionreasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025. 1, 2, 3, 5, 6, 8

  22. [22]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual- rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 2

  23. [23]

    Inter-gps: Interpretable geometry problem solving with formal language and sym- bolic reasoning

    Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and sym- bolic reasoning.arXiv preprint arXiv:2105.04165, 2021. 6

  24. [24]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023. 6

  25. [25]

    Learning what reinforcement learning can’t: Interleaved online fine-tuning for hardest questions.CoRR, abs/2506.07527,

    Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Bin Cui, et al. Learning what reinforcement learning can’t: Interleaved online fine-tuning for hardest questions. arXiv preprint arXiv:2506.07527, 2025. 2

  26. [26]

    ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022. 6 9

  27. [27]

    Docvqa: A dataset for vqa on document images

    Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. InProceed- ings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021. 6

  28. [28]

    DeepSpeed.https:// github.com/deepspeedai/DeepSpeed, 2020

    Microsoft and DeepSpeed Team. DeepSpeed.https:// github.com/deepspeedai/DeepSpeed, 2020. 8

  29. [29]

    We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

    Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large mul- timodal model achieve human-like mathematical reasoning? arXiv preprint arXiv:2407.01284, 2024. 6

  30. [30]

    Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023. 2

  31. [31]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 3, 6

  32. [32]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 2

  33. [33]

    verl: V olcano Engine Reinforce- ment Learning for LLMs.https://github.com/ volcengine/verl, 2024

    ByteDance Seed. verl: V olcano Engine Reinforce- ment Learning for LLMs.https://github.com/ volcengine/verl, 2024. 8

  34. [34]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 1, 2, 3

  35. [35]

    Real-iad: A real-world multi-view dataset for benchmarking versatile industrial anomaly detec- tion

    Chengjie Wang, Wenbing Zhu, Bin-Bin Gao, Zhenye Gan, Jiangning Zhang, Zhihao Gu, Shuguang Qian, Mingang Chen, and Lizhuang Ma. Real-iad: A real-world multi-view dataset for benchmarking versatile industrial anomaly detec- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 22883–22892,

  36. [36]

    Mgm-omni: Scaling omni llms to personal- ized long-horizon speech.arXiv preprint arXiv:2509.25131,

    Chengyao Wang, Zhisheng Zhong, Bohao Peng, Senqiao Yang, Yuqi Liu, Haokun Gui, Bin Xia, Jingyao Li, Bei Yu, and Jiaya Jia. Mgm-omni: Scaling omni llms to personal- ized long-horizon speech.arXiv preprint arXiv:2509.25131,

  37. [37]

    Mea- suring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Sys- tems, 37:95095–95169, 2024

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Mea- suring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Sys- tems, 37:95095–95169, 2024. 6

  38. [38]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1, 2, 6

  39. [39]

    Gsva: Generalized segmentation via multimodal large language models

    Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. Gsva: Generalized segmentation via multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3858–3869, 2024. 8

  40. [40]

    Drivegpt4-v2: Harnessing large language model capabilities for enhanced closed-loop autonomous driving

    Zhenhua Xu, Yan Bai, Yujia Zhang, Zhuoling Li, Fei Xia, Kwan-Yee K Wong, Jianqiang Wang, and Hengshuang Zhao. Drivegpt4-v2: Harnessing large language model capabilities for enhanced closed-loop autonomous driving. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 17261–17270, 2025. 2

  41. [41]

    Teaching large language models to regress accurate image quality scores using score distribution

    Zhiyuan You, Xin Cai, Jinjin Gu, Tianfan Xue, and Chao Dong. Teaching large language models to regress accurate image quality scores using score distribution. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14483–14494, 2025. 2

  42. [42]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforce- ment learning system at scale, 2025.URL https://arxiv. org/abs/2503.14476, 2025. 1, 2

  43. [43]

    On-policy RL meets off-policy experts: Harmonizing supervised fine-tuning and reinforcement learning via dynamic weighting.CoRR, abs/2508.11408, 2025

    Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, and Jingren Zhou. On-policy rl meets off-policy experts: Harmonizing super- vised fine-tuning and reinforcement learning via dynamic weighting.arXiv preprint arXiv:2508.11408, 2025. 2

  44. [44]

    {Question}

    Zhisheng Zhong, Chengyao Wang, Yuqi Liu, Senqiao Yang, Longxiang Tang, Yuechen Zhang, Jingyao Li, Tianyuan Qu, Yanwei Li, Yukang Chen, et al. Lyra: An efficient and speech-centric framework for omni-cognition.arXiv preprint arXiv:2412.09501, 2024. 2 10 ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models Supplementa...