pith. sign in

arxiv: 2601.06993 · v2 · submitted 2026-01-11 · 💻 cs.CV

Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification?

Pith reviewed 2026-05-16 14:48 UTC · model grok-4.3

classification 💻 cs.CV
keywords fine-grained visual classificationchain-of-thought reasoningmulti-modal large language modelsreasoning lengthmulti-reward optimizationFGVCMLLMscost of thinking
0
0 comments X

The pith

Longer textual reasoning lowers accuracy for MLLMs on fine-grained visual classification

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multi-modal large language models handle many tasks well yet fall short on fine-grained visual classification that needs precise discrimination of subtle visual cues. Chain-of-thought prompting, which aids math and coding, here reduces accuracy. Systematic tests across zero-shot and trained settings show the drop traces mainly to the length of the generated reasoning text. The work names this the Cost of Thinking and counters it with MRN, a normalization that balances multiple reward signals, plus the ReFine-RFT framework that limits reasoning length while rewarding classification accuracy. The result is state-of-the-art performance on standard FGVC benchmarks.

Core claim

Across zero-shot and multiple training settings, the degradation induced by CoT is largely driven by the reasoning length, in which longer textual reasoning consistently lowers classification accuracy. The authors term this the Cost of Thinking and introduce MRN, a plug-and-play normalization for multi-reward optimization that balances heterogeneous signals, together with ReFine-RFT, a framework that combines ensemble rewards with MRN to constrain reasoning length while supplying dense accuracy-oriented feedback.

What carries the argument

The Cost of Thinking phenomenon, in which longer textual reasoning lowers classification accuracy, addressed through MRN normalization that balances heterogeneous reward signals during multi-reward optimization.

If this is right

  • Constraining reasoning length prevents accuracy losses on fine-grained visual tasks.
  • MRN enables stable optimization when combining rewards of different types.
  • ReFine-RFT reaches state-of-the-art results across FGVC benchmarks.
  • The length-control approach works in both zero-shot and fine-tuning regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Prompt design for vision-language models could routinely include explicit length limits on reasoning steps.
  • Similar length-based performance costs may appear in other perception-heavy tasks such as medical imaging or autonomous driving.
  • Dynamic length control that adapts to input difficulty could further improve results.
  • Pairing the method with model-distillation techniques might support efficient real-world deployment.

Load-bearing premise

The accuracy drop is caused by reasoning length rather than by correlated factors such as prompt style or total token budget.

What would settle it

A controlled experiment that holds total output length fixed while varying only the amount of reasoning content, or that matches lengths between CoT and direct-answer prompts, to test whether length alone accounts for the accuracy change.

Figures

Figures reproduced from arXiv: 2601.06993 by Jie Zhu, Xiaoming Liu, Yiyang Su.

Figure 1
Figure 1. Figure 1: Performance degradation with CoT and reasoning collapse in RFT. In zero-shot evaluation (top), MLLMs predict the correct label directly, but adding CoT reasoning leads to a wrong answer. During RFT (bottom), reasoning length steadily shrinks while accuracy improves, indicating a reasoning collapse. ployed as unified interfaces for perception and reasoning, their ability to handle fine-grained visual unders… view at source ↗
Figure 2
Figure 2. Figure 2: Dynamics of reasoning length during RFT across FGVC datasets. The dark green lines denote the running average of completion lengths throughout RFT FGVC tasks. Across all datasets, the reasoning content length rapidly decreases and stabilizes at a shorter range, suggesting that RFT discourages excessive reasoning generation and promotes concise, decision-focused responses. [Zero-shot: average content length… view at source ↗
Figure 3
Figure 3. Figure 3: Impact of reasoning length on FGVC performance. We analyze the relationship between average reasoning (thinking) length and classification accuracy across FGVC datasets. As the av￾erage thinking length increases, performance consistently declines, indicating that excessive reasoning generation introduces noise or distracting the model from key discriminative visual cues. leads to a clear decline in classif… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of ReFine-RFT. Given a question, the model generates multiple candidate responses, each evaluated using an ensemble reward that combines rule-based rewards and model-based rewards like MLLM-based accuracy reward and embedding similarity reward. The proposed MRN then normalizes the rewards for each function to compute the final advantages used to update the MLLM. 4. Methods Inspired by our findings… view at source ↗
Figure 5
Figure 5. Figure 5: Differences among rewards during training. Each reward exhibits distinct convergence speed, value range, and satu￾ration point, reflecting the heterogeneity of different rewards. sample preferred answers with a higher reward. In practical training scenarios, multiple reward signals (e.g., format and classification) are often combined to guide optimization. In the original GRPO, these heterogeneous rewards … view at source ↗
Figure 7
Figure 7. Figure 7: Training reward and its standard deviation compari￾son on Aircrafts-102. MRN + GRPO achieves consistently higher reward values and lower variance throughout training, indicating improved stability and optimization efficiency. Please identify the model of the aircraft b ased on the image... <Instruction> <think>…</think> <answer> DC-8 </answer> ReFine-RFT (Ours) <think>…</think> <answer> Boeing 707-320 </an… view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of responses. SFT-CoT and Visual-RFT produce long reasoning with incorrect answers, while ReFine-RFT achieves concise reasoning and higher accuracy. More results and analyses are in the supplementary. Qualitative Results. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 6
Figure 6. Figure 6: Reward curves of ReFine-RFT on Flowers-102. Re￾wards consistently increase over training, demonstrating the effec￾tiveness of our reward design. Reward Distribution Comparison. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Multi-modal large language models (MLLMs) exhibit strong general-purpose capabilities, yet still struggle on Fine-Grained Visual Classification (FGVC), a core perception task that requires subtle visual discrimination and is crucial for many real-world applications. A widely adopted strategy for boosting performance on challenging tasks such as math and coding is Chain-of-Thought (CoT) reasoning. However, several prior works have reported that CoT can actually harm performance on visual perception tasks. These studies, though, examine the issue from relatively narrow angles and leave open why CoT degrades perception-heavy performance. We systematically re-examine the role of CoT in FGVC through the lenses of zero-shot evaluation and multiple training paradigms. Across these settings, we uncover a central paradox: the degradation induced by CoT is largely driven by the reasoning length, in which longer textual reasoning consistently lowers classification accuracy. We term this phenomenon the ``Cost of Thinking''. Building on this finding, we make two key contributions: (1) MRN, a simple and general plug-and-play normalization method for multi-reward optimization that balances heterogeneous reward signals, and (2) ReFine-RFT, a framework that combines ensemble rewards with MRN to constrain reasoning length while providing dense accuracy-oriented feedback. Extensive experiments demonstrate the effectiveness of our findings and the proposed ReFine-RFT, achieving state-of-the-art performance across FGVC benchmarks. Project page: \href{https://refine-rft.github.io/}{ReFine-RFT}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper investigates the effect of Chain-of-Thought (CoT) reasoning on Multi-modal Large Language Models (MLLMs) for Fine-Grained Visual Classification (FGVC). It reports that CoT degrades performance across zero-shot and multiple training settings, attributes this degradation primarily to increased reasoning length (termed the 'Cost of Thinking'), and introduces MRN (a normalization method for balancing heterogeneous rewards) together with the ReFine-RFT framework (ensemble rewards plus MRN to constrain length while supplying dense accuracy feedback). Extensive experiments are claimed to yield state-of-the-art results on FGVC benchmarks.

Significance. If the causal attribution to reasoning length is substantiated, the work supplies a concrete explanation for why CoT harms perception-heavy tasks and supplies two practical, plug-and-play components (MRN and ReFine-RFT) that could be adopted in other multi-reward MLLM training pipelines. The empirical trends across settings and the reported SOTA gains would constitute a useful contribution to the FGVC and MLLM literature.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (experimental analysis): the central claim that 'the degradation induced by CoT is largely driven by the reasoning length' is not yet supported by a controlled ablation. The reported setups entangle length with prompt style, unconstrained generation, token budget, and reward shaping; an experiment that fixes the prompt template, sampling parameters, and total context length while varying only an explicit length cap (or post-hoc truncation) is required to establish causality rather than correlation.
  2. [§4.3 and Table 2] §4.3 and Table 2: the MRN normalization and ReFine-RFT gains rest on the premise that length is the dominant negative factor. Without the isolation experiment above, it remains unclear whether the observed improvements stem from length control, from the ensemble reward formulation itself, or from other changes in the training distribution.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'multiple training paradigms' is used without enumeration; explicitly listing the paradigms (e.g., RFT, supervised fine-tuning, etc.) would improve readability.
  2. [§3] Figure captions and §3: the definition and measurement procedure for 'reasoning length' (token count, sentence count, or model-generated steps) should be stated once in the main text rather than only in supplementary material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the current experiments show a strong correlation between reasoning length and performance degradation but do not fully isolate length from other factors. We will add the requested controlled ablation to strengthen the causal claim and clarify the contributions of MRN and ReFine-RFT.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (experimental analysis): the central claim that 'the degradation induced by CoT is largely driven by the reasoning length' is not yet supported by a controlled ablation. The reported setups entangle length with prompt style, unconstrained generation, token budget, and reward shaping; an experiment that fixes the prompt template, sampling parameters, and total context length while varying only an explicit length cap (or post-hoc truncation) is required to establish causality rather than correlation.

    Authors: We acknowledge that our existing analyses in §4 demonstrate consistent negative trends with longer reasoning chains but do not isolate length via a fully controlled setup. In the revision we will add a new controlled ablation that fixes the prompt template, sampling parameters, and total context length while varying only an explicit length cap (and post-hoc truncation for comparison). Results will be reported in §4 with updated figures and discussion to directly support causality for the 'Cost of Thinking'. revision: yes

  2. Referee: [§4.3 and Table 2] §4.3 and Table 2: the MRN normalization and ReFine-RFT gains rest on the premise that length is the dominant negative factor. Without the isolation experiment above, it remains unclear whether the observed improvements stem from length control, from the ensemble reward formulation itself, or from other changes in the training distribution.

    Authors: We agree the attribution of gains requires confirming length as the primary driver. After adding the controlled ablation, we will revise §4.3 and the Table 2 analysis to explicitly show how MRN balances the length-related penalty against accuracy rewards and how the ensemble supplies dense feedback. Additional component ablations will be included to separate the effects of length constraint from other training changes, clarifying that improvements arise from mitigating the identified cost while preserving task performance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observation of CoT length effect is not derived by construction

full rationale

The paper presents an empirical finding that CoT-induced accuracy drops on FGVC tasks correlate with reasoning length across zero-shot and training regimes, labeling it the 'Cost of Thinking'. This rests on experimental measurements rather than any closed-form identity, fitted parameter renamed as prediction, or self-citation chain. MRN normalization and ReFine-RFT are introduced as practical interventions motivated by the observed pattern, without equations that reduce the central claim to its own inputs or ansatzes smuggled via prior self-work. The derivation chain consists of direct experimental comparisons and is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central empirical claim rests on the assumption that reasoning length is the primary causal variable and that the chosen reward signals are sufficient to control it; no new mathematical axioms or invented physical entities are introduced.

pith-pipeline@v0.9.0 · 5573 in / 1140 out tokens · 16634 ms · 2026-05-16T14:48:43.957403+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DepthAgent: Towards Better Universal Depth Estimation via Sample-wise Expert Selection

    cs.CV 2026-05 unverdicted novelty 7.0

    A reinforcement-learned vision-language agent adaptively selects and fuses monocular depth experts per sample for better performance across camera geometries.

  2. PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation

    cs.RO 2026-01 unverdicted novelty 6.0

    PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.

  3. LASER: Learning Active Sensing for Continuum Field Reconstruction

    cs.LG 2026-04 unverdicted novelty 5.0

    LASER trains a reinforcement learning policy inside a latent dynamics model to choose sensor placements that improve reconstruction of continuum fields under sparsity.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 3 Pith papers · 20 internal anchors

  1. [1]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023. 1

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 3

  3. [3]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022. 3

  4. [4]

    Atm: Action tempo- rality modeling for video question answering

    Junwen Chen, Jie Zhu, and Yu Kong. Atm: Action tempo- rality modeling for video question answering. InACM MM,

  5. [5]

    On the suitability of reinforcement fine-tuning to visual tasks

    Xiaxu Chen, Wei Li, Chunxu Liu, Chi Xie, Xiaoyan Hu, Chengqian Ma, Feng Zhu, and Rui Zhao. On the suitability of reinforcement fine-tuning to visual tasks. InProceedings of the Computer Vision and Pattern Recognition Conference,

  6. [6]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024. 3

  7. [7]

    Fine-grained veri- fiers: Preference modeling as next-token prediction in vision- language alignment.arXiv preprint arXiv:2410.14148, 2024

    Chenhang Cui, An Zhang, Yiyang Zhou, Zhaorun Chen, Gelei Deng, Huaxiu Yao, and Tat-Seng Chua. Fine-grained veri- fiers: Preference modeling as next-token prediction in vision- language alignment.arXiv preprint arXiv:2410.14148, 2024. 2

  8. [8]

    African or european swallow? bench- marking large vision-language models for fine-grained object classification

    Gregor Geigle, Radu Timofte, and Goran Glava ˇs. African or european swallow? benchmarking large vision-language models for fine-grained object classification.arXiv preprint arXiv:2406.14496, 2024. 1, 2

  9. [9]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 3, 7

  10. [10]

    Analyzing and boosting the power of fine-grained visual recognition for multi-modal large language models

    Hulingxiao He, Geng Li, Zijun Geng, Jinglin Xu, and Yuxin Peng. Analyzing and boosting the power of fine-grained visual recognition for multi-modal large language models. arXiv preprint arXiv:2501.15140, 2025. 1, 2, 7

  11. [11]

    Lora: Low-rank adaptation of large language models.ICLR, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 2022. 6

  12. [12]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

  13. [13]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024. 2, 3

  14. [14]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 4

  15. [15]

    Mme-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency.arXiv preprint arXiv:2502.09621, 2025

    Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, et al. Mme-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency.arXiv preprint arXiv:2502.09621, 2025. 2, 3

  16. [16]

    3d object representations for fine-grained categorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. InPro- ceedings of the IEEE International Conference on Computer Vision Workshops, 2013. 2, 3

  17. [17]

    Re- wardbench: Evaluating reward models for language modeling

    Nathan Lambert, Valentina Pyatkin, Jacob Morrison, Lester James Validad Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. Re- wardbench: Evaluating reward models for language modeling. InNAACL, 2025. 5

  18. [18]

    Think or not think: A study of explicit thinking in rule-based visual rein- forcement fine-tuning.NeurIPS, 2025

    Ming Li, Jike Zhong, Shitian Zhao, Yuxiang Lai, Haoquan Zhang, Wang Bill Zhu, and Kaipeng Zhang. Think or not think: A study of explicit thinking in rule-based visual rein- forcement fine-tuning.NeurIPS, 2025. 2, 4, 7

  19. [19]

    Democratizing fine-grained vi- sual recognition with large language models.arXiv preprint arXiv:2401.13837, 2024

    Mingxuan Liu, Subhankar Roy, Wenjing Li, Zhun Zhong, Nicu Sebe, and Elisa Ricci. Democratizing fine-grained vi- sual recognition with large language models.arXiv preprint arXiv:2401.13837, 2024. 2

  20. [20]

    arXiv preprint arXiv:2410.21333 , year=

    Ryan Liu, Jiayi Geng, Addison J Wu, Ilia Sucholutsky, Tania Lombrozo, and Thomas L Griffiths. Mind your step (by step): Chain-of-thought can reduce performance on tasks where thinking makes humans worse.arXiv preprint arXiv:2410.21333, 2024. 2, 3

  21. [21]

    Visual-rft: Visual reinforcement fine-tuning.ICCV, 2025

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.ICCV, 2025. 2, 3, 4, 7

  22. [22]

    Fine-Grained Visual Classification of Aircraft

    Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual clas- sification of aircraft.arXiv preprint arXiv:1306.5151, 2013. 2, 3

  23. [23]

    Automated flower classification over a large number of classes

    Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing. IEEE, 2008. 2, 3

  24. [24]

    Show your work: Scratchpads for intermediate computation with language models, 2021

    Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Hen- ryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sut- ton, and Augustus Odena. Show your work: Scratchpads for intermediate computation with language models, 2021. 2

  25. [25]

    Cats and dogs

    Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012. 2, 3

  26. [26]

    LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

    Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025. 3 9

  27. [27]

    Chatgpt- powered hierarchical comparisons for image classification

    Zhiyuan Ren, Yiyang Su, and Xiaoming Liu. Chatgpt- powered hierarchical comparisons for image classification. NeurIPS, 36:69706–69718, 2023. 2

  28. [28]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 3

  29. [29]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generaliz- able r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025. 3

  30. [30]

    Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback

    Wei Shen, Rui Zheng, Wenyu Zhan, Jun Zhao, Shihan Dou, Tao Gui, Qi Zhang, and Xuan-Jing Huang. Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback. InEMNLP, 2023. 5

  31. [31]

    arXiv preprint arXiv:2409.12183 , year=

    Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dong- wei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, and Greg Durrett. To cot or not to cot? chain-of-thought helps mainly on math and symbolic reasoning.arXiv preprint arXiv:2409.12183, 2024. 2

  32. [32]

    Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

    Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, et al. Stop overthinking: A sur- vey on efficient reasoning for large language models.arXiv preprint arXiv:2503.16419, 2025. 2

  33. [33]

    Aligning Large Multimodal Models with Factually Augmented RLHF

    Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chun- yuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu- Xiong Wang, Yiming Yang, et al. Aligning large multi- modal models with factually augmented rlhf.arXiv preprint arXiv:2309.14525, 2023. 3

  34. [34]

    Let me speak freely? a study on the impact of format restrictions on performance of large language models.arXiv preprint arXiv:2408.02442, 2024

    Zhi Rui Tam, Cheng-Kuang Wu, Yi-Lin Tsai, Chieh-Yen Lin, Hung-yi Lee, and Yun-Nung Chen. Let me speak freely? a study on the impact of format restrictions on performance of large language models.arXiv preprint arXiv:2408.02442,

  35. [35]

    Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

    Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason- rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025. 3

  36. [36]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InCVPR, pages 9568–9578, 2024. 2

  37. [37]

    Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie.The Caltech-UCSD Birds-200-2011 Dataset. 2011. 2

  38. [38]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Lin- jun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022. 6

  39. [39]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 3, 6, 7

  40. [40]

    Chain-of- thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models. NeurIPS, 35:24824–24837, 2022. 2

  41. [41]

    Fine-grained image analysis with deep learning: A survey

    Xiu-Shen Wei, Yi-Zhe Song, Oisin Mac Aodha, Jianxin Wu, Yuxin Peng, Jinhui Tang, Jian Yang, and Serge Belongie. Fine-grained image analysis with deep learning: A survey. TPAMI, 2021. 2

  42. [42]

    R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

    Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Min- feng Zhu, et al. R1-onevision: Advancing generalized multi- modal reasoning through cross-modal formalization.arXiv preprint arXiv:2503.10615, 2025. 3

  43. [43]

    arXiv preprint arXiv:2504.07954 , year =

    En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, et al. Perception-r1: Pioneering perception policy with reinforcement learning.arXiv preprint arXiv:2504.07954,

  44. [44]

    R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

    Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learn- ing to reason with multimodal large language models via step-wise group relative policy optimization.arXiv preprint arXiv:2503.12937, 2025. 3

  45. [45]

    Chain of preference optimization: Improving chain-of-thought reasoning in llms.NeurIPS, 2024

    Xuan Zhang, Chao Du, Tianyu Pang, Qian Liu, Wei Gao, and Min Lin. Chain of preference optimization: Improving chain-of-thought reasoning in llms.NeurIPS, 2024. 2

  46. [46]

    Why are visually-grounded language models bad at image classifi- cation?NeurIPS, 2024

    Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, and Serena Yeung-Levy. Why are visually-grounded language models bad at image classifi- cation?NeurIPS, 2024. 1, 2

  47. [47]

    Automatic Chain of Thought Prompting in Large Language Models

    Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Au- tomatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022. 2

  48. [48]

    Multimodal chain-of-thought reasoning in language models.TMLR, 2023

    Zhuosheng Zhang, Aston Zhang, Mu Li, George Karypis, Alex Smola, et al. Multimodal chain-of-thought reasoning in language models.TMLR, 2023. 2

  49. [49]

    Au- tomatic chain of thought prompting in large language models

    Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Au- tomatic chain of thought prompting in large language models. InICLR, 2023. 2

  50. [50]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 1, 3 10