Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification?
Pith reviewed 2026-05-16 14:48 UTC · model grok-4.3
The pith
Longer textual reasoning lowers accuracy for MLLMs on fine-grained visual classification
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across zero-shot and multiple training settings, the degradation induced by CoT is largely driven by the reasoning length, in which longer textual reasoning consistently lowers classification accuracy. The authors term this the Cost of Thinking and introduce MRN, a plug-and-play normalization for multi-reward optimization that balances heterogeneous signals, together with ReFine-RFT, a framework that combines ensemble rewards with MRN to constrain reasoning length while supplying dense accuracy-oriented feedback.
What carries the argument
The Cost of Thinking phenomenon, in which longer textual reasoning lowers classification accuracy, addressed through MRN normalization that balances heterogeneous reward signals during multi-reward optimization.
If this is right
- Constraining reasoning length prevents accuracy losses on fine-grained visual tasks.
- MRN enables stable optimization when combining rewards of different types.
- ReFine-RFT reaches state-of-the-art results across FGVC benchmarks.
- The length-control approach works in both zero-shot and fine-tuning regimes.
Where Pith is reading between the lines
- Prompt design for vision-language models could routinely include explicit length limits on reasoning steps.
- Similar length-based performance costs may appear in other perception-heavy tasks such as medical imaging or autonomous driving.
- Dynamic length control that adapts to input difficulty could further improve results.
- Pairing the method with model-distillation techniques might support efficient real-world deployment.
Load-bearing premise
The accuracy drop is caused by reasoning length rather than by correlated factors such as prompt style or total token budget.
What would settle it
A controlled experiment that holds total output length fixed while varying only the amount of reasoning content, or that matches lengths between CoT and direct-answer prompts, to test whether length alone accounts for the accuracy change.
Figures
read the original abstract
Multi-modal large language models (MLLMs) exhibit strong general-purpose capabilities, yet still struggle on Fine-Grained Visual Classification (FGVC), a core perception task that requires subtle visual discrimination and is crucial for many real-world applications. A widely adopted strategy for boosting performance on challenging tasks such as math and coding is Chain-of-Thought (CoT) reasoning. However, several prior works have reported that CoT can actually harm performance on visual perception tasks. These studies, though, examine the issue from relatively narrow angles and leave open why CoT degrades perception-heavy performance. We systematically re-examine the role of CoT in FGVC through the lenses of zero-shot evaluation and multiple training paradigms. Across these settings, we uncover a central paradox: the degradation induced by CoT is largely driven by the reasoning length, in which longer textual reasoning consistently lowers classification accuracy. We term this phenomenon the ``Cost of Thinking''. Building on this finding, we make two key contributions: (1) MRN, a simple and general plug-and-play normalization method for multi-reward optimization that balances heterogeneous reward signals, and (2) ReFine-RFT, a framework that combines ensemble rewards with MRN to constrain reasoning length while providing dense accuracy-oriented feedback. Extensive experiments demonstrate the effectiveness of our findings and the proposed ReFine-RFT, achieving state-of-the-art performance across FGVC benchmarks. Project page: \href{https://refine-rft.github.io/}{ReFine-RFT}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates the effect of Chain-of-Thought (CoT) reasoning on Multi-modal Large Language Models (MLLMs) for Fine-Grained Visual Classification (FGVC). It reports that CoT degrades performance across zero-shot and multiple training settings, attributes this degradation primarily to increased reasoning length (termed the 'Cost of Thinking'), and introduces MRN (a normalization method for balancing heterogeneous rewards) together with the ReFine-RFT framework (ensemble rewards plus MRN to constrain length while supplying dense accuracy feedback). Extensive experiments are claimed to yield state-of-the-art results on FGVC benchmarks.
Significance. If the causal attribution to reasoning length is substantiated, the work supplies a concrete explanation for why CoT harms perception-heavy tasks and supplies two practical, plug-and-play components (MRN and ReFine-RFT) that could be adopted in other multi-reward MLLM training pipelines. The empirical trends across settings and the reported SOTA gains would constitute a useful contribution to the FGVC and MLLM literature.
major comments (2)
- [Abstract and §4] Abstract and §4 (experimental analysis): the central claim that 'the degradation induced by CoT is largely driven by the reasoning length' is not yet supported by a controlled ablation. The reported setups entangle length with prompt style, unconstrained generation, token budget, and reward shaping; an experiment that fixes the prompt template, sampling parameters, and total context length while varying only an explicit length cap (or post-hoc truncation) is required to establish causality rather than correlation.
- [§4.3 and Table 2] §4.3 and Table 2: the MRN normalization and ReFine-RFT gains rest on the premise that length is the dominant negative factor. Without the isolation experiment above, it remains unclear whether the observed improvements stem from length control, from the ensemble reward formulation itself, or from other changes in the training distribution.
minor comments (2)
- [Abstract] Abstract: the phrase 'multiple training paradigms' is used without enumeration; explicitly listing the paradigms (e.g., RFT, supervised fine-tuning, etc.) would improve readability.
- [§3] Figure captions and §3: the definition and measurement procedure for 'reasoning length' (token count, sentence count, or model-generated steps) should be stated once in the main text rather than only in supplementary material.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We agree that the current experiments show a strong correlation between reasoning length and performance degradation but do not fully isolate length from other factors. We will add the requested controlled ablation to strengthen the causal claim and clarify the contributions of MRN and ReFine-RFT.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (experimental analysis): the central claim that 'the degradation induced by CoT is largely driven by the reasoning length' is not yet supported by a controlled ablation. The reported setups entangle length with prompt style, unconstrained generation, token budget, and reward shaping; an experiment that fixes the prompt template, sampling parameters, and total context length while varying only an explicit length cap (or post-hoc truncation) is required to establish causality rather than correlation.
Authors: We acknowledge that our existing analyses in §4 demonstrate consistent negative trends with longer reasoning chains but do not isolate length via a fully controlled setup. In the revision we will add a new controlled ablation that fixes the prompt template, sampling parameters, and total context length while varying only an explicit length cap (and post-hoc truncation for comparison). Results will be reported in §4 with updated figures and discussion to directly support causality for the 'Cost of Thinking'. revision: yes
-
Referee: [§4.3 and Table 2] §4.3 and Table 2: the MRN normalization and ReFine-RFT gains rest on the premise that length is the dominant negative factor. Without the isolation experiment above, it remains unclear whether the observed improvements stem from length control, from the ensemble reward formulation itself, or from other changes in the training distribution.
Authors: We agree the attribution of gains requires confirming length as the primary driver. After adding the controlled ablation, we will revise §4.3 and the Table 2 analysis to explicitly show how MRN balances the length-related penalty against accuracy rewards and how the ensemble supplies dense feedback. Additional component ablations will be included to separate the effects of length constraint from other training changes, clarifying that improvements arise from mitigating the identified cost while preserving task performance. revision: yes
Circularity Check
No circularity: empirical observation of CoT length effect is not derived by construction
full rationale
The paper presents an empirical finding that CoT-induced accuracy drops on FGVC tasks correlate with reasoning length across zero-shot and training regimes, labeling it the 'Cost of Thinking'. This rests on experimental measurements rather than any closed-form identity, fitted parameter renamed as prediction, or self-citation chain. MRN normalization and ReFine-RFT are introduced as practical interventions motivated by the observed pattern, without equations that reduce the central claim to its own inputs or ansatzes smuggled via prior self-work. The derivation chain consists of direct experimental comparisons and is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 3 Pith papers
-
DepthAgent: Towards Better Universal Depth Estimation via Sample-wise Expert Selection
A reinforcement-learned vision-language agent adaptively selects and fuses monocular depth experts per sample for better performance across camera geometries.
-
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation
PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.
-
LASER: Learning Active Sensing for Continuum Field Reconstruction
LASER trains a reinforcement learning policy inside a latent dynamics model to choose sensor placements that improve reconstruction of continuum fields under sparsity.
Reference graph
Works this paper leans on
-
[1]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
Atm: Action tempo- rality modeling for video question answering
Junwen Chen, Jie Zhu, and Yu Kong. Atm: Action tempo- rality modeling for video question answering. InACM MM,
-
[5]
On the suitability of reinforcement fine-tuning to visual tasks
Xiaxu Chen, Wei Li, Chunxu Liu, Chi Xie, Xiaoyan Hu, Chengqian Ma, Feng Zhu, and Rui Zhao. On the suitability of reinforcement fine-tuning to visual tasks. InProceedings of the Computer Vision and Pattern Recognition Conference,
-
[6]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Chenhang Cui, An Zhang, Yiyang Zhou, Zhaorun Chen, Gelei Deng, Huaxiu Yao, and Tat-Seng Chua. Fine-grained veri- fiers: Preference modeling as next-token prediction in vision- language alignment.arXiv preprint arXiv:2410.14148, 2024. 2
-
[8]
Gregor Geigle, Radu Timofte, and Goran Glava ˇs. African or european swallow? benchmarking large vision-language models for fine-grained object classification.arXiv preprint arXiv:2406.14496, 2024. 1, 2
-
[9]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 3, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Hulingxiao He, Geng Li, Zijun Geng, Jinglin Xu, and Yuxin Peng. Analyzing and boosting the power of fine-grained visual recognition for multi-modal large language models. arXiv preprint arXiv:2501.15140, 2025. 1, 2, 7
-
[11]
Lora: Low-rank adaptation of large language models.ICLR, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 2022. 6
work page 2022
-
[12]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, et al. Mme-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency.arXiv preprint arXiv:2502.09621, 2025. 2, 3
-
[16]
3d object representations for fine-grained categorization
Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. InPro- ceedings of the IEEE International Conference on Computer Vision Workshops, 2013. 2, 3
work page 2013
-
[17]
Re- wardbench: Evaluating reward models for language modeling
Nathan Lambert, Valentina Pyatkin, Jacob Morrison, Lester James Validad Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. Re- wardbench: Evaluating reward models for language modeling. InNAACL, 2025. 5
work page 2025
-
[18]
Ming Li, Jike Zhong, Shitian Zhao, Yuxiang Lai, Haoquan Zhang, Wang Bill Zhu, and Kaipeng Zhang. Think or not think: A study of explicit thinking in rule-based visual rein- forcement fine-tuning.NeurIPS, 2025. 2, 4, 7
work page 2025
-
[19]
Mingxuan Liu, Subhankar Roy, Wenjing Li, Zhun Zhong, Nicu Sebe, and Elisa Ricci. Democratizing fine-grained vi- sual recognition with large language models.arXiv preprint arXiv:2401.13837, 2024. 2
-
[20]
arXiv preprint arXiv:2410.21333 , year=
Ryan Liu, Jiayi Geng, Addison J Wu, Ilia Sucholutsky, Tania Lombrozo, and Thomas L Griffiths. Mind your step (by step): Chain-of-thought can reduce performance on tasks where thinking makes humans worse.arXiv preprint arXiv:2410.21333, 2024. 2, 3
-
[21]
Visual-rft: Visual reinforcement fine-tuning.ICCV, 2025
Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.ICCV, 2025. 2, 3, 4, 7
work page 2025
-
[22]
Fine-Grained Visual Classification of Aircraft
Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual clas- sification of aircraft.arXiv preprint arXiv:1306.5151, 2013. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[23]
Automated flower classification over a large number of classes
Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing. IEEE, 2008. 2, 3
work page 2008
-
[24]
Show your work: Scratchpads for intermediate computation with language models, 2021
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Hen- ryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sut- ton, and Augustus Odena. Show your work: Scratchpads for intermediate computation with language models, 2021. 2
work page 2021
-
[25]
Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012. 2, 3
work page 2012
-
[26]
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL
Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025. 3 9
work page internal anchor Pith review arXiv 2025
-
[27]
Chatgpt- powered hierarchical comparisons for image classification
Zhiyuan Ren, Yiyang Su, and Xiaoming Liu. Chatgpt- powered hierarchical comparisons for image classification. NeurIPS, 36:69706–69718, 2023. 2
work page 2023
-
[28]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generaliz- able r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback
Wei Shen, Rui Zheng, Wenyu Zhan, Jun Zhao, Shihan Dou, Tao Gui, Qi Zhang, and Xuan-Jing Huang. Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback. InEMNLP, 2023. 5
work page 2023
-
[31]
arXiv preprint arXiv:2409.12183 , year=
Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dong- wei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, and Greg Durrett. To cot or not to cot? chain-of-thought helps mainly on math and symbolic reasoning.arXiv preprint arXiv:2409.12183, 2024. 2
-
[32]
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, et al. Stop overthinking: A sur- vey on efficient reasoning for large language models.arXiv preprint arXiv:2503.16419, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Aligning Large Multimodal Models with Factually Augmented RLHF
Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chun- yuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu- Xiong Wang, Yiming Yang, et al. Aligning large multi- modal models with factually augmented rlhf.arXiv preprint arXiv:2309.14525, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Zhi Rui Tam, Cheng-Kuang Wu, Yi-Lin Tsai, Chieh-Yen Lin, Hung-yi Lee, and Yun-Nung Chen. Let me speak freely? a study on the impact of format restrictions on performance of large language models.arXiv preprint arXiv:2408.02442,
-
[35]
Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025
Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason- rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025. 3
-
[36]
Eyes wide shut? exploring the visual shortcomings of multimodal llms
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InCVPR, pages 9568–9578, 2024. 2
work page 2024
-
[37]
Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie.The Caltech-UCSD Birds-200-2011 Dataset. 2011. 2
work page 2011
-
[38]
Text Embeddings by Weakly-Supervised Contrastive Pre-training
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Lin- jun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022. 6
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[39]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 3, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Chain-of- thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models. NeurIPS, 35:24824–24837, 2022. 2
work page 2022
-
[41]
Fine-grained image analysis with deep learning: A survey
Xiu-Shen Wei, Yi-Zhe Song, Oisin Mac Aodha, Jianxin Wu, Yuxin Peng, Jinhui Tang, Jian Yang, and Serge Belongie. Fine-grained image analysis with deep learning: A survey. TPAMI, 2021. 2
work page 2021
-
[42]
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Min- feng Zhu, et al. R1-onevision: Advancing generalized multi- modal reasoning through cross-modal formalization.arXiv preprint arXiv:2503.10615, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
arXiv preprint arXiv:2504.07954 , year =
En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, et al. Perception-r1: Pioneering perception policy with reinforcement learning.arXiv preprint arXiv:2504.07954,
-
[44]
Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learn- ing to reason with multimodal large language models via step-wise group relative policy optimization.arXiv preprint arXiv:2503.12937, 2025. 3
work page internal anchor Pith review arXiv 2025
-
[45]
Chain of preference optimization: Improving chain-of-thought reasoning in llms.NeurIPS, 2024
Xuan Zhang, Chao Du, Tianyu Pang, Qian Liu, Wei Gao, and Min Lin. Chain of preference optimization: Improving chain-of-thought reasoning in llms.NeurIPS, 2024. 2
work page 2024
-
[46]
Why are visually-grounded language models bad at image classifi- cation?NeurIPS, 2024
Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, and Serena Yeung-Levy. Why are visually-grounded language models bad at image classifi- cation?NeurIPS, 2024. 1, 2
work page 2024
-
[47]
Automatic Chain of Thought Prompting in Large Language Models
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Au- tomatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[48]
Multimodal chain-of-thought reasoning in language models.TMLR, 2023
Zhuosheng Zhang, Aston Zhang, Mu Li, George Karypis, Alex Smola, et al. Multimodal chain-of-thought reasoning in language models.TMLR, 2023. 2
work page 2023
-
[49]
Au- tomatic chain of thought prompting in large language models
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Au- tomatic chain of thought prompting in large language models. InICLR, 2023. 2
work page 2023
-
[50]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 1, 3 10
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.