Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification?

Jie Zhu; Xiaoming Liu; Yiyang Su

arxiv: 2601.06993 · v2 · submitted 2026-01-11 · 💻 cs.CV

Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification?

Jie Zhu , Yiyang Su , Xiaoming Liu This is my paper

Pith reviewed 2026-05-16 14:48 UTC · model grok-4.3

classification 💻 cs.CV

keywords fine-grained visual classificationchain-of-thought reasoningmulti-modal large language modelsreasoning lengthmulti-reward optimizationFGVCMLLMscost of thinking

0 comments

The pith

Longer textual reasoning lowers accuracy for MLLMs on fine-grained visual classification

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multi-modal large language models handle many tasks well yet fall short on fine-grained visual classification that needs precise discrimination of subtle visual cues. Chain-of-thought prompting, which aids math and coding, here reduces accuracy. Systematic tests across zero-shot and trained settings show the drop traces mainly to the length of the generated reasoning text. The work names this the Cost of Thinking and counters it with MRN, a normalization that balances multiple reward signals, plus the ReFine-RFT framework that limits reasoning length while rewarding classification accuracy. The result is state-of-the-art performance on standard FGVC benchmarks.

Core claim

Across zero-shot and multiple training settings, the degradation induced by CoT is largely driven by the reasoning length, in which longer textual reasoning consistently lowers classification accuracy. The authors term this the Cost of Thinking and introduce MRN, a plug-and-play normalization for multi-reward optimization that balances heterogeneous signals, together with ReFine-RFT, a framework that combines ensemble rewards with MRN to constrain reasoning length while supplying dense accuracy-oriented feedback.

What carries the argument

The Cost of Thinking phenomenon, in which longer textual reasoning lowers classification accuracy, addressed through MRN normalization that balances heterogeneous reward signals during multi-reward optimization.

If this is right

Constraining reasoning length prevents accuracy losses on fine-grained visual tasks.
MRN enables stable optimization when combining rewards of different types.
ReFine-RFT reaches state-of-the-art results across FGVC benchmarks.
The length-control approach works in both zero-shot and fine-tuning regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Prompt design for vision-language models could routinely include explicit length limits on reasoning steps.
Similar length-based performance costs may appear in other perception-heavy tasks such as medical imaging or autonomous driving.
Dynamic length control that adapts to input difficulty could further improve results.
Pairing the method with model-distillation techniques might support efficient real-world deployment.

Load-bearing premise

The accuracy drop is caused by reasoning length rather than by correlated factors such as prompt style or total token budget.

What would settle it

A controlled experiment that holds total output length fixed while varying only the amount of reasoning content, or that matches lengths between CoT and direct-answer prompts, to test whether length alone accounts for the accuracy change.

Figures

Figures reproduced from arXiv: 2601.06993 by Jie Zhu, Xiaoming Liu, Yiyang Su.

**Figure 1.** Figure 1: Performance degradation with CoT and reasoning collapse in RFT. In zero-shot evaluation (top), MLLMs predict the correct label directly, but adding CoT reasoning leads to a wrong answer. During RFT (bottom), reasoning length steadily shrinks while accuracy improves, indicating a reasoning collapse. ployed as unified interfaces for perception and reasoning, their ability to handle fine-grained visual unders… view at source ↗

**Figure 2.** Figure 2: Dynamics of reasoning length during RFT across FGVC datasets. The dark green lines denote the running average of completion lengths throughout RFT FGVC tasks. Across all datasets, the reasoning content length rapidly decreases and stabilizes at a shorter range, suggesting that RFT discourages excessive reasoning generation and promotes concise, decision-focused responses. [Zero-shot: average content length… view at source ↗

**Figure 3.** Figure 3: Impact of reasoning length on FGVC performance. We analyze the relationship between average reasoning (thinking) length and classification accuracy across FGVC datasets. As the average thinking length increases, performance consistently declines, indicating that excessive reasoning generation introduces noise or distracting the model from key discriminative visual cues. leads to a clear decline in classif… view at source ↗

**Figure 4.** Figure 4: Overview of ReFine-RFT. Given a question, the model generates multiple candidate responses, each evaluated using an ensemble reward that combines rule-based rewards and model-based rewards like MLLM-based accuracy reward and embedding similarity reward. The proposed MRN then normalizes the rewards for each function to compute the final advantages used to update the MLLM. 4. Methods Inspired by our findings… view at source ↗

**Figure 5.** Figure 5: Differences among rewards during training. Each reward exhibits distinct convergence speed, value range, and saturation point, reflecting the heterogeneity of different rewards. sample preferred answers with a higher reward. In practical training scenarios, multiple reward signals (e.g., format and classification) are often combined to guide optimization. In the original GRPO, these heterogeneous rewards … view at source ↗

**Figure 7.** Figure 7: Training reward and its standard deviation comparison on Aircrafts-102. MRN + GRPO achieves consistently higher reward values and lower variance throughout training, indicating improved stability and optimization efficiency. Please identify the model of the aircraft b ased on the image... <Instruction> <think>…</think> <answer> DC-8 </answer> ReFine-RFT (Ours) <think>…</think> <answer> Boeing 707-320 </an… view at source ↗

**Figure 8.** Figure 8: Comparison of responses. SFT-CoT and Visual-RFT produce long reasoning with incorrect answers, while ReFine-RFT achieves concise reasoning and higher accuracy. More results and analyses are in the supplementary. Qualitative Results. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 6.** Figure 6: Reward curves of ReFine-RFT on Flowers-102. Rewards consistently increase over training, demonstrating the effectiveness of our reward design. Reward Distribution Comparison. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Multi-modal large language models (MLLMs) exhibit strong general-purpose capabilities, yet still struggle on Fine-Grained Visual Classification (FGVC), a core perception task that requires subtle visual discrimination and is crucial for many real-world applications. A widely adopted strategy for boosting performance on challenging tasks such as math and coding is Chain-of-Thought (CoT) reasoning. However, several prior works have reported that CoT can actually harm performance on visual perception tasks. These studies, though, examine the issue from relatively narrow angles and leave open why CoT degrades perception-heavy performance. We systematically re-examine the role of CoT in FGVC through the lenses of zero-shot evaluation and multiple training paradigms. Across these settings, we uncover a central paradox: the degradation induced by CoT is largely driven by the reasoning length, in which longer textual reasoning consistently lowers classification accuracy. We term this phenomenon the ``Cost of Thinking''. Building on this finding, we make two key contributions: (1) MRN, a simple and general plug-and-play normalization method for multi-reward optimization that balances heterogeneous reward signals, and (2) ReFine-RFT, a framework that combines ensemble rewards with MRN to constrain reasoning length while providing dense accuracy-oriented feedback. Extensive experiments demonstrate the effectiveness of our findings and the proposed ReFine-RFT, achieving state-of-the-art performance across FGVC benchmarks. Project page: \href{https://refine-rft.github.io/}{ReFine-RFT}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Longer CoT text drives the accuracy drop on FGVC and their ReFine-RFT recipe with MRN normalization reaches SOTA, though length is not cleanly isolated from prompt and token effects.

read the letter

CoT reasoning hurts performance on fine-grained visual classification mainly because the generated text gets longer, and the paper shows this pattern holds in zero-shot and trained settings. Their fix is a new training framework called ReFine-RFT that uses ensemble rewards and a normalization trick called MRN to keep reasoning short while pushing for correct answers. They call the pattern the cost of thinking and back it with length-versus-accuracy plots across setups. What stands out is how they test the idea across multiple setups and end up with better results than previous methods on standard FGVC benchmarks. The MRN method for handling different reward signals is straightforward and could be used in other multi-objective training for language models. They also report the length-accuracy plots that make the cost of thinking visible. The soft spot is the lack of a clean isolation of length as the cause. In the zero-shot case, longer reasoning comes from the way the prompt is written or how generation is run, so other things like error buildup or attention dilution could be at play. The training side mixes length with the reward design, so it is not obvious that length alone drives the drop. A controlled experiment with fixed prompts and explicit length limits would strengthen that part. This paper is for people who work on applying MLLMs to tasks that need precise visual distinctions, like species identification or product recognition. It gives a usable recipe to avoid the reasoning penalty. I would send it to peer review because the empirical observation is worth discussing and the method works on the benchmarks, even if some controls are missing.

Referee Report

2 major / 2 minor

Summary. The paper investigates the effect of Chain-of-Thought (CoT) reasoning on Multi-modal Large Language Models (MLLMs) for Fine-Grained Visual Classification (FGVC). It reports that CoT degrades performance across zero-shot and multiple training settings, attributes this degradation primarily to increased reasoning length (termed the 'Cost of Thinking'), and introduces MRN (a normalization method for balancing heterogeneous rewards) together with the ReFine-RFT framework (ensemble rewards plus MRN to constrain length while supplying dense accuracy feedback). Extensive experiments are claimed to yield state-of-the-art results on FGVC benchmarks.

Significance. If the causal attribution to reasoning length is substantiated, the work supplies a concrete explanation for why CoT harms perception-heavy tasks and supplies two practical, plug-and-play components (MRN and ReFine-RFT) that could be adopted in other multi-reward MLLM training pipelines. The empirical trends across settings and the reported SOTA gains would constitute a useful contribution to the FGVC and MLLM literature.

major comments (2)

[Abstract and §4] Abstract and §4 (experimental analysis): the central claim that 'the degradation induced by CoT is largely driven by the reasoning length' is not yet supported by a controlled ablation. The reported setups entangle length with prompt style, unconstrained generation, token budget, and reward shaping; an experiment that fixes the prompt template, sampling parameters, and total context length while varying only an explicit length cap (or post-hoc truncation) is required to establish causality rather than correlation.
[§4.3 and Table 2] §4.3 and Table 2: the MRN normalization and ReFine-RFT gains rest on the premise that length is the dominant negative factor. Without the isolation experiment above, it remains unclear whether the observed improvements stem from length control, from the ensemble reward formulation itself, or from other changes in the training distribution.

minor comments (2)

[Abstract] Abstract: the phrase 'multiple training paradigms' is used without enumeration; explicitly listing the paradigms (e.g., RFT, supervised fine-tuning, etc.) would improve readability.
[§3] Figure captions and §3: the definition and measurement procedure for 'reasoning length' (token count, sentence count, or model-generated steps) should be stated once in the main text rather than only in supplementary material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the current experiments show a strong correlation between reasoning length and performance degradation but do not fully isolate length from other factors. We will add the requested controlled ablation to strengthen the causal claim and clarify the contributions of MRN and ReFine-RFT.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (experimental analysis): the central claim that 'the degradation induced by CoT is largely driven by the reasoning length' is not yet supported by a controlled ablation. The reported setups entangle length with prompt style, unconstrained generation, token budget, and reward shaping; an experiment that fixes the prompt template, sampling parameters, and total context length while varying only an explicit length cap (or post-hoc truncation) is required to establish causality rather than correlation.

Authors: We acknowledge that our existing analyses in §4 demonstrate consistent negative trends with longer reasoning chains but do not isolate length via a fully controlled setup. In the revision we will add a new controlled ablation that fixes the prompt template, sampling parameters, and total context length while varying only an explicit length cap (and post-hoc truncation for comparison). Results will be reported in §4 with updated figures and discussion to directly support causality for the 'Cost of Thinking'. revision: yes
Referee: [§4.3 and Table 2] §4.3 and Table 2: the MRN normalization and ReFine-RFT gains rest on the premise that length is the dominant negative factor. Without the isolation experiment above, it remains unclear whether the observed improvements stem from length control, from the ensemble reward formulation itself, or from other changes in the training distribution.

Authors: We agree the attribution of gains requires confirming length as the primary driver. After adding the controlled ablation, we will revise §4.3 and the Table 2 analysis to explicitly show how MRN balances the length-related penalty against accuracy rewards and how the ensemble supplies dense feedback. Additional component ablations will be included to separate the effects of length constraint from other training changes, clarifying that improvements arise from mitigating the identified cost while preserving task performance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observation of CoT length effect is not derived by construction

full rationale

The paper presents an empirical finding that CoT-induced accuracy drops on FGVC tasks correlate with reasoning length across zero-shot and training regimes, labeling it the 'Cost of Thinking'. This rests on experimental measurements rather than any closed-form identity, fitted parameter renamed as prediction, or self-citation chain. MRN normalization and ReFine-RFT are introduced as practical interventions motivated by the observed pattern, without equations that reduce the central claim to its own inputs or ansatzes smuggled via prior self-work. The derivation chain consists of direct experimental comparisons and is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central empirical claim rests on the assumption that reasoning length is the primary causal variable and that the chosen reward signals are sufficient to control it; no new mathematical axioms or invented physical entities are introduced.

pith-pipeline@v0.9.0 · 5573 in / 1140 out tokens · 16634 ms · 2026-05-16T14:48:43.957403+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DepthAgent: Towards Better Universal Depth Estimation via Sample-wise Expert Selection
cs.CV 2026-05 unverdicted novelty 7.0

A reinforcement-learned vision-language agent adaptively selects and fuses monocular depth experts per sample for better performance across camera geometries.
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation
cs.RO 2026-01 unverdicted novelty 6.0

PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.
LASER: Learning Active Sensing for Continuum Field Reconstruction
cs.LG 2026-04 unverdicted novelty 5.0

LASER trains a reinforcement learning policy inside a latent dynamics model to choose sensor placements that improve reconstruction of continuum fields under sparsity.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 3 Pith papers · 20 internal anchors

[1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Atm: Action tempo- rality modeling for video question answering

Junwen Chen, Jie Zhu, and Yu Kong. Atm: Action tempo- rality modeling for video question answering. InACM MM,

work page
[5]

On the suitability of reinforcement fine-tuning to visual tasks

Xiaxu Chen, Wei Li, Chunxu Liu, Chi Xie, Xiaoyan Hu, Chengqian Ma, Feng Zhu, and Rui Zhao. On the suitability of reinforcement fine-tuning to visual tasks. InProceedings of the Computer Vision and Pattern Recognition Conference,

work page
[6]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Fine-grained veri- fiers: Preference modeling as next-token prediction in vision- language alignment.arXiv preprint arXiv:2410.14148, 2024

Chenhang Cui, An Zhang, Yiyang Zhou, Zhaorun Chen, Gelei Deng, Huaxiu Yao, and Tat-Seng Chua. Fine-grained veri- fiers: Preference modeling as next-token prediction in vision- language alignment.arXiv preprint arXiv:2410.14148, 2024. 2

work page arXiv 2024
[8]

African or european swallow? bench- marking large vision-language models for fine-grained object classification

Gregor Geigle, Radu Timofte, and Goran Glava ˇs. African or european swallow? benchmarking large vision-language models for fine-grained object classification.arXiv preprint arXiv:2406.14496, 2024. 1, 2

work page arXiv 2024
[9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Analyzing and boosting the power of fine-grained visual recognition for multi-modal large language models

Hulingxiao He, Geng Li, Zijun Geng, Jinglin Xu, and Yuxin Peng. Analyzing and boosting the power of fine-grained visual recognition for multi-modal large language models. arXiv preprint arXiv:2501.15140, 2025. 1, 2, 7

work page arXiv 2025
[11]

Lora: Low-rank adaptation of large language models.ICLR, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 2022. 6

work page 2022
[12]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Mme-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency.arXiv preprint arXiv:2502.09621, 2025

Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, et al. Mme-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency.arXiv preprint arXiv:2502.09621, 2025. 2, 3

work page arXiv 2025
[16]

3d object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. InPro- ceedings of the IEEE International Conference on Computer Vision Workshops, 2013. 2, 3

work page 2013
[17]

Re- wardbench: Evaluating reward models for language modeling

Nathan Lambert, Valentina Pyatkin, Jacob Morrison, Lester James Validad Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. Re- wardbench: Evaluating reward models for language modeling. InNAACL, 2025. 5

work page 2025
[18]

Think or not think: A study of explicit thinking in rule-based visual rein- forcement fine-tuning.NeurIPS, 2025

Ming Li, Jike Zhong, Shitian Zhao, Yuxiang Lai, Haoquan Zhang, Wang Bill Zhu, and Kaipeng Zhang. Think or not think: A study of explicit thinking in rule-based visual rein- forcement fine-tuning.NeurIPS, 2025. 2, 4, 7

work page 2025
[19]

Democratizing fine-grained vi- sual recognition with large language models.arXiv preprint arXiv:2401.13837, 2024

Mingxuan Liu, Subhankar Roy, Wenjing Li, Zhun Zhong, Nicu Sebe, and Elisa Ricci. Democratizing fine-grained vi- sual recognition with large language models.arXiv preprint arXiv:2401.13837, 2024. 2

work page arXiv 2024
[20]

arXiv preprint arXiv:2410.21333 , year=

Ryan Liu, Jiayi Geng, Addison J Wu, Ilia Sucholutsky, Tania Lombrozo, and Thomas L Griffiths. Mind your step (by step): Chain-of-thought can reduce performance on tasks where thinking makes humans worse.arXiv preprint arXiv:2410.21333, 2024. 2, 3

work page arXiv 2024
[21]

Visual-rft: Visual reinforcement fine-tuning.ICCV, 2025

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.ICCV, 2025. 2, 3, 4, 7

work page 2025
[22]

Fine-Grained Visual Classification of Aircraft

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual clas- sification of aircraft.arXiv preprint arXiv:1306.5151, 2013. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2013
[23]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing. IEEE, 2008. 2, 3

work page 2008
[24]

Show your work: Scratchpads for intermediate computation with language models, 2021

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Hen- ryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sut- ton, and Augustus Odena. Show your work: Scratchpads for intermediate computation with language models, 2021. 2

work page 2021
[25]

Cats and dogs

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012. 2, 3

work page 2012
[26]

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025. 3 9

work page internal anchor Pith review arXiv 2025
[27]

Chatgpt- powered hierarchical comparisons for image classification

Zhiyuan Ren, Yiyang Su, and Xiaoming Liu. Chatgpt- powered hierarchical comparisons for image classification. NeurIPS, 36:69706–69718, 2023. 2

work page 2023
[28]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generaliz- able r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback

Wei Shen, Rui Zheng, Wenyu Zhan, Jun Zhao, Shihan Dou, Tao Gui, Qi Zhang, and Xuan-Jing Huang. Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback. InEMNLP, 2023. 5

work page 2023
[31]

arXiv preprint arXiv:2409.12183 , year=

Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dong- wei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, and Greg Durrett. To cot or not to cot? chain-of-thought helps mainly on math and symbolic reasoning.arXiv preprint arXiv:2409.12183, 2024. 2

work page arXiv 2024
[32]

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, et al. Stop overthinking: A sur- vey on efficient reasoning for large language models.arXiv preprint arXiv:2503.16419, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Aligning Large Multimodal Models with Factually Augmented RLHF

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chun- yuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu- Xiong Wang, Yiming Yang, et al. Aligning large multi- modal models with factually augmented rlhf.arXiv preprint arXiv:2309.14525, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Let me speak freely? a study on the impact of format restrictions on performance of large language models.arXiv preprint arXiv:2408.02442, 2024

Zhi Rui Tam, Cheng-Kuang Wu, Yi-Lin Tsai, Chieh-Yen Lin, Hung-yi Lee, and Yun-Nung Chen. Let me speak freely? a study on the impact of format restrictions on performance of large language models.arXiv preprint arXiv:2408.02442,

work page arXiv
[35]

Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason- rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025. 3

work page arXiv 2025
[36]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InCVPR, pages 9568–9578, 2024. 2

work page 2024
[37]

Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie.The Caltech-UCSD Birds-200-2011 Dataset. 2011. 2

work page 2011
[38]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Lin- jun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022. 6

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 3, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Chain-of- thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models. NeurIPS, 35:24824–24837, 2022. 2

work page 2022
[41]

Fine-grained image analysis with deep learning: A survey

Xiu-Shen Wei, Yi-Zhe Song, Oisin Mac Aodha, Jianxin Wu, Yuxin Peng, Jinhui Tang, Jian Yang, and Serge Belongie. Fine-grained image analysis with deep learning: A survey. TPAMI, 2021. 2

work page 2021
[42]

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Min- feng Zhu, et al. R1-onevision: Advancing generalized multi- modal reasoning through cross-modal formalization.arXiv preprint arXiv:2503.10615, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

arXiv preprint arXiv:2504.07954 , year =

En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, et al. Perception-r1: Pioneering perception policy with reinforcement learning.arXiv preprint arXiv:2504.07954,

work page arXiv
[44]

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learn- ing to reason with multimodal large language models via step-wise group relative policy optimization.arXiv preprint arXiv:2503.12937, 2025. 3

work page internal anchor Pith review arXiv 2025
[45]

Chain of preference optimization: Improving chain-of-thought reasoning in llms.NeurIPS, 2024

Xuan Zhang, Chao Du, Tianyu Pang, Qian Liu, Wei Gao, and Min Lin. Chain of preference optimization: Improving chain-of-thought reasoning in llms.NeurIPS, 2024. 2

work page 2024
[46]

Why are visually-grounded language models bad at image classifi- cation?NeurIPS, 2024

Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, and Serena Yeung-Levy. Why are visually-grounded language models bad at image classifi- cation?NeurIPS, 2024. 1, 2

work page 2024
[47]

Automatic Chain of Thought Prompting in Large Language Models

Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Au- tomatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[48]

Multimodal chain-of-thought reasoning in language models.TMLR, 2023

Zhuosheng Zhang, Aston Zhang, Mu Li, George Karypis, Alex Smola, et al. Multimodal chain-of-thought reasoning in language models.TMLR, 2023. 2

work page 2023
[49]

Au- tomatic chain of thought prompting in large language models

Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Au- tomatic chain of thought prompting in large language models. InICLR, 2023. 2

work page 2023
[50]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 1, 3 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[4] [4]

Atm: Action tempo- rality modeling for video question answering

Junwen Chen, Jie Zhu, and Yu Kong. Atm: Action tempo- rality modeling for video question answering. InACM MM,

work page

[5] [5]

On the suitability of reinforcement fine-tuning to visual tasks

Xiaxu Chen, Wei Li, Chunxu Liu, Chi Xie, Xiaoyan Hu, Chengqian Ma, Feng Zhu, and Rui Zhao. On the suitability of reinforcement fine-tuning to visual tasks. InProceedings of the Computer Vision and Pattern Recognition Conference,

work page

[6] [6]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Fine-grained veri- fiers: Preference modeling as next-token prediction in vision- language alignment.arXiv preprint arXiv:2410.14148, 2024

Chenhang Cui, An Zhang, Yiyang Zhou, Zhaorun Chen, Gelei Deng, Huaxiu Yao, and Tat-Seng Chua. Fine-grained veri- fiers: Preference modeling as next-token prediction in vision- language alignment.arXiv preprint arXiv:2410.14148, 2024. 2

work page arXiv 2024

[8] [8]

African or european swallow? bench- marking large vision-language models for fine-grained object classification

Gregor Geigle, Radu Timofte, and Goran Glava ˇs. African or european swallow? benchmarking large vision-language models for fine-grained object classification.arXiv preprint arXiv:2406.14496, 2024. 1, 2

work page arXiv 2024

[9] [9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Analyzing and boosting the power of fine-grained visual recognition for multi-modal large language models

Hulingxiao He, Geng Li, Zijun Geng, Jinglin Xu, and Yuxin Peng. Analyzing and boosting the power of fine-grained visual recognition for multi-modal large language models. arXiv preprint arXiv:2501.15140, 2025. 1, 2, 7

work page arXiv 2025

[11] [11]

Lora: Low-rank adaptation of large language models.ICLR, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 2022. 6

work page 2022

[12] [12]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Mme-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency.arXiv preprint arXiv:2502.09621, 2025

Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, et al. Mme-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency.arXiv preprint arXiv:2502.09621, 2025. 2, 3

work page arXiv 2025

[16] [16]

3d object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. InPro- ceedings of the IEEE International Conference on Computer Vision Workshops, 2013. 2, 3

work page 2013

[17] [17]

Re- wardbench: Evaluating reward models for language modeling

Nathan Lambert, Valentina Pyatkin, Jacob Morrison, Lester James Validad Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. Re- wardbench: Evaluating reward models for language modeling. InNAACL, 2025. 5

work page 2025

[18] [18]

Think or not think: A study of explicit thinking in rule-based visual rein- forcement fine-tuning.NeurIPS, 2025

Ming Li, Jike Zhong, Shitian Zhao, Yuxiang Lai, Haoquan Zhang, Wang Bill Zhu, and Kaipeng Zhang. Think or not think: A study of explicit thinking in rule-based visual rein- forcement fine-tuning.NeurIPS, 2025. 2, 4, 7

work page 2025

[19] [19]

Democratizing fine-grained vi- sual recognition with large language models.arXiv preprint arXiv:2401.13837, 2024

Mingxuan Liu, Subhankar Roy, Wenjing Li, Zhun Zhong, Nicu Sebe, and Elisa Ricci. Democratizing fine-grained vi- sual recognition with large language models.arXiv preprint arXiv:2401.13837, 2024. 2

work page arXiv 2024

[20] [20]

arXiv preprint arXiv:2410.21333 , year=

Ryan Liu, Jiayi Geng, Addison J Wu, Ilia Sucholutsky, Tania Lombrozo, and Thomas L Griffiths. Mind your step (by step): Chain-of-thought can reduce performance on tasks where thinking makes humans worse.arXiv preprint arXiv:2410.21333, 2024. 2, 3

work page arXiv 2024

[21] [21]

Visual-rft: Visual reinforcement fine-tuning.ICCV, 2025

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.ICCV, 2025. 2, 3, 4, 7

work page 2025

[22] [22]

Fine-Grained Visual Classification of Aircraft

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual clas- sification of aircraft.arXiv preprint arXiv:1306.5151, 2013. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2013

[23] [23]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing. IEEE, 2008. 2, 3

work page 2008

[24] [24]

Show your work: Scratchpads for intermediate computation with language models, 2021

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Hen- ryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sut- ton, and Augustus Odena. Show your work: Scratchpads for intermediate computation with language models, 2021. 2

work page 2021

[25] [25]

Cats and dogs

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012. 2, 3

work page 2012

[26] [26]

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025. 3 9

work page internal anchor Pith review arXiv 2025

[27] [27]

Chatgpt- powered hierarchical comparisons for image classification

Zhiyuan Ren, Yiyang Su, and Xiaoming Liu. Chatgpt- powered hierarchical comparisons for image classification. NeurIPS, 36:69706–69718, 2023. 2

work page 2023

[28] [28]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generaliz- able r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback

Wei Shen, Rui Zheng, Wenyu Zhan, Jun Zhao, Shihan Dou, Tao Gui, Qi Zhang, and Xuan-Jing Huang. Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback. InEMNLP, 2023. 5

work page 2023

[31] [31]

arXiv preprint arXiv:2409.12183 , year=

Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dong- wei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, and Greg Durrett. To cot or not to cot? chain-of-thought helps mainly on math and symbolic reasoning.arXiv preprint arXiv:2409.12183, 2024. 2

work page arXiv 2024

[32] [32]

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, et al. Stop overthinking: A sur- vey on efficient reasoning for large language models.arXiv preprint arXiv:2503.16419, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Aligning Large Multimodal Models with Factually Augmented RLHF

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chun- yuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu- Xiong Wang, Yiming Yang, et al. Aligning large multi- modal models with factually augmented rlhf.arXiv preprint arXiv:2309.14525, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Let me speak freely? a study on the impact of format restrictions on performance of large language models.arXiv preprint arXiv:2408.02442, 2024

Zhi Rui Tam, Cheng-Kuang Wu, Yi-Lin Tsai, Chieh-Yen Lin, Hung-yi Lee, and Yun-Nung Chen. Let me speak freely? a study on the impact of format restrictions on performance of large language models.arXiv preprint arXiv:2408.02442,

work page arXiv

[35] [35]

Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason- rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025. 3

work page arXiv 2025

[36] [36]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InCVPR, pages 9568–9578, 2024. 2

work page 2024

[37] [37]

Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie.The Caltech-UCSD Birds-200-2011 Dataset. 2011. 2

work page 2011

[38] [38]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Lin- jun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022. 6

work page internal anchor Pith review Pith/arXiv arXiv 2022

[39] [39]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 3, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

Chain-of- thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models. NeurIPS, 35:24824–24837, 2022. 2

work page 2022

[41] [41]

Fine-grained image analysis with deep learning: A survey

Xiu-Shen Wei, Yi-Zhe Song, Oisin Mac Aodha, Jianxin Wu, Yuxin Peng, Jinhui Tang, Jian Yang, and Serge Belongie. Fine-grained image analysis with deep learning: A survey. TPAMI, 2021. 2

work page 2021

[42] [42]

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Min- feng Zhu, et al. R1-onevision: Advancing generalized multi- modal reasoning through cross-modal formalization.arXiv preprint arXiv:2503.10615, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

arXiv preprint arXiv:2504.07954 , year =

En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, et al. Perception-r1: Pioneering perception policy with reinforcement learning.arXiv preprint arXiv:2504.07954,

work page arXiv

[44] [44]

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learn- ing to reason with multimodal large language models via step-wise group relative policy optimization.arXiv preprint arXiv:2503.12937, 2025. 3

work page internal anchor Pith review arXiv 2025

[45] [45]

Chain of preference optimization: Improving chain-of-thought reasoning in llms.NeurIPS, 2024

Xuan Zhang, Chao Du, Tianyu Pang, Qian Liu, Wei Gao, and Min Lin. Chain of preference optimization: Improving chain-of-thought reasoning in llms.NeurIPS, 2024. 2

work page 2024

[46] [46]

Why are visually-grounded language models bad at image classifi- cation?NeurIPS, 2024

Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, and Serena Yeung-Levy. Why are visually-grounded language models bad at image classifi- cation?NeurIPS, 2024. 1, 2

work page 2024

[47] [47]

Automatic Chain of Thought Prompting in Large Language Models

Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Au- tomatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[48] [48]

Multimodal chain-of-thought reasoning in language models.TMLR, 2023

Zhuosheng Zhang, Aston Zhang, Mu Li, George Karypis, Alex Smola, et al. Multimodal chain-of-thought reasoning in language models.TMLR, 2023. 2

work page 2023

[49] [49]

Au- tomatic chain of thought prompting in large language models

Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Au- tomatic chain of thought prompting in large language models. InICLR, 2023. 2

work page 2023

[50] [50]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 1, 3 10

work page internal anchor Pith review Pith/arXiv arXiv 2025