arxiv: 2602.07605 · v3 · submitted 2026-02-07 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning

Hulingxiao He , Zijun Geng , Yuxin Peng

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:10 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords fine-grained visual recognitionchain-of-thought reasoningmultimodal large language modelspolicy optimizationfew-shot adaptationopen-world classificationvisual sub-category discriminationtriplet augmentation

0 comments

The pith

Fine-R1 adapts multimodal LLMs to fine-grained visual recognition through chain-of-thought supervised fine-tuning and triplet-augmented policy optimization, achieving superior performance on seen and unseen sub-categories with only 4-shot训练

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to close the gap between general multimodal LLMs and specialized contrastive models on fine-grained visual recognition by introducing an R1-style training process that turns the model into an open-world classifier. Standard MLLMs struggle with sub-category distinctions and overfit to training examples, while dedicated data collection for every new domain remains expensive. The authors construct a structured CoT dataset and apply intra-class and inter-class trajectory augmentations during policy optimization to build robustness and discrimination. If successful, this shows that reasoning-based adaptation can replace large-scale annotation for hierarchical visual tasks and support domains where expert labels are scarce.

Core claim

Fine-R1 adapts general MLLMs to FGVR by first performing Chain-of-Thought Supervised Fine-tuning on a manually built FGVR CoT dataset whose rationales follow the sequence of visual analysis, candidate sub-categories, comparison, and prediction; this step converts the model into a strong open-world classifier. It then applies Triplet Augmented Policy Optimization, where Intra-class Augmentation mixes trajectories from anchor and positive images within the same category to handle intra-class variance and Inter-class Augmentation maximizes response differences across sub-categories to sharpen discrimination. With only 4-shot training, the resulting model surpasses general MLLMs, reasoning MLLMs

What carries the argument

R1-style training framework that combines Chain-of-Thought Supervised Fine-tuning on a structured FGVR CoT rationale dataset with Triplet Augmented Policy Optimization using intra-class and inter-class trajectory mixing

If this is right

The method reduces the annotation burden for fine-grained tasks from thousands of labels to a small number of shots while maintaining open-world capability.
MLLMs can now match or exceed contrastive models on both seen and unseen sub-categories without task-specific pre-training.
The structured CoT format enables the model to output explicit comparison steps that support downstream knowledge-intensive applications.
Intra- and inter-class trajectory augmentation improves robustness to visual variance without requiring additional real images.
The approach scales to domains where exhaustive sub-category labeling is impractical.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same CoT-plus-augmentation recipe could be applied to hierarchical tasks in other modalities such as audio or text classification.
If the rationale structure generalizes, future work could replace manual dataset construction with automated rationale generation from weaker models.
The performance gain on unseen categories suggests the training reduces reliance on spurious visual cues that CLIP models often exploit.

Load-bearing premise

The manually constructed high-quality FGVR CoT dataset with its fixed rationale structure of visual analysis, candidate sub-categories, comparison, and prediction will produce a generalizable open-world classifier that does not overfit to the seen sub-categories in the training distribution.

What would settle it

Test the trained Fine-R1 model on an entirely new fine-grained dataset whose sub-categories share no overlap with the training set and measure whether its accuracy drops below that of a standard CLIP model trained on the same 4 shots.

Figures

Figures reproduced from arXiv: 2602.07605 by Hulingxiao He, Yuxin Peng, Zijun Geng.

**Figure 2.** Figure 2: Overview of the proposed two-stage training framework integrating CoT SFT and TAPO. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Ablation study on training methods, inference strategies, and key components of Fine-R1. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: PCA projections of the last hidden state rep [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Different image-level visual concepts for objects with the same subcategory. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Case study comparing Fine-R1-3B and Qwen2.5-VL-3B on Stanford Car-196 ( [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

read the original abstract

Any entity in the visual world can be hierarchically grouped based on shared characteristics and mapped to fine-grained sub-categories. While Multi-modal Large Language Models (MLLMs) achieve strong performance on coarse-grained visual tasks, they often struggle with Fine-Grained Visual Recognition (FGVR). Adapting general-purpose MLLMs to FGVR typically requires large amounts of annotated data, which is costly to obtain, leaving a substantial performance gap compared to contrastive CLIP models dedicated for discriminative tasks. Moreover, MLLMs tend to overfit to seen sub-categories and generalize poorly to unseen ones. To address these challenges, we propose Fine-R1, an MLLM tailored for FGVR through an R1-style training framework: (1) Chain-of-Thought Supervised Fine-tuning, where we construct a high-quality FGVR CoT dataset with rationales of "visual analysis, candidate sub-categories, comparison, and prediction", transition the model into a strong open-world classifier; and (2) Triplet Augmented Policy Optimization, where Intra-class Augmentation mixes trajectories from anchor and positive images within the same category to improve robustness to intra-class variance, while Inter-class Augmentation maximizes the response distinction conditioned on images across sub-categories to enhance discriminative ability. With only 4-shot training, Fine-R1 outperforms existing general MLLMs, reasoning MLLMs, and even contrastive CLIP models in identifying both seen and unseen sub-categories, showing promise in working in knowledge-intensive domains where gathering expert annotations for all sub-categories is arduous. Code is available at https://github.com/PKU-ICST-MIPL/FineR1_ICLR2026.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fine-R1 gives a concrete CoT dataset structure plus intra/inter-class trajectory mixing for 4-shot FGVR adaptation, but the abstract supplies no numbers or ablations so the outperformance claims on unseen categories stay uncheckable.

read the letter

The main thing to know is that this paper builds a four-part chain-of-thought dataset (visual analysis, candidate sub-categories, comparison, prediction) and then applies triplet-augmented policy optimization with intra-class trajectory mixing for robustness and inter-class mixing for sharper distinctions. With only four shots it claims to beat general MLLMs, reasoning MLLMs, and even CLIP on both seen and unseen fine-grained categories. That combination of structured rationale data and the specific augmentation inside the RL stage is the actual novelty; it is not just another supervised fine-tune on an existing FGVR set. The method directly targets the data-scarcity and overfitting problems that the abstract correctly identifies for MLLMs on expert-level recognition tasks. If the full experiments hold up, the recipe is practical for domains where exhaustive labeling is expensive. The soft spot is the complete absence of any quantitative results, baselines, error bars, dataset sizes, or ablation tables in the abstract. Without those, it is impossible to tell whether the unseen-category gains come from the reasoning pipeline or from how the candidate sub-categories were populated during CoT construction. The stress-test concern about possible test-set leakage in that step is worth a close look in the methods; if the candidates draw from a pool that overlaps the unseen test distribution, the generalization story weakens. The paper is aimed at researchers who adapt large vision-language models to low-data specialized recognition. Anyone working on few-shot FGVR or on injecting structured reasoning into MLLMs would get concrete implementation ideas from the framework. It deserves a serious referee because the training recipe is well-specified and the problem is real, even if the current evidence is still thin.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Fine-R1, an MLLM for fine-grained visual recognition (FGVR) via an R1-style framework consisting of (1) Chain-of-Thought supervised fine-tuning on a manually constructed high-quality FGVR CoT dataset whose rationales follow the structure 'visual analysis, candidate sub-categories, comparison, and prediction' and (2) Triplet Augmented Policy Optimization that applies intra-class mixing of trajectories and inter-class response distinction. The central claim is that 4-shot training suffices for Fine-R1 to outperform general MLLMs, reasoning MLLMs, and contrastive CLIP models on both seen and unseen sub-categories.

Significance. If the reported gains on unseen sub-categories are shown to be free of data-construction artifacts, the work would offer a practical route to adapting MLLMs to FGVR with minimal supervision, addressing the well-known overfitting and annotation-cost problems in the area. The public code release is a positive factor for reproducibility.

major comments (2)

[Abstract and §4] Abstract and §4 (Results): the claim of outperformance on both seen and unseen sub-categories is stated without any numerical values, baselines, dataset sizes, error bars, or ablation tables in the provided text; the full results section must supply these quantities so that the headline 4-shot result can be verified.
[§3] §3 (Dataset Construction): the generalization claim to unseen sub-categories depends on the step that populates 'candidate sub-categories' inside each CoT rationale. The manuscript must report overlap statistics between these candidates and the unseen test distribution, or an ablation that removes label information from the candidate list, to rule out leakage; without such a check the reported unseen gains risk being an artifact of the data pipeline rather than evidence of the training method.

minor comments (2)

[Abstract] The abstract would benefit from a single sentence containing the key quantitative improvements (e.g., accuracy deltas on seen/unseen splits) to allow readers to assess impact immediately.
[§3.2] Notation for the intra-class and inter-class augmentations in the policy-optimization stage should be defined explicitly (e.g., as equations) rather than described only in prose.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the clarity and rigor of our claims regarding Fine-R1's performance and generalization. We address each major point below and will incorporate the suggested revisions in the next version of the manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Results): the claim of outperformance on both seen and unseen sub-categories is stated without any numerical values, baselines, dataset sizes, error bars, or ablation tables in the provided text; the full results section must supply these quantities so that the headline 4-shot result can be verified.

Authors: We agree that explicit numerical support is essential for verifying the headline claims. In the revised manuscript, we will update the abstract to include key quantitative results (e.g., accuracy gains on seen and unseen sub-categories under 4-shot training) and fully expand §4 with detailed tables. These will report comparisons against general MLLMs, reasoning MLLMs, and CLIP baselines; exact dataset sizes and shot settings; error bars computed over multiple random seeds; and comprehensive ablation tables for the CoT SFT and triplet policy optimization components. This will make the 4-shot outperformance directly verifiable from the text. revision: yes
Referee: [§3] §3 (Dataset Construction): the generalization claim to unseen sub-categories depends on the step that populates 'candidate sub-categories' inside each CoT rationale. The manuscript must report overlap statistics between these candidates and the unseen test distribution, or an ablation that removes label information from the candidate list, to rule out leakage; without such a check the reported unseen gains risk being an artifact of the data pipeline rather than evidence of the training method.

Authors: We recognize the critical need to demonstrate that unseen-category gains arise from the training method rather than data-construction artifacts. In the revised manuscript, we will add overlap statistics quantifying how often 'candidate sub-categories' in the CoT rationales intersect with the unseen test distribution. We will also include an ablation that removes explicit label information from the candidate lists during CoT construction and retrains the model, reporting the resulting performance on unseen sub-categories to confirm robustness against leakage. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical training pipeline with independent dataset construction

full rationale

The paper presents a standard supervised fine-tuning plus policy optimization pipeline on a manually constructed FGVR CoT dataset. No equations, fitted parameters, or self-referential definitions are used to derive the performance claims. The 4-shot outperformance on seen/unseen categories is reported as an empirical outcome of the training process rather than a quantity forced by construction from the inputs. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked that reduce the central result to prior author work. The skeptic concern about candidate sub-category selection is a methodological generalization risk, not a circularity in the derivation chain.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the quality and representativeness of the constructed CoT dataset and on the assumption that the described augmentations improve generalization without introducing new biases; no explicit free parameters beyond the 4-shot regime are named.

free parameters (1)

4-shot training regime
The number of examples per category is fixed at four to demonstrate low-data capability; this choice is presented as part of the experimental protocol rather than fitted.

axioms (1)

domain assumption A structured chain-of-thought rationale consisting of visual analysis, candidate sub-categories, comparison, and prediction produces a stronger open-world classifier than standard fine-tuning.
Invoked in the description of the Chain-of-Thought Supervised Fine-tuning stage.

pith-pipeline@v0.9.0 · 5613 in / 1361 out tokens · 36822 ms · 2026-05-16T06:10:32.612953+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition
cs.CV 2026-05 conditional novelty 7.0

FIKA-Bench shows that the best large multimodal models and tool-using agents reach only 25.1% accuracy on fine-grained knowledge acquisition, with failures driven by wrong retrieval and poor visual judgment.
ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs
cs.CV 2026-05 unverdicted novelty 5.0

ShellfishNet is a new benchmark of 8,691 images across 32 mollusc taxa for evaluating vision models on real-world underwater ecological monitoring tasks including robustness to degradation.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 2 Pith papers · 21 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

InternLM2 Technical Report

Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report.arXiv preprint arXiv:2403.17297,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shen- glong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Active prompting with chain-of-thought for large language models

Shizhe Diao, Pengcheng Wang, Yong Lin, and Tong Zhang. Active prompting with chain-of-thought for large language models.arXiv:2302.12246,

work page arXiv
[6]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023a. Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng, and Tushar Khot. Chain-of-thought hub: A continuous effort to m...

work page internal anchor Pith review Pith/arXiv arXiv
[7]

African or european swallow? benchmarking large vision-language models for fine-grained object classification.arXiv preprint arXiv:2406.14496, 2024

Gregor Geigle, Radu Timofte, and Goran Glavaˇs. African or european swallow? benchmarking large vision-language models for fine-grained object classification.arXiv preprint arXiv:2406.14496,

work page arXiv
[8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Analyzing and boosting the power of fine-grained visual recognition for multi-modal large language models

11 Published as a conference paper at ICLR 2026 Hulingxiao He, Geng Li, Zijun Geng, Jinglin Xu, and Yuxin Peng. Analyzing and boosting the power of fine-grained visual recognition for multi-modal large language models.arXiv preprint arXiv:2501.15140,

work page arXiv 2026
[10]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Building and bet- ter understanding vision-language models: insights and future directions.arXiv preprint arXiv:2408.12637, 2024a

Hugo Laurenc ¸on, Andr ´es Marafioti, Victor Sanh, and L ´eo Tronchon. Building and bet- ter understanding vision-language models: insights and future directions.arXiv preprint arXiv:2408.12637, 2024a. Hugo Laurenc ¸on, L´eo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models?arXiv preprint arXiv:2405.02246, 2024b. ...

work page arXiv
[15]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Bench- marking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023a. Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML, pp. 1...

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Revisiting mllms: An in-depth analysis of image classification abilities.arXiv preprint arXiv:2412.16418, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp. 26296–26306, 2024a. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36, 20...

work page arXiv 2026
[17]

Ursa: Under- standing and verifying chain-of-thought reasoning in multi- modal mathematics.arXiv preprint arXiv:2501.04686, 2025

Ruilin Luo, Zhuofan Zheng, Yifan Wang, Yiyao Yu, Xinzhe Ni, Zicheng Lin, Jin Zeng, and Yujiu Yang. Ursa: Understanding and verifying chain-of-thought reasoning in multimodal mathematics. arXiv:2501.04686, 2025a. Yilun Luo, HuaQing Zheng, Haoqian Meng, Wenyuan Liu, and Peng Zhang. Post-training quantiza- tion of openpangu models for efficient deployment on...

work page arXiv
[18]

Deepperception: Advancing r1-like cognitive visual perception in mllms for knowledge-intensive visual grounding.arXiv preprint arXiv:2503.12797,

Xinyu Ma, Ziyang Ding, Zhicong Luo, Chi Chen, Zonghao Guo, Derek F Wong, Xiaoyi Feng, and Maosong Sun. Deepperception: Advancing r1-like cognitive visual perception in mllms for knowledge-intensive visual grounding.arXiv preprint arXiv:2503.12797,

work page arXiv
[19]

Fine-Grained Visual Classification of Aircraft

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft.arXiv preprint arXiv:1306.5151,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Sparse attention vectors: Gener- ative multimodal model features are discriminative vision-language classifiers.arXiv preprint arXiv:2412.00142, 2024a

Chancharik Mitra, Brandon Huang, Tianning Chai, Zhiqiu Lin, Assaf Arbelle, Rogerio Feris, Leonid Karlinsky, Trevor Darrell, Deva Ramanan, and Roei Herzig. Sparse attention vectors: Gener- ative multimodal model features are discriminative vision-language classifiers.arXiv preprint arXiv:2412.00142, 2024a. Chancharik Mitra, Brandon Huang, Trevor Darrell, a...

work page arXiv
[21]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

13 Published as a conference paper at ICLR 2026 Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

Enhancing cognition and explainability of multimodal foundation models with self-synthesized data.arXiv preprint arXiv:2502.14044, 2025a

Yucheng Shi, Quanzheng Li, Jin Sun, Xiang Li, and Ninghao Liu. Enhancing cognition and explainability of multimodal foundation models with self-synthesized data.arXiv preprint arXiv:2502.14044, 2025a. Yucheng Shi, Quanzheng Li, Jin Sun, Xiang Li, and Ninghao Liu. Enhancing cognition and explainability of multimodal foundation models with self-synthesized ...

work page arXiv
[23]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Llamav-o1: Rethinking step-by-step visual reasoning in llms

Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, et al. Llamav-o1: Rethink- ing step-by-step visual reasoning in llms.arXiv:2501.06186,

work page arXiv
[25]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdul- mohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

The caltech-ucsd birds-200-2011 dataset

Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset

work page 2011
[27]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Perception-Aware Policy Optimization for Multimodal Reasoning

Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, et al. Perception-aware policy optimization for mul- timodal reasoning.arXiv preprint arXiv:2507.06448,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022a. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-o...

work page internal anchor Pith review arXiv
[30]

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jian- hong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Internlm-Math: Open Math Large Language Models Toward Verifiable Reasoning.arXiv preprint arXiv:2402.06332, 2024

Huaiyuan Ying, Shuo Zhang, Linyang Li, Zhejian Zhou, Yunfan Shao, Zhaoye Fei, Yichuan Ma, Jiawei Hong, Kuikun Liu, Ziyi Wang, et al. Internlm-math: Open math large language models toward verifiable reasoning.arXiv preprint arXiv:2402.06332,

work page arXiv
[32]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

14 Published as a conference paper at ICLR 2026 Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

Codedpo: Aligning code models with self generated and verified source code.arXiv preprint arXiv:2410.05605, 2024a

Kechi Zhang, Ge Li, Yihong Dong, Jingjing Xu, Jun Zhang, Jing Su, Yongfei Liu, and Zhi Jin. Codedpo: Aligning code models with self generated and verified source code.arXiv preprint arXiv:2410.05605, 2024a. Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Yichi Zhang, Ziyu Guo, Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Bin Wei, Shanghang Zhang, et al. Mavis: Mathem...

work page arXiv
[34]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models.arXiv preprint arXiv:2403.13372,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

B PROMPTDESIGN Table 5: Prompt template for FGVR CoT data construction

15 Published as a conference paper at ICLR 2026 A QUALITATIVERESULTS OFVISUALCONCEPTS Figure 5: Different image-level visual concepts for objects with the same subcategory. B PROMPTDESIGN Table 5: Prompt template for FGVR CoT data construction. This is a picture of a{label}with the following visual features:{concepts}. Based on the in- formation provided,...

work page 2026
[36]

as the base models, and full fine-tune them for 10 epochs using Llama-Factory framework (Zheng et al., 2024). After CoT SFT, we subsequently train 3B and 7B models for 10 and 5 epochs on the separate subset of 4-shot training data, using the proposed TAPO instantiated from DAPO baseline (Yu et al.,

work page 2024
[37]

visual analysis-candidate subcategories-comparison-prediction

with clipping factors set toϵ l = 0.2,ϵ h = 0.28, reference KL removed, token-level loss averaging enabled, and dynamic sampling with a maximum of 20 retries. For other RL-related hyperparameters, we adopt the default settings from PAPO (Wang et al., 2025): a global batch size 16 Published as a conference paper at ICLR 2026 of 128, a rollout batch size of...

work page arXiv 2025