pith. machine review for the scientific record. sign in

arxiv: 2602.07605 · v3 · submitted 2026-02-07 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:10 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords fine-grained visual recognitionchain-of-thought reasoningmultimodal large language modelspolicy optimizationfew-shot adaptationopen-world classificationvisual sub-category discriminationtriplet augmentation
0
0 comments X

The pith

Fine-R1 adapts multimodal LLMs to fine-grained visual recognition through chain-of-thought supervised fine-tuning and triplet-augmented policy optimization, achieving superior performance on seen and unseen sub-categories with only 4-shot训练

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to close the gap between general multimodal LLMs and specialized contrastive models on fine-grained visual recognition by introducing an R1-style training process that turns the model into an open-world classifier. Standard MLLMs struggle with sub-category distinctions and overfit to training examples, while dedicated data collection for every new domain remains expensive. The authors construct a structured CoT dataset and apply intra-class and inter-class trajectory augmentations during policy optimization to build robustness and discrimination. If successful, this shows that reasoning-based adaptation can replace large-scale annotation for hierarchical visual tasks and support domains where expert labels are scarce.

Core claim

Fine-R1 adapts general MLLMs to FGVR by first performing Chain-of-Thought Supervised Fine-tuning on a manually built FGVR CoT dataset whose rationales follow the sequence of visual analysis, candidate sub-categories, comparison, and prediction; this step converts the model into a strong open-world classifier. It then applies Triplet Augmented Policy Optimization, where Intra-class Augmentation mixes trajectories from anchor and positive images within the same category to handle intra-class variance and Inter-class Augmentation maximizes response differences across sub-categories to sharpen discrimination. With only 4-shot training, the resulting model surpasses general MLLMs, reasoning MLLMs

What carries the argument

R1-style training framework that combines Chain-of-Thought Supervised Fine-tuning on a structured FGVR CoT rationale dataset with Triplet Augmented Policy Optimization using intra-class and inter-class trajectory mixing

If this is right

  • The method reduces the annotation burden for fine-grained tasks from thousands of labels to a small number of shots while maintaining open-world capability.
  • MLLMs can now match or exceed contrastive models on both seen and unseen sub-categories without task-specific pre-training.
  • The structured CoT format enables the model to output explicit comparison steps that support downstream knowledge-intensive applications.
  • Intra- and inter-class trajectory augmentation improves robustness to visual variance without requiring additional real images.
  • The approach scales to domains where exhaustive sub-category labeling is impractical.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same CoT-plus-augmentation recipe could be applied to hierarchical tasks in other modalities such as audio or text classification.
  • If the rationale structure generalizes, future work could replace manual dataset construction with automated rationale generation from weaker models.
  • The performance gain on unseen categories suggests the training reduces reliance on spurious visual cues that CLIP models often exploit.

Load-bearing premise

The manually constructed high-quality FGVR CoT dataset with its fixed rationale structure of visual analysis, candidate sub-categories, comparison, and prediction will produce a generalizable open-world classifier that does not overfit to the seen sub-categories in the training distribution.

What would settle it

Test the trained Fine-R1 model on an entirely new fine-grained dataset whose sub-categories share no overlap with the training set and measure whether its accuracy drops below that of a standard CLIP model trained on the same 4 shots.

Figures

Figures reproduced from arXiv: 2602.07605 by Hulingxiao He, Yuxin Peng, Zijun Geng.

Figure 1
Figure 1. Figure 1: Fine-R1 generates Chain-of-Thought (CoT) before producing the final fine-grained visual [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed two-stage training framework integrating CoT SFT and TAPO. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Ablation study on training methods, inference strategies, and key components of Fine-R1. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: PCA projections of the last hidden state rep [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Different image-level visual concepts for objects with the same subcategory. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Case study comparing Fine-R1-3B and Qwen2.5-VL-3B on Stanford Car-196 ( [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
read the original abstract

Any entity in the visual world can be hierarchically grouped based on shared characteristics and mapped to fine-grained sub-categories. While Multi-modal Large Language Models (MLLMs) achieve strong performance on coarse-grained visual tasks, they often struggle with Fine-Grained Visual Recognition (FGVR). Adapting general-purpose MLLMs to FGVR typically requires large amounts of annotated data, which is costly to obtain, leaving a substantial performance gap compared to contrastive CLIP models dedicated for discriminative tasks. Moreover, MLLMs tend to overfit to seen sub-categories and generalize poorly to unseen ones. To address these challenges, we propose Fine-R1, an MLLM tailored for FGVR through an R1-style training framework: (1) Chain-of-Thought Supervised Fine-tuning, where we construct a high-quality FGVR CoT dataset with rationales of "visual analysis, candidate sub-categories, comparison, and prediction", transition the model into a strong open-world classifier; and (2) Triplet Augmented Policy Optimization, where Intra-class Augmentation mixes trajectories from anchor and positive images within the same category to improve robustness to intra-class variance, while Inter-class Augmentation maximizes the response distinction conditioned on images across sub-categories to enhance discriminative ability. With only 4-shot training, Fine-R1 outperforms existing general MLLMs, reasoning MLLMs, and even contrastive CLIP models in identifying both seen and unseen sub-categories, showing promise in working in knowledge-intensive domains where gathering expert annotations for all sub-categories is arduous. Code is available at https://github.com/PKU-ICST-MIPL/FineR1_ICLR2026.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Fine-R1, an MLLM for fine-grained visual recognition (FGVR) via an R1-style framework consisting of (1) Chain-of-Thought supervised fine-tuning on a manually constructed high-quality FGVR CoT dataset whose rationales follow the structure 'visual analysis, candidate sub-categories, comparison, and prediction' and (2) Triplet Augmented Policy Optimization that applies intra-class mixing of trajectories and inter-class response distinction. The central claim is that 4-shot training suffices for Fine-R1 to outperform general MLLMs, reasoning MLLMs, and contrastive CLIP models on both seen and unseen sub-categories.

Significance. If the reported gains on unseen sub-categories are shown to be free of data-construction artifacts, the work would offer a practical route to adapting MLLMs to FGVR with minimal supervision, addressing the well-known overfitting and annotation-cost problems in the area. The public code release is a positive factor for reproducibility.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Results): the claim of outperformance on both seen and unseen sub-categories is stated without any numerical values, baselines, dataset sizes, error bars, or ablation tables in the provided text; the full results section must supply these quantities so that the headline 4-shot result can be verified.
  2. [§3] §3 (Dataset Construction): the generalization claim to unseen sub-categories depends on the step that populates 'candidate sub-categories' inside each CoT rationale. The manuscript must report overlap statistics between these candidates and the unseen test distribution, or an ablation that removes label information from the candidate list, to rule out leakage; without such a check the reported unseen gains risk being an artifact of the data pipeline rather than evidence of the training method.
minor comments (2)
  1. [Abstract] The abstract would benefit from a single sentence containing the key quantitative improvements (e.g., accuracy deltas on seen/unseen splits) to allow readers to assess impact immediately.
  2. [§3.2] Notation for the intra-class and inter-class augmentations in the policy-optimization stage should be defined explicitly (e.g., as equations) rather than described only in prose.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the clarity and rigor of our claims regarding Fine-R1's performance and generalization. We address each major point below and will incorporate the suggested revisions in the next version of the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Results): the claim of outperformance on both seen and unseen sub-categories is stated without any numerical values, baselines, dataset sizes, error bars, or ablation tables in the provided text; the full results section must supply these quantities so that the headline 4-shot result can be verified.

    Authors: We agree that explicit numerical support is essential for verifying the headline claims. In the revised manuscript, we will update the abstract to include key quantitative results (e.g., accuracy gains on seen and unseen sub-categories under 4-shot training) and fully expand §4 with detailed tables. These will report comparisons against general MLLMs, reasoning MLLMs, and CLIP baselines; exact dataset sizes and shot settings; error bars computed over multiple random seeds; and comprehensive ablation tables for the CoT SFT and triplet policy optimization components. This will make the 4-shot outperformance directly verifiable from the text. revision: yes

  2. Referee: [§3] §3 (Dataset Construction): the generalization claim to unseen sub-categories depends on the step that populates 'candidate sub-categories' inside each CoT rationale. The manuscript must report overlap statistics between these candidates and the unseen test distribution, or an ablation that removes label information from the candidate list, to rule out leakage; without such a check the reported unseen gains risk being an artifact of the data pipeline rather than evidence of the training method.

    Authors: We recognize the critical need to demonstrate that unseen-category gains arise from the training method rather than data-construction artifacts. In the revised manuscript, we will add overlap statistics quantifying how often 'candidate sub-categories' in the CoT rationales intersect with the unseen test distribution. We will also include an ablation that removes explicit label information from the candidate lists during CoT construction and retrains the model, reporting the resulting performance on unseen sub-categories to confirm robustness against leakage. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical training pipeline with independent dataset construction

full rationale

The paper presents a standard supervised fine-tuning plus policy optimization pipeline on a manually constructed FGVR CoT dataset. No equations, fitted parameters, or self-referential definitions are used to derive the performance claims. The 4-shot outperformance on seen/unseen categories is reported as an empirical outcome of the training process rather than a quantity forced by construction from the inputs. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked that reduce the central result to prior author work. The skeptic concern about candidate sub-category selection is a methodological generalization risk, not a circularity in the derivation chain.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the quality and representativeness of the constructed CoT dataset and on the assumption that the described augmentations improve generalization without introducing new biases; no explicit free parameters beyond the 4-shot regime are named.

free parameters (1)
  • 4-shot training regime
    The number of examples per category is fixed at four to demonstrate low-data capability; this choice is presented as part of the experimental protocol rather than fitted.
axioms (1)
  • domain assumption A structured chain-of-thought rationale consisting of visual analysis, candidate sub-categories, comparison, and prediction produces a stronger open-world classifier than standard fine-tuning.
    Invoked in the description of the Chain-of-Thought Supervised Fine-tuning stage.

pith-pipeline@v0.9.0 · 5613 in / 1361 out tokens · 36822 ms · 2026-05-16T06:10:32.612953+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition

    cs.CV 2026-05 conditional novelty 7.0

    FIKA-Bench shows that the best large multimodal models and tool-using agents reach only 25.1% accuracy on fine-grained knowledge acquisition, with failures driven by wrong retrieval and poor visual judgment.

  2. ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs

    cs.CV 2026-05 unverdicted novelty 5.0

    ShellfishNet is a new benchmark of 8,691 images across 32 mollusc taxa for evaluating vision models on real-world underwater ecological monitoring tasks including robustness to degradation.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 2 Pith papers · 21 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,

  3. [3]

    InternLM2 Technical Report

    Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report.arXiv preprint arXiv:2403.17297,

  4. [4]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shen- glong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271,

  5. [5]

    Active prompting with chain-of-thought for large language models

    Shizhe Diao, Pengcheng Wang, Yong Lin, and Tong Zhang. Active prompting with chain-of-thought for large language models.arXiv:2302.12246,

  6. [6]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023a. Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng, and Tushar Khot. Chain-of-thought hub: A continuous effort to m...

  7. [7]

    African or european swallow? benchmarking large vision-language models for fine-grained object classification.arXiv preprint arXiv:2406.14496, 2024

    Gregor Geigle, Radu Timofte, and Goran Glavaˇs. African or european swallow? benchmarking large vision-language models for fine-grained object classification.arXiv preprint arXiv:2406.14496,

  8. [8]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  9. [9]

    Analyzing and boosting the power of fine-grained visual recognition for multi-modal large language models

    11 Published as a conference paper at ICLR 2026 Hulingxiao He, Geng Li, Zijun Geng, Jinglin Xu, and Yuxin Peng. Analyzing and boosting the power of fine-grained visual recognition for multi-modal large language models.arXiv preprint arXiv:2501.15140,

  10. [10]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

  11. [11]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186,

  12. [12]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

  13. [13]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

  14. [14]

    Building and bet- ter understanding vision-language models: insights and future directions.arXiv preprint arXiv:2408.12637, 2024a

    Hugo Laurenc ¸on, Andr ´es Marafioti, Victor Sanh, and L ´eo Tronchon. Building and bet- ter understanding vision-language models: insights and future directions.arXiv preprint arXiv:2408.12637, 2024a. Hugo Laurenc ¸on, L´eo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models?arXiv preprint arXiv:2405.02246, 2024b. ...

  15. [15]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Bench- marking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023a. Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML, pp. 1...

  16. [16]

    Revisiting mllms: An in-depth analysis of image classification abilities.arXiv preprint arXiv:2412.16418, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp. 26296–26306, 2024a. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36, 20...

  17. [17]

    Ursa: Under- standing and verifying chain-of-thought reasoning in multi- modal mathematics.arXiv preprint arXiv:2501.04686, 2025

    Ruilin Luo, Zhuofan Zheng, Yifan Wang, Yiyao Yu, Xinzhe Ni, Zicheng Lin, Jin Zeng, and Yujiu Yang. Ursa: Understanding and verifying chain-of-thought reasoning in multimodal mathematics. arXiv:2501.04686, 2025a. Yilun Luo, HuaQing Zheng, Haoqian Meng, Wenyuan Liu, and Peng Zhang. Post-training quantiza- tion of openpangu models for efficient deployment on...

  18. [18]

    Deepperception: Advancing r1-like cognitive visual perception in mllms for knowledge-intensive visual grounding.arXiv preprint arXiv:2503.12797,

    Xinyu Ma, Ziyang Ding, Zhicong Luo, Chi Chen, Zonghao Guo, Derek F Wong, Xiaoyi Feng, and Maosong Sun. Deepperception: Advancing r1-like cognitive visual perception in mllms for knowledge-intensive visual grounding.arXiv preprint arXiv:2503.12797,

  19. [19]

    Fine-Grained Visual Classification of Aircraft

    Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft.arXiv preprint arXiv:1306.5151,

  20. [20]

    Sparse attention vectors: Gener- ative multimodal model features are discriminative vision-language classifiers.arXiv preprint arXiv:2412.00142, 2024a

    Chancharik Mitra, Brandon Huang, Tianning Chai, Zhiqiu Lin, Assaf Arbelle, Rogerio Feris, Leonid Karlinsky, Trevor Darrell, Deva Ramanan, and Roei Herzig. Sparse attention vectors: Gener- ative multimodal model features are discriminative vision-language classifiers.arXiv preprint arXiv:2412.00142, 2024a. Chancharik Mitra, Brandon Huang, Trevor Darrell, a...

  21. [21]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    13 Published as a conference paper at ICLR 2026 Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  22. [22]

    Enhancing cognition and explainability of multimodal foundation models with self-synthesized data.arXiv preprint arXiv:2502.14044, 2025a

    Yucheng Shi, Quanzheng Li, Jin Sun, Xiang Li, and Ninghao Liu. Enhancing cognition and explainability of multimodal foundation models with self-synthesized data.arXiv preprint arXiv:2502.14044, 2025a. Yucheng Shi, Quanzheng Li, Jin Sun, Xiang Li, and Ninghao Liu. Enhancing cognition and explainability of multimodal foundation models with self-synthesized ...

  23. [23]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

  24. [24]

    Llamav-o1: Rethinking step-by-step visual reasoning in llms

    Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, et al. Llamav-o1: Rethink- ing step-by-step visual reasoning in llms.arXiv:2501.06186,

  25. [25]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdul- mohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786,

  26. [26]

    The caltech-ucsd birds-200-2011 dataset

    Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset

  27. [27]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191,

  28. [28]

    Perception-Aware Policy Optimization for Multimodal Reasoning

    Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, et al. Perception-aware policy optimization for mul- timodal reasoning.arXiv preprint arXiv:2507.06448,

  29. [29]

    LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022a. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-o...

  30. [30]

    An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jian- hong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122,

  31. [31]

    Internlm-Math: Open Math Large Language Models Toward Verifiable Reasoning.arXiv preprint arXiv:2402.06332, 2024

    Huaiyuan Ying, Shuo Zhang, Linyang Li, Zhejian Zhou, Yunfan Shao, Zhaoye Fei, Yichuan Ma, Jiawei Hong, Kuikun Liu, Ziyi Wang, et al. Internlm-math: Open math large language models toward verifiable reasoning.arXiv preprint arXiv:2402.06332,

  32. [32]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    14 Published as a conference paper at ICLR 2026 Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

  33. [33]

    Codedpo: Aligning code models with self generated and verified source code.arXiv preprint arXiv:2410.05605, 2024a

    Kechi Zhang, Ge Li, Yihong Dong, Jingjing Xu, Jun Zhang, Jing Su, Yongfei Liu, and Zhi Jin. Codedpo: Aligning code models with self generated and verified source code.arXiv preprint arXiv:2410.05605, 2024a. Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Yichi Zhang, Ziyu Guo, Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Bin Wei, Shanghang Zhang, et al. Mavis: Mathem...

  34. [34]

    LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

    Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models.arXiv preprint arXiv:2403.13372,

  35. [35]

    B PROMPTDESIGN Table 5: Prompt template for FGVR CoT data construction

    15 Published as a conference paper at ICLR 2026 A QUALITATIVERESULTS OFVISUALCONCEPTS Figure 5: Different image-level visual concepts for objects with the same subcategory. B PROMPTDESIGN Table 5: Prompt template for FGVR CoT data construction. This is a picture of a{label}with the following visual features:{concepts}. Based on the in- formation provided,...

  36. [36]

    as the base models, and full fine-tune them for 10 epochs using Llama-Factory framework (Zheng et al., 2024). After CoT SFT, we subsequently train 3B and 7B models for 10 and 5 epochs on the separate subset of 4-shot training data, using the proposed TAPO instantiated from DAPO baseline (Yu et al.,

  37. [37]

    visual analysis-candidate subcategories-comparison-prediction

    with clipping factors set toϵ l = 0.2,ϵ h = 0.28, reference KL removed, token-level loss averaging enabled, and dynamic sampling with a maximum of 20 retries. For other RL-related hyperparameters, we adopt the default settings from PAPO (Wang et al., 2025): a global batch size 16 Published as a conference paper at ICLR 2026 of 128, a rollout batch size of...