pith. machine review for the scientific record. sign in

arxiv: 2603.03197 · v3 · submitted 2026-03-03 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Specificity-aware reinforcement learning for fine-grained open-world classification

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:41 UTC · model grok-4.3

classification 💻 cs.CV
keywords reinforcement learningfine-grained classificationopen-world classificationlarge multimodal modelsspecificityimage classificationverifier reward
0
0 comments X

The pith

A reinforcement learning method steers large multimodal models toward both correct and specific predictions in open-world fine-grained image classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that reasoning large multimodal models already hold fine-grained visual knowledge but default to generic labels unless guided otherwise. It introduces SpeciaRL, a reinforcement learning framework that fine-tunes these models using a reward signal drawn from the strongest outputs among online rollouts. The reward is verified to keep predictions accurate while pushing for more detailed class names. This matters because open-world classification, where no fixed label list exists, requires models to name subtle visual differences correctly rather than fall back on broad categories like bird or car.

Core claim

SpeciaRL fine-tunes reasoning LMMs for fine-grained image classification under the open-world setting by introducing a dynamic, verifier-based reward signal anchored to the best predictions within online rollouts, promoting specificity while respecting the model's capabilities to prevent incorrect predictions.

What carries the argument

The verifier-based reward signal anchored to the best predictions within online rollouts, which encourages more specific outputs without reducing correctness.

If this is right

  • SpeciaRL delivers the best trade-off between correctness and specificity across extensive fine-grained benchmarks.
  • The method surpasses existing approaches in out-of-domain experiments for open-world fine-grained image classification.
  • Reasoning LMMs can be steered to use their intrinsic fine-grained knowledge more effectively through this rollout-based reward design.
  • Open-world fine-grained classification advances by balancing accuracy and detail without relying on a predefined label set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same rollout-verification reward pattern could be tested on tasks such as generating detailed image captions or attribute lists where specificity also matters.
  • Similar mechanisms might reduce overly broad outputs when LMMs are applied to medical or scientific image domains with subtle distinctions.
  • Extending the approach to video or 3D data would require checking whether the verifier remains reliable across temporal or spatial rollouts.
  • Combining SpeciaRL with other alignment techniques could further stabilize performance when the model encounters entirely novel fine-grained categories.

Load-bearing premise

The verifier-based reward signal derived from online rollouts can reliably promote specificity without introducing bias or reducing the model's ability to produce correct predictions on unseen fine-grained concepts.

What would settle it

An experiment in which models fine-tuned with SpeciaRL show either lower accuracy on held-out fine-grained classes or no measurable increase in specificity compared with standard fine-tuning or prompting baselines.

Figures

Figures reproduced from arXiv: 2603.03197 by Alessandro Conti, Davide Berasi, Elisa Ricci, Samuele Angheben, Yiming Wang.

Figure 1
Figure 1. Figure 1: In open-world image classification, improving predic [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Predictions distribution over categories for Qwen2.5VL-7B [ [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of SpeciaRL Given an input image I, the policy model generates N open-ended predictions {p1, . . . , pN }. Each prediction is categorized by a judge model (LLM verifier) as wrong or correct at different levels of specificity with respect to the ground-truth. A verifiable reward r ∗ i is then assigned according to whether the prediction’s category ci meets the adaptive reference level c ∗ , which i… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative examples of the think-answer output from the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: LMM default prompt for prediction. “Be specific” LMM prompt (Pc) Classify the image, be specific. Output the thinking process in <think> </think> and the final answer in <answer> </answer> tags. The output answer format should be as follows: <think> ... </think> <answer>a single label or the word ‘None’ to abstain.</answer>. Please strictly follow the format [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: LMM prompt for prediction for the “Be specific” baseline. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt for the LLM-as-a-judge verifier categorizing a prediction given the target ground-truth. [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt for generating the reasoning traces used to train the supervised fine-tuning baseline model. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: LLM-as-a-judge per-batch verification times during [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Additional qualitative examples of the think-answer output of the base model Qwen2.5VL-7B and SpeciaRL. [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 14
Figure 14. Figure 14: Generated LMM prompt (Pc (v2)). Additional LMM prompt (Pc (v3)) Classify the image. Only care about precision/specificity: always output the most fine-grained label you can. Do not abstain. Do not output ’None’. If multiple fine-grained labels are plausible, choose the single most specific label you consider most likely. Output the thinking process in <think> </think> and the final answer in <answer> </an… view at source ↗
Figure 12
Figure 12. Figure 12: Failure cases. Qualitative examples of SpeciaRL pro￾viding a Wrong prediction (Top & Center) and of SpeciaRL unnec￾essarily using a scientific name for a generic concept (Bottom). Additional LMM prompt (Pc (v1)) Classify the image. Prioritize correctness first. Be as specific as you can ONLY when you are confident the finer-grained label is correct. If you are not confident about a fine-grained label, out… view at source ↗
Figure 13
Figure 13. Figure 13: Generated LMM prompt (Pc (v1)). B.4. Additional Prompting baselines We report the performance of three additional top￾performing variants of the Pc prompt. These variants were generated using ChatGPT by requesting three different op￾timal predictor prompts given the full task context. As Additional LMM prompt (Pc (v2)) Classify the image. Optimize for high precision: do not guess. If you are unsure, absta… view at source ↗
Figure 16
Figure 16. Figure 16: Generated Prompt for the LLM-as-a-judge verifier. [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Generated Prompt for the LLM-as-a-judge verifier. [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Generated Prompt for the LLM-as-a-judge verifier. [PITH_FULL_IMAGE:figures/full_fig_p023_18.png] view at source ↗
read the original abstract

Classifying fine-grained visual concepts under open-world settings, i.e., without a predefined label set, demands models to be both accurate and specific. Recent reasoning Large Multimodal Models (LMMs) exhibit strong visual understanding capability but tend to produce overly generic predictions when performing fine-grained image classification. Our preliminary analysis reveals that models do possess the intrinsic fine-grained domain knowledge. However, promoting more specific predictions (specificity) without compromising correct ones (correctness) remains a non-trivial and understudied challenge. In this work, we investigate how to steer reasoning LMMs toward predictions that are both correct and specific. We propose a novel specificity-aware reinforcement learning framework, SpeciaRL, to fine-tune reasoning LMMs on fine-grained image classification under the open-world setting. SpeciaRL introduces a dynamic, verifier-based reward signal anchored to the best predictions within online rollouts, promoting specificity while respecting the model's capabilities to prevent incorrect predictions. Our out-of-domain experiments show that SpeciaRL delivers the best trade-off between correctness and specificity across extensive fine-grained benchmarks, surpassing existing methods and advancing open-world fine-grained image classification. Code and model are publicly available at https://github.com/s-angheben/SpeciaRL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes SpeciaRL, a specificity-aware reinforcement learning framework for fine-tuning reasoning large multimodal models on fine-grained open-world image classification. It introduces a dynamic verifier-based reward signal derived from the best predictions among online rollouts to promote specificity while respecting model capabilities and avoiding incorrect predictions, claiming superior correctness-specificity trade-offs on out-of-domain benchmarks relative to prior methods.

Significance. If the empirical claims hold after clarification, the work would advance open-world fine-grained classification by providing a practical RL mechanism to steer LMMs toward more precise outputs without accuracy loss. The public release of code and models supports reproducibility.

major comments (2)
  1. [Abstract] Abstract: the central claim of best trade-off on extensive out-of-domain benchmarks is asserted without any reported metrics, baselines, statistical tests, or verifier implementation details, leaving the empirical support for the main result invisible.
  2. [Method] Method (verifier reward definition): the dynamic reward anchored to the 'best' rollout prediction requires an explicit definition of the verifier (e.g., separate model, rule-based, or self-consistency) and the scoring rule for 'best' when ground truth is unavailable; without this, it is impossible to verify that specificity gains are not achieved at the expense of correctness on unseen fine-grained classes.
minor comments (1)
  1. [Abstract] The abstract would benefit from naming the specific fine-grained benchmarks and at least one quantitative result to ground the superiority claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We address each major comment below and propose revisions to improve clarity and completeness.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of best trade-off on extensive out-of-domain benchmarks is asserted without any reported metrics, baselines, statistical tests, or verifier implementation details, leaving the empirical support for the main result invisible.

    Authors: We agree that the abstract, as a high-level summary, omits specific numbers to maintain brevity. The full paper includes comprehensive results in Section 4 with metrics, baselines, and analyses demonstrating the best trade-off. In the revised version, we will incorporate key empirical highlights into the abstract, such as the superior performance on out-of-domain benchmarks, and ensure a brief reference to the verifier approach. revision: yes

  2. Referee: [Method] Method (verifier reward definition): the dynamic reward anchored to the 'best' rollout prediction requires an explicit definition of the verifier (e.g., separate model, rule-based, or self-consistency) and the scoring rule for 'best' when ground truth is unavailable; without this, it is impossible to verify that specificity gains are not achieved at the expense of correctness on unseen fine-grained classes.

    Authors: Thank you for this important clarification request. The current manuscript introduces the verifier-based reward in Section 3 but does not provide sufficient implementation details. We will revise the method section to explicitly define the verifier (as a self-consistency mechanism across rollouts combined with a specificity scoring rule based on prediction granularity), and detail the scoring for selecting the 'best' prediction without ground truth by prioritizing consistent and specific outputs while penalizing potential inaccuracies through capability-aware clipping. This will include pseudocode and examples to allow verification that correctness is preserved. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL method with no self-referential derivations or fitted predictions

full rationale

The paper proposes SpeciaRL as an empirical reinforcement learning framework that introduces a dynamic verifier-based reward anchored to online rollouts. No equations, derivations, or parameter-fitting steps are presented that reduce the claimed trade-off between correctness and specificity to a self-definition, fitted input, or self-citation chain. The approach is described as steering LMMs via RL without invoking uniqueness theorems, ansatzes smuggled through citations, or renaming known results. The central claim rests on experimental outcomes rather than a closed mathematical loop, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are described in the abstract; the method appears to rely on standard RL components plus a custom verifier whose internals are not detailed here.

pith-pipeline@v0.9.0 · 5523 in / 1052 out tokens · 26773 ms · 2026-05-15T16:41:38.789146+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 19 internal anchors

  1. [1]

    Lmrl gym: Benchmarks for multi-turn reinforcement learning with language models.arXiv preprint arXiv:2311.18232,

    Marwa Abdulhai, Isadora White, Charlie Snell, Charles Sun, Joey Hong, Yuexiang Zhai, Kelvin Xu, and Sergey Levine. Lmrl gym: Benchmarks for multi-turn reinforcement learning with language models.arXiv preprint arXiv:2311.18232,

  2. [2]

    Flamingo: a visual language model for few-shot learning.NeurIPS, 2022

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.NeurIPS, 2022. 2

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2, 4, 5, 6, 7, 16, 17

  4. [4]

    Towards open world recognition

    Abhijit Bendale and Terrance Boult. Towards open world recognition. InCVPR, pages 1893–1902, 2015. 1

  5. [5]

    Food-101–mining discriminative components with random forests

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InECCV, 2014. 4, 16, 17, 19

  6. [6]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 7, 16, 17

  7. [7]

    Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InCVPR,

  8. [8]

    V ocabulary-free image classification.NeurIPS, 2023

    Alessandro Conti, Enrico Fini, Massimiliano Mancini, Paolo Rota, Yiming Wang, and Elisa Ricci. V ocabulary-free image classification.NeurIPS, 2023. 1, 6, 7, 16, 17

  9. [9]

    On large multimodal models as open-world image classifiers

    Alessandro Conti, Massimiliano Mancini, Enrico Fini, Yim- ing Wang, Paolo Rota, and Elisa Ricci. On large multimodal models as open-world image classifiers. InICCV, 2025. 1, 2, 3, 4, 5, 6, 7, 12, 14, 19

  10. [10]

    Learning without critics? revisiting grpo in classical reinforcement learning environments.arXiv preprint arXiv:2511.03527, 2025

    Bryan LM de Oliveira, Felipe V Frujeri, Marcos PCM Queiroz, Luana GB Martins, Telma W de L Soares, and Luckeciano C Melo. Learning without critics? revisiting grpo in classical reinforcement learning environments.arXiv preprint arXiv:2511.03527, 2025. 8

  11. [11]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, pages 248–255. Ieee, 2009. 1

  12. [12]

    The llama 3 herd of models.arXiv e-prints, pages arXiv–2407,

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407,

  13. [13]

    Mme: A comprehensive evaluation benchmark for multi- modal large language models, 2023

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Meng- dan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, Rongrong Ji, Caifeng Shan, and Ran He. Mme: A comprehensive evaluation benchmark for multi- modal large language models, 2023. 2

  14. [14]

    Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

    Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains. arXiv preprint arXiv:2507.17746, 2025. 3

  15. [15]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2, 3, 5, 6

  16. [16]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024. 3

  17. [17]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 2

  18. [18]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InICML, pages 4904–

  19. [19]

    Large language models are zero-shot reasoners.NeurIPS, 35:22199–22213, 2022

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.NeurIPS, 35:22199–22213, 2022. 2

  20. [20]

    3d object representations for fine-grained categorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In ICCV-WS, 2013. 4, 17

  21. [21]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the symposium on operating systems principles, pages 611– 626, 2023. 12, 14

  22. [22]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Mi- randa, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Push- ing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024. 3, 5

  23. [23]

    The measurement of observer agreement for categorical data.biometrics, 1977

    J Richard Landis and Gary G Koch. The measurement of observer agreement for categorical data.biometrics, 1977. 20

  24. [24]

    Seed-bench: Benchmarking multimodal large language models

    Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Benchmarking multimodal large language models. InCVPR, 2024. 2

  25. [25]

    Llava- next: Stronger llms supercharge multimodal capabilities in the wild, 2024

    Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. Llava- next: Stronger llms supercharge multimodal capabilities in the wild, 2024. 1

  26. [26]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 2

  27. [27]

    Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML, 2023. 1, 2

  28. [28]

    Mvbench: A comprehensive multi-modal video understand- ing benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InCVPR, 2024. 2

  29. [29]

    Think or not think: A study of explicit thinking in rule-based visual reinforcement fine-tuning.arXiv preprint arXiv:2503.16188, 2025

    Ming Li, Jike Zhong, Shitian Zhao, Yuxiang Lai, and Kaipeng Zhang. Think or not think: A study of explicit thinking in rule-based visual reinforcement fine-tuning.arXiv preprint arXiv:2503.16188, 2025. 3, 6

  30. [30]

    Revisiting mllms: An in- depth analysis of image classification abilities.arXiv preprint arXiv:2412.16418, 2024

    Huan Liu, Lingyu Xiao, Jiangjiang Liu, Xiaofan Li, Ze Feng, Sen Yang, and Jingdong Wang. Revisiting mllms: An in- depth analysis of image classification abilities.arXiv preprint arXiv:2412.16418, 2024. 1, 2

  31. [31]

    Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InECCV, 2024. 2

  32. [32]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025. 20

  33. [33]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785,

  34. [34]

    Fine-Grained Visual Classification of Aircraft

    Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual clas- sification of aircraft.arXiv preprint arXiv:1306.5151, 2013. 4, 17

  35. [35]

    Automated flower classification over a large number of classes

    Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. InIndian conference on computer vision, graphics & image processing. IEEE, 2008. 4, 16, 17, 19

  36. [36]

    Training language models to follow instructions with human feedback.NeurIPS, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.NeurIPS, 35:27730–27744, 2022. 2

  37. [37]

    Cats and dogs

    Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. InCVPR, 2012. 4, 16, 17, 19

  38. [38]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021. 1, 2, 6

  39. [39]

    A survey on semi-, self-and unsupervised learning for image classification.IEEE Access, 9:82146– 82168, 2021

    Lars Schmarje, Monty Santarossa, Simon-Martin Schr¨oder, and Reinhard Koch. A survey on semi-, self-and unsupervised learning for image classification.IEEE Access, 9:82146– 82168, 2021. 1

  40. [40]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 3, 6, 20

  41. [41]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf frame- work.arXiv preprint arXiv:2409.19256, 2024. 6, 14

  42. [42]

    Flava: A foundational language and vision alignment model

    Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guil- laume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. Flava: A foundational language and vision alignment model. InCVPR, 2022. 2

  43. [43]

    Taxonomy-aware evaluation of vision-language models

    V´esteinn Snæbjarnarson, Kevin Du, Niklas Stoehr, Serge Belongie, Ryan Cotterell, Nico Lang, and Stella Frank. Taxonomy-aware evaluation of vision-language models. In CVPR, pages 9109–9120, 2025. 2, 3

  44. [44]

    Learning to summarize with human feedback.NeurIPS, 33:3008–3021, 2020

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback.NeurIPS, 33:3008–3021, 2020. 2

  45. [45]

    Crossing the reward bridge: Expanding rl with verifiable rewards across diverse domains.arXiv preprint arXiv:2503.23829, 2025

    Yi Su, Dian Yu, Linfeng Song, Juntao Li, Haitao Mi, Zhaopeng Tu, Min Zhang, and Dong Yu. Crossing the reward bridge: Expanding rl with verifiable rewards across diverse domains.arXiv preprint arXiv:2503.23829, 2025. 3

  46. [46]

    MIT press Cambridge, 1998

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction. MIT press Cambridge, 1998. 2

  47. [47]

    Vision llms are bad at hierarchical visual understanding, and llms are the bottleneck.arXiv preprint arXiv:2505.24840, 2025

    Yuwen Tan, Yuan Qing, and Boqing Gong. Vision llms are bad at hierarchical visual understanding, and llms are the bottleneck.arXiv preprint arXiv:2505.24840, 2025. 2

  48. [48]

    Kimi K2: Open Agentic Intelligence

    Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence. arXiv preprint arXiv:2507.20534, 2025. 3, 5

  49. [49]

    Qwen3 technical report, 2025

    Qwen Team. Qwen3 technical report, 2025. 6, 14

  50. [50]

    The caltech-ucsd birds-200-2011 dataset,

    Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset,

  51. [51]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2

  52. [52]

    Chain-of- thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models. NeurIPS, 35:24824–24837, 2022. 2

  53. [53]

    An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: To- ward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122, 2024. 3

  54. [54]

    Internlm-math: Open math large lan- guage models toward verifiable reasoning.arXiv preprint arXiv:2402.06332, 2024

    Huaiyuan Ying, Shuo Zhang, Linyang Li, Zhejian Zhou, Yun- fan Shao, Zhaoye Fei, Yichuan Ma, Jiawei Hong, Kuikun Liu, Ziyi Wang, et al. Internlm-math: Open math large lan- guage models toward verifiable reasoning.arXiv preprint arXiv:2402.06332, 2024. 3

  55. [55]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

  56. [56]

    Object recognition as next token prediction

    Kaiyu Yue, Bor-Chun Chen, Jonas Geiping, Hengduo Li, Tom Goldstein, and Ser-Nam Lim. Object recognition as next token prediction. InCVPR, 2024. 2

  57. [57]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025. 5

  58. [58]

    Star: Self-taught reasoner bootstrapping reasoning with reasoning

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D Good- man. Star: Self-taught reasoner bootstrapping reasoning with reasoning. InNeurIPS, 2024. 7

  59. [59]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, 2023. 1, 2

  60. [60]

    Codedpo: Aligning code models with self generated and verified source code

    Kechi Zhang, Ge Li, Yihong Dong, Jingjing Xu, Jun Zhang, Jing Su, Yongfei Liu, and Zhi Jin. Codedpo: Aligning code models with self generated and verified source code. InACL, pages 15854–15871, 2025. 3

  61. [61]

    Improve vision language model chain-of-thought reasoning

    Ruohong Zhang, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun, Zhe Gan, Yinfei Yang, Ruoming Pang, and Yim- ing Yang. Improve vision language model chain-of-thought reasoning. InACL, pages 1631–1662, 2025. 7

  62. [62]

    Why are visually-grounded language models bad at image classifi- cation?NeurIPS, 2024

    Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, and Serena Yeung-Levy. Why are visually-grounded language models bad at image classifi- cation?NeurIPS, 2024. 1, 2

  63. [63]

    Automated generation of challenging multiple-choice questions for vision language model evaluation

    Yuhui Zhang, Yuchang Su, Yiming Liu, Xiaohan Wang, James Burgess, Elaine Sui, Chenyu Wang, Josiah Aklilu, Alejandro Lozano, Anjiang Wei, et al. Automated generation of challenging multiple-choice questions for vision language model evaluation. InCVPR, pages 29580–29590, 2025. 2

  64. [64]

    Minigpt-4: Enhancing vision-language understanding with advanced large language models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. InICLR,

  65. [65]

    TTRL: Test-Time Reinforcement Learning

    Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, et al. Ttrl: Test-time reinforcement learning.arXiv preprint arXiv:2504.16084, 2025. 2 Specificity-aware reinforcement learning for fine-grained open-world classification Supplementary Material In this supplementary material, we present add...

  66. [66]

    If prediction is an abstention/refusal/uncertainty (e.g., ”none”, ”cannot tell”, ”I don’t know”): outputAbstain

  67. [67]

    If prediction is malformed, nonsense, unrelated, contradictory, or gives multiple options (e.g., ”A or B”, lists): outputWrong

  68. [68]

    If prediction and ground truth denote the same entity via exact match or direct synonym: outputSpecific

  69. [69]

    • if the parent is broad/coarse (e.g., animal for dog): outputGeneric

    Ifpredictionis aparent categoryofground truth: • if the parent is close (e.g., genus for species): outputLess Specific. • if the parent is broad/coarse (e.g., animal for dog): outputGeneric

  70. [70]

    Ifpredictionis achild/subtype/instanceofground truth: outputMore Specific

  71. [71]

    ground_truth

    Otherwise: outputWrong. Input Format: {"ground_truth": "<the_ground_truth_label>", "prediction": "<the_vlm_prediction>"} Output Format: A single word from the allowed categories. Prompt: Apply the decision procedure to classify the following JSON object. Output exactly one category word. INPUT: %s Figure 16. Generated Prompt for the LLM-as-a-judge verifie...