pith. sign in

arxiv: 2606.19053 · v1 · pith:5KGSXI3Vnew · submitted 2026-06-17 · 💻 cs.CV

Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: From Evaluation to Diagnosis

Pith reviewed 2026-06-26 21:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords large vision-language modelsfine-grained recognitionbenchmark evaluationsemantic groundingmodality alignmentvisual discriminabilitymultimodal perception
0
0 comments X

The pith

Current large vision-language models remain inadequate fine-grained recognizers due to intertwined bottlenecks in visual representations, semantic grounding, modality alignment, and category-level knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FG-BMK, a benchmark with 1.01 million questions on 0.28 million images spanning common and specialized domains, to test large vision-language models on fine-grained image tasks. It combines human-oriented dialogue questions with machine-oriented feature tests to trace whether failures come from weak visual features, poor image-to-word connections, or missing category knowledge. Experiments across representative models show the models fall short overall, with problems arising from four linked issues in visual representations, semantic grounding, modality alignment, and category-level knowledge. The work also studies how training choices and input perturbations influence results to inform better data and model development.

Core claim

Through the FG-BMK benchmark, which contains 1.01 million questions and 0.28 million images, experiments on a diverse set of LVLMs show that current models remain inadequate fine-grained recognizers, with failures arising from intertwined bottlenecks in visual representations, semantic grounding, modality alignment, and category-level knowledge.

What carries the argument

FG-BMK benchmark that jointly evaluates dialogue-level fine-grained semantic recognition and feature-level visual discriminability through human-oriented and machine-oriented paradigms to diagnose specific failure sources.

If this is right

  • Training design factors can be adjusted to improve fine-grained capabilities in LVLMs.
  • Visual and linguistic perturbations produce measurable effects on LVLM predictions.
  • Diagnostic insights from the benchmark guide future data construction and model design for more reliable fine-grained visual performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the four bottlenecks dominate, then isolated fixes to one area such as visual encoders alone are unlikely to close the gap.
  • The same evaluation approach could be applied to test whether other multimodal systems exhibit similar linked failure patterns.
  • Extending the benchmark to video or real-time settings would reveal whether the identified issues persist beyond static images.

Load-bearing premise

The human-oriented and machine-oriented evaluation paradigms in FG-BMK accurately isolate and diagnose the specific failure sources without introducing measurement biases or overlooking other contributing factors.

What would settle it

A new LVLM architecture that scores high on FG-BMK while showing clear separation of the four bottlenecks would challenge the claim of inherent intertwined inadequacy.

Figures

Figures reproduced from arXiv: 2606.19053 by Chen-Wei Xie, Hong-Tao Yu, Serge Belongie, Xiu-Shen Wei, Yuxin Peng.

Figure 1
Figure 1. Figure 1: Overview of FG-BMK. FG-BMK evaluates LVLMs on fine-grained visual tasks from five diagnostic dimensions: hierarchical recognition, knowledge bias estimation, attribute recognition, image classification, and image retrieval. The teaser illustrates both the task formats and representative findings, showing that current LVLMs still suffer from degraded fine-level recognition, biased category knowledge, uneven… view at source ↗
Figure 2
Figure 2. Figure 2: Our proposed benchmark: The human-oriented evaluation tests the model’s ability to handle fine-grained visual queries [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Results of InternVL3 [16] on true/false and multiple [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison between real images and fine-grained category-conditioned generated images. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of visual-text alignment on CUB under different settings. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of the original (blue dots) and fine-tuned [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Retrieval results of LVLM visual features on twelve [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Nemenyi statistical test results for fine-grained [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: t-SNE visualization of visual features on Stanford Dogs [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Patch-level correspondence visualization on CUB datasets. Green boxes in the query images indicate the selected [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Classification results with different vision encoder sizes. Bars filled with different patterns represent different models, [PITH_FULL_IMAGE:figures/full_fig_p013_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Classification results of LVLM visual features on fine-grained datasets. “Single” denotes accuracy from training on a [PITH_FULL_IMAGE:figures/full_fig_p014_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Question templates for each task in huamn-oriented evaluation. [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Results of GPT-5.4 [1], GPT-4o [1], Gemini-3.5-flash [42], Gemini-2.0-flash [42], Qwen2.5-VL [39], LLaVA [4] [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Knowledge bias estimation results of two closed-source models. True/false question accuracy for each category is [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Knowledge bias estimation results of two closed-source models. True/false question accuracy for each category is [PITH_FULL_IMAGE:figures/full_fig_p025_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Comparison of the original and fine-tuned Qwen2.5- [PITH_FULL_IMAGE:figures/full_fig_p026_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Qualitative analysis of granularity inconsistencies in LVLMs’ alignment data and a constructed sample of properly [PITH_FULL_IMAGE:figures/full_fig_p027_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Classification results of LVLM visual features on fine-grained datasets. “Single” denotes accuracy from training on a [PITH_FULL_IMAGE:figures/full_fig_p028_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Visualization of aligned visual features and category text embeddings under different alignment settings. Fine-grained [PITH_FULL_IMAGE:figures/full_fig_p029_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: t-SNE visualization of visual features on CUB-200-2011 and Stanford Dogs. Features learned with contrastive paradigms (e.g., EVA-CLIP and DINOv2) form more compact and better-separated class clusters than those learned with reconstruction- or generation-based paradigms (e.g., BEiT-3 and Qwen-VL), indicating stronger fine-grained discriminability in the embedding space [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗
Figure 25
Figure 25. Figure 25: Patch-level correspondence analysis on fine-grained bird images. Given selected query patches, contrastive features [PITH_FULL_IMAGE:figures/full_fig_p030_25.png] view at source ↗
read the original abstract

Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception and reasoning capabilities. While numerous benchmarks have evaluated LVLMs from holistic or task-specific perspectives, their capabilities on fine-grained image tasks-fundamental to computer vision-remain insufficiently understood. To address this gap, we introduce FG-BMK, a comprehensive fine-grained evaluation benchmark containing 1.01 million questions and 0.28 million images, covering diverse scenarios from common object-centric domains to specialized domains. FG-BMK jointly evaluates dialogue-level fine-grained semantic recognition and feature-level visual discriminability through human-oriented and machine-oriented paradigms, enabling diagnostic analysis of whether LVLM failures arise from insufficient visual representations, weak visual-to-semantic grounding, or limited fine-grained knowledge. Through extensive experiments on a diverse set of representative LVLMs/VLMs, we find that current LVLMs remain inadequate fine-grained recognizers, with failures arising from intertwined bottlenecks in visual representations, semantic grounding, modality alignment, and category-level knowledge. We further analyze training design factors for improving fine-grained capabilities and examine how visual and linguistic perturbations affect LVLM predictions. These findings provide diagnostic insights into the limitations of current LVLMs and offer guidance for future data construction and model design in developing more reliable LVLMs for fine-grained visual tasks. Our code is open-source and available at https://fg-bmk.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces FG-BMK, a benchmark with 1.01 million questions across 0.28 million images spanning common and specialized domains. It jointly evaluates LVLMs via human-oriented (dialogue-level semantic recognition) and machine-oriented (feature-level visual discriminability) paradigms to diagnose failures attributable to visual representations, semantic grounding, modality alignment, or category-level knowledge. Experiments on representative LVLMs conclude that current models remain inadequate fine-grained recognizers due to these intertwined bottlenecks; additional analyses cover training design factors and effects of visual/linguistic perturbations. Code is open-sourced.

Significance. If the diagnostic attributions are validated, the work supplies concrete guidance for improving LVLMs on fine-grained tasks and highlights data-construction priorities. The scale of FG-BMK and explicit open-sourcing of code and benchmark constitute clear strengths for reproducibility.

major comments (2)
  1. [Abstract] Abstract: The central claim attributes LVLM failures to four specific intertwined bottlenecks and states that the dual paradigms 'enable diagnostic analysis' of their individual contributions. However, no ablations, controls, or isolation procedures are described that hold all but one factor fixed (e.g., varying only visual representation quality while fixing semantic labels and knowledge). Fine-grained discrimination tasks inherently couple these elements, so observed failures may reflect question design rather than separable sources; this directly undermines the attribution in the strongest claim.
  2. [Abstract] Abstract (data-construction paragraph): The manuscript reports extensive experiments yet supplies no details on question generation, statistical controls for category balance, inter-annotator validation of diagnostic labels, or safeguards against measurement bias in the 1.01 M questions. Without these, it is impossible to verify that the reported bottlenecks are not artifacts of the benchmark construction itself.
minor comments (2)
  1. [Abstract] Abstract: The enabling clause lists three diagnostic targets (visual representations, visual-to-semantic grounding, fine-grained knowledge) while the findings paragraph lists four (adding modality alignment and category-level knowledge). Standardize the enumerated set for consistency.
  2. [Experimental setup] The open-source link is provided, but the manuscript should include a brief reproducibility checklist (e.g., exact model versions, prompt templates, and hardware) in the experimental section to match the scale claimed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our diagnostic claims and benchmark construction details. We address each major comment below and outline planned revisions to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim attributes LVLM failures to four specific intertwined bottlenecks and states that the dual paradigms 'enable diagnostic analysis' of their individual contributions. However, no ablations, controls, or isolation procedures are described that hold all but one factor fixed (e.g., varying only visual representation quality while fixing semantic labels and knowledge). Fine-grained discrimination tasks inherently couple these elements, so observed failures may reflect question design rather than separable sources; this directly undermines the attribution in the strongest claim.

    Authors: We agree that full isolation of the four bottlenecks through ablations holding all but one factor fixed is inherently difficult, given the coupled nature of fine-grained tasks. Our dual paradigms provide diagnostic value by contrasting human-oriented dialogue-level semantic recognition (probing grounding, alignment, and knowledge) against machine-oriented feature-level visual discriminability (probing representations). This comparative design reveals intertwined contributions without claiming complete separability. We will revise the abstract to more precisely articulate the diagnostic scope and limitations of the paradigms, and we will add supporting comparative analyses in the experiments section. revision: partial

  2. Referee: [Abstract] Abstract (data-construction paragraph): The manuscript reports extensive experiments yet supplies no details on question generation, statistical controls for category balance, inter-annotator validation of diagnostic labels, or safeguards against measurement bias in the 1.01 M questions. Without these, it is impossible to verify that the reported bottlenecks are not artifacts of the benchmark construction itself.

    Authors: We agree that explicit details on these aspects are essential to rule out construction artifacts. The full manuscript (Section 3) describes the multi-stage question generation pipeline, category balancing procedures, inter-annotator agreement for diagnostic labels, and bias mitigation steps including expert review for specialized domains. To address the concern, we will expand the abstract's data-construction description and include a concise summary of these controls in the main text, along with a reference to the open-sourced generation and validation code. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark paper with no derivation chain or self-referential reductions

full rationale

The paper introduces FG-BMK as a new benchmark with 1.01M questions and evaluates LVLMs experimentally to diagnose failure modes. The central claims rest on observed performance gaps across human-oriented and machine-oriented paradigms rather than any mathematical derivation, fitted parameter renamed as prediction, or self-citation that reduces the result to its own inputs. No equations appear in the abstract or described methodology, and the attribution of intertwined bottlenecks follows directly from the benchmark results without circular redefinition. This is a standard empirical benchmarking study whose conclusions are falsifiable by external replication.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the contribution is an empirical benchmark built on existing image domains and model evaluations.

pith-pipeline@v0.9.1-grok · 5797 in / 1009 out tokens · 21123 ms · 2026-06-26T21:34:57.064830+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 12 linked inside Pith

  1. [1]

    GPT-4 technical report,

    OpenAI, “GPT-4 technical report,” 2023, arXiv:2303.08774

  2. [2]

    Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond,

    J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond,” 2023, arXiv:2308.12966

  3. [3]

    InternVL: Scaling up vision foundation models and aligning for generic visual- linguistic tasks,

    Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, B. li, P. Luo, T. Lu, Y . Qiao, and J. Dai, “InternVL: Scaling up vision foundation models and aligning for generic visual- linguistic tasks,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2024, pp. 24 185–24 198

  4. [4]

    Improved baselines with visual instruction tuning,

    H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2024, pp. 26 296–26 306. SUBMITTED TO IEEE TPAMI 16

  5. [5]

    LVLM-eHub: A comprehensive evaluation benchmark for large vision-language models,

    P. Xu, W. Shao, K. Zhang, P. Gao, S. Liu, M. Lei, F. Meng, S. Huang, Y . Qiao, and P. Luo, “LVLM-eHub: A comprehensive evaluation benchmark for large vision-language models,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 3, pp. 1877–1893, 2025

  6. [6]

    MMBench: Is your multi-modal model an all-around player?

    L. Yuan, H. Duan, Y . Zhang, B. Li, S. Zhang, W. Zhao, Y . Yuan, J. Wang, C. He, Z. Liu, K. Chen, and D. Lin, “MMBench: Is your multi-modal model an all-around player?” inProc. Eur. Conf. Comp. Vis., 2024, pp. 216–233

  7. [7]

    DocVQA: A dataset for vqa on document images,

    M. Mathew, D. Karatzas, and C. Jawahar, “DocVQA: A dataset for vqa on document images,” inProc. Winter Conf. Applications of Comp. Vis., 2021, pp. 2200–2209

  8. [8]

    GQA: A new dataset for real-world visual reasoning and compositional question answering,

    D. A. Hudson and C. D. Manning, “GQA: A new dataset for real-world visual reasoning and compositional question answering,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2019, pp. 6700–6709

  9. [9]

    African or european swallow? benchmarking large vision-language models for fine-grained object classification,

    G. Geigle, R. Timofte, and G. Glava ˇs, “African or european swallow? benchmarking large vision-language models for fine-grained object classification,” inProc. Conf. Empirical Methods in Natural Language Processing, 2024, pp. 2653–2669

  10. [10]

    Why are visually-grounded language models bad at image classification?

    Y . Zhang, A. Unell, X. Wang, D. Ghosh, Y . Su, L. Schmidt, and S. Yeung-Levy, “Why are visually-grounded language models bad at image classification?” inAdvances in Neural Inf. Process. Syst., 2024, pp. 51 727–51 753

  11. [11]

    Vision llms are bad at hierarchical visual understanding, and llms are the bottleneck,

    Y . Tan, Y . Qing, and B. Gong, “Vision llms are bad at hierarchical visual understanding, and llms are the bottleneck,” 2025, arXiv:2505.24840

  12. [12]

    Fine-grained image analysis with deep learning: A survey,

    X.-S. Wei, Y .-Z. Song, O. M. Aodha, J. Wu, Y . Peng, J. Tang, J. Yang, and S. Belongie, “Fine-grained image analysis with deep learning: A survey,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 12, pp. 8927–8948, 2022

  13. [13]

    Benchmarking large vision-language models on fine-grained image tasks: A comprehensive evaluation,

    H.-T. Yu, Y . Peng, S. Belongie, and X.-S. Wei, “Benchmarking large vision-language models on fine-grained image tasks: A comprehensive evaluation,” inProc. Int. Conf. Learn. Representations, 2026

  14. [14]

    BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

    J. Li, D. Li, C. Xiong, and S. Hoi, “BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inProc. Int. Conf. Mach. Learn., 2022, pp. 12 888–12 900

  15. [15]

    BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,

    J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” inProc. Int. Conf. Mach. Learn., 2023, pp. 19 730–19 742

  16. [16]

    InternVL3: Exploring advanced training and test-time recipes for open-source multimodal models,

    J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, Y . Duan, H. Tian, W. Su, J. Shao, Z. Gao, E. Cui, X. Wang, Y . Cao, Y . Liu, X. Wei, H. Zhang, H. Wang, W. Xu, H. Li, J. Wang, N. Deng, S. Li, Y . He, T. Jiang, J. Luo, Y . Wang, C. He, B. Shi, X. Zhang, W. Shao, J. He, Y . Xiong, W. Qu, P. Sun, P. Jiao, H. Lv, L. Wu, K. Zhang, H. Deng, J. Ge, K. Chen, L. W...

  17. [17]

    Image as a foreign language: BEiT pretraining for vision and vision-language tasks,

    W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal, O. K. Mohammed, S. Singhal, S. Som, and F. Wei, “Image as a foreign language: BEiT pretraining for vision and vision-language tasks,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2023, pp. 19 175–19 186

  18. [18]

    BLIP3-o: A family of fully open unified multimodal models-architecture, training and dataset,

    J. Chen, Z. Xu, X. Pan, Y . Hu, C. Qin, T. Goldstein, L. Huang, T. Zhou, S. Xie, S. Savarese, L. Xue, C. Xiong, and R. Xu, “BLIP3-o: A family of fully open unified multimodal models-architecture, training and dataset,” 2025, arXiv:2505.09568

  19. [19]

    UniWorld-V1: High-resolution semantic encoders for unified visual understanding and generation,

    B. Lin, Z. Li, X. Cheng, Y . Niu, Y . Ye, X. He, S. Yuan, W. Yu, S. Wang, Y . Geet al., “UniWorld-V1: High-resolution semantic encoders for unified visual understanding and generation,” 2025, arXiv:2506.03147

  20. [20]

    Emerging properties in unified multimodal pretraining,

    C. Deng, D. Zhu, K. Li, C. Gou, F. Li, Z. Wang, S. Zhong, W. Yu, X. Nie, Z. Song, G. Shi, and H. Fan, “Emerging properties in unified multimodal pretraining,” 2025, arXiv:2505.14683

  21. [21]

    ChartQA: A benchmark for question answering about charts with visual and logical reasoning,

    A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque, “ChartQA: A benchmark for question answering about charts with visual and logical reasoning,” inProc. Conf. Association for Computational Linguistics, 2022, pp. 2263–2279

  22. [22]

    Capability: A comprehensive visual caption benchmark for evaluating both correctness and thoroughness,

    Z. Liu, C.-W. Xie, B. Wen, F. Yu, P. Li, B. Zhang, N. Yang, Z. Gao, Y . Zheng, and H. Xie, “Capability: A comprehensive visual caption benchmark for evaluating both correctness and thoroughness,” pp. 0–11, 2026

  23. [23]

    OCRBench: On the hidden mystery of ocr in large multimodal models,

    Y . Liu, Z. Li, M. Huang, B. Yang, W. Yu, C. Li, X.-C. Yin, C.-L. Liu, L. Jin, and X. Bai, “OCRBench: On the hidden mystery of ocr in large multimodal models,”Science China Information Sciences, vol. 67, no. 12, 2024

  24. [24]

    MathVista: Evaluating mathematical reasoning of foundation models in visual contexts,

    P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao, “MathVista: Evaluating mathematical reasoning of foundation models in visual contexts,” inProc. Int. Conf. Learn. Representations, 2024

  25. [25]

    MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,

    X. Yue, Y . Ni, T. Zheng, K. Zhang, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y . Sun, C. Wei, B. Yu, R. Yuan, R. Sun, M. Yin, B. Zheng, Z. Yang, Y . Liu, W. Huang, H. Sun, Y . Su, and W. Chen, “MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2024, pp. 9556–9567

  26. [26]

    Towards deep learning models resistant to adversarial attacks,

    A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” inProc. Int. Conf. Learn. Representations, 2018

  27. [27]

    Dual attention networks for few-shot fine-grained recognition,

    S.-L. Xu, F. Zhang, X.-S. Wei, and J. Wang, “Dual attention networks for few-shot fine-grained recognition,” inProc. Conf. AAAI, 2022, pp. 2911–2919

  28. [28]

    MECOM: A meta-completion network for fine-grained recognition with incomplete multi-modalities,

    X.-S. Wei, H.-T. Yu, A. Xu, F. Zhang, and Y . Peng, “MECOM: A meta-completion network for fine-grained recognition with incomplete multi-modalities,”IEEE Trans. Image Process., vol. 33, pp. 3456–3469, 2024

  29. [29]

    FSCIL-EACA: Few-Shot Class- Incremental learning network based on embedding augmentation and classifier adaptation for image classification,

    R. Zhang, H. E, and M. Song, “FSCIL-EACA: Few-Shot Class- Incremental learning network based on embedding augmentation and classifier adaptation for image classification,”Chinese J. Electron., vol. 33, no. 1, pp. 139–152, 2024

  30. [30]

    FineCLIP: Self-distilled region-based clip for better fine-grained understanding,

    D. Jing, X. He, Y . Luo, N. Fei, G. Yang, W. Wei, H. Zhao, and Z. Lu, “FineCLIP: Self-distilled region-based clip for better fine-grained understanding,” inAdvances in Neural Inf. Process. Syst., 2024, pp. 27 896–27 918

  31. [31]

    Expression complementary disentanglement network for facial expression recognition,

    S. Wang, H. Shuai, L. Zhu, and Q. Liu, “Expression complementary disentanglement network for facial expression recognition,”Chinese J. Electron., vol. 33, no. 3, pp. 742–752, 2024

  32. [32]

    Weighted linear loss large margin distribution machine for pattern classification,

    L. Liu, M. Chu, R. Gong, L. Liu, and Y . Yang, “Weighted linear loss large margin distribution machine for pattern classification,”Chinese J. Electron., vol. 33, no. 3, pp. 753–765, 2024

  33. [33]

    FGM-SPCL: Open-set recognition network for medical images based on fine-grained data mixture and spatial position constraint loss,

    R. Zhang, H. E, L. Yuan, Y . Wang, L. Wang, and M. Song, “FGM-SPCL: Open-set recognition network for medical images based on fine-grained data mixture and spatial position constraint loss,”Chinese J. Electron., vol. 33, no. 4, pp. 1023–1033, 2024

  34. [34]

    Animal- Bench: Benchmarking multimodal video models for animal-centric video understanding,

    Y . Jing, R. Zhang, K. Liang, Y . Li, Z. He, Z. Ma, and J. Guo, “Animal- Bench: Benchmarking multimodal video models for animal-centric video understanding,” inAdvances in Neural Inf. Process. Syst., 2024, pp. 23 457–23 469

  35. [35]

    SEMICON: A learning-to-hash solution for large-scale fine-grained image retrieval,

    Y . Shen, X. Sun, X.-S. Wei, Q.-Y . Jiang, and J. Yang, “SEMICON: A learning-to-hash solution for large-scale fine-grained image retrieval,” in Proc. Eur. Conf. Comp. Vis., 2022, pp. 531–548

  36. [36]

    RPC: A large-scale and fine-grained retail product checkout dataset,

    X.-S. Wei, Q. Cui, L. Yang, P. Wang, L. Liu, and J. Yang, “RPC: A large-scale and fine-grained retail product checkout dataset,”Science China. Information Sciences, vol. 65, no. 9, p. 197101, 2022

  37. [37]

    DINOv2: Learning robust visual features without supervision,

    M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khali- dov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, R. Howes, P.-Y . Huang, H. Xu, V . Sharma, S.-W. Li, W. Galuba, M. Rabbat, M. Assran, N. Ballas, G. Synnaeve, I. Misra, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “DINOv2: Learning robust visual features withou...

  38. [38]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- wal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskeverothers, “Learning transferable visual models from natural language supervision,” inProc. Int. Conf. Mach. Learn., 2021, pp. 8748– 8763

  39. [39]

    Qwen2.5-vl technical report,

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-vl technical report,” 2025, arXiv:2502.13923

  40. [40]

    EV A-CLIP: Improved training techniques for clip at scale,

    Q. Sun, Y . Fang, L. Wu, X. Wang, and Y . Cao, “EV A-CLIP: Improved training techniques for clip at scale,” 2023, arXiv:2303.15389

  41. [41]

    CoCa: Contrastive captioners are image-text foundation models,

    J. Yu, Z. Wang, V . Vasudevan, L. Yeung, M. Seyedhosseini, and Y . Wu, “CoCa: Contrastive captioners are image-text foundation models,” Transactions on Machine Learning Research, 2022

  42. [42]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,

    G. Gemini Team, “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” 2024, arXiv:2403.05530

  43. [43]

    The Caltech- UCSD birds-200-2011 dataset,

    C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The Caltech- UCSD birds-200-2011 dataset,” Technical report, California Institute of Technology, 2011

  44. [44]

    MetaFormer: A unified meta framework for fine-grained recognition,

    Q. Diao, Y . Jiang, B. Wen, J. Sun, and Z. Yuan, “MetaFormer: A unified meta framework for fine-grained recognition,” 2022, arXiv:2203.02751

  45. [45]

    SR-GNN: Spatial relation-aware graph neural network for fine-grained image categorization,

    A. Bera, Z. Wharton, Y . Liu, N. Bessis, and A. Behera, “SR-GNN: Spatial relation-aware graph neural network for fine-grained image categorization,”IEEE Trans. Image Process., vol. 31, pp. 6017–6031, 2022

  46. [46]

    Progressive multi-task anti-noise learning and distilling frame- works for fine-grained vehicle recognition,

    D. Liu, “Progressive multi-task anti-noise learning and distilling frame- works for fine-grained vehicle recognition,”IEEE Trans. Intell. Transp. Syst., vol. 25, no. 9, pp. 10 667–10 678, 2024. SUBMITTED TO IEEE TPAMI 17

  47. [47]

    Context-aware attentional pooling (cap) for fine-grained visual classification,

    A. Behera, Z. Wharton, P. R. Hewage, and A. Bera, “Context-aware attentional pooling (cap) for fine-grained visual classification,” inProc. Conf. AAAI, 2021, pp. 929–937

  48. [48]

    Interweaving insights: High-order feature interaction for fine-grained visual recognition,

    A. Sikdar, Y . Liu, S. Kedarisetty, Y . Zhao, A. Ahmed, and A. Behera, “Interweaving insights: High-order feature interaction for fine-grained visual recognition,” inProc. IEEE Int. Conf. Comp. Vis., 2024, pp. 1755– 1779

  49. [49]

    Learning multiple layers of features from tiny images,

    A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009

  50. [50]

    An image is worth 16x16 words: Transformers for image recognition at scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,” 2020, arXiv:2010.11929

  51. [51]

    ImageNet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2009, pp. 248–255

  52. [52]

    Bottled wine defect detection data set,

    Tianchi, “Bottled wine defect detection data set,” 2021. [Online]. Available: https://tianchi.aliyun.com/dataset/dataDetail?dataId=110147

  53. [53]

    A benchmark data set for aircraft type recognition from remote sensing images,

    Z.-Z. Wu, S.-H. Wan, X.-F. Wang, M. Tan, L. Zou, X.-L. Li, and Y . Chen, “A benchmark data set for aircraft type recognition from remote sensing images,”Applied Soft Computing, vol. 89, pp. 106 132–106 142, 2020

  54. [54]

    DeepFashion: Powering robust clothes recognition and retrieval with rich annotations,

    Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, “DeepFashion: Powering robust clothes recognition and retrieval with rich annotations,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016, pp. 1096–1104

  55. [55]

    SkinCon: A skin disease dataset densely annotated by domain experts for fine-grained debugging and analysis,

    R. Daneshjou, M. Yuksekgonul, Z. R. Cai, R. Novoa, and J. Y . Zou, “SkinCon: A skin disease dataset densely annotated by domain experts for fine-grained debugging and analysis,” inAdvances in Neural Inf. Process. Syst., 2022, pp. 18 157–18 167

  56. [56]

    Automated flower classification over a large number of classes,

    M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” inProc. IEEE Int. Conf. Comp. Vis., 2008, pp. 722–729

  57. [57]

    Food-101–mining discriminative components with random forests,

    L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining discriminative components with random forests,” inProc. Eur. Conf. Comp. Vis., 2014, pp. 446–461

  58. [58]

    Fine-grained visual classification of aircraft,

    S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi, “Fine-grained visual classification of aircraft,” 2013, arXiv:1306.5151

  59. [59]

    Novel dataset for fine-grained image categorization,

    A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei, “Novel dataset for fine-grained image categorization,” inCVPR Workshop on Fine-Grained Visual Categorization, 2011, pp. 806–813

  60. [60]

    3D object representations for fine-grained categorization,

    J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3D object representations for fine-grained categorization,” inProc. IEEE Int. Conf. Comp. Vis., 2013, pp. 554–561

  61. [61]

    VegFru: A domain-specific dataset for fine-grained visual categorization,

    S. Hou, Y . Feng, and Z. Wang, “VegFru: A domain-specific dataset for fine-grained visual categorization,” inProc. IEEE Int. Conf. Comp. Vis., 2017, pp. 541–549

  62. [62]

    Products-10K: A large-scale product recognition dataset,

    Y . Bai, Y . Chen, W. Yu, L. Wang, and W. Zhang, “Products-10K: A large-scale product recognition dataset,” 2020, arXiv:2008.10545

  63. [63]

    Benchmarking representation learning for natural world image collections,

    G. Van Horn, E. Cole, S. Beery, K. Wilber, S. Belongie, and O. Mac Aodha, “Benchmarking representation learning for natural world image collections,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2021, pp. 12 884–12 893. SUBMITTED TO IEEE TPAMI 18 Supplementary Material of Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: From Evaluati...