Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: From Evaluation to Diagnosis

Chen-Wei Xie; Hong-Tao Yu; Serge Belongie; Xiu-Shen Wei; Yuxin Peng

arxiv: 2606.19053 · v1 · pith:5KGSXI3Vnew · submitted 2026-06-17 · 💻 cs.CV

Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: From Evaluation to Diagnosis

Hong-Tao Yu , Chen-Wei Xie , Yuxin Peng , Serge Belongie , Xiu-Shen Wei This is my paper

Pith reviewed 2026-06-26 21:34 UTC · model grok-4.3

classification 💻 cs.CV

keywords large vision-language modelsfine-grained recognitionbenchmark evaluationsemantic groundingmodality alignmentvisual discriminabilitymultimodal perception

0 comments

The pith

Current large vision-language models remain inadequate fine-grained recognizers due to intertwined bottlenecks in visual representations, semantic grounding, modality alignment, and category-level knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FG-BMK, a benchmark with 1.01 million questions on 0.28 million images spanning common and specialized domains, to test large vision-language models on fine-grained image tasks. It combines human-oriented dialogue questions with machine-oriented feature tests to trace whether failures come from weak visual features, poor image-to-word connections, or missing category knowledge. Experiments across representative models show the models fall short overall, with problems arising from four linked issues in visual representations, semantic grounding, modality alignment, and category-level knowledge. The work also studies how training choices and input perturbations influence results to inform better data and model development.

Core claim

Through the FG-BMK benchmark, which contains 1.01 million questions and 0.28 million images, experiments on a diverse set of LVLMs show that current models remain inadequate fine-grained recognizers, with failures arising from intertwined bottlenecks in visual representations, semantic grounding, modality alignment, and category-level knowledge.

What carries the argument

FG-BMK benchmark that jointly evaluates dialogue-level fine-grained semantic recognition and feature-level visual discriminability through human-oriented and machine-oriented paradigms to diagnose specific failure sources.

If this is right

Training design factors can be adjusted to improve fine-grained capabilities in LVLMs.
Visual and linguistic perturbations produce measurable effects on LVLM predictions.
Diagnostic insights from the benchmark guide future data construction and model design for more reliable fine-grained visual performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the four bottlenecks dominate, then isolated fixes to one area such as visual encoders alone are unlikely to close the gap.
The same evaluation approach could be applied to test whether other multimodal systems exhibit similar linked failure patterns.
Extending the benchmark to video or real-time settings would reveal whether the identified issues persist beyond static images.

Load-bearing premise

The human-oriented and machine-oriented evaluation paradigms in FG-BMK accurately isolate and diagnose the specific failure sources without introducing measurement biases or overlooking other contributing factors.

What would settle it

A new LVLM architecture that scores high on FG-BMK while showing clear separation of the four bottlenecks would challenge the claim of inherent intertwined inadequacy.

Figures

Figures reproduced from arXiv: 2606.19053 by Chen-Wei Xie, Hong-Tao Yu, Serge Belongie, Xiu-Shen Wei, Yuxin Peng.

**Figure 1.** Figure 1: Overview of FG-BMK. FG-BMK evaluates LVLMs on fine-grained visual tasks from five diagnostic dimensions: hierarchical recognition, knowledge bias estimation, attribute recognition, image classification, and image retrieval. The teaser illustrates both the task formats and representative findings, showing that current LVLMs still suffer from degraded fine-level recognition, biased category knowledge, uneven… view at source ↗

**Figure 2.** Figure 2: Our proposed benchmark: The human-oriented evaluation tests the model’s ability to handle fine-grained visual queries [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Results of InternVL3 [16] on true/false and multiple [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison between real images and fine-grained category-conditioned generated images. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of visual-text alignment on CUB under different settings. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of the original (blue dots) and fine-tuned [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Retrieval results of LVLM visual features on twelve [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 10.** Figure 10: Nemenyi statistical test results for fine-grained [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 12.** Figure 12: t-SNE visualization of visual features on Stanford Dogs [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

**Figure 13.** Figure 13: Patch-level correspondence visualization on CUB datasets. Green boxes in the query images indicate the selected [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗

**Figure 14.** Figure 14: Classification results with different vision encoder sizes. Bars filled with different patterns represent different models, [PITH_FULL_IMAGE:figures/full_fig_p013_14.png] view at source ↗

**Figure 15.** Figure 15: Classification results of LVLM visual features on fine-grained datasets. “Single” denotes accuracy from training on a [PITH_FULL_IMAGE:figures/full_fig_p014_15.png] view at source ↗

**Figure 16.** Figure 16: Question templates for each task in huamn-oriented evaluation. [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: Results of GPT-5.4 [1], GPT-4o [1], Gemini-3.5-flash [42], Gemini-2.0-flash [42], Qwen2.5-VL [39], LLaVA [4] [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

**Figure 18.** Figure 18: Knowledge bias estimation results of two closed-source models. True/false question accuracy for each category is [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗

**Figure 19.** Figure 19: Knowledge bias estimation results of two closed-source models. True/false question accuracy for each category is [PITH_FULL_IMAGE:figures/full_fig_p025_19.png] view at source ↗

**Figure 20.** Figure 20: Comparison of the original and fine-tuned Qwen2.5- [PITH_FULL_IMAGE:figures/full_fig_p026_20.png] view at source ↗

**Figure 21.** Figure 21: Qualitative analysis of granularity inconsistencies in LVLMs’ alignment data and a constructed sample of properly [PITH_FULL_IMAGE:figures/full_fig_p027_21.png] view at source ↗

**Figure 22.** Figure 22: Classification results of LVLM visual features on fine-grained datasets. “Single” denotes accuracy from training on a [PITH_FULL_IMAGE:figures/full_fig_p028_22.png] view at source ↗

**Figure 23.** Figure 23: Visualization of aligned visual features and category text embeddings under different alignment settings. Fine-grained [PITH_FULL_IMAGE:figures/full_fig_p029_23.png] view at source ↗

**Figure 24.** Figure 24: t-SNE visualization of visual features on CUB-200-2011 and Stanford Dogs. Features learned with contrastive paradigms (e.g., EVA-CLIP and DINOv2) form more compact and better-separated class clusters than those learned with reconstruction- or generation-based paradigms (e.g., BEiT-3 and Qwen-VL), indicating stronger fine-grained discriminability in the embedding space [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗

**Figure 25.** Figure 25: Patch-level correspondence analysis on fine-grained bird images. Given selected query patches, contrastive features [PITH_FULL_IMAGE:figures/full_fig_p030_25.png] view at source ↗

read the original abstract

Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception and reasoning capabilities. While numerous benchmarks have evaluated LVLMs from holistic or task-specific perspectives, their capabilities on fine-grained image tasks-fundamental to computer vision-remain insufficiently understood. To address this gap, we introduce FG-BMK, a comprehensive fine-grained evaluation benchmark containing 1.01 million questions and 0.28 million images, covering diverse scenarios from common object-centric domains to specialized domains. FG-BMK jointly evaluates dialogue-level fine-grained semantic recognition and feature-level visual discriminability through human-oriented and machine-oriented paradigms, enabling diagnostic analysis of whether LVLM failures arise from insufficient visual representations, weak visual-to-semantic grounding, or limited fine-grained knowledge. Through extensive experiments on a diverse set of representative LVLMs/VLMs, we find that current LVLMs remain inadequate fine-grained recognizers, with failures arising from intertwined bottlenecks in visual representations, semantic grounding, modality alignment, and category-level knowledge. We further analyze training design factors for improving fine-grained capabilities and examine how visual and linguistic perturbations affect LVLM predictions. These findings provide diagnostic insights into the limitations of current LVLMs and offer guidance for future data construction and model design in developing more reliable LVLMs for fine-grained visual tasks. Our code is open-source and available at https://fg-bmk.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FG-BMK is a large new benchmark that flags clear weaknesses in current LVLMs on fine-grained tasks, but the diagnostic attribution to four specific bottlenecks rests on unverified separation of factors.

read the letter

The main point is that this paper releases FG-BMK, a benchmark with 1.01 million questions and 0.28 million images, and uses it to show that LVLMs still fall short on fine-grained recognition because of problems in visual representations, semantic grounding, modality alignment, and category knowledge.

What is new is the joint evaluation of dialogue-level semantics and feature-level discriminability through separate human-oriented and machine-oriented setups, plus the coverage of both everyday and specialized domains. The experiments cover multiple models, look at training choices, and test how visual and language perturbations change outputs. The scale and the open code stand out as practical additions to the evaluation literature.

The work is useful for mapping where models break on detail-heavy tasks. The abstract makes a direct case that existing benchmarks miss this layer, and the perturbation analysis adds a concrete angle on robustness.

The soft spot is the diagnostic part. The claim that failures come from those four intertwined bottlenecks assumes the two paradigms cleanly separate them. Fine-grained questions naturally mix visual detail, labels, and knowledge, so the observed patterns could come from how the questions were written rather than independent sources. No details appear on data construction, category validation, or ablations that fix all but one factor. That leaves the attribution open to the concern raised in the stress test.

The paper is aimed at groups building or testing LVLMs for applications that need precise visual discrimination. Readers who care about benchmark design or model limitations will find the scale and the failure patterns worth looking at.

It should go to peer review. The benchmark size and the focus on a known gap make it worth referee scrutiny, even if the methods section will need close checking on controls and validation.

Referee Report

2 major / 2 minor

Summary. The paper introduces FG-BMK, a benchmark with 1.01 million questions across 0.28 million images spanning common and specialized domains. It jointly evaluates LVLMs via human-oriented (dialogue-level semantic recognition) and machine-oriented (feature-level visual discriminability) paradigms to diagnose failures attributable to visual representations, semantic grounding, modality alignment, or category-level knowledge. Experiments on representative LVLMs conclude that current models remain inadequate fine-grained recognizers due to these intertwined bottlenecks; additional analyses cover training design factors and effects of visual/linguistic perturbations. Code is open-sourced.

Significance. If the diagnostic attributions are validated, the work supplies concrete guidance for improving LVLMs on fine-grained tasks and highlights data-construction priorities. The scale of FG-BMK and explicit open-sourcing of code and benchmark constitute clear strengths for reproducibility.

major comments (2)

[Abstract] Abstract: The central claim attributes LVLM failures to four specific intertwined bottlenecks and states that the dual paradigms 'enable diagnostic analysis' of their individual contributions. However, no ablations, controls, or isolation procedures are described that hold all but one factor fixed (e.g., varying only visual representation quality while fixing semantic labels and knowledge). Fine-grained discrimination tasks inherently couple these elements, so observed failures may reflect question design rather than separable sources; this directly undermines the attribution in the strongest claim.
[Abstract] Abstract (data-construction paragraph): The manuscript reports extensive experiments yet supplies no details on question generation, statistical controls for category balance, inter-annotator validation of diagnostic labels, or safeguards against measurement bias in the 1.01 M questions. Without these, it is impossible to verify that the reported bottlenecks are not artifacts of the benchmark construction itself.

minor comments (2)

[Abstract] Abstract: The enabling clause lists three diagnostic targets (visual representations, visual-to-semantic grounding, fine-grained knowledge) while the findings paragraph lists four (adding modality alignment and category-level knowledge). Standardize the enumerated set for consistency.
[Experimental setup] The open-source link is provided, but the manuscript should include a brief reproducibility checklist (e.g., exact model versions, prompt templates, and hardware) in the experimental section to match the scale claimed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our diagnostic claims and benchmark construction details. We address each major comment below and outline planned revisions to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim attributes LVLM failures to four specific intertwined bottlenecks and states that the dual paradigms 'enable diagnostic analysis' of their individual contributions. However, no ablations, controls, or isolation procedures are described that hold all but one factor fixed (e.g., varying only visual representation quality while fixing semantic labels and knowledge). Fine-grained discrimination tasks inherently couple these elements, so observed failures may reflect question design rather than separable sources; this directly undermines the attribution in the strongest claim.

Authors: We agree that full isolation of the four bottlenecks through ablations holding all but one factor fixed is inherently difficult, given the coupled nature of fine-grained tasks. Our dual paradigms provide diagnostic value by contrasting human-oriented dialogue-level semantic recognition (probing grounding, alignment, and knowledge) against machine-oriented feature-level visual discriminability (probing representations). This comparative design reveals intertwined contributions without claiming complete separability. We will revise the abstract to more precisely articulate the diagnostic scope and limitations of the paradigms, and we will add supporting comparative analyses in the experiments section. revision: partial
Referee: [Abstract] Abstract (data-construction paragraph): The manuscript reports extensive experiments yet supplies no details on question generation, statistical controls for category balance, inter-annotator validation of diagnostic labels, or safeguards against measurement bias in the 1.01 M questions. Without these, it is impossible to verify that the reported bottlenecks are not artifacts of the benchmark construction itself.

Authors: We agree that explicit details on these aspects are essential to rule out construction artifacts. The full manuscript (Section 3) describes the multi-stage question generation pipeline, category balancing procedures, inter-annotator agreement for diagnostic labels, and bias mitigation steps including expert review for specialized domains. To address the concern, we will expand the abstract's data-construction description and include a concise summary of these controls in the main text, along with a reference to the open-sourced generation and validation code. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark paper with no derivation chain or self-referential reductions

full rationale

The paper introduces FG-BMK as a new benchmark with 1.01M questions and evaluates LVLMs experimentally to diagnose failure modes. The central claims rest on observed performance gaps across human-oriented and machine-oriented paradigms rather than any mathematical derivation, fitted parameter renamed as prediction, or self-citation that reduces the result to its own inputs. No equations appear in the abstract or described methodology, and the attribution of intertwined bottlenecks follows directly from the benchmark results without circular redefinition. This is a standard empirical benchmarking study whose conclusions are falsifiable by external replication.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the contribution is an empirical benchmark built on existing image domains and model evaluations.

pith-pipeline@v0.9.1-grok · 5797 in / 1009 out tokens · 21123 ms · 2026-06-26T21:34:57.064830+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 12 linked inside Pith

[1]

GPT-4 technical report,

OpenAI, “GPT-4 technical report,” 2023, arXiv:2303.08774

Pith/arXiv arXiv 2023
[2]

Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond,

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond,” 2023, arXiv:2308.12966

Pith/arXiv arXiv 2023
[3]

InternVL: Scaling up vision foundation models and aligning for generic visual- linguistic tasks,

Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, B. li, P. Luo, T. Lu, Y . Qiao, and J. Dai, “InternVL: Scaling up vision foundation models and aligning for generic visual- linguistic tasks,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2024, pp. 24 185–24 198

2024
[4]

Improved baselines with visual instruction tuning,

H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2024, pp. 26 296–26 306. SUBMITTED TO IEEE TPAMI 16

2024
[5]

LVLM-eHub: A comprehensive evaluation benchmark for large vision-language models,

P. Xu, W. Shao, K. Zhang, P. Gao, S. Liu, M. Lei, F. Meng, S. Huang, Y . Qiao, and P. Luo, “LVLM-eHub: A comprehensive evaluation benchmark for large vision-language models,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 3, pp. 1877–1893, 2025

2025
[6]

MMBench: Is your multi-modal model an all-around player?

L. Yuan, H. Duan, Y . Zhang, B. Li, S. Zhang, W. Zhao, Y . Yuan, J. Wang, C. He, Z. Liu, K. Chen, and D. Lin, “MMBench: Is your multi-modal model an all-around player?” inProc. Eur. Conf. Comp. Vis., 2024, pp. 216–233

2024
[7]

DocVQA: A dataset for vqa on document images,

M. Mathew, D. Karatzas, and C. Jawahar, “DocVQA: A dataset for vqa on document images,” inProc. Winter Conf. Applications of Comp. Vis., 2021, pp. 2200–2209

2021
[8]

GQA: A new dataset for real-world visual reasoning and compositional question answering,

D. A. Hudson and C. D. Manning, “GQA: A new dataset for real-world visual reasoning and compositional question answering,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2019, pp. 6700–6709

2019
[9]

African or european swallow? benchmarking large vision-language models for fine-grained object classification,

G. Geigle, R. Timofte, and G. Glava ˇs, “African or european swallow? benchmarking large vision-language models for fine-grained object classification,” inProc. Conf. Empirical Methods in Natural Language Processing, 2024, pp. 2653–2669

2024
[10]

Why are visually-grounded language models bad at image classification?

Y . Zhang, A. Unell, X. Wang, D. Ghosh, Y . Su, L. Schmidt, and S. Yeung-Levy, “Why are visually-grounded language models bad at image classification?” inAdvances in Neural Inf. Process. Syst., 2024, pp. 51 727–51 753

2024
[11]

Vision llms are bad at hierarchical visual understanding, and llms are the bottleneck,

Y . Tan, Y . Qing, and B. Gong, “Vision llms are bad at hierarchical visual understanding, and llms are the bottleneck,” 2025, arXiv:2505.24840

arXiv 2025
[12]

Fine-grained image analysis with deep learning: A survey,

X.-S. Wei, Y .-Z. Song, O. M. Aodha, J. Wu, Y . Peng, J. Tang, J. Yang, and S. Belongie, “Fine-grained image analysis with deep learning: A survey,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 12, pp. 8927–8948, 2022

2022
[13]

Benchmarking large vision-language models on fine-grained image tasks: A comprehensive evaluation,

H.-T. Yu, Y . Peng, S. Belongie, and X.-S. Wei, “Benchmarking large vision-language models on fine-grained image tasks: A comprehensive evaluation,” inProc. Int. Conf. Learn. Representations, 2026

2026
[14]

BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

J. Li, D. Li, C. Xiong, and S. Hoi, “BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inProc. Int. Conf. Mach. Learn., 2022, pp. 12 888–12 900

2022
[15]

BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,

J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” inProc. Int. Conf. Mach. Learn., 2023, pp. 19 730–19 742

2023
[16]

InternVL3: Exploring advanced training and test-time recipes for open-source multimodal models,

J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, Y . Duan, H. Tian, W. Su, J. Shao, Z. Gao, E. Cui, X. Wang, Y . Cao, Y . Liu, X. Wei, H. Zhang, H. Wang, W. Xu, H. Li, J. Wang, N. Deng, S. Li, Y . He, T. Jiang, J. Luo, Y . Wang, C. He, B. Shi, X. Zhang, W. Shao, J. He, Y . Xiong, W. Qu, P. Sun, P. Jiao, H. Lv, L. Wu, K. Zhang, H. Deng, J. Ge, K. Chen, L. W...

Pith/arXiv arXiv 2025
[17]

Image as a foreign language: BEiT pretraining for vision and vision-language tasks,

W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal, O. K. Mohammed, S. Singhal, S. Som, and F. Wei, “Image as a foreign language: BEiT pretraining for vision and vision-language tasks,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2023, pp. 19 175–19 186

2023
[18]

BLIP3-o: A family of fully open unified multimodal models-architecture, training and dataset,

J. Chen, Z. Xu, X. Pan, Y . Hu, C. Qin, T. Goldstein, L. Huang, T. Zhou, S. Xie, S. Savarese, L. Xue, C. Xiong, and R. Xu, “BLIP3-o: A family of fully open unified multimodal models-architecture, training and dataset,” 2025, arXiv:2505.09568

Pith/arXiv arXiv 2025
[19]

UniWorld-V1: High-resolution semantic encoders for unified visual understanding and generation,

B. Lin, Z. Li, X. Cheng, Y . Niu, Y . Ye, X. He, S. Yuan, W. Yu, S. Wang, Y . Geet al., “UniWorld-V1: High-resolution semantic encoders for unified visual understanding and generation,” 2025, arXiv:2506.03147

Pith/arXiv arXiv 2025
[20]

Emerging properties in unified multimodal pretraining,

C. Deng, D. Zhu, K. Li, C. Gou, F. Li, Z. Wang, S. Zhong, W. Yu, X. Nie, Z. Song, G. Shi, and H. Fan, “Emerging properties in unified multimodal pretraining,” 2025, arXiv:2505.14683

Pith/arXiv arXiv 2025
[21]

ChartQA: A benchmark for question answering about charts with visual and logical reasoning,

A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque, “ChartQA: A benchmark for question answering about charts with visual and logical reasoning,” inProc. Conf. Association for Computational Linguistics, 2022, pp. 2263–2279

2022
[22]

Capability: A comprehensive visual caption benchmark for evaluating both correctness and thoroughness,

Z. Liu, C.-W. Xie, B. Wen, F. Yu, P. Li, B. Zhang, N. Yang, Z. Gao, Y . Zheng, and H. Xie, “Capability: A comprehensive visual caption benchmark for evaluating both correctness and thoroughness,” pp. 0–11, 2026

2026
[23]

OCRBench: On the hidden mystery of ocr in large multimodal models,

Y . Liu, Z. Li, M. Huang, B. Yang, W. Yu, C. Li, X.-C. Yin, C.-L. Liu, L. Jin, and X. Bai, “OCRBench: On the hidden mystery of ocr in large multimodal models,”Science China Information Sciences, vol. 67, no. 12, 2024

2024
[24]

MathVista: Evaluating mathematical reasoning of foundation models in visual contexts,

P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao, “MathVista: Evaluating mathematical reasoning of foundation models in visual contexts,” inProc. Int. Conf. Learn. Representations, 2024

2024
[25]

MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,

X. Yue, Y . Ni, T. Zheng, K. Zhang, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y . Sun, C. Wei, B. Yu, R. Yuan, R. Sun, M. Yin, B. Zheng, Z. Yang, Y . Liu, W. Huang, H. Sun, Y . Su, and W. Chen, “MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2024, pp. 9556–9567

2024
[26]

Towards deep learning models resistant to adversarial attacks,

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” inProc. Int. Conf. Learn. Representations, 2018

2018
[27]

Dual attention networks for few-shot fine-grained recognition,

S.-L. Xu, F. Zhang, X.-S. Wei, and J. Wang, “Dual attention networks for few-shot fine-grained recognition,” inProc. Conf. AAAI, 2022, pp. 2911–2919

2022
[28]

MECOM: A meta-completion network for fine-grained recognition with incomplete multi-modalities,

X.-S. Wei, H.-T. Yu, A. Xu, F. Zhang, and Y . Peng, “MECOM: A meta-completion network for fine-grained recognition with incomplete multi-modalities,”IEEE Trans. Image Process., vol. 33, pp. 3456–3469, 2024

2024
[29]

FSCIL-EACA: Few-Shot Class- Incremental learning network based on embedding augmentation and classifier adaptation for image classification,

R. Zhang, H. E, and M. Song, “FSCIL-EACA: Few-Shot Class- Incremental learning network based on embedding augmentation and classifier adaptation for image classification,”Chinese J. Electron., vol. 33, no. 1, pp. 139–152, 2024

2024
[30]

FineCLIP: Self-distilled region-based clip for better fine-grained understanding,

D. Jing, X. He, Y . Luo, N. Fei, G. Yang, W. Wei, H. Zhao, and Z. Lu, “FineCLIP: Self-distilled region-based clip for better fine-grained understanding,” inAdvances in Neural Inf. Process. Syst., 2024, pp. 27 896–27 918

2024
[31]

Expression complementary disentanglement network for facial expression recognition,

S. Wang, H. Shuai, L. Zhu, and Q. Liu, “Expression complementary disentanglement network for facial expression recognition,”Chinese J. Electron., vol. 33, no. 3, pp. 742–752, 2024

2024
[32]

Weighted linear loss large margin distribution machine for pattern classification,

L. Liu, M. Chu, R. Gong, L. Liu, and Y . Yang, “Weighted linear loss large margin distribution machine for pattern classification,”Chinese J. Electron., vol. 33, no. 3, pp. 753–765, 2024

2024
[33]

FGM-SPCL: Open-set recognition network for medical images based on fine-grained data mixture and spatial position constraint loss,

R. Zhang, H. E, L. Yuan, Y . Wang, L. Wang, and M. Song, “FGM-SPCL: Open-set recognition network for medical images based on fine-grained data mixture and spatial position constraint loss,”Chinese J. Electron., vol. 33, no. 4, pp. 1023–1033, 2024

2024
[34]

Animal- Bench: Benchmarking multimodal video models for animal-centric video understanding,

Y . Jing, R. Zhang, K. Liang, Y . Li, Z. He, Z. Ma, and J. Guo, “Animal- Bench: Benchmarking multimodal video models for animal-centric video understanding,” inAdvances in Neural Inf. Process. Syst., 2024, pp. 23 457–23 469

2024
[35]

SEMICON: A learning-to-hash solution for large-scale fine-grained image retrieval,

Y . Shen, X. Sun, X.-S. Wei, Q.-Y . Jiang, and J. Yang, “SEMICON: A learning-to-hash solution for large-scale fine-grained image retrieval,” in Proc. Eur. Conf. Comp. Vis., 2022, pp. 531–548

2022
[36]

RPC: A large-scale and fine-grained retail product checkout dataset,

X.-S. Wei, Q. Cui, L. Yang, P. Wang, L. Liu, and J. Yang, “RPC: A large-scale and fine-grained retail product checkout dataset,”Science China. Information Sciences, vol. 65, no. 9, p. 197101, 2022

2022
[37]

DINOv2: Learning robust visual features without supervision,

M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khali- dov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, R. Howes, P.-Y . Huang, H. Xu, V . Sharma, S.-W. Li, W. Galuba, M. Rabbat, M. Assran, N. Ballas, G. Synnaeve, I. Misra, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “DINOv2: Learning robust visual features withou...

Pith/arXiv arXiv 2023
[38]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- wal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskeverothers, “Learning transferable visual models from natural language supervision,” inProc. Int. Conf. Mach. Learn., 2021, pp. 8748– 8763

2021
[39]

Qwen2.5-vl technical report,

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-vl technical report,” 2025, arXiv:2502.13923

Pith/arXiv arXiv 2025
[40]

EV A-CLIP: Improved training techniques for clip at scale,

Q. Sun, Y . Fang, L. Wu, X. Wang, and Y . Cao, “EV A-CLIP: Improved training techniques for clip at scale,” 2023, arXiv:2303.15389

Pith/arXiv arXiv 2023
[41]

CoCa: Contrastive captioners are image-text foundation models,

J. Yu, Z. Wang, V . Vasudevan, L. Yeung, M. Seyedhosseini, and Y . Wu, “CoCa: Contrastive captioners are image-text foundation models,” Transactions on Machine Learning Research, 2022

2022
[42]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,

G. Gemini Team, “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” 2024, arXiv:2403.05530

Pith/arXiv arXiv 2024
[43]

The Caltech- UCSD birds-200-2011 dataset,

C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The Caltech- UCSD birds-200-2011 dataset,” Technical report, California Institute of Technology, 2011

2011
[44]

MetaFormer: A unified meta framework for fine-grained recognition,

Q. Diao, Y . Jiang, B. Wen, J. Sun, and Z. Yuan, “MetaFormer: A unified meta framework for fine-grained recognition,” 2022, arXiv:2203.02751

arXiv 2022
[45]

SR-GNN: Spatial relation-aware graph neural network for fine-grained image categorization,

A. Bera, Z. Wharton, Y . Liu, N. Bessis, and A. Behera, “SR-GNN: Spatial relation-aware graph neural network for fine-grained image categorization,”IEEE Trans. Image Process., vol. 31, pp. 6017–6031, 2022

2022
[46]

Progressive multi-task anti-noise learning and distilling frame- works for fine-grained vehicle recognition,

D. Liu, “Progressive multi-task anti-noise learning and distilling frame- works for fine-grained vehicle recognition,”IEEE Trans. Intell. Transp. Syst., vol. 25, no. 9, pp. 10 667–10 678, 2024. SUBMITTED TO IEEE TPAMI 17

2024
[47]

Context-aware attentional pooling (cap) for fine-grained visual classification,

A. Behera, Z. Wharton, P. R. Hewage, and A. Bera, “Context-aware attentional pooling (cap) for fine-grained visual classification,” inProc. Conf. AAAI, 2021, pp. 929–937

2021
[48]

Interweaving insights: High-order feature interaction for fine-grained visual recognition,

A. Sikdar, Y . Liu, S. Kedarisetty, Y . Zhao, A. Ahmed, and A. Behera, “Interweaving insights: High-order feature interaction for fine-grained visual recognition,” inProc. IEEE Int. Conf. Comp. Vis., 2024, pp. 1755– 1779

2024
[49]

Learning multiple layers of features from tiny images,

A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009

2009
[50]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,” 2020, arXiv:2010.11929

Pith/arXiv arXiv 2020
[51]

ImageNet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2009, pp. 248–255

2009
[52]

Bottled wine defect detection data set,

Tianchi, “Bottled wine defect detection data set,” 2021. [Online]. Available: https://tianchi.aliyun.com/dataset/dataDetail?dataId=110147

2021
[53]

A benchmark data set for aircraft type recognition from remote sensing images,

Z.-Z. Wu, S.-H. Wan, X.-F. Wang, M. Tan, L. Zou, X.-L. Li, and Y . Chen, “A benchmark data set for aircraft type recognition from remote sensing images,”Applied Soft Computing, vol. 89, pp. 106 132–106 142, 2020

2020
[54]

DeepFashion: Powering robust clothes recognition and retrieval with rich annotations,

Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, “DeepFashion: Powering robust clothes recognition and retrieval with rich annotations,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016, pp. 1096–1104

2016
[55]

SkinCon: A skin disease dataset densely annotated by domain experts for fine-grained debugging and analysis,

R. Daneshjou, M. Yuksekgonul, Z. R. Cai, R. Novoa, and J. Y . Zou, “SkinCon: A skin disease dataset densely annotated by domain experts for fine-grained debugging and analysis,” inAdvances in Neural Inf. Process. Syst., 2022, pp. 18 157–18 167

2022
[56]

Automated flower classification over a large number of classes,

M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” inProc. IEEE Int. Conf. Comp. Vis., 2008, pp. 722–729

2008
[57]

Food-101–mining discriminative components with random forests,

L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining discriminative components with random forests,” inProc. Eur. Conf. Comp. Vis., 2014, pp. 446–461

2014
[58]

Fine-grained visual classification of aircraft,

S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi, “Fine-grained visual classification of aircraft,” 2013, arXiv:1306.5151

Pith/arXiv arXiv 2013
[59]

Novel dataset for fine-grained image categorization,

A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei, “Novel dataset for fine-grained image categorization,” inCVPR Workshop on Fine-Grained Visual Categorization, 2011, pp. 806–813

2011
[60]

3D object representations for fine-grained categorization,

J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3D object representations for fine-grained categorization,” inProc. IEEE Int. Conf. Comp. Vis., 2013, pp. 554–561

2013
[61]

VegFru: A domain-specific dataset for fine-grained visual categorization,

S. Hou, Y . Feng, and Z. Wang, “VegFru: A domain-specific dataset for fine-grained visual categorization,” inProc. IEEE Int. Conf. Comp. Vis., 2017, pp. 541–549

2017
[62]

Products-10K: A large-scale product recognition dataset,

Y . Bai, Y . Chen, W. Yu, L. Wang, and W. Zhang, “Products-10K: A large-scale product recognition dataset,” 2020, arXiv:2008.10545

arXiv 2020
[63]

Benchmarking representation learning for natural world image collections,

G. Van Horn, E. Cole, S. Beery, K. Wilber, S. Belongie, and O. Mac Aodha, “Benchmarking representation learning for natural world image collections,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2021, pp. 12 884–12 893. SUBMITTED TO IEEE TPAMI 18 Supplementary Material of Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: From Evaluati...

2021

[1] [1]

GPT-4 technical report,

OpenAI, “GPT-4 technical report,” 2023, arXiv:2303.08774

Pith/arXiv arXiv 2023

[2] [2]

Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond,

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond,” 2023, arXiv:2308.12966

Pith/arXiv arXiv 2023

[3] [3]

InternVL: Scaling up vision foundation models and aligning for generic visual- linguistic tasks,

Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, B. li, P. Luo, T. Lu, Y . Qiao, and J. Dai, “InternVL: Scaling up vision foundation models and aligning for generic visual- linguistic tasks,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2024, pp. 24 185–24 198

2024

[4] [4]

Improved baselines with visual instruction tuning,

H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2024, pp. 26 296–26 306. SUBMITTED TO IEEE TPAMI 16

2024

[5] [5]

LVLM-eHub: A comprehensive evaluation benchmark for large vision-language models,

P. Xu, W. Shao, K. Zhang, P. Gao, S. Liu, M. Lei, F. Meng, S. Huang, Y . Qiao, and P. Luo, “LVLM-eHub: A comprehensive evaluation benchmark for large vision-language models,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 3, pp. 1877–1893, 2025

2025

[6] [6]

MMBench: Is your multi-modal model an all-around player?

L. Yuan, H. Duan, Y . Zhang, B. Li, S. Zhang, W. Zhao, Y . Yuan, J. Wang, C. He, Z. Liu, K. Chen, and D. Lin, “MMBench: Is your multi-modal model an all-around player?” inProc. Eur. Conf. Comp. Vis., 2024, pp. 216–233

2024

[7] [7]

DocVQA: A dataset for vqa on document images,

M. Mathew, D. Karatzas, and C. Jawahar, “DocVQA: A dataset for vqa on document images,” inProc. Winter Conf. Applications of Comp. Vis., 2021, pp. 2200–2209

2021

[8] [8]

GQA: A new dataset for real-world visual reasoning and compositional question answering,

D. A. Hudson and C. D. Manning, “GQA: A new dataset for real-world visual reasoning and compositional question answering,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2019, pp. 6700–6709

2019

[9] [9]

African or european swallow? benchmarking large vision-language models for fine-grained object classification,

G. Geigle, R. Timofte, and G. Glava ˇs, “African or european swallow? benchmarking large vision-language models for fine-grained object classification,” inProc. Conf. Empirical Methods in Natural Language Processing, 2024, pp. 2653–2669

2024

[10] [10]

Why are visually-grounded language models bad at image classification?

Y . Zhang, A. Unell, X. Wang, D. Ghosh, Y . Su, L. Schmidt, and S. Yeung-Levy, “Why are visually-grounded language models bad at image classification?” inAdvances in Neural Inf. Process. Syst., 2024, pp. 51 727–51 753

2024

[11] [11]

Vision llms are bad at hierarchical visual understanding, and llms are the bottleneck,

Y . Tan, Y . Qing, and B. Gong, “Vision llms are bad at hierarchical visual understanding, and llms are the bottleneck,” 2025, arXiv:2505.24840

arXiv 2025

[12] [12]

Fine-grained image analysis with deep learning: A survey,

X.-S. Wei, Y .-Z. Song, O. M. Aodha, J. Wu, Y . Peng, J. Tang, J. Yang, and S. Belongie, “Fine-grained image analysis with deep learning: A survey,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 12, pp. 8927–8948, 2022

2022

[13] [13]

Benchmarking large vision-language models on fine-grained image tasks: A comprehensive evaluation,

H.-T. Yu, Y . Peng, S. Belongie, and X.-S. Wei, “Benchmarking large vision-language models on fine-grained image tasks: A comprehensive evaluation,” inProc. Int. Conf. Learn. Representations, 2026

2026

[14] [14]

BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

J. Li, D. Li, C. Xiong, and S. Hoi, “BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inProc. Int. Conf. Mach. Learn., 2022, pp. 12 888–12 900

2022

[15] [15]

BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,

J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” inProc. Int. Conf. Mach. Learn., 2023, pp. 19 730–19 742

2023

[16] [16]

InternVL3: Exploring advanced training and test-time recipes for open-source multimodal models,

J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, Y . Duan, H. Tian, W. Su, J. Shao, Z. Gao, E. Cui, X. Wang, Y . Cao, Y . Liu, X. Wei, H. Zhang, H. Wang, W. Xu, H. Li, J. Wang, N. Deng, S. Li, Y . He, T. Jiang, J. Luo, Y . Wang, C. He, B. Shi, X. Zhang, W. Shao, J. He, Y . Xiong, W. Qu, P. Sun, P. Jiao, H. Lv, L. Wu, K. Zhang, H. Deng, J. Ge, K. Chen, L. W...

Pith/arXiv arXiv 2025

[17] [17]

Image as a foreign language: BEiT pretraining for vision and vision-language tasks,

W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal, O. K. Mohammed, S. Singhal, S. Som, and F. Wei, “Image as a foreign language: BEiT pretraining for vision and vision-language tasks,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2023, pp. 19 175–19 186

2023

[18] [18]

BLIP3-o: A family of fully open unified multimodal models-architecture, training and dataset,

J. Chen, Z. Xu, X. Pan, Y . Hu, C. Qin, T. Goldstein, L. Huang, T. Zhou, S. Xie, S. Savarese, L. Xue, C. Xiong, and R. Xu, “BLIP3-o: A family of fully open unified multimodal models-architecture, training and dataset,” 2025, arXiv:2505.09568

Pith/arXiv arXiv 2025

[19] [19]

UniWorld-V1: High-resolution semantic encoders for unified visual understanding and generation,

B. Lin, Z. Li, X. Cheng, Y . Niu, Y . Ye, X. He, S. Yuan, W. Yu, S. Wang, Y . Geet al., “UniWorld-V1: High-resolution semantic encoders for unified visual understanding and generation,” 2025, arXiv:2506.03147

Pith/arXiv arXiv 2025

[20] [20]

Emerging properties in unified multimodal pretraining,

C. Deng, D. Zhu, K. Li, C. Gou, F. Li, Z. Wang, S. Zhong, W. Yu, X. Nie, Z. Song, G. Shi, and H. Fan, “Emerging properties in unified multimodal pretraining,” 2025, arXiv:2505.14683

Pith/arXiv arXiv 2025

[21] [21]

ChartQA: A benchmark for question answering about charts with visual and logical reasoning,

A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque, “ChartQA: A benchmark for question answering about charts with visual and logical reasoning,” inProc. Conf. Association for Computational Linguistics, 2022, pp. 2263–2279

2022

[22] [22]

Capability: A comprehensive visual caption benchmark for evaluating both correctness and thoroughness,

Z. Liu, C.-W. Xie, B. Wen, F. Yu, P. Li, B. Zhang, N. Yang, Z. Gao, Y . Zheng, and H. Xie, “Capability: A comprehensive visual caption benchmark for evaluating both correctness and thoroughness,” pp. 0–11, 2026

2026

[23] [23]

OCRBench: On the hidden mystery of ocr in large multimodal models,

Y . Liu, Z. Li, M. Huang, B. Yang, W. Yu, C. Li, X.-C. Yin, C.-L. Liu, L. Jin, and X. Bai, “OCRBench: On the hidden mystery of ocr in large multimodal models,”Science China Information Sciences, vol. 67, no. 12, 2024

2024

[24] [24]

MathVista: Evaluating mathematical reasoning of foundation models in visual contexts,

P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao, “MathVista: Evaluating mathematical reasoning of foundation models in visual contexts,” inProc. Int. Conf. Learn. Representations, 2024

2024

[25] [25]

MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,

X. Yue, Y . Ni, T. Zheng, K. Zhang, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y . Sun, C. Wei, B. Yu, R. Yuan, R. Sun, M. Yin, B. Zheng, Z. Yang, Y . Liu, W. Huang, H. Sun, Y . Su, and W. Chen, “MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2024, pp. 9556–9567

2024

[26] [26]

Towards deep learning models resistant to adversarial attacks,

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” inProc. Int. Conf. Learn. Representations, 2018

2018

[27] [27]

Dual attention networks for few-shot fine-grained recognition,

S.-L. Xu, F. Zhang, X.-S. Wei, and J. Wang, “Dual attention networks for few-shot fine-grained recognition,” inProc. Conf. AAAI, 2022, pp. 2911–2919

2022

[28] [28]

MECOM: A meta-completion network for fine-grained recognition with incomplete multi-modalities,

X.-S. Wei, H.-T. Yu, A. Xu, F. Zhang, and Y . Peng, “MECOM: A meta-completion network for fine-grained recognition with incomplete multi-modalities,”IEEE Trans. Image Process., vol. 33, pp. 3456–3469, 2024

2024

[29] [29]

FSCIL-EACA: Few-Shot Class- Incremental learning network based on embedding augmentation and classifier adaptation for image classification,

R. Zhang, H. E, and M. Song, “FSCIL-EACA: Few-Shot Class- Incremental learning network based on embedding augmentation and classifier adaptation for image classification,”Chinese J. Electron., vol. 33, no. 1, pp. 139–152, 2024

2024

[30] [30]

FineCLIP: Self-distilled region-based clip for better fine-grained understanding,

D. Jing, X. He, Y . Luo, N. Fei, G. Yang, W. Wei, H. Zhao, and Z. Lu, “FineCLIP: Self-distilled region-based clip for better fine-grained understanding,” inAdvances in Neural Inf. Process. Syst., 2024, pp. 27 896–27 918

2024

[31] [31]

Expression complementary disentanglement network for facial expression recognition,

S. Wang, H. Shuai, L. Zhu, and Q. Liu, “Expression complementary disentanglement network for facial expression recognition,”Chinese J. Electron., vol. 33, no. 3, pp. 742–752, 2024

2024

[32] [32]

Weighted linear loss large margin distribution machine for pattern classification,

L. Liu, M. Chu, R. Gong, L. Liu, and Y . Yang, “Weighted linear loss large margin distribution machine for pattern classification,”Chinese J. Electron., vol. 33, no. 3, pp. 753–765, 2024

2024

[33] [33]

FGM-SPCL: Open-set recognition network for medical images based on fine-grained data mixture and spatial position constraint loss,

R. Zhang, H. E, L. Yuan, Y . Wang, L. Wang, and M. Song, “FGM-SPCL: Open-set recognition network for medical images based on fine-grained data mixture and spatial position constraint loss,”Chinese J. Electron., vol. 33, no. 4, pp. 1023–1033, 2024

2024

[34] [34]

Animal- Bench: Benchmarking multimodal video models for animal-centric video understanding,

Y . Jing, R. Zhang, K. Liang, Y . Li, Z. He, Z. Ma, and J. Guo, “Animal- Bench: Benchmarking multimodal video models for animal-centric video understanding,” inAdvances in Neural Inf. Process. Syst., 2024, pp. 23 457–23 469

2024

[35] [35]

SEMICON: A learning-to-hash solution for large-scale fine-grained image retrieval,

Y . Shen, X. Sun, X.-S. Wei, Q.-Y . Jiang, and J. Yang, “SEMICON: A learning-to-hash solution for large-scale fine-grained image retrieval,” in Proc. Eur. Conf. Comp. Vis., 2022, pp. 531–548

2022

[36] [36]

RPC: A large-scale and fine-grained retail product checkout dataset,

X.-S. Wei, Q. Cui, L. Yang, P. Wang, L. Liu, and J. Yang, “RPC: A large-scale and fine-grained retail product checkout dataset,”Science China. Information Sciences, vol. 65, no. 9, p. 197101, 2022

2022

[37] [37]

DINOv2: Learning robust visual features without supervision,

M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khali- dov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, R. Howes, P.-Y . Huang, H. Xu, V . Sharma, S.-W. Li, W. Galuba, M. Rabbat, M. Assran, N. Ballas, G. Synnaeve, I. Misra, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “DINOv2: Learning robust visual features withou...

Pith/arXiv arXiv 2023

[38] [38]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- wal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskeverothers, “Learning transferable visual models from natural language supervision,” inProc. Int. Conf. Mach. Learn., 2021, pp. 8748– 8763

2021

[39] [39]

Qwen2.5-vl technical report,

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-vl technical report,” 2025, arXiv:2502.13923

Pith/arXiv arXiv 2025

[40] [40]

EV A-CLIP: Improved training techniques for clip at scale,

Q. Sun, Y . Fang, L. Wu, X. Wang, and Y . Cao, “EV A-CLIP: Improved training techniques for clip at scale,” 2023, arXiv:2303.15389

Pith/arXiv arXiv 2023

[41] [41]

CoCa: Contrastive captioners are image-text foundation models,

J. Yu, Z. Wang, V . Vasudevan, L. Yeung, M. Seyedhosseini, and Y . Wu, “CoCa: Contrastive captioners are image-text foundation models,” Transactions on Machine Learning Research, 2022

2022

[42] [42]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,

G. Gemini Team, “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” 2024, arXiv:2403.05530

Pith/arXiv arXiv 2024

[43] [43]

The Caltech- UCSD birds-200-2011 dataset,

C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The Caltech- UCSD birds-200-2011 dataset,” Technical report, California Institute of Technology, 2011

2011

[44] [44]

MetaFormer: A unified meta framework for fine-grained recognition,

Q. Diao, Y . Jiang, B. Wen, J. Sun, and Z. Yuan, “MetaFormer: A unified meta framework for fine-grained recognition,” 2022, arXiv:2203.02751

arXiv 2022

[45] [45]

SR-GNN: Spatial relation-aware graph neural network for fine-grained image categorization,

A. Bera, Z. Wharton, Y . Liu, N. Bessis, and A. Behera, “SR-GNN: Spatial relation-aware graph neural network for fine-grained image categorization,”IEEE Trans. Image Process., vol. 31, pp. 6017–6031, 2022

2022

[46] [46]

Progressive multi-task anti-noise learning and distilling frame- works for fine-grained vehicle recognition,

D. Liu, “Progressive multi-task anti-noise learning and distilling frame- works for fine-grained vehicle recognition,”IEEE Trans. Intell. Transp. Syst., vol. 25, no. 9, pp. 10 667–10 678, 2024. SUBMITTED TO IEEE TPAMI 17

2024

[47] [47]

Context-aware attentional pooling (cap) for fine-grained visual classification,

A. Behera, Z. Wharton, P. R. Hewage, and A. Bera, “Context-aware attentional pooling (cap) for fine-grained visual classification,” inProc. Conf. AAAI, 2021, pp. 929–937

2021

[48] [48]

Interweaving insights: High-order feature interaction for fine-grained visual recognition,

A. Sikdar, Y . Liu, S. Kedarisetty, Y . Zhao, A. Ahmed, and A. Behera, “Interweaving insights: High-order feature interaction for fine-grained visual recognition,” inProc. IEEE Int. Conf. Comp. Vis., 2024, pp. 1755– 1779

2024

[49] [49]

Learning multiple layers of features from tiny images,

A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009

2009

[50] [50]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,” 2020, arXiv:2010.11929

Pith/arXiv arXiv 2020

[51] [51]

ImageNet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2009, pp. 248–255

2009

[52] [52]

Bottled wine defect detection data set,

Tianchi, “Bottled wine defect detection data set,” 2021. [Online]. Available: https://tianchi.aliyun.com/dataset/dataDetail?dataId=110147

2021

[53] [53]

A benchmark data set for aircraft type recognition from remote sensing images,

Z.-Z. Wu, S.-H. Wan, X.-F. Wang, M. Tan, L. Zou, X.-L. Li, and Y . Chen, “A benchmark data set for aircraft type recognition from remote sensing images,”Applied Soft Computing, vol. 89, pp. 106 132–106 142, 2020

2020

[54] [54]

DeepFashion: Powering robust clothes recognition and retrieval with rich annotations,

Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, “DeepFashion: Powering robust clothes recognition and retrieval with rich annotations,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016, pp. 1096–1104

2016

[55] [55]

SkinCon: A skin disease dataset densely annotated by domain experts for fine-grained debugging and analysis,

R. Daneshjou, M. Yuksekgonul, Z. R. Cai, R. Novoa, and J. Y . Zou, “SkinCon: A skin disease dataset densely annotated by domain experts for fine-grained debugging and analysis,” inAdvances in Neural Inf. Process. Syst., 2022, pp. 18 157–18 167

2022

[56] [56]

Automated flower classification over a large number of classes,

M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” inProc. IEEE Int. Conf. Comp. Vis., 2008, pp. 722–729

2008

[57] [57]

Food-101–mining discriminative components with random forests,

L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining discriminative components with random forests,” inProc. Eur. Conf. Comp. Vis., 2014, pp. 446–461

2014

[58] [58]

Fine-grained visual classification of aircraft,

S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi, “Fine-grained visual classification of aircraft,” 2013, arXiv:1306.5151

Pith/arXiv arXiv 2013

[59] [59]

Novel dataset for fine-grained image categorization,

A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei, “Novel dataset for fine-grained image categorization,” inCVPR Workshop on Fine-Grained Visual Categorization, 2011, pp. 806–813

2011

[60] [60]

3D object representations for fine-grained categorization,

J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3D object representations for fine-grained categorization,” inProc. IEEE Int. Conf. Comp. Vis., 2013, pp. 554–561

2013

[61] [61]

VegFru: A domain-specific dataset for fine-grained visual categorization,

S. Hou, Y . Feng, and Z. Wang, “VegFru: A domain-specific dataset for fine-grained visual categorization,” inProc. IEEE Int. Conf. Comp. Vis., 2017, pp. 541–549

2017

[62] [62]

Products-10K: A large-scale product recognition dataset,

Y . Bai, Y . Chen, W. Yu, L. Wang, and W. Zhang, “Products-10K: A large-scale product recognition dataset,” 2020, arXiv:2008.10545

arXiv 2020

[63] [63]

Benchmarking representation learning for natural world image collections,

G. Van Horn, E. Cole, S. Beery, K. Wilber, S. Belongie, and O. Mac Aodha, “Benchmarking representation learning for natural world image collections,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2021, pp. 12 884–12 893. SUBMITTED TO IEEE TPAMI 18 Supplementary Material of Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: From Evaluati...

2021