pith. sign in

arxiv: 2506.09082 · v5 · submitted 2025-06-10 · 💻 cs.CV · cs.AI· cs.LG

AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models

Pith reviewed 2026-05-19 11:08 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords AVA-BenchAtomic Visual AbilitiesVision Foundation ModelsVisual Question Answeringmodel evaluationability fingerprintsspatial understandingdepth estimation
0
0 comments X

The pith

AVA-Bench disentangles 14 atomic visual abilities to isolate exactly where vision foundation models succeed or fail.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Traditional evaluations pair vision foundation models with language models and test them on broad visual question answering benchmarks, but data mismatches and the mixing of multiple skills make it unclear whether errors come from visual shortcomings or other factors. The paper introduces AVA-Bench to separate 14 foundational skills such as localization, depth estimation, and spatial understanding. It ensures that training and test data distributions match within each skill so that results reflect true ability rather than distribution shift. This produces distinct ability profiles for different models and shows that a much smaller language model can rank the vision models with comparable results while using far less computation. The approach aims to replace guesswork in model selection with targeted diagnosis of visual capabilities.

Core claim

By decoupling 14 Atomic Visual Abilities and matching training and test distributions within each, AVA-Bench pinpoints exactly where a VFM excels or falters, revealing distinctive ability fingerprints and demonstrating that a 0.5B LLM produces similar VFM rankings as a 7B LLM while cutting GPU hours by 8x.

What carries the argument

The decoupling of 14 Atomic Visual Abilities with per-ability matching of training and test data distributions, which isolates individual skills such as localization, depth estimation, and spatial understanding.

If this is right

  • Vision foundation models display unique ability fingerprints that allow precise identification of strengths and weaknesses.
  • A 0.5B parameter language model produces VFM rankings comparable to those from a 7B parameter model.
  • Evaluation of vision foundation models can be performed with 8 times fewer GPU hours.
  • Complex visual reasoning tasks can be diagnosed by examining performance on the individual atomic abilities rather than overall accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Model developers could target training or fine-tuning specifically on the abilities where a given vision foundation model scores lowest.
  • The same decoupling approach could be applied to evaluate other types of foundation models beyond vision.
  • Complementary ability profiles from different models might be combined to handle tasks that require multiple skills.

Load-bearing premise

The 14 chosen abilities are independent and collectively sufficient to explain performance on complex visual reasoning tasks without significant overlap or missing interactions.

What would settle it

A finding that a model's accuracy on a composite visual reasoning task cannot be explained by the pattern of its scores across the 14 separate abilities would show the benchmark fails to isolate the sources of error.

Figures

Figures reproduced from arXiv: 2506.09082 by Arpita Chowdhury, Jiacheng Hou, Jihyung Kil, Lemeng Wang, Sooyoung Jeon, Wei-Lun Chao, Zheda Mai, Zihe Wang.

Figure 1
Figure 1. Figure 1: Vision foundation models (VFMs) trained with different data and objectives are evaluated on the proposed AVA-BENCH to assess their strengths and limitations across atomic visual abilities (AVAs). Abstract The rise of vision foundation models (VFMs) calls for systematic evaluation. A common approach pairs VFMs with large language models (LLMs) as general￾purpose heads, followed by evaluation on broad Visual… view at source ↗
Figure 2
Figure 2. Figure 2: Visual Question Answering (VQA) often requires multiple atomic visual abilities (AVAs) [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: AVA-BENCH consists of 14 Atomic Visual Abilities (AVAs) that can be combined to address more complex visual reasoning tasks. role of language supervision in enhancing general visual capability; (2) For vision-centric AVAs such as localization, absolute depth estimation, and orientation, the SSL-based DINOv2 performs comparably or better than language-supervised counterparts; (3) Conversely, language-centri… view at source ↗
Figure 5
Figure 5. Figure 5: Overall statistics of AVA-BENCH. Another family of VFMs leverages image-text pairs. CLIP [73] uses contrastive learning to align image and textual descriptions, while SigLIP [89, 109] replaces the contrastive loss with a sigmoid one for more efficient training. Unlike the conventional language-guided VFMs that process images and text separately, AIMv2 [21] integrates visual and textual understanding into a… view at source ↗
Figure 6
Figure 6. Figure 6: Performance comparison of VFMs across all AVAs. (Left) Language-Supervised VFMs with DINOv2 as a reference. (Right) Other VFMs with the SigLIP-2 as a reference. 3 AVA-BENCH To address the blind spots discussed earlier, we introduce AVA-BENCH, the first systematic evaluation suite explicitly designed to disentangle 14 fundamental perceptual skills–Atomic Visual Abilities (AVAs)–for VFMs. We first introduce … view at source ↗
Figure 7
Figure 7. Figure 7: The ranks of VFMs across all AVAs. Detailed results in Appendix. Fine-grained. We curate 9K pairs from three diverse datasets: CUB-200-2011 [94] (birds), iNaturalist-2021 [92] (fungi, plants, animals), and FGVC-Aircraft [65] (objects). We randomly select 50 species across fungi, plants, and animals from iNaturalist [92] and 100 bird species from CUB, resulting in a total of 300 fine-grained classes. Object… view at source ↗
Figure 8
Figure 8. Figure 8: A much smaller 0.5B LLM achieves similar relative VFM rankings comparable to a 7B LLM, while dramatically reducing evaluation costs by approximately 8×. (PEFT) [64, 33, 91], specifically Low-Rank Adaptation (LoRA) [35], to mitigate potential overfitting. Subsequently, the fine-tuned model is evaluated on the corresponding AVA-specific test sets. 4.1 Is a Heavyweight LLM Evaluator Necessary? As discussed in… view at source ↗
Figure 9
Figure 9. Figure 9: Impact of bounding boxes on spatial reasoning performance. With bounding boxes, all VFMs perform perfectly in spatial AVA; without them, models with weaker localization (MiDaS, SAM) perform worse. Subgroup analyses reveal exceptions hidden by aggregate metrics. Aggregate metrics can sometimes obscure important nuanced insights. To gain a deeper understanding, we conduct detailed analyses by partitioning te… view at source ↗
Figure 10
Figure 10. Figure 10: Detail results for localization for overall and different splits based on the ground-truth’s [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Examples of Spatial Reasoning AVA Samples. Intersecting these two sources yields a concise yet crucial set of AVAs. Moreover, we focus strictly on pure perceptual tasks and exclude non-perceptual reasoning skills (e.g., historical context, mathe￾matical reasoning). We provide more related work discussion in Appendix D. A.2 Dataset Curation Spatial Reasoning [96, 111]. We curate 15.9K image pairs from NYU-… view at source ↗
Figure 12
Figure 12. Figure 12: Examples of Counting AVA Samples. What is the species of bird in the image? Choose one from below: 1. Cerulean-Warbler 2. Ivory-Gull … 30. Boat-tailed-Grackle Boat-tailed-Grackle What is the species of butterfly in the image? Choose one from below: 1. Caria ino 2. Limenitis camilla… 30. Datana ministra Limenitis camilla What is the species of fungi in the image? Choose one from below: 1. Aseroe rubra 2. L… view at source ↗
Figure 13
Figure 13. Figure 13: Examples of Fine-grained recognition AVA Samples. • For each object count, we sample a fixed range of images per object_id—between 6 and 30 depending on the dataset—to balance frequency and diversity. • Dataset-specific sampling rules are applied: – VQAv2 and LVIS: 15–30 images per object count; ≥5 count values per object id. – FSC-147: 6–12 images per object count; ≥4 count values per object id. – CARPK:… view at source ↗
Figure 14
Figure 14. Figure 14: Examples of OCR AVA Samples. Provide bounding box coordinate for the song sparrow. “[0.36, 0.25, 0.7, 0.8]” Provide bounding box coordinate for the couch. “[0.47, 0.54, 0.81, 0.8]” Provide bounding box coordinate for the Eastern cottontail. “[0.21, 0.35, 0.87, 0.63]” Provide bounding box coordinate for the Eastern gray squirrel. “[0.16, 0.18, 0.78, 0.72]” Provide bounding box coordinate for the mule deer.… view at source ↗
Figure 15
Figure 15. Figure 15: Examples of Localization AVA Samples. • A red bounding box is rendered on each image to highlight the target word location during inference. • The question for each image: “What is written in the red bounding box in the image?” Localization [97, 78]. We curate 34.8K localization-focused image–question pairs sourced from Objects365 [79] and LVIS [30] (open-domain), iNaturalist-2021 [92] (birds and animals)… view at source ↗
Figure 16
Figure 16. Figure 16: Examples of Recognition AVA Samples. Recognition [22, 113, 90]. We curate 34.8K recognition samples from four datasets spanning diverse visual domains—Objects365 [79] and LVIS [30] (open-domain objects), iNaturalist-2021 [92] (birds and animals), and DIOR [50] (remote sensing). These samples are derived from the same images and object instances used in the localization benchmark. However, instead of askin… view at source ↗
Figure 17
Figure 17. Figure 17: Examples of Action AVA Samples. Action [83, 95, 104]. We curate a total of 15K image–question pairs from the Moments in Time [69] dataset, covering a wide range of human actions and activities. Each sample is derived from a short video clip annotated with a specific action label ( [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Examples of Scene AVA Samples. What is the texture attribute of the image? Choose one from below. 1. veined, 2. marbled … 47. polka-dotted What is the texture attribute of the image? Choose one from below. 1. cork, 2. wool … 47. aluminium-foil What is the texture attribute of the image? Choose one from below. 1. cotton, 2. linen … 47. lettuce-leaf What is the texture attribute of the image? Choose one fro… view at source ↗
Figure 19
Figure 19. Figure 19: Examples of Texture AVA Samples. • For each scene category within both datasets, an 80% training and 20% testing split was established, ensuring balanced and fair evaluation conditions. • The question for each pair: “What is the scene class of the image? Choose one from below: 1. Entrance hall, 2. Lawn, ... 30. Snowy Mountain.” These 30 options were selected by randomly sampling from the complete set of s… view at source ↗
Figure 20
Figure 20. Figure 20: Examples of Orientation AVA Samples. • For EgoOrientBench, each object class was ensured to appear in multiple orientations, with at least 10 and at most 40 samples per orientation label. This encourages models to learn generalized representations rather than memorizing specific arrangements. • From CURE-OR, we selected object instances photographed against two different back￾ground conditions using three… view at source ↗
Figure 21
Figure 21. Figure 21: Examples of Absolute Depth AVA Samples. • Depth values were discretized into bins, and for each class, we selected between 20 and 60 samples per bin to ensure coverage while avoiding overrepresentation. Bins with fewer than 20 samples were discarded. When bins exceeded 60 samples, selection was sorted by bounding box area to prioritize larger, more reliable objects. • Image crops were extracted per object… view at source ↗
Figure 22
Figure 22. Figure 22: Examples of Relative Depth AVA Samples. • Pairs were formed using both intra-class (e.g., Car vs. Car) and inter-class (e.g., Car vs. Pedestrian) combinations. • Pairs were retained only if the depth difference between the two objects was at least 0.5 meters, ensuring a meaningful perceptual gap. • To avoid ambiguity and incomplete visual evidence, the following filters were applied: – Pairs with occluded… view at source ↗
Figure 23
Figure 23. Figure 23: Detail results for counting for overall and different splits based on datasets and ground [PITH_FULL_IMAGE:figures/full_fig_p030_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Detail results for emotion for overall and different splits based on emotion types. [PITH_FULL_IMAGE:figures/full_fig_p031_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Detail results for orientation for overall and different splits based on viewpoint directions. [PITH_FULL_IMAGE:figures/full_fig_p031_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Detail results for absolute depth for overall and different splits based on scene type and [PITH_FULL_IMAGE:figures/full_fig_p032_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Detail results for relative depth for overall and different splits based on scene type. [PITH_FULL_IMAGE:figures/full_fig_p032_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Detail results for fine-grained for overall and different splits based on dataset type. [PITH_FULL_IMAGE:figures/full_fig_p033_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Detail results for scene for overall and different splits based on datasets. [PITH_FULL_IMAGE:figures/full_fig_p033_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Detail results for counting for overall and different splits based on dataset domain and [PITH_FULL_IMAGE:figures/full_fig_p034_30.png] view at source ↗
read the original abstract

The rise of vision foundation models (VFMs) calls for systematic evaluation. A common approach pairs VFMs with large language models (LLMs) as general-purpose heads, followed by evaluation on broad Visual Question Answering (VQA) benchmarks. However, this protocol has two key blind spots: (i) the instruction tuning data may not align with VQA test distributions, meaning a wrong prediction can stem from such data mismatch rather than a VFM' visual shortcomings; (ii) VQA benchmarks often require multiple visual abilities, making it hard to tell whether errors stem from lacking all required abilities or just a single critical one. To address these gaps, we introduce AVA-Bench, the first benchmark that explicitly disentangles 14 Atomic Visual Abilities (AVAs) -- foundational skills like localization, depth estimation, and spatial understanding that collectively support complex visual reasoning tasks. By decoupling AVAs and matching training and test distributions within each, AVA-Bench pinpoints exactly where a VFM excels or falters. Applying AVA-Bench to leading VFMs thus reveals distinctive "ability fingerprints," turning VFM selection from educated guesswork into principled engineering. Notably, we find that a 0.5B LLM yields similar VFM rankings as a 7B LLM while cutting GPU hours by 8x, enabling more efficient evaluation. By offering a comprehensive and transparent benchmark, we hope AVA-Bench lays the foundation for the next generation of VFMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces AVA-Bench, a benchmark for vision foundation models (VFMs) that explicitly disentangles 14 Atomic Visual Abilities (AVAs) such as localization, depth estimation, and spatial understanding. It addresses two limitations of standard VQA evaluation protocols—potential train/test distribution mismatch after instruction tuning and the confounding effect of multi-ability tasks—by decoupling the AVAs and enforcing per-ability train/test distribution matching. The resulting per-ability scores are used to produce distinctive “ability fingerprints” for leading VFMs; the work additionally reports that a 0.5 B LLM head yields essentially the same VFM ranking as a 7 B LLM while reducing GPU hours by roughly 8×.

Significance. If the claimed isolation of abilities holds, AVA-Bench would supply a more diagnostic evaluation instrument than existing broad VQA suites, enabling principled selection and targeted improvement of VFMs. The efficiency result with the smaller LLM is a concrete, immediately usable contribution for reproducible benchmarking.

major comments (2)
  1. [§3 (Benchmark Construction) and §4 (Experiments)] The central claim that per-ability distribution matching “pinpoints exactly where a VFM excels or falters” (§1, abstract) presupposes that the 14 chosen AVAs are mutually independent and collectively sufficient with negligible residual cross-talk. Visual skills such as localization and depth estimation are known to be interdependent; without explicit verification (e.g., ablation of joint versus isolated training or correlation analysis across AVA test sets), error attribution to a single AVA remains ambiguous.
  2. [§4.3 (Ability Fingerprints) and §5 (Discussion)] The manuscript states that the 14 AVAs “collectively support complex visual reasoning tasks” yet provides no quantitative check that performance on held-out complex VQA tasks can be reconstructed from the AVA scores (or that residual variance is small). This leaves open whether the chosen set is sufficient.
minor comments (2)
  1. [§3] The exact operational definitions and data sources for each of the 14 AVAs should be listed in a single table with references to the source datasets.
  2. [§4.2] Clarify whether the reported 8× GPU-hour reduction accounts for the cost of constructing the per-ability training sets or only for the final evaluation pass.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments, which highlight important aspects of our benchmark design and claims. We address each major comment below and describe the revisions we plan to incorporate.

read point-by-point responses
  1. Referee: [§3 (Benchmark Construction) and §4 (Experiments)] The central claim that per-ability distribution matching “pinpoints exactly where a VFM excels or falters” (§1, abstract) presupposes that the 14 chosen AVAs are mutually independent and collectively sufficient with negligible residual cross-talk. Visual skills such as localization and depth estimation are known to be interdependent; without explicit verification (e.g., ablation of joint versus isolated training or correlation analysis across AVA test sets), error attribution to a single AVA remains ambiguous.

    Authors: We agree that visual abilities exhibit interdependencies, as established in prior work on visual perception. Our design mitigates confounding by creating per-ability datasets with explicit train/test distribution alignment, which reduces the train/test mismatch and multi-ability entanglement issues of standard VQA. The current manuscript does not include correlation analysis across AVA sets or joint-versus-isolated ablations. We will add a correlation matrix of AVA test-set performances in the revised version, along with discussion of how residual cross-talk affects attribution, to make these limitations explicit. revision: yes

  2. Referee: [§4.3 (Ability Fingerprints) and §5 (Discussion)] The manuscript states that the 14 AVAs “collectively support complex visual reasoning tasks” yet provides no quantitative check that performance on held-out complex VQA tasks can be reconstructed from the AVA scores (or that residual variance is small). This leaves open whether the chosen set is sufficient.

    Authors: We acknowledge that a direct quantitative validation of sufficiency would strengthen the claim that the 14 AVAs are collectively adequate. The manuscript presents the AVAs as foundational primitives based on their coverage of core visual skills. In the revision we will add a regression-based analysis that predicts held-out complex VQA performance from the 14 AVA scores and reports the explained variance, thereby quantifying residual variance and the practical sufficiency of the set. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark is defined by construction without reducing to fitted inputs or self-citations

full rationale

The paper introduces AVA-Bench by selecting and decoupling 14 atomic visual abilities (localization, depth estimation, spatial understanding, etc.) and constructing per-ability train/test sets with matched distributions. This is an explicit methodological definition motivated by stated VQA limitations, not a derivation that reduces to prior fitted quantities or self-citations. No equations, uniqueness theorems, or ansatzes from the authors' prior work are invoked to force the central claim. Empirical findings (e.g., 0.5B vs 7B LLM rankings) are observations on external models rather than tautological outputs. The derivation chain is self-contained as benchmark engineering.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that visual reasoning decomposes cleanly into the listed atomic abilities and that distribution matching removes data-mismatch confounds.

axioms (1)
  • domain assumption The 14 atomic visual abilities are foundational skills that collectively support complex visual reasoning tasks.
    Invoked when claiming that isolating these abilities reveals where VFMs excel or falter.
invented entities (1)
  • Atomic Visual Abilities (AVAs) no independent evidence
    purpose: Disentangled foundational visual skills for precise VFM diagnosis.
    Newly defined set of 14 abilities introduced to structure the benchmark.

pith-pipeline@v0.9.0 · 5829 in / 1198 out tokens · 28328 ms · 2026-05-19T11:08:37.335768+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

158 extracted references · 158 canonical work pages · 14 internal anchors

  1. [1]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023

  2. [2]

    Foundation models defining a new era in vision: a survey and outlook

    Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Foundation models defining a new era in vision: a survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  3. [3]

    Eureka: Evaluating and understanding large foundation models

    Vidhisha Balachandran, Jingya Chen, Neel Joshi, Besmira Nushi, Hamid Palangi, Eduardo Salinas, Vibhav Vineet, James Woffinden-Luey, and Safoora Yousefi. Eureka: Evaluating and understanding large foundation models. arXiv preprint arXiv:2409.10566, 2024

  4. [4]

    Scene text visual question answering

    Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Ernest Valveny, CV Jawahar, and Dimosthenis Karatzas. Scene text visual question answering. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4291–4301, 2019

  5. [5]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021

  6. [6]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

  7. [7]

    Decomposing complex visual comprehension into atomic visual skills for vision language models

    Hyunsik Chae, Seungwoo Yoon, Chloe Yewon Chun, Gyehun Go, Yongin Cho, Gyeongmin Lee, and Ernest K Ryu. Decomposing complex visual comprehension into atomic visual skills for vision language models. In The 4th Workshop on Mathematical Reasoning and AI at NeurIPS’24

  8. [8]

    Sharegpt4v: Improving large multi-modal models with better captions

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. In European Conference on Computer Vision, pages 370–387. Springer, 2024

  9. [9]

    Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation

    Tianle Chen, Zheda Mai, Ruiwen Li, and Wei-lun Chao. Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:2305.05803, 2023

  10. [10]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024

  11. [11]

    On domain-specific post-training for multimodal large language models

    Daixuan Cheng, Shaohan Huang, Ziyu Zhu, Xintong Zhang, Wayne Xin Zhao, Zhongzhi Luan, Bo Dai, and Zhenliang Zhang. On domain-specific post-training for multimodal large language models. arXiv preprint arXiv:2411.19930, 2024

  12. [12]

    Megacoin: Enhancing medium-grained color perception for vision-language models

    Ming-Chang Chiu, Shicheng Wen, Pin-Yu Chen, and Xuezhe Ma. Megacoin: Enhancing medium-grained color perception for vision-language models. arXiv preprint arXiv:2412.03927, 2024

  13. [13]

    Palm: Scaling language modeling with pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023

  14. [14]

    Prompt-cam: A simpler interpretable transformer for fine-grained analysis

    Arpita Chowdhury, Dipanjyoti Paul, Zheda Mai, Jianyang Gu, Ziheng Zhang, Kazi Sajeed Mehrab, Elizabeth G Campolongo, Daniel Rubenstein, Charles V Stewart, Anuj Karpatne, et al. Prompt-cam: A simpler interpretable transformer for fine-grained analysis. arXiv preprint arXiv:2501.09333, 2025

  15. [15]

    Cimpoi, S

    M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A. Vedaldi. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014

  16. [16]

    Humanvlm: Foundation for human- scene vision-language model

    Dawei Dai, Long Xu, Yutang Li, Yuanhui Zhang, and Shuyin Xia. Humanvlm: Foundation for human- scene vision-language model. Information Fusion, page 103271, 2025

  17. [17]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

  18. [18]

    Shape and texture recognition in large vision-language models

    Sagi Eppel, Mor Bismut, and Alona Faktor. Shape and texture recognition in large vision-language models. arXiv preprint arXiv:2503.23062, 2025. 11

  19. [19]

    There is no samantics! exploring sam as a backbone for visual understanding tasks

    Miguel Espinosa, Chenhongyi Yang, Linus Ericsson, Steven McDonagh, and Elliot J Crowley. There is no samantics! exploring sam as a backbone for visual understanding tasks. arXiv preprint arXiv:2411.15288, 2024

  20. [20]

    Mllm-sul: Multimodal large language model for semantic scene understanding and localization in traffic scenarios

    Jiaqi Fan, Jianhua Wu, Jincheng Gao, Jianhao Yu, Yafei Wang, Hongqing Chu, and Bingzhao Gao. Mllm-sul: Multimodal large language model for semantic scene understanding and localization in traffic scenarios. arXiv preprint arXiv:2412.19406, 2024

  21. [21]

    Multimodal autoregressive pre-training of large vision encoders

    Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, et al. Multimodal autoregressive pre-training of large vision encoders. arXiv preprint arXiv:2411.14402, 2024

  22. [22]

    Blink: Multimodal large language models can see but not perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. In European Conference on Computer Vision, pages 148–166. Springer, 2024

  23. [23]

    Are vision language models texture or shape biased and can we steer them? arXiv preprint arXiv:2403.09193, 2024

    Paul Gavrikov, Jovita Lukasik, Steffen Jung, Robert Geirhos, Bianca Lamm, Muhammad Jehanzeb Mirza, Margret Keuper, and Janis Keuper. Are vision language models texture or shape biased and can we steer them? arXiv preprint arXiv:2403.09193, 2024

  24. [24]

    Vision meets robotics: The kitti dataset

    Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The international journal of robotics research, 32(11):1231–1237, 2013

  25. [25]

    Battle of the backbones: A large- scale comparison of pretrained models across computer vision tasks

    Micah Goldblum, Hossein Souri, Renkun Ni, Manli Shu, Viraj Prabhu, Gowthami Somepalli, Prithvijit Chattopadhyay, Mark Ibrahim, Adrien Bardes, Judy Hoffman, et al. Battle of the backbones: A large- scale comparison of pretrained models across computer vision tasks. Advances in Neural Information Processing Systems, 36:29343–29371, 2023

  26. [26]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017

  27. [27]

    Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017

  28. [28]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  29. [29]

    Bioclip 2: Emergent properties from scaling hierarchical contrastive learning.arXiv preprint arXiv:2505.23883, 2025

    Jianyang Gu, Samuel Stevens, Elizabeth G Campolongo, Matthew J Thompson, Net Zhang, Jiaman Wu, Andrei Kopanev, Zheda Mai, Alexander E White, James Balhoff, et al. Bioclip 2: Emergent properties from scaling hierarchical contrastive learning. arXiv preprint arXiv:2505.23883, 2025

  30. [30]

    Lvis: A dataset for large vocabulary instance segmentation

    Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356– 5364, 2019

  31. [31]

    A survey on vision transformer

    Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, et al. A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence, 45(1):87–110, 2022

  32. [32]

    Radio amplified: Improved baselines for agglomerative vision foundation models

    Greg Heinrich, Mike Ranzinger, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catanzaro, Pavlo Molchanov, et al. Radio amplified: Improved baselines for agglomerative vision foundation models. arXiv preprint arXiv:2412.07679, 2024

  33. [33]

    Parameter-efficient transfer learning for nlp

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, An- drea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790–2799. PMLR, 2019

  34. [34]

    Drone-based object counting by spatially regularized regional proposal network

    Meng-Ru Hsieh, Yen-Liang Lin, and Winston H Hsu. Drone-based object counting by spatially regularized regional proposal network. In Proceedings of the IEEE international conference on computer vision , pages 4145–4153, 2017

  35. [35]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022

  36. [36]

    A survey on evaluation of multimodal large language models

    Jiaxing Huang and Jingyi Zhang. A survey on evaluation of multimodal large language models. arXiv preprint arXiv:2408.15769, 2024. 12

  37. [37]

    T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation

    Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. Advances in Neural Information Processing Systems, 36:78723–78747, 2023

  38. [38]

    Ocr-reasoning benchmark: Unveiling the true capabilities of mllms in complex text-rich image reasoning

    Mingxin Huang, Yongxin Shi, Dezhi Peng, Songxuan Lai, Zecheng Xie, and Lianwen Jin. Ocr-reasoning benchmark: Unveiling the true capabilities of mllms in complex text-rich image reasoning. arXiv preprint arXiv:2505.17163, 2025

  39. [39]

    Position: The platonic representation hypothesis

    Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. Position: The platonic representation hypothesis. In Forty-first International Conference on Machine Learning, 2024

  40. [40]

    Is’ right’right? enhancing object orientation understanding in multimodal language models through egocentric instruction tuning

    Ji Hyeok Jung, Eun Tae Kim, Seo Yeon Kim, Joo Ho Lee, Bumsoo Kim, and Buru Chang. Is’ right’right? enhancing object orientation understanding in multimodal language models through egocentric instruction tuning. arXiv preprint arXiv:2411.16761, 2024

  41. [41]

    Transformers in vision: A survey

    Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. Transformers in vision: A survey. ACM computing surveys (CSUR), 54(10s):1–41, 2022

  42. [42]

    Mllm-compbench: A comparative reasoning benchmark for multimodal llms

    Jihyung Kil, Zheda Mai, Justin Lee, Arpita Chowdhury, Zihe Wang, Kerrie Cheng, Lemeng Wang, Ye Liu, and Wei-Lun Harry Chao. Mllm-compbench: A comparative reasoning benchmark for multimodal llms. Advances in Neural Information Processing Systems, 37:28798–28827, 2024

  43. [43]

    Finer: Investigating and enhancing fine-grained visual concept recognition in large vision language models

    Jeonghwan Kim and Heng Ji. Finer: Investigating and enhancing fine-grained visual concept recognition in large vision language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6187–6207, 2024

  44. [44]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

  45. [45]

    Diagnostics- llava: A visual language model for domain-specific diagnostics of equipment

    Aman Kumar, Mahbubul Alam, Ahmed Farahat, Maheshjabu Somineni, and Chetan Gupta. Diagnostics- llava: A visual language model for domain-specific diagnostics of equipment. In Annual Conference of the PHM Society, volume 16, 2024

  46. [46]

    Kylberg texture dataset v

    Gustaf Kylberg. Kylberg texture dataset v. 1.0 . Centre for Image Analysis, Swedish University of Agricultural Sciences and . . . , 2011

  47. [47]

    Llmcount: Enhancing stationary mmwave detection with multimodal-llm

    Boyan Li, Shengyi Ding, Deen Ma, Yixuan Wu, Hongjie Liao, and Kaiyuan Hu. Llmcount: Enhancing stationary mmwave detection with multimodal-llm. arXiv preprint arXiv:2409.16209, 2024

  48. [48]

    Eald-mllm: Emotion analysis in long-sequential and de-identity videos with multi-modal large language model

    Deng Li, Xin Liu, Bohao Xing, Baiqiang Xia, Yuan Zong, Bihan Wen, and Heikki Kälviäinen. Eald-mllm: Emotion analysis in long-sequential and de-identity videos with multi-modal large language model. arXiv preprint arXiv:2405.00574, 2024

  49. [49]

    Video crowd localization with multifocus gaussian neighborhood attention and a large-scale benchmark

    Haopeng Li, Lingbo Liu, Kunlin Yang, Shinan Liu, Junyu Gao, Bin Zhao, Rui Zhang, and Jun Hou. Video crowd localization with multifocus gaussian neighborhood attention and a large-scale benchmark. IEEE Transactions on Image Processing, 31:6032–6047, 2022

  50. [50]

    Object detection in optical remote sensing images: A survey and a new benchmark

    Ke Li, Gang Wan, Gong Cheng, Liqiu Meng, and Junwei Han. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS journal of photogrammetry and remote sensing, 159:296–307, 2020

  51. [51]

    Reliable crowdsourcing and deep locality-preserving learning for un- constrained facial expression recognition

    Shan Li and Weihong Deng. Reliable crowdsourcing and deep locality-preserving learning for un- constrained facial expression recognition. IEEE Transactions on Image Processing , 28(1):356–370, 2019

  52. [52]

    Openvision: A fully-open, cost- effective family of advanced vision encoders for multimodal learning

    Xianhang Li, Yanqing Liu, Haoqin Tu, Hongru Zhu, and Cihang Xie. Openvision: A fully-open, cost- effective family of advanced vision encoders for multimodal learning. arXiv preprint arXiv:2505.04601, 2025

  53. [53]

    Visual large language models for generalized and specialized applications

    Yifan Li, Zhixin Lai, Wentao Bao, Zhen Tan, Anh Dao, Kewei Sui, Jiayi Shen, Dong Liu, Huan Liu, and Yu Kong. Visual large language models for generalized and specialized applications. arXiv preprint arXiv:2501.02765, 2025

  54. [54]

    Expression analysis based on face regions in real-world conditions

    Zheng Lian, Ya Li, Jian-Hua Tao, Jian Huang, and Ming-Yue Niu. Expression analysis based on face regions in real-world conditions. International Journal of Automation and Computing, 17:96–107, 2020

  55. [55]

    A comprehensive survey and guide to multimodal large language models in vision-language tasks

    Chia Xin Liang, Pu Tian, Caitlyn Heqi Yin, Yao Yua, Wei An-Hou, Li Ming, Tianyang Wang, Ziqian Bi, and Ming Liu. A comprehensive survey and guide to multimodal large language models in vision-language tasks. arXiv preprint arXiv:2411.06284, 2024. 13

  56. [56]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023

  57. [57]

    Ocrbench: on the hidden mystery of ocr in large multimodal models

    Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models. Science China Information Sciences, 67(12):220102, 2024

  58. [58]

    Cvpr 2020 continual learning in computer vision competition: Approaches, results, current challenges and future directions

    Vincenzo Lomonaco, Lorenzo Pellegrini, Pau Rodriguez, Massimo Caccia, Qi She, Yu Chen, Quentin Jodelet, Ruiping Wang, Zheda Mai, David Vazquez, et al. Cvpr 2020 continual learning in computer vision competition: Approaches, results, current challenges and future directions. Artificial Intelligence, 303:103635, 2022

  59. [59]

    The development of the cie 2000 colour-difference formula: Ciede2000

    M Ronnier Luo, Guihua Cui, and Bryan Rigg. The development of the cie 2000 colour-difference formula: Ciede2000. Color Research & Application: Endorsed by Inter-Society Color Council, The Colour Group (Great Britain), Canadian Society for Color, Color Science Association of Japan, Dutch Society for the Study of Color, The Swedish Colour Centre Foundation,...

  60. [60]

    Fine-tuning is fine, if calibrated

    Zheda Mai, Arpita Chowdhury, Ping Zhang, Cheng-Hao Tu, Hong-You Chen, Vardaan Pahuja, Tanya Berger-Wolf, Song Gao, Charles Stewart, Yu Su, et al. Fine-tuning is fine, if calibrated. Advances in Neural Information Processing Systems, 37:136084–136119, 2024

  61. [61]

    Batch-level experience replay with review for continual learning

    Zheda Mai, Hyunwoo Kim, Jihwan Jeong, and Scott Sanner. Batch-level experience replay with review for continual learning. arXiv preprint arXiv:2007.05683, 2020

  62. [62]

    Online continual learning in image classification: An empirical survey

    Zheda Mai, Ruiwen Li, Jihwan Jeong, David Quispe, Hyunwoo Kim, and Scott Sanner. Online continual learning in image classification: An empirical survey. Neurocomputing, 469:28–51, 2022

  63. [63]

    Supervised contrastive replay: Revisiting the nearest class mean classifier in online class-incremental continual learning

    Zheda Mai, Ruiwen Li, Hyunwoo Kim, and Scott Sanner. Supervised contrastive replay: Revisiting the nearest class mean classifier in online class-incremental continual learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3589–3599, 2021

  64. [64]

    Lessons and insights from a unifying study of parameter-efficient fine-tuning (peft) in visual recognition

    Zheda Mai, Ping Zhang, Cheng-Hao Tu, Hong-You Chen, Quang-Huy Nguyen, Li Zhang, and Wei- Lun Chao. Lessons and insights from a unifying study of parameter-efficient fine-tuning (peft) in visual recognition. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 14845–14857, 2025

  65. [65]

    Fine-Grained Visual Classification of Aircraft

    Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013

  66. [66]

    The kth-tips2 database

    P Mallikarjuna, Alireza Targhi, Mario Fritz, Eric Hayman, Barbara Caputo, and J.-O Eklundh. The kth-tips2 database. 07 2006

  67. [67]

    Hierarchical interpretable vision reasoning driven through a multi-modal large language model for depth estimation

    Wenfeng Mi, He Chen, and Weipeng Liu. Hierarchical interpretable vision reasoning driven through a multi-modal large language model for depth estimation. In 2024 China Automation Congress (CAC), pages 829–834. IEEE, 2024

  68. [68]

    Mishra, K

    A. Mishra, K. Alahari, and C. V . Jawahar. Scene text recognition using higher order language priors. In BMVC, 2012

  69. [69]

    Moments in time dataset: one million videos for event understanding

    Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ramakrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown, Quanfu Fan, Dan Gutfreund, Carl V ondrick, et al. Moments in time dataset: one million videos for event understanding. IEEE transactions on pattern analysis and machine intelligence, 42(2):502–508, 2019

  70. [70]

    Intriguing properties of vision transformers

    Muhammad Muzammal Naseer, Kanchana Ranasinghe, Salman H Khan, Munawar Hayat, Fahad Shah- baz Khan, and Ming-Hsuan Yang. Intriguing properties of vision transformers. Advances in Neural Information Processing Systems, 34:23296–23308, 2021

  71. [71]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

  72. [72]

    A simple interpretable transformer for fine-grained image classification and analysis

    Dipanjyoti Paul, Arpita Chowdhury, Xinqi Xiong, Feng-Ju Chang, David Carlyn, Samuel Stevens, Kaiya Provost, Anuj Karpatne, Bryan Carstens, Daniel I Rubenstein, et al. A simple interpretable transformer for fine-grained image classification and analysis. 2023. 14

  73. [73]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021

  74. [74]

    Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer

    René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020

  75. [75]

    Learning to count everything

    Viresh Ranjan, Udbhav Sharma, Thu Nguyen, and Minh Hoai. Learning to count everything. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3394– 3403, 2021

  76. [76]

    Am-radio: Agglomerative vision foundation model reduce all domains into one

    Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. Am-radio: Agglomerative vision foundation model reduce all domains into one. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12490–12500, 2024

  77. [77]

    Generalized intersection over union: A metric and a loss for bounding box regression

    Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 658–666, 2019

  78. [78]

    Object detection with multimodal large vision-language models: An in-depth review

    Ranjan Sapkota and Manoj Karkee. Object detection with multimodal large vision-language models: An in-depth review. Available at SSRN 5233953, 2025

  79. [79]

    Objects365: A large-scale, high-quality dataset for object detection

    Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8430–8439, 2019

  80. [80]

    Online class- incremental continual learning with adversarial shapley value

    Dongsub Shim, Zheda Mai, Jihwan Jeong, Scott Sanner, Hyunwoo Kim, and Jongseong Jang. Online class- incremental continual learning with adversarial shapley value. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 9630–9638, 2021

Showing first 80 references.