AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models
Pith reviewed 2026-05-19 11:08 UTC · model grok-4.3
The pith
AVA-Bench disentangles 14 atomic visual abilities to isolate exactly where vision foundation models succeed or fail.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By decoupling 14 Atomic Visual Abilities and matching training and test distributions within each, AVA-Bench pinpoints exactly where a VFM excels or falters, revealing distinctive ability fingerprints and demonstrating that a 0.5B LLM produces similar VFM rankings as a 7B LLM while cutting GPU hours by 8x.
What carries the argument
The decoupling of 14 Atomic Visual Abilities with per-ability matching of training and test data distributions, which isolates individual skills such as localization, depth estimation, and spatial understanding.
If this is right
- Vision foundation models display unique ability fingerprints that allow precise identification of strengths and weaknesses.
- A 0.5B parameter language model produces VFM rankings comparable to those from a 7B parameter model.
- Evaluation of vision foundation models can be performed with 8 times fewer GPU hours.
- Complex visual reasoning tasks can be diagnosed by examining performance on the individual atomic abilities rather than overall accuracy.
Where Pith is reading between the lines
- Model developers could target training or fine-tuning specifically on the abilities where a given vision foundation model scores lowest.
- The same decoupling approach could be applied to evaluate other types of foundation models beyond vision.
- Complementary ability profiles from different models might be combined to handle tasks that require multiple skills.
Load-bearing premise
The 14 chosen abilities are independent and collectively sufficient to explain performance on complex visual reasoning tasks without significant overlap or missing interactions.
What would settle it
A finding that a model's accuracy on a composite visual reasoning task cannot be explained by the pattern of its scores across the 14 separate abilities would show the benchmark fails to isolate the sources of error.
Figures
read the original abstract
The rise of vision foundation models (VFMs) calls for systematic evaluation. A common approach pairs VFMs with large language models (LLMs) as general-purpose heads, followed by evaluation on broad Visual Question Answering (VQA) benchmarks. However, this protocol has two key blind spots: (i) the instruction tuning data may not align with VQA test distributions, meaning a wrong prediction can stem from such data mismatch rather than a VFM' visual shortcomings; (ii) VQA benchmarks often require multiple visual abilities, making it hard to tell whether errors stem from lacking all required abilities or just a single critical one. To address these gaps, we introduce AVA-Bench, the first benchmark that explicitly disentangles 14 Atomic Visual Abilities (AVAs) -- foundational skills like localization, depth estimation, and spatial understanding that collectively support complex visual reasoning tasks. By decoupling AVAs and matching training and test distributions within each, AVA-Bench pinpoints exactly where a VFM excels or falters. Applying AVA-Bench to leading VFMs thus reveals distinctive "ability fingerprints," turning VFM selection from educated guesswork into principled engineering. Notably, we find that a 0.5B LLM yields similar VFM rankings as a 7B LLM while cutting GPU hours by 8x, enabling more efficient evaluation. By offering a comprehensive and transparent benchmark, we hope AVA-Bench lays the foundation for the next generation of VFMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces AVA-Bench, a benchmark for vision foundation models (VFMs) that explicitly disentangles 14 Atomic Visual Abilities (AVAs) such as localization, depth estimation, and spatial understanding. It addresses two limitations of standard VQA evaluation protocols—potential train/test distribution mismatch after instruction tuning and the confounding effect of multi-ability tasks—by decoupling the AVAs and enforcing per-ability train/test distribution matching. The resulting per-ability scores are used to produce distinctive “ability fingerprints” for leading VFMs; the work additionally reports that a 0.5 B LLM head yields essentially the same VFM ranking as a 7 B LLM while reducing GPU hours by roughly 8×.
Significance. If the claimed isolation of abilities holds, AVA-Bench would supply a more diagnostic evaluation instrument than existing broad VQA suites, enabling principled selection and targeted improvement of VFMs. The efficiency result with the smaller LLM is a concrete, immediately usable contribution for reproducible benchmarking.
major comments (2)
- [§3 (Benchmark Construction) and §4 (Experiments)] The central claim that per-ability distribution matching “pinpoints exactly where a VFM excels or falters” (§1, abstract) presupposes that the 14 chosen AVAs are mutually independent and collectively sufficient with negligible residual cross-talk. Visual skills such as localization and depth estimation are known to be interdependent; without explicit verification (e.g., ablation of joint versus isolated training or correlation analysis across AVA test sets), error attribution to a single AVA remains ambiguous.
- [§4.3 (Ability Fingerprints) and §5 (Discussion)] The manuscript states that the 14 AVAs “collectively support complex visual reasoning tasks” yet provides no quantitative check that performance on held-out complex VQA tasks can be reconstructed from the AVA scores (or that residual variance is small). This leaves open whether the chosen set is sufficient.
minor comments (2)
- [§3] The exact operational definitions and data sources for each of the 14 AVAs should be listed in a single table with references to the source datasets.
- [§4.2] Clarify whether the reported 8× GPU-hour reduction accounts for the cost of constructing the per-ability training sets or only for the final evaluation pass.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful comments, which highlight important aspects of our benchmark design and claims. We address each major comment below and describe the revisions we plan to incorporate.
read point-by-point responses
-
Referee: [§3 (Benchmark Construction) and §4 (Experiments)] The central claim that per-ability distribution matching “pinpoints exactly where a VFM excels or falters” (§1, abstract) presupposes that the 14 chosen AVAs are mutually independent and collectively sufficient with negligible residual cross-talk. Visual skills such as localization and depth estimation are known to be interdependent; without explicit verification (e.g., ablation of joint versus isolated training or correlation analysis across AVA test sets), error attribution to a single AVA remains ambiguous.
Authors: We agree that visual abilities exhibit interdependencies, as established in prior work on visual perception. Our design mitigates confounding by creating per-ability datasets with explicit train/test distribution alignment, which reduces the train/test mismatch and multi-ability entanglement issues of standard VQA. The current manuscript does not include correlation analysis across AVA sets or joint-versus-isolated ablations. We will add a correlation matrix of AVA test-set performances in the revised version, along with discussion of how residual cross-talk affects attribution, to make these limitations explicit. revision: yes
-
Referee: [§4.3 (Ability Fingerprints) and §5 (Discussion)] The manuscript states that the 14 AVAs “collectively support complex visual reasoning tasks” yet provides no quantitative check that performance on held-out complex VQA tasks can be reconstructed from the AVA scores (or that residual variance is small). This leaves open whether the chosen set is sufficient.
Authors: We acknowledge that a direct quantitative validation of sufficiency would strengthen the claim that the 14 AVAs are collectively adequate. The manuscript presents the AVAs as foundational primitives based on their coverage of core visual skills. In the revision we will add a regression-based analysis that predicts held-out complex VQA performance from the 14 AVA scores and reports the explained variance, thereby quantifying residual variance and the practical sufficiency of the set. revision: yes
Circularity Check
No circularity: benchmark is defined by construction without reducing to fitted inputs or self-citations
full rationale
The paper introduces AVA-Bench by selecting and decoupling 14 atomic visual abilities (localization, depth estimation, spatial understanding, etc.) and constructing per-ability train/test sets with matched distributions. This is an explicit methodological definition motivated by stated VQA limitations, not a derivation that reduces to prior fitted quantities or self-citations. No equations, uniqueness theorems, or ansatzes from the authors' prior work are invoked to force the central claim. Empirical findings (e.g., 0.5B vs 7B LLM rankings) are observations on external models rather than tautological outputs. The derivation chain is self-contained as benchmark engineering.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 14 atomic visual abilities are foundational skills that collectively support complex visual reasoning tasks.
invented entities (1)
-
Atomic Visual Abilities (AVAs)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Foundation models defining a new era in vision: a survey and outlook
Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Foundation models defining a new era in vision: a survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
work page 2025
-
[3]
Eureka: Evaluating and understanding large foundation models
Vidhisha Balachandran, Jingya Chen, Neel Joshi, Besmira Nushi, Hamid Palangi, Eduardo Salinas, Vibhav Vineet, James Woffinden-Luey, and Safoora Yousefi. Eureka: Evaluating and understanding large foundation models. arXiv preprint arXiv:2409.10566, 2024
-
[4]
Scene text visual question answering
Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Ernest Valveny, CV Jawahar, and Dimosthenis Karatzas. Scene text visual question answering. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4291–4301, 2019
work page 2019
-
[5]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
Emerging properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021
work page 2021
-
[7]
Decomposing complex visual comprehension into atomic visual skills for vision language models
Hyunsik Chae, Seungwoo Yoon, Chloe Yewon Chun, Gyehun Go, Yongin Cho, Gyeongmin Lee, and Ernest K Ryu. Decomposing complex visual comprehension into atomic visual skills for vision language models. In The 4th Workshop on Mathematical Reasoning and AI at NeurIPS’24
-
[8]
Sharegpt4v: Improving large multi-modal models with better captions
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. In European Conference on Computer Vision, pages 370–387. Springer, 2024
work page 2024
-
[9]
Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation
Tianle Chen, Zheda Mai, Ruiwen Li, and Wei-lun Chao. Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:2305.05803, 2023
-
[10]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
On domain-specific post-training for multimodal large language models
Daixuan Cheng, Shaohan Huang, Ziyu Zhu, Xintong Zhang, Wayne Xin Zhao, Zhongzhi Luan, Bo Dai, and Zhenliang Zhang. On domain-specific post-training for multimodal large language models. arXiv preprint arXiv:2411.19930, 2024
-
[12]
Megacoin: Enhancing medium-grained color perception for vision-language models
Ming-Chang Chiu, Shicheng Wen, Pin-Yu Chen, and Xuezhe Ma. Megacoin: Enhancing medium-grained color perception for vision-language models. arXiv preprint arXiv:2412.03927, 2024
-
[13]
Palm: Scaling language modeling with pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023
work page 2023
-
[14]
Prompt-cam: A simpler interpretable transformer for fine-grained analysis
Arpita Chowdhury, Dipanjyoti Paul, Zheda Mai, Jianyang Gu, Ziheng Zhang, Kazi Sajeed Mehrab, Elizabeth G Campolongo, Daniel Rubenstein, Charles V Stewart, Anuj Karpatne, et al. Prompt-cam: A simpler interpretable transformer for fine-grained analysis. arXiv preprint arXiv:2501.09333, 2025
- [15]
-
[16]
Humanvlm: Foundation for human- scene vision-language model
Dawei Dai, Long Xu, Yutang Li, Yuanhui Zhang, and Shuyin Xia. Humanvlm: Foundation for human- scene vision-language model. Information Fusion, page 103271, 2025
work page 2025
-
[17]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[18]
Shape and texture recognition in large vision-language models
Sagi Eppel, Mor Bismut, and Alona Faktor. Shape and texture recognition in large vision-language models. arXiv preprint arXiv:2503.23062, 2025. 11
-
[19]
There is no samantics! exploring sam as a backbone for visual understanding tasks
Miguel Espinosa, Chenhongyi Yang, Linus Ericsson, Steven McDonagh, and Elliot J Crowley. There is no samantics! exploring sam as a backbone for visual understanding tasks. arXiv preprint arXiv:2411.15288, 2024
-
[20]
Jiaqi Fan, Jianhua Wu, Jincheng Gao, Jianhao Yu, Yafei Wang, Hongqing Chu, and Bingzhao Gao. Mllm-sul: Multimodal large language model for semantic scene understanding and localization in traffic scenarios. arXiv preprint arXiv:2412.19406, 2024
-
[21]
Multimodal autoregressive pre-training of large vision encoders
Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, et al. Multimodal autoregressive pre-training of large vision encoders. arXiv preprint arXiv:2411.14402, 2024
-
[22]
Blink: Multimodal large language models can see but not perceive
Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. In European Conference on Computer Vision, pages 148–166. Springer, 2024
work page 2024
-
[23]
Paul Gavrikov, Jovita Lukasik, Steffen Jung, Robert Geirhos, Bianca Lamm, Muhammad Jehanzeb Mirza, Margret Keuper, and Janis Keuper. Are vision language models texture or shape biased and can we steer them? arXiv preprint arXiv:2403.09193, 2024
-
[24]
Vision meets robotics: The kitti dataset
Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The international journal of robotics research, 32(11):1231–1237, 2013
work page 2013
-
[25]
Battle of the backbones: A large- scale comparison of pretrained models across computer vision tasks
Micah Goldblum, Hossein Souri, Renkun Ni, Manli Shu, Viraj Prabhu, Gowthami Somepalli, Prithvijit Chattopadhyay, Mark Ibrahim, Adrien Bardes, Judy Hoffman, et al. Battle of the backbones: A large- scale comparison of pretrained models across computer vision tasks. Advances in Neural Information Processing Systems, 36:29343–29371, 2023
work page 2023
-
[26]
Making the v in vqa matter: Elevating the role of image understanding in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017
work page 2017
-
[27]
Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017
work page 2017
-
[28]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Jianyang Gu, Samuel Stevens, Elizabeth G Campolongo, Matthew J Thompson, Net Zhang, Jiaman Wu, Andrei Kopanev, Zheda Mai, Alexander E White, James Balhoff, et al. Bioclip 2: Emergent properties from scaling hierarchical contrastive learning. arXiv preprint arXiv:2505.23883, 2025
-
[30]
Lvis: A dataset for large vocabulary instance segmentation
Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356– 5364, 2019
work page 2019
-
[31]
A survey on vision transformer
Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, et al. A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence, 45(1):87–110, 2022
work page 2022
-
[32]
Radio amplified: Improved baselines for agglomerative vision foundation models
Greg Heinrich, Mike Ranzinger, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catanzaro, Pavlo Molchanov, et al. Radio amplified: Improved baselines for agglomerative vision foundation models. arXiv preprint arXiv:2412.07679, 2024
-
[33]
Parameter-efficient transfer learning for nlp
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, An- drea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790–2799. PMLR, 2019
work page 2019
-
[34]
Drone-based object counting by spatially regularized regional proposal network
Meng-Ru Hsieh, Yen-Liang Lin, and Winston H Hsu. Drone-based object counting by spatially regularized regional proposal network. In Proceedings of the IEEE international conference on computer vision , pages 4145–4153, 2017
work page 2017
-
[35]
Lora: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022
work page 2022
-
[36]
A survey on evaluation of multimodal large language models
Jiaxing Huang and Jingyi Zhang. A survey on evaluation of multimodal large language models. arXiv preprint arXiv:2408.15769, 2024. 12
-
[37]
T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation
Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. Advances in Neural Information Processing Systems, 36:78723–78747, 2023
work page 2023
-
[38]
Mingxin Huang, Yongxin Shi, Dezhi Peng, Songxuan Lai, Zecheng Xie, and Lianwen Jin. Ocr-reasoning benchmark: Unveiling the true capabilities of mllms in complex text-rich image reasoning. arXiv preprint arXiv:2505.17163, 2025
-
[39]
Position: The platonic representation hypothesis
Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. Position: The platonic representation hypothesis. In Forty-first International Conference on Machine Learning, 2024
work page 2024
-
[40]
Ji Hyeok Jung, Eun Tae Kim, Seo Yeon Kim, Joo Ho Lee, Bumsoo Kim, and Buru Chang. Is’ right’right? enhancing object orientation understanding in multimodal language models through egocentric instruction tuning. arXiv preprint arXiv:2411.16761, 2024
-
[41]
Transformers in vision: A survey
Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. Transformers in vision: A survey. ACM computing surveys (CSUR), 54(10s):1–41, 2022
work page 2022
-
[42]
Mllm-compbench: A comparative reasoning benchmark for multimodal llms
Jihyung Kil, Zheda Mai, Justin Lee, Arpita Chowdhury, Zihe Wang, Kerrie Cheng, Lemeng Wang, Ye Liu, and Wei-Lun Harry Chao. Mllm-compbench: A comparative reasoning benchmark for multimodal llms. Advances in Neural Information Processing Systems, 37:28798–28827, 2024
work page 2024
-
[43]
Jeonghwan Kim and Heng Ji. Finer: Investigating and enhancing fine-grained visual concept recognition in large vision language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6187–6207, 2024
work page 2024
-
[44]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023
work page 2023
-
[45]
Diagnostics- llava: A visual language model for domain-specific diagnostics of equipment
Aman Kumar, Mahbubul Alam, Ahmed Farahat, Maheshjabu Somineni, and Chetan Gupta. Diagnostics- llava: A visual language model for domain-specific diagnostics of equipment. In Annual Conference of the PHM Society, volume 16, 2024
work page 2024
-
[46]
Gustaf Kylberg. Kylberg texture dataset v. 1.0 . Centre for Image Analysis, Swedish University of Agricultural Sciences and . . . , 2011
work page 2011
-
[47]
Llmcount: Enhancing stationary mmwave detection with multimodal-llm
Boyan Li, Shengyi Ding, Deen Ma, Yixuan Wu, Hongjie Liao, and Kaiyuan Hu. Llmcount: Enhancing stationary mmwave detection with multimodal-llm. arXiv preprint arXiv:2409.16209, 2024
-
[48]
Deng Li, Xin Liu, Bohao Xing, Baiqiang Xia, Yuan Zong, Bihan Wen, and Heikki Kälviäinen. Eald-mllm: Emotion analysis in long-sequential and de-identity videos with multi-modal large language model. arXiv preprint arXiv:2405.00574, 2024
-
[49]
Video crowd localization with multifocus gaussian neighborhood attention and a large-scale benchmark
Haopeng Li, Lingbo Liu, Kunlin Yang, Shinan Liu, Junyu Gao, Bin Zhao, Rui Zhang, and Jun Hou. Video crowd localization with multifocus gaussian neighborhood attention and a large-scale benchmark. IEEE Transactions on Image Processing, 31:6032–6047, 2022
work page 2022
-
[50]
Object detection in optical remote sensing images: A survey and a new benchmark
Ke Li, Gang Wan, Gong Cheng, Liqiu Meng, and Junwei Han. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS journal of photogrammetry and remote sensing, 159:296–307, 2020
work page 2020
-
[51]
Shan Li and Weihong Deng. Reliable crowdsourcing and deep locality-preserving learning for un- constrained facial expression recognition. IEEE Transactions on Image Processing , 28(1):356–370, 2019
work page 2019
-
[52]
Openvision: A fully-open, cost- effective family of advanced vision encoders for multimodal learning
Xianhang Li, Yanqing Liu, Haoqin Tu, Hongru Zhu, and Cihang Xie. Openvision: A fully-open, cost- effective family of advanced vision encoders for multimodal learning. arXiv preprint arXiv:2505.04601, 2025
-
[53]
Visual large language models for generalized and specialized applications
Yifan Li, Zhixin Lai, Wentao Bao, Zhen Tan, Anh Dao, Kewei Sui, Jiayi Shen, Dong Liu, Huan Liu, and Yu Kong. Visual large language models for generalized and specialized applications. arXiv preprint arXiv:2501.02765, 2025
-
[54]
Expression analysis based on face regions in real-world conditions
Zheng Lian, Ya Li, Jian-Hua Tao, Jian Huang, and Ming-Yue Niu. Expression analysis based on face regions in real-world conditions. International Journal of Automation and Computing, 17:96–107, 2020
work page 2020
-
[55]
A comprehensive survey and guide to multimodal large language models in vision-language tasks
Chia Xin Liang, Pu Tian, Caitlyn Heqi Yin, Yao Yua, Wei An-Hou, Li Ming, Tianyang Wang, Ziqian Bi, and Ming Liu. A comprehensive survey and guide to multimodal large language models in vision-language tasks. arXiv preprint arXiv:2411.06284, 2024. 13
-
[56]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023
work page 2023
-
[57]
Ocrbench: on the hidden mystery of ocr in large multimodal models
Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models. Science China Information Sciences, 67(12):220102, 2024
work page 2024
-
[58]
Vincenzo Lomonaco, Lorenzo Pellegrini, Pau Rodriguez, Massimo Caccia, Qi She, Yu Chen, Quentin Jodelet, Ruiping Wang, Zheda Mai, David Vazquez, et al. Cvpr 2020 continual learning in computer vision competition: Approaches, results, current challenges and future directions. Artificial Intelligence, 303:103635, 2022
work page 2020
-
[59]
The development of the cie 2000 colour-difference formula: Ciede2000
M Ronnier Luo, Guihua Cui, and Bryan Rigg. The development of the cie 2000 colour-difference formula: Ciede2000. Color Research & Application: Endorsed by Inter-Society Color Council, The Colour Group (Great Britain), Canadian Society for Color, Color Science Association of Japan, Dutch Society for the Study of Color, The Swedish Colour Centre Foundation,...
work page 2000
-
[60]
Fine-tuning is fine, if calibrated
Zheda Mai, Arpita Chowdhury, Ping Zhang, Cheng-Hao Tu, Hong-You Chen, Vardaan Pahuja, Tanya Berger-Wolf, Song Gao, Charles Stewart, Yu Su, et al. Fine-tuning is fine, if calibrated. Advances in Neural Information Processing Systems, 37:136084–136119, 2024
work page 2024
-
[61]
Batch-level experience replay with review for continual learning
Zheda Mai, Hyunwoo Kim, Jihwan Jeong, and Scott Sanner. Batch-level experience replay with review for continual learning. arXiv preprint arXiv:2007.05683, 2020
-
[62]
Online continual learning in image classification: An empirical survey
Zheda Mai, Ruiwen Li, Jihwan Jeong, David Quispe, Hyunwoo Kim, and Scott Sanner. Online continual learning in image classification: An empirical survey. Neurocomputing, 469:28–51, 2022
work page 2022
-
[63]
Zheda Mai, Ruiwen Li, Hyunwoo Kim, and Scott Sanner. Supervised contrastive replay: Revisiting the nearest class mean classifier in online class-incremental continual learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3589–3599, 2021
work page 2021
-
[64]
Zheda Mai, Ping Zhang, Cheng-Hao Tu, Hong-You Chen, Quang-Huy Nguyen, Li Zhang, and Wei- Lun Chao. Lessons and insights from a unifying study of parameter-efficient fine-tuning (peft) in visual recognition. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 14845–14857, 2025
work page 2025
-
[65]
Fine-Grained Visual Classification of Aircraft
Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[66]
P Mallikarjuna, Alireza Targhi, Mario Fritz, Eric Hayman, Barbara Caputo, and J.-O Eklundh. The kth-tips2 database. 07 2006
work page 2006
-
[67]
Wenfeng Mi, He Chen, and Weipeng Liu. Hierarchical interpretable vision reasoning driven through a multi-modal large language model for depth estimation. In 2024 China Automation Congress (CAC), pages 829–834. IEEE, 2024
work page 2024
- [68]
-
[69]
Moments in time dataset: one million videos for event understanding
Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ramakrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown, Quanfu Fan, Dan Gutfreund, Carl V ondrick, et al. Moments in time dataset: one million videos for event understanding. IEEE transactions on pattern analysis and machine intelligence, 42(2):502–508, 2019
work page 2019
-
[70]
Intriguing properties of vision transformers
Muhammad Muzammal Naseer, Kanchana Ranasinghe, Salman H Khan, Munawar Hayat, Fahad Shah- baz Khan, and Ming-Hsuan Yang. Intriguing properties of vision transformers. Advances in Neural Information Processing Systems, 34:23296–23308, 2021
work page 2021
-
[71]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[72]
A simple interpretable transformer for fine-grained image classification and analysis
Dipanjyoti Paul, Arpita Chowdhury, Xinqi Xiong, Feng-Ju Chang, David Carlyn, Samuel Stevens, Kaiya Provost, Anuj Karpatne, Bryan Carstens, Daniel I Rubenstein, et al. A simple interpretable transformer for fine-grained image classification and analysis. 2023. 14
work page 2023
-
[73]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021
work page 2021
-
[74]
Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer
René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020
work page 2020
-
[75]
Viresh Ranjan, Udbhav Sharma, Thu Nguyen, and Minh Hoai. Learning to count everything. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3394– 3403, 2021
work page 2021
-
[76]
Am-radio: Agglomerative vision foundation model reduce all domains into one
Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. Am-radio: Agglomerative vision foundation model reduce all domains into one. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12490–12500, 2024
work page 2024
-
[77]
Generalized intersection over union: A metric and a loss for bounding box regression
Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 658–666, 2019
work page 2019
-
[78]
Object detection with multimodal large vision-language models: An in-depth review
Ranjan Sapkota and Manoj Karkee. Object detection with multimodal large vision-language models: An in-depth review. Available at SSRN 5233953, 2025
work page 2025
-
[79]
Objects365: A large-scale, high-quality dataset for object detection
Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8430–8439, 2019
work page 2019
-
[80]
Online class- incremental continual learning with adversarial shapley value
Dongsub Shim, Zheda Mai, Jihwan Jeong, Scott Sanner, Hyunwoo Kim, and Jongseong Jang. Online class- incremental continual learning with adversarial shapley value. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 9630–9638, 2021
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.