pith. machine review for the scientific record. sign in

arxiv: 2605.13193 · v1 · submitted 2026-05-13 · 💻 cs.CV

Recognition: no theorem link

FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:19 UTC · model grok-4.3

classification 💻 cs.CV
keywords fine-grainedrecognitionacquisitionknowledgeevidencemodelsagentclosed-book
0
0 comments X

The pith

FIKA-Bench shows that the best large multimodal models and tool-using agents reach only 25.1% accuracy on fine-grained knowledge acquisition, with failures driven by wrong retrieval and poor visual judgment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Fine-grained recognition in daily life often requires more than looking at a picture and naming the object. People search for details, compare features, and check evidence when they encounter something unfamiliar. Existing AI benchmarks mostly test closed-book classification from training data. The authors created FIKA-Bench with 311 public-source examples where the correct answer depends on external evidence. They filtered out cases that frontier closed-book models already knew and removed any image-answer leakage. Evaluation of recent large multimodal models and agents that can use tools found the top accuracy at 25.1 percent, with no system above 30 percent. The main problems were retrieving the wrong entities and making incorrect judgments about visual details even when tools were available. This indicates that simply adding search capabilities does not solve the task.

Core claim

Our evaluation of latest Large Multimodal Models (LMMs) and agents reveals that the task remains a formidable challenge: the best system reaches only 25.1% accuracy, with no model exceeding 30%. Crucially, we find that merely equipping models with tools is insufficient to bridge this gap; agent failures are predominantly driven by wrong entity retrieval and poor visual judgement.

Load-bearing premise

That the filtering against frontier closed-book models successfully removes all memorized cases and that the 311 instances have no image-answer leakage while remaining representative of real-life fine-grained recognition scenarios.

Figures

Figures reproduced from arXiv: 2605.13193 by Geng Li, Yuxin Peng.

Figure 1
Figure 1. Figure 1: Overview of FIKA-BENCH: 311 evidence-grounded fine-grained instances from public-source and real-life images across Product, Nature, Transport, and Culture, evaluating whether models can acquire external fine-grained knowledge to recognize unseen categories. Abstract Fine-grained recognition in everyday life is often not a closed-book classification problem: when encountering unfamiliar objects, humans act… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between FIKA-BENCH and existing fine-grained recognition benchmarks. FIKA-BENCH evaluates evidence-grounded fine-grained knowledge acquisition rather than only closed-set or open-ended recognition from the image and model parameters. Our evaluation of closed-book models and tool-enabled agents reveals that FIKA-BENCH remains a formidable challenge. The best-performing system achieves only 25.1% … view at source ↗
Figure 3
Figure 3. Figure 3: End-to-end agent runtime distribution by top-level category. Boxes show the interquartile [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Error taxonomy for current agent methods. Each panel reports the distribution over non [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Two-level taxonomy distribution in FIKA-BENCH. Inner ring: top-level categories; outer ring: mid-level categories. B Additional Benchmark Comparison [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Representative FIKA-BENCH samples grouped by top-level category. Each row contains one public-data example and one real-life example from the same top-level category; each panel reports the top-level category, mid-level category, source partition, question, verified answer, and full evidence link. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
read the original abstract

Fine-grained recognition in everyday life is often not a closed-book classification problem: when encountering unfamiliar objects, humans actively search, compare visual details, and verify evidence before deciding. Existing benchmarks primarily evaluate visually recognition, leaving this active external knowledge acquisition ability underexplored. We study fine-grained knowledge acquisition, where a system must seek, verify, and use external evidence to answer open-ended fine-grained recognition questions. We introduce FIKA-Bench, a leakage-aware and evidence-grounded collection of 311 public-source and real-life instances. To ensure high quality, every example is filtered against frontier closed-book models to remove memorized cases and audited to eliminate image-answer leakage, retaining only samples supported by verified evidence. Our evaluation of latest Large Multimodal Models (LMMs) and agents reveals that the task remains a formidable challenge: the best system reaches only 25.1% accuracy, with no model exceeding 30%. Crucially, we find that merely equipping models with tools is insufficient to bridge this gap; agent failures are predominantly driven by wrong entity retrieval and poor visual judgement. These results show that reliable knowledge acquisition needs better agent designs that focus on fine-grained recognition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces FIKA-Bench, a leakage-aware benchmark of 311 public-source, real-life instances for fine-grained knowledge acquisition. Systems must actively seek, verify, and use external evidence to answer open-ended questions, rather than relying on closed-book recognition. After filtering examples against frontier closed-book models to remove memorized cases and auditing to eliminate image-answer leakage, evaluations of latest LMMs and tool-equipped agents show the best system reaches only 25.1% accuracy (no model exceeds 30%), with failures driven primarily by wrong entity retrieval and poor visual judgment.

Significance. If the filtering and auditing process is shown to be effective, the benchmark would be a valuable contribution by shifting evaluation from passive visual recognition to active external knowledge acquisition in everyday fine-grained scenarios. It provides concrete evidence of current LMM and agent limitations and points to specific failure modes that could guide improved agent designs.

major comments (1)
  1. [Benchmark construction] Benchmark construction section: the filtering procedure against frontier closed-book models is described only qualitatively (remove memorized cases, audit for image-answer leakage) with no quantitative details on the specific models used, accuracy threshold for 'memorized', number of candidates discarded, or inter-auditor agreement metrics. This is load-bearing for the central claim that the 311 retained instances require external knowledge acquisition, as residual leakage would inflate the reported 25.1% accuracy and undermine the attributed failure modes.
minor comments (2)
  1. [Abstract] Abstract and evaluation sections: performance figures (25.1% best accuracy) are reported without error bars, confidence intervals, or statistical significance tests across runs or models.
  2. [Benchmark construction] The paper would benefit from an explicit table or appendix listing the frontier models used for filtering and the exact discard statistics.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment on benchmark construction below and will revise the manuscript accordingly to improve transparency.

read point-by-point responses
  1. Referee: [Benchmark construction] Benchmark construction section: the filtering procedure against frontier closed-book models is described only qualitatively (remove memorized cases, audit for image-answer leakage) with no quantitative details on the specific models used, accuracy threshold for 'memorized', number of candidates discarded, or inter-auditor agreement metrics. This is load-bearing for the central claim that the 311 retained instances require external knowledge acquisition, as residual leakage would inflate the reported 25.1% accuracy and undermine the attributed failure modes.

    Authors: We agree that the current description of the filtering and auditing process is insufficiently quantitative and that additional details are needed to substantiate the claim that the retained instances require external knowledge acquisition. In the revised manuscript we will expand the Benchmark Construction section with a new table and accompanying text that reports: the specific frontier closed-book models used (GPT-4o, Claude-3.5-Sonnet, and Gemini-1.5-Pro), the exact accuracy threshold applied to flag memorized cases, the number of candidate examples discarded at each filtering stage, and inter-auditor agreement statistics (including percentage agreement and Cohen’s kappa). These additions will make the leakage-mitigation procedure fully transparent and directly address the concern that residual leakage could affect the reported accuracy and failure-mode analysis. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivations or fitted predictions

full rationale

The paper presents an empirical benchmark construction and evaluation study with no mathematical derivations, parameter fitting, or predictive claims that reduce to inputs by construction. All load-bearing elements are data curation steps (filtering against closed-book models and leakage auditing) and performance measurements on the resulting 311 instances; these do not invoke self-definitional loops, fitted inputs renamed as predictions, or self-citation chains that substitute for independent evidence. The central accuracy figures (25.1% best, no model >30%) are direct empirical outcomes on the curated set rather than quantities forced by prior fits or ansatzes. Per the analysis rules, this qualifies as a self-contained empirical contribution with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on domain assumptions about effective memorization filtering and data quality auditing that are standard in benchmark papers but not independently verified here.

axioms (1)
  • domain assumption Frontier closed-book models can be used to reliably filter out memorized instances from the benchmark.
    Invoked to ensure the task tests active acquisition rather than recall.

pith-pipeline@v0.9.0 · 5504 in / 1240 out tokens · 46288 ms · 2026-05-14T20:19:47.950172+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 13 canonical work pages · 6 internal anchors

  1. [1]

    Fashion product images dataset

    Param Aggarwal. Fashion product images dataset. Kaggle dataset: https://www.kaggle. com/datasets/paramaggarwal/fashion-product-images-dataset , 2026. Accessed for benchmark construction metadata

  2. [2]

    Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

  3. [3]

    Qwen-vl: A versatile vision-language model for understanding, localization.Text Reading, and Beyond, 2(1):1, 2023

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization.Text Reading, and Beyond, 2(1):1, 2023

  4. [4]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  5. [5]

    Products-10k: A large-scale product recognition dataset

    Y Bai, Y Chen, W Yu, L Wang, and W Zhang. Products-10k: A large-scale product recognition dataset. arxiv 2020.arXiv preprint arXiv:2008.10545

  6. [6]

    Food-101–mining discriminative components with random forests

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InEuropean conference on computer vision, pages 446–461. Springer, 2014

  7. [7]

    Abo: Dataset and benchmarks for real-world 3d object understanding

    Jasmine Collins, Shubham Goel, Kenan Deng, Achleshwar Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F Yago Vicente, Thomas Dideriksen, Himanshu Arora, et al. Abo: Dataset and benchmarks for real-world 3d object understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21126–21136, 2022

  8. [8]

    The 2021 image similarity dataset and challenge.arXiv preprint arXiv:2106.09672, 2021

    Matthijs Douze, Giorgos Tolias, Ed Pizzi, Zoë Papakipos, Lowik Chanussot, Filip Radenovic, Tomas Jenicek, Maxim Maximov, Laura Leal-Taixé, Ismail Elezi, et al. The 2021 image similarity dataset and challenge.arXiv preprint arXiv:2106.09672, 2021

  9. [9]

    Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

    Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, et al. Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

  10. [10]

    Gemini 3.1 Flash-Lite Model Card

    Google DeepMind. Gemini 3.1 Flash-Lite Model Card. https://deepmind.google/ models/model-cards/gemini-3-1-flash-lite/, March 2026

  11. [11]

    Webvoyager: Building an end-to-end web agent with large multimodal models

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6864–6890, 2024

  12. [12]

    Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning

    Hulingxiao He, Zijun Geng, and Yuxin Peng. Fine-r1: Make multi-modal llms excel in fine- grained visual recognition by chain-of-thought reasoning.arXiv preprint arXiv:2602.07605, 2026

  13. [13]

    Things: A database of 1,854 object concepts and more than 26,000 naturalistic object images.PloS one, 14(10):e0223792, 2019

    Martin N Hebart, Adam H Dickter, Alexis Kidder, Wan Y Kwok, Anna Corriveau, Caitlin Van Wicklin, and Chris I Baker. Things: A database of 1,854 object concepts and more than 26,000 naturalistic object images.PloS one, 14(10):e0223792, 2019

  14. [14]

    GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

    Wenyi Hong, Xiaotao Gu, Ziyang Pan, Zhen Yang, Yuting Wang, Yue Wang, Yuanchang Yue, Yu Wang, Yanling Wang, Yan Wang, et al. Glm-5v-turbo: Toward a native foundation model for multimodal agents.arXiv preprint arXiv:2604.26752, 2026. 10

  15. [15]

    Vegfru: A domain-specific dataset for fine-grained visual categorization

    Saihui Hou, Yushan Feng, and Zilei Wang. Vegfru: A domain-specific dataset for fine-grained visual categorization. InProceedings of the IEEE international conference on computer vision, pages 541–549, 2017

  16. [16]

    Mmsearch: Benchmarking the potential of large models as multi-modal search engines.arXiv preprint arXiv:2409.12959, 2024

    Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, Jiayi Lei, Pengshuo Qiu, Pan Lu, Zehui Chen, Chaoyou Fu, Guanglu Song, et al. Mmsearch: Benchmarking the potential of large models as multi-modal search engines.arXiv preprint arXiv:2409.12959, 2024

  17. [17]

    Automatic expansion of a food image dataset leveraging existing categories with domain adaptation

    Yoshiyuki Kawano and Keiji Yanai. Automatic expansion of a food image dataset leveraging existing categories with domain adaptation. InEuropean Conference on Computer Vision, pages 3–17. Springer, 2014

  18. [18]

    Novel dataset for fine-grained image categorization: Stanford dogs

    Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Fei-Fei Li. Novel dataset for fine-grained image categorization: Stanford dogs. InProc. CVPR workshop on fine-grained visual categorization (FGVC), volume 2, 2011

  19. [19]

    Finer: Investigating and enhancing fine-grained visual concept recognition in large vision language models

    Jeonghwan Kim and Heng Ji. Finer: Investigating and enhancing fine-grained visual concept recognition in large vision language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6187–6207, 2024

  20. [20]

    On distinguishing epistemic from pragmatic action.Cognitive science, 18(4):513–549, 1994

    David Kirsh and Paul Maglio. On distinguishing epistemic from pragmatic action.Cognitive science, 18(4):513–549, 1994

  21. [21]

    A hierarchical grocery store image dataset with visual and semantic labels

    Marcus Klasson, Cheng Zhang, and Hedvig Kjellström. A hierarchical grocery store image dataset with visual and semantic labels. In2019 IEEE winter conference on applications of computer vision (WACV), pages 491–500. IEEE, 2019

  22. [22]

    Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 881–905, 2024

  23. [23]

    3d object representations for fine- grained categorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine- grained categorization. InProceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013

  24. [24]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

  25. [25]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  26. [26]

    Visual-rft: Visual reinforcement fine-tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2034–2044, 2025

  27. [27]

    Fine-Grained Visual Classification of Aircraft

    Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine- grained visual classification of aircraft.arXiv preprint arXiv:1306.5151, 2013

  28. [28]

    MiniMax-M2.7

    MiniMax AI. MiniMax-M2.7. https://huggingface.co/MiniMaxAI/MiniMax-M2.7, 2026

  29. [29]

    Kimi K2.6: Advancing open-source coding

    Moonshot AI. Kimi K2.6: Advancing open-source coding. https://www.kimi.com/blog/ kimi-k2-6, 2026

  30. [30]

    mineralimage5k-98

    Nech-C. mineralimage5k-98. Hugging Face dataset: https://huggingface.co/datasets/ Nech-C/mineralimage5K-98, 2026. Accessed for benchmark construction metadata

  31. [31]

    Automated flower classification over a large number of classes

    Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008

  32. [32]

    GPT-5 mini

    OpenAI. GPT-5 mini. https://developers.openai.com/api/docs/models/ gpt-5-mini, 2025. OpenAI API model documentation. 11

  33. [33]

    OpenClaw: Personal ai assistant

    OpenClaw Team. OpenClaw: Personal ai assistant. https://github.com/openclaw/ openclaw, 2026

  34. [34]

    OpenCode: The open source ai coding agent

    OpenCode Team. OpenCode: The open source ai coding agent. https://opencode.ai/, 2026

  35. [35]

    Towards fine-grained recogni- tion with large visual language models: Benchmark and optimization strategies.arXiv preprint arXiv:2512.10384, 2025

    Cong Pang, Hongtao Yu, Zixuan Chen, Lewei Lu, and Xin Lou. Towards fine-grained recogni- tion with large visual language models: Benchmark and optimization strategies.arXiv preprint arXiv:2512.10384, 2025

  36. [36]

    Information foraging.Psychological review, 106(4):643, 1999

    Peter Pirolli and Stuart Card. Information foraging.Psychological review, 106(4):643, 1999

  37. [37]

    Qwen3.5: Towards native multimodal agents

    Qwen Team. Qwen3.5: Towards native multimodal agents. https://qwen.ai/blog?id= qwen3.5, February 2026

  38. [38]

    Mmsearch-plus: Benchmarking provenance-aware search for multimodal browsing agents.arXiv preprint arXiv:2508.21475, 2025

    Xijia Tao, Yihua Teng, Xinxing Su, Xinyu Fu, Jihao Wu, Chaofan Tao, Ziru Liu, Haoli Bai, Rui Liu, and Lingpeng Kong. Mmsearch-plus: Benchmarking provenance-aware search for multimodal browsing agents.arXiv preprint arXiv:2508.21475, 2025

  39. [39]

    The metropolitan museum of art open access

    The Metropolitan Museum of Art. The metropolitan museum of art open access. https: //github.com/metmuseum/openaccess, 2026. Accessed for benchmark construction meta- data

  40. [40]

    Mmina: Benchmarking multihop multimodal internet agents

    Shulin Tian, Ziniu Zhang, Liang-Yu Chen, and Ziwei Liu. Mmina: Benchmarking multihop multimodal internet agents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 13682–13697, 2025

  41. [41]

    Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection

    Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge Belongie. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 595–604, 2015

  42. [42]

    The inaturalist species classification and detection dataset

    Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778, 2018

  43. [43]

    The caltech-ucsd birds-200-2011 dataset

    Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011

  44. [44]

    MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments

    Han Wang, David Wan, Hyunji Lee, Thinh Pham, Mikaela Cankosyan, Weiyuan Chen, Elias Stengel-Eskin, Tu Vu, and Mohit Bansal. Merrin: A benchmark for multimodal evidence retrieval and reasoning in noisy web environments.arXiv preprint arXiv:2604.13418, 2026

  45. [45]

    Google landmarks dataset v2-a large- scale benchmark for instance-level recognition and retrieval

    Tobias Weyand, Andre Araujo, Bingyi Cao, and Jack Sim. Google landmarks dataset v2-a large- scale benchmark for instance-level recognition and retrieval. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2575–2584, 2020

  46. [46]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

  47. [47]

    Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation

    Hong-Tao Yu, Yuxin Peng, Serge Belongie, and Xiu-Shen Wei. Benchmarking large vision- language models on fine-grained image tasks: A comprehensive evaluation.arXiv preprint arXiv:2504.14988, 2025

  48. [48]

    Hawker 800 series (BAe 125/Hawker 800)

    Runjie Zhou, Youbo Shao, Haoyu Lu, Bowei Xing, Tongtong Bai, Yujie Chen, Jie Zhao, Lin Sui, Haotian Yao, Zijia Zhao, et al. Worldvqa: Measuring atomic world knowledge in multimodal large language models.arXiv preprint arXiv:2602.02537, 2026. 12 A Additional Sample Visualizations Apparel Electronics Equipment Food Household Infrastructure Animal Mineral Pl...

  49. [49]

    Justification: The real-life images were voluntarily contributed and privacy-redacted as part of dataset curation rather than a behavioral study

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...