pith. sign in

arxiv: 2605.13193 · v2 · pith:F54MK2JLnew · submitted 2026-05-13 · 💻 cs.CV

FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition

Pith reviewed 2026-05-20 21:57 UTC · model grok-4.3

classification 💻 cs.CV
keywords fine-grained recognitionknowledge acquisitionlarge multimodal modelsAI agentsbenchmarksvisual judgmententity retrievalevidence verification
0
0 comments X

The pith

Current AI systems top out at 25.1 percent accuracy on active fine-grained knowledge acquisition even when equipped with tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FIKA-Bench to test whether systems can actively search external sources, compare visual details, and verify evidence when answering open-ended questions about unfamiliar fine-grained objects. Existing benchmarks focus on closed-book visual recognition, but this work shows that the real-world process of knowledge acquisition remains unsolved. Evaluations of recent large multimodal models and agents find the best performer reaches only 25.1 percent accuracy, with no system above 30 percent. Failures trace mainly to incorrect entity retrieval and weak visual judgment rather than lack of tools. A reader would care because this gap affects how AI could handle everyday encounters with new objects the way people do.

Core claim

FIKA-Bench is a leakage-aware collection of 311 public real-life instances where a system must seek, verify, and apply external evidence to answer open-ended fine-grained recognition questions. Every example was filtered against frontier closed-book models to exclude memorized cases and audited for image-answer leakage, keeping only samples backed by verified evidence. On this benchmark the best evaluated large multimodal model or agent reaches 25.1 percent accuracy and no system exceeds 30 percent. Agent failures arise predominantly from wrong entity retrieval and poor visual judgement, showing that simply providing tools does not close the performance gap.

What carries the argument

FIKA-Bench, an evidence-grounded benchmark that requires active external search, visual comparison, and verification for open-ended fine-grained recognition questions.

If this is right

  • Reliable knowledge acquisition will require agent designs that explicitly strengthen fine-grained visual judgment and precise entity retrieval rather than generic tool access.
  • Scaling models alone or adding broad search tools will not suffice to close the observed performance gap.
  • Progress on this capability will be measurable only with leakage-aware, evidence-supported benchmarks that block closed-book shortcuts.
  • Systems that succeed here could handle real-world encounters with unfamiliar objects by gathering and verifying evidence on demand.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same failure modes may appear in other open-world tasks where models must distinguish subtle variants without prior training examples.
  • Combining stronger specialized visual encoders with retrieval systems tuned for fine-grained attributes could be a direct next step to test.
  • Extending the benchmark to dynamic, multi-turn evidence gathering might expose additional bottlenecks in current agent loops.

Load-bearing premise

Every benchmark example has been filtered against frontier closed-book models to remove memorized cases and audited to eliminate image-answer leakage, so that retained samples truly require active evidence gathering.

What would settle it

A new agent or model that scores above 50 percent on the full FIKA-Bench set without any signs of memorization or leakage would falsify the claim that the task remains a formidable challenge for current approaches.

Figures

Figures reproduced from arXiv: 2605.13193 by Geng Li, Yuxin Peng.

Figure 1
Figure 1. Figure 1: Overview of FIKA-BENCH: 311 evidence-grounded fine-grained instances from public-source and real-life images across Product, Nature, Transport, and Culture, evaluating whether models can acquire external fine-grained knowledge to recognize unseen categories. Abstract Fine-grained recognition in everyday life is often not a closed-book classification problem: when encountering unfamiliar objects, humans act… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between FIKA-BENCH and existing fine-grained recognition benchmarks. FIKA-BENCH evaluates evidence-grounded fine-grained knowledge acquisition rather than only closed-set or open-ended recognition from the image and model parameters. Our evaluation of closed-book models and tool-enabled agents reveals that FIKA-BENCH remains a formidable challenge. The best-performing system achieves only 25.1% … view at source ↗
Figure 3
Figure 3. Figure 3: End-to-end agent runtime distribution by top-level category. Boxes show the interquartile [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Error taxonomy for current agent methods. Each panel reports the distribution over non [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Two-level taxonomy distribution in FIKA-BENCH. Inner ring: top-level categories; outer ring: mid-level categories. B Additional Benchmark Comparison [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Representative FIKA-BENCH samples grouped by top-level category. Each row contains one public-data example and one real-life example from the same top-level category; each panel reports the top-level category, mid-level category, source partition, question, verified answer, and full evidence link. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
read the original abstract

Fine-grained recognition in everyday life is often not a closed-book classification problem: when encountering unfamiliar objects, humans actively search, compare visual details, and verify evidence before deciding. Existing benchmarks primarily evaluate visually recognition, leaving this active external knowledge acquisition ability underexplored. We study fine-grained knowledge acquisition, where a system must seek, verify, and use external evidence to answer open-ended fine-grained recognition questions. We introduce FIKA-Bench, a leakage-aware and evidence-grounded collection of 311 public-source and real-life instances. To ensure high quality, every example is filtered against frontier closed-book models to remove memorized cases and audited to eliminate image-answer leakage, retaining only samples supported by verified evidence. Our evaluation of latest Large Multimodal Models (LMMs) and agents reveals that the task remains a formidable challenge: the best system reaches only 25.1% accuracy, with no model exceeding 30%. Crucially, we find that merely equipping models with tools is insufficient to bridge this gap; agent failures are predominantly driven by wrong entity retrieval and poor visual judgement. These results show that reliable knowledge acquisition needs better agent designs that focus on fine-grained recognition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces FIKA-Bench, a leakage-aware benchmark of 311 public-source, real-life fine-grained recognition instances that require active external knowledge acquisition rather than closed-book memorization. Every example is filtered against frontier closed-book models to remove memorized cases and audited to eliminate image-answer leakage while retaining only verified-evidence samples. Evaluation of latest LMMs and tool-equipped agents shows the best system reaches only 25.1% accuracy with no model exceeding 30%; failures are attributed primarily to wrong entity retrieval and poor visual judgement, indicating that simply adding tools is insufficient for reliable fine-grained knowledge acquisition.

Significance. If the filtering and auditing steps are effective, the work provides a useful, evidence-grounded testbed that exposes a clear performance ceiling in current LMM agents on active knowledge-acquisition tasks. The empirical demonstration that tool use alone does not close the gap, together with the failure-mode analysis, supplies concrete directions for improving retrieval and visual-judgement components in agent designs.

major comments (1)
  1. [Benchmark construction] Benchmark construction section: the filtering procedure against frontier closed-book models is described only at a high level (models used, exact decision thresholds for declaring a case memorized, and the concrete audit steps for image-answer leakage are not specified). Because the headline claim that the 311 examples measure active acquisition rather than residual memorization rests on this filtering being complete, the absence of these operational details is load-bearing for the central interpretation of the 25.1% ceiling.
minor comments (2)
  1. [Experiments] The exact accuracy metric (exact match vs. semantic equivalence) and question format should be stated explicitly in the experimental setup so that the 25.1% figure can be reproduced and compared with other benchmarks.
  2. [Benchmark construction] The modest post-filtering size of 311 examples would benefit from a short discussion of how the retained set relates to the original candidate pool in terms of category coverage and difficulty distribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and outline the changes we will make in revision.

read point-by-point responses
  1. Referee: [Benchmark construction] Benchmark construction section: the filtering procedure against frontier closed-book models is described only at a high level (models used, exact decision thresholds for declaring a case memorized, and the concrete audit steps for image-answer leakage are not specified). Because the headline claim that the 311 examples measure active acquisition rather than residual memorization rests on this filtering being complete, the absence of these operational details is load-bearing for the central interpretation of the 25.1% ceiling.

    Authors: We agree that the Benchmark Construction section currently provides only a high-level description of the filtering and auditing procedures. In the revised manuscript we will expand this section to specify: (1) the exact frontier closed-book models employed (including model versions and access dates), (2) the precise decision thresholds and criteria used to classify an instance as memorized (e.g., correct answer without external evidence, confidence thresholds, or number of trials), and (3) the concrete audit protocol for image-answer leakage, including the manual verification steps, inter-annotator agreement measures, and retention criteria for verified-evidence samples. These additions will make the leakage-prevention process fully reproducible and will strengthen the interpretation of the reported performance ceiling. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark construction with no derivation chain or self-referential predictions

full rationale

The paper presents FIKA-Bench as an empirical collection of 311 instances, with filtering against external frontier closed-book models and evaluation of independent LMMs and agents. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the provided text or abstract. Central results (e.g., 25.1% accuracy) are direct measurements against external systems and data, rendering the work self-contained without any reduction of claims to their own construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the described filtering successfully isolates cases requiring genuine external knowledge acquisition and that the 311 instances are free of leakage.

axioms (2)
  • domain assumption Filtering against frontier closed-book models removes all memorized cases
    Invoked in the abstract to justify retaining only samples that require external evidence.
  • domain assumption Auditing eliminates image-answer leakage
    Stated as the method to ensure instances are supported only by verified external evidence.

pith-pipeline@v0.9.0 · 5735 in / 1462 out tokens · 58265 ms · 2026-05-20T21:57:05.977049+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 7 internal anchors

  1. [1]

    Fashion product images dataset

    Param Aggarwal. Fashion product images dataset. Kaggle dataset: https://www.kaggle. com/datasets/paramaggarwal/fashion-product-images-dataset , 2026. Accessed for benchmark construction metadata

  2. [2]

    Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

  3. [3]

    Qwen-vl: A versatile vision-language model for understanding, localization.Text Reading, and Beyond, 2(1):1, 2023

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization.Text Reading, and Beyond, 2(1):1, 2023

  4. [4]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  5. [5]

    Products-10K: A large-scale product recognition dataset

    Y Bai, Y Chen, W Yu, L Wang, and W Zhang. Products-10k: A large-scale product recognition dataset. arxiv 2020.arXiv preprint arXiv:2008.10545

  6. [6]

    Food-101–mining discriminative components with random forests

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InEuropean conference on computer vision, pages 446–461. Springer, 2014

  7. [7]

    Abo: Dataset and benchmarks for real-world 3d object understanding

    Jasmine Collins, Shubham Goel, Kenan Deng, Achleshwar Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F Yago Vicente, Thomas Dideriksen, Himanshu Arora, et al. Abo: Dataset and benchmarks for real-world 3d object understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21126–21136, 2022

  8. [8]

    The 2021 image similarity dataset and challenge.arXiv preprint arXiv:2106.09672, 2021

    Matthijs Douze, Giorgos Tolias, Ed Pizzi, Zoë Papakipos, Lowik Chanussot, Filip Radenovic, Tomas Jenicek, Maxim Maximov, Laura Leal-Taixé, Ismail Elezi, et al. The 2021 image similarity dataset and challenge.arXiv preprint arXiv:2106.09672, 2021

  9. [9]

    WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

    Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, et al. Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

  10. [10]

    Gemini 3.1 Flash-Lite Model Card

    Google DeepMind. Gemini 3.1 Flash-Lite Model Card. https://deepmind.google/ models/model-cards/gemini-3-1-flash-lite/, March 2026

  11. [11]

    Webvoyager: Building an end-to-end web agent with large multimodal models

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6864–6890, 2024

  12. [12]

    Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning

    Hulingxiao He, Zijun Geng, and Yuxin Peng. Fine-r1: Make multi-modal llms excel in fine- grained visual recognition by chain-of-thought reasoning.arXiv preprint arXiv:2602.07605, 2026

  13. [13]

    Things: A database of 1,854 object concepts and more than 26,000 naturalistic object images.PloS one, 14(10):e0223792, 2019

    Martin N Hebart, Adam H Dickter, Alexis Kidder, Wan Y Kwok, Anna Corriveau, Caitlin Van Wicklin, and Chris I Baker. Things: A database of 1,854 object concepts and more than 26,000 naturalistic object images.PloS one, 14(10):e0223792, 2019

  14. [14]

    GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

    Wenyi Hong, Xiaotao Gu, Ziyang Pan, Zhen Yang, Yuting Wang, Yue Wang, Yuanchang Yue, Yu Wang, Yanling Wang, Yan Wang, et al. Glm-5v-turbo: Toward a native foundation model for multimodal agents.arXiv preprint arXiv:2604.26752, 2026. 10

  15. [15]

    Vegfru: A domain-specific dataset for fine-grained visual categorization

    Saihui Hou, Yushan Feng, and Zilei Wang. Vegfru: A domain-specific dataset for fine-grained visual categorization. InProceedings of the IEEE international conference on computer vision, pages 541–549, 2017

  16. [16]

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han

    Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, Jiayi Lei, Pengshuo Qiu, Pan Lu, Zehui Chen, Chaoyou Fu, Guanglu Song, et al. Mmsearch: Benchmarking the potential of large models as multi-modal search engines.arXiv preprint arXiv:2409.12959, 2024

  17. [17]

    Automatic expansion of a food image dataset leveraging existing categories with domain adaptation

    Yoshiyuki Kawano and Keiji Yanai. Automatic expansion of a food image dataset leveraging existing categories with domain adaptation. InEuropean Conference on Computer Vision, pages 3–17. Springer, 2014

  18. [18]

    Novel dataset for fine-grained image categorization: Stanford dogs

    Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Fei-Fei Li. Novel dataset for fine-grained image categorization: Stanford dogs. InProc. CVPR workshop on fine-grained visual categorization (FGVC), volume 2, 2011

  19. [19]

    Finer: Investigating and enhancing fine-grained visual concept recognition in large vision language models

    Jeonghwan Kim and Heng Ji. Finer: Investigating and enhancing fine-grained visual concept recognition in large vision language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6187–6207, 2024

  20. [20]

    On distinguishing epistemic from pragmatic action.Cognitive science, 18(4):513–549, 1994

    David Kirsh and Paul Maglio. On distinguishing epistemic from pragmatic action.Cognitive science, 18(4):513–549, 1994

  21. [21]

    A hierarchical grocery store image dataset with visual and semantic labels

    Marcus Klasson, Cheng Zhang, and Hedvig Kjellström. A hierarchical grocery store image dataset with visual and semantic labels. In2019 IEEE winter conference on applications of computer vision (WACV), pages 491–500. IEEE, 2019

  22. [22]

    Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 881–905, 2024

  23. [23]

    3d object representations for fine- grained categorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine- grained categorization. InProceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013

  24. [24]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

  25. [25]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  26. [26]

    Visual-rft: Visual reinforcement fine-tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2034–2044, 2025

  27. [27]

    Fine-Grained Visual Classification of Aircraft

    Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine- grained visual classification of aircraft.arXiv preprint arXiv:1306.5151, 2013

  28. [28]

    MiniMax-M2.7

    MiniMax AI. MiniMax-M2.7. https://huggingface.co/MiniMaxAI/MiniMax-M2.7, 2026

  29. [29]

    Kimi K2.6: Advancing open-source coding

    Moonshot AI. Kimi K2.6: Advancing open-source coding. https://www.kimi.com/blog/ kimi-k2-6, 2026

  30. [30]

    mineralimage5k-98

    Nech-C. mineralimage5k-98. Hugging Face dataset: https://huggingface.co/datasets/ Nech-C/mineralimage5K-98, 2026. Accessed for benchmark construction metadata

  31. [31]

    Automated flower classification over a large number of classes

    Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008

  32. [32]

    GPT-5 mini

    OpenAI. GPT-5 mini. https://developers.openai.com/api/docs/models/ gpt-5-mini, 2025. OpenAI API model documentation. 11

  33. [33]

    OpenClaw: Personal ai assistant

    OpenClaw Team. OpenClaw: Personal ai assistant. https://github.com/openclaw/ openclaw, 2026

  34. [34]

    OpenCode: The open source ai coding agent

    OpenCode Team. OpenCode: The open source ai coding agent. https://opencode.ai/, 2026

  35. [35]

    Towards fine-grained recogni- tion with large visual language models: Benchmark and optimization strategies.arXiv preprint arXiv:2512.10384, 2025

    Cong Pang, Hongtao Yu, Zixuan Chen, Lewei Lu, and Xin Lou. Towards fine-grained recogni- tion with large visual language models: Benchmark and optimization strategies.arXiv preprint arXiv:2512.10384, 2025

  36. [36]

    Information foraging.Psychological review, 106(4):643, 1999

    Peter Pirolli and Stuart Card. Information foraging.Psychological review, 106(4):643, 1999

  37. [37]

    Qwen3.5: Towards native multimodal agents

    Qwen Team. Qwen3.5: Towards native multimodal agents. https://qwen.ai/blog?id= qwen3.5, February 2026

  38. [38]

    arXiv preprint arXiv:2508.21475 , year=

    Xijia Tao, Yihua Teng, Xinxing Su, Xinyu Fu, Jihao Wu, Chaofan Tao, Ziru Liu, Haoli Bai, Rui Liu, and Lingpeng Kong. Mmsearch-plus: Benchmarking provenance-aware search for multimodal browsing agents.arXiv preprint arXiv:2508.21475, 2025

  39. [39]

    The metropolitan museum of art open access

    The Metropolitan Museum of Art. The metropolitan museum of art open access. https: //github.com/metmuseum/openaccess, 2026. Accessed for benchmark construction meta- data

  40. [40]

    Mmina: Benchmarking multihop multimodal internet agents

    Shulin Tian, Ziniu Zhang, Liang-Yu Chen, and Ziwei Liu. Mmina: Benchmarking multihop multimodal internet agents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 13682–13697, 2025

  41. [41]

    Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection

    Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge Belongie. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 595–604, 2015

  42. [42]

    The inaturalist species classification and detection dataset

    Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778, 2018

  43. [43]

    The caltech-ucsd birds-200-2011 dataset

    Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011

  44. [44]

    MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments

    Han Wang, David Wan, Hyunji Lee, Thinh Pham, Mikaela Cankosyan, Weiyuan Chen, Elias Stengel-Eskin, Tu Vu, and Mohit Bansal. Merrin: A benchmark for multimodal evidence retrieval and reasoning in noisy web environments.arXiv preprint arXiv:2604.13418, 2026

  45. [45]

    Google landmarks dataset v2-a large- scale benchmark for instance-level recognition and retrieval

    Tobias Weyand, Andre Araujo, Bingyi Cao, and Jack Sim. Google landmarks dataset v2-a large- scale benchmark for instance-level recognition and retrieval. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2575–2584, 2020

  46. [46]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

  47. [47]

    Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation

    Hong-Tao Yu, Yuxin Peng, Serge Belongie, and Xiu-Shen Wei. Benchmarking large vision- language models on fine-grained image tasks: A comprehensive evaluation.arXiv preprint arXiv:2504.14988, 2025

  48. [48]

    Hawker 800 series (BAe 125/Hawker 800)

    Runjie Zhou, Youbo Shao, Haoyu Lu, Bowei Xing, Tongtong Bai, Yujie Chen, Jie Zhao, Lin Sui, Haotian Yao, Zijia Zhao, et al. Worldvqa: Measuring atomic world knowledge in multimodal large language models.arXiv preprint arXiv:2602.02537, 2026. 12 A Additional Sample Visualizations Apparel Electronics Equipment Food Household Infrastructure Animal Mineral Pl...

  49. [49]

    Justification: The real-life images were voluntarily contributed and privacy-redacted as part of dataset curation rather than a behavioral study

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...