arxiv: 2605.13193 · v1 · submitted 2026-05-13 · 💻 cs.CV

Recognition: no theorem link

FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition

Geng Li , Yuxin Peng

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:19 UTC · model grok-4.3

classification 💻 cs.CV

keywords fine-grainedrecognitionacquisitionknowledgeevidencemodelsagentclosed-book

0 comments

The pith

FIKA-Bench shows that the best large multimodal models and tool-using agents reach only 25.1% accuracy on fine-grained knowledge acquisition, with failures driven by wrong retrieval and poor visual judgment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Fine-grained recognition in daily life often requires more than looking at a picture and naming the object. People search for details, compare features, and check evidence when they encounter something unfamiliar. Existing AI benchmarks mostly test closed-book classification from training data. The authors created FIKA-Bench with 311 public-source examples where the correct answer depends on external evidence. They filtered out cases that frontier closed-book models already knew and removed any image-answer leakage. Evaluation of recent large multimodal models and agents that can use tools found the top accuracy at 25.1 percent, with no system above 30 percent. The main problems were retrieving the wrong entities and making incorrect judgments about visual details even when tools were available. This indicates that simply adding search capabilities does not solve the task.

Core claim

Our evaluation of latest Large Multimodal Models (LMMs) and agents reveals that the task remains a formidable challenge: the best system reaches only 25.1% accuracy, with no model exceeding 30%. Crucially, we find that merely equipping models with tools is insufficient to bridge this gap; agent failures are predominantly driven by wrong entity retrieval and poor visual judgement.

Load-bearing premise

That the filtering against frontier closed-book models successfully removes all memorized cases and that the 311 instances have no image-answer leakage while remaining representative of real-life fine-grained recognition scenarios.

Figures

Figures reproduced from arXiv: 2605.13193 by Geng Li, Yuxin Peng.

**Figure 1.** Figure 1: Overview of FIKA-BENCH: 311 evidence-grounded fine-grained instances from public-source and real-life images across Product, Nature, Transport, and Culture, evaluating whether models can acquire external fine-grained knowledge to recognize unseen categories. Abstract Fine-grained recognition in everyday life is often not a closed-book classification problem: when encountering unfamiliar objects, humans act… view at source ↗

**Figure 2.** Figure 2: Comparison between FIKA-BENCH and existing fine-grained recognition benchmarks. FIKA-BENCH evaluates evidence-grounded fine-grained knowledge acquisition rather than only closed-set or open-ended recognition from the image and model parameters. Our evaluation of closed-book models and tool-enabled agents reveals that FIKA-BENCH remains a formidable challenge. The best-performing system achieves only 25.1% … view at source ↗

**Figure 3.** Figure 3: End-to-end agent runtime distribution by top-level category. Boxes show the interquartile [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Error taxonomy for current agent methods. Each panel reports the distribution over non [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Two-level taxonomy distribution in FIKA-BENCH. Inner ring: top-level categories; outer ring: mid-level categories. B Additional Benchmark Comparison [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Representative FIKA-BENCH samples grouped by top-level category. Each row contains one public-data example and one real-life example from the same top-level category; each panel reports the top-level category, mid-level category, source partition, question, verified answer, and full evidence link. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

read the original abstract

Fine-grained recognition in everyday life is often not a closed-book classification problem: when encountering unfamiliar objects, humans actively search, compare visual details, and verify evidence before deciding. Existing benchmarks primarily evaluate visually recognition, leaving this active external knowledge acquisition ability underexplored. We study fine-grained knowledge acquisition, where a system must seek, verify, and use external evidence to answer open-ended fine-grained recognition questions. We introduce FIKA-Bench, a leakage-aware and evidence-grounded collection of 311 public-source and real-life instances. To ensure high quality, every example is filtered against frontier closed-book models to remove memorized cases and audited to eliminate image-answer leakage, retaining only samples supported by verified evidence. Our evaluation of latest Large Multimodal Models (LMMs) and agents reveals that the task remains a formidable challenge: the best system reaches only 25.1% accuracy, with no model exceeding 30%. Crucially, we find that merely equipping models with tools is insufficient to bridge this gap; agent failures are predominantly driven by wrong entity retrieval and poor visual judgement. These results show that reliable knowledge acquisition needs better agent designs that focus on fine-grained recognition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FIKA-Bench sets up a useful new task for active evidence-seeking in fine-grained recognition and shows current models top out low, but the filtering claims need more numbers to hold weight.

read the letter

The paper's real contribution is the shift from closed-book fine-grained classification to an active knowledge-acquisition setup. They collected 311 real-life instances, filtered them against frontier models to drop memorized cases, audited for leakage, and then tested LMMs plus agents. The headline result is that even the best system only hits 25.1 percent accuracy and no model clears 30 percent, with most failures traced to wrong entity retrieval or weak visual judgment. That gap and the specific failure modes are the parts worth paying attention to, because they suggest tool-equipped agents still need better retrieval and comparison logic rather than just more data or bigger backbones.

Referee Report

1 major / 2 minor

Summary. The paper introduces FIKA-Bench, a leakage-aware benchmark of 311 public-source, real-life instances for fine-grained knowledge acquisition. Systems must actively seek, verify, and use external evidence to answer open-ended questions, rather than relying on closed-book recognition. After filtering examples against frontier closed-book models to remove memorized cases and auditing to eliminate image-answer leakage, evaluations of latest LMMs and tool-equipped agents show the best system reaches only 25.1% accuracy (no model exceeds 30%), with failures driven primarily by wrong entity retrieval and poor visual judgment.

Significance. If the filtering and auditing process is shown to be effective, the benchmark would be a valuable contribution by shifting evaluation from passive visual recognition to active external knowledge acquisition in everyday fine-grained scenarios. It provides concrete evidence of current LMM and agent limitations and points to specific failure modes that could guide improved agent designs.

major comments (1)

[Benchmark construction] Benchmark construction section: the filtering procedure against frontier closed-book models is described only qualitatively (remove memorized cases, audit for image-answer leakage) with no quantitative details on the specific models used, accuracy threshold for 'memorized', number of candidates discarded, or inter-auditor agreement metrics. This is load-bearing for the central claim that the 311 retained instances require external knowledge acquisition, as residual leakage would inflate the reported 25.1% accuracy and undermine the attributed failure modes.

minor comments (2)

[Abstract] Abstract and evaluation sections: performance figures (25.1% best accuracy) are reported without error bars, confidence intervals, or statistical significance tests across runs or models.
[Benchmark construction] The paper would benefit from an explicit table or appendix listing the frontier models used for filtering and the exact discard statistics.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment on benchmark construction below and will revise the manuscript accordingly to improve transparency.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction section: the filtering procedure against frontier closed-book models is described only qualitatively (remove memorized cases, audit for image-answer leakage) with no quantitative details on the specific models used, accuracy threshold for 'memorized', number of candidates discarded, or inter-auditor agreement metrics. This is load-bearing for the central claim that the 311 retained instances require external knowledge acquisition, as residual leakage would inflate the reported 25.1% accuracy and undermine the attributed failure modes.

Authors: We agree that the current description of the filtering and auditing process is insufficiently quantitative and that additional details are needed to substantiate the claim that the retained instances require external knowledge acquisition. In the revised manuscript we will expand the Benchmark Construction section with a new table and accompanying text that reports: the specific frontier closed-book models used (GPT-4o, Claude-3.5-Sonnet, and Gemini-1.5-Pro), the exact accuracy threshold applied to flag memorized cases, the number of candidate examples discarded at each filtering stage, and inter-auditor agreement statistics (including percentage agreement and Cohen’s kappa). These additions will make the leakage-mitigation procedure fully transparent and directly address the concern that residual leakage could affect the reported accuracy and failure-mode analysis. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivations or fitted predictions

full rationale

The paper presents an empirical benchmark construction and evaluation study with no mathematical derivations, parameter fitting, or predictive claims that reduce to inputs by construction. All load-bearing elements are data curation steps (filtering against closed-book models and leakage auditing) and performance measurements on the resulting 311 instances; these do not invoke self-definitional loops, fitted inputs renamed as predictions, or self-citation chains that substitute for independent evidence. The central accuracy figures (25.1% best, no model >30%) are direct empirical outcomes on the curated set rather than quantities forced by prior fits or ansatzes. Per the analysis rules, this qualifies as a self-contained empirical contribution with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on domain assumptions about effective memorization filtering and data quality auditing that are standard in benchmark papers but not independently verified here.

axioms (1)

domain assumption Frontier closed-book models can be used to reliably filter out memorized instances from the benchmark.
Invoked to ensure the task tests active acquisition rather than recall.

pith-pipeline@v0.9.0 · 5504 in / 1240 out tokens · 46288 ms · 2026-05-14T20:19:47.950172+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 13 canonical work pages · 6 internal anchors

[1]

Fashion product images dataset

Param Aggarwal. Fashion product images dataset. Kaggle dataset: https://www.kaggle. com/datasets/paramaggarwal/fashion-product-images-dataset , 2026. Accessed for benchmark construction metadata

2026
[2]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

2022
[3]

Qwen-vl: A versatile vision-language model for understanding, localization.Text Reading, and Beyond, 2(1):1, 2023

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization.Text Reading, and Beyond, 2(1):1, 2023

2023
[4]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Products-10k: A large-scale product recognition dataset

Y Bai, Y Chen, W Yu, L Wang, and W Zhang. Products-10k: A large-scale product recognition dataset. arxiv 2020.arXiv preprint arXiv:2008.10545

work page arXiv 2020
[6]

Food-101–mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InEuropean conference on computer vision, pages 446–461. Springer, 2014

2014
[7]

Abo: Dataset and benchmarks for real-world 3d object understanding

Jasmine Collins, Shubham Goel, Kenan Deng, Achleshwar Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F Yago Vicente, Thomas Dideriksen, Himanshu Arora, et al. Abo: Dataset and benchmarks for real-world 3d object understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21126–21136, 2022

2022
[8]

The 2021 image similarity dataset and challenge.arXiv preprint arXiv:2106.09672, 2021

Matthijs Douze, Giorgos Tolias, Ed Pizzi, Zoë Papakipos, Lowik Chanussot, Filip Radenovic, Tomas Jenicek, Maxim Maximov, Laura Leal-Taixé, Ismail Elezi, et al. The 2021 image similarity dataset and challenge.arXiv preprint arXiv:2106.09672, 2021

work page arXiv 2021
[9]

Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, et al. Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

work page arXiv 2025
[10]

Gemini 3.1 Flash-Lite Model Card

Google DeepMind. Gemini 3.1 Flash-Lite Model Card. https://deepmind.google/ models/model-cards/gemini-3-1-flash-lite/, March 2026

2026
[11]

Webvoyager: Building an end-to-end web agent with large multimodal models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6864–6890, 2024

2024
[12]

Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning

Hulingxiao He, Zijun Geng, and Yuxin Peng. Fine-r1: Make multi-modal llms excel in fine- grained visual recognition by chain-of-thought reasoning.arXiv preprint arXiv:2602.07605, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

Things: A database of 1,854 object concepts and more than 26,000 naturalistic object images.PloS one, 14(10):e0223792, 2019

Martin N Hebart, Adam H Dickter, Alexis Kidder, Wan Y Kwok, Anna Corriveau, Caitlin Van Wicklin, and Chris I Baker. Things: A database of 1,854 object concepts and more than 26,000 naturalistic object images.PloS one, 14(10):e0223792, 2019

2019
[14]

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

Wenyi Hong, Xiaotao Gu, Ziyang Pan, Zhen Yang, Yuting Wang, Yue Wang, Yuanchang Yue, Yu Wang, Yanling Wang, Yan Wang, et al. Glm-5v-turbo: Toward a native foundation model for multimodal agents.arXiv preprint arXiv:2604.26752, 2026. 10

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Vegfru: A domain-specific dataset for fine-grained visual categorization

Saihui Hou, Yushan Feng, and Zilei Wang. Vegfru: A domain-specific dataset for fine-grained visual categorization. InProceedings of the IEEE international conference on computer vision, pages 541–549, 2017

2017
[16]

Mmsearch: Benchmarking the potential of large models as multi-modal search engines.arXiv preprint arXiv:2409.12959, 2024

Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, Jiayi Lei, Pengshuo Qiu, Pan Lu, Zehui Chen, Chaoyou Fu, Guanglu Song, et al. Mmsearch: Benchmarking the potential of large models as multi-modal search engines.arXiv preprint arXiv:2409.12959, 2024

work page arXiv 2024
[17]

Automatic expansion of a food image dataset leveraging existing categories with domain adaptation

Yoshiyuki Kawano and Keiji Yanai. Automatic expansion of a food image dataset leveraging existing categories with domain adaptation. InEuropean Conference on Computer Vision, pages 3–17. Springer, 2014

2014
[18]

Novel dataset for fine-grained image categorization: Stanford dogs

Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Fei-Fei Li. Novel dataset for fine-grained image categorization: Stanford dogs. InProc. CVPR workshop on fine-grained visual categorization (FGVC), volume 2, 2011

2011
[19]

Finer: Investigating and enhancing fine-grained visual concept recognition in large vision language models

Jeonghwan Kim and Heng Ji. Finer: Investigating and enhancing fine-grained visual concept recognition in large vision language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6187–6207, 2024

2024
[20]

On distinguishing epistemic from pragmatic action.Cognitive science, 18(4):513–549, 1994

David Kirsh and Paul Maglio. On distinguishing epistemic from pragmatic action.Cognitive science, 18(4):513–549, 1994

1994
[21]

A hierarchical grocery store image dataset with visual and semantic labels

Marcus Klasson, Cheng Zhang, and Hedvig Kjellström. A hierarchical grocery store image dataset with visual and semantic labels. In2019 IEEE winter conference on applications of computer vision (WACV), pages 491–500. IEEE, 2019

2019
[22]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 881–905, 2024

2024
[23]

3d object representations for fine- grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine- grained categorization. InProceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013

2013
[24]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

2023
[25]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

2023
[26]

Visual-rft: Visual reinforcement fine-tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2034–2044, 2025

2034
[27]

Fine-Grained Visual Classification of Aircraft

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine- grained visual classification of aircraft.arXiv preprint arXiv:1306.5151, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[28]

MiniMax-M2.7

MiniMax AI. MiniMax-M2.7. https://huggingface.co/MiniMaxAI/MiniMax-M2.7, 2026

2026
[29]

Kimi K2.6: Advancing open-source coding

Moonshot AI. Kimi K2.6: Advancing open-source coding. https://www.kimi.com/blog/ kimi-k2-6, 2026

2026
[30]

mineralimage5k-98

Nech-C. mineralimage5k-98. Hugging Face dataset: https://huggingface.co/datasets/ Nech-C/mineralimage5K-98, 2026. Accessed for benchmark construction metadata

2026
[31]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008

2008
[32]

GPT-5 mini

OpenAI. GPT-5 mini. https://developers.openai.com/api/docs/models/ gpt-5-mini, 2025. OpenAI API model documentation. 11

2025
[33]

OpenClaw: Personal ai assistant

OpenClaw Team. OpenClaw: Personal ai assistant. https://github.com/openclaw/ openclaw, 2026

2026
[34]

OpenCode: The open source ai coding agent

OpenCode Team. OpenCode: The open source ai coding agent. https://opencode.ai/, 2026

2026
[35]

Towards fine-grained recogni- tion with large visual language models: Benchmark and optimization strategies.arXiv preprint arXiv:2512.10384, 2025

Cong Pang, Hongtao Yu, Zixuan Chen, Lewei Lu, and Xin Lou. Towards fine-grained recogni- tion with large visual language models: Benchmark and optimization strategies.arXiv preprint arXiv:2512.10384, 2025

work page arXiv 2025
[36]

Information foraging.Psychological review, 106(4):643, 1999

Peter Pirolli and Stuart Card. Information foraging.Psychological review, 106(4):643, 1999

1999
[37]

Qwen3.5: Towards native multimodal agents

Qwen Team. Qwen3.5: Towards native multimodal agents. https://qwen.ai/blog?id= qwen3.5, February 2026

2026
[38]

Mmsearch-plus: Benchmarking provenance-aware search for multimodal browsing agents.arXiv preprint arXiv:2508.21475, 2025

Xijia Tao, Yihua Teng, Xinxing Su, Xinyu Fu, Jihao Wu, Chaofan Tao, Ziru Liu, Haoli Bai, Rui Liu, and Lingpeng Kong. Mmsearch-plus: Benchmarking provenance-aware search for multimodal browsing agents.arXiv preprint arXiv:2508.21475, 2025

work page arXiv 2025
[39]

The metropolitan museum of art open access

The Metropolitan Museum of Art. The metropolitan museum of art open access. https: //github.com/metmuseum/openaccess, 2026. Accessed for benchmark construction meta- data

2026
[40]

Mmina: Benchmarking multihop multimodal internet agents

Shulin Tian, Ziniu Zhang, Liang-Yu Chen, and Ziwei Liu. Mmina: Benchmarking multihop multimodal internet agents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 13682–13697, 2025

2025
[41]

Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection

Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge Belongie. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 595–604, 2015

2015
[42]

The inaturalist species classification and detection dataset

Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778, 2018

2018
[43]

The caltech-ucsd birds-200-2011 dataset

Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011

2011
[44]

MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments

Han Wang, David Wan, Hyunji Lee, Thinh Pham, Mikaela Cankosyan, Weiyuan Chen, Elias Stengel-Eskin, Tu Vu, and Mohit Bansal. Merrin: A benchmark for multimodal evidence retrieval and reasoning in noisy web environments.arXiv preprint arXiv:2604.13418, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

Google landmarks dataset v2-a large- scale benchmark for instance-level recognition and retrieval

Tobias Weyand, Andre Araujo, Bingyi Cao, and Jack Sim. Google landmarks dataset v2-a large- scale benchmark for instance-level recognition and retrieval. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2575–2584, 2020

2020
[46]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

2024
[47]

Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation

Hong-Tao Yu, Yuxin Peng, Serge Belongie, and Xiu-Shen Wei. Benchmarking large vision- language models on fine-grained image tasks: A comprehensive evaluation.arXiv preprint arXiv:2504.14988, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Hawker 800 series (BAe 125/Hawker 800)

Runjie Zhou, Youbo Shao, Haoyu Lu, Bowei Xing, Tongtong Bai, Yujie Chen, Jie Zhao, Lin Sui, Haotian Yao, Zijia Zhao, et al. Worldvqa: Measuring atomic world knowledge in multimodal large language models.arXiv preprint arXiv:2602.02537, 2026. 12 A Additional Sample Visualizations Apparel Electronics Equipment Food Household Infrastructure Animal Mineral Pl...

work page arXiv 2026
[49]

Justification: The real-life images were voluntarily contributed and privacy-redacted as part of dataset curation rather than a behavioral study

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...