FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition

Geng Li; Yuxin Peng

arxiv: 2605.13193 · v2 · pith:F54MK2JLnew · submitted 2026-05-13 · 💻 cs.CV

FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition

Geng Li , Yuxin Peng This is my paper

Pith reviewed 2026-05-20 21:57 UTC · model grok-4.3

classification 💻 cs.CV

keywords fine-grained recognitionknowledge acquisitionlarge multimodal modelsAI agentsbenchmarksvisual judgmententity retrievalevidence verification

0 comments

The pith

Current AI systems top out at 25.1 percent accuracy on active fine-grained knowledge acquisition even when equipped with tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FIKA-Bench to test whether systems can actively search external sources, compare visual details, and verify evidence when answering open-ended questions about unfamiliar fine-grained objects. Existing benchmarks focus on closed-book visual recognition, but this work shows that the real-world process of knowledge acquisition remains unsolved. Evaluations of recent large multimodal models and agents find the best performer reaches only 25.1 percent accuracy, with no system above 30 percent. Failures trace mainly to incorrect entity retrieval and weak visual judgment rather than lack of tools. A reader would care because this gap affects how AI could handle everyday encounters with new objects the way people do.

Core claim

FIKA-Bench is a leakage-aware collection of 311 public real-life instances where a system must seek, verify, and apply external evidence to answer open-ended fine-grained recognition questions. Every example was filtered against frontier closed-book models to exclude memorized cases and audited for image-answer leakage, keeping only samples backed by verified evidence. On this benchmark the best evaluated large multimodal model or agent reaches 25.1 percent accuracy and no system exceeds 30 percent. Agent failures arise predominantly from wrong entity retrieval and poor visual judgement, showing that simply providing tools does not close the performance gap.

What carries the argument

FIKA-Bench, an evidence-grounded benchmark that requires active external search, visual comparison, and verification for open-ended fine-grained recognition questions.

If this is right

Reliable knowledge acquisition will require agent designs that explicitly strengthen fine-grained visual judgment and precise entity retrieval rather than generic tool access.
Scaling models alone or adding broad search tools will not suffice to close the observed performance gap.
Progress on this capability will be measurable only with leakage-aware, evidence-supported benchmarks that block closed-book shortcuts.
Systems that succeed here could handle real-world encounters with unfamiliar objects by gathering and verifying evidence on demand.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same failure modes may appear in other open-world tasks where models must distinguish subtle variants without prior training examples.
Combining stronger specialized visual encoders with retrieval systems tuned for fine-grained attributes could be a direct next step to test.
Extending the benchmark to dynamic, multi-turn evidence gathering might expose additional bottlenecks in current agent loops.

Load-bearing premise

Every benchmark example has been filtered against frontier closed-book models to remove memorized cases and audited to eliminate image-answer leakage, so that retained samples truly require active evidence gathering.

What would settle it

A new agent or model that scores above 50 percent on the full FIKA-Bench set without any signs of memorization or leakage would falsify the claim that the task remains a formidable challenge for current approaches.

Figures

Figures reproduced from arXiv: 2605.13193 by Geng Li, Yuxin Peng.

**Figure 1.** Figure 1: Overview of FIKA-BENCH: 311 evidence-grounded fine-grained instances from public-source and real-life images across Product, Nature, Transport, and Culture, evaluating whether models can acquire external fine-grained knowledge to recognize unseen categories. Abstract Fine-grained recognition in everyday life is often not a closed-book classification problem: when encountering unfamiliar objects, humans act… view at source ↗

**Figure 2.** Figure 2: Comparison between FIKA-BENCH and existing fine-grained recognition benchmarks. FIKA-BENCH evaluates evidence-grounded fine-grained knowledge acquisition rather than only closed-set or open-ended recognition from the image and model parameters. Our evaluation of closed-book models and tool-enabled agents reveals that FIKA-BENCH remains a formidable challenge. The best-performing system achieves only 25.1% … view at source ↗

**Figure 3.** Figure 3: End-to-end agent runtime distribution by top-level category. Boxes show the interquartile [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Error taxonomy for current agent methods. Each panel reports the distribution over non [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Two-level taxonomy distribution in FIKA-BENCH. Inner ring: top-level categories; outer ring: mid-level categories. B Additional Benchmark Comparison [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Representative FIKA-BENCH samples grouped by top-level category. Each row contains one public-data example and one real-life example from the same top-level category; each panel reports the top-level category, mid-level category, source partition, question, verified answer, and full evidence link. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

read the original abstract

Fine-grained recognition in everyday life is often not a closed-book classification problem: when encountering unfamiliar objects, humans actively search, compare visual details, and verify evidence before deciding. Existing benchmarks primarily evaluate visually recognition, leaving this active external knowledge acquisition ability underexplored. We study fine-grained knowledge acquisition, where a system must seek, verify, and use external evidence to answer open-ended fine-grained recognition questions. We introduce FIKA-Bench, a leakage-aware and evidence-grounded collection of 311 public-source and real-life instances. To ensure high quality, every example is filtered against frontier closed-book models to remove memorized cases and audited to eliminate image-answer leakage, retaining only samples supported by verified evidence. Our evaluation of latest Large Multimodal Models (LMMs) and agents reveals that the task remains a formidable challenge: the best system reaches only 25.1% accuracy, with no model exceeding 30%. Crucially, we find that merely equipping models with tools is insufficient to bridge this gap; agent failures are predominantly driven by wrong entity retrieval and poor visual judgement. These results show that reliable knowledge acquisition needs better agent designs that focus on fine-grained recognition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FIKA-Bench shows a real performance gap on active fine-grained knowledge acquisition even with tools, but the small filtered set and thin filtering details leave the exact size of that gap open to some doubt.

read the letter

The one or two things to know are that this paper presents a benchmark where even the strongest multimodal agents only reach 25.1% on fine-grained questions that require external evidence, and that adding tools does not fix the main problems of bad retrieval and visual judgment. The new contribution is the leakage-aware filtering process that removes memorized examples by testing against closed-book models, combined with a focus on open-ended, evidence-grounded questions for fine-grained recognition. This is a distinct framing from prior closed-book benchmarks. The paper does a good job of evaluating recent LMMs and agents on this setup and providing an error analysis that highlights specific failure points. The soft spots are the modest size of 311 examples after filtering and the limited information on the precise filtering procedure and audit steps. This makes it harder to be sure there is no residual leakage, which could affect how much the results truly reflect agent limitations versus benchmark construction. The representativeness of the sample is also not deeply tested. This work is for researchers in multimodal AI who are interested in benchmarks that test active knowledge acquisition rather than pure recognition. A reader working on agent design or evaluation would get value from the failure mode insights. The central argument about the need for better agent designs holds up based on the reported results. I would recommend it for peer review, as the benchmark idea is worth refining and discussing in the community.

Referee Report

1 major / 2 minor

Summary. The paper introduces FIKA-Bench, a leakage-aware benchmark of 311 public-source, real-life fine-grained recognition instances that require active external knowledge acquisition rather than closed-book memorization. Every example is filtered against frontier closed-book models to remove memorized cases and audited to eliminate image-answer leakage while retaining only verified-evidence samples. Evaluation of latest LMMs and tool-equipped agents shows the best system reaches only 25.1% accuracy with no model exceeding 30%; failures are attributed primarily to wrong entity retrieval and poor visual judgement, indicating that simply adding tools is insufficient for reliable fine-grained knowledge acquisition.

Significance. If the filtering and auditing steps are effective, the work provides a useful, evidence-grounded testbed that exposes a clear performance ceiling in current LMM agents on active knowledge-acquisition tasks. The empirical demonstration that tool use alone does not close the gap, together with the failure-mode analysis, supplies concrete directions for improving retrieval and visual-judgement components in agent designs.

major comments (1)

[Benchmark construction] Benchmark construction section: the filtering procedure against frontier closed-book models is described only at a high level (models used, exact decision thresholds for declaring a case memorized, and the concrete audit steps for image-answer leakage are not specified). Because the headline claim that the 311 examples measure active acquisition rather than residual memorization rests on this filtering being complete, the absence of these operational details is load-bearing for the central interpretation of the 25.1% ceiling.

minor comments (2)

[Experiments] The exact accuracy metric (exact match vs. semantic equivalence) and question format should be stated explicitly in the experimental setup so that the 25.1% figure can be reproduced and compared with other benchmarks.
[Benchmark construction] The modest post-filtering size of 311 examples would benefit from a short discussion of how the retained set relates to the original candidate pool in terms of category coverage and difficulty distribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and outline the changes we will make in revision.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction section: the filtering procedure against frontier closed-book models is described only at a high level (models used, exact decision thresholds for declaring a case memorized, and the concrete audit steps for image-answer leakage are not specified). Because the headline claim that the 311 examples measure active acquisition rather than residual memorization rests on this filtering being complete, the absence of these operational details is load-bearing for the central interpretation of the 25.1% ceiling.

Authors: We agree that the Benchmark Construction section currently provides only a high-level description of the filtering and auditing procedures. In the revised manuscript we will expand this section to specify: (1) the exact frontier closed-book models employed (including model versions and access dates), (2) the precise decision thresholds and criteria used to classify an instance as memorized (e.g., correct answer without external evidence, confidence thresholds, or number of trials), and (3) the concrete audit protocol for image-answer leakage, including the manual verification steps, inter-annotator agreement measures, and retention criteria for verified-evidence samples. These additions will make the leakage-prevention process fully reproducible and will strengthen the interpretation of the reported performance ceiling. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark construction with no derivation chain or self-referential predictions

full rationale

The paper presents FIKA-Bench as an empirical collection of 311 instances, with filtering against external frontier closed-book models and evaluation of independent LMMs and agents. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the provided text or abstract. Central results (e.g., 25.1% accuracy) are direct measurements against external systems and data, rendering the work self-contained without any reduction of claims to their own construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the described filtering successfully isolates cases requiring genuine external knowledge acquisition and that the 311 instances are free of leakage.

axioms (2)

domain assumption Filtering against frontier closed-book models removes all memorized cases
Invoked in the abstract to justify retaining only samples that require external evidence.
domain assumption Auditing eliminates image-answer leakage
Stated as the method to ensure instances are supported only by verified external evidence.

pith-pipeline@v0.9.0 · 5735 in / 1462 out tokens · 58265 ms · 2026-05-20T21:57:05.977049+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

agent failures are predominantly driven by wrong entity retrieval and poor visual judgement

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 7 internal anchors

[1]

Fashion product images dataset

Param Aggarwal. Fashion product images dataset. Kaggle dataset: https://www.kaggle. com/datasets/paramaggarwal/fashion-product-images-dataset , 2026. Accessed for benchmark construction metadata

work page 2026
[2]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

work page 2022
[3]

Qwen-vl: A versatile vision-language model for understanding, localization.Text Reading, and Beyond, 2(1):1, 2023

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization.Text Reading, and Beyond, 2(1):1, 2023

work page 2023
[4]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Products-10K: A large-scale product recognition dataset

Y Bai, Y Chen, W Yu, L Wang, and W Zhang. Products-10k: A large-scale product recognition dataset. arxiv 2020.arXiv preprint arXiv:2008.10545

work page arXiv 2020
[6]

Food-101–mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InEuropean conference on computer vision, pages 446–461. Springer, 2014

work page 2014
[7]

Abo: Dataset and benchmarks for real-world 3d object understanding

Jasmine Collins, Shubham Goel, Kenan Deng, Achleshwar Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F Yago Vicente, Thomas Dideriksen, Himanshu Arora, et al. Abo: Dataset and benchmarks for real-world 3d object understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21126–21136, 2022

work page 2022
[8]

The 2021 image similarity dataset and challenge.arXiv preprint arXiv:2106.09672, 2021

Matthijs Douze, Giorgos Tolias, Ed Pizzi, Zoë Papakipos, Lowik Chanussot, Filip Radenovic, Tomas Jenicek, Maxim Maximov, Laura Leal-Taixé, Ismail Elezi, et al. The 2021 image similarity dataset and challenge.arXiv preprint arXiv:2106.09672, 2021

work page arXiv 2021
[9]

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, et al. Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Gemini 3.1 Flash-Lite Model Card

Google DeepMind. Gemini 3.1 Flash-Lite Model Card. https://deepmind.google/ models/model-cards/gemini-3-1-flash-lite/, March 2026

work page 2026
[11]

Webvoyager: Building an end-to-end web agent with large multimodal models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6864–6890, 2024

work page 2024
[12]

Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning

Hulingxiao He, Zijun Geng, and Yuxin Peng. Fine-r1: Make multi-modal llms excel in fine- grained visual recognition by chain-of-thought reasoning.arXiv preprint arXiv:2602.07605, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

Things: A database of 1,854 object concepts and more than 26,000 naturalistic object images.PloS one, 14(10):e0223792, 2019

Martin N Hebart, Adam H Dickter, Alexis Kidder, Wan Y Kwok, Anna Corriveau, Caitlin Van Wicklin, and Chris I Baker. Things: A database of 1,854 object concepts and more than 26,000 naturalistic object images.PloS one, 14(10):e0223792, 2019

work page 2019
[14]

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

Wenyi Hong, Xiaotao Gu, Ziyang Pan, Zhen Yang, Yuting Wang, Yue Wang, Yuanchang Yue, Yu Wang, Yanling Wang, Yan Wang, et al. Glm-5v-turbo: Toward a native foundation model for multimodal agents.arXiv preprint arXiv:2604.26752, 2026. 10

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Vegfru: A domain-specific dataset for fine-grained visual categorization

Saihui Hou, Yushan Feng, and Zilei Wang. Vegfru: A domain-specific dataset for fine-grained visual categorization. InProceedings of the IEEE international conference on computer vision, pages 541–549, 2017

work page 2017
[16]

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han

Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, Jiayi Lei, Pengshuo Qiu, Pan Lu, Zehui Chen, Chaoyou Fu, Guanglu Song, et al. Mmsearch: Benchmarking the potential of large models as multi-modal search engines.arXiv preprint arXiv:2409.12959, 2024

work page arXiv 2024
[17]

Automatic expansion of a food image dataset leveraging existing categories with domain adaptation

Yoshiyuki Kawano and Keiji Yanai. Automatic expansion of a food image dataset leveraging existing categories with domain adaptation. InEuropean Conference on Computer Vision, pages 3–17. Springer, 2014

work page 2014
[18]

Novel dataset for fine-grained image categorization: Stanford dogs

Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Fei-Fei Li. Novel dataset for fine-grained image categorization: Stanford dogs. InProc. CVPR workshop on fine-grained visual categorization (FGVC), volume 2, 2011

work page 2011
[19]

Finer: Investigating and enhancing fine-grained visual concept recognition in large vision language models

Jeonghwan Kim and Heng Ji. Finer: Investigating and enhancing fine-grained visual concept recognition in large vision language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6187–6207, 2024

work page 2024
[20]

On distinguishing epistemic from pragmatic action.Cognitive science, 18(4):513–549, 1994

David Kirsh and Paul Maglio. On distinguishing epistemic from pragmatic action.Cognitive science, 18(4):513–549, 1994

work page 1994
[21]

A hierarchical grocery store image dataset with visual and semantic labels

Marcus Klasson, Cheng Zhang, and Hedvig Kjellström. A hierarchical grocery store image dataset with visual and semantic labels. In2019 IEEE winter conference on applications of computer vision (WACV), pages 491–500. IEEE, 2019

work page 2019
[22]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 881–905, 2024

work page 2024
[23]

3d object representations for fine- grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine- grained categorization. InProceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013

work page 2013
[24]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

work page 2023
[25]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

work page 2023
[26]

Visual-rft: Visual reinforcement fine-tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2034–2044, 2025

work page 2034
[27]

Fine-Grained Visual Classification of Aircraft

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine- grained visual classification of aircraft.arXiv preprint arXiv:1306.5151, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[28]

MiniMax-M2.7

MiniMax AI. MiniMax-M2.7. https://huggingface.co/MiniMaxAI/MiniMax-M2.7, 2026

work page 2026
[29]

Kimi K2.6: Advancing open-source coding

Moonshot AI. Kimi K2.6: Advancing open-source coding. https://www.kimi.com/blog/ kimi-k2-6, 2026

work page 2026
[30]

mineralimage5k-98

Nech-C. mineralimage5k-98. Hugging Face dataset: https://huggingface.co/datasets/ Nech-C/mineralimage5K-98, 2026. Accessed for benchmark construction metadata

work page 2026
[31]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008

work page 2008
[32]

GPT-5 mini

OpenAI. GPT-5 mini. https://developers.openai.com/api/docs/models/ gpt-5-mini, 2025. OpenAI API model documentation. 11

work page 2025
[33]

OpenClaw: Personal ai assistant

OpenClaw Team. OpenClaw: Personal ai assistant. https://github.com/openclaw/ openclaw, 2026

work page 2026
[34]

OpenCode: The open source ai coding agent

OpenCode Team. OpenCode: The open source ai coding agent. https://opencode.ai/, 2026

work page 2026
[35]

Towards fine-grained recogni- tion with large visual language models: Benchmark and optimization strategies.arXiv preprint arXiv:2512.10384, 2025

Cong Pang, Hongtao Yu, Zixuan Chen, Lewei Lu, and Xin Lou. Towards fine-grained recogni- tion with large visual language models: Benchmark and optimization strategies.arXiv preprint arXiv:2512.10384, 2025

work page arXiv 2025
[36]

Information foraging.Psychological review, 106(4):643, 1999

Peter Pirolli and Stuart Card. Information foraging.Psychological review, 106(4):643, 1999

work page 1999
[37]

Qwen3.5: Towards native multimodal agents

Qwen Team. Qwen3.5: Towards native multimodal agents. https://qwen.ai/blog?id= qwen3.5, February 2026

work page 2026
[38]

arXiv preprint arXiv:2508.21475 , year=

Xijia Tao, Yihua Teng, Xinxing Su, Xinyu Fu, Jihao Wu, Chaofan Tao, Ziru Liu, Haoli Bai, Rui Liu, and Lingpeng Kong. Mmsearch-plus: Benchmarking provenance-aware search for multimodal browsing agents.arXiv preprint arXiv:2508.21475, 2025

work page arXiv 2025
[39]

The metropolitan museum of art open access

The Metropolitan Museum of Art. The metropolitan museum of art open access. https: //github.com/metmuseum/openaccess, 2026. Accessed for benchmark construction meta- data

work page 2026
[40]

Mmina: Benchmarking multihop multimodal internet agents

Shulin Tian, Ziniu Zhang, Liang-Yu Chen, and Ziwei Liu. Mmina: Benchmarking multihop multimodal internet agents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 13682–13697, 2025

work page 2025
[41]

Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection

Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge Belongie. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 595–604, 2015

work page 2015
[42]

The inaturalist species classification and detection dataset

Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778, 2018

work page 2018
[43]

The caltech-ucsd birds-200-2011 dataset

Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011

work page 2011
[44]

MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments

Han Wang, David Wan, Hyunji Lee, Thinh Pham, Mikaela Cankosyan, Weiyuan Chen, Elias Stengel-Eskin, Tu Vu, and Mohit Bansal. Merrin: A benchmark for multimodal evidence retrieval and reasoning in noisy web environments.arXiv preprint arXiv:2604.13418, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

Google landmarks dataset v2-a large- scale benchmark for instance-level recognition and retrieval

Tobias Weyand, Andre Araujo, Bingyi Cao, and Jack Sim. Google landmarks dataset v2-a large- scale benchmark for instance-level recognition and retrieval. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2575–2584, 2020

work page 2020
[46]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

work page 2024
[47]

Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation

Hong-Tao Yu, Yuxin Peng, Serge Belongie, and Xiu-Shen Wei. Benchmarking large vision- language models on fine-grained image tasks: A comprehensive evaluation.arXiv preprint arXiv:2504.14988, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Hawker 800 series (BAe 125/Hawker 800)

Runjie Zhou, Youbo Shao, Haoyu Lu, Bowei Xing, Tongtong Bai, Yujie Chen, Jie Zhao, Lin Sui, Haotian Yao, Zijia Zhao, et al. Worldvqa: Measuring atomic world knowledge in multimodal large language models.arXiv preprint arXiv:2602.02537, 2026. 12 A Additional Sample Visualizations Apparel Electronics Equipment Food Household Infrastructure Animal Mineral Pl...

work page arXiv 2026
[49]

Justification: The real-life images were voluntarily contributed and privacy-redacted as part of dataset curation rather than a behavioral study

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page

[1] [1]

Fashion product images dataset

Param Aggarwal. Fashion product images dataset. Kaggle dataset: https://www.kaggle. com/datasets/paramaggarwal/fashion-product-images-dataset , 2026. Accessed for benchmark construction metadata

work page 2026

[2] [2]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

work page 2022

[3] [3]

Qwen-vl: A versatile vision-language model for understanding, localization.Text Reading, and Beyond, 2(1):1, 2023

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization.Text Reading, and Beyond, 2(1):1, 2023

work page 2023

[4] [4]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Products-10K: A large-scale product recognition dataset

Y Bai, Y Chen, W Yu, L Wang, and W Zhang. Products-10k: A large-scale product recognition dataset. arxiv 2020.arXiv preprint arXiv:2008.10545

work page arXiv 2020

[6] [6]

Food-101–mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InEuropean conference on computer vision, pages 446–461. Springer, 2014

work page 2014

[7] [7]

Abo: Dataset and benchmarks for real-world 3d object understanding

Jasmine Collins, Shubham Goel, Kenan Deng, Achleshwar Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F Yago Vicente, Thomas Dideriksen, Himanshu Arora, et al. Abo: Dataset and benchmarks for real-world 3d object understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21126–21136, 2022

work page 2022

[8] [8]

The 2021 image similarity dataset and challenge.arXiv preprint arXiv:2106.09672, 2021

Matthijs Douze, Giorgos Tolias, Ed Pizzi, Zoë Papakipos, Lowik Chanussot, Filip Radenovic, Tomas Jenicek, Maxim Maximov, Laura Leal-Taixé, Ismail Elezi, et al. The 2021 image similarity dataset and challenge.arXiv preprint arXiv:2106.09672, 2021

work page arXiv 2021

[9] [9]

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, et al. Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Gemini 3.1 Flash-Lite Model Card

Google DeepMind. Gemini 3.1 Flash-Lite Model Card. https://deepmind.google/ models/model-cards/gemini-3-1-flash-lite/, March 2026

work page 2026

[11] [11]

Webvoyager: Building an end-to-end web agent with large multimodal models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6864–6890, 2024

work page 2024

[12] [12]

Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning

Hulingxiao He, Zijun Geng, and Yuxin Peng. Fine-r1: Make multi-modal llms excel in fine- grained visual recognition by chain-of-thought reasoning.arXiv preprint arXiv:2602.07605, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[13] [13]

Things: A database of 1,854 object concepts and more than 26,000 naturalistic object images.PloS one, 14(10):e0223792, 2019

Martin N Hebart, Adam H Dickter, Alexis Kidder, Wan Y Kwok, Anna Corriveau, Caitlin Van Wicklin, and Chris I Baker. Things: A database of 1,854 object concepts and more than 26,000 naturalistic object images.PloS one, 14(10):e0223792, 2019

work page 2019

[14] [14]

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

Wenyi Hong, Xiaotao Gu, Ziyang Pan, Zhen Yang, Yuting Wang, Yue Wang, Yuanchang Yue, Yu Wang, Yanling Wang, Yan Wang, et al. Glm-5v-turbo: Toward a native foundation model for multimodal agents.arXiv preprint arXiv:2604.26752, 2026. 10

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

Vegfru: A domain-specific dataset for fine-grained visual categorization

Saihui Hou, Yushan Feng, and Zilei Wang. Vegfru: A domain-specific dataset for fine-grained visual categorization. InProceedings of the IEEE international conference on computer vision, pages 541–549, 2017

work page 2017

[16] [16]

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han

Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, Jiayi Lei, Pengshuo Qiu, Pan Lu, Zehui Chen, Chaoyou Fu, Guanglu Song, et al. Mmsearch: Benchmarking the potential of large models as multi-modal search engines.arXiv preprint arXiv:2409.12959, 2024

work page arXiv 2024

[17] [17]

Automatic expansion of a food image dataset leveraging existing categories with domain adaptation

Yoshiyuki Kawano and Keiji Yanai. Automatic expansion of a food image dataset leveraging existing categories with domain adaptation. InEuropean Conference on Computer Vision, pages 3–17. Springer, 2014

work page 2014

[18] [18]

Novel dataset for fine-grained image categorization: Stanford dogs

Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Fei-Fei Li. Novel dataset for fine-grained image categorization: Stanford dogs. InProc. CVPR workshop on fine-grained visual categorization (FGVC), volume 2, 2011

work page 2011

[19] [19]

Finer: Investigating and enhancing fine-grained visual concept recognition in large vision language models

Jeonghwan Kim and Heng Ji. Finer: Investigating and enhancing fine-grained visual concept recognition in large vision language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6187–6207, 2024

work page 2024

[20] [20]

On distinguishing epistemic from pragmatic action.Cognitive science, 18(4):513–549, 1994

David Kirsh and Paul Maglio. On distinguishing epistemic from pragmatic action.Cognitive science, 18(4):513–549, 1994

work page 1994

[21] [21]

A hierarchical grocery store image dataset with visual and semantic labels

Marcus Klasson, Cheng Zhang, and Hedvig Kjellström. A hierarchical grocery store image dataset with visual and semantic labels. In2019 IEEE winter conference on applications of computer vision (WACV), pages 491–500. IEEE, 2019

work page 2019

[22] [22]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 881–905, 2024

work page 2024

[23] [23]

3d object representations for fine- grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine- grained categorization. InProceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013

work page 2013

[24] [24]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

work page 2023

[25] [25]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

work page 2023

[26] [26]

Visual-rft: Visual reinforcement fine-tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2034–2044, 2025

work page 2034

[27] [27]

Fine-Grained Visual Classification of Aircraft

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine- grained visual classification of aircraft.arXiv preprint arXiv:1306.5151, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[28] [28]

MiniMax-M2.7

MiniMax AI. MiniMax-M2.7. https://huggingface.co/MiniMaxAI/MiniMax-M2.7, 2026

work page 2026

[29] [29]

Kimi K2.6: Advancing open-source coding

Moonshot AI. Kimi K2.6: Advancing open-source coding. https://www.kimi.com/blog/ kimi-k2-6, 2026

work page 2026

[30] [30]

mineralimage5k-98

Nech-C. mineralimage5k-98. Hugging Face dataset: https://huggingface.co/datasets/ Nech-C/mineralimage5K-98, 2026. Accessed for benchmark construction metadata

work page 2026

[31] [31]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008

work page 2008

[32] [32]

GPT-5 mini

OpenAI. GPT-5 mini. https://developers.openai.com/api/docs/models/ gpt-5-mini, 2025. OpenAI API model documentation. 11

work page 2025

[33] [33]

OpenClaw: Personal ai assistant

OpenClaw Team. OpenClaw: Personal ai assistant. https://github.com/openclaw/ openclaw, 2026

work page 2026

[34] [34]

OpenCode: The open source ai coding agent

OpenCode Team. OpenCode: The open source ai coding agent. https://opencode.ai/, 2026

work page 2026

[35] [35]

Towards fine-grained recogni- tion with large visual language models: Benchmark and optimization strategies.arXiv preprint arXiv:2512.10384, 2025

Cong Pang, Hongtao Yu, Zixuan Chen, Lewei Lu, and Xin Lou. Towards fine-grained recogni- tion with large visual language models: Benchmark and optimization strategies.arXiv preprint arXiv:2512.10384, 2025

work page arXiv 2025

[36] [36]

Information foraging.Psychological review, 106(4):643, 1999

Peter Pirolli and Stuart Card. Information foraging.Psychological review, 106(4):643, 1999

work page 1999

[37] [37]

Qwen3.5: Towards native multimodal agents

Qwen Team. Qwen3.5: Towards native multimodal agents. https://qwen.ai/blog?id= qwen3.5, February 2026

work page 2026

[38] [38]

arXiv preprint arXiv:2508.21475 , year=

Xijia Tao, Yihua Teng, Xinxing Su, Xinyu Fu, Jihao Wu, Chaofan Tao, Ziru Liu, Haoli Bai, Rui Liu, and Lingpeng Kong. Mmsearch-plus: Benchmarking provenance-aware search for multimodal browsing agents.arXiv preprint arXiv:2508.21475, 2025

work page arXiv 2025

[39] [39]

The metropolitan museum of art open access

The Metropolitan Museum of Art. The metropolitan museum of art open access. https: //github.com/metmuseum/openaccess, 2026. Accessed for benchmark construction meta- data

work page 2026

[40] [40]

Mmina: Benchmarking multihop multimodal internet agents

Shulin Tian, Ziniu Zhang, Liang-Yu Chen, and Ziwei Liu. Mmina: Benchmarking multihop multimodal internet agents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 13682–13697, 2025

work page 2025

[41] [41]

Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection

Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge Belongie. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 595–604, 2015

work page 2015

[42] [42]

The inaturalist species classification and detection dataset

Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778, 2018

work page 2018

[43] [43]

The caltech-ucsd birds-200-2011 dataset

Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011

work page 2011

[44] [44]

MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments

Han Wang, David Wan, Hyunji Lee, Thinh Pham, Mikaela Cankosyan, Weiyuan Chen, Elias Stengel-Eskin, Tu Vu, and Mohit Bansal. Merrin: A benchmark for multimodal evidence retrieval and reasoning in noisy web environments.arXiv preprint arXiv:2604.13418, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[45] [45]

Google landmarks dataset v2-a large- scale benchmark for instance-level recognition and retrieval

Tobias Weyand, Andre Araujo, Bingyi Cao, and Jack Sim. Google landmarks dataset v2-a large- scale benchmark for instance-level recognition and retrieval. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2575–2584, 2020

work page 2020

[46] [46]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

work page 2024

[47] [47]

Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation

Hong-Tao Yu, Yuxin Peng, Serge Belongie, and Xiu-Shen Wei. Benchmarking large vision- language models on fine-grained image tasks: A comprehensive evaluation.arXiv preprint arXiv:2504.14988, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

Hawker 800 series (BAe 125/Hawker 800)

Runjie Zhou, Youbo Shao, Haoyu Lu, Bowei Xing, Tongtong Bai, Yujie Chen, Jie Zhao, Lin Sui, Haotian Yao, Zijia Zhao, et al. Worldvqa: Measuring atomic world knowledge in multimodal large language models.arXiv preprint arXiv:2602.02537, 2026. 12 A Additional Sample Visualizations Apparel Electronics Equipment Food Household Infrastructure Animal Mineral Pl...

work page arXiv 2026

[49] [49]

Justification: The real-life images were voluntarily contributed and privacy-redacted as part of dataset curation rather than a behavioral study

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page