{"paper":{"title":"FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"FIKA-Bench shows that the best large multimodal models and tool-using agents reach only 25.1% accuracy on fine-grained knowledge acquisition, with failures driven by wrong retrieval and poor visual judgment.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Geng Li, Yuxin Peng","submitted_at":"2026-05-13T08:49:51Z","abstract_excerpt":"Fine-grained recognition in everyday life is often not a closed-book classification problem: when encountering unfamiliar objects, humans actively search, compare visual details, and verify evidence before deciding. Existing benchmarks primarily evaluate visually recognition, leaving this active external knowledge acquisition ability underexplored. We study fine-grained knowledge acquisition, where a system must seek, verify, and use external evidence to answer open-ended fine-grained recognition questions. We introduce FIKA-Bench, a leakage-aware and evidence-grounded collection of 311 public"},"claims":{"count":3,"items":[{"kind":"strongest_claim","text":"Our evaluation of latest Large Multimodal Models (LMMs) and agents reveals that the task remains a formidable challenge: the best system reaches only 25.1% accuracy, with no model exceeding 30%. Crucially, we find that merely equipping models with tools is insufficient to bridge this gap; agent failures are predominantly driven by wrong entity retrieval and poor visual judgement.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the filtering against frontier closed-book models successfully removes all memorized cases and that the 311 instances have no image-answer leakage while remaining representative of real-life fine-grained recognition scenarios.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"FIKA-Bench shows that the best large multimodal models and tool-using agents reach only 25.1% accuracy on fine-grained knowledge acquisition, with failures driven by wrong retrieval and poor visual judgment.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"}],"snapshot_sha256":"f16b84be1a4c177ffdc8a8775bb42c24949025a75778bab742bee2177ef92c76"},"source":{"id":"2605.13193","kind":"arxiv","version":1},"verdict":{"id":"440b163b-323a-4b26-a134-7eeaeabd66e3","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T20:19:47.950172Z","strongest_claim":"Our evaluation of latest Large Multimodal Models (LMMs) and agents reveals that the task remains a formidable challenge: the best system reaches only 25.1% accuracy, with no model exceeding 30%. Crucially, we find that merely equipping models with tools is insufficient to bridge this gap; agent failures are predominantly driven by wrong entity retrieval and poor visual judgement.","one_line_summary":"FIKA-Bench shows that the best large multimodal models and tool-using agents reach only 25.1% accuracy on fine-grained knowledge acquisition, with failures driven by wrong retrieval and poor visual judgment.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the filtering against frontier closed-book models successfully removes all memorized cases and that the 311 instances have no image-answer leakage while remaining representative of real-life fine-grained recognition scenarios.","pith_extraction_headline":""},"references":{"count":49,"sample":[{"doi":"","year":2026,"title":"Fashion product images dataset","work_id":"ae73cde8-a094-4f14-968e-756dfc1314e7","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736","work_id":"a6b311cb-8acd-497f-922e-82521ddb430a","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Qwen-vl: A versatile vision-language model for understanding, localization.Text Reading, and Beyond, 2(1):1, 2023","work_id":"3e704b76-920c-4ff8-85b2-7603f3c5ddd3","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","ref_index":4,"cited_arxiv_id":"2511.21631","is_internal_anchor":true},{"doi":"","year":2020,"title":"Products-10k: A large-scale product recognition dataset","work_id":"620913d3-4750-4c73-b69d-883ec2058849","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":49,"snapshot_sha256":"6df06bf5ee6729f58463071fb75ed181071123185980f8f152e33b7766d72edf","internal_anchors":7},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}