pith. sign in
Pith Number

pith:F54MK2JL

pith:2026:F54MK2JLFSRMCEUWVR5FVL3P6R
not attested not anchored not stored refs resolved

FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition

Geng Li, Yuxin Peng

FIKA-Bench shows that the best large multimodal models and tool-using agents reach only 25.1% accuracy on fine-grained knowledge acquisition, with failures driven by wrong retrieval and poor visual judgment.

arxiv:2605.13193 v1 · 2026-05-13 · cs.CV

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{F54MK2JLFSRMCEUWVR5FVL3P6R}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Our evaluation of latest Large Multimodal Models (LMMs) and agents reveals that the task remains a formidable challenge: the best system reaches only 25.1% accuracy, with no model exceeding 30%. Crucially, we find that merely equipping models with tools is insufficient to bridge this gap; agent failures are predominantly driven by wrong entity retrieval and poor visual judgement.

C2weakest assumption

That the filtering against frontier closed-book models successfully removes all memorized cases and that the 311 instances have no image-answer leakage while remaining representative of real-life fine-grained recognition scenarios.

C3one line summary

FIKA-Bench shows that the best large multimodal models and tool-using agents reach only 25.1% accuracy on fine-grained knowledge acquisition, with failures driven by wrong retrieval and poor visual judgment.

References

49 extracted · 49 resolved · 7 Pith anchors

[1] Fashion product images dataset 2026
[2] Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736 2022
[3] Qwen-vl: A versatile vision-language model for understanding, localization.Text Reading, and Beyond, 2(1):1, 2023 2023
[4] Qwen3-VL Technical Report 2025 · arXiv:2511.21631
[5] Products-10k: A large-scale product recognition dataset 2020
Receipt and verification
First computed 2026-05-18T03:08:48.782282Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

2f78c5692b2ca2c11296ac7a5aaf6ff46b4beb8c45de2f12574d239c1ac06fd2

Aliases

arxiv: 2605.13193 · arxiv_version: 2605.13193v1 · doi: 10.48550/arxiv.2605.13193 · pith_short_12: F54MK2JLFSRM · pith_short_16: F54MK2JLFSRMCEUW · pith_short_8: F54MK2JL
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/F54MK2JLFSRMCEUWVR5FVL3P6R \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 2f78c5692b2ca2c11296ac7a5aaf6ff46b4beb8c45de2f12574d239c1ac06fd2
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "8080024fc14d0c4aac3350b4bbdb7080db83c72c32fa88978011462d04f8a4b5",
    "cross_cats_sorted": [],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2026-05-13T08:49:51Z",
    "title_canon_sha256": "433781ff2bbe677de8fd0cb6b13104056a1e47f83d136da4b3d191ef5acb077f"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.13193",
    "kind": "arxiv",
    "version": 1
  }
}