{"paper":{"title":"CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Multimodal document models often produce correct answers while citing the wrong evidence regions.","cross_cats":["cs.CV"],"primary_cat":"cs.CL","authors_text":"Bin Wang, Conghui He, Dongsheng Ma, Jiahao Kong, Jiayu Li, Jie Yang, Jutao Xiao, Weijun Zeng, Wentao Zhang, Yijie Wang, Zhengren Wang","submitted_at":"2026-05-13T01:54:42Z","abstract_excerpt":"Multimodal Large Language Models (MLLMs) have significantly advanced document understanding, yet current Doc-VQA evaluations score only the final answer and leave the supporting evidence unchecked. This answer-only approach masks a critical failure mode: a model can land on the correct answer while grounding it in the wrong passage -- a critical risk in high-stakes domains like law, finance, and medicine, where every conclusion must be traceable to a specific source region. To address this, we introduce CiteVQA, a benchmark that requires models to return element-level bounding-box citations al"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Auditing 20 MLLMs reveals a pervasive Attribution Hallucination: models frequently produce the right answer while citing the wrong region. The strongest system (Gemini-3.1-Pro-Preview) achieves an SAA of only 76.0, and the strongest open-source MLLM reaches just 22.5.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The automated masking-ablation pipeline plus expert review produces accurate ground-truth element-level citations that correctly identify the minimal sufficient evidence regions for each question.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Multimodal document models often produce correct answers while citing the wrong evidence regions.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"ba1154638bdfb046b4dde6fec2a0324ddadab744ee2c2f14230e0e905bbd1550"},"source":{"id":"2605.12882","kind":"arxiv","version":1},"verdict":{"id":"806fe640-f3c8-4ee9-8bb2-64c87573aafa","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T20:33:46.494939Z","strongest_claim":"Auditing 20 MLLMs reveals a pervasive Attribution Hallucination: models frequently produce the right answer while citing the wrong region. The strongest system (Gemini-3.1-Pro-Preview) achieves an SAA of only 76.0, and the strongest open-source MLLM reaches just 22.5.","one_line_summary":"CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The automated masking-ablation pipeline plus expert review produces accurate ground-truth element-level citations that correctly identify the minimal sufficient evidence regions for each question.","pith_extraction_headline":"Multimodal document models often produce correct answers while citing the wrong evidence regions."},"references":{"count":76,"sample":[{"doi":"","year":2023,"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","ref_index":1,"cited_arxiv_id":"2303.08774","is_internal_anchor":true},{"doi":"","year":2025,"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","ref_index":2,"cited_arxiv_id":"2511.21631","is_internal_anchor":true},{"doi":"","year":2024,"title":"Maintnorm: A corpus and benchmark model for lexical normalisation and masking of industrial maintenance short text","work_id":"50778707-6f43-48ab-8c10-8c244c58f11f","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Gaps: A clinically grounded, automated benchmark for evaluating ai clinicians.arXiv preprint arXiv:2510.13734, 2025","work_id":"108d2981-bc3c-4b10-b694-1f9fef66a745","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"M3docrag: Multi-modal retrieval is what you need for multi-page multi-document understanding","work_id":"776da489-7ed1-484b-a0a8-4c975ba75e98","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":76,"snapshot_sha256":"3530f7e992d988e3d10ed15d8154c990a46b39985877adcd16bddb2efee2e7a2","internal_anchors":18},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}