Pith Number

pith:FABDGC42

pith:2026:FABDGC42UOCVILIWPJ2DHRXX7G

not attested not anchored not stored refs resolved

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

Bin Wang, Conghui He, Dongsheng Ma, Jiahao Kong, Jiayu Li, Jie Yang, Jutao Xiao, Weijun Zeng, Wentao Zhang, Yijie Wang, Zhengren Wang

Multimodal document models often produce correct answers while citing the wrong evidence regions.

arxiv:2605.12882 v1 · 2026-05-13 · cs.CL · cs.CV

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{FABDGC42UOCVILIWPJ2DHRXX7G}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Auditing 20 MLLMs reveals a pervasive Attribution Hallucination: models frequently produce the right answer while citing the wrong region. The strongest system (Gemini-3.1-Pro-Preview) achieves an SAA of only 76.0, and the strongest open-source MLLM reaches just 22.5.

C2weakest assumption

The automated masking-ablation pipeline plus expert review produces accurate ground-truth element-level citations that correctly identify the minimal sufficient evidence regions for each question.

C3one line summary

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.

References

76 extracted · 76 resolved · 18 Pith anchors

[1] GPT-4 Technical Report 2023 · arXiv:2303.08774

[2] Qwen3-VL Technical Report 2025 · arXiv:2511.21631

[3] Maintnorm: A corpus and benchmark model for lexical normalisation and masking of industrial maintenance short text 2024

[4] Gaps: A clinically grounded, automated benchmark for evaluating ai clinicians.arXiv preprint arXiv:2510.13734, 2025 2025

[5] M3docrag: Multi-modal retrieval is what you need for multi-page multi-document understanding 2024

Receipt and verification

First computed	2026-05-18T03:09:11.069821Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

2802330b9aa385542d167a7433c6f7f988bdff8191b1a30147780678fe5216fa

Aliases

arxiv: 2605.12882 · arxiv_version: 2605.12882v1 · doi: 10.48550/arxiv.2605.12882 · pith_short_12: FABDGC42UOCV · pith_short_16: FABDGC42UOCVILIW · pith_short_8: FABDGC42

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/FABDGC42UOCVILIWPJ2DHRXX7G \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 2802330b9aa385542d167a7433c6f7f988bdff8191b1a30147780678fe5216fa

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "d7abf45806e966aac184e30ced91b414828d7d4249608ffb512b9b9d741dd2c2",
    "cross_cats_sorted": [
      "cs.CV"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2026-05-13T01:54:42Z",
    "title_canon_sha256": "49cb8290ff597ff16029388c750aa4d45f1f8610be6c8b5e4d294800cbc5f66c"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.12882",
    "kind": "arxiv",
    "version": 1
  }
}