{"paper":{"title":"KG-ViP: Bridging Knowledge Grounding and Visual Perception in Multi-modal LLMs for Visual Question Answering","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"KG-ViP fuses scene graphs and commonsense graphs via a query-guided pipeline to reduce hallucination and sharpen visual detail in multi-modal LLMs for VQA.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Ao Ke, Xike Xie, Yukun Cao, Zhiyang Li","submitted_at":"2026-01-14T07:16:11Z","abstract_excerpt":"Multi-modal Large Language Models (MLLMs) for Visual Question Answering (VQA) often suffer from dual limitations: knowledge hallucination and insufficient fine-grained visual perception. Crucially, we identify that commonsense graphs and scene graphs provide precisely complementary solutions to these respective deficiencies by providing rich external knowledge and capturing fine-grained visual details. However, prior works typically treat them in isolation, overlooking their synergistic potential. To bridge this gap, we propose KG-ViP, a unified framework that empowers MLLMs by fusing scene gr"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Extensive experiments on FVQA 2.0+ and MVQA benchmarks demonstrate that KG-ViP significantly outperforms existing VQA methods.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the novel retrieval-and-fusion pipeline, using the query as a semantic bridge to integrate scene graphs and commonsense graphs, will produce reliable multi-modal reasoning without introducing new errors or irrelevant information.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"KG-ViP fuses scene graphs and commonsense graphs via a query-based retrieval-and-fusion pipeline to improve multi-modal LLM performance on visual question answering.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"KG-ViP fuses scene graphs and commonsense graphs via a query-guided pipeline to reduce hallucination and sharpen visual detail in multi-modal LLMs for VQA.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"90f20d79f02a239876b934b71b739ea58f19d117efb3e3de3b816baa269b1a95"},"source":{"id":"2601.11632","kind":"arxiv","version":3},"verdict":{"id":"6e874506-69f0-41b5-945c-642fe7bb6940","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T14:45:02.696687Z","strongest_claim":"Extensive experiments on FVQA 2.0+ and MVQA benchmarks demonstrate that KG-ViP significantly outperforms existing VQA methods.","one_line_summary":"KG-ViP fuses scene graphs and commonsense graphs via a query-based retrieval-and-fusion pipeline to improve multi-modal LLM performance on visual question answering.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the novel retrieval-and-fusion pipeline, using the query as a semantic bridge to integrate scene graphs and commonsense graphs, will produce reliable multi-modal reasoning without introducing new errors or irrelevant information.","pith_extraction_headline":"KG-ViP fuses scene graphs and commonsense graphs via a query-guided pipeline to reduce hallucination and sharpen visual detail in multi-modal LLMs for VQA."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2601.11632/integrity.json","findings":[],"available":true,"detectors_run":[],"snapshot_sha256":"c28c3603d3b5d939e8dc4c7e95fa8dfce3d595e45f758748cecf8e644a296938"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"936322e059e14cf6511e50dc2decb8b06bc2d2afefb9bc14dda2bebbc3ee83eb"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}