From Scenes to Elements: Multi-Granularity Evidence Retrieval for Verifiable Multimodal RAG

Chuyue Huang; Derek F. Wong; Guanhua Chen; Lidia S. Chao; Shudong Liu; Xueqing Song; Yutong Yao

arxiv: 2605.15019 · v1 · pith:MX3YSNNAnew · submitted 2026-05-14 · 💻 cs.CL

From Scenes to Elements: Multi-Granularity Evidence Retrieval for Verifiable Multimodal RAG

Guanhua Chen , Chuyue Huang , Yutong Yao , Shudong Liu , Xueqing Song , Lidia S. Chao , Derek F. Wong This is my paper

Pith reviewed 2026-06-30 20:31 UTC · model grok-4.3

classification 💻 cs.CL

keywords multimodal RAGevidence retrievalvisual elementsverifiable generationelement-level annotationsGranuVistaVQAGranuRAGpartial observation

0 comments

The pith

GranuRAG retrieves evidence at the visual-element level instead of whole scenes to make multimodal RAG verifiable and more accurate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard multimodal RAG pulls entire images or scenes as evidence, which often fails to match fine-grained user questions and hides the source of mistakes. The paper presents GranuVistaVQA, a benchmark of real-world landmarks annotated at the element level across multiple viewpoints to capture partial observations. It introduces GranuRAG, a three-stage system that first detects and classifies individual elements, then performs multi-granularity cross-modal alignment to retrieve evidence, and finally generates answers under attribution constraints. This shift to elements as first-class units is shown to deliver up to 29.2 percent gains over six baselines while allowing direct tracing of errors to specific visual components.

Core claim

By treating detected visual elements rather than whole scenes as the atomic units for retrieval and grounding, GranuRAG produces evidence that can be explicitly attributed, thereby supporting verifiable generation on fine-grained multimodal questions that involve only partial views of entities.

What carries the argument

GranuRAG three-stage pipeline that performs element-level detection and classification, followed by multi-granularity cross-modal alignment for retrieval, and attribution-constrained generation.

If this is right

Retrieval can now be aligned to the exact granularity of a user's query rather than defaulting to full scenes.
Generation failures can be traced to specific missing or misclassified elements instead of opaque attention weights.
The same element annotations enable evaluation of partial-observation robustness across different viewpoints of the same landmark.
Performance improves by as much as 29.2 percent on the introduced benchmark relative to prior scene-level methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could be applied to video or 3D scenes where elements persist across frames or viewpoints.
Attribution constraints might be extended to penalize generation that references elements absent from the retrieved set.
Element-level units could serve as a common interface for mixing retrieval from image databases with structured knowledge bases.

Load-bearing premise

Element detection and classification can be performed reliably enough that it does not introduce new errors large enough to cancel the reported gains.

What would settle it

A controlled test in which element-detection error rates are measured separately and shown to reduce end-to-end accuracy below the scene-level baselines.

Figures

Figures reproduced from arXiv: 2605.15019 by Chuyue Huang, Derek F. Wong, Guanhua Chen, Lidia S. Chao, Shudong Liu, Xueqing Song, Yutong Yao.

**Figure 1.** Figure 1: Examples of Multi-Perspective Image. following a two-level schema (full specification in Appendix B.1): xlandmark = (meta, E,ED) (1) Textual content is sourced from official tourism portals and encyclopedic references, then structured through: (i) element phrase extraction from authoritative descriptions, (ii) cross-landmark normalization to ensure consistent terminology (e.g., “bell tower” ≡ “campanile… view at source ↗

**Figure 2.** Figure 2: The results of evaluating MLLMs on Granu [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The overview of our proposed GranuRAG framework. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation on visual presentation and element [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Performance comparison on in-domain and OOD data across three methods. Method ROUGE-L BERT-F1 LLM Baseline 23.79 40.83 56.40 Embedding Retrieval 29.47 45.57 63.45 RAVQA(PreFLMR) 21.27 42.60 69.24 VisRAG 24.06 43.35 68.06 GranuRAG (Ours) 32.27 52.19 79.30 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of answer quality when both [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Extraction accuracy comparison across images [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Top 10 attention difference regions. Red: higher attention in our method; Blue: higher attention in the base/CoT method. ments, such as the “White Holy Spirit Dove Relief” sculptures and decorative arches, while blue markers scatter across generic background regions like ceilings and walls. This pattern confirms that our grounding mechanism systematically reallocates attention toward knowledge-relevant re… view at source ↗

**Figure 9.** Figure 9: Distribution of image counts per landmark. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

read the original abstract

Multimodal Retrieval-Augmented Generation (RAG) systems retrieve evidence at coarse granularities (entire images or scenes), creating a mismatch with fine-grained user queries and making failures unverifiable. We introduce GranuVistaVQA, a multimodal benchmark featuring real-world landmarks with element-level annotations across multiple viewpoints, capturing the partial observation challenge where individual images contain only subsets of entities. We further propose GranuRAG, a multi-granularity framework that treats visual elements as first-class retrieval units through three stages: element-level detection and classification, multi-granularity cross-modal alignment for evidence retrieval, and attribution-constrained generation. By grounding retrieval at the element level rather than relying on implicit attention, our approach enables transparent error diagnosis. Experiments demonstrate that GranuRAG achieves up to 29.2% improvement over six strong baselines for this task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GranuRAG offers a new element-level benchmark and retrieval framework for multimodal RAG that targets partial observations, though detection reliability needs checking.

read the letter

The main takeaway is a new benchmark GranuVistaVQA with element-level annotations on landmarks and a framework called GranuRAG that retrieves at that granularity.

The work is new in shifting from scene or image retrieval to elements as the basic unit, which fits better with fine-grained queries and allows tracing errors back to specific detections or alignments. The three stages are straightforward: detect and classify elements, align across modalities at multiple levels, and generate with attribution constraints.

It does well in highlighting the partial observation issue where single images miss some entities, and in providing a benchmark that reflects real-world landmark views from different angles. The 29.2% gain over six baselines is a concrete result that suggests the method improves performance on the task.

The soft spot is the assumption that element detection can be done reliably without adding offsetting errors. The abstract invokes this but does not show separate validation or error rates for detection, so it's unclear if the gains hold up when detection is imperfect. No dataset statistics or ablation on that stage are mentioned, which makes the evaluation harder to assess fully.

This is aimed at people working on multimodal RAG and verifiable generation. A reader interested in benchmarks for complex scene understanding would find the new dataset and the multi-granularity approach worth examining. It has enough substance to go to peer review, though the methods and results sections would need careful review for controls and analysis.

I recommend sending it for peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces the GranuVistaVQA benchmark, which provides element-level annotations for real-world landmarks across multiple viewpoints to capture partial observations, and proposes GranuRAG, a three-stage multi-granularity framework (element-level detection and classification, multi-granularity cross-modal alignment for retrieval, and attribution-constrained generation) for verifiable multimodal RAG. It claims that grounding retrieval at the element level (rather than scene-level or implicit attention) enables transparent error diagnosis and yields up to 29.2% improvement over six strong baselines.

Significance. If the performance gains hold after proper controls and the element-level detection stage proves reliable without offsetting errors, the work would meaningfully advance verifiable multimodal RAG by resolving granularity mismatches between queries and evidence. The new benchmark with real-world partial-observation data is a concrete contribution that could support future falsifiable evaluations.

major comments (2)

[Framework and Experiments] The central claim that element-level detection serves as a reliable first-class retrieval unit is invoked in the three-stage framework description but is not independently validated (e.g., no separate detection accuracy metrics, ablation on detection errors, or analysis showing that detection failures do not offset the 29.2% retrieval/generation gains).
[Experiments] The reported 29.2% improvement over six baselines cannot be assessed for robustness because the abstract (and thus the provided description) supplies no dataset statistics, error bars, cross-validation details, or controls against post-hoc baseline selection; full results tables would need to demonstrate that the gains are not driven by a single easy subset.

minor comments (2)

[Methods] Notation for the three stages and the cross-modal alignment step should be formalized with explicit equations or pseudocode to allow reproduction.
[Framework] Clarify how the attribution-constrained generation stage interacts with the retrieval units; the current description leaves open whether attribution is enforced at inference or only post-hoc.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on validating the detection stage and assessing result robustness. We address each major comment below.

read point-by-point responses

Referee: [Framework and Experiments] The central claim that element-level detection serves as a reliable first-class retrieval unit is invoked in the three-stage framework description but is not independently validated (e.g., no separate detection accuracy metrics, ablation on detection errors, or analysis showing that detection failures do not offset the 29.2% retrieval/generation gains).

Authors: We agree that the manuscript would be strengthened by independent validation of the detection stage. The GranuVistaVQA benchmark supplies element-level ground truth, enabling such metrics. In the revision we will add precision/recall for element detection, an ablation isolating detection errors, and analysis confirming that any detection failures reduce rather than inflate the reported gains. revision: yes
Referee: [Experiments] The reported 29.2% improvement over six baselines cannot be assessed for robustness because the abstract (and thus the provided description) supplies no dataset statistics, error bars, cross-validation details, or controls against post-hoc baseline selection; full results tables would need to demonstrate that the gains are not driven by a single easy subset.

Authors: The full manuscript already reports dataset statistics (landmarks, viewpoints, element counts) in Section 3, error bars from multiple runs in Section 5 tables, and per-query-type breakdowns. The benchmark construction with multiple partial-observation viewpoints is intended to mitigate single-subset dominance. We did not perform cross-validation because GranuVistaVQA is a fixed held-out test set; we will add explicit subset-performance tables and baseline-selection rationale in the revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The manuscript describes a three-stage framework (element detection/classification, cross-modal alignment, attribution-constrained generation) and reports empirical gains on a new benchmark. No equations, fitted parameters, or derivation steps are present that reduce by construction to inputs. No self-citation chains or uniqueness theorems are invoked as load-bearing premises. The central claims rest on experimental comparison to baselines rather than any self-referential reduction, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the framework description implies but does not detail any modeling choices such as detection thresholds or alignment losses.

pith-pipeline@v0.9.1-grok · 5701 in / 1137 out tokens · 29284 ms · 2026-06-30T20:31:51.683694+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 3 internal anchors

[1]

Hallucination of Multimodal Large Language Models: A Survey

Hallucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930. Guanhua Chen, Yutong Yao, Lidia S. Chao, Xuebo Liu, and Derek F. Wong. 2025a. SGIC: A self-guided iterative calibration framework for RAG. InProceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), ACL 20...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models.Preprint, arXiv:2301.12597. Chin-Yew Lin. 2004. ROUGE: A package for auto- matic evaluation of summaries. InText Summariza- tion Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics. Weizhe Lin, Jinghong Chen, Jingbia...

work page internal anchor Pith review Pith/arXiv arXiv 2004
[3]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Kosmos-2: Grounding multimodal large language models to the world.arXiv preprint arXiv:2306.14824. Shraman Pramanick, Rama Chellappa, and Subhashini Venugopalan. 2024. Spiqa: A dataset for multimodal question answering on scientific papers.Advances in Neural Information Processing Systems, 37:118807– 118833. Alec Radford, Jong Wook Kim, Chris Hallacy, Adi...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Hallucination of Multimodal Large Language Models: A Survey

Hallucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930. Guanhua Chen, Yutong Yao, Lidia S. Chao, Xuebo Liu, and Derek F. Wong. 2025a. SGIC: A self-guided iterative calibration framework for RAG. InProceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), ACL 20...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models.Preprint, arXiv:2301.12597. Chin-Yew Lin. 2004. ROUGE: A package for auto- matic evaluation of summaries. InText Summariza- tion Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics. Weizhe Lin, Jinghong Chen, Jingbia...

work page internal anchor Pith review Pith/arXiv arXiv 2004

[3] [3]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Kosmos-2: Grounding multimodal large language models to the world.arXiv preprint arXiv:2306.14824. Shraman Pramanick, Rama Chellappa, and Subhashini Venugopalan. 2024. Spiqa: A dataset for multimodal question answering on scientific papers.Advances in Neural Information Processing Systems, 37:118807– 118833. Alec Radford, Jong Wook Kim, Chris Hallacy, Adi...

work page internal anchor Pith review Pith/arXiv arXiv 2024