Faithful Grounded Visual Reasoning via Learned Proxy-Tokens

· 2026 · cs.CV · arXiv 2606.23354

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success in Visual Question Answering (VQA), yet their "black-box" nature hinders deployment in critical domains. Grounded Visual Reasoning (GVR) approaches attempt to improve interpretability by explicitly couple textual rationales with visual grounding information, which are typically textual coordinates. This mechanism lacks a learnable semantic link to the visual features, often resulting in a semantic-spatial gap where the model hallucinates coordinates that do not correspond to image evidences. In this work, we introduce Composer, a MLLM that leverages a novel visual grounding mechanism based on learned proxy-tokens to promote faithful interpretability. These discrete symbolic pointers explicitly index the image latent space, allowing the model to manipulate visual regions as addressable, semantically manipulable sets. To rigorously validate our novel grounding mechanism, we constructed ComposerGCoT, a dataset synthesized to enable holistic assessment of reasoning consistency and grounding accuracy. Experimental results indicate that Composer achieves performance parity with its coordinate-based counterpart in final answer accuracy, while improving visual grounding accuracy by +9.0 points. By demonstrating that discrete proxy-tokens capture spatial semantics more effectively than typical textual coordinates, we establish that visual grounding mechanisms with learnable semantic links represent a promising path toward trustworthy and reliable MLLMs.

representative citing papers

Faithful Grounded Visual Reasoning via Learned Proxy-Tokens

cs.CV · 2026-06-22 · unverdicted · novelty 6.0

Composer replaces textual coordinate grounding with learned proxy-tokens that index image latent space, matching answer accuracy while raising grounding accuracy 9 points on the new ComposerGCoT dataset.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Faithful Grounded Visual Reasoning via Learned Proxy-Tokens cs.CV · 2026-06-22 · unverdicted · none · ref 1 · internal anchor
Composer replaces textual coordinate grounding with learned proxy-tokens that index image latent space, matching answer accuracy while raising grounding accuracy 9 points on the new ComposerGCoT dataset.

Faithful Grounded Visual Reasoning via Learned Proxy-Tokens

fields

years

verdicts

representative citing papers

citing papers explorer