pith. sign in

Faithful Grounded Visual Reasoning via Learned Proxy-Tokens

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it
abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success in Visual Question Answering (VQA), yet their "black-box" nature hinders deployment in critical domains. Grounded Visual Reasoning (GVR) approaches attempt to improve interpretability by explicitly couple textual rationales with visual grounding information, which are typically textual coordinates. This mechanism lacks a learnable semantic link to the visual features, often resulting in a semantic-spatial gap where the model hallucinates coordinates that do not correspond to image evidences. In this work, we introduce Composer, a MLLM that leverages a novel visual grounding mechanism based on learned proxy-tokens to promote faithful interpretability. These discrete symbolic pointers explicitly index the image latent space, allowing the model to manipulate visual regions as addressable, semantically manipulable sets. To rigorously validate our novel grounding mechanism, we constructed ComposerGCoT, a dataset synthesized to enable holistic assessment of reasoning consistency and grounding accuracy. Experimental results indicate that Composer achieves performance parity with its coordinate-based counterpart in final answer accuracy, while improving visual grounding accuracy by +9.0 points. By demonstrating that discrete proxy-tokens capture spatial semantics more effectively than typical textual coordinates, we establish that visual grounding mechanisms with learnable semantic links represent a promising path toward trustworthy and reliable MLLMs.

fields

cs.CV 1

years

2026 1

verdicts

UNVERDICTED 1

clear filters

representative citing papers

Faithful Grounded Visual Reasoning via Learned Proxy-Tokens

cs.CV · 2026-06-22 · unverdicted · novelty 6.0

Composer replaces textual coordinate grounding with learned proxy-tokens that index image latent space, matching answer accuracy while raising grounding accuracy 9 points on the new ComposerGCoT dataset.

citing papers explorer

Showing 1 of 1 citing paper after filters.

  • Faithful Grounded Visual Reasoning via Learned Proxy-Tokens cs.CV · 2026-06-22 · unverdicted · none · ref 1 · internal anchor

    Composer replaces textual coordinate grounding with learned proxy-tokens that index image latent space, matching answer accuracy while raising grounding accuracy 9 points on the new ComposerGCoT dataset.