Faithful Grounded Visual Reasoning via Learned Proxy-Tokens

Aboubacar Tuo; Angelique Loesch; Mohamed Chaouch; Tom Hodemon

arxiv: 2606.23354 · v1 · pith:FGCQSQRDnew · submitted 2026-06-22 · 💻 cs.CV

Faithful Grounded Visual Reasoning via Learned Proxy-Tokens

Tom Hodemon , Mohamed Chaouch , Aboubacar Tuo , Angelique Loesch This is my paper

Pith reviewed 2026-06-26 09:15 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual groundingmultimodal large language modelsproxy tokensgrounded visual reasoninginterpretabilityvisual question answeringsemantic-spatial gap

0 comments

The pith

Learned proxy-tokens give multimodal models a semantic link to visual features that textual coordinates lack, raising grounding accuracy by nine points at no cost to final answer accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Composer, an MLLM that replaces textual coordinate outputs with discrete learned proxy-tokens that directly index the image latent space. These tokens function as addressable symbolic pointers, letting the model treat visual regions as semantically manipulable sets rather than strings that must be guessed. On a purpose-built dataset called ComposerGCoT the model reaches the same final-answer accuracy as a coordinate-based baseline while lifting visual-grounding accuracy by nine points. The central demonstration is that a learnable semantic link between text and image latents reduces the hallucination of coordinates that do not match image evidence. If correct, the result shows that faithful grounded visual reasoning can be achieved without sacrificing task performance.

Core claim

Composer replaces the conventional textual-coordinate grounding mechanism with learned proxy-tokens that explicitly index the image latent space. These discrete symbolic pointers allow the model to manipulate visual regions as addressable, semantically manipulable sets. On the ComposerGCoT dataset the proxy-token model achieves the same final-answer accuracy as its coordinate-based counterpart while improving visual-grounding accuracy by 9.0 points, establishing that proxy-tokens capture spatial semantics more effectively than textual coordinates.

What carries the argument

Learned proxy-tokens: discrete symbolic pointers that explicitly index the image latent space and serve as addressable, semantically manipulable sets for visual regions.

If this is right

Grounding mechanisms that maintain a learnable semantic link to visual features reduce coordinate hallucination without harming answer accuracy.
Discrete proxy-tokens can be treated as manipulable objects inside the language model, enabling more structured visual reasoning chains.
The same token-based indexing approach can be applied to other multimodal tasks that require explicit region reference.
Performance parity between proxy-token and coordinate models indicates that improved interpretability need not trade off against task success.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Proxy-token grounding may generalize to video or 3-D scene reasoning where coordinate strings become even less reliable.
The approach could be combined with existing attention-map visualization techniques to produce human-readable explanations that are also machine-verifiable.
If proxy-tokens prove stable across model scales, they offer a route to post-training alignment methods that directly supervise visual reference rather than final answers alone.

Load-bearing premise

The ComposerGCoT dataset supplies an unbiased, holistic measure of both reasoning consistency and grounding accuracy that holds outside the proxy-token model's training distribution.

What would settle it

A controlled test on a held-out real-world VQA corpus in which the proxy-token model’s grounding accuracy falls below the coordinate baseline while final-answer accuracy remains equal or lower.

read the original abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success in Visual Question Answering (VQA), yet their "black-box" nature hinders deployment in critical domains. Grounded Visual Reasoning (GVR) approaches attempt to improve interpretability by explicitly couple textual rationales with visual grounding information, which are typically textual coordinates. This mechanism lacks a learnable semantic link to the visual features, often resulting in a semantic-spatial gap where the model hallucinates coordinates that do not correspond to image evidences. In this work, we introduce Composer, a MLLM that leverages a novel visual grounding mechanism based on learned proxy-tokens to promote faithful interpretability. These discrete symbolic pointers explicitly index the image latent space, allowing the model to manipulate visual regions as addressable, semantically manipulable sets. To rigorously validate our novel grounding mechanism, we constructed ComposerGCoT, a dataset synthesized to enable holistic assessment of reasoning consistency and grounding accuracy. Experimental results indicate that Composer achieves performance parity with its coordinate-based counterpart in final answer accuracy, while improving visual grounding accuracy by +9.0 points. By demonstrating that discrete proxy-tokens capture spatial semantics more effectively than typical textual coordinates, we establish that visual grounding mechanisms with learnable semantic links represent a promising path toward trustworthy and reliable MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Proxy-tokens are a clean idea for closing the semantic gap in grounded VQA, but all the numbers sit on a single synthesized dataset.

read the letter

The paper's main move is to replace textual coordinates with learned proxy-tokens that act as discrete pointers into the image latent space. This lets the model treat visual regions as manipulable semantic sets rather than strings of numbers. They implement it in Composer and test on a new dataset they built, ComposerGCoT, where the model matches coordinate baselines on final-answer accuracy while gaining 9 points on grounding accuracy.

The proxy-token mechanism is the actual novelty. It directly targets the semantic-spatial mismatch the authors describe, and the construction of the tokens as addressable indices is a straightforward way to give the model a learnable link to visual features.

The weak point is the evaluation. Every reported number comes from ComposerGCoT, which the authors created for this paper to support holistic assessment. The stress-test concern holds: without an external benchmark or human-annotated control set, it is possible the synthesis procedure favors the proxy-token representation. The abstract supplies no error bars, baseline details, or statistical tests, so the +9 point claim is hard to assess for robustness.

The work is aimed at researchers who care about interpretability in MLLMs for VQA. Someone already thinking about grounding mechanisms will get value from the proxy-token formulation even if the experiments need tightening. The paper shows clear thinking on the problem and honest engagement with the limitation of prior coordinate methods.

Send it to peer review so the dataset construction and generalization can be checked properly.

Referee Report

3 major / 1 minor

Summary. The paper proposes Composer, an MLLM for grounded visual reasoning that replaces textual coordinate-based grounding with learned discrete proxy-tokens that index the image latent space. It introduces the ComposerGCoT dataset, synthesized to jointly evaluate reasoning consistency and grounding accuracy, and reports that Composer matches its coordinate-based counterpart in final-answer accuracy while improving grounding accuracy by 9 points.

Significance. If the central empirical claim holds after addressing evaluation details, the work would establish that proxy-tokens supply a learnable semantic link between textual rationales and visual features, offering a concrete alternative to coordinate-based grounding. The explicit construction of a dataset for holistic assessment of both reasoning and grounding is a constructive contribution that could be reused by others.

major comments (3)

[Abstract and Experiments] Abstract and Experiments section: the +9.0-point grounding-accuracy gain is stated without error bars, variance across runs, or statistical tests, and without describing the exact metric or baseline implementations; this information is required to determine whether the reported parity in final-answer accuracy and the grounding improvement are robust rather than artifacts of a single run on a synthesized dataset.
[ComposerGCoT dataset] ComposerGCoT dataset construction (referenced in abstract): the protocol for synthesizing the dataset is not specified in sufficient detail (e.g., how questions, rationales, and ground-truth regions are generated and how proxy-token compatibility is ensured), leaving open the possibility that the data-generation pipeline correlates with the proxy-token representation and inflates the grounding metric.
[Experiments] Experiments section: no results are reported on any external, human-annotated grounding benchmark or on a held-out distribution distinct from the synthesis procedure; without such controls the claim that proxy-tokens capture spatial semantics more effectively than coordinates cannot be separated from dataset-specific effects.

minor comments (1)

[Method] Clarify the precise definition and dimensionality of proxy-tokens in the method section so that readers can reproduce the indexing mechanism.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity and robustness where appropriate.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: the +9.0-point grounding-accuracy gain is stated without error bars, variance across runs, or statistical tests, and without describing the exact metric or baseline implementations; this information is required to determine whether the reported parity in final-answer accuracy and the grounding improvement are robust rather than artifacts of a single run on a synthesized dataset.

Authors: We agree that the reported results require additional statistical support. In the revised manuscript we will report grounding accuracy averaged over multiple independent runs with standard deviations, include statistical significance tests comparing Composer to the coordinate baseline, and explicitly define the grounding accuracy metric together with implementation details for all baselines. revision: yes
Referee: [ComposerGCoT dataset] ComposerGCoT dataset construction (referenced in abstract): the protocol for synthesizing the dataset is not specified in sufficient detail (e.g., how questions, rationales, and ground-truth regions are generated and how proxy-token compatibility is ensured), leaving open the possibility that the data-generation pipeline correlates with the proxy-token representation and inflates the grounding metric.

Authors: Section 4.1 already outlines the synthesis steps, but we acknowledge the need for greater detail. The revised manuscript will expand this section with the precise generation procedure for questions, rationales and ground-truth regions, as well as the checks performed to maintain compatibility with the proxy-token vocabulary. revision: yes
Referee: [Experiments] Experiments section: no results are reported on any external, human-annotated grounding benchmark or on a held-out distribution distinct from the synthesis procedure; without such controls the claim that proxy-tokens capture spatial semantics more effectively than coordinates cannot be separated from dataset-specific effects.

Authors: ComposerGCoT was deliberately constructed to jointly measure reasoning consistency and grounding accuracy, a combination not present in existing human-annotated benchmarks. We will add an explicit limitations paragraph acknowledging the absence of external-benchmark results and will discuss why the controlled synthesis isolates the effect of the grounding mechanism; we do not currently possess results on held-out human-annotated data. revision: partial

Circularity Check

0 steps flagged

No circularity detected in derivation or claims

full rationale

The paper introduces Composer, a new MLLM with proxy-token grounding, and reports empirical results (parity in final-answer accuracy, +9.0 grounding improvement) on the author-synthesized ComposerGCoT dataset. No mathematical derivation chain, equations, fitted parameters renamed as predictions, or self-citations appear in the provided text. The dataset construction is explicitly for evaluation and does not reduce the accuracy metrics to inputs by construction. No uniqueness theorems, ansatzes via citation, or renaming of known results are present. The central claims rest on direct experimental comparison rather than any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unverified effectiveness of proxy-tokens as semantic pointers and on the assumption that the new dataset measures grounding fidelity without circularity or bias; no free parameters are named in the abstract, but the learned tokens themselves function as fitted entities.

axioms (1)

domain assumption Learned proxy-tokens create a semantic link to visual features that textual coordinates lack, thereby reducing hallucinations.
This premise is required for the claim that the new mechanism improves grounding accuracy over the coordinate baseline.

invented entities (1)

proxy-tokens no independent evidence
purpose: Discrete symbolic pointers that index the image latent space and allow manipulation of visual regions as addressable semantic sets.
Introduced as the core novel mechanism; no independent evidence outside the paper is supplied in the abstract.

pith-pipeline@v0.9.1-grok · 5767 in / 1426 out tokens · 30068 ms · 2026-06-26T09:15:15.217733+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 1 canonical work pages · 1 internal anchor

[1]

INTRODUCTION Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks [1, 2, 3], most notably in Visual Question Answering (VQA). However, despite their impressive performance, these models operate under a monolithic paradigm that directly maps vi- sual and textual inputs to a final answer without expo...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

<objectDetection> ✓ ✓ Standard MLLMs Yes✗ Standard MLLMs Composer (ours)
[3]

<objectDetection>✓ <spatialLocation> <colorAttribute> 2
[4]

1.(left) Visual Grounding via Proxy-Token Indexing.We introduce a visual grounding mechanism that uses learned proxy-tokens to index the image latent space

✗ ✓ Yes✗ Fig. 1.(left) Visual Grounding via Proxy-Token Indexing.We introduce a visual grounding mechanism that uses learned proxy-tokens to index the image latent space. Visual tokensV={v 1, v2, . . . , vN }are mapped to learnable proxy-tokens R={r 1, r2, . . . , rN }that serve as discrete pointers; the model references specific regions by addressing the...
[5]

Conse- quently, such models fail to expose their decision path, a critical deficiency for deployment in high-stakes applica- tions [9]

RELA TED WORK Interpretability in the VQA Context.Most existing MLLMs [1, 2, 3] operate under a monolithic paradigm that directly maps visual and textual inputs to a final answer. Conse- quently, such models fail to expose their decision path, a critical deficiency for deployment in high-stakes applica- tions [9]. To address this, several approaches have ...
[6]

This creates a semantic-spatial gap: the model must reason about locations using linguistic symbols that are fundamentally ungrounded in the visual latent space

METHOD In standard GVR approaches [4, 5, 6] , the LLM processes spatial coordinates as arbitrary textual coordinates, which lack an explicit, learnable correspondence to the underly- ing image features. This creates a semantic-spatial gap: the model must reason about locations using linguistic symbols that are fundamentally ungrounded in the visual latent...
[7]

Implementation Details Composer follows the encoder-projector-LLM paradigm

EXPERIMENTS 4.1. Implementation Details Composer follows the encoder-projector-LLM paradigm. We employ CLIP ViT-L/16 [28] as the visual encoder, coupled with a two-layer MLP projector that maps visual features into the embedding space of Vicuna-v1.5-7B [29]. To maintain a 1-to-1 mapping with the 256 visual tokens produced by CLIP ViT-L/16, we expanded the...
[8]

Central to our approach is a novel visual grounding mechanism based on learned proxy-tokens, which establishes an explicit link between linguistic reasoning and visual features

CONCLUSION In this work, we introduce Composer, a GVR-capable model that bridges the gap between reasoning accuracy and faithful interpretability. Central to our approach is a novel visual grounding mechanism based on learned proxy-tokens, which establishes an explicit link between linguistic reasoning and visual features. To rigorously validate this para...
[9]

Visual instruction tuning,

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee, “Visual instruction tuning,” inNeurIPS23’
[10]

Improved baselines with visual instruction tun- ing,

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee, “Improved baselines with visual instruction tun- ing,” inCVPR24’
[11]

Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models,

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi, “Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models,” inICML23’
[12]

Grounded chain-of-thought for multimodal large lan- guage models,

Qiong Wu, Xiangcong Yang, Yiyi Zhou, et al., “Grounded chain-of-thought for multimodal large lan- guage models,”arXiv, 2025

2025
[13]

Visually interpretable subtask reasoning for visual question an- swering,

Yu Cheng, Arushi Goel, and Hakan Bilen, “Visually interpretable subtask reasoning for visual question an- swering,” inCVPR25’
[14]

V ocot: Un- leashing visually grounded multi-step reasoning in large multi-modal models,

Zejun Li, Ruipu Luo, Jiwen Zhang, et al., “V ocot: Un- leashing visually grounded multi-step reasoning in large multi-modal models,” inNAACL25’
[15]

Groma: Localized visual tokenization for grounding multimodal large language models,

Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi, “Groma: Localized visual tokenization for grounding multimodal large language models,” in ECCV24’
[16]

Grounded reinforcement learning for visual reasoning,

Gabriel Herbert Sarch, Snigdha Saha, et al., “Grounded reinforcement learning for visual reasoning,” in NeurIPS25’
[17]

Explain before you answer: A survey on compositional visual reasoning,

Fucai Ke, Joy Hsu, Zhixi Cai, et al., “Explain before you answer: A survey on compositional visual reasoning,” arXiv, 2025

2025
[18]

Chain-of-thought prompting elicits reasoning in large language models,

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al., “Chain-of-thought prompting elicits reasoning in large language models,”NeurIPS22’
[19]

Llava-cot: Let vision lan- guage models reason step-by-step,

Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan, “Llava-cot: Let vision lan- guage models reason step-by-step,” inICCV25’
[20]

Multimodal chain- of-thought reasoning in language models,

Zhuosheng Zhang, Aston Zhang, Mu Li, hai zhao, George Karypis, and Alex Smola, “Multimodal chain- of-thought reasoning in language models,”TMLR24’
[21]

Visual programming: Compositional visual reasoning without training,

Tanmay Gupta and Aniruddha Kembhavi, “Visual programming: Compositional visual reasoning without training,” inCVPR23’
[22]

Vipergpt: Visual inference via python execution for reasoning,

D ´ıdac Sur ´ıs, Sachit Menon, and Carl V ondrick, “Vipergpt: Visual inference via python execution for reasoning,” inCVPR23’
[23]

Mind’s eye of llms: visualization-of-thought elicits spatial reasoning in large language models,

Wenshan Wu, Shaoguang Mao, Yadong Zhang, Yan Xia, Li Dong, Lei Cui, and Furu Wei, “Mind’s eye of llms: visualization-of-thought elicits spatial reasoning in large language models,”
[24]

Visual sketch- pad: Sketching as a visual chain of thought for multi- modal language models,

Yushi Hu, Weijia Shi, Xingyu Fu, et al., “Visual sketch- pad: Sketching as a visual chain of thought for multi- modal language models,” inNeurIPS24’
[25]

Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers,

Zhaochen Su, Peng Xia, et al., “Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers,”arXiv, 2025

2025
[26]

Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain- of-thought reasoning,

“Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain- of-thought reasoning,” inNeurIPS24’
[27]

Kosmos- 2: Grounding multimodal large language models to the world,

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei, “Kosmos- 2: Grounding multimodal large language models to the world,”arXiv, 2023

2023
[28]

Florence- 2: Advancing a unified representation for a variety of vision tasks,

Bin Xiao, Haiping Wu, Weijian Xu, et al., “Florence- 2: Advancing a unified representation for a variety of vision tasks,” inCVPR24’
[29]

Conceptual captions: A cleaned, hyper- nymed, image alt-text dataset for automatic image cap- tioning,

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut, “Conceptual captions: A cleaned, hyper- nymed, image alt-text dataset for automatic image cap- tioning,” inACL18’
[30]

Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,

Bryan A Plummer, Liwei Wang, Chris M Cervantes, et al., “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” inCVPR15’
[31]

Microsoft coco: Common objects in context,

Tsung-Yi Lin, Michael Maire, Serge Belongie, et al., “Microsoft coco: Common objects in context,” in ECCV14’
[32]

Referitgame: Referring to objects in photographs of natural scenes,

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg, “Referitgame: Referring to objects in photographs of natural scenes,” inEMNLP24’
[33]

Visual genome: Connecting language and vision using crowdsourced dense image annotations,

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John- son, et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” IJCV17’
[34]

Visual spatial reasoning,

Fangyu Liu, Guy Emerson, and Nigel Collier, “Visual spatial reasoning,” inTACL23’
[35]

Gqa: A new dataset for real-world visual reasoning and compo- sitional question answering,

Drew A Hudson and Christopher D Manning, “Gqa: A new dataset for real-world visual reasoning and compo- sitional question answering,” inCVPR19’
[36]

Learning transferable visual models from natural language supervision,

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, et al., “Learning transferable visual models from natural language supervision,” in ICML21’
[37]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, et al., “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,”arXiv, 2023. A. ADDITIONAL TRAINING DETAILS Table 4 summarizes the detailed hyperparameters used for training both Composer variants. Table 3 details the training durations for Composer variants. Both variants were trained on...

2023

[1] [1]

INTRODUCTION Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks [1, 2, 3], most notably in Visual Question Answering (VQA). However, despite their impressive performance, these models operate under a monolithic paradigm that directly maps vi- sual and textual inputs to a final answer without expo...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

<objectDetection> ✓ ✓ Standard MLLMs Yes✗ Standard MLLMs Composer (ours)

[3] [3]

<objectDetection>✓ <spatialLocation> <colorAttribute> 2

[4] [4]

1.(left) Visual Grounding via Proxy-Token Indexing.We introduce a visual grounding mechanism that uses learned proxy-tokens to index the image latent space

✗ ✓ Yes✗ Fig. 1.(left) Visual Grounding via Proxy-Token Indexing.We introduce a visual grounding mechanism that uses learned proxy-tokens to index the image latent space. Visual tokensV={v 1, v2, . . . , vN }are mapped to learnable proxy-tokens R={r 1, r2, . . . , rN }that serve as discrete pointers; the model references specific regions by addressing the...

[5] [5]

Conse- quently, such models fail to expose their decision path, a critical deficiency for deployment in high-stakes applica- tions [9]

RELA TED WORK Interpretability in the VQA Context.Most existing MLLMs [1, 2, 3] operate under a monolithic paradigm that directly maps visual and textual inputs to a final answer. Conse- quently, such models fail to expose their decision path, a critical deficiency for deployment in high-stakes applica- tions [9]. To address this, several approaches have ...

[6] [6]

This creates a semantic-spatial gap: the model must reason about locations using linguistic symbols that are fundamentally ungrounded in the visual latent space

METHOD In standard GVR approaches [4, 5, 6] , the LLM processes spatial coordinates as arbitrary textual coordinates, which lack an explicit, learnable correspondence to the underly- ing image features. This creates a semantic-spatial gap: the model must reason about locations using linguistic symbols that are fundamentally ungrounded in the visual latent...

[7] [7]

Implementation Details Composer follows the encoder-projector-LLM paradigm

EXPERIMENTS 4.1. Implementation Details Composer follows the encoder-projector-LLM paradigm. We employ CLIP ViT-L/16 [28] as the visual encoder, coupled with a two-layer MLP projector that maps visual features into the embedding space of Vicuna-v1.5-7B [29]. To maintain a 1-to-1 mapping with the 256 visual tokens produced by CLIP ViT-L/16, we expanded the...

[8] [8]

Central to our approach is a novel visual grounding mechanism based on learned proxy-tokens, which establishes an explicit link between linguistic reasoning and visual features

CONCLUSION In this work, we introduce Composer, a GVR-capable model that bridges the gap between reasoning accuracy and faithful interpretability. Central to our approach is a novel visual grounding mechanism based on learned proxy-tokens, which establishes an explicit link between linguistic reasoning and visual features. To rigorously validate this para...

[9] [9]

Visual instruction tuning,

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee, “Visual instruction tuning,” inNeurIPS23’

[10] [10]

Improved baselines with visual instruction tun- ing,

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee, “Improved baselines with visual instruction tun- ing,” inCVPR24’

[11] [11]

Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models,

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi, “Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models,” inICML23’

[12] [12]

Grounded chain-of-thought for multimodal large lan- guage models,

Qiong Wu, Xiangcong Yang, Yiyi Zhou, et al., “Grounded chain-of-thought for multimodal large lan- guage models,”arXiv, 2025

2025

[13] [13]

Visually interpretable subtask reasoning for visual question an- swering,

Yu Cheng, Arushi Goel, and Hakan Bilen, “Visually interpretable subtask reasoning for visual question an- swering,” inCVPR25’

[14] [14]

V ocot: Un- leashing visually grounded multi-step reasoning in large multi-modal models,

Zejun Li, Ruipu Luo, Jiwen Zhang, et al., “V ocot: Un- leashing visually grounded multi-step reasoning in large multi-modal models,” inNAACL25’

[15] [15]

Groma: Localized visual tokenization for grounding multimodal large language models,

Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi, “Groma: Localized visual tokenization for grounding multimodal large language models,” in ECCV24’

[16] [16]

Grounded reinforcement learning for visual reasoning,

Gabriel Herbert Sarch, Snigdha Saha, et al., “Grounded reinforcement learning for visual reasoning,” in NeurIPS25’

[17] [17]

Explain before you answer: A survey on compositional visual reasoning,

Fucai Ke, Joy Hsu, Zhixi Cai, et al., “Explain before you answer: A survey on compositional visual reasoning,” arXiv, 2025

2025

[18] [18]

Chain-of-thought prompting elicits reasoning in large language models,

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al., “Chain-of-thought prompting elicits reasoning in large language models,”NeurIPS22’

[19] [19]

Llava-cot: Let vision lan- guage models reason step-by-step,

Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan, “Llava-cot: Let vision lan- guage models reason step-by-step,” inICCV25’

[20] [20]

Multimodal chain- of-thought reasoning in language models,

Zhuosheng Zhang, Aston Zhang, Mu Li, hai zhao, George Karypis, and Alex Smola, “Multimodal chain- of-thought reasoning in language models,”TMLR24’

[21] [21]

Visual programming: Compositional visual reasoning without training,

Tanmay Gupta and Aniruddha Kembhavi, “Visual programming: Compositional visual reasoning without training,” inCVPR23’

[22] [22]

Vipergpt: Visual inference via python execution for reasoning,

D ´ıdac Sur ´ıs, Sachit Menon, and Carl V ondrick, “Vipergpt: Visual inference via python execution for reasoning,” inCVPR23’

[23] [23]

Mind’s eye of llms: visualization-of-thought elicits spatial reasoning in large language models,

Wenshan Wu, Shaoguang Mao, Yadong Zhang, Yan Xia, Li Dong, Lei Cui, and Furu Wei, “Mind’s eye of llms: visualization-of-thought elicits spatial reasoning in large language models,”

[24] [24]

Visual sketch- pad: Sketching as a visual chain of thought for multi- modal language models,

Yushi Hu, Weijia Shi, Xingyu Fu, et al., “Visual sketch- pad: Sketching as a visual chain of thought for multi- modal language models,” inNeurIPS24’

[25] [25]

Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers,

Zhaochen Su, Peng Xia, et al., “Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers,”arXiv, 2025

2025

[26] [26]

Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain- of-thought reasoning,

“Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain- of-thought reasoning,” inNeurIPS24’

[27] [27]

Kosmos- 2: Grounding multimodal large language models to the world,

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei, “Kosmos- 2: Grounding multimodal large language models to the world,”arXiv, 2023

2023

[28] [28]

Florence- 2: Advancing a unified representation for a variety of vision tasks,

Bin Xiao, Haiping Wu, Weijian Xu, et al., “Florence- 2: Advancing a unified representation for a variety of vision tasks,” inCVPR24’

[29] [29]

Conceptual captions: A cleaned, hyper- nymed, image alt-text dataset for automatic image cap- tioning,

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut, “Conceptual captions: A cleaned, hyper- nymed, image alt-text dataset for automatic image cap- tioning,” inACL18’

[30] [30]

Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,

Bryan A Plummer, Liwei Wang, Chris M Cervantes, et al., “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” inCVPR15’

[31] [31]

Microsoft coco: Common objects in context,

Tsung-Yi Lin, Michael Maire, Serge Belongie, et al., “Microsoft coco: Common objects in context,” in ECCV14’

[32] [32]

Referitgame: Referring to objects in photographs of natural scenes,

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg, “Referitgame: Referring to objects in photographs of natural scenes,” inEMNLP24’

[33] [33]

Visual genome: Connecting language and vision using crowdsourced dense image annotations,

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John- son, et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” IJCV17’

[34] [34]

Visual spatial reasoning,

Fangyu Liu, Guy Emerson, and Nigel Collier, “Visual spatial reasoning,” inTACL23’

[35] [35]

Gqa: A new dataset for real-world visual reasoning and compo- sitional question answering,

Drew A Hudson and Christopher D Manning, “Gqa: A new dataset for real-world visual reasoning and compo- sitional question answering,” inCVPR19’

[36] [36]

Learning transferable visual models from natural language supervision,

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, et al., “Learning transferable visual models from natural language supervision,” in ICML21’

[37] [37]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, et al., “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,”arXiv, 2023. A. ADDITIONAL TRAINING DETAILS Table 4 summarizes the detailed hyperparameters used for training both Composer variants. Table 3 details the training durations for Composer variants. Both variants were trained on...

2023