Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?

Mahtab Bigverdi; Ranjay Krishna; Tianyi Zhang

arxiv: 2605.21642 · v1 · pith:LGBI7AJKnew · submitted 2026-05-20 · 💻 cs.CV

Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?

Tianyi Zhang , Mahtab Bigverdi , Ranjay Krishna This is my paper

Pith reviewed 2026-05-22 09:14 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelslatent tokenscontinuous thought tokensablationtoken replacement testvisual reasoninginformation bottleneckdiagnostics

0 comments

The pith

Vision-language models retain accuracy gains even when the content of their continuous thought tokens is corrupted.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models are often given extra continuous or latent tokens to support visual thinking, leading to better task accuracy. The question is whether these tokens are actually being used for reasoning or if the gains come from other factors like added context. This paper introduces the Ablate-to-Validate principle and the Token Replacement Test, which keeps everything fixed but replaces the tokens with zeros, random values, repeats, or oracles. In experiments across controlled tasks and benchmarks, performance largely persists despite content corruption. This reveals that having a latent channel does not mean it is used as an information bottleneck.

Core claim

The central discovery is that VLMs retain most improvement even when token content is corrupted or replaced, showing a persistent gap between having a latent channel and using it as an information bottleneck. This holds across controlled depth reasoning tasks with different encoders and token budgets, and also for off-the-shelf systems on multiple visual benchmarks.

What carries the argument

The Token Replacement Test (TRT), a suite of content-replacement ablations that hold the prompt, image, token budget, and decoding fixed while replacing intermediate tokens to isolate whether performance depends on token content.

If this is right

Accuracy gains alone cannot confirm that latent tokens are used for reasoning.
Any new method introducing continuous thought tokens should be tested with TRT alongside accuracy metrics.
Models show a gap between possessing a latent channel and actually using the content as an information bottleneck.
The finding applies across trained and off-the-shelf visual-thinking systems on multiple benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers of latent token methods may need to add training objectives that force reliance on the token content rather than presence.
Similar replacement tests could be useful for checking if extra parameters or modules in other AI systems are genuinely utilized.
This highlights a broader challenge in AI interpretability where models may exploit superficial cues instead of intended mechanisms.

Load-bearing premise

The ablations isolate token content usage without introducing new confounds like changes in effective context or decoding dynamics that could explain retained performance.

What would settle it

A significant drop in task accuracy when replacing latent token content with random or zero values, while keeping token positions, count, and all other inputs identical.

Figures

Figures reproduced from arXiv: 2605.21642 by Mahtab Bigverdi, Ranjay Krishna, Tianyi Zhang.

**Figure 1.** Figure 1: Overview of the Token Replacement Test (TRT). TRT replaces the intermediate thought-token span while fixing the prompt, im [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

read the original abstract

Vision-language models (VLMs) are increasingly augmented with continuous or latent non-textual tokens intended to support "visual thinking." Despite improved task accuracy, this alone does not show that models actually use these tokens for reasoning -- gains may arise from confounds such as added context length, special-token anchoring, or training-time regularization. We formalize a diagnostic principle, Ablate-to-Validate, for testing whether latent-token content is genuinely utilized, and instantiate it as the Token Replacement Test (TRT), a standardized suite of content-replacement ablations. TRT holds the prompt, image, token budget, and decoding fixed while replacing intermediate tokens with zero, random, first-repeat, or oracle alternatives, isolating whether performance depends on token content or merely on token presence. As a controlled testbed, we study relative depth reasoning with LLaVA-13B and Qwen2.5-VL-3B, training models to predict and consume continuous or discrete depth spans across multiple frozen encoders (SigLIP2, CLIP, DINOv2) and token budgets. We additionally apply TRT to three off-the-shelf visual-thinking systems (Mirage, Mull-Tokens, CoVT) on BLINK, VSP, and CV-Bench. Across all settings, accuracy gains are a misleading proxy for latent-token reasoning: VLMs retain most improvement even when token content is corrupted or replaced, revealing a persistent gap between having a latent channel and using it as an information bottleneck. We recommend TRT as a standard diagnostic alongside accuracy for any method introducing continuous thought tokens.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean diagnostic showing VLMs keep most accuracy gains even after swapping continuous token content, but the replacements may still shift attention and context dynamics.

read the letter

The core takeaway is that accuracy numbers alone do not prove VLMs are using the actual content inside those added continuous tokens. The authors show that performance largely survives when the tokens are replaced by zeros, random vectors, repeats, or even oracle values, while keeping prompt, image, budget, and decoding fixed. This suggests a gap between having a latent channel and actually routing information through it as a bottleneck.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Ablate-to-Validate principle and its Token Replacement Test (TRT) to test whether VLMs use the content of continuous or latent thought tokens for reasoning. It applies controlled replacements (zero, random, first-repeat, oracle) while fixing prompt/image/token-budget/decoding on LLaVA-13B, Qwen2.5-VL-3B with SigLIP2/CLIP/DINOv2 encoders for relative depth reasoning, plus off-the-shelf systems (Mirage, Mull-Tokens, CoVT) on BLINK/VSP/CV-Bench. The central finding is that accuracy gains largely persist under content corruption, indicating a gap between possessing a latent channel and using it as an information bottleneck.

Significance. If the results hold, the work is significant for providing a standardized diagnostic that goes beyond accuracy as a proxy for latent reasoning in VLMs. The multi-model, multi-encoder testbed and application to existing visual-thinking systems offer a practical tool that could influence evaluation standards. The emphasis on falsifiable ablations rather than fitted quantities adds methodological value.

major comments (2)

[§3.2 (TRT suite)] §3.2 (TRT suite): The replacements (zero/random/first-repeat) can alter the input distribution to subsequent layers and attention heads or introduce artificial repetition/positional effects, which may preserve performance through changed computation dynamics rather than demonstrating non-use of token content. This is load-bearing for the claim that retained gains reveal a gap in using the latent channel as bottleneck; additional controls (e.g., matching statistical properties while varying content) are needed.
[§5 (off-the-shelf systems)] §5 (off-the-shelf systems): For Mirage, Mull-Tokens, and CoVT, the construction of oracle replacements must be detailed to rule out information leakage; without this, the retained performance on BLINK/VSP/CV-Bench cannot cleanly isolate content utilization from replacement artifacts.

minor comments (2)

[Abstract] Abstract: The phrase 'retain most improvement' should be accompanied by quantitative retention percentages or ranges from the experiments to allow readers to assess the magnitude.
[Notation] Notation: Early clarification is needed on how 'continuous tokens from frozen encoders' differ operationally from 'discrete depth spans' when both are used in the same testbed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help strengthen the methodological rigor of the Ablate-to-Validate framework. We address each major comment below with point-by-point responses and indicate planned revisions.

read point-by-point responses

Referee: §3.2 (TRT suite): The replacements (zero/random/first-repeat) can alter the input distribution to subsequent layers and attention heads or introduce artificial repetition/positional effects, which may preserve performance through changed computation dynamics rather than demonstrating non-use of token content. This is load-bearing for the claim that retained gains reveal a gap in using the latent channel as bottleneck; additional controls (e.g., matching statistical properties while varying content) are needed.

Authors: We agree that replacement strategies can introduce distributional shifts and positional artifacts that might influence downstream computation. Our design mitigates this by holding token positions, budgets, and decoding fixed across all conditions, with content as the sole variable. The consistency of retained gains across three qualitatively different replacements (zero vectors, random draws from a broad distribution, and first-token repetition) makes it less likely that results stem from any single artifact, as each method perturbs statistics and repetition patterns differently. Nevertheless, we acknowledge the value of additional controls. In the revision we will add a new paragraph in §3.2 discussing these potential confounds and include an appendix experiment that matches first- and second-order statistics (mean/variance) of the original tokens while randomizing higher-order content, to further isolate semantic utilization. revision: partial
Referee: §5 (off-the-shelf systems): For Mirage, Mull-Tokens, and CoVT, the construction of oracle replacements must be detailed to rule out information leakage; without this, the retained performance on BLINK/VSP/CV-Bench cannot cleanly isolate content utilization from replacement artifacts.

Authors: We agree that explicit documentation of oracle construction is necessary to rule out leakage. For the off-the-shelf systems, oracle replacements were obtained by running an auxiliary forward pass on the same model using ground-truth annotations (where available) or the model's own intermediate activations from a separate non-test batch, then substituting only the continuous token vectors while keeping all other inputs identical. No test-set labels or images were used to generate these oracles. We will expand the experimental details in §5 and add a dedicated paragraph clarifying this procedure, including pseudocode, to ensure reproducibility and to confirm absence of leakage. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical diagnostic relies on direct measurement

full rationale

The paper introduces Ablate-to-Validate as a diagnostic principle and instantiates it via the Token Replacement Test (TRT), which performs controlled content replacements (zero, random, first-repeat, oracle) while holding prompt, image, token budget, and decoding fixed. Performance is then measured directly on LLaVA-13B, Qwen2.5-VL-3B, and off-the-shelf systems across benchmarks. No equations, fitted parameters, or predictions are defined in terms of the target result; no self-citations are invoked to justify uniqueness theorems or ansatzes; and no known results are merely renamed. The central claim that accuracy gains do not demonstrate content usage follows from the experimental outcomes themselves rather than reducing to inputs by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that token replacements isolate content dependence while holding all other factors fixed; no free parameters or invented entities are described.

axioms (1)

domain assumption Token replacement with zero, random, first-repeat, or oracle alternatives isolates whether performance depends on token content rather than token presence or other confounds.
This premise is invoked to interpret retained accuracy as evidence against genuine latent-token reasoning.

pith-pipeline@v0.9.0 · 5826 in / 1250 out tokens · 34171 ms · 2026-05-22T09:14:27.120390+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TRT holds the prompt, image, token budget, and decoding fixed while replacing intermediate tokens with zero, random, first-repeat, or oracle alternatives, isolating whether performance depends on token content or merely on token presence.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

accuracy gains are a misleading proxy for latent-token reasoning: VLMs retain most improvement even when token content is corrupted or replaced

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 4 internal anchors

[1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Se- bastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sa- hand Sharifzadeh, Mikolaj Bink...

work page
[2]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual question answering. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2425–2433, 2015. 2

work page 2015
[3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Per- ception tokens enhance visual reasoning in multimodal lan- guage models

Mahtab Bigverdi, Zelun Luo, Cheng-Yu Hsieh, Ethan Shen, Dongping Chen, Linda G Shapiro, and Ranjay Krishna. Per- ception tokens enhance visual reasoning in multimodal lan- guage models. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 3836–3845,

work page
[5]

train on the test set

Ellis Brown, Jihan Yang, Shusheng Yang, Rob Fergus, and Saining Xie. Benchmark designers should “train on the test set” to expose exploitable non-visual shortcuts.arXiv preprint arXiv:2511.04655, 2025. 3

work page arXiv 2025
[6]

PhD the- sis, MASSACHUSETTS INSTITUTE OF TECHNOLOGY ,

PHILOSO EPHY DO CT OR OF.MACHINE PERCEP- TION OF THREE-DIMENSIONAL, SO LIDS. PhD the- sis, MASSACHUSETTS INSTITUTE OF TECHNOLOGY ,

work page
[7]

Smith, Wei-Chiu Ma, and Ranjay Krishna

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. 2024. 6

work page 2024
[8]

What’s “up” with vision-language models? Investigating their strug- gle with spatial reasoning

Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s “up” with vision-language models? Investigating their strug- gle with spatial reasoning. InProceedings of the Confer- ence on Empirical Methods in Natural Language Processing (EMNLP), 2023. 3

work page 2023
[9]

ReferItGame: Referring to objects in pho- tographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. ReferItGame: Referring to objects in pho- tographs of natural scenes. InProceedings of the 2014 Con- ference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 787–798, 2014. 2

work page 2014
[10]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Pro- cessing Systems, 2022. 2

work page 2022
[11]

Latent implicit visual rea- soning.arXiv preprint arXiv:2512.21218, 2025

Kelvin Li, Chuyi Shang, Leonid Karlinsky, Rogerio Feris, Trevor Darrell, and Roei Herzig. Latent implicit visual rea- soning.arXiv preprint arXiv:2512.21218, 2025. 3

work page arXiv 2025
[12]

ViewSpatial- Bench: Evaluating multi-perspective spatial under- standing of vision-language models.arXiv preprint arXiv:2505.21500, 2025

Linnan Li, Xiaoyu Chen, Peng Chen, et al. ViewSpatial- Bench: Evaluating multi-perspective spatial under- standing of vision-language models.arXiv preprint arXiv:2505.21500, 2025. 3

work page arXiv 2025
[13]

Visual spatial reasoning.arXiv preprint arXiv:2205.00363, 2022

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.arXiv preprint arXiv:2205.00363, 2022. 3

work page arXiv 2022
[14]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, 2023. 2, 3, 5

work page 2023
[15]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. 2022. 3

work page 2022
[16]

Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts. InIn- ternational Conference on Learning Representations (ICLR),

work page
[17]

de Melo, and Alan Yuille

Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso M. de Melo, and Alan Yuille. 3DSR- Bench: A comprehensive 3D spatial reasoning benchmark

work page
[18]

MIT press, 2010

David Marr.Vision: A computational investigation into the human representation and processing of visual information. MIT press, 2010. 2

work page 2010
[19]

An introduction to computational geometry.Cambridge tiass., HIT, 479(480): 104, 1969

Marvin Minsky and Seymour Papert. An introduction to computational geometry.Cambridge tiass., HIT, 479(480): 104, 1969. 2

work page 1969
[20]

Dinov2: Learning robust visual features without supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e J´egou, Julien Mairal, P...

work page 2023
[21]

Chain-of-visual-thought: Teaching vlms to see and think better with continuous visual tokens

Yiming Qin, Bomin Wei, Jiaxin Ge, Konstantinos Kallidromitis, Stephanie Fu, Trevor Darrell, and XuDong Wang. Chain-of-visual-thought: Teaching vlms to see and think better with continuous visual tokens. 2025. 2, 3, 6, 7, 8, 11

work page 2025
[22]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. 2021. 5

work page 2021
[23]

Mull-Tokens: Modality-Agnostic Latent Thinking

Arijit Ray, Ahmed Abdelkader, Chengzhi Mao, Bryan A Plummer, Kate Saenko, Ranjay Krishna, Leonidas Guibas, and Wen-Sheng Chu. Mull-tokens: Modality-agnostic latent thinking.arXiv preprint arXiv:2512.10941, 2025. 2, 3, 5, 6, 7, 8, 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. Metamorph: Multimodal understanding and generation via instruction tuning.arXiv preprint arXiv:2412.14164, 2024. 2, 3, 5, 6, 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil 9 Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Henaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense feature...

work page 2025
[26]

Show and tell: A neural image caption gen- erator

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du- mitru Erhan. Show and tell: A neural image caption gen- erator. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3156–3164,

work page
[27]

Towards understand- ing chain-of-thought prompting: An empirical study of what matters

Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen, You Wu, Luke Zettlemoyer, and Huan Sun. Towards understand- ing chain-of-thought prompting: An empirical study of what matters. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2717–2739, Toronto, Canada, 2023. Associ- ation for Computa...

work page 2023
[28]

Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022. 2

work page 2022
[29]

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces.arXiv preprint arXiv:2412.14171, 2025. 3

work page internal anchor Pith review arXiv 2025
[30]

SpatialSense: An adversarially crowdsourced benchmark for spatial rela- tion recognition

Kaiyu Yang, Olga Russakovsky, and Jia Deng. SpatialSense: An adversarially crowdsourced benchmark for spatial rela- tion recognition. 2019. 3

work page 2019
[31]

MMSI-Bench: A benchmark for multi-image spatial intelligence.arXiv preprint arXiv:2505.23764, 2025

Sihan Yang, Runsen Xu, Yiman Xie, et al. MMSI-Bench: A benchmark for multi-image spatial intelligence.arXiv preprint arXiv:2505.23764, 2025. 3

work page arXiv 2025
[32]

Machine mental imagery: Empower multi- modal reasoning with latent visual tokens

Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, and Chuang Gan. Machine mental imagery: Empower multi- modal reasoning with latent visual tokens. 2025. 2, 3, 4, 5, 7, 11

work page 2025
[33]

Spa- tial mental modeling from limited views.arXiv preprint arXiv:2506.21458, 2025

Baiqiao Yin, Qineng Wang, Pingyue Zhang, et al. Spa- tial mental modeling from limited views.arXiv preprint arXiv:2506.21458, 2025. 3

work page arXiv 2025
[34]

MM-vet: Evaluating large multimodal models for integrated capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. MM-vet: Evaluating large multimodal models for integrated capabilities. InProceedings of the 41st International Con- ference on Machine Learning, pages 57730–57754. PMLR,

work page
[35]

MMMU: A massive multi-discipline multimodal un- derstanding and reasoning benchmark for expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weim- ing Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. MMMU: A massive multi-discipline multimodal un- derstanding and reasoning benchmark for...

work page
[36]

Multimodal chain-of-thought rea- soning in language models

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought rea- soning in language models. 2023. 3

work page 2023
[37]

No depth loss

Zhanke Zhou, Rong Tao, Jianing Zhu, Yiwen Luo, Zengmao Wang, and Bo Han. Can language models perform robust reasoning in chain-of-thought prompting with noisy ratio- nales?arXiv preprint arXiv:2410.23856, 2024. 8 10 A. Additional Results A.1. HardBlink additional runs (full blocks) In Table 8, we present the comprehensive HardBlink aver- age accuracy resu...

work page arXiv 2024
[38]

Multiple points are circled... Which point is the closest to the camera?

Each image is represented withK= 100discrete depth tokens, whereKdenotes the discrete depth-token budget. The depth projector/head are linear with depth LR 1×10 −5;λ depth = 1.0. Qwen2.5-VL-3B (reported 10-epoch setting).Vision encoder frozen; LLM + visual MLP + embeddings are fine-tuned; AdamW with cosine LR; BF16; warmup ra- tio 0.03; effective batch si...

work page

[1] [1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Se- bastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sa- hand Sharifzadeh, Mikolaj Bink...

work page

[2] [2]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual question answering. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2425–2433, 2015. 2

work page 2015

[3] [3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Per- ception tokens enhance visual reasoning in multimodal lan- guage models

Mahtab Bigverdi, Zelun Luo, Cheng-Yu Hsieh, Ethan Shen, Dongping Chen, Linda G Shapiro, and Ranjay Krishna. Per- ception tokens enhance visual reasoning in multimodal lan- guage models. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 3836–3845,

work page

[5] [5]

train on the test set

Ellis Brown, Jihan Yang, Shusheng Yang, Rob Fergus, and Saining Xie. Benchmark designers should “train on the test set” to expose exploitable non-visual shortcuts.arXiv preprint arXiv:2511.04655, 2025. 3

work page arXiv 2025

[6] [6]

PhD the- sis, MASSACHUSETTS INSTITUTE OF TECHNOLOGY ,

PHILOSO EPHY DO CT OR OF.MACHINE PERCEP- TION OF THREE-DIMENSIONAL, SO LIDS. PhD the- sis, MASSACHUSETTS INSTITUTE OF TECHNOLOGY ,

work page

[7] [7]

Smith, Wei-Chiu Ma, and Ranjay Krishna

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. 2024. 6

work page 2024

[8] [8]

What’s “up” with vision-language models? Investigating their strug- gle with spatial reasoning

Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s “up” with vision-language models? Investigating their strug- gle with spatial reasoning. InProceedings of the Confer- ence on Empirical Methods in Natural Language Processing (EMNLP), 2023. 3

work page 2023

[9] [9]

ReferItGame: Referring to objects in pho- tographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. ReferItGame: Referring to objects in pho- tographs of natural scenes. InProceedings of the 2014 Con- ference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 787–798, 2014. 2

work page 2014

[10] [10]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Pro- cessing Systems, 2022. 2

work page 2022

[11] [11]

Latent implicit visual rea- soning.arXiv preprint arXiv:2512.21218, 2025

Kelvin Li, Chuyi Shang, Leonid Karlinsky, Rogerio Feris, Trevor Darrell, and Roei Herzig. Latent implicit visual rea- soning.arXiv preprint arXiv:2512.21218, 2025. 3

work page arXiv 2025

[12] [12]

ViewSpatial- Bench: Evaluating multi-perspective spatial under- standing of vision-language models.arXiv preprint arXiv:2505.21500, 2025

Linnan Li, Xiaoyu Chen, Peng Chen, et al. ViewSpatial- Bench: Evaluating multi-perspective spatial under- standing of vision-language models.arXiv preprint arXiv:2505.21500, 2025. 3

work page arXiv 2025

[13] [13]

Visual spatial reasoning.arXiv preprint arXiv:2205.00363, 2022

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.arXiv preprint arXiv:2205.00363, 2022. 3

work page arXiv 2022

[14] [14]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, 2023. 2, 3, 5

work page 2023

[15] [15]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. 2022. 3

work page 2022

[16] [16]

Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts. InIn- ternational Conference on Learning Representations (ICLR),

work page

[17] [17]

de Melo, and Alan Yuille

Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso M. de Melo, and Alan Yuille. 3DSR- Bench: A comprehensive 3D spatial reasoning benchmark

work page

[18] [18]

MIT press, 2010

David Marr.Vision: A computational investigation into the human representation and processing of visual information. MIT press, 2010. 2

work page 2010

[19] [19]

An introduction to computational geometry.Cambridge tiass., HIT, 479(480): 104, 1969

Marvin Minsky and Seymour Papert. An introduction to computational geometry.Cambridge tiass., HIT, 479(480): 104, 1969. 2

work page 1969

[20] [20]

Dinov2: Learning robust visual features without supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e J´egou, Julien Mairal, P...

work page 2023

[21] [21]

Chain-of-visual-thought: Teaching vlms to see and think better with continuous visual tokens

Yiming Qin, Bomin Wei, Jiaxin Ge, Konstantinos Kallidromitis, Stephanie Fu, Trevor Darrell, and XuDong Wang. Chain-of-visual-thought: Teaching vlms to see and think better with continuous visual tokens. 2025. 2, 3, 6, 7, 8, 11

work page 2025

[22] [22]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. 2021. 5

work page 2021

[23] [23]

Mull-Tokens: Modality-Agnostic Latent Thinking

Arijit Ray, Ahmed Abdelkader, Chengzhi Mao, Bryan A Plummer, Kate Saenko, Ranjay Krishna, Leonidas Guibas, and Wen-Sheng Chu. Mull-tokens: Modality-agnostic latent thinking.arXiv preprint arXiv:2512.10941, 2025. 2, 3, 5, 6, 7, 8, 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. Metamorph: Multimodal understanding and generation via instruction tuning.arXiv preprint arXiv:2412.14164, 2024. 2, 3, 5, 6, 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil 9 Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Henaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense feature...

work page 2025

[26] [26]

Show and tell: A neural image caption gen- erator

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du- mitru Erhan. Show and tell: A neural image caption gen- erator. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3156–3164,

work page

[27] [27]

Towards understand- ing chain-of-thought prompting: An empirical study of what matters

Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen, You Wu, Luke Zettlemoyer, and Huan Sun. Towards understand- ing chain-of-thought prompting: An empirical study of what matters. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2717–2739, Toronto, Canada, 2023. Associ- ation for Computa...

work page 2023

[28] [28]

Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022. 2

work page 2022

[29] [29]

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces.arXiv preprint arXiv:2412.14171, 2025. 3

work page internal anchor Pith review arXiv 2025

[30] [30]

SpatialSense: An adversarially crowdsourced benchmark for spatial rela- tion recognition

Kaiyu Yang, Olga Russakovsky, and Jia Deng. SpatialSense: An adversarially crowdsourced benchmark for spatial rela- tion recognition. 2019. 3

work page 2019

[31] [31]

MMSI-Bench: A benchmark for multi-image spatial intelligence.arXiv preprint arXiv:2505.23764, 2025

Sihan Yang, Runsen Xu, Yiman Xie, et al. MMSI-Bench: A benchmark for multi-image spatial intelligence.arXiv preprint arXiv:2505.23764, 2025. 3

work page arXiv 2025

[32] [32]

Machine mental imagery: Empower multi- modal reasoning with latent visual tokens

Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, and Chuang Gan. Machine mental imagery: Empower multi- modal reasoning with latent visual tokens. 2025. 2, 3, 4, 5, 7, 11

work page 2025

[33] [33]

Spa- tial mental modeling from limited views.arXiv preprint arXiv:2506.21458, 2025

Baiqiao Yin, Qineng Wang, Pingyue Zhang, et al. Spa- tial mental modeling from limited views.arXiv preprint arXiv:2506.21458, 2025. 3

work page arXiv 2025

[34] [34]

MM-vet: Evaluating large multimodal models for integrated capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. MM-vet: Evaluating large multimodal models for integrated capabilities. InProceedings of the 41st International Con- ference on Machine Learning, pages 57730–57754. PMLR,

work page

[35] [35]

MMMU: A massive multi-discipline multimodal un- derstanding and reasoning benchmark for expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weim- ing Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. MMMU: A massive multi-discipline multimodal un- derstanding and reasoning benchmark for...

work page

[36] [36]

Multimodal chain-of-thought rea- soning in language models

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought rea- soning in language models. 2023. 3

work page 2023

[37] [37]

No depth loss

Zhanke Zhou, Rong Tao, Jianing Zhu, Yiwen Luo, Zengmao Wang, and Bo Han. Can language models perform robust reasoning in chain-of-thought prompting with noisy ratio- nales?arXiv preprint arXiv:2410.23856, 2024. 8 10 A. Additional Results A.1. HardBlink additional runs (full blocks) In Table 8, we present the comprehensive HardBlink aver- age accuracy resu...

work page arXiv 2024

[38] [38]

Multiple points are circled... Which point is the closest to the camera?

Each image is represented withK= 100discrete depth tokens, whereKdenotes the discrete depth-token budget. The depth projector/head are linear with depth LR 1×10 −5;λ depth = 1.0. Qwen2.5-VL-3B (reported 10-epoch setting).Vision encoder frozen; LLM + visual MLP + embeddings are fine-tuned; AdamW with cosine LR; BF16; warmup ra- tio 0.03; effective batch si...

work page