pith. sign in

arxiv: 2606.23885 · v1 · pith:YALQSBHBnew · submitted 2026-06-22 · 💻 cs.CV · cs.AI· cs.CL· cs.MM

Mind the Heads: Topological Representation Alignment for Multimodal LLMs

Pith reviewed 2026-06-26 08:45 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.MM
keywords multimodal LLMsrepresentation alignmentattention headstopological alignmentvisual hallucinationscontrastive objectiveMKNN metric
0
0 comments X

The pith

Aligning the least-aligned attention heads via local topology matching improves MLLM vision performance and reduces hallucinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that multimodal large language models improve when alignment with a vision encoder is applied at the level of individual attention heads instead of whole layers. Alignment uses a contrastive objective that matches local neighborhood structures between modalities rather than exact vector values. The central finding is that selecting the heads with the worst initial alignment scores produces the largest gains. This change boosts results on vision-centric benchmarks and reduces the models' tendency to generate answers that ignore the image in favor of language patterns.

Core claim

Head-Wise Representation Alignment (HeRA) measures how well each attention head preserves local neighborhood relationships using the Mutual K-Nearest Neighbor metric, selects the heads with the lowest scores, and applies a contrastive loss during multimodal training to enforce matching neighborhood structure with the vision encoder. Experiments across several MLLMs and 18 benchmarks show that this targeted head alignment raises accuracy on challenging vision tasks and lowers visual hallucinations by decreasing over-reliance on linguistic priors.

What carries the argument

Mutual K-Nearest Neighbor (MKNN) metric used both to score head alignment and to define the contrastive objective that preserves local neighborhood topology across modalities.

If this is right

  • Targeting the least aligned heads produces larger gains than targeting already well-aligned heads.
  • The method reduces visual hallucinations by limiting the model's dependence on language priors.
  • Performance improvements hold across multiple MLLM architectures on 18 different benchmarks.
  • The head-specific alignment can be added to existing multimodal training without altering model architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Head-level topological alignment could be tested in other cross-modal combinations such as vision-audio or language-audio models.
  • Tracking alignment scores during training might allow dynamic selection of which heads to regularize at different stages.
  • If mismatches at specific heads cause hallucinations, then similar targeted fixes could address other failure modes like factual errors.

Load-bearing premise

That preserving local neighborhood topology through MKNN matching in the least-aligned heads is the main driver of gains rather than other training details such as loss weighting or which heads are chosen.

What would settle it

A controlled experiment that applies the same contrastive loss to the most-aligned heads or to randomly chosen heads and still matches or exceeds the reported gains on the same vision benchmarks would falsify the claim that least-aligned heads are critical.

Figures

Figures reproduced from arXiv: 2606.23885 by Alberto Compagnoni, Davide Caffagni, Federico Melis, Lorenzo Baraldi, Marcella Cornia, Mark Granroth-Wilding, Pier Luigi Dovesi, Sara Sarto.

Figure 1
Figure 1. Figure 1: Standard representation alignment imposes strict vision-language feature matching ( [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of HeRA. Alongside the standard language modeling objective (LLM), HeRA employs a contrastive loss (LHeRA) to pull representations from selected LLM attention heads closer to their k-nearest neighbors (Top-k), computed in the latent space of a frozen teacher vision encoder. 3.2 Contrastive Learning as a Proxy for Representation Alignment For a fixed vision encoder V t , mkNN has been positively co… view at source ↗
Figure 3
Figure 3. Figure 3: Left: Alignment with DINOv2-L, measured with the MKNN metric on each layer and attention head of Qwen2.5-3B. Right: MKNN scores of the Worst-5 and Top-5 heads, computed on (i) the base LLM; (ii) after the LLaVA multimodal training; and (iii) after the addition of HeRA. matrix). Conversely, representation alignment methods [15, 47] typically rely on intermediate layers. However, the choice of which layer(s)… view at source ↗
Figure 4
Figure 4. Figure 4: VQA results of HeRA with different teacher vision encoders. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of the multimodal training of LLaVA and representation alignment methods on the [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Left: Alignment with DINOv2-L, measured with the MKNN metric on each layer and attention head of Qwen2.5-7B. Right: MKNN scores of the Worst-5 and Top-5 heads, computed on (i) the base LLM; (ii) after the LLaVA multimodal training; and (iii) after the addition of HeRA. 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 Layer Index 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 Score Head M… view at source ↗
Figure 7
Figure 7. Figure 7: Left: Alignment with DINOv2-L, measured with the MKNN metric on each layer and attention head of Qwen2.5-14B. Right: MKNN scores of the Worst-5 and Top-5 heads, computed on (i) the base LLM; (ii) after the LLaVA multimodal training; and (iii) after the addition of HeRA. the CKA metric to match global pairwise relationships across all samples within a training batch, whereas HeRA strictly targets the consis… view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of MKNN scores of the Worst-5 and Top-5 heads after the second training [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: VQA results on Qwen3-VL-4B after fine-tuning, with and without the HeRA objective. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison of LLaVA [19], ROSS [38], and HeRA using Qwen3-8B and SigLIP2. We present representative samples across all Cambrian categories: General, Knowledge, OCR, and Vision-Centric. HeRA No Is the airplane far away from the bicycle? LLaVA No ROSS No LLaVA 2 HeRA 2 How many boats can be seen in the image? ROSS 2 HeRA Wood What is the predominant material of the display? LLaVA Wood ROSS Glass… view at source ↗
Figure 11
Figure 11. Figure 11: Failure cases of HeRA on VQA tasks. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
read the original abstract

Representation alignment has emerged as an effective approach to improve Multimodal Large Language Models (MLLMs) by regularizing their internal representations toward those of an external vision encoder. However, existing methods typically align a fixed layer of the language backbone, overlooking the fine-grained structure of Transformer models. In this work, we propose Head-Wise Representation Alignment (HeRA), a method that enforces cross-modal alignment at the level of individual attention heads. Our approach is grounded in the Platonic Representation Hypothesis, focusing on preserving the topological structure of representations (i.e., their local neighborhood relationships) across modalities. Following the Mutual K-Nearest Neighbor (MKNN) alignment metric, we introduce a contrastive objective that acts as a differentiable proxy for matching local structures. HeRA applies this objective during multimodal training to specific attention heads in the LLM, selected by their alignment score according to the MKNN metric. Counterintuitively, we find that aligning the least aligned heads yields the largest gains. Extensive evaluations across multiple MLLMs and 18 benchmarks demonstrate that HeRA consistently improves performance on challenging vision-centric tasks and serves as an effective regularizer against visual hallucinations by naturally curbing the over-reliance on linguistic priors. Our code is publicly released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Head-Wise Representation Alignment (HeRA) for Multimodal LLMs. It selects attention heads by their MKNN alignment score, applies a contrastive loss that serves as a differentiable proxy for preserving local neighborhood topology across modalities, and reports that aligning the least-aligned heads produces the largest gains. The method is evaluated on multiple MLLMs across 18 benchmarks, with claimed improvements on vision-centric tasks and reduced visual hallucinations; code is released.

Significance. If the results hold after the requested controls, the work would demonstrate that per-head topological alignment can outperform conventional layer-wise approaches and act as a regularizer against linguistic priors. The public code release and multi-model, multi-benchmark evaluation are positive factors for verifiability.

major comments (3)
  1. [§3.2 and §4] §3.2 (MKNN contrastive objective) and §4 (experimental results): the central claim that MKNN topology preservation is the operative mechanism requires an ablation that applies an equivalent-magnitude contrastive loss to the same least-aligned heads without the MKNN term. The current results do not isolate whether gains arise from the topological proxy, the selection heuristic, or loss weighting.
  2. [§4] §4 (head selection and evaluation): the exact threshold or number of heads chosen by the MKNN score is not stated, nor are statistical significance tests or confidence intervals reported for the gains across the 18 benchmarks. These omissions are load-bearing for assessing whether the counter-intuitive finding (least-aligned heads yield largest gains) is robust.
  3. [§4] §4 (baselines): the experimental setup does not clarify whether the reported baselines include recent head-wise or layer-wise alignment methods that also target internal representations; without this, the relative contribution of HeRA cannot be fully evaluated.
minor comments (2)
  1. [Abstract and §3] The abstract and method section could explicitly state the typical number of heads aligned and the value of the contrastive loss weight used in the reported experiments.
  2. [Figure 2 / results tables] Figure 2 or the corresponding results table would benefit from error bars or per-benchmark variance to support the claim of consistent gains.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the experimental validation and clarity.

read point-by-point responses
  1. Referee: [§3.2 and §4] §3.2 (MKNN contrastive objective) and §4 (experimental results): the central claim that MKNN topology preservation is the operative mechanism requires an ablation that applies an equivalent-magnitude contrastive loss to the same least-aligned heads without the MKNN term. The current results do not isolate whether gains arise from the topological proxy, the selection heuristic, or loss weighting.

    Authors: We agree that an ablation isolating the MKNN term is necessary to substantiate the topological mechanism. In the revision we will add this control: a standard contrastive loss (InfoNCE without neighborhood matching) applied to the identical set of least-aligned heads at equivalent magnitude, allowing direct comparison of whether the MKNN-based topological proxy drives the gains beyond selection or weighting effects. revision: yes

  2. Referee: [§4] §4 (head selection and evaluation): the exact threshold or number of heads chosen by the MKNN score is not stated, nor are statistical significance tests or confidence intervals reported for the gains across the 18 benchmarks. These omissions are load-bearing for assessing whether the counter-intuitive finding (least-aligned heads yield largest gains) is robust.

    Authors: We will explicitly state the MKNN-score threshold and resulting number of heads selected in §4. We will also add paired statistical significance tests and 95% confidence intervals for all reported gains across the 18 benchmarks to demonstrate robustness of the least-aligned-heads result. revision: yes

  3. Referee: [§4] §4 (baselines): the experimental setup does not clarify whether the reported baselines include recent head-wise or layer-wise alignment methods that also target internal representations; without this, the relative contribution of HeRA cannot be fully evaluated.

    Authors: The current baselines follow the standard layer-wise alignment protocols cited in the manuscript. We will add an explicit clarification of the exact baseline implementations and, where feasible within compute limits, include additional comparisons against recent head-wise representation alignment methods in the revised §4. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical gains from new MKNN objective do not reduce to fitted inputs or self-citations

full rationale

The paper defines HeRA as a new contrastive objective using the MKNN metric to align selected attention heads during training, then reports empirical performance on 18 external benchmarks. No equation shows a 'prediction' that equals a fitted parameter by construction, nor does the central claim (largest gains from least-aligned heads) reduce to the selection rule itself. The Platonic Representation Hypothesis is invoked as grounding but is not a self-citation chain that forbids alternatives or forces the result. The method remains falsifiable on held-out tasks without the reported gains being tautological.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on the Platonic Representation Hypothesis as a domain assumption and uses standard contrastive learning machinery; no new entities are postulated and free parameters appear limited to loss weighting and head count, which are typical in such methods.

free parameters (2)
  • number of heads to align
    Selected according to MKNN alignment score; exact threshold or count not specified in abstract but is a tunable choice.
  • contrastive loss weight
    Standard hyperparameter controlling strength of the alignment objective during multimodal training.
axioms (1)
  • domain assumption Platonic Representation Hypothesis: representations of different modalities share topological structure (local neighborhood relationships)
    Explicitly invoked to justify focusing on topology preservation via MKNN rather than other alignment metrics.

pith-pipeline@v0.9.1-grok · 5780 in / 1300 out tokens · 16191 ms · 2026-06-26T08:45:11.526331+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 16 canonical work pages · 13 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-VL Technical Report.arXiv preprint arXiv:2511.21631, 2025

  2. [2]

    Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models.arXiv preprint arXiv:2512.15885, 2025

    Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Pier Luigi Dovesi, Shaghayegh Roohi, Mark Granroth-Wilding, and Rita Cucchiara. Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models.arXiv preprint arXiv:2512.15885, 2025

  3. [3]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality, 2023

  4. [4]

    Scaling Language-Free Visual Represen- tation Learning

    David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nicolas Ballas, Yann LeCun, Amir Bar, et al. Scaling Language-Free Visual Represen- tation Learning. InICCV, 2025

  5. [5]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models.arXiv preprint arXiv:2306.13394, 2023

  6. [6]

    BLINK: Multimodal Large Language Models Can See But Not Perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. BLINK: Multimodal Large Language Models Can See But Not Perceive. InECCV, 2024

  7. [7]

    Cross-Modal Alignment Regularization: Enhanc- ing Language Models with Vision Model Representations

    Yulu Gan, Kaiya Ivy Zhao, and Phillip Isola. Cross-Modal Alignment Regularization: Enhanc- ing Language Models with Vision Model Representations. InICLR Workshops, 2025

  8. [8]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783, 2024

  9. [9]

    Revisiting the Platonic Representation Hypothesis: An Aristotelian View

    Fabian Gröger, Shuo Wen, and Maria Brbi´c. Revisiting the Platonic Representation Hypothesis: An Aristotelian View. InICML, 2026

  10. [10]

    HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models. InCVPR, 2024

  11. [11]

    VizWiz Grand Challenge: Answering Visual Questions From Blind People

    Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. VizWiz Grand Challenge: Answering Visual Questions From Blind People. InCVPR, 2018

  12. [12]

    GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

    Drew A Hudson and Christopher D Manning. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. InCVPR, 2019

  13. [13]

    The Platonic Representation Hypothesis

    Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The Platonic Representation Hypothesis. InICML, 2024

  14. [14]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InECCV, 2016. 10

  15. [15]

    REPA-E: Unlocking V AE for End-to-End Tuning with Latent Diffusion Transformers

    Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. REPA-E: Unlocking V AE for End-to-End Tuning with Latent Diffusion Transformers. InICCV, 2025

  16. [16]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. SEED- Bench: Benchmarking Multimodal LLMs with Generative Comprehension.arXiv preprint arXiv:2307.16125, 2023

  17. [17]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating Object Hallucination in Large Vision-Language Models. InEMNLP, 2023

  18. [18]

    Microsoft COCO: Common Objects in Context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common Objects in Context. In ECCV, 2014

  19. [19]

    Improved Baselines with Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved Baselines with Visual Instruction Tuning. InCVPR, 2024

  20. [20]

    MMBench: Is Your Multi-modal Model an All-around Player? InECCV, 2024

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. MMBench: Is Your Multi-modal Model an All-around Player? InECCV, 2024

  21. [21]

    OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models.Sci China Inf Sci, 67(12):220102, 2024

    Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models.Sci China Inf Sci, 67(12):220102, 2024

  22. [22]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts. InICLR, 2024

  23. [23]

    Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. InNeurIPS, 2022

  24. [24]

    ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

    Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. InACL, 2022

  25. [25]

    Causal Head Gating: A Framework for Interpreting Roles of Attention Heads in Transformers

    Andrew Joohun Nam, Henry Conklin, Yukang Yang, Thomas L Griffiths, Jonathan D Cohen, and Sarah-Jane Leslie. Causal Head Gating: A Framework for Interpreting Roles of Attention Heads in Transformers. InNeurIPS, 2025

  26. [26]

    In-context Learning and Induction Heads

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-Context Learning and Induction Heads.arXiv preprint arXiv:2209.11895, 2022

  27. [27]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation Learning with Contrastive Predictive Coding.arXiv preprint arXiv:1807.03748, 2018

  28. [28]

    DINOv2: Learning Robust Visual Features without Supervision.TMLR, pages 1–31, 2024

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning Robust Visual Features without Supervision.TMLR, pages 1–31, 2024

  29. [29]

    Qwen2.5 Technical Report

    Qwen Team. Qwen2.5 Technical Report.arXiv preprint arXiv:2412.15115, 2024

  30. [30]

    Qwen3.5: Towards Native Multimodal Agents, 2026

    Qwen Team. Qwen3.5: Towards Native Multimodal Agents, 2026

  31. [31]

    Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S

    Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S. Khan. GLaMM: Pixel Grounding Large Multimodal Model. InCVPR, 2024

  32. [32]

    Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. DINOv3. arXiv preprint arXiv:2508.10104, 2025. 11

  33. [33]

    Towards VQA Models That Can Read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA Models That Can Read. InCVPR, 2019

  34. [34]

    Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

    Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai C Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs. InNeurIPS, 2024

  35. [35]

    Beyond Language Modeling: An Exploration of Multimodal Pretraining.arXiv preprint arXiv:2603.03276, 2026

    Shengbang Tong, David Fan, John Nguyen, Ellis Brown, Gaoyue Zhou, Shengyi Qian, Boyang Zheng, Théophane Vallaeys, Junlin Han, Rob Fergus, et al. Beyond Language Modeling: An Exploration of Multimodal Pretraining.arXiv preprint arXiv:2603.03276, 2026

  36. [36]

    Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs. InCVPR, 2024

  37. [37]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Al- abdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features.arXiv preprint arXiv:2502.14786, 2025

  38. [38]

    Reconstructive Visual Instruction Tuning

    Haochen Wang, Anlin Zheng, Yucheng Zhao, Tiancai Wang, Zheng Ge, Xiangyu Zhang, and Zhaoxiang Zhang. Reconstructive Visual Instruction Tuning. InICLR, 2025

  39. [39]

    AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

    Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation.arXiv preprint arXiv:2311.07397, 2023

  40. [40]

    Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small

    Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small. InICLR, 2023

  41. [41]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency.arXiv preprint arXiv:2508.18265, 2025

  42. [42]

    FineVision: Open Data Is All You Need

    Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, and Andrés Marafioti. FineVision: Open Data Is All You Need. arXiv preprint arXiv:2510.17269, 2025

  43. [43]

    V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs

    Penghao Wu and Saining Xie. V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs. InCVPR, 2024

  44. [44]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388, 2025

  45. [45]

    Visual Representation Alignment for Multimodal Large Language Models.arXiv preprint arXiv:2509.07979, 2025

    Heeji Yoon, Jaewoo Jung, Junwan Kim, Hyungyu Choi, Heeseong Shin, Sangbeom Lim, Honggyu An, Chaehyun Kim, Jisang Han, Donghyun Kim, et al. Visual Representation Alignment for Multimodal Large Language Models.arXiv preprint arXiv:2509.07979, 2025

  46. [46]

    Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think. InICLR, 2025

  47. [47]

    MMMU: A Massive Multi-discipline Multi- modal Understanding and Reasoning Benchmark for Expert AGI

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. MMMU: A Massive Multi-discipline Multi- modal Understanding and Reasoning Benchmark for Expert AGI. InCVPR, 2024

  48. [48]

    Answer only with Yes or No. Use exactly one word. Do not use commas, periods, or symbols

    Zihao Yue, Liang Zhang, and Qin Jin. Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective. InACL, 2024. 12 A Additional Implementation Details Training Details.Following the two-stage training recipe of LLaV A-1.5 [19], in the first stage, we train only the projector proj, a two-layer MLP, using 558k image-caption pairs, whil...