Mind the Heads: Topological Representation Alignment for Multimodal LLMs

Alberto Compagnoni; Davide Caffagni; Federico Melis; Lorenzo Baraldi; Marcella Cornia; Mark Granroth-Wilding; Pier Luigi Dovesi; Sara Sarto

arxiv: 2606.23885 · v1 · pith:YALQSBHBnew · submitted 2026-06-22 · 💻 cs.CV · cs.AI· cs.CL· cs.MM

Mind the Heads: Topological Representation Alignment for Multimodal LLMs

Davide Caffagni , Alberto Compagnoni , Federico Melis , Sara Sarto , Pier Luigi Dovesi , Mark Granroth-Wilding , Marcella Cornia , Lorenzo Baraldi This is my paper

Pith reviewed 2026-06-26 08:45 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.MM

keywords multimodal LLMsrepresentation alignmentattention headstopological alignmentvisual hallucinationscontrastive objectiveMKNN metric

0 comments

The pith

Aligning the least-aligned attention heads via local topology matching improves MLLM vision performance and reduces hallucinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that multimodal large language models improve when alignment with a vision encoder is applied at the level of individual attention heads instead of whole layers. Alignment uses a contrastive objective that matches local neighborhood structures between modalities rather than exact vector values. The central finding is that selecting the heads with the worst initial alignment scores produces the largest gains. This change boosts results on vision-centric benchmarks and reduces the models' tendency to generate answers that ignore the image in favor of language patterns.

Core claim

Head-Wise Representation Alignment (HeRA) measures how well each attention head preserves local neighborhood relationships using the Mutual K-Nearest Neighbor metric, selects the heads with the lowest scores, and applies a contrastive loss during multimodal training to enforce matching neighborhood structure with the vision encoder. Experiments across several MLLMs and 18 benchmarks show that this targeted head alignment raises accuracy on challenging vision tasks and lowers visual hallucinations by decreasing over-reliance on linguistic priors.

What carries the argument

Mutual K-Nearest Neighbor (MKNN) metric used both to score head alignment and to define the contrastive objective that preserves local neighborhood topology across modalities.

If this is right

Targeting the least aligned heads produces larger gains than targeting already well-aligned heads.
The method reduces visual hallucinations by limiting the model's dependence on language priors.
Performance improvements hold across multiple MLLM architectures on 18 different benchmarks.
The head-specific alignment can be added to existing multimodal training without altering model architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Head-level topological alignment could be tested in other cross-modal combinations such as vision-audio or language-audio models.
Tracking alignment scores during training might allow dynamic selection of which heads to regularize at different stages.
If mismatches at specific heads cause hallucinations, then similar targeted fixes could address other failure modes like factual errors.

Load-bearing premise

That preserving local neighborhood topology through MKNN matching in the least-aligned heads is the main driver of gains rather than other training details such as loss weighting or which heads are chosen.

What would settle it

A controlled experiment that applies the same contrastive loss to the most-aligned heads or to randomly chosen heads and still matches or exceeds the reported gains on the same vision benchmarks would falsify the claim that least-aligned heads are critical.

Figures

Figures reproduced from arXiv: 2606.23885 by Alberto Compagnoni, Davide Caffagni, Federico Melis, Lorenzo Baraldi, Marcella Cornia, Mark Granroth-Wilding, Pier Luigi Dovesi, Sara Sarto.

**Figure 2.** Figure 2: Overview of HeRA. Alongside the standard language modeling objective (LLM), HeRA employs a contrastive loss (LHeRA) to pull representations from selected LLM attention heads closer to their k-nearest neighbors (Top-k), computed in the latent space of a frozen teacher vision encoder. 3.2 Contrastive Learning as a Proxy for Representation Alignment For a fixed vision encoder V t , mkNN has been positively co… view at source ↗

**Figure 3.** Figure 3: Left: Alignment with DINOv2-L, measured with the MKNN metric on each layer and attention head of Qwen2.5-3B. Right: MKNN scores of the Worst-5 and Top-5 heads, computed on (i) the base LLM; (ii) after the LLaVA multimodal training; and (iii) after the addition of HeRA. matrix). Conversely, representation alignment methods [15, 47] typically rely on intermediate layers. However, the choice of which layer(s)… view at source ↗

**Figure 4.** Figure 4: VQA results of HeRA with different teacher vision encoders. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Effect of the multimodal training of LLaVA and representation alignment methods on the [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Left: Alignment with DINOv2-L, measured with the MKNN metric on each layer and attention head of Qwen2.5-7B. Right: MKNN scores of the Worst-5 and Top-5 heads, computed on (i) the base LLM; (ii) after the LLaVA multimodal training; and (iii) after the addition of HeRA. 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 Layer Index 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 Score Head M… view at source ↗

**Figure 7.** Figure 7: Left: Alignment with DINOv2-L, measured with the MKNN metric on each layer and attention head of Qwen2.5-14B. Right: MKNN scores of the Worst-5 and Top-5 heads, computed on (i) the base LLM; (ii) after the LLaVA multimodal training; and (iii) after the addition of HeRA. the CKA metric to match global pairwise relationships across all samples within a training batch, whereas HeRA strictly targets the consis… view at source ↗

**Figure 8.** Figure 8: Comparison of MKNN scores of the Worst-5 and Top-5 heads after the second training [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: VQA results on Qwen3-VL-4B after fine-tuning, with and without the HeRA objective. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative comparison of LLaVA [19], ROSS [38], and HeRA using Qwen3-8B and SigLIP2. We present representative samples across all Cambrian categories: General, Knowledge, OCR, and Vision-Centric. HeRA No Is the airplane far away from the bicycle? LLaVA No ROSS No LLaVA 2 HeRA 2 How many boats can be seen in the image? ROSS 2 HeRA Wood What is the predominant material of the display? LLaVA Wood ROSS Glass… view at source ↗

**Figure 11.** Figure 11: Failure cases of HeRA on VQA tasks. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

read the original abstract

Representation alignment has emerged as an effective approach to improve Multimodal Large Language Models (MLLMs) by regularizing their internal representations toward those of an external vision encoder. However, existing methods typically align a fixed layer of the language backbone, overlooking the fine-grained structure of Transformer models. In this work, we propose Head-Wise Representation Alignment (HeRA), a method that enforces cross-modal alignment at the level of individual attention heads. Our approach is grounded in the Platonic Representation Hypothesis, focusing on preserving the topological structure of representations (i.e., their local neighborhood relationships) across modalities. Following the Mutual K-Nearest Neighbor (MKNN) alignment metric, we introduce a contrastive objective that acts as a differentiable proxy for matching local structures. HeRA applies this objective during multimodal training to specific attention heads in the LLM, selected by their alignment score according to the MKNN metric. Counterintuitively, we find that aligning the least aligned heads yields the largest gains. Extensive evaluations across multiple MLLMs and 18 benchmarks demonstrate that HeRA consistently improves performance on challenging vision-centric tasks and serves as an effective regularizer against visual hallucinations by naturally curbing the over-reliance on linguistic priors. Our code is publicly released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HeRA moves alignment to per-head MKNN topology in MLLMs and claims biggest gains from the least-aligned heads plus hallucination reduction, but the causal role of the topology term needs checking.

read the letter

The main takeaway is that this paper applies representation alignment at the level of individual attention heads in the LLM backbone rather than a fixed layer, using a contrastive loss based on mutual k-nearest neighbors to preserve local neighborhood structure between modalities. They report that selecting the least-aligned heads for this treatment produces the largest improvements on vision-heavy benchmarks and acts as a regularizer against language-prior hallucinations.

What stands out as new is the head-wise selection rule and the counterintuitive result on which heads to target. The work builds on the Platonic Representation Hypothesis but narrows it to per-head topology matching, and the broad evaluation across multiple MLLMs and 18 benchmarks plus public code is a plus. Those elements give the claims some empirical weight.

The soft spots are in the missing controls. The abstract and summary do not show whether the gains survive when heads are picked by a different rule, such as random or magnitude-based, or when a comparable contrastive loss is added without the MKNN term. If the benefit comes mainly from the selection heuristic or the extra loss weight rather than topology preservation, the central mechanism claim weakens. Details on exact thresholds for head selection, statistical significance, and full baseline comparisons are also thin from what is visible.

This is aimed at people already working on multimodal LLM training and alignment tricks. A reader looking for practical regularization methods might find the head-selection idea worth testing. It deserves peer review because the scope is reasonable and the code release allows verification, even though the mechanism needs tighter ablations to stand on its own.

Referee Report

3 major / 2 minor

Summary. The paper proposes Head-Wise Representation Alignment (HeRA) for Multimodal LLMs. It selects attention heads by their MKNN alignment score, applies a contrastive loss that serves as a differentiable proxy for preserving local neighborhood topology across modalities, and reports that aligning the least-aligned heads produces the largest gains. The method is evaluated on multiple MLLMs across 18 benchmarks, with claimed improvements on vision-centric tasks and reduced visual hallucinations; code is released.

Significance. If the results hold after the requested controls, the work would demonstrate that per-head topological alignment can outperform conventional layer-wise approaches and act as a regularizer against linguistic priors. The public code release and multi-model, multi-benchmark evaluation are positive factors for verifiability.

major comments (3)

[§3.2 and §4] §3.2 (MKNN contrastive objective) and §4 (experimental results): the central claim that MKNN topology preservation is the operative mechanism requires an ablation that applies an equivalent-magnitude contrastive loss to the same least-aligned heads without the MKNN term. The current results do not isolate whether gains arise from the topological proxy, the selection heuristic, or loss weighting.
[§4] §4 (head selection and evaluation): the exact threshold or number of heads chosen by the MKNN score is not stated, nor are statistical significance tests or confidence intervals reported for the gains across the 18 benchmarks. These omissions are load-bearing for assessing whether the counter-intuitive finding (least-aligned heads yield largest gains) is robust.
[§4] §4 (baselines): the experimental setup does not clarify whether the reported baselines include recent head-wise or layer-wise alignment methods that also target internal representations; without this, the relative contribution of HeRA cannot be fully evaluated.

minor comments (2)

[Abstract and §3] The abstract and method section could explicitly state the typical number of heads aligned and the value of the contrastive loss weight used in the reported experiments.
[Figure 2 / results tables] Figure 2 or the corresponding results table would benefit from error bars or per-benchmark variance to support the claim of consistent gains.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the experimental validation and clarity.

read point-by-point responses

Referee: [§3.2 and §4] §3.2 (MKNN contrastive objective) and §4 (experimental results): the central claim that MKNN topology preservation is the operative mechanism requires an ablation that applies an equivalent-magnitude contrastive loss to the same least-aligned heads without the MKNN term. The current results do not isolate whether gains arise from the topological proxy, the selection heuristic, or loss weighting.

Authors: We agree that an ablation isolating the MKNN term is necessary to substantiate the topological mechanism. In the revision we will add this control: a standard contrastive loss (InfoNCE without neighborhood matching) applied to the identical set of least-aligned heads at equivalent magnitude, allowing direct comparison of whether the MKNN-based topological proxy drives the gains beyond selection or weighting effects. revision: yes
Referee: [§4] §4 (head selection and evaluation): the exact threshold or number of heads chosen by the MKNN score is not stated, nor are statistical significance tests or confidence intervals reported for the gains across the 18 benchmarks. These omissions are load-bearing for assessing whether the counter-intuitive finding (least-aligned heads yield largest gains) is robust.

Authors: We will explicitly state the MKNN-score threshold and resulting number of heads selected in §4. We will also add paired statistical significance tests and 95% confidence intervals for all reported gains across the 18 benchmarks to demonstrate robustness of the least-aligned-heads result. revision: yes
Referee: [§4] §4 (baselines): the experimental setup does not clarify whether the reported baselines include recent head-wise or layer-wise alignment methods that also target internal representations; without this, the relative contribution of HeRA cannot be fully evaluated.

Authors: The current baselines follow the standard layer-wise alignment protocols cited in the manuscript. We will add an explicit clarification of the exact baseline implementations and, where feasible within compute limits, include additional comparisons against recent head-wise representation alignment methods in the revised §4. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical gains from new MKNN objective do not reduce to fitted inputs or self-citations

full rationale

The paper defines HeRA as a new contrastive objective using the MKNN metric to align selected attention heads during training, then reports empirical performance on 18 external benchmarks. No equation shows a 'prediction' that equals a fitted parameter by construction, nor does the central claim (largest gains from least-aligned heads) reduce to the selection rule itself. The Platonic Representation Hypothesis is invoked as grounding but is not a self-citation chain that forbids alternatives or forces the result. The method remains falsifiable on held-out tasks without the reported gains being tautological.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on the Platonic Representation Hypothesis as a domain assumption and uses standard contrastive learning machinery; no new entities are postulated and free parameters appear limited to loss weighting and head count, which are typical in such methods.

free parameters (2)

number of heads to align
Selected according to MKNN alignment score; exact threshold or count not specified in abstract but is a tunable choice.
contrastive loss weight
Standard hyperparameter controlling strength of the alignment objective during multimodal training.

axioms (1)

domain assumption Platonic Representation Hypothesis: representations of different modalities share topological structure (local neighborhood relationships)
Explicitly invoked to justify focusing on topology preservation via MKNN rather than other alignment metrics.

pith-pipeline@v0.9.1-grok · 5780 in / 1300 out tokens · 16191 ms · 2026-06-26T08:45:11.526331+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 16 canonical work pages · 13 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-VL Technical Report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models.arXiv preprint arXiv:2512.15885, 2025

Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Pier Luigi Dovesi, Shaghayegh Roohi, Mark Granroth-Wilding, and Rita Cucchiara. Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models.arXiv preprint arXiv:2512.15885, 2025

work page arXiv 2025
[3]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality, 2023

2023
[4]

Scaling Language-Free Visual Represen- tation Learning

David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nicolas Ballas, Yann LeCun, Amir Bar, et al. Scaling Language-Free Visual Represen- tation Learning. InICCV, 2025

2025
[5]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models.arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

BLINK: Multimodal Large Language Models Can See But Not Perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. BLINK: Multimodal Large Language Models Can See But Not Perceive. InECCV, 2024

2024
[7]

Cross-Modal Alignment Regularization: Enhanc- ing Language Models with Vision Model Representations

Yulu Gan, Kaiya Ivy Zhao, and Phillip Isola. Cross-Modal Alignment Regularization: Enhanc- ing Language Models with Vision Model Representations. InICLR Workshops, 2025

2025
[8]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Revisiting the Platonic Representation Hypothesis: An Aristotelian View

Fabian Gröger, Shuo Wen, and Maria Brbi´c. Revisiting the Platonic Representation Hypothesis: An Aristotelian View. InICML, 2026

2026
[10]

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models. InCVPR, 2024

2024
[11]

VizWiz Grand Challenge: Answering Visual Questions From Blind People

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. VizWiz Grand Challenge: Answering Visual Questions From Blind People. InCVPR, 2018

2018
[12]

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

Drew A Hudson and Christopher D Manning. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. InCVPR, 2019

2019
[13]

The Platonic Representation Hypothesis

Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The Platonic Representation Hypothesis. InICML, 2024

2024
[14]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InECCV, 2016. 10

2016
[15]

REPA-E: Unlocking V AE for End-to-End Tuning with Latent Diffusion Transformers

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. REPA-E: Unlocking V AE for End-to-End Tuning with Latent Diffusion Transformers. InICCV, 2025

2025
[16]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. SEED- Bench: Benchmarking Multimodal LLMs with Generative Comprehension.arXiv preprint arXiv:2307.16125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating Object Hallucination in Large Vision-Language Models. InEMNLP, 2023

2023
[18]

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common Objects in Context. In ECCV, 2014

2014
[19]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved Baselines with Visual Instruction Tuning. InCVPR, 2024

2024
[20]

MMBench: Is Your Multi-modal Model an All-around Player? InECCV, 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. MMBench: Is Your Multi-modal Model an All-around Player? InECCV, 2024

2024
[21]

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models.Sci China Inf Sci, 67(12):220102, 2024

Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models.Sci China Inf Sci, 67(12):220102, 2024

2024
[22]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts. InICLR, 2024

2024
[23]

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. InNeurIPS, 2022

2022
[24]

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. InACL, 2022

2022
[25]

Causal Head Gating: A Framework for Interpreting Roles of Attention Heads in Transformers

Andrew Joohun Nam, Henry Conklin, Yukang Yang, Thomas L Griffiths, Jonathan D Cohen, and Sarah-Jane Leslie. Causal Head Gating: A Framework for Interpreting Roles of Attention Heads in Transformers. InNeurIPS, 2025

2025
[26]

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-Context Learning and Induction Heads.arXiv preprint arXiv:2209.11895, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation Learning with Contrastive Predictive Coding.arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

DINOv2: Learning Robust Visual Features without Supervision.TMLR, pages 1–31, 2024

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning Robust Visual Features without Supervision.TMLR, pages 1–31, 2024

2024
[29]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 Technical Report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Qwen3.5: Towards Native Multimodal Agents, 2026

Qwen Team. Qwen3.5: Towards Native Multimodal Agents, 2026

2026
[31]

Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S

Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S. Khan. GLaMM: Pixel Grounding Large Multimodal Model. InCVPR, 2024

2024
[32]

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. DINOv3. arXiv preprint arXiv:2508.10104, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Towards VQA Models That Can Read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA Models That Can Read. InCVPR, 2019

2019
[34]

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai C Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs. InNeurIPS, 2024

2024
[35]

Beyond Language Modeling: An Exploration of Multimodal Pretraining.arXiv preprint arXiv:2603.03276, 2026

Shengbang Tong, David Fan, John Nguyen, Ellis Brown, Gaoyue Zhou, Shengyi Qian, Boyang Zheng, Théophane Vallaeys, Junlin Han, Rob Fergus, et al. Beyond Language Modeling: An Exploration of Multimodal Pretraining.arXiv preprint arXiv:2603.03276, 2026

work page arXiv 2026
[36]

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs. InCVPR, 2024

2024
[37]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Al- abdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features.arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Reconstructive Visual Instruction Tuning

Haochen Wang, Anlin Zheng, Yucheng Zhao, Tiancai Wang, Zheng Ge, Xiangyu Zhang, and Zhaoxiang Zhang. Reconstructive Visual Instruction Tuning. InICLR, 2025

2025
[39]

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation.arXiv preprint arXiv:2311.07397, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small

Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small. InICLR, 2023

2023
[41]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

FineVision: Open Data Is All You Need

Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, and Andrés Marafioti. FineVision: Open Data Is All You Need. arXiv preprint arXiv:2510.17269, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs

Penghao Wu and Saining Xie. V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs. InCVPR, 2024

2024
[44]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Visual Representation Alignment for Multimodal Large Language Models.arXiv preprint arXiv:2509.07979, 2025

Heeji Yoon, Jaewoo Jung, Junwan Kim, Hyungyu Choi, Heeseong Shin, Sangbeom Lim, Honggyu An, Chaehyun Kim, Jisang Han, Donghyun Kim, et al. Visual Representation Alignment for Multimodal Large Language Models.arXiv preprint arXiv:2509.07979, 2025

work page arXiv 2025
[46]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think. InICLR, 2025

2025
[47]

MMMU: A Massive Multi-discipline Multi- modal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. MMMU: A Massive Multi-discipline Multi- modal Understanding and Reasoning Benchmark for Expert AGI. InCVPR, 2024

2024
[48]

Answer only with Yes or No. Use exactly one word. Do not use commas, periods, or symbols

Zihao Yue, Liang Zhang, and Qin Jin. Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective. InACL, 2024. 12 A Additional Implementation Details Training Details.Following the two-stage training recipe of LLaV A-1.5 [19], in the first stage, we train only the projector proj, a two-layer MLP, using 558k image-caption pairs, whil...

2024

[1] [1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-VL Technical Report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models.arXiv preprint arXiv:2512.15885, 2025

Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Pier Luigi Dovesi, Shaghayegh Roohi, Mark Granroth-Wilding, and Rita Cucchiara. Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models.arXiv preprint arXiv:2512.15885, 2025

work page arXiv 2025

[3] [3]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality, 2023

2023

[4] [4]

Scaling Language-Free Visual Represen- tation Learning

David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nicolas Ballas, Yann LeCun, Amir Bar, et al. Scaling Language-Free Visual Represen- tation Learning. InICCV, 2025

2025

[5] [5]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models.arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

BLINK: Multimodal Large Language Models Can See But Not Perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. BLINK: Multimodal Large Language Models Can See But Not Perceive. InECCV, 2024

2024

[7] [7]

Cross-Modal Alignment Regularization: Enhanc- ing Language Models with Vision Model Representations

Yulu Gan, Kaiya Ivy Zhao, and Phillip Isola. Cross-Modal Alignment Regularization: Enhanc- ing Language Models with Vision Model Representations. InICLR Workshops, 2025

2025

[8] [8]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Revisiting the Platonic Representation Hypothesis: An Aristotelian View

Fabian Gröger, Shuo Wen, and Maria Brbi´c. Revisiting the Platonic Representation Hypothesis: An Aristotelian View. InICML, 2026

2026

[10] [10]

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models. InCVPR, 2024

2024

[11] [11]

VizWiz Grand Challenge: Answering Visual Questions From Blind People

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. VizWiz Grand Challenge: Answering Visual Questions From Blind People. InCVPR, 2018

2018

[12] [12]

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

Drew A Hudson and Christopher D Manning. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. InCVPR, 2019

2019

[13] [13]

The Platonic Representation Hypothesis

Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The Platonic Representation Hypothesis. InICML, 2024

2024

[14] [14]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InECCV, 2016. 10

2016

[15] [15]

REPA-E: Unlocking V AE for End-to-End Tuning with Latent Diffusion Transformers

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. REPA-E: Unlocking V AE for End-to-End Tuning with Latent Diffusion Transformers. InICCV, 2025

2025

[16] [16]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. SEED- Bench: Benchmarking Multimodal LLMs with Generative Comprehension.arXiv preprint arXiv:2307.16125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating Object Hallucination in Large Vision-Language Models. InEMNLP, 2023

2023

[18] [18]

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common Objects in Context. In ECCV, 2014

2014

[19] [19]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved Baselines with Visual Instruction Tuning. InCVPR, 2024

2024

[20] [20]

MMBench: Is Your Multi-modal Model an All-around Player? InECCV, 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. MMBench: Is Your Multi-modal Model an All-around Player? InECCV, 2024

2024

[21] [21]

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models.Sci China Inf Sci, 67(12):220102, 2024

Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models.Sci China Inf Sci, 67(12):220102, 2024

2024

[22] [22]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts. InICLR, 2024

2024

[23] [23]

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. InNeurIPS, 2022

2022

[24] [24]

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. InACL, 2022

2022

[25] [25]

Causal Head Gating: A Framework for Interpreting Roles of Attention Heads in Transformers

Andrew Joohun Nam, Henry Conklin, Yukang Yang, Thomas L Griffiths, Jonathan D Cohen, and Sarah-Jane Leslie. Causal Head Gating: A Framework for Interpreting Roles of Attention Heads in Transformers. InNeurIPS, 2025

2025

[26] [26]

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-Context Learning and Induction Heads.arXiv preprint arXiv:2209.11895, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[27] [27]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation Learning with Contrastive Predictive Coding.arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[28] [28]

DINOv2: Learning Robust Visual Features without Supervision.TMLR, pages 1–31, 2024

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning Robust Visual Features without Supervision.TMLR, pages 1–31, 2024

2024

[29] [29]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 Technical Report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

Qwen3.5: Towards Native Multimodal Agents, 2026

Qwen Team. Qwen3.5: Towards Native Multimodal Agents, 2026

2026

[31] [31]

Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S

Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S. Khan. GLaMM: Pixel Grounding Large Multimodal Model. InCVPR, 2024

2024

[32] [32]

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. DINOv3. arXiv preprint arXiv:2508.10104, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Towards VQA Models That Can Read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA Models That Can Read. InCVPR, 2019

2019

[34] [34]

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai C Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs. InNeurIPS, 2024

2024

[35] [35]

Beyond Language Modeling: An Exploration of Multimodal Pretraining.arXiv preprint arXiv:2603.03276, 2026

Shengbang Tong, David Fan, John Nguyen, Ellis Brown, Gaoyue Zhou, Shengyi Qian, Boyang Zheng, Théophane Vallaeys, Junlin Han, Rob Fergus, et al. Beyond Language Modeling: An Exploration of Multimodal Pretraining.arXiv preprint arXiv:2603.03276, 2026

work page arXiv 2026

[36] [36]

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs. InCVPR, 2024

2024

[37] [37]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Al- abdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features.arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Reconstructive Visual Instruction Tuning

Haochen Wang, Anlin Zheng, Yucheng Zhao, Tiancai Wang, Zheng Ge, Xiangyu Zhang, and Zhaoxiang Zhang. Reconstructive Visual Instruction Tuning. InICLR, 2025

2025

[39] [39]

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation.arXiv preprint arXiv:2311.07397, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [40]

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small

Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small. InICLR, 2023

2023

[41] [41]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

FineVision: Open Data Is All You Need

Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, and Andrés Marafioti. FineVision: Open Data Is All You Need. arXiv preprint arXiv:2510.17269, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs

Penghao Wu and Saining Xie. V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs. InCVPR, 2024

2024

[44] [44]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

Visual Representation Alignment for Multimodal Large Language Models.arXiv preprint arXiv:2509.07979, 2025

Heeji Yoon, Jaewoo Jung, Junwan Kim, Hyungyu Choi, Heeseong Shin, Sangbeom Lim, Honggyu An, Chaehyun Kim, Jisang Han, Donghyun Kim, et al. Visual Representation Alignment for Multimodal Large Language Models.arXiv preprint arXiv:2509.07979, 2025

work page arXiv 2025

[46] [46]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think. InICLR, 2025

2025

[47] [47]

MMMU: A Massive Multi-discipline Multi- modal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. MMMU: A Massive Multi-discipline Multi- modal Understanding and Reasoning Benchmark for Expert AGI. InCVPR, 2024

2024

[48] [48]

Answer only with Yes or No. Use exactly one word. Do not use commas, periods, or symbols

Zihao Yue, Liang Zhang, and Qin Jin. Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective. InACL, 2024. 12 A Additional Implementation Details Training Details.Following the two-stage training recipe of LLaV A-1.5 [19], in the first stage, we train only the projector proj, a two-layer MLP, using 558k image-caption pairs, whil...

2024