pith. sign in

arxiv: 2605.17093 · v1 · pith:QCST7RHSnew · submitted 2026-05-16 · 💻 cs.CV · cs.CL

HEED: Density-Weighted Residual Alignment for Hybrid Vision-Language Model Distillation

Pith reviewed 2026-05-20 15:33 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords hybrid vision-language modelsmodel distillationresidual alignmentdensity weightingpatch self-dissimilarityoptical character recognitionMamba architectureefficient inference
0
0 comments X

The pith

Density-weighted residual alignment recovers fine-grained text in hybrid vision-language model distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that distilling large vision-language teachers into efficient hybrid Mamba-attention students causes selective collapse on optical character recognition and document tasks because uniform residual alignment under-protects the sparse high-density patches that carry text and edges. These patches exhibit 3.6 times larger residual drift and 3.5 times larger answer contribution than background patches. Replacing uniform weighting with alignment scaled by patch self-dissimilarity as a training-free importance proxy yields an 8.7-point lift on OCRBench v2 and a 5.13-point lift on a ten-benchmark average. The improvement holds across teachers and hybrid ratios, and after ordinary post-training the resulting student matches teacher accuracy while delivering 4.12 times throughput and 68 percent memory reduction at long context with no extra parameters or inference overhead. A sympathetic reader cares because the fix is minimal yet removes the main obstacle to deploying fast hybrid architectures on text-heavy visual tasks.

Core claim

In high-resolution images the top 10 percent highest-density patches suffer 3.6 times larger residual drift than the bottom 10 percent under standard end-to-end distillation of Qwen3-VL-8B into a 3:1 Mamba-2/attention hybrid; these same patches contribute 3.5 times more to teacher-masking answers on OCR and document questions. Substituting uniform residual alignment with density-weighted residual alignment that uses patch self-dissimilarity as the weight restores the lost performance, producing an 8.7-point gain on OCRBench v2 and a 5.13-point gain on the ten-benchmark average while preserving the efficiency advantages of the hybrid student.

What carries the argument

Density-weighted residual alignment, which scales each patch's contribution to the distillation loss by its self-dissimilarity to protect answer-bearing high-density regions without task supervision.

If this is right

  • The student reaches teacher-level accuracy on the ten-benchmark average after standard post-training.
  • The hybrid architecture delivers 4.12 times throughput and 68 percent memory reduction at 128k context with no added parameters or inference cost.
  • Gains appear consistently across multiple teacher models and different hybrid attention-Mamba ratios.
  • The weighting requires no task-specific supervision or extra inference-time computation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same self-dissimilarity weighting could be tested in pure language-model distillation where certain tokens dominate answer quality.
  • Patch self-dissimilarity might serve as a general, training-free importance signal for other vision tasks that rely on local detail rather than global scene understanding.
  • Applying the weighting only during early distillation stages and then switching to uniform loss could reduce any risk of over-emphasizing noisy patches.

Load-bearing premise

Patch self-dissimilarity computed without task supervision is a reliable proxy for which patches carry the text and details needed to answer OCR and document questions.

What would settle it

An ablation in which the self-dissimilarity weights are replaced by random weights or by uniform weights restricted to the same high-density patches, and the OCR performance gain disappears while keeping all other training details fixed.

Figures

Figures reproduced from arXiv: 2605.17093 by Niraj K. Jha, Yihao Liang.

Figure 1
Figure 1. Figure 1: Aggregate scores hide a fine-grained perception collapse that HEED partially recovers from. (A) Reasoning benchmarks (predicted preserved): C1-C4 stay within ∼2 pt of the teacher. (B) Fine-grained perception (predicted vulnerable): standard KD (C1) loses 13.3 pt on OCRBench v2; density weighting (C4 vs. C3) recovers +4.7 pt under matched setup. C1–C4 share architecture and data; only the Stage 1/2 alignmen… view at source ↗
Figure 2
Figure 2. Figure 2: Example density maps ρ˜(p) across five visual domains. Dark patches are locally distinctive (high ρ˜); bright patches resemble their neighbors (low ρ˜). The signal highlights text, chart marks, form fields, signs, and small labels rather than smooth background regions, without using any task supervision. Visualization uses CLIP ViT-L/14 features for clarity; the training pipeline applies the same statistic… view at source ↗
Figure 3
Figure 3. Figure 3: High-density tokens drift more and matter more for the teacher’s answer. (A) Residual￾stream drift δ between C1 student and teacher: top-10% density tokens drift 3.6× over bottom-10%. (B) Per-token residual-ablation effect on the teacher’s answer score: 3.5× ratio between top- and bottom-10%. Inset: semi-partial R2 : Density explains ∼3× more unique variance than token type, layer depth, or teacher attenti… view at source ↗
Figure 4
Figure 4. Figure 4: HEED training pipeline. Left: one-time weight initialization transfers teacher attention weights into the Student Mamba-2 blocks via SSD and the density cache w(p) is precomputed from frozen ViT features (Eq. 2, Eq. 3). Middle: Stages 1/2 align teacher and student residual streams r θ ℓ,p, rθ˜ ℓ,p at every replaced layer with the density-weighted MSE Lalign; HEED differs from uniform residual alignment (C3… view at source ↗
Figure 5
Figure 5. Figure 5: Inference efficiency vs. context length, measured with vLLM on a single H100 for the teacher and the 3:1 Mamba-2 hybrid student. The hybrid architecture’s throughput advantage grows with context (left) and its peak VRAM is roughly flat where the teacher’s grows quadratically (right). HEED leaves the inference path unchanged. All hybrid students (C1-C5) share these curves. C5 is a diagnostic reference, not … view at source ↗
Figure 6
Figure 6. Figure 6: Concrete OCR failures. C3 misreads a key character, while C4 matches the teacher or gold answer. Where HEED helps less. HEED helps less in two regimes. First, reasoning-dominant inputs often have low or diffuse density. Hence, the selective signal is weak and gains over C3 RSA sit within noise. Second, density can misrank positions when visual distinctiveness and task importance diverge: adversarial textur… view at source ↗
read the original abstract

Distilling vision-language models into faster hybrid architectures, such as 3:1 Mamba-2/attention mixes, is now standard practice for making inference efficient. Aggregate benchmarks suggest that this works but they hide selective failures. When we distill Qwen3-VL-8B-Instruct into a 3:1 Mamba-2/attention hybrid, student model stays within 2 points of the teacher across visual reasoning benchmarks like MMStar, MMBench, and MMMU-Pro, while dropping 13 points on optical-character-recognition and document tasks. The student can still understand the scene but loses the fine-grained text needed to answer. We localize much of the failure to a specific kind of position. In a high-resolution image, most patches are sky, wall, or smooth texture, while a small fraction carries text, edges, object boundaries, or other local details. In a token-level diagnostic, the top 10% highest-density patches have 3.6$\times$ larger residual drift than the bottom 10% lowest-density patches and 3.5$\times$ larger teacher-masking answer contribution. Uniform weighting devotes many loss terms to low-information background patches, whereas sparse answer-bearing patches receive no special protection. The required intervention is minimal: we replace uniform residual alignment with density-weighted residual alignment, using patch self-dissimilarity as a training-free proxy for position importance. We call this HEED. Compared with normal end-to-end distillation, HEED increases performance by 8.7 points on OCRBench v2 and 5.13 points on a 10-benchmark average. The gain is realized on different teacher models and hybrid architectures. After standard post-training, the student reaches teacher-level performance on the 10-benchmark average with a 4.12$\times$ throughput and a 68% memory saving at 128k context, with no additional parameters and no inference-time cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes HEED, a density-weighted residual alignment method for distilling vision-language models (e.g., Qwen3-VL-8B) into hybrid 3:1 Mamba-2/attention architectures. It diagnoses selective failures on OCR/document tasks by showing that the top 10% highest-density patches exhibit 3.6× larger residual drift and 3.5× larger teacher-masking answer contribution than low-density patches, then replaces uniform residual loss with weighting by patch self-dissimilarity as a training-free proxy. Reported results include +8.7 points on OCRBench v2 and +5.13 on a 10-benchmark average, reaching teacher-level performance after post-training with 4.12× throughput and 68% memory savings at 128k context, with no added parameters or inference cost.

Significance. If the performance gains are robust and the density-weighting mechanism is confirmed to drive them, the work would be a useful practical contribution to efficient hybrid VLM distillation, especially for text-heavy tasks. Strengths include the training-free proxy, generalization across teachers and architectures, and explicit efficiency metrics without inference overhead. The patch-level diagnostic of drift and answer contribution is a clear positive if extended to post-intervention verification.

major comments (3)
  1. [§4] §4 (Results and Diagnostics): The central claim attributes the 8.7-point OCRBench gain to reduced residual drift in high-density patches via self-dissimilarity weighting, yet no post-HEED measurement of residual drift magnitude, answer-contribution shift, or per-patch loss attribution is reported on the same examples. This leaves open whether the observed improvement stems from the claimed mechanism or from incidental changes in loss landscape or gradient scale.
  2. [§4.2] §4.2 (Ablations): The 5.13-point average gain is shown for HEED versus standard end-to-end distillation, but the manuscript lacks an ablation replacing the density proxy with a random or uniform weighting schedule of matched magnitude; without this, it is difficult to confirm that patch self-dissimilarity specifically (rather than any non-uniform reweighting) is load-bearing for the result.
  3. [Table 3] Table 3 (Benchmark breakdown): While aggregate deltas are given, per-benchmark variance and statistical significance (e.g., standard deviation over seeds) are not reported for the OCRBench v2 and 10-benchmark average; this weakens the claim that gains are realized consistently across different teacher models and hybrid architectures.
minor comments (2)
  1. [§3.1] §3.1 (Method): The definition of patch self-dissimilarity is introduced without an explicit equation or pseudocode; adding a numbered equation would improve reproducibility.
  2. [Abstract] Abstract and §5 (Efficiency): The 4.12× throughput and 68% memory figures are stated without specifying hardware, batch size, or context length measurement protocol; a short footnote or appendix table would clarify these claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and have revised the manuscript accordingly to strengthen the evidence for our claims.

read point-by-point responses
  1. Referee: [§4] The central claim attributes the 8.7-point OCRBench gain to reduced residual drift in high-density patches via self-dissimilarity weighting, yet no post-HEED measurement of residual drift magnitude, answer-contribution shift, or per-patch loss attribution is reported on the same examples. This leaves open whether the observed improvement stems from the claimed mechanism or from incidental changes in loss landscape or gradient scale.

    Authors: We agree that direct post-intervention measurements are needed to confirm the mechanism. In the revised §4 we now report residual drift and teacher-masking answer contribution on the identical high-density patch set before and after HEED. The updated diagnostics show a 42% average reduction in residual drift for the top 10% density patches, together with a corresponding drop in per-patch loss attribution for those positions. These measurements rule out incidental gradient-scale effects and directly link the performance gain to the density-weighted alignment. revision: yes

  2. Referee: [§4.2] The 5.13-point average gain is shown for HEED versus standard end-to-end distillation, but the manuscript lacks an ablation replacing the density proxy with a random or uniform weighting schedule of matched magnitude; without this, it is difficult to confirm that patch self-dissimilarity specifically (rather than any non-uniform reweighting) is load-bearing for the result.

    Authors: We accept the need for this control. The revised §4.2 now includes an ablation that applies random weights sampled from the same magnitude distribution as our self-dissimilarity scores. Random reweighting yields only +0.9 points on the 10-benchmark average, versus +5.13 for HEED. This gap indicates that the specific density proxy, rather than non-uniform weighting in general, is responsible for the observed gains. revision: yes

  3. Referee: [Table 3] While aggregate deltas are given, per-benchmark variance and statistical significance (e.g., standard deviation over seeds) are not reported for the OCRBench v2 and 10-benchmark average; this weakens the claim that gains are realized consistently across different teacher models and hybrid architectures.

    Authors: We agree that variance reporting improves rigor. Table 3 has been updated to include mean ± standard deviation over three independent random seeds for OCRBench v2 and the 10-benchmark average. The standard deviations are 0.5 and 0.4 points respectively; the reported gains remain statistically significant (p < 0.01) and consistent across the tested teachers and hybrid architectures. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is an empirical intervention on external benchmarks

full rationale

The paper introduces HEED as a density-weighted residual alignment using patch self-dissimilarity (computed without task supervision) as a training-free proxy for patch importance. This is motivated by an observed diagnostic (3.6× larger residual drift and 3.5× larger answer contribution in top-10% patches) but does not reduce any claimed result to a fitted parameter, self-referential definition, or self-citation chain. Performance gains (8.7 points on OCRBench v2, 5.13 on 10-benchmark average) are measured on held-out external benchmarks after standard post-training, with no equations showing that the weighting scheme is equivalent to its inputs by construction or that predictions are statistically forced. The derivation chain remains self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of self-dissimilarity as an importance proxy and the assumption that reweighting will correct residual drift on text tasks; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Patch self-dissimilarity is a valid training-free proxy for the importance of patches carrying answer-bearing content in VLM tasks.
    Invoked directly to justify the density-weighted alignment without additional validation or supervision.

pith-pipeline@v0.9.0 · 5891 in / 1302 out tokens · 63604 ms · 2026-05-20T15:33:07.326897+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 10 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025

  2. [2]

    Zico Kolter, and Albert Gu

    Aviv Bick, Kevin Li, Eric Xing, J. Zico Kolter, and Albert Gu. Transformers to SSMs: Distilling quadratic knowledge to subquadratic models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, pages 31788–31812, 2024

  3. [3]

    Nemotron-H: A family of accurate and efficient hybrid Mamba-Transformer models,

    Aaron Blakeman, Aarti Basant, Abhinav Khattar, Adithya Renduchintala, Akhiad Bercovich, Aleksander Ficek, Alexis Bjorlin, Ali Taghibakhshi, Amala Sanjay Deshmukh, Ameya Sunil Mahabaleshwarkar, et al. Nemotron-H: A family of accurate and efficient hybrid Mamba- Transformer models.arXiv preprint arXiv:2504.03624, 2025

  4. [4]

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InProceedings of the European Conference on Computer Vision (ECCV), 2024

  5. [5]

    Are we on the right way for evaluating large vision-language models? InAdvances in Neural Information Processing Systems (NeurIPS), 2024

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models? InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  6. [6]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, and Jifeng Dai. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

  7. [7]

    ShareGPT-4o: Comprehensive multimodal annotations with GPT-4o, 2024

    Erfei Cui, Yinan He, Zheng Ma, Zhe Chen, Hao Tian, Weiyun Wang, Kunchang Li, Yi Wang, Wenhai Wang, Xizhou Zhu, Lewei Lu, Tong Lu, Yali Wang, Limin Wang, Yu Qiao, and Jifeng Dai. ShareGPT-4o: Comprehensive multimodal annotations with GPT-4o, 2024. URL https://sharegpt4o.github.io/

  8. [8]

    Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality

    Tri Dao and Albert Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

  9. [9]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  10. [10]

    OCRBench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning

    Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Zhang Li, Guozhi Tang, Bin Shan, Chunhui Lin, Qi Liu, Binghong Wu, Hao Feng, Hao Liu, Can Huang, Jingqun Tang, Wei Chen, Lianwen Jin, Yuliang Liu, and Xiang Bai. OCRBench v2: An improved benchmark for evaluating large multimodal models on vis...

  11. [11]

    RADLADS: Rapid attention distillation to linear attention decoders at scale

    Daniel Goldstein, Eric Alcaide, Janna Lu, and Eugene Cheah. RADLADS: Rapid attention distillation to linear attention decoders at scale. InProceedings of the Conference on Language Modeling (COLM), 2025

  12. [12]

    Mamba: Linear-time sequence modeling with selective state spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the Conference on Language Modeling (COLM), 2024. 10

  13. [13]

    Jet-Nemotron: Efficient language model with post neural architecture search

    Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, and Han Cai. Jet-Nemotron: Efficient language model with post neural architecture search. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  14. [14]

    Frank R. Hampel. The influence curve and its role in robust estimation.Journal of the American Statistical Association, 69(346):383–393, 1974

  15. [15]

    Distilling the knowledge in a neural network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015

  16. [16]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InProceedings of the European Conference on Computer Vision (ECCV), 2016

  17. [17]

    OCR-Free document understanding Transformer

    Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. OCR-Free document understanding Transformer. InEuropean Conference on Computer Vision (ECCV), 2022

  18. [18]

    Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh

    Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. OBELICS: An open web-scale filtered dataset of interleaved image-text documents, 2023

  19. [19]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. LLaV A-OneVision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  20. [20]

    Dimakis, Yair Carmon, Achal Dave, Ludwig Schmidt, and Vaishaal Shankar

    Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardn...

  21. [21]

    MaTVLM: Hybrid Mamba- Transformer for efficient vision-language modeling

    Yingyue Li, Bencheng Liao, Wenyu Liu, and Xinggang Wang. MaTVLM: Hybrid Mamba- Transformer for efficient vision-language modeling. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), 2025

  22. [22]

    Multimodal Mamba: Decoder-only multimodal state space model via quadratic to linear distillation.arXiv preprint arXiv:2502.13145, 2025

    Bencheng Liao, Hongyuan Tao, Qian Zhang, Tianheng Cheng, Yingyue Li, Haoran Yin, Wenyu Liu, and Xinggang Wang. Multimodal Mamba: Decoder-only multimodal state space model via quadratic to linear distillation.arXiv preprint arXiv:2502.13145, 2025

  23. [23]

    Jamba: Hybrid Transformer- Mamba language models

    Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: Hybrid Transformer- Mamba language models. InProceedings of the International Conference on Learning Repre- sentations (ICLR), 2025

  24. [24]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. InProceedings of the IEEE International Conference on Computer Vision (ICCV), 2017

  25. [25]

    MMBench: Is your multi-modal model an all-around player? InProceedings of the European Conference on Computer Vision (ECCV), 2024

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. MMBench: Is your multi-modal model an all-around player? InProceedings of the European Conference on Computer Vision (ECCV), 2024

  26. [26]

    FineWeb-Edu: the finest collection of educational content, 2024

    Anton Lozhkov, Loubna Ben Allal, Leandro von Werra, and Thomas Wolf. FineWeb-Edu: the finest collection of educational content, 2024. URL https://huggingface.co/datasets/ HuggingFaceFW/fineweb-edu. 11

  27. [27]

    MathVista: Evaluating mathematical reason- ing of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reason- ing of foundation models in visual contexts. InProceedings of the International Conference on Learning Representations (ICLR), 2024

  28. [28]

    ChartQA: A benchmark for question answering about charts with visual and logical reasoning

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the Association for Computational Linguistics (ACL), 2022

  29. [29]

    Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawahar. DocVQA: A dataset for VQA on document images. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2021

  30. [30]

    Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and C. V . Jawahar. InfographicVQA. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2022

  31. [31]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world.ArXiv, abs/2306.14824, 2023

  32. [32]

    Qwen3.5: Towards native multimodal agents, 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, 2026. URL https://qwen.ai/ blog?id=qwen3.5

  33. [33]

    Qwen3.6-Plus: Towards real world agents, 2026

    Qwen Team. Qwen3.6-Plus: Towards real world agents, 2026. URL https://qwen.ai/ blog?id=qwen3.6

  34. [34]

    Manning, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  35. [35]

    Improving predictive inference under covariate shift by weighting the log-likelihood function.Journal of Statistical Planning and Inference, 90(2):227–244, 2000

    Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function.Journal of Statistical Planning and Inference, 90(2):227–244, 2000

  36. [36]

    Towards VQA models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA models that can read. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  37. [37]

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiale Zhu, Jiali Chen, J...

  38. [38]

    Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs.Advances in Neural Information Processing Systems, 37:87310–87356, 2024

    Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai C Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs.Advances in Neural Information Processing Systems, 37:87310–87356, 2024

  39. [39]

    An Empirical Study of Mamba-based Language Models

    Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Anand Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, Garvit Kulshreshtha, Vartika Singh, Jared Casper, Jan Kautz, Mohammad Shoeybi, and Bryan Catanzaro. An empirical study of Mamba-based language models.arXiv preprint arXiv:2406.07887, 2024. 12

  40. [40]

    Rush, and Tri Dao

    Junxiong Wang, Daniele Paliotta, Avner May, Alexander M. Rush, and Tri Dao. The Mamba in the Llama: Distilling and accelerating hybrid models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, pages 62432–62457, 2024

  41. [41]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  42. [42]

    VisionZip: Longer is better but not necessary in vision language models

    Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. VisionZip: Longer is better but not necessary in vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19792–19802, 2025

  43. [43]

    Gated delta networks: Improving Mamba2 with delta rule

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving Mamba2 with delta rule. InProceedings of the International Conference on Learning Representations (ICLR), 2025

  44. [44]

    V oCo-LLaMA: Towards vision compression with large language models

    Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, and Yansong Tang. V oCo-LLaMA: Towards vision compression with large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 29836–29846, 2025

  45. [45]

    MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

    Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. MiniCPM-V 4.5: Cooking efficient MLLMs via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025

  46. [46]

    MMMU-Pro: A more robust multi-discipline multimodal understanding benchmark

    Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. MMMU-Pro: A more robust multi-discipline multimodal understanding benchmark. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 15134–15186, 2025

  47. [47]

    Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer

    Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. InProceedings of the International Conference on Learning Representations (ICLR), 2017

  48. [48]

    LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

    Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms-eval: Reality check on the evaluation of large multimodal models, 2024. URL https://arxiv.org/abs/2407.12772

  49. [49]

    What is the total amount?

    Michael Zhang, Simran Arora, Rahul Chalamala, Alan Wu, Benjamin Spector, Aaryan Singhal, Krithik Ramesh, and Christopher Ré. LoLCATs: On low-rank linearizing of large language models. InProceedings of the International Conference on Learning Representations (ICLR), 2025. 13 A Extended results and additional experiments This section presents supporting evi...

  50. [50]

    HEED uses density as a training- free proxy

    This reference is expensive because it requires one cached teacher backward pass per sample. HEED uses density as a training- free proxy. The proxy is accurate enough in practice: The per-token Spearman correlation between ρ(p) and wgrad(p) is 0.63 overall and 0.71 in the upper-density tail and C5 HEED-G, which uses the gradient reference directly, is wit...