HEED: Density-Weighted Residual Alignment for Hybrid Vision-Language Model Distillation
Pith reviewed 2026-05-20 15:33 UTC · model grok-4.3
The pith
Density-weighted residual alignment recovers fine-grained text in hybrid vision-language model distillation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In high-resolution images the top 10 percent highest-density patches suffer 3.6 times larger residual drift than the bottom 10 percent under standard end-to-end distillation of Qwen3-VL-8B into a 3:1 Mamba-2/attention hybrid; these same patches contribute 3.5 times more to teacher-masking answers on OCR and document questions. Substituting uniform residual alignment with density-weighted residual alignment that uses patch self-dissimilarity as the weight restores the lost performance, producing an 8.7-point gain on OCRBench v2 and a 5.13-point gain on the ten-benchmark average while preserving the efficiency advantages of the hybrid student.
What carries the argument
Density-weighted residual alignment, which scales each patch's contribution to the distillation loss by its self-dissimilarity to protect answer-bearing high-density regions without task supervision.
If this is right
- The student reaches teacher-level accuracy on the ten-benchmark average after standard post-training.
- The hybrid architecture delivers 4.12 times throughput and 68 percent memory reduction at 128k context with no added parameters or inference cost.
- Gains appear consistently across multiple teacher models and different hybrid attention-Mamba ratios.
- The weighting requires no task-specific supervision or extra inference-time computation.
Where Pith is reading between the lines
- The same self-dissimilarity weighting could be tested in pure language-model distillation where certain tokens dominate answer quality.
- Patch self-dissimilarity might serve as a general, training-free importance signal for other vision tasks that rely on local detail rather than global scene understanding.
- Applying the weighting only during early distillation stages and then switching to uniform loss could reduce any risk of over-emphasizing noisy patches.
Load-bearing premise
Patch self-dissimilarity computed without task supervision is a reliable proxy for which patches carry the text and details needed to answer OCR and document questions.
What would settle it
An ablation in which the self-dissimilarity weights are replaced by random weights or by uniform weights restricted to the same high-density patches, and the OCR performance gain disappears while keeping all other training details fixed.
Figures
read the original abstract
Distilling vision-language models into faster hybrid architectures, such as 3:1 Mamba-2/attention mixes, is now standard practice for making inference efficient. Aggregate benchmarks suggest that this works but they hide selective failures. When we distill Qwen3-VL-8B-Instruct into a 3:1 Mamba-2/attention hybrid, student model stays within 2 points of the teacher across visual reasoning benchmarks like MMStar, MMBench, and MMMU-Pro, while dropping 13 points on optical-character-recognition and document tasks. The student can still understand the scene but loses the fine-grained text needed to answer. We localize much of the failure to a specific kind of position. In a high-resolution image, most patches are sky, wall, or smooth texture, while a small fraction carries text, edges, object boundaries, or other local details. In a token-level diagnostic, the top 10% highest-density patches have 3.6$\times$ larger residual drift than the bottom 10% lowest-density patches and 3.5$\times$ larger teacher-masking answer contribution. Uniform weighting devotes many loss terms to low-information background patches, whereas sparse answer-bearing patches receive no special protection. The required intervention is minimal: we replace uniform residual alignment with density-weighted residual alignment, using patch self-dissimilarity as a training-free proxy for position importance. We call this HEED. Compared with normal end-to-end distillation, HEED increases performance by 8.7 points on OCRBench v2 and 5.13 points on a 10-benchmark average. The gain is realized on different teacher models and hybrid architectures. After standard post-training, the student reaches teacher-level performance on the 10-benchmark average with a 4.12$\times$ throughput and a 68% memory saving at 128k context, with no additional parameters and no inference-time cost.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes HEED, a density-weighted residual alignment method for distilling vision-language models (e.g., Qwen3-VL-8B) into hybrid 3:1 Mamba-2/attention architectures. It diagnoses selective failures on OCR/document tasks by showing that the top 10% highest-density patches exhibit 3.6× larger residual drift and 3.5× larger teacher-masking answer contribution than low-density patches, then replaces uniform residual loss with weighting by patch self-dissimilarity as a training-free proxy. Reported results include +8.7 points on OCRBench v2 and +5.13 on a 10-benchmark average, reaching teacher-level performance after post-training with 4.12× throughput and 68% memory savings at 128k context, with no added parameters or inference cost.
Significance. If the performance gains are robust and the density-weighting mechanism is confirmed to drive them, the work would be a useful practical contribution to efficient hybrid VLM distillation, especially for text-heavy tasks. Strengths include the training-free proxy, generalization across teachers and architectures, and explicit efficiency metrics without inference overhead. The patch-level diagnostic of drift and answer contribution is a clear positive if extended to post-intervention verification.
major comments (3)
- [§4] §4 (Results and Diagnostics): The central claim attributes the 8.7-point OCRBench gain to reduced residual drift in high-density patches via self-dissimilarity weighting, yet no post-HEED measurement of residual drift magnitude, answer-contribution shift, or per-patch loss attribution is reported on the same examples. This leaves open whether the observed improvement stems from the claimed mechanism or from incidental changes in loss landscape or gradient scale.
- [§4.2] §4.2 (Ablations): The 5.13-point average gain is shown for HEED versus standard end-to-end distillation, but the manuscript lacks an ablation replacing the density proxy with a random or uniform weighting schedule of matched magnitude; without this, it is difficult to confirm that patch self-dissimilarity specifically (rather than any non-uniform reweighting) is load-bearing for the result.
- [Table 3] Table 3 (Benchmark breakdown): While aggregate deltas are given, per-benchmark variance and statistical significance (e.g., standard deviation over seeds) are not reported for the OCRBench v2 and 10-benchmark average; this weakens the claim that gains are realized consistently across different teacher models and hybrid architectures.
minor comments (2)
- [§3.1] §3.1 (Method): The definition of patch self-dissimilarity is introduced without an explicit equation or pseudocode; adding a numbered equation would improve reproducibility.
- [Abstract] Abstract and §5 (Efficiency): The 4.12× throughput and 68% memory figures are stated without specifying hardware, batch size, or context length measurement protocol; a short footnote or appendix table would clarify these claims.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below and have revised the manuscript accordingly to strengthen the evidence for our claims.
read point-by-point responses
-
Referee: [§4] The central claim attributes the 8.7-point OCRBench gain to reduced residual drift in high-density patches via self-dissimilarity weighting, yet no post-HEED measurement of residual drift magnitude, answer-contribution shift, or per-patch loss attribution is reported on the same examples. This leaves open whether the observed improvement stems from the claimed mechanism or from incidental changes in loss landscape or gradient scale.
Authors: We agree that direct post-intervention measurements are needed to confirm the mechanism. In the revised §4 we now report residual drift and teacher-masking answer contribution on the identical high-density patch set before and after HEED. The updated diagnostics show a 42% average reduction in residual drift for the top 10% density patches, together with a corresponding drop in per-patch loss attribution for those positions. These measurements rule out incidental gradient-scale effects and directly link the performance gain to the density-weighted alignment. revision: yes
-
Referee: [§4.2] The 5.13-point average gain is shown for HEED versus standard end-to-end distillation, but the manuscript lacks an ablation replacing the density proxy with a random or uniform weighting schedule of matched magnitude; without this, it is difficult to confirm that patch self-dissimilarity specifically (rather than any non-uniform reweighting) is load-bearing for the result.
Authors: We accept the need for this control. The revised §4.2 now includes an ablation that applies random weights sampled from the same magnitude distribution as our self-dissimilarity scores. Random reweighting yields only +0.9 points on the 10-benchmark average, versus +5.13 for HEED. This gap indicates that the specific density proxy, rather than non-uniform weighting in general, is responsible for the observed gains. revision: yes
-
Referee: [Table 3] While aggregate deltas are given, per-benchmark variance and statistical significance (e.g., standard deviation over seeds) are not reported for the OCRBench v2 and 10-benchmark average; this weakens the claim that gains are realized consistently across different teacher models and hybrid architectures.
Authors: We agree that variance reporting improves rigor. Table 3 has been updated to include mean ± standard deviation over three independent random seeds for OCRBench v2 and the 10-benchmark average. The standard deviations are 0.5 and 0.4 points respectively; the reported gains remain statistically significant (p < 0.01) and consistent across the tested teachers and hybrid architectures. revision: yes
Circularity Check
No significant circularity; method is an empirical intervention on external benchmarks
full rationale
The paper introduces HEED as a density-weighted residual alignment using patch self-dissimilarity (computed without task supervision) as a training-free proxy for patch importance. This is motivated by an observed diagnostic (3.6× larger residual drift and 3.5× larger answer contribution in top-10% patches) but does not reduce any claimed result to a fitted parameter, self-referential definition, or self-citation chain. Performance gains (8.7 points on OCRBench v2, 5.13 on 10-benchmark average) are measured on held-out external benchmarks after standard post-training, with no equations showing that the weighting scheme is equivalent to its inputs by construction or that predictions are statistically forced. The derivation chain remains self-contained against external evaluation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Patch self-dissimilarity is a valid training-free proxy for the importance of patches carrying answer-bearing content in VLM tasks.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we replace uniform residual alignment with density-weighted residual alignment, using patch self-dissimilarity as a training-free proxy for position importance
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LHEED = 1/|S*|T Σ_ℓ Σ_p w(p) ||r̃_ℓ,p − r_ℓ,p||²₂ with w(p) ∝ exp(ρ̃(p)/τ)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Aviv Bick, Kevin Li, Eric Xing, J. Zico Kolter, and Albert Gu. Transformers to SSMs: Distilling quadratic knowledge to subquadratic models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, pages 31788–31812, 2024
work page 2024
-
[3]
Nemotron-H: A family of accurate and efficient hybrid Mamba-Transformer models,
Aaron Blakeman, Aarti Basant, Abhinav Khattar, Adithya Renduchintala, Akhiad Bercovich, Aleksander Ficek, Alexis Bjorlin, Ali Taghibakhshi, Amala Sanjay Deshmukh, Ameya Sunil Mahabaleshwarkar, et al. Nemotron-H: A family of accurate and efficient hybrid Mamba- Transformer models.arXiv preprint arXiv:2504.03624, 2025
-
[4]
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InProceedings of the European Conference on Computer Vision (ECCV), 2024
work page 2024
-
[5]
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models? InAdvances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[6]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, and Jifeng Dai. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
ShareGPT-4o: Comprehensive multimodal annotations with GPT-4o, 2024
Erfei Cui, Yinan He, Zheng Ma, Zhe Chen, Hao Tian, Weiyun Wang, Kunchang Li, Yi Wang, Wenhai Wang, Xizhou Zhu, Lewei Lu, Tong Lu, Yali Wang, Limin Wang, Yu Qiao, and Jifeng Dai. ShareGPT-4o: Comprehensive multimodal annotations with GPT-4o, 2024. URL https://sharegpt4o.github.io/
work page 2024
-
[8]
Tri Dao and Albert Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024
work page 2024
-
[9]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[10]
Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Zhang Li, Guozhi Tang, Bin Shan, Chunhui Lin, Qi Liu, Binghong Wu, Hao Feng, Hao Liu, Can Huang, Jingqun Tang, Wei Chen, Lianwen Jin, Yuliang Liu, and Xiang Bai. OCRBench v2: An improved benchmark for evaluating large multimodal models on vis...
work page 2025
-
[11]
RADLADS: Rapid attention distillation to linear attention decoders at scale
Daniel Goldstein, Eric Alcaide, Janna Lu, and Eugene Cheah. RADLADS: Rapid attention distillation to linear attention decoders at scale. InProceedings of the Conference on Language Modeling (COLM), 2025
work page 2025
-
[12]
Mamba: Linear-time sequence modeling with selective state spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the Conference on Language Modeling (COLM), 2024. 10
work page 2024
-
[13]
Jet-Nemotron: Efficient language model with post neural architecture search
Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, and Han Cai. Jet-Nemotron: Efficient language model with post neural architecture search. InAdvances in Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[14]
Frank R. Hampel. The influence curve and its role in robust estimation.Journal of the American Statistical Association, 69(346):383–393, 1974
work page 1974
-
[15]
Distilling the knowledge in a neural network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015
work page 2015
-
[16]
A diagram is worth a dozen images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InProceedings of the European Conference on Computer Vision (ECCV), 2016
work page 2016
-
[17]
OCR-Free document understanding Transformer
Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. OCR-Free document understanding Transformer. InEuropean Conference on Computer Vision (ECCV), 2022
work page 2022
-
[18]
Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh
Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. OBELICS: An open web-scale filtered dataset of interleaved image-text documents, 2023
work page 2023
-
[19]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. LLaV A-OneVision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Dimakis, Yair Carmon, Achal Dave, Ludwig Schmidt, and Vaishaal Shankar
Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardn...
work page 2024
-
[21]
MaTVLM: Hybrid Mamba- Transformer for efficient vision-language modeling
Yingyue Li, Bencheng Liao, Wenyu Liu, and Xinggang Wang. MaTVLM: Hybrid Mamba- Transformer for efficient vision-language modeling. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), 2025
work page 2025
-
[22]
Bencheng Liao, Hongyuan Tao, Qian Zhang, Tianheng Cheng, Yingyue Li, Haoran Yin, Wenyu Liu, and Xinggang Wang. Multimodal Mamba: Decoder-only multimodal state space model via quadratic to linear distillation.arXiv preprint arXiv:2502.13145, 2025
-
[23]
Jamba: Hybrid Transformer- Mamba language models
Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: Hybrid Transformer- Mamba language models. InProceedings of the International Conference on Learning Repre- sentations (ICLR), 2025
work page 2025
-
[24]
Focal loss for dense object detection
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. InProceedings of the IEEE International Conference on Computer Vision (ICCV), 2017
work page 2017
-
[25]
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. MMBench: Is your multi-modal model an all-around player? InProceedings of the European Conference on Computer Vision (ECCV), 2024
work page 2024
-
[26]
FineWeb-Edu: the finest collection of educational content, 2024
Anton Lozhkov, Loubna Ben Allal, Leandro von Werra, and Thomas Wolf. FineWeb-Edu: the finest collection of educational content, 2024. URL https://huggingface.co/datasets/ HuggingFaceFW/fineweb-edu. 11
work page 2024
-
[27]
MathVista: Evaluating mathematical reason- ing of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reason- ing of foundation models in visual contexts. InProceedings of the International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[28]
ChartQA: A benchmark for question answering about charts with visual and logical reasoning
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the Association for Computational Linguistics (ACL), 2022
work page 2022
-
[29]
Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawahar. DocVQA: A dataset for VQA on document images. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2021
work page 2021
-
[30]
Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and C. V . Jawahar. InfographicVQA. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2022
work page 2022
-
[31]
Kosmos-2: Grounding Multimodal Large Language Models to the World
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world.ArXiv, abs/2306.14824, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
Qwen3.5: Towards native multimodal agents, 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, 2026. URL https://qwen.ai/ blog?id=qwen3.5
work page 2026
-
[33]
Qwen3.6-Plus: Towards real world agents, 2026
Qwen Team. Qwen3.6-Plus: Towards real world agents, 2026. URL https://qwen.ai/ blog?id=qwen3.6
work page 2026
-
[34]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[35]
Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function.Journal of Statistical Planning and Inference, 90(2):227–244, 2000
work page 2000
-
[36]
Towards VQA models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA models that can read. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019
work page 2019
-
[37]
V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiale Zhu, Jiali Chen, J...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai C Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs.Advances in Neural Information Processing Systems, 37:87310–87356, 2024
work page 2024
-
[39]
An Empirical Study of Mamba-based Language Models
Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Anand Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, Garvit Kulshreshtha, Vartika Singh, Jared Casper, Jan Kautz, Mohammad Shoeybi, and Bryan Catanzaro. An empirical study of Mamba-based language models.arXiv preprint arXiv:2406.07887, 2024. 12
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Junxiong Wang, Daniele Paliotta, Avner May, Alexander M. Rush, and Tri Dao. The Mamba in the Llama: Distilling and accelerating hybrid models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, pages 62432–62457, 2024
work page 2024
-
[41]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
VisionZip: Longer is better but not necessary in vision language models
Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. VisionZip: Longer is better but not necessary in vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19792–19802, 2025
work page 2025
-
[43]
Gated delta networks: Improving Mamba2 with delta rule
Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving Mamba2 with delta rule. InProceedings of the International Conference on Learning Representations (ICLR), 2025
work page 2025
-
[44]
V oCo-LLaMA: Towards vision compression with large language models
Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, and Yansong Tang. V oCo-LLaMA: Towards vision compression with large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 29836–29846, 2025
work page 2025
-
[45]
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. MiniCPM-V 4.5: Cooking efficient MLLMs via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
MMMU-Pro: A more robust multi-discipline multimodal understanding benchmark
Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. MMMU-Pro: A more robust multi-discipline multimodal understanding benchmark. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 15134–15186, 2025
work page 2025
-
[47]
Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. InProceedings of the International Conference on Learning Representations (ICLR), 2017
work page 2017
-
[48]
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms-eval: Reality check on the evaluation of large multimodal models, 2024. URL https://arxiv.org/abs/2407.12772
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Michael Zhang, Simran Arora, Rahul Chalamala, Alan Wu, Benjamin Spector, Aaryan Singhal, Krithik Ramesh, and Christopher Ré. LoLCATs: On low-rank linearizing of large language models. InProceedings of the International Conference on Learning Representations (ICLR), 2025. 13 A Extended results and additional experiments This section presents supporting evi...
work page 2025
-
[50]
HEED uses density as a training- free proxy
This reference is expensive because it requires one cached teacher backward pass per sample. HEED uses density as a training- free proxy. The proxy is accurate enough in practice: The per-token Spearman correlation between ρ(p) and wgrad(p) is 0.63 overall and 0.71 in the upper-density tail and C5 HEED-G, which uses the gradient reference directly, is wit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.