pith. sign in

arxiv: 2605.19929 · v1 · pith:YM3MYVVAnew · submitted 2026-05-19 · 💻 cs.CV · cs.AI

Breaking Modality Heterogeneity in Low-Bit Quantization for Large Vision-Language Models

Pith reviewed 2026-05-20 05:16 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords low-bit quantizationvision-language modelspost-training quantizationmodality heterogeneitychannel decouplingoutlier channelsmultimodal deployment
0
0 comments X

The pith

SplitQ decouples modality-specific outlier channels to enable accurate low-bit quantization of vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that the activation distributions of text and vision in VLMs differ in ways that standard quantization cannot handle well. It shows this heterogeneity concentrates in a small number of channels, and that these outlier channels are largely distinct for each modality. By isolating those channels with a dedicated module and then applying adaptive calibration to the rest, the method keeps most of the original model accuracy even when weights and activations are reduced to three bits. A sympathetic reader would care because this directly lowers the memory and energy cost of running capable multimodal models on phones and edge hardware.

Core claim

Cross-modal heterogeneity in VLM activations is unevenly distributed across channels, with most modality-specific outliers residing in different channels for text versus vision. SplitQ addresses this through a Modality-specific Outlier Channel Decoupling module that isolates the problematic channels at low overhead, followed by an Adaptive Cross-Modal Calibration module that uses dual learnable branches to reduce remaining distribution mismatches. Experiments across six multi-modal datasets show SplitQ outperforming prior PTQ methods at W4A8, W4A4, W3A3, and W3A2 settings, retaining 93.5 percent of FP16 performance at the challenging W3A3 point.

What carries the argument

The Modality-specific Outlier Channel Decoupling (MOCD) module combined with the Adaptive Cross-Modal Calibration (ACC) module inside a channel-splitting PTQ framework.

If this is right

  • VLMs remain usable at W3A3 quantization while keeping over 93 percent of full-precision accuracy on multi-modal tasks.
  • Memory footprint and inference latency drop enough to fit advanced VLMs on resource-limited hardware.
  • The same channel-decoupling pattern works across W4A8 down to W3A2 without retraining the base model.
  • Outlier isolation adds negligible extra parameters yet removes most modality-induced quantization error.
  • Performance gains appear consistently on six different multi-modal evaluation datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same uneven-channel pattern may appear in other multimodal architectures such as audio-visual or video-text models.
  • Hardware accelerators could add specialized paths for the decoupled outlier channels to gain extra speed.
  • Dynamic selection of which channels to split could be explored at inference time to adapt to new inputs.
  • The calibration branches might transfer to mixed-precision quantization schemes beyond uniform low-bit settings.

Load-bearing premise

Cross-modal heterogeneity concentrates in a small subset of channels whose outliers sit in different channels for each modality.

What would settle it

Running SplitQ on a held-out VLM and dataset yields accuracy no better than standard per-channel quantization at the same bit width.

Figures

Figures reproduced from arXiv: 2605.19929 by Guolei Sun, Haotong Qin, Lei Zhang, Xindong Zhang, Yi Zhong.

Figure 1
Figure 1. Figure 1: (a) VLM inference pipeline. (b) An overview of SplitQ framework. The two key components [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distributions of text and vision activations [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of weight quantization er￾ror ∆(P−1 m Wm) across channels using activations of different modalities: text-only, vision-only, and joint vision-text. The x-axis (Channel Group) par￾titions all channels, sorted in ascending order of quantization error MAE, into 10 equal-sized bins (each containing 10% of channels). The results show that vision-text activations consistently am￾plify this error term.… view at source ↗
Figure 4
Figure 4. Figure 4: Construction process. calibration sets, while fixed SVD-based components [23, 16] lack the flexibility needed to absorb quantization errors arising from modality heterogeneity. To balance these two choices, we propose a learnable constrained low-rank parameterization, as shown in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Low-bit post-training quantization (PTQ) is a pivotal technique for deploying Vision-Language Models (VLMs) on resource-constrained devices. However, existing PTQ methods often degrade VLMs' accuracy due to the heterogeneous activation distributions of text and vision modalities during quantization. We find that this cross-modal heterogeneity is distributed unevenly across channels: a small subset of channels contains most modality-specific outliers, and these outliers typically reside in different channels for each modality. Motivated by this, we propose SplitQ, a channel-Splitting-driven post-training Quantization framework. At its core, SplitQ introduces a novel Modality-specific Outlier Channel Decoupling (MOCD) module that effectively isolates salient modality-specific outlier channels with minimal overhead. To further address the remaining cross-modal distribution discrepancies, we design an Adaptive Cross-Modal Calibration (ACC) module that employs dual lightweight learnable branches to dynamically mitigate modality-induced quantization errors. Extensive experiments on popular VLMs demonstrate that SplitQ significantly outperforms existing approaches across 6 popular multi-modal datasets under all evaluated quantization settings, including W4A8, W4A4, W3A3, and W3A2. Notably, SplitQ preserves 93.5% of FP16 performance under the challenging W3A3 setting (69.5 vs. 74.3), pushing the efficiency frontier for deploying advanced VLMs. Our code is available at https://github.com/EMVision-NK/SplitQ

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript presents SplitQ, a channel-splitting-driven post-training quantization framework for large vision-language models. It is motivated by the empirical observation that cross-modal heterogeneity in activations is concentrated in a small subset of channels, with modality-specific outliers typically residing in different channels for vision and text. The core contributions are the Modality-specific Outlier Channel Decoupling (MOCD) module to isolate these channels and the Adaptive Cross-Modal Calibration (ACC) module using dual lightweight learnable branches. Experiments on popular VLMs across 6 multi-modal datasets and bit-width settings (W4A8, W4A4, W3A3, W3A2) report consistent outperformance over baselines, including preservation of 93.5% of FP16 performance (69.5 vs. 74.3) at the challenging W3A3 setting. Code is released at the provided GitHub repository.

Significance. If the results hold, the work has clear practical significance for deploying VLMs on resource-constrained devices, where low-bit quantization is essential. The code availability is a positive strength for reproducibility. The channel-wise outlier analysis offers a useful perspective on multimodal quantization issues. However, the overall significance is tempered by the need to confirm that gains stem specifically from the proposed decoupling rather than generic improvements in calibration.

major comments (2)
  1. [§3.1] §3.1 (motivation and observation): The central premise that cross-modal heterogeneity is unevenly distributed with outliers in differing channels per modality is load-bearing for introducing MOCD. The manuscript should provide quantitative evidence such as channel overlap statistics, outlier magnitude histograms, or per-channel activation plots across the tested VLMs and datasets to demonstrate that this separation is pronounced and consistent.
  2. [§4] §4 (experiments and ablations): To establish that SplitQ's gains (e.g., the W3A3 result of 69.5) are attributable to the full framework rather than the ACC module alone, an ablation removing MOCD is required. Without it, the contribution of the channel-decoupling step remains unclear and the design choice is not fully justified.
minor comments (3)
  1. The abstract and introduction should explicitly list the specific VLMs evaluated (e.g., LLaVA, MiniGPT-4) to provide immediate context for the reported numbers.
  2. [§3.3] Notation for the dual branches in ACC could be clarified with a small diagram or explicit equations showing how the learnable parameters interact with the quantized activations.
  3. [§2] Related work should reference additional recent PTQ methods tailored to multimodal or transformer-based models beyond the current citations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and provide point-by-point responses below. Revisions will be incorporated to address the concerns raised.

read point-by-point responses
  1. Referee: [§3.1] §3.1 (motivation and observation): The central premise that cross-modal heterogeneity is unevenly distributed with outliers in differing channels per modality is load-bearing for introducing MOCD. The manuscript should provide quantitative evidence such as channel overlap statistics, outlier magnitude histograms, or per-channel activation plots across the tested VLMs and datasets to demonstrate that this separation is pronounced and consistent.

    Authors: We thank the referee for highlighting the importance of strengthening the empirical foundation in §3.1. The section currently presents key observations on the uneven distribution of cross-modal heterogeneity and modality-specific outliers. To further substantiate the premise, we will add quantitative evidence in the revised manuscript, including channel overlap statistics (e.g., Jaccard index of outlier channels between vision and text modalities) and outlier magnitude histograms across the evaluated VLMs and datasets. These additions will demonstrate the consistency and pronounced nature of the channel separation. revision: yes

  2. Referee: [§4] §4 (experiments and ablations): To establish that SplitQ's gains (e.g., the W3A3 result of 69.5) are attributable to the full framework rather than the ACC module alone, an ablation removing MOCD is required. Without it, the contribution of the channel-decoupling step remains unclear and the design choice is not fully justified.

    Authors: We agree that an explicit ablation isolating the contribution of MOCD is necessary to rigorously justify the design. While existing ablations evaluate components of the framework, we will add a dedicated experiment in the revised §4 that removes MOCD and reports performance using only the ACC module. This will directly compare against the full SplitQ results (including the W3A3 setting) to clarify the incremental benefit of the channel-decoupling step. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observation motivates new modules with external validation

full rationale

The paper's chain begins with an empirical observation on channel-wise outlier distribution across modalities, which directly motivates the design of the MOCD and ACC modules without any fitted parameters, self-referential equations, or load-bearing self-citations. Performance results are reported as comparative accuracies on held-out datasets under multiple bit-width settings, which are externally measurable and not equivalent to the input observation by construction. No derivation reduces to its own inputs; the framework is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that modality heterogeneity concentrates in a small number of distinct channels per modality, plus two newly introduced modules whose parameters are learned during calibration.

free parameters (1)
  • learnable parameters in ACC dual branches
    Lightweight learnable branches are trained to mitigate remaining cross-modal discrepancies.
axioms (1)
  • domain assumption Cross-modal activation heterogeneity is unevenly distributed across channels with outliers concentrated in a small subset that differs by modality
    This observation directly motivates the channel-splitting design and is stated as the key finding in the abstract.
invented entities (2)
  • MOCD module no independent evidence
    purpose: Isolates salient modality-specific outlier channels
    Newly proposed component with minimal overhead.
  • ACC module no independent evidence
    purpose: Dynamically mitigates modality-induced quantization errors via dual branches
    Newly proposed adaptive calibration component.

pith-pipeline@v0.9.0 · 5811 in / 1361 out tokens · 50389 ms · 2026-05-20T05:16:53.116518+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 7 internal anchors

  1. [1]

    Vqa: Visual question answering

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InICCV, 2015

  2. [2]

    Quarot: Outlier-free 4-bit inference in rotated llms

    Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms. InNeurIPS, 2024

  3. [3]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  4. [4]

    Qwen2.5-vl technical report, 2025

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025

  5. [5]

    Low-rank quantization-aware training for llms.arXiv preprint arXiv:2406.06385, 2024

    Yelysei Bondarenko, Riccardo Del Chiaro, and Markus Nagel. Low-rank quantization-aware training for llms.arXiv preprint arXiv:2406.06385, 2024

  6. [6]

    Quip: 2-bit quantiza- tion of large language models with guarantees

    Jerry Chee, Yaohui Cai, V olodymyr Kuleshov, and Christopher M De Sa. Quip: 2-bit quantiza- tion of large language models with guarantees. InNeurIPS, 2023

  7. [7]

    Efficientqat: Efficient quantization-aware training for large language models

    Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. Efficientqat: Efficient quantization-aware training for large language models. InACL, 2025

  8. [8]

    Qlora: Efficient finetuning of quantized llms

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. InNeurIPS, 2023

  9. [9]

    arXiv preprint arXiv:2306.03078 (2023)

    Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. Spqr: A sparse-quantized 10 representation for near-lossless llm weight compression.arXiv preprint arXiv:2306.03078, 2023

  10. [10]

    Cbq: Cross-block quantization for large language models

    Xin Ding, Xiaoyu Liu, Zhijun Tu, Yun Zhang, Wei Li, Jie Hu, Hanting Chen, Yehui Tang, Zhiwei Xiong, Baoqun Yin, et al. Cbq: Cross-block quantization for large language models. arXiv preprint arXiv:2312.07950, 2023

  11. [11]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

  12. [12]

    OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

    Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, et al. Ocrbench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning.arXiv preprint arXiv:2501.00321, 2024

  13. [13]

    Vizwiz grand challenge: Answering visual questions from blind people

    Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. InCVPR, 2018

  14. [14]

    Image captioning: Transform- ing objects into words

    Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. Image captioning: Transform- ing objects into words. InNeurIPS, 2019

  15. [15]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InICLR, 2022

  16. [16]

    Masquant: Modality-aware smoothing quantization for multimodal large language models.arXiv preprint arXiv:2603.04800, 2026

    Lulu Hu, Wenhu Xiao, Xin Chen, Xinhua Xu, Bowen Xu, Kun Li, and Yongliang Tao. Masquant: Modality-aware smoothing quantization for multimodal large language models.arXiv preprint arXiv:2603.04800, 2026

  17. [17]

    Yoon Kim and Alexander M Rush

    Xing Hu, Yuan Cheng, Dawei Yang, Zhihang Yuan, Jiangyong Yu, Chen Xu, and Sifan Zhou. I-llm: Efficient integer-only inference for fully-quantized low-bit large language models.arXiv preprint arXiv:2405.17849, 2024

  18. [18]

    Moequant: Enhancing quantization for mixture-of-experts large language models via expert-balanced sampling and affinity guidance.arXiv preprint arXiv:2505.03804, 2025

    Xing Hu, Zhixuan Chen, Dawei Yang, Zukang Xu, Chen Xu, Zhihang Yuan, Sifan Zhou, and Jiangyong Yu. Moequant: Enhancing quantization for mixture-of-experts large language models via expert-balanced sampling and affinity guidance.arXiv preprint arXiv:2505.03804, 2025

  19. [19]

    arXiv preprint arXiv:2501.13987

    Xing Hu, Yuan Cheng, Dawei Yang, Zukang Xu, Zhihang Yuan, Jiangyong Yu, Chen Xu, Zhe Jiang, and Sifan Zhou. Ostquant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting.arXiv preprint arXiv:2501.13987, 2025

  20. [20]

    Squeezellm: Dense-and-sparse quantization

    Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W Mahoney, and Kurt Keutzer. Squeezellm: Dense-and-sparse quantization.arXiv preprint arXiv:2306.07629, 2023

  21. [21]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed- bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023

  22. [22]

    Norm tweaking: High-performance low-bit quantization of large language models

    Liang Li, Qingyuan Li, Bo Zhang, and Xiangxiang Chu. Norm tweaking: High-performance low-bit quantization of large language models. InAAAI, 2024

  23. [23]

    Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models.arXiv preprint arXiv:2411.05007, 2024

    Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models.arXiv preprint arXiv:2411.05007, 2024

  24. [24]

    Fptq: Fine-grained post-training quantization for large language models.arXiv preprint arXiv:2308.15987, 2023

    Qingyuan Li, Yifan Zhang, Liang Li, Peng Yao, Bo Zhang, Xiangxiang Chu, Yerui Sun, Li Du, and Yuchen Xie. Fptq: Fine-grained post-training quantization for large language models.arXiv preprint arXiv:2308.15987, 2023

  25. [25]

    Visual question answering with question representation update (qru)

    Ruiyu Li and Jiaya Jia. Visual question answering with question representation update (qru). In NeurIPS, 2016. 11

  26. [26]

    Mbq: Modality-balanced quantization for large vision-language models

    Shiyao Li, Yingchun Hu, Xuefei Ning, Xihui Liu, Ke Hong, Xiaotao Jia, Xiuhong Li, Yaqi Yan, Pei Ran, Guohao Dai, et al. Mbq: Modality-balanced quantization for large vision-language models. InCVPR, 2025

  27. [27]

    Duquant: Distributing outliers via dual transformation makes stronger quantized llms

    Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei. Duquant: Distributing outliers via dual transformation makes stronger quantized llms. InNeurIPS, 2024

  28. [28]

    Awq: Activation-aware weight quantization for on-device llm compression and acceleration

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. InMLSys, 2024

  29. [29]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

  30. [30]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR, 2024

  31. [31]

    Llm-qat: Data-free quantiza- tion aware training for large language models

    Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. Llm-qat: Data-free quantiza- tion aware training for large language models. InACL, 2024

  32. [32]

    SpinQuant: LLM quantization with learned rotations

    Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: Llm quantization with learned rotations.arXiv preprint arXiv:2405.16406, 2024

  33. [33]

    Post-training quantization for vision transformer

    Zhenhua Liu, Yunhe Wang, Kai Han, Wei Zhang, Siwei Ma, and Wen Gao. Post-training quantization for vision transformer. InNeurIPS, 2021

  34. [34]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InNeurIPS, 2022

  35. [35]

    Affinequant: Affine transformation quantization for large language models.arXiv preprint arXiv:2403.12544, 2024

    Yuexiao Ma, Huixia Li, Xiawu Zheng, Feng Ling, Xuefeng Xiao, Rui Wang, Shilei Wen, Fei Chao, and Rongrong Ji. Affinequant: Affine transformation quantization for large language models.arXiv preprint arXiv:2403.12544, 2024

  36. [36]

    Overcoming oscillations in quantization-aware training

    Markus Nagel, Marios Fournarakis, Yelysei Bondarenko, and Tijmen Blankevoort. Overcoming oscillations in quantization-aware training. InICML, 2022

  37. [37]

    Post-training quantization on diffusion models

    Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, and Yan Yan. Post-training quantization on diffusion models. InCVPR, 2023

  38. [38]

    Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, Xin Jiang, Wulong Liu, and Jun Yao

    Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantiza- tion for large language models.arXiv preprint arXiv:2308.13137, 2023

  39. [39]

    Post training quan- tization of large language models with microscaling formats.arXiv preprint arXiv:2405.07135, 2024

    Sayeh Sharify, Utkarsh Saxena, Zifei Xu, Ilya Soloveychik, Xin Wang, et al. Post training quan- tization of large language models with microscaling formats.arXiv preprint arXiv:2405.07135, 2024

  40. [40]

    Towards vqa models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InCVPR, 2019

  41. [41]

    Achieving binary weight and activation for llms using post-training quantization

    Siqing Song, Chuang Wang, Rui-Qi Wang, Yi Yang, and Xu-Yao Zhang. Achieving binary weight and activation for llms using post-training quantization. InACL, 2025

  42. [42]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, and 1 others

    Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, et al. Flatquant: Flatness matters for llm quantization.arXiv preprint arXiv:2410.09426, 2024

  43. [43]

    Q-vlm: Post-training quantization for large vision-language models

    Changyuan Wang, Ziwei Wang, Xiuwei Xu, Yansong Tang, Jie Zhou, and Jiwen Lu. Q-vlm: Post-training quantization for large vision-language models. InNeurIPS, 2024. 12

  44. [44]

    WSVD: Weighted Low-Rank Approximation for Fast and Efficient Execution of Low-Precision Vision-Language Models

    Haiyu Wang, Yutong Wang, Jack Jiang, and Sai Qian Zhang. Wsvd: Weighted low-rank approximation for fast and efficient execution of low-precision vision-language models.arXiv preprint arXiv:2604.02570, 2026

  45. [45]

    Sliderquant: Accurate post-training quantization for llms.arXiv preprint arXiv:2603.25284, 2026

    Shigeng Wang, Chao Li, Yangyuxuan Kang, Jiawei Fan, Zhonghong Ou, and Anbang Yao. Sliderquant: Accurate post-training quantization for llms.arXiv preprint arXiv:2603.25284, 2026

  46. [46]

    Bi-vlm: Binary post-training quantization for vision-language models

    Xijun Wang, Rayyan Abdalla, Junyun Huang, Chengyuan Zhang, Ruiqi Xian, and Dinesh Manocha. Bi-vlm: Binary post-training quantization for vision-language models. InAAAI, 2026

  47. [47]

    Qsvd: Efficient low-rank approximation for unified query-key-value weight compression in low-precision vision-language models.arXiv preprint arXiv:2510.16292, 2025

    Yutong Wang, Haiyu Wang, and Sai Qian Zhang. Qsvd: Efficient low-rank approximation for unified query-key-value weight compression in low-precision vision-language models.arXiv preprint arXiv:2510.16292, 2025

  48. [48]

    Outlier suppression: Pushing the limit of low-bit transformer language models

    Xiuying Wei, Yunchen Zhang, Xiangguo Zhang, Ruihao Gong, Shanghang Zhang, Qi Zhang, Fengwei Yu, and Xianglong Liu. Outlier suppression: Pushing the limit of low-bit transformer language models. InNeurIPS, 2022

  49. [49]

    Smoothquant: Accurate and efficient post-training quantization for large language models

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InICML, 2023

  50. [50]

    Advancing multimodal large language models with quantization-aware scale learning for efficient adaptation

    Jingjing Xie, Yuxin Zhang, Mingbao Lin, Liujuan Cao, and Rongrong Ji. Advancing multimodal large language models with quantization-aware scale learning for efficient adaptation. InACM MM, 2024

  51. [51]

    Rwkvquant: Quantizing the rwkv family with proxy guided hybrid of scalar and vector quantization.arXiv preprint arXiv:2505.03803, 2025

    Chen Xu, Yuxuan Yue, Zukang Xu, Xing Hu, Jiangyong Yu, Zhixuan Chen, Sifan Zhou, Zhihang Yuan, and Dawei Yang. Rwkvquant: Quantizing the rwkv family with proxy guided hybrid of scalar and vector quantization.arXiv preprint arXiv:2505.03803, 2025

  52. [52]

    Vlmq: Efficient post-training quantization for large vision-language models via hessian augmentation.arXiv preprint arXiv:2508.03351, 2025

    Yufei Xue, Yushi Huang, Jiawei Shao, and Jun Zhang. Vlmq: Efficient post-training quantization for large vision-language models via hessian augmentation.arXiv preprint arXiv:2508.03351, 2025

  53. [53]

    R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization

    Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. InICCV, 2025

  54. [54]

    Zeroquant: Efficient and affordable post-training quantization for large-scale transformers

    Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. InNeurIPS, 2022

  55. [55]

    arXiv preprint arXiv:2303.08302 , year=

    Zhewei Yao, Xiaoxia Wu, Cheng Li, Stephen Youn, and Yuxiong He. Zeroquant-v2: Exploring post-training quantization in llms from comprehensive study to low rank compensation.arXiv preprint arXiv:2303.08302, 2023

  56. [56]

    Image captioning with semantic attention

    Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. Image captioning with semantic attention. InCVPR, 2016

  57. [57]

    Mquant: Unleashing the inference potential of multimodal large language models via static quantization

    JiangYong Yu, Sifan Zhou, Dawei Yang, Shuoyu Li, Shuo Wang, Xing Hu, Chen Xu, Zukang Xu, Changyong Shu, and Zhihang Yuan. Mquant: Unleashing the inference potential of multimodal large language models via static quantization. InACM MM, 2025

  58. [58]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

  59. [59]

    Qqq: Quality quattuor-bit quantization for large language models.arXiv preprint arXiv:2406.09904, 2024

    Ying Zhang, Peng Zhang, Mincong Huang, Jingyang Xiang, Yujie Wang, Chao Wang, Yineng Zhang, Lei Yu, Chuan Liu, and Wei Lin. Qqq: Quality quattuor-bit quantization for large language models.arXiv preprint arXiv:2406.09904, 2024. 13

  60. [60]

    Aser: activation smoothing and error reconstruction for large language model quantization

    Weibo Zhao, Yubin Shi, Xinyu Lyu, Wanchen Sui, Shen Li, and Yong Li. Aser: activation smoothing and error reconstruction for large language model quantization. InAAAI, 2025

  61. [61]

    MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design

    Zhen Zheng, Xiaonan Song, and Chuanjie Liu. Mixllm: Llm quantization with global mixed-precision between output-features and highly-efficient system design.arXiv preprint arXiv:2412.14590, 2024. 14 Appendix Table 8: Ablation study on the rank ratio of CWS. Qwen2.5-VL-3B Qwen2.5-VL-7B MethodRank Ratio bits MMMU OCRBench ScienceQA AverageMMMU OCRBench Scien...