Breaking Modality Heterogeneity in Low-Bit Quantization for Large Vision-Language Models
Pith reviewed 2026-05-20 05:16 UTC · model grok-4.3
The pith
SplitQ decouples modality-specific outlier channels to enable accurate low-bit quantization of vision-language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Cross-modal heterogeneity in VLM activations is unevenly distributed across channels, with most modality-specific outliers residing in different channels for text versus vision. SplitQ addresses this through a Modality-specific Outlier Channel Decoupling module that isolates the problematic channels at low overhead, followed by an Adaptive Cross-Modal Calibration module that uses dual learnable branches to reduce remaining distribution mismatches. Experiments across six multi-modal datasets show SplitQ outperforming prior PTQ methods at W4A8, W4A4, W3A3, and W3A2 settings, retaining 93.5 percent of FP16 performance at the challenging W3A3 point.
What carries the argument
The Modality-specific Outlier Channel Decoupling (MOCD) module combined with the Adaptive Cross-Modal Calibration (ACC) module inside a channel-splitting PTQ framework.
If this is right
- VLMs remain usable at W3A3 quantization while keeping over 93 percent of full-precision accuracy on multi-modal tasks.
- Memory footprint and inference latency drop enough to fit advanced VLMs on resource-limited hardware.
- The same channel-decoupling pattern works across W4A8 down to W3A2 without retraining the base model.
- Outlier isolation adds negligible extra parameters yet removes most modality-induced quantization error.
- Performance gains appear consistently on six different multi-modal evaluation datasets.
Where Pith is reading between the lines
- The same uneven-channel pattern may appear in other multimodal architectures such as audio-visual or video-text models.
- Hardware accelerators could add specialized paths for the decoupled outlier channels to gain extra speed.
- Dynamic selection of which channels to split could be explored at inference time to adapt to new inputs.
- The calibration branches might transfer to mixed-precision quantization schemes beyond uniform low-bit settings.
Load-bearing premise
Cross-modal heterogeneity concentrates in a small subset of channels whose outliers sit in different channels for each modality.
What would settle it
Running SplitQ on a held-out VLM and dataset yields accuracy no better than standard per-channel quantization at the same bit width.
Figures
read the original abstract
Low-bit post-training quantization (PTQ) is a pivotal technique for deploying Vision-Language Models (VLMs) on resource-constrained devices. However, existing PTQ methods often degrade VLMs' accuracy due to the heterogeneous activation distributions of text and vision modalities during quantization. We find that this cross-modal heterogeneity is distributed unevenly across channels: a small subset of channels contains most modality-specific outliers, and these outliers typically reside in different channels for each modality. Motivated by this, we propose SplitQ, a channel-Splitting-driven post-training Quantization framework. At its core, SplitQ introduces a novel Modality-specific Outlier Channel Decoupling (MOCD) module that effectively isolates salient modality-specific outlier channels with minimal overhead. To further address the remaining cross-modal distribution discrepancies, we design an Adaptive Cross-Modal Calibration (ACC) module that employs dual lightweight learnable branches to dynamically mitigate modality-induced quantization errors. Extensive experiments on popular VLMs demonstrate that SplitQ significantly outperforms existing approaches across 6 popular multi-modal datasets under all evaluated quantization settings, including W4A8, W4A4, W3A3, and W3A2. Notably, SplitQ preserves 93.5% of FP16 performance under the challenging W3A3 setting (69.5 vs. 74.3), pushing the efficiency frontier for deploying advanced VLMs. Our code is available at https://github.com/EMVision-NK/SplitQ
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents SplitQ, a channel-splitting-driven post-training quantization framework for large vision-language models. It is motivated by the empirical observation that cross-modal heterogeneity in activations is concentrated in a small subset of channels, with modality-specific outliers typically residing in different channels for vision and text. The core contributions are the Modality-specific Outlier Channel Decoupling (MOCD) module to isolate these channels and the Adaptive Cross-Modal Calibration (ACC) module using dual lightweight learnable branches. Experiments on popular VLMs across 6 multi-modal datasets and bit-width settings (W4A8, W4A4, W3A3, W3A2) report consistent outperformance over baselines, including preservation of 93.5% of FP16 performance (69.5 vs. 74.3) at the challenging W3A3 setting. Code is released at the provided GitHub repository.
Significance. If the results hold, the work has clear practical significance for deploying VLMs on resource-constrained devices, where low-bit quantization is essential. The code availability is a positive strength for reproducibility. The channel-wise outlier analysis offers a useful perspective on multimodal quantization issues. However, the overall significance is tempered by the need to confirm that gains stem specifically from the proposed decoupling rather than generic improvements in calibration.
major comments (2)
- [§3.1] §3.1 (motivation and observation): The central premise that cross-modal heterogeneity is unevenly distributed with outliers in differing channels per modality is load-bearing for introducing MOCD. The manuscript should provide quantitative evidence such as channel overlap statistics, outlier magnitude histograms, or per-channel activation plots across the tested VLMs and datasets to demonstrate that this separation is pronounced and consistent.
- [§4] §4 (experiments and ablations): To establish that SplitQ's gains (e.g., the W3A3 result of 69.5) are attributable to the full framework rather than the ACC module alone, an ablation removing MOCD is required. Without it, the contribution of the channel-decoupling step remains unclear and the design choice is not fully justified.
minor comments (3)
- The abstract and introduction should explicitly list the specific VLMs evaluated (e.g., LLaVA, MiniGPT-4) to provide immediate context for the reported numbers.
- [§3.3] Notation for the dual branches in ACC could be clarified with a small diagram or explicit equations showing how the learnable parameters interact with the quantized activations.
- [§2] Related work should reference additional recent PTQ methods tailored to multimodal or transformer-based models beyond the current citations.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and provide point-by-point responses below. Revisions will be incorporated to address the concerns raised.
read point-by-point responses
-
Referee: [§3.1] §3.1 (motivation and observation): The central premise that cross-modal heterogeneity is unevenly distributed with outliers in differing channels per modality is load-bearing for introducing MOCD. The manuscript should provide quantitative evidence such as channel overlap statistics, outlier magnitude histograms, or per-channel activation plots across the tested VLMs and datasets to demonstrate that this separation is pronounced and consistent.
Authors: We thank the referee for highlighting the importance of strengthening the empirical foundation in §3.1. The section currently presents key observations on the uneven distribution of cross-modal heterogeneity and modality-specific outliers. To further substantiate the premise, we will add quantitative evidence in the revised manuscript, including channel overlap statistics (e.g., Jaccard index of outlier channels between vision and text modalities) and outlier magnitude histograms across the evaluated VLMs and datasets. These additions will demonstrate the consistency and pronounced nature of the channel separation. revision: yes
-
Referee: [§4] §4 (experiments and ablations): To establish that SplitQ's gains (e.g., the W3A3 result of 69.5) are attributable to the full framework rather than the ACC module alone, an ablation removing MOCD is required. Without it, the contribution of the channel-decoupling step remains unclear and the design choice is not fully justified.
Authors: We agree that an explicit ablation isolating the contribution of MOCD is necessary to rigorously justify the design. While existing ablations evaluate components of the framework, we will add a dedicated experiment in the revised §4 that removes MOCD and reports performance using only the ACC module. This will directly compare against the full SplitQ results (including the W3A3 setting) to clarify the incremental benefit of the channel-decoupling step. revision: yes
Circularity Check
No circularity: empirical observation motivates new modules with external validation
full rationale
The paper's chain begins with an empirical observation on channel-wise outlier distribution across modalities, which directly motivates the design of the MOCD and ACC modules without any fitted parameters, self-referential equations, or load-bearing self-citations. Performance results are reported as comparative accuracies on held-out datasets under multiple bit-width settings, which are externally measurable and not equivalent to the input observation by construction. No derivation reduces to its own inputs; the framework is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- learnable parameters in ACC dual branches
axioms (1)
- domain assumption Cross-modal activation heterogeneity is unevenly distributed across channels with outliers concentrated in a small subset that differs by modality
invented entities (2)
-
MOCD module
no independent evidence
-
ACC module
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Vqa: Visual question answering
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InICCV, 2015
work page 2015
-
[2]
Quarot: Outlier-free 4-bit inference in rotated llms
Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms. InNeurIPS, 2024
work page 2024
-
[3]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Qwen2.5-vl technical report, 2025
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025
work page 2025
-
[5]
Low-rank quantization-aware training for llms.arXiv preprint arXiv:2406.06385, 2024
Yelysei Bondarenko, Riccardo Del Chiaro, and Markus Nagel. Low-rank quantization-aware training for llms.arXiv preprint arXiv:2406.06385, 2024
-
[6]
Quip: 2-bit quantiza- tion of large language models with guarantees
Jerry Chee, Yaohui Cai, V olodymyr Kuleshov, and Christopher M De Sa. Quip: 2-bit quantiza- tion of large language models with guarantees. InNeurIPS, 2023
work page 2023
-
[7]
Efficientqat: Efficient quantization-aware training for large language models
Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. Efficientqat: Efficient quantization-aware training for large language models. InACL, 2025
work page 2025
-
[8]
Qlora: Efficient finetuning of quantized llms
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. InNeurIPS, 2023
work page 2023
-
[9]
arXiv preprint arXiv:2306.03078 (2023)
Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. Spqr: A sparse-quantized 10 representation for near-lossless llm weight compression.arXiv preprint arXiv:2306.03078, 2023
-
[10]
Cbq: Cross-block quantization for large language models
Xin Ding, Xiaoyu Liu, Zhijun Tu, Yun Zhang, Wei Li, Jie Hu, Hanting Chen, Yehui Tang, Zhiwei Xiong, Baoqun Yin, et al. Cbq: Cross-block quantization for large language models. arXiv preprint arXiv:2312.07950, 2023
-
[11]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[12]
Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, et al. Ocrbench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning.arXiv preprint arXiv:2501.00321, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Vizwiz grand challenge: Answering visual questions from blind people
Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. InCVPR, 2018
work page 2018
-
[14]
Image captioning: Transform- ing objects into words
Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. Image captioning: Transform- ing objects into words. InNeurIPS, 2019
work page 2019
-
[15]
Lora: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InICLR, 2022
work page 2022
-
[16]
Lulu Hu, Wenhu Xiao, Xin Chen, Xinhua Xu, Bowen Xu, Kun Li, and Yongliang Tao. Masquant: Modality-aware smoothing quantization for multimodal large language models.arXiv preprint arXiv:2603.04800, 2026
-
[17]
Xing Hu, Yuan Cheng, Dawei Yang, Zhihang Yuan, Jiangyong Yu, Chen Xu, and Sifan Zhou. I-llm: Efficient integer-only inference for fully-quantized low-bit large language models.arXiv preprint arXiv:2405.17849, 2024
-
[18]
Xing Hu, Zhixuan Chen, Dawei Yang, Zukang Xu, Chen Xu, Zhihang Yuan, Sifan Zhou, and Jiangyong Yu. Moequant: Enhancing quantization for mixture-of-experts large language models via expert-balanced sampling and affinity guidance.arXiv preprint arXiv:2505.03804, 2025
-
[19]
arXiv preprint arXiv:2501.13987
Xing Hu, Yuan Cheng, Dawei Yang, Zukang Xu, Zhihang Yuan, Jiangyong Yu, Chen Xu, Zhe Jiang, and Sifan Zhou. Ostquant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting.arXiv preprint arXiv:2501.13987, 2025
-
[20]
Squeezellm: Dense-and-sparse quantization
Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W Mahoney, and Kurt Keutzer. Squeezellm: Dense-and-sparse quantization.arXiv preprint arXiv:2306.07629, 2023
-
[21]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed- bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Norm tweaking: High-performance low-bit quantization of large language models
Liang Li, Qingyuan Li, Bo Zhang, and Xiangxiang Chu. Norm tweaking: High-performance low-bit quantization of large language models. InAAAI, 2024
work page 2024
-
[23]
Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models.arXiv preprint arXiv:2411.05007, 2024
-
[24]
Qingyuan Li, Yifan Zhang, Liang Li, Peng Yao, Bo Zhang, Xiangxiang Chu, Yerui Sun, Li Du, and Yuchen Xie. Fptq: Fine-grained post-training quantization for large language models.arXiv preprint arXiv:2308.15987, 2023
-
[25]
Visual question answering with question representation update (qru)
Ruiyu Li and Jiaya Jia. Visual question answering with question representation update (qru). In NeurIPS, 2016. 11
work page 2016
-
[26]
Mbq: Modality-balanced quantization for large vision-language models
Shiyao Li, Yingchun Hu, Xuefei Ning, Xihui Liu, Ke Hong, Xiaotao Jia, Xiuhong Li, Yaqi Yan, Pei Ran, Guohao Dai, et al. Mbq: Modality-balanced quantization for large vision-language models. InCVPR, 2025
work page 2025
-
[27]
Duquant: Distributing outliers via dual transformation makes stronger quantized llms
Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei. Duquant: Distributing outliers via dual transformation makes stronger quantized llms. InNeurIPS, 2024
work page 2024
-
[28]
Awq: Activation-aware weight quantization for on-device llm compression and acceleration
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. InMLSys, 2024
work page 2024
-
[29]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023
work page 2023
-
[30]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR, 2024
work page 2024
-
[31]
Llm-qat: Data-free quantiza- tion aware training for large language models
Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. Llm-qat: Data-free quantiza- tion aware training for large language models. InACL, 2024
work page 2024
-
[32]
SpinQuant: LLM quantization with learned rotations
Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: Llm quantization with learned rotations.arXiv preprint arXiv:2405.16406, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Post-training quantization for vision transformer
Zhenhua Liu, Yunhe Wang, Kai Han, Wei Zhang, Siwei Ma, and Wen Gao. Post-training quantization for vision transformer. InNeurIPS, 2021
work page 2021
-
[34]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InNeurIPS, 2022
work page 2022
-
[35]
Yuexiao Ma, Huixia Li, Xiawu Zheng, Feng Ling, Xuefeng Xiao, Rui Wang, Shilei Wen, Fei Chao, and Rongrong Ji. Affinequant: Affine transformation quantization for large language models.arXiv preprint arXiv:2403.12544, 2024
-
[36]
Overcoming oscillations in quantization-aware training
Markus Nagel, Marios Fournarakis, Yelysei Bondarenko, and Tijmen Blankevoort. Overcoming oscillations in quantization-aware training. InICML, 2022
work page 2022
-
[37]
Post-training quantization on diffusion models
Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, and Yan Yan. Post-training quantization on diffusion models. InCVPR, 2023
work page 2023
-
[38]
Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantiza- tion for large language models.arXiv preprint arXiv:2308.13137, 2023
-
[39]
Sayeh Sharify, Utkarsh Saxena, Zifei Xu, Ilya Soloveychik, Xin Wang, et al. Post training quan- tization of large language models with microscaling formats.arXiv preprint arXiv:2405.07135, 2024
-
[40]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InCVPR, 2019
work page 2019
-
[41]
Achieving binary weight and activation for llms using post-training quantization
Siqing Song, Chuang Wang, Rui-Qi Wang, Yi Yang, and Xu-Yao Zhang. Achieving binary weight and activation for llms using post-training quantization. InACL, 2025
work page 2025
-
[42]
Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, et al. Flatquant: Flatness matters for llm quantization.arXiv preprint arXiv:2410.09426, 2024
-
[43]
Q-vlm: Post-training quantization for large vision-language models
Changyuan Wang, Ziwei Wang, Xiuwei Xu, Yansong Tang, Jie Zhou, and Jiwen Lu. Q-vlm: Post-training quantization for large vision-language models. InNeurIPS, 2024. 12
work page 2024
-
[44]
Haiyu Wang, Yutong Wang, Jack Jiang, and Sai Qian Zhang. Wsvd: Weighted low-rank approximation for fast and efficient execution of low-precision vision-language models.arXiv preprint arXiv:2604.02570, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[45]
Sliderquant: Accurate post-training quantization for llms.arXiv preprint arXiv:2603.25284, 2026
Shigeng Wang, Chao Li, Yangyuxuan Kang, Jiawei Fan, Zhonghong Ou, and Anbang Yao. Sliderquant: Accurate post-training quantization for llms.arXiv preprint arXiv:2603.25284, 2026
-
[46]
Bi-vlm: Binary post-training quantization for vision-language models
Xijun Wang, Rayyan Abdalla, Junyun Huang, Chengyuan Zhang, Ruiqi Xian, and Dinesh Manocha. Bi-vlm: Binary post-training quantization for vision-language models. InAAAI, 2026
work page 2026
-
[47]
Yutong Wang, Haiyu Wang, and Sai Qian Zhang. Qsvd: Efficient low-rank approximation for unified query-key-value weight compression in low-precision vision-language models.arXiv preprint arXiv:2510.16292, 2025
-
[48]
Outlier suppression: Pushing the limit of low-bit transformer language models
Xiuying Wei, Yunchen Zhang, Xiangguo Zhang, Ruihao Gong, Shanghang Zhang, Qi Zhang, Fengwei Yu, and Xianglong Liu. Outlier suppression: Pushing the limit of low-bit transformer language models. InNeurIPS, 2022
work page 2022
-
[49]
Smoothquant: Accurate and efficient post-training quantization for large language models
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InICML, 2023
work page 2023
-
[50]
Jingjing Xie, Yuxin Zhang, Mingbao Lin, Liujuan Cao, and Rongrong Ji. Advancing multimodal large language models with quantization-aware scale learning for efficient adaptation. InACM MM, 2024
work page 2024
-
[51]
Chen Xu, Yuxuan Yue, Zukang Xu, Xing Hu, Jiangyong Yu, Zhixuan Chen, Sifan Zhou, Zhihang Yuan, and Dawei Yang. Rwkvquant: Quantizing the rwkv family with proxy guided hybrid of scalar and vector quantization.arXiv preprint arXiv:2505.03803, 2025
-
[52]
Yufei Xue, Yushi Huang, Jiawei Shao, and Jun Zhang. Vlmq: Efficient post-training quantization for large vision-language models via hessian augmentation.arXiv preprint arXiv:2508.03351, 2025
-
[53]
R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization
Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. InICCV, 2025
work page 2025
-
[54]
Zeroquant: Efficient and affordable post-training quantization for large-scale transformers
Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. InNeurIPS, 2022
work page 2022
-
[55]
arXiv preprint arXiv:2303.08302 , year=
Zhewei Yao, Xiaoxia Wu, Cheng Li, Stephen Youn, and Yuxiong He. Zeroquant-v2: Exploring post-training quantization in llms from comprehensive study to low rank compensation.arXiv preprint arXiv:2303.08302, 2023
-
[56]
Image captioning with semantic attention
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. Image captioning with semantic attention. InCVPR, 2016
work page 2016
-
[57]
JiangYong Yu, Sifan Zhou, Dawei Yang, Shuoyu Li, Shuo Wang, Xing Hu, Chen Xu, Zukang Xu, Changyong Shu, and Zhihang Yuan. Mquant: Unleashing the inference potential of multimodal large language models via static quantization. InACM MM, 2025
work page 2025
-
[58]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...
work page 2024
-
[59]
Ying Zhang, Peng Zhang, Mincong Huang, Jingyang Xiang, Yujie Wang, Chao Wang, Yineng Zhang, Lei Yu, Chuan Liu, and Wei Lin. Qqq: Quality quattuor-bit quantization for large language models.arXiv preprint arXiv:2406.09904, 2024. 13
-
[60]
Aser: activation smoothing and error reconstruction for large language model quantization
Weibo Zhao, Yubin Shi, Xinyu Lyu, Wanchen Sui, Shen Li, and Yong Li. Aser: activation smoothing and error reconstruction for large language model quantization. InAAAI, 2025
work page 2025
-
[61]
Zhen Zheng, Xiaonan Song, and Chuanjie Liu. Mixllm: Llm quantization with global mixed-precision between output-features and highly-efficient system design.arXiv preprint arXiv:2412.14590, 2024. 14 Appendix Table 8: Ablation study on the rank ratio of CWS. Qwen2.5-VL-3B Qwen2.5-VL-7B MethodRank Ratio bits MMMU OCRBench ScienceQA AverageMMMU OCRBench Scien...
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.