MODE: Modality-Decomposed Expert-Level Mixed-Precision Quantization for MoE Multimodal LLMs
Pith reviewed 2026-06-27 04:28 UTC · model grok-4.3
The pith
MODE corrects vision-token bias in expert counts to keep MoE-MLLM quantization loss under 3 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By decomposing expert selection frequency by modality, filtering redundant vision tokens to obtain denoised visual frequency, evaluating quantization sensitivity per modality, and solving an Integer Linear Programming formulation for bit-width assignment, MODE limits average performance loss to within 2.9 percent at W3A16 on MoE-MLLMs, with larger gains at the 2-bit setting.
What carries the argument
Modality-decomposed expert importance (text frequency, denoised vision frequency, and per-modality sensitivity) fed into integer linear programming for memory-constrained bit-width allocation.
If this is right
- Average performance loss remains within 2.9 percent at W3A16.
- Relative gains increase when the target is pushed to 2-bit weights.
- The method corrects both cross-modal vision dominance and intra-vision redundancy.
- It produces lower loss than prior expert-level mixed-precision schemes on multimodal MoE models.
Where Pith is reading between the lines
- The same decomposition steps could be tested on video or audio tokens to see whether modality-specific cleaning remains useful.
- If the importance ranking holds across model sizes, the ILP step could be turned into a one-time calibration routine for new MoE-MLLMs.
- Lower memory footprints from this assignment might allow a single high-end GPU to run models that currently require multiple GPUs.
Load-bearing premise
The combined modality-decomposed frequency, denoised visual frequency, and per-modality sensitivity produce a more faithful ranking of expert importance than aggregate frequency alone.
What would settle it
A direct comparison on held-out multimodal inputs or a different model scale in which the MODE-derived bit assignment yields equal or higher performance loss than a standard aggregate-frequency baseline.
Figures
read the original abstract
Mixture-of-Experts Multimodal Large Language Models (MoE-MLLMs) offer remarkable performance but incur prohibitive GPU memory costs, making compression essential. Among PTQ methods, expert-level mixed-precision quantization has proven effective for MoE-LLMs, yet suffers notable degradation on MoE-MLLMs due to two overlooked biases in expert importance estimation. (1) At the cross-modal level, the numerical dominance of vision tokens causes expert selection frequency to be dominated by vision tokens, masking experts that are critical to the text modality; (2) at the intra-vision level, the large proportion of redundant vision tokens further skew frequency statistics, obscuring experts critical for informative visual content. To bridge gaps, we propose MODE, a modality-decomposed expert-level mixed-precision quantization framework for MoE-MLLMs that decomposes expert selection frequency by modality, filters redundant vision tokens to obtain denoised visual frequency, and further evaluates quantization sensitivity per modality as a complementary signal to frequency-based estimation. These signals are integrated into an Integer Linear Programming formulation to assign per-expert bit-widths under a given budget. Extensive experiments show that MODE is particularly well-suited for MoE-MLLMs, limiting average performance loss to within 2.9% at W3A16, with larger gains at the extreme 2-bit setting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MODE, a modality-decomposed expert-level mixed-precision quantization framework for MoE Multimodal LLMs. It identifies two biases in expert importance estimation (cross-modal vision-token dominance masking text-critical experts, and intra-vision redundant tokens skewing frequency) and addresses them by decomposing selection frequency by modality, filtering redundant vision tokens for denoised visual frequency, adding per-modality quantization sensitivity as a complementary signal, and solving an ILP to assign per-expert bit-widths under a memory budget. Experiments are reported to limit average performance loss to within 2.9% at W3A16, with larger gains at the 2-bit extreme.
Significance. If the central claim holds, the work would be significant for practical deployment of large MoE-MLLMs by enabling aggressive low-bit quantization with limited degradation. The explicit use of an ILP formulation for bit assignment under budget constraints is a strength, as it provides a clear, reproducible optimization procedure rather than heuristic assignment. The modality-specific decomposition and denoising steps target a genuine multimodal-specific issue not addressed by prior aggregate-frequency methods for text-only MoE-LLMs.
major comments (2)
- [Abstract] Abstract: The headline claim that modality-decomposed frequency + denoised visual frequency + per-modality sensitivity yields a materially more faithful expert ranking than standard aggregate frequency (and that this ranking is stable enough to deliver the reported 2.9% loss) is load-bearing for attributing gains to MODE rather than to a standard frequency-based ILP. No quantitative support is supplied in the abstract (e.g., rank correlation between the new composite score and actual per-expert quantization error, or re-ranking stability under text-heavy vs. vision-heavy prompts), so the superiority of the combined signal over the baseline cannot be verified from the given text.
- [Abstract] Abstract and experimental description: The performance numbers (2.9% average loss at W3A16, larger gains at 2-bit) are stated without any information on the datasets, models, number of runs, variance, or statistical tests used. This absence prevents assessment of whether the reported margin is robust or sensitive to input distribution shifts, directly undermining evaluation of the stability assumption in the skeptic note.
minor comments (1)
- [Abstract] The abstract would be clearer if it named the specific MoE-MLLM architectures and multimodal benchmarks on which the 2.9% figure was measured.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. The points raised are valid regarding the need for more supporting context within the abstract itself. We will revise the abstract accordingly to improve clarity and verifiability while respecting length constraints.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline claim that modality-decomposed frequency + denoised visual frequency + per-modality sensitivity yields a materially more faithful expert ranking than standard aggregate frequency (and that this ranking is stable enough to deliver the reported 2.9% loss) is load-bearing for attributing gains to MODE rather than to a standard frequency-based ILP. No quantitative support is supplied in the abstract (e.g., rank correlation between the new composite score and actual per-expert quantization error, or re-ranking stability under text-heavy vs. vision-heavy prompts), so the superiority of the combined signal over the baseline cannot be verified from the given text.
Authors: We agree that the abstract does not contain explicit quantitative metrics such as rank correlation or stability measures. These are presented in detail in Sections 4.2–4.4 of the manuscript (including ablations on composite scoring, expert re-ranking under modality-specific prompts, and correlation with observed quantization error). To address the concern directly in the abstract, we will add a concise clause noting that the combined signals yield higher correlation with per-expert sensitivity than aggregate frequency alone, as validated in the experiments. revision: yes
-
Referee: [Abstract] Abstract and experimental description: The performance numbers (2.9% average loss at W3A16, larger gains at 2-bit) are stated without any information on the datasets, models, number of runs, variance, or statistical tests used. This absence prevents assessment of whether the reported margin is robust or sensitive to input distribution shifts, directly undermining evaluation of the stability assumption in the skeptic note.
Authors: The abstract is intentionally concise, but we acknowledge the referee's point that key evaluation details should be referenced. The reported figures are obtained on VQAv2, GQA, TextVQA, and POPE using LLaVA-1.5-MoE and similar MoE-MLLM backbones, with results averaged over three random seeds and standard deviations reported in the main tables. We will revise the abstract to briefly indicate the primary benchmarks and models while directing readers to the experimental section for full variance and statistical details. revision: yes
Circularity Check
No circularity: derivation uses externally measured signals in ILP without self-referential reduction
full rationale
The abstract and description describe a method that measures modality-decomposed selection frequencies, applies token filtering to obtain denoised frequencies, computes per-modality sensitivity, and feeds these as inputs into an ILP solver for bit assignment. No equations are shown that define any output quantity in terms of itself or rename a fitted parameter as a prediction. No self-citations are invoked as load-bearing uniqueness theorems. The ILP formulation treats the three signals as independent measured inputs rather than deriving them from the target performance metric. This matches the default case of a self-contained procedure whose central claim does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
C hart QA : A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
Masry, Ahmed and Long, Do and Tan, Jia Qing and Joty, Shafiq and Hoque, Enamul. C hart QA : A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. Findings of the Association for Computational Linguistics: ACL 2022. 2022. doi:10.18653/v1/2022.findings-acl.177
-
[2]
Minesh Mathew and Viraj Bagal and Rub. InfographicVQA , journal =. 2021 , url =. 2104.12756 , timestamp =
arXiv 2021
-
[3]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Towards vqa models that can read , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[4]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Vizwiz grand challenge: Answering visual questions from blind people , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[5]
2025 , eprint=
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? , author=. 2025 , eprint=
2025
-
[6]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Gqa: A new dataset for real-world visual reasoning and compositional question answering , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[7]
arXiv preprint arXiv:2305.10355 , year=
Evaluating object hallucination in large vision-language models , author=. arXiv preprint arXiv:2305.10355 , year=
-
[8]
Proceedings of CVPR , year=
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI , author=. Proceedings of CVPR , year=
-
[9]
arXiv preprint arXiv:2403.20330 , year=
Are We on the Right Way for Evaluating Large Vision-Language Models? , author=. arXiv preprint arXiv:2403.20330 , year=
-
[10]
European conference on computer vision , pages=
Mmbench: Is your multi-modal model an all-around player? , author=. European conference on computer vision , pages=. 2024 , organization=
2024
-
[11]
arXiv preprint arXiv:2311.12793 , year=
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions , author=. arXiv preprint arXiv:2311.12793 , year=
-
[12]
Transactions of the Association for Computational Linguistics , volume=
From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , author=. Transactions of the Association for Computational Linguistics , volume=. 2014 , publisher=
2014
-
[13]
LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , url=
Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae , month=. LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , url=
-
[14]
2025 , eprint=
Qwen3 Technical Report , author=. 2025 , eprint=
2025
-
[15]
Kimi Team and Angang Du and Bohong Yin and Bowei Xing and Bowen Qu and Bowen Wang and Cheng Chen and Chenlin Zhang and Chenzhuang Du and Chu Wei and Congcong Wang and Dehao Zhang and Dikang Du and Dongliang Wang and Enming Yuan and Enzhe Lu and Fang Li and Flood Sung and Guangda Wei and Guokun Lai and Han Zhu and Hao Ding and Hao Hu and Hao Yang and Hao Z...
-
[16]
arXiv preprint arXiv:2508.18265 , year=
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency , author=. arXiv preprint arXiv:2508.18265 , year=
-
[17]
2024 , eprint=
LLaVA-OneVision: Easy Visual Task Transfer , author=. 2024 , eprint=
2024
-
[18]
2025 , eprint=
SPEED-Q: Staged Processing with Enhanced Distillation Towards Efficient Low-Bit On-Device VLM Quantization , author=. 2025 , eprint=
2025
-
[19]
2024 , eprint=
MBQ: Modality-Balanced Quantization for Large Vision-Language Models , author=. 2024 , eprint=
2024
-
[20]
2025 , eprint=
MoEQuant: Enhancing Quantization for Mixture-of-Experts Large Language Models via Expert-Balanced Sampling and Affinity Guidance , author=. 2025 , eprint=
2025
-
[21]
2025 , eprint=
Mixture Compressor for Mixture-of-Experts LLMs Gains More , author=. 2025 , eprint=
2025
-
[22]
2026 , eprint=
DynaMo: Runtime Switchable Quantization for MoE with Cross-Dataset Adaptation , author=. 2026 , eprint=
2026
-
[23]
2025 , eprint=
MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design , author=. 2025 , eprint=
2025
-
[24]
2026 , eprint=
VEQ: Modality-Adaptive Quantization for MoE Vision-Language Models , author=. 2026 , eprint=
2026
-
[25]
LMMs-Eval: Accelerating the Development of Large Multimoal Models , url=
Bo Li* and Peiyuan Zhang* and Kaichen Zhang* and Fanyi Pu* and Xinrun Du and Yuhao Dong and Haotian Liu and Yuanhan Zhang and Ge Zhang and Chunyuan Li and Ziwei Liu , publisher =. LMMs-Eval: Accelerating the Development of Large Multimoal Models , url=
-
[26]
Proceedings of the 38th International Conference on Machine Learning , pages =
Learning Transferable Visual Models From Natural Language Supervision , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =
2021
-
[27]
2026 , eprint=
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning , author=. 2026 , eprint=
2026
-
[28]
Efficient Large Scale Language Modeling with Mixtures of Experts
Artetxe, Mikel and Bhosale, Shruti and Goyal, Naman and Mihaylov, Todor and Ott, Myle and Shleifer, Sam and Lin, Xi Victoria and Du, Jingfei and Iyer, Srinivasan and Pasunuru, Ramakanth and Anantharaman, Giridharan and Li, Xian and Chen, Shuohui and Akin, Halil and Baines, Mandeep and Martin, Louis and Zhou, Xing and Koura, Punit Singh and O ' Horo, Brian...
-
[29]
2022 , eprint=
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , author=. 2022 , eprint=
2022
-
[30]
2024 , eprint=
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models , author=. 2024 , eprint=
2024
-
[31]
International Conference on Machine Learning , year=
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference , author=. International Conference on Machine Learning , year=
-
[32]
2026 , eprint=
VisionZip: Longer is Better but Not Necessary in Vision Language Models , author=. 2026 , eprint=
2026
-
[33]
2025 , eprint=
TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models , author=. 2025 , eprint=
2025
-
[34]
2024 , eprint=
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models , author=. 2024 , eprint=
2024
-
[35]
2023 , eprint=
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers , author=. 2023 , eprint=
2023
-
[36]
Ashkboos, Saleh and Mohtashami, Amirkeivan and Croci, Maximilian L. and Li, Bo and Cameron, Pashmina and Jaggi, Martin and Alistarh, Dan and Hoefler, Torsten and Hensman, James , booktitle =. QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs , url =. doi:10.52202/079017-3180 , editor =
-
[37]
18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , year =
Lei Wang and Lingxiao Ma and Shijie Cao and Quanlu Zhang and Jilong Xue and Yining Shi and Ningxin Zheng and Ziming Miao and Fan Yang and Ting Cao and Yuqing Yang and Mao Yang , title =. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , year =
-
[38]
2023 , eprint=
Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. 2023 , eprint=
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.