GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs
Pith reviewed 2026-05-25 05:31 UTC · model grok-4.3
The pith
A global linear program ranks all MoE experts by quantization error and router fine-tuning restores accuracy at lower bit widths.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that casting expert bit allocation as a single global linear program derived from quantization error analysis, combined with router fine-tuning inside a progressive quantization loop, yields mixed-precision assignments that reduce memory and accelerate inference with only minimal accuracy loss compared with layer-wise baselines.
What carries the argument
Global linear-programming formulation that scores model-wide expert importance from quantization error analysis, paired with router fine-tuning.
If this is right
- MoE models reach extreme low-bit configurations with smaller accuracy cost than layer-wise methods allow.
- Memory footprint drops substantially while inference speed increases.
- Progressive iteration between importance estimation and allocation improves final bit assignments.
- Router adaptation becomes necessary once experts are quantized to different precisions.
Where Pith is reading between the lines
- The same global ranking step may transfer to other sparsely activated networks that use learned routing.
- Co-optimizing the router appears required whenever conditional computation paths are compressed.
- Larger MoE models could fit on memory-limited hardware if the allocation and tuning steps scale.
Load-bearing premise
The linear program derived from quantization error analysis correctly ranks the relative importance of every expert across the full model, and router fine-tuning can fully offset any shifts in expert selection caused by the lower precision.
What would settle it
Apply the GEMQ procedure to a standard MoE LLM and check whether the final quantized model either exceeds the claimed memory savings or exhibits larger accuracy degradation than the paper reports on the same evaluation benchmarks.
Figures
read the original abstract
Mixture-of-Experts Large Language Models (MoE-LLMs) achieve strong performance but incur substantial memory overhead due to massive expert parameters. Mixed-precision quantization mitigates this cost by allocating expert-wise bit-widths based on their importance, approaching the accuracy-memory Pareto frontier and enabling extreme low-bit quantization. However, existing methods rely on layer-wise importance estimation and overlook router shifts induced by quantization, resulting in suboptimal allocation and routing. In this work, we propose Global Expert-level Mixed-precision Quantization (GEMQ) to overcome these limitations via (1) a global linear-programming formulation that captures model-wide expert importance based on quantization error analysis, and (2) efficient router fine-tuning to adapt routing to quantized experts. These components are integrated into a progressive quantization framework that iteratively refines importance estimation and allocation. Experiments demonstrate that GEMQ significantly reduces memory and accelerates inference with minimal accuracy degradation. Source code is available at https://github.com/jndeng/GEMQ .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes GEMQ, a mixed-precision quantization method for MoE LLMs. It uses a global linear-programming formulation derived from quantization error analysis to allocate bit-widths according to model-wide expert importance, combined with router fine-tuning to compensate for quantization-induced routing shifts, all within a progressive quantization framework. The authors claim this yields substantial memory reduction and inference speedup with minimal accuracy degradation compared to prior layer-wise approaches.
Significance. If the global LP formulation and router fine-tuning prove effective, the method could improve the accuracy-memory trade-off for large MoE models beyond existing layer-wise quantization techniques. The public release of source code at the cited GitHub repository is a clear strength for reproducibility and verification.
minor comments (2)
- [Abstract] Abstract: the claim of 'minimal accuracy degradation' and 'significantly reduces memory' is stated without any quantitative results, baselines, or error metrics, preventing evaluation of the central empirical claim.
- [Abstract] Abstract: no derivation details, equations, or description of the linear-programming objective/constraints are provided, so the 'global' vs. 'layer-wise' distinction cannot be assessed from the given text.
Simulated Author's Rebuttal
We thank the referee for summarizing our GEMQ method and noting the public code release as a strength for reproducibility. The recommendation is listed as uncertain, but the major comments section contains no specific points. We remain available to address any concerns or provide additional experiments if raised.
Circularity Check
No circularity detected; derivation self-contained against external benchmarks
full rationale
The abstract and available context describe a global LP formulation from quantization error analysis plus router fine-tuning, but supply no equations, fitting procedures, or self-citations that reduce any claimed result to its own inputs by construction. No load-bearing step can be exhibited as equivalent to a fitted parameter or prior self-citation. The method is presented as experimentally validated on memory/accuracy trade-offs, which constitutes independent content rather than a renaming or self-definition. This is the normal honest finding when no specific reduction is visible.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms
Amini, A., Gabriel, S., Lin, P., Koncel-Kedziorski, R., Choi, Y ., and Hajishirzi, H. Mathqa: Towards interpretable math 9 GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319,
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[2]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Chen, M., Shao, W., Xu, P., Wang, J., Gao, P., Zhang, K., and Luo, P. Efficientqat: Efficient quantization-aware training for large language models. InProceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pp. 10081– 10100, 2025a. Chen, Y ., Shao, Y ., Wang, P., and Cheng, J. Eac- moe: Expert-sel...
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Training Verifiers to Solve Math Word Problems
Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Dai, D., Deng, C., Zhao, C., Xu, R., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y ., et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts lan- guage models.arXiv preprint arXiv:2401.06066,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Dettmers, T., Svirschevski, R., Egiazarian, V ., Kuznedelev, D., Frantar, E., Ashkboos, S., Borzunov, A., Hoefler, T., and Alistarh, D. Spqr: A sparse-quantized representation for near-lossless llm weight compression.arXiv preprint arXiv:2306.03078,
-
[6]
Duanmu, H., Li, X., Yuan, Z., Zheng, S., Duan, J., Zhang, X., and Lin, D. Mxmoe: Mixed-precision quantization for moe with accuracy and performance co-design.arXiv preprint arXiv:2505.05799,
-
[7]
Frantar, E. and Alistarh, D. Qmoe: Practical sub-1-bit compression of trillion-parameter models.arXiv preprint arXiv:2310.16795,
-
[8]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: Accurate post-training quantization for generative pre- trained transformers.arXiv preprint arXiv:2210.17323,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Eaquant: Enhancing post-training quan- tization for moe models via expert-aware optimization
Fu, Z., Ding, N., Han, K., Yu, X., Li, X., Chen, X., Tang, Y ., and Wang, Y . Eaquant: Enhancing post-training quan- tization for moe models via expert-aware optimization. arXiv preprint arXiv:2506.13329,
-
[10]
He, B., Yin, L., Zhen, H.-L., Liu, S., Wu, H., Zhang, X., Yuan, M., and Ma, C
URL https://zenodo.org/records/12608602. He, B., Yin, L., Zhen, H.-L., Liu, S., Wu, H., Zhang, X., Yuan, M., and Ma, C. Preserving llm capabilities through calibration data curation: From analysis to optimization. Advances in Neural Information Processing Systems, 38: 58531–58572,
-
[11]
Measuring Massive Multitask Language Understanding
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring mas- sive multitask language understanding.arXiv preprint arXiv:2009.03300,
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[12]
Measuring Mathematical Problem Solving With the MATH Dataset
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
10 GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs Hu, X., Chen, Z., Yang, D., Xu, Z., Xu, C., Yuan, Z., Zhou, S., and Yu, J. Moequant: Enhancing quantiza- tion for mixture-of-experts large language models via expert-balanced sampling and affinity guidance.arXiv preprint arXiv:2505.03804,
-
[14]
Mixture compressor for mixture-of-experts llms gains more.arXiv preprint arXiv:2410.06270, 2024a
Huang, W., Liao, Y ., Liu, J., He, R., Tan, H., Zhang, S., Li, H., Liu, S., and Qi, X. Mixture compressor for mixture-of-experts llms gains more.arXiv preprint arXiv:2410.06270, 2024a. Huang, W., Liu, Y ., Qin, H., Li, Y ., Zhang, S., Liu, X., Magno, M., and Qi, X. Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.0...
-
[15]
Kim, S., Hooper, C., Gholami, A., Dong, Z., Li, X., Shen, S., Mahoney, M. W., and Keutzer, K. Squeezellm: Dense-and-sparse quantization.arXiv preprint arXiv:2306.07629, 2023a. Kim, Y . J., Fahim, R., and Awadalla, H. H. Mixture of quantized experts (moqe): Complementary effect of low-bit quantization and robustness.arXiv preprint arXiv:2310.02410, 2023b. ...
-
[16]
Li, Y ., Gong, R., Tan, X., Yang, Y ., Hu, P., Zhang, Q., Yu, F., Wang, W., and Gu, S. Brecq: Pushing the limit of post-training quantization by block reconstruction.arXiv preprint arXiv:2102.05426,
-
[17]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., Dengr, C., Ruan, C., Dai, D., Guo, D., et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024a. Liu, Z., Oguz, B., Zhao, C., Chang, E., Stock, P., Mehdad, Y ., Shi, Y ., Krishnamoorthi, R., and Chandra, V . Llm-qat: Data-free ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Decoupled Weight Decay Regularization
Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization.arXiv preprint arXiv:1711.05101,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Lu, X., Liu, Q., Xu, Y ., Zhou, A., Huang, S., Zhang, B., Yan, J., and Li, H. Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large lan- guage models.arXiv preprint arXiv:2402.14800,
-
[20]
Pointer Sentinel Mixture Models
Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Large Language Models: A Survey
Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M., Socher, R., Amatriain, X., and Gao, J. Large language models: A survey.arXiv preprint arXiv:2402.06196,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
OLMoE: Open Mixture-of-Experts Language Models
Muennighoff, N., Soldaini, L., Groeneveld, D., Lo, K., Mor- rison, J., Min, S., Shi, W., Walsh, P., Tafjord, O., Lambert, N., et al. Olmoe: Open mixture-of-experts language models.arXiv preprint arXiv:2409.02060,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Shao, W., Chen, M., Zhang, Z., Xu, P., Zhao, L., Li, Z., Zhang, K., Gao, P., Qiao, Y ., and Luo, P. Omniquant: Omnidirectionally calibrated quantization for large lan- guage models.arXiv preprint arXiv:2308.13137,
-
[24]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Tseng, A., Chee, J., Sun, Q., Kuleshov, V ., and De Sa, C. Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.arXiv preprint arXiv:2402.04396,
-
[26]
Effi- cient large language models: A survey.arXiv preprint arXiv:2312.03863,
Wan, Z., Wang, X., Liu, C., Alam, S., Zheng, Y ., Liu, J., Qu, Z., Yan, S., Zhu, Y ., Zhang, Q., et al. Effi- cient large language models: A survey.arXiv preprint arXiv:2312.03863,
-
[27]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
HellaSwag: Can a Machine Really Finish Your Sentence?
Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi, Y . Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830,
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[29]
Zheng, Z., Cui, X., Zheng, S., Li, M., Chen, J., Liang, Y ., and Chen, X. Dynamo: Runtime switchable quantization for moe with cross-dataset adaptation.arXiv preprint arXiv:2503.21135,
-
[30]
12 GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs A. Details on Models and Evaluation Table 8.Details of MoE-LLMs used in evaluation. “Expert Prop.” and “Router Prop.” denote the percentage of experts and routers in the total number of parameters (“#Params”), respectively. For the “#Experts” column, we follow the convention (#Routed E...
work page 2024
-
[31]
Qwen3-30B-A3B undergoes both pre-training and post-training
Note that, except for Qwen3-30B-A3B, all models are only pre-trained for language modeling without supervised fine-tuning (SFT). Qwen3-30B-A3B undergoes both pre-training and post-training. In addition to evaluating perplexity on general language modeling benchmarks, we evaluate different quantization methods on seven zero-shot tasks: PIQA (Bisk et al., 2...
work page 2020
-
[32]
All benchmark results are obtained using LM-Evaluation-Harness (v0.4.8) (Gao et al., 2024)
benchmark to assess the mathematical reasoning ability of quantized models. All benchmark results are obtained using LM-Evaluation-Harness (v0.4.8) (Gao et al., 2024). We reportacc normwhen available; otherwise,accis reported. B. Comparison with State-of-the-Art Methods In this section, we present the full results from Tab. 1 and Tab. 2, along with additi...
work page 2024
-
[33]
and MoEQuant (Hu et al., 2025). EAQuant primarily focuses on outlier suppression under uniform weight-activation quantization, whereas GEMQ derives a global mixed-precision strategy for weight-only quantization. Although EAQuant also considers router distribution shift, it adopts a layer-wise rigid alignment scheme that yields only marginal gains (e.g., <...
-
[34]
Moreover, EAQuant targets relatively high-bit regimes (≥3 bpe), whereas GEMQ focuses on more aggressive low-bit settings (≤2.5 bpe) to better address the memory footprint of expert parameters. MoEQuant constructs optimized calibration data for uniform weight-only quantization via self-sampling and extends GPTQ with affinity-guided weighting to reduce quan...
work page 2024
-
[35]
As shown the figures, GEMQ is relatively robust to sampling noise, as the estimated error curves largely overlap even though only 128 sequences are used for calibration, achieving an average Pearson correlation over 0.99. Importantly, the key experts (i.e., the peaks in the error-estimation curves) with large estimated errors are consistently identified a...
work page 2048
-
[36]
Table 18.Ablation of expert bit-width candidates on Mixtral-8×7B (attention bits = 4)
or exploring generalization objectives like sharpness-aware minimization in future work. Table 18.Ablation of expert bit-width candidates on Mixtral-8×7B (attention bits = 4). Bits Per Expert Bit CandidatesBOpt Obj (Eq.7) ↓ WT2↓ C4↓ 0-shot↑ 7 2.5 {1,2,3}0.01444.978.9565.22 {0,1,2,3}0.0139 5.02 8.91 64.96 {1,2,3,4}0.0138 5.00 8.95 65.19 {0,1,2,3,4}0.01315....
-
[37]
As shown in the left figure, since routers contain only a small number of parameters, training converges within a single epoch in under 2 minutes. In the right figure, we observe that using more calibration samples can further reduce perplexity on the test set, but the improvement is marginal. We therefore use 128 samples in all experiments. 0 1 2 3 4 5 E...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.