GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs

Dongwei Wang; Huanrui Yang; Jianing Deng; Jingtong Hu; Song Wang; Tianlong Chen; Zijie Liu

arxiv: 2605.23078 · v1 · pith:NAQVKMFBnew · submitted 2026-05-21 · 💻 cs.LG · cs.CL

GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs

Jianing Deng , Song Wang , Dongwei Wang , Zijie Liu , Tianlong Chen , Huanrui Yang , Jingtong Hu This is my paper

Pith reviewed 2026-05-25 05:31 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords mixture of expertsmixed-precision quantizationexpert importancerouter fine-tuningmodel compressioninference optimizationlinear programming

0 comments

The pith

A global linear program ranks all MoE experts by quantization error and router fine-tuning restores accuracy at lower bit widths.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mixture-of-Experts large language models deliver strong results yet store separate expert networks that drive high memory cost. Mixed-precision quantization can assign fewer bits to less critical experts, but earlier methods scored importance only inside each layer and left router decisions unchanged after quantization. GEMQ instead solves one linear program over the entire model to rank every expert from its contribution to quantization error, then fine-tunes the router on the resulting quantized experts inside an iterative loop. Experiments show the resulting allocations cut memory and speed inference while keeping accuracy close to the unquantized baseline.

Core claim

The paper claims that casting expert bit allocation as a single global linear program derived from quantization error analysis, combined with router fine-tuning inside a progressive quantization loop, yields mixed-precision assignments that reduce memory and accelerate inference with only minimal accuracy loss compared with layer-wise baselines.

What carries the argument

Global linear-programming formulation that scores model-wide expert importance from quantization error analysis, paired with router fine-tuning.

If this is right

MoE models reach extreme low-bit configurations with smaller accuracy cost than layer-wise methods allow.
Memory footprint drops substantially while inference speed increases.
Progressive iteration between importance estimation and allocation improves final bit assignments.
Router adaptation becomes necessary once experts are quantized to different precisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same global ranking step may transfer to other sparsely activated networks that use learned routing.
Co-optimizing the router appears required whenever conditional computation paths are compressed.
Larger MoE models could fit on memory-limited hardware if the allocation and tuning steps scale.

Load-bearing premise

The linear program derived from quantization error analysis correctly ranks the relative importance of every expert across the full model, and router fine-tuning can fully offset any shifts in expert selection caused by the lower precision.

What would settle it

Apply the GEMQ procedure to a standard MoE LLM and check whether the final quantized model either exceeds the claimed memory savings or exhibits larger accuracy degradation than the paper reports on the same evaluation benchmarks.

Figures

Figures reproduced from arXiv: 2605.23078 by Dongwei Wang, Huanrui Yang, Jianing Deng, Jingtong Hu, Song Wang, Tianlong Chen, Zijie Liu.

**Figure 1.** Figure 1: Motivations. (a) The sensitivity of expert weights, measured by squared gradients (i.e., the trace of the empirical Fisher Information Matrix), varies not only within a layer but also across layers, indicating heterogeneous layer importance; and (b) over 40% of tokens are routed to different experts after 1.5-bit quantization, revealing substantial distortion in router distributions. Statistics are compute… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed GEMQ framework for MoE-LLMs quantization. 4. Method [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of different cases of expert importance estimation for target weights wˆ . w⋆ denotes FP expert weights. as shown in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Analysis of global expert bit-width allocation. 1.5 2.0 2.5 Bits/Expert 7 8 9 10 11 12 WikiT e x t 2 P e r ple xit y DeepSeekV2-Lite 1.5 2.0 2.5 Bits/Expert 5 6 7 8 9 WikiT e x t 2 P e r ple xit y Mixtral-8x7B 1.5 2.0 2.5 Bits/Expert 10 12 14 16 18 20 22 C 4 P e r ple xit y DeepSeekV2-Lite 1.5 2.0 2.5 Bits/Expert 10 12 14 16 18 20 22 C 4 P e r ple xit y Mixtral-8x7B 1.5 2.0 2.5 Bits/Expert 45 48 51 54 57 6… view at source ↗

**Figure 5.** Figure 5: Ablation of the proposed techniques. “RFT” denotes global router fine-tuning, and “PQ” denotes progressive quantization. “Zero-shot Accuracy” is averaged over seven tasks. whereas GEMQ preserves sufficient precision for critical experts, preventing collapse and maintaining performance. Additionally, GEMQ consistently outperforms both EAQuant and MoEQuant at comparable bit budgets, demonstrating the effec… view at source ↗

**Figure 6.** Figure 6: Statistics of the expert importance proxy and corresponding bit-width allocation across four randomly sampled calibration subsets from C4 (three with 128 sequences and one with 2048 sequences; each sequence contains 2048 tokens). Note that dark red in the figures indicates overlap [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Ablation study on router fine-tuning settings. Weights are initialized from the quantized Mixtral-8×7B model (1.5 bits/expert). Training data are randomly extracted from the WikiText2 training set. Times are measured on three H100 GPUs. 0 1 2 3 4 5 6 7 1 2 3 Layer 3 0 1 2 3 4 5 6 7 Layer 5 0 1 2 3 4 5 6 7 Layer 9 0 1 2 3 4 5 6 7 Layer 13 0 1 2 3 4 5 6 7 Layer 19 0 1 2 3 4 5 6 7 Layer 29 0.0 0.1 0.2 0.3 0.0… view at source ↗

**Figure 8.** Figure 8: Comparison of router statistics from selected layers of the full-precision, quantized (1.5 bits/expert), and router fine-tuned Mixtral-8×7B models on the WikiText2 test set. Analysis of Routing Dynamics after Fine-Tuning. As shown in [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Router selection change ratio of Mixtral-8×7B computed on WikiText2 testset. G. Detailed Expert Bit-width Allocation Results [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: shows detailed expert bit-width allocations produced by GEMQ with candidate set B = {1, 2, 3}. 0 4 8 12 16 20 24 Layer Index 0 8 16 24 32 40 48 56 64 Expert Index 2.5 bits per expert (#1bit: 415; #2bit: 28; #3bit: 1273) 0 4 8 12 16 20 24 Layer Index 0 8 16 24 32 40 48 56 64 Expert Index 2.0 bits per expert (#1bit: 841; #2bit: 34; #3bit: 841) 0 4 8 12 16 20 24 Layer Index 0 8 16 24 32 40 48 56 64 Expert In… view at source ↗

read the original abstract

Mixture-of-Experts Large Language Models (MoE-LLMs) achieve strong performance but incur substantial memory overhead due to massive expert parameters. Mixed-precision quantization mitigates this cost by allocating expert-wise bit-widths based on their importance, approaching the accuracy-memory Pareto frontier and enabling extreme low-bit quantization. However, existing methods rely on layer-wise importance estimation and overlook router shifts induced by quantization, resulting in suboptimal allocation and routing. In this work, we propose Global Expert-level Mixed-precision Quantization (GEMQ) to overcome these limitations via (1) a global linear-programming formulation that captures model-wide expert importance based on quantization error analysis, and (2) efficient router fine-tuning to adapt routing to quantized experts. These components are integrated into a progressive quantization framework that iteratively refines importance estimation and allocation. Experiments demonstrate that GEMQ significantly reduces memory and accelerates inference with minimal accuracy degradation. Source code is available at https://github.com/jndeng/GEMQ .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GEMQ's global LP formulation and router fine-tuning step address real gaps in layer-wise MoE quantization, but the abstract supplies no numbers or derivations to check the claims.

read the letter

The main point is that this paper moves from per-layer importance scoring to a single global linear program that assigns bit widths across all experts based on model-wide quantization error, then adds a router fine-tuning pass to correct for routing changes. That combination is the technical step they highlight over prior work. The progressive framework that refines the allocation iteratively is also presented as a practical addition. Recognizing that quantizing experts can shift router behavior is a reasonable observation, and trying to compensate with fine-tuning is a direct response rather than an afterthought. The abstract frames this as fixing suboptimal allocations that layer-wise methods produce. On the evidence side, nothing concrete appears: no accuracy numbers, no memory or speed measurements, no baseline comparisons, and no details on how the LP is actually constructed or solved. The claim of minimal accuracy degradation therefore sits unsupported for now. Without seeing the error analysis or the router adaptation procedure in action, it is difficult to judge whether the global ranking of experts actually improves on simpler methods or whether the fine-tuning fully recovers performance. This work targets people who need to run large MoE models under tight memory or latency constraints, such as those working on inference optimization or hardware deployment. A reader already following mixed-precision quantization for sparse models would pick up the specific idea of global optimization plus router correction and could test it themselves. The paper deserves peer review because the problem is current, the proposed distinction from layer-wise baselines is clear, and the approach is described at a level that referees can evaluate once the experiments and derivations are in front of them.

Referee Report

0 major / 2 minor

Summary. The paper proposes GEMQ, a mixed-precision quantization method for MoE LLMs. It uses a global linear-programming formulation derived from quantization error analysis to allocate bit-widths according to model-wide expert importance, combined with router fine-tuning to compensate for quantization-induced routing shifts, all within a progressive quantization framework. The authors claim this yields substantial memory reduction and inference speedup with minimal accuracy degradation compared to prior layer-wise approaches.

Significance. If the global LP formulation and router fine-tuning prove effective, the method could improve the accuracy-memory trade-off for large MoE models beyond existing layer-wise quantization techniques. The public release of source code at the cited GitHub repository is a clear strength for reproducibility and verification.

minor comments (2)

[Abstract] Abstract: the claim of 'minimal accuracy degradation' and 'significantly reduces memory' is stated without any quantitative results, baselines, or error metrics, preventing evaluation of the central empirical claim.
[Abstract] Abstract: no derivation details, equations, or description of the linear-programming objective/constraints are provided, so the 'global' vs. 'layer-wise' distinction cannot be assessed from the given text.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for summarizing our GEMQ method and noting the public code release as a strength for reproducibility. The recommendation is listed as uncertain, but the major comments section contains no specific points. We remain available to address any concerns or provide additional experiments if raised.

Circularity Check

0 steps flagged

No circularity detected; derivation self-contained against external benchmarks

full rationale

The abstract and available context describe a global LP formulation from quantization error analysis plus router fine-tuning, but supply no equations, fitting procedures, or self-citations that reduce any claimed result to its own inputs by construction. No load-bearing step can be exhibited as equivalent to a fitted parameter or prior self-citation. The method is presented as experimentally validated on memory/accuracy trade-offs, which constitutes independent content rather than a renaming or self-definition. This is the normal honest finding when no specific reduction is visible.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be extracted or verified.

pith-pipeline@v0.9.0 · 5716 in / 995 out tokens · 16694 ms · 2026-05-25T05:31:32.142726+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 15 internal anchors

[1]

MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

Amini, A., Gabriel, S., Lin, P., Koncel-Kedziorski, R., Choi, Y ., and Hajishirzi, H. Mathqa: Towards interpretable math 9 GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[2]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Chen, M., Shao, W., Xu, P., Wang, J., Gao, P., Zhang, K., and Luo, P. Efficientqat: Efficient quantization-aware training for large language models. InProceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pp. 10081– 10100, 2025a. Chen, Y ., Shao, Y ., Wang, P., and Cheng, J. Eac- moe: Expert-sel...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Dai, D., Deng, C., Zhao, C., Xu, R., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y ., et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts lan- guage models.arXiv preprint arXiv:2401.06066,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Spqr: A sparse-quantized representation for near-lossless llm weight compression.arXiv preprint arXiv:2306.03078,

Dettmers, T., Svirschevski, R., Egiazarian, V ., Kuznedelev, D., Frantar, E., Ashkboos, S., Borzunov, A., Hoefler, T., and Alistarh, D. Spqr: A sparse-quantized representation for near-lossless llm weight compression.arXiv preprint arXiv:2306.03078,

work page arXiv
[6]

Mxmoe: Mixed-precision quantization for moe with accuracy and performance co-design.arXiv preprint arXiv:2505.05799,

Duanmu, H., Li, X., Yuan, Z., Zheng, S., Duan, J., Zhang, X., and Lin, D. Mxmoe: Mixed-precision quantization for moe with accuracy and performance co-design.arXiv preprint arXiv:2505.05799,

work page arXiv
[7]

and Alistarh, D

Frantar, E. and Alistarh, D. Qmoe: Practical sub-1-bit compression of trillion-parameter models.arXiv preprint arXiv:2310.16795,

work page arXiv
[8]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: Accurate post-training quantization for generative pre- trained transformers.arXiv preprint arXiv:2210.17323,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Eaquant: Enhancing post-training quan- tization for moe models via expert-aware optimization

Fu, Z., Ding, N., Han, K., Yu, X., Li, X., Chen, X., Tang, Y ., and Wang, Y . Eaquant: Enhancing post-training quan- tization for moe models via expert-aware optimization. arXiv preprint arXiv:2506.13329,

work page arXiv
[10]

He, B., Yin, L., Zhen, H.-L., Liu, S., Wu, H., Zhang, X., Yuan, M., and Ma, C

URL https://zenodo.org/records/12608602. He, B., Yin, L., Zhen, H.-L., Liu, S., Wu, H., Zhang, X., Yuan, M., and Ma, C. Preserving llm capabilities through calibration data curation: From analysis to optimization. Advances in Neural Information Processing Systems, 38: 58531–58572,

work page arXiv
[11]

Measuring Massive Multitask Language Understanding

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring mas- sive multitask language understanding.arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[12]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Moequant: Enhancing quantiza- tion for mixture-of-experts large language models via expert-balanced sampling and affinity guidance.arXiv preprint arXiv:2505.03804,

10 GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs Hu, X., Chen, Z., Yang, D., Xu, Z., Xu, C., Yuan, Z., Zhou, S., and Yu, J. Moequant: Enhancing quantiza- tion for mixture-of-experts large language models via expert-balanced sampling and affinity guidance.arXiv preprint arXiv:2505.03804,

work page arXiv
[14]

Mixture compressor for mixture-of-experts llms gains more.arXiv preprint arXiv:2410.06270, 2024a

Huang, W., Liao, Y ., Liu, J., He, R., Tan, H., Zhang, S., Li, H., Liu, S., and Qi, X. Mixture compressor for mixture-of-experts llms gains more.arXiv preprint arXiv:2410.06270, 2024a. Huang, W., Liu, Y ., Qin, H., Li, Y ., Zhang, S., Liu, X., Magno, M., and Qi, X. Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.0...

work page arXiv
[15]

W., and Keutzer, K

Kim, S., Hooper, C., Gholami, A., Dong, Z., Li, X., Shen, S., Mahoney, M. W., and Keutzer, K. Squeezellm: Dense-and-sparse quantization.arXiv preprint arXiv:2306.07629, 2023a. Kim, Y . J., Fahim, R., and Awadalla, H. H. Mixture of quantized experts (moqe): Complementary effect of low-bit quantization and robustness.arXiv preprint arXiv:2310.02410, 2023b. ...

work page arXiv
[16]

Brecq: Pushing the limit of post-training quantization by block reconstruction.arXiv preprint arXiv:2102.05426,

Li, Y ., Gong, R., Tan, X., Yang, Y ., Hu, P., Zhang, Q., Yu, F., Wang, W., and Gu, S. Brecq: Pushing the limit of post-training quantization by block reconstruction.arXiv preprint arXiv:2102.05426,

work page arXiv
[17]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., Dengr, C., Ruan, C., Dai, D., Guo, D., et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024a. Liu, Z., Oguz, B., Zhao, C., Chang, E., Stock, P., Mehdad, Y ., Shi, Y ., Krishnamoorthi, R., and Chandra, V . Llm-qat: Data-free ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Decoupled Weight Decay Regularization

Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large lan- guage models.arXiv preprint arXiv:2402.14800,

Lu, X., Liu, Q., Xu, Y ., Zhou, A., Huang, S., Zhang, B., Yan, J., and Li, H. Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large lan- guage models.arXiv preprint arXiv:2402.14800,

work page arXiv
[20]

Pointer Sentinel Mixture Models

Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Large Language Models: A Survey

Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M., Socher, R., Amatriain, X., and Gao, J. Large language models: A survey.arXiv preprint arXiv:2402.06196,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

OLMoE: Open Mixture-of-Experts Language Models

Muennighoff, N., Soldaini, L., Groeneveld, D., Lo, K., Mor- rison, J., Min, S., Shi, W., Walsh, P., Tafjord, O., Lambert, N., et al. Olmoe: Open mixture-of-experts language models.arXiv preprint arXiv:2409.02060,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Omniquant: Omnidirectionally calibrated quantization for large lan- guage models.arXiv preprint arXiv:2308.13137,

Shao, W., Chen, M., Zhang, Z., Xu, P., Zhao, L., Li, Z., Zhang, K., Gao, P., Qiao, Y ., and Luo, P. Omniquant: Omnidirectionally calibrated quantization for large lan- guage models.arXiv preprint arXiv:2308.13137,

work page arXiv
[24]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.arXiv preprint arXiv:2402.04396,

Tseng, A., Chee, J., Sun, Q., Kuleshov, V ., and De Sa, C. Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.arXiv preprint arXiv:2402.04396,

work page arXiv
[26]

Effi- cient large language models: A survey.arXiv preprint arXiv:2312.03863,

Wan, Z., Wang, X., Liu, C., Alam, S., Zheng, Y ., Liu, J., Qu, Z., Yan, S., Zhu, Y ., Zhang, Q., et al. Effi- cient large language models: A survey.arXiv preprint arXiv:2312.03863,

work page arXiv
[27]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

HellaSwag: Can a Machine Really Finish Your Sentence?

Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi, Y . Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[29]

Dynamo: Runtime switchable quantization for moe with cross-dataset adaptation.arXiv preprint arXiv:2503.21135,

Zheng, Z., Cui, X., Zheng, S., Li, M., Chen, J., Liang, Y ., and Chen, X. Dynamo: Runtime switchable quantization for moe with cross-dataset adaptation.arXiv preprint arXiv:2503.21135,

work page arXiv
[30]

Expert Prop

12 GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs A. Details on Models and Evaluation Table 8.Details of MoE-LLMs used in evaluation. “Expert Prop.” and “Router Prop.” denote the percentage of experts and routers in the total number of parameters (“#Params”), respectively. For the “#Experts” column, we follow the convention (#Routed E...

work page 2024
[31]

Qwen3-30B-A3B undergoes both pre-training and post-training

Note that, except for Qwen3-30B-A3B, all models are only pre-trained for language modeling without supervised fine-tuning (SFT). Qwen3-30B-A3B undergoes both pre-training and post-training. In addition to evaluating perplexity on general language modeling benchmarks, we evaluate different quantization methods on seven zero-shot tasks: PIQA (Bisk et al., 2...

work page 2020
[32]

All benchmark results are obtained using LM-Evaluation-Harness (v0.4.8) (Gao et al., 2024)

benchmark to assess the mathematical reasoning ability of quantized models. All benchmark results are obtained using LM-Evaluation-Harness (v0.4.8) (Gao et al., 2024). We reportacc normwhen available; otherwise,accis reported. B. Comparison with State-of-the-Art Methods In this section, we present the full results from Tab. 1 and Tab. 2, along with additi...

work page 2024
[33]

EAQuant primarily focuses on outlier suppression under uniform weight-activation quantization, whereas GEMQ derives a global mixed-precision strategy for weight-only quantization

and MoEQuant (Hu et al., 2025). EAQuant primarily focuses on outlier suppression under uniform weight-activation quantization, whereas GEMQ derives a global mixed-precision strategy for weight-only quantization. Although EAQuant also considers router distribution shift, it adopts a layer-wise rigid alignment scheme that yields only marginal gains (e.g., <...

work page arXiv 2025
[34]

MoEQuant constructs optimized calibration data for uniform weight-only quantization via self-sampling and extends GPTQ with affinity-guided weighting to reduce quantization error

Moreover, EAQuant targets relatively high-bit regimes (≥3 bpe), whereas GEMQ focuses on more aggressive low-bit settings (≤2.5 bpe) to better address the memory footprint of expert parameters. MoEQuant constructs optimized calibration data for uniform weight-only quantization via self-sampling and extends GPTQ with affinity-guided weighting to reduce quan...

work page 2024
[35]

Importantly, the key experts (i.e., the peaks in the error-estimation curves) with large estimated errors are consistently identified across different samples

As shown the figures, GEMQ is relatively robust to sampling noise, as the estimated error curves largely overlap even though only 128 sequences are used for calibration, achieving an average Pearson correlation over 0.99. Importantly, the key experts (i.e., the peaks in the error-estimation curves) with large estimated errors are consistently identified a...

work page 2048
[36]

Table 18.Ablation of expert bit-width candidates on Mixtral-8×7B (attention bits = 4)

or exploring generalization objectives like sharpness-aware minimization in future work. Table 18.Ablation of expert bit-width candidates on Mixtral-8×7B (attention bits = 4). Bits Per Expert Bit CandidatesBOpt Obj (Eq.7) ↓ WT2↓ C4↓ 0-shot↑ 7 2.5 {1,2,3}0.01444.978.9565.22 {0,1,2,3}0.0139 5.02 8.91 64.96 {1,2,3,4}0.0138 5.00 8.95 65.19 {0,1,2,3,4}0.01315....

work page arXiv
[37]

In the right figure, we observe that using more calibration samples can further reduce perplexity on the test set, but the improvement is marginal

As shown in the left figure, since routers contain only a small number of parameters, training converges within a single epoch in under 2 minutes. In the right figure, we observe that using more calibration samples can further reduce perplexity on the test set, but the improvement is marginal. We therefore use 128 samples in all experiments. 0 1 2 3 4 5 E...

work page 2024

[1] [1]

MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

Amini, A., Gabriel, S., Lin, P., Koncel-Kedziorski, R., Choi, Y ., and Hajishirzi, H. Mathqa: Towards interpretable math 9 GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319,

work page internal anchor Pith review Pith/arXiv arXiv 1905

[2] [2]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Chen, M., Shao, W., Xu, P., Wang, J., Gao, P., Zhang, K., and Luo, P. Efficientqat: Efficient quantization-aware training for large language models. InProceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pp. 10081– 10100, 2025a. Chen, Y ., Shao, Y ., Wang, P., and Cheng, J. Eac- moe: Expert-sel...

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Dai, D., Deng, C., Zhao, C., Xu, R., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y ., et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts lan- guage models.arXiv preprint arXiv:2401.06066,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Spqr: A sparse-quantized representation for near-lossless llm weight compression.arXiv preprint arXiv:2306.03078,

Dettmers, T., Svirschevski, R., Egiazarian, V ., Kuznedelev, D., Frantar, E., Ashkboos, S., Borzunov, A., Hoefler, T., and Alistarh, D. Spqr: A sparse-quantized representation for near-lossless llm weight compression.arXiv preprint arXiv:2306.03078,

work page arXiv

[6] [6]

Mxmoe: Mixed-precision quantization for moe with accuracy and performance co-design.arXiv preprint arXiv:2505.05799,

Duanmu, H., Li, X., Yuan, Z., Zheng, S., Duan, J., Zhang, X., and Lin, D. Mxmoe: Mixed-precision quantization for moe with accuracy and performance co-design.arXiv preprint arXiv:2505.05799,

work page arXiv

[7] [7]

and Alistarh, D

Frantar, E. and Alistarh, D. Qmoe: Practical sub-1-bit compression of trillion-parameter models.arXiv preprint arXiv:2310.16795,

work page arXiv

[8] [8]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: Accurate post-training quantization for generative pre- trained transformers.arXiv preprint arXiv:2210.17323,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Eaquant: Enhancing post-training quan- tization for moe models via expert-aware optimization

Fu, Z., Ding, N., Han, K., Yu, X., Li, X., Chen, X., Tang, Y ., and Wang, Y . Eaquant: Enhancing post-training quan- tization for moe models via expert-aware optimization. arXiv preprint arXiv:2506.13329,

work page arXiv

[10] [10]

He, B., Yin, L., Zhen, H.-L., Liu, S., Wu, H., Zhang, X., Yuan, M., and Ma, C

URL https://zenodo.org/records/12608602. He, B., Yin, L., Zhen, H.-L., Liu, S., Wu, H., Zhang, X., Yuan, M., and Ma, C. Preserving llm capabilities through calibration data curation: From analysis to optimization. Advances in Neural Information Processing Systems, 38: 58531–58572,

work page arXiv

[11] [11]

Measuring Massive Multitask Language Understanding

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring mas- sive multitask language understanding.arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009

[12] [12]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Moequant: Enhancing quantiza- tion for mixture-of-experts large language models via expert-balanced sampling and affinity guidance.arXiv preprint arXiv:2505.03804,

10 GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs Hu, X., Chen, Z., Yang, D., Xu, Z., Xu, C., Yuan, Z., Zhou, S., and Yu, J. Moequant: Enhancing quantiza- tion for mixture-of-experts large language models via expert-balanced sampling and affinity guidance.arXiv preprint arXiv:2505.03804,

work page arXiv

[14] [14]

Mixture compressor for mixture-of-experts llms gains more.arXiv preprint arXiv:2410.06270, 2024a

Huang, W., Liao, Y ., Liu, J., He, R., Tan, H., Zhang, S., Li, H., Liu, S., and Qi, X. Mixture compressor for mixture-of-experts llms gains more.arXiv preprint arXiv:2410.06270, 2024a. Huang, W., Liu, Y ., Qin, H., Li, Y ., Zhang, S., Liu, X., Magno, M., and Qi, X. Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.0...

work page arXiv

[15] [15]

W., and Keutzer, K

Kim, S., Hooper, C., Gholami, A., Dong, Z., Li, X., Shen, S., Mahoney, M. W., and Keutzer, K. Squeezellm: Dense-and-sparse quantization.arXiv preprint arXiv:2306.07629, 2023a. Kim, Y . J., Fahim, R., and Awadalla, H. H. Mixture of quantized experts (moqe): Complementary effect of low-bit quantization and robustness.arXiv preprint arXiv:2310.02410, 2023b. ...

work page arXiv

[16] [16]

Brecq: Pushing the limit of post-training quantization by block reconstruction.arXiv preprint arXiv:2102.05426,

Li, Y ., Gong, R., Tan, X., Yang, Y ., Hu, P., Zhang, Q., Yu, F., Wang, W., and Gu, S. Brecq: Pushing the limit of post-training quantization by block reconstruction.arXiv preprint arXiv:2102.05426,

work page arXiv

[17] [17]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., Dengr, C., Ruan, C., Dai, D., Guo, D., et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024a. Liu, Z., Oguz, B., Zhao, C., Chang, E., Stock, P., Mehdad, Y ., Shi, Y ., Krishnamoorthi, R., and Chandra, V . Llm-qat: Data-free ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Decoupled Weight Decay Regularization

Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large lan- guage models.arXiv preprint arXiv:2402.14800,

Lu, X., Liu, Q., Xu, Y ., Zhou, A., Huang, S., Zhang, B., Yan, J., and Li, H. Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large lan- guage models.arXiv preprint arXiv:2402.14800,

work page arXiv

[20] [20]

Pointer Sentinel Mixture Models

Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Large Language Models: A Survey

Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M., Socher, R., Amatriain, X., and Gao, J. Large language models: A survey.arXiv preprint arXiv:2402.06196,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

OLMoE: Open Mixture-of-Experts Language Models

Muennighoff, N., Soldaini, L., Groeneveld, D., Lo, K., Mor- rison, J., Min, S., Shi, W., Walsh, P., Tafjord, O., Lambert, N., et al. Olmoe: Open mixture-of-experts language models.arXiv preprint arXiv:2409.02060,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Omniquant: Omnidirectionally calibrated quantization for large lan- guage models.arXiv preprint arXiv:2308.13137,

Shao, W., Chen, M., Zhang, Z., Xu, P., Zhao, L., Li, Z., Zhang, K., Gao, P., Qiao, Y ., and Luo, P. Omniquant: Omnidirectionally calibrated quantization for large lan- guage models.arXiv preprint arXiv:2308.13137,

work page arXiv

[24] [24]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.arXiv preprint arXiv:2402.04396,

Tseng, A., Chee, J., Sun, Q., Kuleshov, V ., and De Sa, C. Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.arXiv preprint arXiv:2402.04396,

work page arXiv

[26] [26]

Effi- cient large language models: A survey.arXiv preprint arXiv:2312.03863,

Wan, Z., Wang, X., Liu, C., Alam, S., Zheng, Y ., Liu, J., Qu, Z., Yan, S., Zhu, Y ., Zhang, Q., et al. Effi- cient large language models: A survey.arXiv preprint arXiv:2312.03863,

work page arXiv

[27] [27]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

HellaSwag: Can a Machine Really Finish Your Sentence?

Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi, Y . Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830,

work page internal anchor Pith review Pith/arXiv arXiv 1905

[29] [29]

Dynamo: Runtime switchable quantization for moe with cross-dataset adaptation.arXiv preprint arXiv:2503.21135,

Zheng, Z., Cui, X., Zheng, S., Li, M., Chen, J., Liang, Y ., and Chen, X. Dynamo: Runtime switchable quantization for moe with cross-dataset adaptation.arXiv preprint arXiv:2503.21135,

work page arXiv

[30] [30]

Expert Prop

12 GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs A. Details on Models and Evaluation Table 8.Details of MoE-LLMs used in evaluation. “Expert Prop.” and “Router Prop.” denote the percentage of experts and routers in the total number of parameters (“#Params”), respectively. For the “#Experts” column, we follow the convention (#Routed E...

work page 2024

[31] [31]

Qwen3-30B-A3B undergoes both pre-training and post-training

Note that, except for Qwen3-30B-A3B, all models are only pre-trained for language modeling without supervised fine-tuning (SFT). Qwen3-30B-A3B undergoes both pre-training and post-training. In addition to evaluating perplexity on general language modeling benchmarks, we evaluate different quantization methods on seven zero-shot tasks: PIQA (Bisk et al., 2...

work page 2020

[32] [32]

All benchmark results are obtained using LM-Evaluation-Harness (v0.4.8) (Gao et al., 2024)

benchmark to assess the mathematical reasoning ability of quantized models. All benchmark results are obtained using LM-Evaluation-Harness (v0.4.8) (Gao et al., 2024). We reportacc normwhen available; otherwise,accis reported. B. Comparison with State-of-the-Art Methods In this section, we present the full results from Tab. 1 and Tab. 2, along with additi...

work page 2024

[33] [33]

EAQuant primarily focuses on outlier suppression under uniform weight-activation quantization, whereas GEMQ derives a global mixed-precision strategy for weight-only quantization

and MoEQuant (Hu et al., 2025). EAQuant primarily focuses on outlier suppression under uniform weight-activation quantization, whereas GEMQ derives a global mixed-precision strategy for weight-only quantization. Although EAQuant also considers router distribution shift, it adopts a layer-wise rigid alignment scheme that yields only marginal gains (e.g., <...

work page arXiv 2025

[34] [34]

MoEQuant constructs optimized calibration data for uniform weight-only quantization via self-sampling and extends GPTQ with affinity-guided weighting to reduce quantization error

Moreover, EAQuant targets relatively high-bit regimes (≥3 bpe), whereas GEMQ focuses on more aggressive low-bit settings (≤2.5 bpe) to better address the memory footprint of expert parameters. MoEQuant constructs optimized calibration data for uniform weight-only quantization via self-sampling and extends GPTQ with affinity-guided weighting to reduce quan...

work page 2024

[35] [35]

Importantly, the key experts (i.e., the peaks in the error-estimation curves) with large estimated errors are consistently identified across different samples

As shown the figures, GEMQ is relatively robust to sampling noise, as the estimated error curves largely overlap even though only 128 sequences are used for calibration, achieving an average Pearson correlation over 0.99. Importantly, the key experts (i.e., the peaks in the error-estimation curves) with large estimated errors are consistently identified a...

work page 2048

[36] [36]

Table 18.Ablation of expert bit-width candidates on Mixtral-8×7B (attention bits = 4)

or exploring generalization objectives like sharpness-aware minimization in future work. Table 18.Ablation of expert bit-width candidates on Mixtral-8×7B (attention bits = 4). Bits Per Expert Bit CandidatesBOpt Obj (Eq.7) ↓ WT2↓ C4↓ 0-shot↑ 7 2.5 {1,2,3}0.01444.978.9565.22 {0,1,2,3}0.0139 5.02 8.91 64.96 {1,2,3,4}0.0138 5.00 8.95 65.19 {0,1,2,3,4}0.01315....

work page arXiv

[37] [37]

In the right figure, we observe that using more calibration samples can further reduce perplexity on the test set, but the improvement is marginal

As shown in the left figure, since routers contain only a small number of parameters, training converges within a single epoch in under 2 minutes. In the right figure, we observe that using more calibration samples can further reduce perplexity on the test set, but the improvement is marginal. We therefore use 128 samples in all experiments. 0 1 2 3 4 5 E...

work page 2024