pith. sign in

arxiv: 2605.23078 · v1 · pith:NAQVKMFBnew · submitted 2026-05-21 · 💻 cs.LG · cs.CL

GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs

Pith reviewed 2026-05-25 05:31 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords mixture of expertsmixed-precision quantizationexpert importancerouter fine-tuningmodel compressioninference optimizationlinear programming
0
0 comments X

The pith

A global linear program ranks all MoE experts by quantization error and router fine-tuning restores accuracy at lower bit widths.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mixture-of-Experts large language models deliver strong results yet store separate expert networks that drive high memory cost. Mixed-precision quantization can assign fewer bits to less critical experts, but earlier methods scored importance only inside each layer and left router decisions unchanged after quantization. GEMQ instead solves one linear program over the entire model to rank every expert from its contribution to quantization error, then fine-tunes the router on the resulting quantized experts inside an iterative loop. Experiments show the resulting allocations cut memory and speed inference while keeping accuracy close to the unquantized baseline.

Core claim

The paper claims that casting expert bit allocation as a single global linear program derived from quantization error analysis, combined with router fine-tuning inside a progressive quantization loop, yields mixed-precision assignments that reduce memory and accelerate inference with only minimal accuracy loss compared with layer-wise baselines.

What carries the argument

Global linear-programming formulation that scores model-wide expert importance from quantization error analysis, paired with router fine-tuning.

If this is right

  • MoE models reach extreme low-bit configurations with smaller accuracy cost than layer-wise methods allow.
  • Memory footprint drops substantially while inference speed increases.
  • Progressive iteration between importance estimation and allocation improves final bit assignments.
  • Router adaptation becomes necessary once experts are quantized to different precisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same global ranking step may transfer to other sparsely activated networks that use learned routing.
  • Co-optimizing the router appears required whenever conditional computation paths are compressed.
  • Larger MoE models could fit on memory-limited hardware if the allocation and tuning steps scale.

Load-bearing premise

The linear program derived from quantization error analysis correctly ranks the relative importance of every expert across the full model, and router fine-tuning can fully offset any shifts in expert selection caused by the lower precision.

What would settle it

Apply the GEMQ procedure to a standard MoE LLM and check whether the final quantized model either exceeds the claimed memory savings or exhibits larger accuracy degradation than the paper reports on the same evaluation benchmarks.

Figures

Figures reproduced from arXiv: 2605.23078 by Dongwei Wang, Huanrui Yang, Jianing Deng, Jingtong Hu, Song Wang, Tianlong Chen, Zijie Liu.

Figure 1
Figure 1. Figure 1: Motivations. (a) The sensitivity of expert weights, measured by squared gradients (i.e., the trace of the empirical Fisher Information Matrix), varies not only within a layer but also across layers, indicating heterogeneous layer importance; and (b) over 40% of tokens are routed to different experts after 1.5-bit quantization, revealing substantial distortion in router distributions. Statistics are compute… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed GEMQ framework for MoE-LLMs quantization. 4. Method [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of different cases of expert importance esti￾mation for target weights wˆ . w⋆ denotes FP expert weights. as shown in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Analysis of global expert bit-width allocation. 1.5 2.0 2.5 Bits/Expert 7 8 9 10 11 12 WikiT e x t 2 P e r ple xit y DeepSeekV2-Lite 1.5 2.0 2.5 Bits/Expert 5 6 7 8 9 WikiT e x t 2 P e r ple xit y Mixtral-8x7B 1.5 2.0 2.5 Bits/Expert 10 12 14 16 18 20 22 C 4 P e r ple xit y DeepSeekV2-Lite 1.5 2.0 2.5 Bits/Expert 10 12 14 16 18 20 22 C 4 P e r ple xit y Mixtral-8x7B 1.5 2.0 2.5 Bits/Expert 45 48 51 54 57 6… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation of the proposed techniques. “RFT” denotes global router fine-tuning, and “PQ” denotes progressive quantization. “Zero-shot Accuracy” is averaged over seven tasks. whereas GEMQ preserves sufficient precision for critical ex￾perts, preventing collapse and maintaining performance. Ad￾ditionally, GEMQ consistently outperforms both EAQuant and MoEQuant at comparable bit budgets, demonstrating the effec… view at source ↗
Figure 6
Figure 6. Figure 6: Statistics of the expert importance proxy and corresponding bit-width allocation across four randomly sampled calibration subsets from C4 (three with 128 sequences and one with 2048 sequences; each sequence contains 2048 tokens). Note that dark red in the figures indicates overlap [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Ablation study on router fine-tuning settings. Weights are initialized from the quantized Mixtral-8×7B model (1.5 bits/expert). Training data are randomly extracted from the WikiText2 training set. Times are measured on three H100 GPUs. 0 1 2 3 4 5 6 7 1 2 3 Layer 3 0 1 2 3 4 5 6 7 Layer 5 0 1 2 3 4 5 6 7 Layer 9 0 1 2 3 4 5 6 7 Layer 13 0 1 2 3 4 5 6 7 Layer 19 0 1 2 3 4 5 6 7 Layer 29 0.0 0.1 0.2 0.3 0.0… view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of router statistics from selected layers of the full-precision, quantized (1.5 bits/expert), and router fine-tuned Mixtral-8×7B models on the WikiText2 test set. Analysis of Routing Dynamics after Fine-Tuning. As shown in [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Router selection change ratio of Mixtral-8×7B computed on WikiText2 testset. G. Detailed Expert Bit-width Allocation Results [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: shows detailed expert bit-width allocations produced by GEMQ with candidate set B = {1, 2, 3}. 0 4 8 12 16 20 24 Layer Index 0 8 16 24 32 40 48 56 64 Expert Index 2.5 bits per expert (#1bit: 415; #2bit: 28; #3bit: 1273) 0 4 8 12 16 20 24 Layer Index 0 8 16 24 32 40 48 56 64 Expert Index 2.0 bits per expert (#1bit: 841; #2bit: 34; #3bit: 841) 0 4 8 12 16 20 24 Layer Index 0 8 16 24 32 40 48 56 64 Expert In… view at source ↗
read the original abstract

Mixture-of-Experts Large Language Models (MoE-LLMs) achieve strong performance but incur substantial memory overhead due to massive expert parameters. Mixed-precision quantization mitigates this cost by allocating expert-wise bit-widths based on their importance, approaching the accuracy-memory Pareto frontier and enabling extreme low-bit quantization. However, existing methods rely on layer-wise importance estimation and overlook router shifts induced by quantization, resulting in suboptimal allocation and routing. In this work, we propose Global Expert-level Mixed-precision Quantization (GEMQ) to overcome these limitations via (1) a global linear-programming formulation that captures model-wide expert importance based on quantization error analysis, and (2) efficient router fine-tuning to adapt routing to quantized experts. These components are integrated into a progressive quantization framework that iteratively refines importance estimation and allocation. Experiments demonstrate that GEMQ significantly reduces memory and accelerates inference with minimal accuracy degradation. Source code is available at https://github.com/jndeng/GEMQ .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper proposes GEMQ, a mixed-precision quantization method for MoE LLMs. It uses a global linear-programming formulation derived from quantization error analysis to allocate bit-widths according to model-wide expert importance, combined with router fine-tuning to compensate for quantization-induced routing shifts, all within a progressive quantization framework. The authors claim this yields substantial memory reduction and inference speedup with minimal accuracy degradation compared to prior layer-wise approaches.

Significance. If the global LP formulation and router fine-tuning prove effective, the method could improve the accuracy-memory trade-off for large MoE models beyond existing layer-wise quantization techniques. The public release of source code at the cited GitHub repository is a clear strength for reproducibility and verification.

minor comments (2)
  1. [Abstract] Abstract: the claim of 'minimal accuracy degradation' and 'significantly reduces memory' is stated without any quantitative results, baselines, or error metrics, preventing evaluation of the central empirical claim.
  2. [Abstract] Abstract: no derivation details, equations, or description of the linear-programming objective/constraints are provided, so the 'global' vs. 'layer-wise' distinction cannot be assessed from the given text.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for summarizing our GEMQ method and noting the public code release as a strength for reproducibility. The recommendation is listed as uncertain, but the major comments section contains no specific points. We remain available to address any concerns or provide additional experiments if raised.

Circularity Check

0 steps flagged

No circularity detected; derivation self-contained against external benchmarks

full rationale

The abstract and available context describe a global LP formulation from quantization error analysis plus router fine-tuning, but supply no equations, fitting procedures, or self-citations that reduce any claimed result to its own inputs by construction. No load-bearing step can be exhibited as equivalent to a fitted parameter or prior self-citation. The method is presented as experimentally validated on memory/accuracy trade-offs, which constitutes independent content rather than a renaming or self-definition. This is the normal honest finding when no specific reduction is visible.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be extracted or verified.

pith-pipeline@v0.9.0 · 5716 in / 995 out tokens · 16694 ms · 2026-05-25T05:31:32.142726+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 15 internal anchors

  1. [1]

    MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

    Amini, A., Gabriel, S., Lin, P., Koncel-Kedziorski, R., Choi, Y ., and Hajishirzi, H. Mathqa: Towards interpretable math 9 GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319,

  2. [2]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Chen, M., Shao, W., Xu, P., Wang, J., Gao, P., Zhang, K., and Luo, P. Efficientqat: Efficient quantization-aware training for large language models. InProceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pp. 10081– 10100, 2025a. Chen, Y ., Shao, Y ., Wang, P., and Cheng, J. Eac- moe: Expert-sel...

  3. [3]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

  4. [4]

    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

    Dai, D., Deng, C., Zhao, C., Xu, R., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y ., et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts lan- guage models.arXiv preprint arXiv:2401.06066,

  5. [5]

    Spqr: A sparse-quantized representation for near-lossless llm weight compression.arXiv preprint arXiv:2306.03078,

    Dettmers, T., Svirschevski, R., Egiazarian, V ., Kuznedelev, D., Frantar, E., Ashkboos, S., Borzunov, A., Hoefler, T., and Alistarh, D. Spqr: A sparse-quantized representation for near-lossless llm weight compression.arXiv preprint arXiv:2306.03078,

  6. [6]

    Mxmoe: Mixed-precision quantization for moe with accuracy and performance co-design.arXiv preprint arXiv:2505.05799,

    Duanmu, H., Li, X., Yuan, Z., Zheng, S., Duan, J., Zhang, X., and Lin, D. Mxmoe: Mixed-precision quantization for moe with accuracy and performance co-design.arXiv preprint arXiv:2505.05799,

  7. [7]

    and Alistarh, D

    Frantar, E. and Alistarh, D. Qmoe: Practical sub-1-bit compression of trillion-parameter models.arXiv preprint arXiv:2310.16795,

  8. [8]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: Accurate post-training quantization for generative pre- trained transformers.arXiv preprint arXiv:2210.17323,

  9. [9]

    Eaquant: Enhancing post-training quan- tization for moe models via expert-aware optimization

    Fu, Z., Ding, N., Han, K., Yu, X., Li, X., Chen, X., Tang, Y ., and Wang, Y . Eaquant: Enhancing post-training quan- tization for moe models via expert-aware optimization. arXiv preprint arXiv:2506.13329,

  10. [10]

    He, B., Yin, L., Zhen, H.-L., Liu, S., Wu, H., Zhang, X., Yuan, M., and Ma, C

    URL https://zenodo.org/records/12608602. He, B., Yin, L., Zhen, H.-L., Liu, S., Wu, H., Zhang, X., Yuan, M., and Ma, C. Preserving llm capabilities through calibration data curation: From analysis to optimization. Advances in Neural Information Processing Systems, 38: 58531–58572,

  11. [11]

    Measuring Massive Multitask Language Understanding

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring mas- sive multitask language understanding.arXiv preprint arXiv:2009.03300,

  12. [12]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

  13. [13]

    Moequant: Enhancing quantiza- tion for mixture-of-experts large language models via expert-balanced sampling and affinity guidance.arXiv preprint arXiv:2505.03804,

    10 GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs Hu, X., Chen, Z., Yang, D., Xu, Z., Xu, C., Yuan, Z., Zhou, S., and Yu, J. Moequant: Enhancing quantiza- tion for mixture-of-experts large language models via expert-balanced sampling and affinity guidance.arXiv preprint arXiv:2505.03804,

  14. [14]

    Mixture compressor for mixture-of-experts llms gains more.arXiv preprint arXiv:2410.06270, 2024a

    Huang, W., Liao, Y ., Liu, J., He, R., Tan, H., Zhang, S., Li, H., Liu, S., and Qi, X. Mixture compressor for mixture-of-experts llms gains more.arXiv preprint arXiv:2410.06270, 2024a. Huang, W., Liu, Y ., Qin, H., Li, Y ., Zhang, S., Liu, X., Magno, M., and Qi, X. Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.0...

  15. [15]

    W., and Keutzer, K

    Kim, S., Hooper, C., Gholami, A., Dong, Z., Li, X., Shen, S., Mahoney, M. W., and Keutzer, K. Squeezellm: Dense-and-sparse quantization.arXiv preprint arXiv:2306.07629, 2023a. Kim, Y . J., Fahim, R., and Awadalla, H. H. Mixture of quantized experts (moqe): Complementary effect of low-bit quantization and robustness.arXiv preprint arXiv:2310.02410, 2023b. ...

  16. [16]

    Brecq: Pushing the limit of post-training quantization by block reconstruction.arXiv preprint arXiv:2102.05426,

    Li, Y ., Gong, R., Tan, X., Yang, Y ., Hu, P., Zhang, Q., Yu, F., Wang, W., and Gu, S. Brecq: Pushing the limit of post-training quantization by block reconstruction.arXiv preprint arXiv:2102.05426,

  17. [17]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., Dengr, C., Ruan, C., Dai, D., Guo, D., et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024a. Liu, Z., Oguz, B., Zhao, C., Chang, E., Stock, P., Mehdad, Y ., Shi, Y ., Krishnamoorthi, R., and Chandra, V . Llm-qat: Data-free ...

  18. [18]

    Decoupled Weight Decay Regularization

    Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization.arXiv preprint arXiv:1711.05101,

  19. [19]

    Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large lan- guage models.arXiv preprint arXiv:2402.14800,

    Lu, X., Liu, Q., Xu, Y ., Zhou, A., Huang, S., Zhang, B., Yan, J., and Li, H. Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large lan- guage models.arXiv preprint arXiv:2402.14800,

  20. [20]

    Pointer Sentinel Mixture Models

    Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843,

  21. [21]

    Large Language Models: A Survey

    Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M., Socher, R., Amatriain, X., and Gao, J. Large language models: A survey.arXiv preprint arXiv:2402.06196,

  22. [22]

    OLMoE: Open Mixture-of-Experts Language Models

    Muennighoff, N., Soldaini, L., Groeneveld, D., Lo, K., Mor- rison, J., Min, S., Shi, W., Walsh, P., Tafjord, O., Lambert, N., et al. Olmoe: Open mixture-of-experts language models.arXiv preprint arXiv:2409.02060,

  23. [23]

    Omniquant: Omnidirectionally calibrated quantization for large lan- guage models.arXiv preprint arXiv:2308.13137,

    Shao, W., Chen, M., Zhang, Z., Xu, P., Zhao, L., Li, Z., Zhang, K., Gao, P., Qiao, Y ., and Luo, P. Omniquant: Omnidirectionally calibrated quantization for large lan- guage models.arXiv preprint arXiv:2308.13137,

  24. [24]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538,

  25. [25]

    Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.arXiv preprint arXiv:2402.04396,

    Tseng, A., Chee, J., Sun, Q., Kuleshov, V ., and De Sa, C. Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.arXiv preprint arXiv:2402.04396,

  26. [26]

    Effi- cient large language models: A survey.arXiv preprint arXiv:2312.03863,

    Wan, Z., Wang, X., Liu, C., Alam, S., Zheng, Y ., Liu, J., Qu, Z., Yan, S., Zhu, Y ., Zhang, Q., et al. Effi- cient large language models: A survey.arXiv preprint arXiv:2312.03863,

  27. [27]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  28. [28]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi, Y . Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830,

  29. [29]

    Dynamo: Runtime switchable quantization for moe with cross-dataset adaptation.arXiv preprint arXiv:2503.21135,

    Zheng, Z., Cui, X., Zheng, S., Li, M., Chen, J., Liang, Y ., and Chen, X. Dynamo: Runtime switchable quantization for moe with cross-dataset adaptation.arXiv preprint arXiv:2503.21135,

  30. [30]

    Expert Prop

    12 GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs A. Details on Models and Evaluation Table 8.Details of MoE-LLMs used in evaluation. “Expert Prop.” and “Router Prop.” denote the percentage of experts and routers in the total number of parameters (“#Params”), respectively. For the “#Experts” column, we follow the convention (#Routed E...

  31. [31]

    Qwen3-30B-A3B undergoes both pre-training and post-training

    Note that, except for Qwen3-30B-A3B, all models are only pre-trained for language modeling without supervised fine-tuning (SFT). Qwen3-30B-A3B undergoes both pre-training and post-training. In addition to evaluating perplexity on general language modeling benchmarks, we evaluate different quantization methods on seven zero-shot tasks: PIQA (Bisk et al., 2...

  32. [32]

    All benchmark results are obtained using LM-Evaluation-Harness (v0.4.8) (Gao et al., 2024)

    benchmark to assess the mathematical reasoning ability of quantized models. All benchmark results are obtained using LM-Evaluation-Harness (v0.4.8) (Gao et al., 2024). We reportacc normwhen available; otherwise,accis reported. B. Comparison with State-of-the-Art Methods In this section, we present the full results from Tab. 1 and Tab. 2, along with additi...

  33. [33]

    EAQuant primarily focuses on outlier suppression under uniform weight-activation quantization, whereas GEMQ derives a global mixed-precision strategy for weight-only quantization

    and MoEQuant (Hu et al., 2025). EAQuant primarily focuses on outlier suppression under uniform weight-activation quantization, whereas GEMQ derives a global mixed-precision strategy for weight-only quantization. Although EAQuant also considers router distribution shift, it adopts a layer-wise rigid alignment scheme that yields only marginal gains (e.g., <...

  34. [34]

    MoEQuant constructs optimized calibration data for uniform weight-only quantization via self-sampling and extends GPTQ with affinity-guided weighting to reduce quantization error

    Moreover, EAQuant targets relatively high-bit regimes (≥3 bpe), whereas GEMQ focuses on more aggressive low-bit settings (≤2.5 bpe) to better address the memory footprint of expert parameters. MoEQuant constructs optimized calibration data for uniform weight-only quantization via self-sampling and extends GPTQ with affinity-guided weighting to reduce quan...

  35. [35]

    Importantly, the key experts (i.e., the peaks in the error-estimation curves) with large estimated errors are consistently identified across different samples

    As shown the figures, GEMQ is relatively robust to sampling noise, as the estimated error curves largely overlap even though only 128 sequences are used for calibration, achieving an average Pearson correlation over 0.99. Importantly, the key experts (i.e., the peaks in the error-estimation curves) with large estimated errors are consistently identified a...

  36. [36]

    Table 18.Ablation of expert bit-width candidates on Mixtral-8×7B (attention bits = 4)

    or exploring generalization objectives like sharpness-aware minimization in future work. Table 18.Ablation of expert bit-width candidates on Mixtral-8×7B (attention bits = 4). Bits Per Expert Bit CandidatesBOpt Obj (Eq.7) ↓ WT2↓ C4↓ 0-shot↑ 7 2.5 {1,2,3}0.01444.978.9565.22 {0,1,2,3}0.0139 5.02 8.91 64.96 {1,2,3,4}0.0138 5.00 8.95 65.19 {0,1,2,3,4}0.01315....

  37. [37]

    In the right figure, we observe that using more calibration samples can further reduce perplexity on the test set, but the improvement is marginal

    As shown in the left figure, since routers contain only a small number of parameters, training converges within a single epoch in under 2 minutes. In the right figure, we observe that using more calibration samples can further reduce perplexity on the test set, but the improvement is marginal. We therefore use 128 samples in all experiments. 0 1 2 3 4 5 E...