pith. sign in

arxiv: 2604.17695 · v1 · submitted 2026-04-20 · 💻 cs.LG · cs.CL

MoE-nD: Per-Layer Mixture-of-Experts Routing for Multi-Axis KV Cache Compression

Pith reviewed 2026-05-10 05:09 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords KV cache compressionmixture of experts routinglong-context LLM inferenceper-layer quantizationtoken evictionmemory-efficient attention
0
0 comments X

The pith

Per-layer mixture-of-experts routing lets each KV-cache layer pick its own eviction ratio and bit precision, matching uncompressed accuracy at 14x compression where uniform methods lose ground.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that uniform compression recipes across layers leave accuracy on the table because layers differ sharply in how they tolerate token eviction versus quantization. MoE-nD therefore assigns every layer its own (eviction ratio, K-bits, V-bits) tuple chosen by an offline greedy solver that minimizes predicted quality loss under a fixed global memory budget. At inference the chosen per-layer policies are executed together through one attention patch. On LongBench long-context tasks the resulting heterogeneous schedule reaches the full 1.9 GB baseline score at 136 MB (14x), while all tested 1-D, 2-D-uniform, and 2-D baselines at equal or lower memory stay well below that mark; the same pattern appears on AIME reasoning (+6 to +27 points).

Core claim

MoE-nD shows that routing each layer independently to a compression configuration (eviction ratio plus K- and V-bit widths) under a shared memory constraint can preserve the accuracy of an uncompressed KV cache at 14x compression on long-context and reasoning benchmarks, whereas any homogeneous per-layer or global policy at comparable memory incurs large quality drops.

What carries the argument

Offline-calibrated greedy solver that selects a per-layer (eviction-ratio, K-bits, V-bits) tuple minimizing predicted quality loss, then applies the heterogeneous choices jointly through a single attention patch at inference time.

If this is right

  • Heterogeneous per-layer choices of eviction and quantization outperform any single global compression recipe at high compression ratios.
  • The optimal balance between token eviction and bit-width reduction varies substantially across layers within the same model.
  • On short inputs the solver often selects full retention, showing that per-layer eviction routing adds value mainly for long-context workloads.
  • The approach works through an existing attention kernel with no change to the forward pass beyond the chosen per-layer parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Layer-wise routing could be extended to additional compression axes such as low-rank head projections or cross-layer sharing if a joint predictor is trained.
  • The null results on MATH-500 and TREC illustrate a clear regime (short sequences) where heterogeneous eviction offers little headroom.
  • If the quality predictor can be updated cheaply online, the same framework might support input-dependent routing without retraining the solver.

Load-bearing premise

The quality-loss predictor, calibrated offline on the same model family, ranks compression options correctly for new inputs and that layer-wise sensitivity patterns stay stable enough to justify choosing different configurations per layer.

What would settle it

Measure whether the same routing decisions still recover full baseline accuracy when the predictor is applied to a different model family or to inputs whose length or task distribution differs markedly from the calibration set.

Figures

Figures reproduced from arXiv: 2604.17695 by Libo Sun, Peixiong He, Po-Wei Harn, Xiao Qin.

Figure 1
Figure 1. Figure 1: LongBench-v1 accuracy vs. compressed KV memory on DeepSeek-R1-Distill-Qwen-7B. 4-task subset (NarrativeQA, HotpotQA, TREC, PassageRetrieval-en), n = 50 per task, 16k￾token inputs (middle truncation), adapted reasoning-model proto￾col (max gen ×16, scoring after final </think>); see §5. MoE￾nD’s hetero variant (2dhetero) matches the uncompressed full-cache baseline at its leftmost operating point (136 MB, 1… view at source ↗
Figure 2
Figure 2. Figure 2: Per-layer sensitivity landscape. (A) Full 28×9 heatmap of predicted attention-output L2 error when applying each compression operation to a single layer, measured offline against the uncompressed reference (log-scale color; only headline cells are annotated to keep the panel readable). The top strip shows per-operation maxℓ/minℓ ratio, directly summarizing the heterogeneity numbers in [PITH_FULL_IMAGE:fig… view at source ↗
Figure 3
Figure 3. Figure 3: MoE-nD pipeline. (A) Offline, per-layer sensitivity to each compression operation is measured against an uncompressed reference; the heatmap shows the actual calibration table for DeepSeek-R1-Distill-Qwen-7B. (B) A greedy budget solver picks the per-layer (keep, kbits, vbits) tuple that greedily reduces predicted total sensitivity (attention-output L2 error) subject to a global memory budget M. (C) At infe… view at source ↗
read the original abstract

KV cache memory is the dominant bottleneck for long-context LLM inference. Existing compression methods each act on a single axis of the four-dimensional KV tensor -- token eviction (sequence), quantization (precision), low-rank projection (head dimension), or cross-layer sharing -- but apply the same recipe to every layer. We show that this homogeneity leaves accuracy on the table: different layers respond very differently to each compression operation, and the optimal per-layer mix of eviction and quantization is far from uniform. We propose MoE-nD, a mixture-of-experts framework that routes each layer to its own (eviction-ratio, K-bits, V-bits) tuple under a global memory budget. An offline-calibrated greedy solver chooses the routing that minimizes predicted quality loss; at inference time, per-layer heterogeneous eviction and quantization are applied jointly through a single attention patch. On a 4-task subset of LongBench-v1 (16k inputs, n=50 per task, adapted reasoning-model protocol; see section Experiments), MoE-nD's hetero variant matches our uncompressed 1.9~GB baseline at 14x compression (136~MB) while every other compressed baseline we tested (1d, 2d_uniform, 2d) at comparable or smaller memory stays under 8/100. The gains hold on AIME reasoning benchmarks (+6 to +27 pts over the strongest per-layer-quantization baseline across eight configurations). Two null results -- MATH-500 and LongBench's TREC -- share a principled cause (short inputs, solver picks keep=1.0 on most layers), cleanly characterizing when per-layer eviction routing has headroom to help.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes MoE-nD, a per-layer mixture-of-experts routing framework for multi-axis KV cache compression. An offline-calibrated greedy solver selects heterogeneous (eviction-ratio, K-bit, V-bit) tuples per layer to minimize predicted quality loss under a global memory budget; at inference, these are applied jointly via a single attention patch. On a 4-task LongBench-v1 subset the hetero variant matches the 1.9 GB uncompressed baseline at 14× compression (136 MB) while uniform baselines remain below 8/100, with +6 to +27 pt gains on AIME reasoning tasks; null results on MATH-500 and TREC are attributed to short inputs where the solver defaults to keep=1.0.

Significance. If the results hold, the work demonstrates that exploiting stable layer-wise differences in compression sensitivity can yield substantial accuracy-memory trade-offs beyond uniform per-layer or global recipes. The concrete 14× compression result that matches the uncompressed baseline, together with the clean characterization of when per-layer eviction helps, would be a useful contribution to efficient long-context inference.

major comments (3)
  1. [Experiments section] Experiments section: the headline claims (14× compression matching the 1.9 GB baseline and +6–27 pt AIME gains) are presented without error bars, standard deviations, or any indication of the number of runs or random seeds, so the statistical reliability of the reported differences cannot be assessed.
  2. [Method / solver description] Method / solver description: no direct measurement is given of the quality-loss predictor’s ranking accuracy or absolute error on inputs drawn from a distribution disjoint from the calibration set (especially long-context reasoning traces). Because the heterogeneous routing decisions are produced entirely by this offline predictor, the absence of such a test is load-bearing for the central claim that per-layer heterogeneity outperforms uniform baselines.
  3. [Experiments section] Experiments section: the paper provides neither an ablation isolating the predictor’s contribution nor full implementation details or hyper-parameter settings for the 1d, 2d_uniform, and 2d baselines, preventing independent verification of the reported superiority.
minor comments (2)
  1. [Experiments section] The phrase “adapted reasoning-model protocol” is used in the abstract and Experiments section without a concise definition or pointer to the exact prompt template and evaluation script.
  2. Notation for the per-layer tuples (eviction ratio, K-bits, V-bits) could be introduced once in a single equation or table to avoid repeated prose descriptions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback on our manuscript. We address each of the major comments point by point below, indicating the changes we will make in the revised version.

read point-by-point responses
  1. Referee: [Experiments section] Experiments section: the headline claims (14× compression matching the 1.9 GB baseline and +6–27 pt AIME gains) are presented without error bars, standard deviations, or any indication of the number of runs or random seeds, so the statistical reliability of the reported differences cannot be assessed.

    Authors: We appreciate this observation. The reported results were obtained from single runs with a fixed seed for each configuration, as the greedy solver is deterministic and the inference procedure does not introduce randomness. The sample size is n=50 per task as stated in the Experiments section. To enhance statistical reliability, we will revise the manuscript to explicitly state the number of runs and seeds used, and we will include error bars by conducting additional runs with varied seeds where feasible, particularly for the AIME gains. revision: partial

  2. Referee: [Method / solver description] Method / solver description: no direct measurement is given of the quality-loss predictor’s ranking accuracy or absolute error on inputs drawn from a distribution disjoint from the calibration set (especially long-context reasoning traces). Because the heterogeneous routing decisions are produced entirely by this offline predictor, the absence of such a test is load-bearing for the central claim that per-layer heterogeneity outperforms uniform baselines.

    Authors: We agree that validating the predictor on disjoint distributions is crucial. While the calibration set encompasses a variety of tasks from LongBench and reasoning benchmarks, we did not provide explicit ranking accuracy or error metrics on held-out long-context reasoning traces. In the revision, we will add a dedicated evaluation of the quality-loss predictor, including its ranking performance and absolute error on inputs from distributions disjoint from the calibration set, to substantiate the effectiveness of the per-layer routing. revision: yes

  3. Referee: [Experiments section] Experiments section: the paper provides neither an ablation isolating the predictor’s contribution nor full implementation details or hyper-parameter settings for the 1d, 2d_uniform, and 2d baselines, preventing independent verification of the reported superiority.

    Authors: We acknowledge the need for greater reproducibility. We will include a new ablation study that isolates the contribution of the quality-loss predictor by comparing against variants with uniform or random routing decisions. Additionally, we will provide complete implementation details, including all hyper-parameter settings, for the 1d, 2d_uniform, and 2d baselines in the revised Experiments section to enable independent verification. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claim rests on independent benchmark evaluation after offline calibration.

full rationale

The paper derives its routing decisions from an offline-calibrated greedy solver that minimizes a quality-loss predictor's output under a memory budget. This selection is then applied at inference and measured on external benchmarks (LongBench-v1 subset, AIME, MATH-500, TREC). No equation or step equates the reported accuracy gains to the calibration data by construction, nor does any load-bearing premise reduce to a self-citation or renamed ansatz. Layer-wise sensitivity differences are presented as measured observations rather than tautological definitions. The method remains self-contained against held-out evaluation distributions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The method adds a new routing layer on top of standard transformer attention and empirical calibration; it does not introduce new physical entities or ungrounded constants.

free parameters (1)
  • per-layer eviction ratios and bit widths
    Discrete choices selected by the greedy solver to meet the global memory budget; values are not fixed a priori but discovered per model and budget.
axioms (1)
  • domain assumption Layer-wise differences in compression sensitivity are stable and predictable from offline calibration data.
    Invoked to justify why heterogeneous routing outperforms uniform policies.
invented entities (1)
  • MoE-nD per-layer routing framework no independent evidence
    purpose: To map each transformer layer to a distinct (eviction, K-quant, V-quant) configuration
    New algorithmic construct proposed by the paper; no independent evidence outside the reported experiments.

pith-pipeline@v0.9.0 · 5615 in / 1447 out tokens · 57672 ms · 2026-05-10T05:09:18.502989+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

  1. [1]

    GQA: Training generalized multi-query transformer models from multi-head check- points

    Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y ., Lebr´on, F., and Sanghai, S. GQA: Training generalized multi-query transformer models from multi-head check- points. InConference on Empirical Methods in Natural Language Processing, 2023

  2. [2]

    LongBench: A bilingual, multitask benchmark for long context understanding

    Bai, Y ., Lv, X., Zhang, J., et al. LongBench: A bilingual, multitask benchmark for long context understanding. In Annual Meeting of the Association for Computational Linguistics, 2024

  3. [3]

    Kelly, J. R. Reducing transformer key-value cache size with cross-layer attention, 2024

  4. [4]

    PyramidKV: Dynamic KV cache compression based on pyramidal information funneling, 2024

    Xiong, W., Dong, Y ., Hu, J., and Xiao, W. PyramidKV: Dynamic KV cache compression based on pyramidal information funneling, 2024

  5. [5]

    R-KV: Redundancy-aware KV cache compression for reasoning models

    Cai, Z., Xiao, W., Sun, H., et al. R-KV: Redundancy-aware KV cache compression for reasoning models. InAd- vances in Neural Information Processing Systems, 2025

  6. [6]

    Y ., Ermon, S., Rudra, A., and R´e, C

    Dao, T., Fu, D. Y ., Ermon, S., Rudra, A., and R´e, C. FlashAt- tention: Fast and memory-efficient exact attention with 7 MoE-nD: Per-Layer Routing for Multi-Axis KV Compression IO-awareness. InAdvances in Neural Information Pro- cessing Systems, 2022. DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning, 2025

  7. [7]

    QAQ: Quality adaptive quantization for LLM KV cache, 2024

    Dong, S., Cheng, W., Qin, J., and Wang, W. QAQ: Quality adaptive quantization for LLM KV cache, 2024

  8. [8]

    Feng, Y ., Lv, J., Cao, Y ., Xie, X., and Zhou, S. K. Ada- KV: Optimizing KV cache eviction by adaptive budget allocation for efficient LLM inference. InAdvances in Neural Information Processing Systems, 2025

  9. [9]

    Measuring mathematical problem solving with the MATH dataset

    Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the MATH dataset. InNeurIPS Track on Datasets and Benchmarks, 2021

  10. [10]

    W., Shao, Y

    Hooper, C., Kim, S., Mohammadzadeh, H., Mahoney, M. W., Shao, Y . S., Keutzer, K., and Gholami, A. KVQuant: Towards 10 million context length LLM infer- ence with KV cache quantization. InAdvances in Neural Information Processing Systems, 2024

  11. [11]

    J., and Yuan, M

    Li, X., Xing, Z., Li, Y ., Qu, L., Zhen, H.-L., Liu, W., Yao, Y ., Pan, S. J., and Yuan, M. KVTuner: Sensitivity-aware layer-wise mixed-precision KV cache quantization for efficient and nearly lossless LLM inference. InInterna- tional Conference on Machine Learning, 2025

  12. [12]

    SnapKV: LLM knows what you are looking for before generation

    Li, Y ., Huang, Y ., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., Cai, T., Lewis, P., and Chen, D. SnapKV: LLM knows what you are looking for before generation. InAdvances in Neural Information Processing Systems, 2024

  13. [13]

    Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time

    Liu, Z., Desai, A., Liao, F., Wang, W., Xie, V ., Xu, Z., Kyril- lidis, A., and Shrivastava, A. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time. InAdvances in Neural Information Processing Systems, volume 36, 2023

  14. [14]

    KIVI: A tuning-free asym- metric 2-bit quantization for KV cache

    Liu, Z., Yuan, J., Jin, H., Zhong, S., Xu, Z., Braverman, V ., Chen, B., and Hu, X. KIVI: A tuning-free asym- metric 2-bit quantization for KV cache. InInternational Conference on Machine Learning, 2024

  15. [15]

    Triattention: Effi- cient long reasoning with trigonometric KV compression, 2026

    Mao, W., Lin, X., Huang, W., et al. Triattention: Effi- cient long reasoning with trigonometric KV compression, 2026

  16. [16]

    MiniKV: Pushing the limits of LLM inference via 2-bit layer-discriminative KV cache, 2024

    Sharma, A., Ding, H., Li, J., Dani, N., and Zhang, M. MiniKV: Pushing the limits of LLM inference via 2-bit layer-discriminative KV cache, 2024

  17. [17]

    Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

    Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations, 2017

  18. [18]

    RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Su, J., Ahmed, M., Lu, Y ., Pan, S., Bo, W., and Liu, Y . RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  19. [19]

    V ., and Zhou, D

    Xia, F., Chi, E., Le, Q. V ., and Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 35, pp. 24824–24837, 2022

  20. [20]

    Effi- cient streaming language models with attention sinks

    Xiao, G., Tian, Y ., Chen, B., Han, S., and Lewis, M. Effi- cient streaming language models with attention sinks. In International Conference on Learning Representations, 2024

  21. [21]

    J., et al

    Yuan, Z., Shang, Y ., Zhou, Y ., Dong, Z., Zhou, Z., Xue, C., Wu, B., Li, Z., Gu, Q., Lee, Y . J., et al. ASVD: Activation-aware singular value decomposition for com- pressing large language models, 2024

  22. [22]

    Solve the equation x2 −5x+ 6 = 0 . Think step by step and show all work

    Song, Z., Tian, Y ., R´e, C., Barrett, C., et al. H2O: Heavy- hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems, volume 36, 2023. 8 MoE-nD: Per-Layer Routing for Multi-Axis KV Compression A. Absolute Scores and Instruction-Tuned Models Our 11.5 LongBench average for the uncompres...