MoE-nD: Per-Layer Mixture-of-Experts Routing for Multi-Axis KV Cache Compression

Libo Sun; Peixiong He; Po-Wei Harn; Xiao Qin

arxiv: 2604.17695 · v1 · submitted 2026-04-20 · 💻 cs.LG · cs.CL

MoE-nD: Per-Layer Mixture-of-Experts Routing for Multi-Axis KV Cache Compression

Libo Sun , Peixiong He , Po-Wei Harn , Xiao Qin This is my paper

Pith reviewed 2026-05-10 05:09 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords KV cache compressionmixture of experts routinglong-context LLM inferenceper-layer quantizationtoken evictionmemory-efficient attention

0 comments

The pith

Per-layer mixture-of-experts routing lets each KV-cache layer pick its own eviction ratio and bit precision, matching uncompressed accuracy at 14x compression where uniform methods lose ground.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that uniform compression recipes across layers leave accuracy on the table because layers differ sharply in how they tolerate token eviction versus quantization. MoE-nD therefore assigns every layer its own (eviction ratio, K-bits, V-bits) tuple chosen by an offline greedy solver that minimizes predicted quality loss under a fixed global memory budget. At inference the chosen per-layer policies are executed together through one attention patch. On LongBench long-context tasks the resulting heterogeneous schedule reaches the full 1.9 GB baseline score at 136 MB (14x), while all tested 1-D, 2-D-uniform, and 2-D baselines at equal or lower memory stay well below that mark; the same pattern appears on AIME reasoning (+6 to +27 points).

Core claim

MoE-nD shows that routing each layer independently to a compression configuration (eviction ratio plus K- and V-bit widths) under a shared memory constraint can preserve the accuracy of an uncompressed KV cache at 14x compression on long-context and reasoning benchmarks, whereas any homogeneous per-layer or global policy at comparable memory incurs large quality drops.

What carries the argument

Offline-calibrated greedy solver that selects a per-layer (eviction-ratio, K-bits, V-bits) tuple minimizing predicted quality loss, then applies the heterogeneous choices jointly through a single attention patch at inference time.

If this is right

Heterogeneous per-layer choices of eviction and quantization outperform any single global compression recipe at high compression ratios.
The optimal balance between token eviction and bit-width reduction varies substantially across layers within the same model.
On short inputs the solver often selects full retention, showing that per-layer eviction routing adds value mainly for long-context workloads.
The approach works through an existing attention kernel with no change to the forward pass beyond the chosen per-layer parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Layer-wise routing could be extended to additional compression axes such as low-rank head projections or cross-layer sharing if a joint predictor is trained.
The null results on MATH-500 and TREC illustrate a clear regime (short sequences) where heterogeneous eviction offers little headroom.
If the quality predictor can be updated cheaply online, the same framework might support input-dependent routing without retraining the solver.

Load-bearing premise

The quality-loss predictor, calibrated offline on the same model family, ranks compression options correctly for new inputs and that layer-wise sensitivity patterns stay stable enough to justify choosing different configurations per layer.

What would settle it

Measure whether the same routing decisions still recover full baseline accuracy when the predictor is applied to a different model family or to inputs whose length or task distribution differs markedly from the calibration set.

Figures

Figures reproduced from arXiv: 2604.17695 by Libo Sun, Peixiong He, Po-Wei Harn, Xiao Qin.

**Figure 1.** Figure 1: LongBench-v1 accuracy vs. compressed KV memory on DeepSeek-R1-Distill-Qwen-7B. 4-task subset (NarrativeQA, HotpotQA, TREC, PassageRetrieval-en), n = 50 per task, 16ktoken inputs (middle truncation), adapted reasoning-model protocol (max gen ×16, scoring after final </think>); see §5. MoEnD’s hetero variant (2dhetero) matches the uncompressed full-cache baseline at its leftmost operating point (136 MB, 1… view at source ↗

**Figure 2.** Figure 2: Per-layer sensitivity landscape. (A) Full 28×9 heatmap of predicted attention-output L2 error when applying each compression operation to a single layer, measured offline against the uncompressed reference (log-scale color; only headline cells are annotated to keep the panel readable). The top strip shows per-operation maxℓ/minℓ ratio, directly summarizing the heterogeneity numbers in [PITH_FULL_IMAGE:fig… view at source ↗

**Figure 3.** Figure 3: MoE-nD pipeline. (A) Offline, per-layer sensitivity to each compression operation is measured against an uncompressed reference; the heatmap shows the actual calibration table for DeepSeek-R1-Distill-Qwen-7B. (B) A greedy budget solver picks the per-layer (keep, kbits, vbits) tuple that greedily reduces predicted total sensitivity (attention-output L2 error) subject to a global memory budget M. (C) At infe… view at source ↗

read the original abstract

KV cache memory is the dominant bottleneck for long-context LLM inference. Existing compression methods each act on a single axis of the four-dimensional KV tensor -- token eviction (sequence), quantization (precision), low-rank projection (head dimension), or cross-layer sharing -- but apply the same recipe to every layer. We show that this homogeneity leaves accuracy on the table: different layers respond very differently to each compression operation, and the optimal per-layer mix of eviction and quantization is far from uniform. We propose MoE-nD, a mixture-of-experts framework that routes each layer to its own (eviction-ratio, K-bits, V-bits) tuple under a global memory budget. An offline-calibrated greedy solver chooses the routing that minimizes predicted quality loss; at inference time, per-layer heterogeneous eviction and quantization are applied jointly through a single attention patch. On a 4-task subset of LongBench-v1 (16k inputs, n=50 per task, adapted reasoning-model protocol; see section Experiments), MoE-nD's hetero variant matches our uncompressed 1.9~GB baseline at 14x compression (136~MB) while every other compressed baseline we tested (1d, 2d_uniform, 2d) at comparable or smaller memory stays under 8/100. The gains hold on AIME reasoning benchmarks (+6 to +27 pts over the strongest per-layer-quantization baseline across eight configurations). Two null results -- MATH-500 and LongBench's TREC -- share a principled cause (short inputs, solver picks keep=1.0 on most layers), cleanly characterizing when per-layer eviction routing has headroom to help.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Per-layer heterogeneous routing for KV cache compression shows promise but hinges on unverified offline predictor accuracy.

read the letter

The key point is that this paper makes a case for heterogeneous per-layer compression in KV caches using a mixture-of-experts style router over eviction and quantization choices. If the numbers hold, it offers a practical way to get high compression ratios without the accuracy drop seen in uniform approaches. What stands out as new is the joint per-layer selection of eviction ratio along with separate bit widths for keys and values, all under a single memory constraint. The offline greedy solver picks the mix that minimizes predicted loss. The work does a good job documenting that uniform methods leave accuracy on the table and explaining the null results on short-input tasks as cases where the solver correctly defaults to no eviction. The soft spots are around the empirical support. The quality-loss predictor is trained on held-out data from the same model family, but the paper gives no direct accuracy numbers for how well it ranks different configurations on unseen long-context inputs. Without that, the claim that hetero routing matches the 1.9 GB baseline at 136 MB while others lag is hard to evaluate fully. There are also no error bars on the benchmark scores and limited ablation of the predictor component itself. This is aimed at the community working on efficient LLM inference, particularly those optimizing KV cache for longer contexts on constrained hardware. Readers who care about multi-axis compression tradeoffs will find the framing useful even if they adapt the solver. I would send it for peer review. The central observation about layer differences is worth checking, and the method is simple enough that referees can focus on validating the gains and the predictor's reliability.

Referee Report

3 major / 2 minor

Summary. The paper proposes MoE-nD, a per-layer mixture-of-experts routing framework for multi-axis KV cache compression. An offline-calibrated greedy solver selects heterogeneous (eviction-ratio, K-bit, V-bit) tuples per layer to minimize predicted quality loss under a global memory budget; at inference, these are applied jointly via a single attention patch. On a 4-task LongBench-v1 subset the hetero variant matches the 1.9 GB uncompressed baseline at 14× compression (136 MB) while uniform baselines remain below 8/100, with +6 to +27 pt gains on AIME reasoning tasks; null results on MATH-500 and TREC are attributed to short inputs where the solver defaults to keep=1.0.

Significance. If the results hold, the work demonstrates that exploiting stable layer-wise differences in compression sensitivity can yield substantial accuracy-memory trade-offs beyond uniform per-layer or global recipes. The concrete 14× compression result that matches the uncompressed baseline, together with the clean characterization of when per-layer eviction helps, would be a useful contribution to efficient long-context inference.

major comments (3)

[Experiments section] Experiments section: the headline claims (14× compression matching the 1.9 GB baseline and +6–27 pt AIME gains) are presented without error bars, standard deviations, or any indication of the number of runs or random seeds, so the statistical reliability of the reported differences cannot be assessed.
[Method / solver description] Method / solver description: no direct measurement is given of the quality-loss predictor’s ranking accuracy or absolute error on inputs drawn from a distribution disjoint from the calibration set (especially long-context reasoning traces). Because the heterogeneous routing decisions are produced entirely by this offline predictor, the absence of such a test is load-bearing for the central claim that per-layer heterogeneity outperforms uniform baselines.
[Experiments section] Experiments section: the paper provides neither an ablation isolating the predictor’s contribution nor full implementation details or hyper-parameter settings for the 1d, 2d_uniform, and 2d baselines, preventing independent verification of the reported superiority.

minor comments (2)

[Experiments section] The phrase “adapted reasoning-model protocol” is used in the abstract and Experiments section without a concise definition or pointer to the exact prompt template and evaluation script.
Notation for the per-layer tuples (eviction ratio, K-bits, V-bits) could be introduced once in a single equation or table to avoid repeated prose descriptions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback on our manuscript. We address each of the major comments point by point below, indicating the changes we will make in the revised version.

read point-by-point responses

Referee: [Experiments section] Experiments section: the headline claims (14× compression matching the 1.9 GB baseline and +6–27 pt AIME gains) are presented without error bars, standard deviations, or any indication of the number of runs or random seeds, so the statistical reliability of the reported differences cannot be assessed.

Authors: We appreciate this observation. The reported results were obtained from single runs with a fixed seed for each configuration, as the greedy solver is deterministic and the inference procedure does not introduce randomness. The sample size is n=50 per task as stated in the Experiments section. To enhance statistical reliability, we will revise the manuscript to explicitly state the number of runs and seeds used, and we will include error bars by conducting additional runs with varied seeds where feasible, particularly for the AIME gains. revision: partial
Referee: [Method / solver description] Method / solver description: no direct measurement is given of the quality-loss predictor’s ranking accuracy or absolute error on inputs drawn from a distribution disjoint from the calibration set (especially long-context reasoning traces). Because the heterogeneous routing decisions are produced entirely by this offline predictor, the absence of such a test is load-bearing for the central claim that per-layer heterogeneity outperforms uniform baselines.

Authors: We agree that validating the predictor on disjoint distributions is crucial. While the calibration set encompasses a variety of tasks from LongBench and reasoning benchmarks, we did not provide explicit ranking accuracy or error metrics on held-out long-context reasoning traces. In the revision, we will add a dedicated evaluation of the quality-loss predictor, including its ranking performance and absolute error on inputs from distributions disjoint from the calibration set, to substantiate the effectiveness of the per-layer routing. revision: yes
Referee: [Experiments section] Experiments section: the paper provides neither an ablation isolating the predictor’s contribution nor full implementation details or hyper-parameter settings for the 1d, 2d_uniform, and 2d baselines, preventing independent verification of the reported superiority.

Authors: We acknowledge the need for greater reproducibility. We will include a new ablation study that isolates the contribution of the quality-loss predictor by comparing against variants with uniform or random routing decisions. Additionally, we will provide complete implementation details, including all hyper-parameter settings, for the 1d, 2d_uniform, and 2d baselines in the revised Experiments section to enable independent verification. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claim rests on independent benchmark evaluation after offline calibration.

full rationale

The paper derives its routing decisions from an offline-calibrated greedy solver that minimizes a quality-loss predictor's output under a memory budget. This selection is then applied at inference and measured on external benchmarks (LongBench-v1 subset, AIME, MATH-500, TREC). No equation or step equates the reported accuracy gains to the calibration data by construction, nor does any load-bearing premise reduce to a self-citation or renamed ansatz. Layer-wise sensitivity differences are presented as measured observations rather than tautological definitions. The method remains self-contained against held-out evaluation distributions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The method adds a new routing layer on top of standard transformer attention and empirical calibration; it does not introduce new physical entities or ungrounded constants.

free parameters (1)

per-layer eviction ratios and bit widths
Discrete choices selected by the greedy solver to meet the global memory budget; values are not fixed a priori but discovered per model and budget.

axioms (1)

domain assumption Layer-wise differences in compression sensitivity are stable and predictable from offline calibration data.
Invoked to justify why heterogeneous routing outperforms uniform policies.

invented entities (1)

MoE-nD per-layer routing framework no independent evidence
purpose: To map each transformer layer to a distinct (eviction, K-quant, V-quant) configuration
New algorithmic construct proposed by the paper; no independent evidence outside the reported experiments.

pith-pipeline@v0.9.0 · 5615 in / 1447 out tokens · 57672 ms · 2026-05-10T05:09:18.502989+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

[1]

GQA: Training generalized multi-query transformer models from multi-head check- points

Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y ., Lebr´on, F., and Sanghai, S. GQA: Training generalized multi-query transformer models from multi-head check- points. InConference on Empirical Methods in Natural Language Processing, 2023

work page 2023
[2]

LongBench: A bilingual, multitask benchmark for long context understanding

Bai, Y ., Lv, X., Zhang, J., et al. LongBench: A bilingual, multitask benchmark for long context understanding. In Annual Meeting of the Association for Computational Linguistics, 2024

work page 2024
[3]

Kelly, J. R. Reducing transformer key-value cache size with cross-layer attention, 2024

work page 2024
[4]

PyramidKV: Dynamic KV cache compression based on pyramidal information funneling, 2024

Xiong, W., Dong, Y ., Hu, J., and Xiao, W. PyramidKV: Dynamic KV cache compression based on pyramidal information funneling, 2024

work page 2024
[5]

R-KV: Redundancy-aware KV cache compression for reasoning models

Cai, Z., Xiao, W., Sun, H., et al. R-KV: Redundancy-aware KV cache compression for reasoning models. InAd- vances in Neural Information Processing Systems, 2025

work page 2025
[6]

Y ., Ermon, S., Rudra, A., and R´e, C

Dao, T., Fu, D. Y ., Ermon, S., Rudra, A., and R´e, C. FlashAt- tention: Fast and memory-efficient exact attention with 7 MoE-nD: Per-Layer Routing for Multi-Axis KV Compression IO-awareness. InAdvances in Neural Information Pro- cessing Systems, 2022. DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning, 2025

work page 2022
[7]

QAQ: Quality adaptive quantization for LLM KV cache, 2024

Dong, S., Cheng, W., Qin, J., and Wang, W. QAQ: Quality adaptive quantization for LLM KV cache, 2024

work page 2024
[8]

Feng, Y ., Lv, J., Cao, Y ., Xie, X., and Zhou, S. K. Ada- KV: Optimizing KV cache eviction by adaptive budget allocation for efficient LLM inference. InAdvances in Neural Information Processing Systems, 2025

work page 2025
[9]

Measuring mathematical problem solving with the MATH dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the MATH dataset. InNeurIPS Track on Datasets and Benchmarks, 2021

work page 2021
[10]

W., Shao, Y

Hooper, C., Kim, S., Mohammadzadeh, H., Mahoney, M. W., Shao, Y . S., Keutzer, K., and Gholami, A. KVQuant: Towards 10 million context length LLM infer- ence with KV cache quantization. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[11]

J., and Yuan, M

Li, X., Xing, Z., Li, Y ., Qu, L., Zhen, H.-L., Liu, W., Yao, Y ., Pan, S. J., and Yuan, M. KVTuner: Sensitivity-aware layer-wise mixed-precision KV cache quantization for efficient and nearly lossless LLM inference. InInterna- tional Conference on Machine Learning, 2025

work page 2025
[12]

SnapKV: LLM knows what you are looking for before generation

Li, Y ., Huang, Y ., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., Cai, T., Lewis, P., and Chen, D. SnapKV: LLM knows what you are looking for before generation. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[13]

Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time

Liu, Z., Desai, A., Liao, F., Wang, W., Xie, V ., Xu, Z., Kyril- lidis, A., and Shrivastava, A. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[14]

KIVI: A tuning-free asym- metric 2-bit quantization for KV cache

Liu, Z., Yuan, J., Jin, H., Zhong, S., Xu, Z., Braverman, V ., Chen, B., and Hu, X. KIVI: A tuning-free asym- metric 2-bit quantization for KV cache. InInternational Conference on Machine Learning, 2024

work page 2024
[15]

Triattention: Effi- cient long reasoning with trigonometric KV compression, 2026

Mao, W., Lin, X., Huang, W., et al. Triattention: Effi- cient long reasoning with trigonometric KV compression, 2026

work page 2026
[16]

MiniKV: Pushing the limits of LLM inference via 2-bit layer-discriminative KV cache, 2024

Sharma, A., Ding, H., Li, J., Dani, N., and Zhang, M. MiniKV: Pushing the limits of LLM inference via 2-bit layer-discriminative KV cache, 2024

work page 2024
[17]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations, 2017

work page 2017
[18]

RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Su, J., Ahmed, M., Lu, Y ., Pan, S., Bo, W., and Liu, Y . RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

work page 2024
[19]

V ., and Zhou, D

Xia, F., Chi, E., Le, Q. V ., and Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 35, pp. 24824–24837, 2022

work page 2022
[20]

Effi- cient streaming language models with attention sinks

Xiao, G., Tian, Y ., Chen, B., Han, S., and Lewis, M. Effi- cient streaming language models with attention sinks. In International Conference on Learning Representations, 2024

work page 2024
[21]

J., et al

Yuan, Z., Shang, Y ., Zhou, Y ., Dong, Z., Zhou, Z., Xue, C., Wu, B., Li, Z., Gu, Q., Lee, Y . J., et al. ASVD: Activation-aware singular value decomposition for com- pressing large language models, 2024

work page 2024
[22]

Solve the equation x2 −5x+ 6 = 0 . Think step by step and show all work

Song, Z., Tian, Y ., R´e, C., Barrett, C., et al. H2O: Heavy- hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems, volume 36, 2023. 8 MoE-nD: Per-Layer Routing for Multi-Axis KV Compression A. Absolute Scores and Instruction-Tuned Models Our 11.5 LongBench average for the uncompres...

work page 2023

[1] [1]

GQA: Training generalized multi-query transformer models from multi-head check- points

Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y ., Lebr´on, F., and Sanghai, S. GQA: Training generalized multi-query transformer models from multi-head check- points. InConference on Empirical Methods in Natural Language Processing, 2023

work page 2023

[2] [2]

LongBench: A bilingual, multitask benchmark for long context understanding

Bai, Y ., Lv, X., Zhang, J., et al. LongBench: A bilingual, multitask benchmark for long context understanding. In Annual Meeting of the Association for Computational Linguistics, 2024

work page 2024

[3] [3]

Kelly, J. R. Reducing transformer key-value cache size with cross-layer attention, 2024

work page 2024

[4] [4]

PyramidKV: Dynamic KV cache compression based on pyramidal information funneling, 2024

Xiong, W., Dong, Y ., Hu, J., and Xiao, W. PyramidKV: Dynamic KV cache compression based on pyramidal information funneling, 2024

work page 2024

[5] [5]

R-KV: Redundancy-aware KV cache compression for reasoning models

Cai, Z., Xiao, W., Sun, H., et al. R-KV: Redundancy-aware KV cache compression for reasoning models. InAd- vances in Neural Information Processing Systems, 2025

work page 2025

[6] [6]

Y ., Ermon, S., Rudra, A., and R´e, C

Dao, T., Fu, D. Y ., Ermon, S., Rudra, A., and R´e, C. FlashAt- tention: Fast and memory-efficient exact attention with 7 MoE-nD: Per-Layer Routing for Multi-Axis KV Compression IO-awareness. InAdvances in Neural Information Pro- cessing Systems, 2022. DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning, 2025

work page 2022

[7] [7]

QAQ: Quality adaptive quantization for LLM KV cache, 2024

Dong, S., Cheng, W., Qin, J., and Wang, W. QAQ: Quality adaptive quantization for LLM KV cache, 2024

work page 2024

[8] [8]

Feng, Y ., Lv, J., Cao, Y ., Xie, X., and Zhou, S. K. Ada- KV: Optimizing KV cache eviction by adaptive budget allocation for efficient LLM inference. InAdvances in Neural Information Processing Systems, 2025

work page 2025

[9] [9]

Measuring mathematical problem solving with the MATH dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the MATH dataset. InNeurIPS Track on Datasets and Benchmarks, 2021

work page 2021

[10] [10]

W., Shao, Y

Hooper, C., Kim, S., Mohammadzadeh, H., Mahoney, M. W., Shao, Y . S., Keutzer, K., and Gholami, A. KVQuant: Towards 10 million context length LLM infer- ence with KV cache quantization. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[11] [11]

J., and Yuan, M

Li, X., Xing, Z., Li, Y ., Qu, L., Zhen, H.-L., Liu, W., Yao, Y ., Pan, S. J., and Yuan, M. KVTuner: Sensitivity-aware layer-wise mixed-precision KV cache quantization for efficient and nearly lossless LLM inference. InInterna- tional Conference on Machine Learning, 2025

work page 2025

[12] [12]

SnapKV: LLM knows what you are looking for before generation

Li, Y ., Huang, Y ., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., Cai, T., Lewis, P., and Chen, D. SnapKV: LLM knows what you are looking for before generation. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[13] [13]

Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time

Liu, Z., Desai, A., Liao, F., Wang, W., Xie, V ., Xu, Z., Kyril- lidis, A., and Shrivastava, A. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023

[14] [14]

KIVI: A tuning-free asym- metric 2-bit quantization for KV cache

Liu, Z., Yuan, J., Jin, H., Zhong, S., Xu, Z., Braverman, V ., Chen, B., and Hu, X. KIVI: A tuning-free asym- metric 2-bit quantization for KV cache. InInternational Conference on Machine Learning, 2024

work page 2024

[15] [15]

Triattention: Effi- cient long reasoning with trigonometric KV compression, 2026

Mao, W., Lin, X., Huang, W., et al. Triattention: Effi- cient long reasoning with trigonometric KV compression, 2026

work page 2026

[16] [16]

MiniKV: Pushing the limits of LLM inference via 2-bit layer-discriminative KV cache, 2024

Sharma, A., Ding, H., Li, J., Dani, N., and Zhang, M. MiniKV: Pushing the limits of LLM inference via 2-bit layer-discriminative KV cache, 2024

work page 2024

[17] [17]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations, 2017

work page 2017

[18] [18]

RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Su, J., Ahmed, M., Lu, Y ., Pan, S., Bo, W., and Liu, Y . RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

work page 2024

[19] [19]

V ., and Zhou, D

Xia, F., Chi, E., Le, Q. V ., and Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 35, pp. 24824–24837, 2022

work page 2022

[20] [20]

Effi- cient streaming language models with attention sinks

Xiao, G., Tian, Y ., Chen, B., Han, S., and Lewis, M. Effi- cient streaming language models with attention sinks. In International Conference on Learning Representations, 2024

work page 2024

[21] [21]

J., et al

Yuan, Z., Shang, Y ., Zhou, Y ., Dong, Z., Zhou, Z., Xue, C., Wu, B., Li, Z., Gu, Q., Lee, Y . J., et al. ASVD: Activation-aware singular value decomposition for com- pressing large language models, 2024

work page 2024

[22] [22]

Solve the equation x2 −5x+ 6 = 0 . Think step by step and show all work

Song, Z., Tian, Y ., R´e, C., Barrett, C., et al. H2O: Heavy- hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems, volume 36, 2023. 8 MoE-nD: Per-Layer Routing for Multi-Axis KV Compression A. Absolute Scores and Instruction-Tuned Models Our 11.5 LongBench average for the uncompres...

work page 2023