Per-layer mixture-of-experts routing selects heterogeneous eviction-quantization tuples for KV cache compression, matching uncompressed accuracy at 14x reduction on LongBench subsets where uniform baselines degrade.
LongBench: A bilingual, multitask benchmark for long context understanding
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 2years
2026 2representative citing papers
A minimal scoring modification to TriAttention using greedy facility-location selection with V-space redundancy penalty improves KV retention at budgets 64 and 128 on distilled reasoning models under matched-memory held-out evaluation.
citing papers explorer
-
MoE-nD: Per-Layer Mixture-of-Experts Routing for Multi-Axis KV Cache Compression
Per-layer mixture-of-experts routing selects heterogeneous eviction-quantization tuples for KV cache compression, matching uncompressed accuracy at 14x reduction on LongBench subsets where uniform baselines degrade.
-
Minimal-Intervention KV Retention via Set-Conditioned Diversity
A minimal scoring modification to TriAttention using greedy facility-location selection with V-space redundancy penalty improves KV retention at budgets 64 and 128 on distilled reasoning models under matched-memory held-out evaluation.