KV cache eviction is unified under an information capacity maximization principle derived from a linear-Gaussian attention surrogate, with CapKV proposed as a leverage-score based implementation that outperforms prior heuristics in experiments.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Watt Counts supplies over 5,000 energy measurements across 50 LLMs and 10 GPUs and shows that hardware-aware selection can reduce server-scenario energy use by up to 70 percent with little effect on user experience.
The paper analyzes CPU bottlenecks in agentic AI serving, selects representative workloads, and demonstrates that CPU-aware scheduling optimizations COMB and MAS can reduce P50 latency by up to 1.7x and total latency by up to 2.49x on two hardware systems.
Learning-augmented LRU achieves 1-consistency and O(k)-robustness for GPU caching with low overhead, implemented in LCR to cut P99 TTFT by up to 28.3% on LLM workloads and raise throughput by up to 24.2% on DLRM workloads.
citing papers explorer
No citing papers match the current filters.