hub Canonical reference

arXiv preprint arXiv:2412.19442 (2024) 22 Benjamin Probst, Andreas Happe, and Jürgen Cito

Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen · 2024 · arXiv 2412.19442

Canonical reference. 100% of citing Pith papers cite this work as background.

20 Pith papers citing it

Background 100% of classified citations

read on arXiv browse 20 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5

citation-polarity summary

background 5

representative citing papers

CachePrune: Privacy-Aware and Fine-Grained KV Cache Sharing for Efficient LLM Inference

cs.CR · 2026-05-22 · unverdicted · novelty 7.0

CachePrune enables fine-grained, token-level KV cache reuse across LLM requests by masking sensitive segments, eliminating direct side-channel leakage while cutting TTFT by 4.5x and raising hit rates by 44% versus prior coarse-grained methods.

Rethinking KV Cache Eviction via a Unified Information-Theoretic Objective

cs.LG · 2026-04-28 · unverdicted · novelty 7.0

KV cache eviction is unified under an information capacity maximization principle derived from a linear-Gaussian attention surrogate, with CapKV proposed as a leverage-score based implementation that outperforms prior heuristics in experiments.

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

cs.CL · 2026-04-06 · unverdicted · novelty 7.0

TriAttention compresses KV cache by exploiting stable pre-RoPE Q/K concentration and trigonometric distance preferences to match full-attention reasoning accuracy with far lower memory and higher speed.

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

OScaR mitigates token norm imbalance via canalized rotation and omni-token scaling to enable near-lossless INT2 KV cache quantization with up to 3x decoding speedup and 5.3x memory reduction.

VeriCache: Turning Lossy KV Cache into Lossless LLM Inference

cs.AR · 2026-05-17 · unverdicted · novelty 6.0

VeriCache turns lossy KV cache compression into lossless LLM inference by drafting with compressed cache and verifying drafts with full cache, achieving up to 4x throughput with identical outputs.

Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.

ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

ProxyKV offloads KV cache importance scoring to a lightweight intra-family small-model proxy with HybridAxialMapper and ranking-focused loss, matching KVZip accuracy while achieving up to 3.21x prefilling speedup on models up to 32B.

ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing

cs.CL · 2026-05-09 · conditional · novelty 6.0

ReST-KV formulates KV eviction as layer-wise output reconstruction optimization with spatial-temporal smoothing, outperforming baselines by 2.58% on LongBench and 15.2% on RULER while cutting decoding latency by 10.61x at 128k context.

RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

RDKV derives per-token and per-channel weights from attention distortion, then uses reverse water-filling to assign bit-widths from full precision to zero after prefilling, recovering 97.81% accuracy with 2.48% cache retention on LongBench.

Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing pipelines.

GRACE: A Dynamic Coreset Selection Framework for Large Language Model Optimization

cs.DB · 2026-04-09 · unverdicted · novelty 6.0

GRACE dynamically constructs and updates coresets for LLM training using representation diversity, gradient-based importance, and k-NN graph propagation to improve efficiency and performance.

LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation

cs.CL · 2024-10-17 · unverdicted · novelty 6.0

LightTransfer identifies lazy layers in LLMs like LLaMA and replaces their attention with streaming attention to form hybrid models, delivering up to 2.17x throughput with under 1.5% drop on LongBench and strong results on reasoning benchmarks.

Understanding Inference Scaling for LLMs: Bottlenecks, Trade-offs, and Performance Principles

cs.DC · 2026-05-19 · unverdicted · novelty 5.0

Reasoning workloads shift LLM inference to a capacity-bound regime where KV-cache fragmentation limits data parallelism, tensor parallelism unlocks memory at the 32B scale, and MoE models require hybrid strategies to avoid routing latency.

Enhancing Linux Privilege Escalation Attack Capabilities of Local LLM Agents

cs.CR · 2026-04-29 · unverdicted · novelty 5.0

Targeted prompting and system interventions enable local LLMs such as Llama 3.1 70B to exploit 83% of tested Linux privilege escalation vulnerabilities.

Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference

cs.AR · 2026-04-19 · unverdicted · novelty 5.0

A unified KV cache system with architecture-specific sizing, six-tier memory from GPU to filesystems, and Bayesian prediction delivers 7.4x higher batch sizes, 70-84% hit rates, and projected 1.7-2.9x throughput gains.

SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining

cs.LG · 2026-02-11 · conditional · novelty 5.0

SnapMLA achieves up to 1.91x higher throughput in long-output MLA decoding using FP8 quantization and specialized kernels while keeping benchmark quality near the BF16 baseline.

The Pitfalls of KV Cache Compression

cs.LG · 2025-09-30 · conditional · novelty 5.0

KV cache compression causes certain instructions to degrade rapidly and be ignored in multi-instruction prompting, with system prompt leakage worsened by method choice, instruction order, and eviction bias; simple policy changes can mitigate this.

Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities

cs.DC · 2026-04-24 · unverdicted · novelty 3.0

A survey synthesizing challenges, system architectures, model optimizations, deployment methods, and resource management techniques for large language model inference at the network edge.

Comparative Characterization of KV Cache Management Strategies for LLM Inference

cs.AR · 2026-04-06 · unverdicted · novelty 3.0

Benchmarks of vLLM, InfiniGen, and H2O identify conditions under which each KV cache strategy delivers the best trade-off between memory consumption and inference performance.

Reasoning Primitives in Hybrid and Non-Hybrid LLMs: Do Architectural Differences Yield Advantages in State-Tracking and Recall?

cs.CL · 2026-04-23

citing papers explorer

Showing 20 of 20 citing papers.

CachePrune: Privacy-Aware and Fine-Grained KV Cache Sharing for Efficient LLM Inference cs.CR · 2026-05-22 · unverdicted · none · ref 22
CachePrune enables fine-grained, token-level KV cache reuse across LLM requests by masking sensitive segments, eliminating direct side-channel leakage while cutting TTFT by 4.5x and raising hit rates by 44% versus prior coarse-grained methods.
Rethinking KV Cache Eviction via a Unified Information-Theoretic Objective cs.LG · 2026-04-28 · unverdicted · none · ref 13
KV cache eviction is unified under an information capacity maximization principle derived from a linear-Gaussian attention surrogate, with CapKV proposed as a leverage-score based implementation that outperforms prior heuristics in experiments.
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression cs.CL · 2026-04-06 · unverdicted · none · ref 11
TriAttention compresses KV cache by exploiting stable pre-RoPE Q/K concentration and trigonometric distance preferences to match full-attention reasoning accuracy with far lower memory and higher speed.
OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond cs.LG · 2026-05-19 · unverdicted · none · ref 28
OScaR mitigates token norm imbalance via canalized rotation and omni-token scaling to enable near-lossless INT2 KV cache quantization with up to 3x decoding speedup and 5.3x memory reduction.
VeriCache: Turning Lossy KV Cache into Lossless LLM Inference cs.AR · 2026-05-17 · unverdicted · none · ref 26
VeriCache turns lossy KV cache compression into lossless LLM inference by drafting with compressed cache and verifying drafts with full cache, achieving up to 4x throughput with identical outputs.
Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility cs.LG · 2026-05-13 · unverdicted · none · ref 69
SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.
ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference cs.LG · 2026-05-09 · unverdicted · none · ref 10
ProxyKV offloads KV cache importance scoring to a lightweight intra-family small-model proxy with HybridAxialMapper and ranking-focused loss, matching KVZip accuracy while achieving up to 3.21x prefilling speedup on models up to 32B.
ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing cs.CL · 2026-05-09 · conditional · none · ref 15
ReST-KV formulates KV eviction as layer-wise output reconstruction optimization with spatial-temporal smoothing, outperforming baselines by 2.58% on LongBench and 15.2% on RULER while cutting decoding latency by 10.61x at 128k context.
RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache cs.LG · 2026-05-08 · unverdicted · none · ref 8
RDKV derives per-token and per-channel weights from attention distortion, then uses reverse water-filling to assign bit-widths from full precision to zero after prefilling, recovering 97.81% accuracy with 2.48% cache retention on LongBench.
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache cs.LG · 2026-05-07 · unverdicted · none · ref 29
Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing pipelines.
GRACE: A Dynamic Coreset Selection Framework for Large Language Model Optimization cs.DB · 2026-04-09 · unverdicted · none · ref 43
GRACE dynamically constructs and updates coresets for LLM training using representation diversity, gradient-based importance, and k-NN graph propagation to improve efficiency and performance.
LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation cs.CL · 2024-10-17 · unverdicted · none · ref 22
LightTransfer identifies lazy layers in LLMs like LLaMA and replaces their attention with streaming attention to form hybrid models, delivering up to 2.17x throughput with under 1.5% drop on LongBench and strong results on reasoning benchmarks.
Understanding Inference Scaling for LLMs: Bottlenecks, Trade-offs, and Performance Principles cs.DC · 2026-05-19 · unverdicted · none · ref 25
Reasoning workloads shift LLM inference to a capacity-bound regime where KV-cache fragmentation limits data parallelism, tensor parallelism unlocks memory at the 32B scale, and MoE models require hybrid strategies to avoid routing latency.
Enhancing Linux Privilege Escalation Attack Capabilities of Local LLM Agents cs.CR · 2026-04-29 · unverdicted · none · ref 15
Targeted prompting and system interventions enable local LLMs such as Llama 3.1 70B to exploit 83% of tested Linux privilege escalation vulnerabilities.
Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference cs.AR · 2026-04-19 · unverdicted · none · ref 36
A unified KV cache system with architecture-specific sizing, six-tier memory from GPU to filesystems, and Bayesian prediction delivers 7.4x higher batch sizes, 70-84% hit rates, and projected 1.7-2.9x throughput gains.
SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining cs.LG · 2026-02-11 · conditional · none · ref 18
SnapMLA achieves up to 1.91x higher throughput in long-output MLA decoding using FP8 quantization and specialized kernels while keeping benchmark quality near the BF16 baseline.
The Pitfalls of KV Cache Compression cs.LG · 2025-09-30 · conditional · none · ref 4
KV cache compression causes certain instructions to degrade rapidly and be ignored in multi-instruction prompting, with system prompt leakage worsened by method choice, instruction order, and eviction bias; simple policy changes can mitigate this.
Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities cs.DC · 2026-04-24 · unverdicted · none · ref 93
A survey synthesizing challenges, system architectures, model optimizations, deployment methods, and resource management techniques for large language model inference at the network edge.
Comparative Characterization of KV Cache Management Strategies for LLM Inference cs.AR · 2026-04-06 · unverdicted · none · ref 3
Benchmarks of vLLM, InfiniGen, and H2O identify conditions under which each KV cache strategy delivers the best trade-off between memory consumption and inference performance.
Reasoning Primitives in Hybrid and Non-Hybrid LLMs: Do Architectural Differences Yield Advantages in State-Tracking and Recall? cs.CL · 2026-04-23 · unreviewed · ref 12

arXiv preprint arXiv:2412.19442 (2024) 22 Benjamin Probst, Andreas Happe, and Jürgen Cito

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer