RedKnot decomposes the KV cache by attention heads to enable position-independent reuse, prefix compression, hot/cold separation, and distributed placement for long-context LLM serving without model changes.
LongGenBench: Benchmarking Long-Form Generation in Long Context
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
VeriCache turns lossy KV cache compression into lossless LLM inference by drafting with compressed cache and verifying drafts with full cache, achieving up to 4x throughput with identical outputs.
IceCache combines semantic token clustering with PagedAttention to keep only 25% of the KV cache tokens while retaining 99% accuracy on LongBench and matching or beating prior offloading methods in latency.
LLMs fail at extended counting of repeated characters due to finite internal states, with abrupt errors persisting across model scales and inference methods.
citing papers explorer
-
RedKnot: Efficient Long-Context LLM Serving with Head-Aware KV Reuse and SegPagedAttention
RedKnot decomposes the KV cache by attention heads to enable position-independent reuse, prefix compression, hot/cold separation, and distributed placement for long-context LLM serving without model changes.
-
VeriCache: Turning Lossy KV Cache into Lossless LLM Inference
VeriCache turns lossy KV cache compression into lossless LLM inference by drafting with compressed cache and verifying drafts with full cache, achieving up to 4x throughput with identical outputs.
-
IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs
IceCache combines semantic token clustering with PagedAttention to keep only 25% of the KV cache tokens while retaining 99% accuracy on LongBench and matching or beating prior offloading methods in latency.
-
Language models fail at extended rule following
LLMs fail at extended counting of repeated characters due to finite internal states, with abrupt errors persisting across model scales and inference methods.