FastKV decouples prefill context reduction via Token-Selective Propagation from independent KV cache selection, delivering up to 1.82x prefill and 2.87x decoding speedups while matching decoding-only accuracy.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 3representative citing papers
EchoKV compresses LLM KV caches by reconstructing missing components from partial data via inter- and intra-layer attention similarities, outperforming prior methods on LongBench and RULER while supporting on-demand full-cache inference.
Flux Attention uses a context-aware Layer Router to dynamically assign full or sparse attention to each LLM layer, achieving up to 2.8x prefill and 2.0x decode speedups with competitive performance on long-context and reasoning tasks.
citing papers explorer
-
FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration
FastKV decouples prefill context reduction via Token-Selective Propagation from independent KV cache selection, delivering up to 1.82x prefill and 2.87x decoding speedups while matching decoding-only accuracy.
-
EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction
EchoKV compresses LLM KV caches by reconstructing missing components from partial data via inter- and intra-layer attention similarities, outperforming prior methods on LongBench and RULER while supporting on-demand full-cache inference.
-
Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference
Flux Attention uses a context-aware Layer Router to dynamically assign full or sparse attention to each LLM layer, achieving up to 2.8x prefill and 2.0x decode speedups with competitive performance on long-context and reasoning tasks.