Title resolution pending

Dan Peng, Zhihui Fu, Zewen Ye, Zhuoran Song, Jun Wang · 2025 · arXiv 2505.19578

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration

cs.LG · 2025-02-03 · unverdicted · novelty 7.0

FastKV decouples prefill context reduction via Token-Selective Propagation from independent KV cache selection, delivering up to 1.82x prefill and 2.87x decoding speedups while matching decoding-only accuracy.

EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction

cs.CL · 2026-03-24 · unverdicted · novelty 6.0

EchoKV compresses LLM KV caches by reconstructing missing components from partial data via inter- and intra-layer attention similarities, outperforming prior methods on LongBench and RULER while supporting on-demand full-cache inference.

Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference

cs.LG · 2026-04-08 · unverdicted · novelty 5.0

Flux Attention uses a context-aware Layer Router to dynamically assign full or sparse attention to each LLM layer, achieving up to 2.8x prefill and 2.0x decode speedups with competitive performance on long-context and reasoning tasks.

citing papers explorer

Showing 3 of 3 citing papers.

FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration cs.LG · 2025-02-03 · unverdicted · none · ref 26
FastKV decouples prefill context reduction via Token-Selective Propagation from independent KV cache selection, delivering up to 1.82x prefill and 2.87x decoding speedups while matching decoding-only accuracy.
EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction cs.CL · 2026-03-24 · unverdicted · none · ref 15
EchoKV compresses LLM KV caches by reconstructing missing components from partial data via inter- and intra-layer attention similarities, outperforming prior methods on LongBench and RULER while supporting on-demand full-cache inference.
Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference cs.LG · 2026-04-08 · unverdicted · none · ref 34
Flux Attention uses a context-aware Layer Router to dynamically assign full or sparse attention to each LLM layer, achieving up to 2.8x prefill and 2.0x decode speedups with competitive performance on long-context and reasoning tasks.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer