DuoAttention identifies retrieval heads requiring full KV cache and streaming heads using constant-length cache to reduce memory and latency in long-context LLM inference.
Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 3representative citing papers
CoRoVA compresses repository context into compact vectors for code LLMs, reducing TTFT 20-38% versus uncompressed RAG with only a small projector module.
Ada-KV is the first head-wise adaptive KV cache budget allocator for LLMs, using a theoretical loss upper bound to allocate eviction differently per attention head and yielding higher quality than uniform methods on long-context benchmarks.
citing papers explorer
-
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
DuoAttention identifies retrieval heads requiring full KV cache and streaming heads using constant-length cache to reduce memory and latency in long-context LLM inference.
-
CoRoVA: Compressed Representations for Vector-Augmented Code Completion
CoRoVA compresses repository context into compact vectors for code LLMs, reducing TTFT 20-38% versus uncompressed RAG with only a small projector module.
-
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
Ada-KV is the first head-wise adaptive KV cache budget allocator for LLMs, using a theoretical loss upper bound to allocate eviction differently per attention head and yielding higher quality than uniform methods on long-context benchmarks.