PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Bofei Gao; Junjie Hu; Keming Lu; Tianyu Liu; Wayne Xiong; Wen Xiao; Yichi Zhang; Yucheng Li; Yue Dong; Yuliang Liu

arxiv: 2406.02069 · v4 · submitted 2024-06-04 · 💻 cs.CL · cs.AI

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Zefan Cai , Yichi Zhang , Bofei Gao , Yuliang Liu , Yucheng Li , Tianyu Liu , Keming Lu , Wayne Xiong

show 3 more authors

Yue Dong Junjie Hu Wen Xiao

This is my paper

Pith reviewed 2026-05-12 09:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords KV cache compressionlong contextattention patternsPyramidal Information FunnelingLLM efficiencydynamic allocationattention sink

0 comments

The pith

LLMs funnel attention from wide scattering in lower layers to focused critical tokens in higher layers, enabling PyramidKV to compress the KV cache to 12% size while matching full performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models process long contexts through a pyramidal funneling of attention information, scattering broadly in early layers before consolidating on key tokens later. This paper introduces PyramidKV, which dynamically sizes the KV cache larger in lower layers and smaller in upper ones to match this pattern. Experiments on LongBench show it achieves full-cache accuracy with just 12% of the cache retained. At extreme compression to 0.7%, it improves over other methods by up to 20.5 points on TREC and reaches 100% accuracy on needle-in-haystack with only 128 entries for Llama-3-70B.

Core claim

Attention-based information flow in LLMs follows Pyramidal Information Funneling: attention scatters widely in lower layers, progressively consolidates, and focuses on critical tokens in higher layers. PyramidKV exploits this by dynamically adjusting KV cache sizes across layers, allocating more in lower layers and less in higher ones, rather than using uniform sizes.

What carries the argument

Pyramidal Information Funneling: the pattern of wide attention in lower layers consolidating to critical tokens (attention sinks) in higher layers, which justifies non-uniform KV cache allocation.

Load-bearing premise

The pyramidal information funneling pattern is consistent across models, tasks, and context lengths, allowing fixed layer-wise retention ratios to work effectively without per-task retuning.

What would settle it

If attention patterns on a new model or task show no increasing focus on critical tokens in higher layers, or if uniform KV cache sizes outperform the pyramidal layer-wise allocation on LongBench or needle retrieval.

read the original abstract

In this study, we investigate whether attention-based information flow inside large language models (LLMs) is aggregated through noticeable patterns for long context processing. Our observations reveal that LLMs aggregate information through Pyramidal Information Funneling where attention is scattering widely in lower layers, progressively consolidating within specific contexts, and ultimately focusing on critical tokens (a.k.a massive activation or attention sink) in higher layers. Motivated by these insights, we developed PyramidKV, a novel and effective KV cache compression method. This approach dynamically adjusts the KV cache size across different layers, allocating more cache in lower layers and less in higher ones, diverging from traditional methods that maintain a uniform KV cache size. Our experimental evaluations, utilizing the LongBench benchmark, show that PyramidKV matches the performance of models with a full KV cache while retaining only 12% of the KV cache, thus significantly reducing memory usage. In scenarios emphasizing memory efficiency, where only 0.7% of the KV cache is maintained, PyramidKV surpasses other KV cache compression techniques, achieving up to a 20.5 absolute accuracy improvement on TREC dataset. In the Needle-in-a-Haystack experiment, PyramidKV outperforms competing methods in maintaining long-context comprehension in LLMs; notably, retaining just 128 KV cache entries enables the LLAMA-3-70B model to achieve 100.0 Acc. performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PyramidKV shows how to cut KV cache to 12% with little accuracy loss by varying size per layer according to attention patterns.

read the letter

The main point with this PyramidKV work is a KV cache compression scheme that allocates more space in lower layers and less in higher ones, following the pattern where attention starts broad and narrows to key tokens. This is new in explicitly using that pyramidal structure for dynamic sizing rather than uniform or static methods. The paper shows it can match full-cache accuracy on LongBench with just 12% of the cache retained, and at 0.7% it pulls ahead of other techniques with notable gains on some datasets. The needle-in-a-haystack results with minimal cache are also strong. What they do well is connect the method directly to attention observations and demonstrate practical benefits for long-context models. The soft spots are minor but real: the abstract lacks error bars, statistical details, or full specs on the retention ratios and baseline setups, making it harder to assess robustness right away. The consistency of the funneling pattern across models and tasks is assumed more than proven here. This paper is for researchers and engineers focused on making long-context LLMs run with less memory. Readers in that area would find the technique and numbers useful to explore. I would recommend it for peer review to get the full experimental picture and confirm the gains hold up under closer look.

Referee Report

3 major / 3 minor

Summary. The paper observes a 'pyramidal information funneling' pattern in LLM attention (wide scattering in lower layers, progressive consolidation, and focus on critical tokens/sinks in higher layers). Motivated by this, it introduces PyramidKV, a dynamic KV-cache compression scheme that allocates more cache entries to lower layers and fewer to higher layers. On LongBench it matches full-cache performance at 12% retention and outperforms prior compression methods by up to 20.5 points on TREC at 0.7% retention; on Needle-in-a-Haystack, 128 entries yield 100% accuracy for Llama-3-70B.

Significance. If the funneling pattern proves consistent and the layer-wise ratios generalize, PyramidKV would provide a simple, observation-driven route to substantial memory savings for long-context inference. The empirical gains on standard benchmarks are concrete and the method avoids heavy per-task tuning, which is a practical strength in the KV-compression literature.

major comments (3)

[§4.1] §4.1 (LongBench results): absolute accuracy deltas (e.g., +20.5 on TREC at 0.7% cache) are reported without error bars, multiple random seeds, or statistical significance tests, so it is impossible to judge whether the claimed superiority over baselines is robust.
[§3.2] §3.2 (Method): the layer-wise retention ratios are asserted to follow from the pyramidal observation, yet the manuscript supplies neither the exact selection procedure (attention-score thresholds, entropy heuristics, or manual tuning) nor any quantitative verification that these fixed ratios remain effective across context lengths or tasks outside the reported benchmarks.
[§4.3] §4.3 (Needle-in-a-Haystack): the 100.0 accuracy claim with 128 entries for Llama-3-70B is presented without specifying needle-position distribution, number of trials, or a direct full-cache baseline under identical prompting, weakening the reproducibility of the extreme-compression result.

minor comments (3)

[Abstract] Abstract and §2: the phrase 'massive activation or attention sink' should cite the original attention-sink literature for clarity.
[Figures] Figure captions: attention heat-map figures lack explicit layer indexing and token-position labels, making the funneling pattern harder to inspect.
[§4] §4: exact per-layer retention percentages or the formula used to derive them are not tabulated, impeding direct reproduction.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We have carefully considered each comment and provide point-by-point responses below, along with plans for revisions to address the concerns raised.

read point-by-point responses

Referee: [§4.1] §4.1 (LongBench results): absolute accuracy deltas (e.g., +20.5 on TREC at 0.7% cache) are reported without error bars, multiple random seeds, or statistical significance tests, so it is impossible to judge whether the claimed superiority over baselines is robust.

Authors: We agree that the absence of error bars and multi-seed results makes it difficult to assess robustness. In the revised manuscript, we will include a discussion of this limitation and provide results from at least three random seeds for the key experiments, along with standard deviations. revision: yes
Referee: [§3.2] §3.2 (Method): the layer-wise retention ratios are asserted to follow from the pyramidal observation, yet the manuscript supplies neither the exact selection procedure (attention-score thresholds, entropy heuristics, or manual tuning) nor any quantitative verification that these fixed ratios remain effective across context lengths or tasks outside the reported benchmarks.

Authors: The ratios are chosen to reflect the observed pyramidal funneling, with higher retention in lower layers where attention is more scattered. We will expand Section 3.2 to describe the exact procedure used to select these ratios and include additional experiments or analysis demonstrating their effectiveness on a broader range of context lengths and tasks. revision: yes
Referee: [§4.3] §4.3 (Needle-in-a-Haystack): the 100.0 accuracy claim with 128 entries for Llama-3-70B is presented without specifying needle-position distribution, number of trials, or a direct full-cache baseline under identical prompting, weakening the reproducibility of the extreme-compression result.

Authors: We will revise the description in Section 4.3 to detail the needle insertion positions (randomly distributed), the number of evaluation trials, and explicitly state the full KV cache performance under the same conditions for direct comparison. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method grounded in observations and benchmarks

full rationale

The paper begins with direct observations of attention patterns (pyramidal funneling) across layers in LLMs, uses these to motivate a layer-wise dynamic KV cache allocation heuristic, and validates the resulting PyramidKV method through external benchmarks (LongBench, Needle-in-a-Haystack, TREC). No equations, fitted parameters, or self-citations are presented as 'predictions' that reduce by construction to the inputs; the performance claims rest on empirical results rather than any definitional equivalence or load-bearing self-reference. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Central claim rests on the empirical observation of layer-wise attention consolidation and on the effectiveness of manually or heuristically chosen per-layer retention ratios.

free parameters (1)

layer-wise KV retention ratios
Specific fractions of cache kept per layer are chosen to follow the pyramidal pattern and are not derived from first principles.

axioms (1)

domain assumption LLMs exhibit consistent pyramidal information funneling across layers
The compression policy is motivated by and depends on this observed pattern holding generally.

pith-pipeline@v0.9.0 · 5583 in / 1273 out tokens · 71799 ms · 2026-05-12T09:49:56.960440+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

HierarchyEmergence hierarchy_emergence_forces_phi echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Our observations reveal that LLMs aggregate information through Pyramidal Information Funneling where attention is scattering widely in lower layers, progressively consolidating within specific contexts, and ultimately focusing on critical tokens (a.k.a massive activation or attention sink) in higher layers.
HierarchyRealization realized_hierarchy_forces_phi echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

PyramidKV allocates more KV cache to the lower layers where information is more dispersed and each KV state contains less information, while reducing the KV cache in higher layers where information becomes concentrated in fewer key tokens.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Layer-wise Token Compression for Efficient Document Reranking
cs.IR 2026-05 conditional novelty 7.0

Layer-wise Token Compression applies adaptive pooling at middle transformer layers to increase QPS by up to 116% on document ranking with little or no loss in quality.
Layer-wise Token Compression for Efficient Document Reranking
cs.IR 2026-05 unverdicted novelty 7.0

Layer-wise Token Compression applies adaptive token pooling at middle transformer layers for cross-encoder rerankers, preserving MS MARCO ranking quality while raising QPS up to 25% on passages and 116% on documents, ...
Head-Aware Key-Value Compression for Efficient Autoregressive Image Generation
cs.CV 2026-05 conditional novelty 7.0

HeadKV compresses KV cache for autoregressive image generation via head-aware budget allocation, early head-type identification from consistent patterns, and stratified token eviction.
FibQuant: Universal Vector Quantization for Random-Access KV-Cache Compression
cs.AI 2026-05 unverdicted novelty 7.0

FibQuant is a universal fixed-rate vector quantizer for KV-cache compression that uses a radial-angular codebook matched to the spherical-Beta source after Haar rotation and strictly outperforms scalar quantization at...
How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers
cs.LG 2026-04 unverdicted novelty 7.0

Transformers need depth scaling as the product of ceil(k/s) and log n terms for k-hop pointer chasing under cache size s, with a conjectured lower bound, proved upper bound via windowed pointer doubling, and an adapti...
Sparse Prefix Caching for Hybrid and Recurrent LLM Serving
cs.LG 2026-04 unverdicted novelty 7.0

Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.
Transactional Attention: Semantic Sponsorship for KV-Cache Retention
cs.CL 2026-04 unverdicted novelty 7.0

Transactional Attention uses semantic sponsorship from anchor patterns to retain dormant critical tokens in KV caches, achieving 100% credential retrieval at 16 tokens where all prior methods fail.
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
cs.LG 2026-04 unverdicted novelty 7.0

The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
cs.CL 2026-04 unverdicted novelty 7.0

TriAttention compresses KV cache by exploiting stable pre-RoPE Q/K concentration and trigonometric distance preferences to match full-attention reasoning accuracy with far lower memory and higher speed.
Why Smaller Is Slower? Dimensional Misalignment in Compressed LLMs
cs.DC 2026-03 unverdicted novelty 7.0

Dimensional misalignment slows compressed LLMs on GPUs; GAC uses knapsack optimization to achieve full alignment and up to 1.5x speedup on Llama-3-8B while preserving quality.
Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression
cs.CL 2025-02 unverdicted novelty 7.0

KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation...
FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration
cs.LG 2025-02 unverdicted novelty 7.0

FastKV decouples prefill context reduction via Token-Selective Propagation from independent KV cache selection, delivering up to 1.82x prefill and 2.87x decoding speedups while matching decoding-only accuracy.
Adaptive Mass-Segmented KV Compression for Long-Context Reasoning
cs.LG 2026-05 unverdicted novelty 6.0

AMS KV compression adaptively partitions the cache by attention mass regions and assigns quotas to protect contiguous reasoning blocks during long-context LLM inference.
Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression
cs.AI 2026-05 unverdicted novelty 6.0

Meta-Soft dynamically synthesizes targeted soft tokens from a learnable orthogonal meta-library via Gumbel-Softmax selection and uses attention-flow integration to preserve semantic information during KV cache eviction.
OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
cs.LG 2026-05 unverdicted novelty 6.0

OScaR mitigates token norm imbalance via canalized rotation and omni-token scaling to enable near-lossless INT2 KV cache quantization with up to 3x decoding speedup and 5.3x memory reduction.
AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees
cs.AI 2026-05 unverdicted novelty 6.0

AQuaUI uses adaptive quadtrees to cut visual tokens in GUI-agent LMMs by up to 29.52% at inference time while retaining 99.06% of full-token accuracy on grounding and navigation benchmarks.
DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention
cs.CL 2026-05 unverdicted novelty 6.0

DashAttention introduces differentiable adaptive sparse hierarchical attention via α-entmax block selection, achieving full-attention accuracy at 75% sparsity with improved Pareto performance over NSA and InfLLMv2.
Elastic-dLLM: Position Preserving Context Compression and Augmentation of Diffusion LLMs
cs.LG 2026-05 unverdicted novelty 6.0

Position-preserving MASK token compression reduces redundancy in diffusion LLMs to accelerate parallel decoding and enable context folding for longer sequences.
OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
cs.LG 2026-05 unverdicted novelty 6.0

OSCAR achieves near-BF16 accuracy for 2-bit KV cache quantization by using offline spectral covariance-aware rotations aligned with attention, plus a custom deployable INT2 kernel compatible with paged serving.
VeriCache: Turning Lossy KV Cache into Lossless LLM Inference
cs.AR 2026-05 unverdicted novelty 6.0

VeriCache turns lossy KV cache compression into lossless LLM inference by drafting with compressed cache and verifying drafts with full cache, achieving up to 4x throughput with identical outputs.
GHOST: Geometry-Hierarchical Online Streaming Token Eviction for Efficient 3D Reconstruction
cs.CV 2026-05 unverdicted novelty 6.0

GHOST applies geometry-hierarchical online token eviction with hierarchical scoring, privilege protection, and layer-wise budget allocation to halve KV cache size while maintaining reconstruction quality and achieving...
Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility
cs.LG 2026-05 unverdicted novelty 6.0

SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.
SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference
cs.LG 2026-05 unverdicted novelty 6.0

Spherical KV introduces angle-domain attention with spherical key parameterization and rate-distortion retention to cut KV cache residency while preserving efficient paged decoding.
Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.
KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference
cs.LG 2026-05 conditional novelty 6.0

KV-Fold turns frozen transformers into stable long-context models by folding the KV cache across sequence chunks in repeated forward passes.
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving
cs.AR 2026-05 unverdicted novelty 6.0

KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
cs.LG 2026-05 unverdicted novelty 6.0

A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.
Not All Thoughts Need HBM: Semantics-Aware Memory Hierarchy for LLM Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

A semantics-aware KV cache hierarchy offloads tokens to slower memory with zero approximation error, demonstrating that LLM reasoning accuracy depends only on the permanent eviction ratio and not on HBM residency.
ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference
cs.LG 2026-05 unverdicted novelty 6.0

ProxyKV offloads KV cache importance scoring to a lightweight intra-family small-model proxy with HybridAxialMapper and ranking-focused loss, matching KVZip accuracy while achieving up to 3.21x prefilling speedup on m...
ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing
cs.CL 2026-05 conditional novelty 6.0

ReST-KV formulates KV eviction as layer-wise output reconstruction optimization with spatial-temporal smoothing, outperforming baselines by 2.58% on LongBench and 15.2% on RULER while cutting decoding latency by 10.61...
Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents
cs.MA 2026-05 unverdicted novelty 6.0

Slipstream uses asynchronous compaction with trajectory-grounded judge validation to improve long-horizon agent accuracy by up to 8.8 percentage points and reduce latency by up to 39.7%.
RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache
cs.LG 2026-05 unverdicted novelty 6.0

RDKV derives per-token and per-channel weights from attention distortion, then uses reverse water-filling to assign bit-widths from full precision to zero after prefilling, recovering 97.81% accuracy with 2.48% cache ...
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
cs.CL 2026-05 unverdicted novelty 6.0

LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
cs.LG 2026-05 unverdicted novelty 6.0

Louver is a new index structure that guarantees zero false negatives for sparse attention in LLM KV caches by casting the problem as halfspace range searching.
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
cs.LG 2026-05 unverdicted novelty 6.0

Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing p...
Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility
cs.AI 2026-05 unverdicted novelty 6.0

SPEED uses layer-asymmetric KV visibility to process non-anchor prompt tokens only in lower layers during prefill, achieving near-baseline quality on Llama-3.1-8B with 33% better TTFT and 25% lower active KV memory at...
Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding
cs.AR 2026-04 unverdicted novelty 6.0

Salca is a new ASIC accelerator that achieves 3.82× speedup and 74.19× energy efficiency over A100 for long-context attention via dual-compression dynamic sparse attention and pipelined hardware.
DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing
cs.CL 2026-04 unverdicted novelty 6.0

DASH-KV accelerates long-context LLM inference to linear complexity via asymmetric KV cache hashing and mixed-precision retention, matching full attention performance on LongBench.
Graph-Guided Adaptive Channel Elimination for KV Cache Compression
eess.SP 2026-04 unverdicted novelty 6.0

GRACE reframes KV cache channel pruning as graph optimization to find a near-optimal subset, achieving 60% compression with negligible degradation and outperforming prior methods.
RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction
cs.LG 2026-04 unverdicted novelty 6.0

RetentiveKV uses entropy to drive state-space model transitions that retain and reactivate low-attention visual tokens in a continuous memory instead of pruning them, delivering 5x KV cache compression and 1.5x faster...
CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference
cs.DC 2026-04 unverdicted novelty 6.0

CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baseli...
eOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and Quantization
cs.LG 2026-04 unverdicted novelty 6.0

eOptShrinkQ compresses KV caches to ~2.2 bits per entry via optimal spectral shrinkage and quantization, outperforming prior methods on LongBench while matching FP16 on multi-needle retrieval.
LightThinker++: From Reasoning Compression to Memory Management
cs.CL 2026-04 unverdicted novelty 6.0

LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.
Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing
cs.LG 2026-04 unverdicted novelty 6.0

Stochastic training with random cross-layer KV attention enables depth-wise cache sharing in transformers, cutting memory footprint while preserving or improving performance.
CompilerKV: Risk-Adaptive KV Compression via Offline Experience Compilation
cs.LG 2026-02 unverdicted novelty 6.0

CompilerKV uses offline-compiled retention tables as portable priors to achieve SOTA prefill-only KV compression performance across backbones at low token budgets.
HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference
cs.CL 2026-01 unverdicted novelty 6.0

HeteroCache dynamically allocates KV cache space to attention heads based on their temporal stability and uses hierarchical asynchronous retrieval to achieve state-of-the-art long-context performance with up to 3x fas...
CacheClip: Accelerating RAG with Effective KV Cache Reuse
cs.LG 2025-10 unverdicted novelty 6.0

CacheClip accelerates RAG prefill by up to 3.33x via auxiliary-model-guided selective KV recomputation while retaining 85-91% of full-attention quality on NIAH and LongBench.
EpiCache: Episodic KV Cache Management for Long-Term Conversation on Resource-Constrained Environments
cs.CL 2025-09 unverdicted novelty 6.0

EpiCache clusters long conversation history into coherent episodes for per-episode KV cache eviction, delivering up to 30% accuracy gains and 3.7x peak memory reduction on LongConvQA tasks under fixed budgets.
RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference
cs.LG 2025-05 conditional novelty 6.0

RetroInfer introduces the wave index and wave buffer to realize sparse KV-cache attention for long-context LLM inference with up to 4.4X throughput gains while matching full-attention accuracy.
TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
cs.LG 2025-04 unverdicted novelty 6.0

TurboQuant achieves near-optimal vector quantization distortion for both MSE and inner products via random rotation and per-coordinate scalar quantization, with a formal proof that it matches lower bounds within a fac...
LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation
cs.CL 2024-10 unverdicted novelty 6.0

LightTransfer identifies lazy layers in LLMs like LLaMA and replaces their attention with streaming attention to form hybrid models, delivering up to 2.17x throughput with under 1.5% drop on LongBench and strong resul...
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
cs.LG 2024-09 conditional novelty 6.0

RetrievalAttention approximates full attention in long-context LLMs by retrieving relevant KV vectors from CPU-based ANNS indexes with an attention-aware algorithm, achieving near-full accuracy while accessing only 1-...
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
cs.CL 2024-07 accept novelty 6.0

Ada-KV is the first head-wise adaptive KV cache budget allocator for LLMs, using a theoretical loss upper bound to allocate eviction differently per attention head and yielding higher quality than uniform methods on l...
FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing
cs.CV 2026-05 unverdicted novelty 5.0

FastOCR dynamically selects a small subset of visual tokens per decoding step using focal-guided pruning and cross-step reuse, retaining 98% accuracy on Qwen2.5-VL while attending to only 5% of tokens and cutting atte...
Minimal-Intervention KV Retention via Set-Conditioned Diversity
cs.LG 2026-05 unverdicted novelty 5.0

A one-function modification to the TriAttention retention scorer using greedy selection under a V-space redundancy penalty outperforms seven matched mechanisms on long-form math reasoning at budgets 64 and 128.
Minimal-Intervention KV Retention via Set-Conditioned Diversity
cs.LG 2026-05 conditional novelty 5.0

A minimal scoring modification to TriAttention using greedy facility-location selection with V-space redundancy penalty improves KV retention at budgets 64 and 128 on distilled reasoning models under matched-memory he...
An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference
cs.LG 2026-05 unverdicted novelty 5.0

Fluxion achieves 1.5x-3.7x speedup in long-context LLM inference with CPU KV caches while limiting accuracy degradation to at most 0.26 relative to full attention.
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
cs.DC 2026-04 unverdicted novelty 5.0

HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, pl...
From Business Events to Auditable Decisions: Ontology-Governed Graph Simulation for Enterprise AI
cs.AI 2026-04 unverdicted novelty 5.0

LOM-action uses business events to drive ontology-governed graph simulations that generate auditable decisions, reporting 93.82% accuracy and 98.74% tool-chain F1 versus 24-36% F1 for frontier LLMs.
AudioKV: KV Cache Eviction in Efficient Large Audio Language Models
cs.SD 2026-04 unverdicted novelty 5.0

AudioKV prioritizes audio-critical attention heads identified via ASR analysis and applies spectral score smoothing to evict KV cache tokens, achieving high compression with minimal accuracy loss in LALMs.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 62 Pith papers · 10 internal anchors

[1]

GPT-4 Technical Report

• •Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508,

work page internal anchor Pith review arXiv
[3]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. arXiv preprint arXiv:2403.06764, 2024a. Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Longlora: Efficient fin...

work page arXiv
[4]

arXiv preprint arXiv:2410.16179 , year=

Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, and Beidi Chen. Magicpig: Lsh sampling for efficient llm generation, 2024b. URL https://arxiv.org/abs/2410.16179. Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yo...

work page arXiv
[5]

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner

URL https://lmsys.org/blog/2023-03-30-vicuna/ . Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologi...

work page 2023
[6]

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. Longrope: Extending llm context window beyond 2 million tokens. arXiv preprint arXiv:2402.13753,

work page internal anchor Pith review arXiv
[7]

arXiv preprint arXiv:2402.09398 , year=

Harry Dong, Xinyu Yang, Zhenyu Zhang, Zhangyang Wang, Yuejie Chi, and Beidi Chen. Get more with less: Synthesizing recurrence with kv cache compression for efficient llm inference. arXiv preprint arXiv:2402.09398,

work page arXiv
[8]

Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model , url =

Alexander Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir Radev. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1074–1084, Florence, Italy, July 2...

work page doi:10.18653/v1/p19-1102
[9]

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

10 Preprint. Under review. Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801,

work page internal anchor Pith review arXiv
[10]

Samsum corpus: A human-annotated dialogue dataset for abstractive summarization

Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. EMNLP-IJCNLP 2019, page 70,

work page 2019
[11]

Longcoder: A long-range pre-trained language model for code completion

Daya Guo, Canwen Xu, Nan Duan, Jian Yin, and Julian McAuley. Longcoder: A long-range pre-trained language model for code completion. arXiv preprint arXiv:2306.14893,

work page arXiv
[12]

Lm- infinite: Simple on-the-fly length generalization for large language models

Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137,

work page arXiv
[13]

Efficient attentions for long document summarization

Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. Efficient attentions for long document summarization. In Proceedings of the 2021 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1419–1436,

work page 2021
[14]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, et al. Minference 1.0: Ac- celerating pre-filling for long-context llms via dynamic sparse attention. arXiv preprint arXiv:2407.02490,

work page arXiv
[16]

Xin Li and Dan Roth

URL https://arxiv.org/abs/2406.19707. Xin Li and Dan Roth. Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics,

work page arXiv 2002
[17]

SnapKV: LLM Knows What You are Looking for Before Generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. arXiv preprint arXiv:2404.14469,

work page internal anchor Pith review arXiv
[18]

Code Llama: Open Foundation Models for Code

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts, 2023a. Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository-level code auto-completion systems, 2023b. Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Soot...

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Massive Activations in Large Language Models

11 Preprint. Under review. Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models. arXiv preprint arXiv:2402.17762,

work page internal anchor Pith review arXiv
[20]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay ...

work page internal anchor Pith review Pith/arXiv arXiv 2013
[21]

Label words are anchors: An information flow perspective for understanding in-context learning

Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou, Fandong Meng, Jie Zhou, and Xu Sun. Label words are anchors: An information flow perspective for understanding in-context learning. arXiv preprint arXiv:2305.14160,

work page arXiv
[22]

Re- trieval head mechanistically explains long-context factu- ality.arXiv preprint arXiv:2404.15574,

Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, and Yao Fu. Retrieval head mechanistically explains long-context factuality. arXiv preprint arXiv:2404.15574,

work page arXiv
[23]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient stream- ing language models with attention sinks. arXiv preprint arXiv:2309.17453,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Pyramidinfer: Pyramid kv cache compres- sion for high-throughput llm inference

Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, and Hai Zhao. Pyramid- infer: Pyramid kv cache compression for high-throughput llm inference. arXiv preprint arXiv:2405.12532,

work page arXiv
[25]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdi- nov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380,

work page 2018
[26]

Qmsum: A new benchmark for query- based multi-domain meeting summarization

Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, et al. Qmsum: A new benchmark for query- based multi-domain meeting summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, page...

work page 2021
[27]

Pose: Efficient context window extension of llms via positional skip-wise training.arXiv preprint arXiv:2309.10400,

Dawei Zhu, Nan Yang, Liang Wang, Yifan Song, Wenhao Wu, Furu Wei, and Sujian Li. Pose: Efficient context window extension of llms via positional skip-wise training.arXiv preprint arXiv:2309.10400,

work page arXiv
[28]

attention sink

13 Preprint. Under review. Attention Weights Heatmap Layer 30 Attention Weights Heatmap Layer 24 LocalizedAttention AttentionSinkMassiveAttention Figure 6: Attention patterns of retrieval-augmented generation across layers in Mixtral- 8x7B-Instruct Mixture-of-Experts model. C Related Work Interpretation of LLMs Prior research has shown that attention matr...

work page 2023
[29]

Figure 5 and Figure 6 demonstrate that the Pyramidal Informa- tion Funneling phenomenon is also evident in both the Mistral model and Mixtral model

for Mistral-7B-Instruct model and Mixtral-8x7B-Instruct Mixture-of-Experts model. Figure 5 and Figure 6 demonstrate that the Pyramidal Informa- tion Funneling phenomenon is also evident in both the Mistral model and Mixtral model . The results reveal that, akin to Llama-like models, Mistral exhibit a progressively narrowing attention focus across layers. ...

work page 2024
[30]

While Lee et al

and a single upper layer (layer 18). While Lee et al. (2024) noted that attention becomes more skewed in upper layers, it did not provide a fine-grained observation of attention patterns across all layers. In contrast, our study reveals several novel findings: • Localized Attention: We observe that attention progressively narrows its focus, targeting spec...

work page 2024
[31]

We run all the experiments on NVIDIA A100. Dataset Source Avg len Metric Language #data Single-Document QA NarrativeQA Literature, Film 18,409 F1 English 200 Qasper Science 3,619 F1 English 200 MultiFieldQA-en Multi-field 4,559 F1 English 150 Multi-Document QA HotpotQA Wikipedia 9,151 F1 English 200 2WikiMultihopQA Wikipedia 4,887 F1 English 200 MuSiQue W...

work page 2023
[32]

α Single-Document QA Multi-Document QA Summarization Few-shot Learning Synthetic Code Avg

leads to better performance than a larger alpha value (i.e., 24, 32, 40, 48). α Single-Document QA Multi-Document QA Summarization Few-shot Learning Synthetic Code Avg. NrtvQAQasperMF-enHotpotQA2WikiMQAMusiqueGovReportQMSumMultiNewsTRECTriviaQASAMSumPCountPReLccRB-P 8 21.40 16.92 31.62 38.45 28.72 18.59 19.96 22.49 20.96 66.50 89.35 38.43 5.92 69.00 57.86...

work page 1924
[33]

β Single-Document QA Multi-Document QA Summarization Few-shot Learning Synthetic Code Avg

The results at Table 6 show that using a relatively small value ofβ yields better outcomes, and PyramidKV is generally robust to the selection of β. β Single-Document QA Multi-Document QA Summarization Few-shot Learning Synthetic Code Avg. NrtvQAQasperMF-enHotpotQA2WikiMQAMusiqueGovReportQMSumMultiNewsTRECTriviaQASAMSumPCountPReLccRB-P 20 21.40 16.92 33.7...

work page 2024
[34]

The results demonstrated the superior performance of PyramidKV . Furthermore, we demonstrate that MInference and PyramidKV can be seamlessly integrated to achieve highly efficient inference while maintaining performance comparable to full attention. The results of MInference combined with PyramidKV , evaluated on Longbench with a KV cache size of 128, as ...

work page 2024
[35]

[Prompt length, Generation length]

Each row shows the setting of using a specific “[Prompt length, Generation length]” combination. We show the inference speed comparison between total inference time, time for allocation strategy and time for score-based selection on LlaMa-3-8B-Instruct. Each cell is the latency measured in seconds. Furthermore, our budget allocation can be calculated befo...

work page 2048
[36]

[Prompt length, Generation length]

Each row shows the setting of using a specific “[Prompt length, Generation length]” combination. Each cell is the latency measured in seconds. PyramidKV does not sacrifice the speed. PyramidKV provides performance improvement and memory saving while runs at a comparable speed compared with baselines (i.e. SnapKV (Li et al., 2024), StreamingLLM (Xiao et al.,

work page 2024
[37]

That’s because the allocation strategy requires very limited additional complexity in the inference/generation phase compared with computation required for generation as Appendix L

and H2O (Zhang et al., 2024)). That’s because the allocation strategy requires very limited additional complexity in the inference/generation phase compared with computation required for generation as Appendix L. N PyramidKV Excels in all KV Cache Size Limitation The evaluation results from LongBench(Bai et al.,

work page 2024
[38]

Attention Recall Rate Experiment

for different KV cache sizes. Overall, PyramidKV consistently surpasses other method across a range of KV cache sizes and different backbone models, with its performance advantages becoming particularly pronounced in memory-constrained environments. Upon examining specific tasks, Pyra- midKV demonstrates a notably superior performance on the TREC task, a ...

work page arXiv 2048
[39]

However, with a larger budget (i.e., 2k KV Cache Size), the improvement decreases

The results show that with a small budget, PyramidKV improves the attention recall rate (the percentage of attention computed using the keys retrieved by the method and the query, relative to the attention computed using all keys and the query.). However, with a larger budget (i.e., 2k KV Cache Size), the improvement decreases. For 64, 128, 256, 512, 1024...

work page 2048
[40]

massive attention

Our findings indicate the absence of "massive attention" in any individual head. Figure 18: Attention patterns of retrieval-augmented generation across heads in the bottom layer in LlaMa. R PyramidKV Implementation at vLLM To help compare the vLLM implementation with the vanilla dense attention backend in terms of throughput, we perform the experiment. We...

work page 2000

[1] [1]

GPT-4 Technical Report

• •Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508,

work page internal anchor Pith review arXiv

[3] [3]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. arXiv preprint arXiv:2403.06764, 2024a. Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Longlora: Efficient fin...

work page arXiv

[4] [4]

arXiv preprint arXiv:2410.16179 , year=

Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, and Beidi Chen. Magicpig: Lsh sampling for efficient llm generation, 2024b. URL https://arxiv.org/abs/2410.16179. Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yo...

work page arXiv

[5] [5]

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner

URL https://lmsys.org/blog/2023-03-30-vicuna/ . Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologi...

work page 2023

[6] [6]

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. Longrope: Extending llm context window beyond 2 million tokens. arXiv preprint arXiv:2402.13753,

work page internal anchor Pith review arXiv

[7] [7]

arXiv preprint arXiv:2402.09398 , year=

Harry Dong, Xinyu Yang, Zhenyu Zhang, Zhangyang Wang, Yuejie Chi, and Beidi Chen. Get more with less: Synthesizing recurrence with kv cache compression for efficient llm inference. arXiv preprint arXiv:2402.09398,

work page arXiv

[8] [8]

Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model , url =

Alexander Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir Radev. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1074–1084, Florence, Italy, July 2...

work page doi:10.18653/v1/p19-1102

[9] [9]

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

10 Preprint. Under review. Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801,

work page internal anchor Pith review arXiv

[10] [10]

Samsum corpus: A human-annotated dialogue dataset for abstractive summarization

Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. EMNLP-IJCNLP 2019, page 70,

work page 2019

[11] [11]

Longcoder: A long-range pre-trained language model for code completion

Daya Guo, Canwen Xu, Nan Duan, Jian Yin, and Julian McAuley. Longcoder: A long-range pre-trained language model for code completion. arXiv preprint arXiv:2306.14893,

work page arXiv

[12] [12]

Lm- infinite: Simple on-the-fly length generalization for large language models

Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137,

work page arXiv

[13] [13]

Efficient attentions for long document summarization

Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. Efficient attentions for long document summarization. In Proceedings of the 2021 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1419–1436,

work page 2021

[14] [14]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, et al. Minference 1.0: Ac- celerating pre-filling for long-context llms via dynamic sparse attention. arXiv preprint arXiv:2407.02490,

work page arXiv

[16] [16]

Xin Li and Dan Roth

URL https://arxiv.org/abs/2406.19707. Xin Li and Dan Roth. Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics,

work page arXiv 2002

[17] [17]

SnapKV: LLM Knows What You are Looking for Before Generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. arXiv preprint arXiv:2404.14469,

work page internal anchor Pith review arXiv

[18] [18]

Code Llama: Open Foundation Models for Code

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts, 2023a. Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository-level code auto-completion systems, 2023b. Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Soot...

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Massive Activations in Large Language Models

11 Preprint. Under review. Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models. arXiv preprint arXiv:2402.17762,

work page internal anchor Pith review arXiv

[20] [20]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay ...

work page internal anchor Pith review Pith/arXiv arXiv 2013

[21] [21]

Label words are anchors: An information flow perspective for understanding in-context learning

Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou, Fandong Meng, Jie Zhou, and Xu Sun. Label words are anchors: An information flow perspective for understanding in-context learning. arXiv preprint arXiv:2305.14160,

work page arXiv

[22] [22]

Re- trieval head mechanistically explains long-context factu- ality.arXiv preprint arXiv:2404.15574,

Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, and Yao Fu. Retrieval head mechanistically explains long-context factuality. arXiv preprint arXiv:2404.15574,

work page arXiv

[23] [23]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient stream- ing language models with attention sinks. arXiv preprint arXiv:2309.17453,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Pyramidinfer: Pyramid kv cache compres- sion for high-throughput llm inference

Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, and Hai Zhao. Pyramid- infer: Pyramid kv cache compression for high-throughput llm inference. arXiv preprint arXiv:2405.12532,

work page arXiv

[25] [25]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdi- nov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380,

work page 2018

[26] [26]

Qmsum: A new benchmark for query- based multi-domain meeting summarization

Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, et al. Qmsum: A new benchmark for query- based multi-domain meeting summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, page...

work page 2021

[27] [27]

Pose: Efficient context window extension of llms via positional skip-wise training.arXiv preprint arXiv:2309.10400,

Dawei Zhu, Nan Yang, Liang Wang, Yifan Song, Wenhao Wu, Furu Wei, and Sujian Li. Pose: Efficient context window extension of llms via positional skip-wise training.arXiv preprint arXiv:2309.10400,

work page arXiv

[28] [28]

attention sink

13 Preprint. Under review. Attention Weights Heatmap Layer 30 Attention Weights Heatmap Layer 24 LocalizedAttention AttentionSinkMassiveAttention Figure 6: Attention patterns of retrieval-augmented generation across layers in Mixtral- 8x7B-Instruct Mixture-of-Experts model. C Related Work Interpretation of LLMs Prior research has shown that attention matr...

work page 2023

[29] [29]

Figure 5 and Figure 6 demonstrate that the Pyramidal Informa- tion Funneling phenomenon is also evident in both the Mistral model and Mixtral model

for Mistral-7B-Instruct model and Mixtral-8x7B-Instruct Mixture-of-Experts model. Figure 5 and Figure 6 demonstrate that the Pyramidal Informa- tion Funneling phenomenon is also evident in both the Mistral model and Mixtral model . The results reveal that, akin to Llama-like models, Mistral exhibit a progressively narrowing attention focus across layers. ...

work page 2024

[30] [30]

While Lee et al

and a single upper layer (layer 18). While Lee et al. (2024) noted that attention becomes more skewed in upper layers, it did not provide a fine-grained observation of attention patterns across all layers. In contrast, our study reveals several novel findings: • Localized Attention: We observe that attention progressively narrows its focus, targeting spec...

work page 2024

[31] [31]

We run all the experiments on NVIDIA A100. Dataset Source Avg len Metric Language #data Single-Document QA NarrativeQA Literature, Film 18,409 F1 English 200 Qasper Science 3,619 F1 English 200 MultiFieldQA-en Multi-field 4,559 F1 English 150 Multi-Document QA HotpotQA Wikipedia 9,151 F1 English 200 2WikiMultihopQA Wikipedia 4,887 F1 English 200 MuSiQue W...

work page 2023

[32] [32]

α Single-Document QA Multi-Document QA Summarization Few-shot Learning Synthetic Code Avg

leads to better performance than a larger alpha value (i.e., 24, 32, 40, 48). α Single-Document QA Multi-Document QA Summarization Few-shot Learning Synthetic Code Avg. NrtvQAQasperMF-enHotpotQA2WikiMQAMusiqueGovReportQMSumMultiNewsTRECTriviaQASAMSumPCountPReLccRB-P 8 21.40 16.92 31.62 38.45 28.72 18.59 19.96 22.49 20.96 66.50 89.35 38.43 5.92 69.00 57.86...

work page 1924

[33] [33]

β Single-Document QA Multi-Document QA Summarization Few-shot Learning Synthetic Code Avg

The results at Table 6 show that using a relatively small value ofβ yields better outcomes, and PyramidKV is generally robust to the selection of β. β Single-Document QA Multi-Document QA Summarization Few-shot Learning Synthetic Code Avg. NrtvQAQasperMF-enHotpotQA2WikiMQAMusiqueGovReportQMSumMultiNewsTRECTriviaQASAMSumPCountPReLccRB-P 20 21.40 16.92 33.7...

work page 2024

[34] [34]

The results demonstrated the superior performance of PyramidKV . Furthermore, we demonstrate that MInference and PyramidKV can be seamlessly integrated to achieve highly efficient inference while maintaining performance comparable to full attention. The results of MInference combined with PyramidKV , evaluated on Longbench with a KV cache size of 128, as ...

work page 2024

[35] [35]

[Prompt length, Generation length]

Each row shows the setting of using a specific “[Prompt length, Generation length]” combination. We show the inference speed comparison between total inference time, time for allocation strategy and time for score-based selection on LlaMa-3-8B-Instruct. Each cell is the latency measured in seconds. Furthermore, our budget allocation can be calculated befo...

work page 2048

[36] [36]

[Prompt length, Generation length]

Each row shows the setting of using a specific “[Prompt length, Generation length]” combination. Each cell is the latency measured in seconds. PyramidKV does not sacrifice the speed. PyramidKV provides performance improvement and memory saving while runs at a comparable speed compared with baselines (i.e. SnapKV (Li et al., 2024), StreamingLLM (Xiao et al.,

work page 2024

[37] [37]

That’s because the allocation strategy requires very limited additional complexity in the inference/generation phase compared with computation required for generation as Appendix L

and H2O (Zhang et al., 2024)). That’s because the allocation strategy requires very limited additional complexity in the inference/generation phase compared with computation required for generation as Appendix L. N PyramidKV Excels in all KV Cache Size Limitation The evaluation results from LongBench(Bai et al.,

work page 2024

[38] [38]

Attention Recall Rate Experiment

for different KV cache sizes. Overall, PyramidKV consistently surpasses other method across a range of KV cache sizes and different backbone models, with its performance advantages becoming particularly pronounced in memory-constrained environments. Upon examining specific tasks, Pyra- midKV demonstrates a notably superior performance on the TREC task, a ...

work page arXiv 2048

[39] [39]

However, with a larger budget (i.e., 2k KV Cache Size), the improvement decreases

The results show that with a small budget, PyramidKV improves the attention recall rate (the percentage of attention computed using the keys retrieved by the method and the query, relative to the attention computed using all keys and the query.). However, with a larger budget (i.e., 2k KV Cache Size), the improvement decreases. For 64, 128, 256, 512, 1024...

work page 2048

[40] [40]

massive attention

Our findings indicate the absence of "massive attention" in any individual head. Figure 18: Attention patterns of retrieval-augmented generation across heads in the bottom layer in LlaMa. R PyramidKV Implementation at vLLM To help compare the vLLM implementation with the vanilla dense attention backend in terms of throughput, we perform the experiment. We...

work page 2000