Block-GTQ performs RoPE-aware greedy bit allocation on KV caches using per-block energy scores, cutting logit MAE 32-80% versus uniform TQ-MSE and lifting long-context task scores substantially at 2-3 bits per dimension.
Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time.arXiv preprint arXiv:2305.17118
11 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
Transformers need depth scaling as the product of ceil(k/s) and log n terms for k-hop pointer chasing under cache size s, with a conjectured lower bound, proved upper bound via windowed pointer doubling, and an adaptive-oblivious error separation.
Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.
SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.
Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing pipelines.
FastGen adaptively compresses LLM KV caches via lightweight attention profiling: evicting long-range contexts on local heads, non-special tokens on special-token heads, and retaining full caches on broad-attention heads, yielding substantial memory savings with negligible quality loss.
GSRQ applies a gain-shape variant of K-means inside residual quantization to improve directional fidelity, raising LongBench accuracy from 11.34 to 33.54 at 1-bit on LLaMA-3-8B.
KV-RM regularizes KV-cache movement via block paging and coalesced transfers to improve throughput, tail latency, and memory efficiency in static-graph LLM serving without changing the decoder interface.
Meta-Soft dynamically synthesizes targeted soft tokens from a learnable meta-library using Gumbel-Softmax and applies attention-flow integration to compress KV cache while attempting to preserve evicted context information.
Structural protection of boundary tokens in globally capped KV cache eviction recovers 69-90% of full-cache quality at 13% retention and dominates differences among scoring policies.
The paper introduces a four-layer technical architecture for token-operations-oriented inference optimization in large models and reviews key technologies and industry status at each layer.
citing papers explorer
-
Token-Operations-Oriented Inference Optimization Techniques for Large Models
The paper introduces a four-layer technical architecture for token-operations-oriented inference optimization in large models and reviews key technologies and industry status at each layer.