Squeezed attention: Accelerat- ing long context length llm inference

Squeezed Attention: Accelerating Long Context Length LLM Inference , author= · 2024 · arXiv 2411.09688

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization

cs.AR · 2026-04-20 · unverdicted · novelty 6.0

AQPIM performs in-memory product quantization of activations for LLMs on PIM hardware, reducing GPU-CPU communication by 90-98.5% and delivering 3.4x speedup over prior PIM methods.

OmniMem: Perturbation-aware Memory Compression for Streaming Audio-Visual LLMs

cs.AI · 2026-05-26 · unverdicted · novelty 5.0

OmniMem achieves 2-4% higher accuracy than training-free baselines on long video benchmarks for audio-visual LLMs by using modality-aware KV cache allocation and perturbation-aware state selection, with further gains from budget-aware fine-tuning.

Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM

cs.CL · 2025-05-09 · unverdicted · novelty 5.0

STARC remaps sparse KV caches by semantic clustering for PIM hardware, delivering 19-31% lower attention latency and 19-27% lower energy versus token-wise sparsity, with larger gains under tight KV budgets.

citing papers explorer

Showing 3 of 3 citing papers.

AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization cs.AR · 2026-04-20 · unverdicted · none · ref 26
AQPIM performs in-memory product quantization of activations for LLMs on PIM hardware, reducing GPU-CPU communication by 90-98.5% and delivering 3.4x speedup over prior PIM methods.
OmniMem: Perturbation-aware Memory Compression for Streaming Audio-Visual LLMs cs.AI · 2026-05-26 · unverdicted · none · ref 14
OmniMem achieves 2-4% higher accuracy than training-free baselines on long video benchmarks for audio-visual LLMs by using modality-aware KV cache allocation and perturbation-aware state selection, with further gains from budget-aware fine-tuning.
Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM cs.CL · 2025-05-09 · unverdicted · none · ref 41
STARC remaps sparse KV caches by semantic clustering for PIM hardware, delivering 19-31% lower attention latency and 19-27% lower energy versus token-wise sparsity, with larger gains under tight KV budgets.

Squeezed attention: Accelerat- ing long context length llm inference

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer