The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
A survey on model compression for large language models.Transactions of the Association for Computational Linguistics, 12:1556–1577
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
The paper identifies an encoding mismatch in ViT feature distillation from per-image compressibility versus dataset subspace rotations and broad spectral energy patterns, proposing Lift and WideLast remedies that improve DeiT-Tiny accuracy from 74.86% to 77.53-78.23% on ImageNet-1K.
PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.
citing papers explorer
-
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
-
From Per-Image Low-Rank to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers
The paper identifies an encoding mismatch in ViT feature distillation from per-image compressibility versus dataset subspace rotations and broad spectral energy patterns, proposing Lift and WideLast remedies that improve DeiT-Tiny accuracy from 74.86% to 77.53-78.23% on ImageNet-1K.
-
Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression
PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.