arXiv preprint arXiv:2508.02546 (2025)

Valeria Ruscio, Umberto Nanni, Fabrizio Silvestri · 2025 · arXiv 2508.02546

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models

cs.CL · 2026-05-08 · conditional · novelty 7.0 · 2 refs

Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.

When Sinks Help or Hurt: Unified Framework for Attention Sink in Large Vision-Language Models

cs.CV · 2026-04-01 · unverdicted · novelty 7.0

Attention sinks in LVLM create a global-vs-local trade-off that a layer-wise gating module can balance to improve multimodal benchmark performance.

A Mechanistic Account of Attention Sinks in GPT-2: One Circuit, Broader Implications for Mitigation

cs.LG · 2026-04-16 · unverdicted · novelty 6.0

Attention sinks in GPT-2 arise from the interaction of learned query bias, first-layer MLP on positional encodings, and key projection structure, with each component individually dispensable.

Parameter Efficiency Is Not Memory Efficiency: Rethinking Fine-Tuning for On-Device LLM Adaptation

cs.LG · 2026-04-03 · unverdicted · novelty 6.0

LARS constrains activation subspaces to decouple memory use from sequence length, cutting GPU memory by 33.5% and CPU memory by 52% versus LoRA while keeping accuracy comparable.

ASAP: Attention Sink Anchored Pruning

cs.LG · 2026-05-21 · unverdicted · novelty 5.0

ASAP prunes tokens in ViTs by anchoring on attention sinks modeled as lazy random walks, using cumulative transition matrices and radial diffusion clustering to compress redundancy while preserving accuracy.

Retrieval and competition: how a protein foundation model starts a protein

q-bio.BM · 2026-05-05 · unverdicted · novelty 5.0

ESM2-8M predicts N-terminal methionine via retrieval from a positional prior at the beginning-of-sequence token through distributed attention circuits rather than direct biological detection.

SinkTrack: Attention Sink based Context Anchoring for Large Language Models

cs.CV · 2026-04-11 · unverdicted · novelty 5.0 · 2 refs

SinkTrack anchors LLMs to initial context by modifying the attention sink token with injected features, yielding gains on textual and multimodal tasks.

citing papers explorer

Showing 7 of 7 citing papers.

A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models cs.CL · 2026-05-08 · conditional · none · ref 17 · 2 links
Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.
When Sinks Help or Hurt: Unified Framework for Attention Sink in Large Vision-Language Models cs.CV · 2026-04-01 · unverdicted · none · ref 32
Attention sinks in LVLM create a global-vs-local trade-off that a layer-wise gating module can balance to improve multimodal benchmark performance.
A Mechanistic Account of Attention Sinks in GPT-2: One Circuit, Broader Implications for Mitigation cs.LG · 2026-04-16 · unverdicted · none · ref 4
Attention sinks in GPT-2 arise from the interaction of learned query bias, first-layer MLP on positional encodings, and key projection structure, with each component individually dispensable.
Parameter Efficiency Is Not Memory Efficiency: Rethinking Fine-Tuning for On-Device LLM Adaptation cs.LG · 2026-04-03 · unverdicted · none · ref 3
LARS constrains activation subspaces to decouple memory use from sequence length, cutting GPU memory by 33.5% and CPU memory by 52% versus LoRA while keeping accuracy comparable.
ASAP: Attention Sink Anchored Pruning cs.LG · 2026-05-21 · unverdicted · none · ref 16
ASAP prunes tokens in ViTs by anchoring on attention sinks modeled as lazy random walks, using cumulative transition matrices and radial diffusion clustering to compress redundancy while preserving accuracy.
Retrieval and competition: how a protein foundation model starts a protein q-bio.BM · 2026-05-05 · unverdicted · none · ref 16
ESM2-8M predicts N-terminal methionine via retrieval from a positional prior at the beginning-of-sequence token through distributed attention circuits rather than direct biological detection.
SinkTrack: Attention Sink based Context Anchoring for Large Language Models cs.CV · 2026-04-11 · unverdicted · none · ref 13 · 2 links
SinkTrack anchors LLMs to initial context by modifying the attention sink token with injected features, yielding gains on textual and multimodal tasks.

arXiv preprint arXiv:2508.02546 (2025)

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer