Value-state gated at- tention for mitigating extreme-token phenomena in trans- formers.arXiv preprint arXiv:2510.09017

Rui Bu, Haofeng Zhong, Wenzheng Chen, Yangyan Li · 2025 · arXiv 2510.09017

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models

cs.CL · 2026-05-08 · conditional · novelty 7.0 · 2 refs

Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.

Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

cs.LG · 2026-04-11 · unverdicted · novelty 7.0

The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

Attention Sinks Induce Gradient Sinks: Massive Activations as Gradient Regulators in Transformers

cs.LG · 2026-03-18 · unverdicted · novelty 6.0

Attention sinks induce gradient sinks under causal masking, with massive activations serving as adaptive RMSNorm regulators that attenuate localized gradient pressure in Transformer training.

The Cognitive Penalty: Ablating System 1 and System 2 Reasoning in Edge-Native SLMs for Decentralized Consensus

cs.AI · 2026-04-18 · unverdicted · novelty 5.0

System 1 intuition in edge SLMs delivers 100% adversarial robustness and low latency for DAO consensus while System 2 reasoning causes 26.7% cognitive collapse and 17x slowdown.

citing papers explorer

Showing 4 of 4 citing papers.

A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models cs.CL · 2026-05-08 · conditional · none · ref 3 · 2 links
Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation cs.LG · 2026-04-11 · unverdicted · none · ref 44
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
Attention Sinks Induce Gradient Sinks: Massive Activations as Gradient Regulators in Transformers cs.LG · 2026-03-18 · unverdicted · none · ref 6
Attention sinks induce gradient sinks under causal masking, with massive activations serving as adaptive RMSNorm regulators that attenuate localized gradient pressure in Transformer training.
The Cognitive Penalty: Ablating System 1 and System 2 Reasoning in Edge-Native SLMs for Decentralized Consensus cs.AI · 2026-04-18 · unverdicted · none · ref 27
System 1 intuition in edge SLMs delivers 100% adversarial robustness and low latency for DAO consensus while System 2 reasoning causes 26.7% cognitive collapse and 17x slowdown.

Value-state gated at- tention for mitigating extreme-token phenomena in trans- formers.arXiv preprint arXiv:2510.09017

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer