Analyzing the Structure of Attention in a Transformer Language Model

Jesse Vig; Yonatan Belinkov

arxiv: 1906.04284 · v2 · pith:5ZTLN3WAnew · submitted 2019-06-07 · 💻 cs.CL · cs.LG· stat.ML

Analyzing the Structure of Attention in a Transformer Language Model

Jesse Vig , Yonatan Belinkov This is my paper

classification 💻 cs.CL cs.LGstat.ML

keywords attentionmodeltransformeranalyzedifferentfindlanguagelayers

0 comments

read the original abstract

The Transformer is a fully attention-based alternative to recurrent networks that has achieved state-of-the-art results across a range of NLP tasks. In this paper, we analyze the structure of attention in a Transformer language model, the GPT-2 small pretrained model. We visualize attention for individual instances and analyze the interaction between attention and syntax over a large corpus. We find that attention targets different parts of speech at different layer depths within the model, and that attention aligns with dependency relations most strongly in the middle layers. We also find that the deepest layers of the model capture the most distant relationships. Finally, we extract exemplar sentences that reveal highly specific patterns targeted by particular attention heads.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 13 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CachePrune: Privacy-Aware and Fine-Grained KV Cache Sharing for Efficient LLM Inference
cs.CR 2026-05 unverdicted novelty 7.0

CachePrune enables fine-grained, token-level KV cache reuse across LLM requests by masking sensitive segments, eliminating direct side-channel leakage while cutting TTFT by 4.5x and raising hit rates by 44% versus pri...
Rethinking Attention with Performers
cs.LG 2020-09 unverdicted novelty 7.0

Performers approximate full-rank softmax attention in Transformers via FAVOR+ random features for linear complexity, with theoretical guarantees of unbiased estimation and competitive results on pixel, text, and prote...
Why Retrieval-Augmented Generation Fails: A Graph Perspective
cs.CL 2026-05 unverdicted novelty 6.0

Attribution graphs reveal that RAG failures arise from shallow fragmented evidence flow in LLMs, enabling topology-based detection and targeted interventions that reinforce question-guided routing.
The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity
cs.LG 2026-05 unverdicted novelty 6.0

Attention sinks arise from variance discrepancy in self-attention value aggregation, amplified by super neurons and first-token dimension disparity, and can be mitigated by head-wise RMSNorm to accelerate pre-training...
FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption
cs.CR 2026-04 unverdicted novelty 6.0

FlashRT delivers 2x-7x speedup and 2x-4x GPU memory reduction for prompt injection and knowledge corruption attacks on long-context LLMs versus nanoGCG.
Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation
cs.SE 2026-04 unverdicted novelty 6.0

Co-locating tests with implementation code yields substantially higher preservation and correctness in foundation-model-generated programs than separated test syntax.
CacheClip: Accelerating RAG with Effective KV Cache Reuse
cs.LG 2025-10 unverdicted novelty 6.0

CacheClip accelerates RAG prefill by up to 3.33x via auxiliary-model-guided selective KV recomputation while retaining 85-91% of full-attention quality on NIAH and LongBench.
AIM-CoT: Active Information-driven Multimodal Chain-of-Thought for Vision-Language Reasoning
cs.CV 2025-09 unverdicted novelty 6.0

AIM-CoT enhances interleaved multimodal chain-of-thought reasoning by adding context-enhanced attention generation, active visual probing via information foraging, and dynamic attention-shift triggering.
The Generalization Ridge: Information Flow in Natural Language Generation
cs.CL 2025-07 unverdicted novelty 6.0

InfoRidge reveals a non-monotonic pattern in which predictive mutual information between hidden states and outputs peaks in intermediate layers before declining in final layers.
Steered Generation via Gradient-Based Optimization on Sparse Query Features
cs.LG 2026-05 unverdicted novelty 5.0

Prototype-Based Sparse Steering decomposes query activations with SAEs and optimizes sparse features via gradients to steer LLM outputs toward specific behaviors.
H-Probes: Extracting Hierarchical Structures From Latent Representations of Language Models
cs.CL 2026-04 unverdicted novelty 5.0

H-probes locate low-dimensional subspaces encoding hierarchy in LLM activations for synthetic tree tasks, show causal importance and generalization, and detect weaker signals in mathematical reasoning traces.
BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design
cs.LG 2026-04 unverdicted novelty 5.0

BWTA achieves near full-precision accuracy on BERT and LLMs using binary weights and ternary activations, with 16-24x kernel speedups via specialized CUDA kernels.
Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding
cs.AI 2026-05 unverdicted novelty 3.0

Advanced language representations shape LLMs' schemas to improve knowledge activation and problem-solving.