Do Transformer Attention Heads Provide Transparency in Abstractive Summarization?

Anne Schuth; Joris Baan; Maarten de Rijke; Maartje ter Hoeve; Marlies van der Wees

arxiv: 1907.00570 · v2 · pith:WNRIE3ZMnew · submitted 2019-07-01 · 💻 cs.CL · cs.AI· cs.IR· cs.LG

Do Transformer Attention Heads Provide Transparency in Abstractive Summarization?

Joris Baan , Maartje ter Hoeve , Marlies van der Wees , Anne Schuth , Maarten de Rijke This is my paper

Pith reviewed 2026-05-25 12:15 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IRcs.LG

keywords transformerattention headsabstractive summarizationmodel transparencyinterpretabilityattention distributionsNLP

0 comments

The pith

Transformer attention heads specialize on distinct input in summarization but the model may not rely on those distributions for its outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether attention distributions in transformer models can serve as a window into how abstractive summaries are produced. It finds that individual heads do focus on particular syntactic and semantic features of the source text. The authors introduce a way to measure how much the overall model depends on those specific learned patterns rather than other mechanisms. This matters for NLP because attention maps are widely treated as explanations, yet the work questions whether they actually reveal the decision process in summarization. The analysis concludes by discussing what limited reliance would mean for transparency claims.

Core claim

The paper shows that some attention heads specialize towards syntactically and semantically distinct input. It proposes an approach to evaluate to which extent the Transformer model relies on specifically learned attention distributions and discusses what this implies for using attention distributions as a means of transparency.

What carries the argument

The attention distributions produced by different heads within the multi-head self-attention layers of the transformer when processing input for summary generation.

If this is right

If the model does not rely on the specialized distributions, then attention maps cannot be assumed to explain why particular summary words were chosen.
The evaluation method can be applied to other sequence generation tasks to test similar transparency claims.
Performance may remain high even when attention patterns are altered, indicating that other components drive the output.
Transparency efforts in summarization would need mechanisms beyond inspecting attention weights.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Attention could function more as a side effect of training than as the causal pathway for summary decisions.
Similar specialization without reliance might appear in other encoder-decoder tasks such as machine translation.
Practitioners should test reliance before treating attention visualizations as faithful explanations in deployed systems.

Load-bearing premise

That observed specialization among heads together with measurements of the model's reliance on those distributions can be taken as direct evidence about whether attention provides meaningful transparency into the model's decision process.

What would settle it

Replace the learned attention distributions of the specialized heads with uniform random distributions and measure whether summary quality and content remain essentially unchanged.

Figures

Figures reproduced from arXiv: 1907.00570 by Anne Schuth, Joris Baan, Maarten de Rijke, Maartje ter Hoeve, Marlies van der Wees.

**Figure 1.** Figure 1: Attention head focusing on locations. and (3) our input sequences (news articles) are significantly longer than the short sentences used in previous work. 3 EXPERIMENTAL SETUP We adopt OpenNMT’s implementation [10] of the CopyGenerator Transformer [6]. Both encoder and decoder have four layers with eight heads. We use scaled dot attention, Gehrmann et al. [6]’s new summary specific coverage function, Wu et… view at source ↗

**Figure 2.** Figure 2: Attention head that seemed to focus on named en [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Ratio of the max attention weight being assigned [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: A comparison of the top 3 specialized heads. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 7.** Figure 7: Specialized NE head with a low NEP. This is in [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 6.** Figure 6: Specialized head focusing on the location Antarc [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

read the original abstract

Learning algorithms become more powerful, often at the cost of increased complexity. In response, the demand for algorithms to be transparent is growing. In NLP tasks, attention distributions learned by attention-based deep learning models are used to gain insights in the models' behavior. To which extent is this perspective valid for all NLP tasks? We investigate whether distributions calculated by different attention heads in a transformer architecture can be used to improve transparency in the task of abstractive summarization. To this end, we present both a qualitative and quantitative analysis to investigate the behavior of the attention heads. We show that some attention heads indeed specialize towards syntactically and semantically distinct input. We propose an approach to evaluate to which extent the Transformer model relies on specifically learned attention distributions. We also discuss what this implies for using attention distributions as a means of transparency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Some heads specialize on syntax and semantics in summarization, but the work does not test whether the model actually conditions its outputs on those distributions.

read the letter

The paper reports that certain attention heads in a transformer do focus on distinct syntactic and semantic parts of the input when doing abstractive summarization. It also outlines a method to measure how much the model depends on those particular attention patterns rather than other internal signals. The qualitative examples and quantitative breakdowns of head behavior are the concrete parts that stand out. This is a direct check on one generation task and adds a practical evaluation step on top of earlier attention studies. The specialization result looks reproducible from the description and gives practitioners something they can inspect in their own models. The main gap is the missing causal link. Specialization is shown, but there is no ablation, masking, or distribution swap that demonstrates the generated summary actually changes when those heads are altered. Without that, the transparency discussion stays at correlation and does not answer whether attention distributions reveal the model's decision process. The abstract itself flags the broader validity question, yet the experiments stay observational. This is useful reading for people already working on interpretability of transformers in summarization or similar generation tasks. It is not a breakthrough but the empirical angle is honest enough that a referee could usefully tighten the claims and check the proposed evaluation method. I would send it to review.

Referee Report

1 major / 0 minor

Summary. The paper investigates whether attention distributions in Transformer models can serve as a source of transparency for abstractive summarization. It reports qualitative and quantitative analyses indicating that certain attention heads specialize toward syntactically and semantically distinct inputs, proposes an evaluation approach to measure the model's reliance on these specific distributions, and discusses the implications for using attention as an interpretability tool in NLP.

Significance. If the empirical findings and proposed evaluation hold after addressing causal questions, the work would add to the literature on attention interpretability by documenting head specialization in summarization and offering a method to test reliance, potentially tempering claims that attention visualizations reliably explain model decisions.

major comments (1)

[Abstract] The central claim requires evidence that the model conditions its generated summaries on the specialized attention distributions rather than on other internal representations. The abstract describes specialization and an evaluation approach, but the skeptic's concern is valid: without intervention experiments (attention masking, head ablation, or distribution replacement) that isolate the effect on output tokens while holding other factors fixed, the results remain correlational and do not establish reliance or address the transparency question posed in the abstract.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The major comment correctly identifies a distinction between correlational evidence and causal demonstration of reliance, which we address below.

read point-by-point responses

Referee: [Abstract] The central claim requires evidence that the model conditions its generated summaries on the specialized attention distributions rather than on other internal representations. The abstract describes specialization and an evaluation approach, but the skeptic's concern is valid: without intervention experiments (attention masking, head ablation, or distribution replacement) that isolate the effect on output tokens while holding other factors fixed, the results remain correlational and do not establish reliance or address the transparency question posed in the abstract.

Authors: We agree that the analyses presented are correlational and that intervention experiments would be required to establish that the model conditions its outputs on the specialized attention distributions. The proposed evaluation approach measures reliance by comparing model behavior under the observed attention distributions versus alternatives, but does not include masking, ablation, or replacement. We will revise the abstract to describe the contributions more precisely as documenting head specialization and proposing a correlational method for assessing reliance, and we will update the discussion to explicitly note the absence of causal interventions and the resulting limitations for claims about transparency. These changes will be incorporated in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

Empirical investigation with no derivation chain or fitted predictions

full rationale

The paper presents a qualitative and quantitative empirical analysis of attention head specialization in a Transformer for abstractive summarization, along with a proposed evaluation approach. No equations, first-principles derivations, or predictions are claimed that reduce to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The work is framed as an investigation into observed patterns and their implications for transparency, without any renaming of known results or circular fitting. The central claims rest on direct observation of attention distributions rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; no information available on modeling assumptions or data handling.

pith-pipeline@v0.9.0 · 5689 in / 956 out tokens · 36806 ms · 2026-05-25T12:15:23.319992+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We show that some attention heads indeed specialize towards syntactically and semantically distinct input. We propose an approach to evaluate to which extent the Transformer model relies on specifically learned attention distributions.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 14 internal anchors

[1]

Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual String Em- beddings for Sequence Labeling. In COLING 2018, 27th International Conference on Computational Linguistics. 1638–1649

work page 2018
[2]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Ma- chine Translation by Jointly Learning to Align and Translate. arXiv preprint arXiv:1409.0473 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[3]

Edward Choi, Mohammad Taha Bahadori, Jimeng Sun, Joshua Kulas, Andy Schuetz, and Walter Stewart. 2016. Retain: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism. In Advances in Neural Information Processing Systems. 3504–3512

work page 2016
[4]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

Finale Doshi-Velez and Been Kim. 2017. Towards a Rigorous Science of Inter- pretable Machine Learning. arXiv preprint arXiv:1702.08608 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[6]

Sebastian Gehrmann, Yuntian Deng, and Alexander M Rush. 2018. Bottom-up Abstractive Summarization. arXiv preprint arXiv:1808.10792 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

Leilani H Gilpin, David Bau, Ben Z Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal. 2018. Explaining Explanations: An Approach to Evaluating Inter- pretability of Machine Learning. arXiv preprint arXiv:1806.00069 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching Machines to Read and Comprehend. In Advances in neural information processing systems . 1693–1701

work page 2015
[9]

Sarthak Jain and Byron C Wallace. 2019. Attention is not Explanation. arXiv preprint arXiv:1902.10186 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[10]

Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. 2017. OpenNMT: Open-Source Toolkit for Neural Machine Translation. arXiv preprint arXiv:1701.02810 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[11]

Tao Lei. 2017. Interpretable Neural Models for Natural Language Processing . Ph.D. Dissertation. Massachusetts Institute of Technology

work page 2017
[12]

Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. arXiv preprint arXiv:1508.04025 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[13]

Paul Michel, Omer Levy, and Graham Neubig. 2019. Are Sixteen Heads Really Better than One? arXiv preprint arXiv:1905.10650 (2019)

work page arXiv 2019
[14]

Brent Mittelstadt, Chris Russell, and Sandra Wachter. 2018. Explaining Explana- tions in AI. arXiv preprint arXiv:1811.01439 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing Xiang, et al. 2016. Ab- stractive Text Summarization using Sequence-to-sequence RNNs and Beyond. arXiv preprint arXiv:1602.06023 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[16]

Slav Petrov, Dipanjan Das, and Ryan McDonald. 2011. A Universal Part-of-Speech Tagset. arXiv preprint arXiv:1104.2086 (2011)

work page internal anchor Pith review Pith/arXiv arXiv 2011
[17]

Alessandro Raganato, Jörg Tiedemann, et al . 2018. An Analysis of Encoder Representations in Transformer-Based Machine Translation. In 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP . ACL

work page 2018
[18]

Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional Recurrent Neural Net- works. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681

work page 1997
[19]

Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the Point: Summarization with Pointer-Generator Networks.arXiv preprint arXiv:1704.04368 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In Advances in Neural Information Processing Systems . 5998–6008

work page 2017
[21]

Jesse Vig. 2018. Deconstructing BERT: Distilling 6 Patterns from 100 Million Pa- rameters. towardsdatascience.com/deconstructing-bert-distilling-6-patterns-from- 100-million-parameters-b49113672f77. Accessed: 2019-04-29

work page 2018
[22]

Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. arXiv preprint arXiv:1905.09418 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[23]

Yonghui Wu et al . 2016. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv preprint arXiv:1609.08144 (2016). Do Attention Heads Provide Transparency? Paris ’19, June 21–25, 2019, Paris, France A APPENDIX Figure 5: Specialized named entity head focusing on football teams. Figure 6: Specialized head fo...

work page internal anchor Pith review Pith/arXiv arXiv 2016

[1] [1]

Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual String Em- beddings for Sequence Labeling. In COLING 2018, 27th International Conference on Computational Linguistics. 1638–1649

work page 2018

[2] [2]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Ma- chine Translation by Jointly Learning to Align and Translate. arXiv preprint arXiv:1409.0473 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[3] [3]

Edward Choi, Mohammad Taha Bahadori, Jimeng Sun, Joshua Kulas, Andy Schuetz, and Walter Stewart. 2016. Retain: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism. In Advances in Neural Information Processing Systems. 3504–3512

work page 2016

[4] [4]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[5] [5]

Finale Doshi-Velez and Been Kim. 2017. Towards a Rigorous Science of Inter- pretable Machine Learning. arXiv preprint arXiv:1702.08608 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[6] [6]

Sebastian Gehrmann, Yuntian Deng, and Alexander M Rush. 2018. Bottom-up Abstractive Summarization. arXiv preprint arXiv:1808.10792 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

Leilani H Gilpin, David Bau, Ben Z Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal. 2018. Explaining Explanations: An Approach to Evaluating Inter- pretability of Machine Learning. arXiv preprint arXiv:1806.00069 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching Machines to Read and Comprehend. In Advances in neural information processing systems . 1693–1701

work page 2015

[9] [9]

Sarthak Jain and Byron C Wallace. 2019. Attention is not Explanation. arXiv preprint arXiv:1902.10186 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[10] [10]

Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. 2017. OpenNMT: Open-Source Toolkit for Neural Machine Translation. arXiv preprint arXiv:1701.02810 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[11] [11]

Tao Lei. 2017. Interpretable Neural Models for Natural Language Processing . Ph.D. Dissertation. Massachusetts Institute of Technology

work page 2017

[12] [12]

Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. arXiv preprint arXiv:1508.04025 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[13] [13]

Paul Michel, Omer Levy, and Graham Neubig. 2019. Are Sixteen Heads Really Better than One? arXiv preprint arXiv:1905.10650 (2019)

work page arXiv 2019

[14] [14]

Brent Mittelstadt, Chris Russell, and Sandra Wachter. 2018. Explaining Explana- tions in AI. arXiv preprint arXiv:1811.01439 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[15] [15]

Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing Xiang, et al. 2016. Ab- stractive Text Summarization using Sequence-to-sequence RNNs and Beyond. arXiv preprint arXiv:1602.06023 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[16] [16]

Slav Petrov, Dipanjan Das, and Ryan McDonald. 2011. A Universal Part-of-Speech Tagset. arXiv preprint arXiv:1104.2086 (2011)

work page internal anchor Pith review Pith/arXiv arXiv 2011

[17] [17]

Alessandro Raganato, Jörg Tiedemann, et al . 2018. An Analysis of Encoder Representations in Transformer-Based Machine Translation. In 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP . ACL

work page 2018

[18] [18]

Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional Recurrent Neural Net- works. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681

work page 1997

[19] [19]

Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the Point: Summarization with Pointer-Generator Networks.arXiv preprint arXiv:1704.04368 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[20] [20]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In Advances in Neural Information Processing Systems . 5998–6008

work page 2017

[21] [21]

Jesse Vig. 2018. Deconstructing BERT: Distilling 6 Patterns from 100 Million Pa- rameters. towardsdatascience.com/deconstructing-bert-distilling-6-patterns-from- 100-million-parameters-b49113672f77. Accessed: 2019-04-29

work page 2018

[22] [22]

Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. arXiv preprint arXiv:1905.09418 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[23] [23]

Yonghui Wu et al . 2016. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv preprint arXiv:1609.08144 (2016). Do Attention Heads Provide Transparency? Paris ’19, June 21–25, 2019, Paris, France A APPENDIX Figure 5: Specialized named entity head focusing on football teams. Figure 6: Specialized head fo...

work page internal anchor Pith review Pith/arXiv arXiv 2016