Sharing Attention Weights for Fast Transformer

Jingbo Zhu; Tongran Liu; Tong Xiao; Yinqiao Li; Zhengtao Yu

arxiv: 1906.11024 · v1 · pith:HLNAL6DHnew · submitted 2019-06-26 · 💻 cs.CL

Sharing Attention Weights for Fast Transformer

Tong Xiao , Yinqiao Li , Jingbo Zhu , Zhengtao Yu , Tongran Liu This is my paper

Pith reviewed 2026-05-25 15:47 UTC · model grok-4.3

classification 💻 cs.CL

keywords attention sharingtransformermachine translationinference accelerationweight sharingauto-regressive decodingWMTNIST OpenMT

0 comments

The pith

Sharing attention weights between adjacent Transformer layers yields 1.3 times faster inference with almost no loss in BLEU score.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that attention weights can be shared across adjacent layers in the Transformer to enable vertical reuse of hidden states during auto-regressive decoding. This reduces the cost of repeated dot-product attention computations while the sharing policy itself is learned jointly with the translation model. Experiments across ten WMT and NIST OpenMT tasks demonstrate the resulting 1.3X speedup on top of an already-cached baseline implementation, and a larger gain when combined with the AAN model. A reader would care because the change requires no new hardware or data and keeps output quality nearly identical, making large attention models more usable at inference time.

Core claim

By sharing attention weights in adjacent layers the model reuses hidden states vertically, producing an average 1.3X speed-up with almost no decrease in BLEU on ten WMT and NIST OpenMT tasks; the same approach gives 1.8X speed-up with the AAN model and reaches 16 times the speed of an uncached baseline.

What carries the argument

Attention weight sharing across adjacent layers, which permits vertical reuse of hidden states.

If this is right

The shared model maintains translation quality within a negligible margin on standard benchmarks.
The technique stacks on top of existing attention caching for further gains.
The sharing decision can be optimized end-to-end with the translation loss.
The approach reaches 1.8X speed-up when combined with the AAN model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sharing pattern could be tested on encoder-only or decoder-only Transformers outside machine translation.
Allowing different sharing patterns per head might recover any small accuracy gap observed in the experiments.
The learned sharing decisions may indicate which layer pairs perform redundant computations.

Load-bearing premise

Sharing attention weights between adjacent layers preserves enough model capacity to match the performance of the unshared model on the tested translation tasks.

What would settle it

Running the shared-weight model on one of the ten tasks and measuring a BLEU drop larger than 0.5 points relative to the unshared version.

Figures

Figures reproduced from arXiv: 1906.11024 by Jingbo Zhu, Tongran Liu, Tong Xiao, Yinqiao Li, Zhengtao Yu.

**Figure 2.** Figure 2: The Jensen-Shannon divergence of the attention we [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of the standard attention model and the [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Joint learning of MT models and sharing policies [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Translation speed (token/sec) vs beam size and BLE [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: JS divergence vs number of training steps [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

read the original abstract

Recently, the Transformer machine translation system has shown strong results by stacking attention layers on both the source and target-language sides. But the inference of this model is slow due to the heavy use of dot-product attention in auto-regressive decoding. In this paper we speed up Transformer via a fast and lightweight attention model. More specifically, we share attention weights in adjacent layers and enable the efficient re-use of hidden states in a vertical manner. Moreover, the sharing policy can be jointly learned with the MT model. We test our approach on ten WMT and NIST OpenMT tasks. Experimental results show that it yields an average of 1.3X speed-up (with almost no decrease in BLEU) on top of a state-of-the-art implementation that has already adopted a cache for fast inference. Also, our approach obtains a 1.8X speed-up when it works with the \textsc{Aan} model. This is even 16 times faster than the baseline with no use of the attention cache.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper reports a practical 1.3X inference speedup from sharing attention weights between adjacent layers on top of an existing cache, with little BLEU loss across ten MT tasks.

read the letter

The core result is straightforward: sharing attention weights vertically between adjacent layers, with hidden-state reuse and a jointly learned policy, delivers about 1.3X faster decoding than a strong cached baseline on WMT and NIST tasks, and 1.8X when combined with AAN. The work is mostly an engineering refinement rather than a conceptual shift, but it is scoped clearly to inference speed in autoregressive MT and measures the outcome directly against external baselines. What stands out is the consistent reporting across ten tasks and the additional gain when stacked on AAN, which shows the method is not limited to one setup. The joint learning of the sharing policy is a reasonable addition that avoids manual tuning. On the downside, the abstract gives no error bars, no ablation breakdowns, and no statistical tests, so the reliability of the 1.3X figure is hard to judge from the summary alone. The capacity-preservation assumption is tested only by the BLEU numbers on these particular tasks; nothing in the provided material suggests it would generalize without further checks. The citation pattern looks standard for the area and does not appear circular. This is the kind of paper that matters to groups shipping production MT systems who already use caching and want another lever for latency. It is not essential reading for core model research, but the empirical claim is narrow enough and the baseline strong enough that it should go to peer review rather than desk rejection. A referee could usefully press for ablations and variance numbers, but the central measurement is reproducible in principle from the described setup.

Referee Report

1 major / 2 minor

Summary. The paper proposes sharing attention weights between adjacent layers in the Transformer to enable vertical reuse of hidden states during auto-regressive inference for machine translation. The sharing policy is learned jointly with the model. On ten WMT and NIST OpenMT tasks, it reports an average 1.3X inference speedup (with almost no BLEU drop) on top of an already-cached state-of-the-art baseline, plus a 1.8X gain when combined with the AAN model (16X vs. uncached baseline).

Significance. If the empirical results hold, the work supplies a lightweight, learnable inference optimization for Transformers that preserves task performance on the tested MT benchmarks. This is a practical contribution given the centrality of Transformer inference speed in deployed MT systems.

major comments (1)

[Results] Results section: the central claims of 'consistent speed-ups' and 'almost no decrease in BLEU' across ten tasks are reported without error bars, variance estimates, or statistical significance tests, making it impossible to assess whether the 1.3X figure is robust or within noise of the cached baseline.

minor comments (2)

[Method] The description of how the sharing policy is parameterized and jointly optimized should be expanded with explicit equations or pseudocode to allow reproduction.
[Experiments] Table or figure captions for the ten-task results should list per-task BLEU deltas and speed-up ratios rather than only averages.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and the recommendation for minor revision. We address the single major comment below regarding the presentation of empirical results.

read point-by-point responses

Referee: [Results] Results section: the central claims of 'consistent speed-ups' and 'almost no decrease in BLEU' across ten tasks are reported without error bars, variance estimates, or statistical significance tests, making it impossible to assess whether the 1.3X figure is robust or within noise of the cached baseline.

Authors: We agree that error bars, variance estimates, and statistical tests would strengthen the presentation. Each model was trained with a single run owing to the substantial computational cost of training large Transformers on the WMT and NIST corpora; multiple independent runs were not performed. The reported 1.3X average speedup (and near-zero BLEU change) is nevertheless observed uniformly across all ten tasks that differ in language pair, data size, and domain. In the revised manuscript we will add an explicit paragraph in the results section acknowledging the lack of variance estimates, justifying the single-run protocol, and emphasizing the cross-task consistency as supporting evidence for robustness. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical proposal to share attention weights between adjacent Transformer layers, jointly optimized with the MT model, and reports measured inference speed-ups (1.3X average, 1.8X with AAN) on ten external WMT/NIST tasks relative to cached baselines. No equations, predictions, or uniqueness claims are present that reduce the reported outcomes to fitted parameters or self-citations by construction; the capacity-preservation assumption is evaluated directly via BLEU scores on held-out data, and the central result is an externally falsifiable runtime measurement rather than an internal derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The approach rests on the empirical observation that adjacent-layer sharing preserves performance; no new mathematical axioms or invented entities are introduced.

free parameters (1)

sharing policy parameters
The policy deciding which weights to share is learned jointly from data.

pith-pipeline@v0.9.0 · 5705 in / 1029 out tokens · 23865 ms · 2026-05-25T15:47:14.051780+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 3 internal anchors

[1]

Neural machine translation by jointly learning to align and translate

[Bahdanau et al., 2015] Dzmitry Bahdanau, Kyunghyun Cho, and Y oshua Bengio. Neural machine translation by jointly learning to align and translate. In In Proceed- ings of the 3rd International Conference on Learning Representations,

work page 2015
[2]

Massive exploration of neural ma- chine translation architectures

[Britz et al., 2017] Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc Le. Massive exploration of neural ma- chine translation architectures. In Proceedings of the 2017 Conference on Empirical Methods in Natural Lan- guage Processing , pages 1442–1451, Copenhagen, Den- mark, September

work page 2017
[3]

Re- current stacking of layers for compact neural machine translation models

[Dabre and Fujita, 2019 ] Raj Dabre and Atsushi Fujita. Re- current stacking of layers for compact neural machine translation models. In Proceedings of the 33rd AAAI Con- ference on Artiﬁcial Intelligence (AAAI) ,

work page 2019
[4]

[Gehring et al., 2017] Jonas Gehring, Michael Auli, David Grangier, Denis Y arats, and Y ann N. Dauphin. Convolu- tional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW , Australia, 6-11 August 2017 , pages 1243–1252,

work page 2017
[5]

Li, and Richard Socher

[Gu et al., 2018] Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K. Li, and Richard Socher. Non- autoregressive neural machine translation. In International Conference on Learning Representations,

work page 2018
[6]

Distilling the knowledge in a neural net- work

[Hinton et al., 2015] Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural net- work. In NIPS Deep Learning and Representation Learn- ing W orkshop,

work page 2015
[7]

[Kim and Rush, 2016 ] Y oon Kim and Alexander M. Rush. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, T exas, USA, November 1-4, 2016 , pages 1317–1327,

work page 2016
[8]

Kingma and Jimmy Ba

[Kingma and Ba, 2015 ] Diederik P . Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd Inter- national Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings,

work page 2015
[9]

Vocabulary Selection Strategies for Neural Machine Translation

[L’Hostis et al., 2016] Gurvan L’Hostis, David Grangier, and Michael Auli. V ocabulary selection strategies for neural machine translation. CoRR, abs/1610.00072,

work page internal anchor Pith review Pith/arXiv arXiv 2016
[10]

Divergence measures based on the shannon entropy

[Lin, 1991] Jianhua Lin. Divergence measures based on the shannon entropy. IEEE Trans. Information Theory , 37(1):145–151,

work page 1991
[11]

[Luong et al., 2015] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. In Proceed- ings of the 2015 Conference on Empirical Methods in Nat- ural Language Processing, pages 1412–1421,

work page 2015
[12]

Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser

[Luong et al., 2016] Minh-Thang Luong, Quoc V . Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. Multi-task sequence to sequence learning. In 4th International Con- ference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4 ,

work page 2016
[13]

Diamos, Erich Elsen, David Garc´ ıa, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh V enkatesh, and Hao Wu

[Micikevicius et al., 2018] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory F. Diamos, Erich Elsen, David Garc´ ıa, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh V enkatesh, and Hao Wu. Mixed precision training. In 6th International Conference on Learning Representations, ICLR 2018, V ancouver , BC, Canada, April 30 - May 3 ,

work page 2018
[14]

Pieces of eight: 8-bit neural machine transla - tion

[Quinn and Ballesteros, 2018 ] Jerry Quinn and Miguel Ballesteros. Pieces of eight: 8-bit neural machine transla - tion. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T echnologies, NAACL-HTL 2018, New Orleans, Louisiana, USA, June 1-6, 2018, V olume 3 (Industry Papers)...

work page 2018
[15]

Attention-based Vocabulary Selection for NMT Decoding

[Sankaran et al., 2017] Baskaran Sankaran, Markus Freitag, and Y aser Al-Onaizan. Attention-based vocabulary selec- tion for NMT decoding. CoRR, abs/1706.03824,

work page internal anchor Pith review Pith/arXiv arXiv 2017
[16]

Sequence to sequence learning with neural networks

[Sutskever et al., 2014] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112,

work page 2014
[17]

Rethinking the inception architecture for computer vision

[Szegedy et al., 2016] Christian Szegedy, Vincent V an- houcke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las V egas, NV , USA, June 27-30, 2016, pages 2818–2826,

work page 2016
[18]

Attention is all you need

[V aswaniet al., 2017] Ashish V aswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Sys- tems, pages 6000–6010,

work page 2017
[19]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

[Wu et al., 2016] Y onghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Y uan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation sys- tem: Bridging the gap between human and machine trans- lation. arXiv preprint arXiv:1609.08144 ,

work page internal anchor Pith review Pith/arXiv arXiv 2016
[20]

NiuTrans: An open source toolkit for phrase- based and syntax-based machine translation

[Xiao et al., 2012] Tong Xiao, Jingbo Zhu, Hao Zhang, and Qiang Li. NiuTrans: An open source toolkit for phrase- based and syntax-based machine translation. In Proceed- ings of the ACL 2012 System Demonstrations , pages 19– 24, Jeju Island, Korea, July

work page 2012
[21]

Unsupervised neural machine translation with weight sharing

[Y anget al., 2018] Zhen Y ang, Wei Chen, Feng Wang, and Bo Xu. Unsupervised neural machine translation with weight sharing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 46–55,

work page 2018
[22]

Accelerating neural transformer via an average atten- tion network

[Zhang et al., 2018] Biao Zhang, Deyi Xiong, and Jinsong Su. Accelerating neural transformer via an average atten- tion network. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 1789–1798, 2018

work page 2018

[1] [1]

Neural machine translation by jointly learning to align and translate

[Bahdanau et al., 2015] Dzmitry Bahdanau, Kyunghyun Cho, and Y oshua Bengio. Neural machine translation by jointly learning to align and translate. In In Proceed- ings of the 3rd International Conference on Learning Representations,

work page 2015

[2] [2]

Massive exploration of neural ma- chine translation architectures

[Britz et al., 2017] Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc Le. Massive exploration of neural ma- chine translation architectures. In Proceedings of the 2017 Conference on Empirical Methods in Natural Lan- guage Processing , pages 1442–1451, Copenhagen, Den- mark, September

work page 2017

[3] [3]

Re- current stacking of layers for compact neural machine translation models

[Dabre and Fujita, 2019 ] Raj Dabre and Atsushi Fujita. Re- current stacking of layers for compact neural machine translation models. In Proceedings of the 33rd AAAI Con- ference on Artiﬁcial Intelligence (AAAI) ,

work page 2019

[4] [4]

[Gehring et al., 2017] Jonas Gehring, Michael Auli, David Grangier, Denis Y arats, and Y ann N. Dauphin. Convolu- tional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW , Australia, 6-11 August 2017 , pages 1243–1252,

work page 2017

[5] [5]

Li, and Richard Socher

[Gu et al., 2018] Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K. Li, and Richard Socher. Non- autoregressive neural machine translation. In International Conference on Learning Representations,

work page 2018

[6] [6]

Distilling the knowledge in a neural net- work

[Hinton et al., 2015] Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural net- work. In NIPS Deep Learning and Representation Learn- ing W orkshop,

work page 2015

[7] [7]

[Kim and Rush, 2016 ] Y oon Kim and Alexander M. Rush. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, T exas, USA, November 1-4, 2016 , pages 1317–1327,

work page 2016

[8] [8]

Kingma and Jimmy Ba

[Kingma and Ba, 2015 ] Diederik P . Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd Inter- national Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings,

work page 2015

[9] [9]

Vocabulary Selection Strategies for Neural Machine Translation

[L’Hostis et al., 2016] Gurvan L’Hostis, David Grangier, and Michael Auli. V ocabulary selection strategies for neural machine translation. CoRR, abs/1610.00072,

work page internal anchor Pith review Pith/arXiv arXiv 2016

[10] [10]

Divergence measures based on the shannon entropy

[Lin, 1991] Jianhua Lin. Divergence measures based on the shannon entropy. IEEE Trans. Information Theory , 37(1):145–151,

work page 1991

[11] [11]

[Luong et al., 2015] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. In Proceed- ings of the 2015 Conference on Empirical Methods in Nat- ural Language Processing, pages 1412–1421,

work page 2015

[12] [12]

Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser

[Luong et al., 2016] Minh-Thang Luong, Quoc V . Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. Multi-task sequence to sequence learning. In 4th International Con- ference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4 ,

work page 2016

[13] [13]

Diamos, Erich Elsen, David Garc´ ıa, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh V enkatesh, and Hao Wu

[Micikevicius et al., 2018] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory F. Diamos, Erich Elsen, David Garc´ ıa, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh V enkatesh, and Hao Wu. Mixed precision training. In 6th International Conference on Learning Representations, ICLR 2018, V ancouver , BC, Canada, April 30 - May 3 ,

work page 2018

[14] [14]

Pieces of eight: 8-bit neural machine transla - tion

[Quinn and Ballesteros, 2018 ] Jerry Quinn and Miguel Ballesteros. Pieces of eight: 8-bit neural machine transla - tion. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T echnologies, NAACL-HTL 2018, New Orleans, Louisiana, USA, June 1-6, 2018, V olume 3 (Industry Papers)...

work page 2018

[15] [15]

Attention-based Vocabulary Selection for NMT Decoding

[Sankaran et al., 2017] Baskaran Sankaran, Markus Freitag, and Y aser Al-Onaizan. Attention-based vocabulary selec- tion for NMT decoding. CoRR, abs/1706.03824,

work page internal anchor Pith review Pith/arXiv arXiv 2017

[16] [16]

Sequence to sequence learning with neural networks

[Sutskever et al., 2014] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112,

work page 2014

[17] [17]

Rethinking the inception architecture for computer vision

[Szegedy et al., 2016] Christian Szegedy, Vincent V an- houcke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las V egas, NV , USA, June 27-30, 2016, pages 2818–2826,

work page 2016

[18] [18]

Attention is all you need

[V aswaniet al., 2017] Ashish V aswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Sys- tems, pages 6000–6010,

work page 2017

[19] [19]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

[Wu et al., 2016] Y onghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Y uan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation sys- tem: Bridging the gap between human and machine trans- lation. arXiv preprint arXiv:1609.08144 ,

work page internal anchor Pith review Pith/arXiv arXiv 2016

[20] [20]

NiuTrans: An open source toolkit for phrase- based and syntax-based machine translation

[Xiao et al., 2012] Tong Xiao, Jingbo Zhu, Hao Zhang, and Qiang Li. NiuTrans: An open source toolkit for phrase- based and syntax-based machine translation. In Proceed- ings of the ACL 2012 System Demonstrations , pages 19– 24, Jeju Island, Korea, July

work page 2012

[21] [21]

Unsupervised neural machine translation with weight sharing

[Y anget al., 2018] Zhen Y ang, Wei Chen, Feng Wang, and Bo Xu. Unsupervised neural machine translation with weight sharing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 46–55,

work page 2018

[22] [22]

Accelerating neural transformer via an average atten- tion network

[Zhang et al., 2018] Biao Zhang, Deyi Xiong, and Jinsong Su. Accelerating neural transformer via an average atten- tion network. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 1789–1798, 2018

work page 2018