Sharing Attention Weights for Fast Transformer
Pith reviewed 2026-05-25 15:47 UTC · model grok-4.3
The pith
Sharing attention weights between adjacent Transformer layers yields 1.3 times faster inference with almost no loss in BLEU score.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By sharing attention weights in adjacent layers the model reuses hidden states vertically, producing an average 1.3X speed-up with almost no decrease in BLEU on ten WMT and NIST OpenMT tasks; the same approach gives 1.8X speed-up with the AAN model and reaches 16 times the speed of an uncached baseline.
What carries the argument
Attention weight sharing across adjacent layers, which permits vertical reuse of hidden states.
If this is right
- The shared model maintains translation quality within a negligible margin on standard benchmarks.
- The technique stacks on top of existing attention caching for further gains.
- The sharing decision can be optimized end-to-end with the translation loss.
- The approach reaches 1.8X speed-up when combined with the AAN model.
Where Pith is reading between the lines
- The same sharing pattern could be tested on encoder-only or decoder-only Transformers outside machine translation.
- Allowing different sharing patterns per head might recover any small accuracy gap observed in the experiments.
- The learned sharing decisions may indicate which layer pairs perform redundant computations.
Load-bearing premise
Sharing attention weights between adjacent layers preserves enough model capacity to match the performance of the unshared model on the tested translation tasks.
What would settle it
Running the shared-weight model on one of the ten tasks and measuring a BLEU drop larger than 0.5 points relative to the unshared version.
Figures
read the original abstract
Recently, the Transformer machine translation system has shown strong results by stacking attention layers on both the source and target-language sides. But the inference of this model is slow due to the heavy use of dot-product attention in auto-regressive decoding. In this paper we speed up Transformer via a fast and lightweight attention model. More specifically, we share attention weights in adjacent layers and enable the efficient re-use of hidden states in a vertical manner. Moreover, the sharing policy can be jointly learned with the MT model. We test our approach on ten WMT and NIST OpenMT tasks. Experimental results show that it yields an average of 1.3X speed-up (with almost no decrease in BLEU) on top of a state-of-the-art implementation that has already adopted a cache for fast inference. Also, our approach obtains a 1.8X speed-up when it works with the \textsc{Aan} model. This is even 16 times faster than the baseline with no use of the attention cache.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes sharing attention weights between adjacent layers in the Transformer to enable vertical reuse of hidden states during auto-regressive inference for machine translation. The sharing policy is learned jointly with the model. On ten WMT and NIST OpenMT tasks, it reports an average 1.3X inference speedup (with almost no BLEU drop) on top of an already-cached state-of-the-art baseline, plus a 1.8X gain when combined with the AAN model (16X vs. uncached baseline).
Significance. If the empirical results hold, the work supplies a lightweight, learnable inference optimization for Transformers that preserves task performance on the tested MT benchmarks. This is a practical contribution given the centrality of Transformer inference speed in deployed MT systems.
major comments (1)
- [Results] Results section: the central claims of 'consistent speed-ups' and 'almost no decrease in BLEU' across ten tasks are reported without error bars, variance estimates, or statistical significance tests, making it impossible to assess whether the 1.3X figure is robust or within noise of the cached baseline.
minor comments (2)
- [Method] The description of how the sharing policy is parameterized and jointly optimized should be expanded with explicit equations or pseudocode to allow reproduction.
- [Experiments] Table or figure captions for the ten-task results should list per-task BLEU deltas and speed-up ratios rather than only averages.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and the recommendation for minor revision. We address the single major comment below regarding the presentation of empirical results.
read point-by-point responses
-
Referee: [Results] Results section: the central claims of 'consistent speed-ups' and 'almost no decrease in BLEU' across ten tasks are reported without error bars, variance estimates, or statistical significance tests, making it impossible to assess whether the 1.3X figure is robust or within noise of the cached baseline.
Authors: We agree that error bars, variance estimates, and statistical tests would strengthen the presentation. Each model was trained with a single run owing to the substantial computational cost of training large Transformers on the WMT and NIST corpora; multiple independent runs were not performed. The reported 1.3X average speedup (and near-zero BLEU change) is nevertheless observed uniformly across all ten tasks that differ in language pair, data size, and domain. In the revised manuscript we will add an explicit paragraph in the results section acknowledging the lack of variance estimates, justifying the single-run protocol, and emphasizing the cross-task consistency as supporting evidence for robustness. revision: partial
Circularity Check
No significant circularity
full rationale
The paper presents an empirical proposal to share attention weights between adjacent Transformer layers, jointly optimized with the MT model, and reports measured inference speed-ups (1.3X average, 1.8X with AAN) on ten external WMT/NIST tasks relative to cached baselines. No equations, predictions, or uniqueness claims are present that reduce the reported outcomes to fitted parameters or self-citations by construction; the capacity-preservation assumption is evaluated directly via BLEU scores on held-out data, and the central result is an externally falsifiable runtime measurement rather than an internal derivation.
Axiom & Free-Parameter Ledger
free parameters (1)
- sharing policy parameters
Reference graph
Works this paper leans on
-
[1]
Neural machine translation by jointly learning to align and translate
[Bahdanau et al., 2015] Dzmitry Bahdanau, Kyunghyun Cho, and Y oshua Bengio. Neural machine translation by jointly learning to align and translate. In In Proceed- ings of the 3rd International Conference on Learning Representations,
work page 2015
-
[2]
Massive exploration of neural ma- chine translation architectures
[Britz et al., 2017] Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc Le. Massive exploration of neural ma- chine translation architectures. In Proceedings of the 2017 Conference on Empirical Methods in Natural Lan- guage Processing , pages 1442–1451, Copenhagen, Den- mark, September
work page 2017
-
[3]
Re- current stacking of layers for compact neural machine translation models
[Dabre and Fujita, 2019 ] Raj Dabre and Atsushi Fujita. Re- current stacking of layers for compact neural machine translation models. In Proceedings of the 33rd AAAI Con- ference on Artificial Intelligence (AAAI) ,
work page 2019
-
[4]
[Gehring et al., 2017] Jonas Gehring, Michael Auli, David Grangier, Denis Y arats, and Y ann N. Dauphin. Convolu- tional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW , Australia, 6-11 August 2017 , pages 1243–1252,
work page 2017
-
[5]
[Gu et al., 2018] Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K. Li, and Richard Socher. Non- autoregressive neural machine translation. In International Conference on Learning Representations,
work page 2018
-
[6]
Distilling the knowledge in a neural net- work
[Hinton et al., 2015] Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural net- work. In NIPS Deep Learning and Representation Learn- ing W orkshop,
work page 2015
-
[7]
[Kim and Rush, 2016 ] Y oon Kim and Alexander M. Rush. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, T exas, USA, November 1-4, 2016 , pages 1317–1327,
work page 2016
-
[8]
[Kingma and Ba, 2015 ] Diederik P . Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd Inter- national Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings,
work page 2015
-
[9]
Vocabulary Selection Strategies for Neural Machine Translation
[L’Hostis et al., 2016] Gurvan L’Hostis, David Grangier, and Michael Auli. V ocabulary selection strategies for neural machine translation. CoRR, abs/1610.00072,
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[10]
Divergence measures based on the shannon entropy
[Lin, 1991] Jianhua Lin. Divergence measures based on the shannon entropy. IEEE Trans. Information Theory , 37(1):145–151,
work page 1991
-
[11]
[Luong et al., 2015] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. In Proceed- ings of the 2015 Conference on Empirical Methods in Nat- ural Language Processing, pages 1412–1421,
work page 2015
-
[12]
Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser
[Luong et al., 2016] Minh-Thang Luong, Quoc V . Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. Multi-task sequence to sequence learning. In 4th International Con- ference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4 ,
work page 2016
-
[13]
[Micikevicius et al., 2018] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory F. Diamos, Erich Elsen, David Garc´ ıa, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh V enkatesh, and Hao Wu. Mixed precision training. In 6th International Conference on Learning Representations, ICLR 2018, V ancouver , BC, Canada, April 30 - May 3 ,
work page 2018
-
[14]
Pieces of eight: 8-bit neural machine transla - tion
[Quinn and Ballesteros, 2018 ] Jerry Quinn and Miguel Ballesteros. Pieces of eight: 8-bit neural machine transla - tion. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T echnologies, NAACL-HTL 2018, New Orleans, Louisiana, USA, June 1-6, 2018, V olume 3 (Industry Papers)...
work page 2018
-
[15]
Attention-based Vocabulary Selection for NMT Decoding
[Sankaran et al., 2017] Baskaran Sankaran, Markus Freitag, and Y aser Al-Onaizan. Attention-based vocabulary selec- tion for NMT decoding. CoRR, abs/1706.03824,
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[16]
Sequence to sequence learning with neural networks
[Sutskever et al., 2014] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112,
work page 2014
-
[17]
Rethinking the inception architecture for computer vision
[Szegedy et al., 2016] Christian Szegedy, Vincent V an- houcke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las V egas, NV , USA, June 27-30, 2016, pages 2818–2826,
work page 2016
-
[18]
[V aswaniet al., 2017] Ashish V aswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Sys- tems, pages 6000–6010,
work page 2017
-
[19]
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
[Wu et al., 2016] Y onghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Y uan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation sys- tem: Bridging the gap between human and machine trans- lation. arXiv preprint arXiv:1609.08144 ,
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[20]
NiuTrans: An open source toolkit for phrase- based and syntax-based machine translation
[Xiao et al., 2012] Tong Xiao, Jingbo Zhu, Hao Zhang, and Qiang Li. NiuTrans: An open source toolkit for phrase- based and syntax-based machine translation. In Proceed- ings of the ACL 2012 System Demonstrations , pages 19– 24, Jeju Island, Korea, July
work page 2012
-
[21]
Unsupervised neural machine translation with weight sharing
[Y anget al., 2018] Zhen Y ang, Wei Chen, Feng Wang, and Bo Xu. Unsupervised neural machine translation with weight sharing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 46–55,
work page 2018
-
[22]
Accelerating neural transformer via an average atten- tion network
[Zhang et al., 2018] Biao Zhang, Deyi Xiong, and Jinsong Su. Accelerating neural transformer via an average atten- tion network. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 1789–1798, 2018
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.