pith. sign in

arxiv: 1906.09444 · v1 · pith:ACCYOB25new · submitted 2019-06-22 · 💻 cs.CL · cs.AI· cs.LG

Retrieving Sequential Information for Non-Autoregressive Neural Machine Translation

Pith reviewed 2026-05-25 18:16 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords non-autoregressive translationneural machine translationreinforcement learningtransformer decodersequential informationparallel decodingBLEU
0
0 comments X

The pith

Non-autoregressive translation recovers target word order via reinforcement training or a top-layer fused decoder.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Non-autoregressive models generate all target words in parallel and therefore lose the sequential dependencies that autoregressive models capture step by step. The paper shows that this loss produces over- and under-translation, especially on long sentences. It introduces a reinforcement algorithm that trains NAT models at the full-sequence level and an FS-decoder that injects sequential information only into the uppermost decoder layer. Both changes improve BLEU while preserving the original parallel decoding speed.

Core claim

Reinforce-NAT and the FS-decoder retrieve the target sequential information that standard NAT discards, yielding higher BLEU scores than baseline NAT without any slowdown and, for the FS-decoder, performance comparable to the autoregressive Transformer with substantial speedup.

What carries the argument

The FS-decoder, which fuses sequential information exclusively into the top decoder layer, together with a reinforcement algorithm that performs sequence-level training of NAT.

If this is right

  • NAT can be trained end-to-end at the sequence level without variance explosion.
  • Sequential dependencies need only be restored at the final decoder stage to affect output quality.
  • The same parallel decoding schedule remains valid after the proposed changes.
  • Comparable quality to autoregressive models becomes reachable at NAT speeds on standard translation benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reinforcement or top-layer fusion approach could be tested on other non-autoregressive generation tasks such as summarization or speech synthesis.
  • Combining the two proposed methods in one model might produce additive gains, though the paper does not report such an experiment.
  • The methods may reduce reliance on external distillation or fertility prediction tricks commonly used in NAT.

Load-bearing premise

Adding sequential information only through reinforcement or a single top decoder layer is sufficient to restore the dependencies NAT otherwise ignores, without creating new inconsistencies or requiring changes to parallel decoding.

What would settle it

On the same test sets, if Reinforce-NAT or FS-decoder still produces the same frequency of over- and under-translation errors on long sentences as the unmodified NAT baseline, the central claim is false.

Figures

Figures reproduced from arXiv: 1906.09444 by Chenze Shao, Fandong Meng, Jie Zhou, Jinchao Zhang, Xilin Chen, Yang Feng.

Figure 1
Figure 1. Figure 1: The architecture of FS-decoder. The decoder [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: training curves for k = 0, 1, 5 and 10. 5.5 Performance over Different Lengths [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: top- [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: The BLEU scores on the validation set of [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Non-Autoregressive Transformer (NAT) aims to accelerate the Transformer model through discarding the autoregressive mechanism and generating target words independently, which fails to exploit the target sequential information. Over-translation and under-translation errors often occur for the above reason, especially in the long sentence translation scenario. In this paper, we propose two approaches to retrieve the target sequential information for NAT to enhance its translation ability while preserving the fast-decoding property. Firstly, we propose a sequence-level training method based on a novel reinforcement algorithm for NAT (Reinforce-NAT) to reduce the variance and stabilize the training procedure. Secondly, we propose an innovative Transformer decoder named FS-decoder to fuse the target sequential information into the top layer of the decoder. Experimental results on three translation tasks show that the Reinforce-NAT surpasses the baseline NAT system by a significant margin on BLEU without decelerating the decoding speed and the FS-decoder achieves comparable translation performance to the autoregressive Transformer with considerable speedup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that Non-Autoregressive Transformer (NAT) models can be improved by retrieving target sequential information via two methods: Reinforce-NAT, a novel reinforcement learning algorithm for sequence-level training that reduces variance and stabilizes training, and FS-decoder, which fuses target sequential information into the top decoder layer. Experiments on three translation tasks show Reinforce-NAT surpassing baseline NAT on BLEU without slowing decoding speed, while FS-decoder achieves performance comparable to the autoregressive Transformer with considerable speedup.

Significance. If the results hold, this would be a meaningful contribution to efficient NMT by addressing over- and under-translation errors in NAT while retaining parallel decoding advantages. The reinforcement-based training and top-layer fusion approach offer practical techniques for incorporating dependencies without altering the core parallel generation process.

major comments (1)
  1. [FS-decoder] FS-decoder section: fusing sequential information exclusively into the top decoder layer while keeping lower layers fully parallel and position-independent means hidden states arriving at the top layer lack target dependencies. The top-layer fusion can at best post-correct outputs but cannot retroactively enforce consistency through the stack, which is load-bearing for the central claim that the method recovers discarded target dependencies without changing the parallel decoding process.
minor comments (1)
  1. [Abstract] Abstract: reports performance gains on three tasks but supplies no dataset sizes, baseline details, variance numbers, or ablation results, limiting the ability to evaluate the strength of the empirical claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We respond to the major comment below.

read point-by-point responses
  1. Referee: [FS-decoder] FS-decoder section: fusing sequential information exclusively into the top decoder layer while keeping lower layers fully parallel and position-independent means hidden states arriving at the top layer lack target dependencies. The top-layer fusion can at best post-correct outputs but cannot retroactively enforce consistency through the stack, which is load-bearing for the central claim that the method recovers discarded target dependencies without changing the parallel decoding process.

    Authors: The lower layers of the FS-decoder are indeed kept fully parallel and position-independent to preserve the core NAT decoding speed. However, the fusion of target sequential information occurs at the top decoder layer, which directly produces the output token predictions. This design incorporates the sequential dependencies into the final hidden states used for prediction, allowing the model to mitigate over- and under-translation at the generation step. Because NAT generates all tokens in parallel, there is no sequential stack that must be traversed autoregressively; the top-layer fusion supplies the missing target-side context precisely where it is needed for the output. The experimental results across three tasks, showing FS-decoder performance comparable to the autoregressive Transformer with substantial speedup, provide evidence that this recovers the relevant dependencies without altering the parallel process. We disagree that the approach is limited to ineffective post-correction. If the manuscript description of the fusion mechanism requires additional detail for clarity, we can revise the FS-decoder section accordingly. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical results only

full rationale

The paper contains no mathematical derivations, equations, or first-principles claims. It proposes two empirical methods (Reinforce-NAT and FS-decoder) and reports BLEU scores from experiments on translation tasks. No self-definitional relations, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on external experimental comparisons rather than quantities defined in terms of the paper's own fitted values or prior self-citations. This is the expected non-finding for an applied empirical NLP paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The work relies on standard Transformer and reinforcement-learning background assumptions from prior literature.

pith-pipeline@v0.9.0 · 5714 in / 1023 out tokens · 28181 ms · 2026-05-25T18:16:10.957410+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 19 internal anchors

  1. [1]

    Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2016. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086

  2. [2]

    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473

  3. [3]

    Yoshua Bengio, Jean-S \'e bastien Sen \'e cal, et al. 2003. Quick training of probabilistic neural nets by importance sampling. In AISTATS, pages 1--9

  4. [4]

    Aleksandar Botev, Bowen Zheng, and David Barber. 2017. Complementary sum sampling for likelihood approximation in large scale classification. In Artificial Intelligence and Statistics, pages 1030--1038

  5. [5]

    Kyunghyun Cho, Bart Van Merri \"e nboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078

  6. [6]

    Peter W Glynn and Donald L Iglehart. 1989. Importance sampling for stochastic simulations. Management Science, 35(11):1367--1392

  7. [7]

    Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. 2017 a . Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281

  8. [8]

    Jiatao Gu, Kyunghyun Cho, and Victor OK Li. 2017 b . Trainable greedy decoding for neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1968--1978

  9. [9]

    Junliang Guo, Xu Tan, Di He, Tao Qin, Linli Xu, and Tie-Yan Liu. 2018. Non-autoregressive neural machine translation with enhanced decoder input. arXiv preprint arXiv:1812.09664

  10. [10]

    Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. 2016. Dual learning for machine translation. In Advances in Neural Information Processing Systems, pages 820--828

  11. [11]

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531

  12. [12]

    ukasz Kaiser, Aurko Roy, Ashish Vaswani, Niki Pamar, Samy Bengio, Jakob Uszkoreit, and Noam Shazeer. 2018. Fast decoding in sequence models using discrete latent variables. arXiv preprint arXiv:1803.03382

  13. [13]

    Yoon Kim and Alexander M Rush. 2016. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317--1327

  14. [14]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. 2014. http://arxiv.org/abs/1412.6980 Adam: A method for stochastic optimization . CoRR, abs/1412.6980

  15. [15]

    Jason Lee, Elman Mansimov, and Kyunghyun Cho. 2018. Deterministic non-autoregressive neural sequence modeling by iterative refinement. arXiv preprint arXiv:1802.06901

  16. [16]

    Zhuohan Li, Di He, Fei Tian, Tao Qin, Liwei Wang, and Tie-Yan Liu. 2018. Hint-based training for non-autoregressive translation

  17. [17]

    Shuming Ma, Xu Sun, Yizhong Wang, and Junyang Lin. 2018. Bag-of-words as target for neural machine translation. arXiv preprint arXiv:1805.04871

  18. [18]

    Ng, Daishi Harada, and Stuart J

    Andrew Y. Ng, Daishi Harada, and Stuart J. Russell. 1999. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML

  19. [19]

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311--318. Association for Computational Linguistics

  20. [20]

    Ofir Press and Noah A. Smith. 2018. http://arxiv.org/abs/1810.13409 You may not need attention . CoRR, abs/1810.13409

  21. [21]

    Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732

  22. [22]

    Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. http://www.aclweb.org/anthology/P16-1162 Neural machine translation of rare words with subword units . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715--1725, Berlin, Germany. Association for Computational Linguistics

  23. [23]

    Chenze Shao, Xilin Chen, and Yang Feng. 2018. Greedy search with probabilistic n-gram matching for neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4778--4784

  24. [24]

    Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. 2016. Minimum risk training for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1683--1692

  25. [25]

    Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of association for machine translation in the Americas, volume 200

  26. [26]

    Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104--3112

  27. [27]

    Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057--1063

  28. [28]

    Richard Stuart Sutton. 1984. Temporal credit assignment in reinforcement learning

  29. [29]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000--6010

  30. [30]

    Chunqi Wang, Ji Zhang, and Haiqing Chen. 2018. Semi-autoregressive neural machine translation. arXiv preprint arXiv:1808.08583

  31. [31]

    Yiren Wang, Fei Tian, Di He, Tao Qin, ChengXiang Zhai, and Tie-Yan Liu. 2019. Non-autoregressive machine translation with auxiliary regularization. arXiv preprint arXiv:1902.10245

  32. [32]

    Lex Weaver and Nigel Tao. 2013. The optimal reward baseline for gradient-based reinforcement learning. Processings of the Seventeeth Conference on Uncertainty in Artificial Intelligence

  33. [33]

    Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Reinforcement Learning, pages 5--32. Springer

  34. [34]

    Ronald J Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270--280

  35. [35]

    Lijun Wu, Fei Tian, Tao Qin, Jianhuang Lai, and Tie-Yan Liu. 2018. A study of reinforcement learning for neural machine translation. arXiv preprint arXiv:1808.08866

  36. [36]

    Lijun Wu, Yingce Xia, Li Zhao, Fei Tian, Tao Qin, Jianhuang Lai, and Tie-Yan Liu. 2017. Adversarial neural machine translation. arXiv preprint arXiv:1704.06933

  37. [37]

    Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144

  38. [38]

    Zhen Yang, Wei Chen, Feng Wang, and Bo Xu. 2017. Improving neural machine translation with conditional sequence generative adversarial nets. arXiv preprint arXiv:1703.04887

  39. [39]

    Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, pages 2852--2858

  40. [40]

    Biao Zhang, Deyi Xiong, and Jinsong Su. 2018 a . Accelerating neural transformer via an average attention network. arXiv preprint arXiv:1805.00631

  41. [41]

    Wen Zhang, Liang Huang, Yang Feng, Lei Shen, and Qun Liu. 2018 b . Speeding up neural machine translation decoding by cube pruning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4284--4294

  42. [42]

    URL: " 'urlintro :=

    ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

  43. [43]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...