Retrieving Sequential Information for Non-Autoregressive Neural Machine Translation

Chenze Shao; Fandong Meng; Jie Zhou; Jinchao Zhang; Xilin Chen; Yang Feng

arxiv: 1906.09444 · v1 · pith:ACCYOB25new · submitted 2019-06-22 · 💻 cs.CL · cs.AI· cs.LG

Retrieving Sequential Information for Non-Autoregressive Neural Machine Translation

Chenze Shao , Yang Feng , Jinchao Zhang , Fandong Meng , Xilin Chen , Jie Zhou This is my paper

Pith reviewed 2026-05-25 18:16 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords non-autoregressive translationneural machine translationreinforcement learningtransformer decodersequential informationparallel decodingBLEU

0 comments

The pith

Non-autoregressive translation recovers target word order via reinforcement training or a top-layer fused decoder.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Non-autoregressive models generate all target words in parallel and therefore lose the sequential dependencies that autoregressive models capture step by step. The paper shows that this loss produces over- and under-translation, especially on long sentences. It introduces a reinforcement algorithm that trains NAT models at the full-sequence level and an FS-decoder that injects sequential information only into the uppermost decoder layer. Both changes improve BLEU while preserving the original parallel decoding speed.

Core claim

Reinforce-NAT and the FS-decoder retrieve the target sequential information that standard NAT discards, yielding higher BLEU scores than baseline NAT without any slowdown and, for the FS-decoder, performance comparable to the autoregressive Transformer with substantial speedup.

What carries the argument

The FS-decoder, which fuses sequential information exclusively into the top decoder layer, together with a reinforcement algorithm that performs sequence-level training of NAT.

If this is right

NAT can be trained end-to-end at the sequence level without variance explosion.
Sequential dependencies need only be restored at the final decoder stage to affect output quality.
The same parallel decoding schedule remains valid after the proposed changes.
Comparable quality to autoregressive models becomes reachable at NAT speeds on standard translation benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reinforcement or top-layer fusion approach could be tested on other non-autoregressive generation tasks such as summarization or speech synthesis.
Combining the two proposed methods in one model might produce additive gains, though the paper does not report such an experiment.
The methods may reduce reliance on external distillation or fertility prediction tricks commonly used in NAT.

Load-bearing premise

Adding sequential information only through reinforcement or a single top decoder layer is sufficient to restore the dependencies NAT otherwise ignores, without creating new inconsistencies or requiring changes to parallel decoding.

What would settle it

On the same test sets, if Reinforce-NAT or FS-decoder still produces the same frequency of over- and under-translation errors on long sentences as the unmodified NAT baseline, the central claim is false.

Figures

Figures reproduced from arXiv: 1906.09444 by Chenze Shao, Fandong Meng, Jie Zhou, Jinchao Zhang, Xilin Chen, Yang Feng.

**Figure 3.** Figure 3: training curves for k = 0, 1, 5 and 10. 5.5 Performance over Different Lengths [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 2.** Figure 2: top- [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 4.** Figure 4: The BLEU scores on the validation set of [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Non-Autoregressive Transformer (NAT) aims to accelerate the Transformer model through discarding the autoregressive mechanism and generating target words independently, which fails to exploit the target sequential information. Over-translation and under-translation errors often occur for the above reason, especially in the long sentence translation scenario. In this paper, we propose two approaches to retrieve the target sequential information for NAT to enhance its translation ability while preserving the fast-decoding property. Firstly, we propose a sequence-level training method based on a novel reinforcement algorithm for NAT (Reinforce-NAT) to reduce the variance and stabilize the training procedure. Secondly, we propose an innovative Transformer decoder named FS-decoder to fuse the target sequential information into the top layer of the decoder. Experimental results on three translation tasks show that the Reinforce-NAT surpasses the baseline NAT system by a significant margin on BLEU without decelerating the decoding speed and the FS-decoder achieves comparable translation performance to the autoregressive Transformer with considerable speedup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Reinforce-NAT and top-layer fusion give reported BLEU gains on NAT but rest on thin experimental reporting and an architectural choice that may not recover the lost dependencies.

read the letter

The two concrete additions are a sequence-level reinforcement algorithm for training NAT and an FS-decoder that injects sequential information only into the top decoder layer. Both aim to reduce over- and under-translation while keeping parallel decoding intact, and the abstract states they deliver measurable BLEU improvement on three tasks without losing the speed edge over autoregressive baselines. That framing of the core NAT limitation is straightforward and the techniques are not just minor tweaks to prior work. The experimental claims are the main soft spot. The abstract supplies no dataset sizes, no variance numbers, no ablation isolating each method, and no detail on how baselines were chosen or tuned, so the size and stability of the gains cannot be assessed from the given text. On the FS-decoder specifically, fusing context only at the final layer leaves the lower layers running without target-side order; the hidden states reaching the top layer are still produced by position-independent parallel computation, which makes it unclear whether the reported recovery of sequential information actually occurs or is limited to post-hoc correction. A reader focused on latency-sensitive MT would get the most from this, since the problem it targets is practical. The work shows clear engagement with the NAT literature and the proposed fixes are falsifiable, so it deserves a serious referee even though the current evidence is incomplete and the decoder design needs closer scrutiny on propagation.

Referee Report

1 major / 1 minor

Summary. The paper claims that Non-Autoregressive Transformer (NAT) models can be improved by retrieving target sequential information via two methods: Reinforce-NAT, a novel reinforcement learning algorithm for sequence-level training that reduces variance and stabilizes training, and FS-decoder, which fuses target sequential information into the top decoder layer. Experiments on three translation tasks show Reinforce-NAT surpassing baseline NAT on BLEU without slowing decoding speed, while FS-decoder achieves performance comparable to the autoregressive Transformer with considerable speedup.

Significance. If the results hold, this would be a meaningful contribution to efficient NMT by addressing over- and under-translation errors in NAT while retaining parallel decoding advantages. The reinforcement-based training and top-layer fusion approach offer practical techniques for incorporating dependencies without altering the core parallel generation process.

major comments (1)

[FS-decoder] FS-decoder section: fusing sequential information exclusively into the top decoder layer while keeping lower layers fully parallel and position-independent means hidden states arriving at the top layer lack target dependencies. The top-layer fusion can at best post-correct outputs but cannot retroactively enforce consistency through the stack, which is load-bearing for the central claim that the method recovers discarded target dependencies without changing the parallel decoding process.

minor comments (1)

[Abstract] Abstract: reports performance gains on three tasks but supplies no dataset sizes, baseline details, variance numbers, or ablation results, limiting the ability to evaluate the strength of the empirical claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We respond to the major comment below.

read point-by-point responses

Referee: [FS-decoder] FS-decoder section: fusing sequential information exclusively into the top decoder layer while keeping lower layers fully parallel and position-independent means hidden states arriving at the top layer lack target dependencies. The top-layer fusion can at best post-correct outputs but cannot retroactively enforce consistency through the stack, which is load-bearing for the central claim that the method recovers discarded target dependencies without changing the parallel decoding process.

Authors: The lower layers of the FS-decoder are indeed kept fully parallel and position-independent to preserve the core NAT decoding speed. However, the fusion of target sequential information occurs at the top decoder layer, which directly produces the output token predictions. This design incorporates the sequential dependencies into the final hidden states used for prediction, allowing the model to mitigate over- and under-translation at the generation step. Because NAT generates all tokens in parallel, there is no sequential stack that must be traversed autoregressively; the top-layer fusion supplies the missing target-side context precisely where it is needed for the output. The experimental results across three tasks, showing FS-decoder performance comparable to the autoregressive Transformer with substantial speedup, provide evidence that this recovers the relevant dependencies without altering the parallel process. We disagree that the approach is limited to ineffective post-correction. If the manuscript description of the fusion mechanism requires additional detail for clarity, we can revise the FS-decoder section accordingly. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical results only

full rationale

The paper contains no mathematical derivations, equations, or first-principles claims. It proposes two empirical methods (Reinforce-NAT and FS-decoder) and reports BLEU scores from experiments on translation tasks. No self-definitional relations, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on external experimental comparisons rather than quantities defined in terms of the paper's own fitted values or prior self-citations. This is the expected non-finding for an applied empirical NLP paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The work relies on standard Transformer and reinforcement-learning background assumptions from prior literature.

pith-pipeline@v0.9.0 · 5714 in / 1023 out tokens · 28181 ms · 2026-05-25T18:16:10.957410+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose an innovative Transformer decoder named FS-decoder to fuse the target sequential information into the top layer of the decoder
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Reinforce-NAT ... to reduce the variance and stabilize the training procedure

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 19 internal anchors

[1]

Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2016. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473

work page internal anchor Pith review Pith/arXiv arXiv 2014
[3]

Yoshua Bengio, Jean-S \'e bastien Sen \'e cal, et al. 2003. Quick training of probabilistic neural nets by importance sampling. In AISTATS, pages 1--9

work page 2003
[4]

Aleksandar Botev, Bowen Zheng, and David Barber. 2017. Complementary sum sampling for likelihood approximation in large scale classification. In Artificial Intelligence and Statistics, pages 1030--1038

work page 2017
[5]

Kyunghyun Cho, Bart Van Merri \"e nboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078

work page internal anchor Pith review Pith/arXiv arXiv 2014
[6]

Peter W Glynn and Donald L Iglehart. 1989. Importance sampling for stochastic simulations. Management Science, 35(11):1367--1392

work page 1989
[7]

Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. 2017 a . Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281

work page internal anchor Pith review Pith/arXiv arXiv 2017
[8]

Jiatao Gu, Kyunghyun Cho, and Victor OK Li. 2017 b . Trainable greedy decoding for neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1968--1978

work page 2017
[9]

Junliang Guo, Xu Tan, Di He, Tao Qin, Linli Xu, and Tie-Yan Liu. 2018. Non-autoregressive neural machine translation with enhanced decoder input. arXiv preprint arXiv:1812.09664

work page internal anchor Pith review Pith/arXiv arXiv 2018
[10]

Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. 2016. Dual learning for machine translation. In Advances in Neural Information Processing Systems, pages 820--828

work page 2016
[11]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015
[12]

ukasz Kaiser, Aurko Roy, Ashish Vaswani, Niki Pamar, Samy Bengio, Jakob Uszkoreit, and Noam Shazeer. 2018. Fast decoding in sequence models using discrete latent variables. arXiv preprint arXiv:1803.03382

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

Yoon Kim and Alexander M Rush. 2016. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317--1327

work page 2016
[14]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. 2014. http://arxiv.org/abs/1412.6980 Adam: A method for stochastic optimization . CoRR, abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2014
[15]

Jason Lee, Elman Mansimov, and Kyunghyun Cho. 2018. Deterministic non-autoregressive neural sequence modeling by iterative refinement. arXiv preprint arXiv:1802.06901

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

Zhuohan Li, Di He, Fei Tian, Tao Qin, Liwei Wang, and Tie-Yan Liu. 2018. Hint-based training for non-autoregressive translation

work page 2018
[17]

Shuming Ma, Xu Sun, Yizhong Wang, and Junyang Lin. 2018. Bag-of-words as target for neural machine translation. arXiv preprint arXiv:1805.04871

work page internal anchor Pith review Pith/arXiv arXiv 2018
[18]

Ng, Daishi Harada, and Stuart J

Andrew Y. Ng, Daishi Harada, and Stuart J. Russell. 1999. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML

work page 1999
[19]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311--318. Association for Computational Linguistics

work page 2002
[20]

Ofir Press and Noah A. Smith. 2018. http://arxiv.org/abs/1810.13409 You may not need attention . CoRR, abs/1810.13409

work page internal anchor Pith review Pith/arXiv arXiv 2018
[21]

Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732

work page internal anchor Pith review Pith/arXiv arXiv 2015
[22]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. http://www.aclweb.org/anthology/P16-1162 Neural machine translation of rare words with subword units . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715--1725, Berlin, Germany. Association for Computational Linguistics

work page 2016
[23]

Chenze Shao, Xilin Chen, and Yang Feng. 2018. Greedy search with probabilistic n-gram matching for neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4778--4784

work page 2018
[24]

Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. 2016. Minimum risk training for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1683--1692

work page 2016
[25]

Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of association for machine translation in the Americas, volume 200

work page 2006
[26]

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104--3112

work page 2014
[27]

Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057--1063

work page 2000
[28]

Richard Stuart Sutton. 1984. Temporal credit assignment in reinforcement learning

work page 1984
[29]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000--6010

work page 2017
[30]

Chunqi Wang, Ji Zhang, and Haiqing Chen. 2018. Semi-autoregressive neural machine translation. arXiv preprint arXiv:1808.08583

work page internal anchor Pith review Pith/arXiv arXiv 2018
[31]

Yiren Wang, Fei Tian, Di He, Tao Qin, ChengXiang Zhai, and Tie-Yan Liu. 2019. Non-autoregressive machine translation with auxiliary regularization. arXiv preprint arXiv:1902.10245

work page internal anchor Pith review Pith/arXiv arXiv 2019
[32]

Lex Weaver and Nigel Tao. 2013. The optimal reward baseline for gradient-based reinforcement learning. Processings of the Seventeeth Conference on Uncertainty in Artificial Intelligence

work page 2013
[33]

Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Reinforcement Learning, pages 5--32. Springer

work page 1992
[34]

Ronald J Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270--280

work page 1989
[35]

Lijun Wu, Fei Tian, Tao Qin, Jianhuang Lai, and Tie-Yan Liu. 2018. A study of reinforcement learning for neural machine translation. arXiv preprint arXiv:1808.08866

work page internal anchor Pith review Pith/arXiv arXiv 2018
[36]

Lijun Wu, Yingce Xia, Li Zhao, Fei Tian, Tao Qin, Jianhuang Lai, and Tie-Yan Liu. 2017. Adversarial neural machine translation. arXiv preprint arXiv:1704.06933

work page internal anchor Pith review Pith/arXiv arXiv 2017
[37]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144

work page internal anchor Pith review Pith/arXiv arXiv 2016
[38]

Zhen Yang, Wei Chen, Feng Wang, and Bo Xu. 2017. Improving neural machine translation with conditional sequence generative adversarial nets. arXiv preprint arXiv:1703.04887

work page internal anchor Pith review Pith/arXiv arXiv 2017
[39]

Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, pages 2852--2858

work page 2017
[40]

Biao Zhang, Deyi Xiong, and Jinsong Su. 2018 a . Accelerating neural transformer via an average attention network. arXiv preprint arXiv:1805.00631

work page internal anchor Pith review Pith/arXiv arXiv 2018
[41]

Wen Zhang, Liang Huang, Yang Feng, Lei Shen, and Qun Liu. 2018 b . Speeding up neural machine translation decoding by cube pruning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4284--4294

work page 2018
[42]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page
[43]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[1] [1]

Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2016. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086

work page internal anchor Pith review Pith/arXiv arXiv 2016

[2] [2]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473

work page internal anchor Pith review Pith/arXiv arXiv 2014

[3] [3]

Yoshua Bengio, Jean-S \'e bastien Sen \'e cal, et al. 2003. Quick training of probabilistic neural nets by importance sampling. In AISTATS, pages 1--9

work page 2003

[4] [4]

Aleksandar Botev, Bowen Zheng, and David Barber. 2017. Complementary sum sampling for likelihood approximation in large scale classification. In Artificial Intelligence and Statistics, pages 1030--1038

work page 2017

[5] [5]

Kyunghyun Cho, Bart Van Merri \"e nboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078

work page internal anchor Pith review Pith/arXiv arXiv 2014

[6] [6]

Peter W Glynn and Donald L Iglehart. 1989. Importance sampling for stochastic simulations. Management Science, 35(11):1367--1392

work page 1989

[7] [7]

Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. 2017 a . Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281

work page internal anchor Pith review Pith/arXiv arXiv 2017

[8] [8]

Jiatao Gu, Kyunghyun Cho, and Victor OK Li. 2017 b . Trainable greedy decoding for neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1968--1978

work page 2017

[9] [9]

Junliang Guo, Xu Tan, Di He, Tao Qin, Linli Xu, and Tie-Yan Liu. 2018. Non-autoregressive neural machine translation with enhanced decoder input. arXiv preprint arXiv:1812.09664

work page internal anchor Pith review Pith/arXiv arXiv 2018

[10] [10]

Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. 2016. Dual learning for machine translation. In Advances in Neural Information Processing Systems, pages 820--828

work page 2016

[11] [11]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015

[12] [12]

ukasz Kaiser, Aurko Roy, Ashish Vaswani, Niki Pamar, Samy Bengio, Jakob Uszkoreit, and Noam Shazeer. 2018. Fast decoding in sequence models using discrete latent variables. arXiv preprint arXiv:1803.03382

work page internal anchor Pith review Pith/arXiv arXiv 2018

[13] [13]

Yoon Kim and Alexander M Rush. 2016. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317--1327

work page 2016

[14] [14]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. 2014. http://arxiv.org/abs/1412.6980 Adam: A method for stochastic optimization . CoRR, abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2014

[15] [15]

Jason Lee, Elman Mansimov, and Kyunghyun Cho. 2018. Deterministic non-autoregressive neural sequence modeling by iterative refinement. arXiv preprint arXiv:1802.06901

work page internal anchor Pith review Pith/arXiv arXiv 2018

[16] [16]

Zhuohan Li, Di He, Fei Tian, Tao Qin, Liwei Wang, and Tie-Yan Liu. 2018. Hint-based training for non-autoregressive translation

work page 2018

[17] [17]

Shuming Ma, Xu Sun, Yizhong Wang, and Junyang Lin. 2018. Bag-of-words as target for neural machine translation. arXiv preprint arXiv:1805.04871

work page internal anchor Pith review Pith/arXiv arXiv 2018

[18] [18]

Ng, Daishi Harada, and Stuart J

Andrew Y. Ng, Daishi Harada, and Stuart J. Russell. 1999. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML

work page 1999

[19] [19]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311--318. Association for Computational Linguistics

work page 2002

[20] [20]

Ofir Press and Noah A. Smith. 2018. http://arxiv.org/abs/1810.13409 You may not need attention . CoRR, abs/1810.13409

work page internal anchor Pith review Pith/arXiv arXiv 2018

[21] [21]

Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732

work page internal anchor Pith review Pith/arXiv arXiv 2015

[22] [22]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. http://www.aclweb.org/anthology/P16-1162 Neural machine translation of rare words with subword units . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715--1725, Berlin, Germany. Association for Computational Linguistics

work page 2016

[23] [23]

Chenze Shao, Xilin Chen, and Yang Feng. 2018. Greedy search with probabilistic n-gram matching for neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4778--4784

work page 2018

[24] [24]

Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. 2016. Minimum risk training for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1683--1692

work page 2016

[25] [25]

Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of association for machine translation in the Americas, volume 200

work page 2006

[26] [26]

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104--3112

work page 2014

[27] [27]

Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057--1063

work page 2000

[28] [28]

Richard Stuart Sutton. 1984. Temporal credit assignment in reinforcement learning

work page 1984

[29] [29]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000--6010

work page 2017

[30] [30]

Chunqi Wang, Ji Zhang, and Haiqing Chen. 2018. Semi-autoregressive neural machine translation. arXiv preprint arXiv:1808.08583

work page internal anchor Pith review Pith/arXiv arXiv 2018

[31] [31]

Yiren Wang, Fei Tian, Di He, Tao Qin, ChengXiang Zhai, and Tie-Yan Liu. 2019. Non-autoregressive machine translation with auxiliary regularization. arXiv preprint arXiv:1902.10245

work page internal anchor Pith review Pith/arXiv arXiv 2019

[32] [32]

Lex Weaver and Nigel Tao. 2013. The optimal reward baseline for gradient-based reinforcement learning. Processings of the Seventeeth Conference on Uncertainty in Artificial Intelligence

work page 2013

[33] [33]

Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Reinforcement Learning, pages 5--32. Springer

work page 1992

[34] [34]

Ronald J Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270--280

work page 1989

[35] [35]

Lijun Wu, Fei Tian, Tao Qin, Jianhuang Lai, and Tie-Yan Liu. 2018. A study of reinforcement learning for neural machine translation. arXiv preprint arXiv:1808.08866

work page internal anchor Pith review Pith/arXiv arXiv 2018

[36] [36]

Lijun Wu, Yingce Xia, Li Zhao, Fei Tian, Tao Qin, Jianhuang Lai, and Tie-Yan Liu. 2017. Adversarial neural machine translation. arXiv preprint arXiv:1704.06933

work page internal anchor Pith review Pith/arXiv arXiv 2017

[37] [37]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144

work page internal anchor Pith review Pith/arXiv arXiv 2016

[38] [38]

Zhen Yang, Wei Chen, Feng Wang, and Bo Xu. 2017. Improving neural machine translation with conditional sequence generative adversarial nets. arXiv preprint arXiv:1703.04887

work page internal anchor Pith review Pith/arXiv arXiv 2017

[39] [39]

Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, pages 2852--2858

work page 2017

[40] [40]

Biao Zhang, Deyi Xiong, and Jinsong Su. 2018 a . Accelerating neural transformer via an average attention network. arXiv preprint arXiv:1805.00631

work page internal anchor Pith review Pith/arXiv arXiv 2018

[41] [41]

Wen Zhang, Liang Huang, Yang Feng, Lei Shen, and Qun Liu. 2018 b . Speeding up neural machine translation decoding by cube pruning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4284--4294

work page 2018

[42] [42]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page

[43] [43]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page