Non-Autoregressive Neural Machine Translation

Jiatao Gu , James Bradbury , Caiming Xiong , Victor O.K. Li , Richard Socher

Authors on Pith no claims yet

classification 💻 cs.CL cs.LG

keywords autoregressivebleufertilitiesinferencemachinemodelneuralnon-autoregressive

read the original abstract

Existing approaches to neural machine translation condition each output word on previously generated outputs. We introduce a model that avoids this autoregressive property and produces its outputs in parallel, allowing an order of magnitude lower latency during inference. Through knowledge distillation, the use of input token fertilities as a latent variable, and policy gradient fine-tuning, we achieve this at a cost of as little as 2.0 BLEU points relative to the autoregressive Transformer network used as a teacher. We demonstrate substantial cumulative improvements associated with each of the three aspects of our training strategy, and validate our approach on IWSLT 2016 English-German and two WMT language pairs. By sampling fertilities in parallel at inference time, our non-autoregressive model achieves near-state-of-the-art performance of 29.8 BLEU on WMT 2016 English-Romanian.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HapticLDM: A Diffusion Model for Text-to-Vibrotactile Generation
cs.HC 2026-05 unverdicted novelty 7.0

HapticLDM is the first latent diffusion model that generates vibrotactile signals directly from text, using dynamic text curation and global denoising to improve realism and semantic alignment over autoregressive baselines.
PlayGen-MoG: Framework for Diverse Multi-Agent Play Generation via Mixture-of-Gaussians Trajectory Prediction
cs.CV 2026-04 unverdicted novelty 7.0

PlayGen-MoG uses a shared Mixture-of-Gaussians head across agents plus relative attention to generate diverse coordinated plays from a single static formation, achieving 1.68 yard ADE and 3.98 yard FDE with full mixtu...
BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion
cs.CL 2026-05 unverdicted novelty 6.0

BitLM replaces per-token softmax with bitwise continuous diffusion inside causal blocks to generate multiple tokens in parallel while preserving autoregressive structure.
Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts
cs.CL 2026-05 conditional novelty 6.0

Reasoning language models extract answers from sparse, order-shuffled chain-of-thought traces with little accuracy loss.