pith. sign in

arxiv: 1610.10099 · v2 · pith:RRDQAK5Enew · submitted 2016-10-31 · 💻 cs.CL · cs.LG

Neural Machine Translation in Linear Time

classification 💻 cs.CL cs.LG
keywords bytenetnetworkneuraltranslationdecodersequencestimealignment
0
0 comments X
read the original abstract

We present a novel neural network for processing sequences. The ByteNet is a one-dimensional convolutional neural network that is composed of two parts, one to encode the source sequence and the other to decode the target sequence. The two network parts are connected by stacking the decoder on top of the encoder and preserving the temporal resolution of the sequences. To address the differing lengths of the source and the target, we introduce an efficient mechanism by which the decoder is dynamically unfolded over the representation of the encoder. The ByteNet uses dilation in the convolutional layers to increase its receptive field. The resulting network has two core properties: it runs in time that is linear in the length of the sequences and it sidesteps the need for excessive memorization. The ByteNet decoder attains state-of-the-art performance on character-level language modelling and outperforms the previous best results obtained with recurrent networks. The ByteNet also achieves state-of-the-art performance on character-to-character machine translation on the English-to-German WMT translation task, surpassing comparable neural translation models that are based on recurrent networks with attentional pooling and run in quadratic time. We find that the latent alignment structure contained in the representations reflects the expected alignment between the tokens.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Scalable Memristive-Friendly Reservoir Computing for Time Series Classification

    cs.NE 2026-04 unverdicted novelty 7.0

    MARS parallel reservoirs achieve up to 21x training speedups and outperform LRU, S5, and Mamba on long sequence benchmarks while remaining gradient-free and compact.

  2. Revisiting Transformer Layer Parameterization Through Causal Energy Minimization

    cs.LG 2026-05 unverdicted novelty 6.0

    CEM recasts Transformer layers as energy minimization steps, enabling constrained parameterizations like weight sharing and low-rank interactions that match standard baselines in 100M-scale language modeling.

  3. Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges

    cs.LG 2021-04 accept novelty 6.0

    Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.

  4. Compressive Transformers for Long-Range Sequence Modelling

    cs.LG 2019-11 unverdicted novelty 6.0

    Compressive Transformer sets new records on WikiText-103 (17.1 ppl) and Enwik8 (0.97 bpc) via memory compression and introduces the PG-19 long-range language benchmark.

  5. Learning to Reformulate the Queries on the WEB

    cs.IR 2019-07 unverdicted novelty 5.0

    An unsupervised character-level CNN encoder with attention-based RNN decoder, trained on Clueweb09 anchor phrases, generates query reformulations that improve retrieval on TREC collections.

  6. Attention Is All You Need

    cs.CL 2017-06 unverdicted novelty 5.0

    Pith review generated a malformed one-line summary.

  7. Hierarchical Sequence to Sequence Voice Conversion with Limited Data

    eess.AS 2019-07 unverdicted novelty 4.0

    Hierarchical seq2seq model for parallel voice conversion pretrained as autoencoder on single-speaker data then adapted to limited multispeaker data, using mel spectrograms converted via wavenet vocoder.

  8. Improving Zero-shot Translation with Language-Independent Constraints

    cs.CL 2019-06 unverdicted novelty 4.0

    Language-independent constraints and regularization in multilingual Transformer NMT yield a 2.23 BLEU average gain on zero-shot pairs from the IWSLT 2017 dataset.

  9. A Survey of Advancing Audio Super-Resolution and Bandwidth Extension from Discriminative to Generative Models

    eess.AS 2026-05 unverdicted novelty 2.0

    A structured survey of audio bandwidth extension that organizes the transition from deterministic discriminative DNNs to generative approaches including GANs, diffusion models, and flow-based methods.