Sequence to Sequence Learning with Neural Networks

Ilya Sutskever; Oriol Vinyals; Quoc V. Le

arxiv: 1409.3215 · v3 · pith:UUPZ7EDXnew · submitted 2014-09-10 · 💻 cs.CL · cs.LG

Sequence to Sequence Learning with Neural Networks

Ilya Sutskever , Oriol Vinyals , Quoc V. Le This is my paper

classification 💻 cs.CL cs.LG

keywords lstmsequencebleuscorelearningsentencestargetdataset

0 comments

read the original abstract

Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT'14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous best result on this task. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A document is worth a structured record: Principled inductive bias design for document recognition
cs.CV 2025-07 unverdicted novelty 8.0

Introduces a method to design structure-specific relational inductive biases for a base transformer architecture, enabling end-to-end transcription of documents with intrinsic structures, demonstrated on sheet music, ...
Adaptive Computation Time for Recurrent Neural Networks
cs.NE 2016-03 accept novelty 8.0

ACT lets RNNs dynamically adapt computation depth per input via a differentiable halting unit, yielding large gains on synthetic tasks and structural insights on language data.
Neural Turing Machines
cs.NE 2014-10 unverdicted novelty 8.0

Neural Turing Machines augment neural networks with differentiable external memory to learn algorithmic tasks such as copying, sorting, and associative recall from examples.
Geometry-Induced Long-Range Correlations in Recurrent Neural Network Quantum States
quant-ph 2026-04 conditional novelty 7.0

Dilated RNN wave functions induce power-law correlations for the critical 1D transverse-field Ising model and the Cluster state, unlike the exponential decay of conventional RNN ansatze.
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
cs.CL 2016-11 accept novelty 7.0

MS MARCO is a new large-scale machine reading comprehension dataset built from real Bing search queries, human-generated answers, and web passages, supporting three tasks including answer synthesis and passage ranking.
HRM-Text: Efficient Pretraining Beyond Scaling
cs.CL 2026-05 unverdicted novelty 6.0

A 1B-parameter hierarchical recurrent model pretrained on 40B instruction-response tokens achieves 60.7% MMLU and strong results on ARC-C, DROP, GSM8K, and MATH while using 100-900x fewer tokens than standard baselines.
Physics-in-the-Loop: A Hybrid Agentic Architecture for Validated CAD Engineering Design
cs.CV 2026-05 unverdicted novelty 6.0

A hybrid agentic architecture integrates knowledge-based physical verification tools into LLM-driven CAD design loops, producing more complex and functionally valid designs than prior agentic baselines.
SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation
cs.RO 2026-03 conditional novelty 6.0

SeedPolicy introduces self-evolving gated attention to extend the temporal horizon of diffusion policies, yielding 36.8% and 169% relative gains over standard DP on clean and randomized RoboTwin 2.0 tasks.
Boosting Team Modeling through Tempo-Relational Representation Learning
cs.LG 2025-07 unverdicted novelty 6.0

A tempo-relational neural architecture jointly models temporal and relational aspects of team interactions to outperform prior approaches on team performance prediction and enable efficient multi-task prediction of te...
Large Language Models for Market Research: A Data-augmentation Approach
cs.AI 2024-12 unverdicted novelty 6.0

A data-augmentation framework for conjoint analysis integrates LLM-generated data with human responses to yield consistent, asymptotically normal estimators and reported cost savings of 24.9-79.8% in two empirical studies.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
cs.LG 2021-04 accept novelty 6.0

Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.
Scaling Laws for Transfer
cs.LG 2021-02 unverdicted novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
Separable Convolutional LSTMs for Faster Video Segmentation
cs.CV 2019-07 unverdicted novelty 6.0

Separable convLSTMs cut parameters and FLOPs in video segmentation, delivering up to 15% faster GPU inference with similar or slightly lower accuracy.
Universal Transformers
cs.CL 2018-07 unverdicted novelty 6.0

Universal Transformers combine Transformer parallelism with recurrent updates and dynamic halting to achieve Turing-completeness under assumptions and outperform standard Transformers on algorithmic and language tasks.
Are Candidate Models Really Needed for Active Learning?
cs.CV 2026-05 unverdicted novelty 5.0

Active learning with randomly initialized models achieves comparable results to traditional candidate-model methods, with low-confidence sampling proving most effective.
Sessa: Selective State Space Attention
cs.LG 2026-04 unverdicted novelty 5.0

Sessa integrates attention within recurrent paths to achieve power-law memory tails and flexible non-decaying selective retrieval, outperforming baselines on long-context tasks.
Learning Project-wise Subsequent Code Edits via Interleaving Neural-based Induction and Tool-based Deduction
cs.SE 2026-04 unverdicted novelty 5.0

TRACE improves project-wise subsequent code editing by interleaving neural-based induction for semantic edits and tool-based deduction for syntactic edits.
ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators
cs.AR 2025-12 unverdicted novelty 5.0

ODMA raises KV-cache utilization by up to 19.25% and throughput by 23-27% on Cambricon MLU accelerators by dynamically adjusting prediction buckets and using a safety pool for LLM serving.
Learning to Reformulate the Queries on the WEB
cs.IR 2019-07 unverdicted novelty 5.0

An unsupervised character-level CNN encoder with attention-based RNN decoder, trained on Clueweb09 anchor phrases, generates query reformulations that improve retrieval on TREC collections.
Exploring Vision Neural Network Pruning via Screening Methodology
cs.LG 2025-02 unverdicted novelty 4.0

A unified F-statistic screening and weighted evaluation method prunes both unstructured and structured parameters in FNNs and CNNs, claiming order-of-magnitude size reduction with competitive accuracy on vision datasets.
Gemma: Open Models Based on Gemini Research and Technology
cs.CL 2024-03 accept novelty 4.0

Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.
From System 1 to System 2: A Survey of Reasoning Large Language Models
cs.AI 2025-02 accept novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
Gemma 2: Improving Open Language Models at a Practical Size
cs.CL 2024-07 conditional novelty 3.0

Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.
Cleansing Jewel: A Neural Spelling Correction Model Built On Google OCR-ed Tibetan Manuscripts
cs.CL 2023-04 unverdicted novelty 3.0

A Transformer augmented with a confidence score mechanism outperforms LSTM and GRU baselines on correcting OCR errors in paired Tibetan manuscript data.
Bridging Language Models and Financial Analysis
q-fin.ST 2025-03 unverdicted novelty 2.0

A survey synthesizing recent LLM research and assessing its applicability to financial data analysis.