Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Bart van Merrienboer; Caglar Gulcehre; Dzmitry Bahdanau; Fethi Bougares; Holger Schwenk; Kyunghyun Cho; Yoshua Bengio

arxiv: 1406.1078 · v3 · submitted 2014-06-03 · 💻 cs.CL · cs.LG· cs.NE· stat.ML

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Kyunghyun Cho , Bart van Merrienboer , Caglar Gulcehre , Dzmitry Bahdanau , Fethi Bougares , Holger Schwenk , Yoshua Bengio This is my paper

Pith reviewed 2026-05-12 23:26 UTC · model grok-4.3

classification 💻 cs.CL cs.LGcs.NEstat.ML

keywords RNN Encoder-Decoderstatistical machine translationphrase representationsrecurrent neural networksconditional probabilitylog-linear modelsequence modeling

0 comments

The pith

The RNN Encoder-Decoder computes phrase probabilities that improve statistical machine translation when added to log-linear models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the RNN Encoder-Decoder, which consists of two recurrent neural networks: one encodes a source phrase into a fixed-length vector, and the other decodes that vector into a target phrase. The networks are jointly trained to maximize the conditional probability of the target phrase given the source phrase. When these computed probabilities for phrase pairs are used as an additional feature inside an existing statistical machine translation system's log-linear model, the overall translation performance improves on empirical tests. The model also produces vector representations of phrases that reflect semantic and syntactic properties.

Core claim

The RNN Encoder-Decoder maps a variable-length source sequence to a fixed-length vector via an encoder RNN and then generates the target sequence from that vector via a decoder RNN. The two networks are trained end-to-end to maximize the conditional probability of a target phrase given a source phrase. Incorporating the resulting phrase-pair probabilities as an extra feature in the log-linear model of a phrase-based statistical machine translation system yields improved translation quality, and the learned representations exhibit semantic and syntactic structure.

What carries the argument

RNN Encoder-Decoder architecture in which an encoder recurrent network compresses an input sequence into a fixed-length vector and a decoder recurrent network generates the output sequence from that vector, trained jointly on conditional sequence probability.

If this is right

Statistical machine translation systems can be strengthened by treating the neural model's phrase probabilities as an extra scoring feature.
The encoder produces fixed-length vectors that preserve the information needed to reconstruct target phrases accurately.
The training objective leads to phrase representations that group phrases by semantic and syntactic similarity.
Phrase-based translation pipelines can incorporate neural sequence modeling without replacing the entire log-linear framework.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fixed-vector encoding could support phrase similarity measures or paraphrase generation in other language tasks.
Hybrid statistical-neural scoring may prove useful for sequence problems outside translation where explicit features already exist.
If the vector representation is informationally complete, the architecture could be tested on longer contexts or non-linguistic sequences.

Load-bearing premise

The fixed-length vector from the encoder retains enough information about the source phrase for the decoder to generate accurate target phrases, and the resulting probabilities supply information that is genuinely new relative to the existing features in the log-linear model.

What would settle it

A side-by-side evaluation of a statistical machine translation system on a held-out test set that shows no improvement in standard quality metrics when the RNN Encoder-Decoder probabilities are added as a feature.

read the original abstract

In this paper, we propose a novel neural network model called RNN Encoder-Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder-Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes the RNN Encoder-Decoder architecture consisting of two RNNs: an encoder that compresses a variable-length source phrase into a fixed-length vector and a decoder that generates the corresponding target phrase from that vector. The model is trained end-to-end to maximize the conditional probability of the target phrase given the source phrase. The authors then use the log-probabilities produced by the trained model as an additional feature inside the log-linear model of a phrase-based statistical machine translation system and report improved BLEU scores on English-to-French WMT data; they also present qualitative nearest-neighbor analyses indicating that the learned vectors capture syntactic and semantic regularities.

Significance. If the reported gains are reproducible, the work supplies early empirical evidence that a neural sequence model can supply complementary information to conventional SMT features (phrase table, language model, etc.) even when restricted to short phrases. The qualitative results further demonstrate that fixed-length encodings can preserve linguistically meaningful structure for phrases, providing a concrete illustration of the representational power of the architecture that later influenced neural machine translation.

major comments (2)

[§4 (Experiments)] §4 (Experiments): the manuscript states that adding the RNN-derived feature improves BLEU after MERT re-tuning, but supplies neither the absolute BLEU scores of the baseline and augmented systems nor any statistical significance test or variance estimate across multiple MERT runs. Without these numbers the magnitude and reliability of the central empirical claim cannot be assessed.
[§3.2 (Decoder)] §3.2 (Decoder): the transfer of information from encoder to decoder is described only at a high level; the paper does not specify whether the decoder’s initial hidden state is exactly the encoder’s final state, a learned projection of it, or something else, nor does it report the phrase-length distribution on which the model was trained. Both details are load-bearing for the claim that the fixed-length vector retains sufficient information.

minor comments (2)

[Abstract] Abstract: the claim of empirical improvement is made without any numerical result (BLEU delta, data size, etc.), which reduces the abstract’s utility as a standalone summary.
[§3] Notation: the update equations for the RNN hidden states are given but the symbols for the weight matrices and bias vectors are not collected in one place, making it harder to verify the parameter count and implementation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and constructive comments on our work. We address each major comment below and have revised the manuscript to improve clarity and completeness.

read point-by-point responses

Referee: [§4 (Experiments)] §4 (Experiments): the manuscript states that adding the RNN-derived feature improves BLEU after MERT re-tuning, but supplies neither the absolute BLEU scores of the baseline and augmented systems nor any statistical significance test or variance estimate across multiple MERT runs. Without these numbers the magnitude and reliability of the central empirical claim cannot be assessed.

Authors: We agree that absolute BLEU scores and details on statistical reliability strengthen the empirical claim. The revised manuscript now explicitly reports the baseline BLEU score and the score after adding the RNN Encoder-Decoder feature as an additional log-linear feature. We also include results from multiple MERT runs with variance estimates and note that the observed improvement is consistent, although a full bootstrap significance test was not performed in the original experiments. revision: yes
Referee: [§3.2 (Decoder)] §3.2 (Decoder): the transfer of information from encoder to decoder is described only at a high level; the paper does not specify whether the decoder’s initial hidden state is exactly the encoder’s final state, a learned projection of it, or something else, nor does it report the phrase-length distribution on which the model was trained. Both details are load-bearing for the claim that the fixed-length vector retains sufficient information.

Authors: We thank the referee for highlighting this lack of detail. The decoder is initialized directly with the encoder’s final hidden state (no learned projection). We have revised Section 3.2 to state this explicitly. The model was trained on phrase pairs whose lengths follow the distribution in the WMT training data (predominantly short phrases, with a maximum length of 30 tokens); we have added this information and a brief histogram to the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core contribution is an empirical demonstration that phrase-pair conditional probabilities from a jointly trained RNN Encoder-Decoder improve BLEU when added as one extra feature to a standard SMT log-linear model. The RNN is trained end-to-end on an explicit maximum-likelihood objective (maximizing p(target phrase | source phrase)) that does not reference the downstream SMT weights, phrase table, or MERT procedure. No equation or claim reduces the reported performance gain to a fitted parameter by construction, and the paper contains no load-bearing self-citations that would force the result. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities beyond the standard assumption that RNNs can be trained via backpropagation on sequence data. The central claim rests on the unstated premise that a fixed-length vector suffices for phrase-level translation modeling.

axioms (1)

domain assumption Recurrent neural networks can be jointly trained to encode and decode variable-length sequences via maximum conditional likelihood.
Invoked by the proposal of the encoder-decoder architecture and its training objective.

pith-pipeline@v0.9.0 · 5451 in / 1178 out tokens · 43715 ms · 2026-05-12T23:26:19.186796+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PathVQA: 30000+ Questions for Medical Visual Question Answering
cs.CL 2020-03 accept novelty 8.0

PathVQA is the first public dataset of over 32,000 questions on nearly 5,000 pathology images for medical visual question answering.
Learned Memory Attenuation in Sage-Husa Kalman Filters for Robust UAV State Estimation
eess.SP 2026-05 unverdicted novelty 7.0

NDR-SHKF replaces the static forgetting factor in Sage-Husa Kalman Filters with a learned vector-valued memory attenuation policy from a bifurcated recurrent network trained end-to-end on whitened innovations to minim...
Nested-GPT for variable-multiplicity parton showers: A case study in the resummation of non-global logarithms
hep-ph 2026-05 unverdicted novelty 7.0

Nested-GPT is an autoregressive Transformer that dynamically generates variable-multiplicity parton showers matching Monte Carlo references for non-global logarithm resummation in the large-Nc limit.
Nested-GPT for variable-multiplicity parton showers: A case study in the resummation of non-global logarithms
hep-ph 2026-05 unverdicted novelty 7.0

Nested-GPT is an autoregressive Transformer surrogate that generates variable-multiplicity parton showers while enforcing ordered Markovian branching and matches reference Monte Carlo results for leading-log non-globa...
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media
cs.CL 2026-05 unverdicted novelty 7.0

PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic onlin...
Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo
cond-mat.str-el 2026-05 conditional novelty 7.0

PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.
Zero-shot Imitation Learning by Latent Topology Mapping
cs.LG 2026-05 unverdicted novelty 7.0

ZALT learns latent hub states and hub-to-hub dynamics from demonstrations to plan zero-shot solutions for unseen start-goal tasks, achieving 55% success in a 3D maze versus 6% for baselines.
Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement
cs.CV 2026-05 unverdicted novelty 7.0

NOVA represents world states as INR weights for decoder-free rendering, compactness, and unsupervised disentanglement of background, foreground, and motion in video world models.
How Long Does Infinite Width Last? Signal Propagation in Long-Range Linear Recurrences
cs.LG 2026-05 unverdicted novelty 7.0

In linear recurrent models, infinite-width signal propagation remains accurate only for depths t much smaller than sqrt(width n), with a critical regime at t ~ c sqrt(n) where finite-width effects emerge and dominate ...
FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning
cs.LG 2026-05 unverdicted novelty 7.0

FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connect...
Geometry-Induced Long-Range Correlations in Recurrent Neural Network Quantum States
quant-ph 2026-04 conditional novelty 7.0

Dilated RNN wave functions induce power-law correlations for the critical 1D transverse-field Ising model and the Cluster state, unlike the exponential decay of conventional RNN ansatze.
A Clinical Point Cloud Paradigm for In-Hospital Mortality Prediction from Multi-Level Incomplete Multimodal EHRs
cs.LG 2026-04 unverdicted novelty 7.0

HealthPoint represents clinical events as points in a 4D space (content, time, modality, case) and applies low-rank relational attention to achieve state-of-the-art mortality prediction from multi-level incomplete mul...
Denoising Particle Filters: Learning State Estimation with Single-Step Objectives
cs.RO 2026-02 conditional novelty 7.0

Denoising particle filters train state estimators on individual transitions via score matching, then use the learned denoiser with a dynamics model to approximate Bayesian filtering step-by-step, matching end-to-end b...
MELT: A Behavioral Trace Dataset for High-Risk Memecoin Launch Detection
cs.CR 2026-02 unverdicted novelty 7.0

MELT is the first behavioral trace dataset for high-risk memecoin launch detection on Solana, providing 122 features, risk annotations, and ML benchmarks that reduce investment loss when used for selection.
Cognitive Alpha Mining via LLM-Driven Code-Based Evolution
cs.CL 2025-11 unverdicted novelty 7.0

CogAlpha combines LLM reasoning with code-level evolutionary search to discover financial alphas that show higher predictive accuracy and generalization than prior methods on five stock datasets.
Mastering Diverse Domains through World Models
cs.AI 2023-01 unverdicted novelty 7.0

DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.
Human Motion Diffusion Model
cs.CV 2022-09 unverdicted novelty 7.0

MDM is a classifier-free diffusion model that generates expressive human motions by predicting clean samples rather than noise, supporting text and action conditioning and outperforming prior methods on standard benchmarks.
Mastering Atari with Discrete World Models
cs.LG 2020-10 accept novelty 7.0

DreamerV2 reaches human-level performance on 55 Atari games by learning behaviors inside a separately trained discrete-latent world model.
Brno Mobile OCR Dataset
cs.CV 2019-07 accept novelty 7.0

Introduces B-MOD dataset of 19,728 mobile device photos of documents with precise text line annotations and a neural baseline showing high error rates on harder images.
Graph Attention Networks
stat.ML 2017-10 accept novelty 7.0

Graph Attention Networks compute learnable attention coefficients over node neighborhoods to produce weighted feature aggregations, achieving state-of-the-art results on citation networks and inductive protein-protein...
Mixed Precision Training
cs.AI 2017-10 accept novelty 7.0

Mixed precision training uses FP16 for most computations, FP32 master weights for accumulation, and loss scaling to enable accurate training of large DNNs with halved memory usage.
Generative Recursive Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

GRAM turns recursive latent reasoning into a generative probabilistic model via stochastic trajectories and amortized variational inference, claiming better performance on structured reasoning tasks than deterministic...
Generative Recursive Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

GRAM is a latent-variable generative model that performs recursive reasoning via stochastic trajectories, trained with amortized variational inference to support multi-hypothesis reasoning and unconditional generation.
3DGS$^3$: Joint Super Sampling and Frame Interpolation for Real-Time Large-Scale 3DGS Rendering
cs.GR 2026-05 unverdicted novelty 6.0

3DGS³ adds gradient-guided super-sampling and lightweight temporal interpolation to low-resolution 3DGS renders to produce high-resolution, high-frame-rate output without retraining the underlying scene representation.
Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement
cs.CV 2026-05 unverdicted novelty 6.0

NOVA represents scene states as INR weights for analytical rendering without decoders and achieves structural disentanglement of content and dynamics in video world models.
Graph Federated Unlearning for Privacy Preservation
cs.LG 2026-05 unverdicted novelty 6.0

Orthogonal unlearning updates plus server-side virtual clients enable effective user data removal in graph federated learning without major performance loss.
Deep Kernel Learning for Stratifying Glaucoma Trajectories
cs.LG 2026-05 unverdicted novelty 6.0

A deep kernel learning architecture with transformer feature extraction on clinical-BERT embeddings and Gaussian process backend identifies three glaucoma subgroups by decoupling progression trajectories from current ...
IDOBE: Infectious Disease Outbreak forecasting Benchmark Ecosystem
cs.LG 2026-04 conditional novelty 6.0

IDOBE compiles over 10,000 epidemiological outbreaks into a public benchmark and shows that MLP-based models deliver the most robust short-term forecasts while statistical methods hold a slight edge pre-peak.
MATRIX: Multi-Layer Code Watermarking via Dual-Channel Constrained Parity-Check Encoding
cs.CR 2026-04 unverdicted novelty 6.0

MATRIX embeds multi-layer watermarks in LLM-generated code via dual-channel constrained parity-check encoding, achieving 99.2% detection accuracy with 0-0.14% functionality loss and 7.7-26.67% better attack robustness...
Early-Warning Learner Satisfaction Forecasting in MOOCs via Temporal Event Transformers and LLM Text Embeddings
cs.CE 2026-04 unverdicted novelty 6.0

TET-LLM predicts MOOC satisfaction early via temporal event transformers on behavior, LLM embeddings on text, and topic distributions, beating baselines at RMSE 0.82 and AUC 0.77 for 7-day forecasts.
Habitat-GS: A High-Fidelity Navigation Simulator with Dynamic Gaussian Splatting
cs.RO 2026-04 unverdicted novelty 6.0

Habitat-GS integrates 3D Gaussian Splatting scene rendering and Gaussian avatars into Habitat-Sim, yielding agents with stronger cross-domain generalization and effective human-aware navigation.
Upper Generalization Bounds for Neural Oscillators
cs.LG 2026-03 conditional novelty 6.0

Upper generalization bounds for neural oscillators scale polynomially with MLP size and time length, avoiding the curse of parametric complexity, with numerical validation on a Bouc-Wen nonlinear system.
Beyond Static: Related Questions Retrieval Through Conversations in Community Question Answering
cs.IR 2026-03 unverdicted novelty 6.0

TeCQR retrieves related questions in cQA by generating tag-enhanced clarifying questions, using noise-tolerant semantic matching, and two-stage training to learn fine-grained representations of queries, questions, and tags.
Attention-Based Neural-Augmented Kalman Filter for Legged Robot State Estimation
cs.RO 2026-01 unverdicted novelty 6.0

AttenNKF augments InEKF with an attention-based neural compensator trained in latent space to correct foot-slip errors in legged robot state estimation.
AsarRec: Adaptive Sequential Augmentation for Robust Self-supervised Sequential Recommendation
cs.IR 2025-12 unverdicted novelty 6.0

AsarRec learns adaptive sequence augmentations via transformation matrices and Semi-Sinkhorn projection to improve robustness of self-supervised sequential recommenders under noise.
Cataract-LMM Large-Scale Multi-Source Multi-Task Benchmark for Deep Learning in Surgical Video Analysis
cs.CV 2025-10 conditional novelty 6.0

Cataract-LMM is a new multi-source dataset of 3000 annotated phacoemulsification videos enabling benchmarks for phase recognition, scene segmentation, interaction tracking, and automated skill assessment.
RAPTOR: A Foundation Policy for Quadrotor Control
cs.RO 2025-09 unverdicted novelty 6.0

A 2084-parameter recurrent policy trained by distilling 1000 RL teacher policies enables zero-shot control across 10 real quadrotors differing in mass, motors, frames, propellers, and flight controllers.
Scalable Option Learning in High-Throughput Environments
cs.LG 2025-08 unverdicted novelty 6.0

SOL is a new hierarchical RL algorithm that reaches 35x higher throughput and outperforms flat agents when trained on 30 billion frames in NetHack while showing positive scaling.
Time-Scale Coupling Between States and Parameters in Recurrent Neural Networks
cs.LG 2025-08 unverdicted novelty 6.0

Gating in RNNs couples state time-scales with parameter gradients to produce lag- and direction-dependent effective learning rates, shown via exact Jacobians and first-order expansion.
SpectraLLM: Uncovering the Ability of LLMs for Molecular Structure Elucidation from Multi-Spectral Data
q-bio.QM 2025-08 unverdicted novelty 6.0

SpectraLLM is an LLM fine-tuned to predict small-molecule structures from single or multiple spectra, reporting state-of-the-art results on four public benchmarks with gains from multi-modal input.
Chinese Cyberbullying Detection: Dataset, Method, and Validation
cs.CL 2025-05 unverdicted novelty 6.0

Introduces CHNCI, the first Chinese cyberbullying incident detection dataset with 220,676 comments across 91 incidents, created via ensemble pseudo-labeling from explanation-generating methods followed by human annotation.
Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM
cs.CV 2025-05 unverdicted novelty 6.0

Slot-MLLM introduces a slot-attention-based object-centric visual tokenizer with Q-Former encoder, diffusion decoder, and residual vector quantization for improved local visual comprehension and generation in multimodal LLMs.
Decentralized Collective World Model for Emergent Communication and Coordination
cs.MA 2025-04 unverdicted novelty 6.0

A decentralized collective world model integrates predictive coding with bidirectional communication to achieve simultaneous symbol emergence and coordination, outperforming non-communicative baselines in a two-agent ...
Pretraining a Foundation Model for Small-Molecule Natural Products
q-bio.QM 2025-03 unverdicted novelty 6.0

NaFM is a pretrained foundation model for natural products using scaffold-focused contrastive learning and masked graph objectives that achieves SOTA on taxonomy classification, gene/microbial analysis, and virtual sc...
Beyond the Edge of Function: Unraveling the Patterns of Type Recovery in Binary Code
cs.CR 2025-03 unverdicted novelty 6.0

ByteTR recovers variable types in binary code more effectively than prior methods by decoupling unbalanced type sets, mitigating compiler optimization effects via static analysis, and modeling inter-procedural data fl...
TabICL: A Tabular Foundation Model for In-Context Learning on Large Data
cs.LG 2025-02 unverdicted novelty 6.0

TabICL scales in-context learning to large tabular data via column-then-row attention for row embeddings followed by a transformer, matching TabPFNv2 speed and performance while outperforming it and CatBoost on datase...
SAM 2: Segment Anything in Images and Videos
cs.CV 2024-08 conditional novelty 6.0

SAM 2 delivers more accurate video segmentation with 3x fewer user interactions and 6x faster image segmentation than the original SAM by training a streaming-memory transformer on the largest video segmentation datas...
READ: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling
cs.CV 2023-12 unverdicted novelty 6.0

READ recurrent adapters with partial video-language alignment via optimal transport outperform standard fine-tuning on low-resource temporal grounding and summarization tasks.
Gated Linear Attention Transformers with Hardware-Efficient Training
cs.LG 2023-12 unverdicted novelty 6.0

Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
cs.CV 2023-08 unverdicted novelty 6.0

DragNUWA integrates text, image, and trajectory controls into a diffusion video model using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training to enable fine-grained open-domain video generation.
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
cs.LG 2021-04 accept novelty 6.0

Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
cs.CL 2020-02 unverdicted novelty 6.0

CodeBERT pre-trains a bimodal model on code and text pairs plus unimodal data to achieve state-of-the-art results on natural language code search and code documentation generation.
Non-Parallel Voice Conversion with Cyclic Variational Autoencoder
eess.AS 2019-07 unverdicted novelty 6.0

CycleVAE optimizes non-parallel voice conversion indirectly via cyclic reconstructed spectra, yielding higher spectral accuracy, latent feature correlation, and improved converted speech quality.
R-Transformer: Recurrent Neural Network Enhanced Transformer
cs.LG 2019-07 unverdicted novelty 6.0

R-Transformer integrates RNNs with multi-head attention to model local and global sequence dependencies without position embeddings and reports large-margin gains over prior methods on diverse tasks.
Time2Vec: Learning a Vector Representation of Time
cs.LG 2019-07 unverdicted novelty 6.0

Time2Vec learns a vector representation of time that improves model performance when used in place of raw time inputs across various models and problems.
A Unified Framework of Online Learning Algorithms for Training Recurrent Neural Networks
cs.LG 2019-07 accept novelty 6.0

A framework unifies recent online RNN training algorithms along four axes and demonstrates performance clustering on synthetic tasks, indicating that gradient alignment is insufficient to explain success especially fo...
Learning Blended, Precise Semantic Program Embeddings
cs.SE 2019-07 unverdicted novelty 6.0

LIGER blends symbolic and concrete traces to learn precise semantic program embeddings, outperforming syntax-based models on CoSET classification and code2seq on method name prediction while using fewer executions.
Widening the Representation Bottleneck in Neural Machine Translation with Lexical Shortcuts
cs.CL 2019-06 conditional novelty 6.0

Gated lexical shortcut connections added to the transformer yield 0.9 BLEU average gains on five WMT directions while lowering the lexical content stored in hidden states.
A Deep Decoder Structure Based on WordEmbedding Regression for An Encoder-Decoder Based Model for Image Captioning
cs.CV 2019-06 unverdicted novelty 6.0

The authors replace next-word log-likelihood training with word-embedding regression in an encoder-decoder captioning model and report CIDEr 125.0 and BLEU-4 50.5 on MS-COCO, exceeding prior bests of 117.1 and 48.0.
Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives
cs.LG 2019-06 unverdicted novelty 6.0

RL policies decompose into information-regularized primitives that compete by requesting state information amounts, with the greediest one acting, yielding better generalization than flat or hierarchical baselines.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 102 Pith papers

[1]

[Auli et al.2013] Michael Auli, Michel Galley, Chris Quirk, and Geoffrey Zweig

work page 2013
[2]

In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1044–1054

Joint language and translation modeling with recurrent neural net- works. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1044–1054. [Axelrod et al.2011] Amittai Axelrod, Xiaodong He, and Jianfeng Gao

work page 2011
[3]

In Proceedings of the ACL Conference on Empirical Methods in Natu- ral Language Processing (EMNLP), pages 355–362

Domain adaptation via pseudo in-domain data selection. In Proceedings of the ACL Conference on Empirical Methods in Natu- ral Language Processing (EMNLP), pages 355–362. [Bastien et al.2012] Fr ´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio

work page 2012
[4]

Deep Learning and Unsupervised Fea- ture Learning NIPS 2012 Workshop

Theano: new features and speed im- provements. Deep Learning and Unsupervised Fea- ture Learning NIPS 2012 Workshop. [Bengio et al.2003] Yoshua Bengio, R ´ejean Ducharme, Pascal Vincent, and Christian Janvin

work page 2012
[5]

A neu- ral probabilistic language model. J. Mach. Learn. Res., 3:1137–1155, March. [Bengio et al.2013] Y . Bengio, N. Boulanger- Lewandowski, and R. Pascanu

work page 2013
[6]

, May. [Bergstra et al.2010] James Bergstra, Olivier Breuleux, Fr´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde- Farley, and Yoshua Bengio

work page 2010
[7]

In Proceedings of the Python for Scientiﬁc Computing Conference (SciPy), June

Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientiﬁc Computing Conference (SciPy), June. Oral Presentation. [Chandar et al.2014] Sarath Chandar, Stanislas Lauly, Hugo Larochelle, Mitesh Khapra, Balaraman Ravin- dran, Vikas Raykar, and Amrita Saha

work page 2014
[8]

arXiv:1402.1454 [cs.CL] , Febru- ary

An au- toencoder approach to learning bilingual word repre- sentations. arXiv:1402.1454 [cs.CL] , Febru- ary. [Dahl et al.2012] George E. Dahl, Dong Yu, Li Deng, and Alex Acero

work page arXiv 2012
[9]

IEEE Transactions on Audio, Speech, and Language Processing, 20(1):33–42

Context-dependent pre- trained deep neural networks for large vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):33–42. [Devlin et al.2014] Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard Schwartz, , and John Makhoul

work page 2014
[10]

In Proceedings of the ACL 2014 Conference, ACL ’14, pages 1370–1380

Fast and robust neural network joint models for statistical machine translation. In Proceedings of the ACL 2014 Conference, ACL ’14, pages 1370–1380. [Gao et al.2013] Jianfeng Gao, Xiaodong He, Wen tau Yih, and Li Deng

work page 2014
[11]

Techni- cal report, Microsoft Research

Learning semantic repre- sentations for the phrase translation model. Techni- cal report, Microsoft Research. [Glorot et al.2011] X. Glorot, A. Bordes, and Y . Ben- gio

work page 2011
[12]

In AISTATS’2011

Deep sparse rectiﬁer neural networks. In AISTATS’2011. [Goodfellow et al.2013] Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio

work page 2011
[13]

In ICML’2013

Maxout networks. In ICML’2013. [Graves2012] Alex Graves

work page 2013
[14]

In Proceedings of the ACL Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 1700–1709

Two recurrent continuous translation models. In Proceedings of the ACL Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 1700–1709. [Koehn et al.2003] Philipp Koehn, Franz Josef Och, and Daniel Marcu

work page 2003
[15]

In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, pages 48–54

Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, pages 48–54. [Koehn2005] P. Koehn

work page 2003
[16]

In Machine Translation Summit X , pages 79–86, Phuket, Thai- land

Europarl: A parallel cor- pus for statistical machine translation. In Machine Translation Summit X , pages 79–86, Phuket, Thai- land. [Krizhevsky et al.2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton

work page 2012
[17]

In Advances in Neural Information Processing Systems 25 (NIPS’2012)

Ima- geNet classiﬁcation with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS’2012). [Marcu and Wong2002] Daniel Marcu and William Wong

work page 2012
[18]

In Pro- ceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - Volume 10, EMNLP ’02, pages 133–139

A phrase-based, joint probability model for statistical machine translation. In Pro- ceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - Volume 10, EMNLP ’02, pages 133–139. [Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeff Dean

work page 2013
[19]

In Proceedings of the ACL 2010 Conference Short Papers , ACLShort ’10, pages 220–224, Stroudsburg, PA, USA

Intelligent selection of language model training data. In Proceedings of the ACL 2010 Conference Short Papers , ACLShort ’10, pages 220–224, Stroudsburg, PA, USA. [Pascanu et al.2014] R. Pascanu, C. Gulcehre, K. Cho, and Y . Bengio

work page 2010
[20]

In Proceedings of the Second International Conference on Learning Representa- tions (ICLR 2014), April

How to construct deep recur- rent neural networks. In Proceedings of the Second International Conference on Learning Representa- tions (ICLR 2014), April. [Saxe et al.2014] Andrew M. Saxe, James L. McClel- land, and Surya Ganguli

work page 2014
[21]

In Proceedings of the Second International Conference on Learning Representa- tions (ICLR 2014), April

Exact solutions to the nonlinear dynamics of learning in deep lin- ear neural networks. In Proceedings of the Second International Conference on Learning Representa- tions (ICLR 2014), April. [Schwenk et al.2006] Holger Schwenk, Marta R. Costa- Juss`a, and Jos ´e A. R. Fonollosa

work page 2014
[22]

In IWSLT, pages 166–173

Continuous space language models for the iwslt 2006 task. In IWSLT, pages 166–173. [Schwenk2007] Holger Schwenk

work page 2006
[23]

In Martin Kay and Chris- tian Boitet, editors, Proceedings of the 24th Inter- national Conference on Computational Linguistics (COLIN), pages 1071–1080

Continuous space translation models for phrase-based statisti- cal machine translation. In Martin Kay and Chris- tian Boitet, editors, Proceedings of the 24th Inter- national Conference on Computational Linguistics (COLIN), pages 1071–1080. [Socher et al.2011] Richard Socher, Eric H. Huang, Jef- frey Pennington, Andrew Y . Ng, and Christopher D. Manning

work page 2011
[24]

[Son et al.2012] Le Hai Son, Alexandre Allauzen, and Franc ¸ois Yvon

work page 2012
[25]

Continuous space transla- tion models with neural networks. In Proceedings of the 2012 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT ’12, pages 39–48, Stroudsburg, PA, USA. [van der Maaten2013] Laurens van der Maaten

work page 2012
[26]

In Proceedings of the First Inter- national Conference on Learning Representations (ICLR 2013), May

Barnes-hut-sne. In Proceedings of the First Inter- national Conference on Learning Representations (ICLR 2013), May. [Vaswani et al.2013] Ashish Vaswani, Yinggong Zhao, Victoria Fossum, and David Chiang

work page 2013
[27]

ADADELTA: An Adaptive Learning Rate Method

ADADELTA: an adaptive learning rate method. Technical report, arXiv 1212.5701. [Zou et al.2013] Will Y . Zou, Richard Socher, Daniel M. Cer, and Christopher D. Manning

work page Pith review arXiv 2013

[1] [1]

[Auli et al.2013] Michael Auli, Michel Galley, Chris Quirk, and Geoffrey Zweig

work page 2013

[2] [2]

In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1044–1054

Joint language and translation modeling with recurrent neural net- works. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1044–1054. [Axelrod et al.2011] Amittai Axelrod, Xiaodong He, and Jianfeng Gao

work page 2011

[3] [3]

In Proceedings of the ACL Conference on Empirical Methods in Natu- ral Language Processing (EMNLP), pages 355–362

Domain adaptation via pseudo in-domain data selection. In Proceedings of the ACL Conference on Empirical Methods in Natu- ral Language Processing (EMNLP), pages 355–362. [Bastien et al.2012] Fr ´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio

work page 2012

[4] [4]

Deep Learning and Unsupervised Fea- ture Learning NIPS 2012 Workshop

Theano: new features and speed im- provements. Deep Learning and Unsupervised Fea- ture Learning NIPS 2012 Workshop. [Bengio et al.2003] Yoshua Bengio, R ´ejean Ducharme, Pascal Vincent, and Christian Janvin

work page 2012

[5] [5]

A neu- ral probabilistic language model. J. Mach. Learn. Res., 3:1137–1155, March. [Bengio et al.2013] Y . Bengio, N. Boulanger- Lewandowski, and R. Pascanu

work page 2013

[6] [6]

, May. [Bergstra et al.2010] James Bergstra, Olivier Breuleux, Fr´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde- Farley, and Yoshua Bengio

work page 2010

[7] [7]

In Proceedings of the Python for Scientiﬁc Computing Conference (SciPy), June

Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientiﬁc Computing Conference (SciPy), June. Oral Presentation. [Chandar et al.2014] Sarath Chandar, Stanislas Lauly, Hugo Larochelle, Mitesh Khapra, Balaraman Ravin- dran, Vikas Raykar, and Amrita Saha

work page 2014

[8] [8]

arXiv:1402.1454 [cs.CL] , Febru- ary

An au- toencoder approach to learning bilingual word repre- sentations. arXiv:1402.1454 [cs.CL] , Febru- ary. [Dahl et al.2012] George E. Dahl, Dong Yu, Li Deng, and Alex Acero

work page arXiv 2012

[9] [9]

IEEE Transactions on Audio, Speech, and Language Processing, 20(1):33–42

Context-dependent pre- trained deep neural networks for large vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):33–42. [Devlin et al.2014] Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard Schwartz, , and John Makhoul

work page 2014

[10] [10]

In Proceedings of the ACL 2014 Conference, ACL ’14, pages 1370–1380

Fast and robust neural network joint models for statistical machine translation. In Proceedings of the ACL 2014 Conference, ACL ’14, pages 1370–1380. [Gao et al.2013] Jianfeng Gao, Xiaodong He, Wen tau Yih, and Li Deng

work page 2014

[11] [11]

Techni- cal report, Microsoft Research

Learning semantic repre- sentations for the phrase translation model. Techni- cal report, Microsoft Research. [Glorot et al.2011] X. Glorot, A. Bordes, and Y . Ben- gio

work page 2011

[12] [12]

In AISTATS’2011

Deep sparse rectiﬁer neural networks. In AISTATS’2011. [Goodfellow et al.2013] Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio

work page 2011

[13] [13]

In ICML’2013

Maxout networks. In ICML’2013. [Graves2012] Alex Graves

work page 2013

[14] [14]

In Proceedings of the ACL Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 1700–1709

Two recurrent continuous translation models. In Proceedings of the ACL Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 1700–1709. [Koehn et al.2003] Philipp Koehn, Franz Josef Och, and Daniel Marcu

work page 2003

[15] [15]

In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, pages 48–54

Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, pages 48–54. [Koehn2005] P. Koehn

work page 2003

[16] [16]

In Machine Translation Summit X , pages 79–86, Phuket, Thai- land

Europarl: A parallel cor- pus for statistical machine translation. In Machine Translation Summit X , pages 79–86, Phuket, Thai- land. [Krizhevsky et al.2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton

work page 2012

[17] [17]

In Advances in Neural Information Processing Systems 25 (NIPS’2012)

Ima- geNet classiﬁcation with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS’2012). [Marcu and Wong2002] Daniel Marcu and William Wong

work page 2012

[18] [18]

In Pro- ceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - Volume 10, EMNLP ’02, pages 133–139

A phrase-based, joint probability model for statistical machine translation. In Pro- ceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - Volume 10, EMNLP ’02, pages 133–139. [Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeff Dean

work page 2013

[19] [19]

In Proceedings of the ACL 2010 Conference Short Papers , ACLShort ’10, pages 220–224, Stroudsburg, PA, USA

Intelligent selection of language model training data. In Proceedings of the ACL 2010 Conference Short Papers , ACLShort ’10, pages 220–224, Stroudsburg, PA, USA. [Pascanu et al.2014] R. Pascanu, C. Gulcehre, K. Cho, and Y . Bengio

work page 2010

[20] [20]

In Proceedings of the Second International Conference on Learning Representa- tions (ICLR 2014), April

How to construct deep recur- rent neural networks. In Proceedings of the Second International Conference on Learning Representa- tions (ICLR 2014), April. [Saxe et al.2014] Andrew M. Saxe, James L. McClel- land, and Surya Ganguli

work page 2014

[21] [21]

In Proceedings of the Second International Conference on Learning Representa- tions (ICLR 2014), April

Exact solutions to the nonlinear dynamics of learning in deep lin- ear neural networks. In Proceedings of the Second International Conference on Learning Representa- tions (ICLR 2014), April. [Schwenk et al.2006] Holger Schwenk, Marta R. Costa- Juss`a, and Jos ´e A. R. Fonollosa

work page 2014

[22] [22]

In IWSLT, pages 166–173

Continuous space language models for the iwslt 2006 task. In IWSLT, pages 166–173. [Schwenk2007] Holger Schwenk

work page 2006

[23] [23]

In Martin Kay and Chris- tian Boitet, editors, Proceedings of the 24th Inter- national Conference on Computational Linguistics (COLIN), pages 1071–1080

Continuous space translation models for phrase-based statisti- cal machine translation. In Martin Kay and Chris- tian Boitet, editors, Proceedings of the 24th Inter- national Conference on Computational Linguistics (COLIN), pages 1071–1080. [Socher et al.2011] Richard Socher, Eric H. Huang, Jef- frey Pennington, Andrew Y . Ng, and Christopher D. Manning

work page 2011

[24] [24]

[Son et al.2012] Le Hai Son, Alexandre Allauzen, and Franc ¸ois Yvon

work page 2012

[25] [25]

Continuous space transla- tion models with neural networks. In Proceedings of the 2012 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT ’12, pages 39–48, Stroudsburg, PA, USA. [van der Maaten2013] Laurens van der Maaten

work page 2012

[26] [26]

In Proceedings of the First Inter- national Conference on Learning Representations (ICLR 2013), May

Barnes-hut-sne. In Proceedings of the First Inter- national Conference on Learning Representations (ICLR 2013), May. [Vaswani et al.2013] Ashish Vaswani, Yinggong Zhao, Victoria Fossum, and David Chiang

work page 2013

[27] [27]

ADADELTA: An Adaptive Learning Rate Method

ADADELTA: an adaptive learning rate method. Technical report, arXiv 1212.5701. [Zou et al.2013] Will Y . Zou, Richard Socher, Daniel M. Cer, and Christopher D. Manning

work page Pith review arXiv 2013