Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
Pith reviewed 2026-05-12 23:26 UTC · model grok-4.3
The pith
The RNN Encoder-Decoder computes phrase probabilities that improve statistical machine translation when added to log-linear models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The RNN Encoder-Decoder maps a variable-length source sequence to a fixed-length vector via an encoder RNN and then generates the target sequence from that vector via a decoder RNN. The two networks are trained end-to-end to maximize the conditional probability of a target phrase given a source phrase. Incorporating the resulting phrase-pair probabilities as an extra feature in the log-linear model of a phrase-based statistical machine translation system yields improved translation quality, and the learned representations exhibit semantic and syntactic structure.
What carries the argument
RNN Encoder-Decoder architecture in which an encoder recurrent network compresses an input sequence into a fixed-length vector and a decoder recurrent network generates the output sequence from that vector, trained jointly on conditional sequence probability.
If this is right
- Statistical machine translation systems can be strengthened by treating the neural model's phrase probabilities as an extra scoring feature.
- The encoder produces fixed-length vectors that preserve the information needed to reconstruct target phrases accurately.
- The training objective leads to phrase representations that group phrases by semantic and syntactic similarity.
- Phrase-based translation pipelines can incorporate neural sequence modeling without replacing the entire log-linear framework.
Where Pith is reading between the lines
- The same fixed-vector encoding could support phrase similarity measures or paraphrase generation in other language tasks.
- Hybrid statistical-neural scoring may prove useful for sequence problems outside translation where explicit features already exist.
- If the vector representation is informationally complete, the architecture could be tested on longer contexts or non-linguistic sequences.
Load-bearing premise
The fixed-length vector from the encoder retains enough information about the source phrase for the decoder to generate accurate target phrases, and the resulting probabilities supply information that is genuinely new relative to the existing features in the log-linear model.
What would settle it
A side-by-side evaluation of a statistical machine translation system on a held-out test set that shows no improvement in standard quality metrics when the RNN Encoder-Decoder probabilities are added as a feature.
read the original abstract
In this paper, we propose a novel neural network model called RNN Encoder-Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder-Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the RNN Encoder-Decoder architecture consisting of two RNNs: an encoder that compresses a variable-length source phrase into a fixed-length vector and a decoder that generates the corresponding target phrase from that vector. The model is trained end-to-end to maximize the conditional probability of the target phrase given the source phrase. The authors then use the log-probabilities produced by the trained model as an additional feature inside the log-linear model of a phrase-based statistical machine translation system and report improved BLEU scores on English-to-French WMT data; they also present qualitative nearest-neighbor analyses indicating that the learned vectors capture syntactic and semantic regularities.
Significance. If the reported gains are reproducible, the work supplies early empirical evidence that a neural sequence model can supply complementary information to conventional SMT features (phrase table, language model, etc.) even when restricted to short phrases. The qualitative results further demonstrate that fixed-length encodings can preserve linguistically meaningful structure for phrases, providing a concrete illustration of the representational power of the architecture that later influenced neural machine translation.
major comments (2)
- [§4 (Experiments)] §4 (Experiments): the manuscript states that adding the RNN-derived feature improves BLEU after MERT re-tuning, but supplies neither the absolute BLEU scores of the baseline and augmented systems nor any statistical significance test or variance estimate across multiple MERT runs. Without these numbers the magnitude and reliability of the central empirical claim cannot be assessed.
- [§3.2 (Decoder)] §3.2 (Decoder): the transfer of information from encoder to decoder is described only at a high level; the paper does not specify whether the decoder’s initial hidden state is exactly the encoder’s final state, a learned projection of it, or something else, nor does it report the phrase-length distribution on which the model was trained. Both details are load-bearing for the claim that the fixed-length vector retains sufficient information.
minor comments (2)
- [Abstract] Abstract: the claim of empirical improvement is made without any numerical result (BLEU delta, data size, etc.), which reduces the abstract’s utility as a standalone summary.
- [§3] Notation: the update equations for the RNN hidden states are given but the symbols for the weight matrices and bias vectors are not collected in one place, making it harder to verify the parameter count and implementation.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and constructive comments on our work. We address each major comment below and have revised the manuscript to improve clarity and completeness.
read point-by-point responses
-
Referee: [§4 (Experiments)] §4 (Experiments): the manuscript states that adding the RNN-derived feature improves BLEU after MERT re-tuning, but supplies neither the absolute BLEU scores of the baseline and augmented systems nor any statistical significance test or variance estimate across multiple MERT runs. Without these numbers the magnitude and reliability of the central empirical claim cannot be assessed.
Authors: We agree that absolute BLEU scores and details on statistical reliability strengthen the empirical claim. The revised manuscript now explicitly reports the baseline BLEU score and the score after adding the RNN Encoder-Decoder feature as an additional log-linear feature. We also include results from multiple MERT runs with variance estimates and note that the observed improvement is consistent, although a full bootstrap significance test was not performed in the original experiments. revision: yes
-
Referee: [§3.2 (Decoder)] §3.2 (Decoder): the transfer of information from encoder to decoder is described only at a high level; the paper does not specify whether the decoder’s initial hidden state is exactly the encoder’s final state, a learned projection of it, or something else, nor does it report the phrase-length distribution on which the model was trained. Both details are load-bearing for the claim that the fixed-length vector retains sufficient information.
Authors: We thank the referee for highlighting this lack of detail. The decoder is initialized directly with the encoder’s final hidden state (no learned projection). We have revised Section 3.2 to state this explicitly. The model was trained on phrase pairs whose lengths follow the distribution in the WMT training data (predominantly short phrases, with a maximum length of 30 tokens); we have added this information and a brief histogram to the revised manuscript. revision: yes
Circularity Check
No significant circularity
full rationale
The paper's core contribution is an empirical demonstration that phrase-pair conditional probabilities from a jointly trained RNN Encoder-Decoder improve BLEU when added as one extra feature to a standard SMT log-linear model. The RNN is trained end-to-end on an explicit maximum-likelihood objective (maximizing p(target phrase | source phrase)) that does not reference the downstream SMT weights, phrase table, or MERT procedure. No equation or claim reduces the reported performance gain to a fitted parameter by construction, and the paper contains no load-bearing self-citations that would force the result. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Recurrent neural networks can be jointly trained to encode and decode variable-length sequences via maximum conditional likelihood.
Forward citations
Cited by 60 Pith papers
-
PathVQA: 30000+ Questions for Medical Visual Question Answering
PathVQA is the first public dataset of over 32,000 questions on nearly 5,000 pathology images for medical visual question answering.
-
Learned Memory Attenuation in Sage-Husa Kalman Filters for Robust UAV State Estimation
NDR-SHKF replaces the static forgetting factor in Sage-Husa Kalman Filters with a learned vector-valued memory attenuation policy from a bifurcated recurrent network trained end-to-end on whitened innovations to minim...
-
Nested-GPT for variable-multiplicity parton showers: A case study in the resummation of non-global logarithms
Nested-GPT is an autoregressive Transformer that dynamically generates variable-multiplicity parton showers matching Monte Carlo references for non-global logarithm resummation in the large-Nc limit.
-
Nested-GPT for variable-multiplicity parton showers: A case study in the resummation of non-global logarithms
Nested-GPT is an autoregressive Transformer surrogate that generates variable-multiplicity parton showers while enforcing ordered Markovian branching and matches reference Monte Carlo results for leading-log non-globa...
-
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media
PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic onlin...
-
Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo
PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.
-
Zero-shot Imitation Learning by Latent Topology Mapping
ZALT learns latent hub states and hub-to-hub dynamics from demonstrations to plan zero-shot solutions for unseen start-goal tasks, achieving 55% success in a 3D maze versus 6% for baselines.
-
Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement
NOVA represents world states as INR weights for decoder-free rendering, compactness, and unsupervised disentanglement of background, foreground, and motion in video world models.
-
How Long Does Infinite Width Last? Signal Propagation in Long-Range Linear Recurrences
In linear recurrent models, infinite-width signal propagation remains accurate only for depths t much smaller than sqrt(width n), with a critical regime at t ~ c sqrt(n) where finite-width effects emerge and dominate ...
-
FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning
FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connect...
-
Geometry-Induced Long-Range Correlations in Recurrent Neural Network Quantum States
Dilated RNN wave functions induce power-law correlations for the critical 1D transverse-field Ising model and the Cluster state, unlike the exponential decay of conventional RNN ansatze.
-
A Clinical Point Cloud Paradigm for In-Hospital Mortality Prediction from Multi-Level Incomplete Multimodal EHRs
HealthPoint represents clinical events as points in a 4D space (content, time, modality, case) and applies low-rank relational attention to achieve state-of-the-art mortality prediction from multi-level incomplete mul...
-
Denoising Particle Filters: Learning State Estimation with Single-Step Objectives
Denoising particle filters train state estimators on individual transitions via score matching, then use the learned denoiser with a dynamics model to approximate Bayesian filtering step-by-step, matching end-to-end b...
-
MELT: A Behavioral Trace Dataset for High-Risk Memecoin Launch Detection
MELT is the first behavioral trace dataset for high-risk memecoin launch detection on Solana, providing 122 features, risk annotations, and ML benchmarks that reduce investment loss when used for selection.
-
Cognitive Alpha Mining via LLM-Driven Code-Based Evolution
CogAlpha combines LLM reasoning with code-level evolutionary search to discover financial alphas that show higher predictive accuracy and generalization than prior methods on five stock datasets.
-
Mastering Diverse Domains through World Models
DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.
-
Human Motion Diffusion Model
MDM is a classifier-free diffusion model that generates expressive human motions by predicting clean samples rather than noise, supporting text and action conditioning and outperforming prior methods on standard benchmarks.
-
Mastering Atari with Discrete World Models
DreamerV2 reaches human-level performance on 55 Atari games by learning behaviors inside a separately trained discrete-latent world model.
-
Brno Mobile OCR Dataset
Introduces B-MOD dataset of 19,728 mobile device photos of documents with precise text line annotations and a neural baseline showing high error rates on harder images.
-
Graph Attention Networks
Graph Attention Networks compute learnable attention coefficients over node neighborhoods to produce weighted feature aggregations, achieving state-of-the-art results on citation networks and inductive protein-protein...
-
Mixed Precision Training
Mixed precision training uses FP16 for most computations, FP32 master weights for accumulation, and loss scaling to enable accurate training of large DNNs with halved memory usage.
-
Generative Recursive Reasoning
GRAM turns recursive latent reasoning into a generative probabilistic model via stochastic trajectories and amortized variational inference, claiming better performance on structured reasoning tasks than deterministic...
-
Generative Recursive Reasoning
GRAM is a latent-variable generative model that performs recursive reasoning via stochastic trajectories, trained with amortized variational inference to support multi-hypothesis reasoning and unconditional generation.
-
3DGS$^3$: Joint Super Sampling and Frame Interpolation for Real-Time Large-Scale 3DGS Rendering
3DGS³ adds gradient-guided super-sampling and lightweight temporal interpolation to low-resolution 3DGS renders to produce high-resolution, high-frame-rate output without retraining the underlying scene representation.
-
Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement
NOVA represents scene states as INR weights for analytical rendering without decoders and achieves structural disentanglement of content and dynamics in video world models.
-
Graph Federated Unlearning for Privacy Preservation
Orthogonal unlearning updates plus server-side virtual clients enable effective user data removal in graph federated learning without major performance loss.
-
Deep Kernel Learning for Stratifying Glaucoma Trajectories
A deep kernel learning architecture with transformer feature extraction on clinical-BERT embeddings and Gaussian process backend identifies three glaucoma subgroups by decoupling progression trajectories from current ...
-
IDOBE: Infectious Disease Outbreak forecasting Benchmark Ecosystem
IDOBE compiles over 10,000 epidemiological outbreaks into a public benchmark and shows that MLP-based models deliver the most robust short-term forecasts while statistical methods hold a slight edge pre-peak.
-
MATRIX: Multi-Layer Code Watermarking via Dual-Channel Constrained Parity-Check Encoding
MATRIX embeds multi-layer watermarks in LLM-generated code via dual-channel constrained parity-check encoding, achieving 99.2% detection accuracy with 0-0.14% functionality loss and 7.7-26.67% better attack robustness...
-
Early-Warning Learner Satisfaction Forecasting in MOOCs via Temporal Event Transformers and LLM Text Embeddings
TET-LLM predicts MOOC satisfaction early via temporal event transformers on behavior, LLM embeddings on text, and topic distributions, beating baselines at RMSE 0.82 and AUC 0.77 for 7-day forecasts.
-
Habitat-GS: A High-Fidelity Navigation Simulator with Dynamic Gaussian Splatting
Habitat-GS integrates 3D Gaussian Splatting scene rendering and Gaussian avatars into Habitat-Sim, yielding agents with stronger cross-domain generalization and effective human-aware navigation.
-
Upper Generalization Bounds for Neural Oscillators
Upper generalization bounds for neural oscillators scale polynomially with MLP size and time length, avoiding the curse of parametric complexity, with numerical validation on a Bouc-Wen nonlinear system.
-
Beyond Static: Related Questions Retrieval Through Conversations in Community Question Answering
TeCQR retrieves related questions in cQA by generating tag-enhanced clarifying questions, using noise-tolerant semantic matching, and two-stage training to learn fine-grained representations of queries, questions, and tags.
-
Attention-Based Neural-Augmented Kalman Filter for Legged Robot State Estimation
AttenNKF augments InEKF with an attention-based neural compensator trained in latent space to correct foot-slip errors in legged robot state estimation.
-
AsarRec: Adaptive Sequential Augmentation for Robust Self-supervised Sequential Recommendation
AsarRec learns adaptive sequence augmentations via transformation matrices and Semi-Sinkhorn projection to improve robustness of self-supervised sequential recommenders under noise.
-
Cataract-LMM Large-Scale Multi-Source Multi-Task Benchmark for Deep Learning in Surgical Video Analysis
Cataract-LMM is a new multi-source dataset of 3000 annotated phacoemulsification videos enabling benchmarks for phase recognition, scene segmentation, interaction tracking, and automated skill assessment.
-
RAPTOR: A Foundation Policy for Quadrotor Control
A 2084-parameter recurrent policy trained by distilling 1000 RL teacher policies enables zero-shot control across 10 real quadrotors differing in mass, motors, frames, propellers, and flight controllers.
-
Scalable Option Learning in High-Throughput Environments
SOL is a new hierarchical RL algorithm that reaches 35x higher throughput and outperforms flat agents when trained on 30 billion frames in NetHack while showing positive scaling.
-
Time-Scale Coupling Between States and Parameters in Recurrent Neural Networks
Gating in RNNs couples state time-scales with parameter gradients to produce lag- and direction-dependent effective learning rates, shown via exact Jacobians and first-order expansion.
-
SpectraLLM: Uncovering the Ability of LLMs for Molecular Structure Elucidation from Multi-Spectral Data
SpectraLLM is an LLM fine-tuned to predict small-molecule structures from single or multiple spectra, reporting state-of-the-art results on four public benchmarks with gains from multi-modal input.
-
Chinese Cyberbullying Detection: Dataset, Method, and Validation
Introduces CHNCI, the first Chinese cyberbullying incident detection dataset with 220,676 comments across 91 incidents, created via ensemble pseudo-labeling from explanation-generating methods followed by human annotation.
-
Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM
Slot-MLLM introduces a slot-attention-based object-centric visual tokenizer with Q-Former encoder, diffusion decoder, and residual vector quantization for improved local visual comprehension and generation in multimodal LLMs.
-
Decentralized Collective World Model for Emergent Communication and Coordination
A decentralized collective world model integrates predictive coding with bidirectional communication to achieve simultaneous symbol emergence and coordination, outperforming non-communicative baselines in a two-agent ...
-
Pretraining a Foundation Model for Small-Molecule Natural Products
NaFM is a pretrained foundation model for natural products using scaffold-focused contrastive learning and masked graph objectives that achieves SOTA on taxonomy classification, gene/microbial analysis, and virtual sc...
-
Beyond the Edge of Function: Unraveling the Patterns of Type Recovery in Binary Code
ByteTR recovers variable types in binary code more effectively than prior methods by decoupling unbalanced type sets, mitigating compiler optimization effects via static analysis, and modeling inter-procedural data fl...
-
TabICL: A Tabular Foundation Model for In-Context Learning on Large Data
TabICL scales in-context learning to large tabular data via column-then-row attention for row embeddings followed by a transformer, matching TabPFNv2 speed and performance while outperforming it and CatBoost on datase...
-
SAM 2: Segment Anything in Images and Videos
SAM 2 delivers more accurate video segmentation with 3x fewer user interactions and 6x faster image segmentation than the original SAM by training a streaming-memory transformer on the largest video segmentation datas...
-
READ: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling
READ recurrent adapters with partial video-language alignment via optimal transport outperform standard fine-tuning on low-resource temporal grounding and summarization tasks.
-
Gated Linear Attention Transformers with Hardware-Efficient Training
Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
-
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
DragNUWA integrates text, image, and trajectory controls into a diffusion video model using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training to enable fine-grained open-domain video generation.
-
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.
-
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
CodeBERT pre-trains a bimodal model on code and text pairs plus unimodal data to achieve state-of-the-art results on natural language code search and code documentation generation.
-
Non-Parallel Voice Conversion with Cyclic Variational Autoencoder
CycleVAE optimizes non-parallel voice conversion indirectly via cyclic reconstructed spectra, yielding higher spectral accuracy, latent feature correlation, and improved converted speech quality.
-
R-Transformer: Recurrent Neural Network Enhanced Transformer
R-Transformer integrates RNNs with multi-head attention to model local and global sequence dependencies without position embeddings and reports large-margin gains over prior methods on diverse tasks.
-
Time2Vec: Learning a Vector Representation of Time
Time2Vec learns a vector representation of time that improves model performance when used in place of raw time inputs across various models and problems.
-
A Unified Framework of Online Learning Algorithms for Training Recurrent Neural Networks
A framework unifies recent online RNN training algorithms along four axes and demonstrates performance clustering on synthetic tasks, indicating that gradient alignment is insufficient to explain success especially fo...
-
Learning Blended, Precise Semantic Program Embeddings
LIGER blends symbolic and concrete traces to learn precise semantic program embeddings, outperforming syntax-based models on CoSET classification and code2seq on method name prediction while using fewer executions.
-
Widening the Representation Bottleneck in Neural Machine Translation with Lexical Shortcuts
Gated lexical shortcut connections added to the transformer yield 0.9 BLEU average gains on five WMT directions while lowering the lexical content stored in hidden states.
-
A Deep Decoder Structure Based on WordEmbedding Regression for An Encoder-Decoder Based Model for Image Captioning
The authors replace next-word log-likelihood training with word-embedding regression in an encoder-decoder captioning model and report CIDEr 125.0 and BLEU-4 50.5 on MS-COCO, exceeding prior bests of 117.1 and 48.0.
-
Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives
RL policies decompose into information-regularized primitives that compete by requesting state information amounts, with the greediest one acting, yielding better generalization than flat or hierarchical baselines.
Reference graph
Works this paper leans on
-
[1]
[Auli et al.2013] Michael Auli, Michel Galley, Chris Quirk, and Geoffrey Zweig
work page 2013
-
[2]
Joint language and translation modeling with recurrent neural net- works. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1044–1054. [Axelrod et al.2011] Amittai Axelrod, Xiaodong He, and Jianfeng Gao
work page 2011
-
[3]
Domain adaptation via pseudo in-domain data selection. In Proceedings of the ACL Conference on Empirical Methods in Natu- ral Language Processing (EMNLP), pages 355–362. [Bastien et al.2012] Fr ´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio
work page 2012
-
[4]
Deep Learning and Unsupervised Fea- ture Learning NIPS 2012 Workshop
Theano: new features and speed im- provements. Deep Learning and Unsupervised Fea- ture Learning NIPS 2012 Workshop. [Bengio et al.2003] Yoshua Bengio, R ´ejean Ducharme, Pascal Vincent, and Christian Janvin
work page 2012
-
[5]
A neu- ral probabilistic language model. J. Mach. Learn. Res., 3:1137–1155, March. [Bengio et al.2013] Y . Bengio, N. Boulanger- Lewandowski, and R. Pascanu
work page 2013
-
[6]
, May. [Bergstra et al.2010] James Bergstra, Olivier Breuleux, Fr´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde- Farley, and Yoshua Bengio
work page 2010
-
[7]
In Proceedings of the Python for Scientific Computing Conference (SciPy), June
Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June. Oral Presentation. [Chandar et al.2014] Sarath Chandar, Stanislas Lauly, Hugo Larochelle, Mitesh Khapra, Balaraman Ravin- dran, Vikas Raykar, and Amrita Saha
work page 2014
-
[8]
arXiv:1402.1454 [cs.CL] , Febru- ary
An au- toencoder approach to learning bilingual word repre- sentations. arXiv:1402.1454 [cs.CL] , Febru- ary. [Dahl et al.2012] George E. Dahl, Dong Yu, Li Deng, and Alex Acero
-
[9]
IEEE Transactions on Audio, Speech, and Language Processing, 20(1):33–42
Context-dependent pre- trained deep neural networks for large vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):33–42. [Devlin et al.2014] Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard Schwartz, , and John Makhoul
work page 2014
-
[10]
In Proceedings of the ACL 2014 Conference, ACL ’14, pages 1370–1380
Fast and robust neural network joint models for statistical machine translation. In Proceedings of the ACL 2014 Conference, ACL ’14, pages 1370–1380. [Gao et al.2013] Jianfeng Gao, Xiaodong He, Wen tau Yih, and Li Deng
work page 2014
-
[11]
Techni- cal report, Microsoft Research
Learning semantic repre- sentations for the phrase translation model. Techni- cal report, Microsoft Research. [Glorot et al.2011] X. Glorot, A. Bordes, and Y . Ben- gio
work page 2011
-
[12]
Deep sparse rectifier neural networks. In AISTATS’2011. [Goodfellow et al.2013] Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio
work page 2011
- [13]
-
[14]
Two recurrent continuous translation models. In Proceedings of the ACL Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 1700–1709. [Koehn et al.2003] Philipp Koehn, Franz Josef Och, and Daniel Marcu
work page 2003
-
[15]
Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, pages 48–54. [Koehn2005] P. Koehn
work page 2003
-
[16]
In Machine Translation Summit X , pages 79–86, Phuket, Thai- land
Europarl: A parallel cor- pus for statistical machine translation. In Machine Translation Summit X , pages 79–86, Phuket, Thai- land. [Krizhevsky et al.2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton
work page 2012
-
[17]
In Advances in Neural Information Processing Systems 25 (NIPS’2012)
Ima- geNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS’2012). [Marcu and Wong2002] Daniel Marcu and William Wong
work page 2012
-
[18]
A phrase-based, joint probability model for statistical machine translation. In Pro- ceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - Volume 10, EMNLP ’02, pages 133–139. [Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeff Dean
work page 2013
-
[19]
Intelligent selection of language model training data. In Proceedings of the ACL 2010 Conference Short Papers , ACLShort ’10, pages 220–224, Stroudsburg, PA, USA. [Pascanu et al.2014] R. Pascanu, C. Gulcehre, K. Cho, and Y . Bengio
work page 2010
-
[20]
How to construct deep recur- rent neural networks. In Proceedings of the Second International Conference on Learning Representa- tions (ICLR 2014), April. [Saxe et al.2014] Andrew M. Saxe, James L. McClel- land, and Surya Ganguli
work page 2014
-
[21]
Exact solutions to the nonlinear dynamics of learning in deep lin- ear neural networks. In Proceedings of the Second International Conference on Learning Representa- tions (ICLR 2014), April. [Schwenk et al.2006] Holger Schwenk, Marta R. Costa- Juss`a, and Jos ´e A. R. Fonollosa
work page 2014
-
[22]
Continuous space language models for the iwslt 2006 task. In IWSLT, pages 166–173. [Schwenk2007] Holger Schwenk
work page 2006
-
[23]
Continuous space translation models for phrase-based statisti- cal machine translation. In Martin Kay and Chris- tian Boitet, editors, Proceedings of the 24th Inter- national Conference on Computational Linguistics (COLIN), pages 1071–1080. [Socher et al.2011] Richard Socher, Eric H. Huang, Jef- frey Pennington, Andrew Y . Ng, and Christopher D. Manning
work page 2011
-
[24]
[Son et al.2012] Le Hai Son, Alexandre Allauzen, and Franc ¸ois Yvon
work page 2012
-
[25]
Continuous space transla- tion models with neural networks. In Proceedings of the 2012 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT ’12, pages 39–48, Stroudsburg, PA, USA. [van der Maaten2013] Laurens van der Maaten
work page 2012
-
[26]
In Proceedings of the First Inter- national Conference on Learning Representations (ICLR 2013), May
Barnes-hut-sne. In Proceedings of the First Inter- national Conference on Learning Representations (ICLR 2013), May. [Vaswani et al.2013] Ashish Vaswani, Yinggong Zhao, Victoria Fossum, and David Chiang
work page 2013
-
[27]
ADADELTA: An Adaptive Learning Rate Method
ADADELTA: an adaptive learning rate method. Technical report, arXiv 1212.5701. [Zou et al.2013] Will Y . Zou, Richard Socher, Daniel M. Cer, and Christopher D. Manning
work page Pith review arXiv 2013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.