Neural Machine Translation by Jointly Learning to Align and Translate
Pith reviewed 2026-05-11 09:16 UTC · model grok-4.3
The pith
A neural translation model learns to focus on relevant source words while generating each target word.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors conjecture that a fixed-length context vector creates a performance bottleneck in basic encoder-decoder networks for machine translation. They introduce an attention mechanism that computes a distinct context vector for each target word as a weighted sum of the source sentence's hidden states, with the weights learned jointly during training. This allows the model to softly align source and target positions without explicit segmentation. On the WMT English-to-French task the approach matches the BLEU score of a strong phrase-based baseline while producing alignments that match human intuition.
What carries the argument
The attention-based alignment model that produces a context vector for each decoding step as a weighted combination of encoder hidden states, with weights derived from a feed-forward network trained jointly with the translation objective.
If this is right
- Translation systems can be trained end-to-end as a single network rather than relying on separate alignment and phrase-table components.
- Performance on longer sentences should improve because relevance can be selected dynamically instead of being compressed into one vector.
- The learned soft alignments provide an interpretable view of which source words influence each target word.
- The same joint alignment-and-generation approach can be applied to other sequence tasks where input relevance varies by output position.
Where Pith is reading between the lines
- The attention weights could serve as a starting point for extracting explicit phrase pairs or for debugging translation errors.
- Models that build on this idea might combine soft attention with hard constraints or multiple attention layers to handle very long documents.
- The removal of the fixed-vector bottleneck suggests similar gains are possible in any encoder-decoder setting where the input sequence is much longer than the output.
Load-bearing premise
That forcing the entire source sentence into a single fixed-length vector prevents the decoder from accessing the right information when generating different target words.
What would settle it
An experiment in which a basic encoder-decoder model without the soft-search mechanism reaches the same BLEU score as the attention model on the identical English-to-French test set.
read the original abstract
Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes an attention-based extension to the encoder-decoder architecture for neural machine translation. Rather than compressing the source sentence into a single fixed-length vector, the decoder learns to compute soft alignment weights over source positions at each decoding step, allowing it to focus on relevant source words when predicting each target word. The model is trained end-to-end on parallel data. On the WMT 2014 English-to-French task the attention model reaches 28.45 BLEU, described as comparable to a strong phrase-based baseline (Moses at 33.30 BLEU), and qualitative inspection shows that the learned soft alignments are intuitive.
Significance. If the reported BLEU scores and alignment visualizations hold under scrutiny, the work is significant because it supplies the first large-scale empirical demonstration that a neural translation model can learn to perform soft alignment jointly with translation. This directly addresses the fixed-context-vector limitation that motivated the paper and introduces the attention mechanism that later became standard in sequence modeling. The combination of quantitative results on a competitive benchmark and qualitative evidence of sensible alignments gives the central claim a solid empirical footing.
major comments (2)
- [§4.1 and Table 1] §4.1 and Table 1: the claim that performance is 'comparable' to the state-of-the-art phrase-based system rests on a single-run BLEU of 28.45 versus 33.30 for Moses. A 4.85-point gap is large enough that the comparability statement would be strengthened by reporting variance across random seeds, an ensemble result, or a direct comparison against the best contemporaneous neural baselines on the same data split.
- [§3.2, Eq. (5)–(7)] §3.2, Eq. (5)–(7): the alignment model is a simple feed-forward network whose output is normalized by softmax; the paper does not analyze or mitigate potential gradient vanishing when source sentences exceed the lengths seen in training. Because the motivating conjecture concerns long-sentence performance, this omission is load-bearing for the central architectural claim.
minor comments (3)
- [Abstract] The abstract states the main result without quoting the actual BLEU numbers or naming the test set; adding these two facts would make the abstract self-contained.
- [Figure 3] Figure 3 (alignment visualizations): the heatmaps lack explicit word labels on both axes and a color-bar scale, making it harder for readers to verify the claimed agreement with intuition.
- [§2.2] §2.2: the description of the basic RNN encoder-decoder could cite the exact prior work (Sutskever et al., 2014) more explicitly when stating the fixed-vector bottleneck.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and indicate where revisions will be made.
read point-by-point responses
-
Referee: [§4.1 and Table 1] §4.1 and Table 1: the claim that performance is 'comparable' to the state-of-the-art phrase-based system rests on a single-run BLEU of 28.45 versus 33.30 for Moses. A 4.85-point gap is large enough that the comparability statement would be strengthened by reporting variance across random seeds, an ensemble result, or a direct comparison against the best contemporaneous neural baselines on the same data split.
Authors: We acknowledge the 4.85 BLEU gap and agree that 'comparable' may overstate the absolute performance relative to the strong Moses baseline. The intent of the claim was to emphasize that an end-to-end neural model without phrase tables or hand-crafted features could reach a level close enough to be practically relevant on a large-scale task. We did not run multiple random seeds or ensembles due to the high computational cost of training on the full WMT data at the time. We will revise the abstract and §4.1 to describe the result as 'competitive with' or 'approaching' the phrase-based system and will add a brief comparison to other neural encoder-decoder baselines available at submission time. This constitutes a partial revision. revision: partial
-
Referee: [§3.2, Eq. (5)–(7)] §3.2, Eq. (5)–(7): the alignment model is a simple feed-forward network whose output is normalized by softmax; the paper does not analyze or mitigate potential gradient vanishing when source sentences exceed the lengths seen in training. Because the motivating conjecture concerns long-sentence performance, this omission is load-bearing for the central architectural claim.
Authors: The alignment model indeed uses a simple feed-forward scorer followed by softmax normalization over source positions. While this can in principle dilute gradients for source sentences much longer than those seen during training, our experiments were conducted on the standard WMT splits where sentence lengths are bounded, and the attention model showed clear gains over the fixed-vector baseline. We did not include an explicit gradient analysis because the paper's focus was on the empirical demonstration of jointly learned soft alignments. We will add a short discussion in §3.2 noting the potential limitation for extremely long sequences and pointing to the empirical improvement on longer sentences in the test set. This is a partial revision. revision: partial
Circularity Check
No significant circularity
full rationale
The paper defines an attention-augmented encoder-decoder from first principles (bidirectional RNN encoder, decoder with soft alignment probabilities computed via a feedforward network, trained end-to-end by maximizing log-likelihood on parallel sentence pairs). Performance is measured by standard BLEU on held-out WMT test data; no fitted parameter is defined in terms of BLEU, no self-citation supplies a uniqueness theorem or ansatz, and the fixed-length-vector conjecture is offered only as motivation. All load-bearing steps (model equations, training objective, alignment visualization) are self-contained and externally falsifiable.
Axiom & Free-Parameter Ledger
free parameters (1)
- attention model parameters
axioms (1)
- domain assumption A single neural network can be jointly tuned to maximize translation performance
invented entities (1)
-
soft alignment weights
no independent evidence
Forward citations
Cited by 60 Pith papers
-
Characterizing the Expressivity of Local Attention in Transformers
Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by adding a second past operator in linear temporal logic, with global and local attention being expressive...
-
Emerging Properties in Self-Supervised Vision Transformers
Self-supervised ViTs show emergent semantic segmentation and 78.3% k-NN accuracy on ImageNet; DINO reaches 80.1% linear evaluation with ViT-Base.
-
Generative Language Modeling for Automated Theorem Proving
GPT-f, a transformer-based prover for Metamath, generated new short proofs that were accepted into the main library—the first such contribution from a deep-learning system.
-
REALM: Retrieval-Augmented Language Model Pre-Training
REALM augments language-model pre-training with an unsupervised retriever over Wikipedia documents and reports 4-16% absolute gains on open-domain QA benchmarks over prior implicit and explicit knowledge methods.
-
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
-
Adaptive Computation Time for Recurrent Neural Networks
ACT lets RNNs dynamically adapt computation depth per input via a differentiable halting unit, yielding large gains on synthetic tasks and structural insights on language data.
-
Neural Turing Machines
Neural Turing Machines augment neural networks with differentiable external memory to learn algorithmic tasks such as copying, sorting, and associative recall from examples.
-
Structured Neural Marked Point Processes for Interpretable Event Interaction Modeling
SNMPP builds a product-form neural influence kernel from a signed class-wise interaction network and a monotonic delay-aware temporal network to enable interpretable multi-class event stream modeling.
-
GravityGraphSAGE: Link Prediction in Directed Attributed Graphs
GravityGraphSAGE adapts GraphSAGE with a gravity-inspired decoder to outperform prior graph deep learning methods on directed link prediction across citation networks and 16 real-world graphs.
-
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
-
Arbitrarily Conditioned Hierarchical Flows for Spatiotemporal Events
ARCH is a hierarchical flow-based generative model that enables tractable conditional intensity computation and arbitrary conditioning for spatiotemporal event distributions.
-
Characterizing the Expressivity of Local Attention in Transformers
Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by introducing a second temporal operator in LTL, with global and local attention being expressively complementary.
-
Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models
Transformers converge pathwise to a stochastic particle system and SPDE in the scaling limit, exhibiting synchronization by noise and exponential energy dissipation when common noise is coercive relative to self-atten...
-
Selective Contrastive Learning For Gloss Free Sign Language Translation
A pair selection strategy based on negative similarity dynamics strengthens contrastive supervision in gloss-free sign language translation by reducing noisy negatives.
-
RACE Attention: A Strictly Linear-Time Attention Layer for Training on Outrageously Large Contexts
RACE Attention is a strictly linear-time attention mechanism that approximates softmax attention outputs using Gaussian projections and soft LSH to enable training on contexts up to 12 million tokens.
-
Robust Filter Attention: Self-Attention as Precision-Weighted State Estimation
Robust Filter Attention models self-attention as consistency-based state estimation under a linear SDE for token trajectories, matching standard attention complexity while showing lower perplexity and better zero-shot...
-
AlphaEvolve: A coding agent for scientific and algorithmic discovery
AlphaEvolve is an LLM-orchestrated evolutionary coding agent that discovered a 4x4 complex matrix multiplication algorithm using 48 scalar multiplications, the first improvement over Strassen's algorithm in 56 years, ...
-
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.
-
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Griffin hybrid model matches Llama-2 performance while trained on over 6 times fewer tokens and offers lower inference latency with higher throughput.
-
In-context Learning and Induction Heads
Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning i...
-
How Attentive are Graph Attention Networks?
GAT uses static attention where neighbor rankings ignore the query node and thus cannot express some graph problems; GATv2 enables dynamic attention and outperforms GAT on 11 OGB and other benchmarks with equal parameters.
-
CodeBLEU: a Method for Automatic Evaluation of Code Synthesis
CodeBLEU improves correlation with human programmer scores on code synthesis tasks by adding syntactic AST matching and semantic data-flow matching to the standard BLEU n-gram approach.
-
RUSLAN: Russian Spoken Language Corpus for Speech Synthesis
RUSLAN is a 31-hour single-speaker Russian speech corpus for TTS containing 22200 annotated samples, with a baseline end-to-end model scoring 4.05 naturalness and 3.78 intelligibility on MOS tests.
-
Deep Modular Co-Attention Networks for Visual Question Answering
MCAN stacks modular co-attention layers to reach 70.63% accuracy on VQA-v2 test-dev, outperforming prior state-of-the-art models.
-
Understanding, Categorizing and Predicting Semantic Image-Text Relations
Introduces an eight-class taxonomy for semantic image-text relations based on three metrics and a multimodal embedding model for predicting the classes from collected data.
-
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
SentencePiece trains subword models directly from raw text to enable language-independent neural text processing.
-
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
MS MARCO is a new large-scale machine reading comprehension dataset built from real Bing search queries, human-generated answers, and web passages, supporting three tasks including answer synthesis and passage ranking.
-
General-Purpose Photonic Computing Primitive for Contemporary Artificial Intelligence
DUET is a photonic tensor core paradigm that uses structural symmetry in VODICs to support arbitrary signed operands directly, experimentally tested on image classification, segmentation, and Transformer tasks.
-
AutoMCU: Feasibility-First MCU Neural Network Customization via LLM-based Multi-Agent Systems
AutoMCU uses feasibility-first LLM multi-agent coordination to automate MCU-constrained neural network design, delivering competitive accuracy on CIFAR-10/100 in 1-2 hours versus hundreds of GPU hours for prior HW-NAS...
-
Structured Neural Marked Point Processes for Interpretable Event Interaction Modeling
SNMPP builds a product-form neural influence kernel from a signed interaction network over event classes and a delay-aware monotonic temporal network to enable explicit discovery of inter-event relationships alongside...
-
Exploiting Pre-trained Encoder-Decoder Transformers for Sequence-to-Sequence Constituent Parsing
Pre-trained encoder-decoder transformers fine-tuned for sequence-to-sequence constituent parsing outperform prior seq2seq models and compete with specialized parsers on continuous treebanks.
-
Cubit: Token Mixer with Kernel Ridge Regression
Cubit replaces Transformer attention with Kernel Ridge Regression token mixing and shows potential gains on longer sequences.
-
Shortcut Solutions Learned by Transformers Impair Continual Compositional Reasoning
BERT learns shortcut solutions that impair generalization and forward transfer in continual LEGO, while ALBERT learns loop-like solutions for better performance, yet both fail at cross-experience composition, with ALB...
-
Benchmarking POS Tagging for the Tajik Language: A Comparative Study of Neural Architectures on the TajPersParallel Corpus
mBERT with LoRA achieves the best weighted F1 of 0.62 for Tajik POS tagging on context-free dictionary entries, but macro F1 is only 0.11, with all models failing on rare function words.
-
VIDA: A dataset for Visually Dependent Ambiguity in Multimodal Machine Translation
VIDA provides 2,500 visually-dependent ambiguous MT instances and LLM-judge metrics; chain-of-thought SFT improves disambiguation accuracy over standard SFT, especially out-of-distribution.
-
Group Cognition Learning: Making Everything Better Through Governed Two-Stage Agents Collaboration
GCL uses a two-stage protocol with Routing, Auditing, Public-Factor, and Aggregation Agents to mitigate modality dominance and spurious coupling in multimodal learning, achieving state-of-the-art results on CMU-MOSI, ...
-
Jet Quenching Identification via Supervised Learning in Simulated Heavy-Ion Collisions
Sequential machine learning on jet declustering history trees outperforms static models at identifying jet quenching in heavy-ion collision simulations.
-
An explicit operator explains end-to-end computation in the modern neural networks used for sequence and language modeling
S4D state space models correspond exactly to wave propagation and nonlinear wave interactions in a one-dimensional ring oscillator network, with a closed-form operator describing the complete input-output map.
-
Graph Transformer-Based Pathway Embedding for Cancer Prognosis
PATH gene embeddings in a graph transformer achieve 0.8766 F1 on pancancer metastasis prediction (8.8% over SOTA) and identify disease-state pathway rewiring.
-
Neural architectures for resolving references in program code
New seq2seq architectures for permutation indexing outperform baselines on synthetic reference-resolution tasks and reduce real decompilation error rates by 42%.
-
Enhancing event reconstruction for $\gamma$-ray particle detector arrays using transformers
Transformer models applied to simulated water-Cherenkov array data improve gamma-hadron separation and reconstruction of direction, core position, and energy compared to established techniques.
-
Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space
PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.
-
TiledAttention: a CUDA Tile SDPA Kernel for PyTorch
TiledAttention is a cuTile-based SDPA kernel that balances performance with Python-level customizability for attention research in PyTorch.
-
Interpretable Alzheimer's Diagnosis via Multimodal Fusion of Regional Brain Experts
MREF-AD applies a mixture-of-experts architecture to regional brain imaging data from multiple modalities to achieve competitive Alzheimer's diagnosis accuracy on ADNI while supplying region- and modality-level interp...
-
DyWPE: Signal-Aware Dynamic Wavelet Positional Encoding for Time Series Transformers
DyWPE generates positional embeddings for time series transformers from the input signal via Discrete Wavelet Transform and outperforms standard positional encodings on ten datasets, especially longer sequences and bi...
-
Category-based Galaxy Image Generation via Diffusion Models
GalCatDiff applies category embeddings and a novel Astro-RAB block inside diffusion models to produce galaxy images whose color and size distributions match observations more closely than prior generative approaches.
-
A PennyLane-Centric Dataset to Enhance LLM-based Quantum Code Generation using RAG
PennyLang dataset of 3,347 PennyLane samples boosts LLM code generation success via RAG from 8.7% to 41.7% for Qwen 7B and 78.8% to 84.8% for LLaMa 4.
-
Titans: Learning to Memorize at Test Time
Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.
-
An Empirical Study of Mamba-based Language Models
An 8B Mamba-2-Hybrid with 43% Mamba-2, 7% attention, and 50% MLP layers exceeds an 8B Transformer by 2.65 points on average across 12 tasks and matches it on 23 long-context tasks while enabling up to 8x faster inference.
-
animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics
Introduces animal2vec, a self-supervised transformer for sparse bioacoustic audio, and the MeerKAT meerkat vocalization dataset, claiming outperformance over baselines including in few-shot settings.
-
Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive
DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.
-
Revisiting Feature Prediction for Learning Visual Representations from Video
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
-
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on be...
-
The Falcon Series of Open Language Models
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
-
Vision Transformers Need Registers
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
-
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
DragNUWA integrates text, image, and trajectory controls into a diffusion video model using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training to enable fine-grained open-domain video generation.
-
Image Captioning via Compact Bidirectional Architecture
Compact bidirectional transformer integrates L2R and R2L flows with sentence-level ensemble and two-flow self-critical training to achieve SOTA on MSCOCO without vision-language pretraining.
-
On the Opportunities and Risks of Foundation Models
Foundation models are large adaptable AI systems with emergent capabilities that offer broad opportunities but carry risks from homogenization, opacity, and inherited defects across downstream applications.
-
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.
-
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
GShard supplies automatic sharding and conditional computation support that enabled training a 600-billion-parameter multilingual translation model on thousands of TPUs with superior quality.
Reference graph
Works this paper leans on
-
[1]
Axelrod, A., He, X., and Gao, J. (2011). Domain adaptation via pseudo in-domain data selection. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP)\/ , pages 355--362. Association for Computational Linguistics
work page 2011
-
[2]
J., Bergeron, A., Bouchard, N., and Bengio, Y
Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard, N., and Bengio, Y. (2012). Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop
work page 2012
-
[3]
Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks\/ , 5 (2), 157--166
work page 1994
-
[4]
Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. (2003). A neural probabilistic language model. J. Mach. Learn. Res. , 3 , 1137--1155
work page 2003
-
[5]
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference ( SciPy )\/ . Oral Presentation
work page 2010
-
[6]
Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P. (2013). Audio chord recognition with recurrent neural networks. In ISMIR\/
work page 2013
-
[7]
Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., and Bengio, Y. (2014a). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014)\/ . to appear
work page 2014
-
[8]
Cho, K., van Merri\"enboer, B., Bahdanau, D., and Bengio, Y. (2014b). On the properties of neural machine translation: E ncoder-- D ecoder approaches. In Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation\/ . to appear
-
[9]
Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R., and Makhoul, J. (2014). Fast and robust neural network joint models for statistical machine translation. In Association for Computational Linguistics\/
work page 2014
-
[10]
Forcada, M. L. and \ Neco, R. P. (1997). Recursive hetero-associative memories for translation. In J. Mira, R. Moreno-D\'iaz, and J. Cabestany, editors, Biological and Artificial Computation: From Neuroscience to Technology\/ , volume 1240 of Lecture Notes in Computer Science\/ , pages 453--462. Springer Berlin Heidelberg
work page 1997
-
[11]
Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013). Maxout networks. In Proceedings of The 30th International Conference on Machine Learning\/ , pages 1319--1327
work page 2013
-
[12]
Graves, A. (2012). Sequence transduction with recurrent neural networks. In Proceedings of the 29th International Conference on Machine Learning (ICML 2012)\/
work page 2012
-
[13]
Graves , A. (2013). Generating sequences with recurrent neural networks. ar X iv: 1308.0850 [cs.NE] \/
work page Pith review arXiv 2013
-
[14]
Graves, A., Jaitly, N., and Mohamed, A.-R. (2013). Hybrid speech recognition with deep bidirectional LSTM . In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on\/ , pages 273--278
work page 2013
-
[15]
Hermann, K. and Blunsom, P. (2014). Multilingual distributed representations without word alignment. In Proceedings of the Second International Conference on Learning Representations (ICLR 2014)\/
work page 2014
-
[16]
u r Informatik, Lehrstuhl Prof. Brauer, Technische Universit\
Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut f\" u r Informatik, Lehrstuhl Prof. Brauer, Technische Universit\" a t M\" u nchen
work page 1991
-
[17]
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation\/ , 9 (8), 1735--1780
work page 1997
-
[18]
Kalchbrenner, N. and Blunsom, P. (2013). Recurrent continuous translation models. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP)\/ , pages 1700--1709. Association for Computational Linguistics
work page 2013
-
[19]
Koehn, P. (2010). Statistical Machine Translation\/ . Cambridge University Press, New York, NY, USA
work page 2010
-
[20]
Koehn, P., Och, F. J., and Marcu, D. (2003). Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1\/ , NAACL '03, pages 48--54, Stroudsburg, PA, USA. Association for Computational Linguistics
work page 2003
-
[21]
Pascanu, R., Mikolov, T., and Bengio, Y. (2013a). On the difficulty of training recurrent neural networks. In ICML'2013\/
work page 2013
-
[22]
Pascanu, R., Mikolov, T., and Bengio, Y. (2013b). On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning (ICML 2013)\/
work page 2013
-
[23]
Pascanu, R., Gulcehre, C., Cho, K., and Bengio, Y. (2014). How to construct deep recurrent neural networks. In Proceedings of the Second International Conference on Learning Representations (ICLR 2014)\/
work page 2014
-
[24]
Pouget-Abadie, J., Bahdanau, D., van Merri\"enboer, B., Cho, K., and Bengio, Y. (2014). Overcoming the curse of sentence length for neural machine translation using automatic segmentation. In Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation\/ . to appear
work page 2014
-
[25]
Schuster, M. and Paliwal, K. K. (1997). Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions on\/ , 45 (11), 2673--2681
work page 1997
-
[26]
Schwenk, H. (2012). Continuous space translation models for phrase-based statistical machine translation. In M. Kay and C. Boitet, editors, Proceedings of the 24th International Conference on Computational Linguistics (COLIN)\/ , pages 1071--1080. Indian Institute of Technology Bombay
work page 2012
-
[27]
Schwenk, H., Dchelotte, D., and Gauvain, J.-L. (2006). Continuous space language models for statistical machine translation. In Proceedings of the COLING/ACL on Main conference poster sessions\/ , pages 723--730. Association for Computational Linguistics
work page 2006
-
[28]
Sutskever, I., Vinyals, O., and Le, Q. (2014). Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems (NIPS 2014)\/
work page 2014
-
[29]
Zeiler, M. D. (2012). ADADELTA : An adaptive learning rate method. ar X iv: 1212.5701 [cs.LG] \/
work page Pith review arXiv 2012
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.