Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau; Kyunghyun Cho; Yoshua Bengio

arxiv: 1409.0473 · v7 · submitted 2014-09-01 · 💻 cs.CL · cs.LG· cs.NE· stat.ML

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau , Kyunghyun Cho , Yoshua Bengio This is my paper

Pith reviewed 2026-05-11 09:16 UTC · model grok-4.3

classification 💻 cs.CL cs.LGcs.NEstat.ML

keywords neural machine translationattention mechanismencoder-decodersoft alignmentsequence-to-sequencemachine translationend-to-end training

0 comments

The pith

A neural translation model learns to focus on relevant source words while generating each target word.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that the standard encoder-decoder architecture for neural machine translation is limited by forcing all source information into one fixed-length vector. It proposes extending the model so the decoder can automatically and softly search for the most relevant parts of the source sentence when predicting each target word. This joint learning of alignment and translation is meant to remove the bottleneck and produce end-to-end trainable systems. A reader would care because it offers a path to simpler, fully neural translation pipelines that might scale without hand-engineered phrase tables. The authors report that the resulting model reaches translation quality comparable to the best phrase-based systems on English-to-French.

Core claim

The authors conjecture that a fixed-length context vector creates a performance bottleneck in basic encoder-decoder networks for machine translation. They introduce an attention mechanism that computes a distinct context vector for each target word as a weighted sum of the source sentence's hidden states, with the weights learned jointly during training. This allows the model to softly align source and target positions without explicit segmentation. On the WMT English-to-French task the approach matches the BLEU score of a strong phrase-based baseline while producing alignments that match human intuition.

What carries the argument

The attention-based alignment model that produces a context vector for each decoding step as a weighted combination of encoder hidden states, with weights derived from a feed-forward network trained jointly with the translation objective.

If this is right

Translation systems can be trained end-to-end as a single network rather than relying on separate alignment and phrase-table components.
Performance on longer sentences should improve because relevance can be selected dynamically instead of being compressed into one vector.
The learned soft alignments provide an interpretable view of which source words influence each target word.
The same joint alignment-and-generation approach can be applied to other sequence tasks where input relevance varies by output position.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The attention weights could serve as a starting point for extracting explicit phrase pairs or for debugging translation errors.
Models that build on this idea might combine soft attention with hard constraints or multiple attention layers to handle very long documents.
The removal of the fixed-vector bottleneck suggests similar gains are possible in any encoder-decoder setting where the input sequence is much longer than the output.

Load-bearing premise

That forcing the entire source sentence into a single fixed-length vector prevents the decoder from accessing the right information when generating different target words.

What would settle it

An experiment in which a basic encoder-decoder model without the soft-search mechanism reaches the same BLEU score as the attention model on the identical English-to-French test set.

read the original abstract

Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper adds soft attention to encoder-decoder NMT so the decoder can dynamically focus on source words, and reports BLEU scores on En-Fr that match the best phrase-based systems.

read the letter

The main takeaway is that this work shows how to replace the single fixed context vector in a basic RNN encoder-decoder with a soft attention mechanism. For each target word the decoder computes weights over the source hidden states and uses a weighted sum, all trained jointly with the translation loss. They get English-to-French BLEU numbers on WMT'14 that sit at the level of the strongest phrase-based systems then available, plus some qualitative plots where the learned alignments look sensible for word pairs and phrases.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes an attention-based extension to the encoder-decoder architecture for neural machine translation. Rather than compressing the source sentence into a single fixed-length vector, the decoder learns to compute soft alignment weights over source positions at each decoding step, allowing it to focus on relevant source words when predicting each target word. The model is trained end-to-end on parallel data. On the WMT 2014 English-to-French task the attention model reaches 28.45 BLEU, described as comparable to a strong phrase-based baseline (Moses at 33.30 BLEU), and qualitative inspection shows that the learned soft alignments are intuitive.

Significance. If the reported BLEU scores and alignment visualizations hold under scrutiny, the work is significant because it supplies the first large-scale empirical demonstration that a neural translation model can learn to perform soft alignment jointly with translation. This directly addresses the fixed-context-vector limitation that motivated the paper and introduces the attention mechanism that later became standard in sequence modeling. The combination of quantitative results on a competitive benchmark and qualitative evidence of sensible alignments gives the central claim a solid empirical footing.

major comments (2)

[§4.1 and Table 1] §4.1 and Table 1: the claim that performance is 'comparable' to the state-of-the-art phrase-based system rests on a single-run BLEU of 28.45 versus 33.30 for Moses. A 4.85-point gap is large enough that the comparability statement would be strengthened by reporting variance across random seeds, an ensemble result, or a direct comparison against the best contemporaneous neural baselines on the same data split.
[§3.2, Eq. (5)–(7)] §3.2, Eq. (5)–(7): the alignment model is a simple feed-forward network whose output is normalized by softmax; the paper does not analyze or mitigate potential gradient vanishing when source sentences exceed the lengths seen in training. Because the motivating conjecture concerns long-sentence performance, this omission is load-bearing for the central architectural claim.

minor comments (3)

[Abstract] The abstract states the main result without quoting the actual BLEU numbers or naming the test set; adding these two facts would make the abstract self-contained.
[Figure 3] Figure 3 (alignment visualizations): the heatmaps lack explicit word labels on both axes and a color-bar scale, making it harder for readers to verify the claimed agreement with intuition.
[§2.2] §2.2: the description of the basic RNN encoder-decoder could cite the exact prior work (Sutskever et al., 2014) more explicitly when stating the fixed-vector bottleneck.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate where revisions will be made.

read point-by-point responses

Referee: [§4.1 and Table 1] §4.1 and Table 1: the claim that performance is 'comparable' to the state-of-the-art phrase-based system rests on a single-run BLEU of 28.45 versus 33.30 for Moses. A 4.85-point gap is large enough that the comparability statement would be strengthened by reporting variance across random seeds, an ensemble result, or a direct comparison against the best contemporaneous neural baselines on the same data split.

Authors: We acknowledge the 4.85 BLEU gap and agree that 'comparable' may overstate the absolute performance relative to the strong Moses baseline. The intent of the claim was to emphasize that an end-to-end neural model without phrase tables or hand-crafted features could reach a level close enough to be practically relevant on a large-scale task. We did not run multiple random seeds or ensembles due to the high computational cost of training on the full WMT data at the time. We will revise the abstract and §4.1 to describe the result as 'competitive with' or 'approaching' the phrase-based system and will add a brief comparison to other neural encoder-decoder baselines available at submission time. This constitutes a partial revision. revision: partial
Referee: [§3.2, Eq. (5)–(7)] §3.2, Eq. (5)–(7): the alignment model is a simple feed-forward network whose output is normalized by softmax; the paper does not analyze or mitigate potential gradient vanishing when source sentences exceed the lengths seen in training. Because the motivating conjecture concerns long-sentence performance, this omission is load-bearing for the central architectural claim.

Authors: The alignment model indeed uses a simple feed-forward scorer followed by softmax normalization over source positions. While this can in principle dilute gradients for source sentences much longer than those seen during training, our experiments were conducted on the standard WMT splits where sentence lengths are bounded, and the attention model showed clear gains over the fixed-vector baseline. We did not include an explicit gradient analysis because the paper's focus was on the empirical demonstration of jointly learned soft alignments. We will add a short discussion in §3.2 noting the potential limitation for extremely long sequences and pointing to the empirical improvement on longer sentences in the test set. This is a partial revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper defines an attention-augmented encoder-decoder from first principles (bidirectional RNN encoder, decoder with soft alignment probabilities computed via a feedforward network, trained end-to-end by maximizing log-likelihood on parallel sentence pairs). Performance is measured by standard BLEU on held-out WMT test data; no fitted parameter is defined in terms of BLEU, no self-citation supplies a uniqueness theorem or ansatz, and the fixed-length-vector conjecture is offered only as motivation. All load-bearing steps (model equations, training objective, alignment visualization) are self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The proposal rests on standard neural network training assumptions plus the specific conjecture about fixed vectors; no new physical entities are postulated.

free parameters (1)

attention model parameters
Weights of the alignment model are learned from data during joint training.

axioms (1)

domain assumption A single neural network can be jointly tuned to maximize translation performance
Stated as the goal of neural machine translation in the abstract.

invented entities (1)

soft alignment weights no independent evidence
purpose: To allow the decoder to focus on variable relevant source parts
New component introduced to extend the encoder-decoder

pith-pipeline@v0.9.0 · 5489 in / 1063 out tokens · 101742 ms · 2026-05-11T09:16:53.286704+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Characterizing the Expressivity of Local Attention in Transformers
cs.CL 2026-05 unverdicted novelty 8.0

Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by adding a second past operator in linear temporal logic, with global and local attention being expressive...
Emerging Properties in Self-Supervised Vision Transformers
cs.CV 2021-04 conditional novelty 8.0

Self-supervised ViTs show emergent semantic segmentation and 78.3% k-NN accuracy on ImageNet; DINO reaches 80.1% linear evaluation with ViT-Base.
Generative Language Modeling for Automated Theorem Proving
cs.LG 2020-09 unverdicted novelty 8.0

GPT-f, a transformer-based prover for Metamath, generated new short proofs that were accepted into the main library—the first such contribution from a deep-learning system.
REALM: Retrieval-Augmented Language Model Pre-Training
cs.CL 2020-02 accept novelty 8.0

REALM augments language-model pre-training with an unsupervised retriever over Wikipedia documents and reports 4-16% absolute gains on open-domain QA benchmarks over prior implicit and explicit knowledge methods.
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
cs.LG 2017-01 accept novelty 8.0

A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
Adaptive Computation Time for Recurrent Neural Networks
cs.NE 2016-03 accept novelty 8.0

ACT lets RNNs dynamically adapt computation depth per input via a differentiable halting unit, yielding large gains on synthetic tasks and structural insights on language data.
Neural Turing Machines
cs.NE 2014-10 unverdicted novelty 8.0

Neural Turing Machines augment neural networks with differentiable external memory to learn algorithmic tasks such as copying, sorting, and associative recall from examples.
Structured Neural Marked Point Processes for Interpretable Event Interaction Modeling
cs.LG 2026-05 unverdicted novelty 7.0

SNMPP builds a product-form neural influence kernel from a signed class-wise interaction network and a monotonic delay-aware temporal network to enable interpretable multi-class event stream modeling.
GravityGraphSAGE: Link Prediction in Directed Attributed Graphs
cs.LG 2026-05 unverdicted novelty 7.0

GravityGraphSAGE adapts GraphSAGE with a gravity-inspired decoder to outperform prior graph deep learning methods on directed link prediction across citation networks and 16 real-world graphs.
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
cs.LG 2026-05 unverdicted novelty 7.0

PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
Arbitrarily Conditioned Hierarchical Flows for Spatiotemporal Events
cs.LG 2026-05 unverdicted novelty 7.0

ARCH is a hierarchical flow-based generative model that enables tractable conditional intensity computation and arbitrary conditioning for spatiotemporal event distributions.
Characterizing the Expressivity of Local Attention in Transformers
cs.CL 2026-05 conditional novelty 7.0

Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by introducing a second temporal operator in LTL, with global and local attention being expressively complementary.
Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models
math.PR 2026-04 unverdicted novelty 7.0

Transformers converge pathwise to a stochastic particle system and SPDE in the scaling limit, exhibiting synchronization by noise and exponential energy dissipation when common noise is coercive relative to self-atten...
Selective Contrastive Learning For Gloss Free Sign Language Translation
cs.CL 2026-04 unverdicted novelty 7.0

A pair selection strategy based on negative similarity dynamics strengthens contrastive supervision in gloss-free sign language translation by reducing noisy negatives.
RACE Attention: A Strictly Linear-Time Attention Layer for Training on Outrageously Large Contexts
cs.LG 2025-10 unverdicted novelty 7.0

RACE Attention is a strictly linear-time attention mechanism that approximates softmax attention outputs using Gaussian projections and soft LSH to enable training on contexts up to 12 million tokens.
Robust Filter Attention: Self-Attention as Precision-Weighted State Estimation
cs.LG 2025-09 unverdicted novelty 7.0

Robust Filter Attention models self-attention as consistency-based state estimation under a linear SDE for token trajectories, matching standard attention complexity while showing lower perplexity and better zero-shot...
AlphaEvolve: A coding agent for scientific and algorithmic discovery
cs.AI 2025-06 unverdicted novelty 7.0

AlphaEvolve is an LLM-orchestrated evolutionary coding agent that discovered a 4x4 complex matrix multiplication algorithm using 48 scalar multiplications, the first improvement over Strassen's algorithm in 56 years, ...
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
cs.CL 2024-04 conditional novelty 7.0

Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
cs.LG 2024-02 unverdicted novelty 7.0

Griffin hybrid model matches Llama-2 performance while trained on over 6 times fewer tokens and offers lower inference latency with higher throughput.
In-context Learning and Induction Heads
cs.LG 2022-09 unverdicted novelty 7.0

Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning i...
How Attentive are Graph Attention Networks?
cs.LG 2021-05 conditional novelty 7.0

GAT uses static attention where neighbor rankings ignore the query node and thus cannot express some graph problems; GATv2 enables dynamic attention and outperforms GAT on 11 OGB and other benchmarks with equal parameters.
CodeBLEU: a Method for Automatic Evaluation of Code Synthesis
cs.SE 2020-09 conditional novelty 7.0

CodeBLEU improves correlation with human programmer scores on code synthesis tasks by adding syntactic AST matching and semantic data-flow matching to the standard BLEU n-gram approach.
RUSLAN: Russian Spoken Language Corpus for Speech Synthesis
eess.AS 2019-06 unverdicted novelty 7.0

RUSLAN is a 31-hour single-speaker Russian speech corpus for TTS containing 22200 annotated samples, with a baseline end-to-end model scoring 4.05 naturalness and 3.78 intelligibility on MOS tests.
Deep Modular Co-Attention Networks for Visual Question Answering
cs.CV 2019-06 conditional novelty 7.0

MCAN stacks modular co-attention layers to reach 70.63% accuracy on VQA-v2 test-dev, outperforming prior state-of-the-art models.
Understanding, Categorizing and Predicting Semantic Image-Text Relations
cs.MM 2019-06 unverdicted novelty 7.0

Introduces an eight-class taxonomy for semantic image-text relations based on three metrics and a multimodal embedding model for predicting the classes from collected data.
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
cs.CL 2018-08 accept novelty 7.0

SentencePiece trains subword models directly from raw text to enable language-independent neural text processing.
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
cs.CL 2016-11 accept novelty 7.0

MS MARCO is a new large-scale machine reading comprehension dataset built from real Bing search queries, human-generated answers, and web passages, supporting three tasks including answer synthesis and passage ranking.
General-Purpose Photonic Computing Primitive for Contemporary Artificial Intelligence
physics.optics 2026-05 unverdicted novelty 6.0

DUET is a photonic tensor core paradigm that uses structural symmetry in VODICs to support arbitrary signed operands directly, experimentally tested on image classification, segmentation, and Transformer tasks.
AutoMCU: Feasibility-First MCU Neural Network Customization via LLM-based Multi-Agent Systems
cs.LG 2026-05 unverdicted novelty 6.0

AutoMCU uses feasibility-first LLM multi-agent coordination to automate MCU-constrained neural network design, delivering competitive accuracy on CIFAR-10/100 in 1-2 hours versus hundreds of GPU hours for prior HW-NAS...
Structured Neural Marked Point Processes for Interpretable Event Interaction Modeling
cs.LG 2026-05 unverdicted novelty 6.0

SNMPP builds a product-form neural influence kernel from a signed interaction network over event classes and a delay-aware monotonic temporal network to enable explicit discovery of inter-event relationships alongside...
Exploiting Pre-trained Encoder-Decoder Transformers for Sequence-to-Sequence Constituent Parsing
cs.CL 2026-05 unverdicted novelty 6.0

Pre-trained encoder-decoder transformers fine-tuned for sequence-to-sequence constituent parsing outperform prior seq2seq models and compete with specialized parsers on continuous treebanks.
Cubit: Token Mixer with Kernel Ridge Regression
cs.LG 2026-05 unverdicted novelty 6.0

Cubit replaces Transformer attention with Kernel Ridge Regression token mixing and shows potential gains on longer sequences.
Shortcut Solutions Learned by Transformers Impair Continual Compositional Reasoning
cs.LG 2026-05 unverdicted novelty 6.0

BERT learns shortcut solutions that impair generalization and forward transfer in continual LEGO, while ALBERT learns loop-like solutions for better performance, yet both fail at cross-experience composition, with ALB...
Benchmarking POS Tagging for the Tajik Language: A Comparative Study of Neural Architectures on the TajPersParallel Corpus
cs.CL 2026-05 unverdicted novelty 6.0

mBERT with LoRA achieves the best weighted F1 of 0.62 for Tajik POS tagging on context-free dictionary entries, but macro F1 is only 0.11, with all models failing on rare function words.
VIDA: A dataset for Visually Dependent Ambiguity in Multimodal Machine Translation
cs.CL 2026-05 unverdicted novelty 6.0

VIDA provides 2,500 visually-dependent ambiguous MT instances and LLM-judge metrics; chain-of-thought SFT improves disambiguation accuracy over standard SFT, especially out-of-distribution.
Group Cognition Learning: Making Everything Better Through Governed Two-Stage Agents Collaboration
cs.LG 2026-05 unverdicted novelty 6.0

GCL uses a two-stage protocol with Routing, Auditing, Public-Factor, and Aggregation Agents to mitigate modality dominance and spurious coupling in multimodal learning, achieving state-of-the-art results on CMU-MOSI, ...
Jet Quenching Identification via Supervised Learning in Simulated Heavy-Ion Collisions
hep-ph 2026-04 unverdicted novelty 6.0

Sequential machine learning on jet declustering history trees outperforms static models at identifying jet quenching in heavy-ion collision simulations.
An explicit operator explains end-to-end computation in the modern neural networks used for sequence and language modeling
cs.NE 2026-04 unverdicted novelty 6.0

S4D state space models correspond exactly to wave propagation and nonlinear wave interactions in a one-dimensional ring oscillator network, with a closed-form operator describing the complete input-output map.
Graph Transformer-Based Pathway Embedding for Cancer Prognosis
cs.LG 2026-04 unverdicted novelty 6.0

PATH gene embeddings in a graph transformer achieve 0.8766 F1 on pancancer metastasis prediction (8.8% over SOTA) and identify disease-state pathway rewiring.
Neural architectures for resolving references in program code
cs.LG 2026-04 unverdicted novelty 6.0

New seq2seq architectures for permutation indexing outperform baselines on synthetic reference-resolution tasks and reduce real decompilation error rates by 42%.
Enhancing event reconstruction for $\gamma$-ray particle detector arrays using transformers
astro-ph.IM 2026-04 unverdicted novelty 6.0

Transformer models applied to simulated water-Cherenkov array data improve gamma-hadron separation and reconstruction of direction, core position, and energy compared to established techniques.
Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space
cs.CL 2026-04 unverdicted novelty 6.0

PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.
TiledAttention: a CUDA Tile SDPA Kernel for PyTorch
cs.LG 2026-03 unverdicted novelty 6.0

TiledAttention is a cuTile-based SDPA kernel that balances performance with Python-level customizability for attention research in PyTorch.
Interpretable Alzheimer's Diagnosis via Multimodal Fusion of Regional Brain Experts
cs.LG 2025-11 unverdicted novelty 6.0

MREF-AD applies a mixture-of-experts architecture to regional brain imaging data from multiple modalities to achieve competitive Alzheimer's diagnosis accuracy on ADNI while supplying region- and modality-level interp...
DyWPE: Signal-Aware Dynamic Wavelet Positional Encoding for Time Series Transformers
cs.LG 2025-09 unverdicted novelty 6.0

DyWPE generates positional embeddings for time series transformers from the input signal via Discrete Wavelet Transform and outperforms standard positional encodings on ten datasets, especially longer sequences and bi...
Category-based Galaxy Image Generation via Diffusion Models
astro-ph.IM 2025-06 unverdicted novelty 6.0

GalCatDiff applies category embeddings and a novel Astro-RAB block inside diffusion models to produce galaxy images whose color and size distributions match observations more closely than prior generative approaches.
A PennyLane-Centric Dataset to Enhance LLM-based Quantum Code Generation using RAG
cs.SE 2025-03 unverdicted novelty 6.0

PennyLang dataset of 3,347 PennyLane samples boosts LLM code generation success via RAG from 8.7% to 41.7% for Qwen 7B and 78.8% to 84.8% for LLaMa 4.
Titans: Learning to Memorize at Test Time
cs.LG 2024-12 unverdicted novelty 6.0

Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.
An Empirical Study of Mamba-based Language Models
cs.LG 2024-06 accept novelty 6.0

An 8B Mamba-2-Hybrid with 43% Mamba-2, 7% attention, and 50% MLP layers exceeds an 8B Transformer by 2.65 points on average across 12 tasks and matches it on 23 long-context tasks while enabling up to 8x faster inference.
animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics
cs.SD 2024-06 unverdicted novelty 6.0

Introduces animal2vec, a self-supervised transformer for sparse bioacoustic audio, and the MeerKAT meerkat vocalization dataset, claiming outperformance over baselines including in few-shot settings.
Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive
cs.CL 2024-02 conditional novelty 6.0

DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.
Revisiting Feature Prediction for Learning Visual Representations from Video
cs.CV 2024-02 conditional novelty 6.0

V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
cs.LG 2024-01 unverdicted novelty 6.0

SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on be...
The Falcon Series of Open Language Models
cs.CL 2023-11 conditional novelty 6.0

Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
Vision Transformers Need Registers
cs.CV 2023-09 unverdicted novelty 6.0

Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
cs.CV 2023-08 unverdicted novelty 6.0

DragNUWA integrates text, image, and trajectory controls into a diffusion video model using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training to enable fine-grained open-domain video generation.
Image Captioning via Compact Bidirectional Architecture
cs.CV 2022-01 unverdicted novelty 6.0

Compact bidirectional transformer integrates L2R and R2L flows with sentence-level ensemble and two-flow self-critical training to achieve SOTA on MSCOCO without vision-language pretraining.
On the Opportunities and Risks of Foundation Models
cs.LG 2021-08 accept novelty 6.0

Foundation models are large adaptable AI systems with emergent capabilities that offer broad opportunities but carry risks from homogenization, opacity, and inherited defects across downstream applications.
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
cs.LG 2021-04 accept novelty 6.0

Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
cs.CL 2020-06 unverdicted novelty 6.0

GShard supplies automatic sharding and conditional computation support that enabled training a 600-billion-parameter multilingual translation model on thousands of TPUs with superior quality.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 145 Pith papers

[1]

Axelrod, A., He, X., and Gao, J. (2011). Domain adaptation via pseudo in-domain data selection. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP)\/ , pages 355--362. Association for Computational Linguistics

work page 2011
[2]

J., Bergeron, A., Bouchard, N., and Bengio, Y

Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard, N., and Bengio, Y. (2012). Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop

work page 2012
[3]

Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks\/ , 5 (2), 157--166

work page 1994
[4]

Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. (2003). A neural probabilistic language model. J. Mach. Learn. Res. , 3 , 1137--1155

work page 2003
[5]

Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference ( SciPy )\/ . Oral Presentation

work page 2010
[6]

Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P. (2013). Audio chord recognition with recurrent neural networks. In ISMIR\/

work page 2013
[7]

Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., and Bengio, Y. (2014a). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014)\/ . to appear

work page 2014
[8]

Cho, K., van Merri\"enboer, B., Bahdanau, D., and Bengio, Y. (2014b). On the properties of neural machine translation: E ncoder-- D ecoder approaches. In Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation\/ . to appear

work page
[9]

Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R., and Makhoul, J. (2014). Fast and robust neural network joint models for statistical machine translation. In Association for Computational Linguistics\/

work page 2014
[10]

Forcada, M. L. and \ Neco, R. P. (1997). Recursive hetero-associative memories for translation. In J. Mira, R. Moreno-D\'iaz, and J. Cabestany, editors, Biological and Artificial Computation: From Neuroscience to Technology\/ , volume 1240 of Lecture Notes in Computer Science\/ , pages 453--462. Springer Berlin Heidelberg

work page 1997
[11]

Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013). Maxout networks. In Proceedings of The 30th International Conference on Machine Learning\/ , pages 1319--1327

work page 2013
[12]

Graves, A. (2012). Sequence transduction with recurrent neural networks. In Proceedings of the 29th International Conference on Machine Learning (ICML 2012)\/

work page 2012
[13]

Graves , A. (2013). Generating sequences with recurrent neural networks. ar X iv: 1308.0850 [cs.NE] \/

work page Pith review arXiv 2013
[14]

Graves, A., Jaitly, N., and Mohamed, A.-R. (2013). Hybrid speech recognition with deep bidirectional LSTM . In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on\/ , pages 273--278

work page 2013
[15]

and Blunsom, P

Hermann, K. and Blunsom, P. (2014). Multilingual distributed representations without word alignment. In Proceedings of the Second International Conference on Learning Representations (ICLR 2014)\/

work page 2014
[16]

u r Informatik, Lehrstuhl Prof. Brauer, Technische Universit\

Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut f\" u r Informatik, Lehrstuhl Prof. Brauer, Technische Universit\" a t M\" u nchen

work page 1991
[17]

and Schmidhuber, J

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation\/ , 9 (8), 1735--1780

work page 1997
[18]

and Blunsom, P

Kalchbrenner, N. and Blunsom, P. (2013). Recurrent continuous translation models. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP)\/ , pages 1700--1709. Association for Computational Linguistics

work page 2013
[19]

Koehn, P. (2010). Statistical Machine Translation\/ . Cambridge University Press, New York, NY, USA

work page 2010
[20]

J., and Marcu, D

Koehn, P., Och, F. J., and Marcu, D. (2003). Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1\/ , NAACL '03, pages 48--54, Stroudsburg, PA, USA. Association for Computational Linguistics

work page 2003
[21]

Pascanu, R., Mikolov, T., and Bengio, Y. (2013a). On the difficulty of training recurrent neural networks. In ICML'2013\/

work page 2013
[22]

Pascanu, R., Mikolov, T., and Bengio, Y. (2013b). On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning (ICML 2013)\/

work page 2013
[23]

Pascanu, R., Gulcehre, C., Cho, K., and Bengio, Y. (2014). How to construct deep recurrent neural networks. In Proceedings of the Second International Conference on Learning Representations (ICLR 2014)\/

work page 2014
[24]

Pouget-Abadie, J., Bahdanau, D., van Merri\"enboer, B., Cho, K., and Bengio, Y. (2014). Overcoming the curse of sentence length for neural machine translation using automatic segmentation. In Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation\/ . to appear

work page 2014
[25]

and Paliwal, K

Schuster, M. and Paliwal, K. K. (1997). Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions on\/ , 45 (11), 2673--2681

work page 1997
[26]

Schwenk, H. (2012). Continuous space translation models for phrase-based statistical machine translation. In M. Kay and C. Boitet, editors, Proceedings of the 24th International Conference on Computational Linguistics (COLIN)\/ , pages 1071--1080. Indian Institute of Technology Bombay

work page 2012
[27]

Schwenk, H., Dchelotte, D., and Gauvain, J.-L. (2006). Continuous space language models for statistical machine translation. In Proceedings of the COLING/ACL on Main conference poster sessions\/ , pages 723--730. Association for Computational Linguistics

work page 2006
[28]

Sutskever, I., Vinyals, O., and Le, Q. (2014). Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems (NIPS 2014)\/

work page 2014
[29]

Zeiler, M. D. (2012). ADADELTA : An adaptive learning rate method. ar X iv: 1212.5701 [cs.LG] \/

work page Pith review arXiv 2012

[1] [1]

Axelrod, A., He, X., and Gao, J. (2011). Domain adaptation via pseudo in-domain data selection. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP)\/ , pages 355--362. Association for Computational Linguistics

work page 2011

[2] [2]

J., Bergeron, A., Bouchard, N., and Bengio, Y

Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard, N., and Bengio, Y. (2012). Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop

work page 2012

[3] [3]

Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks\/ , 5 (2), 157--166

work page 1994

[4] [4]

Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. (2003). A neural probabilistic language model. J. Mach. Learn. Res. , 3 , 1137--1155

work page 2003

[5] [5]

Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference ( SciPy )\/ . Oral Presentation

work page 2010

[6] [6]

Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P. (2013). Audio chord recognition with recurrent neural networks. In ISMIR\/

work page 2013

[7] [7]

Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., and Bengio, Y. (2014a). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014)\/ . to appear

work page 2014

[8] [8]

Cho, K., van Merri\"enboer, B., Bahdanau, D., and Bengio, Y. (2014b). On the properties of neural machine translation: E ncoder-- D ecoder approaches. In Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation\/ . to appear

work page

[9] [9]

Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R., and Makhoul, J. (2014). Fast and robust neural network joint models for statistical machine translation. In Association for Computational Linguistics\/

work page 2014

[10] [10]

Forcada, M. L. and \ Neco, R. P. (1997). Recursive hetero-associative memories for translation. In J. Mira, R. Moreno-D\'iaz, and J. Cabestany, editors, Biological and Artificial Computation: From Neuroscience to Technology\/ , volume 1240 of Lecture Notes in Computer Science\/ , pages 453--462. Springer Berlin Heidelberg

work page 1997

[11] [11]

Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013). Maxout networks. In Proceedings of The 30th International Conference on Machine Learning\/ , pages 1319--1327

work page 2013

[12] [12]

Graves, A. (2012). Sequence transduction with recurrent neural networks. In Proceedings of the 29th International Conference on Machine Learning (ICML 2012)\/

work page 2012

[13] [13]

Graves , A. (2013). Generating sequences with recurrent neural networks. ar X iv: 1308.0850 [cs.NE] \/

work page Pith review arXiv 2013

[14] [14]

Graves, A., Jaitly, N., and Mohamed, A.-R. (2013). Hybrid speech recognition with deep bidirectional LSTM . In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on\/ , pages 273--278

work page 2013

[15] [15]

and Blunsom, P

Hermann, K. and Blunsom, P. (2014). Multilingual distributed representations without word alignment. In Proceedings of the Second International Conference on Learning Representations (ICLR 2014)\/

work page 2014

[16] [16]

u r Informatik, Lehrstuhl Prof. Brauer, Technische Universit\

Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut f\" u r Informatik, Lehrstuhl Prof. Brauer, Technische Universit\" a t M\" u nchen

work page 1991

[17] [17]

and Schmidhuber, J

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation\/ , 9 (8), 1735--1780

work page 1997

[18] [18]

and Blunsom, P

Kalchbrenner, N. and Blunsom, P. (2013). Recurrent continuous translation models. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP)\/ , pages 1700--1709. Association for Computational Linguistics

work page 2013

[19] [19]

Koehn, P. (2010). Statistical Machine Translation\/ . Cambridge University Press, New York, NY, USA

work page 2010

[20] [20]

J., and Marcu, D

Koehn, P., Och, F. J., and Marcu, D. (2003). Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1\/ , NAACL '03, pages 48--54, Stroudsburg, PA, USA. Association for Computational Linguistics

work page 2003

[21] [21]

Pascanu, R., Mikolov, T., and Bengio, Y. (2013a). On the difficulty of training recurrent neural networks. In ICML'2013\/

work page 2013

[22] [22]

Pascanu, R., Mikolov, T., and Bengio, Y. (2013b). On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning (ICML 2013)\/

work page 2013

[23] [23]

Pascanu, R., Gulcehre, C., Cho, K., and Bengio, Y. (2014). How to construct deep recurrent neural networks. In Proceedings of the Second International Conference on Learning Representations (ICLR 2014)\/

work page 2014

[24] [24]

Pouget-Abadie, J., Bahdanau, D., van Merri\"enboer, B., Cho, K., and Bengio, Y. (2014). Overcoming the curse of sentence length for neural machine translation using automatic segmentation. In Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation\/ . to appear

work page 2014

[25] [25]

and Paliwal, K

Schuster, M. and Paliwal, K. K. (1997). Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions on\/ , 45 (11), 2673--2681

work page 1997

[26] [26]

Schwenk, H. (2012). Continuous space translation models for phrase-based statistical machine translation. In M. Kay and C. Boitet, editors, Proceedings of the 24th International Conference on Computational Linguistics (COLIN)\/ , pages 1071--1080. Indian Institute of Technology Bombay

work page 2012

[27] [27]

Schwenk, H., Dchelotte, D., and Gauvain, J.-L. (2006). Continuous space language models for statistical machine translation. In Proceedings of the COLING/ACL on Main conference poster sessions\/ , pages 723--730. Association for Computational Linguistics

work page 2006

[28] [28]

Sutskever, I., Vinyals, O., and Le, Q. (2014). Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems (NIPS 2014)\/

work page 2014

[29] [29]

Zeiler, M. D. (2012). ADADELTA : An adaptive learning rate method. ar X iv: 1212.5701 [cs.LG] \/

work page Pith review arXiv 2012