hub

Effective Approaches to Attention-based Neural Machine Translation

Minh-Thang Luong, Hieu Pham, Christopher D Manning · 2015 · cs.CL · arXiv 1508.04025

22 Pith papers cite this work. Polarity classification is still indexing.

22 Pith papers citing it

open full Pith review browse 22 citing papers arXiv PDF

abstract

An attentional mechanism has lately been used to improve neural machine translation (NMT) by selectively focusing on parts of the source sentence during translation. However, there has been little work exploring useful architectures for attention-based NMT. This paper examines two simple and effective classes of attentional mechanism: a global approach which always attends to all source words and a local one that only looks at a subset of source words at a time. We demonstrate the effectiveness of both approaches over the WMT translation tasks between English and German in both directions. With local attention, we achieve a significant gain of 5.0 BLEU points over non-attentional systems which already incorporate known techniques such as dropout. Our ensemble model using different attention architectures has established a new state-of-the-art result in the WMT'15 English to German translation task with 25.9 BLEU points, an improvement of 1.0 BLEU points over the existing best system backed by NMT and an n-gram reranker.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 method 1 other 1

citation-polarity summary

background 1 unclear 1 use method 1

representative citing papers

Continuum Neural Momentum Eigenstate for Variationally Solving Quasiparticles

cond-mat.quant-gas · 2026-06-11 · unverdicted · novelty 7.0

EVE is a neural quantum state that enforces exact momentum eigenstates by construction, allowing VMC to variationally solve quasiparticle states across multiple phases in 2D interacting bosons.

Selective Contrastive Learning For Gloss Free Sign Language Translation

cs.CL · 2026-04-24 · unverdicted · novelty 7.0

A pair selection strategy based on negative similarity dynamics strengthens contrastive supervision in gloss-free sign language translation by reducing noisy negatives.

Revisiting Neural Processes via Fourier Transform and Volterra Series

cs.LG · 2026-05-31 · unverdicted · novelty 6.0

Introduces SFConvCNPs and SFVConvCNPs using set Fourier convolutions and Volterra expansions for translation-equivariant neural processes on irregular data with global receptive fields and linear scaling.

Graph-based Knowledge Distillation by Multi-head Attention Network

cs.LG · 2019-07-04 · unverdicted · novelty 6.0

Multi-head attention constructs a graph of dataset relations from the teacher embedding procedure and transfers it to the student via multi-task learning, yielding 7.05% higher CIFAR-100 accuracy than the student alone and 2.46% above prior SOTA.

Creating A Neural Pedagogical Agent by Jointly Learning to Review and Assess

cs.LG · 2019-06-26 · unverdicted · novelty 6.0

Bidirectional RNN with attention models real-time user knowledge from question-response sequences to predict correctness, outperforming baselines especially for new users on a large TOEIC mobile app dataset.

Leveraging Text Repetitions and Denoising Autoencoders in OCR Post-correction

cs.CL · 2019-06-26 · conditional · novelty 6.0

Error distributions estimated from text repetitions enable training of denoising autoencoders that improve OCR post-correction on historical Finnish newspapers without manual training data.

Attention U-Net: Learning Where to Look for the Pancreas

cs.CV · 2018-04-11 · unverdicted · novelty 6.0

Attention gates added to U-Net automatically focus on target organs in CT images and improve segmentation performance on abdominal datasets.

OGNet: Salient Object Detection with Output-guided Attention Module

cs.CV · 2019-07-17 · unverdicted · novelty 5.0

OGNet proposes an output-guided attention module from multi-scale outputs and an intractable area F-measure loss to enhance salient object detection in edges and confusing areas while remaining lightweight.

Do Transformer Attention Heads Provide Transparency in Abstractive Summarization?

cs.CL · 2019-07-01 · unverdicted · novelty 5.0

Analysis of transformer attention heads in abstractive summarization shows specialization in some heads and proposes a method to measure model reliance on learned attention distributions.

Attention Is All You Need

cs.CL · 2017-06-12 · unverdicted · novelty 5.0

Pith review generated a malformed one-line summary.

Automatically Learning Construction Injury Precursors from Text

cs.CL · 2019-07-26 · unverdicted · novelty 4.0

Standard NLP classifiers can surface valid injury precursors from raw construction safety reports.

Investigating Self-Attention Network for Chinese Word Segmentation

cs.CL · 2019-07-26 · unverdicted · novelty 4.0

Self-attention networks achieve competitive results to BiLSTM-CRF on Chinese word segmentation, with BERT and word integration yielding the best reported performance on six heterogeneous domain benchmarks.

Hierarchical Sequence to Sequence Voice Conversion with Limited Data

eess.AS · 2019-07-15 · unverdicted · novelty 4.0

Hierarchical seq2seq model for parallel voice conversion pretrained as autoencoder on single-speaker data then adapted to limited multispeaker data, using mel spectrograms converted via wavenet vocoder.

Automatic Repair and Type Binding of Undeclared Variables using Neural Networks

cs.SE · 2019-07-14 · unverdicted · novelty 4.0

Neural network trained on AST structural details repairs undeclared variable errors and infers types, reporting 81% success on location/identification and 80% on types for 1059 programs in the prutor dataset.

Improving Zero-shot Translation with Language-Independent Constraints

cs.CL · 2019-06-20 · unverdicted · novelty 4.0

Language-independent constraints and regularization in multilingual Transformer NMT yield a 2.23 BLEU average gain on zero-shot pairs from the IWSLT 2017 dataset.

Resource-Efficient CSI Prediction: A Gated Fusion and Factorized Projection Approach

eess.SP · 2026-05-07 · unverdicted · novelty 4.0

A gated-fusion CSI predictor using GRU, attention, and DSLH reaches -13.84 dB NMSE with 26% fewer parameters and 2.3x higher throughput than a LinFormer baseline on 3GPP channels.

Skeleton-based Coherence Modeling in Narratives

cs.CL · 2026-04-02 · unverdicted · novelty 4.0

Sentence-level models outperform skeleton-based approaches for narrative coherence despite a new SSN network improving on cosine and Euclidean baselines.

The General Theory of Localization Methods

cs.LG · 2026-05-20 · unverdicted · novelty 3.0 · 2 refs

The localization method is presented as a unifying framework connecting kernel methods, MeanShift, Hopfield networks, LLE, fuzzy inference, denoising autoencoders, and Transformers via local models and the localization trick.

Positional Encoding in Transformer-Based Time Series Models: A Survey

cs.LG · 2025-02-17 · unverdicted · novelty 3.0

A survey of positional encoding methods in transformer-based time series models that evaluates fixed, learnable, relative, and hybrid approaches on classification tasks and links effectiveness to data characteristics.

Predicting Drug Responses by Propagating Interactions through Text-Enhanced Drug-Gene Networks

cs.SI · 2019-06-19 · unverdicted · novelty 3.0

A text-enhanced drug-gene network is constructed from articles and data, with edge embeddings estimated from cell line records to enable explainable drug sensitivity predictions at 94.74% accuracy.

Gemma 2: Improving Open Language Models at a Practical Size

cs.CL · 2024-07-31 · conditional · novelty 3.0

Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.

Jet Quenching Identification via Supervised Learning in Simulated Heavy-Ion Collisions

hep-ph · 2026-04-22

citing papers explorer

Showing 5 of 5 citing papers after filters.

Revisiting Neural Processes via Fourier Transform and Volterra Series cs.LG · 2026-05-31 · unverdicted · none · ref 115 · internal anchor
Introduces SFConvCNPs and SFVConvCNPs using set Fourier convolutions and Volterra expansions for translation-equivariant neural processes on irregular data with global receptive fields and linear scaling.
Graph-based Knowledge Distillation by Multi-head Attention Network cs.LG · 2019-07-04 · unverdicted · none · ref 18 · internal anchor
Multi-head attention constructs a graph of dataset relations from the teacher embedding procedure and transfers it to the student via multi-task learning, yielding 7.05% higher CIFAR-100 accuracy than the student alone and 2.46% above prior SOTA.
Creating A Neural Pedagogical Agent by Jointly Learning to Review and Assess cs.LG · 2019-06-26 · unverdicted · none · ref 28 · internal anchor
Bidirectional RNN with attention models real-time user knowledge from question-response sequences to predict correctness, outperforming baselines especially for new users on a large TOEIC mobile app dataset.
The General Theory of Localization Methods cs.LG · 2026-05-20 · unverdicted · none · ref 8 · 2 links · internal anchor
The localization method is presented as a unifying framework connecting kernel methods, MeanShift, Hopfield networks, LLE, fuzzy inference, denoising autoencoders, and Transformers via local models and the localization trick.
Positional Encoding in Transformer-Based Time Series Models: A Survey cs.LG · 2025-02-17 · unverdicted · none · ref 33 · internal anchor
A survey of positional encoding methods in transformer-based time series models that evaluates fixed, learnable, relative, and hybrid approaches on classification tasks and links effectiveness to data characteristics.

Effective Approaches to Attention-based Neural Machine Translation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer