Calibration of Encoder Decoder Models for Neural Machine Translation

Aviral Kumar; Sunita Sarawagi

arxiv: 1903.00802 · v1 · pith:VA6WDHK6new · submitted 2019-03-03 · 💻 cs.LG · cs.CL· stat.ML

Calibration of Encoder Decoder Models for Neural Machine Translation

Aviral Kumar , Sunita Sarawagi This is my paper

classification 💻 cs.LG cs.CLstat.ML

keywords calibrationmodelsbeam-searchmachineneuraltranslationaccuracyattention

0 comments

read the original abstract

We study the calibration of several state of the art neural machine translation(NMT) systems built on attention-based encoder-decoder models. For structured outputs like in NMT, calibration is important not just for reliable confidence with predictions, but also for proper functioning of beam-search inference. We show that most modern NMT models are surprisingly miscalibrated even when conditioned on the true previous tokens. Our investigation leads to two main reasons -- severe miscalibration of EOS (end of sequence marker) and suppression of attention uncertainty. We design recalibration methods based on these signals and demonstrate improved accuracy, better sequence-level calibration, and more intuitive results from beam-search.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

$ECUAS_n$: A family of metrics for principled evaluation of uncertainty-augmented systems
cs.AI 2026-05 unverdicted novelty 7.0

Introduces the ECUAS_n family of proper scoring rules for evaluating uncertainty-augmented systems, where n tunes the trade-off between prediction accuracy costs and uncertainty quality.
$ECUAS_n$: A family of metrics for principled evaluation of uncertainty-augmented systems
cs.AI 2026-05 unverdicted novelty 6.0

Proposes ECUAS_n metrics as proper scoring rules for evaluating uncertainty-augmented systems, with n controlling cost trade-offs between predictions and uncertainties.
Confident in a Confidence Score: Investigating the Sensitivity of Confidence Scores to Supervised Fine-Tuning
cs.CL 2026-04 unverdicted novelty 5.0

Supervised fine-tuning degrades the correlation between confidence scores and output quality in language models, driven by factors like training distribution similarity rather than true quality.