pith. sign in

arxiv: 1903.00802 · v1 · pith:VA6WDHK6new · submitted 2019-03-03 · 💻 cs.LG · cs.CL· stat.ML

Calibration of Encoder Decoder Models for Neural Machine Translation

classification 💻 cs.LG cs.CLstat.ML
keywords calibrationmodelsbeam-searchmachineneuraltranslationaccuracyattention
0
0 comments X
read the original abstract

We study the calibration of several state of the art neural machine translation(NMT) systems built on attention-based encoder-decoder models. For structured outputs like in NMT, calibration is important not just for reliable confidence with predictions, but also for proper functioning of beam-search inference. We show that most modern NMT models are surprisingly miscalibrated even when conditioned on the true previous tokens. Our investigation leads to two main reasons -- severe miscalibration of EOS (end of sequence marker) and suppression of attention uncertainty. We design recalibration methods based on these signals and demonstrate improved accuracy, better sequence-level calibration, and more intuitive results from beam-search.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. $ECUAS_n$: A family of metrics for principled evaluation of uncertainty-augmented systems

    cs.AI 2026-05 unverdicted novelty 7.0

    Introduces the ECUAS_n family of proper scoring rules for evaluating uncertainty-augmented systems, where n tunes the trade-off between prediction accuracy costs and uncertainty quality.

  2. $ECUAS_n$: A family of metrics for principled evaluation of uncertainty-augmented systems

    cs.AI 2026-05 unverdicted novelty 6.0

    Proposes ECUAS_n metrics as proper scoring rules for evaluating uncertainty-augmented systems, with n controlling cost trade-offs between predictions and uncertainties.

  3. Confident in a Confidence Score: Investigating the Sensitivity of Confidence Scores to Supervised Fine-Tuning

    cs.CL 2026-04 unverdicted novelty 5.0

    Supervised fine-tuning degrades the correlation between confidence scores and output quality in language models, driven by factors like training distribution similarity rather than true quality.