pith. sign in

arxiv: 1906.09675 · v1 · pith:YD7C27QCnew · submitted 2019-06-24 · 💻 cs.CL

Evaluating the Supervised and Zero-shot Performance of Multi-lingual Translation Models

Pith reviewed 2026-05-25 18:01 UTC · model grok-4.3

classification 💻 cs.CL
keywords multilingual NMTdecoder parameter sharingzero-shot translationsupervised translationWMT shared task
0
0 comments X

The pith

Models with task-specific decoder parameters outperform those with fully shared decoders across supervised and zero-shot multilingual translation tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains multilingual neural machine translation models on WMT 2019 parallel data and compares different ways of sharing decoder parameters across translation directions. It measures performance in 110 unique directions, including many zero-shot pairs that lack direct training data, by adapting evaluation techniques from unsupervised machine translation. The central result is that allowing some decoder parameters to remain unique to each task produces higher quality output than forcing all decoder parameters to be identical across tasks. This finding addresses a practical design choice in building systems that must handle many languages at once without separate models for each pair.

Core claim

Models which have task-specific decoder parameters outperform models where decoder parameters are fully shared across all tasks.

What carries the argument

Methods for full or partial sharing of decoder parameters in multilingual NMT, where task-specific parameters allow separate adaptation per translation direction while shared parameters capture cross-lingual patterns.

If this is right

  • Partial decoder sharing yields better results than full sharing in both supervised and zero-shot settings.
  • Trade-offs exist between the amount of parameter sharing and translation quality across the 110 directions tested.
  • The approach scales to large training data volumes while maintaining gains from task-specific components.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designs that keep a modest number of decoder parameters private per task could reduce the need for separate models in production multilingual systems.
  • The same partial-sharing pattern may apply to encoder parameters or other components if similar ablation studies were run.

Load-bearing premise

Repurposed evaluation methods from unsupervised machine translation accurately reflect true zero-shot translation quality for language pairs without gold-standard parallel data.

What would settle it

Human evaluation or new gold parallel test sets for several zero-shot pairs that directly compare BLEU or other automatic scores against human judgments of translation adequacy.

Figures

Figures reproduced from arXiv: 1906.09675 by Chris Hokamp, Demian Gholipour, John Glover.

Figure 1
Figure 1. Figure 1: The decoder component of the transformer [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: plots the validation performance dur￾ing training on one of our validation datasets. The language embeddings from the EMB system are visualized in figure 3 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Language embeddings of the EMB system projected with UMAP (McInnes et al., 2018). 3.1 Results We conduct four different evaluations of the per￾formance of our models. First, we check perfor￾mance on the 22 supervised pairs using dev and test sets from the WMT shared task. We then try to evaluate zero-shot translation performance in sev￾eral ways. We use the TED talks multi-parallel dataset (Ye et al., 2018… view at source ↗
read the original abstract

We study several methods for full or partial sharing of the decoder parameters of multilingual NMT models. We evaluate both fully supervised and zero-shot translation performance in 110 unique translation directions using only the WMT 2019 shared task parallel datasets for training. We use additional test sets and re-purpose evaluation methods recently used for unsupervised MT in order to evaluate zero-shot translation performance for language pairs where no gold-standard parallel data is available. To our knowledge, this is the largest evaluation of multi-lingual translation yet conducted in terms of the total size of the training data we use, and in terms of the diversity of zero-shot translation pairs we evaluate. We conduct an in-depth evaluation of the translation performance of different models, highlighting the trade-offs between methods of sharing decoder parameters. We find that models which have task-specific decoder parameters outperform models where decoder parameters are fully shared across all tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates several methods for full or partial sharing of decoder parameters in multilingual NMT models trained on WMT 2019 parallel datasets. It assesses both supervised performance on standard test sets and zero-shot performance on 110 directions, using repurposed unsupervised MT evaluation methods for pairs without gold-standard parallels. The central finding is that models with task-specific decoder parameters outperform those with fully shared decoder parameters across these settings.

Significance. If the results hold, the work provides a large-scale empirical comparison of decoder parameter sharing strategies in multilingual translation, highlighting trade-offs and supporting the use of task-specific parameters. The diversity of evaluated zero-shot pairs is notable, though dependent on the validity of the proxy metrics.

major comments (2)
  1. [Zero-shot evaluation methods] The paper relies on repurposed unsupervised MT metrics (such as round-trip or back-translation consistency) for zero-shot pairs lacking gold data. However, there is no verification that these proxies correlate with actual translation quality or preserve model rankings between different decoder sharing configurations. Since models with task-specific parameters have more capacity, they may perform better on the proxy tasks artifactually, undermining the load-bearing outperformance claim for zero-shot translation.
  2. [Experimental setup] Details on data balancing, language pair selection criteria, and controls for per-direction training data volume are needed to confirm that performance differences are attributable to decoder sharing rather than imbalances in the 110-direction setup.
minor comments (2)
  1. [Abstract] The claim of conducting the 'largest evaluation' would be strengthened by quantitative comparison of training data volume and zero-shot pair count against prior multilingual NMT studies.
  2. [Methods] Notation for the different decoder sharing configurations (task-specific vs. fully shared) could be clarified with a table or diagram in the methods section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our evaluation of decoder parameter sharing in multilingual NMT. We address each major comment below and will incorporate revisions where appropriate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Zero-shot evaluation methods] The paper relies on repurposed unsupervised MT metrics (such as round-trip or back-translation consistency) for zero-shot pairs lacking gold data. However, there is no verification that these proxies correlate with actual translation quality or preserve model rankings between different decoder sharing configurations. Since models with task-specific parameters have more capacity, they may perform better on the proxy tasks artifactually, undermining the load-bearing outperformance claim for zero-shot translation.

    Authors: We agree that the manuscript does not contain an explicit verification (e.g., correlation analysis) of the proxy metrics against gold-standard BLEU or human judgments on directions where both are available. These proxies are drawn from established unsupervised MT evaluation practices, and the same trend of task-specific decoder superiority appears in our supervised results (where gold data exists). Nevertheless, to directly address the concern about capacity bias and ranking preservation, we will add a new subsection that computes proxy-to-BLEU correlations on the supervised test sets and checks whether the relative ordering of models is preserved under the proxies. revision: yes

  2. Referee: [Experimental setup] Details on data balancing, language pair selection criteria, and controls for per-direction training data volume are needed to confirm that performance differences are attributable to decoder sharing rather than imbalances in the 110-direction setup.

    Authors: The current manuscript summarizes the use of WMT 2019 parallel data but does not provide exhaustive per-direction statistics or explicit balancing procedures. We will expand the experimental setup section with the requested details: language-pair selection criteria from WMT 2019, any data balancing or upsampling applied during training, and tables or text reporting training data volume per direction to allow readers to assess whether differences are due to decoder sharing rather than data imbalance. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation on external benchmarks

full rationale

The paper reports direct performance comparisons of multilingual NMT models on WMT 2019 test sets for supervised directions and repurposed unsupervised MT metrics for zero-shot pairs. No equations, derivations, fitted parameters, or self-citations are used to generate the central claim that task-specific decoder parameters outperform fully shared ones; the result is obtained by training and measuring on held-out external data. This matches the default case of a self-contained empirical study with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The paper rests on standard NMT training assumptions and the validity of repurposed zero-shot metrics; no new entities are introduced and the main contribution is comparative measurement rather than new parameters or axioms.

free parameters (1)
  • decoder parameter sharing configuration
    Choice of which decoder layers or parameters to share versus keep task-specific is a design decision selected to optimize observed performance.
axioms (2)
  • domain assumption WMT 2019 parallel datasets are sufficient and representative for training multilingual models across the evaluated languages.
    Training uses only these datasets as stated in the abstract.
  • domain assumption Evaluation methods repurposed from unsupervised MT provide valid proxies for zero-shot performance.
    Abstract explicitly states these methods are used for pairs without gold data.

pith-pipeline@v0.9.0 · 5681 in / 1236 out tokens · 33798 ms · 2026-05-25T18:01:49.532347+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 8 internal anchors

  1. [1]

    Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019. https://www.aclweb.org/anthology/N19-1388 Massively multilingual neural machine translation . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 3874--3884, Minneapol...

  2. [2]

    Maruan Al-Shedivat and Ankur Parikh. 2019. https://arxiv.org/abs/1904.02338 Consistency by agreement in zero-shot neural machine translation . In Proceedings of NAACL

  3. [3]

    Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Roee Aharoni, Melvin Johnson, and Wolfgang Macherey. 2019. The missing ingredient in zero-shot neural machine translation. arXiv preprint arXiv:1903.07091

  4. [4]

    Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2018. Unsupervised neural machine translation. In Proceedings of the Sixth International Conference on Learning Representations

  5. [5]

    Graeme Blackwood, Miguel Ballesteros, and Todd Ward. 2018. https://www.aclweb.org/anthology/C18-1263 Multilingual neural machine translation with task-specific attention . In Proceedings of the 27th International Conference on Computational Linguistics, pages 3112--3122, Santa Fe, New Mexico, USA. Association for Computational Linguistics

  6. [6]

    Ondřej Bojar, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, and Christof Monz. 2018. http://www.aclweb.org/anthology/W18-6401 Findings of the 2018 conference on machine translation (wmt18) . In Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, pages 272--307, Belgium, B...

  7. [7]

    Yun Chen, Yang Liu, Yong Cheng, and Victor O.K. Li. 2017. https://doi.org/10.18653/v1/P17-1176 A teacher-student framework for zero-resource neural machine translation . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1925--1935, Vancouver, Canada. Association for Computational Linguistics

  8. [8]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  9. [9]

    Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. 2015. https://doi.org/10.3115/v1/P15-1166 Multi-task learning for multiple language translation . In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 172...

  10. [10]

    Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. 2016. https://doi.org/10.18653/v1/N16-1101 Multi-way, multilingual neural machine translation with a shared attention mechanism . In Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 866--875, San Diego, Cali...

  11. [11]

    Jiatao Gu, Yong Wang, Kyunghyun Cho, and Victor O. K. Li. 2019. http://arxiv.org/abs/1906.01181 Improved zero-shot neural machine translation via ignoring spurious correlations

  12. [12]

    Thanh - Le Ha, Jan Niehues, and Alexander H. Waibel. 2016. http://arxiv.org/abs/1611.04798 Toward multilingual neural machine translation with universal encoder and decoder . CoRR, abs/1611.04798

  13. [13]

    Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation

    Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. https://arxiv.org/abs/1611.04558 Google's multilingual neural machine translation system: Enabling zero-shot translation . Technical report, Google

  14. [14]

    Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. 2017. https://doi.org/10.18653/v1/P17-4012 Open NMT : Open-source toolkit for neural machine translation . In Proc. ACL

  15. [15]

    Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291

  16. [16]

    Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc'Aurelio Ranzato. 2018 a . https://openreview.net/forum?id=rkYTTf-AZ Unsupervised machine translation using monolingual corpora only . In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings

  17. [17]

    Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc'Aurelio Ranzato. 2018 b . Phrase-based & neural unsupervised machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)

  18. [18]

    Yichao Lu, Phillip Keung, Faisal Ladhak, Vikas Bhardwaj, Shaonan Zhang, and Jason Sun. 2018. https://www.aclweb.org/anthology/W18-6309 A neural interlingua for multilingual machine translation . In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 84--92, Belgium, Brussels. Association for Computational Linguistics

  19. [19]

    Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser

    Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2016. Multi-task sequence to sequence learning. In International Conference on Learning Representations

  20. [20]

    Leland McInnes, John Healy, Nathaniel Saul, and Lukas Grossberger. 2018. Umap: Uniform manifold approximation and projection. The Journal of Open Source Software, 3(29):861

  21. [21]

    Devendra Sachan and Graham Neubig. 2018. https://www.aclweb.org/anthology/W18-6327 Parameter sharing methods for multilingual self-attentional translation models . In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 261--271, Belgium, Brussels. Association for Computational Linguistics

  22. [22]

    Rico Sennrich and Barry Haddow. 2016. http://www.aclweb.org/anthology/W16-2209.pdf Linguistic Input Features Improve Neural Machine Translation . In Proceedings of the First Conference on Machine Translation , pages 83--91, Berlin, Germany. Association for Computational Linguistics

  23. [23]

    Lierni Sestorain, Massimiliano Ciaramita, Christian Buck, and Thomas Hofmann. 2019. https://openreview.net/forum?id=ByecAoAqK7 Zero-shot dual machine translation

  24. [24]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf Attention is all you need . In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information P...

  25. [25]

    Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason R...

  26. [26]

    Qi Ye, Sachan Devendra, Felix Matthieu, Padmanabhan Sarguna, and Neubig Graham. 2018. When and why are pre-trained word embeddings useful for neural machine translation. In HLT-NAACL

  27. [27]

    URL: " 'urlintro :=

    ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

  28. [28]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...