Evaluating the Supervised and Zero-shot Performance of Multi-lingual Translation Models
Pith reviewed 2026-05-25 18:01 UTC · model grok-4.3
The pith
Models with task-specific decoder parameters outperform those with fully shared decoders across supervised and zero-shot multilingual translation tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Models which have task-specific decoder parameters outperform models where decoder parameters are fully shared across all tasks.
What carries the argument
Methods for full or partial sharing of decoder parameters in multilingual NMT, where task-specific parameters allow separate adaptation per translation direction while shared parameters capture cross-lingual patterns.
If this is right
- Partial decoder sharing yields better results than full sharing in both supervised and zero-shot settings.
- Trade-offs exist between the amount of parameter sharing and translation quality across the 110 directions tested.
- The approach scales to large training data volumes while maintaining gains from task-specific components.
Where Pith is reading between the lines
- Designs that keep a modest number of decoder parameters private per task could reduce the need for separate models in production multilingual systems.
- The same partial-sharing pattern may apply to encoder parameters or other components if similar ablation studies were run.
Load-bearing premise
Repurposed evaluation methods from unsupervised machine translation accurately reflect true zero-shot translation quality for language pairs without gold-standard parallel data.
What would settle it
Human evaluation or new gold parallel test sets for several zero-shot pairs that directly compare BLEU or other automatic scores against human judgments of translation adequacy.
Figures
read the original abstract
We study several methods for full or partial sharing of the decoder parameters of multilingual NMT models. We evaluate both fully supervised and zero-shot translation performance in 110 unique translation directions using only the WMT 2019 shared task parallel datasets for training. We use additional test sets and re-purpose evaluation methods recently used for unsupervised MT in order to evaluate zero-shot translation performance for language pairs where no gold-standard parallel data is available. To our knowledge, this is the largest evaluation of multi-lingual translation yet conducted in terms of the total size of the training data we use, and in terms of the diversity of zero-shot translation pairs we evaluate. We conduct an in-depth evaluation of the translation performance of different models, highlighting the trade-offs between methods of sharing decoder parameters. We find that models which have task-specific decoder parameters outperform models where decoder parameters are fully shared across all tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates several methods for full or partial sharing of decoder parameters in multilingual NMT models trained on WMT 2019 parallel datasets. It assesses both supervised performance on standard test sets and zero-shot performance on 110 directions, using repurposed unsupervised MT evaluation methods for pairs without gold-standard parallels. The central finding is that models with task-specific decoder parameters outperform those with fully shared decoder parameters across these settings.
Significance. If the results hold, the work provides a large-scale empirical comparison of decoder parameter sharing strategies in multilingual translation, highlighting trade-offs and supporting the use of task-specific parameters. The diversity of evaluated zero-shot pairs is notable, though dependent on the validity of the proxy metrics.
major comments (2)
- [Zero-shot evaluation methods] The paper relies on repurposed unsupervised MT metrics (such as round-trip or back-translation consistency) for zero-shot pairs lacking gold data. However, there is no verification that these proxies correlate with actual translation quality or preserve model rankings between different decoder sharing configurations. Since models with task-specific parameters have more capacity, they may perform better on the proxy tasks artifactually, undermining the load-bearing outperformance claim for zero-shot translation.
- [Experimental setup] Details on data balancing, language pair selection criteria, and controls for per-direction training data volume are needed to confirm that performance differences are attributable to decoder sharing rather than imbalances in the 110-direction setup.
minor comments (2)
- [Abstract] The claim of conducting the 'largest evaluation' would be strengthened by quantitative comparison of training data volume and zero-shot pair count against prior multilingual NMT studies.
- [Methods] Notation for the different decoder sharing configurations (task-specific vs. fully shared) could be clarified with a table or diagram in the methods section.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our evaluation of decoder parameter sharing in multilingual NMT. We address each major comment below and will incorporate revisions where appropriate to strengthen the manuscript.
read point-by-point responses
-
Referee: [Zero-shot evaluation methods] The paper relies on repurposed unsupervised MT metrics (such as round-trip or back-translation consistency) for zero-shot pairs lacking gold data. However, there is no verification that these proxies correlate with actual translation quality or preserve model rankings between different decoder sharing configurations. Since models with task-specific parameters have more capacity, they may perform better on the proxy tasks artifactually, undermining the load-bearing outperformance claim for zero-shot translation.
Authors: We agree that the manuscript does not contain an explicit verification (e.g., correlation analysis) of the proxy metrics against gold-standard BLEU or human judgments on directions where both are available. These proxies are drawn from established unsupervised MT evaluation practices, and the same trend of task-specific decoder superiority appears in our supervised results (where gold data exists). Nevertheless, to directly address the concern about capacity bias and ranking preservation, we will add a new subsection that computes proxy-to-BLEU correlations on the supervised test sets and checks whether the relative ordering of models is preserved under the proxies. revision: yes
-
Referee: [Experimental setup] Details on data balancing, language pair selection criteria, and controls for per-direction training data volume are needed to confirm that performance differences are attributable to decoder sharing rather than imbalances in the 110-direction setup.
Authors: The current manuscript summarizes the use of WMT 2019 parallel data but does not provide exhaustive per-direction statistics or explicit balancing procedures. We will expand the experimental setup section with the requested details: language-pair selection criteria from WMT 2019, any data balancing or upsampling applied during training, and tables or text reporting training data volume per direction to allow readers to assess whether differences are due to decoder sharing rather than data imbalance. revision: yes
Circularity Check
No circularity: purely empirical evaluation on external benchmarks
full rationale
The paper reports direct performance comparisons of multilingual NMT models on WMT 2019 test sets for supervised directions and repurposed unsupervised MT metrics for zero-shot pairs. No equations, derivations, fitted parameters, or self-citations are used to generate the central claim that task-specific decoder parameters outperform fully shared ones; the result is obtained by training and measuring on held-out external data. This matches the default case of a self-contained empirical study with no reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- decoder parameter sharing configuration
axioms (2)
- domain assumption WMT 2019 parallel datasets are sufficient and representative for training multilingual models across the evaluated languages.
- domain assumption Evaluation methods repurposed from unsupervised MT provide valid proxies for zero-shot performance.
Reference graph
Works this paper leans on
-
[1]
Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019. https://www.aclweb.org/anthology/N19-1388 Massively multilingual neural machine translation . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 3874--3884, Minneapol...
work page 2019
-
[2]
Maruan Al-Shedivat and Ankur Parikh. 2019. https://arxiv.org/abs/1904.02338 Consistency by agreement in zero-shot neural machine translation . In Proceedings of NAACL
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[3]
Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Roee Aharoni, Melvin Johnson, and Wolfgang Macherey. 2019. The missing ingredient in zero-shot neural machine translation. arXiv preprint arXiv:1903.07091
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[4]
Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2018. Unsupervised neural machine translation. In Proceedings of the Sixth International Conference on Learning Representations
work page 2018
-
[5]
Graeme Blackwood, Miguel Ballesteros, and Todd Ward. 2018. https://www.aclweb.org/anthology/C18-1263 Multilingual neural machine translation with task-specific attention . In Proceedings of the 27th International Conference on Computational Linguistics, pages 3112--3122, Santa Fe, New Mexico, USA. Association for Computational Linguistics
work page 2018
-
[6]
Ondřej Bojar, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, and Christof Monz. 2018. http://www.aclweb.org/anthology/W18-6401 Findings of the 2018 conference on machine translation (wmt18) . In Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, pages 272--307, Belgium, B...
work page 2018
-
[7]
Yun Chen, Yang Liu, Yong Cheng, and Victor O.K. Li. 2017. https://doi.org/10.18653/v1/P17-1176 A teacher-student framework for zero-resource neural machine translation . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1925--1935, Vancouver, Canada. Association for Computational Linguistics
-
[8]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[9]
Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. 2015. https://doi.org/10.3115/v1/P15-1166 Multi-task learning for multiple language translation . In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 172...
-
[10]
Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. 2016. https://doi.org/10.18653/v1/N16-1101 Multi-way, multilingual neural machine translation with a shared attention mechanism . In Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 866--875, San Diego, Cali...
-
[11]
Jiatao Gu, Yong Wang, Kyunghyun Cho, and Victor O. K. Li. 2019. http://arxiv.org/abs/1906.01181 Improved zero-shot neural machine translation via ignoring spurious correlations
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[12]
Thanh - Le Ha, Jan Niehues, and Alexander H. Waibel. 2016. http://arxiv.org/abs/1611.04798 Toward multilingual neural machine translation with universal encoder and decoder . CoRR, abs/1611.04798
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[13]
Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation
Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. https://arxiv.org/abs/1611.04558 Google's multilingual neural machine translation system: Enabling zero-shot translation . Technical report, Google
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[14]
Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. 2017. https://doi.org/10.18653/v1/P17-4012 Open NMT : Open-source toolkit for neural machine translation . In Proc. ACL
-
[15]
Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[16]
Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc'Aurelio Ranzato. 2018 a . https://openreview.net/forum?id=rkYTTf-AZ Unsupervised machine translation using monolingual corpora only . In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings
work page 2018
-
[17]
Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc'Aurelio Ranzato. 2018 b . Phrase-based & neural unsupervised machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)
work page 2018
-
[18]
Yichao Lu, Phillip Keung, Faisal Ladhak, Vikas Bhardwaj, Shaonan Zhang, and Jason Sun. 2018. https://www.aclweb.org/anthology/W18-6309 A neural interlingua for multilingual machine translation . In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 84--92, Belgium, Brussels. Association for Computational Linguistics
work page 2018
-
[19]
Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser
Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2016. Multi-task sequence to sequence learning. In International Conference on Learning Representations
work page 2016
-
[20]
Leland McInnes, John Healy, Nathaniel Saul, and Lukas Grossberger. 2018. Umap: Uniform manifold approximation and projection. The Journal of Open Source Software, 3(29):861
work page 2018
-
[21]
Devendra Sachan and Graham Neubig. 2018. https://www.aclweb.org/anthology/W18-6327 Parameter sharing methods for multilingual self-attentional translation models . In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 261--271, Belgium, Brussels. Association for Computational Linguistics
work page 2018
-
[22]
Rico Sennrich and Barry Haddow. 2016. http://www.aclweb.org/anthology/W16-2209.pdf Linguistic Input Features Improve Neural Machine Translation . In Proceedings of the First Conference on Machine Translation , pages 83--91, Berlin, Germany. Association for Computational Linguistics
work page 2016
-
[23]
Lierni Sestorain, Massimiliano Ciaramita, Christian Buck, and Thomas Hofmann. 2019. https://openreview.net/forum?id=ByecAoAqK7 Zero-shot dual machine translation
work page 2019
-
[24]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf Attention is all you need . In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information P...
work page 2017
-
[25]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason R...
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[26]
Qi Ye, Sachan Devendra, Felix Matthieu, Padmanabhan Sarguna, and Neubig Graham. 2018. When and why are pre-trained word embeddings useful for neural machine translation. In HLT-NAACL
work page 2018
-
[27]
ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...
-
[28]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.