pith. sign in

arxiv: 1906.12068 · v1 · pith:NK5QVKFKnew · submitted 2019-06-28 · 💻 cs.CL · cs.LG

Lost in Translation: Loss and Decay of Linguistic Richness in Machine Translation

Pith reviewed 2026-05-25 14:04 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords machine translationlexical richnesslexical diversitygender biashuman translationempirical analysisbias amplification
0
0 comments X

The pith

Machine translation systems produce less lexically diverse text than human translations by favoring frequent patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an empirical approach to measure how machine translation loses lexical richness relative to human translation. Experiments demonstrate that MT outputs fail to capture the diversity present in human-generated or human-translated text. MT systems tend to amplify already common words and phrases while suppressing less frequent ones. This pattern may contribute to problems such as gender bias in translations, separate from biases present in the training data. A reader would care because the work raises the possibility that the translation algorithm itself exacerbates certain imbalances.

Core claim

Current MT systems fail to render the lexical diversity of human generated or translated text. The inability of MT systems to generate diverse outputs and its tendency to exacerbate already frequent patterns while ignoring less frequent ones might be the underlying cause for, among others, the currently heavily debated issues related to gender biased output. Can we indeed, aside from biased data, talk about an algorithm that exacerbates seen biases?

What carries the argument

Empirical quantification of lexical richness loss between MT and HT outputs, isolating the effect of favoring frequent patterns over rarer ones.

If this is right

  • MT outputs exhibit lower lexical diversity than human translations.
  • MT amplifies frequent patterns while suppressing less frequent ones.
  • This amplification may drive gender bias in MT outputs beyond data biases.
  • The algorithm contributes to bias exacerbation independently of training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Architectural changes to MT models may be needed to preserve diversity even after data debiasing.
  • Lexical richness metrics could serve as a routine check for other forms of output homogenization in generation tasks.
  • The same frequency-exacerbation mechanism might appear in non-translation language models trained on similar objectives.

Load-bearing premise

The measured drop in lexical richness stems from properties of the MT algorithm itself rather than solely from the training data.

What would settle it

A controlled experiment in which MT systems trained on identical data to human translators produce equivalent lexical richness scores would falsify the algorithmic cause.

Figures

Figures reproduced from arXiv: 1906.12068 by Andy Way, Dimitar Shterionov, Eva Vanmassenhove.

Figure 1
Figure 1. Figure 1: One-to-many relation between an English source word and some of its possible French translations see voir vois voyons voyez voient [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: One-to-many relation between English verb ‘see’ and its conjugations in French smart intelligente intelligent intelligentes intelligents [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: One-to-many relation between English adjective ‘smart’ and its male and female counterparts in French However, from a translation point of view, the abil￾ity of MT systems to be (1) consistent and (2) learn and generalize well are –compared to previous MT systems– the biggest asset of NMT. We however, hypothesize that this type of generalization might as well have serious drawbacks and that diversity, alth… view at source ↗
Figure 4
Figure 4. Figure 4: Back-translated data pipeline. For the REV and BACK systems we used the same settings as for the FF ones. However, at this stage, the source side of the training data is different and thus impacts the learnable vocabu￾lary [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Relative frequencies of the Spanish translations of the English words ‘picture’ and ‘happen’. 6 Conclusions and Future Work This work investigates bias exacerbation and loss of lexical richness through the process of MT. We analyse the problem of loss of lexical richness us￾ing a number of LD metrics on the output of 12 dif￾ferent MT systems: SMT, RNN and Transformer models for EN–FR and EN–ES with origina… view at source ↗
read the original abstract

This work presents an empirical approach to quantifying the loss of lexical richness in Machine Translation (MT) systems compared to Human Translation (HT). Our experiments show how current MT systems indeed fail to render the lexical diversity of human generated or translated text. The inability of MT systems to generate diverse outputs and its tendency to exacerbate already frequent patterns while ignoring less frequent ones, might be the underlying cause for, among others, the currently heavily debated issues related to gender biased output. Can we indeed, aside from biased data, talk about an algorithm that exacerbates seen biases?

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents an empirical study quantifying the loss of lexical richness in machine translation (MT) outputs relative to human translations (HT), showing that MT systems reduce diversity by favoring frequent patterns and ignoring rarer ones; it suggests this algorithmic tendency, beyond training data, may underlie issues like gender bias in MT.

Significance. If the central attribution to algorithmic properties (rather than training data frequencies) can be isolated, the result would be significant for NLP, as it identifies a mechanism by which MT exacerbates biases and reduces output diversity, with implications for fairness and the design of generation models. The work provides an empirical quantification approach that could be extended if properly controlled.

major comments (2)
  1. [Experimental setup and results] The experimental design does not isolate algorithmic effects from training data statistics. No direct comparison is reported between MT output type-token ratios or Zipf exponents and the empirical n-gram distribution of the parallel training corpus, nor is there an ablation using frequency-matched synthetic references or a maximum-likelihood sampler from the training distribution (see the description of experiments and results). This is load-bearing for the claim that the observed loss and exacerbation of frequent patterns is due to the MT algorithm rather than reproduction of training skew.
  2. [Abstract and §1] The abstract and introduction frame the work as distinguishing algorithmic bias from data bias, but the reported comparisons are only between MT output and human translations without the controls needed to support that distinction (see abstract and §1).
minor comments (2)
  1. [Abstract] The abstract states experimental results but does not mention the specific metrics, datasets, or statistical tests used; this should be added for clarity even if details appear later in the paper.
  2. [Methods] Notation for lexical richness measures (e.g., type-token ratio, Zipf exponents) should be defined explicitly on first use with formulas.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and indicate where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: The experimental design does not isolate algorithmic effects from training data statistics. No direct comparison is reported between MT output type-token ratios or Zipf exponents and the empirical n-gram distribution of the parallel training corpus, nor is there an ablation using frequency-matched synthetic references or a maximum-likelihood sampler from the training distribution (see the description of experiments and results). This is load-bearing for the claim that the observed loss and exacerbation of frequent patterns is due to the MT algorithm rather than reproduction of training skew.

    Authors: We agree that the current experiments do not include direct comparisons of MT outputs to the n-gram statistics of the parallel training corpus or ablations such as maximum-likelihood sampling from the training distribution. The manuscript demonstrates reduced lexical richness in MT relative to human translations but does not fully isolate algorithmic contributions from data frequencies. In revision we will add a comparison of type-token ratios and Zipf exponents between MT outputs and the training corpus itself, and we will discuss the implications for distinguishing algorithmic effects. Where feasible we will also reference or include a simple frequency-based baseline. revision: yes

  2. Referee: The abstract and introduction frame the work as distinguishing algorithmic bias from data bias, but the reported comparisons are only between MT output and human translations without the controls needed to support that distinction (see abstract and §1).

    Authors: The abstract and §1 present an empirical quantification of loss in MT versus HT and pose an open question about possible algorithmic exacerbation beyond data bias. We do not claim the existing results fully isolate algorithmic from data effects. We will revise the abstract and introduction to more precisely describe the scope of the current comparisons and to note explicitly that additional controls (such as those suggested) would be required to attribute effects specifically to the model rather than the training distribution. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements of lexical diversity are direct comparisons without derivations or fitted predictions

full rationale

The paper reports an empirical quantification of lexical richness loss by comparing type-token ratios, frequency distributions, and related metrics between MT outputs and human translations or references. No equations, parameter fitting, predictions derived from fitted inputs, or self-citation load-bearing steps are present in the provided text or abstract. The central claim rests on observable differences in generated text statistics rather than any self-referential reduction or ansatz smuggled via prior work, rendering the analysis self-contained against external text corpora benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5620 in / 949 out tokens · 22635 ms · 2026-05-25T14:04:32.160366+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Bias in Large Language Models: Origin, Evaluation, and Mitigation

    cs.CL 2024-11 unverdicted novelty 2.0

    A literature review that categorizes bias in LLMs, surveys evaluation and mitigation techniques, and discusses ethical implications.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · cited by 1 Pith paper

  1. [1]

    Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate . In Proceedings of International Conference on Learning Representations (ICLR2015) , San Diego, USA, May

  2. [2]

    Bentivogli, Luisa, Arianna Bisazza, Mauro Cettolo, and Marcello Federico. 2016. Neural versus Phrase-Based Machine Translation Quality: a Case Study . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP , pages 257--267, Austin, Texas, USA, November

  3. [3]

    Berman, Antoine. 2000. Translation and the Trials of the Foreign . In The Translation Studies Reader . Routledge London

  4. [4]

    Brezina, Vaclav. 2018. Statistics in Corpus Linguistics: A Practical Guide . Cambridge University Press

  5. [5]

    Britz, Denny, Anna Goldie, Minh-Thang Luong, and Quoc Le. 2017. Massive Exploration of Neural Machine Translation Architectures . In Proceedings of the Association for Computational Linguistics (ACL) , pages 1442--1451, Vancouver, Canada, July--August

  6. [6]

    Clark, Jonathan H, Chris Dyer, Alon Lavie, and Noah A Smith. 2011. Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability . In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL): Human Language Technologies: short papers, Volume 2 , pages 176--181, Portland, Oregon, USA, June

  7. [7]

    and Jimmy Ba

    Kingma, Diederik P. and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization . In Proceedings of the 3rd International Conference on Learning Representations: Poster Session , Banff, Canada, April

  8. [8]

    Klebanov, Beata Beigman and Michael Flor. 2013. Associative Texture is Lost in Translation . In Proceedings of the Workshop on Discourse in Machine Translation , pages 27--32, Sofia, Bulgaria, August

  9. [9]

    Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open-Source Toolkit for Statistical Machine Translation . In Proceedings of the 45th Annual Meeting of the Association o...

  10. [10]

    Koehn, Philipp. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, ( EMNLP 2004), A meeting of SIGDAT, a Special Interest Group of the ACL, held in conjunction with ACL 2004 , pages 388--395, Barcelona, Spain, July

  11. [11]

    Koehn, Philipp. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation . In Proceedings of The Tenth Machine Translation Summit , pages 79--86, Phuket, Thailand, September

  12. [12]

    Kruger, Haidee. 2012. A Corpus-Based Study of the Mediation Effect in Translated and Edited Language . In Target. International Journal of Translation Studies, Volume 24:2 , pages 355--388. John Benjamins Publishing Company, Amsterdam, The Netherlands

  13. [13]

    McCarthy, Philip M and Scott Jarvis. 2010. MTLD, vocd-D, and HD-D: A Validation Study of Sophisticated Approaches to Lexical Diversity Assessment . In Behavior Research Methods, Volume 2:2 , pages 381--392. Springer, Berlin, Germany

  14. [14]

    McCarthy, Philip M. 2005. An Assessment of the Range and Usefulness of Lexical Diversity Measures and the Potential of the Measure of Textual, Lexical Diversity (MTLD) . In PhD Thesis, Dissertation Abstracts International, Volume 66:12 . University of Memphis, Memphis, Tennessee, USA

  15. [15]

    Oakes, Michael P and Meng (eds) Ji. 2013. Quantitative Methods in Corpus-Based Translation Studies: A Practical Guide to Descriptive Translation Research . In Studies in Corpus Linguistics, Volume 51 , page 361. John Benjamins Publishing Company, Amsterdam, The Netherlands

  16. [16]

    Och, Franz Josef and Hermann Ney. 2003. A Systematic Comparison of Various Statistical Alignment Models . In Computational Linguistics, Volume 29:1 , pages 19--51. MIT Press, Cambridge, Massachusetts, USA

  17. [17]

    Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation . In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL) , pages 311--318, Philadephia, PA, USA, July

  18. [18]

    Passban, Peyman, Andy Way, and Qun Liu. 2018. Tailoring Neural Architectures for Translating from Morphologically Rich Languages . In Proceedings of the 27th International Conference on Computational Linguistics (COLING) , pages 3134--3145, Santa Fe, New-Mexico, USA, August

  19. [19]

    Poncelas, Alberto, Dimitar Shterionov, Andy Way, Gideon Maillette de Buy Wenniger, and Peyman Passban. 2018. Investigating Backtranslation in Neural Machine Translation . In Proceedings of the 21st Annual Conference of the European Association for Machine Translation (EAMT) , pages 249--258, Alacant, Spain, May

  20. [20]

    Prates, Marcelo OR, Pedro HC Avelar, and Luis Lamb. 2019. Assessing Gender Bias in Machine Translation--A Case Study with Google Translate . In Neural Computing and Applications . Springer, Berlin, Germany, March

  21. [21]

    Sennrich, Rico, Barry Haddow, and Alexandra Birch. 2016. Improving Neural Machine Translation Models with Monolingual Data . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL): Long Papers , pages 86--96, Berlin, Germany, August

  22. [22]

    Snover, Matthew, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A Study of Translation Edit Rate with Targeted Human Annotation . In Proceedings of Association for Machine Translation in the Americas (AMTA) 200:6 , pages 223--231, Austin, Texas, USA, October

  23. [23]

    Templin, Mildred C. 1975. Certain Language Skills in Children: Their Development and Interrelationships . Greenwood Press, Westport, Connecticut, USA

  24. [24]

    Unanue, Inigo Jauregi, Lierni Garmendia Arratibel, Ehsan Zare Borzeshi, and Massimo Piccardi. 2018. English-Basque statistical and neural machine translation . In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018) , pages 880--885, Miyazaki, Japan, May

  25. [25]

    Vanmassenhove, Eva, Jinhua Du, and Andy Way. 2016. Improving Subject-Verb Agreement in SMT . In Proceedings of the Fifth Workshop on Hybrid Approaches to Translation: HyTra (EAMT) , Riga, Latvia, June

  26. [26]

    Vanmassenhove, Eva, Christian Hardmeier, and Andy Way. 2018. Getting Gender Right in Neural Machine Translation . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 3003--3008, Brussels, Belgium, Novemebr--October

  27. [27]

    Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need . In Proceedings of The Thirty-first Annual Conference on Neural Information Processing Systems 30 (NIPS) , pages 5998--6008, Long Beach, CA, USA, December

  28. [28]

    Wong, Billy and Chunyu Kit. 2012. Extending Machine Translation Evaluation Metrics with Lexical Cohesion to Document Level . In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNL) , pages 1060--1068, Jeju Island, Korea, July

  29. [29]

    Yule, G. Udny. 1944. The Statistical Study of Literary Vocabulary . Cambridge University Press, Cambridge, USA

  30. [30]

    Zhao, Jieyu, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. 2017. Men also Like Shopping: Reducing Gender Bias Amplification Using Corpus-Level Constraints . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNL) , pages 2979--2989, Copenhagen, Denmark, September

  31. [31]

    write newline

    " write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...