pith. sign in

arxiv: 2605.07453 · v1 · submitted 2026-05-08 · 💻 cs.CL

Data Contamination in Neural Hieroglyphic Translation: A Reproducibility Study

Pith reviewed 2026-05-11 01:46 UTC · model grok-4.3

classification 💻 cs.CL
keywords data contaminationreproducibilityneural machine translationhieroglyphicslow-resource languagesancient languagestest set leakageBLEU evaluation
0
0 comments X

The pith

Reproducing a hieroglyphic translation model uncovers data contamination that dramatically inflates reported performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates claims of strong neural machine translation performance for translating ancient Egyptian hieroglyphics into German. A reproduction using the publicly released model achieves substantially lower scores than originally reported. The authors trace this discrepancy to overlap between training and test data, where some test targets appear verbatim in the training set. This leakage allows models to memorize rather than generalize, leading to misleadingly high evaluation metrics on contaminated examples. By removing the affected samples, they establish more realistic baseline scores for future work on this scarce dataset.

Core claim

The authors reproduce a prior study on hieroglyphic-to-German neural machine translation and obtain 37.0 BLEU instead of the claimed 61.5 BLEU. They find that 16 of the 50 test targets are identical to sentences in the training data, with further n-gram overlaps. Contaminated test items score up to 83.8 BLEU while clean items score 30.9 to 39.2 BLEU. Document-level removal of contaminated sources reduces scores by only 4.6 points because some targets persist across documents, requiring target-level deduplication. The paper releases a decontaminated 34-sample test set and corrected baselines.

What carries the argument

Exact string matches and n-gram overlaps between the training and test portions of the hieroglyphic translation dataset, which the authors use to identify and quantify data contamination.

If this is right

  • Neural models perform significantly worse on uncontaminated test data for hieroglyphic translation.
  • Standard document-level decontamination fails to fully eliminate leakage when identical targets appear in multiple source documents.
  • Target-level deduplication is necessary to obtain reliable evaluation sets for formulaic corpora.
  • A cleaned test set of 34 samples is now available for assessing true model capability.
  • Current neural approaches achieve around 35 BLEU on realistic data for this endangered writing system.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar data leakage issues may undermine reported results in other low-resource or ancient language translation tasks where data is scarce and repetitive.
  • Researchers working with formulaic texts should routinely audit for train-test overlaps before claiming high performance.
  • Future models for such languages may need techniques that explicitly prevent memorization of repeated phrases.
  • Reproducibility studies like this one provide essential corrections to the literature on endangered language processing.

Load-bearing premise

That the released model and dataset files correspond precisely to those used in the original study, so that the identified overlaps account for the entire performance gap rather than other unreported differences in training or preprocessing.

What would settle it

Retraining the model from scratch following only the published description and data, then measuring performance on the decontaminated test set, would show if scores remain low without the contamination.

read the original abstract

Ancient and endangered languages pose a unique challenge for NLP: their datasets are inherently scarce, difficult to expand, and built from formulaic corpora -- making data-quality issues especially consequential yet rarely audited. Motivated by the need to understand what current NMT can realistically achieve for such languages, we investigate hieroglyphic-to-German translation, where a recent study reported 61.5 BLEU using fine-tuned M2M-100. Our reproduction yields only 37.0 BLEU with the released model. Investigating this gap, we find 2\% of test targets appear identically in training (16/50; 50\% under 8-gram overlap at 70\% threshold). This contamination inflates scores dramatically: contaminated samples achieve up to 83.8 BLEU / 0.924 COMET-22 versus 30.9--39.2 BLEU / 0.622--0.676 COMET-22 on clean samples across five model configurations spanning two architectures. Document-level decontamination reduces contaminated BLEU by only 4.6 points because 8/16 targets persist via other source documents -- target-level deduplication is required. We release a decontaminated 34-sample test set and establish corrected baselines (30.9--39.2 BLEU), providing a realistic assessment of NMT capability for this endangered writing system.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript is a reproducibility study of hieroglyphic-to-German NMT. A prior work reported 61.5 BLEU with fine-tuned M2M-100; the authors obtain 37.0 BLEU using the released model. They identify 16/50 exact target matches (plus 50% 8-gram overlap) between test and training data, show that contaminated samples score up to 83.8 BLEU / 0.924 COMET-22 while clean samples score 30.9-39.2 BLEU / 0.622-0.676 COMET-22 across five configurations on two architectures, argue that document-level decontamination is insufficient, and release a 34-sample decontaminated test set with corrected baselines.

Significance. If the central empirical findings hold, the work provides a concrete demonstration of how even small amounts of target leakage can inflate NMT scores in low-resource settings and supplies corrected baselines plus a cleaned test set that future studies of this endangered script can use. The multi-configuration comparison and explicit metric splits strengthen the case for routine contamination audits in scarce-data NMT.

major comments (1)
  1. [Abstract and results on the performance gap] The attribution of the full 24.5 BLEU gap (61.5 vs. 37.0) to the identified 2% target contamination is load-bearing for the reproducibility claim, yet the manuscript provides no verification that the released model was trained under the exact conditions (hyperparameters, preprocessing pipeline, data splits, or augmentations) described in the original study. Without training logs or a side-by-side comparison, unreported differences could independently raise scores on both clean and contaminated subsets.
minor comments (2)
  1. [Abstract] The abstract reports 'up to 83.8 BLEU' on contaminated samples; reporting the mean and standard deviation on the contaminated subset (rather than the maximum) would make the inflation claim more precise and easier to compare with the clean-subset range.
  2. [Abstract] The five model configurations are referenced but not enumerated in the summary; a brief parenthetical or table reference early in the paper would improve readability.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the constructive review and for highlighting the need for careful verification of experimental conditions in reproducibility studies. We address the major comment point by point below and have made targeted revisions to improve clarity without overstating our claims.

read point-by-point responses
  1. Referee: [Abstract and results on the performance gap] The attribution of the full 24.5 BLEU gap (61.5 vs. 37.0) to the identified 2% target contamination is load-bearing for the reproducibility claim, yet the manuscript provides no verification that the released model was trained under the exact conditions (hyperparameters, preprocessing pipeline, data splits, or augmentations) described in the original study. Without training logs or a side-by-side comparison, unreported differences could independently raise scores on both clean and contaminated subsets.

    Authors: We agree that we cannot independently verify the precise training conditions of the original model. Our reproduction uses the publicly released model weights and inference code exactly as provided, without any retraining or parameter changes. We lack access to the original training logs, environment, or unreported augmentations, so we cannot rule out that differences in the original training procedure contributed to the gap between the claimed 61.5 BLEU and our 37.0 BLEU reproduction. We have revised the manuscript (abstract, Section 3, and a new limitations paragraph) to explicitly state that the released model is used as-is and that the overall gap may involve factors beyond contamination. At the same time, the core empirical result—that target contamination inflates scores on the test set—remains robust: across five configurations on two architectures, contaminated examples score 83.8 BLEU / 0.924 COMET-22 while clean examples score 30.9–39.2 BLEU / 0.622–0.676 COMET-22. This subset analysis is performed at evaluation time on the fixed released model and is therefore independent of how the model was originally trained. The decontaminated 34-sample test set and associated baselines we release therefore provide a reliable reference regardless of the source of the original gap. revision: partial

standing simulated objections not resolved
  • We cannot provide training logs, exact hyperparameter values, preprocessing details, or a side-by-side training comparison for the original model, as these are not available from the released materials.

Circularity Check

0 steps flagged

Empirical reproducibility study with no derivation chain or fitted predictions

full rationale

This is a direct empirical reproduction and measurement study. The authors download the released model, identify exact target matches and n-gram overlaps in the test set, then compute BLEU and COMET scores on contaminated versus clean subsets across multiple configurations. No equations are present, no parameters are fitted to a subset and then presented as predictions, and no self-citations or uniqueness theorems are invoked to justify the central measurements. The reported score gaps (e.g., 83.8 vs 30.9-39.2 BLEU) are computed directly from the data and models, rendering the study self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a purely empirical reproducibility study. No free parameters are fitted, no mathematical axioms are invoked, and no new entities are postulated; all claims rest on direct data inspection and metric computation.

pith-pipeline@v0.9.0 · 5545 in / 1189 out tokens · 39964 ms · 2026-05-11T01:46:44.208153+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    Journal of Machine Learning Research , volume=

    Beyond English-centric multilingual machine translation , author=. Journal of Machine Learning Research , volume=

  2. [2]

    A Call for Clarity in Reporting

    Post, Matt , booktitle=. A Call for Clarity in Reporting

  3. [3]

    Computational Linguistics , volume=

    Machine Learning for Ancient Languages: A Survey , author=. Computational Linguistics , volume=

  4. [4]

    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics , pages=

    Data Contamination: From Memorization to Exploitation , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics , pages=

  5. [5]

    Overestimation in

    Kocyigit, Muhammed Yusuf and Briakou, Eleftheria and Deutsch, Daniel and Luo, Jiaming and Cherry, Colin and Freitag, Markus , journal=. Overestimation in

  6. [6]

    Popovi. chr. Proceedings of the Tenth Workshop on Statistical Machine Translation , pages=

  7. [7]

    Digital Scholarship in the Humanities , year=

    Deep Learning Meets Egyptology: A Hieroglyphic Transformer for Translating Ancient Egyptian , author=. Digital Scholarship in the Humanities , year=

  8. [8]

    Multi-Task Modeling of Phonographic Languages: Translating Middle

    Wiesenbach, Philipp and Riezler, Stefan , booktitle=. Multi-Task Modeling of Phonographic Languages: Translating Middle

  9. [9]

    Automatic

    Franken, Morris and van Gemert, Jan C , booktitle=. Automatic

  10. [10]

    A deep learning approach to ancient

    Barucci, Andrea and Cucci, Costanza and Franci, Franco and Loschiavo, Marco and Argenti, Fabrizio , journal=. A deep learning approach to ancient

  11. [11]

    Nature Communications , volume=

    Transforming machine translation: A deep learning system reaches news translation quality comparable to human professionals , author=. Nature Communications , volume=

  12. [12]

    Improving reproducibility in machine learning research: A report from the

    Pineau, Joelle and Vincent-Lamarre, Philippe and Sinha, Koustuv and Larivi. Improving reproducibility in machine learning research: A report from the. Journal of Machine Learning Research , volume=

  13. [13]

    Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

    Sainz, Oscar and Campos, Jon Ander and Garc. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

  14. [14]

    2020 , publisher=

    Reading Akkadian Prayers and Hymns: An Introduction , author=. 2020 , publisher=

  15. [15]

    Translating

    Gutherz, Gai and Gordin, Shai and S. Translating. PNAS Nexus , volume=

  16. [16]

    Bamman, David and Burns, Patrick J , journal=. Latin

  17. [17]

    Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026) , year =

    When Flores Bloomz Wrong: Cross-Direction Contamination in Machine Translation Evaluation , author =. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026) , year =. 2601.20858 , archivePrefix =

  18. [18]

    arXiv preprint arXiv:2601.14994 , year =

    Obscuring Data Contamination Through Translation: Evidence from Arabic Corpora , author =. arXiv preprint arXiv:2601.14994 , year =. 2601.14994 , archivePrefix =

  19. [19]

    Enis, Maxim and Hopkins, Mark , journal =. From. 2024 , eprint =

  20. [20]

    Proceedings of the Seventh Conference on Machine Translation (WMT) , pages=

    COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task , author=. Proceedings of the Seventh Conference on Machine Translation (WMT) , pages=. 2022 , organization=

  21. [21]

    Chatterji, Faisal Ladhak, and Tatsunori Hashimoto

    Proving Test Set Contamination in Black Box Language Models , author=. arXiv preprint arXiv:2310.17623 , year=. 2310.17623 , archivePrefix=

  22. [22]

    General- ization or memorization: Data contamination and trustworthy evaluation for large language models, 2024

    Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models , author=. arXiv preprint arXiv:2402.15938 , year=. 2402.15938 , archivePrefix=

  23. [23]

    ConStat : Performance - Based Contamination Detection in Large Language Models , May 2024

    Dekoninck, Jasper and M. arXiv preprint arXiv:2405.16281 , year=. 2405.16281 , archivePrefix=

  24. [24]

    2024 , organization=

    Chen, Danlu and Shi, Freda and Agarwal, Aditi and Myerston, Jacobo and Berg-Kirkpatrick, Taylor , booktitle=. 2024 , organization=. 2408.04628 , archivePrefix=