pith. sign in

arxiv: 2604.07015 · v1 · submitted 2026-04-08 · 💻 cs.CL

Corpora deduplication or duplication in Natural Language Processing of few resourced languages ? A case of study: The Mexico's Nahuatl

Pith reviewed 2026-05-10 17:54 UTC · model grok-4.3

classification 💻 cs.CL
keywords Nahuatllow-resource languagescorpus duplicationword embeddingssemantic similaritydata augmentationindigenous languages
0
0 comments X

The pith

Incremental duplication of a small Nahuatl corpus moderately improves static embeddings for semantic similarity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether controlled duplication of scarce text can aid NLP for languages with almost no training data. For Nawatl, spoken by over two million people but possessing only a tiny corpus of varied dialectal texts, the authors expand their π-yalli collection by incremental duplication and train static embeddings on the result. They evaluate those embeddings on sentence-level semantic similarity and record a moderate gain over embeddings trained on the original unexpanded data. The finding suggests a low-cost route to stretching limited resources for languages that cannot gather large new collections.

Core claim

Applying incremental duplication to the limited Nawatl corpus produces static embeddings that reach moderately higher performance on a sentence semantic similarity task than embeddings trained on the corpus without expansion. The authors note that this controlled expansion technique has not been reported before in the literature for such settings.

What carries the argument

Incremental duplication, the process of systematically repeating the existing limited texts in controlled increments to enlarge the training set for embedding learning.

If this is right

  • Static embeddings improve moderately on semantic similarity when the corpus is expanded by incremental duplication.
  • The technique supplies a practical way to enlarge training material for low-resource languages without new data collection.
  • Embeddings can be made more suitable for downstream NLP tasks by first applying controlled duplication to small existing collections.
  • The same expansion step can be inserted before training whenever only limited text is available.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the moderate gain generalizes, similar duplication might help other agglutinative or polysynthetic languages with sparse digital resources.
  • Applying the same duplicated corpus to tasks such as part-of-speech tagging or simple translation would test whether the benefit extends beyond similarity judgments.
  • The duplication step could be combined with cross-lingual transfer from Spanish or other neighboring languages to further stretch the data.

Load-bearing premise

Duplicating the existing sentences does not create artificial repetitions or biases that improve scores on the chosen similarity test without reflecting genuine language patterns.

What would settle it

Running the same embedding training and similarity evaluation on a fresh held-out set of Nahuatl sentences and finding either no gain or a performance drop when the duplicated corpus is used instead of the original.

Figures

Figures reproduced from arXiv: 2604.07015 by Elvys Linhares-Pontes, Graham Ranger, Juan-Jos\'e Guzman-Landa, Juan-Manuel Torres-Moreno, Luis-Gil Moreno-Jim\'enez, Martha-Lorena Avenda\~no-Garrido, Miguel Figueroa-Saavedra.

Figure 1
Figure 1. Figure 1: Sentence semantic similarity task: Kendall’s coefficient [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
read the original abstract

In this article, we seek to answer the following question: could data duplication be useful in Natural Language Processing (NLP) for languages with limited computational resources? In this type of languages (or $\pi$-languages), corpora available for training Large Language Models are virtually non-existent. In particular, we will study the impact of corpora expansion in Nawatl, an agglutinative and polysynthetic $\pi$-language spoken by over 2 million people, with a large number of dialectal varieties. The aim is to expand the new $\pi$-yalli corpus, which contains a limited number of Nawatl texts, by duplicating it in a controlled way. In our experiments, we will use the incremental duplication technique. The aim is to learn embeddings that are well-suited to NLP tasks. Thus, static embeddings were trained and evaluated in a sentence-level semantic similarity task. Our results show a moderate improvement in performance when using incremental duplication compared to the results obtained using only the corpus without expansion. Furthermore, to our knowledge, this technique has not yet been used in the literature.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper examines whether controlled incremental duplication of a small Nahuatl (Nawatl) corpus can improve static word embeddings for sentence-level semantic similarity in extremely low-resource agglutinative languages. It trains embeddings on the original π-yalli corpus versus an incrementally duplicated version and reports a moderate performance gain on the similarity task, claiming the technique has not been previously applied in the literature.

Significance. If substantiated with quantitative controls, the result would offer a low-cost data-augmentation heuristic for π-languages where corpora are tiny; however, the absence of any reported metrics, baselines, or statistical tests currently prevents assessment of whether duplication adds linguistic signal or merely re-weights frequencies.

major comments (3)
  1. [Abstract] Abstract: the headline claim of 'moderate improvement' is unsupported by any numerical results, baselines, error bars, or statistical tests. No cosine-similarity scores, correlation coefficients, or significance values are supplied, so the central empirical comparison cannot be evaluated.
  2. [Methods] Methods / experimental setup (inferred from description of incremental duplication): the paper does not specify the duplication schedule, the exact multiplicity applied at each step, or the mechanism used to prevent overlap between duplicated training material and the sentences used in the semantic-similarity evaluation. Without these controls, frequency bias remains a plausible alternative explanation for any observed gain.
  3. [Evaluation] Evaluation section: the semantic-similarity task draws test sentences from the same limited pool as the training corpus. The manuscript must demonstrate that the train/test split (or cross-validation) excludes any duplicated material from the test set; otherwise the comparison to the unexpanded baseline is confounded by distributional shift.
minor comments (2)
  1. [Abstract] The abstract states the language is spoken by 'over 2 million people' yet later refers to it as a π-language with 'virtually non-existent' corpora; a brief clarification of the actual size of the π-yalli corpus (token count, number of documents) would help readers gauge the scale of the duplication experiment.
  2. [Introduction] Notation: the term 'π-languages' is introduced without a formal definition or citation; a short footnote or reference to prior usage would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback on our work. We address each of the major comments point by point below, indicating where revisions will be made to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim of 'moderate improvement' is unsupported by any numerical results, baselines, error bars, or statistical tests. No cosine-similarity scores, correlation coefficients, or significance values are supplied, so the central empirical comparison cannot be evaluated.

    Authors: We agree that the abstract would benefit from including key quantitative results to make the central claim self-contained. The manuscript body reports the sentence-level semantic similarity performance (including cosine similarity and correlation metrics) for the original versus incrementally duplicated corpora. In the revision we will update the abstract to state the specific numerical gains observed, along with any available baseline comparisons. revision: yes

  2. Referee: [Methods] Methods / experimental setup (inferred from description of incremental duplication): the paper does not specify the duplication schedule, the exact multiplicity applied at each step, or the mechanism used to prevent overlap between duplicated training material and the sentences used in the semantic-similarity evaluation. Without these controls, frequency bias remains a plausible alternative explanation for any observed gain.

    Authors: We accept that additional detail is required. The incremental duplication was performed by creating successive versions of the corpus at increasing integer multiplicities (starting from the original 1x and adding copies up to a chosen maximum), with the schedule determined by monitoring embedding quality on a small held-out validation subset. The train/test partition was executed first on the original corpus using a random split that reserves a fixed percentage of sentences exclusively for evaluation; duplication was then applied only to the training portion. We will expand the Methods section to state the exact multiplicities, the validation-based selection criterion, and the pre-duplication split procedure. revision: yes

  3. Referee: [Evaluation] Evaluation section: the semantic-similarity task draws test sentences from the same limited pool as the training corpus. The manuscript must demonstrate that the train/test split (or cross-validation) excludes any duplicated material from the test set; otherwise the comparison to the unexpanded baseline is confounded by distributional shift.

    Authors: We agree this control must be made explicit. Because the split was performed on the original corpus before any duplication occurred, the test sentences remain untouched and are never present in the duplicated training data. This design ensures the observed gains cannot be attributed to test-set leakage or simple distributional shift. We will add an explicit statement, together with a short description of the split ratio and randomization seed, to the Evaluation section. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison of embeddings on duplicated vs. original corpus

full rationale

The paper reports an experimental study: it applies incremental duplication to the π-yalli Nahuatl corpus, trains static embeddings, and measures performance on a sentence-level semantic similarity task. The central claim is a moderate improvement in that task. No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text. The result is a direct head-to-head metric comparison against an external benchmark (the similarity task), with no reduction of any 'prediction' to the input by construction. Self-citation is absent from the load-bearing steps. This is a standard empirical ablation and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the work rests on standard assumptions of word embedding training and evaluation that are not detailed here.

pith-pipeline@v0.9.0 · 5548 in / 994 out tokens · 27531 ms · 2026-05-10T17:54:52.018569+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

  1. [1]

    Abdillahi, N., Nocera, P., and Torres, J. M. (2006). Boites a outils TAL pour les langues peu informatis \'e es : Le cas du Somali . In Journ \'e es d'Analyses des Donn \'e es Textuelles , Besan c on, France

  2. [2]

    Berment, V. (2004). Méthodes pour informatiser les langues et les groupes de langues ``peu dotées'' . PhD thesis, Université Joseph-Fourier - Grenoble I

  3. [3]

    Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the ACL , 5:135--146

  4. [4]

    Bun, K. K. and Ishizuka, M. (2002). Topic extraction from news archive using tf* pdf algorithm. In 3rd International Conference on Web Information Systems Engineering (WISE'02) , pages 73--82. IEEE

  5. [5]

    Canger, U. (1988). Nahuatl dialectology: A survey and some suggestions. International Journal of American Linguistics , 54:28 -- 72

  6. [6]

    Charles, W.-C. D. (2016). Lectura del náhuatl . Instituto Nacional de Lenguas Indígenas

  7. [7]

    Chen, J., Tam, D., Raffel, C., Bansal, M., and Yang, D. (2023). An empirical survey of data augmentation for limited data learning in NLP . Transactions of the ACL , 11:191--211

  8. [8]

    Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT : Pre-training of deep bidirectional transformers for language understanding. In Conference of the North A merican Chapter of the ACL: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 4171--4186, Minneapolis, Minnesota. ACL

  9. [9]

    Y., Gangal, V., Wei, J., Chandar, S., Vosoughi, S., Mitamura, T., and Hovy, E

    Feng, S. Y., Gangal, V., Wei, J., Chandar, S., Vosoughi, S., Mitamura, T., and Hovy, E. (2021). A survey of data augmentation approaches for NLP . In Zong, C., Xia, F., Li, W., and Navigli, R., editors, Findings of the ACL: ACL-IJCNLP 2021 , pages 968--988, Online. Association for Computational Linguistics

  10. [10]

    Flores Nájera, L. (2019). La gram\'atica de la clausula simple en el náhuatl de Tlaxcala . PhD thesis, CIESAS

  11. [11]

    Francis-Landau, M., Durrett, G., and Klein, D. (2016). Capturing semantic similarity for entity linking with convolutional neural networks. In Knight, K., Nenkova, A., and Rambow, O., editors, NAACL: Human Language Technologies , pages 1256--1261, San Diego, California. ACL

  12. [12]

    Goyal, P., Pandey, S., and Jain, K. (2018). Deep Learning for Natural Language Processing . Springer

  13. [13]

    Guzm \'a n-Landa, J.-J., Torres-Moreno, J.-M., Avenda \ n o-Garrido, M.-L., Figueroa-Saavedra, M., Quintana-Torres, L., Ranger, G., Gonz \'a lez-Gallardo, C.-E., Linhares-Pontes, E., Vel \'a zquez-Morales, P., and Moreno-Jim \'e nez, L.-G. (2025a). - YALLI : un nouveau corpus pour des mod \`e les de langue nahuatl / Y ankuik nawatlahtolkorpus pampa tlahto...

  14. [14]

    J., Torres-Moreno, J.-M., Ranger, G., Figueroa-Saavedra, M., Quintana-Torres, L., and Avenda \ n o-Garrido, M

    Guzm \'a n-Landa, J. J., Torres-Moreno, J.-M., Ranger, G., Figueroa-Saavedra, M., Quintana-Torres, L., and Avenda \ n o-Garrido, M. L. (2025b). Two cfg nahuatl for automatic corpora expansion. ArXiv , abs/2512.14239

  15. [15]

    L., Figueroa-Saavedra, M., Quintana-Torres, L., Vel \'a zquez Morales, P., and Sierra-Martínez, G

    Guzm \'a n-Landa, J.-J., Vázquez-Osorio, J., Torres-Moreno, J.-M., Ranger, G., Garrido-Avenda \ n o, M. L., Figueroa-Saavedra, M., Quintana-Torres, L., Vel \'a zquez Morales, P., and Sierra-Martínez, G. (2025c). A symbolic algorithm for the unification of nawatl word spellings. In MICAI'25 , page 12p. SMIA

  16. [16]

    Guzmán-Landa, J.-J., Torres-Moreno, J.-M., Figueroa-Saavedra, M., González-Gallardo, C.-E., Ranger, G., and Lorena-Avendaño-Garrido, M. (2026). Classifying several dialectal nawatl varieties

  17. [17]

    Hansen, M. P. (2024). Nahuatl Nations: Language Revitalization and Semiotic Sovereignty in Indigenous Mexico . Oxford University Press

  18. [18]

    Censo de poblaci\'on y vivienda 2020

    INEGI (2020). Censo de poblaci\'on y vivienda 2020. In CENSO 2020 . https://www.inegi.org.mx/rnm/index.php/catalog/632/study-description

  19. [19]

    B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D

    Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. (2020). Scaling laws for neural language models

  20. [20]

    Kendall, M. G. (1938). A new measure of rank correlation. Biometrika , 30(1/2):81--93

  21. [21]

    Lastra de Su \'a rez, Y. (1986). Las \'a reas dialectales del n \'a huatl moderno . UNAM, Instituto de Investigaciones Antropológicas, Mexico

  22. [22]

    Launey, M. (1978). Introduction \`a la langue et \`a la litt \'e rature azt \`e ques , volume 1. L'Harmattan, Paris

  23. [23]

    Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., and Carlini, N. (2022). Deduplicating training data makes language models better. In Muresan, S., Nakov, P., and Villavicencio, A., editors, 60th Annual Meeting of the ACL (V1) , pages 8424--8445, Dublin, Ireland. ACL

  24. [24]

    Mahamud, M., Lee, Z., and Samsten, I. (2023). Distributional data augmentation methods for low resource language

  25. [25]

    Manning, C. D. and Schütze, H. (1999). Foundations of Statistical Natural Language Processing . MIT Press, Cambridge, MA

  26. [26]

    Micheli, V., d ' Hoffschmidt, M., and Fleuret, F. (2020). On the importance of pre-training data volume for compact language models. In Webber, B., Cohn, T., He, Y., and Liu, Y., editors, Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 7853--7858, Online. ACL

  27. [27]

    Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word representations in vector space

  28. [28]

    Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. In NIPS - Vol 2 , NIPS, page 3111–3119, Red Hook, NY, USA. Curran Associates Inc

  29. [29]

    and Sullivan, J

    Olko, J. and Sullivan, J. (2016). Bridging gaps and empowering speakers: An inclusive, partnership-based approach to nahuatl research and revitalization. Integral strategies for language revitalization , pages 347--386

  30. [30]

    Penedo, G., Kydl \' c ek, H., Ben Allal , L., Lozhkov, A., Mitchell, M., Raffel, C., Von Werra , L., and Wolf, T. (2024). The FineWeb datasets: Decanting the web for the finest text data at scale. In Advances in Neural Information Processing Systems . NeurIPS 2024 Datasets and Benchmarks Track (Spotlight)

  31. [31]

    Pennington, J., Socher, R., and Manning, C. (2014a). G lo V e: Global vectors for word representation. In Empirical Methods in Natural Language Processing ( EMNLP ) , pages 1532--1543, Doha, Qatar. ACL

  32. [32]

    Pennington, J., Socher, R., and Manning, C. D. (2014b). Glove: Global vectors for word representation. In 2014 EMNLP , pages 1532--1543. ACL

  33. [33]

    X., M \'a rquez Hernandez, \'A ., and Tyers, F

    Pugh, R., Wing, C., Ju \'a rez Huerta, M. X., M \'a rquez Hernandez, \'A ., and Tyers, F. (2025). Ihquin tlahtouah in tetelahtzincocah: An annotated, multi-purpose audio and text corpus of western sierra P uebla N ahuatl. In Chiruzzo, L., Ritter, A., and Wang, L., editors, Conference of the Nations of the Americas Chapter of the ACL: Human Language Techno...

  34. [34]

    Robertson, S., Zaragoza, H., and Taylor, M. (2004). Simple bm25 extension to multiple weighted fields. In Proceedings of the thirteenth ACM international conference on Information and knowledge management , pages 42--49

  35. [35]

    Sasaki, M. (2022). Divide y entender \'a s: El papel de la polarizaci \'o n sint \'a ctica en el n \'a huatl moderno y colonial. In Coloquio de Investigación Lingüística, Universidad de Sonora (Mexico)

  36. [36]

    Torres-Moreno, J.-M. (2014). Automatic Text Summarization . Wiley, London

  37. [37]

    Tunstall, L., von Werra, L., and Wolf, T. (2022). Natural Language Processing with Transformers: Building Language Applications with Hugging Face . O'Reilly Media

  38. [38]

    and Zou, K

    Wei, J. and Zou, K. (2019). EDA : Easy data augmentation techniques for boosting performance on text classification tasks. In Inui, K., Jiang, J., Ng, V., and Wan, X., editors, EMNLP-IJCNLP , pages 6382--6388, Hong Kong, China. ACL

  39. [39]

    Wenzek, G., Lachaux, M.-A., Conneau, A., Chaudhary, V., Guzm \'a n, F., Joulin, A., and Grave, E. (2020). CCN et: Extracting high quality monolingual datasets from web crawl data. In 20th Language Resources and Evaluation Conf , pages 4003--4012, Marseille, France. European Language Resources Association

  40. [40]

    Zimmermann, K. (2019). Estandarización y revitalización de lenguas amerindias: funciones comunicativas e ideológicas, expectativas ilusorias y condiciones de la aceptación. Revista de Llengua i Dret, Journal of Language and Law , 71:111--122