Corpora deduplication or duplication in Natural Language Processing of few resourced languages ? A case of study: The Mexico's Nahuatl
Pith reviewed 2026-05-10 17:54 UTC · model grok-4.3
The pith
Incremental duplication of a small Nahuatl corpus moderately improves static embeddings for semantic similarity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Applying incremental duplication to the limited Nawatl corpus produces static embeddings that reach moderately higher performance on a sentence semantic similarity task than embeddings trained on the corpus without expansion. The authors note that this controlled expansion technique has not been reported before in the literature for such settings.
What carries the argument
Incremental duplication, the process of systematically repeating the existing limited texts in controlled increments to enlarge the training set for embedding learning.
If this is right
- Static embeddings improve moderately on semantic similarity when the corpus is expanded by incremental duplication.
- The technique supplies a practical way to enlarge training material for low-resource languages without new data collection.
- Embeddings can be made more suitable for downstream NLP tasks by first applying controlled duplication to small existing collections.
- The same expansion step can be inserted before training whenever only limited text is available.
Where Pith is reading between the lines
- If the moderate gain generalizes, similar duplication might help other agglutinative or polysynthetic languages with sparse digital resources.
- Applying the same duplicated corpus to tasks such as part-of-speech tagging or simple translation would test whether the benefit extends beyond similarity judgments.
- The duplication step could be combined with cross-lingual transfer from Spanish or other neighboring languages to further stretch the data.
Load-bearing premise
Duplicating the existing sentences does not create artificial repetitions or biases that improve scores on the chosen similarity test without reflecting genuine language patterns.
What would settle it
Running the same embedding training and similarity evaluation on a fresh held-out set of Nahuatl sentences and finding either no gain or a performance drop when the duplicated corpus is used instead of the original.
Figures
read the original abstract
In this article, we seek to answer the following question: could data duplication be useful in Natural Language Processing (NLP) for languages with limited computational resources? In this type of languages (or $\pi$-languages), corpora available for training Large Language Models are virtually non-existent. In particular, we will study the impact of corpora expansion in Nawatl, an agglutinative and polysynthetic $\pi$-language spoken by over 2 million people, with a large number of dialectal varieties. The aim is to expand the new $\pi$-yalli corpus, which contains a limited number of Nawatl texts, by duplicating it in a controlled way. In our experiments, we will use the incremental duplication technique. The aim is to learn embeddings that are well-suited to NLP tasks. Thus, static embeddings were trained and evaluated in a sentence-level semantic similarity task. Our results show a moderate improvement in performance when using incremental duplication compared to the results obtained using only the corpus without expansion. Furthermore, to our knowledge, this technique has not yet been used in the literature.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines whether controlled incremental duplication of a small Nahuatl (Nawatl) corpus can improve static word embeddings for sentence-level semantic similarity in extremely low-resource agglutinative languages. It trains embeddings on the original π-yalli corpus versus an incrementally duplicated version and reports a moderate performance gain on the similarity task, claiming the technique has not been previously applied in the literature.
Significance. If substantiated with quantitative controls, the result would offer a low-cost data-augmentation heuristic for π-languages where corpora are tiny; however, the absence of any reported metrics, baselines, or statistical tests currently prevents assessment of whether duplication adds linguistic signal or merely re-weights frequencies.
major comments (3)
- [Abstract] Abstract: the headline claim of 'moderate improvement' is unsupported by any numerical results, baselines, error bars, or statistical tests. No cosine-similarity scores, correlation coefficients, or significance values are supplied, so the central empirical comparison cannot be evaluated.
- [Methods] Methods / experimental setup (inferred from description of incremental duplication): the paper does not specify the duplication schedule, the exact multiplicity applied at each step, or the mechanism used to prevent overlap between duplicated training material and the sentences used in the semantic-similarity evaluation. Without these controls, frequency bias remains a plausible alternative explanation for any observed gain.
- [Evaluation] Evaluation section: the semantic-similarity task draws test sentences from the same limited pool as the training corpus. The manuscript must demonstrate that the train/test split (or cross-validation) excludes any duplicated material from the test set; otherwise the comparison to the unexpanded baseline is confounded by distributional shift.
minor comments (2)
- [Abstract] The abstract states the language is spoken by 'over 2 million people' yet later refers to it as a π-language with 'virtually non-existent' corpora; a brief clarification of the actual size of the π-yalli corpus (token count, number of documents) would help readers gauge the scale of the duplication experiment.
- [Introduction] Notation: the term 'π-languages' is introduced without a formal definition or citation; a short footnote or reference to prior usage would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive feedback on our work. We address each of the major comments point by point below, indicating where revisions will be made to improve clarity and rigor.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim of 'moderate improvement' is unsupported by any numerical results, baselines, error bars, or statistical tests. No cosine-similarity scores, correlation coefficients, or significance values are supplied, so the central empirical comparison cannot be evaluated.
Authors: We agree that the abstract would benefit from including key quantitative results to make the central claim self-contained. The manuscript body reports the sentence-level semantic similarity performance (including cosine similarity and correlation metrics) for the original versus incrementally duplicated corpora. In the revision we will update the abstract to state the specific numerical gains observed, along with any available baseline comparisons. revision: yes
-
Referee: [Methods] Methods / experimental setup (inferred from description of incremental duplication): the paper does not specify the duplication schedule, the exact multiplicity applied at each step, or the mechanism used to prevent overlap between duplicated training material and the sentences used in the semantic-similarity evaluation. Without these controls, frequency bias remains a plausible alternative explanation for any observed gain.
Authors: We accept that additional detail is required. The incremental duplication was performed by creating successive versions of the corpus at increasing integer multiplicities (starting from the original 1x and adding copies up to a chosen maximum), with the schedule determined by monitoring embedding quality on a small held-out validation subset. The train/test partition was executed first on the original corpus using a random split that reserves a fixed percentage of sentences exclusively for evaluation; duplication was then applied only to the training portion. We will expand the Methods section to state the exact multiplicities, the validation-based selection criterion, and the pre-duplication split procedure. revision: yes
-
Referee: [Evaluation] Evaluation section: the semantic-similarity task draws test sentences from the same limited pool as the training corpus. The manuscript must demonstrate that the train/test split (or cross-validation) excludes any duplicated material from the test set; otherwise the comparison to the unexpanded baseline is confounded by distributional shift.
Authors: We agree this control must be made explicit. Because the split was performed on the original corpus before any duplication occurred, the test sentences remain untouched and are never present in the duplicated training data. This design ensures the observed gains cannot be attributed to test-set leakage or simple distributional shift. We will add an explicit statement, together with a short description of the split ratio and randomization seed, to the Evaluation section. revision: yes
Circularity Check
No circularity: purely empirical comparison of embeddings on duplicated vs. original corpus
full rationale
The paper reports an experimental study: it applies incremental duplication to the π-yalli Nahuatl corpus, trains static embeddings, and measures performance on a sentence-level semantic similarity task. The central claim is a moderate improvement in that task. No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text. The result is a direct head-to-head metric comparison against an external benchmark (the similarity task), with no reduction of any 'prediction' to the input by construction. Self-citation is absent from the load-bearing steps. This is a standard empirical ablation and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We decided to test our hypothesis regarding the impact of corpora expansion on learning algorithms through empirical means... duplicated the corpus π-YALLI ρ times... ρ = [1, 2, 4,..., 26, 28, 30]
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our results show a moderate improvement in performance when using incremental duplication compared to the results obtained using only the corpus without expansion.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Abdillahi, N., Nocera, P., and Torres, J. M. (2006). Boites a outils TAL pour les langues peu informatis \'e es : Le cas du Somali . In Journ \'e es d'Analyses des Donn \'e es Textuelles , Besan c on, France
work page 2006
-
[2]
Berment, V. (2004). Méthodes pour informatiser les langues et les groupes de langues ``peu dotées'' . PhD thesis, Université Joseph-Fourier - Grenoble I
work page 2004
-
[3]
Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the ACL , 5:135--146
work page 2017
-
[4]
Bun, K. K. and Ishizuka, M. (2002). Topic extraction from news archive using tf* pdf algorithm. In 3rd International Conference on Web Information Systems Engineering (WISE'02) , pages 73--82. IEEE
work page 2002
-
[5]
Canger, U. (1988). Nahuatl dialectology: A survey and some suggestions. International Journal of American Linguistics , 54:28 -- 72
work page 1988
-
[6]
Charles, W.-C. D. (2016). Lectura del náhuatl . Instituto Nacional de Lenguas Indígenas
work page 2016
-
[7]
Chen, J., Tam, D., Raffel, C., Bansal, M., and Yang, D. (2023). An empirical survey of data augmentation for limited data learning in NLP . Transactions of the ACL , 11:191--211
work page 2023
-
[8]
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT : Pre-training of deep bidirectional transformers for language understanding. In Conference of the North A merican Chapter of the ACL: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 4171--4186, Minneapolis, Minnesota. ACL
work page 2019
-
[9]
Y., Gangal, V., Wei, J., Chandar, S., Vosoughi, S., Mitamura, T., and Hovy, E
Feng, S. Y., Gangal, V., Wei, J., Chandar, S., Vosoughi, S., Mitamura, T., and Hovy, E. (2021). A survey of data augmentation approaches for NLP . In Zong, C., Xia, F., Li, W., and Navigli, R., editors, Findings of the ACL: ACL-IJCNLP 2021 , pages 968--988, Online. Association for Computational Linguistics
work page 2021
-
[10]
Flores Nájera, L. (2019). La gram\'atica de la clausula simple en el náhuatl de Tlaxcala . PhD thesis, CIESAS
work page 2019
-
[11]
Francis-Landau, M., Durrett, G., and Klein, D. (2016). Capturing semantic similarity for entity linking with convolutional neural networks. In Knight, K., Nenkova, A., and Rambow, O., editors, NAACL: Human Language Technologies , pages 1256--1261, San Diego, California. ACL
work page 2016
-
[12]
Goyal, P., Pandey, S., and Jain, K. (2018). Deep Learning for Natural Language Processing . Springer
work page 2018
-
[13]
Guzm \'a n-Landa, J.-J., Torres-Moreno, J.-M., Avenda \ n o-Garrido, M.-L., Figueroa-Saavedra, M., Quintana-Torres, L., Ranger, G., Gonz \'a lez-Gallardo, C.-E., Linhares-Pontes, E., Vel \'a zquez-Morales, P., and Moreno-Jim \'e nez, L.-G. (2025a). - YALLI : un nouveau corpus pour des mod \`e les de langue nahuatl / Y ankuik nawatlahtolkorpus pampa tlahto...
-
[14]
Guzm \'a n-Landa, J. J., Torres-Moreno, J.-M., Ranger, G., Figueroa-Saavedra, M., Quintana-Torres, L., and Avenda \ n o-Garrido, M. L. (2025b). Two cfg nahuatl for automatic corpora expansion. ArXiv , abs/2512.14239
-
[15]
L., Figueroa-Saavedra, M., Quintana-Torres, L., Vel \'a zquez Morales, P., and Sierra-Martínez, G
Guzm \'a n-Landa, J.-J., Vázquez-Osorio, J., Torres-Moreno, J.-M., Ranger, G., Garrido-Avenda \ n o, M. L., Figueroa-Saavedra, M., Quintana-Torres, L., Vel \'a zquez Morales, P., and Sierra-Martínez, G. (2025c). A symbolic algorithm for the unification of nawatl word spellings. In MICAI'25 , page 12p. SMIA
-
[16]
Guzmán-Landa, J.-J., Torres-Moreno, J.-M., Figueroa-Saavedra, M., González-Gallardo, C.-E., Ranger, G., and Lorena-Avendaño-Garrido, M. (2026). Classifying several dialectal nawatl varieties
work page 2026
-
[17]
Hansen, M. P. (2024). Nahuatl Nations: Language Revitalization and Semiotic Sovereignty in Indigenous Mexico . Oxford University Press
work page 2024
-
[18]
Censo de poblaci\'on y vivienda 2020
INEGI (2020). Censo de poblaci\'on y vivienda 2020. In CENSO 2020 . https://www.inegi.org.mx/rnm/index.php/catalog/632/study-description
work page 2020
-
[19]
B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. (2020). Scaling laws for neural language models
work page 2020
-
[20]
Kendall, M. G. (1938). A new measure of rank correlation. Biometrika , 30(1/2):81--93
work page 1938
-
[21]
Lastra de Su \'a rez, Y. (1986). Las \'a reas dialectales del n \'a huatl moderno . UNAM, Instituto de Investigaciones Antropológicas, Mexico
work page 1986
-
[22]
Launey, M. (1978). Introduction \`a la langue et \`a la litt \'e rature azt \`e ques , volume 1. L'Harmattan, Paris
work page 1978
-
[23]
Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., and Carlini, N. (2022). Deduplicating training data makes language models better. In Muresan, S., Nakov, P., and Villavicencio, A., editors, 60th Annual Meeting of the ACL (V1) , pages 8424--8445, Dublin, Ireland. ACL
work page 2022
-
[24]
Mahamud, M., Lee, Z., and Samsten, I. (2023). Distributional data augmentation methods for low resource language
work page 2023
-
[25]
Manning, C. D. and Schütze, H. (1999). Foundations of Statistical Natural Language Processing . MIT Press, Cambridge, MA
work page 1999
-
[26]
Micheli, V., d ' Hoffschmidt, M., and Fleuret, F. (2020). On the importance of pre-training data volume for compact language models. In Webber, B., Cohn, T., He, Y., and Liu, Y., editors, Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 7853--7858, Online. ACL
work page 2020
-
[27]
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word representations in vector space
-
[28]
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. In NIPS - Vol 2 , NIPS, page 3111–3119, Red Hook, NY, USA. Curran Associates Inc
-
[29]
Olko, J. and Sullivan, J. (2016). Bridging gaps and empowering speakers: An inclusive, partnership-based approach to nahuatl research and revitalization. Integral strategies for language revitalization , pages 347--386
work page 2016
-
[30]
Penedo, G., Kydl \' c ek, H., Ben Allal , L., Lozhkov, A., Mitchell, M., Raffel, C., Von Werra , L., and Wolf, T. (2024). The FineWeb datasets: Decanting the web for the finest text data at scale. In Advances in Neural Information Processing Systems . NeurIPS 2024 Datasets and Benchmarks Track (Spotlight)
work page 2024
-
[31]
Pennington, J., Socher, R., and Manning, C. (2014a). G lo V e: Global vectors for word representation. In Empirical Methods in Natural Language Processing ( EMNLP ) , pages 1532--1543, Doha, Qatar. ACL
-
[32]
Pennington, J., Socher, R., and Manning, C. D. (2014b). Glove: Global vectors for word representation. In 2014 EMNLP , pages 1532--1543. ACL
work page 2014
-
[33]
X., M \'a rquez Hernandez, \'A ., and Tyers, F
Pugh, R., Wing, C., Ju \'a rez Huerta, M. X., M \'a rquez Hernandez, \'A ., and Tyers, F. (2025). Ihquin tlahtouah in tetelahtzincocah: An annotated, multi-purpose audio and text corpus of western sierra P uebla N ahuatl. In Chiruzzo, L., Ritter, A., and Wang, L., editors, Conference of the Nations of the Americas Chapter of the ACL: Human Language Techno...
work page 2025
-
[34]
Robertson, S., Zaragoza, H., and Taylor, M. (2004). Simple bm25 extension to multiple weighted fields. In Proceedings of the thirteenth ACM international conference on Information and knowledge management , pages 42--49
work page 2004
-
[35]
Sasaki, M. (2022). Divide y entender \'a s: El papel de la polarizaci \'o n sint \'a ctica en el n \'a huatl moderno y colonial. In Coloquio de Investigación Lingüística, Universidad de Sonora (Mexico)
work page 2022
-
[36]
Torres-Moreno, J.-M. (2014). Automatic Text Summarization . Wiley, London
work page 2014
-
[37]
Tunstall, L., von Werra, L., and Wolf, T. (2022). Natural Language Processing with Transformers: Building Language Applications with Hugging Face . O'Reilly Media
work page 2022
-
[38]
Wei, J. and Zou, K. (2019). EDA : Easy data augmentation techniques for boosting performance on text classification tasks. In Inui, K., Jiang, J., Ng, V., and Wan, X., editors, EMNLP-IJCNLP , pages 6382--6388, Hong Kong, China. ACL
work page 2019
-
[39]
Wenzek, G., Lachaux, M.-A., Conneau, A., Chaudhary, V., Guzm \'a n, F., Joulin, A., and Grave, E. (2020). CCN et: Extracting high quality monolingual datasets from web crawl data. In 20th Language Resources and Evaluation Conf , pages 4003--4012, Marseille, France. European Language Resources Association
work page 2020
-
[40]
Zimmermann, K. (2019). Estandarización y revitalización de lenguas amerindias: funciones comunicativas e ideológicas, expectativas ilusorias y condiciones de la aceptación. Revista de Llengua i Dret, Journal of Language and Law , 71:111--122
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.