Corpora deduplication or duplication in Natural Language Processing of few resourced languages ? A case of study: The Mexico's Nahuatl

Elvys Linhares-Pontes; Graham Ranger; Juan-Jos\'e Guzman-Landa; Juan-Manuel Torres-Moreno; Luis-Gil Moreno-Jim\'enez; Martha-Lorena Avenda\~no-Garrido; Miguel Figueroa-Saavedra

arxiv: 2604.07015 · v1 · submitted 2026-04-08 · 💻 cs.CL

Corpora deduplication or duplication in Natural Language Processing of few resourced languages ? A case of study: The Mexico's Nahuatl

Juan-Jos\'e Guzman-Landa , Juan-Manuel Torres-Moreno , Graham Ranger , Miguel Figueroa-Saavedra , Martha-Lorena Avenda\~no-Garrido , Elvys Linhares-Pontes , Luis-Gil Moreno-Jim\'enez This is my paper

Pith reviewed 2026-05-10 17:54 UTC · model grok-4.3

classification 💻 cs.CL

keywords Nahuatllow-resource languagescorpus duplicationword embeddingssemantic similaritydata augmentationindigenous languages

0 comments

The pith

Incremental duplication of a small Nahuatl corpus moderately improves static embeddings for semantic similarity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether controlled duplication of scarce text can aid NLP for languages with almost no training data. For Nawatl, spoken by over two million people but possessing only a tiny corpus of varied dialectal texts, the authors expand their π-yalli collection by incremental duplication and train static embeddings on the result. They evaluate those embeddings on sentence-level semantic similarity and record a moderate gain over embeddings trained on the original unexpanded data. The finding suggests a low-cost route to stretching limited resources for languages that cannot gather large new collections.

Core claim

Applying incremental duplication to the limited Nawatl corpus produces static embeddings that reach moderately higher performance on a sentence semantic similarity task than embeddings trained on the corpus without expansion. The authors note that this controlled expansion technique has not been reported before in the literature for such settings.

What carries the argument

Incremental duplication, the process of systematically repeating the existing limited texts in controlled increments to enlarge the training set for embedding learning.

If this is right

Static embeddings improve moderately on semantic similarity when the corpus is expanded by incremental duplication.
The technique supplies a practical way to enlarge training material for low-resource languages without new data collection.
Embeddings can be made more suitable for downstream NLP tasks by first applying controlled duplication to small existing collections.
The same expansion step can be inserted before training whenever only limited text is available.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the moderate gain generalizes, similar duplication might help other agglutinative or polysynthetic languages with sparse digital resources.
Applying the same duplicated corpus to tasks such as part-of-speech tagging or simple translation would test whether the benefit extends beyond similarity judgments.
The duplication step could be combined with cross-lingual transfer from Spanish or other neighboring languages to further stretch the data.

Load-bearing premise

Duplicating the existing sentences does not create artificial repetitions or biases that improve scores on the chosen similarity test without reflecting genuine language patterns.

What would settle it

Running the same embedding training and similarity evaluation on a fresh held-out set of Nahuatl sentences and finding either no gain or a performance drop when the duplicated corpus is used instead of the original.

Figures

Figures reproduced from arXiv: 2604.07015 by Elvys Linhares-Pontes, Graham Ranger, Juan-Jos\'e Guzman-Landa, Juan-Manuel Torres-Moreno, Luis-Gil Moreno-Jim\'enez, Martha-Lorena Avenda\~no-Garrido, Miguel Figueroa-Saavedra.

read the original abstract

In this article, we seek to answer the following question: could data duplication be useful in Natural Language Processing (NLP) for languages with limited computational resources? In this type of languages (or $\pi$-languages), corpora available for training Large Language Models are virtually non-existent. In particular, we will study the impact of corpora expansion in Nawatl, an agglutinative and polysynthetic $\pi$-language spoken by over 2 million people, with a large number of dialectal varieties. The aim is to expand the new $\pi$-yalli corpus, which contains a limited number of Nawatl texts, by duplicating it in a controlled way. In our experiments, we will use the incremental duplication technique. The aim is to learn embeddings that are well-suited to NLP tasks. Thus, static embeddings were trained and evaluated in a sentence-level semantic similarity task. Our results show a moderate improvement in performance when using incremental duplication compared to the results obtained using only the corpus without expansion. Furthermore, to our knowledge, this technique has not yet been used in the literature.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

This paper finds a moderate improvement from incrementally duplicating their Nahuatl corpus when training embeddings for semantic similarity, but the results are hard to trust because of potential frequency bias in the evaluation. What is new is the application to Nahuatl, a polysynthetic language with very little data. They created the π-yalli corpus and show how controlled duplication can be used to expand it for embedding training. That part is practical and targets a genuine need in low-resource NLP. The paper does a decent job of framing the problem for languages like Nahuatl spoken by over two million people. The experiments are simple: train static embeddings on original vs duplicated data and test on sentence similarity. The soft spot is the lack of detail on whether the duplication actually adds linguistic variety or just repeats patterns. The stress-test concern looks right on target here. With a small corpus, duplicating sentences will boost the frequency of certain constructions, and since the test set is likely drawn from similar material, any similarity score improvement could be an artifact rather than a real gain in representation quality. No numbers are given in the abstract, and even if the full paper has them, without proper controls for overlap or statistical significance, it's difficult to accept the claim at face value. This kind of work is for people building tools for indigenous languages or experimenting with data augmentation in embedding models. A reader interested in Nahuatl specifically might get some ideas, but the methodological gaps mean it won't change much for the broader field. I would not cite this in its current form. It should go to peer review so the authors can add the missing controls and numbers, because the underlying question about handling data scarcity is worth pursuing.

Referee Report

3 major / 2 minor

Summary. The paper examines whether controlled incremental duplication of a small Nahuatl (Nawatl) corpus can improve static word embeddings for sentence-level semantic similarity in extremely low-resource agglutinative languages. It trains embeddings on the original π-yalli corpus versus an incrementally duplicated version and reports a moderate performance gain on the similarity task, claiming the technique has not been previously applied in the literature.

Significance. If substantiated with quantitative controls, the result would offer a low-cost data-augmentation heuristic for π-languages where corpora are tiny; however, the absence of any reported metrics, baselines, or statistical tests currently prevents assessment of whether duplication adds linguistic signal or merely re-weights frequencies.

major comments (3)

[Abstract] Abstract: the headline claim of 'moderate improvement' is unsupported by any numerical results, baselines, error bars, or statistical tests. No cosine-similarity scores, correlation coefficients, or significance values are supplied, so the central empirical comparison cannot be evaluated.
[Methods] Methods / experimental setup (inferred from description of incremental duplication): the paper does not specify the duplication schedule, the exact multiplicity applied at each step, or the mechanism used to prevent overlap between duplicated training material and the sentences used in the semantic-similarity evaluation. Without these controls, frequency bias remains a plausible alternative explanation for any observed gain.
[Evaluation] Evaluation section: the semantic-similarity task draws test sentences from the same limited pool as the training corpus. The manuscript must demonstrate that the train/test split (or cross-validation) excludes any duplicated material from the test set; otherwise the comparison to the unexpanded baseline is confounded by distributional shift.

minor comments (2)

[Abstract] The abstract states the language is spoken by 'over 2 million people' yet later refers to it as a π-language with 'virtually non-existent' corpora; a brief clarification of the actual size of the π-yalli corpus (token count, number of documents) would help readers gauge the scale of the duplication experiment.
[Introduction] Notation: the term 'π-languages' is introduced without a formal definition or citation; a short footnote or reference to prior usage would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback on our work. We address each of the major comments point by point below, indicating where revisions will be made to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim of 'moderate improvement' is unsupported by any numerical results, baselines, error bars, or statistical tests. No cosine-similarity scores, correlation coefficients, or significance values are supplied, so the central empirical comparison cannot be evaluated.

Authors: We agree that the abstract would benefit from including key quantitative results to make the central claim self-contained. The manuscript body reports the sentence-level semantic similarity performance (including cosine similarity and correlation metrics) for the original versus incrementally duplicated corpora. In the revision we will update the abstract to state the specific numerical gains observed, along with any available baseline comparisons. revision: yes
Referee: [Methods] Methods / experimental setup (inferred from description of incremental duplication): the paper does not specify the duplication schedule, the exact multiplicity applied at each step, or the mechanism used to prevent overlap between duplicated training material and the sentences used in the semantic-similarity evaluation. Without these controls, frequency bias remains a plausible alternative explanation for any observed gain.

Authors: We accept that additional detail is required. The incremental duplication was performed by creating successive versions of the corpus at increasing integer multiplicities (starting from the original 1x and adding copies up to a chosen maximum), with the schedule determined by monitoring embedding quality on a small held-out validation subset. The train/test partition was executed first on the original corpus using a random split that reserves a fixed percentage of sentences exclusively for evaluation; duplication was then applied only to the training portion. We will expand the Methods section to state the exact multiplicities, the validation-based selection criterion, and the pre-duplication split procedure. revision: yes
Referee: [Evaluation] Evaluation section: the semantic-similarity task draws test sentences from the same limited pool as the training corpus. The manuscript must demonstrate that the train/test split (or cross-validation) excludes any duplicated material from the test set; otherwise the comparison to the unexpanded baseline is confounded by distributional shift.

Authors: We agree this control must be made explicit. Because the split was performed on the original corpus before any duplication occurred, the test sentences remain untouched and are never present in the duplicated training data. This design ensures the observed gains cannot be attributed to test-set leakage or simple distributional shift. We will add an explicit statement, together with a short description of the split ratio and randomization seed, to the Evaluation section. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison of embeddings on duplicated vs. original corpus

full rationale

The paper reports an experimental study: it applies incremental duplication to the π-yalli Nahuatl corpus, trains static embeddings, and measures performance on a sentence-level semantic similarity task. The central claim is a moderate improvement in that task. No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text. The result is a direct head-to-head metric comparison against an external benchmark (the similarity task), with no reduction of any 'prediction' to the input by construction. Self-citation is absent from the load-bearing steps. This is a standard empirical ablation and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the work rests on standard assumptions of word embedding training and evaluation that are not detailed here.

pith-pipeline@v0.9.0 · 5548 in / 994 out tokens · 27531 ms · 2026-05-10T17:54:52.018569+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We decided to test our hypothesis regarding the impact of corpora expansion on learning algorithms through empirical means... duplicated the corpus π-YALLI ρ times... ρ = [1, 2, 4,..., 26, 28, 30]
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our results show a moderate improvement in performance when using incremental duplication compared to the results obtained using only the corpus without expansion.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

[1]

Abdillahi, N., Nocera, P., and Torres, J. M. (2006). Boites a outils TAL pour les langues peu informatis \'e es : Le cas du Somali . In Journ \'e es d'Analyses des Donn \'e es Textuelles , Besan c on, France

work page 2006
[2]

Berment, V. (2004). Méthodes pour informatiser les langues et les groupes de langues ``peu dotées'' . PhD thesis, Université Joseph-Fourier - Grenoble I

work page 2004
[3]

Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the ACL , 5:135--146

work page 2017
[4]

Bun, K. K. and Ishizuka, M. (2002). Topic extraction from news archive using tf* pdf algorithm. In 3rd International Conference on Web Information Systems Engineering (WISE'02) , pages 73--82. IEEE

work page 2002
[5]

Canger, U. (1988). Nahuatl dialectology: A survey and some suggestions. International Journal of American Linguistics , 54:28 -- 72

work page 1988
[6]

Charles, W.-C. D. (2016). Lectura del náhuatl . Instituto Nacional de Lenguas Indígenas

work page 2016
[7]

Chen, J., Tam, D., Raffel, C., Bansal, M., and Yang, D. (2023). An empirical survey of data augmentation for limited data learning in NLP . Transactions of the ACL , 11:191--211

work page 2023
[8]

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT : Pre-training of deep bidirectional transformers for language understanding. In Conference of the North A merican Chapter of the ACL: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 4171--4186, Minneapolis, Minnesota. ACL

work page 2019
[9]

Y., Gangal, V., Wei, J., Chandar, S., Vosoughi, S., Mitamura, T., and Hovy, E

Feng, S. Y., Gangal, V., Wei, J., Chandar, S., Vosoughi, S., Mitamura, T., and Hovy, E. (2021). A survey of data augmentation approaches for NLP . In Zong, C., Xia, F., Li, W., and Navigli, R., editors, Findings of the ACL: ACL-IJCNLP 2021 , pages 968--988, Online. Association for Computational Linguistics

work page 2021
[10]

Flores Nájera, L. (2019). La gram\'atica de la clausula simple en el náhuatl de Tlaxcala . PhD thesis, CIESAS

work page 2019
[11]

Francis-Landau, M., Durrett, G., and Klein, D. (2016). Capturing semantic similarity for entity linking with convolutional neural networks. In Knight, K., Nenkova, A., and Rambow, O., editors, NAACL: Human Language Technologies , pages 1256--1261, San Diego, California. ACL

work page 2016
[12]

Goyal, P., Pandey, S., and Jain, K. (2018). Deep Learning for Natural Language Processing . Springer

work page 2018
[13]

Guzm \'a n-Landa, J.-J., Torres-Moreno, J.-M., Avenda \ n o-Garrido, M.-L., Figueroa-Saavedra, M., Quintana-Torres, L., Ranger, G., Gonz \'a lez-Gallardo, C.-E., Linhares-Pontes, E., Vel \'a zquez-Morales, P., and Moreno-Jim \'e nez, L.-G. (2025a). - YALLI : un nouveau corpus pour des mod \`e les de langue nahuatl / Y ankuik nawatlahtolkorpus pampa tlahto...

work page
[14]

J., Torres-Moreno, J.-M., Ranger, G., Figueroa-Saavedra, M., Quintana-Torres, L., and Avenda \ n o-Garrido, M

Guzm \'a n-Landa, J. J., Torres-Moreno, J.-M., Ranger, G., Figueroa-Saavedra, M., Quintana-Torres, L., and Avenda \ n o-Garrido, M. L. (2025b). Two cfg nahuatl for automatic corpora expansion. ArXiv , abs/2512.14239

work page arXiv
[15]

L., Figueroa-Saavedra, M., Quintana-Torres, L., Vel \'a zquez Morales, P., and Sierra-Martínez, G

Guzm \'a n-Landa, J.-J., Vázquez-Osorio, J., Torres-Moreno, J.-M., Ranger, G., Garrido-Avenda \ n o, M. L., Figueroa-Saavedra, M., Quintana-Torres, L., Vel \'a zquez Morales, P., and Sierra-Martínez, G. (2025c). A symbolic algorithm for the unification of nawatl word spellings. In MICAI'25 , page 12p. SMIA

work page
[16]

Guzmán-Landa, J.-J., Torres-Moreno, J.-M., Figueroa-Saavedra, M., González-Gallardo, C.-E., Ranger, G., and Lorena-Avendaño-Garrido, M. (2026). Classifying several dialectal nawatl varieties

work page 2026
[17]

Hansen, M. P. (2024). Nahuatl Nations: Language Revitalization and Semiotic Sovereignty in Indigenous Mexico . Oxford University Press

work page 2024
[18]

Censo de poblaci\'on y vivienda 2020

INEGI (2020). Censo de poblaci\'on y vivienda 2020. In CENSO 2020 . https://www.inegi.org.mx/rnm/index.php/catalog/632/study-description

work page 2020
[19]

B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. (2020). Scaling laws for neural language models

work page 2020
[20]

Kendall, M. G. (1938). A new measure of rank correlation. Biometrika , 30(1/2):81--93

work page 1938
[21]

Lastra de Su \'a rez, Y. (1986). Las \'a reas dialectales del n \'a huatl moderno . UNAM, Instituto de Investigaciones Antropológicas, Mexico

work page 1986
[22]

Launey, M. (1978). Introduction \`a la langue et \`a la litt \'e rature azt \`e ques , volume 1. L'Harmattan, Paris

work page 1978
[23]

Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., and Carlini, N. (2022). Deduplicating training data makes language models better. In Muresan, S., Nakov, P., and Villavicencio, A., editors, 60th Annual Meeting of the ACL (V1) , pages 8424--8445, Dublin, Ireland. ACL

work page 2022
[24]

Mahamud, M., Lee, Z., and Samsten, I. (2023). Distributional data augmentation methods for low resource language

work page 2023
[25]

Manning, C. D. and Schütze, H. (1999). Foundations of Statistical Natural Language Processing . MIT Press, Cambridge, MA

work page 1999
[26]

Micheli, V., d ' Hoffschmidt, M., and Fleuret, F. (2020). On the importance of pre-training data volume for compact language models. In Webber, B., Cohn, T., He, Y., and Liu, Y., editors, Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 7853--7858, Online. ACL

work page 2020
[27]

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word representations in vector space

work page
[28]

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. In NIPS - Vol 2 , NIPS, page 3111–3119, Red Hook, NY, USA. Curran Associates Inc

work page
[29]

and Sullivan, J

Olko, J. and Sullivan, J. (2016). Bridging gaps and empowering speakers: An inclusive, partnership-based approach to nahuatl research and revitalization. Integral strategies for language revitalization , pages 347--386

work page 2016
[30]

Penedo, G., Kydl \' c ek, H., Ben Allal , L., Lozhkov, A., Mitchell, M., Raffel, C., Von Werra , L., and Wolf, T. (2024). The FineWeb datasets: Decanting the web for the finest text data at scale. In Advances in Neural Information Processing Systems . NeurIPS 2024 Datasets and Benchmarks Track (Spotlight)

work page 2024
[31]

Pennington, J., Socher, R., and Manning, C. (2014a). G lo V e: Global vectors for word representation. In Empirical Methods in Natural Language Processing ( EMNLP ) , pages 1532--1543, Doha, Qatar. ACL

work page
[32]

Pennington, J., Socher, R., and Manning, C. D. (2014b). Glove: Global vectors for word representation. In 2014 EMNLP , pages 1532--1543. ACL

work page 2014
[33]

X., M \'a rquez Hernandez, \'A ., and Tyers, F

Pugh, R., Wing, C., Ju \'a rez Huerta, M. X., M \'a rquez Hernandez, \'A ., and Tyers, F. (2025). Ihquin tlahtouah in tetelahtzincocah: An annotated, multi-purpose audio and text corpus of western sierra P uebla N ahuatl. In Chiruzzo, L., Ritter, A., and Wang, L., editors, Conference of the Nations of the Americas Chapter of the ACL: Human Language Techno...

work page 2025
[34]

Robertson, S., Zaragoza, H., and Taylor, M. (2004). Simple bm25 extension to multiple weighted fields. In Proceedings of the thirteenth ACM international conference on Information and knowledge management , pages 42--49

work page 2004
[35]

Sasaki, M. (2022). Divide y entender \'a s: El papel de la polarizaci \'o n sint \'a ctica en el n \'a huatl moderno y colonial. In Coloquio de Investigación Lingüística, Universidad de Sonora (Mexico)

work page 2022
[36]

Torres-Moreno, J.-M. (2014). Automatic Text Summarization . Wiley, London

work page 2014
[37]

Tunstall, L., von Werra, L., and Wolf, T. (2022). Natural Language Processing with Transformers: Building Language Applications with Hugging Face . O'Reilly Media

work page 2022
[38]

and Zou, K

Wei, J. and Zou, K. (2019). EDA : Easy data augmentation techniques for boosting performance on text classification tasks. In Inui, K., Jiang, J., Ng, V., and Wan, X., editors, EMNLP-IJCNLP , pages 6382--6388, Hong Kong, China. ACL

work page 2019
[39]

Wenzek, G., Lachaux, M.-A., Conneau, A., Chaudhary, V., Guzm \'a n, F., Joulin, A., and Grave, E. (2020). CCN et: Extracting high quality monolingual datasets from web crawl data. In 20th Language Resources and Evaluation Conf , pages 4003--4012, Marseille, France. European Language Resources Association

work page 2020
[40]

Zimmermann, K. (2019). Estandarización y revitalización de lenguas amerindias: funciones comunicativas e ideológicas, expectativas ilusorias y condiciones de la aceptación. Revista de Llengua i Dret, Journal of Language and Law , 71:111--122

work page 2019

[1] [1]

Abdillahi, N., Nocera, P., and Torres, J. M. (2006). Boites a outils TAL pour les langues peu informatis \'e es : Le cas du Somali . In Journ \'e es d'Analyses des Donn \'e es Textuelles , Besan c on, France

work page 2006

[2] [2]

Berment, V. (2004). Méthodes pour informatiser les langues et les groupes de langues ``peu dotées'' . PhD thesis, Université Joseph-Fourier - Grenoble I

work page 2004

[3] [3]

Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the ACL , 5:135--146

work page 2017

[4] [4]

Bun, K. K. and Ishizuka, M. (2002). Topic extraction from news archive using tf* pdf algorithm. In 3rd International Conference on Web Information Systems Engineering (WISE'02) , pages 73--82. IEEE

work page 2002

[5] [5]

Canger, U. (1988). Nahuatl dialectology: A survey and some suggestions. International Journal of American Linguistics , 54:28 -- 72

work page 1988

[6] [6]

Charles, W.-C. D. (2016). Lectura del náhuatl . Instituto Nacional de Lenguas Indígenas

work page 2016

[7] [7]

Chen, J., Tam, D., Raffel, C., Bansal, M., and Yang, D. (2023). An empirical survey of data augmentation for limited data learning in NLP . Transactions of the ACL , 11:191--211

work page 2023

[8] [8]

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT : Pre-training of deep bidirectional transformers for language understanding. In Conference of the North A merican Chapter of the ACL: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 4171--4186, Minneapolis, Minnesota. ACL

work page 2019

[9] [9]

Y., Gangal, V., Wei, J., Chandar, S., Vosoughi, S., Mitamura, T., and Hovy, E

Feng, S. Y., Gangal, V., Wei, J., Chandar, S., Vosoughi, S., Mitamura, T., and Hovy, E. (2021). A survey of data augmentation approaches for NLP . In Zong, C., Xia, F., Li, W., and Navigli, R., editors, Findings of the ACL: ACL-IJCNLP 2021 , pages 968--988, Online. Association for Computational Linguistics

work page 2021

[10] [10]

Flores Nájera, L. (2019). La gram\'atica de la clausula simple en el náhuatl de Tlaxcala . PhD thesis, CIESAS

work page 2019

[11] [11]

Francis-Landau, M., Durrett, G., and Klein, D. (2016). Capturing semantic similarity for entity linking with convolutional neural networks. In Knight, K., Nenkova, A., and Rambow, O., editors, NAACL: Human Language Technologies , pages 1256--1261, San Diego, California. ACL

work page 2016

[12] [12]

Goyal, P., Pandey, S., and Jain, K. (2018). Deep Learning for Natural Language Processing . Springer

work page 2018

[13] [13]

Guzm \'a n-Landa, J.-J., Torres-Moreno, J.-M., Avenda \ n o-Garrido, M.-L., Figueroa-Saavedra, M., Quintana-Torres, L., Ranger, G., Gonz \'a lez-Gallardo, C.-E., Linhares-Pontes, E., Vel \'a zquez-Morales, P., and Moreno-Jim \'e nez, L.-G. (2025a). - YALLI : un nouveau corpus pour des mod \`e les de langue nahuatl / Y ankuik nawatlahtolkorpus pampa tlahto...

work page

[14] [14]

J., Torres-Moreno, J.-M., Ranger, G., Figueroa-Saavedra, M., Quintana-Torres, L., and Avenda \ n o-Garrido, M

Guzm \'a n-Landa, J. J., Torres-Moreno, J.-M., Ranger, G., Figueroa-Saavedra, M., Quintana-Torres, L., and Avenda \ n o-Garrido, M. L. (2025b). Two cfg nahuatl for automatic corpora expansion. ArXiv , abs/2512.14239

work page arXiv

[15] [15]

L., Figueroa-Saavedra, M., Quintana-Torres, L., Vel \'a zquez Morales, P., and Sierra-Martínez, G

Guzm \'a n-Landa, J.-J., Vázquez-Osorio, J., Torres-Moreno, J.-M., Ranger, G., Garrido-Avenda \ n o, M. L., Figueroa-Saavedra, M., Quintana-Torres, L., Vel \'a zquez Morales, P., and Sierra-Martínez, G. (2025c). A symbolic algorithm for the unification of nawatl word spellings. In MICAI'25 , page 12p. SMIA

work page

[16] [16]

Guzmán-Landa, J.-J., Torres-Moreno, J.-M., Figueroa-Saavedra, M., González-Gallardo, C.-E., Ranger, G., and Lorena-Avendaño-Garrido, M. (2026). Classifying several dialectal nawatl varieties

work page 2026

[17] [17]

Hansen, M. P. (2024). Nahuatl Nations: Language Revitalization and Semiotic Sovereignty in Indigenous Mexico . Oxford University Press

work page 2024

[18] [18]

Censo de poblaci\'on y vivienda 2020

INEGI (2020). Censo de poblaci\'on y vivienda 2020. In CENSO 2020 . https://www.inegi.org.mx/rnm/index.php/catalog/632/study-description

work page 2020

[19] [19]

B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. (2020). Scaling laws for neural language models

work page 2020

[20] [20]

Kendall, M. G. (1938). A new measure of rank correlation. Biometrika , 30(1/2):81--93

work page 1938

[21] [21]

Lastra de Su \'a rez, Y. (1986). Las \'a reas dialectales del n \'a huatl moderno . UNAM, Instituto de Investigaciones Antropológicas, Mexico

work page 1986

[22] [22]

Launey, M. (1978). Introduction \`a la langue et \`a la litt \'e rature azt \`e ques , volume 1. L'Harmattan, Paris

work page 1978

[23] [23]

Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., and Carlini, N. (2022). Deduplicating training data makes language models better. In Muresan, S., Nakov, P., and Villavicencio, A., editors, 60th Annual Meeting of the ACL (V1) , pages 8424--8445, Dublin, Ireland. ACL

work page 2022

[24] [24]

Mahamud, M., Lee, Z., and Samsten, I. (2023). Distributional data augmentation methods for low resource language

work page 2023

[25] [25]

Manning, C. D. and Schütze, H. (1999). Foundations of Statistical Natural Language Processing . MIT Press, Cambridge, MA

work page 1999

[26] [26]

Micheli, V., d ' Hoffschmidt, M., and Fleuret, F. (2020). On the importance of pre-training data volume for compact language models. In Webber, B., Cohn, T., He, Y., and Liu, Y., editors, Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 7853--7858, Online. ACL

work page 2020

[27] [27]

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word representations in vector space

work page

[28] [28]

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. In NIPS - Vol 2 , NIPS, page 3111–3119, Red Hook, NY, USA. Curran Associates Inc

work page

[29] [29]

and Sullivan, J

Olko, J. and Sullivan, J. (2016). Bridging gaps and empowering speakers: An inclusive, partnership-based approach to nahuatl research and revitalization. Integral strategies for language revitalization , pages 347--386

work page 2016

[30] [30]

Penedo, G., Kydl \' c ek, H., Ben Allal , L., Lozhkov, A., Mitchell, M., Raffel, C., Von Werra , L., and Wolf, T. (2024). The FineWeb datasets: Decanting the web for the finest text data at scale. In Advances in Neural Information Processing Systems . NeurIPS 2024 Datasets and Benchmarks Track (Spotlight)

work page 2024

[31] [31]

Pennington, J., Socher, R., and Manning, C. (2014a). G lo V e: Global vectors for word representation. In Empirical Methods in Natural Language Processing ( EMNLP ) , pages 1532--1543, Doha, Qatar. ACL

work page

[32] [32]

Pennington, J., Socher, R., and Manning, C. D. (2014b). Glove: Global vectors for word representation. In 2014 EMNLP , pages 1532--1543. ACL

work page 2014

[33] [33]

X., M \'a rquez Hernandez, \'A ., and Tyers, F

Pugh, R., Wing, C., Ju \'a rez Huerta, M. X., M \'a rquez Hernandez, \'A ., and Tyers, F. (2025). Ihquin tlahtouah in tetelahtzincocah: An annotated, multi-purpose audio and text corpus of western sierra P uebla N ahuatl. In Chiruzzo, L., Ritter, A., and Wang, L., editors, Conference of the Nations of the Americas Chapter of the ACL: Human Language Techno...

work page 2025

[34] [34]

Robertson, S., Zaragoza, H., and Taylor, M. (2004). Simple bm25 extension to multiple weighted fields. In Proceedings of the thirteenth ACM international conference on Information and knowledge management , pages 42--49

work page 2004

[35] [35]

Sasaki, M. (2022). Divide y entender \'a s: El papel de la polarizaci \'o n sint \'a ctica en el n \'a huatl moderno y colonial. In Coloquio de Investigación Lingüística, Universidad de Sonora (Mexico)

work page 2022

[36] [36]

Torres-Moreno, J.-M. (2014). Automatic Text Summarization . Wiley, London

work page 2014

[37] [37]

Tunstall, L., von Werra, L., and Wolf, T. (2022). Natural Language Processing with Transformers: Building Language Applications with Hugging Face . O'Reilly Media

work page 2022

[38] [38]

and Zou, K

Wei, J. and Zou, K. (2019). EDA : Easy data augmentation techniques for boosting performance on text classification tasks. In Inui, K., Jiang, J., Ng, V., and Wan, X., editors, EMNLP-IJCNLP , pages 6382--6388, Hong Kong, China. ACL

work page 2019

[39] [39]

Wenzek, G., Lachaux, M.-A., Conneau, A., Chaudhary, V., Guzm \'a n, F., Joulin, A., and Grave, E. (2020). CCN et: Extracting high quality monolingual datasets from web crawl data. In 20th Language Resources and Evaluation Conf , pages 4003--4012, Marseille, France. European Language Resources Association

work page 2020

[40] [40]

Zimmermann, K. (2019). Estandarización y revitalización de lenguas amerindias: funciones comunicativas e ideológicas, expectativas ilusorias y condiciones de la aceptación. Revista de Llengua i Dret, Journal of Language and Law , 71:111--122

work page 2019