Curation of a Palaeohispanic Dataset for Machine Learning

Agust\'in Riscos-N\'u\~nez; Francisco Jos\'e Salguero-Lamillar; Gonzalo Mart\'inez-Fern\'andez; Jose F Quesada

arxiv: 2604.13070 · v1 · submitted 2026-03-20 · 💻 cs.CL · cs.AI

Curation of a Palaeohispanic Dataset for Machine Learning

Gonzalo Mart\'inez-Fern\'andez , Jose F Quesada , Agust\'in Riscos-N\'u\~nez , Francisco Jos\'e Salguero-Lamillar This is my paper

Pith reviewed 2026-05-15 08:45 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Palaeohispanic languagesdataset curationmachine learningancient Iberian scriptscomputational linguisticslanguage deciphermentstructured data

0 comments

The pith

A curated dataset transforms Palaeohispanic language resources into a format ready for machine learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs a structured dataset from existing Palaeohispanic inscriptions and materials that were previously unsuitable for computational work. This addresses the scarcity of machine-readable data on these ancient Iberian languages, which remain only partially deciphered. A sympathetic reader would care because computational techniques could now be applied to test linguistic hypotheses or support decipherment efforts that traditional methods have left incomplete. The work positions the dataset as a foundation for future data-driven studies in the field.

Core claim

The authors curate Palaeohispanic language resources into a single structured dataset formatted for machine learning, thereby converting limited and incompatible materials into a usable resource that can support computational analysis of these partially understood ancient scripts.

What carries the argument

The structured dataset, which reformats existing Palaeohispanic inscriptions and linguistic data into a machine-readable form without altering core content.

If this is right

Machine learning models can be trained on the dataset for tasks such as script recognition and pattern detection in ancient texts.
Computational experiments can now test specific claims about the structure and relationships among Palaeohispanic languages.
The resource can serve as a shared benchmark for developing tools tailored to semi-syllabic writing systems.
Further curation or expansion of the dataset can build directly on this initial release to cover additional inscriptions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar dataset curation could be applied to other ancient or under-resourced scripts to enable parallel computational work.
Machine learning outputs from the dataset might surface statistical regularities that prompt re-examination of traditional linguistic classifications.
Linking the dataset to existing digital epigraphy projects could increase its utility for collaborative research.

Load-bearing premise

Existing Palaeohispanic resources can be converted into a machine learning format while preserving all critical linguistic details.

What would settle it

Demonstration that key phonetic, grammatical, or contextual information from the original sources is lost or misrepresented in the new dataset structure.

read the original abstract

Palaeohispanic languages are those spoken in the Iberian Peninsula before the arrival of the Romans in the 3rd Century B.C. Their study was really put on motion after G\'omez Moreno deciphered the Iberian Levantine script, one of the several semi-sillabaries used by these languages. Still, the Palaeohispanic languages have varying degrees of decipherment, and none is fully known to this day. Most of the studies have been performed from a purely linguistic point of view, and a computational approach may benefit this research area greatly. However, the resources are limited and presented in an unsuitable format for techniques such as Machine Learning. Therefore, a structured dataset is constructed, which will hopefully allow more progress in the field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a new structured dataset for Palaeohispanic languages aimed at ML use, but gives almost no details on how it was made or checked.

read the letter

The main thing to know is that this paper takes existing resources on pre-Roman Iberian languages and scripts and turns them into a structured dataset formatted for machine learning. That is the actual contribution, and it is new in the sense that no prior ML-ready version is referenced in the abstract or description. The authors point out that these languages remain only partly deciphered and that most prior work has stayed strictly linguistic, so the dataset is meant to open the door to computational experiments. That motivation is clear and reasonable for a narrow subfield. The paper does a straightforward job laying out the historical background, from Gómez Moreno's decipherment work onward, and explaining why current resources are unsuitable for ML techniques. It keeps the focus on enabling future progress rather than claiming any new linguistic insight itself. The soft spot is the near-total absence of information on the curation steps. There is no account of which specific sources were used, how the data was transformed into the new format, what validation or quality checks were applied, or how potential loss of linguistic nuance was handled. Without those details it is difficult to assess whether the dataset actually preserves what matters for ML work or whether it introduces artifacts. If the full paper includes dataset statistics, examples, or release information, that would strengthen it considerably, but the current description leaves the claim resting on description alone. This is the kind of resource paper that matters most to researchers in historical linguistics or digital epigraphy who already work on Iberian scripts and want to test ML approaches. A reader outside that niche will get little from it. It deserves a serious referee because resource papers can be useful when the construction process is documented well enough for others to evaluate and reuse the data. I would send it to peer review rather than desk reject, mainly to get concrete feedback on the curation choices and reproducibility.

Referee Report

1 major / 1 minor

Summary. The paper claims to construct a structured dataset for Palaeohispanic languages by transforming existing resources on pre-Roman Iberian languages into a format suitable for machine learning, addressing the limitations of current resources that are unsuitable for computational techniques.

Significance. If the curation is rigorously documented and the dataset preserves linguistic details while enabling ML tasks, it could meaningfully advance computational approaches in a field reliant on limited, non-standardized resources. The work is primarily descriptive resource creation rather than a methodological or empirical advance, so its impact hinges on public release, documentation quality, and demonstrated usability.

major comments (1)

[Abstract] Abstract: The claim that existing resources are transformed into an ML-suitable format lacks any description of the curation process, validation steps, quality checks, or potential information loss. This directly affects the central claim that the dataset will allow progress in the field, as no evidence is provided that critical linguistic information is retained.

minor comments (1)

Consider adding explicit details on dataset structure, size, format, and access instructions in a dedicated section to improve reproducibility and utility for the community.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive feedback. We address the single major comment below and will revise the manuscript to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that existing resources are transformed into an ML-suitable format lacks any description of the curation process, validation steps, quality checks, or potential information loss. This directly affects the central claim that the dataset will allow progress in the field, as no evidence is provided that critical linguistic information is retained.

Authors: We agree that the abstract as currently written is too concise and does not sufficiently outline the curation methodology. The full manuscript contains dedicated sections describing the transformation of existing Palaeohispanic resources (including script normalization, tokenization, and annotation alignment steps), along with validation against original epigraphic sources and checks for information preservation. To make the abstract self-contained and directly support the central claim, we will expand it in the revised version to include a brief summary of the curation pipeline, quality assurance procedures, and steps taken to retain linguistic details such as script variants and contextual metadata. This revision will be limited to the abstract and will not alter the technical content of the paper. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a purely descriptive account of curating an existing set of Palaeohispanic inscriptions into a structured dataset suitable for machine-learning use. No equations, fitted parameters, quantitative predictions, or derivation chains appear anywhere in the manuscript. The central claim reduces to the factual statement that the dataset was assembled from prior resources; this statement is not shown to be equivalent to its own inputs by construction, nor does it rely on self-citation load-bearing uniqueness theorems or ansatzes. Consequently the circularity score is zero.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model or derivation; the work is data curation with no free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5443 in / 889 out tokens · 31885 ms · 2026-05-15T08:45:31.275925+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a structured dataset is constructed... CSV file with 1751 instances and 36 feature columns

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 1 internal anchor

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

and Jord \'a n C \'o lera, C

Beltr \'a n Lloris, F. and Jord \'a n C \'o lera, C. (2017). Celtiberian: language, writing, epigraphy . Prensas de la Universidad de Zaragoza

work page 2017
[3]

Beltr \'a n Lloris, F., Jord \'a n C \'o lera, C., et al. (2020). Celtib \'e rico. Palaeohispanica. Revista sobre lenguas y culturas de la Hispania Antigua , (20):631--688

work page 2020
[4]

and Jord \'a n C \'o lera, C

Beltr \'a n Lloris, F. and Jord \'a n C \'o lera, C. B. (2022). Escritura y lengua en la celtiberia. In Lenguas y epigraf \' as paleohisp \'a nicas , pages 269--332. Bellaterra

work page 2022
[5]

Buj \'a n, S., Bardanca, D., Gamallo, P., de Dios-Flores, I., and Pichel, J. R. (2025). Machine translation for low-resource languages: Performance trade-offs between seq2seq and generative approaches. Procesamiento del lenguaje natural , 75:297--315

work page 2025
[6]

Correa Rodr \' guez, J. A. (1981). Nota a la inscripci \'o n tartesia gm ii. Archivo Espa \ n ol de Arqueolog \' a , 54(143):203

work page 1981
[7]

Correa Rodr \' guez, J. A. (1985). Consideraciones sobre las inscripciones tartesias. In Actas del III Coloquio sobre Lenguas y Culturas Paleohisp \'a nicas , pages 377--396. Ediciones Universidad de Salamanca

work page 1985
[8]

M., Carruana Mart \' n, A., and de Miguel Ambite, E

Couto Seller, L., Sanz Torres, \'I ., Vogel-Fern \'a ndez, A., Gonz \'a lez Carballo, C., S \'a nchez S \'a nchez, P. M., Carruana Mart \' n, A., and de Miguel Ambite, E. (2025). Evaluating compact llms for zero-shot iberian language tasks on end-user devices. arXiv preprint arXiv:2504.03312

work page arXiv 2025
[9]

de Hoz Bravo, J. J. (1989). El desarrollo de la escritura y las lenguas de la zona meridional. In Tartessos: arqueolog \' a protohist \'o rica del bajo Guadalquivir , pages 523--587. AUSA

work page 1989
[10]

u \' stica de la pen \' nsula ib \'e rica en la Antig \

de Hoz Bravo, J. J. (2010). Historia ling \"u \' stica de la pen \' nsula ib \'e rica en la Antig \"u edad. I: Preliminares y mundo meridional prerromano , volume 1. Editorial CSIC-CSIC Press

work page 2010
[11]

de Hoz Bravo, J. J. (2022). M \'e todo y m \'e todos: Estudiar las lenguas paleohisp \'a nicas como disciplina. In Lenguas y epigraf \' as paleohisp \'a nicas , pages 15--37. Bellaterra

work page 2022
[12]

J., Ordu \ n a Aznar, E., and Luj \'a n, E

Estar \'a n Tolosa, M. J., Ordu \ n a Aznar, E., and Luj \'a n, E. R. (2009). El banco de datos hesperia. Palaeohispanica

work page 2009
[13]

and Moncunill Mart \' , N

Ferrer i Jan \'e , J. and Moncunill Mart \' , N. (2022). Sistemas de escritura paleohisp \'a nicos: clasificaci \'o n, origen y desarrollo. In Lenguas y epigraf \' as paleohisp \'a nicas , pages 97--130. Bellaterra

work page 2022
[14]

Fourrier, C. (2022). Neural Approaches to Historical Words Reconstruction . PhD thesis, Universit \'e Paris sciences et lettres

work page 2022
[15]

G \'o mez Moreno, M. (1922). De epigraf \' a ib \'e rica. el plomo de alcoy. Revista de filolog \' a espa \ n ola , 9(4):341

work page 1922
[16]

G \'o mez Moreno, M. (1949). Miscel \'a neas, historia, arte, arqueolog \' a. Primera serie: La Antiguedad (Madrid, 1949) , pages 180--4

work page 1949
[17]

Koch, J. T. (2014). On the debate over the classification of the language of the south-western (sw) inscriptions, also known as tartessian. Journal of Indo-European Studies , 42(4):336--427

work page 2014
[18]

Laborde, A. L. J. et al. (1806). Voyage pittoresque et historique de l'Espagne. Tome premier. Premiere partie . Imprimerie de Pierre Didot

work page
[19]

Luj \'a n, E. R. (2005). Hesperia: the electronic corpus of palaeo-hispanic inscriptions and linguistic records. Review of the National Center for Digitization , (6):78--89

work page 2005
[20]

Luj \'a n Mart \' nez, E. R. (2021). La lengua de las inscripciones del sudoeste: estado de la cuesti \'o n. In Palaeohispanica . Instituci \'o n Fernando el Cat \'o lico

work page 2021
[21]

Luj \'a n Mart \' nez, E. R. (2022). Lengua y escritura entre los lusitanos. In Lenguas y epigraf \' as paleohisp \'a nicas , pages 333--366. Bellaterra

work page 2022
[22]

Luo, J., Hartmann, F., Santus, E., Barzilay, R., and Cao, Y. (2021). Deciphering undersegmented ancient scripts using phonetic prior. Transactions of the Association for Computational Linguistics , 9:69--81

work page 2021
[23]

and Velaza, J

Moncunill Mart \' , N. and Velaza, J. (2019). Lexikon der iberischen Inschriften. L \'e xico de las inscripciones ib \'e ricas . Ludwig Reichert Verlag, Wiesbaden

work page 2019
[24]

Ordu \ n a Aznar, E. (2022). La teor \' a vasco-ib \'e rica. In Lenguas y epigraf \' as paleohisp \'a nicas , pages 247--268. Bellaterra

work page 2022
[25]

Rodr \' guez Ramos, J. (2001). Aspectos de la morfolog \' a de los formantes segundos de los compuestos de tipo onom \'a stico en la lengua \' bera. Faventia , 23(1):7--19

work page 2001
[26]

Rodr \' guez Ramos, J. (2002). Las inscripciones sudlusitano-tartesias: su funci \'o n, lengua y contexto socio-econ \'o mico. Complutum , (13):85--96

work page 2002
[27]

and Bodel, J

Salomies, O. and Bodel, J. P. (2001). Epigraphic Evidence: Ancient History from Inscriptions . Routledge

work page 2001
[28]

u mmersprachen zwischen grammatik und geschichte. In Tr \

Untermann, J. (1980). Tr \"u mmersprachen zwischen grammatik und geschichte. In Tr \"u mmersprachen zwischen Grammatik und Geschichte: 245. Sitzung am 16. Januar 1980 in D \"u sseldorf , pages 7--40. Springer

work page 1980
[29]

N., Kaiser, ., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems , 30

work page 2017
[30]

Velaza Fr \' as, J. (2022). Epigraf \' a y lengua ib \'e ricas. In Lenguas y epigraf \' as paleohisp \'a nicas , pages 183--221. Bellaterra

work page 2022
[31]

Wodtko, D. et al. (2021). Spelling tartessian. Palaeohispanica. Revista sobre lenguas y culturas de la Hispania Antigua , 21:219--234

work page 2021
[32]

Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua, A., and Raffel, C. (2021). mt5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human language technologies , pages 483--498

work page 2021

[1] [1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

and Jord \'a n C \'o lera, C

Beltr \'a n Lloris, F. and Jord \'a n C \'o lera, C. (2017). Celtiberian: language, writing, epigraphy . Prensas de la Universidad de Zaragoza

work page 2017

[3] [3]

Beltr \'a n Lloris, F., Jord \'a n C \'o lera, C., et al. (2020). Celtib \'e rico. Palaeohispanica. Revista sobre lenguas y culturas de la Hispania Antigua , (20):631--688

work page 2020

[4] [4]

and Jord \'a n C \'o lera, C

Beltr \'a n Lloris, F. and Jord \'a n C \'o lera, C. B. (2022). Escritura y lengua en la celtiberia. In Lenguas y epigraf \' as paleohisp \'a nicas , pages 269--332. Bellaterra

work page 2022

[5] [5]

Buj \'a n, S., Bardanca, D., Gamallo, P., de Dios-Flores, I., and Pichel, J. R. (2025). Machine translation for low-resource languages: Performance trade-offs between seq2seq and generative approaches. Procesamiento del lenguaje natural , 75:297--315

work page 2025

[6] [6]

Correa Rodr \' guez, J. A. (1981). Nota a la inscripci \'o n tartesia gm ii. Archivo Espa \ n ol de Arqueolog \' a , 54(143):203

work page 1981

[7] [7]

Correa Rodr \' guez, J. A. (1985). Consideraciones sobre las inscripciones tartesias. In Actas del III Coloquio sobre Lenguas y Culturas Paleohisp \'a nicas , pages 377--396. Ediciones Universidad de Salamanca

work page 1985

[8] [8]

M., Carruana Mart \' n, A., and de Miguel Ambite, E

Couto Seller, L., Sanz Torres, \'I ., Vogel-Fern \'a ndez, A., Gonz \'a lez Carballo, C., S \'a nchez S \'a nchez, P. M., Carruana Mart \' n, A., and de Miguel Ambite, E. (2025). Evaluating compact llms for zero-shot iberian language tasks on end-user devices. arXiv preprint arXiv:2504.03312

work page arXiv 2025

[9] [9]

de Hoz Bravo, J. J. (1989). El desarrollo de la escritura y las lenguas de la zona meridional. In Tartessos: arqueolog \' a protohist \'o rica del bajo Guadalquivir , pages 523--587. AUSA

work page 1989

[10] [10]

u \' stica de la pen \' nsula ib \'e rica en la Antig \

de Hoz Bravo, J. J. (2010). Historia ling \"u \' stica de la pen \' nsula ib \'e rica en la Antig \"u edad. I: Preliminares y mundo meridional prerromano , volume 1. Editorial CSIC-CSIC Press

work page 2010

[11] [11]

de Hoz Bravo, J. J. (2022). M \'e todo y m \'e todos: Estudiar las lenguas paleohisp \'a nicas como disciplina. In Lenguas y epigraf \' as paleohisp \'a nicas , pages 15--37. Bellaterra

work page 2022

[12] [12]

J., Ordu \ n a Aznar, E., and Luj \'a n, E

Estar \'a n Tolosa, M. J., Ordu \ n a Aznar, E., and Luj \'a n, E. R. (2009). El banco de datos hesperia. Palaeohispanica

work page 2009

[13] [13]

and Moncunill Mart \' , N

Ferrer i Jan \'e , J. and Moncunill Mart \' , N. (2022). Sistemas de escritura paleohisp \'a nicos: clasificaci \'o n, origen y desarrollo. In Lenguas y epigraf \' as paleohisp \'a nicas , pages 97--130. Bellaterra

work page 2022

[14] [14]

Fourrier, C. (2022). Neural Approaches to Historical Words Reconstruction . PhD thesis, Universit \'e Paris sciences et lettres

work page 2022

[15] [15]

G \'o mez Moreno, M. (1922). De epigraf \' a ib \'e rica. el plomo de alcoy. Revista de filolog \' a espa \ n ola , 9(4):341

work page 1922

[16] [16]

G \'o mez Moreno, M. (1949). Miscel \'a neas, historia, arte, arqueolog \' a. Primera serie: La Antiguedad (Madrid, 1949) , pages 180--4

work page 1949

[17] [17]

Koch, J. T. (2014). On the debate over the classification of the language of the south-western (sw) inscriptions, also known as tartessian. Journal of Indo-European Studies , 42(4):336--427

work page 2014

[18] [18]

Laborde, A. L. J. et al. (1806). Voyage pittoresque et historique de l'Espagne. Tome premier. Premiere partie . Imprimerie de Pierre Didot

work page

[19] [19]

Luj \'a n, E. R. (2005). Hesperia: the electronic corpus of palaeo-hispanic inscriptions and linguistic records. Review of the National Center for Digitization , (6):78--89

work page 2005

[20] [20]

Luj \'a n Mart \' nez, E. R. (2021). La lengua de las inscripciones del sudoeste: estado de la cuesti \'o n. In Palaeohispanica . Instituci \'o n Fernando el Cat \'o lico

work page 2021

[21] [21]

Luj \'a n Mart \' nez, E. R. (2022). Lengua y escritura entre los lusitanos. In Lenguas y epigraf \' as paleohisp \'a nicas , pages 333--366. Bellaterra

work page 2022

[22] [22]

Luo, J., Hartmann, F., Santus, E., Barzilay, R., and Cao, Y. (2021). Deciphering undersegmented ancient scripts using phonetic prior. Transactions of the Association for Computational Linguistics , 9:69--81

work page 2021

[23] [23]

and Velaza, J

Moncunill Mart \' , N. and Velaza, J. (2019). Lexikon der iberischen Inschriften. L \'e xico de las inscripciones ib \'e ricas . Ludwig Reichert Verlag, Wiesbaden

work page 2019

[24] [24]

Ordu \ n a Aznar, E. (2022). La teor \' a vasco-ib \'e rica. In Lenguas y epigraf \' as paleohisp \'a nicas , pages 247--268. Bellaterra

work page 2022

[25] [25]

Rodr \' guez Ramos, J. (2001). Aspectos de la morfolog \' a de los formantes segundos de los compuestos de tipo onom \'a stico en la lengua \' bera. Faventia , 23(1):7--19

work page 2001

[26] [26]

Rodr \' guez Ramos, J. (2002). Las inscripciones sudlusitano-tartesias: su funci \'o n, lengua y contexto socio-econ \'o mico. Complutum , (13):85--96

work page 2002

[27] [27]

and Bodel, J

Salomies, O. and Bodel, J. P. (2001). Epigraphic Evidence: Ancient History from Inscriptions . Routledge

work page 2001

[28] [28]

u mmersprachen zwischen grammatik und geschichte. In Tr \

Untermann, J. (1980). Tr \"u mmersprachen zwischen grammatik und geschichte. In Tr \"u mmersprachen zwischen Grammatik und Geschichte: 245. Sitzung am 16. Januar 1980 in D \"u sseldorf , pages 7--40. Springer

work page 1980

[29] [29]

N., Kaiser, ., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems , 30

work page 2017

[30] [30]

Velaza Fr \' as, J. (2022). Epigraf \' a y lengua ib \'e ricas. In Lenguas y epigraf \' as paleohisp \'a nicas , pages 183--221. Bellaterra

work page 2022

[31] [31]

Wodtko, D. et al. (2021). Spelling tartessian. Palaeohispanica. Revista sobre lenguas y culturas de la Hispania Antigua , 21:219--234

work page 2021

[32] [32]

Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua, A., and Raffel, C. (2021). mt5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human language technologies , pages 483--498

work page 2021