Curation of a Palaeohispanic Dataset for Machine Learning
Pith reviewed 2026-05-15 08:45 UTC · model grok-4.3
The pith
A curated dataset transforms Palaeohispanic language resources into a format ready for machine learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors curate Palaeohispanic language resources into a single structured dataset formatted for machine learning, thereby converting limited and incompatible materials into a usable resource that can support computational analysis of these partially understood ancient scripts.
What carries the argument
The structured dataset, which reformats existing Palaeohispanic inscriptions and linguistic data into a machine-readable form without altering core content.
If this is right
- Machine learning models can be trained on the dataset for tasks such as script recognition and pattern detection in ancient texts.
- Computational experiments can now test specific claims about the structure and relationships among Palaeohispanic languages.
- The resource can serve as a shared benchmark for developing tools tailored to semi-syllabic writing systems.
- Further curation or expansion of the dataset can build directly on this initial release to cover additional inscriptions.
Where Pith is reading between the lines
- Similar dataset curation could be applied to other ancient or under-resourced scripts to enable parallel computational work.
- Machine learning outputs from the dataset might surface statistical regularities that prompt re-examination of traditional linguistic classifications.
- Linking the dataset to existing digital epigraphy projects could increase its utility for collaborative research.
Load-bearing premise
Existing Palaeohispanic resources can be converted into a machine learning format while preserving all critical linguistic details.
What would settle it
Demonstration that key phonetic, grammatical, or contextual information from the original sources is lost or misrepresented in the new dataset structure.
read the original abstract
Palaeohispanic languages are those spoken in the Iberian Peninsula before the arrival of the Romans in the 3rd Century B.C. Their study was really put on motion after G\'omez Moreno deciphered the Iberian Levantine script, one of the several semi-sillabaries used by these languages. Still, the Palaeohispanic languages have varying degrees of decipherment, and none is fully known to this day. Most of the studies have been performed from a purely linguistic point of view, and a computational approach may benefit this research area greatly. However, the resources are limited and presented in an unsuitable format for techniques such as Machine Learning. Therefore, a structured dataset is constructed, which will hopefully allow more progress in the field.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to construct a structured dataset for Palaeohispanic languages by transforming existing resources on pre-Roman Iberian languages into a format suitable for machine learning, addressing the limitations of current resources that are unsuitable for computational techniques.
Significance. If the curation is rigorously documented and the dataset preserves linguistic details while enabling ML tasks, it could meaningfully advance computational approaches in a field reliant on limited, non-standardized resources. The work is primarily descriptive resource creation rather than a methodological or empirical advance, so its impact hinges on public release, documentation quality, and demonstrated usability.
major comments (1)
- [Abstract] Abstract: The claim that existing resources are transformed into an ML-suitable format lacks any description of the curation process, validation steps, quality checks, or potential information loss. This directly affects the central claim that the dataset will allow progress in the field, as no evidence is provided that critical linguistic information is retained.
minor comments (1)
- Consider adding explicit details on dataset structure, size, format, and access instructions in a dedicated section to improve reproducibility and utility for the community.
Simulated Author's Rebuttal
We thank the referee for the detailed review and constructive feedback. We address the single major comment below and will revise the manuscript to improve clarity and completeness.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that existing resources are transformed into an ML-suitable format lacks any description of the curation process, validation steps, quality checks, or potential information loss. This directly affects the central claim that the dataset will allow progress in the field, as no evidence is provided that critical linguistic information is retained.
Authors: We agree that the abstract as currently written is too concise and does not sufficiently outline the curation methodology. The full manuscript contains dedicated sections describing the transformation of existing Palaeohispanic resources (including script normalization, tokenization, and annotation alignment steps), along with validation against original epigraphic sources and checks for information preservation. To make the abstract self-contained and directly support the central claim, we will expand it in the revised version to include a brief summary of the curation pipeline, quality assurance procedures, and steps taken to retain linguistic details such as script variants and contextual metadata. This revision will be limited to the abstract and will not alter the technical content of the paper. revision: yes
Circularity Check
No significant circularity
full rationale
The paper is a purely descriptive account of curating an existing set of Palaeohispanic inscriptions into a structured dataset suitable for machine-learning use. No equations, fitted parameters, quantitative predictions, or derivation chains appear anywhere in the manuscript. The central claim reduces to the factual statement that the dataset was assembled from prior resources; this statement is not shown to be equivalent to its own inputs by construction, nor does it rely on self-citation load-bearing uniqueness theorems or ansatzes. Consequently the circularity score is zero.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
a structured dataset is constructed... CSV file with 1751 instances and 36 feature columns
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Beltr \'a n Lloris, F. and Jord \'a n C \'o lera, C. (2017). Celtiberian: language, writing, epigraphy . Prensas de la Universidad de Zaragoza
work page 2017
-
[3]
Beltr \'a n Lloris, F., Jord \'a n C \'o lera, C., et al. (2020). Celtib \'e rico. Palaeohispanica. Revista sobre lenguas y culturas de la Hispania Antigua , (20):631--688
work page 2020
-
[4]
Beltr \'a n Lloris, F. and Jord \'a n C \'o lera, C. B. (2022). Escritura y lengua en la celtiberia. In Lenguas y epigraf \' as paleohisp \'a nicas , pages 269--332. Bellaterra
work page 2022
-
[5]
Buj \'a n, S., Bardanca, D., Gamallo, P., de Dios-Flores, I., and Pichel, J. R. (2025). Machine translation for low-resource languages: Performance trade-offs between seq2seq and generative approaches. Procesamiento del lenguaje natural , 75:297--315
work page 2025
-
[6]
Correa Rodr \' guez, J. A. (1981). Nota a la inscripci \'o n tartesia gm ii. Archivo Espa \ n ol de Arqueolog \' a , 54(143):203
work page 1981
-
[7]
Correa Rodr \' guez, J. A. (1985). Consideraciones sobre las inscripciones tartesias. In Actas del III Coloquio sobre Lenguas y Culturas Paleohisp \'a nicas , pages 377--396. Ediciones Universidad de Salamanca
work page 1985
-
[8]
M., Carruana Mart \' n, A., and de Miguel Ambite, E
Couto Seller, L., Sanz Torres, \'I ., Vogel-Fern \'a ndez, A., Gonz \'a lez Carballo, C., S \'a nchez S \'a nchez, P. M., Carruana Mart \' n, A., and de Miguel Ambite, E. (2025). Evaluating compact llms for zero-shot iberian language tasks on end-user devices. arXiv preprint arXiv:2504.03312
-
[9]
de Hoz Bravo, J. J. (1989). El desarrollo de la escritura y las lenguas de la zona meridional. In Tartessos: arqueolog \' a protohist \'o rica del bajo Guadalquivir , pages 523--587. AUSA
work page 1989
-
[10]
u \' stica de la pen \' nsula ib \'e rica en la Antig \
de Hoz Bravo, J. J. (2010). Historia ling \"u \' stica de la pen \' nsula ib \'e rica en la Antig \"u edad. I: Preliminares y mundo meridional prerromano , volume 1. Editorial CSIC-CSIC Press
work page 2010
-
[11]
de Hoz Bravo, J. J. (2022). M \'e todo y m \'e todos: Estudiar las lenguas paleohisp \'a nicas como disciplina. In Lenguas y epigraf \' as paleohisp \'a nicas , pages 15--37. Bellaterra
work page 2022
-
[12]
J., Ordu \ n a Aznar, E., and Luj \'a n, E
Estar \'a n Tolosa, M. J., Ordu \ n a Aznar, E., and Luj \'a n, E. R. (2009). El banco de datos hesperia. Palaeohispanica
work page 2009
-
[13]
Ferrer i Jan \'e , J. and Moncunill Mart \' , N. (2022). Sistemas de escritura paleohisp \'a nicos: clasificaci \'o n, origen y desarrollo. In Lenguas y epigraf \' as paleohisp \'a nicas , pages 97--130. Bellaterra
work page 2022
-
[14]
Fourrier, C. (2022). Neural Approaches to Historical Words Reconstruction . PhD thesis, Universit \'e Paris sciences et lettres
work page 2022
-
[15]
G \'o mez Moreno, M. (1922). De epigraf \' a ib \'e rica. el plomo de alcoy. Revista de filolog \' a espa \ n ola , 9(4):341
work page 1922
-
[16]
G \'o mez Moreno, M. (1949). Miscel \'a neas, historia, arte, arqueolog \' a. Primera serie: La Antiguedad (Madrid, 1949) , pages 180--4
work page 1949
-
[17]
Koch, J. T. (2014). On the debate over the classification of the language of the south-western (sw) inscriptions, also known as tartessian. Journal of Indo-European Studies , 42(4):336--427
work page 2014
-
[18]
Laborde, A. L. J. et al. (1806). Voyage pittoresque et historique de l'Espagne. Tome premier. Premiere partie . Imprimerie de Pierre Didot
-
[19]
Luj \'a n, E. R. (2005). Hesperia: the electronic corpus of palaeo-hispanic inscriptions and linguistic records. Review of the National Center for Digitization , (6):78--89
work page 2005
-
[20]
Luj \'a n Mart \' nez, E. R. (2021). La lengua de las inscripciones del sudoeste: estado de la cuesti \'o n. In Palaeohispanica . Instituci \'o n Fernando el Cat \'o lico
work page 2021
-
[21]
Luj \'a n Mart \' nez, E. R. (2022). Lengua y escritura entre los lusitanos. In Lenguas y epigraf \' as paleohisp \'a nicas , pages 333--366. Bellaterra
work page 2022
-
[22]
Luo, J., Hartmann, F., Santus, E., Barzilay, R., and Cao, Y. (2021). Deciphering undersegmented ancient scripts using phonetic prior. Transactions of the Association for Computational Linguistics , 9:69--81
work page 2021
-
[23]
Moncunill Mart \' , N. and Velaza, J. (2019). Lexikon der iberischen Inschriften. L \'e xico de las inscripciones ib \'e ricas . Ludwig Reichert Verlag, Wiesbaden
work page 2019
-
[24]
Ordu \ n a Aznar, E. (2022). La teor \' a vasco-ib \'e rica. In Lenguas y epigraf \' as paleohisp \'a nicas , pages 247--268. Bellaterra
work page 2022
-
[25]
Rodr \' guez Ramos, J. (2001). Aspectos de la morfolog \' a de los formantes segundos de los compuestos de tipo onom \'a stico en la lengua \' bera. Faventia , 23(1):7--19
work page 2001
-
[26]
Rodr \' guez Ramos, J. (2002). Las inscripciones sudlusitano-tartesias: su funci \'o n, lengua y contexto socio-econ \'o mico. Complutum , (13):85--96
work page 2002
-
[27]
Salomies, O. and Bodel, J. P. (2001). Epigraphic Evidence: Ancient History from Inscriptions . Routledge
work page 2001
-
[28]
u mmersprachen zwischen grammatik und geschichte. In Tr \
Untermann, J. (1980). Tr \"u mmersprachen zwischen grammatik und geschichte. In Tr \"u mmersprachen zwischen Grammatik und Geschichte: 245. Sitzung am 16. Januar 1980 in D \"u sseldorf , pages 7--40. Springer
work page 1980
-
[29]
N., Kaiser, ., and Polosukhin, I
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems , 30
work page 2017
-
[30]
Velaza Fr \' as, J. (2022). Epigraf \' a y lengua ib \'e ricas. In Lenguas y epigraf \' as paleohisp \'a nicas , pages 183--221. Bellaterra
work page 2022
-
[31]
Wodtko, D. et al. (2021). Spelling tartessian. Palaeohispanica. Revista sobre lenguas y culturas de la Hispania Antigua , 21:219--234
work page 2021
-
[32]
Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua, A., and Raffel, C. (2021). mt5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human language technologies , pages 483--498
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.