pith. sign in

arxiv: 2604.13070 · v1 · submitted 2026-03-20 · 💻 cs.CL · cs.AI

Curation of a Palaeohispanic Dataset for Machine Learning

Pith reviewed 2026-05-15 08:45 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords Palaeohispanic languagesdataset curationmachine learningancient Iberian scriptscomputational linguisticslanguage deciphermentstructured data
0
0 comments X

The pith

A curated dataset transforms Palaeohispanic language resources into a format ready for machine learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs a structured dataset from existing Palaeohispanic inscriptions and materials that were previously unsuitable for computational work. This addresses the scarcity of machine-readable data on these ancient Iberian languages, which remain only partially deciphered. A sympathetic reader would care because computational techniques could now be applied to test linguistic hypotheses or support decipherment efforts that traditional methods have left incomplete. The work positions the dataset as a foundation for future data-driven studies in the field.

Core claim

The authors curate Palaeohispanic language resources into a single structured dataset formatted for machine learning, thereby converting limited and incompatible materials into a usable resource that can support computational analysis of these partially understood ancient scripts.

What carries the argument

The structured dataset, which reformats existing Palaeohispanic inscriptions and linguistic data into a machine-readable form without altering core content.

If this is right

  • Machine learning models can be trained on the dataset for tasks such as script recognition and pattern detection in ancient texts.
  • Computational experiments can now test specific claims about the structure and relationships among Palaeohispanic languages.
  • The resource can serve as a shared benchmark for developing tools tailored to semi-syllabic writing systems.
  • Further curation or expansion of the dataset can build directly on this initial release to cover additional inscriptions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar dataset curation could be applied to other ancient or under-resourced scripts to enable parallel computational work.
  • Machine learning outputs from the dataset might surface statistical regularities that prompt re-examination of traditional linguistic classifications.
  • Linking the dataset to existing digital epigraphy projects could increase its utility for collaborative research.

Load-bearing premise

Existing Palaeohispanic resources can be converted into a machine learning format while preserving all critical linguistic details.

What would settle it

Demonstration that key phonetic, grammatical, or contextual information from the original sources is lost or misrepresented in the new dataset structure.

read the original abstract

Palaeohispanic languages are those spoken in the Iberian Peninsula before the arrival of the Romans in the 3rd Century B.C. Their study was really put on motion after G\'omez Moreno deciphered the Iberian Levantine script, one of the several semi-sillabaries used by these languages. Still, the Palaeohispanic languages have varying degrees of decipherment, and none is fully known to this day. Most of the studies have been performed from a purely linguistic point of view, and a computational approach may benefit this research area greatly. However, the resources are limited and presented in an unsuitable format for techniques such as Machine Learning. Therefore, a structured dataset is constructed, which will hopefully allow more progress in the field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims to construct a structured dataset for Palaeohispanic languages by transforming existing resources on pre-Roman Iberian languages into a format suitable for machine learning, addressing the limitations of current resources that are unsuitable for computational techniques.

Significance. If the curation is rigorously documented and the dataset preserves linguistic details while enabling ML tasks, it could meaningfully advance computational approaches in a field reliant on limited, non-standardized resources. The work is primarily descriptive resource creation rather than a methodological or empirical advance, so its impact hinges on public release, documentation quality, and demonstrated usability.

major comments (1)
  1. [Abstract] Abstract: The claim that existing resources are transformed into an ML-suitable format lacks any description of the curation process, validation steps, quality checks, or potential information loss. This directly affects the central claim that the dataset will allow progress in the field, as no evidence is provided that critical linguistic information is retained.
minor comments (1)
  1. Consider adding explicit details on dataset structure, size, format, and access instructions in a dedicated section to improve reproducibility and utility for the community.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive feedback. We address the single major comment below and will revise the manuscript to improve clarity and completeness.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that existing resources are transformed into an ML-suitable format lacks any description of the curation process, validation steps, quality checks, or potential information loss. This directly affects the central claim that the dataset will allow progress in the field, as no evidence is provided that critical linguistic information is retained.

    Authors: We agree that the abstract as currently written is too concise and does not sufficiently outline the curation methodology. The full manuscript contains dedicated sections describing the transformation of existing Palaeohispanic resources (including script normalization, tokenization, and annotation alignment steps), along with validation against original epigraphic sources and checks for information preservation. To make the abstract self-contained and directly support the central claim, we will expand it in the revised version to include a brief summary of the curation pipeline, quality assurance procedures, and steps taken to retain linguistic details such as script variants and contextual metadata. This revision will be limited to the abstract and will not alter the technical content of the paper. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a purely descriptive account of curating an existing set of Palaeohispanic inscriptions into a structured dataset suitable for machine-learning use. No equations, fitted parameters, quantitative predictions, or derivation chains appear anywhere in the manuscript. The central claim reduces to the factual statement that the dataset was assembled from prior resources; this statement is not shown to be equivalent to its own inputs by construction, nor does it rely on self-citation load-bearing uniqueness theorems or ansatzes. Consequently the circularity score is zero.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model or derivation; the work is data curation with no free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5443 in / 889 out tokens · 31885 ms · 2026-05-15T08:45:31.275925+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 1 internal anchor

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774

  2. [2]

    and Jord \'a n C \'o lera, C

    Beltr \'a n Lloris, F. and Jord \'a n C \'o lera, C. (2017). Celtiberian: language, writing, epigraphy . Prensas de la Universidad de Zaragoza

  3. [3]

    Beltr \'a n Lloris, F., Jord \'a n C \'o lera, C., et al. (2020). Celtib \'e rico. Palaeohispanica. Revista sobre lenguas y culturas de la Hispania Antigua , (20):631--688

  4. [4]

    and Jord \'a n C \'o lera, C

    Beltr \'a n Lloris, F. and Jord \'a n C \'o lera, C. B. (2022). Escritura y lengua en la celtiberia. In Lenguas y epigraf \' as paleohisp \'a nicas , pages 269--332. Bellaterra

  5. [5]

    Buj \'a n, S., Bardanca, D., Gamallo, P., de Dios-Flores, I., and Pichel, J. R. (2025). Machine translation for low-resource languages: Performance trade-offs between seq2seq and generative approaches. Procesamiento del lenguaje natural , 75:297--315

  6. [6]

    Correa Rodr \' guez, J. A. (1981). Nota a la inscripci \'o n tartesia gm ii. Archivo Espa \ n ol de Arqueolog \' a , 54(143):203

  7. [7]

    Correa Rodr \' guez, J. A. (1985). Consideraciones sobre las inscripciones tartesias. In Actas del III Coloquio sobre Lenguas y Culturas Paleohisp \'a nicas , pages 377--396. Ediciones Universidad de Salamanca

  8. [8]

    M., Carruana Mart \' n, A., and de Miguel Ambite, E

    Couto Seller, L., Sanz Torres, \'I ., Vogel-Fern \'a ndez, A., Gonz \'a lez Carballo, C., S \'a nchez S \'a nchez, P. M., Carruana Mart \' n, A., and de Miguel Ambite, E. (2025). Evaluating compact llms for zero-shot iberian language tasks on end-user devices. arXiv preprint arXiv:2504.03312

  9. [9]

    de Hoz Bravo, J. J. (1989). El desarrollo de la escritura y las lenguas de la zona meridional. In Tartessos: arqueolog \' a protohist \'o rica del bajo Guadalquivir , pages 523--587. AUSA

  10. [10]

    u \' stica de la pen \' nsula ib \'e rica en la Antig \

    de Hoz Bravo, J. J. (2010). Historia ling \"u \' stica de la pen \' nsula ib \'e rica en la Antig \"u edad. I: Preliminares y mundo meridional prerromano , volume 1. Editorial CSIC-CSIC Press

  11. [11]

    de Hoz Bravo, J. J. (2022). M \'e todo y m \'e todos: Estudiar las lenguas paleohisp \'a nicas como disciplina. In Lenguas y epigraf \' as paleohisp \'a nicas , pages 15--37. Bellaterra

  12. [12]

    J., Ordu \ n a Aznar, E., and Luj \'a n, E

    Estar \'a n Tolosa, M. J., Ordu \ n a Aznar, E., and Luj \'a n, E. R. (2009). El banco de datos hesperia. Palaeohispanica

  13. [13]

    and Moncunill Mart \' , N

    Ferrer i Jan \'e , J. and Moncunill Mart \' , N. (2022). Sistemas de escritura paleohisp \'a nicos: clasificaci \'o n, origen y desarrollo. In Lenguas y epigraf \' as paleohisp \'a nicas , pages 97--130. Bellaterra

  14. [14]

    Fourrier, C. (2022). Neural Approaches to Historical Words Reconstruction . PhD thesis, Universit \'e Paris sciences et lettres

  15. [15]

    G \'o mez Moreno, M. (1922). De epigraf \' a ib \'e rica. el plomo de alcoy. Revista de filolog \' a espa \ n ola , 9(4):341

  16. [16]

    G \'o mez Moreno, M. (1949). Miscel \'a neas, historia, arte, arqueolog \' a. Primera serie: La Antiguedad (Madrid, 1949) , pages 180--4

  17. [17]

    Koch, J. T. (2014). On the debate over the classification of the language of the south-western (sw) inscriptions, also known as tartessian. Journal of Indo-European Studies , 42(4):336--427

  18. [18]

    Laborde, A. L. J. et al. (1806). Voyage pittoresque et historique de l'Espagne. Tome premier. Premiere partie . Imprimerie de Pierre Didot

  19. [19]

    Luj \'a n, E. R. (2005). Hesperia: the electronic corpus of palaeo-hispanic inscriptions and linguistic records. Review of the National Center for Digitization , (6):78--89

  20. [20]

    Luj \'a n Mart \' nez, E. R. (2021). La lengua de las inscripciones del sudoeste: estado de la cuesti \'o n. In Palaeohispanica . Instituci \'o n Fernando el Cat \'o lico

  21. [21]

    Luj \'a n Mart \' nez, E. R. (2022). Lengua y escritura entre los lusitanos. In Lenguas y epigraf \' as paleohisp \'a nicas , pages 333--366. Bellaterra

  22. [22]

    Luo, J., Hartmann, F., Santus, E., Barzilay, R., and Cao, Y. (2021). Deciphering undersegmented ancient scripts using phonetic prior. Transactions of the Association for Computational Linguistics , 9:69--81

  23. [23]

    and Velaza, J

    Moncunill Mart \' , N. and Velaza, J. (2019). Lexikon der iberischen Inschriften. L \'e xico de las inscripciones ib \'e ricas . Ludwig Reichert Verlag, Wiesbaden

  24. [24]

    Ordu \ n a Aznar, E. (2022). La teor \' a vasco-ib \'e rica. In Lenguas y epigraf \' as paleohisp \'a nicas , pages 247--268. Bellaterra

  25. [25]

    Rodr \' guez Ramos, J. (2001). Aspectos de la morfolog \' a de los formantes segundos de los compuestos de tipo onom \'a stico en la lengua \' bera. Faventia , 23(1):7--19

  26. [26]

    Rodr \' guez Ramos, J. (2002). Las inscripciones sudlusitano-tartesias: su funci \'o n, lengua y contexto socio-econ \'o mico. Complutum , (13):85--96

  27. [27]

    and Bodel, J

    Salomies, O. and Bodel, J. P. (2001). Epigraphic Evidence: Ancient History from Inscriptions . Routledge

  28. [28]

    u mmersprachen zwischen grammatik und geschichte. In Tr \

    Untermann, J. (1980). Tr \"u mmersprachen zwischen grammatik und geschichte. In Tr \"u mmersprachen zwischen Grammatik und Geschichte: 245. Sitzung am 16. Januar 1980 in D \"u sseldorf , pages 7--40. Springer

  29. [29]

    N., Kaiser, ., and Polosukhin, I

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems , 30

  30. [30]

    Velaza Fr \' as, J. (2022). Epigraf \' a y lengua ib \'e ricas. In Lenguas y epigraf \' as paleohisp \'a nicas , pages 183--221. Bellaterra

  31. [31]

    Wodtko, D. et al. (2021). Spelling tartessian. Palaeohispanica. Revista sobre lenguas y culturas de la Hispania Antigua , 21:219--234

  32. [32]

    Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua, A., and Raffel, C. (2021). mt5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human language technologies , pages 483--498