A Reproducible Universal Dependencies-Style Pipeline for Katharevousa Greek Parliamentary Text
Pith reviewed 2026-05-25 05:45 UTC · model grok-4.3
The pith
A pipeline turns OCR'd Katharevousa Greek parliamentary questions into a 1,697-sentence UD treebank where XLM-R reaches 0.5162 LAS.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the described pipeline produces a reproducible 1,697-sentence UD-style reference set for Katharevousa parliamentary text; on this set an XLM-R model attains 0.8893 UPOS accuracy, 0.7250 dependency-relation F1, 0.6098 UAS and 0.5162 LAS, an absolute LAS improvement of 0.0980 over the strongest external baseline.
What carries the argument
The schema-constrained LLM-assisted annotation pipeline with automatic validation that produces the frozen 1,697-sentence reference set and fixed train/test split.
If this is right
- Off-the-shelf Greek and Ancient Greek parsers exhibit substantial register mismatch on Katharevousa text, with the best baseline reaching only 0.4183 LAS.
- At this data scale a feature-based parser stays competitive on UPOS tagging and relation labeling.
- The full pipeline, code, schema, annotations, and benchmark reports are released as open-access infrastructure for historical Greek text.
- Custom training on the new resource outperforms external baselines by a measurable margin under identical scoring.
Where Pith is reading between the lines
- The same OCR-to-UD workflow could be tested on other low-resource historical registers that suffer from similar OCR noise and register shift.
- The released treebank enables downstream applications such as syntactic search over Greek parliamentary archives from the post-junta period.
- Future experiments could measure whether adding more sentences or larger language models further reduces the remaining gap to modern Greek parsing performance.
Load-bearing premise
The schema-constrained LLM-assisted annotation combined with automatic validation produces annotations accurate enough to train and fairly evaluate parsers on Katharevousa parliamentary text.
What would settle it
A manual inter-annotator agreement study or error analysis on the 1,697 sentences that reports low agreement or high systematic annotation errors would show the reference set is not reliable for training and evaluation.
Figures
read the original abstract
Katharevousa Greek remains poorly served by contemporary NLP pipelines despite its importance for legal, administrative, and parliamentary archives. We present a reproducible workflow for building and evaluating a Universal Dependencies-style parsing resource for Katharevousa parliamentary questions from Greece's early post-junta period. The pipeline links OCR-aware reconstruction, schema-constrained LLM-assisted annotation, automatic validation, deterministic CoNLL-U snapshotting, fixed-split evaluation, and model-family comparison. The frozen automatically validated reference set contains 1{,}697 sentences, split into 1{,}357 training sentences and 340 held-out test sentences. We compare off-the-shelf Greek and Ancient Greek parsers, a feature-based parser, mBERT, XLM-R, and custom Stanza training under the same scoring protocol. Off-the-shelf systems show substantial register mismatch: the strongest external baseline, spaCy Greek, reaches 0.4183 LAS. The best structural parser, an XLM-R model, reaches 0.8893 UPOS accuracy, 0.7250 dependency-relation F1, 0.6098 UAS, and 0.5162 LAS, an absolute LAS gain of 0.0980 over the best external baseline. The feature-based model remains competitive for UPOS and relation labeling, indicating that transparent lexical-context features still matter at this data scale. Beyond scores, the paper contributes an auditable methodology for turning difficult historical parliamentary OCR into reusable syntactic NLP infrastructure. The entire pipeline -- code, schema, frozen reference annotations, fixed train/test split, and per-model benchmark reports -- is released as an open-access companion to this paper.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a reproducible pipeline for building a Universal Dependencies-style parsing resource for Katharevousa Greek parliamentary questions, covering OCR-aware reconstruction, schema-constrained LLM-assisted annotation, automatic validation, deterministic CoNLL-U snapshotting, and a fixed 1,357/340 train/test split on 1,697 sentences. It evaluates off-the-shelf Greek/Ancient Greek parsers, a feature-based model, mBERT, XLM-R, and custom Stanza training, reporting that XLM-R achieves 0.8893 UPOS, 0.7250 dependency-relation F1, 0.6098 UAS, and 0.5162 LAS (0.0980 absolute gain over the best external baseline spaCy Greek at 0.4183 LAS). The full pipeline, code, schema, frozen annotations, and split are released openly.
Significance. If the annotations hold as reliable gold data, the work supplies needed syntactic infrastructure for an underserved historical register and demonstrates clear register mismatch in existing tools. The open release of code, schema, annotations, fixed split, and per-model reports is a concrete strength for reproducibility and future work.
major comments (1)
- [Abstract / pipeline description] Abstract and pipeline description (annotation and evaluation sections): the headline parser scores and the claimed 0.0980 LAS gain presuppose that the automatically validated 1,697-sentence set constitutes reliable gold data, yet the manuscript supplies no inter-annotator agreement figures, no human error analysis on the final annotations, and no quantitative validation metrics for the LLM-assisted step on the 1,357/340 split. Automatic validation alone cannot substitute for these measures when assessing whether reported performance reflects model capability or annotation artifacts.
minor comments (1)
- [Abstract] The sentence count is written as '1{,}697'; standardize to conventional 1,697 throughout.
Simulated Author's Rebuttal
We thank the referee for the careful review and the focus on annotation reliability. We respond to the single major comment below.
read point-by-point responses
-
Referee: [Abstract / pipeline description] Abstract and pipeline description (annotation and evaluation sections): the headline parser scores and the claimed 0.0980 LAS gain presuppose that the automatically validated 1,697-sentence set constitutes reliable gold data, yet the manuscript supplies no inter-annotator agreement figures, no human error analysis on the final annotations, and no quantitative validation metrics for the LLM-assisted step on the 1,357/340 split. Automatic validation alone cannot substitute for these measures when assessing whether reported performance reflects model capability or annotation artifacts.
Authors: The annotation process described in the manuscript is explicitly LLM-assisted under schema constraints, followed by automatic validation against UD rules; it was not designed as a multi-human annotation project. Consequently, inter-annotator agreement figures and human error analysis are not available and were never collected. No separate quantitative metrics for the LLM step (beyond the deterministic validation rules) are reported. We accept that this constitutes a limitation when claiming gold-standard status and will revise the annotation and evaluation sections to (a) state the LLM-assisted nature and absence of IAA explicitly, (b) detail the exact validation rules applied, and (c) note that the released frozen annotations permit independent human verification. We do not claim that automatic validation fully substitutes for human agreement metrics. revision: partial
Circularity Check
No circularity: empirical resource creation with external baselines
full rationale
The paper describes an empirical pipeline for creating a UD-style annotation resource for Katharevousa Greek parliamentary text and benchmarks several parsers (including off-the-shelf external systems like spaCy Greek) against a held-out test set. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described full text. The central claims rest on released code, frozen annotations, and comparisons to independent external baselines rather than any self-referential reduction. The absence of quantitative IAA is a validity concern but does not constitute circularity under the defined patterns.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.