pith. sign in

arxiv: 2605.22978 · v2 · pith:T7BGBYQEnew · submitted 2026-05-21 · 💻 cs.CL

A Reproducible Universal Dependencies-Style Pipeline for Katharevousa Greek Parliamentary Text

Pith reviewed 2026-05-25 05:45 UTC · model grok-4.3

classification 💻 cs.CL
keywords Katharevousa GreekUniversal Dependenciesparliamentary textdependency parsinghistorical NLPOCR processingtreebank creationmultilingual models
0
0 comments X

The pith

A pipeline turns OCR'd Katharevousa Greek parliamentary questions into a 1,697-sentence UD treebank where XLM-R reaches 0.5162 LAS.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to build and evaluate the first Universal Dependencies resource for Katharevousa Greek, a historical register used in Greek legal and parliamentary records. It links several steps including OCR reconstruction, LLM-assisted annotation under schema constraints, automatic validation, and fixed-split model training. A sympathetic reader would care because the work converts otherwise inaccessible historical archives into data that modern parsers can use, while showing that register mismatch defeats off-the-shelf tools. The authors release the full pipeline, frozen annotations, and splits so others can replicate or extend the resource.

Core claim

The central claim is that the described pipeline produces a reproducible 1,697-sentence UD-style reference set for Katharevousa parliamentary text; on this set an XLM-R model attains 0.8893 UPOS accuracy, 0.7250 dependency-relation F1, 0.6098 UAS and 0.5162 LAS, an absolute LAS improvement of 0.0980 over the strongest external baseline.

What carries the argument

The schema-constrained LLM-assisted annotation pipeline with automatic validation that produces the frozen 1,697-sentence reference set and fixed train/test split.

If this is right

  • Off-the-shelf Greek and Ancient Greek parsers exhibit substantial register mismatch on Katharevousa text, with the best baseline reaching only 0.4183 LAS.
  • At this data scale a feature-based parser stays competitive on UPOS tagging and relation labeling.
  • The full pipeline, code, schema, annotations, and benchmark reports are released as open-access infrastructure for historical Greek text.
  • Custom training on the new resource outperforms external baselines by a measurable margin under identical scoring.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same OCR-to-UD workflow could be tested on other low-resource historical registers that suffer from similar OCR noise and register shift.
  • The released treebank enables downstream applications such as syntactic search over Greek parliamentary archives from the post-junta period.
  • Future experiments could measure whether adding more sentences or larger language models further reduces the remaining gap to modern Greek parsing performance.

Load-bearing premise

The schema-constrained LLM-assisted annotation combined with automatic validation produces annotations accurate enough to train and fairly evaluate parsers on Katharevousa parliamentary text.

What would settle it

A manual inter-annotator agreement study or error analysis on the 1,697 sentences that reports low agreement or high systematic annotation errors would show the reference set is not reliable for training and evaluation.

Figures

Figures reproduced from arXiv: 2605.22978 by Fotios Fitsilis, George Mikros.

Figure 1
Figure 1. Figure 1: Metric comparison across external baselines and custom models on the fixed held-out [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
read the original abstract

Katharevousa Greek remains poorly served by contemporary NLP pipelines despite its importance for legal, administrative, and parliamentary archives. We present a reproducible workflow for building and evaluating a Universal Dependencies-style parsing resource for Katharevousa parliamentary questions from Greece's early post-junta period. The pipeline links OCR-aware reconstruction, schema-constrained LLM-assisted annotation, automatic validation, deterministic CoNLL-U snapshotting, fixed-split evaluation, and model-family comparison. The frozen automatically validated reference set contains 1{,}697 sentences, split into 1{,}357 training sentences and 340 held-out test sentences. We compare off-the-shelf Greek and Ancient Greek parsers, a feature-based parser, mBERT, XLM-R, and custom Stanza training under the same scoring protocol. Off-the-shelf systems show substantial register mismatch: the strongest external baseline, spaCy Greek, reaches 0.4183 LAS. The best structural parser, an XLM-R model, reaches 0.8893 UPOS accuracy, 0.7250 dependency-relation F1, 0.6098 UAS, and 0.5162 LAS, an absolute LAS gain of 0.0980 over the best external baseline. The feature-based model remains competitive for UPOS and relation labeling, indicating that transparent lexical-context features still matter at this data scale. Beyond scores, the paper contributes an auditable methodology for turning difficult historical parliamentary OCR into reusable syntactic NLP infrastructure. The entire pipeline -- code, schema, frozen reference annotations, fixed train/test split, and per-model benchmark reports -- is released as an open-access companion to this paper.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents a reproducible pipeline for building a Universal Dependencies-style parsing resource for Katharevousa Greek parliamentary questions, covering OCR-aware reconstruction, schema-constrained LLM-assisted annotation, automatic validation, deterministic CoNLL-U snapshotting, and a fixed 1,357/340 train/test split on 1,697 sentences. It evaluates off-the-shelf Greek/Ancient Greek parsers, a feature-based model, mBERT, XLM-R, and custom Stanza training, reporting that XLM-R achieves 0.8893 UPOS, 0.7250 dependency-relation F1, 0.6098 UAS, and 0.5162 LAS (0.0980 absolute gain over the best external baseline spaCy Greek at 0.4183 LAS). The full pipeline, code, schema, frozen annotations, and split are released openly.

Significance. If the annotations hold as reliable gold data, the work supplies needed syntactic infrastructure for an underserved historical register and demonstrates clear register mismatch in existing tools. The open release of code, schema, annotations, fixed split, and per-model reports is a concrete strength for reproducibility and future work.

major comments (1)
  1. [Abstract / pipeline description] Abstract and pipeline description (annotation and evaluation sections): the headline parser scores and the claimed 0.0980 LAS gain presuppose that the automatically validated 1,697-sentence set constitutes reliable gold data, yet the manuscript supplies no inter-annotator agreement figures, no human error analysis on the final annotations, and no quantitative validation metrics for the LLM-assisted step on the 1,357/340 split. Automatic validation alone cannot substitute for these measures when assessing whether reported performance reflects model capability or annotation artifacts.
minor comments (1)
  1. [Abstract] The sentence count is written as '1{,}697'; standardize to conventional 1,697 throughout.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and the focus on annotation reliability. We respond to the single major comment below.

read point-by-point responses
  1. Referee: [Abstract / pipeline description] Abstract and pipeline description (annotation and evaluation sections): the headline parser scores and the claimed 0.0980 LAS gain presuppose that the automatically validated 1,697-sentence set constitutes reliable gold data, yet the manuscript supplies no inter-annotator agreement figures, no human error analysis on the final annotations, and no quantitative validation metrics for the LLM-assisted step on the 1,357/340 split. Automatic validation alone cannot substitute for these measures when assessing whether reported performance reflects model capability or annotation artifacts.

    Authors: The annotation process described in the manuscript is explicitly LLM-assisted under schema constraints, followed by automatic validation against UD rules; it was not designed as a multi-human annotation project. Consequently, inter-annotator agreement figures and human error analysis are not available and were never collected. No separate quantitative metrics for the LLM step (beyond the deterministic validation rules) are reported. We accept that this constitutes a limitation when claiming gold-standard status and will revise the annotation and evaluation sections to (a) state the LLM-assisted nature and absence of IAA explicitly, (b) detail the exact validation rules applied, and (c) note that the released frozen annotations permit independent human verification. We do not claim that automatic validation fully substitutes for human agreement metrics. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical resource creation with external baselines

full rationale

The paper describes an empirical pipeline for creating a UD-style annotation resource for Katharevousa Greek parliamentary text and benchmarks several parsers (including off-the-shelf external systems like spaCy Greek) against a held-out test set. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described full text. The central claims rest on released code, frozen annotations, and comparisons to independent external baselines rather than any self-referential reduction. The absence of quantitative IAA is a validity concern but does not constitute circularity under the defined patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied resource-creation paper with no theoretical derivation. No free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5829 in / 1232 out tokens · 21503 ms · 2026-05-25T05:45:50.442411+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.