A Reproducible Universal Dependencies-Style Pipeline for Katharevousa Greek Parliamentary Text

Fotios Fitsilis; George Mikros

arxiv: 2605.22978 · v2 · pith:T7BGBYQEnew · submitted 2026-05-21 · 💻 cs.CL

A Reproducible Universal Dependencies-Style Pipeline for Katharevousa Greek Parliamentary Text

George Mikros , Fotios Fitsilis This is my paper

Pith reviewed 2026-05-25 05:45 UTC · model grok-4.3

classification 💻 cs.CL

keywords Katharevousa GreekUniversal Dependenciesparliamentary textdependency parsinghistorical NLPOCR processingtreebank creationmultilingual models

0 comments

The pith

A pipeline turns OCR'd Katharevousa Greek parliamentary questions into a 1,697-sentence UD treebank where XLM-R reaches 0.5162 LAS.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to build and evaluate the first Universal Dependencies resource for Katharevousa Greek, a historical register used in Greek legal and parliamentary records. It links several steps including OCR reconstruction, LLM-assisted annotation under schema constraints, automatic validation, and fixed-split model training. A sympathetic reader would care because the work converts otherwise inaccessible historical archives into data that modern parsers can use, while showing that register mismatch defeats off-the-shelf tools. The authors release the full pipeline, frozen annotations, and splits so others can replicate or extend the resource.

Core claim

The central claim is that the described pipeline produces a reproducible 1,697-sentence UD-style reference set for Katharevousa parliamentary text; on this set an XLM-R model attains 0.8893 UPOS accuracy, 0.7250 dependency-relation F1, 0.6098 UAS and 0.5162 LAS, an absolute LAS improvement of 0.0980 over the strongest external baseline.

What carries the argument

The schema-constrained LLM-assisted annotation pipeline with automatic validation that produces the frozen 1,697-sentence reference set and fixed train/test split.

If this is right

Off-the-shelf Greek and Ancient Greek parsers exhibit substantial register mismatch on Katharevousa text, with the best baseline reaching only 0.4183 LAS.
At this data scale a feature-based parser stays competitive on UPOS tagging and relation labeling.
The full pipeline, code, schema, annotations, and benchmark reports are released as open-access infrastructure for historical Greek text.
Custom training on the new resource outperforms external baselines by a measurable margin under identical scoring.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same OCR-to-UD workflow could be tested on other low-resource historical registers that suffer from similar OCR noise and register shift.
The released treebank enables downstream applications such as syntactic search over Greek parliamentary archives from the post-junta period.
Future experiments could measure whether adding more sentences or larger language models further reduces the remaining gap to modern Greek parsing performance.

Load-bearing premise

The schema-constrained LLM-assisted annotation combined with automatic validation produces annotations accurate enough to train and fairly evaluate parsers on Katharevousa parliamentary text.

What would settle it

A manual inter-annotator agreement study or error analysis on the 1,697 sentences that reports low agreement or high systematic annotation errors would show the reference set is not reliable for training and evaluation.

Figures

Figures reproduced from arXiv: 2605.22978 by Fotios Fitsilis, George Mikros.

read the original abstract

Katharevousa Greek remains poorly served by contemporary NLP pipelines despite its importance for legal, administrative, and parliamentary archives. We present a reproducible workflow for building and evaluating a Universal Dependencies-style parsing resource for Katharevousa parliamentary questions from Greece's early post-junta period. The pipeline links OCR-aware reconstruction, schema-constrained LLM-assisted annotation, automatic validation, deterministic CoNLL-U snapshotting, fixed-split evaluation, and model-family comparison. The frozen automatically validated reference set contains 1{,}697 sentences, split into 1{,}357 training sentences and 340 held-out test sentences. We compare off-the-shelf Greek and Ancient Greek parsers, a feature-based parser, mBERT, XLM-R, and custom Stanza training under the same scoring protocol. Off-the-shelf systems show substantial register mismatch: the strongest external baseline, spaCy Greek, reaches 0.4183 LAS. The best structural parser, an XLM-R model, reaches 0.8893 UPOS accuracy, 0.7250 dependency-relation F1, 0.6098 UAS, and 0.5162 LAS, an absolute LAS gain of 0.0980 over the best external baseline. The feature-based model remains competitive for UPOS and relation labeling, indicating that transparent lexical-context features still matter at this data scale. Beyond scores, the paper contributes an auditable methodology for turning difficult historical parliamentary OCR into reusable syntactic NLP infrastructure. The entire pipeline -- code, schema, frozen reference annotations, fixed train/test split, and per-model benchmark reports -- is released as an open-access companion to this paper.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New reproducible treebank for Katharevousa Greek parliamentary text, but LLM annotations lack any human validation metrics so the parser scores remain ungrounded.

read the letter

The paper delivers the first reported UD-style treebank for Katharevousa Greek parliamentary questions, with 1,697 sentences, a fixed 1,357/340 split, and an open pipeline that starts from OCR and ends in CoNLL-U. They release the code, schema, frozen annotations, and benchmarks, which is the main concrete contribution. XLM-R reaches 0.5162 LAS and beats the best off-the-shelf baseline by 0.098, while a feature-based model stays competitive on UPOS and relations. That is useful work for anyone who needs syntactic tools on this register. The reproducibility steps, including deterministic snapshotting, are done cleanly. The soft spot is the annotation step. The pipeline relies on schema-constrained LLM labeling plus automatic validation, yet the abstract and description give no inter-annotator agreement, no error analysis on the final set, and no human review numbers. Automatic checks alone do not establish that the 1,697 sentences are reliable gold data, so the reported gains could partly reflect annotation patterns rather than model strength. This is a standard requirement for new treebanks and its absence is noticeable. The work is aimed at researchers building resources for historical legal and administrative Greek. It is narrow in scope but fills a documented gap, and the open release makes it worth referee time even if the evaluation section needs strengthening. I would send it to peer review with a request for human validation metrics on the annotations.

Referee Report

1 major / 1 minor

Summary. The paper presents a reproducible pipeline for building a Universal Dependencies-style parsing resource for Katharevousa Greek parliamentary questions, covering OCR-aware reconstruction, schema-constrained LLM-assisted annotation, automatic validation, deterministic CoNLL-U snapshotting, and a fixed 1,357/340 train/test split on 1,697 sentences. It evaluates off-the-shelf Greek/Ancient Greek parsers, a feature-based model, mBERT, XLM-R, and custom Stanza training, reporting that XLM-R achieves 0.8893 UPOS, 0.7250 dependency-relation F1, 0.6098 UAS, and 0.5162 LAS (0.0980 absolute gain over the best external baseline spaCy Greek at 0.4183 LAS). The full pipeline, code, schema, frozen annotations, and split are released openly.

Significance. If the annotations hold as reliable gold data, the work supplies needed syntactic infrastructure for an underserved historical register and demonstrates clear register mismatch in existing tools. The open release of code, schema, annotations, fixed split, and per-model reports is a concrete strength for reproducibility and future work.

major comments (1)

[Abstract / pipeline description] Abstract and pipeline description (annotation and evaluation sections): the headline parser scores and the claimed 0.0980 LAS gain presuppose that the automatically validated 1,697-sentence set constitutes reliable gold data, yet the manuscript supplies no inter-annotator agreement figures, no human error analysis on the final annotations, and no quantitative validation metrics for the LLM-assisted step on the 1,357/340 split. Automatic validation alone cannot substitute for these measures when assessing whether reported performance reflects model capability or annotation artifacts.

minor comments (1)

[Abstract] The sentence count is written as '1{,}697'; standardize to conventional 1,697 throughout.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and the focus on annotation reliability. We respond to the single major comment below.

read point-by-point responses

Referee: [Abstract / pipeline description] Abstract and pipeline description (annotation and evaluation sections): the headline parser scores and the claimed 0.0980 LAS gain presuppose that the automatically validated 1,697-sentence set constitutes reliable gold data, yet the manuscript supplies no inter-annotator agreement figures, no human error analysis on the final annotations, and no quantitative validation metrics for the LLM-assisted step on the 1,357/340 split. Automatic validation alone cannot substitute for these measures when assessing whether reported performance reflects model capability or annotation artifacts.

Authors: The annotation process described in the manuscript is explicitly LLM-assisted under schema constraints, followed by automatic validation against UD rules; it was not designed as a multi-human annotation project. Consequently, inter-annotator agreement figures and human error analysis are not available and were never collected. No separate quantitative metrics for the LLM step (beyond the deterministic validation rules) are reported. We accept that this constitutes a limitation when claiming gold-standard status and will revise the annotation and evaluation sections to (a) state the LLM-assisted nature and absence of IAA explicitly, (b) detail the exact validation rules applied, and (c) note that the released frozen annotations permit independent human verification. We do not claim that automatic validation fully substitutes for human agreement metrics. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical resource creation with external baselines

full rationale

The paper describes an empirical pipeline for creating a UD-style annotation resource for Katharevousa Greek parliamentary text and benchmarks several parsers (including off-the-shelf external systems like spaCy Greek) against a held-out test set. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described full text. The central claims rest on released code, frozen annotations, and comparisons to independent external baselines rather than any self-referential reduction. The absence of quantitative IAA is a validity concern but does not constitute circularity under the defined patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied resource-creation paper with no theoretical derivation. No free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5829 in / 1232 out tokens · 21503 ms · 2026-05-25T05:45:50.442411+00:00 · methodology

A Reproducible Universal Dependencies-Style Pipeline for Katharevousa Greek Parliamentary Text

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)