pith. sign in

arxiv: 2605.18504 · v1 · pith:5UG6KD2Enew · submitted 2026-05-18 · 💻 cs.CL

Ancient Greek to Modern Greek Machine Translation: A Novel Benchmark and Fine-Tuning Experiments on LLMs and NMT Models

Pith reviewed 2026-05-20 11:12 UTC · model grok-4.3

classification 💻 cs.CL
keywords Ancient GreekModern GreekMachine TranslationParallel CorpusFine-tuningLLM adaptationLow-resource MTSentence alignment
2
0 comments X

The pith

A new 132k-pair corpus and fine-tuning experiments lift Ancient Greek to Modern Greek translation by up to 10.3 BLEU points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds the first sizable parallel dataset for Ancient Greek to Modern Greek machine translation by scraping literary and historical texts and aligning them through a multi-stage process. It then benchmarks several current models and shows that fine-tuning them on the new data produces clear gains over their untuned versions. The strongest result comes from full-parameter updates to a Greek-adapted LLM, which reaches 13.16 BLEU. Because the task has almost no prior resources, the work supplies both the training material and the first evidence that standard adaptation methods can move performance forward in this domain.

Core claim

We introduce the AG-MG Parallel Corpus of 132,481 sentence pairs created from web-scraped literary, historical, and biblical texts via a pipeline of fine-tuned LaBSE embeddings, VecAlign, and Gemini-based correction; fine-tuning NMT models and Llama-Krikri-8B on this corpus raises BLEU scores by as much as 10.3 points, with full-parameter tuning of the 8B Greek LLM attaining the highest score of 13.16.

What carries the argument

The AG-MG Parallel Corpus together with its multi-stage alignment pipeline that first fine-tunes LaBSE embeddings on a small manual seed set, runs VecAlign, and applies LLM-based misalignment correction to produce usable training pairs.

If this is right

  • Full-parameter fine-tuning of Llama-Krikri-8B delivers the top BLEU score of 13.16 on the new benchmark.
  • QLoRA adaptation of M2M100-1.2B produces the largest relative improvement while remaining computationally light.
  • NLLB and M2M100 models both improve substantially after fine-tuning but remain below the best LLM result.
  • The released corpus and models supply the first public baseline for future Ancient-to-Modern Greek work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same alignment recipe could be reused for other low-resource ancient-to-modern language pairs that have digitized but unaligned texts.
  • Higher-quality AG-MG translation would let modern readers and scholars work directly with original Greek sources without constant manual lookup.
  • Parameter-efficient methods shown here suggest that small research teams can adapt large models for historical languages without massive compute.

Load-bearing premise

The automatically aligned sentence pairs are clean and faithful enough that models trained on them learn genuine translation patterns rather than artifacts of the alignment process.

What would settle it

Human inspection of a random sample of 500 pairs from the corpus or automatic scoring of model output on a fresh, manually verified test set would reveal whether the reported BLEU gains disappear once alignment noise is removed.

Figures

Figures reproduced from arXiv: 2605.18504 by Maria Giagkou, Prokopis Prokopidis, Sokratis Sofianopoulos, Spyridon Mavromatis.

Figure 1
Figure 1. Figure 1: The hybrid corpus creation pipeline, combining neural embedding-based alignment with LLM [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of Ancient Greek dialects in the sentence-level AG [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Sentence-Level token count distribution (132.4k pairs). Most sentences are short (10-30 to [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
read the original abstract

Machine Translation (MT) for Ancient Greek (AG) to Modern Greek (MG) is a low-resource task, constrained by the lack of large-scale, high-quality parallel data. We address this gap by introducing the AG-MG Parallel Corpus, a new resource containing 132,481 sentence-aligned pairs derived from literary, historical, and biblical texts. We present a novel corpus creation pipeline that combines web-scraped, excerpt-level data with a multi-stage sentence-level alignment, and refinement process. Our method uses VecAlign with LaBSE embeddings, which we first fine-tune on a manually-aligned AG-MG subset, followed by an LLM-based error/misalignment correction phase using Gemini 2.5 Flash to ensure high alignment quality. Furthermore, we provide the first comprehensive benchmark of modern MT models on this task, evaluating three fine-tuning strategies across NMT models (NLLB, M2M100) and a Greek LLM (Llama-Krikri-8B). Our experiments show that fine-tuning yields significant improvements over base models, increasing performance by up to +10.3 BLEU points. Specifically, full-parameter fine-tuning of Llama-Krikri-8B achieves the highest overall performance with a BLEU score of 13.16, while the QLoRA-adapted M2M100-1.2B model demonstrates the largest relative gains and highly competitive results. Our dataset and models represent a significant contribution to Greek NLP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the AG-MG Parallel Corpus of 132,481 sentence-aligned pairs for Ancient Greek to Modern Greek translation, constructed via a multi-stage pipeline that fine-tunes LaBSE embeddings on a manual seed set, applies VecAlign, and uses Gemini 2.5 Flash for misalignment correction. It then benchmarks three fine-tuning regimes on NLLB, M2M100, and Llama-Krikri-8B, reporting gains of up to +10.3 BLEU with full-parameter fine-tuning of Llama-Krikri-8B reaching 13.16 BLEU.

Significance. If the alignment quality holds, the corpus and benchmarks would constitute a useful first resource for this low-resource classical-to-modern language pair and would establish concrete baselines for future AG-MG MT work. The reported fine-tuning gains illustrate the practical value of adapting both NMT and Greek-specific LLMs on newly aligned data.

major comments (2)
  1. [Section 3] Section 3 (Corpus Creation Pipeline): The alignment procedure relies on fine-tuning LaBSE on a small manually aligned subset followed by VecAlign and Gemini correction to produce the final 132k pairs, yet no held-out alignment precision/recall, error rate, or post-correction human audit on the full corpus is reported. Residual misalignments would directly undermine the validity of the downstream MT training and the claimed +10.3 BLEU gains.
  2. [Section 4] Section 4 (Evaluation): The headline BLEU scores (13.16 for Llama-Krikri-8B and relative gains for QLoRA M2M100) are presented without specification of test-set construction, confirmation that the test split is disjoint from the training data, statistical significance testing, or any human evaluation to corroborate automatic metrics. These omissions are load-bearing for interpreting the benchmark results.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'increasing performance by up to +10.3 BLEU points' should explicitly identify the baseline model and fine-tuning variant to which the delta is computed.
  2. [Throughout] Ensure that all model names (Llama-Krikri-8B, NLLB, M2M100-1.2B) and fine-tuning strategies are used consistently in tables and text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of corpus validation and evaluation rigor. We address each major comment point by point below, indicating the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (Corpus Creation Pipeline): The alignment procedure relies on fine-tuning LaBSE on a small manually aligned subset followed by VecAlign and Gemini correction to produce the final 132k pairs, yet no held-out alignment precision/recall, error rate, or post-correction human audit on the full corpus is reported. Residual misalignments would directly undermine the validity of the downstream MT training and the claimed +10.3 BLEU gains.

    Authors: We agree that explicit quantitative validation of alignment quality is necessary to support the corpus and downstream results. The current manuscript describes the pipeline but omits these metrics. In the revised version, we will add a dedicated subsection reporting precision and recall on a held-out manually aligned set, an estimated error rate derived from manual inspection of a 1,000-pair random sample, and details of a post-correction human audit performed on 500 pairs. These additions will directly address concerns about residual misalignments and strengthen confidence in the reported BLEU improvements. revision: yes

  2. Referee: [Section 4] Section 4 (Evaluation): The headline BLEU scores (13.16 for Llama-Krikri-8B and relative gains for QLoRA M2M100) are presented without specification of test-set construction, confirmation that the test split is disjoint from the training data, statistical significance testing, or any human evaluation to corroborate automatic metrics. These omissions are load-bearing for interpreting the benchmark results.

    Authors: We concur that these methodological details are essential for interpreting the benchmark. The manuscript currently lacks explicit descriptions of the split procedure and statistical tests. We will revise Section 4 to detail the test-set construction (a random 10% split with explicit confirmation of no overlap with training or validation data), include paired bootstrap significance testing for all reported BLEU differences, and add a discussion of automatic metric limitations together with a small-scale human evaluation on a 100-sentence subset or a clear statement of why broader human evaluation was not feasible for this low-resource pair. These changes will make the results more interpretable and robust. revision: yes

Circularity Check

0 steps flagged

No circularity: results from new corpus creation and standard fine-tuning

full rationale

The paper introduces a new 132k-pair AG-MG corpus via a multi-stage alignment pipeline (VecAlign on fine-tuned LaBSE plus Gemini correction) and reports empirical BLEU improvements from fine-tuning NMT models and an LLM. No equations, derivations, or predictions are presented that reduce by construction to fitted inputs or self-citations. The +10.3 BLEU gains and 13.16 score are measured outcomes on the new data, not forced equivalences. Alignment quality is an unvalidated assumption but does not create self-definitional or load-bearing circularity in the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the quality of the automatically generated alignments and the assumption that BLEU is a suitable primary metric for this language pair; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption VecAlign with LaBSE embeddings fine-tuned on a small manually aligned subset plus LLM correction produces high-quality sentence alignments
    Invoked in the corpus creation pipeline described in the abstract.

pith-pipeline@v0.9.0 · 5820 in / 1270 out tokens · 47367 ms · 2026-05-20T11:12:07.668205+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 2 internal anchors

  1. [1]

    Introduction Machine Translation (MT) has evolved from rule-based and statistical approaches to neural sequence-to-sequence models and is now being reshaped by Large Language Models (LLMs). Re- cent studies suggest that LLMs, pre-trained on ex- tensive multilingual data, can rival or complement traditional encoder–decoder MT , particularly in set- tings w...

  2. [2]

    This paper presents a high- quality AG→MG sentence-level parallel corpus, as well as fine-tuning NMT models and a Greek LLM for this task

    and NLLB (Costa-Jussà et al., 2022), and (ii) open Greek LLMs, notably Llama–Krikri 4 , which offer strong Greek fluency and tokenization while remaining accessible for adaptation (Roussis et al., 2025; Voukoutis et al.). This paper presents a high- quality AG→MG sentence-level parallel corpus, as well as fine-tuning NMT models and a Greek LLM for this ta...

  3. [3]

    We introduce the AG-MG Parallel Cor- pus, the largest sentence-aligned corpus for this low-resource language pair, containing 132,481 high-quality aligned sentence pairs. LASER 4https://huggingface.co/ilsp/ Llama-Krikri-8B-Instruct arXiv:2605.18504v1 [cs.CL] 18 May 2026 The corpus is enriched with extensive meta- data, including author, title, segment ind...

  4. [4]

    Our method first uses a domain-adapted LaBSE model for initial alignment and then leverages a LLM (Gemini 2.5 Flash developed by Comanici et al

    We present a novel hybrid alignment pipeline for creating the corpus. Our method first uses a domain-adapted LaBSE model for initial alignment and then leverages a LLM (Gemini 2.5 Flash developed by Comanici et al. , 2025) for misalignment detection and correction, ensuring superior alignment qual- ity

  5. [5]

    We conduct the first large-scale benchmark of both NMT and LLM-based models for AG- to-MG translation. We compare three dif- ferent fine-tuning strategies (Full-parameter, LoRA, and QLoRA) on NLLB 5, M2M1006 and Llama-Krikri-8B, providing a clear picture of the current state-of-the-art on this task. Our results demonstrate that fine-tuned models significa...

  6. [6]

    Related Work Our work is positioned at the intersection of classi- cal language resource creation and low-resource machine translation. 2.1. Parallel Corpora and Alignment Although there are monolingual corpora for An- cient Greek, such as the Diorisis corpus ( Vatri and McGillivray, 2018), the creation of large-scale par- allel corpora for MT is a more r...

  7. [7]

    This section details the data sources, our alignment methodology, and the final corpus characteristics

    The Ancient-Modern Greek Parallel Corpus We introduce the AG-MG Parallel Corpus, a new sentence-aligned dataset designed for AG →MG machine translation. This section details the data sources, our alignment methodology, and the final corpus characteristics. 3.1. Data Sources The corpus aggregates parallel texts from three main types of digital resources, a...

  8. [8]

    Basic cleaning removed residual HTML tags and markup

    Scraping and Initial Cleaning: We first scraped the public web sources using standard Python libraries (BeautifulSoup 7 and Scrapy8), ex- tracting AG texts, MG translations, and avail- able metadata (author, title, translator, etc.) into JSONL format. Basic cleaning removed residual HTML tags and markup

  9. [9]

    Both AG and MG texts were then segmented into sentences using the Stanza library 9

    Deep Cleaning and Segmentation: We ap- plied a more thorough cleaning process to remove noise such as page numbers, editorial brackets, translator comments, and inconsistent punctua- tion. Both AG and MG texts were then segmented into sentences using the Stanza library 9

  10. [10]

    Fine-Tuned Embedding Alignment: For the non-Bible sources where sentence alignment was needed, we employed VecAlign, following prior work ( Craig et al. , 2023). Crucially, instead of using off-the-shelf embeddings, we fine-tuned LaBSE ( Feng et al. , 2020) on 1,000 manually aligned AG-MG sentence pairs, varying in genre and ancient Greek dialect and extr...

  11. [11]

    To ensure the highest possible quality of the corpus, we imple- mented a refinement step using the Gemini 2.5 Flash API ( Comanici et al

    LLM-Based Refinement: While the fine- tuned VecAlign+LaBSE approach yielded good re- sults, manual inspection revealed residual mis- alignments, particularly with non-literal transla- tions or sentence splitting/merging. To ensure the highest possible quality of the corpus, we imple- mented a refinement step using the Gemini 2.5 Flash API ( Comanici et al...

  12. [12]

    grc”) and Modern Greek (“ell

    Deduplication and Multi-Reference Han- dling: Following Lee et al. (2021), we performed deduplication on all splits based on the MG sen- tences to remove near-identical translation vari- ants that might skew model training. However, drawing inspiration from multi-reference MT train- ing ( Zheng et al. , 2018; Khayrallah et al. , 2020), when sources provid...

  13. [13]

    This section details the dataset splits, models, fine-tuning procedures, and evaluation metrics used

    Experimental Setup We evaluate the effectiveness of our AG-MG Par- allel Corpus by fine-tuning several state-of-the- art NMT and LLM models. This section details the dataset splits, models, fine-tuning procedures, and evaluation metrics used. 4.1. Dataset Splits We split the 132,481 sentence pairs as described in T able 2. The training set comprises 128,2...

  14. [14]

    Applied to NLLB-600M and NLLB-1.3B

    A parameter-efficient fine-tuning (PEFT) method that injects trainable low-rank matrices into the model layers, freezing the original weights. Applied to NLLB-600M and NLLB-1.3B. • QLoRA (Quantized LoRA): (Dettmers et al. ,

  15. [15]

    Ap- plied to M2M100-1.2B and Llama-Krikri-8B

    A more memory-efficient PEFT method combining 4-bit quantization with LoRA. Ap- plied to M2M100-1.2B and Llama-Krikri-8B. 4.4. Training Details Fine-tuning was performed using either Google Colab Pro (L4 GPU) or the CINECA 12 supercom- puting infrastructure (using 1 to 4 NVIDIA A100 64GB GPUs 13) for the larger models and full fine- tuning runs. Key hyperp...

  16. [16]

    This process revealed 122 missing Ancient Greek characters/tokens for the M2M100 and 148 for the NLLB models

    Token Discovery: We scanned the entire An- cient Greek training corpus to identify charac- ters that the base tokenizer could not resolve. This process revealed 122 missing Ancient Greek characters/tokens for the M2M100 and 148 for the NLLB models

  17. [17]

    Dictionary Update: These identified charac- ters were explicitly added to the model’s to- kenizer, assigning them unique IDs and pre- venting them from being mapped to <unk>

  18. [18]

    Embedding Resizing: We structurally re- sized the model’s input and output embedding matrices to accommodate the newly added to- ken IDs

  19. [19]

    Instead, we employed a ”smart initialization” strategy

    Smart Initialization (Weight Transplant): Initializing new embeddings with random 16https://github.com/ bitsandbytes-foundation/bitsandbytes noise can significantly slow down conver- gence, as the model must learn the seman- tic value of these characters from scratch. Instead, we employed a ”smart initialization” strategy. For each new Polytonic character...

  20. [20]

    Lower is better

    Measures the number of edits required to match the reference. Lower is better. • BERTScore: (Zhang et al. , 2019) Computes semantic similarity using contextual embed- dings (using xlm-roberta-large). We re- port F1. • COMET : (Rei et al. , 2020) A neural metric trained to predict human judg- ments of translation quality (using Unbabel/wmt22-comet-da). Hig...

  21. [21]

    Results and Analysis We present the evaluation results on the Test and Stress sets, comparing the zero-shot (Base) per- formance of the pre-trained models against their fine-tuned versions across all metrics. 5.1. Results on Test Set T able5 summarizes the performance of all models on the main Test set (2,000 pairs). The results clearly demonstrate the ef...

  22. [22]

    Our primary contribution is the introduction of the AG-MG Parallel Corpus, the largest sentence-aligned dataset for this pair, containing 132,481 high-quality pairs

    Conclusion In this paper, we addressed the critical scarcity of resources for Ancient Greek (AG) to Modern Greek (MG) machine translation, a low-resource task compounded by the significant dialectal, his- torical, and genre-based diversity of the Ancient Greek source texts. Our primary contribution is the introduction of the AG-MG Parallel Corpus, the lar...

  23. [23]

    First, the corpus composition re- flects the available digital sources, primarily liter- ary, philosophical, and biblical texts

    Limitations While our work provides a significant new resource and benchmark, several limitations should be ac- knowledged. First, the corpus composition re- flects the available digital sources, primarily liter- ary, philosophical, and biblical texts. This may introduce domain bias, and models trained on it might perform less optimally on other genres. S...

  24. [24]

    We believe our use aligns with the intended pur- pose of these digital libraries

    Ethical Considerations The data used in this work was compiled from publicly accessible online resources, primarily in- tended for educational and research purposes. We believe our use aligns with the intended pur- pose of these digital libraries. The created dataset, AG-MG Parallel Corpus, consists of historical texts and their modern translations. While...

  25. [25]

    This work was also supported in part by the PHAROS project (Grant Agreement No

    Acknowledgments This work was supported in part by a thesis schol- arship granted to the first author by the Institute for Language and Speech Processing (ILSP), Athena Research Center. This work was also supported in part by the PHAROS project (Grant Agreement No. 101234269). We acknowledge the EuroHPC Joint Undertaking for awarding this project access t...

  26. [26]

    Bibliographical References Gheorghe Comanici, Eric Bieber, Mike Schaek- ermann, Ice Pasupat, Noveen Sachdeva, In- derjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced rea- soning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261. M...

  27. [27]

    arXiv preprint arXiv:2305.01181

    A paradigm shift: The future of ma- chine translation lies with large language mod- els. arXiv preprint arXiv:2305.01181 . Chiara Palladino, Farnoosh Shamsian, T ariq Yousef, David J Wright, Anise d’Orange Ferreira, and Michel Ferreira Dos Reis. 2023. Translation alignment for ancient greek: Annotation guide- lines and gold standards. Journal of Open Hu- ...

  28. [28]

    arXiv preprint arXiv:2009.09025 , year=

    Low-resource interlinear translation: Morphology-enhanced neural models for an- cient greek. In Proceedings of the First Work- shop on Language Models for Low-Resource Languages, pages 145–165. Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. Comet: A neural framework for mt evaluation. arXiv preprint arXiv:2009.09025. Dimitris Roussis, Le...

  29. [29]

    Computational Linguistics, 49(3):703– 747

    Machine learning for ancient languages: A survey. Computational Linguistics, 49(3):703– 747. Brian Thompson and Philipp Koehn. 2019. Ve- calign: Improved sentence alignment in lin- ear time and space. In Proceedings of the 2019 conference on empirical methods in nat- ural language processing and the 9th interna- tional joint conference on natural language...

  30. [30]

    Language Resource References Vatri, Alessandro and McGillivray, Barbara. 2018. The Diorisis Ancient Greek Corpus: Linguistics and Literature. Brill. Appendix A. Corpus Creation Pipeline Figure 1 illustrates the multi-stage hybrid align- ment pipeline used to create the AG-MG Parallel Corpus, combining neural embeddings with LLM- based refinement. Appendix...