Ancient Greek to Modern Greek Machine Translation: A Novel Benchmark and Fine-Tuning Experiments on LLMs and NMT Models

Maria Giagkou; Prokopis Prokopidis; Sokratis Sofianopoulos; Spyridon Mavromatis

arxiv: 2605.18504 · v1 · pith:5UG6KD2Enew · submitted 2026-05-18 · 💻 cs.CL

Ancient Greek to Modern Greek Machine Translation: A Novel Benchmark and Fine-Tuning Experiments on LLMs and NMT Models

Spyridon Mavromatis , Sokratis Sofianopoulos , Prokopis Prokopidis , Maria Giagkou This is my paper

Pith reviewed 2026-05-20 11:12 UTC · model grok-4.3

classification 💻 cs.CL

keywords Ancient GreekModern GreekMachine TranslationParallel CorpusFine-tuningLLM adaptationLow-resource MTSentence alignment

0 comments

The pith

A new 132k-pair corpus and fine-tuning experiments lift Ancient Greek to Modern Greek translation by up to 10.3 BLEU points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds the first sizable parallel dataset for Ancient Greek to Modern Greek machine translation by scraping literary and historical texts and aligning them through a multi-stage process. It then benchmarks several current models and shows that fine-tuning them on the new data produces clear gains over their untuned versions. The strongest result comes from full-parameter updates to a Greek-adapted LLM, which reaches 13.16 BLEU. Because the task has almost no prior resources, the work supplies both the training material and the first evidence that standard adaptation methods can move performance forward in this domain.

Core claim

We introduce the AG-MG Parallel Corpus of 132,481 sentence pairs created from web-scraped literary, historical, and biblical texts via a pipeline of fine-tuned LaBSE embeddings, VecAlign, and Gemini-based correction; fine-tuning NMT models and Llama-Krikri-8B on this corpus raises BLEU scores by as much as 10.3 points, with full-parameter tuning of the 8B Greek LLM attaining the highest score of 13.16.

What carries the argument

The AG-MG Parallel Corpus together with its multi-stage alignment pipeline that first fine-tunes LaBSE embeddings on a small manual seed set, runs VecAlign, and applies LLM-based misalignment correction to produce usable training pairs.

If this is right

Full-parameter fine-tuning of Llama-Krikri-8B delivers the top BLEU score of 13.16 on the new benchmark.
QLoRA adaptation of M2M100-1.2B produces the largest relative improvement while remaining computationally light.
NLLB and M2M100 models both improve substantially after fine-tuning but remain below the best LLM result.
The released corpus and models supply the first public baseline for future Ancient-to-Modern Greek work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same alignment recipe could be reused for other low-resource ancient-to-modern language pairs that have digitized but unaligned texts.
Higher-quality AG-MG translation would let modern readers and scholars work directly with original Greek sources without constant manual lookup.
Parameter-efficient methods shown here suggest that small research teams can adapt large models for historical languages without massive compute.

Load-bearing premise

The automatically aligned sentence pairs are clean and faithful enough that models trained on them learn genuine translation patterns rather than artifacts of the alignment process.

What would settle it

Human inspection of a random sample of 500 pairs from the corpus or automatic scoring of model output on a fresh, manually verified test set would reveal whether the reported BLEU gains disappear once alignment noise is removed.

Figures

Figures reproduced from arXiv: 2605.18504 by Maria Giagkou, Prokopis Prokopidis, Sokratis Sofianopoulos, Spyridon Mavromatis.

**Figure 2.** Figure 2: Distribution of Ancient Greek dialects in the sentence-level AG [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 3.** Figure 3: Sentence-Level token count distribution (132.4k pairs). Most sentences are short (10-30 to [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

read the original abstract

Machine Translation (MT) for Ancient Greek (AG) to Modern Greek (MG) is a low-resource task, constrained by the lack of large-scale, high-quality parallel data. We address this gap by introducing the AG-MG Parallel Corpus, a new resource containing 132,481 sentence-aligned pairs derived from literary, historical, and biblical texts. We present a novel corpus creation pipeline that combines web-scraped, excerpt-level data with a multi-stage sentence-level alignment, and refinement process. Our method uses VecAlign with LaBSE embeddings, which we first fine-tune on a manually-aligned AG-MG subset, followed by an LLM-based error/misalignment correction phase using Gemini 2.5 Flash to ensure high alignment quality. Furthermore, we provide the first comprehensive benchmark of modern MT models on this task, evaluating three fine-tuning strategies across NMT models (NLLB, M2M100) and a Greek LLM (Llama-Krikri-8B). Our experiments show that fine-tuning yields significant improvements over base models, increasing performance by up to +10.3 BLEU points. Specifically, full-parameter fine-tuning of Llama-Krikri-8B achieves the highest overall performance with a BLEU score of 13.16, while the QLoRA-adapted M2M100-1.2B model demonstrates the largest relative gains and highly competitive results. Our dataset and models represent a significant contribution to Greek NLP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New AG-MG corpus is the real contribution here, but alignment validation is thin and limits how much we can trust the MT numbers.

read the letter

The one thing to know is that this paper delivers a new 132,000-sentence parallel corpus for Ancient Greek to Modern Greek translation, built from literary, historical, and biblical sources, along with the first reported benchmarks on modern MT models for this pair. They describe a pipeline that starts with web-scraped data, applies VecAlign using a LaBSE model fine-tuned on a small manually aligned set, and then runs Gemini 2.5 Flash to correct misalignments. On the modeling side, they compare fine-tuning strategies on NLLB, M2M100, and Llama-Krikri-8B, with the best result being 13.16 BLEU from full-parameter tuning of the Greek LLM. What stands out is the practical resource. AG to MG is a genuinely low-resource direction with real use cases in classics and history, and releasing the corpus lets others build on it without starting from scratch. The experiments show consistent gains from fine-tuning, which matches what we see in other low-resource settings, and they include both full and parameter-efficient methods. The main limitation is around the corpus construction. The alignment quality is central, yet the paper gives no quantitative checks like alignment precision or recall on a held-out set for the final data, nor any post-correction human review of a sample from the 132k pairs. Without that, it's difficult to separate real model improvements from possible noise in the training data. The reported +10.3 BLEU jump would be more convincing with those details or at least an error analysis of the outputs. This work is aimed at researchers in machine translation for historical languages and anyone doing Greek NLP. A reader who needs a starting dataset or wants to compare against these baselines will find it useful. It is worth sending to peer review because the new data is the core value and the experiments are standard enough to evaluate once the data quality is better documented.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the AG-MG Parallel Corpus of 132,481 sentence-aligned pairs for Ancient Greek to Modern Greek translation, constructed via a multi-stage pipeline that fine-tunes LaBSE embeddings on a manual seed set, applies VecAlign, and uses Gemini 2.5 Flash for misalignment correction. It then benchmarks three fine-tuning regimes on NLLB, M2M100, and Llama-Krikri-8B, reporting gains of up to +10.3 BLEU with full-parameter fine-tuning of Llama-Krikri-8B reaching 13.16 BLEU.

Significance. If the alignment quality holds, the corpus and benchmarks would constitute a useful first resource for this low-resource classical-to-modern language pair and would establish concrete baselines for future AG-MG MT work. The reported fine-tuning gains illustrate the practical value of adapting both NMT and Greek-specific LLMs on newly aligned data.

major comments (2)

[Section 3] Section 3 (Corpus Creation Pipeline): The alignment procedure relies on fine-tuning LaBSE on a small manually aligned subset followed by VecAlign and Gemini correction to produce the final 132k pairs, yet no held-out alignment precision/recall, error rate, or post-correction human audit on the full corpus is reported. Residual misalignments would directly undermine the validity of the downstream MT training and the claimed +10.3 BLEU gains.
[Section 4] Section 4 (Evaluation): The headline BLEU scores (13.16 for Llama-Krikri-8B and relative gains for QLoRA M2M100) are presented without specification of test-set construction, confirmation that the test split is disjoint from the training data, statistical significance testing, or any human evaluation to corroborate automatic metrics. These omissions are load-bearing for interpreting the benchmark results.

minor comments (2)

[Abstract] Abstract: The phrase 'increasing performance by up to +10.3 BLEU points' should explicitly identify the baseline model and fine-tuning variant to which the delta is computed.
[Throughout] Ensure that all model names (Llama-Krikri-8B, NLLB, M2M100-1.2B) and fine-tuning strategies are used consistently in tables and text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of corpus validation and evaluation rigor. We address each major comment point by point below, indicating the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Section 3] Section 3 (Corpus Creation Pipeline): The alignment procedure relies on fine-tuning LaBSE on a small manually aligned subset followed by VecAlign and Gemini correction to produce the final 132k pairs, yet no held-out alignment precision/recall, error rate, or post-correction human audit on the full corpus is reported. Residual misalignments would directly undermine the validity of the downstream MT training and the claimed +10.3 BLEU gains.

Authors: We agree that explicit quantitative validation of alignment quality is necessary to support the corpus and downstream results. The current manuscript describes the pipeline but omits these metrics. In the revised version, we will add a dedicated subsection reporting precision and recall on a held-out manually aligned set, an estimated error rate derived from manual inspection of a 1,000-pair random sample, and details of a post-correction human audit performed on 500 pairs. These additions will directly address concerns about residual misalignments and strengthen confidence in the reported BLEU improvements. revision: yes
Referee: [Section 4] Section 4 (Evaluation): The headline BLEU scores (13.16 for Llama-Krikri-8B and relative gains for QLoRA M2M100) are presented without specification of test-set construction, confirmation that the test split is disjoint from the training data, statistical significance testing, or any human evaluation to corroborate automatic metrics. These omissions are load-bearing for interpreting the benchmark results.

Authors: We concur that these methodological details are essential for interpreting the benchmark. The manuscript currently lacks explicit descriptions of the split procedure and statistical tests. We will revise Section 4 to detail the test-set construction (a random 10% split with explicit confirmation of no overlap with training or validation data), include paired bootstrap significance testing for all reported BLEU differences, and add a discussion of automatic metric limitations together with a small-scale human evaluation on a 100-sentence subset or a clear statement of why broader human evaluation was not feasible for this low-resource pair. These changes will make the results more interpretable and robust. revision: yes

Circularity Check

0 steps flagged

No circularity: results from new corpus creation and standard fine-tuning

full rationale

The paper introduces a new 132k-pair AG-MG corpus via a multi-stage alignment pipeline (VecAlign on fine-tuned LaBSE plus Gemini correction) and reports empirical BLEU improvements from fine-tuning NMT models and an LLM. No equations, derivations, or predictions are presented that reduce by construction to fitted inputs or self-citations. The +10.3 BLEU gains and 13.16 score are measured outcomes on the new data, not forced equivalences. Alignment quality is an unvalidated assumption but does not create self-definitional or load-bearing circularity in the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the quality of the automatically generated alignments and the assumption that BLEU is a suitable primary metric for this language pair; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption VecAlign with LaBSE embeddings fine-tuned on a small manually aligned subset plus LLM correction produces high-quality sentence alignments
Invoked in the corpus creation pipeline described in the abstract.

pith-pipeline@v0.9.0 · 5820 in / 1270 out tokens · 47367 ms · 2026-05-20T11:12:07.668205+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present a novel hybrid alignment pipeline... VecAlign with LaBSE embeddings... LLM-based error/misalignment correction phase using Gemini 2.5 Flash
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

full-parameter fine-tuning of Llama-Krikri-8B achieves... BLEU score of 13.16

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 2 internal anchors

[1]

Introduction Machine Translation (MT) has evolved from rule-based and statistical approaches to neural sequence-to-sequence models and is now being reshaped by Large Language Models (LLMs). Re- cent studies suggest that LLMs, pre-trained on ex- tensive multilingual data, can rival or complement traditional encoder–decoder MT , particularly in set- tings w...

work page 2023
[2]

This paper presents a high- quality AG→MG sentence-level parallel corpus, as well as fine-tuning NMT models and a Greek LLM for this task

and NLLB (Costa-Jussà et al., 2022), and (ii) open Greek LLMs, notably Llama–Krikri 4 , which offer strong Greek fluency and tokenization while remaining accessible for adaptation (Roussis et al., 2025; Voukoutis et al.). This paper presents a high- quality AG→MG sentence-level parallel corpus, as well as fine-tuning NMT models and a Greek LLM for this ta...

work page 2022
[3]

We introduce the AG-MG Parallel Cor- pus, the largest sentence-aligned corpus for this low-resource language pair, containing 132,481 high-quality aligned sentence pairs. LASER 4https://huggingface.co/ilsp/ Llama-Krikri-8B-Instruct arXiv:2605.18504v1 [cs.CL] 18 May 2026 The corpus is enriched with extensive meta- data, including author, title, segment ind...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Our method first uses a domain-adapted LaBSE model for initial alignment and then leverages a LLM (Gemini 2.5 Flash developed by Comanici et al

We present a novel hybrid alignment pipeline for creating the corpus. Our method first uses a domain-adapted LaBSE model for initial alignment and then leverages a LLM (Gemini 2.5 Flash developed by Comanici et al. , 2025) for misalignment detection and correction, ensuring superior alignment qual- ity

work page 2025
[5]

We conduct the first large-scale benchmark of both NMT and LLM-based models for AG- to-MG translation. We compare three dif- ferent fine-tuning strategies (Full-parameter, LoRA, and QLoRA) on NLLB 5, M2M1006 and Llama-Krikri-8B, providing a clear picture of the current state-of-the-art on this task. Our results demonstrate that fine-tuned models significa...

work page
[6]

Related Work Our work is positioned at the intersection of classi- cal language resource creation and low-resource machine translation. 2.1. Parallel Corpora and Alignment Although there are monolingual corpora for An- cient Greek, such as the Diorisis corpus ( Vatri and McGillivray, 2018), the creation of large-scale par- allel corpora for MT is a more r...

work page 2018
[7]

This section details the data sources, our alignment methodology, and the final corpus characteristics

The Ancient-Modern Greek Parallel Corpus We introduce the AG-MG Parallel Corpus, a new sentence-aligned dataset designed for AG →MG machine translation. This section details the data sources, our alignment methodology, and the final corpus characteristics. 3.1. Data Sources The corpus aggregates parallel texts from three main types of digital resources, a...

work page
[8]

Basic cleaning removed residual HTML tags and markup

Scraping and Initial Cleaning: We first scraped the public web sources using standard Python libraries (BeautifulSoup 7 and Scrapy8), ex- tracting AG texts, MG translations, and avail- able metadata (author, title, translator, etc.) into JSONL format. Basic cleaning removed residual HTML tags and markup

work page
[9]

Both AG and MG texts were then segmented into sentences using the Stanza library 9

Deep Cleaning and Segmentation: We ap- plied a more thorough cleaning process to remove noise such as page numbers, editorial brackets, translator comments, and inconsistent punctua- tion. Both AG and MG texts were then segmented into sentences using the Stanza library 9

work page
[10]

Fine-Tuned Embedding Alignment: For the non-Bible sources where sentence alignment was needed, we employed VecAlign, following prior work ( Craig et al. , 2023). Crucially, instead of using off-the-shelf embeddings, we fine-tuned LaBSE ( Feng et al. , 2020) on 1,000 manually aligned AG-MG sentence pairs, varying in genre and ancient Greek dialect and extr...

work page 2023
[11]

To ensure the highest possible quality of the corpus, we imple- mented a refinement step using the Gemini 2.5 Flash API ( Comanici et al

LLM-Based Refinement: While the fine- tuned VecAlign+LaBSE approach yielded good re- sults, manual inspection revealed residual mis- alignments, particularly with non-literal transla- tions or sentence splitting/merging. To ensure the highest possible quality of the corpus, we imple- mented a refinement step using the Gemini 2.5 Flash API ( Comanici et al...

work page 2025
[12]

grc”) and Modern Greek (“ell

Deduplication and Multi-Reference Han- dling: Following Lee et al. (2021), we performed deduplication on all splits based on the MG sen- tences to remove near-identical translation vari- ants that might skew model training. However, drawing inspiration from multi-reference MT train- ing ( Zheng et al. , 2018; Khayrallah et al. , 2020), when sources provid...

work page 2021
[13]

This section details the dataset splits, models, fine-tuning procedures, and evaluation metrics used

Experimental Setup We evaluate the effectiveness of our AG-MG Par- allel Corpus by fine-tuning several state-of-the- art NMT and LLM models. This section details the dataset splits, models, fine-tuning procedures, and evaluation metrics used. 4.1. Dataset Splits We split the 132,481 sentence pairs as described in T able 2. The training set comprises 128,2...

work page 2022
[14]

Applied to NLLB-600M and NLLB-1.3B

A parameter-eﬀicient fine-tuning (PEFT) method that injects trainable low-rank matrices into the model layers, freezing the original weights. Applied to NLLB-600M and NLLB-1.3B. • QLoRA (Quantized LoRA): (Dettmers et al. ,

work page
[15]

Ap- plied to M2M100-1.2B and Llama-Krikri-8B

A more memory-eﬀicient PEFT method combining 4-bit quantization with LoRA. Ap- plied to M2M100-1.2B and Llama-Krikri-8B. 4.4. Training Details Fine-tuning was performed using either Google Colab Pro (L4 GPU) or the CINECA 12 supercom- puting infrastructure (using 1 to 4 NVIDIA A100 64GB GPUs 13) for the larger models and full fine- tuning runs. Key hyperp...

work page
[16]

This process revealed 122 missing Ancient Greek characters/tokens for the M2M100 and 148 for the NLLB models

Token Discovery: We scanned the entire An- cient Greek training corpus to identify charac- ters that the base tokenizer could not resolve. This process revealed 122 missing Ancient Greek characters/tokens for the M2M100 and 148 for the NLLB models

work page
[17]

Dictionary Update: These identified charac- ters were explicitly added to the model’s to- kenizer, assigning them unique IDs and pre- venting them from being mapped to <unk>

work page
[18]

Embedding Resizing: We structurally re- sized the model’s input and output embedding matrices to accommodate the newly added to- ken IDs

work page
[19]

Instead, we employed a ”smart initialization” strategy

Smart Initialization (Weight Transplant): Initializing new embeddings with random 16https://github.com/ bitsandbytes-foundation/bitsandbytes noise can significantly slow down conver- gence, as the model must learn the seman- tic value of these characters from scratch. Instead, we employed a ”smart initialization” strategy. For each new Polytonic character...

work page 2002
[20]

Lower is better

Measures the number of edits required to match the reference. Lower is better. • BERTScore: (Zhang et al. , 2019) Computes semantic similarity using contextual embed- dings (using xlm-roberta-large). We re- port F1. • COMET : (Rei et al. , 2020) A neural metric trained to predict human judg- ments of translation quality (using Unbabel/wmt22-comet-da). Hig...

work page 2019
[21]

Results and Analysis We present the evaluation results on the Test and Stress sets, comparing the zero-shot (Base) per- formance of the pre-trained models against their fine-tuned versions across all metrics. 5.1. Results on Test Set T able5 summarizes the performance of all models on the main Test set (2,000 pairs). The results clearly demonstrate the ef...

work page
[22]

Our primary contribution is the introduction of the AG-MG Parallel Corpus, the largest sentence-aligned dataset for this pair, containing 132,481 high-quality pairs

Conclusion In this paper, we addressed the critical scarcity of resources for Ancient Greek (AG) to Modern Greek (MG) machine translation, a low-resource task compounded by the significant dialectal, his- torical, and genre-based diversity of the Ancient Greek source texts. Our primary contribution is the introduction of the AG-MG Parallel Corpus, the lar...

work page
[23]

First, the corpus composition re- flects the available digital sources, primarily liter- ary, philosophical, and biblical texts

Limitations While our work provides a significant new resource and benchmark, several limitations should be ac- knowledged. First, the corpus composition re- flects the available digital sources, primarily liter- ary, philosophical, and biblical texts. This may introduce domain bias, and models trained on it might perform less optimally on other genres. S...

work page 2023
[24]

We believe our use aligns with the intended pur- pose of these digital libraries

Ethical Considerations The data used in this work was compiled from publicly accessible online resources, primarily in- tended for educational and research purposes. We believe our use aligns with the intended pur- pose of these digital libraries. The created dataset, AG-MG Parallel Corpus, consists of historical texts and their modern translations. While...

work page
[25]

This work was also supported in part by the PHAROS project (Grant Agreement No

Acknowledgments This work was supported in part by a thesis schol- arship granted to the first author by the Institute for Language and Speech Processing (ILSP), Athena Research Center. This work was also supported in part by the PHAROS project (Grant Agreement No. 101234269). We acknowledge the EuroHPC Joint Undertaking for awarding this project access t...

work page
[26]

Bibliographical References Gheorghe Comanici, Eric Bieber, Mike Schaek- ermann, Ice Pasupat, Noveen Sachdeva, In- derjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced rea- soning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261. M...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

arXiv preprint arXiv:2305.01181

A paradigm shift: The future of ma- chine translation lies with large language mod- els. arXiv preprint arXiv:2305.01181 . Chiara Palladino, Farnoosh Shamsian, T ariq Yousef, David J Wright, Anise d’Orange Ferreira, and Michel Ferreira Dos Reis. 2023. Translation alignment for ancient greek: Annotation guide- lines and gold standards. Journal of Open Hu- ...

work page arXiv 2023
[28]

arXiv preprint arXiv:2009.09025 , year=

Low-resource interlinear translation: Morphology-enhanced neural models for an- cient greek. In Proceedings of the First Work- shop on Language Models for Low-Resource Languages, pages 145–165. Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. Comet: A neural framework for mt evaluation. arXiv preprint arXiv:2009.09025. Dimitris Roussis, Le...

work page arXiv 2020
[29]

Computational Linguistics, 49(3):703– 747

Machine learning for ancient languages: A survey. Computational Linguistics, 49(3):703– 747. Brian Thompson and Philipp Koehn. 2019. Ve- calign: Improved sentence alignment in lin- ear time and space. In Proceedings of the 2019 conference on empirical methods in nat- ural language processing and the 9th interna- tional joint conference on natural language...

work page arXiv 2019
[30]

Language Resource References Vatri, Alessandro and McGillivray, Barbara. 2018. The Diorisis Ancient Greek Corpus: Linguistics and Literature. Brill. Appendix A. Corpus Creation Pipeline Figure 1 illustrates the multi-stage hybrid align- ment pipeline used to create the AG-MG Parallel Corpus, combining neural embeddings with LLM- based refinement. Appendix...

work page 2018

[1] [1]

Introduction Machine Translation (MT) has evolved from rule-based and statistical approaches to neural sequence-to-sequence models and is now being reshaped by Large Language Models (LLMs). Re- cent studies suggest that LLMs, pre-trained on ex- tensive multilingual data, can rival or complement traditional encoder–decoder MT , particularly in set- tings w...

work page 2023

[2] [2]

This paper presents a high- quality AG→MG sentence-level parallel corpus, as well as fine-tuning NMT models and a Greek LLM for this task

and NLLB (Costa-Jussà et al., 2022), and (ii) open Greek LLMs, notably Llama–Krikri 4 , which offer strong Greek fluency and tokenization while remaining accessible for adaptation (Roussis et al., 2025; Voukoutis et al.). This paper presents a high- quality AG→MG sentence-level parallel corpus, as well as fine-tuning NMT models and a Greek LLM for this ta...

work page 2022

[3] [3]

We introduce the AG-MG Parallel Cor- pus, the largest sentence-aligned corpus for this low-resource language pair, containing 132,481 high-quality aligned sentence pairs. LASER 4https://huggingface.co/ilsp/ Llama-Krikri-8B-Instruct arXiv:2605.18504v1 [cs.CL] 18 May 2026 The corpus is enriched with extensive meta- data, including author, title, segment ind...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

Our method first uses a domain-adapted LaBSE model for initial alignment and then leverages a LLM (Gemini 2.5 Flash developed by Comanici et al

We present a novel hybrid alignment pipeline for creating the corpus. Our method first uses a domain-adapted LaBSE model for initial alignment and then leverages a LLM (Gemini 2.5 Flash developed by Comanici et al. , 2025) for misalignment detection and correction, ensuring superior alignment qual- ity

work page 2025

[5] [5]

We conduct the first large-scale benchmark of both NMT and LLM-based models for AG- to-MG translation. We compare three dif- ferent fine-tuning strategies (Full-parameter, LoRA, and QLoRA) on NLLB 5, M2M1006 and Llama-Krikri-8B, providing a clear picture of the current state-of-the-art on this task. Our results demonstrate that fine-tuned models significa...

work page

[6] [6]

Related Work Our work is positioned at the intersection of classi- cal language resource creation and low-resource machine translation. 2.1. Parallel Corpora and Alignment Although there are monolingual corpora for An- cient Greek, such as the Diorisis corpus ( Vatri and McGillivray, 2018), the creation of large-scale par- allel corpora for MT is a more r...

work page 2018

[7] [7]

This section details the data sources, our alignment methodology, and the final corpus characteristics

The Ancient-Modern Greek Parallel Corpus We introduce the AG-MG Parallel Corpus, a new sentence-aligned dataset designed for AG →MG machine translation. This section details the data sources, our alignment methodology, and the final corpus characteristics. 3.1. Data Sources The corpus aggregates parallel texts from three main types of digital resources, a...

work page

[8] [8]

Basic cleaning removed residual HTML tags and markup

Scraping and Initial Cleaning: We first scraped the public web sources using standard Python libraries (BeautifulSoup 7 and Scrapy8), ex- tracting AG texts, MG translations, and avail- able metadata (author, title, translator, etc.) into JSONL format. Basic cleaning removed residual HTML tags and markup

work page

[9] [9]

Both AG and MG texts were then segmented into sentences using the Stanza library 9

Deep Cleaning and Segmentation: We ap- plied a more thorough cleaning process to remove noise such as page numbers, editorial brackets, translator comments, and inconsistent punctua- tion. Both AG and MG texts were then segmented into sentences using the Stanza library 9

work page

[10] [10]

Fine-Tuned Embedding Alignment: For the non-Bible sources where sentence alignment was needed, we employed VecAlign, following prior work ( Craig et al. , 2023). Crucially, instead of using off-the-shelf embeddings, we fine-tuned LaBSE ( Feng et al. , 2020) on 1,000 manually aligned AG-MG sentence pairs, varying in genre and ancient Greek dialect and extr...

work page 2023

[11] [11]

To ensure the highest possible quality of the corpus, we imple- mented a refinement step using the Gemini 2.5 Flash API ( Comanici et al

LLM-Based Refinement: While the fine- tuned VecAlign+LaBSE approach yielded good re- sults, manual inspection revealed residual mis- alignments, particularly with non-literal transla- tions or sentence splitting/merging. To ensure the highest possible quality of the corpus, we imple- mented a refinement step using the Gemini 2.5 Flash API ( Comanici et al...

work page 2025

[12] [12]

grc”) and Modern Greek (“ell

Deduplication and Multi-Reference Han- dling: Following Lee et al. (2021), we performed deduplication on all splits based on the MG sen- tences to remove near-identical translation vari- ants that might skew model training. However, drawing inspiration from multi-reference MT train- ing ( Zheng et al. , 2018; Khayrallah et al. , 2020), when sources provid...

work page 2021

[13] [13]

This section details the dataset splits, models, fine-tuning procedures, and evaluation metrics used

Experimental Setup We evaluate the effectiveness of our AG-MG Par- allel Corpus by fine-tuning several state-of-the- art NMT and LLM models. This section details the dataset splits, models, fine-tuning procedures, and evaluation metrics used. 4.1. Dataset Splits We split the 132,481 sentence pairs as described in T able 2. The training set comprises 128,2...

work page 2022

[14] [14]

Applied to NLLB-600M and NLLB-1.3B

A parameter-eﬀicient fine-tuning (PEFT) method that injects trainable low-rank matrices into the model layers, freezing the original weights. Applied to NLLB-600M and NLLB-1.3B. • QLoRA (Quantized LoRA): (Dettmers et al. ,

work page

[15] [15]

Ap- plied to M2M100-1.2B and Llama-Krikri-8B

A more memory-eﬀicient PEFT method combining 4-bit quantization with LoRA. Ap- plied to M2M100-1.2B and Llama-Krikri-8B. 4.4. Training Details Fine-tuning was performed using either Google Colab Pro (L4 GPU) or the CINECA 12 supercom- puting infrastructure (using 1 to 4 NVIDIA A100 64GB GPUs 13) for the larger models and full fine- tuning runs. Key hyperp...

work page

[16] [16]

This process revealed 122 missing Ancient Greek characters/tokens for the M2M100 and 148 for the NLLB models

Token Discovery: We scanned the entire An- cient Greek training corpus to identify charac- ters that the base tokenizer could not resolve. This process revealed 122 missing Ancient Greek characters/tokens for the M2M100 and 148 for the NLLB models

work page

[17] [17]

Dictionary Update: These identified charac- ters were explicitly added to the model’s to- kenizer, assigning them unique IDs and pre- venting them from being mapped to <unk>

work page

[18] [18]

Embedding Resizing: We structurally re- sized the model’s input and output embedding matrices to accommodate the newly added to- ken IDs

work page

[19] [19]

Instead, we employed a ”smart initialization” strategy

Smart Initialization (Weight Transplant): Initializing new embeddings with random 16https://github.com/ bitsandbytes-foundation/bitsandbytes noise can significantly slow down conver- gence, as the model must learn the seman- tic value of these characters from scratch. Instead, we employed a ”smart initialization” strategy. For each new Polytonic character...

work page 2002

[20] [20]

Lower is better

Measures the number of edits required to match the reference. Lower is better. • BERTScore: (Zhang et al. , 2019) Computes semantic similarity using contextual embed- dings (using xlm-roberta-large). We re- port F1. • COMET : (Rei et al. , 2020) A neural metric trained to predict human judg- ments of translation quality (using Unbabel/wmt22-comet-da). Hig...

work page 2019

[21] [21]

Results and Analysis We present the evaluation results on the Test and Stress sets, comparing the zero-shot (Base) per- formance of the pre-trained models against their fine-tuned versions across all metrics. 5.1. Results on Test Set T able5 summarizes the performance of all models on the main Test set (2,000 pairs). The results clearly demonstrate the ef...

work page

[22] [22]

Our primary contribution is the introduction of the AG-MG Parallel Corpus, the largest sentence-aligned dataset for this pair, containing 132,481 high-quality pairs

Conclusion In this paper, we addressed the critical scarcity of resources for Ancient Greek (AG) to Modern Greek (MG) machine translation, a low-resource task compounded by the significant dialectal, his- torical, and genre-based diversity of the Ancient Greek source texts. Our primary contribution is the introduction of the AG-MG Parallel Corpus, the lar...

work page

[23] [23]

First, the corpus composition re- flects the available digital sources, primarily liter- ary, philosophical, and biblical texts

Limitations While our work provides a significant new resource and benchmark, several limitations should be ac- knowledged. First, the corpus composition re- flects the available digital sources, primarily liter- ary, philosophical, and biblical texts. This may introduce domain bias, and models trained on it might perform less optimally on other genres. S...

work page 2023

[24] [24]

We believe our use aligns with the intended pur- pose of these digital libraries

Ethical Considerations The data used in this work was compiled from publicly accessible online resources, primarily in- tended for educational and research purposes. We believe our use aligns with the intended pur- pose of these digital libraries. The created dataset, AG-MG Parallel Corpus, consists of historical texts and their modern translations. While...

work page

[25] [25]

This work was also supported in part by the PHAROS project (Grant Agreement No

Acknowledgments This work was supported in part by a thesis schol- arship granted to the first author by the Institute for Language and Speech Processing (ILSP), Athena Research Center. This work was also supported in part by the PHAROS project (Grant Agreement No. 101234269). We acknowledge the EuroHPC Joint Undertaking for awarding this project access t...

work page

[26] [26]

Bibliographical References Gheorghe Comanici, Eric Bieber, Mike Schaek- ermann, Ice Pasupat, Noveen Sachdeva, In- derjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced rea- soning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261. M...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

arXiv preprint arXiv:2305.01181

A paradigm shift: The future of ma- chine translation lies with large language mod- els. arXiv preprint arXiv:2305.01181 . Chiara Palladino, Farnoosh Shamsian, T ariq Yousef, David J Wright, Anise d’Orange Ferreira, and Michel Ferreira Dos Reis. 2023. Translation alignment for ancient greek: Annotation guide- lines and gold standards. Journal of Open Hu- ...

work page arXiv 2023

[28] [28]

arXiv preprint arXiv:2009.09025 , year=

Low-resource interlinear translation: Morphology-enhanced neural models for an- cient greek. In Proceedings of the First Work- shop on Language Models for Low-Resource Languages, pages 145–165. Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. Comet: A neural framework for mt evaluation. arXiv preprint arXiv:2009.09025. Dimitris Roussis, Le...

work page arXiv 2020

[29] [29]

Computational Linguistics, 49(3):703– 747

Machine learning for ancient languages: A survey. Computational Linguistics, 49(3):703– 747. Brian Thompson and Philipp Koehn. 2019. Ve- calign: Improved sentence alignment in lin- ear time and space. In Proceedings of the 2019 conference on empirical methods in nat- ural language processing and the 9th interna- tional joint conference on natural language...

work page arXiv 2019

[30] [30]

Language Resource References Vatri, Alessandro and McGillivray, Barbara. 2018. The Diorisis Ancient Greek Corpus: Linguistics and Literature. Brill. Appendix A. Corpus Creation Pipeline Figure 1 illustrates the multi-stage hybrid align- ment pipeline used to create the AG-MG Parallel Corpus, combining neural embeddings with LLM- based refinement. Appendix...

work page 2018