Evaluating LLM-Based Translation of a Low-Resource Technical Language: The Medical and Philosophical Greek of Galen

Cameron Pattison; Jacob Wu; James L. Zainaldin; Manuela Marai; Mark J. Schiefsky

arxiv: 2602.24119 · v2 · submitted 2026-02-27 · 💻 cs.CL · cs.AI

Evaluating LLM-Based Translation of a Low-Resource Technical Language: The Medical and Philosophical Greek of Galen

James L. Zainaldin , Cameron Pattison , Manuela Marai , Jacob Wu , Mark J. Schiefsky This is my paper

Pith reviewed 2026-05-15 18:46 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM translationancient GreekGalenmachine translation evaluationterminology rarityMQM frameworklow-resource languagestechnical prose

0 comments

The pith

Commercial LLMs translate Galen's ancient Greek medical texts well except where rare terminology appears.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests three major LLMs on 20 paragraph-length passages from two works by the Greek physician Galen, one previously translated into English and one never translated before. Domain experts scored the outputs using a modified reference-free quality framework and found high average performance on the expository text that fell on the pharmacological text, driven by two passages dense with uncommon terms. Corpus frequency of those terms predicted failure with a correlation of -0.97. Standard automatic metrics aligned only moderately with the human scores and could not separate high-quality translations from one another. The results identify a concrete textual property that limits current machine translation for this low-resource technical language.

Core claim

On the expository Galen text the three LLMs reached a mean MQM score of 95.2 out of 100; on the untranslated pharmacological text the mean fell to 79.9, with two passages producing catastrophic failures while the rest stayed within four points of the expository scores. Terminology rarity measured by corpus frequency was the dominant predictor of those failures at r = -.97. Automated metrics showed only moderate correlation with expert judgment and none of them distinguished among the high-quality translations.

What carries the argument

The modified Multidimensional Quality Metrics framework applied by domain specialists for reference-free scoring, together with corpus frequency as the operational measure of terminology rarity.

If this is right

LLMs can already produce usable English versions of Galen's less specialized expository prose.
Pharmacological passages with high densities of rare terms remain a clear failure mode.
Automatic metrics are not yet trustworthy for ranking or filtering translations that are already close to expert quality.
Frequency-based diagnostics can be used in advance to flag passages likely to need human review or specialized post-editing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same frequency-based failure pattern is likely to appear in LLM translations of other ancient technical corpora such as Latin medical or philosophical texts.
Targeted fine-tuning or retrieval-augmented generation on frequency-weighted ancient Greek vocabularies could reduce the observed error rate.
The evaluation approach supplies a practical template for testing LLM performance on additional low-resource historical languages where reference translations do not exist.

Load-bearing premise

That the twenty selected passages and the specialists' application of the modified MQM framework give a reliable picture of how LLMs handle the full range of Galenic technical prose.

What would settle it

A larger set of Galenic passages in which terminology frequency shows no strong correlation with expert translation scores, or in which at least one automatic metric separates high-quality translations as clearly as the human raters do.

Figures

Figures reproduced from arXiv: 2602.24119 by Cameron Pattison, Jacob Wu, James L. Zainaldin, Manuela Marai, Mark J. Schiefsky.

read the original abstract

Purpose: This study evaluates the quality of commercial large language model (LLM) machine translation (MT) for Ancient Greek technical prose and benchmarks standard automated MT evaluation metrics against expert human judgment. Design: We evaluated 60 translations by three LLMs (ChatGPT, Claude, Gemini) of 20 paragraph-length passages from 2 works by the Greek physician Galen (c. 129-216 CE): an expository text with two published English translations and a pharmacological text never before translated. Quality was assessed using seven automated metrics and systematic reference-free human evaluation via a modified Multidimensional Quality Metrics (MQM) framework applied by domain specialists. Findings: On the translated expository text, LLMs achieved high quality (mean MQM score 95.2/100). On the untranslated pharmacological text, quality was lower (79.9/100) but bimodally distributed: two passages with extreme terminological density produced catastrophic failures, while remaining passages scored within 4 points of the expository text. Terminology rarity, operationalized via corpus frequency, emerged as the dominant predictor of failure (r = -.97). Automated metrics showed moderate correlation with human judgment only on texts with wide quality variance; no metric discriminated among high-quality translations. Originality: This is the first systematic, reference-free expert human evaluation of LLM translation for any ancient language and the first study identifying textual properties predictive of translation failure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives the first expert human benchmark on LLM translation of ancient technical Greek and flags rare terminology as the main failure driver, though the r=-.97 correlation rests on a small bimodal sample that may be outlier-driven.

read the letter

This paper's main contribution is a reference-free expert evaluation of LLM translations of Galen's technical Greek, showing solid performance on standard expository passages but clear drops with dense pharmacological terminology. They also tie failure rates strongly to how rare the terms are in a corpus. They did a few things right. Three commercial LLMs translated 20 passages, scored by domain experts using a modified MQM setup. The expository text averaged 95 out of 100, which is encouraging. The pharma text averaged lower at 80 but was mostly close to that except for two bad cases. They compared this to automated metrics and found those only work when quality varies a lot. The identification of terminology rarity as a predictor is new for this domain and could help people pick better passages or preprocess texts. The main weakness is the small scale and lack of robustness on that correlation. Twenty passages total, with the key result driven by a bimodal pattern and just two outliers, means the r of -0.97 might not generalize. Without sensitivity checks or more data, it's hard to say rarity dominates across Galenic writing rather than just in those extremes. The methods section probably has more details, but the abstract leaves some gaps on how the human judgments were standardized. For readers in classics, digital humanities, or anyone doing MT on low-resource historical languages, this is worth a look as a starting point. It shows LLMs can help but need care with technical density. It deserves peer review because the empirical setup is straightforward and the practical angle is clear, even if revisions would strengthen the statistical claims. I'd cite the evaluation framework and the basic quality numbers in related work.

Referee Report

2 major / 1 minor

Summary. The paper evaluates LLM (ChatGPT, Claude, Gemini) translations of 20 paragraph-length passages from two Galenic works using seven automated metrics and reference-free human evaluation via a modified MQM framework by domain specialists. It reports high quality on the expository text (mean MQM 95.2/100), lower but bimodal quality on the untranslated pharmacological text (79.9/100) with two catastrophic failures tied to extreme terminological density, a dominant negative correlation (r = -0.97) between corpus frequency of terms and translation failure, and moderate correlation of automated metrics with human judgment only on high-variance texts. It positions itself as the first systematic expert human evaluation of LLM translation for any ancient language and the first to identify textual predictors of failure.

Significance. If the empirical findings hold, the work is significant for establishing baseline performance of current LLMs on low-resource ancient technical Greek, demonstrating that specialist human evaluation via adapted MQM can serve as a reliable reference-free benchmark, and providing the first evidence that terminological rarity (via corpus frequency) strongly predicts translation failure in this domain. The use of domain experts and the identification of a concrete textual property as a failure predictor offer actionable insights for MT of historical languages.

major comments (2)

[Findings] Findings section (correlation analysis): The headline claim that terminology rarity is the dominant predictor of failure (r = -.97) rests on a sample of 20 passages exhibiting a bimodal distribution on the pharmacological text, where only two extreme-density passages produced catastrophic failures while the remaining 18 scored within 4 points of the expository mean (95.2/100). No sensitivity analyses, outlier diagnostics, rank correlations (e.g., Spearman), or leave-two-out checks are reported, so it is unclear whether the Pearson coefficient reflects a general relationship across Galenic technical prose or is an artifact of the two outliers; this directly undermines the predictive claim.
[Abstract and Methods] Abstract and Methods (sample description): The study relies on only 20 selected passages (with 18 performing similarly), yet provides limited detail on selection criteria or how they represent the full range of Galenic technical prose; combined with the absence of robustness checks on the correlation, this limits the strength of the generalizability conclusion.

minor comments (1)

[Abstract] The abstract states that automated metrics showed moderate correlation with human judgment only on texts with wide quality variance, but the specific metric values and exact correlation coefficients are not summarized in the abstract or early sections, reducing immediate clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review of our manuscript. We address each major comment below and have revised the paper to incorporate additional analyses and clarifications.

read point-by-point responses

Referee: [Findings] Findings section (correlation analysis): The headline claim that terminology rarity is the dominant predictor of failure (r = -.97) rests on a sample of 20 passages exhibiting a bimodal distribution on the pharmacological text, where only two extreme-density passages produced catastrophic failures while the remaining 18 scored within 4 points of the expository mean (95.2/100). No sensitivity analyses, outlier diagnostics, rank correlations (e.g., Spearman), or leave-two-out checks are reported, so it is unclear whether the Pearson coefficient reflects a general relationship across Galenic technical prose or is an artifact of the two outliers; this directly undermines the predictive claim.

Authors: We acknowledge that the Pearson r = -0.97 is strongly influenced by the two passages with extreme terminological density. These were deliberately selected as the upper end of rarity in the pharmacological text to test the limits of LLM performance. In the revised manuscript we now include Spearman rank correlation (r_s = -0.89), Cook's distance outlier diagnostics, and a leave-two-out sensitivity check. Removing the two extreme passages yields r = -0.62, confirming a still-negative but weaker relationship; we have added a scatterplot highlighting these points and revised the text to present the correlation as evidence that rarity is a strong but not sole predictor within this sample. revision: yes
Referee: [Abstract and Methods] Abstract and Methods (sample description): The study relies on only 20 selected passages (with 18 performing similarly), yet provides limited detail on selection criteria or how they represent the full range of Galenic technical prose; combined with the absence of robustness checks on the correlation, this limits the strength of the generalizability conclusion.

Authors: We have expanded the Methods section with explicit selection criteria: passages were drawn from standard critical editions to span low-to-high terminological rarity (via TLG corpus frequency) and syntactic complexity in both works, with the pharmacological set intentionally including the two highest-density examples. The revised Abstract and Discussion now state that results are based on this targeted sample of 20 passages and note the need for larger-scale follow-up studies. Robustness checks for the correlation are added as described above. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical evaluation with direct statistical reporting

full rationale

The paper reports an observational study: 20 passages were selected, translated by three LLMs, scored by domain experts using a modified MQM framework, and correlated with an independently measured property (terminology rarity via corpus frequency). The reported r = -.97 is a direct Pearson computation on the observed scores and frequencies; it is not obtained by fitting a model to a subset and then predicting a related quantity by construction, nor does any equation or claim reduce to a self-definition. No self-citations are invoked to justify uniqueness theorems or ansatzes, and no derivations exist that could be circular. The study is self-contained against its own human judgments and corpus counts, satisfying the criteria for a non-circular empirical evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study rests on the validity of expert human judgment via the modified MQM framework and the representativeness of the chosen passages and corpus frequency measure. No free parameters or invented entities are introduced.

axioms (1)

domain assumption Expert application of the modified Multidimensional Quality Metrics (MQM) framework yields valid quality scores for LLM translations of ancient Greek without reference texts.
This underpins all human evaluation results and the comparison to automated metrics.

pith-pipeline@v0.9.0 · 5575 in / 1256 out tokens · 78997 ms · 2026-05-15T18:46:04.288379+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

[1]

translate

for Mix. and C. G. Kühn for Zainaldin et al. 2026 6 Comp. (Kühn 1827). Kühn’s text is of inferior quality and contains errors, but is generally readable and has not been replaced (Nutton 2002). A further motivation in selecting these works is that Mix. has two recent English translations (Singer and van der Eijk 2018; Johnston 2020), while Comp. has never...

work page 2026
[2]

The automated tests are cheap, fast, and widely standardized, but of uncertain efficacy

Methodology To evaluate LLM-based MT performance, we implemented both automated (reference-based) and expert human (reference-free) frameworks. The automated tests are cheap, fast, and widely standardized, but of uncertain efficacy. We therefore undertook domain-expert examination of all 60 MT translations via modified MQM evaluation, which allowed Zainal...

work page 2026
[3]

2021a; Freitag et al

that is the official evaluation strategy for the Workshop on Machine Translation (Freitag et al. 2021a; Freitag et al. Zainaldin et al. 2026 8 2021b). Previous studies evaluating MT for LRAL have relied on custom metrics (Volk et al

work page 2026
[4]

2023); none has applied the systematic analytical framework of MQM, in part because of its labor intensity

or qualitative spot-checks (Gutherz et al. 2023); none has applied the systematic analytical framework of MQM, in part because of its labor intensity. Nonetheless, MQM allows for fine-grained evaluation of translation quality utilizing domain-specific expertise. It does not presuppose the existence of authoritative reference translations, mitigating a cen...

work page 2023
[5]

hallucination

but did not operationalize a separate category because it is impracticably vague: translating AG often requires interpretative rendering even when accurate, leaving no pragmatic test distinguishing “hallucination” from other sorts of misconstrual, and fluent hallucinations always resolve as specific terminological or propositional errors. Each MQM error w...

work page 2024
[6]

ChatGPT 31.4 (± 6.1) 53.4 (± 3.7) 46.4 (± 5.4) 50.9 (± 5.3) 91.0 (± 1.2) 79.9 (± 1.9) 49.8 (± 3.4) Zainaldin et al

Aggregate Automated MT Evaluation Scores Text Model BLEU-4 chrF++ METEOR ROUGE-L BERTScore COMET BLEURT Mix. ChatGPT 31.4 (± 6.1) 53.4 (± 3.7) 46.4 (± 5.4) 50.9 (± 5.3) 91.0 (± 1.2) 79.9 (± 1.9) 49.8 (± 3.4) Zainaldin et al. 2026 10 Claude 34.2 (± 6.2) 55.4 (± 3.4) 48.5 (± 3.4) 55.3 (± 5.6) 91.6 (± 0.9) 79.8 (± 2.1) 50.4 (± 3.9) Gemini 34.2 (± 5.0) 57.0 (...

work page 2026
[7]

In general, performance on Mix

with ChatGPT in a distant third (differences in aggregate from Gemini of 4.8 and 8.5). In general, performance on Mix. was high, with mean aggregate TQS of 95.2 and 5 total Zainaldin et al. 2026 11 Critical errors across all 30 passages. Performance on Comp. was lower, with mean aggregate TQS of 79.9 and 57 total Critical errors across all 30 passages. Wh...

work page 2026
[8]

Zainaldin et al

yielded scores more comparable to Mix., ranging from 82.8 to 100.0 with no translations below 70, although passages 2 and 5 also posed difficulties (mean TQS 86.7 and 89.5, respectively). Zainaldin et al. 2026 12 Table 3 summarizes TQS by text and passage type for Comp. under several stratification schemes, taking into account passage-specific variability. Table

work page 2026
[9]

Gap on Automated Metrics: Stratified Analysis Metric All passages Excl

Mix.–Comp. Gap on Automated Metrics: Stratified Analysis Metric All passages Excl. 8, 10 Excl. 3, 8, 10 BLEU-4 −48.6% −47.4% −47.4% chrF++ −10.8% −9.5% −9.4% METEOR −12.0% −9.7% −8.7% ROUGE-L −12.8% −12.9% −13.8% BERTScore −2.0% −1.3% −1.4% COMET −4.8% −3.9% −3.4% BLEURT −11.1% −11.1% −11.5% Note: Values represent relative performance drop from Mix. to Co...

work page 2026
[10]

N = 60 translations

Correlations between Automated Metrics and MQM TQS Metric Pearson r [95% CI] p-value Spearman ρ [95% CI] p-value BERTScore +0.75*** [.62, .85] < .001 +0.43*** [.20, .62] < .001 COMET +0.60*** [.41, .74] < .001 +0.51*** [.30, .68] < .001 METEOR +0.53*** [.32, .69] < .001 +0.26* [.01, .48] .044 chrF++ +0.53*** [.32, .69] < .001 +0.38** [.13, .57] .003 BLEU-...

work page 2026
[11]

2026 15 Note: Pass Rate = High Pass + Low Pass

85.7% 61.9% 11.0% 24.8% Zainaldin et al. 2026 15 Note: Pass Rate = High Pass + Low Pass. Gap = Mix. pass rate minus Comp. pass rate under each scheme. Stratification excludes specified Comp. passages; Mix. includes all passages in all rows. The quality rating analysis reveals a dimension of performance not fully captured by mean TQS. While section 4.2 dem...

work page 2026
[12]

MQM Error Typology Distribution by Text Mix. Comp. Comp./Mix. Ratio Total errors 170 265 1.6× Errors per passage 5.7 8.8 1.6× Severity Neutral 29 (17.1%) 45 (17.0%) 1.0× Minor 103 (60.6%) 105 (39.6%) 0.7× Major 33 (19.4%) 58 (21.9%) 1.1× Critical 5 (2.9%) 57 (21.5%) 7.4× Error Type Terminology 74 (43.5%) 198 (74.7%) 1.7× Accuracy 96 (56.5%) 67 (25.3%) 0.4...

work page 2026
[13]

out of the box

Discussion 5.1 Translation Quality: Previously Translated vs. Untranslated Texts The central question that we posed at the outset of this study—how well do general-purpose LLMs “out of the box” translate ancient technical prose?—requires a nuanced answer based on our findings. On the previously translated text (Mix.), LLM performance was strong: mean aggr...

work page 2026
[14]

memorized

produced catastrophic failures across all models, while the remaining expository passages achieved quality within 2–4 points of Mix. It remains to consider what explains the gap between performance on the translated and untranslated texts. 5.1.1 The Memorization Hypothesis One plausible explanation for the Mix.–Comp. gap is that LLMs have “memorized” huma...

work page 2023
[15]

2026 19 Note: Pearson r values

Correlations between Terminology Rarity Metrics and Translation Quality Predictor TQS (Comp.) TQS (All) Critical (Comp.) Critical (All) Rare term ratio −.97*** −.93*** +.92*** +.90*** Not-found ratio −.96*** −.94*** +.91*** +.91*** Average Zipf frequency +.96*** +.95*** −.91*** −.92*** Zainaldin et al. 2026 19 Note: Pearson r values. Comp. = Comp. passage...

work page 2026
[16]

in-distribution

as well as with critical error count (r = +.92, p < .001). A simple linear regression showed that rare term ratio explained 93.9% of variance in passage-level TQS (R² = .94; bootstrap 95% CI: .42–.99). However, Cook’s distance analysis identified passages 8 and 10 as highly influential observations (D = 1.50 and 2.03, respectively, against a 4/n threshold...

work page 2026
[17]

(n=30) Comp

Correlations between Automated Metrics and MQM TQS by Text Metric Mix. (n=30) Comp. (n=30) Combined (n=60) BERTScore −0.10 +0.85*** +0.75*** COMET −0.07 +0.62*** +0.60*** METEOR +0.02 +0.55** +0.53*** chrF++ −0.00 +0.55** +0.53*** BLEU-4 −0.08 +0.42* +0.45*** ROUGE-L +0.13 +0.24 +0.34** BLEURT +0.00 +0.18 +0.32* Note: Pearson r values. * p < .05, ** p < ....

work page 2026
[18]

and other passages, but COMET compressed this into a 4.8% dip (77.3% vs. 72.5%). A practitioner relying on COMET scores would be seriously misled on the most terminologically difficult passages, including: • Comp. 8 ChatGPT: TQS = 0.0, COMET = 73.3% • Comp. 10 Gemini: TQS = 25.2, COMET = 75.6% • Comp. 8 Claude: TQS = 21.8, COMET = 74.0% Conversely, BLEU s...

work page 2026
[19]

best guess

for non-catastrophic passages, with quality approaching expert level for common vocabulary (in our dataset, rare term ratio <0.20). • LLMs are reliably useful for syntactic parsing and initial orientation to unfamiliar texts; general sense and argument structure are typically preserved even when individual terms are mistranslated (as with the scattered cr...

work page 2026
[20]

We evaluated intra-passage accuracy but not cross-passage terminological consistency

Cross-passage consistency. We evaluated intra-passage accuracy but not cross-passage terminological consistency. Utility of LLM-based MT could degrade without controlling for consistency across multiple inputs. Zainaldin et al. 2026 24

work page 2026
[21]

References Akavarapu, V. S. D. S. Mahesh, et al. (2025) ‘A Case Study of Cross-Lingual Zero-Shot Generalization for Classical Languages in LLMs’, Findings of the Association for Computational Linguistics: ACL 2025 2745–61, https://doi.org/10.18653/v1/2025.findings-acl.141. Amrhein, Chantal, and Rico Sennrich (2022) ‘Identifying Weaknesses in Machine Trans...

work page doi:10.18653/v1/2025.findings-acl.141 2025
[22]

1 Claude 96.7 45.7% 92.9% 82.6% HP 1 Gemini 99.3 41.1% 93.4% 83.1% HP 1 ChatGPT 97.0 39.5% 92.7% 82.4% HP 2 Claude 92.0 36.1% 91.9% 83.0% LP Zainaldin et al

Mix. 1 Claude 96.7 45.7% 92.9% 82.6% HP 1 Gemini 99.3 41.1% 93.4% 83.1% HP 1 ChatGPT 97.0 39.5% 92.7% 82.4% HP 2 Claude 92.0 36.1% 91.9% 83.0% LP Zainaldin et al. 2026 2 2 Gemini 99.7 32.5% 91.0% 83.2% HP 2 ChatGPT 77.8 34.4% 91.0% 82.8% F 3 Claude 100.0 24.4% 90.8% 79.2% HP 3 Gemini 100.0 30.1% 90.3% 81.2% HP 3 ChatGPT 97.5 24.2% 90.3% 80.7% HP 4 Claude ...

work page 2026

[1] [1]

translate

for Mix. and C. G. Kühn for Zainaldin et al. 2026 6 Comp. (Kühn 1827). Kühn’s text is of inferior quality and contains errors, but is generally readable and has not been replaced (Nutton 2002). A further motivation in selecting these works is that Mix. has two recent English translations (Singer and van der Eijk 2018; Johnston 2020), while Comp. has never...

work page 2026

[2] [2]

The automated tests are cheap, fast, and widely standardized, but of uncertain efficacy

Methodology To evaluate LLM-based MT performance, we implemented both automated (reference-based) and expert human (reference-free) frameworks. The automated tests are cheap, fast, and widely standardized, but of uncertain efficacy. We therefore undertook domain-expert examination of all 60 MT translations via modified MQM evaluation, which allowed Zainal...

work page 2026

[3] [3]

2021a; Freitag et al

that is the official evaluation strategy for the Workshop on Machine Translation (Freitag et al. 2021a; Freitag et al. Zainaldin et al. 2026 8 2021b). Previous studies evaluating MT for LRAL have relied on custom metrics (Volk et al

work page 2026

[4] [4]

2023); none has applied the systematic analytical framework of MQM, in part because of its labor intensity

or qualitative spot-checks (Gutherz et al. 2023); none has applied the systematic analytical framework of MQM, in part because of its labor intensity. Nonetheless, MQM allows for fine-grained evaluation of translation quality utilizing domain-specific expertise. It does not presuppose the existence of authoritative reference translations, mitigating a cen...

work page 2023

[5] [5]

hallucination

but did not operationalize a separate category because it is impracticably vague: translating AG often requires interpretative rendering even when accurate, leaving no pragmatic test distinguishing “hallucination” from other sorts of misconstrual, and fluent hallucinations always resolve as specific terminological or propositional errors. Each MQM error w...

work page 2024

[6] [6]

ChatGPT 31.4 (± 6.1) 53.4 (± 3.7) 46.4 (± 5.4) 50.9 (± 5.3) 91.0 (± 1.2) 79.9 (± 1.9) 49.8 (± 3.4) Zainaldin et al

Aggregate Automated MT Evaluation Scores Text Model BLEU-4 chrF++ METEOR ROUGE-L BERTScore COMET BLEURT Mix. ChatGPT 31.4 (± 6.1) 53.4 (± 3.7) 46.4 (± 5.4) 50.9 (± 5.3) 91.0 (± 1.2) 79.9 (± 1.9) 49.8 (± 3.4) Zainaldin et al. 2026 10 Claude 34.2 (± 6.2) 55.4 (± 3.4) 48.5 (± 3.4) 55.3 (± 5.6) 91.6 (± 0.9) 79.8 (± 2.1) 50.4 (± 3.9) Gemini 34.2 (± 5.0) 57.0 (...

work page 2026

[7] [7]

In general, performance on Mix

with ChatGPT in a distant third (differences in aggregate from Gemini of 4.8 and 8.5). In general, performance on Mix. was high, with mean aggregate TQS of 95.2 and 5 total Zainaldin et al. 2026 11 Critical errors across all 30 passages. Performance on Comp. was lower, with mean aggregate TQS of 79.9 and 57 total Critical errors across all 30 passages. Wh...

work page 2026

[8] [8]

Zainaldin et al

yielded scores more comparable to Mix., ranging from 82.8 to 100.0 with no translations below 70, although passages 2 and 5 also posed difficulties (mean TQS 86.7 and 89.5, respectively). Zainaldin et al. 2026 12 Table 3 summarizes TQS by text and passage type for Comp. under several stratification schemes, taking into account passage-specific variability. Table

work page 2026

[9] [9]

Gap on Automated Metrics: Stratified Analysis Metric All passages Excl

Mix.–Comp. Gap on Automated Metrics: Stratified Analysis Metric All passages Excl. 8, 10 Excl. 3, 8, 10 BLEU-4 −48.6% −47.4% −47.4% chrF++ −10.8% −9.5% −9.4% METEOR −12.0% −9.7% −8.7% ROUGE-L −12.8% −12.9% −13.8% BERTScore −2.0% −1.3% −1.4% COMET −4.8% −3.9% −3.4% BLEURT −11.1% −11.1% −11.5% Note: Values represent relative performance drop from Mix. to Co...

work page 2026

[10] [10]

N = 60 translations

Correlations between Automated Metrics and MQM TQS Metric Pearson r [95% CI] p-value Spearman ρ [95% CI] p-value BERTScore +0.75*** [.62, .85] < .001 +0.43*** [.20, .62] < .001 COMET +0.60*** [.41, .74] < .001 +0.51*** [.30, .68] < .001 METEOR +0.53*** [.32, .69] < .001 +0.26* [.01, .48] .044 chrF++ +0.53*** [.32, .69] < .001 +0.38** [.13, .57] .003 BLEU-...

work page 2026

[11] [11]

2026 15 Note: Pass Rate = High Pass + Low Pass

85.7% 61.9% 11.0% 24.8% Zainaldin et al. 2026 15 Note: Pass Rate = High Pass + Low Pass. Gap = Mix. pass rate minus Comp. pass rate under each scheme. Stratification excludes specified Comp. passages; Mix. includes all passages in all rows. The quality rating analysis reveals a dimension of performance not fully captured by mean TQS. While section 4.2 dem...

work page 2026

[12] [12]

MQM Error Typology Distribution by Text Mix. Comp. Comp./Mix. Ratio Total errors 170 265 1.6× Errors per passage 5.7 8.8 1.6× Severity Neutral 29 (17.1%) 45 (17.0%) 1.0× Minor 103 (60.6%) 105 (39.6%) 0.7× Major 33 (19.4%) 58 (21.9%) 1.1× Critical 5 (2.9%) 57 (21.5%) 7.4× Error Type Terminology 74 (43.5%) 198 (74.7%) 1.7× Accuracy 96 (56.5%) 67 (25.3%) 0.4...

work page 2026

[13] [13]

out of the box

Discussion 5.1 Translation Quality: Previously Translated vs. Untranslated Texts The central question that we posed at the outset of this study—how well do general-purpose LLMs “out of the box” translate ancient technical prose?—requires a nuanced answer based on our findings. On the previously translated text (Mix.), LLM performance was strong: mean aggr...

work page 2026

[14] [14]

memorized

produced catastrophic failures across all models, while the remaining expository passages achieved quality within 2–4 points of Mix. It remains to consider what explains the gap between performance on the translated and untranslated texts. 5.1.1 The Memorization Hypothesis One plausible explanation for the Mix.–Comp. gap is that LLMs have “memorized” huma...

work page 2023

[15] [15]

2026 19 Note: Pearson r values

Correlations between Terminology Rarity Metrics and Translation Quality Predictor TQS (Comp.) TQS (All) Critical (Comp.) Critical (All) Rare term ratio −.97*** −.93*** +.92*** +.90*** Not-found ratio −.96*** −.94*** +.91*** +.91*** Average Zipf frequency +.96*** +.95*** −.91*** −.92*** Zainaldin et al. 2026 19 Note: Pearson r values. Comp. = Comp. passage...

work page 2026

[16] [16]

in-distribution

as well as with critical error count (r = +.92, p < .001). A simple linear regression showed that rare term ratio explained 93.9% of variance in passage-level TQS (R² = .94; bootstrap 95% CI: .42–.99). However, Cook’s distance analysis identified passages 8 and 10 as highly influential observations (D = 1.50 and 2.03, respectively, against a 4/n threshold...

work page 2026

[17] [17]

(n=30) Comp

Correlations between Automated Metrics and MQM TQS by Text Metric Mix. (n=30) Comp. (n=30) Combined (n=60) BERTScore −0.10 +0.85*** +0.75*** COMET −0.07 +0.62*** +0.60*** METEOR +0.02 +0.55** +0.53*** chrF++ −0.00 +0.55** +0.53*** BLEU-4 −0.08 +0.42* +0.45*** ROUGE-L +0.13 +0.24 +0.34** BLEURT +0.00 +0.18 +0.32* Note: Pearson r values. * p < .05, ** p < ....

work page 2026

[18] [18]

and other passages, but COMET compressed this into a 4.8% dip (77.3% vs. 72.5%). A practitioner relying on COMET scores would be seriously misled on the most terminologically difficult passages, including: • Comp. 8 ChatGPT: TQS = 0.0, COMET = 73.3% • Comp. 10 Gemini: TQS = 25.2, COMET = 75.6% • Comp. 8 Claude: TQS = 21.8, COMET = 74.0% Conversely, BLEU s...

work page 2026

[19] [19]

best guess

for non-catastrophic passages, with quality approaching expert level for common vocabulary (in our dataset, rare term ratio <0.20). • LLMs are reliably useful for syntactic parsing and initial orientation to unfamiliar texts; general sense and argument structure are typically preserved even when individual terms are mistranslated (as with the scattered cr...

work page 2026

[20] [20]

We evaluated intra-passage accuracy but not cross-passage terminological consistency

Cross-passage consistency. We evaluated intra-passage accuracy but not cross-passage terminological consistency. Utility of LLM-based MT could degrade without controlling for consistency across multiple inputs. Zainaldin et al. 2026 24

work page 2026

[21] [21]

References Akavarapu, V. S. D. S. Mahesh, et al. (2025) ‘A Case Study of Cross-Lingual Zero-Shot Generalization for Classical Languages in LLMs’, Findings of the Association for Computational Linguistics: ACL 2025 2745–61, https://doi.org/10.18653/v1/2025.findings-acl.141. Amrhein, Chantal, and Rico Sennrich (2022) ‘Identifying Weaknesses in Machine Trans...

work page doi:10.18653/v1/2025.findings-acl.141 2025

[22] [22]

1 Claude 96.7 45.7% 92.9% 82.6% HP 1 Gemini 99.3 41.1% 93.4% 83.1% HP 1 ChatGPT 97.0 39.5% 92.7% 82.4% HP 2 Claude 92.0 36.1% 91.9% 83.0% LP Zainaldin et al

Mix. 1 Claude 96.7 45.7% 92.9% 82.6% HP 1 Gemini 99.3 41.1% 93.4% 83.1% HP 1 ChatGPT 97.0 39.5% 92.7% 82.4% HP 2 Claude 92.0 36.1% 91.9% 83.0% LP Zainaldin et al. 2026 2 2 Gemini 99.7 32.5% 91.0% 83.2% HP 2 ChatGPT 77.8 34.4% 91.0% 82.8% F 3 Claude 100.0 24.4% 90.8% 79.2% HP 3 Gemini 100.0 30.1% 90.3% 81.2% HP 3 ChatGPT 97.5 24.2% 90.3% 80.7% HP 4 Claude ...

work page 2026