Evaluating LLM-Based Translation of a Low-Resource Technical Language: The Medical and Philosophical Greek of Galen
Pith reviewed 2026-05-15 18:46 UTC · model grok-4.3
The pith
Commercial LLMs translate Galen's ancient Greek medical texts well except where rare terminology appears.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On the expository Galen text the three LLMs reached a mean MQM score of 95.2 out of 100; on the untranslated pharmacological text the mean fell to 79.9, with two passages producing catastrophic failures while the rest stayed within four points of the expository scores. Terminology rarity measured by corpus frequency was the dominant predictor of those failures at r = -.97. Automated metrics showed only moderate correlation with expert judgment and none of them distinguished among the high-quality translations.
What carries the argument
The modified Multidimensional Quality Metrics framework applied by domain specialists for reference-free scoring, together with corpus frequency as the operational measure of terminology rarity.
If this is right
- LLMs can already produce usable English versions of Galen's less specialized expository prose.
- Pharmacological passages with high densities of rare terms remain a clear failure mode.
- Automatic metrics are not yet trustworthy for ranking or filtering translations that are already close to expert quality.
- Frequency-based diagnostics can be used in advance to flag passages likely to need human review or specialized post-editing.
Where Pith is reading between the lines
- The same frequency-based failure pattern is likely to appear in LLM translations of other ancient technical corpora such as Latin medical or philosophical texts.
- Targeted fine-tuning or retrieval-augmented generation on frequency-weighted ancient Greek vocabularies could reduce the observed error rate.
- The evaluation approach supplies a practical template for testing LLM performance on additional low-resource historical languages where reference translations do not exist.
Load-bearing premise
That the twenty selected passages and the specialists' application of the modified MQM framework give a reliable picture of how LLMs handle the full range of Galenic technical prose.
What would settle it
A larger set of Galenic passages in which terminology frequency shows no strong correlation with expert translation scores, or in which at least one automatic metric separates high-quality translations as clearly as the human raters do.
Figures
read the original abstract
Purpose: This study evaluates the quality of commercial large language model (LLM) machine translation (MT) for Ancient Greek technical prose and benchmarks standard automated MT evaluation metrics against expert human judgment. Design: We evaluated 60 translations by three LLMs (ChatGPT, Claude, Gemini) of 20 paragraph-length passages from 2 works by the Greek physician Galen (c. 129-216 CE): an expository text with two published English translations and a pharmacological text never before translated. Quality was assessed using seven automated metrics and systematic reference-free human evaluation via a modified Multidimensional Quality Metrics (MQM) framework applied by domain specialists. Findings: On the translated expository text, LLMs achieved high quality (mean MQM score 95.2/100). On the untranslated pharmacological text, quality was lower (79.9/100) but bimodally distributed: two passages with extreme terminological density produced catastrophic failures, while remaining passages scored within 4 points of the expository text. Terminology rarity, operationalized via corpus frequency, emerged as the dominant predictor of failure (r = -.97). Automated metrics showed moderate correlation with human judgment only on texts with wide quality variance; no metric discriminated among high-quality translations. Originality: This is the first systematic, reference-free expert human evaluation of LLM translation for any ancient language and the first study identifying textual properties predictive of translation failure.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates LLM (ChatGPT, Claude, Gemini) translations of 20 paragraph-length passages from two Galenic works using seven automated metrics and reference-free human evaluation via a modified MQM framework by domain specialists. It reports high quality on the expository text (mean MQM 95.2/100), lower but bimodal quality on the untranslated pharmacological text (79.9/100) with two catastrophic failures tied to extreme terminological density, a dominant negative correlation (r = -0.97) between corpus frequency of terms and translation failure, and moderate correlation of automated metrics with human judgment only on high-variance texts. It positions itself as the first systematic expert human evaluation of LLM translation for any ancient language and the first to identify textual predictors of failure.
Significance. If the empirical findings hold, the work is significant for establishing baseline performance of current LLMs on low-resource ancient technical Greek, demonstrating that specialist human evaluation via adapted MQM can serve as a reliable reference-free benchmark, and providing the first evidence that terminological rarity (via corpus frequency) strongly predicts translation failure in this domain. The use of domain experts and the identification of a concrete textual property as a failure predictor offer actionable insights for MT of historical languages.
major comments (2)
- [Findings] Findings section (correlation analysis): The headline claim that terminology rarity is the dominant predictor of failure (r = -.97) rests on a sample of 20 passages exhibiting a bimodal distribution on the pharmacological text, where only two extreme-density passages produced catastrophic failures while the remaining 18 scored within 4 points of the expository mean (95.2/100). No sensitivity analyses, outlier diagnostics, rank correlations (e.g., Spearman), or leave-two-out checks are reported, so it is unclear whether the Pearson coefficient reflects a general relationship across Galenic technical prose or is an artifact of the two outliers; this directly undermines the predictive claim.
- [Abstract and Methods] Abstract and Methods (sample description): The study relies on only 20 selected passages (with 18 performing similarly), yet provides limited detail on selection criteria or how they represent the full range of Galenic technical prose; combined with the absence of robustness checks on the correlation, this limits the strength of the generalizability conclusion.
minor comments (1)
- [Abstract] The abstract states that automated metrics showed moderate correlation with human judgment only on texts with wide quality variance, but the specific metric values and exact correlation coefficients are not summarized in the abstract or early sections, reducing immediate clarity for readers.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review of our manuscript. We address each major comment below and have revised the paper to incorporate additional analyses and clarifications.
read point-by-point responses
-
Referee: [Findings] Findings section (correlation analysis): The headline claim that terminology rarity is the dominant predictor of failure (r = -.97) rests on a sample of 20 passages exhibiting a bimodal distribution on the pharmacological text, where only two extreme-density passages produced catastrophic failures while the remaining 18 scored within 4 points of the expository mean (95.2/100). No sensitivity analyses, outlier diagnostics, rank correlations (e.g., Spearman), or leave-two-out checks are reported, so it is unclear whether the Pearson coefficient reflects a general relationship across Galenic technical prose or is an artifact of the two outliers; this directly undermines the predictive claim.
Authors: We acknowledge that the Pearson r = -0.97 is strongly influenced by the two passages with extreme terminological density. These were deliberately selected as the upper end of rarity in the pharmacological text to test the limits of LLM performance. In the revised manuscript we now include Spearman rank correlation (r_s = -0.89), Cook's distance outlier diagnostics, and a leave-two-out sensitivity check. Removing the two extreme passages yields r = -0.62, confirming a still-negative but weaker relationship; we have added a scatterplot highlighting these points and revised the text to present the correlation as evidence that rarity is a strong but not sole predictor within this sample. revision: yes
-
Referee: [Abstract and Methods] Abstract and Methods (sample description): The study relies on only 20 selected passages (with 18 performing similarly), yet provides limited detail on selection criteria or how they represent the full range of Galenic technical prose; combined with the absence of robustness checks on the correlation, this limits the strength of the generalizability conclusion.
Authors: We have expanded the Methods section with explicit selection criteria: passages were drawn from standard critical editions to span low-to-high terminological rarity (via TLG corpus frequency) and syntactic complexity in both works, with the pharmacological set intentionally including the two highest-density examples. The revised Abstract and Discussion now state that results are based on this targeted sample of 20 passages and note the need for larger-scale follow-up studies. Robustness checks for the correlation are added as described above. revision: yes
Circularity Check
No significant circularity: purely empirical evaluation with direct statistical reporting
full rationale
The paper reports an observational study: 20 passages were selected, translated by three LLMs, scored by domain experts using a modified MQM framework, and correlated with an independently measured property (terminology rarity via corpus frequency). The reported r = -.97 is a direct Pearson computation on the observed scores and frequencies; it is not obtained by fitting a model to a subset and then predicting a related quantity by construction, nor does any equation or claim reduce to a self-definition. No self-citations are invoked to justify uniqueness theorems or ansatzes, and no derivations exist that could be circular. The study is self-contained against its own human judgments and corpus counts, satisfying the criteria for a non-circular empirical evaluation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert application of the modified Multidimensional Quality Metrics (MQM) framework yields valid quality scores for LLM translations of ancient Greek without reference texts.
Reference graph
Works this paper leans on
-
[1]
for Mix. and C. G. Kühn for Zainaldin et al. 2026 6 Comp. (Kühn 1827). Kühn’s text is of inferior quality and contains errors, but is generally readable and has not been replaced (Nutton 2002). A further motivation in selecting these works is that Mix. has two recent English translations (Singer and van der Eijk 2018; Johnston 2020), while Comp. has never...
work page 2026
-
[2]
The automated tests are cheap, fast, and widely standardized, but of uncertain efficacy
Methodology To evaluate LLM-based MT performance, we implemented both automated (reference-based) and expert human (reference-free) frameworks. The automated tests are cheap, fast, and widely standardized, but of uncertain efficacy. We therefore undertook domain-expert examination of all 60 MT translations via modified MQM evaluation, which allowed Zainal...
work page 2026
-
[3]
that is the official evaluation strategy for the Workshop on Machine Translation (Freitag et al. 2021a; Freitag et al. Zainaldin et al. 2026 8 2021b). Previous studies evaluating MT for LRAL have relied on custom metrics (Volk et al
work page 2026
-
[4]
or qualitative spot-checks (Gutherz et al. 2023); none has applied the systematic analytical framework of MQM, in part because of its labor intensity. Nonetheless, MQM allows for fine-grained evaluation of translation quality utilizing domain-specific expertise. It does not presuppose the existence of authoritative reference translations, mitigating a cen...
work page 2023
-
[5]
but did not operationalize a separate category because it is impracticably vague: translating AG often requires interpretative rendering even when accurate, leaving no pragmatic test distinguishing “hallucination” from other sorts of misconstrual, and fluent hallucinations always resolve as specific terminological or propositional errors. Each MQM error w...
work page 2024
-
[6]
Aggregate Automated MT Evaluation Scores Text Model BLEU-4 chrF++ METEOR ROUGE-L BERTScore COMET BLEURT Mix. ChatGPT 31.4 (± 6.1) 53.4 (± 3.7) 46.4 (± 5.4) 50.9 (± 5.3) 91.0 (± 1.2) 79.9 (± 1.9) 49.8 (± 3.4) Zainaldin et al. 2026 10 Claude 34.2 (± 6.2) 55.4 (± 3.4) 48.5 (± 3.4) 55.3 (± 5.6) 91.6 (± 0.9) 79.8 (± 2.1) 50.4 (± 3.9) Gemini 34.2 (± 5.0) 57.0 (...
work page 2026
-
[7]
In general, performance on Mix
with ChatGPT in a distant third (differences in aggregate from Gemini of 4.8 and 8.5). In general, performance on Mix. was high, with mean aggregate TQS of 95.2 and 5 total Zainaldin et al. 2026 11 Critical errors across all 30 passages. Performance on Comp. was lower, with mean aggregate TQS of 79.9 and 57 total Critical errors across all 30 passages. Wh...
work page 2026
-
[8]
yielded scores more comparable to Mix., ranging from 82.8 to 100.0 with no translations below 70, although passages 2 and 5 also posed difficulties (mean TQS 86.7 and 89.5, respectively). Zainaldin et al. 2026 12 Table 3 summarizes TQS by text and passage type for Comp. under several stratification schemes, taking into account passage-specific variability. Table
work page 2026
-
[9]
Gap on Automated Metrics: Stratified Analysis Metric All passages Excl
Mix.–Comp. Gap on Automated Metrics: Stratified Analysis Metric All passages Excl. 8, 10 Excl. 3, 8, 10 BLEU-4 −48.6% −47.4% −47.4% chrF++ −10.8% −9.5% −9.4% METEOR −12.0% −9.7% −8.7% ROUGE-L −12.8% −12.9% −13.8% BERTScore −2.0% −1.3% −1.4% COMET −4.8% −3.9% −3.4% BLEURT −11.1% −11.1% −11.5% Note: Values represent relative performance drop from Mix. to Co...
work page 2026
-
[10]
Correlations between Automated Metrics and MQM TQS Metric Pearson r [95% CI] p-value Spearman ρ [95% CI] p-value BERTScore +0.75*** [.62, .85] < .001 +0.43*** [.20, .62] < .001 COMET +0.60*** [.41, .74] < .001 +0.51*** [.30, .68] < .001 METEOR +0.53*** [.32, .69] < .001 +0.26* [.01, .48] .044 chrF++ +0.53*** [.32, .69] < .001 +0.38** [.13, .57] .003 BLEU-...
work page 2026
-
[11]
2026 15 Note: Pass Rate = High Pass + Low Pass
85.7% 61.9% 11.0% 24.8% Zainaldin et al. 2026 15 Note: Pass Rate = High Pass + Low Pass. Gap = Mix. pass rate minus Comp. pass rate under each scheme. Stratification excludes specified Comp. passages; Mix. includes all passages in all rows. The quality rating analysis reveals a dimension of performance not fully captured by mean TQS. While section 4.2 dem...
work page 2026
-
[12]
MQM Error Typology Distribution by Text Mix. Comp. Comp./Mix. Ratio Total errors 170 265 1.6× Errors per passage 5.7 8.8 1.6× Severity Neutral 29 (17.1%) 45 (17.0%) 1.0× Minor 103 (60.6%) 105 (39.6%) 0.7× Major 33 (19.4%) 58 (21.9%) 1.1× Critical 5 (2.9%) 57 (21.5%) 7.4× Error Type Terminology 74 (43.5%) 198 (74.7%) 1.7× Accuracy 96 (56.5%) 67 (25.3%) 0.4...
work page 2026
-
[13]
Discussion 5.1 Translation Quality: Previously Translated vs. Untranslated Texts The central question that we posed at the outset of this study—how well do general-purpose LLMs “out of the box” translate ancient technical prose?—requires a nuanced answer based on our findings. On the previously translated text (Mix.), LLM performance was strong: mean aggr...
work page 2026
-
[14]
produced catastrophic failures across all models, while the remaining expository passages achieved quality within 2–4 points of Mix. It remains to consider what explains the gap between performance on the translated and untranslated texts. 5.1.1 The Memorization Hypothesis One plausible explanation for the Mix.–Comp. gap is that LLMs have “memorized” huma...
work page 2023
-
[15]
2026 19 Note: Pearson r values
Correlations between Terminology Rarity Metrics and Translation Quality Predictor TQS (Comp.) TQS (All) Critical (Comp.) Critical (All) Rare term ratio −.97*** −.93*** +.92*** +.90*** Not-found ratio −.96*** −.94*** +.91*** +.91*** Average Zipf frequency +.96*** +.95*** −.91*** −.92*** Zainaldin et al. 2026 19 Note: Pearson r values. Comp. = Comp. passage...
work page 2026
-
[16]
as well as with critical error count (r = +.92, p < .001). A simple linear regression showed that rare term ratio explained 93.9% of variance in passage-level TQS (R² = .94; bootstrap 95% CI: .42–.99). However, Cook’s distance analysis identified passages 8 and 10 as highly influential observations (D = 1.50 and 2.03, respectively, against a 4/n threshold...
work page 2026
-
[17]
Correlations between Automated Metrics and MQM TQS by Text Metric Mix. (n=30) Comp. (n=30) Combined (n=60) BERTScore −0.10 +0.85*** +0.75*** COMET −0.07 +0.62*** +0.60*** METEOR +0.02 +0.55** +0.53*** chrF++ −0.00 +0.55** +0.53*** BLEU-4 −0.08 +0.42* +0.45*** ROUGE-L +0.13 +0.24 +0.34** BLEURT +0.00 +0.18 +0.32* Note: Pearson r values. * p < .05, ** p < ....
work page 2026
-
[18]
and other passages, but COMET compressed this into a 4.8% dip (77.3% vs. 72.5%). A practitioner relying on COMET scores would be seriously misled on the most terminologically difficult passages, including: • Comp. 8 ChatGPT: TQS = 0.0, COMET = 73.3% • Comp. 10 Gemini: TQS = 25.2, COMET = 75.6% • Comp. 8 Claude: TQS = 21.8, COMET = 74.0% Conversely, BLEU s...
work page 2026
-
[19]
for non-catastrophic passages, with quality approaching expert level for common vocabulary (in our dataset, rare term ratio <0.20). • LLMs are reliably useful for syntactic parsing and initial orientation to unfamiliar texts; general sense and argument structure are typically preserved even when individual terms are mistranslated (as with the scattered cr...
work page 2026
-
[20]
We evaluated intra-passage accuracy but not cross-passage terminological consistency
Cross-passage consistency. We evaluated intra-passage accuracy but not cross-passage terminological consistency. Utility of LLM-based MT could degrade without controlling for consistency across multiple inputs. Zainaldin et al. 2026 24
work page 2026
-
[21]
References Akavarapu, V. S. D. S. Mahesh, et al. (2025) ‘A Case Study of Cross-Lingual Zero-Shot Generalization for Classical Languages in LLMs’, Findings of the Association for Computational Linguistics: ACL 2025 2745–61, https://doi.org/10.18653/v1/2025.findings-acl.141. Amrhein, Chantal, and Rico Sennrich (2022) ‘Identifying Weaknesses in Machine Trans...
-
[22]
Mix. 1 Claude 96.7 45.7% 92.9% 82.6% HP 1 Gemini 99.3 41.1% 93.4% 83.1% HP 1 ChatGPT 97.0 39.5% 92.7% 82.4% HP 2 Claude 92.0 36.1% 91.9% 83.0% LP Zainaldin et al. 2026 2 2 Gemini 99.7 32.5% 91.0% 83.2% HP 2 ChatGPT 77.8 34.4% 91.0% 82.8% F 3 Claude 100.0 24.4% 90.8% 79.2% HP 3 Gemini 100.0 30.1% 90.3% 81.2% HP 3 ChatGPT 97.5 24.2% 90.3% 80.7% HP 4 Claude ...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.