From Outliers to Errors: Auditing Pali-to-English LLM Translations with Multi-Reference Adjudication

Hansa Dhammahaso; M\'at\'e Metzger; Nadnapang Phophichit

arxiv: 2606.01136 · v1 · pith:5AUELDVKnew · submitted 2026-05-31 · 💻 cs.CL

From Outliers to Errors: Auditing Pali-to-English LLM Translations with Multi-Reference Adjudication

M\'at\'e Metzger , Nadnapang Phophichit , Hansa Dhammahaso This is my paper

Pith reviewed 2026-06-28 17:09 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM translation auditingPali-to-Englishembedding driftmulti-reference evaluationtranslation error detectionclassical language translationadjudication panel

0 comments

The pith

Drift from a multi-translator reference centroid predicts major-error rates in LLM Pali-to-English translations rather than treating all outliers as mistakes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests a triage method for auditing LLM translations of the Pali Canon that uses embedding drift from the centroid of three established human translations as a signal to prioritize review instead of a single gold standard. It shows this drift correlates with error severity: rates of major errors such as omission or doctrinal mistakes rise from 7.9 percent in the 1.5-2.0 drift band to 51.6 percent above 3.0, while roughly 80 percent of moderate outliers count as valid variations. Model differences appear most clearly in the high-drift tail, with one model exhibiting the highest error volume and rate there. This matters because single-score metrics conflate legitimate variation with error in classical languages that admit multiple defensible renderings. The design supplies a reusable audit workflow that flags the tail for adjudication rather than labeling every outlier as failure.

Core claim

By defining a local reference envelope from three human translations and measuring normalized embedding drift, the authors demonstrate that drift serves as a severity predictor rather than an error label: the major-error rate among adjudicated high-drift candidates rises monotonically across bands, approximately 80 percent of 1.5-2.0 outliers are valid variations, and model differences concentrate in the tail where one model records the highest rates (27.6 percent overall, 74.4 percent above drift 3.0).

What carries the argument

Normalized embedding drift from the multi-translator reference centroid, used as a triage signal before blinded three-model LLM judge panel adjudication calibrated on a 300-instance validation set.

If this is right

Major-error rates increase steadily with higher drift thresholds.
Most candidates in the 1.5-2.0 drift band represent acceptable translation variations.
One model produces both more high-drift outliers and a higher proportion of major errors in the tail than the others.
The dominant error categories are omission, truncation, and doctrinal term mistakes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The triage workflow could extend to other classical languages that have multiple authoritative human translations to prioritize review effort.
Embedding-based drift may offer a more stable signal than single-reference metrics when variation is expected.
Threshold tuning on new corpora could balance review volume against capture of severe errors.

Load-bearing premise

The blinded three-model LLM judge panel, after calibration against author-adjudicated examples, reliably distinguishes legitimate translation variations from major errors.

What would settle it

Large-scale human re-adjudication of the high-drift candidates that shows no monotonic rise in major-error rate with drift bands or no difference in tail error rates across the four models.

read the original abstract

Single-score translation metrics can conflate legitimate variation with error, a problem especially acute for classical languages where multiple defensible English renderings of the same passage coexist. We audit Pali-to-English output from four flagship large language models (LLMs): GPT-5.5, Claude Sonnet 4.6, Gemini 3.1 Pro, and Grok 4.3, on 1,700 passages from the Pali Canon, using three established human translations by Bhikkhu Sujato, Thanissaro Bhikkhu, and Bhikkhu Bodhi as a local reference envelope rather than a single gold standard. Each candidate's normalized embedding drift from the reference centroid serves as a triage signal, not an error label; the 1,203 candidates above a 1.5 drift threshold are then adjudicated by a blinded three-model LLM judge panel, calibrated against a 300-instance author-adjudicated validation set. Two results stand out. First, drift predicts severity rather than error per se: the major-error rate among adjudicated high-drift candidates rose monotonically from 7.9% in the 1.5-2.0 band to 51.6% above 3.0, while approximately 80% of 1.5-2.0 outliers were judged valid translation variations. Second, model differences were clearest in the high-drift tail: GPT-5.5 had the lowest adjudicated high-drift major-error rate, with confidence intervals overlapping those of Claude Sonnet 4.6 and Gemini 3.1 Pro; Grok 4.3 had both the largest outlier volume and the highest tail major-error rate (27.6% overall, 74.4% above drift 3.0). The dominant major-error categories (e.g. omission or truncation, doctrinal term errors) are precisely the failures most likely to mislead readers of doctrinal text. The contribution is a reusable audit design for classical-to-modern translation: define a local reference envelope from multiple human translators, use embedding drift to prioritize review, and adjudicate the flagged tail rather than treating outlier status as error.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a usable triage method for flagging bad Pali translations via multi-reference drift, but the results hinge on an LLM judge panel whose reliability is not fully checked.

read the letter

The core new piece here is the audit pipeline: pull three solid human translations as a reference envelope, measure embedding drift from their centroid to triage outliers, then run a calibrated LLM panel on the flagged cases instead of scoring everything against one gold standard. That setup fits classical texts where multiple renderings are defensible, and the abstract shows it working on 1700 passages from four LLMs.

What stands out is the monotonic signal. Major-error rates among the adjudicated high-drift items climb steadily from 7.9% in the 1.5-2.0 band to 51.6% above 3.0, while roughly 80% of the milder outliers turn out to be acceptable variations. Model differences also appear mainly in the tail, with GPT-5.5 looking better and Grok 4.3 worse on both volume and error rate. Those trends are concrete and worth having.

The soft spot is the judge panel itself. The design calibrates on a 300-instance author-adjudicated set, but the abstract gives no numbers on agreement with independent humans, no kappa or similar metric, and no detail on how the prompts handle doctrinal Pali terms. If the panel systematically misses omissions or misreads interpretive language, both the severity prediction and the model ranking lose force. That assumption is load-bearing.

This is aimed at people doing digital humanities work on classical-language corpora or anyone building translation quality checks for low-resource or interpretive domains. The method is reusable and the empirical patterns are reported clearly enough to test.

It deserves a serious referee. The contribution is narrow but the triage idea is practical and the data trends are there; a review can check the missing agreement stats and prompt details without needing to throw the whole thing out.

Referee Report

3 major / 2 minor

Summary. The paper audits Pali-to-English translations from four LLMs (GPT-5.5, Claude Sonnet 4.6, Gemini 3.1 Pro, Grok 4.3) on 1,700 Pali Canon passages. It defines a local reference envelope from three human translations, uses normalized embedding drift above a 1.5 threshold to triage 1,203 outliers, and adjudicates them via a blinded three-model LLM judge panel calibrated on a 300-instance author-adjudicated validation set. Central claims are that drift predicts error severity (major-error rate rises monotonically from 7.9% in the 1.5-2.0 band to 51.6% above 3.0, with ~80% of low-band outliers judged valid variations) and that model differences appear in the high-drift tail (GPT-5.5 lowest major-error rate; Grok 4.3 highest at 27.6% overall and 74.4% above drift 3.0).

Significance. If the adjudication holds, the work supplies a practical, reusable audit design for classical-language translation that avoids single-gold-standard assumptions and focuses review effort on the error-prone tail. The monotonic severity prediction and tail-specific model comparisons would be directly useful for doctrinal text evaluation.

major comments (3)

[adjudication pipeline] Adjudication pipeline (abstract and corresponding methods description): the load-bearing claim that the blinded three-model LLM judge panel reliably distinguishes major errors (omission, truncation, doctrinal term mistakes) from valid variations rests on calibration against the 300-instance validation set, yet no details are supplied on panel composition, exact judge prompts, or agreement metrics (e.g., kappa) between the panel and independent human experts on held-out cases. Without these, the reported major-error rates and model rankings cannot be assessed for bias.
[results on model differences] Results on model differences (high-drift tail paragraph): the claim that GPT-5.5 has the lowest and Grok 4.3 the highest adjudicated major-error rates lacks any statistical test (chi-square, bootstrap CI overlap assessment, or similar) for the observed differences; overlapping CIs are mentioned for GPT-5.5 vs. others but no test is reported for the Grok 4.3 tail result of 74.4%.
[drift calculation] Drift calculation (methods on embedding drift): the embedding model used to compute normalized drift from the reference centroid is not identified, nor is any sensitivity analysis provided; because the triage threshold of 1.5 and all downstream rates depend on this choice, the monotonic trend cannot be reproduced or stress-tested.

minor comments (2)

[abstract] The abstract refers to 'normalized embedding drift' without an equation or definition of the normalization; add a short formal definition or reference in the methods.
[results] No table or figure caption clarifies the exact band boundaries or sample sizes per band; a small summary table would improve readability of the 7.9% / 51.6% figures.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. The three major comments identify important gaps in reproducibility and statistical rigor. We address each below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [adjudication pipeline] Adjudication pipeline (abstract and corresponding methods description): the load-bearing claim that the blinded three-model LLM judge panel reliably distinguishes major errors (omission, truncation, doctrinal term mistakes) from valid variations rests on calibration against the 300-instance validation set, yet no details are supplied on panel composition, exact judge prompts, or agreement metrics (e.g., kappa) between the panel and independent human experts on held-out cases. Without these, the reported major-error rates and model rankings cannot be assessed for bias.

Authors: We agree that the manuscript currently lacks these details. In the revised version we will add a dedicated methods subsection specifying the exact three models used in the judge panel, the full adjudication prompts, and agreement statistics (including Cohen's kappa) between the panel and the author-adjudicated validation set on held-out instances. revision: yes
Referee: [results on model differences] Results on model differences (high-drift tail paragraph): the claim that GPT-5.5 has the lowest and Grok 4.3 the highest adjudicated major-error rates lacks any statistical test (chi-square, bootstrap CI overlap assessment, or similar) for the observed differences; overlapping CIs are mentioned for GPT-5.5 vs. others but no test is reported for the Grok 4.3 tail result of 74.4%.

Authors: We acknowledge the omission of formal tests. The revised manuscript will include chi-square tests for differences in major-error proportions across models together with bootstrap confidence-interval comparisons, with explicit reporting for the Grok 4.3 high-drift tail result. revision: yes
Referee: [drift calculation] Drift calculation (methods on embedding drift): the embedding model used to compute normalized drift from the reference centroid is not identified, nor is any sensitivity analysis provided; because the triage threshold of 1.5 and all downstream rates depend on this choice, the monotonic trend cannot be reproduced or stress-tested.

Authors: The comment is correct; the embedding model is not named and no sensitivity analysis appears. We will identify the model in the methods and add a sensitivity analysis that varies both the embedding model and the drift threshold to verify robustness of the monotonic severity trend. revision: yes

Circularity Check

0 steps flagged

No circularity: results derive from independent human references and separate adjudication

full rationale

The paper uses multi-human reference translations to compute embedding drift as a triage signal only, then adjudicates flagged candidates via an LLM panel calibrated on a distinct 300-instance author-adjudicated validation set. Major-error rates by drift band and model comparisons are outputs of this external adjudication process, not reductions of fitted parameters, self-definitions, or self-citation chains. No equations or steps equate the reported rates to the drift metric by construction; the methodology is self-contained against the provided human references and validation set.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central method depends on a hand-chosen drift threshold and two domain assumptions about embeddings and LLM judges; no new entities are postulated.

free parameters (2)

drift threshold of 1.5 = 1.5
Selected to flag the 1,203 candidates for adjudication; appears chosen by inspection rather than derived from first principles.
reporting bands (1.5-2.0, above 3.0)
Used to demonstrate monotonic rise in error rate; post-hoc selection for presentation.

axioms (2)

domain assumption Normalized embedding drift from the multi-reference centroid is a valid proxy for translation deviation severity
Invoked to justify the triage signal before adjudication.
domain assumption The three-model LLM judge panel, after calibration on the author-adjudicated set, produces reliable labels for valid variation versus major error
Underpins all reported major-error rates and model comparisons.

pith-pipeline@v0.9.1-grok · 5949 in / 1471 out tokens · 38407 ms · 2026-06-28T17:09:16.386417+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 44 canonical work pages · 6 internal anchors

[1]

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

Artetxe, Mikel and Schwenk, Holger. (2019) ’Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond’, Transactions of the Association for Computational Linguistics , 7, pp. 597–610. Available at: https://doi.org/10.1162/tacl_a_00288

work page doi:10.1162/tacl_a_00288 2019
[2]

(2022) ’Restoring and Attributing Ancient Texts Using Deep Neural Networks’, Nature, 603(7900), pp

Assael, Y annis et al. (2022) ’Restoring and Attributing Ancient Texts Using Deep Neural Networks’, Nature, 603(7900), pp. 280–283. Available at: https://doi.org/10.1038/s41586-022-04448-z

work page doi:10.1038/s41586-022-04448-z 2022
[3]

(2020) ’Latin BERT: A Contextual Language Model for Classical Philology’

Bamman, David and Burns, Patrick J. (2020) ’Latin BERT: A Contextual Language Model for Classical Philology’. Available at: https://arxiv.org/abs/2009.10053 (Accessed: 12 May 2026)

work page arXiv 2020
[4]

(2024) ’Machine Translation Hallucination Detection for Low and High Resource Languages Using Large Language Models’

Benkirane, Kenza et al. (2024) ’Machine Translation Hallucination Detection for Low and High Resource Languages Using Large Language Models’. Available at: https://doi.org/10.48550/arXiv.2407. 16470 (Accessed: 12 May 2026)

work page doi:10.48550/arxiv.2407 2024
[5]

(2025) ’Experiments in Distant Reading: Using Topic Modeling on Chinese Buddhist Texts from 500-800 CE’, Digital Humanities Quarterly , 19(1)

Bingenheimer, Marcus, Brody, Justin, and Nichols, Ryan. (2025) ’Experiments in Distant Reading: Using Topic Modeling on Chinese Buddhist Texts from 500-800 CE’, Digital Humanities Quarterly , 19(1). Avail- able at: https://www.digitalhumanities.org/dhq/vol/19/1/000771/000771.html (Accessed: 12 May 2026)

2025
[6]

(2000) The Connected Discourses of the Buddha: A Translation of the Saṃyutta Nikāya

Bodhi, Bhikkhu. (2000) The Connected Discourses of the Buddha: A Translation of the Saṃyutta Nikāya . Boston: Wisdom Publications

2000
[7]

(2012) The Numerical Discourses of the Buddha: A Translation of the Aṅguttara Nikāya

Bodhi, Bhikkhu. (2012) The Numerical Discourses of the Buddha: A Translation of the Aṅguttara Nikāya . Boston: Wisdom Publications

2012
[8]

and Lopez, Donald S., Jr

Buswell, Robert E., Jr. and Lopez, Donald S., Jr. (eds) (2014) The Princeton Dictionary of Buddhism. Prince- ton: Princeton University Press

2014
[9]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Chiang, Wei-Lin et al. (2024) ’Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference’ in Proceedings of the 41st International Conference on Machine Learning. Proceedings of Machine Learning Research 235, pp. 8359–8388. Available at: https://arxiv.org/abs/2403.04132 (Accessed: 12 May 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

(1960) ’ A Coeﬀicient of Agreement for Nominal Scales’, Educational and Psychological Measurement, 20(1), pp

Cohen, Jacob. (1960) ’ A Coeﬀicient of Agreement for Nominal Scales’, Educational and Psychological Measurement, 20(1), pp. 37–46. Available at: https://doi.org/10.1177/001316446002000104 19

work page doi:10.1177/001316446002000104 1960
[11]

Costa-jussà, Marta R. et al. (2022) ’No Language Left Behind: Scaling Human-Centered Machine Transla- tion’. Available at: https://arxiv.org/abs/2207.04672 (Accessed: 12 May 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Costa-jussà, Marta R. et al. (2024) ’Scaling Neural Machine Translation to 200 Languages’, Nature, 630(8018), pp. 841–846. Available at: https://doi.org/10.1038/s41586-024-07335-x

work page doi:10.1038/s41586-024-07335-x 2024
[13]

Dale, David et al. (2023) ’Detecting and Mitigating Hallucinations in Machine Translation: Model Internal Workings Alone Do Well, Sentence Similarity Even Better’ in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics . Toronto: Association for Computational Linguistics, pp. 36–50. Available at: https://doi.org/10.18653...

work page doi:10.18653/v1/2023.acl-long.3 2023
[14]

Language-agnostic BERT Sentence Embedding

Feng, Fangxiaoyu et al. (2022) ’Language-Agnostic BERT Sentence Embedding’ in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics . Dublin: Association for Computational Linguistics, pp. 878–891. Available at: https://doi.org/10.18653/v1/2022.acl-long.62

work page doi:10.18653/v1/2022.acl-long.62 2022
[15]

(2020) ’Multi-Hypothesis Machine Translation Evaluation’ in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Fomicheva, Marina, Specia, Lucia, and Guzmán, Francisco. (2020) ’Multi-Hypothesis Machine Translation Evaluation’ in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Online: Association for Computational Linguistics, pp. 1218–1232. Available at: https://doi.org/10 .18653/v1/2020.acl-main.113

2020
[16]

(2021) ’Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation’, Transactions of the Association for Computational Linguistics , 9, pp

Freitag, Markus et al. (2021) ’Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation’, Transactions of the Association for Computational Linguistics , 9, pp. 1460–1474. Available at: https://doi.org/10.1162/tacl_a_00437

work page doi:10.1162/tacl_a_00437 2021
[17]

NeMo guardrails: A toolkit for controllable and safe LLM applications with pro- grammable rails

Freitag, Markus et al. (2022) ’Results of WMT22 Metrics Shared Task: Stop Using BLEU–Neural Metrics Are Better and More Robust’ in Proceedings of the Seventh Conference on Machine Translation. Abu Dhabi: Association for Computational Linguistics, pp. 46–68. Available at: https://doi.org/10.18653/v1/ 2022.wmt-1.2

work page doi:10.18653/v1/ 2022
[18]

(2022) ’The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation’,Transactions of the Association for Computational Linguistics, 10, pp

Goyal, Naman et al. (2022) ’The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation’,Transactions of the Association for Computational Linguistics, 10, pp. 522–538. Avail- able at: https://doi.org/10.1162/tacl_a_00474

work page doi:10.1162/tacl_a_00474 2022
[19]

Guerreiro, Nuno M. et al. (2023a) ’Hallucinations in Large Multilingual Translation Models’, Transactions of the Association for Computational Linguistics , 11, pp. 1500–1517. Available at: https://doi.org/10 .1162/tacl_a_00615
[20]

Guerreiro, Nuno M., Voita, Elena, and Martins, André F. T. (2023b) ’Looking for a Needle in a Haystack: A Comprehensive Study of Hallucinations in Neural Machine Translation’ in Proceedings of the 17th Confer- ence of the European Chapter of the Association for Computational Linguistics . Dubrovnik: Association for Computational Linguistics, pp. 1059–1075...

work page doi:10.18653/v1/2023.eac 2023
[21]

Guerreiro, Nuno M. et al. (2024) ’xCOMET: Transparent Machine Translation Evaluation through Fine- Grained Error Detection’, Transactions of the Association for Computational Linguistics , 12, pp. 979–995. Available at: https://doi.org/10.1162/tacl_a_00683

work page doi:10.1162/tacl_a_00683 2024
[22]

(2015) ’Morphological Disambiguation of Classical Sanskrit’ in Mahlow, Cerstin and Pi- otrowski, Michael (eds) Systems and Frameworks for Computational Morphology

Hellwig, Oliver. (2015) ’Morphological Disambiguation of Classical Sanskrit’ in Mahlow, Cerstin and Pi- otrowski, Michael (eds) Systems and Frameworks for Computational Morphology. Cham: Springer, pp. 41–

2015
[23]

Available at: https://doi.org/10.1007/978-3-319-23980-4_3

work page doi:10.1007/978-3-319-23980-4_3
[24]

(2023) ’How Good Are GPT Models at Machine Translation? A Comprehensive Evalua- tion’

Hendy, Amr et al. (2023) ’How Good Are GPT Models at Machine Translation? A Comprehensive Evalua- tion’. Available at: https://arxiv.org/abs/2302.09210 (Accessed: 12 May 2026)

work page arXiv 2023
[25]

(2023) ’Is ChatGPT a Good Translator? Y es with GPT-4 as the Engine’

Jiao, Wenxiang et al. (2023) ’Is ChatGPT a Good Translator? Y es with GPT-4 as the Engine’. Available at: https://doi.org/10.48550/arXiv.2301.08745 (Accessed: 12 May 2026)

work page doi:10.48550/arxiv.2301.08745 2023
[26]

(2023a) ’GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4’ in Proceedings of the Eighth Conference on Machine Translation

Kocmi, Tom and Federmann, Christian. (2023a) ’GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4’ in Proceedings of the Eighth Conference on Machine Translation . Singapore: Association for Computational Linguistics, pp. 768–775. Available at: https://doi.org/10.18653/v1/2023.wmt-1 .64 20

work page doi:10.18653/v1/2023.wmt-1 2023
[27]

Kocmi, Tom and Federmann, Christian. (2023b) ’Large Language Models Are State-of-the-Art Evaluators of Translation Quality’ in Proceedings of the 24th Annual Conference of the European Association for Machine Translation. Tampere: European Association for Machine Translation, pp. 193–203. Available at: https: //arxiv.org/abs/2302.14520 (Accessed: 12 May 2026)

work page arXiv 2026
[28]

(2009) ’The METEOR Metric for Automatic Evaluation of Machine Translation’, Machine Translation, 23(2–3), pp

Lavie, Alon and Denkowski, Michael J. (2009) ’The METEOR Metric for Automatic Evaluation of Machine Translation’, Machine Translation, 23(2–3), pp. 105–115. Available at: https://doi.org/10.1007/s1 0590-009-9059-4

work page doi:10.1007/s1 2009
[29]

(2023) ’G-Eval: NLG Evaluation Using GPT-4 with Better Human Alignment’ in Proceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing

Liu, Y ang et al. (2023) ’G-Eval: NLG Evaluation Using GPT-4 with Better Human Alignment’ in Proceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing. Singapore: Association for Computational Linguistics, pp. 2511–2522. Available at: https://doi.org/10.18653/v1/2023.e mnlp-main.153

work page doi:10.18653/v1/2023.e 2023
[30]

(2018) ’Metrics for Translation Quality Assessment: A Case for Standardizing Error T ypolo- gies’ in Moorkens, Joss et al

Lommel, Arle. (2018) ’Metrics for Translation Quality Assessment: A Case for Standardizing Error T ypolo- gies’ in Moorkens, Joss et al. (eds) Translation Quality Assessment: From Principles to Practice . Cham: Springer, pp. 109–127. Available at: https://doi.org/10.1007/978-3-319-91241-7_6

work page doi:10.1007/978-3-319-91241-7_6 2018
[31]

Mathur, Nitika, Baldwin, Timothy, and Cohn, Trevor. (2020) ’Tangled up in BLEU: Reevaluating the Eval- uation of Automatic Machine Translation Evaluation Metrics’ in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Online: Association for Computational Linguistics, pp. 4984–4997. Available at: https://doi.org/10.18...

work page doi:10.18653/v1/2020.acl-main.448 2020
[32]

(n.d.) MQM Error Typology

MQM Council. (n.d.) MQM Error Typology. Available at: https://themqm.org/error-types-2/typ ology/ (Accessed: 12 May 2026)

2026
[33]

(2023) ’MTEB: Massive Text Embedding Benchmark’ in Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Muennighoff, Niklas et al. (2023) ’MTEB: Massive Text Embedding Benchmark’ in Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics . Dubrovnik: Associ- ation for Computational Linguistics, pp. 2014–2037. Available at: https://doi.org/10.18653/v1/20 23.eacl-main.148

work page doi:10.18653/v1/20 2023
[34]

Nehrdich, Sebastian and Keutzer, Kurt. (2026) ’MITRA: A Large-Scale Parallel Corpus and Multilingual Pretrained Language Model for Machine Translation and Semantic Retrieval for Pāli, Sanskrit, Buddhist Chinese, and Tibetan’. Available at: https://arxiv.org/abs/2601.06400 (Accessed: 12 May 2026)

work page arXiv 2026
[35]

(2024) ’One Model Is All Y ou Need: ByT5- Sanskrit, a Unified Model for Sanskrit NLP Tasks’ in Findings of the Association for Computational Lin- guistics: EMNLP 2024

Nehrdich, Sebastian, Hellwig, Oliver, and Keutzer, Kurt. (2024) ’One Model Is All Y ou Need: ByT5- Sanskrit, a Unified Model for Sanskrit NLP Tasks’ in Findings of the Association for Computational Lin- guistics: EMNLP 2024 . Miami: Association for Computational Linguistics, pp. 13742–13751. Available at: https://doi.org/10.18653/v1/2024.findings-emnlp.805

work page doi:10.18653/v1/2024.findings-emnlp.805 2024
[36]

(1995) The Middle Length Discourses of the Buddha: A Translation of the Majjhima Nikāya

Ñāṇamoli, Bhikkhu and Bodhi, Bhikkhu. (1995) The Middle Length Discourses of the Buddha: A Translation of the Majjhima Nikāya . Boston: Wisdom Publications

1995
[37]

(2002) ’BLEU: A Method for Automatic Evaluation of Machine Translation’ in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics

Papineni, Kishore et al. (2002) ’BLEU: A Method for Automatic Evaluation of Machine Translation’ in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics . Philadelphia: Association for Computational Linguistics, pp. 311–318. Available at: https://doi.org/10.3115/10 73083.1073135

work page doi:10.3115/10 2002
[38]

chr F : character n-gram F -score for automatic MT evaluation

Popović, Maja. (2015) ’chrF: Character N-Gram F-Score for Automatic MT Evaluation’ in Proceedings of the Tenth Workshop on Statistical Machine Translation. Lisbon: Association for Computational Linguistics, pp. 392–395. Available at: https://doi.org/10.18653/v1/W15-3049

work page doi:10.18653/v1/w15-3049 2015
[39]

(2017) ’chrF++: Words Helping Character N-Grams’ in Proceedings of the Second Confer- ence on Machine Translation

Popović, Maja. (2017) ’chrF++: Words Helping Character N-Grams’ in Proceedings of the Second Confer- ence on Machine Translation. Copenhagen: Association for Computational Linguistics, pp. 612–618. Avail- able at: https://doi.org/10.18653/v1/W17-4770

work page doi:10.18653/v1/w17-4770 2017
[40]

(2018) ’ A Call for Clarity in Reporting BLEU Scores’ in Proceedings of the Third Conference on Machine Translation

Post, Matt. (2018) ’ A Call for Clarity in Reporting BLEU Scores’ in Proceedings of the Third Conference on Machine Translation . Brussels: Association for Computational Linguistics, pp. 186–191. Available at: https://doi.org/10.18653/v1/W18-6319 21

work page doi:10.18653/v1/w18-6319 2018
[41]

Raunak, Vikas, Menezes, Arul, and Junczys-Dowmunt, Marcin. (2021) ’The Curious Case of Hallucinations in Neural Machine Translation’ in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Online: Association for Computational Linguistics, pp. 1172–1183. Available ...

work page doi:10.18653/v1/2021.naa 2021
[42]

COMET : A Neural Framework for MT Evaluation

Rei, Ricardo et al. (2020) ’COMET: A Neural Framework for MT Evaluation’ in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Online: Association for Computational Linguistics, pp. 2685–2702. Available at: https://doi.org/10.18653/v1/2020.emnlp-main.213

work page doi:10.18653/v1/2020.emnlp-main.213 2020
[43]

(2022) ’COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task’ in Proceedings of the Seventh Conference on Machine Translation

Rei, Ricardo et al. (2022) ’COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task’ in Proceedings of the Seventh Conference on Machine Translation. Abu Dhabi: Association for Computational Linguistics, pp. 578–585. Available at: https://doi.org/10.18653/v1/2022.wmt-1.52

work page doi:10.18653/v1/2022.wmt-1.52 2022
[44]

verdict":

Reimers, Nils and Gurevych, Iryna. (2019) ’Sentence-BERT: Sentence Embeddings Using Siamese BERT- Networks’ in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing . Hong Kong: Association for Computational Linguistics, pp. 3982–3992. Available at: https://doi.org/ 10.18653/v1/D19-1410

work page doi:10.18653/v1/d19-1410 2019
[45]

(2023) ’Exploring Large Language Models for Classical Philology’ in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics

Riemenschneider, Frederick and Frank, Anette. (2023) ’Exploring Large Language Models for Classical Philology’ in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics . Toronto: Association for Computational Linguistics, pp. 15181–15199. Available at: https://doi.or g/10.18653/v1/2023.acl-long.846

work page doi:10.18653/v1/2023.acl-long.846 2023
[46]

(2020) ’BLEURT: Learning Robust Metrics for Text Generation’ in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Sellam, Thibault, Das, Dipanjan, and Parikh, Ankur. (2020) ’BLEURT: Learning Robust Metrics for Text Generation’ in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Online: Association for Computational Linguistics, pp. 7881–7892. Available at: https://doi.org/10 .18653/v1/2020.acl-main.704

2020
[47]

(2023) ’Machine Learning for Ancient Languages: A Survey’, Computational Linguistics, 49(3), pp

Sommerschield, Thea et al. (2023) ’Machine Learning for Ancient Languages: A Survey’, Computational Linguistics, 49(3), pp. 703–747. Available at: https://doi.org/10.1162/coli_a_00481

work page doi:10.1162/coli_a_00481 2023
[48]

(n.d.) SuttaCentral

SuttaCentral. (n.d.) SuttaCentral. Available at: https://suttacentral.net (Accessed: 12 May 2026)

2026
[49]

(n.d.) Suttas

Thanissaro Bhikkhu. (n.d.) Suttas. Available at: https://www.dhammatalks.org/suttas/ (Accessed: 12 May 2026)

2026
[50]

(2014) Enlarging Translation, Empowering Translators

T ymoczko, Maria. (2014) Enlarging Translation, Empowering Translators. London: Routledge

2014
[51]

(2018) The Translator’s Invisibility: A History of Translation

Venuti, Lawrence. (2018) The Translator’s Invisibility: A History of Translation. London: Routledge

2018
[52]

(2006) ’Error Analysis of Statistical Machine Translation Output’ in Proceedings of the Fifth International Conference on Language Resources and Evaluation

Vilar, David et al. (2006) ’Error Analysis of Statistical Machine Translation Output’ in Proceedings of the Fifth International Conference on Language Resources and Evaluation . Genoa: European Language Re- sources Association. Available at: https://aclanthology.org/L06-1244/ (Accessed: 12 May 2026)

2006
[53]

0.5-1.5 m height, 30-60 cm spread

Wilson, Edwin B. (1927) ’Probable Inference, the Law of Succession, and Statistical Inference’, Journal of the American Statistical Association , 22(158), pp. 209–212. Available at: https://doi.org/10.1080/ 01621459.1927.10502953

work page arXiv 1927
[54]

(2024) ’Multiple References with Meaningful Variations Improve Literary Machine Translation’

Wu, Si, Wieting, John, and Smith, David A. (2024) ’Multiple References with Meaningful Variations Improve Literary Machine Translation’. Available at: https://arxiv.org/abs/2412.18707 (Accessed: 12 May 2026)

work page arXiv 2024
[55]

(2024) ’ A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models’ in The Twelfth International Conference on Learning Representations

Xu, Haoran et al. (2024) ’ A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models’ in The Twelfth International Conference on Learning Representations . Available at: https://arxiv.org/abs/2309.11674 (Accessed: 12 May 2026)

work page arXiv 2024
[56]

Zainaldin, James L. et al. (2026) ’Evaluating LLM-Based Translation of a Low-Resource Technical Lan- guage: The Medical and Philosophical Greek of Galen’. Available at: https://doi.org/10.48550/arX iv.2602.24119 (Accessed: 12 May 2026). 22

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arx 2026
[57]

BERTScore: Evaluating Text Generation with BERT

Zhang, Tianyi et al. (2020) ’BERTScore: Evaluating Text Generation with BERT’ in International Confer- ence on Learning Representations . Available at: https://arxiv.org/abs/1904.09675 (Accessed: 12 May 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[58]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Zhang, Y anzhao et al. (2025) ’Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models’. Available at: https://arxiv.org/abs/2506.05176 (Accessed: 12 May 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

(2023) ’Can Large Language Model Comprehend Ancient Chinese? A Pre- liminary Test on ACLUE’ in Proceedings of the Ancient Language Processing Workshop

Zhang, Yixuan and Li, Haonan. (2023) ’Can Large Language Model Comprehend Ancient Chinese? A Pre- liminary Test on ACLUE’ in Proceedings of the Ancient Language Processing Workshop. Varna: INCOMA Ltd., pp. 80–87. Available at: https://aclanthology.org/2023.alp-1.9/ (Accessed: 12 May 2026)

2023
[60]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, Lianmin et al. (2023) ’Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena’ in Advances in Neural Information Processing Systems 36 , Datasets and Benchmarks Track. Available at: https://arxi v.org/abs/2306.05685 (Accessed: 12 May 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

mn1:5":

Zhu, Wenhao et al. (2024) ’Multilingual Machine Translation with Large Language Models: Empirical Re- sults and Analysis’ in Findings of the Association for Computational Linguistics: NAACL 2024. Mexico City: Association for Computational Linguistics, pp. 2765–2781. Available at: https://doi.org/10.18653 /v1/2024.findings-naacl.176 A Translation Prompt A....

2024
[62]

Use the Pāli text to decide what content belongs to each segment

Pāli is the authority. Use the Pāli text to decide what content belongs to each segment
[63]

His segmentation helps you understand boundaries, but the target translator may split/merge differently.,→

Sujato is only a guide. His segmentation helps you understand boundaries, but the target translator may split/merge differently.,→
[64]

…" or is abbreviated: → Extract ONLY the corresponding term(s) from the translator → Do NOT expand to the full sentence → Example: Pali

MATCH THE PALI STRUCTURE. The OUTPUT must mirror the structure of the PALI segment: a) If the Pali contains "…" or is abbreviated: → Extract ONLY the corresponding term(s) from the translator → Do NOT expand to the full sentence → Example: Pali "viññātaṁ …" + Sujato "the known …" → Output just "the cognized" (not the full paragraph) b) If the Pali is ful...
[65]

Extract the smallest text that expresses the Pāli meaning

Minimal faithful extraction. Extract the smallest text that expresses the Pāli meaning. Prefer contiguous substrings from the original. 29
[66]

Process the translation in order

Respect text order. Process the translation in order. Don't reuse non-repetitive text
[67]

Discard footnotes, section headers (unless part of translation), editor notes, bracketed references.,→

Filter noise. Discard footnotes, section headers (unless part of translation), editor notes, bracketed references.,→
[68]

mn1:3.1":

Null policy. Output null ONLY if you genuinely cannot find matching content. Remember: abbreviated Pali → short output (just the term). OUTPUT FORMAT (STRICT JSON) Return valid JSON with exactly the same keys as the input, in the same order. Each key maps to either a string (extracted text) or null. Example: { "mn1:3.1": "Here, monks, an untaught ordinary...

[1] [1]

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

Artetxe, Mikel and Schwenk, Holger. (2019) ’Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond’, Transactions of the Association for Computational Linguistics , 7, pp. 597–610. Available at: https://doi.org/10.1162/tacl_a_00288

work page doi:10.1162/tacl_a_00288 2019

[2] [2]

(2022) ’Restoring and Attributing Ancient Texts Using Deep Neural Networks’, Nature, 603(7900), pp

Assael, Y annis et al. (2022) ’Restoring and Attributing Ancient Texts Using Deep Neural Networks’, Nature, 603(7900), pp. 280–283. Available at: https://doi.org/10.1038/s41586-022-04448-z

work page doi:10.1038/s41586-022-04448-z 2022

[3] [3]

(2020) ’Latin BERT: A Contextual Language Model for Classical Philology’

Bamman, David and Burns, Patrick J. (2020) ’Latin BERT: A Contextual Language Model for Classical Philology’. Available at: https://arxiv.org/abs/2009.10053 (Accessed: 12 May 2026)

work page arXiv 2020

[4] [4]

(2024) ’Machine Translation Hallucination Detection for Low and High Resource Languages Using Large Language Models’

Benkirane, Kenza et al. (2024) ’Machine Translation Hallucination Detection for Low and High Resource Languages Using Large Language Models’. Available at: https://doi.org/10.48550/arXiv.2407. 16470 (Accessed: 12 May 2026)

work page doi:10.48550/arxiv.2407 2024

[5] [5]

(2025) ’Experiments in Distant Reading: Using Topic Modeling on Chinese Buddhist Texts from 500-800 CE’, Digital Humanities Quarterly , 19(1)

Bingenheimer, Marcus, Brody, Justin, and Nichols, Ryan. (2025) ’Experiments in Distant Reading: Using Topic Modeling on Chinese Buddhist Texts from 500-800 CE’, Digital Humanities Quarterly , 19(1). Avail- able at: https://www.digitalhumanities.org/dhq/vol/19/1/000771/000771.html (Accessed: 12 May 2026)

2025

[6] [6]

(2000) The Connected Discourses of the Buddha: A Translation of the Saṃyutta Nikāya

Bodhi, Bhikkhu. (2000) The Connected Discourses of the Buddha: A Translation of the Saṃyutta Nikāya . Boston: Wisdom Publications

2000

[7] [7]

(2012) The Numerical Discourses of the Buddha: A Translation of the Aṅguttara Nikāya

Bodhi, Bhikkhu. (2012) The Numerical Discourses of the Buddha: A Translation of the Aṅguttara Nikāya . Boston: Wisdom Publications

2012

[8] [8]

and Lopez, Donald S., Jr

Buswell, Robert E., Jr. and Lopez, Donald S., Jr. (eds) (2014) The Princeton Dictionary of Buddhism. Prince- ton: Princeton University Press

2014

[9] [9]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Chiang, Wei-Lin et al. (2024) ’Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference’ in Proceedings of the 41st International Conference on Machine Learning. Proceedings of Machine Learning Research 235, pp. 8359–8388. Available at: https://arxiv.org/abs/2403.04132 (Accessed: 12 May 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

(1960) ’ A Coeﬀicient of Agreement for Nominal Scales’, Educational and Psychological Measurement, 20(1), pp

Cohen, Jacob. (1960) ’ A Coeﬀicient of Agreement for Nominal Scales’, Educational and Psychological Measurement, 20(1), pp. 37–46. Available at: https://doi.org/10.1177/001316446002000104 19

work page doi:10.1177/001316446002000104 1960

[11] [11]

Costa-jussà, Marta R. et al. (2022) ’No Language Left Behind: Scaling Human-Centered Machine Transla- tion’. Available at: https://arxiv.org/abs/2207.04672 (Accessed: 12 May 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[12] [12]

Costa-jussà, Marta R. et al. (2024) ’Scaling Neural Machine Translation to 200 Languages’, Nature, 630(8018), pp. 841–846. Available at: https://doi.org/10.1038/s41586-024-07335-x

work page doi:10.1038/s41586-024-07335-x 2024

[13] [13]

Dale, David et al. (2023) ’Detecting and Mitigating Hallucinations in Machine Translation: Model Internal Workings Alone Do Well, Sentence Similarity Even Better’ in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics . Toronto: Association for Computational Linguistics, pp. 36–50. Available at: https://doi.org/10.18653...

work page doi:10.18653/v1/2023.acl-long.3 2023

[14] [14]

Language-agnostic BERT Sentence Embedding

Feng, Fangxiaoyu et al. (2022) ’Language-Agnostic BERT Sentence Embedding’ in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics . Dublin: Association for Computational Linguistics, pp. 878–891. Available at: https://doi.org/10.18653/v1/2022.acl-long.62

work page doi:10.18653/v1/2022.acl-long.62 2022

[15] [15]

(2020) ’Multi-Hypothesis Machine Translation Evaluation’ in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Fomicheva, Marina, Specia, Lucia, and Guzmán, Francisco. (2020) ’Multi-Hypothesis Machine Translation Evaluation’ in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Online: Association for Computational Linguistics, pp. 1218–1232. Available at: https://doi.org/10 .18653/v1/2020.acl-main.113

2020

[16] [16]

(2021) ’Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation’, Transactions of the Association for Computational Linguistics , 9, pp

Freitag, Markus et al. (2021) ’Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation’, Transactions of the Association for Computational Linguistics , 9, pp. 1460–1474. Available at: https://doi.org/10.1162/tacl_a_00437

work page doi:10.1162/tacl_a_00437 2021

[17] [17]

NeMo guardrails: A toolkit for controllable and safe LLM applications with pro- grammable rails

Freitag, Markus et al. (2022) ’Results of WMT22 Metrics Shared Task: Stop Using BLEU–Neural Metrics Are Better and More Robust’ in Proceedings of the Seventh Conference on Machine Translation. Abu Dhabi: Association for Computational Linguistics, pp. 46–68. Available at: https://doi.org/10.18653/v1/ 2022.wmt-1.2

work page doi:10.18653/v1/ 2022

[18] [18]

(2022) ’The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation’,Transactions of the Association for Computational Linguistics, 10, pp

Goyal, Naman et al. (2022) ’The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation’,Transactions of the Association for Computational Linguistics, 10, pp. 522–538. Avail- able at: https://doi.org/10.1162/tacl_a_00474

work page doi:10.1162/tacl_a_00474 2022

[19] [19]

Guerreiro, Nuno M. et al. (2023a) ’Hallucinations in Large Multilingual Translation Models’, Transactions of the Association for Computational Linguistics , 11, pp. 1500–1517. Available at: https://doi.org/10 .1162/tacl_a_00615

[20] [20]

Guerreiro, Nuno M., Voita, Elena, and Martins, André F. T. (2023b) ’Looking for a Needle in a Haystack: A Comprehensive Study of Hallucinations in Neural Machine Translation’ in Proceedings of the 17th Confer- ence of the European Chapter of the Association for Computational Linguistics . Dubrovnik: Association for Computational Linguistics, pp. 1059–1075...

work page doi:10.18653/v1/2023.eac 2023

[21] [21]

Guerreiro, Nuno M. et al. (2024) ’xCOMET: Transparent Machine Translation Evaluation through Fine- Grained Error Detection’, Transactions of the Association for Computational Linguistics , 12, pp. 979–995. Available at: https://doi.org/10.1162/tacl_a_00683

work page doi:10.1162/tacl_a_00683 2024

[22] [22]

(2015) ’Morphological Disambiguation of Classical Sanskrit’ in Mahlow, Cerstin and Pi- otrowski, Michael (eds) Systems and Frameworks for Computational Morphology

Hellwig, Oliver. (2015) ’Morphological Disambiguation of Classical Sanskrit’ in Mahlow, Cerstin and Pi- otrowski, Michael (eds) Systems and Frameworks for Computational Morphology. Cham: Springer, pp. 41–

2015

[23] [23]

Available at: https://doi.org/10.1007/978-3-319-23980-4_3

work page doi:10.1007/978-3-319-23980-4_3

[24] [24]

(2023) ’How Good Are GPT Models at Machine Translation? A Comprehensive Evalua- tion’

Hendy, Amr et al. (2023) ’How Good Are GPT Models at Machine Translation? A Comprehensive Evalua- tion’. Available at: https://arxiv.org/abs/2302.09210 (Accessed: 12 May 2026)

work page arXiv 2023

[25] [25]

(2023) ’Is ChatGPT a Good Translator? Y es with GPT-4 as the Engine’

Jiao, Wenxiang et al. (2023) ’Is ChatGPT a Good Translator? Y es with GPT-4 as the Engine’. Available at: https://doi.org/10.48550/arXiv.2301.08745 (Accessed: 12 May 2026)

work page doi:10.48550/arxiv.2301.08745 2023

[26] [26]

(2023a) ’GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4’ in Proceedings of the Eighth Conference on Machine Translation

Kocmi, Tom and Federmann, Christian. (2023a) ’GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4’ in Proceedings of the Eighth Conference on Machine Translation . Singapore: Association for Computational Linguistics, pp. 768–775. Available at: https://doi.org/10.18653/v1/2023.wmt-1 .64 20

work page doi:10.18653/v1/2023.wmt-1 2023

[27] [27]

Kocmi, Tom and Federmann, Christian. (2023b) ’Large Language Models Are State-of-the-Art Evaluators of Translation Quality’ in Proceedings of the 24th Annual Conference of the European Association for Machine Translation. Tampere: European Association for Machine Translation, pp. 193–203. Available at: https: //arxiv.org/abs/2302.14520 (Accessed: 12 May 2026)

work page arXiv 2026

[28] [28]

(2009) ’The METEOR Metric for Automatic Evaluation of Machine Translation’, Machine Translation, 23(2–3), pp

Lavie, Alon and Denkowski, Michael J. (2009) ’The METEOR Metric for Automatic Evaluation of Machine Translation’, Machine Translation, 23(2–3), pp. 105–115. Available at: https://doi.org/10.1007/s1 0590-009-9059-4

work page doi:10.1007/s1 2009

[29] [29]

(2023) ’G-Eval: NLG Evaluation Using GPT-4 with Better Human Alignment’ in Proceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing

Liu, Y ang et al. (2023) ’G-Eval: NLG Evaluation Using GPT-4 with Better Human Alignment’ in Proceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing. Singapore: Association for Computational Linguistics, pp. 2511–2522. Available at: https://doi.org/10.18653/v1/2023.e mnlp-main.153

work page doi:10.18653/v1/2023.e 2023

[30] [30]

(2018) ’Metrics for Translation Quality Assessment: A Case for Standardizing Error T ypolo- gies’ in Moorkens, Joss et al

Lommel, Arle. (2018) ’Metrics for Translation Quality Assessment: A Case for Standardizing Error T ypolo- gies’ in Moorkens, Joss et al. (eds) Translation Quality Assessment: From Principles to Practice . Cham: Springer, pp. 109–127. Available at: https://doi.org/10.1007/978-3-319-91241-7_6

work page doi:10.1007/978-3-319-91241-7_6 2018

[31] [31]

Mathur, Nitika, Baldwin, Timothy, and Cohn, Trevor. (2020) ’Tangled up in BLEU: Reevaluating the Eval- uation of Automatic Machine Translation Evaluation Metrics’ in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Online: Association for Computational Linguistics, pp. 4984–4997. Available at: https://doi.org/10.18...

work page doi:10.18653/v1/2020.acl-main.448 2020

[32] [32]

(n.d.) MQM Error Typology

MQM Council. (n.d.) MQM Error Typology. Available at: https://themqm.org/error-types-2/typ ology/ (Accessed: 12 May 2026)

2026

[33] [33]

(2023) ’MTEB: Massive Text Embedding Benchmark’ in Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Muennighoff, Niklas et al. (2023) ’MTEB: Massive Text Embedding Benchmark’ in Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics . Dubrovnik: Associ- ation for Computational Linguistics, pp. 2014–2037. Available at: https://doi.org/10.18653/v1/20 23.eacl-main.148

work page doi:10.18653/v1/20 2023

[34] [34]

Nehrdich, Sebastian and Keutzer, Kurt. (2026) ’MITRA: A Large-Scale Parallel Corpus and Multilingual Pretrained Language Model for Machine Translation and Semantic Retrieval for Pāli, Sanskrit, Buddhist Chinese, and Tibetan’. Available at: https://arxiv.org/abs/2601.06400 (Accessed: 12 May 2026)

work page arXiv 2026

[35] [35]

(2024) ’One Model Is All Y ou Need: ByT5- Sanskrit, a Unified Model for Sanskrit NLP Tasks’ in Findings of the Association for Computational Lin- guistics: EMNLP 2024

Nehrdich, Sebastian, Hellwig, Oliver, and Keutzer, Kurt. (2024) ’One Model Is All Y ou Need: ByT5- Sanskrit, a Unified Model for Sanskrit NLP Tasks’ in Findings of the Association for Computational Lin- guistics: EMNLP 2024 . Miami: Association for Computational Linguistics, pp. 13742–13751. Available at: https://doi.org/10.18653/v1/2024.findings-emnlp.805

work page doi:10.18653/v1/2024.findings-emnlp.805 2024

[36] [36]

(1995) The Middle Length Discourses of the Buddha: A Translation of the Majjhima Nikāya

Ñāṇamoli, Bhikkhu and Bodhi, Bhikkhu. (1995) The Middle Length Discourses of the Buddha: A Translation of the Majjhima Nikāya . Boston: Wisdom Publications

1995

[37] [37]

(2002) ’BLEU: A Method for Automatic Evaluation of Machine Translation’ in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics

Papineni, Kishore et al. (2002) ’BLEU: A Method for Automatic Evaluation of Machine Translation’ in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics . Philadelphia: Association for Computational Linguistics, pp. 311–318. Available at: https://doi.org/10.3115/10 73083.1073135

work page doi:10.3115/10 2002

[38] [38]

chr F : character n-gram F -score for automatic MT evaluation

Popović, Maja. (2015) ’chrF: Character N-Gram F-Score for Automatic MT Evaluation’ in Proceedings of the Tenth Workshop on Statistical Machine Translation. Lisbon: Association for Computational Linguistics, pp. 392–395. Available at: https://doi.org/10.18653/v1/W15-3049

work page doi:10.18653/v1/w15-3049 2015

[39] [39]

(2017) ’chrF++: Words Helping Character N-Grams’ in Proceedings of the Second Confer- ence on Machine Translation

Popović, Maja. (2017) ’chrF++: Words Helping Character N-Grams’ in Proceedings of the Second Confer- ence on Machine Translation. Copenhagen: Association for Computational Linguistics, pp. 612–618. Avail- able at: https://doi.org/10.18653/v1/W17-4770

work page doi:10.18653/v1/w17-4770 2017

[40] [40]

(2018) ’ A Call for Clarity in Reporting BLEU Scores’ in Proceedings of the Third Conference on Machine Translation

Post, Matt. (2018) ’ A Call for Clarity in Reporting BLEU Scores’ in Proceedings of the Third Conference on Machine Translation . Brussels: Association for Computational Linguistics, pp. 186–191. Available at: https://doi.org/10.18653/v1/W18-6319 21

work page doi:10.18653/v1/w18-6319 2018

[41] [41]

Raunak, Vikas, Menezes, Arul, and Junczys-Dowmunt, Marcin. (2021) ’The Curious Case of Hallucinations in Neural Machine Translation’ in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Online: Association for Computational Linguistics, pp. 1172–1183. Available ...

work page doi:10.18653/v1/2021.naa 2021

[42] [42]

COMET : A Neural Framework for MT Evaluation

Rei, Ricardo et al. (2020) ’COMET: A Neural Framework for MT Evaluation’ in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Online: Association for Computational Linguistics, pp. 2685–2702. Available at: https://doi.org/10.18653/v1/2020.emnlp-main.213

work page doi:10.18653/v1/2020.emnlp-main.213 2020

[43] [43]

(2022) ’COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task’ in Proceedings of the Seventh Conference on Machine Translation

Rei, Ricardo et al. (2022) ’COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task’ in Proceedings of the Seventh Conference on Machine Translation. Abu Dhabi: Association for Computational Linguistics, pp. 578–585. Available at: https://doi.org/10.18653/v1/2022.wmt-1.52

work page doi:10.18653/v1/2022.wmt-1.52 2022

[44] [44]

verdict":

Reimers, Nils and Gurevych, Iryna. (2019) ’Sentence-BERT: Sentence Embeddings Using Siamese BERT- Networks’ in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing . Hong Kong: Association for Computational Linguistics, pp. 3982–3992. Available at: https://doi.org/ 10.18653/v1/D19-1410

work page doi:10.18653/v1/d19-1410 2019

[45] [45]

(2023) ’Exploring Large Language Models for Classical Philology’ in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics

Riemenschneider, Frederick and Frank, Anette. (2023) ’Exploring Large Language Models for Classical Philology’ in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics . Toronto: Association for Computational Linguistics, pp. 15181–15199. Available at: https://doi.or g/10.18653/v1/2023.acl-long.846

work page doi:10.18653/v1/2023.acl-long.846 2023

[46] [46]

(2020) ’BLEURT: Learning Robust Metrics for Text Generation’ in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Sellam, Thibault, Das, Dipanjan, and Parikh, Ankur. (2020) ’BLEURT: Learning Robust Metrics for Text Generation’ in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Online: Association for Computational Linguistics, pp. 7881–7892. Available at: https://doi.org/10 .18653/v1/2020.acl-main.704

2020

[47] [47]

(2023) ’Machine Learning for Ancient Languages: A Survey’, Computational Linguistics, 49(3), pp

Sommerschield, Thea et al. (2023) ’Machine Learning for Ancient Languages: A Survey’, Computational Linguistics, 49(3), pp. 703–747. Available at: https://doi.org/10.1162/coli_a_00481

work page doi:10.1162/coli_a_00481 2023

[48] [48]

(n.d.) SuttaCentral

SuttaCentral. (n.d.) SuttaCentral. Available at: https://suttacentral.net (Accessed: 12 May 2026)

2026

[49] [49]

(n.d.) Suttas

Thanissaro Bhikkhu. (n.d.) Suttas. Available at: https://www.dhammatalks.org/suttas/ (Accessed: 12 May 2026)

2026

[50] [50]

(2014) Enlarging Translation, Empowering Translators

T ymoczko, Maria. (2014) Enlarging Translation, Empowering Translators. London: Routledge

2014

[51] [51]

(2018) The Translator’s Invisibility: A History of Translation

Venuti, Lawrence. (2018) The Translator’s Invisibility: A History of Translation. London: Routledge

2018

[52] [52]

(2006) ’Error Analysis of Statistical Machine Translation Output’ in Proceedings of the Fifth International Conference on Language Resources and Evaluation

Vilar, David et al. (2006) ’Error Analysis of Statistical Machine Translation Output’ in Proceedings of the Fifth International Conference on Language Resources and Evaluation . Genoa: European Language Re- sources Association. Available at: https://aclanthology.org/L06-1244/ (Accessed: 12 May 2026)

2006

[53] [53]

0.5-1.5 m height, 30-60 cm spread

Wilson, Edwin B. (1927) ’Probable Inference, the Law of Succession, and Statistical Inference’, Journal of the American Statistical Association , 22(158), pp. 209–212. Available at: https://doi.org/10.1080/ 01621459.1927.10502953

work page arXiv 1927

[54] [54]

(2024) ’Multiple References with Meaningful Variations Improve Literary Machine Translation’

Wu, Si, Wieting, John, and Smith, David A. (2024) ’Multiple References with Meaningful Variations Improve Literary Machine Translation’. Available at: https://arxiv.org/abs/2412.18707 (Accessed: 12 May 2026)

work page arXiv 2024

[55] [55]

(2024) ’ A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models’ in The Twelfth International Conference on Learning Representations

Xu, Haoran et al. (2024) ’ A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models’ in The Twelfth International Conference on Learning Representations . Available at: https://arxiv.org/abs/2309.11674 (Accessed: 12 May 2026)

work page arXiv 2024

[56] [56]

Zainaldin, James L. et al. (2026) ’Evaluating LLM-Based Translation of a Low-Resource Technical Lan- guage: The Medical and Philosophical Greek of Galen’. Available at: https://doi.org/10.48550/arX iv.2602.24119 (Accessed: 12 May 2026). 22

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arx 2026

[57] [57]

BERTScore: Evaluating Text Generation with BERT

Zhang, Tianyi et al. (2020) ’BERTScore: Evaluating Text Generation with BERT’ in International Confer- ence on Learning Representations . Available at: https://arxiv.org/abs/1904.09675 (Accessed: 12 May 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2020

[58] [58]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Zhang, Y anzhao et al. (2025) ’Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models’. Available at: https://arxiv.org/abs/2506.05176 (Accessed: 12 May 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [59]

(2023) ’Can Large Language Model Comprehend Ancient Chinese? A Pre- liminary Test on ACLUE’ in Proceedings of the Ancient Language Processing Workshop

Zhang, Yixuan and Li, Haonan. (2023) ’Can Large Language Model Comprehend Ancient Chinese? A Pre- liminary Test on ACLUE’ in Proceedings of the Ancient Language Processing Workshop. Varna: INCOMA Ltd., pp. 80–87. Available at: https://aclanthology.org/2023.alp-1.9/ (Accessed: 12 May 2026)

2023

[60] [60]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, Lianmin et al. (2023) ’Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena’ in Advances in Neural Information Processing Systems 36 , Datasets and Benchmarks Track. Available at: https://arxi v.org/abs/2306.05685 (Accessed: 12 May 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[61] [61]

mn1:5":

Zhu, Wenhao et al. (2024) ’Multilingual Machine Translation with Large Language Models: Empirical Re- sults and Analysis’ in Findings of the Association for Computational Linguistics: NAACL 2024. Mexico City: Association for Computational Linguistics, pp. 2765–2781. Available at: https://doi.org/10.18653 /v1/2024.findings-naacl.176 A Translation Prompt A....

2024

[62] [62]

Use the Pāli text to decide what content belongs to each segment

Pāli is the authority. Use the Pāli text to decide what content belongs to each segment

[63] [63]

His segmentation helps you understand boundaries, but the target translator may split/merge differently.,→

Sujato is only a guide. His segmentation helps you understand boundaries, but the target translator may split/merge differently.,→

[64] [64]

…" or is abbreviated: → Extract ONLY the corresponding term(s) from the translator → Do NOT expand to the full sentence → Example: Pali

MATCH THE PALI STRUCTURE. The OUTPUT must mirror the structure of the PALI segment: a) If the Pali contains "…" or is abbreviated: → Extract ONLY the corresponding term(s) from the translator → Do NOT expand to the full sentence → Example: Pali "viññātaṁ …" + Sujato "the known …" → Output just "the cognized" (not the full paragraph) b) If the Pali is ful...

[65] [65]

Extract the smallest text that expresses the Pāli meaning

Minimal faithful extraction. Extract the smallest text that expresses the Pāli meaning. Prefer contiguous substrings from the original. 29

[66] [66]

Process the translation in order

Respect text order. Process the translation in order. Don't reuse non-repetitive text

[67] [67]

Discard footnotes, section headers (unless part of translation), editor notes, bracketed references.,→

Filter noise. Discard footnotes, section headers (unless part of translation), editor notes, bracketed references.,→

[68] [68]

mn1:3.1":

Null policy. Output null ONLY if you genuinely cannot find matching content. Remember: abbreviated Pali → short output (just the term). OUTPUT FORMAT (STRICT JSON) Return valid JSON with exactly the same keys as the input, in the same order. Each key maps to either a string (extracted text) or null. Example: { "mn1:3.1": "Here, monks, an untaught ordinary...