From Outliers to Errors: Auditing Pali-to-English LLM Translations with Multi-Reference Adjudication
Pith reviewed 2026-06-28 17:09 UTC · model grok-4.3
The pith
Drift from a multi-translator reference centroid predicts major-error rates in LLM Pali-to-English translations rather than treating all outliers as mistakes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By defining a local reference envelope from three human translations and measuring normalized embedding drift, the authors demonstrate that drift serves as a severity predictor rather than an error label: the major-error rate among adjudicated high-drift candidates rises monotonically across bands, approximately 80 percent of 1.5-2.0 outliers are valid variations, and model differences concentrate in the tail where one model records the highest rates (27.6 percent overall, 74.4 percent above drift 3.0).
What carries the argument
Normalized embedding drift from the multi-translator reference centroid, used as a triage signal before blinded three-model LLM judge panel adjudication calibrated on a 300-instance validation set.
If this is right
- Major-error rates increase steadily with higher drift thresholds.
- Most candidates in the 1.5-2.0 drift band represent acceptable translation variations.
- One model produces both more high-drift outliers and a higher proportion of major errors in the tail than the others.
- The dominant error categories are omission, truncation, and doctrinal term mistakes.
Where Pith is reading between the lines
- The triage workflow could extend to other classical languages that have multiple authoritative human translations to prioritize review effort.
- Embedding-based drift may offer a more stable signal than single-reference metrics when variation is expected.
- Threshold tuning on new corpora could balance review volume against capture of severe errors.
Load-bearing premise
The blinded three-model LLM judge panel, after calibration against author-adjudicated examples, reliably distinguishes legitimate translation variations from major errors.
What would settle it
Large-scale human re-adjudication of the high-drift candidates that shows no monotonic rise in major-error rate with drift bands or no difference in tail error rates across the four models.
read the original abstract
Single-score translation metrics can conflate legitimate variation with error, a problem especially acute for classical languages where multiple defensible English renderings of the same passage coexist. We audit Pali-to-English output from four flagship large language models (LLMs): GPT-5.5, Claude Sonnet 4.6, Gemini 3.1 Pro, and Grok 4.3, on 1,700 passages from the Pali Canon, using three established human translations by Bhikkhu Sujato, Thanissaro Bhikkhu, and Bhikkhu Bodhi as a local reference envelope rather than a single gold standard. Each candidate's normalized embedding drift from the reference centroid serves as a triage signal, not an error label; the 1,203 candidates above a 1.5 drift threshold are then adjudicated by a blinded three-model LLM judge panel, calibrated against a 300-instance author-adjudicated validation set. Two results stand out. First, drift predicts severity rather than error per se: the major-error rate among adjudicated high-drift candidates rose monotonically from 7.9% in the 1.5-2.0 band to 51.6% above 3.0, while approximately 80% of 1.5-2.0 outliers were judged valid translation variations. Second, model differences were clearest in the high-drift tail: GPT-5.5 had the lowest adjudicated high-drift major-error rate, with confidence intervals overlapping those of Claude Sonnet 4.6 and Gemini 3.1 Pro; Grok 4.3 had both the largest outlier volume and the highest tail major-error rate (27.6% overall, 74.4% above drift 3.0). The dominant major-error categories (e.g. omission or truncation, doctrinal term errors) are precisely the failures most likely to mislead readers of doctrinal text. The contribution is a reusable audit design for classical-to-modern translation: define a local reference envelope from multiple human translators, use embedding drift to prioritize review, and adjudicate the flagged tail rather than treating outlier status as error.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper audits Pali-to-English translations from four LLMs (GPT-5.5, Claude Sonnet 4.6, Gemini 3.1 Pro, Grok 4.3) on 1,700 Pali Canon passages. It defines a local reference envelope from three human translations, uses normalized embedding drift above a 1.5 threshold to triage 1,203 outliers, and adjudicates them via a blinded three-model LLM judge panel calibrated on a 300-instance author-adjudicated validation set. Central claims are that drift predicts error severity (major-error rate rises monotonically from 7.9% in the 1.5-2.0 band to 51.6% above 3.0, with ~80% of low-band outliers judged valid variations) and that model differences appear in the high-drift tail (GPT-5.5 lowest major-error rate; Grok 4.3 highest at 27.6% overall and 74.4% above drift 3.0).
Significance. If the adjudication holds, the work supplies a practical, reusable audit design for classical-language translation that avoids single-gold-standard assumptions and focuses review effort on the error-prone tail. The monotonic severity prediction and tail-specific model comparisons would be directly useful for doctrinal text evaluation.
major comments (3)
- [adjudication pipeline] Adjudication pipeline (abstract and corresponding methods description): the load-bearing claim that the blinded three-model LLM judge panel reliably distinguishes major errors (omission, truncation, doctrinal term mistakes) from valid variations rests on calibration against the 300-instance validation set, yet no details are supplied on panel composition, exact judge prompts, or agreement metrics (e.g., kappa) between the panel and independent human experts on held-out cases. Without these, the reported major-error rates and model rankings cannot be assessed for bias.
- [results on model differences] Results on model differences (high-drift tail paragraph): the claim that GPT-5.5 has the lowest and Grok 4.3 the highest adjudicated major-error rates lacks any statistical test (chi-square, bootstrap CI overlap assessment, or similar) for the observed differences; overlapping CIs are mentioned for GPT-5.5 vs. others but no test is reported for the Grok 4.3 tail result of 74.4%.
- [drift calculation] Drift calculation (methods on embedding drift): the embedding model used to compute normalized drift from the reference centroid is not identified, nor is any sensitivity analysis provided; because the triage threshold of 1.5 and all downstream rates depend on this choice, the monotonic trend cannot be reproduced or stress-tested.
minor comments (2)
- [abstract] The abstract refers to 'normalized embedding drift' without an equation or definition of the normalization; add a short formal definition or reference in the methods.
- [results] No table or figure caption clarifies the exact band boundaries or sample sizes per band; a small summary table would improve readability of the 7.9% / 51.6% figures.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The three major comments identify important gaps in reproducibility and statistical rigor. We address each below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [adjudication pipeline] Adjudication pipeline (abstract and corresponding methods description): the load-bearing claim that the blinded three-model LLM judge panel reliably distinguishes major errors (omission, truncation, doctrinal term mistakes) from valid variations rests on calibration against the 300-instance validation set, yet no details are supplied on panel composition, exact judge prompts, or agreement metrics (e.g., kappa) between the panel and independent human experts on held-out cases. Without these, the reported major-error rates and model rankings cannot be assessed for bias.
Authors: We agree that the manuscript currently lacks these details. In the revised version we will add a dedicated methods subsection specifying the exact three models used in the judge panel, the full adjudication prompts, and agreement statistics (including Cohen's kappa) between the panel and the author-adjudicated validation set on held-out instances. revision: yes
-
Referee: [results on model differences] Results on model differences (high-drift tail paragraph): the claim that GPT-5.5 has the lowest and Grok 4.3 the highest adjudicated major-error rates lacks any statistical test (chi-square, bootstrap CI overlap assessment, or similar) for the observed differences; overlapping CIs are mentioned for GPT-5.5 vs. others but no test is reported for the Grok 4.3 tail result of 74.4%.
Authors: We acknowledge the omission of formal tests. The revised manuscript will include chi-square tests for differences in major-error proportions across models together with bootstrap confidence-interval comparisons, with explicit reporting for the Grok 4.3 high-drift tail result. revision: yes
-
Referee: [drift calculation] Drift calculation (methods on embedding drift): the embedding model used to compute normalized drift from the reference centroid is not identified, nor is any sensitivity analysis provided; because the triage threshold of 1.5 and all downstream rates depend on this choice, the monotonic trend cannot be reproduced or stress-tested.
Authors: The comment is correct; the embedding model is not named and no sensitivity analysis appears. We will identify the model in the methods and add a sensitivity analysis that varies both the embedding model and the drift threshold to verify robustness of the monotonic severity trend. revision: yes
Circularity Check
No circularity: results derive from independent human references and separate adjudication
full rationale
The paper uses multi-human reference translations to compute embedding drift as a triage signal only, then adjudicates flagged candidates via an LLM panel calibrated on a distinct 300-instance author-adjudicated validation set. Major-error rates by drift band and model comparisons are outputs of this external adjudication process, not reductions of fitted parameters, self-definitions, or self-citation chains. No equations or steps equate the reported rates to the drift metric by construction; the methodology is self-contained against the provided human references and validation set.
Axiom & Free-Parameter Ledger
free parameters (2)
- drift threshold of 1.5 =
1.5
- reporting bands (1.5-2.0, above 3.0)
axioms (2)
- domain assumption Normalized embedding drift from the multi-reference centroid is a valid proxy for translation deviation severity
- domain assumption The three-model LLM judge panel, after calibration on the author-adjudicated set, produces reliable labels for valid variation versus major error
Reference graph
Works this paper leans on
-
[1]
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
Artetxe, Mikel and Schwenk, Holger. (2019) ’Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond’, Transactions of the Association for Computational Linguistics , 7, pp. 597–610. Available at: https://doi.org/10.1162/tacl_a_00288
-
[2]
(2022) ’Restoring and Attributing Ancient Texts Using Deep Neural Networks’, Nature, 603(7900), pp
Assael, Y annis et al. (2022) ’Restoring and Attributing Ancient Texts Using Deep Neural Networks’, Nature, 603(7900), pp. 280–283. Available at: https://doi.org/10.1038/s41586-022-04448-z
-
[3]
(2020) ’Latin BERT: A Contextual Language Model for Classical Philology’
Bamman, David and Burns, Patrick J. (2020) ’Latin BERT: A Contextual Language Model for Classical Philology’. Available at: https://arxiv.org/abs/2009.10053 (Accessed: 12 May 2026)
-
[4]
Benkirane, Kenza et al. (2024) ’Machine Translation Hallucination Detection for Low and High Resource Languages Using Large Language Models’. Available at: https://doi.org/10.48550/arXiv.2407. 16470 (Accessed: 12 May 2026)
-
[5]
(2025) ’Experiments in Distant Reading: Using Topic Modeling on Chinese Buddhist Texts from 500-800 CE’, Digital Humanities Quarterly , 19(1)
Bingenheimer, Marcus, Brody, Justin, and Nichols, Ryan. (2025) ’Experiments in Distant Reading: Using Topic Modeling on Chinese Buddhist Texts from 500-800 CE’, Digital Humanities Quarterly , 19(1). Avail- able at: https://www.digitalhumanities.org/dhq/vol/19/1/000771/000771.html (Accessed: 12 May 2026)
2025
-
[6]
(2000) The Connected Discourses of the Buddha: A Translation of the Saṃyutta Nikāya
Bodhi, Bhikkhu. (2000) The Connected Discourses of the Buddha: A Translation of the Saṃyutta Nikāya . Boston: Wisdom Publications
2000
-
[7]
(2012) The Numerical Discourses of the Buddha: A Translation of the Aṅguttara Nikāya
Bodhi, Bhikkhu. (2012) The Numerical Discourses of the Buddha: A Translation of the Aṅguttara Nikāya . Boston: Wisdom Publications
2012
-
[8]
and Lopez, Donald S., Jr
Buswell, Robert E., Jr. and Lopez, Donald S., Jr. (eds) (2014) The Princeton Dictionary of Buddhism. Prince- ton: Princeton University Press
2014
-
[9]
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Chiang, Wei-Lin et al. (2024) ’Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference’ in Proceedings of the 41st International Conference on Machine Learning. Proceedings of Machine Learning Research 235, pp. 8359–8388. Available at: https://arxiv.org/abs/2403.04132 (Accessed: 12 May 2026)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Cohen, Jacob. (1960) ’ A Coefficient of Agreement for Nominal Scales’, Educational and Psychological Measurement, 20(1), pp. 37–46. Available at: https://doi.org/10.1177/001316446002000104 19
-
[11]
Costa-jussà, Marta R. et al. (2022) ’No Language Left Behind: Scaling Human-Centered Machine Transla- tion’. Available at: https://arxiv.org/abs/2207.04672 (Accessed: 12 May 2026)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[12]
Costa-jussà, Marta R. et al. (2024) ’Scaling Neural Machine Translation to 200 Languages’, Nature, 630(8018), pp. 841–846. Available at: https://doi.org/10.1038/s41586-024-07335-x
-
[13]
Dale, David et al. (2023) ’Detecting and Mitigating Hallucinations in Machine Translation: Model Internal Workings Alone Do Well, Sentence Similarity Even Better’ in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics . Toronto: Association for Computational Linguistics, pp. 36–50. Available at: https://doi.org/10.18653...
-
[14]
Language-agnostic BERT Sentence Embedding
Feng, Fangxiaoyu et al. (2022) ’Language-Agnostic BERT Sentence Embedding’ in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics . Dublin: Association for Computational Linguistics, pp. 878–891. Available at: https://doi.org/10.18653/v1/2022.acl-long.62
-
[15]
(2020) ’Multi-Hypothesis Machine Translation Evaluation’ in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Fomicheva, Marina, Specia, Lucia, and Guzmán, Francisco. (2020) ’Multi-Hypothesis Machine Translation Evaluation’ in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Online: Association for Computational Linguistics, pp. 1218–1232. Available at: https://doi.org/10 .18653/v1/2020.acl-main.113
2020
-
[16]
Freitag, Markus et al. (2021) ’Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation’, Transactions of the Association for Computational Linguistics , 9, pp. 1460–1474. Available at: https://doi.org/10.1162/tacl_a_00437
-
[17]
NeMo guardrails: A toolkit for controllable and safe LLM applications with pro- grammable rails
Freitag, Markus et al. (2022) ’Results of WMT22 Metrics Shared Task: Stop Using BLEU–Neural Metrics Are Better and More Robust’ in Proceedings of the Seventh Conference on Machine Translation. Abu Dhabi: Association for Computational Linguistics, pp. 46–68. Available at: https://doi.org/10.18653/v1/ 2022.wmt-1.2
-
[18]
Goyal, Naman et al. (2022) ’The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation’,Transactions of the Association for Computational Linguistics, 10, pp. 522–538. Avail- able at: https://doi.org/10.1162/tacl_a_00474
-
[19]
Guerreiro, Nuno M. et al. (2023a) ’Hallucinations in Large Multilingual Translation Models’, Transactions of the Association for Computational Linguistics , 11, pp. 1500–1517. Available at: https://doi.org/10 .1162/tacl_a_00615
-
[20]
Guerreiro, Nuno M., Voita, Elena, and Martins, André F. T. (2023b) ’Looking for a Needle in a Haystack: A Comprehensive Study of Hallucinations in Neural Machine Translation’ in Proceedings of the 17th Confer- ence of the European Chapter of the Association for Computational Linguistics . Dubrovnik: Association for Computational Linguistics, pp. 1059–1075...
-
[21]
Guerreiro, Nuno M. et al. (2024) ’xCOMET: Transparent Machine Translation Evaluation through Fine- Grained Error Detection’, Transactions of the Association for Computational Linguistics , 12, pp. 979–995. Available at: https://doi.org/10.1162/tacl_a_00683
-
[22]
(2015) ’Morphological Disambiguation of Classical Sanskrit’ in Mahlow, Cerstin and Pi- otrowski, Michael (eds) Systems and Frameworks for Computational Morphology
Hellwig, Oliver. (2015) ’Morphological Disambiguation of Classical Sanskrit’ in Mahlow, Cerstin and Pi- otrowski, Michael (eds) Systems and Frameworks for Computational Morphology. Cham: Springer, pp. 41–
2015
-
[23]
Available at: https://doi.org/10.1007/978-3-319-23980-4_3
-
[24]
(2023) ’How Good Are GPT Models at Machine Translation? A Comprehensive Evalua- tion’
Hendy, Amr et al. (2023) ’How Good Are GPT Models at Machine Translation? A Comprehensive Evalua- tion’. Available at: https://arxiv.org/abs/2302.09210 (Accessed: 12 May 2026)
-
[25]
(2023) ’Is ChatGPT a Good Translator? Y es with GPT-4 as the Engine’
Jiao, Wenxiang et al. (2023) ’Is ChatGPT a Good Translator? Y es with GPT-4 as the Engine’. Available at: https://doi.org/10.48550/arXiv.2301.08745 (Accessed: 12 May 2026)
-
[26]
Kocmi, Tom and Federmann, Christian. (2023a) ’GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4’ in Proceedings of the Eighth Conference on Machine Translation . Singapore: Association for Computational Linguistics, pp. 768–775. Available at: https://doi.org/10.18653/v1/2023.wmt-1 .64 20
-
[27]
Kocmi, Tom and Federmann, Christian. (2023b) ’Large Language Models Are State-of-the-Art Evaluators of Translation Quality’ in Proceedings of the 24th Annual Conference of the European Association for Machine Translation. Tampere: European Association for Machine Translation, pp. 193–203. Available at: https: //arxiv.org/abs/2302.14520 (Accessed: 12 May 2026)
-
[28]
Lavie, Alon and Denkowski, Michael J. (2009) ’The METEOR Metric for Automatic Evaluation of Machine Translation’, Machine Translation, 23(2–3), pp. 105–115. Available at: https://doi.org/10.1007/s1 0590-009-9059-4
work page doi:10.1007/s1 2009
-
[29]
Liu, Y ang et al. (2023) ’G-Eval: NLG Evaluation Using GPT-4 with Better Human Alignment’ in Proceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing. Singapore: Association for Computational Linguistics, pp. 2511–2522. Available at: https://doi.org/10.18653/v1/2023.e mnlp-main.153
-
[30]
Lommel, Arle. (2018) ’Metrics for Translation Quality Assessment: A Case for Standardizing Error T ypolo- gies’ in Moorkens, Joss et al. (eds) Translation Quality Assessment: From Principles to Practice . Cham: Springer, pp. 109–127. Available at: https://doi.org/10.1007/978-3-319-91241-7_6
-
[31]
Mathur, Nitika, Baldwin, Timothy, and Cohn, Trevor. (2020) ’Tangled up in BLEU: Reevaluating the Eval- uation of Automatic Machine Translation Evaluation Metrics’ in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Online: Association for Computational Linguistics, pp. 4984–4997. Available at: https://doi.org/10.18...
-
[32]
(n.d.) MQM Error Typology
MQM Council. (n.d.) MQM Error Typology. Available at: https://themqm.org/error-types-2/typ ology/ (Accessed: 12 May 2026)
2026
-
[33]
Muennighoff, Niklas et al. (2023) ’MTEB: Massive Text Embedding Benchmark’ in Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics . Dubrovnik: Associ- ation for Computational Linguistics, pp. 2014–2037. Available at: https://doi.org/10.18653/v1/20 23.eacl-main.148
-
[34]
Nehrdich, Sebastian and Keutzer, Kurt. (2026) ’MITRA: A Large-Scale Parallel Corpus and Multilingual Pretrained Language Model for Machine Translation and Semantic Retrieval for Pāli, Sanskrit, Buddhist Chinese, and Tibetan’. Available at: https://arxiv.org/abs/2601.06400 (Accessed: 12 May 2026)
-
[35]
Nehrdich, Sebastian, Hellwig, Oliver, and Keutzer, Kurt. (2024) ’One Model Is All Y ou Need: ByT5- Sanskrit, a Unified Model for Sanskrit NLP Tasks’ in Findings of the Association for Computational Lin- guistics: EMNLP 2024 . Miami: Association for Computational Linguistics, pp. 13742–13751. Available at: https://doi.org/10.18653/v1/2024.findings-emnlp.805
-
[36]
(1995) The Middle Length Discourses of the Buddha: A Translation of the Majjhima Nikāya
Ñāṇamoli, Bhikkhu and Bodhi, Bhikkhu. (1995) The Middle Length Discourses of the Buddha: A Translation of the Majjhima Nikāya . Boston: Wisdom Publications
1995
-
[37]
Papineni, Kishore et al. (2002) ’BLEU: A Method for Automatic Evaluation of Machine Translation’ in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics . Philadelphia: Association for Computational Linguistics, pp. 311–318. Available at: https://doi.org/10.3115/10 73083.1073135
work page doi:10.3115/10 2002
-
[38]
chr F : character n-gram F -score for automatic MT evaluation
Popović, Maja. (2015) ’chrF: Character N-Gram F-Score for Automatic MT Evaluation’ in Proceedings of the Tenth Workshop on Statistical Machine Translation. Lisbon: Association for Computational Linguistics, pp. 392–395. Available at: https://doi.org/10.18653/v1/W15-3049
-
[39]
Popović, Maja. (2017) ’chrF++: Words Helping Character N-Grams’ in Proceedings of the Second Confer- ence on Machine Translation. Copenhagen: Association for Computational Linguistics, pp. 612–618. Avail- able at: https://doi.org/10.18653/v1/W17-4770
-
[40]
Post, Matt. (2018) ’ A Call for Clarity in Reporting BLEU Scores’ in Proceedings of the Third Conference on Machine Translation . Brussels: Association for Computational Linguistics, pp. 186–191. Available at: https://doi.org/10.18653/v1/W18-6319 21
-
[41]
Raunak, Vikas, Menezes, Arul, and Junczys-Dowmunt, Marcin. (2021) ’The Curious Case of Hallucinations in Neural Machine Translation’ in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Online: Association for Computational Linguistics, pp. 1172–1183. Available ...
-
[42]
COMET : A Neural Framework for MT Evaluation
Rei, Ricardo et al. (2020) ’COMET: A Neural Framework for MT Evaluation’ in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Online: Association for Computational Linguistics, pp. 2685–2702. Available at: https://doi.org/10.18653/v1/2020.emnlp-main.213
-
[43]
Rei, Ricardo et al. (2022) ’COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task’ in Proceedings of the Seventh Conference on Machine Translation. Abu Dhabi: Association for Computational Linguistics, pp. 578–585. Available at: https://doi.org/10.18653/v1/2022.wmt-1.52
-
[44]
Reimers, Nils and Gurevych, Iryna. (2019) ’Sentence-BERT: Sentence Embeddings Using Siamese BERT- Networks’ in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing . Hong Kong: Association for Computational Linguistics, pp. 3982–3992. Available at: https://doi.org/ 10.18653/v1/D19-1410
-
[45]
Riemenschneider, Frederick and Frank, Anette. (2023) ’Exploring Large Language Models for Classical Philology’ in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics . Toronto: Association for Computational Linguistics, pp. 15181–15199. Available at: https://doi.or g/10.18653/v1/2023.acl-long.846
-
[46]
(2020) ’BLEURT: Learning Robust Metrics for Text Generation’ in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Sellam, Thibault, Das, Dipanjan, and Parikh, Ankur. (2020) ’BLEURT: Learning Robust Metrics for Text Generation’ in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Online: Association for Computational Linguistics, pp. 7881–7892. Available at: https://doi.org/10 .18653/v1/2020.acl-main.704
2020
-
[47]
(2023) ’Machine Learning for Ancient Languages: A Survey’, Computational Linguistics, 49(3), pp
Sommerschield, Thea et al. (2023) ’Machine Learning for Ancient Languages: A Survey’, Computational Linguistics, 49(3), pp. 703–747. Available at: https://doi.org/10.1162/coli_a_00481
-
[48]
(n.d.) SuttaCentral
SuttaCentral. (n.d.) SuttaCentral. Available at: https://suttacentral.net (Accessed: 12 May 2026)
2026
-
[49]
(n.d.) Suttas
Thanissaro Bhikkhu. (n.d.) Suttas. Available at: https://www.dhammatalks.org/suttas/ (Accessed: 12 May 2026)
2026
-
[50]
(2014) Enlarging Translation, Empowering Translators
T ymoczko, Maria. (2014) Enlarging Translation, Empowering Translators. London: Routledge
2014
-
[51]
(2018) The Translator’s Invisibility: A History of Translation
Venuti, Lawrence. (2018) The Translator’s Invisibility: A History of Translation. London: Routledge
2018
-
[52]
(2006) ’Error Analysis of Statistical Machine Translation Output’ in Proceedings of the Fifth International Conference on Language Resources and Evaluation
Vilar, David et al. (2006) ’Error Analysis of Statistical Machine Translation Output’ in Proceedings of the Fifth International Conference on Language Resources and Evaluation . Genoa: European Language Re- sources Association. Available at: https://aclanthology.org/L06-1244/ (Accessed: 12 May 2026)
2006
-
[53]
0.5-1.5 m height, 30-60 cm spread
Wilson, Edwin B. (1927) ’Probable Inference, the Law of Succession, and Statistical Inference’, Journal of the American Statistical Association , 22(158), pp. 209–212. Available at: https://doi.org/10.1080/ 01621459.1927.10502953
-
[54]
(2024) ’Multiple References with Meaningful Variations Improve Literary Machine Translation’
Wu, Si, Wieting, John, and Smith, David A. (2024) ’Multiple References with Meaningful Variations Improve Literary Machine Translation’. Available at: https://arxiv.org/abs/2412.18707 (Accessed: 12 May 2026)
-
[55]
Xu, Haoran et al. (2024) ’ A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models’ in The Twelfth International Conference on Learning Representations . Available at: https://arxiv.org/abs/2309.11674 (Accessed: 12 May 2026)
-
[56]
Zainaldin, James L. et al. (2026) ’Evaluating LLM-Based Translation of a Low-Resource Technical Lan- guage: The Medical and Philosophical Greek of Galen’. Available at: https://doi.org/10.48550/arX iv.2602.24119 (Accessed: 12 May 2026). 22
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arx 2026
-
[57]
BERTScore: Evaluating Text Generation with BERT
Zhang, Tianyi et al. (2020) ’BERTScore: Evaluating Text Generation with BERT’ in International Confer- ence on Learning Representations . Available at: https://arxiv.org/abs/1904.09675 (Accessed: 12 May 2026)
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[58]
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Zhang, Y anzhao et al. (2025) ’Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models’. Available at: https://arxiv.org/abs/2506.05176 (Accessed: 12 May 2026)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
(2023) ’Can Large Language Model Comprehend Ancient Chinese? A Pre- liminary Test on ACLUE’ in Proceedings of the Ancient Language Processing Workshop
Zhang, Yixuan and Li, Haonan. (2023) ’Can Large Language Model Comprehend Ancient Chinese? A Pre- liminary Test on ACLUE’ in Proceedings of the Ancient Language Processing Workshop. Varna: INCOMA Ltd., pp. 80–87. Available at: https://aclanthology.org/2023.alp-1.9/ (Accessed: 12 May 2026)
2023
-
[60]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Zheng, Lianmin et al. (2023) ’Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena’ in Advances in Neural Information Processing Systems 36 , Datasets and Benchmarks Track. Available at: https://arxi v.org/abs/2306.05685 (Accessed: 12 May 2026)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[61]
mn1:5":
Zhu, Wenhao et al. (2024) ’Multilingual Machine Translation with Large Language Models: Empirical Re- sults and Analysis’ in Findings of the Association for Computational Linguistics: NAACL 2024. Mexico City: Association for Computational Linguistics, pp. 2765–2781. Available at: https://doi.org/10.18653 /v1/2024.findings-naacl.176 A Translation Prompt A....
2024
-
[62]
Use the Pāli text to decide what content belongs to each segment
Pāli is the authority. Use the Pāli text to decide what content belongs to each segment
-
[63]
His segmentation helps you understand boundaries, but the target translator may split/merge differently.,→
Sujato is only a guide. His segmentation helps you understand boundaries, but the target translator may split/merge differently.,→
-
[64]
…" or is abbreviated: → Extract ONLY the corresponding term(s) from the translator → Do NOT expand to the full sentence → Example: Pali
MATCH THE PALI STRUCTURE. The OUTPUT must mirror the structure of the PALI segment: a) If the Pali contains "…" or is abbreviated: → Extract ONLY the corresponding term(s) from the translator → Do NOT expand to the full sentence → Example: Pali "viññātaṁ …" + Sujato "the known …" → Output just "the cognized" (not the full paragraph) b) If the Pali is ful...
-
[65]
Extract the smallest text that expresses the Pāli meaning
Minimal faithful extraction. Extract the smallest text that expresses the Pāli meaning. Prefer contiguous substrings from the original. 29
-
[66]
Process the translation in order
Respect text order. Process the translation in order. Don't reuse non-repetitive text
-
[67]
Discard footnotes, section headers (unless part of translation), editor notes, bracketed references.,→
Filter noise. Discard footnotes, section headers (unless part of translation), editor notes, bracketed references.,→
-
[68]
mn1:3.1":
Null policy. Output null ONLY if you genuinely cannot find matching content. Remember: abbreviated Pali → short output (just the term). OUTPUT FORMAT (STRICT JSON) Return valid JSON with exactly the same keys as the input, in the same order. Each key maps to either a string (extracted text) or null. Example: { "mn1:3.1": "Here, monks, an untaught ordinary...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.