pith. sign in

arxiv: 2606.26040 · v1 · pith:BGGD3XDMnew · submitted 2026-06-24 · 💻 cs.CL

AI translation of literary texts is "fine", but readers still prefer human translations

Pith reviewed 2026-06-25 19:16 UTC · model grok-4.3

classification 💻 cs.CL
keywords literary translationmachine translation evaluationhuman preferencesreader studycomparative readingLLM outputimmersiveness
0
0 comments X

The pith

Readers prefer human literary translations over AI versions for ease, clarity and immersion, even when they cannot tell them apart.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to test whether machine translations of recent novels meet the standards readers expect in literary prose. It reports that 15 avid readers found AI output acceptable overall but consistently chose the human versions in direct comparisons, citing better flow and engagement. The advantage for human translation appeared modestly in full-excerpt readings and more strongly when readers examined short paired passages. Automatic scoring systems and even LLM judges did not match these reader judgments. The work also supplies a public dataset of reader comments and annotations for future studies.

Core claim

Across 30 excerpt-level comparisons and 772 chunk-level comparisons, readers favored the human translations 19/30 and 522/772 times respectively, describing them as easier, clearer, and more immersive. They identified the human version correctly only 17 times out of 30 and showed a tendency to prefer whichever text they believed to be human. Machine translations displayed greater internal quality variation than human ones. Standard automatic metrics, including LLM-as-a-judge methods, did not recover the reader preferences and instead favored the machine output.

What carries the argument

A controlled reader study protocol that collects preferences, source guesses, and span annotations from immersive full-excerpt reading and from close examination of aligned human-machine text chunks.

Load-bearing premise

The agentic LLM pipeline used to produce the machine translations stands in for current best AI literary translation, and the judgments of these 15 readers apply beyond the tested books and languages.

What would settle it

A replication using a different leading LLM pipeline or a larger and more diverse set of readers that finds equal or higher preference for the machine translations would falsify the reported preference pattern.

Figures

Figures reproduced from arXiv: 2606.26040 by Adam Podoxin, Maite Taboada, Marzena Karpinska, Roman Grundkiewicz, Ty Brassington, Yves Ferstler.

Figure 1
Figure 1. Figure 1: Evaluation pipeline: Avid readers of published fiction evaluate two versions of an 8,000-word book excerpt: a human translation ( HT ) and an AI-generated machine translation ( MT ). Participants (1) read the first translation, (2) complete a perception questionnaire, (3) read the competing translation, (4) complete a second questionnaire, (5) compare both versions, (6) take a one-day break, and (7) perfor… view at source ↗
Figure 2
Figure 2. Figure 2: Agentic literary MT pipeline used in this study. Source excerpts are chunked and paired with style [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Study design and evaluation counts. Two readers evaluated each of the 15 book excerpts (30 book￾reader evaluations). same excerpt at the chunk level, with 300-word MT and HT chunks presented side by side for a total of 772 comparisons (close reading; 386 aligned HT–MT chunk pairs, each judged by both read￾ers) [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The distribution of readers’ preferences be [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The distribution of readers’ ratings after im [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The distribution of readers’ excerpt-level pref [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of good and poor span highlights by source language and translation type. 50 100 150 200 Poor-highlighted words per 1K threshold 0 10 20 30 40 50 60 70 80 % of chunks at or above threshold 31.3% 11.9% 5.7% 2.8% 70.5% 41.7% 23.8% 10.6% MT has more chunks with dense poor spans HT MT [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Share of close-reading chunks with dense [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of span-level highlights in the [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Examples of span-level preference evidence from the side-by-side chunks evaluation. Participants [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The most frequent labels in readers’ comments about the [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: Machine-translation (MT) identification ac [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Guidelines provided to the participants for the evaluation and annotation tasks (pages 1–4). See [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Guidelines provided to the participants for the evaluation and annotation tasks (pages 5–8). See [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Guidelines provided to the participants for the evaluation and annotation tasks (pages 9–12). [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Single-reading questionnaire shown after an immersive reading. Participants rated fluency, literary [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Human-evaluation questionnaire interfaces used after paired readings. The comparison form records [PITH_FULL_IMAGE:figures/full_fig_p031_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Median span highlight length in the mul [PITH_FULL_IMAGE:figures/full_fig_p032_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Span-level annotations by target language in [PITH_FULL_IMAGE:figures/full_fig_p032_20.png] view at source ↗
Figure 22
Figure 22. Figure 22: MT identification by book. Orange bars show [PITH_FULL_IMAGE:figures/full_fig_p033_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Positive and negative aspects mentioned by participants when comparing machine translation (MT) and [PITH_FULL_IMAGE:figures/full_fig_p038_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Positive reasons readers reported for preferring one translation over the other, separated by preferred [PITH_FULL_IMAGE:figures/full_fig_p038_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Preference mechanisms explaining why the chosen translation was preferred. Diverging bars contrast [PITH_FULL_IMAGE:figures/full_fig_p039_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Close-reading chunk-level preferred translation. Each cell shows one chunk comparison for an excerpt [PITH_FULL_IMAGE:figures/full_fig_p040_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Preferred translation by excerpt and reader. Bars show the share of chunk-level choices favoring MT on [PITH_FULL_IMAGE:figures/full_fig_p040_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Per-book immersive-reading ratings for HT and MT. The figure breaks down participant ratings by book, [PITH_FULL_IMAGE:figures/full_fig_p041_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Immersive-reading ratings by presentation order. The figure compares ratings assigned after the first and [PITH_FULL_IMAGE:figures/full_fig_p042_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Translation-origin guess flows after the comparison task. The first panel summarizes whether guesses [PITH_FULL_IMAGE:figures/full_fig_p042_30.png] view at source ↗
Figure 32
Figure 32. Figure 32: Close-reading preferences in the multilin [PITH_FULL_IMAGE:figures/full_fig_p043_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: More likely AI-translated choices in the [PITH_FULL_IMAGE:figures/full_fig_p043_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Close-reading chunk-level preferred transla [PITH_FULL_IMAGE:figures/full_fig_p043_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Chunk-level preferred translation in the multilingual target-language case study. Each row shows one [PITH_FULL_IMAGE:figures/full_fig_p044_35.png] view at source ↗
read the original abstract

AI translation of literary works is increasingly common. While the content may be rendered adequately, we do not know enough about how readers experience it in terms of immersiveness and literary effect, aspects poorly captured by automatic machine translation metrics or human evaluation targeting fluency and adequacy. We ask 15 avid readers to compare recently published human translations (HT) to machine translations (MT) generated with an agentic large language model (LLM)-based pipeline, for 15 recent novels in French, Polish, and Japanese and translated into English. Readers evaluated approximately 8K-word excerpts in two conditions: immersive reading of the whole excerpt (30 comparisons) and close reading of 386 aligned HT-MT chunk pairs (772 comparisons), with two readers per book and in alternating order of presentation. Overall, readers find MT "fine", but prefer HT (slightly at excerpt-level 19/30, more clearly at chunk-level 522/772) for its ease, clarity, and immersive nature. Readers' highlights show that MT's quality varies more within one book than HT's does. Crucially, readers cannot reliably tell the two apart (17/30 guess correctly) and tend to prefer the version they believe to be human. Automatic metrics, including LLM-as-a-judge approaches, fail to recover reader preferences and favor MT. We release LAIT (Literary AI Translation), a reader-centered evaluation dataset with 1K reader comments, 2K judgments and preference ratings, and 7.2K span-level annotations, along with our evaluation protocol and supporting interface.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The manuscript reports results from a human-subjects study in which 15 avid readers compared recently published human translations (HT) against machine translations (MT) generated by a single agentic LLM-based pipeline. The evaluation covers ~8K-word excerpts from 15 novels in French, Polish, and Japanese translated to English, using both immersive whole-excerpt reading (30 comparisons) and close reading of 386 aligned chunk pairs (772 comparisons). Readers rate MT as 'fine' but prefer HT (19/30 excerpt-level, 522/772 chunk-level) for ease, clarity, and immersiveness; they cannot reliably distinguish the two (17/30 correct guesses) and favor the version they believe is human. Automatic metrics, including LLM-as-judge, fail to recover these preferences. The authors release the LAIT dataset containing 1K reader comments, 2K judgments, and 7.2K span annotations together with the evaluation protocol.

Significance. If the reported preference patterns and indistinguishability result generalize, the work supplies concrete reader-centered evidence on literary aspects of translation quality that automatic metrics miss, together with a publicly released dataset that supports reproducibility and follow-on studies. The explicit contrast between excerpt-level and chunk-level judgments and the observation that MT quality varies more within books than HT does are useful empirical contributions.

major comments (1)
  1. [Abstract and Evaluation section] Abstract and Evaluation section: the central claims that readers prefer HT (19/30 and 522/772) yet cannot reliably distinguish MT from HT (17/30) rest on data from only 15 readers and a single agentic LLM pipeline. Because the paper itself notes that MT quality varies more within books than HT does, the absence of any comparison to alternative MT pipelines or a larger reader cohort makes the broader statements about AI versus human literary translation vulnerable to the specific implementation choices; this is load-bearing for the generalizability of the preference and indistinguishability results.
minor comments (3)
  1. [Abstract] Abstract: the description of chunk alignment, reader recruitment criteria, exact MT generation parameters, and any statistical testing of the reported counts is absent, hindering replication.
  2. [Abstract] Abstract: the alternating order of presentation is mentioned but no detail is given on how order effects or fatigue were controlled across the two readers per book.
  3. [Dataset release statement] Dataset release statement: while the LAIT dataset is a strength, the paper should specify the exact license, file formats, and whether the 7.2K span-level annotations include the original text spans or only offsets.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the need to qualify claims about generalizability. We respond to the major comment below.

read point-by-point responses
  1. Referee: [Abstract and Evaluation section] Abstract and Evaluation section: the central claims that readers prefer HT (19/30 and 522/772) yet cannot reliably distinguish MT from HT (17/30) rest on data from only 15 readers and a single agentic LLM pipeline. Because the paper itself notes that MT quality varies more within books than HT does, the absence of any comparison to alternative MT pipelines or a larger reader cohort makes the broader statements about AI versus human literary translation vulnerable to the specific implementation choices; this is load-bearing for the generalizability of the preference and indistinguishability results.

    Authors: We agree that the modest reader sample and single pipeline constrain broad generalizations, and that the noted intra-book MT variation makes pipeline choice relevant. The study is framed as an initial reader-centered exploration using a current state-of-the-art agentic approach rather than a comprehensive survey of all MT systems. We will revise the abstract and Evaluation section to explicitly qualify the central claims as applying to the tested pipeline and cohort, while retaining the reported counts. We will also expand the Limitations section to discuss the implications of these design choices and the value of follow-up work with additional pipelines and larger reader groups. These textual changes will better align the stated scope with the data without requiring new experiments. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical human-subjects study with direct observations

full rationale

This paper is a reader study reporting preference counts (19/30 excerpt-level, 522/772 chunk-level), indistinguishability rates (17/30), and qualitative comments from 15 participants evaluating fixed HT/MT pairs. These quantities are direct tallies of participant responses rather than quantities derived from equations, fitted parameters, or self-citations. No derivation chain exists; the central claims rest on the collected data itself, with no reduction of results to inputs by construction. The paper is self-contained against external benchmarks as a straightforward empirical evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model or derivation is present; the work rests on standard assumptions of human-subjects research such as reader honesty and excerpt representativeness.

pith-pipeline@v0.9.1-grok · 5834 in / 1140 out tokens · 43666 ms · 2026-06-25T19:16:59.263471+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

110 extracted references · 33 canonical work pages

  1. [2]

    Mike Allen. 2017. https://us.sagepub.com/en-us/nam/the-sage-encyclopedia-of-communication-research-methods/book244974 The SAGE encyclopedia of communication research methods . SAGE Publications, Inc, 2455 Teller Road, Thousand Oaks California 91320

  2. [3]

    Amazon Staff . 2025. Amazon introduces Kindle Translate , an AI -powered translation service for authors to reach global readers. https://www.aboutamazon.com/news/books-and-authors/amazon-kindle-translate-books-authors. Accessed: 2026-05-22

  3. [6]

    Antonio Castaldo, Sheila Castilho, Joss Moorkens, and Johanna Monti. 2025. https://aclanthology.org/2025.mtsummit-1.40/ Extending CREAMT : Leveraging large language models for literary translation post-editing . In Proceedings of Machine Translation Summit XX: Volume 1, pages 506--515, Geneva, Switzerland. European Association for Machine Translation

  4. [7]

    Ella Creamer. 2024. https://www.theguardian.com/books/2024/nov/04/dutch-publisher-to-use-ai-to-translate-books-into-english-veen-bosch-keuning-artificial-intelligence Dutch publisher to use AI to translate limited number of books into English . The Guardian

  5. [8]

    Bradley Emi and Max Spero. 2024. https://arxiv.org/abs/2402.14873 Technical report on the pangram ai-generated text classifier . Preprint, arXiv:2402.14873

  6. [10]

    Kyo Gerrits and Ana Guerberof-Arenas. 2025. To mt or not to mt: An eye-tracking study on the reception by dutch readers of different translation and creativity levels. In Proceedings of Machine Translation Summit XX: Volume 1, pages 516--537

  7. [12]

    GlobeScribe.AI

    GlobeScribe.AI Ltd . GlobeScribe.AI . https://globescribe.ai/. Accessed: 2026-05-22

  8. [13]

    Google DeepMind . 2026. Gemini 3.1 Pro Model Card . https://deepmind.google/models/model-cards/gemini-3-1-pro/. Accessed 2026-05-26

  9. [14]

    Yvette Graham, Timothy Baldwin, Alistair Moffat, and Justin Zobel. 2013. https://aclanthology.org/W13-2305/ Continuous measurement scales in human evaluation of machine translation . In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 33--41, Sofia, Bulgaria. Association for Computational Linguistics

  10. [15]

    Melanie C Green and Timothy C Brock. 2000. The role of transportation in the persuasiveness of public narratives. Journal of personality and social psychology, 79(5):701

  11. [16]

    Ana Guerberof-Arenas and Antonio Toral. 2020. The impact of post-editing and machine translation on creativity and reading experience. Translation Spaces, 9(2):255--282

  12. [17]

    Ana Guerberof-Arenas and Antonio Toral. 2022. Creativity in translation: Machine translation as a constraint for literary texts. Translation spaces, 11(2):184--212

  13. [18]

    Ana Guerberof-Arenas and Antonio Toral. 2024. To be or not to be: A translation reception study of a literary text translated into dutch and catalan using machine translation. Target, 36(2):215--244

  14. [20]

    Kilem Li Gwet. 2021. Handbook of inter-rater reliability. Advanced Analytics

  15. [21]

    Fantine Huot, Reinald Kim Amplayo, Jennimaria Palomaki, Alice Shoshana Jakobovits, Elizabeth Clark, and Mirella Lapata. 2025. https://openreview.net/forum?id=HfWcFs7XLR Agents' room: Narrative generation through multi-step collaboration . In The Thirteenth International Conference on Learning Representations

  16. [26]

    M. G. Kendall. 1938. https://doi.org/10.1093/biomet/30.1-2.81 A new measure of rank correlation . Biometrika, 30(1--2):81--93

  17. [27]

    Dorothy Kenny and Marion Winters. 2020. Machine translation, ethics and the literary translator’s voice. Translation Spaces, 9(1):123--149

  18. [30]

    Muhammed Yusuf Kocyigit, Eleftheria Briakou, Daniel Deutsch, Jiaming Luo, Colin Cherry, and Markus Freitag. 2025. https://openreview.net/forum?id=MpjtvkvXDo Overestimation in LLM evaluation: A controlled large-scale study on data contamination s impact on machine translation . In Forty-second International Conference on Machine Learning

  19. [31]

    Samuel L \"a ubli, Sheila Castilho, Graham Neubig, Rico Sennrich, Qinlan Shen, and Antonio Toral. 2020. A set of recommendations for assessing human--machine parity in language translation. Journal of artificial intelligence research, 67:653--672

  20. [32]

    Samuel L \"a ubli, Rico Sennrich, and Martin Volk. 2018. Has machine translation achieved human parity? a case for document-level evaluation. In Proceedings of the 2018 conference on empirical methods in natural language processing, pages 4791--4796

  21. [33]

    Arle Richard Lommel, Aljoscha Burchardt, and Hans Uszkoreit. 2013. https://aclanthology.org/2013.tc-1.6/ Multidimensional quality metrics: a flexible system for assessing translation quality . In Proceedings of Translating and the Computer 35, London, UK. Aslib

  22. [35]

    Evgeny Matusov. 2019. The challenges of using neural machine translation for literature. In Proceedings of the qualities of literary machine translation, pages 10--19

  23. [36]

    Joss Moorkens, Antonio Toral, Sheila Castilho, and Andy Way. 2018. Translators’ perceptions of literary post-editing using statistical and neural machine translation. Translation Spaces, 7(2):240--262

  24. [37]

    Annu Nishioka. 2024. https://asia.nikkei.com/Business/Media-Entertainment/Japanese-publisher-to-launch-light-novel-app-with-AI-assisted-translations Japanese publisher to launch `light novel' app with AI -assisted translations . Nikkei Asia. Accessed: 2026-05-24

  25. [39]

    Chau Minh Pham, Yapei Chang, and Mohit Iyyer. 2026. https://github.com/AutoFiction-AI/autofiction Autofiction pipeline . Research pipeline for long-form AI novel generation

  26. [40]

    Barbara Plank. 2022. The “problem” of human label variation: On ground truth in data, modeling and evaluation. In Proceedings of the 2022 conference on empirical methods in natural language processing, pages 10671--10682

  27. [42]

    Brian Porter and Edouard Machery. 2024. Ai-generated poetry is indistinguishable from human-written poetry and is rated more favorably. Scientific Reports, 14(1):26133

  28. [43]

    Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. Comet: A neural framework for mt evaluation. In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp), pages 2685--2702

  29. [47]

    Kristiina Taivalkoski-Shilov. 2019 a . Ethical issues regarding machine (-assisted) translation of literary texts. Perspectives, 27(5):689--703

  30. [48]

    Kristiina Taivalkoski-Shilov. 2019 b . Free indirect discourse: an insurmountable challenge for literary mt systems? In Proceedings of the qualities of literary machine translation, pages 35--39

  31. [55]

    Rebecca Webster, Margot Fonteyne, Arda Tezcan, Lieve Macken, and Joke Daems. 2020. Gutenberg goes neural: Comparing features of D utch human translations with raw neural machine translation outputs in a corpus of E nglish literary classics. In Informatics, volume 7, page 32. MDPI

  32. [59]

    Tiffany Zhu, Iain Weissburg, Kexun Zhang, and William Yang Wang. 2025. Human bias in the face of ai: Examining human judgment against text labeled as ai generated. In Findings of the Association for Computational Linguistics: ACL 2025, pages 25907--25914

  33. [61]

    The Reader is the Metric: How Textual Features and Reader Profiles Explain Conflicting Evaluations of AI Creative Writing

    Marco, Guillermo and Gonzalo, Julio and Fresno, V \'i ctor. The Reader is the Metric: How Textual Features and Reader Profiles Explain Conflicting Evaluations of AI Creative Writing. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.1304

  34. [62]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  35. [63]

    Publications Manual , year = "1983", publisher =

  36. [64]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  37. [65]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  38. [66]

    Dan Gusfield , title =. 1997

  39. [67]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  40. [68]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  41. [69]

    A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples , volume =

    Wongpakaran, Nahathai and Wongpakaran, Tinakon and Wedding, Danny and Gwet, Kilem L , year =. A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples , volume =. BMC Medical Research Methodology , publisher =. doi:10.1186/1471-2288-13-61 , number =

  42. [70]

    Richard and Koch, Gary G

    Landis, J. Richard and Koch, Gary G. , year =. The Measurement of Observer Agreement for Categorical Data , volume =. Biometrics , publisher =. doi:10.2307/2529310 , number =

  43. [71]

    Handbook of inter-rater reliability

    Gwet, Kilem Li. Handbook of inter-rater reliability

  44. [72]

    The Guardian , year =

    Creamer, Ella , title =. The Guardian , year =

  45. [73]

    2025 , month = nov, note =

    Amazon Introduces. 2025 , month = nov, note =

  46. [74]

    Marzena Karpinska and Katherine Thai and Kalpesh Krishna and John Wieting and Moira Inghilleri and Mohit Iyyer , month =

  47. [75]

    Khoong, William D

    Carpuat, Marine and Asscher, Omri and Bali, Kalika and Bentivogli, Luisa and Blain, Fr \'e d \'e ric and Bowker, Lynne and Choudhury, Monojit and Daum \'e III, Hal and Duh, Kevin and Gao, Ge and Grissom II, Alvin and Karpinska, Marzena and Khoong, Elaine C. and Lewis, William D. and Martins, Andr \'e F. T. and Nurminen, Mary and Oard, Douglas W. and Popov...

  48. [76]

    Nikkei Asia , year =

    Nishioka, Annu , title =. Nikkei Asia , year =

  49. [77]

    Bleu: a method for automatic evaluation of machine translation

    Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing. B leu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002. doi:10.3115/1073083.1073135

  50. [78]

    BLEURT : Learning Robust Metrics for Text Generation

    Sellam, Thibault and Das, Dipanjan and Parikh, Ankur. BLEURT : Learning Robust Metrics for Text Generation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.704

  51. [79]

    BlonDe : An Automatic Evaluation Metric for Document-level Machine Translation

    Jiang, Yuchen and Liu, Tianyu and Ma, Shuming and Zhang, Dongdong and Yang, Jian and Huang, Haoyang and Sennrich, Rico and Cotterell, Ryan and Sachan, Mrinmaya and Zhou, Ming. BlonDe : An Automatic Evaluation Metric for Document-level Machine Translation. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational...

  52. [80]

    and Rei, Ricardo and Stigt, Daan van and Coheur, Luisa and Colombo, Pierre and Martins, Andr \'e F

    Guerreiro, Nuno M. and Rei, Ricardo and Stigt, Daan van and Coheur, Luisa and Colombo, Pierre and Martins, Andr \'e F. T. x COMET : Transparent Machine Translation Evaluation through Fine-grained Error Detection. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00683

  53. [81]

    and Zerva, Chrysoula and Farinha, Ana C and Maroti, Christine and C

    Rei, Ricardo and Treviso, Marcos and Guerreiro, Nuno M. and Zerva, Chrysoula and Farinha, Ana C and Maroti, Christine and C. de Souza, Jos \'e G. and Glushkova, Taisiya and Alves, Duarte and Coheur, Luisa and Lavie, Alon and Martins, Andr \'e F. T. C omet K iwi: IST -Unbabel 2022 Submission for the Quality Estimation Shared Task. Proceedings of the Sevent...

  54. [82]

    GEMBA - MQM : Detecting Translation Quality Error Spans with GPT -4

    Kocmi, Tom and Federmann, Christian. GEMBA - MQM : Detecting Translation Quality Error Spans with GPT -4. Proceedings of the Eighth Conference on Machine Translation. 2023. doi:10.18653/v1/2023.wmt-1.64

  55. [83]

    M etric X -23: The G oogle Submission to the WMT 2023 Metrics Shared Task

    Juraska, Juraj and Finkelstein, Mara and Deutsch, Daniel and Siddhant, Aditya and Mirzazadeh, Mehdi and Freitag, Markus. M etric X -23: The G oogle Submission to the WMT 2023 Metrics Shared Task. Proceedings of the Eighth Conference on Machine Translation. 2023. doi:10.18653/v1/2023.wmt-1.63

  56. [84]

    L i T rans P ro QA : An LLM -based Literary Translation Evaluation Metric with Professional Question Answering

    Zhang, Ran and Zhao, Wei and Macken, Lieve and Eger, Steffen. L i T rans P ro QA : An LLM -based Literary Translation Evaluation Metric with Professional Question Answering. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1482

  57. [85]

    arXiv preprint arXiv:2412.01340 , year=

    A 2-step framework for automated literary translation evaluation: Its promises and pitfalls , author=. arXiv preprint arXiv:2412.01340 , year=

  58. [86]

    Exploring

    Thai, Katherine and Karpinska, Marzena and Krishna, Kalpesh and Ray, Bill and Inghilleri, Moira and Wieting, John and Iyyer, Mohit. Exploring Document-Level Literary Machine Translation with Parallel Paragraphs from World Literature. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.672

  59. [87]

    How Good Are LLM s for Literary Translation, Really? Literary Translation Evaluation with Humans and LLM s

    Zhang, Ran and Zhao, Wei and Eger, Steffen. How Good Are LLM s for Literary Translation, Really? Literary Translation Evaluation with Humans and LLM s. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.n...

  60. [88]

    Multidimensional quality metrics: a flexible system for assessing translation quality

    Lommel, Arle Richard and Burchardt, Aljoscha and Uszkoreit, Hans. Multidimensional quality metrics: a flexible system for assessing translation quality. Proceedings of Translating and the Computer 35. 2013

  61. [89]

    (2021) ’Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation’, Transactions of the Association for Computational Linguistics , 9, pp

    Freitag, Markus and Foster, George and Grangier, David and Ratnakar, Viresh and Tan, Qijun and Macherey, Wolfgang. Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation. Transactions of the Association for Computational Linguistics. 2021. doi:10.1162/tacl_a_00437

  62. [90]

    (Perhaps) Beyond Human Translation: Harnessing Multi-Agent Collaboration for Translating Ultra-Long Literary Texts

    Wu, Minghao and Xu, Jiahao and Yuan, Yulin and Haffari, Gholamreza and Wan, Longyue and Luo, Weihua and Zhang, Kaifu. (Perhaps) Beyond Human Translation: Harnessing Multi-Agent Collaboration for Translating Ultra-Long Literary Texts. Transactions of the Association for Computational Linguistics. 2025. doi:10.1162/tacl.a.25

  63. [91]

    Extending CREAMT : Leveraging Large Language Models for Literary Translation Post-Editing

    Castaldo, Antonio and Castilho, Sheila and Moorkens, Joss and Monti, Johanna. Extending CREAMT : Leveraging Large Language Models for Literary Translation Post-Editing. Proceedings of Machine Translation Summit XX: Volume 1. 2025

  64. [92]

    Large Language Models Effectively Leverage Document-level Context for Literary Translation, but Critical Errors Persist

    Karpinska, Marzena and Iyyer, Mohit. Large Language Models Effectively Leverage Document-level Context for Literary Translation, but Critical Errors Persist. Proceedings of the Eighth Conference on Machine Translation. 2023. doi:10.18653/v1/2023.wmt-1.41

  65. [93]

    Findings of the WMT 2023 Shared Task on Discourse-Level Literary Translation: A Fresh Orb in the Cosmos of LLM s

    Wang, Longyue and Tu, Zhaopeng and Gu, Yan and Liu, Siyou and Yu, Dian and Ma, Qingsong and Lyu, Chenyang and Zhou, Liting and Liu, Chao-Hong and Ma, Yufeng and Chen, Weiyu and Graham, Yvette and Webber, Bonnie and Koehn, Philipp and Way, Andy and Yuan, Yulin and Shi, Shuming. Findings of the WMT 2023 Shared Task on Discourse-Level Literary Translation: A...

  66. [94]

    Literary Machine Translation under the Magnifying Glass: Assessing the Quality of an NMT -Translated Detective Novel on Document Level

    Fonteyne, Margot and Tezcan, Arda and Macken, Lieve. Literary Machine Translation under the Magnifying Glass: Assessing the Quality of an NMT -Translated Detective Novel on Document Level. Proceedings of the Twelfth Language Resources and Evaluation Conference. 2020

  67. [95]

    Project P i P e N ovel: Pilot on Post-editing Novels

    Toral, Antonio and Wieling, Martijn and Castilho, Sheila and Moorkens, Joss and Way, Andy. Project P i P e N ovel: Pilot on Post-editing Novels. Proceedings of the 21st Annual Conference of the European Association for Machine Translation. 2018

  68. [96]

    arXiv preprint arXiv:2605.13596 , year=

    Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations , author=. arXiv preprint arXiv:2605.13596 , year=

  69. [97]

    2026 , version =

    Pham, Chau Minh and Chang, Yapei and Iyyer, Mohit , title =. 2026 , version =

  70. [98]

    2024 , eprint=

    Technical Report on the Pangram AI-Generated Text Classifier , author=. 2024 , eprint=

  71. [99]

    2023 , note =

    ordinal---Regression Models for Ordinal Data , author =. 2023 , note =

  72. [100]

    Fitting Linear Mixed-Effects Models Using

    Douglas Bates and Martin M. Fitting Linear Mixed-Effects Models Using. Journal of Statistical Software , year =

  73. [101]

    Thomas , title =

    David R. Thomas , title =. American Journal of Evaluation , volume =. 2006 , doi =. https://doi.org/10.1177/1098214005283748 , abstract =

  74. [102]

    People who frequently use C hat GPT for writing tasks are accurate and robust detectors of AI -generated text

    Russell, Jenna and Karpinska, Marzena and Iyyer, Mohit. People who frequently use C hat GPT for writing tasks are accurate and robust detectors of AI -generated text. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.267

  75. [103]

    Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks

    Jacovi, Alon and Caciularu, Avi and Goldman, Omer and Goldberg, Yoav. Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.308

  76. [104]

    Turning E nglish-centric LLM s Into Polyglots: How Much Multilinguality Is Needed?

    Kew, Tannon and Schottmann, Florian and Sennrich, Rico. Turning E nglish-centric LLM s Into Polyglots: How Much Multilinguality Is Needed?. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.766

  77. [105]

    Findings of the WMT 25 General Machine Translation Shared Task: Time to Stop Evaluating on Easy Test Sets

    Kocmi, Tom and Artemova, Ekaterina and Avramidis, Eleftherios and Bawden, Rachel and Bojar, Ond r ej and Dranch, Konstantin and Dvorkovich, Anton and Dukanov, Sergey and Fishel, Mark and Freitag, Markus and Gowda, Thamme and Grundkiewicz, Roman and Haddow, Barry and Karpinska, Marzena and Koehn, Philipp and Lakougna, Howard and Lundin, Jessica and Monz, C...

  78. [106]

    M etric X -24: The G oogle Submission to the WMT 2024 Metrics Shared Task

    Juraska, Juraj and Deutsch, Daniel and Finkelstein, Mara and Freitag, Markus. M etric X -24: The G oogle Submission to the WMT 2024 Metrics Shared Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.35

  79. [107]

    2026 , month = feb, howpublished =

  80. [108]

    An Eye-Tracking Study of Equivalent Effect in Translation: The Reader Experience of Literary Style , ISBN =

    Walker, Callum , year =. An Eye-Tracking Study of Equivalent Effect in Translation: The Reader Experience of Literary Style , ISBN =. doi:10.1007/978-3-030-55769-0 , publisher =

Showing first 80 references.