AI translation of literary texts is "fine", but readers still prefer human translations

Adam Podoxin; Maite Taboada; Marzena Karpinska; Roman Grundkiewicz; Ty Brassington; Yves Ferstler

arxiv: 2606.26040 · v1 · pith:BGGD3XDMnew · submitted 2026-06-24 · 💻 cs.CL

AI translation of literary texts is "fine", but readers still prefer human translations

Yves Ferstler , Adam Podoxin , Ty Brassington , Roman Grundkiewicz , Maite Taboada , Marzena Karpinska This is my paper

Pith reviewed 2026-06-25 19:16 UTC · model grok-4.3

classification 💻 cs.CL

keywords literary translationmachine translation evaluationhuman preferencesreader studycomparative readingLLM outputimmersiveness

0 comments

The pith

Readers prefer human literary translations over AI versions for ease, clarity and immersion, even when they cannot tell them apart.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to test whether machine translations of recent novels meet the standards readers expect in literary prose. It reports that 15 avid readers found AI output acceptable overall but consistently chose the human versions in direct comparisons, citing better flow and engagement. The advantage for human translation appeared modestly in full-excerpt readings and more strongly when readers examined short paired passages. Automatic scoring systems and even LLM judges did not match these reader judgments. The work also supplies a public dataset of reader comments and annotations for future studies.

Core claim

Across 30 excerpt-level comparisons and 772 chunk-level comparisons, readers favored the human translations 19/30 and 522/772 times respectively, describing them as easier, clearer, and more immersive. They identified the human version correctly only 17 times out of 30 and showed a tendency to prefer whichever text they believed to be human. Machine translations displayed greater internal quality variation than human ones. Standard automatic metrics, including LLM-as-a-judge methods, did not recover the reader preferences and instead favored the machine output.

What carries the argument

A controlled reader study protocol that collects preferences, source guesses, and span annotations from immersive full-excerpt reading and from close examination of aligned human-machine text chunks.

Load-bearing premise

The agentic LLM pipeline used to produce the machine translations stands in for current best AI literary translation, and the judgments of these 15 readers apply beyond the tested books and languages.

What would settle it

A replication using a different leading LLM pipeline or a larger and more diverse set of readers that finds equal or higher preference for the machine translations would falsify the reported preference pattern.

Figures

Figures reproduced from arXiv: 2606.26040 by Adam Podoxin, Maite Taboada, Marzena Karpinska, Roman Grundkiewicz, Ty Brassington, Yves Ferstler.

**Figure 1.** Figure 1: Evaluation pipeline: Avid readers of published fiction evaluate two versions of an 8,000-word book excerpt: a human translation ( HT ) and an AI-generated machine translation ( MT ). Participants (1) read the first translation, (2) complete a perception questionnaire, (3) read the competing translation, (4) complete a second questionnaire, (5) compare both versions, (6) take a one-day break, and (7) perfor… view at source ↗

**Figure 2.** Figure 2: Agentic literary MT pipeline used in this study. Source excerpts are chunked and paired with style [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Study design and evaluation counts. Two readers evaluated each of the 15 book excerpts (30 bookreader evaluations). same excerpt at the chunk level, with 300-word MT and HT chunks presented side by side for a total of 772 comparisons (close reading; 386 aligned HT–MT chunk pairs, each judged by both readers) [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: The distribution of readers’ preferences be [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: The distribution of readers’ ratings after im [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: The distribution of readers’ excerpt-level pref [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution of good and poor span highlights by source language and translation type. 50 100 150 200 Poor-highlighted words per 1K threshold 0 10 20 30 40 50 60 70 80 % of chunks at or above threshold 31.3% 11.9% 5.7% 2.8% 70.5% 41.7% 23.8% 10.6% MT has more chunks with dense poor spans HT MT [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Share of close-reading chunks with dense [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Distribution of span-level highlights in the [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Examples of span-level preference evidence from the side-by-side chunks evaluation. Participants [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: The most frequent labels in readers’ comments about the [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗

**Figure 13.** Figure 13: Machine-translation (MT) identification ac [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗

**Figure 14.** Figure 14: Guidelines provided to the participants for the evaluation and annotation tasks (pages 1–4). See [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗

**Figure 15.** Figure 15: Guidelines provided to the participants for the evaluation and annotation tasks (pages 5–8). See [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗

**Figure 16.** Figure 16: Guidelines provided to the participants for the evaluation and annotation tasks (pages 9–12). [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗

**Figure 17.** Figure 17: Single-reading questionnaire shown after an immersive reading. Participants rated fluency, literary [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗

**Figure 18.** Figure 18: Human-evaluation questionnaire interfaces used after paired readings. The comparison form records [PITH_FULL_IMAGE:figures/full_fig_p031_18.png] view at source ↗

**Figure 19.** Figure 19: Median span highlight length in the mul [PITH_FULL_IMAGE:figures/full_fig_p032_19.png] view at source ↗

**Figure 20.** Figure 20: Span-level annotations by target language in [PITH_FULL_IMAGE:figures/full_fig_p032_20.png] view at source ↗

**Figure 22.** Figure 22: MT identification by book. Orange bars show [PITH_FULL_IMAGE:figures/full_fig_p033_22.png] view at source ↗

**Figure 23.** Figure 23: Positive and negative aspects mentioned by participants when comparing machine translation (MT) and [PITH_FULL_IMAGE:figures/full_fig_p038_23.png] view at source ↗

**Figure 24.** Figure 24: Positive reasons readers reported for preferring one translation over the other, separated by preferred [PITH_FULL_IMAGE:figures/full_fig_p038_24.png] view at source ↗

**Figure 25.** Figure 25: Preference mechanisms explaining why the chosen translation was preferred. Diverging bars contrast [PITH_FULL_IMAGE:figures/full_fig_p039_25.png] view at source ↗

**Figure 26.** Figure 26: Close-reading chunk-level preferred translation. Each cell shows one chunk comparison for an excerpt [PITH_FULL_IMAGE:figures/full_fig_p040_26.png] view at source ↗

**Figure 27.** Figure 27: Preferred translation by excerpt and reader. Bars show the share of chunk-level choices favoring MT on [PITH_FULL_IMAGE:figures/full_fig_p040_27.png] view at source ↗

**Figure 28.** Figure 28: Per-book immersive-reading ratings for HT and MT. The figure breaks down participant ratings by book, [PITH_FULL_IMAGE:figures/full_fig_p041_28.png] view at source ↗

**Figure 29.** Figure 29: Immersive-reading ratings by presentation order. The figure compares ratings assigned after the first and [PITH_FULL_IMAGE:figures/full_fig_p042_29.png] view at source ↗

**Figure 30.** Figure 30: Translation-origin guess flows after the comparison task. The first panel summarizes whether guesses [PITH_FULL_IMAGE:figures/full_fig_p042_30.png] view at source ↗

**Figure 32.** Figure 32: Close-reading preferences in the multilin [PITH_FULL_IMAGE:figures/full_fig_p043_32.png] view at source ↗

**Figure 33.** Figure 33: More likely AI-translated choices in the [PITH_FULL_IMAGE:figures/full_fig_p043_33.png] view at source ↗

**Figure 34.** Figure 34: Close-reading chunk-level preferred transla [PITH_FULL_IMAGE:figures/full_fig_p043_34.png] view at source ↗

**Figure 35.** Figure 35: Chunk-level preferred translation in the multilingual target-language case study. Each row shows one [PITH_FULL_IMAGE:figures/full_fig_p044_35.png] view at source ↗

read the original abstract

AI translation of literary works is increasingly common. While the content may be rendered adequately, we do not know enough about how readers experience it in terms of immersiveness and literary effect, aspects poorly captured by automatic machine translation metrics or human evaluation targeting fluency and adequacy. We ask 15 avid readers to compare recently published human translations (HT) to machine translations (MT) generated with an agentic large language model (LLM)-based pipeline, for 15 recent novels in French, Polish, and Japanese and translated into English. Readers evaluated approximately 8K-word excerpts in two conditions: immersive reading of the whole excerpt (30 comparisons) and close reading of 386 aligned HT-MT chunk pairs (772 comparisons), with two readers per book and in alternating order of presentation. Overall, readers find MT "fine", but prefer HT (slightly at excerpt-level 19/30, more clearly at chunk-level 522/772) for its ease, clarity, and immersive nature. Readers' highlights show that MT's quality varies more within one book than HT's does. Crucially, readers cannot reliably tell the two apart (17/30 guess correctly) and tend to prefer the version they believe to be human. Automatic metrics, including LLM-as-a-judge approaches, fail to recover reader preferences and favor MT. We release LAIT (Literary AI Translation), a reader-centered evaluation dataset with 1K reader comments, 2K judgments and preference ratings, and 7.2K span-level annotations, along with our evaluation protocol and supporting interface.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Readers slightly prefer human literary translations over this one AI pipeline but can't tell them apart, and the paper's main asset is the released dataset of judgments.

read the letter

Readers find the machine translations fine but lean toward the human versions for ease and immersion, though they guess correctly only about half the time and favor whatever they think is human. The paper supplies concrete counts from 15 readers on 15 novels and releases the LAIT dataset with comments and annotations.

The study compares full excerpts and aligned chunks, giving both 19/30 excerpt preferences and 522/772 chunk preferences for human translation. Collecting reader highlights that show MT quality swings more within a book is a useful observation, and the fact that standard metrics plus LLM judges miss the reader pattern adds a clear data point.

The soft spot is the narrow base. Fifteen readers and a single agentic LLM pipeline for the machine side mean the preference ratios and indistinguishability result are tied to this exact setup. If other generation methods or larger reader groups shift the numbers, the broader claim about AI literary translation weakens. The paper notes the within-book variation but does not test alternatives.

This is for people working on literary machine translation or on evaluation that tries to track actual reader experience rather than fluency scores. The dataset itself could serve as a starting point for new metric work.

Send it to peer review. The empirical numbers and the released material give referees something concrete to examine even if the scope stays limited.

Referee Report

1 major / 3 minor

Summary. The manuscript reports results from a human-subjects study in which 15 avid readers compared recently published human translations (HT) against machine translations (MT) generated by a single agentic LLM-based pipeline. The evaluation covers ~8K-word excerpts from 15 novels in French, Polish, and Japanese translated to English, using both immersive whole-excerpt reading (30 comparisons) and close reading of 386 aligned chunk pairs (772 comparisons). Readers rate MT as 'fine' but prefer HT (19/30 excerpt-level, 522/772 chunk-level) for ease, clarity, and immersiveness; they cannot reliably distinguish the two (17/30 correct guesses) and favor the version they believe is human. Automatic metrics, including LLM-as-judge, fail to recover these preferences. The authors release the LAIT dataset containing 1K reader comments, 2K judgments, and 7.2K span annotations together with the evaluation protocol.

Significance. If the reported preference patterns and indistinguishability result generalize, the work supplies concrete reader-centered evidence on literary aspects of translation quality that automatic metrics miss, together with a publicly released dataset that supports reproducibility and follow-on studies. The explicit contrast between excerpt-level and chunk-level judgments and the observation that MT quality varies more within books than HT does are useful empirical contributions.

major comments (1)

[Abstract and Evaluation section] Abstract and Evaluation section: the central claims that readers prefer HT (19/30 and 522/772) yet cannot reliably distinguish MT from HT (17/30) rest on data from only 15 readers and a single agentic LLM pipeline. Because the paper itself notes that MT quality varies more within books than HT does, the absence of any comparison to alternative MT pipelines or a larger reader cohort makes the broader statements about AI versus human literary translation vulnerable to the specific implementation choices; this is load-bearing for the generalizability of the preference and indistinguishability results.

minor comments (3)

[Abstract] Abstract: the description of chunk alignment, reader recruitment criteria, exact MT generation parameters, and any statistical testing of the reported counts is absent, hindering replication.
[Abstract] Abstract: the alternating order of presentation is mentioned but no detail is given on how order effects or fatigue were controlled across the two readers per book.
[Dataset release statement] Dataset release statement: while the LAIT dataset is a strength, the paper should specify the exact license, file formats, and whether the 7.2K span-level annotations include the original text spans or only offsets.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the need to qualify claims about generalizability. We respond to the major comment below.

read point-by-point responses

Referee: [Abstract and Evaluation section] Abstract and Evaluation section: the central claims that readers prefer HT (19/30 and 522/772) yet cannot reliably distinguish MT from HT (17/30) rest on data from only 15 readers and a single agentic LLM pipeline. Because the paper itself notes that MT quality varies more within books than HT does, the absence of any comparison to alternative MT pipelines or a larger reader cohort makes the broader statements about AI versus human literary translation vulnerable to the specific implementation choices; this is load-bearing for the generalizability of the preference and indistinguishability results.

Authors: We agree that the modest reader sample and single pipeline constrain broad generalizations, and that the noted intra-book MT variation makes pipeline choice relevant. The study is framed as an initial reader-centered exploration using a current state-of-the-art agentic approach rather than a comprehensive survey of all MT systems. We will revise the abstract and Evaluation section to explicitly qualify the central claims as applying to the tested pipeline and cohort, while retaining the reported counts. We will also expand the Limitations section to discuss the implications of these design choices and the value of follow-up work with additional pipelines and larger reader groups. These textual changes will better align the stated scope with the data without requiring new experiments. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical human-subjects study with direct observations

full rationale

This paper is a reader study reporting preference counts (19/30 excerpt-level, 522/772 chunk-level), indistinguishability rates (17/30), and qualitative comments from 15 participants evaluating fixed HT/MT pairs. These quantities are direct tallies of participant responses rather than quantities derived from equations, fitted parameters, or self-citations. No derivation chain exists; the central claims rest on the collected data itself, with no reduction of results to inputs by construction. The paper is self-contained against external benchmarks as a straightforward empirical evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model or derivation is present; the work rests on standard assumptions of human-subjects research such as reader honesty and excerpt representativeness.

pith-pipeline@v0.9.1-grok · 5834 in / 1140 out tokens · 43666 ms · 2026-06-25T19:16:59.263471+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

110 extracted references · 33 canonical work pages

[2]

Mike Allen. 2017. https://us.sagepub.com/en-us/nam/the-sage-encyclopedia-of-communication-research-methods/book244974 The SAGE encyclopedia of communication research methods . SAGE Publications, Inc, 2455 Teller Road, Thousand Oaks California 91320

2017
[3]

Amazon Staff . 2025. Amazon introduces Kindle Translate , an AI -powered translation service for authors to reach global readers. https://www.aboutamazon.com/news/books-and-authors/amazon-kindle-translate-books-authors. Accessed: 2026-05-22

2025
[6]

Antonio Castaldo, Sheila Castilho, Joss Moorkens, and Johanna Monti. 2025. https://aclanthology.org/2025.mtsummit-1.40/ Extending CREAMT : Leveraging large language models for literary translation post-editing . In Proceedings of Machine Translation Summit XX: Volume 1, pages 506--515, Geneva, Switzerland. European Association for Machine Translation

2025
[7]

Ella Creamer. 2024. https://www.theguardian.com/books/2024/nov/04/dutch-publisher-to-use-ai-to-translate-books-into-english-veen-bosch-keuning-artificial-intelligence Dutch publisher to use AI to translate limited number of books into English . The Guardian

2024
[8]

Bradley Emi and Max Spero. 2024. https://arxiv.org/abs/2402.14873 Technical report on the pangram ai-generated text classifier . Preprint, arXiv:2402.14873

arXiv 2024
[10]

Kyo Gerrits and Ana Guerberof-Arenas. 2025. To mt or not to mt: An eye-tracking study on the reception by dutch readers of different translation and creativity levels. In Proceedings of Machine Translation Summit XX: Volume 1, pages 516--537

2025
[12]

GlobeScribe.AI

GlobeScribe.AI Ltd . GlobeScribe.AI . https://globescribe.ai/. Accessed: 2026-05-22

2026
[13]

Google DeepMind . 2026. Gemini 3.1 Pro Model Card . https://deepmind.google/models/model-cards/gemini-3-1-pro/. Accessed 2026-05-26

2026
[14]

Yvette Graham, Timothy Baldwin, Alistair Moffat, and Justin Zobel. 2013. https://aclanthology.org/W13-2305/ Continuous measurement scales in human evaluation of machine translation . In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 33--41, Sofia, Bulgaria. Association for Computational Linguistics

2013
[15]

Melanie C Green and Timothy C Brock. 2000. The role of transportation in the persuasiveness of public narratives. Journal of personality and social psychology, 79(5):701

2000
[16]

Ana Guerberof-Arenas and Antonio Toral. 2020. The impact of post-editing and machine translation on creativity and reading experience. Translation Spaces, 9(2):255--282

2020
[17]

Ana Guerberof-Arenas and Antonio Toral. 2022. Creativity in translation: Machine translation as a constraint for literary texts. Translation spaces, 11(2):184--212

2022
[18]

Ana Guerberof-Arenas and Antonio Toral. 2024. To be or not to be: A translation reception study of a literary text translated into dutch and catalan using machine translation. Target, 36(2):215--244

2024
[20]

Kilem Li Gwet. 2021. Handbook of inter-rater reliability. Advanced Analytics

2021
[21]

Fantine Huot, Reinald Kim Amplayo, Jennimaria Palomaki, Alice Shoshana Jakobovits, Elizabeth Clark, and Mirella Lapata. 2025. https://openreview.net/forum?id=HfWcFs7XLR Agents' room: Narrative generation through multi-step collaboration . In The Thirteenth International Conference on Learning Representations

2025
[26]

M. G. Kendall. 1938. https://doi.org/10.1093/biomet/30.1-2.81 A new measure of rank correlation . Biometrika, 30(1--2):81--93

work page doi:10.1093/biomet/30.1-2.81 1938
[27]

Dorothy Kenny and Marion Winters. 2020. Machine translation, ethics and the literary translator’s voice. Translation Spaces, 9(1):123--149

2020
[30]

Muhammed Yusuf Kocyigit, Eleftheria Briakou, Daniel Deutsch, Jiaming Luo, Colin Cherry, and Markus Freitag. 2025. https://openreview.net/forum?id=MpjtvkvXDo Overestimation in LLM evaluation: A controlled large-scale study on data contamination s impact on machine translation . In Forty-second International Conference on Machine Learning

2025
[31]

Samuel L \"a ubli, Sheila Castilho, Graham Neubig, Rico Sennrich, Qinlan Shen, and Antonio Toral. 2020. A set of recommendations for assessing human--machine parity in language translation. Journal of artificial intelligence research, 67:653--672

2020
[32]

Samuel L \"a ubli, Rico Sennrich, and Martin Volk. 2018. Has machine translation achieved human parity? a case for document-level evaluation. In Proceedings of the 2018 conference on empirical methods in natural language processing, pages 4791--4796

2018
[33]

Arle Richard Lommel, Aljoscha Burchardt, and Hans Uszkoreit. 2013. https://aclanthology.org/2013.tc-1.6/ Multidimensional quality metrics: a flexible system for assessing translation quality . In Proceedings of Translating and the Computer 35, London, UK. Aslib

2013
[35]

Evgeny Matusov. 2019. The challenges of using neural machine translation for literature. In Proceedings of the qualities of literary machine translation, pages 10--19

2019
[36]

Joss Moorkens, Antonio Toral, Sheila Castilho, and Andy Way. 2018. Translators’ perceptions of literary post-editing using statistical and neural machine translation. Translation Spaces, 7(2):240--262

2018
[37]

Annu Nishioka. 2024. https://asia.nikkei.com/Business/Media-Entertainment/Japanese-publisher-to-launch-light-novel-app-with-AI-assisted-translations Japanese publisher to launch `light novel' app with AI -assisted translations . Nikkei Asia. Accessed: 2026-05-24

2024
[39]

Chau Minh Pham, Yapei Chang, and Mohit Iyyer. 2026. https://github.com/AutoFiction-AI/autofiction Autofiction pipeline . Research pipeline for long-form AI novel generation

2026
[40]

Barbara Plank. 2022. The “problem” of human label variation: On ground truth in data, modeling and evaluation. In Proceedings of the 2022 conference on empirical methods in natural language processing, pages 10671--10682

2022
[42]

Brian Porter and Edouard Machery. 2024. Ai-generated poetry is indistinguishable from human-written poetry and is rated more favorably. Scientific Reports, 14(1):26133

2024
[43]

Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. Comet: A neural framework for mt evaluation. In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp), pages 2685--2702

2020
[47]

Kristiina Taivalkoski-Shilov. 2019 a . Ethical issues regarding machine (-assisted) translation of literary texts. Perspectives, 27(5):689--703

2019
[48]

Kristiina Taivalkoski-Shilov. 2019 b . Free indirect discourse: an insurmountable challenge for literary mt systems? In Proceedings of the qualities of literary machine translation, pages 35--39

2019
[55]

Rebecca Webster, Margot Fonteyne, Arda Tezcan, Lieve Macken, and Joke Daems. 2020. Gutenberg goes neural: Comparing features of D utch human translations with raw neural machine translation outputs in a corpus of E nglish literary classics. In Informatics, volume 7, page 32. MDPI

2020
[59]

Tiffany Zhu, Iain Weissburg, Kexun Zhang, and William Yang Wang. 2025. Human bias in the face of ai: Examining human judgment against text labeled as ai generated. In Findings of the Association for Computational Linguistics: ACL 2025, pages 25907--25914

2025
[61]

The Reader is the Metric: How Textual Features and Reader Profiles Explain Conflicting Evaluations of AI Creative Writing

Marco, Guillermo and Gonzalo, Julio and Fresno, V \'i ctor. The Reader is the Metric: How Textual Features and Reader Profiles Explain Conflicting Evaluations of AI Creative Writing. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.1304

work page doi:10.18653/v1/2025.findings-acl.1304 2025
[62]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[63]

Publications Manual , year = "1983", publisher =

1983
[64]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[65]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[66]

Dan Gusfield , title =. 1997

1997
[67]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[68]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[69]

A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples , volume =

Wongpakaran, Nahathai and Wongpakaran, Tinakon and Wedding, Danny and Gwet, Kilem L , year =. A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples , volume =. BMC Medical Research Methodology , publisher =. doi:10.1186/1471-2288-13-61 , number =

work page doi:10.1186/1471-2288-13-61
[70]

Richard and Koch, Gary G

Landis, J. Richard and Koch, Gary G. , year =. The Measurement of Observer Agreement for Categorical Data , volume =. Biometrics , publisher =. doi:10.2307/2529310 , number =

work page doi:10.2307/2529310
[71]

Handbook of inter-rater reliability

Gwet, Kilem Li. Handbook of inter-rater reliability
[72]

The Guardian , year =

Creamer, Ella , title =. The Guardian , year =
[73]

2025 , month = nov, note =

Amazon Introduces. 2025 , month = nov, note =

2025
[74]

Marzena Karpinska and Katherine Thai and Kalpesh Krishna and John Wieting and Moira Inghilleri and Mohit Iyyer , month =
[75]

Khoong, William D

Carpuat, Marine and Asscher, Omri and Bali, Kalika and Bentivogli, Luisa and Blain, Fr \'e d \'e ric and Bowker, Lynne and Choudhury, Monojit and Daum \'e III, Hal and Duh, Kevin and Gao, Ge and Grissom II, Alvin and Karpinska, Marzena and Khoong, Elaine C. and Lewis, William D. and Martins, Andr \'e F. T. and Nurminen, Mary and Oard, Douglas W. and Popov...

work page doi:10.18653/v1/2025.emnlp-main.1164 2025
[76]

Nikkei Asia , year =

Nishioka, Annu , title =. Nikkei Asia , year =
[77]

Bleu: a method for automatic evaluation of machine translation

Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing. B leu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002. doi:10.3115/1073083.1073135

work page doi:10.3115/1073083.1073135 2002
[78]

BLEURT : Learning Robust Metrics for Text Generation

Sellam, Thibault and Das, Dipanjan and Parikh, Ankur. BLEURT : Learning Robust Metrics for Text Generation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.704

work page doi:10.18653/v1/2020.acl-main.704 2020
[79]

BlonDe : An Automatic Evaluation Metric for Document-level Machine Translation

Jiang, Yuchen and Liu, Tianyu and Ma, Shuming and Zhang, Dongdong and Yang, Jian and Huang, Haoyang and Sennrich, Rico and Cotterell, Ryan and Sachan, Mrinmaya and Zhou, Ming. BlonDe : An Automatic Evaluation Metric for Document-level Machine Translation. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational...

work page doi:10.18653/v1/2022.naacl-main.111 2022
[80]

and Rei, Ricardo and Stigt, Daan van and Coheur, Luisa and Colombo, Pierre and Martins, Andr \'e F

Guerreiro, Nuno M. and Rei, Ricardo and Stigt, Daan van and Coheur, Luisa and Colombo, Pierre and Martins, Andr \'e F. T. x COMET : Transparent Machine Translation Evaluation through Fine-grained Error Detection. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00683

work page doi:10.1162/tacl_a_00683 2024
[81]

and Zerva, Chrysoula and Farinha, Ana C and Maroti, Christine and C

Rei, Ricardo and Treviso, Marcos and Guerreiro, Nuno M. and Zerva, Chrysoula and Farinha, Ana C and Maroti, Christine and C. de Souza, Jos \'e G. and Glushkova, Taisiya and Alves, Duarte and Coheur, Luisa and Lavie, Alon and Martins, Andr \'e F. T. C omet K iwi: IST -Unbabel 2022 Submission for the Quality Estimation Shared Task. Proceedings of the Sevent...

work page doi:10.18653/v1/2022.wmt-1.60 2022
[82]

GEMBA - MQM : Detecting Translation Quality Error Spans with GPT -4

Kocmi, Tom and Federmann, Christian. GEMBA - MQM : Detecting Translation Quality Error Spans with GPT -4. Proceedings of the Eighth Conference on Machine Translation. 2023. doi:10.18653/v1/2023.wmt-1.64

work page doi:10.18653/v1/2023.wmt-1.64 2023
[83]

M etric X -23: The G oogle Submission to the WMT 2023 Metrics Shared Task

Juraska, Juraj and Finkelstein, Mara and Deutsch, Daniel and Siddhant, Aditya and Mirzazadeh, Mehdi and Freitag, Markus. M etric X -23: The G oogle Submission to the WMT 2023 Metrics Shared Task. Proceedings of the Eighth Conference on Machine Translation. 2023. doi:10.18653/v1/2023.wmt-1.63

work page doi:10.18653/v1/2023.wmt-1.63 2023
[84]

L i T rans P ro QA : An LLM -based Literary Translation Evaluation Metric with Professional Question Answering

Zhang, Ran and Zhao, Wei and Macken, Lieve and Eger, Steffen. L i T rans P ro QA : An LLM -based Literary Translation Evaluation Metric with Professional Question Answering. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1482

work page doi:10.18653/v1/2025.emnlp-main.1482 2025
[85]

arXiv preprint arXiv:2412.01340 , year=

A 2-step framework for automated literary translation evaluation: Its promises and pitfalls , author=. arXiv preprint arXiv:2412.01340 , year=

arXiv
[86]

Exploring

Thai, Katherine and Karpinska, Marzena and Krishna, Kalpesh and Ray, Bill and Inghilleri, Moira and Wieting, John and Iyyer, Mohit. Exploring Document-Level Literary Machine Translation with Parallel Paragraphs from World Literature. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.672

work page doi:10.18653/v1/2022.emnlp-main.672 2022
[87]

How Good Are LLM s for Literary Translation, Really? Literary Translation Evaluation with Humans and LLM s

Zhang, Ran and Zhao, Wei and Eger, Steffen. How Good Are LLM s for Literary Translation, Really? Literary Translation Evaluation with Humans and LLM s. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.n...

work page doi:10.18653/v1/2025.naacl-long.548 2025
[88]

Multidimensional quality metrics: a flexible system for assessing translation quality

Lommel, Arle Richard and Burchardt, Aljoscha and Uszkoreit, Hans. Multidimensional quality metrics: a flexible system for assessing translation quality. Proceedings of Translating and the Computer 35. 2013

2013
[89]

(2021) ’Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation’, Transactions of the Association for Computational Linguistics , 9, pp

Freitag, Markus and Foster, George and Grangier, David and Ratnakar, Viresh and Tan, Qijun and Macherey, Wolfgang. Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation. Transactions of the Association for Computational Linguistics. 2021. doi:10.1162/tacl_a_00437

work page doi:10.1162/tacl_a_00437 2021
[90]

(Perhaps) Beyond Human Translation: Harnessing Multi-Agent Collaboration for Translating Ultra-Long Literary Texts

Wu, Minghao and Xu, Jiahao and Yuan, Yulin and Haffari, Gholamreza and Wan, Longyue and Luo, Weihua and Zhang, Kaifu. (Perhaps) Beyond Human Translation: Harnessing Multi-Agent Collaboration for Translating Ultra-Long Literary Texts. Transactions of the Association for Computational Linguistics. 2025. doi:10.1162/tacl.a.25

work page doi:10.1162/tacl.a.25 2025
[91]

Extending CREAMT : Leveraging Large Language Models for Literary Translation Post-Editing

Castaldo, Antonio and Castilho, Sheila and Moorkens, Joss and Monti, Johanna. Extending CREAMT : Leveraging Large Language Models for Literary Translation Post-Editing. Proceedings of Machine Translation Summit XX: Volume 1. 2025

2025
[92]

Large Language Models Effectively Leverage Document-level Context for Literary Translation, but Critical Errors Persist

Karpinska, Marzena and Iyyer, Mohit. Large Language Models Effectively Leverage Document-level Context for Literary Translation, but Critical Errors Persist. Proceedings of the Eighth Conference on Machine Translation. 2023. doi:10.18653/v1/2023.wmt-1.41

work page doi:10.18653/v1/2023.wmt-1.41 2023
[93]

Findings of the WMT 2023 Shared Task on Discourse-Level Literary Translation: A Fresh Orb in the Cosmos of LLM s

Wang, Longyue and Tu, Zhaopeng and Gu, Yan and Liu, Siyou and Yu, Dian and Ma, Qingsong and Lyu, Chenyang and Zhou, Liting and Liu, Chao-Hong and Ma, Yufeng and Chen, Weiyu and Graham, Yvette and Webber, Bonnie and Koehn, Philipp and Way, Andy and Yuan, Yulin and Shi, Shuming. Findings of the WMT 2023 Shared Task on Discourse-Level Literary Translation: A...

work page doi:10.18653/v1/2023.wmt-1.3 2023
[94]

Literary Machine Translation under the Magnifying Glass: Assessing the Quality of an NMT -Translated Detective Novel on Document Level

Fonteyne, Margot and Tezcan, Arda and Macken, Lieve. Literary Machine Translation under the Magnifying Glass: Assessing the Quality of an NMT -Translated Detective Novel on Document Level. Proceedings of the Twelfth Language Resources and Evaluation Conference. 2020

2020
[95]

Project P i P e N ovel: Pilot on Post-editing Novels

Toral, Antonio and Wieling, Martijn and Castilho, Sheila and Moorkens, Joss and Way, Andy. Project P i P e N ovel: Pilot on Post-editing Novels. Proceedings of the 21st Annual Conference of the European Association for Machine Translation. 2018

2018
[96]

arXiv preprint arXiv:2605.13596 , year=

Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations , author=. arXiv preprint arXiv:2605.13596 , year=

Pith/arXiv arXiv
[97]

2026 , version =

Pham, Chau Minh and Chang, Yapei and Iyyer, Mohit , title =. 2026 , version =

2026
[98]

2024 , eprint=

Technical Report on the Pangram AI-Generated Text Classifier , author=. 2024 , eprint=

2024
[99]

2023 , note =

ordinal---Regression Models for Ordinal Data , author =. 2023 , note =

2023
[100]

Fitting Linear Mixed-Effects Models Using

Douglas Bates and Martin M. Fitting Linear Mixed-Effects Models Using. Journal of Statistical Software , year =
[101]

Thomas , title =

David R. Thomas , title =. American Journal of Evaluation , volume =. 2006 , doi =. https://doi.org/10.1177/1098214005283748 , abstract =

work page doi:10.1177/1098214005283748 2006
[102]

People who frequently use C hat GPT for writing tasks are accurate and robust detectors of AI -generated text

Russell, Jenna and Karpinska, Marzena and Iyyer, Mohit. People who frequently use C hat GPT for writing tasks are accurate and robust detectors of AI -generated text. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.267

work page doi:10.18653/v1/2025.acl-long.267 2025
[103]

Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks

Jacovi, Alon and Caciularu, Avi and Goldman, Omer and Goldberg, Yoav. Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.308

work page doi:10.18653/v1/2023.emnlp-main.308 2023
[104]

Turning E nglish-centric LLM s Into Polyglots: How Much Multilinguality Is Needed?

Kew, Tannon and Schottmann, Florian and Sennrich, Rico. Turning E nglish-centric LLM s Into Polyglots: How Much Multilinguality Is Needed?. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.766

work page doi:10.18653/v1/2024.findings-emnlp.766 2024
[105]

Findings of the WMT 25 General Machine Translation Shared Task: Time to Stop Evaluating on Easy Test Sets

Kocmi, Tom and Artemova, Ekaterina and Avramidis, Eleftherios and Bawden, Rachel and Bojar, Ond r ej and Dranch, Konstantin and Dvorkovich, Anton and Dukanov, Sergey and Fishel, Mark and Freitag, Markus and Gowda, Thamme and Grundkiewicz, Roman and Haddow, Barry and Karpinska, Marzena and Koehn, Philipp and Lakougna, Howard and Lundin, Jessica and Monz, C...

work page doi:10.18653/v1/2025.wmt-1.22 2025
[106]

M etric X -24: The G oogle Submission to the WMT 2024 Metrics Shared Task

Juraska, Juraj and Deutsch, Daniel and Finkelstein, Mara and Freitag, Markus. M etric X -24: The G oogle Submission to the WMT 2024 Metrics Shared Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.35

work page doi:10.18653/v1/2024.wmt-1.35 2024
[107]

2026 , month = feb, howpublished =

2026
[108]

An Eye-Tracking Study of Equivalent Effect in Translation: The Reader Experience of Literary Style , ISBN =

Walker, Callum , year =. An Eye-Tracking Study of Equivalent Effect in Translation: The Reader Experience of Literary Style , ISBN =. doi:10.1007/978-3-030-55769-0 , publisher =

work page doi:10.1007/978-3-030-55769-0

Showing first 80 references.

[1] [2]

Mike Allen. 2017. https://us.sagepub.com/en-us/nam/the-sage-encyclopedia-of-communication-research-methods/book244974 The SAGE encyclopedia of communication research methods . SAGE Publications, Inc, 2455 Teller Road, Thousand Oaks California 91320

2017

[2] [3]

Amazon Staff . 2025. Amazon introduces Kindle Translate , an AI -powered translation service for authors to reach global readers. https://www.aboutamazon.com/news/books-and-authors/amazon-kindle-translate-books-authors. Accessed: 2026-05-22

2025

[3] [6]

Antonio Castaldo, Sheila Castilho, Joss Moorkens, and Johanna Monti. 2025. https://aclanthology.org/2025.mtsummit-1.40/ Extending CREAMT : Leveraging large language models for literary translation post-editing . In Proceedings of Machine Translation Summit XX: Volume 1, pages 506--515, Geneva, Switzerland. European Association for Machine Translation

2025

[4] [7]

Ella Creamer. 2024. https://www.theguardian.com/books/2024/nov/04/dutch-publisher-to-use-ai-to-translate-books-into-english-veen-bosch-keuning-artificial-intelligence Dutch publisher to use AI to translate limited number of books into English . The Guardian

2024

[5] [8]

Bradley Emi and Max Spero. 2024. https://arxiv.org/abs/2402.14873 Technical report on the pangram ai-generated text classifier . Preprint, arXiv:2402.14873

arXiv 2024

[6] [10]

Kyo Gerrits and Ana Guerberof-Arenas. 2025. To mt or not to mt: An eye-tracking study on the reception by dutch readers of different translation and creativity levels. In Proceedings of Machine Translation Summit XX: Volume 1, pages 516--537

2025

[7] [12]

GlobeScribe.AI

GlobeScribe.AI Ltd . GlobeScribe.AI . https://globescribe.ai/. Accessed: 2026-05-22

2026

[8] [13]

Google DeepMind . 2026. Gemini 3.1 Pro Model Card . https://deepmind.google/models/model-cards/gemini-3-1-pro/. Accessed 2026-05-26

2026

[9] [14]

Yvette Graham, Timothy Baldwin, Alistair Moffat, and Justin Zobel. 2013. https://aclanthology.org/W13-2305/ Continuous measurement scales in human evaluation of machine translation . In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 33--41, Sofia, Bulgaria. Association for Computational Linguistics

2013

[10] [15]

Melanie C Green and Timothy C Brock. 2000. The role of transportation in the persuasiveness of public narratives. Journal of personality and social psychology, 79(5):701

2000

[11] [16]

Ana Guerberof-Arenas and Antonio Toral. 2020. The impact of post-editing and machine translation on creativity and reading experience. Translation Spaces, 9(2):255--282

2020

[12] [17]

Ana Guerberof-Arenas and Antonio Toral. 2022. Creativity in translation: Machine translation as a constraint for literary texts. Translation spaces, 11(2):184--212

2022

[13] [18]

Ana Guerberof-Arenas and Antonio Toral. 2024. To be or not to be: A translation reception study of a literary text translated into dutch and catalan using machine translation. Target, 36(2):215--244

2024

[14] [20]

Kilem Li Gwet. 2021. Handbook of inter-rater reliability. Advanced Analytics

2021

[15] [21]

Fantine Huot, Reinald Kim Amplayo, Jennimaria Palomaki, Alice Shoshana Jakobovits, Elizabeth Clark, and Mirella Lapata. 2025. https://openreview.net/forum?id=HfWcFs7XLR Agents' room: Narrative generation through multi-step collaboration . In The Thirteenth International Conference on Learning Representations

2025

[16] [26]

M. G. Kendall. 1938. https://doi.org/10.1093/biomet/30.1-2.81 A new measure of rank correlation . Biometrika, 30(1--2):81--93

work page doi:10.1093/biomet/30.1-2.81 1938

[17] [27]

Dorothy Kenny and Marion Winters. 2020. Machine translation, ethics and the literary translator’s voice. Translation Spaces, 9(1):123--149

2020

[18] [30]

Muhammed Yusuf Kocyigit, Eleftheria Briakou, Daniel Deutsch, Jiaming Luo, Colin Cherry, and Markus Freitag. 2025. https://openreview.net/forum?id=MpjtvkvXDo Overestimation in LLM evaluation: A controlled large-scale study on data contamination s impact on machine translation . In Forty-second International Conference on Machine Learning

2025

[19] [31]

Samuel L \"a ubli, Sheila Castilho, Graham Neubig, Rico Sennrich, Qinlan Shen, and Antonio Toral. 2020. A set of recommendations for assessing human--machine parity in language translation. Journal of artificial intelligence research, 67:653--672

2020

[20] [32]

Samuel L \"a ubli, Rico Sennrich, and Martin Volk. 2018. Has machine translation achieved human parity? a case for document-level evaluation. In Proceedings of the 2018 conference on empirical methods in natural language processing, pages 4791--4796

2018

[21] [33]

Arle Richard Lommel, Aljoscha Burchardt, and Hans Uszkoreit. 2013. https://aclanthology.org/2013.tc-1.6/ Multidimensional quality metrics: a flexible system for assessing translation quality . In Proceedings of Translating and the Computer 35, London, UK. Aslib

2013

[22] [35]

Evgeny Matusov. 2019. The challenges of using neural machine translation for literature. In Proceedings of the qualities of literary machine translation, pages 10--19

2019

[23] [36]

Joss Moorkens, Antonio Toral, Sheila Castilho, and Andy Way. 2018. Translators’ perceptions of literary post-editing using statistical and neural machine translation. Translation Spaces, 7(2):240--262

2018

[24] [37]

Annu Nishioka. 2024. https://asia.nikkei.com/Business/Media-Entertainment/Japanese-publisher-to-launch-light-novel-app-with-AI-assisted-translations Japanese publisher to launch `light novel' app with AI -assisted translations . Nikkei Asia. Accessed: 2026-05-24

2024

[25] [39]

Chau Minh Pham, Yapei Chang, and Mohit Iyyer. 2026. https://github.com/AutoFiction-AI/autofiction Autofiction pipeline . Research pipeline for long-form AI novel generation

2026

[26] [40]

Barbara Plank. 2022. The “problem” of human label variation: On ground truth in data, modeling and evaluation. In Proceedings of the 2022 conference on empirical methods in natural language processing, pages 10671--10682

2022

[27] [42]

Brian Porter and Edouard Machery. 2024. Ai-generated poetry is indistinguishable from human-written poetry and is rated more favorably. Scientific Reports, 14(1):26133

2024

[28] [43]

Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. Comet: A neural framework for mt evaluation. In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp), pages 2685--2702

2020

[29] [47]

Kristiina Taivalkoski-Shilov. 2019 a . Ethical issues regarding machine (-assisted) translation of literary texts. Perspectives, 27(5):689--703

2019

[30] [48]

Kristiina Taivalkoski-Shilov. 2019 b . Free indirect discourse: an insurmountable challenge for literary mt systems? In Proceedings of the qualities of literary machine translation, pages 35--39

2019

[31] [55]

Rebecca Webster, Margot Fonteyne, Arda Tezcan, Lieve Macken, and Joke Daems. 2020. Gutenberg goes neural: Comparing features of D utch human translations with raw neural machine translation outputs in a corpus of E nglish literary classics. In Informatics, volume 7, page 32. MDPI

2020

[32] [59]

Tiffany Zhu, Iain Weissburg, Kexun Zhang, and William Yang Wang. 2025. Human bias in the face of ai: Examining human judgment against text labeled as ai generated. In Findings of the Association for Computational Linguistics: ACL 2025, pages 25907--25914

2025

[33] [61]

The Reader is the Metric: How Textual Features and Reader Profiles Explain Conflicting Evaluations of AI Creative Writing

Marco, Guillermo and Gonzalo, Julio and Fresno, V \'i ctor. The Reader is the Metric: How Textual Features and Reader Profiles Explain Conflicting Evaluations of AI Creative Writing. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.1304

work page doi:10.18653/v1/2025.findings-acl.1304 2025

[34] [62]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972

[35] [63]

Publications Manual , year = "1983", publisher =

1983

[36] [64]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[37] [65]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

[38] [66]

Dan Gusfield , title =. 1997

1997

[39] [67]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015

[40] [68]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

[41] [69]

A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples , volume =

Wongpakaran, Nahathai and Wongpakaran, Tinakon and Wedding, Danny and Gwet, Kilem L , year =. A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples , volume =. BMC Medical Research Methodology , publisher =. doi:10.1186/1471-2288-13-61 , number =

work page doi:10.1186/1471-2288-13-61

[42] [70]

Richard and Koch, Gary G

Landis, J. Richard and Koch, Gary G. , year =. The Measurement of Observer Agreement for Categorical Data , volume =. Biometrics , publisher =. doi:10.2307/2529310 , number =

work page doi:10.2307/2529310

[43] [71]

Handbook of inter-rater reliability

Gwet, Kilem Li. Handbook of inter-rater reliability

[44] [72]

The Guardian , year =

Creamer, Ella , title =. The Guardian , year =

[45] [73]

2025 , month = nov, note =

Amazon Introduces. 2025 , month = nov, note =

2025

[46] [74]

Marzena Karpinska and Katherine Thai and Kalpesh Krishna and John Wieting and Moira Inghilleri and Mohit Iyyer , month =

[47] [75]

Khoong, William D

Carpuat, Marine and Asscher, Omri and Bali, Kalika and Bentivogli, Luisa and Blain, Fr \'e d \'e ric and Bowker, Lynne and Choudhury, Monojit and Daum \'e III, Hal and Duh, Kevin and Gao, Ge and Grissom II, Alvin and Karpinska, Marzena and Khoong, Elaine C. and Lewis, William D. and Martins, Andr \'e F. T. and Nurminen, Mary and Oard, Douglas W. and Popov...

work page doi:10.18653/v1/2025.emnlp-main.1164 2025

[48] [76]

Nikkei Asia , year =

Nishioka, Annu , title =. Nikkei Asia , year =

[49] [77]

Bleu: a method for automatic evaluation of machine translation

Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing. B leu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002. doi:10.3115/1073083.1073135

work page doi:10.3115/1073083.1073135 2002

[50] [78]

BLEURT : Learning Robust Metrics for Text Generation

Sellam, Thibault and Das, Dipanjan and Parikh, Ankur. BLEURT : Learning Robust Metrics for Text Generation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.704

work page doi:10.18653/v1/2020.acl-main.704 2020

[51] [79]

BlonDe : An Automatic Evaluation Metric for Document-level Machine Translation

Jiang, Yuchen and Liu, Tianyu and Ma, Shuming and Zhang, Dongdong and Yang, Jian and Huang, Haoyang and Sennrich, Rico and Cotterell, Ryan and Sachan, Mrinmaya and Zhou, Ming. BlonDe : An Automatic Evaluation Metric for Document-level Machine Translation. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational...

work page doi:10.18653/v1/2022.naacl-main.111 2022

[52] [80]

and Rei, Ricardo and Stigt, Daan van and Coheur, Luisa and Colombo, Pierre and Martins, Andr \'e F

Guerreiro, Nuno M. and Rei, Ricardo and Stigt, Daan van and Coheur, Luisa and Colombo, Pierre and Martins, Andr \'e F. T. x COMET : Transparent Machine Translation Evaluation through Fine-grained Error Detection. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00683

work page doi:10.1162/tacl_a_00683 2024

[53] [81]

and Zerva, Chrysoula and Farinha, Ana C and Maroti, Christine and C

Rei, Ricardo and Treviso, Marcos and Guerreiro, Nuno M. and Zerva, Chrysoula and Farinha, Ana C and Maroti, Christine and C. de Souza, Jos \'e G. and Glushkova, Taisiya and Alves, Duarte and Coheur, Luisa and Lavie, Alon and Martins, Andr \'e F. T. C omet K iwi: IST -Unbabel 2022 Submission for the Quality Estimation Shared Task. Proceedings of the Sevent...

work page doi:10.18653/v1/2022.wmt-1.60 2022

[54] [82]

GEMBA - MQM : Detecting Translation Quality Error Spans with GPT -4

Kocmi, Tom and Federmann, Christian. GEMBA - MQM : Detecting Translation Quality Error Spans with GPT -4. Proceedings of the Eighth Conference on Machine Translation. 2023. doi:10.18653/v1/2023.wmt-1.64

work page doi:10.18653/v1/2023.wmt-1.64 2023

[55] [83]

M etric X -23: The G oogle Submission to the WMT 2023 Metrics Shared Task

Juraska, Juraj and Finkelstein, Mara and Deutsch, Daniel and Siddhant, Aditya and Mirzazadeh, Mehdi and Freitag, Markus. M etric X -23: The G oogle Submission to the WMT 2023 Metrics Shared Task. Proceedings of the Eighth Conference on Machine Translation. 2023. doi:10.18653/v1/2023.wmt-1.63

work page doi:10.18653/v1/2023.wmt-1.63 2023

[56] [84]

L i T rans P ro QA : An LLM -based Literary Translation Evaluation Metric with Professional Question Answering

Zhang, Ran and Zhao, Wei and Macken, Lieve and Eger, Steffen. L i T rans P ro QA : An LLM -based Literary Translation Evaluation Metric with Professional Question Answering. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1482

work page doi:10.18653/v1/2025.emnlp-main.1482 2025

[57] [85]

arXiv preprint arXiv:2412.01340 , year=

A 2-step framework for automated literary translation evaluation: Its promises and pitfalls , author=. arXiv preprint arXiv:2412.01340 , year=

arXiv

[58] [86]

Exploring

Thai, Katherine and Karpinska, Marzena and Krishna, Kalpesh and Ray, Bill and Inghilleri, Moira and Wieting, John and Iyyer, Mohit. Exploring Document-Level Literary Machine Translation with Parallel Paragraphs from World Literature. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.672

work page doi:10.18653/v1/2022.emnlp-main.672 2022

[59] [87]

How Good Are LLM s for Literary Translation, Really? Literary Translation Evaluation with Humans and LLM s

Zhang, Ran and Zhao, Wei and Eger, Steffen. How Good Are LLM s for Literary Translation, Really? Literary Translation Evaluation with Humans and LLM s. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.n...

work page doi:10.18653/v1/2025.naacl-long.548 2025

[60] [88]

Multidimensional quality metrics: a flexible system for assessing translation quality

Lommel, Arle Richard and Burchardt, Aljoscha and Uszkoreit, Hans. Multidimensional quality metrics: a flexible system for assessing translation quality. Proceedings of Translating and the Computer 35. 2013

2013

[61] [89]

(2021) ’Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation’, Transactions of the Association for Computational Linguistics , 9, pp

Freitag, Markus and Foster, George and Grangier, David and Ratnakar, Viresh and Tan, Qijun and Macherey, Wolfgang. Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation. Transactions of the Association for Computational Linguistics. 2021. doi:10.1162/tacl_a_00437

work page doi:10.1162/tacl_a_00437 2021

[62] [90]

(Perhaps) Beyond Human Translation: Harnessing Multi-Agent Collaboration for Translating Ultra-Long Literary Texts

Wu, Minghao and Xu, Jiahao and Yuan, Yulin and Haffari, Gholamreza and Wan, Longyue and Luo, Weihua and Zhang, Kaifu. (Perhaps) Beyond Human Translation: Harnessing Multi-Agent Collaboration for Translating Ultra-Long Literary Texts. Transactions of the Association for Computational Linguistics. 2025. doi:10.1162/tacl.a.25

work page doi:10.1162/tacl.a.25 2025

[63] [91]

Extending CREAMT : Leveraging Large Language Models for Literary Translation Post-Editing

Castaldo, Antonio and Castilho, Sheila and Moorkens, Joss and Monti, Johanna. Extending CREAMT : Leveraging Large Language Models for Literary Translation Post-Editing. Proceedings of Machine Translation Summit XX: Volume 1. 2025

2025

[64] [92]

Large Language Models Effectively Leverage Document-level Context for Literary Translation, but Critical Errors Persist

Karpinska, Marzena and Iyyer, Mohit. Large Language Models Effectively Leverage Document-level Context for Literary Translation, but Critical Errors Persist. Proceedings of the Eighth Conference on Machine Translation. 2023. doi:10.18653/v1/2023.wmt-1.41

work page doi:10.18653/v1/2023.wmt-1.41 2023

[65] [93]

Findings of the WMT 2023 Shared Task on Discourse-Level Literary Translation: A Fresh Orb in the Cosmos of LLM s

Wang, Longyue and Tu, Zhaopeng and Gu, Yan and Liu, Siyou and Yu, Dian and Ma, Qingsong and Lyu, Chenyang and Zhou, Liting and Liu, Chao-Hong and Ma, Yufeng and Chen, Weiyu and Graham, Yvette and Webber, Bonnie and Koehn, Philipp and Way, Andy and Yuan, Yulin and Shi, Shuming. Findings of the WMT 2023 Shared Task on Discourse-Level Literary Translation: A...

work page doi:10.18653/v1/2023.wmt-1.3 2023

[66] [94]

Literary Machine Translation under the Magnifying Glass: Assessing the Quality of an NMT -Translated Detective Novel on Document Level

Fonteyne, Margot and Tezcan, Arda and Macken, Lieve. Literary Machine Translation under the Magnifying Glass: Assessing the Quality of an NMT -Translated Detective Novel on Document Level. Proceedings of the Twelfth Language Resources and Evaluation Conference. 2020

2020

[67] [95]

Project P i P e N ovel: Pilot on Post-editing Novels

Toral, Antonio and Wieling, Martijn and Castilho, Sheila and Moorkens, Joss and Way, Andy. Project P i P e N ovel: Pilot on Post-editing Novels. Proceedings of the 21st Annual Conference of the European Association for Machine Translation. 2018

2018

[68] [96]

arXiv preprint arXiv:2605.13596 , year=

Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations , author=. arXiv preprint arXiv:2605.13596 , year=

Pith/arXiv arXiv

[69] [97]

2026 , version =

Pham, Chau Minh and Chang, Yapei and Iyyer, Mohit , title =. 2026 , version =

2026

[70] [98]

2024 , eprint=

Technical Report on the Pangram AI-Generated Text Classifier , author=. 2024 , eprint=

2024

[71] [99]

2023 , note =

ordinal---Regression Models for Ordinal Data , author =. 2023 , note =

2023

[72] [100]

Fitting Linear Mixed-Effects Models Using

Douglas Bates and Martin M. Fitting Linear Mixed-Effects Models Using. Journal of Statistical Software , year =

[73] [101]

Thomas , title =

David R. Thomas , title =. American Journal of Evaluation , volume =. 2006 , doi =. https://doi.org/10.1177/1098214005283748 , abstract =

work page doi:10.1177/1098214005283748 2006

[74] [102]

People who frequently use C hat GPT for writing tasks are accurate and robust detectors of AI -generated text

Russell, Jenna and Karpinska, Marzena and Iyyer, Mohit. People who frequently use C hat GPT for writing tasks are accurate and robust detectors of AI -generated text. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.267

work page doi:10.18653/v1/2025.acl-long.267 2025

[75] [103]

Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks

Jacovi, Alon and Caciularu, Avi and Goldman, Omer and Goldberg, Yoav. Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.308

work page doi:10.18653/v1/2023.emnlp-main.308 2023

[76] [104]

Turning E nglish-centric LLM s Into Polyglots: How Much Multilinguality Is Needed?

Kew, Tannon and Schottmann, Florian and Sennrich, Rico. Turning E nglish-centric LLM s Into Polyglots: How Much Multilinguality Is Needed?. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.766

work page doi:10.18653/v1/2024.findings-emnlp.766 2024

[77] [105]

Findings of the WMT 25 General Machine Translation Shared Task: Time to Stop Evaluating on Easy Test Sets

Kocmi, Tom and Artemova, Ekaterina and Avramidis, Eleftherios and Bawden, Rachel and Bojar, Ond r ej and Dranch, Konstantin and Dvorkovich, Anton and Dukanov, Sergey and Fishel, Mark and Freitag, Markus and Gowda, Thamme and Grundkiewicz, Roman and Haddow, Barry and Karpinska, Marzena and Koehn, Philipp and Lakougna, Howard and Lundin, Jessica and Monz, C...

work page doi:10.18653/v1/2025.wmt-1.22 2025

[78] [106]

M etric X -24: The G oogle Submission to the WMT 2024 Metrics Shared Task

Juraska, Juraj and Deutsch, Daniel and Finkelstein, Mara and Freitag, Markus. M etric X -24: The G oogle Submission to the WMT 2024 Metrics Shared Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.35

work page doi:10.18653/v1/2024.wmt-1.35 2024

[79] [107]

2026 , month = feb, howpublished =

2026

[80] [108]

An Eye-Tracking Study of Equivalent Effect in Translation: The Reader Experience of Literary Style , ISBN =

Walker, Callum , year =. An Eye-Tracking Study of Equivalent Effect in Translation: The Reader Experience of Literary Style , ISBN =. doi:10.1007/978-3-030-55769-0 , publisher =

work page doi:10.1007/978-3-030-55769-0