pith. machine review for the scientific record. sign in

arxiv: 2605.13596 · v1 · submitted 2026-05-13 · 💻 cs.CL

Recognition: no theorem link

Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:02 UTC · model grok-4.3

classification 💻 cs.CL
keywords literary translationcreativity evaluationautomatic evaluation metricsLLM-as-a-judgemachine translationhuman evaluationevaluation bias
0
0 comments X

The pith

Automatic evaluation metrics and LLM judges correlate poorly with professional translators on creativity in literary texts and bias toward machine outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a dataset of literary translations in three modalities, three genres, and three language pairs, then has experienced professional translators annotate them in detail for creative shifts and errors. It compares these annotations against scores from automatic evaluation metrics and from LLM-as-a-judge setups. Both automatic approaches show weak alignment with the professionals, and LLM judges systematically rate machine-translated versions higher while marking creative, culturally fitting solutions as mistakes. The gap widens for poetry and other highly literary genres. The work concludes that existing tools cannot yet replace manual expert judgment and that new methods are needed to treat creative deviations as valid rather than errors.

Core claim

Automatic evaluation metrics and LLM-as-a-judge evaluations correlate poorly with professional literary translators' assessments of creativity, and LLM judges display a systematic bias that favors machine-translated texts while penalizing creative and culturally appropriate solutions, with performance dropping further on poetry and similar literary genres.

What carries the argument

A dataset of literary translations across human, machine, and post-edited modalities, annotated by professional translators for creative shifts and errors, used to measure alignment with automatic metrics and LLM judges.

If this is right

  • Automatic metrics cannot serve as reliable substitutes for professional judgment when creativity is the focus of evaluation.
  • LLM-as-a-judge methods introduce a consistent preference for literal machine outputs over creative human solutions.
  • Evaluation accuracy declines markedly for poetry and other highly literary genres.
  • New automatic tools are required that treat creative out-of-routine solutions as valid rather than errors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If automatic scores are used for quality control, translation workflows may systematically undervalue creative post-editing by humans.
  • The same bias pattern could appear in AI evaluation of other creative writing tasks such as story generation or script adaptation.
  • Explicit cultural and creative criteria would need to be built into future evaluation frameworks to reduce the observed mismatch.

Load-bearing premise

Detailed annotations by experienced professional literary translators constitute an objective and reliable ground truth for measuring creativity and translation quality across genres and modalities.

What would settle it

A new set of annotations on the same dataset by a different group of professional literary translators that produces substantially different creativity scores from the original annotations.

Figures

Figures reproduced from arXiv: 2605.13596 by Ana Guerberof Arenas, Kyo Gerrits, Rik van Noord.

Figure 1
Figure 1. Figure 1: Example of a UCP in the ST with a Reproduction in the PE and a CS in the HT with English glosses underneath. The creative shift annotations were done by 3 doctoral students who are proficient with the framework that were also translators and were na￾tive or proficient speakers for the language pair. They received the texts and the UCP annotations (created by 2 of the doctoral students) and could mark a sol… view at source ↗
Figure 2
Figure 2. Figure 2: Heat map of the Spearman correlations between AEMs and professional annotations. Poem Short story Thriller ST TT From HT PE MT HT PE MT HT PE MT EN NL Human 3 8 16 4 5 17 6 11 9 LLM 8 12 12 14 8 12 8 9 9 Match 0 3 6 2 0 2 1 4 1 EN CA Human 9 7 16 3 7 25 4 7 18 LLM 13 16 9 14 6 8 9 10 5 Match 4 4 7 1 1 4 0 1 4 RU NL Human 7 6 14 4 14 14 9 15 28 LLM 19 23 24 10 13 10 9 13 20 Match 4 5 8 2 6 4 0 5 12 [PITH_F… view at source ↗
Figure 3
Figure 3. Figure 3: Scatterplot of CI from professional and LLM anno￾tations. Colouring indicates the modality and shapes genre. 4.3 Genre Our last RQ investigates whether the correlations between AEMs and LLM-as-a-judge for the first 16The shapes reflect genre, see Section 4.3. 17MT scores lowest, followed by PE and topped by HT (as shown in the figure by the green dots towards the lower end of the Y-axis, the blue in betwee… view at source ↗
Figure 4
Figure 4. Figure 4: Heat map of Spearman correlations between AEMs and professional evaluations across genres. # Errors Poem Short Story Thriller Human 84 93 107 LLM 137 95 93 Match 42 22 28 [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Box plots for errors, creative shift (CS) and creativ￾ity index (CI) per each modality level. 0 50 100 150 Poem ShortStory Thriller Genre ErrorPoints 0 5 10 15 Poem ShortStory Thriller Genre CS −150 −100 −50 0 50 Poem ShortStory Thriller Genre C.I [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Box plots for errors, creative shift (CS) and creativ￾ity index (CI) per each genre level. increase in costs and environmental impact, but we did try out multiple models and prompts us￾ing selected texts. Specifically, we tried out TSA (Yeom et al., 2025), AutoMQM (Fernandes et al., 2023), GEMBA-MQM (Kocmi and Federmann, 2023) and EAPrompt (Lu et al., 2024). First, we tried out different strategies for the… view at source ↗
read the original abstract

This article investigates the performance of automatic evaluation metrics (AEMs) and LLM-as-a-judge evaluation on literary translation across multiple languages, genres, and translation modalities. The aim is to assess how well these tools align with professionals when evaluating translation, creativity (creative shifts & errors), and see if they can substitute laborious manual annotations. A dataset of literary translations across three modalities (human translation, machine translation, and post-editing), three genres and three language pairs was created and annotated in detail for creativity by experienced professional literary translators. The results show that both AEMs and LLM-as-a-judge evaluations correlate poorly with professional evaluations on creativity, with LLM-as-a-judge showing a systematic bias in favour of machine-translated texts and penalising creative and culturally appropriate solutions. Moreover, performance is consistently worse for more literary genres such as poetry. This highlights fundamental limitations of current automatic evaluation tools for literary translation and the need to create new tools that do not frequently consider out of routine translations as errors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that automatic evaluation metrics (AEMs) and LLM-as-a-judge methods correlate poorly with professional literary translators' assessments of translation quality and creativity (creative shifts and errors) across three language pairs, three genres, and three modalities (human translation, machine translation, post-editing). It further reports that LLM judges exhibit systematic bias favoring machine-translated outputs while penalizing creative and culturally appropriate solutions, with performance degrading for more literary genres such as poetry.

Significance. If the empirical findings hold after addressing reporting gaps, the work provides useful evidence of limitations in current automatic tools for evaluating creativity in literary translation and motivates development of new metrics. The construction of a multi-genre, multi-modality dataset annotated by experienced professionals is a concrete contribution that can support future research.

major comments (2)
  1. [Annotation methodology] Annotation methodology section: no inter-annotator agreement statistics (Fleiss' kappa, Cohen's kappa, or equivalent) are reported for the creativity labels (creative shifts & errors) assigned by the professional translators. Without these figures, the low correlations and reported LLM bias could reflect label noise rather than genuine misalignment with the ground truth.
  2. [Results] Results section: the abstract and main results provide no sample sizes (number of translations or segments per genre/modality), exact implementations of the AEMs and LLM prompts, correlation coefficients with confidence intervals, or statistical tests (e.g., p-values for differences between modalities). These omissions prevent assessment of whether the claimed poor correlations and systematic bias are statistically robust.
minor comments (2)
  1. [Abstract] Abstract: add concrete numbers (e.g., total segments annotated, number of annotators, range of correlation values) to make the headline claims more informative.
  2. [Dataset construction] Dataset description: clarify selection criteria for the literary texts and any filtering applied after annotation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which has helped us strengthen the reporting in our manuscript. We address each major comment below and have revised the paper to incorporate the requested details on annotation reliability and statistical reporting.

read point-by-point responses
  1. Referee: [Annotation methodology] Annotation methodology section: no inter-annotator agreement statistics (Fleiss' kappa, Cohen's kappa, or equivalent) are reported for the creativity labels (creative shifts & errors) assigned by the professional translators. Without these figures, the low correlations and reported LLM bias could reflect label noise rather than genuine misalignment with the ground truth.

    Authors: We agree that inter-annotator agreement metrics are essential to demonstrate label reliability. Although the original submission omitted these statistics, the annotations were performed by multiple professional translators with overlapping segments. In the revised manuscript we now report Cohen's kappa values for the creativity shift and error labels (computed on the double-annotated subset), which show substantial agreement. This addition confirms that the observed low correlations with automatic metrics are unlikely to stem from label noise. revision: yes

  2. Referee: [Results] Results section: the abstract and main results provide no sample sizes (number of translations or segments per genre/modality), exact implementations of the AEMs and LLM prompts, correlation coefficients with confidence intervals, or statistical tests (e.g., p-values for differences between modalities). These omissions prevent assessment of whether the claimed poor correlations and systematic bias are statistically robust.

    Authors: We acknowledge the reporting gaps. The revised version now includes explicit sample sizes (number of segments per genre, language pair, and modality) in both the abstract and Results section. We have added the precise AEM implementations, full LLM prompts, Pearson/Spearman correlations with 95% confidence intervals, and p-values from appropriate statistical tests comparing modalities. These details are also summarized in a new table for clarity and confirm the robustness of the reported poor correlations and LLM bias. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison to external annotations

full rationale

The paper constructs a dataset of literary translations across modalities and genres, obtains detailed annotations for creativity from experienced professional translators, and reports correlations between these annotations and both automatic evaluation metrics and LLM-as-a-judge outputs. No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text. The central claims rest on observed empirical mismatches rather than any reduction of predictions to inputs by construction. Self-citations, if present, are not load-bearing for the core result. This is a standard empirical evaluation study whose validity depends on annotation quality and inter-annotator agreement (a separate reliability concern), not on circular logic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on treating professional literary translator annotations as the authoritative benchmark for creativity; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Professional literary translators' judgments provide a reliable and objective measure of creativity and translation quality
    Used as the ground-truth reference against which AEMs and LLM judges are evaluated.

pith-pipeline@v0.9.0 · 5477 in / 1099 out tokens · 173783 ms · 2026-05-14T20:02:14.689918+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

130 extracted references · 57 canonical work pages

  1. [1]

    and Ullman, Jeffrey D

    Aho, Alfred V. and Ullman, Jeffrey D. , title =. 1972 , volume=1, publisher =

  2. [2]

    Interspeech 2006 --- Ninth International Conference on Spoken Language Processing , address=

    Unsupervised language model adaptation using latent semantic marginals , author=. Interspeech 2006 --- Ninth International Conference on Spoken Language Processing , address=. 2006 , pages=

  3. [3]

    1983 , publisher=

    Publications. 1983 , publisher=

  4. [4]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year=1981, title=. Journal of the Association for Computing Machinery , volume=28, issue=1, pages=

  5. [5]

    Coling 2008, 22nd International Conference on Computational Linguistics , address=

    Anne Gledson and John Keane , year=2008, title=. Coling 2008, 22nd International Conference on Computational Linguistics , address=

  6. [6]

    Dan Gusfield , title=

  7. [7]

    Proceedings of ICASSP 2007, International Conference on Acoustics, Speech, and Signal Processing , address=

    Yik-Cheung Tam and Tanja Schultz , year=2007, title=. Proceedings of ICASSP 2007, International Conference on Acoustics, Speech, and Signal Processing , address=

  8. [8]

    MATEO : MA chine T ranslation E valuation O nline

    Vanroy, Bram and Tezcan, Arda and Macken, Lieve. MATEO : MA chine T ranslation E valuation O nline. Proceedings of the 24th Annual Conference of the European Association for Machine Translation. 2023

  9. [9]

    2020 , eprint=

    BERTScore: Evaluating Text Generation with BERT , author=. 2020 , eprint=

  10. [10]

    B leu: a method for automatic evaluation of machine translation

    Kishore Papineni and Salim Roukos and Todd Ward and Wei-Jing Zhu , year=. Bleu: a Method for Automatic Evaluation of Machine Translation , booktitle=. doi:10.3115/1073083.1073135 , pages=

  11. [11]

    A Study of Translation Edit Rate with Targeted Human Annotation

    Matthew Snover and Dorr, Bonnie and Schwartz, Rich and Micciulla, Linnea and Makhoul, John. A Study of Translation Edit Rate with Targeted Human Annotation. Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers. 2006

  12. [12]

    chr F deconstructed: beta parameters and n-gram weights

    Popovi \'c , Maja. chr F deconstructed: beta parameters and n-gram weights. Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers. 2016. doi:10.18653/v1/W16-2341

  13. [13]

    and Lavie, Alon , booktitle =

    Rei, Ricardo and Stewart, Craig and Farinha, Ana C and Lavie, Alon. COMET : A Neural Framework for MT Evaluation. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.213

  14. [14]

    BLEURT : Learning Robust Metrics for Text Generation

    Sellam, Thibault and Das, Dipanjan and Parikh, Ankur. BLEURT : Learning Robust Metrics for Text Generation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.704

  15. [15]

    L i T rans P ro QA : An LLM -based Literary Translation Evaluation Metric with Professional Question Answering

    Zhang, Ran and Zhao, Wei and Macken, Lieve and Eger, Steffen. L i T rans P ro QA : An LLM -based Literary Translation Evaluation Metric with Professional Question Answering. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1482

  16. [16]

    Prompting C hat GPT for Translation: A Comparative Analysis of Translation Brief and Persona Prompts

    He, Sui. Prompting C hat GPT for Translation: A Comparative Analysis of Translation Brief and Persona Prompts. Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1). 2024

  17. [17]

    Poetry and Emotions: Investigating the Limitations of AI Translation

    Priya Sharma and Tanuja Yadav. Poetry and Emotions: Investigating the Limitations of AI Translation. Proceedings of International Conference on Innovations in Data Science. 2026

  18. [18]

    AL-ĪMĀN Research Journal , author=

    A Comparative Analysis of Machine Translation and Human Translation: Efficacy of Poetry Translation from Urdu to English , volume=. AL-ĪMĀN Research Journal , author=. 2026 , month=. doi:10.63283/IRJ.04.01/02 , abstractNote=

  19. [19]

    Don ' t Go Far Off: An Empirical Study on Neural Poetry Translation

    Chakrabarty, Tuhin and Saakyan, Arkadiy and Muresan, Smaranda. Don ' t Go Far Off: An Empirical Study on Neural Poetry Translation. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.577

  20. [20]

    The Translator ' s Canvas: Using LLM s to Enhance Poetry Translation

    Resende, Nat \'a lia and Hadley, James. The Translator ' s Canvas: Using LLM s to Enhance Poetry Translation. Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track). 2024

  21. [21]

    A Detailed Comparative Analysis of Automatic Neural Metrics for Machine Translation:

    Mukherjee, Aniruddha and Hassija, Vikas and Chamola, Vinay and Gupta, Karunesh Kumar , journal=. A Detailed Comparative Analysis of Automatic Neural Metrics for Machine Translation:. 2025 , volume=

  22. [22]

    A Fine-Grained Analysis of BERTS core

    Hanna, Michael and Bojar, Ond r ej. A Fine-Grained Analysis of BERTS core. Proceedings of the Sixth Conference on Machine Translation. 2021

  23. [23]

    Extrinsic Evaluation of Machine Translation Metrics

    Nikita Moghe and Tom Sherborne and Mark Steedman and Alexandra Birch. Extrinsic Evaluation of Machine Translation Metrics. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.730

  24. [24]

    Results of WMT 23 Metrics Shared Task: Metrics Might Be Guilty but References Are Not Innocent

    Freitag, Markus and Mathur, Nitika and Lo, Chi-kiu and Avramidis, Eleftherios and Rei, Ricardo and Thompson, Brian and Kocmi, Tom and Blain, Frederic and Deutsch, Daniel and Stewart, Craig and Zerva, Chrysoula and Castilho, Sheila and Lavie, Alon and Foster, George. Results of WMT 23 Metrics Shared Task: Metrics Might Be Guilty but References Are Not Inno...

  25. [25]

    BLEU , METEOR , BERTS core: Evaluation of Metrics Performance in Assessing Critical Translation Errors in Sentiment-Oriented Text

    Saadany, Hadeel and Orasan, Constantin. BLEU , METEOR , BERTS core: Evaluation of Metrics Performance in Assessing Critical Translation Errors in Sentiment-Oriented Text. Proceedings of the Translation and Interpreting Technology Online Conference. 2021

  26. [26]

    Findings of the WMT 25 General Machine Translation Shared Task: Time to Stop Evaluating on Easy Test Sets

    Kocmi, Tom and Artemova, Ekaterina and Avramidis, Eleftherios and Bawden, Rachel and Bojar, Ond r ej and Dranch, Konstantin and Dvorkovich, Anton and Dukanov, Sergey and Fishel, Mark and Freitag, Markus and Gowda, Thamme and Grundkiewicz, Roman and Haddow, Barry and Karpinska, Marzena and Koehn, Philipp and Lakougna, Howard and Lundin, Jessica and Monz, C...

  27. [27]

    Freitag, Markus and Rei, Ricardo and Mathur, Nitika and Lo, Chi-kiu and Stewart, Craig and Avramidis, Eleftherios and Kocmi, Tom and Foster, George and Lavie, Alon and Martins, Andr \'e F. T. Results of WMT 22 Metrics Shared Task: Stop Using BLEU -- Neural Metrics Are Better and More Robust. Proceedings of the Seventh Conference on Machine Translation (WM...

  28. [28]

    and Rei, Ricardo and Stigt, Daan van and Coheur, Luisa and Colombo, Pierre and Martins, Andr \'e F

    Guerreiro, Nuno M. and Rei, Ricardo and Stigt, Daan van and Coheur, Luisa and Colombo, Pierre and Martins, Andr \'e F. T. x COMET : Transparent Machine Translation Evaluation through Fine-grained Error Detection. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00683

  29. [29]

    Fine-Tuned Machine Translation Metrics Struggle in Unseen Domains

    Zouhar, Vil \'e m and Ding, Shuoyang and Currey, Anna and Badeka, Tatyana and Wang, Jenyuan and Thompson, Brian. Fine-Tuned Machine Translation Metrics Struggle in Unseen Domains. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2024. doi:10.18653/v1/2024.acl-short.45

  30. [30]

    ACES : Translation Accuracy Challenge Sets for Evaluating Machine Translation Metrics

    Amrhein, Chantal and Moghe, Nikita and Guillou, Liane. ACES : Translation Accuracy Challenge Sets for Evaluating Machine Translation Metrics. Proceedings of the Seventh Conference on Machine Translation (WMT). 2022. doi:10.18653/v1/2022.wmt-1.44

  31. [31]

    Evaluating Automatic Metrics with Incremental Machine Translation Systems

    Wu, Guojun and Cohen, Shay B and Sennrich, Rico. Evaluating Automatic Metrics with Incremental Machine Translation Systems. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.169

  32. [32]

    To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation

    Kocmi, Tom and Federmann, Christian and Grundkiewicz, Roman and Junczys-Dowmunt, Marcin and Matsushita, Hitokazu and Menezes, Arul. To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation. Proceedings of the Sixth Conference on Machine Translation. 2021

  33. [33]

    DEMETR : Diagnosing Evaluation Metrics for Translation

    Karpinska, Marzena and Raj, Nishant and Thai, Katherine and Song, Yixiao and Gupta, Ankita and Iyyer, Mohit. DEMETR : Diagnosing Evaluation Metrics for Translation. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.649

  34. [34]

    The Devil Is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation

    Fernandes, Patrick and Deutsch, Daniel and Finkelstein, Mara and Riley, Parker and Martins, Andr \'e and Neubig, Graham and Garg, Ankush and Clark, Jonathan and Freitag, Markus and Firat, Orhan. The Devil Is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation. Proceedings of the Eighth Conference on Machine Tran...

  35. [35]

    Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models

    Lu, Qingyu and Qiu, Baopu and Ding, Liang and Zhang, Kanjian and Kocmi, Tom and Tao, Dacheng. Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.520

  36. [36]

    Evolutionary Studies in Imaginative Culture , year =

    Nehal Ali AbdulGhaffar , title =. Evolutionary Studies in Imaginative Culture , year =

  37. [37]

    and Zerva, Chrysoula and Farinha, Ana C and Maroti, Christine and C

    Rei, Ricardo and Treviso, Marcos and Guerreiro, Nuno M. and Zerva, Chrysoula and Farinha, Ana C and Maroti, Christine and C. de Souza, Jos \'e G. and Glushkova, Taisiya and Alves, Duarte and Coheur, Luisa and Lavie, Alon and Martins, Andr \'e F. T. C omet K iwi: IST -Unbabel 2022 Submission for the Quality Estimation Shared Task. Proceedings of the Sevent...

  38. [38]

    Findings of the WMT 25 Multilingual Instruction Shared Task: Persistent Hurdles in Reasoning, Generation, and Evaluation

    Kocmi, Tom and Agrawal, Sweta and Artemova, Ekaterina and Avramidis, Eleftherios and Briakou, Eleftheria and Chen, Pinzhen and Fadaee, Marzieh and Freitag, Markus and Grundkiewicz, Roman and Hou, Yupeng and Koehn, Philipp and Kreutzer, Julia and Mansour, Saab and Perrella, Stefano and Proietti, Lorenzo and Riley, Parker and S \'a nchez, Eduardo and Schmid...

  39. [39]

    M etric X -24: The G oogle Submission to the WMT 2024 Metrics Shared Task

    Juraska, Juraj and Deutsch, Daniel and Finkelstein, Mara and Freitag, Markus. M etric X -24: The G oogle Submission to the WMT 2024 Metrics Shared Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.35

  40. [40]

    Findings of the WMT 24 General Machine Translation Shared Task: The LLM Era Is Here but MT Is Not Solved Yet

    Kocmi, Tom and Avramidis, Eleftherios and Bawden, Rachel and Bojar, Ond r ej and Dvorkovich, Anton and Federmann, Christian and Fishel, Mark and Freitag, Markus and Gowda, Thamme and Grundkiewicz, Roman and Haddow, Barry and Karpinska, Marzena and Koehn, Philipp and Marie, Benjamin and Monz, Christof and Murray, Kenton and Nagata, Masaaki and Popel, Marti...

  41. [41]

    Can Automatic Metrics Assess High-Quality Translations?

    Agrawal, Sweta and Farinhas, Ant \'o nio and Rei, Ricardo and Martins, Andre. Can Automatic Metrics Assess High-Quality Translations?. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.802

  42. [42]

    and Platek, Ondrej and Sivaprasad, Adarsa

    Schmidtova, Patricia and Mahamood, Saad and Balloccu, Simone and Dusek, Ondrej and Gatt, Albert and Gkatzia, Dimitra and Howcroft, David M. and Platek, Ondrej and Sivaprasad, Adarsa. Automatic Metrics in Natural Language Generation: A survey of Current Evaluation Practices. Proceedings of the 17th International Natural Language Generation Conference. 2024...

  43. [43]

    Exploring Document-Level Literary Machine Translation with Parallel Paragraphs from World Literature

    Thai, Katherine and Karpinska, Marzena and Krishna, Kalpesh and Ray, Bill and Inghilleri, Moira and Wieting, John and Iyyer, Mohit. Exploring Document-Level Literary Machine Translation with Parallel Paragraphs from World Literature. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.672

  44. [44]

    2025 , eprint=

    MAATS: A Multi-Agent Automated Translation System Based on MQM Evaluation , author=. 2025 , eprint=

  45. [45]

    H i MATE : A Hierarchical Multi-Agent Framework for Machine Translation Evaluation

    Zhang, Shijie and Li, Renhao and Wang, Songsheng and Koehn, Philipp and Yang, Min and Wong, Derek F. H i MATE : A Hierarchical Multi-Agent Framework for Machine Translation Evaluation. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.593

  46. [46]

    M - MAD : Multidimensional Multi-Agent Debate for Advanced Machine Translation Evaluation

    Feng, Zhaopeng and Su, Jiayuan and Zheng, Jiamei and Ren, Jiahan and Zhang, Yan and Wu, Jian and Wang, Hongwei and Liu, Zuozhu. M - MAD : Multidimensional Multi-Agent Debate for Advanced Machine Translation Evaluation. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/202...

  47. [47]

    The Atlantic , year =

    Jeremy Klemin , title =. The Atlantic , year =

  48. [48]

    The Bookseller , year =

    Emily Warner , title =. The Bookseller , year =

  49. [49]

    The price of debiasing automatic metrics in natural language evaluation

    Chaganty, Arun and Mussmann, Stephen and Liang, Percy. The price of debiasing automatic metrics in natural language evaluation. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1060

  50. [50]

    How Good Are LLM s for Literary Translation, Really? Literary Translation Evaluation with Humans and LLM s

    Zhang, Ran and Zhao, Wei and Eger, Steffen. How Good Are LLM s for Literary Translation, Really? Literary Translation Evaluation with Humans and LLM s. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.n...

  51. [51]

    The INCE p TION Platform: Machine-Assisted and Knowledge-Oriented Interactive Annotation

    Klie, Jan-Christoph and Bugert, Michael and Boullosa, Beto and Eckart de Castilho, Richard and Gurevych, Iryna. The INCE p TION Platform: Machine-Assisted and Knowledge-Oriented Interactive Annotation. Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations. 2018

  52. [52]

    The Challenges of Using Neural Machine Translation for Literature

    Matusov, Evgeny. The Challenges of Using Neural Machine Translation for Literature. Proceedings of the Qualities of Literary Machine Translation. 2019

  53. [53]

    Machine vs Human Translation of Formal Neologisms in Literature: Exploring

    Laura Noriega-Santi. Machine vs Human Translation of Formal Neologisms in Literature: Exploring. Revista Tradum. 2023 , url=

  54. [54]

    Information , VOLUME =

    Corpas Pastor, Gloria and Noriega-Santiáñez, Laura , TITLE =. Information , VOLUME =. 2024 , NUMBER =

  55. [55]

    Large Language Models Effectively Leverage Document-level Context for Literary Translation, but Critical Errors Persist

    Karpinska, Marzena and Iyyer, Mohit. Large Language Models Effectively Leverage Document-level Context for Literary Translation, but Critical Errors Persist. Proceedings of the Eighth Conference on Machine Translation. 2023. doi:10.18653/v1/2023.wmt-1.41

  56. [56]

    Translation Spaces , volume =

    Dorothy Kenny and Marion Winters , title =. Translation Spaces , volume =. 2020 , doi =

  57. [57]

    ELOPE: English Language Overseas Perspectives and Enquiries , volume =

    Tjaša Mohar and Sara Orthaber and Tomaž Onič , title =. ELOPE: English Language Overseas Perspectives and Enquiries , volume =. 2020 , doi =

  58. [58]

    Proceedings of the Qualities of Literary Machine Translation

    Would MT kill creativity in literary retranslation?. Proceedings of the Qualities of Literary Machine Translation. 2019

  59. [59]

    Creativity in Translation: Machine Translation as a Constraint for Literary Texts , Volume =

    Ana Guerberof-Arenas and Antonio Toral , Journal =. Creativity in Translation: Machine Translation as a Constraint for Literary Texts , Volume =

  60. [60]

    The Impact of Post-Editing and Machine Translation on Creativity and Reading Experience , Volume =

    Ana Guerberof-Arenas and Antonio Toral , Journal =. The Impact of Post-Editing and Machine Translation on Creativity and Reading Experience , Volume =

  61. [61]

    To be or not to be: A translation reception study of a literary text translated into

    Ana Guerberof-Arenas and Antonio Toral , Journal =. To be or not to be: A translation reception study of a literary text translated into

  62. [62]

    Frontiers in Digital Humanities , VOLUME=

    Toral, Antonio and Wieling, Martijn and Way, Andy , TITLE=. Frontiers in Digital Humanities , VOLUME=. 2018 , URL=. doi:10.3389/fdigh.2018.00009 , ISSN=

  63. [63]

    B y GPT 5: End-to-End Style-conditioned Poetry Generation with Token-free Language Models

    Belouadi, Jonas and Eger, Steffen. B y GPT 5: End-to-End Style-conditioned Poetry Generation with Token-free Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.406

  64. [64]

    Evaluating Diversity in Automatic Poetry Generation

    Chen, Yanran and Gr. Evaluating Diversity in Automatic Poetry Generation. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.1097

  65. [65]

    Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems , articleno =

    Chakrabarty, Tuhin and Laban, Philippe and Agarwal, Divyansh and Muresan, Smaranda and Wu, Chien-Sheng , title =. Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems , articleno =. 2024 , isbn =. doi:10.1145/3613904.3642731 , abstract =

  66. [66]

    Re-evaluating the Role of B leu in Machine Translation Research

    Callison-Burch, Chris and Osborne, Miles and Koehn, Philipp. Re-evaluating the Role of B leu in Machine Translation Research. 11th Conference of the E uropean Chapter of the Association for Computational Linguistics. 2006

  67. [67]

    Tangled up in BLEU : Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

    Mathur, Nitika and Baldwin, Timothy and Cohn, Trevor. Tangled up in BLEU : Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.448

  68. [68]

    Why We Need New Evaluation Metrics for NLG

    Novikova, Jekaterina and Du s ek, Ond r ej and Cercas Curry, Amanda and Rieser, Verena. Why We Need New Evaluation Metrics for NLG. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. doi:10.18653/v1/D17-1238

  69. [69]

    AI -Assisted Human Evaluation of Machine Translation

    Zouhar, Vil \'e m and Kocmi, Tom and Sachan, Mrinmaya. AI -Assisted Human Evaluation of Machine Translation. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.naacl-long.255

  70. [70]

    INSTRUCTSCORE : Towards Explainable Text Generation Evaluation with Automatic Feedback

    Xu, Wenda and Wang, Danqing and Pan, Liangming and Song, Zhenqiao and Freitag, Markus and Wang, William and Li, Lei. INSTRUCTSCORE : Towards Explainable Text Generation Evaluation with Automatic Feedback. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.365

  71. [71]

    G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment

    Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang. G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.153

  72. [72]

    Tagged Span Annotation for Detecting Translation Errors in Reasoning LLM s

    Yeom, Taemin and Ryu, Yonghyun and Choi, Yoonjung and Bak, Jinyeong. Tagged Span Annotation for Detecting Translation Errors in Reasoning LLM s. Proceedings of the Tenth Conference on Machine Translation. 2025. doi:10.18653/v1/2025.wmt-1.62

  73. [73]

    Deep Associations, High Creativity: A Simple yet Effective Metric for Evaluating Large Language Models

    Qiu, Ziliang and Hu, Renfen. Deep Associations, High Creativity: A Simple yet Effective Metric for Evaluating Large Language Models. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.550

  74. [74]

    Human Translation of Stylistic Neologisms in English Language Chick Lit into Ukrainian , url=

    Machine vs. Human Translation of Stylistic Neologisms in English Language Chick Lit into Ukrainian , url=. Respectus Philologicus , author=. 2025 , month=. doi:10.15388/RESPECTUS.2025.48.9 , abstractNote=

  75. [75]

    2024 , eprint=

    CS4: Measuring the Creativity of Large Language Models Automatically by Controlling the Number of Story-Writing Constraints , author=. 2024 , eprint=

  76. [76]

    Automatic Creativity Measurement in Scratch Programs Across Modalities , year=

    Kovalkov, Anastasia and Paaßen, Benjamin and Segal, Avi and Pinkwart, Niels and Gal, Kobi , journal=. Automatic Creativity Measurement in Scratch Programs Across Modalities , year=

  77. [77]

    Prompting Large Language Models for Idiomatic Translation

    Castaldo, Antonio and Monti, Johanna. Prompting Large Language Models for Idiomatic Translation. Proceedings of the 1st Workshop on Creative-text Translation and Technology. 2024

  78. [78]

    Machine Translation with Large Language Models: Prompting, Few-shot Learning, and Fine-tuning with QL o RA

    Zhang, Xuan and Rajabi, Navid and Duh, Kevin and Koehn, Philipp. Machine Translation with Large Language Models: Prompting, Few-shot Learning, and Fine-tuning with QL o RA. Proceedings of the Eighth Conference on Machine Translation. 2023. doi:10.18653/v1/2023.wmt-1.43

  79. [79]

    Proceedings of the 40th International Conference on Machine Learning , pages =

    Prompting Large Language Model for Machine Translation: A Case Study , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

  80. [80]

    Proceedings of the 6th ACM International Conference on Multimedia in Asia Workshops , articleno =

    Gao, Yuan and Wang, Ruili and Hou, Feng , title =. Proceedings of the 6th ACM International Conference on Multimedia in Asia Workshops , articleno =. 2024 , isbn =. doi:10.1145/3700410.3702123 , abstract =

Showing first 80 references.