arxiv: 2605.13596 · v1 · submitted 2026-05-13 · 💻 cs.CL

Recognition: no theorem link

Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations

Kyo Gerrits , Rik van Noord , Ana Guerberof Arenas

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:02 UTC · model grok-4.3

classification 💻 cs.CL

keywords literary translationcreativity evaluationautomatic evaluation metricsLLM-as-a-judgemachine translationhuman evaluationevaluation bias

0 comments

The pith

Automatic evaluation metrics and LLM judges correlate poorly with professional translators on creativity in literary texts and bias toward machine outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a dataset of literary translations in three modalities, three genres, and three language pairs, then has experienced professional translators annotate them in detail for creative shifts and errors. It compares these annotations against scores from automatic evaluation metrics and from LLM-as-a-judge setups. Both automatic approaches show weak alignment with the professionals, and LLM judges systematically rate machine-translated versions higher while marking creative, culturally fitting solutions as mistakes. The gap widens for poetry and other highly literary genres. The work concludes that existing tools cannot yet replace manual expert judgment and that new methods are needed to treat creative deviations as valid rather than errors.

Core claim

Automatic evaluation metrics and LLM-as-a-judge evaluations correlate poorly with professional literary translators' assessments of creativity, and LLM judges display a systematic bias that favors machine-translated texts while penalizing creative and culturally appropriate solutions, with performance dropping further on poetry and similar literary genres.

What carries the argument

A dataset of literary translations across human, machine, and post-edited modalities, annotated by professional translators for creative shifts and errors, used to measure alignment with automatic metrics and LLM judges.

If this is right

Automatic metrics cannot serve as reliable substitutes for professional judgment when creativity is the focus of evaluation.
LLM-as-a-judge methods introduce a consistent preference for literal machine outputs over creative human solutions.
Evaluation accuracy declines markedly for poetry and other highly literary genres.
New automatic tools are required that treat creative out-of-routine solutions as valid rather than errors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If automatic scores are used for quality control, translation workflows may systematically undervalue creative post-editing by humans.
The same bias pattern could appear in AI evaluation of other creative writing tasks such as story generation or script adaptation.
Explicit cultural and creative criteria would need to be built into future evaluation frameworks to reduce the observed mismatch.

Load-bearing premise

Detailed annotations by experienced professional literary translators constitute an objective and reliable ground truth for measuring creativity and translation quality across genres and modalities.

What would settle it

A new set of annotations on the same dataset by a different group of professional literary translators that produces substantially different creativity scores from the original annotations.

Figures

Figures reproduced from arXiv: 2605.13596 by Ana Guerberof Arenas, Kyo Gerrits, Rik van Noord.

**Figure 1.** Figure 1: Example of a UCP in the ST with a Reproduction in the PE and a CS in the HT with English glosses underneath. The creative shift annotations were done by 3 doctoral students who are proficient with the framework that were also translators and were native or proficient speakers for the language pair. They received the texts and the UCP annotations (created by 2 of the doctoral students) and could mark a sol… view at source ↗

**Figure 2.** Figure 2: Heat map of the Spearman correlations between AEMs and professional annotations. Poem Short story Thriller ST TT From HT PE MT HT PE MT HT PE MT EN NL Human 3 8 16 4 5 17 6 11 9 LLM 8 12 12 14 8 12 8 9 9 Match 0 3 6 2 0 2 1 4 1 EN CA Human 9 7 16 3 7 25 4 7 18 LLM 13 16 9 14 6 8 9 10 5 Match 4 4 7 1 1 4 0 1 4 RU NL Human 7 6 14 4 14 14 9 15 28 LLM 19 23 24 10 13 10 9 13 20 Match 4 5 8 2 6 4 0 5 12 [PITH_F… view at source ↗

**Figure 3.** Figure 3: Scatterplot of CI from professional and LLM annotations. Colouring indicates the modality and shapes genre. 4.3 Genre Our last RQ investigates whether the correlations between AEMs and LLM-as-a-judge for the first 16The shapes reflect genre, see Section 4.3. 17MT scores lowest, followed by PE and topped by HT (as shown in the figure by the green dots towards the lower end of the Y-axis, the blue in betwee… view at source ↗

**Figure 4.** Figure 4: Heat map of Spearman correlations between AEMs and professional evaluations across genres. # Errors Poem Short Story Thriller Human 84 93 107 LLM 137 95 93 Match 42 22 28 [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Box plots for errors, creative shift (CS) and creativity index (CI) per each modality level. 0 50 100 150 Poem ShortStory Thriller Genre ErrorPoints 0 5 10 15 Poem ShortStory Thriller Genre CS −150 −100 −50 0 50 Poem ShortStory Thriller Genre C.I [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Box plots for errors, creative shift (CS) and creativity index (CI) per each genre level. increase in costs and environmental impact, but we did try out multiple models and prompts using selected texts. Specifically, we tried out TSA (Yeom et al., 2025), AutoMQM (Fernandes et al., 2023), GEMBA-MQM (Kocmi and Federmann, 2023) and EAPrompt (Lu et al., 2024). First, we tried out different strategies for the… view at source ↗

read the original abstract

This article investigates the performance of automatic evaluation metrics (AEMs) and LLM-as-a-judge evaluation on literary translation across multiple languages, genres, and translation modalities. The aim is to assess how well these tools align with professionals when evaluating translation, creativity (creative shifts & errors), and see if they can substitute laborious manual annotations. A dataset of literary translations across three modalities (human translation, machine translation, and post-editing), three genres and three language pairs was created and annotated in detail for creativity by experienced professional literary translators. The results show that both AEMs and LLM-as-a-judge evaluations correlate poorly with professional evaluations on creativity, with LLM-as-a-judge showing a systematic bias in favour of machine-translated texts and penalising creative and culturally appropriate solutions. Moreover, performance is consistently worse for more literary genres such as poetry. This highlights fundamental limitations of current automatic evaluation tools for literary translation and the need to create new tools that do not frequently consider out of routine translations as errors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper documents LLM judges favoring machine translations on creativity scores in literary MT, backed by a new multi-genre dataset, but the annotations lack reported agreement stats.

read the letter

The main thing to know is that this work finds both standard automatic metrics and LLM judges align poorly with professional literary translators on creativity, and that LLMs show a clear tilt toward machine outputs while downplaying creative or culturally fitting choices. Performance drops further on poetry compared to other genres. They built a dataset covering human, machine, and post-edited translations across three language pairs and three genres, with detailed annotations from experienced translators on creative shifts and errors. That dataset and the bias observation are the concrete contributions here. It is useful to see the empirical pattern laid out across modalities and to have evidence that current tools treat routine outputs as safer than inventive ones. The setup is straightforward empirical comparison against external human labels, with no circular derivations. The soft spot is the ground truth itself. Creativity judgments in literary translation are subjective and genre-sensitive, yet the paper gives no inter-annotator agreement figures for the professional labels. Without those numbers, the reported low correlations and the LLM bias could partly reflect label noise rather than genuine misalignment. Sample sizes and exact statistical tests are also missing from the abstract, which leaves the strength of the claims hard to gauge. This is for people building or using evaluation tools for creative text generation. A reader focused on literary MT or LLM judging would find the dataset and bias result worth checking. It deserves a serious referee to verify the annotation protocol and run the numbers properly.

Referee Report

2 major / 2 minor

Summary. The paper claims that automatic evaluation metrics (AEMs) and LLM-as-a-judge methods correlate poorly with professional literary translators' assessments of translation quality and creativity (creative shifts and errors) across three language pairs, three genres, and three modalities (human translation, machine translation, post-editing). It further reports that LLM judges exhibit systematic bias favoring machine-translated outputs while penalizing creative and culturally appropriate solutions, with performance degrading for more literary genres such as poetry.

Significance. If the empirical findings hold after addressing reporting gaps, the work provides useful evidence of limitations in current automatic tools for evaluating creativity in literary translation and motivates development of new metrics. The construction of a multi-genre, multi-modality dataset annotated by experienced professionals is a concrete contribution that can support future research.

major comments (2)

[Annotation methodology] Annotation methodology section: no inter-annotator agreement statistics (Fleiss' kappa, Cohen's kappa, or equivalent) are reported for the creativity labels (creative shifts & errors) assigned by the professional translators. Without these figures, the low correlations and reported LLM bias could reflect label noise rather than genuine misalignment with the ground truth.
[Results] Results section: the abstract and main results provide no sample sizes (number of translations or segments per genre/modality), exact implementations of the AEMs and LLM prompts, correlation coefficients with confidence intervals, or statistical tests (e.g., p-values for differences between modalities). These omissions prevent assessment of whether the claimed poor correlations and systematic bias are statistically robust.

minor comments (2)

[Abstract] Abstract: add concrete numbers (e.g., total segments annotated, number of annotators, range of correlation values) to make the headline claims more informative.
[Dataset construction] Dataset description: clarify selection criteria for the literary texts and any filtering applied after annotation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which has helped us strengthen the reporting in our manuscript. We address each major comment below and have revised the paper to incorporate the requested details on annotation reliability and statistical reporting.

read point-by-point responses

Referee: [Annotation methodology] Annotation methodology section: no inter-annotator agreement statistics (Fleiss' kappa, Cohen's kappa, or equivalent) are reported for the creativity labels (creative shifts & errors) assigned by the professional translators. Without these figures, the low correlations and reported LLM bias could reflect label noise rather than genuine misalignment with the ground truth.

Authors: We agree that inter-annotator agreement metrics are essential to demonstrate label reliability. Although the original submission omitted these statistics, the annotations were performed by multiple professional translators with overlapping segments. In the revised manuscript we now report Cohen's kappa values for the creativity shift and error labels (computed on the double-annotated subset), which show substantial agreement. This addition confirms that the observed low correlations with automatic metrics are unlikely to stem from label noise. revision: yes
Referee: [Results] Results section: the abstract and main results provide no sample sizes (number of translations or segments per genre/modality), exact implementations of the AEMs and LLM prompts, correlation coefficients with confidence intervals, or statistical tests (e.g., p-values for differences between modalities). These omissions prevent assessment of whether the claimed poor correlations and systematic bias are statistically robust.

Authors: We acknowledge the reporting gaps. The revised version now includes explicit sample sizes (number of segments per genre, language pair, and modality) in both the abstract and Results section. We have added the precise AEM implementations, full LLM prompts, Pearson/Spearman correlations with 95% confidence intervals, and p-values from appropriate statistical tests comparing modalities. These details are also summarized in a new table for clarity and confirm the robustness of the reported poor correlations and LLM bias. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison to external annotations

full rationale

The paper constructs a dataset of literary translations across modalities and genres, obtains detailed annotations for creativity from experienced professional translators, and reports correlations between these annotations and both automatic evaluation metrics and LLM-as-a-judge outputs. No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text. The central claims rest on observed empirical mismatches rather than any reduction of predictions to inputs by construction. Self-citations, if present, are not load-bearing for the core result. This is a standard empirical evaluation study whose validity depends on annotation quality and inter-annotator agreement (a separate reliability concern), not on circular logic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on treating professional literary translator annotations as the authoritative benchmark for creativity; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Professional literary translators' judgments provide a reliable and objective measure of creativity and translation quality
Used as the ground-truth reference against which AEMs and LLM judges are evaluated.

pith-pipeline@v0.9.0 · 5477 in / 1099 out tokens · 173783 ms · 2026-05-14T20:02:14.689918+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

130 extracted references · 57 canonical work pages

[1]

and Ullman, Jeffrey D

Aho, Alfred V. and Ullman, Jeffrey D. , title =. 1972 , volume=1, publisher =

1972
[2]

Interspeech 2006 --- Ninth International Conference on Spoken Language Processing , address=

Unsupervised language model adaptation using latent semantic marginals , author=. Interspeech 2006 --- Ninth International Conference on Spoken Language Processing , address=. 2006 , pages=

2006
[3]

1983 , publisher=

Publications. 1983 , publisher=

1983
[4]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year=1981, title=. Journal of the Association for Computing Machinery , volume=28, issue=1, pages=

1981
[5]

Coling 2008, 22nd International Conference on Computational Linguistics , address=

Anne Gledson and John Keane , year=2008, title=. Coling 2008, 22nd International Conference on Computational Linguistics , address=

2008
[6]

Dan Gusfield , title=
[7]

Proceedings of ICASSP 2007, International Conference on Acoustics, Speech, and Signal Processing , address=

Yik-Cheung Tam and Tanja Schultz , year=2007, title=. Proceedings of ICASSP 2007, International Conference on Acoustics, Speech, and Signal Processing , address=

2007
[8]

MATEO : MA chine T ranslation E valuation O nline

Vanroy, Bram and Tezcan, Arda and Macken, Lieve. MATEO : MA chine T ranslation E valuation O nline. Proceedings of the 24th Annual Conference of the European Association for Machine Translation. 2023

2023
[9]

2020 , eprint=

BERTScore: Evaluating Text Generation with BERT , author=. 2020 , eprint=

2020
[10]

B leu: a method for automatic evaluation of machine translation

Kishore Papineni and Salim Roukos and Todd Ward and Wei-Jing Zhu , year=. Bleu: a Method for Automatic Evaluation of Machine Translation , booktitle=. doi:10.3115/1073083.1073135 , pages=

work page doi:10.3115/1073083.1073135
[11]

A Study of Translation Edit Rate with Targeted Human Annotation

Matthew Snover and Dorr, Bonnie and Schwartz, Rich and Micciulla, Linnea and Makhoul, John. A Study of Translation Edit Rate with Targeted Human Annotation. Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers. 2006

2006
[12]

chr F deconstructed: beta parameters and n-gram weights

Popovi \'c , Maja. chr F deconstructed: beta parameters and n-gram weights. Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers. 2016. doi:10.18653/v1/W16-2341

work page doi:10.18653/v1/w16-2341 2016
[13]

and Lavie, Alon , booktitle =

Rei, Ricardo and Stewart, Craig and Farinha, Ana C and Lavie, Alon. COMET : A Neural Framework for MT Evaluation. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.213

work page doi:10.18653/v1/2020.emnlp-main.213 2020
[14]

BLEURT : Learning Robust Metrics for Text Generation

Sellam, Thibault and Das, Dipanjan and Parikh, Ankur. BLEURT : Learning Robust Metrics for Text Generation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.704

work page doi:10.18653/v1/2020.acl-main.704 2020
[15]

L i T rans P ro QA : An LLM -based Literary Translation Evaluation Metric with Professional Question Answering

Zhang, Ran and Zhao, Wei and Macken, Lieve and Eger, Steffen. L i T rans P ro QA : An LLM -based Literary Translation Evaluation Metric with Professional Question Answering. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1482

work page doi:10.18653/v1/2025.emnlp-main.1482 2025
[16]

Prompting C hat GPT for Translation: A Comparative Analysis of Translation Brief and Persona Prompts

He, Sui. Prompting C hat GPT for Translation: A Comparative Analysis of Translation Brief and Persona Prompts. Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1). 2024

2024
[17]

Poetry and Emotions: Investigating the Limitations of AI Translation

Priya Sharma and Tanuja Yadav. Poetry and Emotions: Investigating the Limitations of AI Translation. Proceedings of International Conference on Innovations in Data Science. 2026

2026
[18]

AL-ĪMĀN Research Journal , author=

A Comparative Analysis of Machine Translation and Human Translation: Efficacy of Poetry Translation from Urdu to English , volume=. AL-ĪMĀN Research Journal , author=. 2026 , month=. doi:10.63283/IRJ.04.01/02 , abstractNote=

work page doi:10.63283/irj.04.01/02 2026
[19]

Don ' t Go Far Off: An Empirical Study on Neural Poetry Translation

Chakrabarty, Tuhin and Saakyan, Arkadiy and Muresan, Smaranda. Don ' t Go Far Off: An Empirical Study on Neural Poetry Translation. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.577

work page doi:10.18653/v1/2021.emnlp-main.577 2021
[20]

The Translator ' s Canvas: Using LLM s to Enhance Poetry Translation

Resende, Nat \'a lia and Hadley, James. The Translator ' s Canvas: Using LLM s to Enhance Poetry Translation. Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track). 2024

2024
[21]

A Detailed Comparative Analysis of Automatic Neural Metrics for Machine Translation:

Mukherjee, Aniruddha and Hassija, Vikas and Chamola, Vinay and Gupta, Karunesh Kumar , journal=. A Detailed Comparative Analysis of Automatic Neural Metrics for Machine Translation:. 2025 , volume=

2025
[22]

A Fine-Grained Analysis of BERTS core

Hanna, Michael and Bojar, Ond r ej. A Fine-Grained Analysis of BERTS core. Proceedings of the Sixth Conference on Machine Translation. 2021

2021
[23]

Extrinsic Evaluation of Machine Translation Metrics

Nikita Moghe and Tom Sherborne and Mark Steedman and Alexandra Birch. Extrinsic Evaluation of Machine Translation Metrics. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.730

work page doi:10.18653/v1/2023.acl-long.730 2023
[24]

Results of WMT 23 Metrics Shared Task: Metrics Might Be Guilty but References Are Not Innocent

Freitag, Markus and Mathur, Nitika and Lo, Chi-kiu and Avramidis, Eleftherios and Rei, Ricardo and Thompson, Brian and Kocmi, Tom and Blain, Frederic and Deutsch, Daniel and Stewart, Craig and Zerva, Chrysoula and Castilho, Sheila and Lavie, Alon and Foster, George. Results of WMT 23 Metrics Shared Task: Metrics Might Be Guilty but References Are Not Inno...

work page doi:10.18653/v1/2023.wmt-1.51 2023
[25]

BLEU , METEOR , BERTS core: Evaluation of Metrics Performance in Assessing Critical Translation Errors in Sentiment-Oriented Text

Saadany, Hadeel and Orasan, Constantin. BLEU , METEOR , BERTS core: Evaluation of Metrics Performance in Assessing Critical Translation Errors in Sentiment-Oriented Text. Proceedings of the Translation and Interpreting Technology Online Conference. 2021

2021
[26]

Findings of the WMT 25 General Machine Translation Shared Task: Time to Stop Evaluating on Easy Test Sets

Kocmi, Tom and Artemova, Ekaterina and Avramidis, Eleftherios and Bawden, Rachel and Bojar, Ond r ej and Dranch, Konstantin and Dvorkovich, Anton and Dukanov, Sergey and Fishel, Mark and Freitag, Markus and Gowda, Thamme and Grundkiewicz, Roman and Haddow, Barry and Karpinska, Marzena and Koehn, Philipp and Lakougna, Howard and Lundin, Jessica and Monz, C...

work page doi:10.18653/v1/2025.wmt-1.22 2025
[27]

Freitag, Markus and Rei, Ricardo and Mathur, Nitika and Lo, Chi-kiu and Stewart, Craig and Avramidis, Eleftherios and Kocmi, Tom and Foster, George and Lavie, Alon and Martins, Andr \'e F. T. Results of WMT 22 Metrics Shared Task: Stop Using BLEU -- Neural Metrics Are Better and More Robust. Proceedings of the Seventh Conference on Machine Translation (WM...

work page doi:10.18653/v1/2022.wmt-1.2 2022
[28]

and Rei, Ricardo and Stigt, Daan van and Coheur, Luisa and Colombo, Pierre and Martins, Andr \'e F

Guerreiro, Nuno M. and Rei, Ricardo and Stigt, Daan van and Coheur, Luisa and Colombo, Pierre and Martins, Andr \'e F. T. x COMET : Transparent Machine Translation Evaluation through Fine-grained Error Detection. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00683

work page doi:10.1162/tacl_a_00683 2024
[29]

Fine-Tuned Machine Translation Metrics Struggle in Unseen Domains

Zouhar, Vil \'e m and Ding, Shuoyang and Currey, Anna and Badeka, Tatyana and Wang, Jenyuan and Thompson, Brian. Fine-Tuned Machine Translation Metrics Struggle in Unseen Domains. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2024. doi:10.18653/v1/2024.acl-short.45

work page doi:10.18653/v1/2024.acl-short.45 2024
[30]

ACES : Translation Accuracy Challenge Sets for Evaluating Machine Translation Metrics

Amrhein, Chantal and Moghe, Nikita and Guillou, Liane. ACES : Translation Accuracy Challenge Sets for Evaluating Machine Translation Metrics. Proceedings of the Seventh Conference on Machine Translation (WMT). 2022. doi:10.18653/v1/2022.wmt-1.44

work page doi:10.18653/v1/2022.wmt-1.44 2022
[31]

Evaluating Automatic Metrics with Incremental Machine Translation Systems

Wu, Guojun and Cohen, Shay B and Sennrich, Rico. Evaluating Automatic Metrics with Incremental Machine Translation Systems. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.169

work page doi:10.18653/v1/2024.findings-emnlp.169 2024
[32]

To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation

Kocmi, Tom and Federmann, Christian and Grundkiewicz, Roman and Junczys-Dowmunt, Marcin and Matsushita, Hitokazu and Menezes, Arul. To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation. Proceedings of the Sixth Conference on Machine Translation. 2021

2021
[33]

DEMETR : Diagnosing Evaluation Metrics for Translation

Karpinska, Marzena and Raj, Nishant and Thai, Katherine and Song, Yixiao and Gupta, Ankita and Iyyer, Mohit. DEMETR : Diagnosing Evaluation Metrics for Translation. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.649

work page doi:10.18653/v1/2022.emnlp-main.649 2022
[34]

The Devil Is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation

Fernandes, Patrick and Deutsch, Daniel and Finkelstein, Mara and Riley, Parker and Martins, Andr \'e and Neubig, Graham and Garg, Ankush and Clark, Jonathan and Freitag, Markus and Firat, Orhan. The Devil Is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation. Proceedings of the Eighth Conference on Machine Tran...

work page doi:10.18653/v1/2023.wmt-1.100 2023
[35]

Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models

Lu, Qingyu and Qiu, Baopu and Ding, Liang and Zhang, Kanjian and Kocmi, Tom and Tao, Dacheng. Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.520

work page doi:10.18653/v1/2024.findings-acl.520 2024
[36]

Evolutionary Studies in Imaginative Culture , year =

Nehal Ali AbdulGhaffar , title =. Evolutionary Studies in Imaginative Culture , year =
[37]

and Zerva, Chrysoula and Farinha, Ana C and Maroti, Christine and C

Rei, Ricardo and Treviso, Marcos and Guerreiro, Nuno M. and Zerva, Chrysoula and Farinha, Ana C and Maroti, Christine and C. de Souza, Jos \'e G. and Glushkova, Taisiya and Alves, Duarte and Coheur, Luisa and Lavie, Alon and Martins, Andr \'e F. T. C omet K iwi: IST -Unbabel 2022 Submission for the Quality Estimation Shared Task. Proceedings of the Sevent...

work page doi:10.18653/v1/2022.wmt-1.60 2022
[38]

Findings of the WMT 25 Multilingual Instruction Shared Task: Persistent Hurdles in Reasoning, Generation, and Evaluation

Kocmi, Tom and Agrawal, Sweta and Artemova, Ekaterina and Avramidis, Eleftherios and Briakou, Eleftheria and Chen, Pinzhen and Fadaee, Marzieh and Freitag, Markus and Grundkiewicz, Roman and Hou, Yupeng and Koehn, Philipp and Kreutzer, Julia and Mansour, Saab and Perrella, Stefano and Proietti, Lorenzo and Riley, Parker and S \'a nchez, Eduardo and Schmid...

work page doi:10.18653/v1/2025.wmt-1.23 2025
[39]

M etric X -24: The G oogle Submission to the WMT 2024 Metrics Shared Task

Juraska, Juraj and Deutsch, Daniel and Finkelstein, Mara and Freitag, Markus. M etric X -24: The G oogle Submission to the WMT 2024 Metrics Shared Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.35

work page doi:10.18653/v1/2024.wmt-1.35 2024
[40]

Findings of the WMT 24 General Machine Translation Shared Task: The LLM Era Is Here but MT Is Not Solved Yet

Kocmi, Tom and Avramidis, Eleftherios and Bawden, Rachel and Bojar, Ond r ej and Dvorkovich, Anton and Federmann, Christian and Fishel, Mark and Freitag, Markus and Gowda, Thamme and Grundkiewicz, Roman and Haddow, Barry and Karpinska, Marzena and Koehn, Philipp and Marie, Benjamin and Monz, Christof and Murray, Kenton and Nagata, Masaaki and Popel, Marti...

work page doi:10.18653/v1/2024.wmt-1.1 2024
[41]

Can Automatic Metrics Assess High-Quality Translations?

Agrawal, Sweta and Farinhas, Ant \'o nio and Rei, Ricardo and Martins, Andre. Can Automatic Metrics Assess High-Quality Translations?. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.802

work page doi:10.18653/v1/2024.emnlp-main.802 2024
[42]

and Platek, Ondrej and Sivaprasad, Adarsa

Schmidtova, Patricia and Mahamood, Saad and Balloccu, Simone and Dusek, Ondrej and Gatt, Albert and Gkatzia, Dimitra and Howcroft, David M. and Platek, Ondrej and Sivaprasad, Adarsa. Automatic Metrics in Natural Language Generation: A survey of Current Evaluation Practices. Proceedings of the 17th International Natural Language Generation Conference. 2024...

work page doi:10.18653/v1/2024.inlg-main.44 2024
[43]

Exploring Document-Level Literary Machine Translation with Parallel Paragraphs from World Literature

Thai, Katherine and Karpinska, Marzena and Krishna, Kalpesh and Ray, Bill and Inghilleri, Moira and Wieting, John and Iyyer, Mohit. Exploring Document-Level Literary Machine Translation with Parallel Paragraphs from World Literature. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.672

work page doi:10.18653/v1/2022.emnlp-main.672 2022
[44]

2025 , eprint=

MAATS: A Multi-Agent Automated Translation System Based on MQM Evaluation , author=. 2025 , eprint=

2025
[45]

H i MATE : A Hierarchical Multi-Agent Framework for Machine Translation Evaluation

Zhang, Shijie and Li, Renhao and Wang, Songsheng and Koehn, Philipp and Yang, Min and Wong, Derek F. H i MATE : A Hierarchical Multi-Agent Framework for Machine Translation Evaluation. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.593

work page doi:10.18653/v1/2025.findings-emnlp.593 2025
[46]

M - MAD : Multidimensional Multi-Agent Debate for Advanced Machine Translation Evaluation

Feng, Zhaopeng and Su, Jiayuan and Zheng, Jiamei and Ren, Jiahan and Zhang, Yan and Wu, Jian and Wang, Hongwei and Liu, Zuozhu. M - MAD : Multidimensional Multi-Agent Debate for Advanced Machine Translation Evaluation. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/202...

work page doi:10.18653/v1/2025.acl-long.351 2025
[47]

The Atlantic , year =

Jeremy Klemin , title =. The Atlantic , year =
[48]

The Bookseller , year =

Emily Warner , title =. The Bookseller , year =
[49]

The price of debiasing automatic metrics in natural language evaluation

Chaganty, Arun and Mussmann, Stephen and Liang, Percy. The price of debiasing automatic metrics in natural language evaluation. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1060

work page doi:10.18653/v1/p18-1060 2018
[50]

How Good Are LLM s for Literary Translation, Really? Literary Translation Evaluation with Humans and LLM s

Zhang, Ran and Zhao, Wei and Eger, Steffen. How Good Are LLM s for Literary Translation, Really? Literary Translation Evaluation with Humans and LLM s. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.n...

work page doi:10.18653/v1/2025.naacl-long.548 2025
[51]

The INCE p TION Platform: Machine-Assisted and Knowledge-Oriented Interactive Annotation

Klie, Jan-Christoph and Bugert, Michael and Boullosa, Beto and Eckart de Castilho, Richard and Gurevych, Iryna. The INCE p TION Platform: Machine-Assisted and Knowledge-Oriented Interactive Annotation. Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations. 2018

2018
[52]

The Challenges of Using Neural Machine Translation for Literature

Matusov, Evgeny. The Challenges of Using Neural Machine Translation for Literature. Proceedings of the Qualities of Literary Machine Translation. 2019

2019
[53]

Machine vs Human Translation of Formal Neologisms in Literature: Exploring

Laura Noriega-Santi. Machine vs Human Translation of Formal Neologisms in Literature: Exploring. Revista Tradum. 2023 , url=

2023
[54]

Information , VOLUME =

Corpas Pastor, Gloria and Noriega-Santiáñez, Laura , TITLE =. Information , VOLUME =. 2024 , NUMBER =

2024
[55]

Large Language Models Effectively Leverage Document-level Context for Literary Translation, but Critical Errors Persist

Karpinska, Marzena and Iyyer, Mohit. Large Language Models Effectively Leverage Document-level Context for Literary Translation, but Critical Errors Persist. Proceedings of the Eighth Conference on Machine Translation. 2023. doi:10.18653/v1/2023.wmt-1.41

work page doi:10.18653/v1/2023.wmt-1.41 2023
[56]

Translation Spaces , volume =

Dorothy Kenny and Marion Winters , title =. Translation Spaces , volume =. 2020 , doi =

2020
[57]

ELOPE: English Language Overseas Perspectives and Enquiries , volume =

Tjaša Mohar and Sara Orthaber and Tomaž Onič , title =. ELOPE: English Language Overseas Perspectives and Enquiries , volume =. 2020 , doi =

2020
[58]

Proceedings of the Qualities of Literary Machine Translation

Would MT kill creativity in literary retranslation?. Proceedings of the Qualities of Literary Machine Translation. 2019

2019
[59]

Creativity in Translation: Machine Translation as a Constraint for Literary Texts , Volume =

Ana Guerberof-Arenas and Antonio Toral , Journal =. Creativity in Translation: Machine Translation as a Constraint for Literary Texts , Volume =
[60]

The Impact of Post-Editing and Machine Translation on Creativity and Reading Experience , Volume =

Ana Guerberof-Arenas and Antonio Toral , Journal =. The Impact of Post-Editing and Machine Translation on Creativity and Reading Experience , Volume =
[61]

To be or not to be: A translation reception study of a literary text translated into

Ana Guerberof-Arenas and Antonio Toral , Journal =. To be or not to be: A translation reception study of a literary text translated into
[62]

Frontiers in Digital Humanities , VOLUME=

Toral, Antonio and Wieling, Martijn and Way, Andy , TITLE=. Frontiers in Digital Humanities , VOLUME=. 2018 , URL=. doi:10.3389/fdigh.2018.00009 , ISSN=

work page doi:10.3389/fdigh.2018.00009 2018
[63]

B y GPT 5: End-to-End Style-conditioned Poetry Generation with Token-free Language Models

Belouadi, Jonas and Eger, Steffen. B y GPT 5: End-to-End Style-conditioned Poetry Generation with Token-free Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.406

work page doi:10.18653/v1/2023.acl-long.406 2023
[64]

Evaluating Diversity in Automatic Poetry Generation

Chen, Yanran and Gr. Evaluating Diversity in Automatic Poetry Generation. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.1097

work page doi:10.18653/v1/2024.emnlp-main.1097 2024
[65]

Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems , articleno =

Chakrabarty, Tuhin and Laban, Philippe and Agarwal, Divyansh and Muresan, Smaranda and Wu, Chien-Sheng , title =. Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems , articleno =. 2024 , isbn =. doi:10.1145/3613904.3642731 , abstract =

work page doi:10.1145/3613904.3642731 2024
[66]

Re-evaluating the Role of B leu in Machine Translation Research

Callison-Burch, Chris and Osborne, Miles and Koehn, Philipp. Re-evaluating the Role of B leu in Machine Translation Research. 11th Conference of the E uropean Chapter of the Association for Computational Linguistics. 2006

2006
[67]

Tangled up in BLEU : Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

Mathur, Nitika and Baldwin, Timothy and Cohn, Trevor. Tangled up in BLEU : Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.448

work page doi:10.18653/v1/2020.acl-main.448 2020
[68]

Why We Need New Evaluation Metrics for NLG

Novikova, Jekaterina and Du s ek, Ond r ej and Cercas Curry, Amanda and Rieser, Verena. Why We Need New Evaluation Metrics for NLG. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. doi:10.18653/v1/D17-1238

work page doi:10.18653/v1/d17-1238 2017
[69]

AI -Assisted Human Evaluation of Machine Translation

Zouhar, Vil \'e m and Kocmi, Tom and Sachan, Mrinmaya. AI -Assisted Human Evaluation of Machine Translation. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.naacl-long.255

work page doi:10.18653/v1/2025.naacl-long.255 2025
[70]

INSTRUCTSCORE : Towards Explainable Text Generation Evaluation with Automatic Feedback

Xu, Wenda and Wang, Danqing and Pan, Liangming and Song, Zhenqiao and Freitag, Markus and Wang, William and Li, Lei. INSTRUCTSCORE : Towards Explainable Text Generation Evaluation with Automatic Feedback. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.365

work page doi:10.18653/v1/2023.emnlp-main.365 2023
[71]

G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment

Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang. G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.153

work page doi:10.18653/v1/2023.emnlp-main.153 2023
[72]

Tagged Span Annotation for Detecting Translation Errors in Reasoning LLM s

Yeom, Taemin and Ryu, Yonghyun and Choi, Yoonjung and Bak, Jinyeong. Tagged Span Annotation for Detecting Translation Errors in Reasoning LLM s. Proceedings of the Tenth Conference on Machine Translation. 2025. doi:10.18653/v1/2025.wmt-1.62

work page doi:10.18653/v1/2025.wmt-1.62 2025
[73]

Deep Associations, High Creativity: A Simple yet Effective Metric for Evaluating Large Language Models

Qiu, Ziliang and Hu, Renfen. Deep Associations, High Creativity: A Simple yet Effective Metric for Evaluating Large Language Models. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.550

work page doi:10.18653/v1/2025.emnlp-main.550 2025
[74]

Human Translation of Stylistic Neologisms in English Language Chick Lit into Ukrainian , url=

Machine vs. Human Translation of Stylistic Neologisms in English Language Chick Lit into Ukrainian , url=. Respectus Philologicus , author=. 2025 , month=. doi:10.15388/RESPECTUS.2025.48.9 , abstractNote=

work page doi:10.15388/respectus.2025.48.9 2025
[75]

2024 , eprint=

CS4: Measuring the Creativity of Large Language Models Automatically by Controlling the Number of Story-Writing Constraints , author=. 2024 , eprint=

2024
[76]

Automatic Creativity Measurement in Scratch Programs Across Modalities , year=

Kovalkov, Anastasia and Paaßen, Benjamin and Segal, Avi and Pinkwart, Niels and Gal, Kobi , journal=. Automatic Creativity Measurement in Scratch Programs Across Modalities , year=
[77]

Prompting Large Language Models for Idiomatic Translation

Castaldo, Antonio and Monti, Johanna. Prompting Large Language Models for Idiomatic Translation. Proceedings of the 1st Workshop on Creative-text Translation and Technology. 2024

2024
[78]

Machine Translation with Large Language Models: Prompting, Few-shot Learning, and Fine-tuning with QL o RA

Zhang, Xuan and Rajabi, Navid and Duh, Kevin and Koehn, Philipp. Machine Translation with Large Language Models: Prompting, Few-shot Learning, and Fine-tuning with QL o RA. Proceedings of the Eighth Conference on Machine Translation. 2023. doi:10.18653/v1/2023.wmt-1.43

work page doi:10.18653/v1/2023.wmt-1.43 2023
[79]

Proceedings of the 40th International Conference on Machine Learning , pages =

Prompting Large Language Model for Machine Translation: A Case Study , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

2023
[80]

Proceedings of the 6th ACM International Conference on Multimedia in Asia Workshops , articleno =

Gao, Yuan and Wang, Ruili and Hou, Feng , title =. Proceedings of the 6th ACM International Conference on Multimedia in Asia Workshops , articleno =. 2024 , isbn =. doi:10.1145/3700410.3702123 , abstract =

work page doi:10.1145/3700410.3702123 2024

Showing first 80 references.