Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations

· 2026 · cs.CL · arXiv 2605.13596

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

This article investigates the performance of automatic evaluation metrics (AEMs) and LLM-as-a-judge evaluation on literary translation across multiple languages, genres, and translation modalities. The aim is to assess how well these tools align with professionals when evaluating translation, creativity (creative shifts & errors), and see if they can substitute laborious manual annotations. A dataset of literary translations across three modalities (human translation, machine translation, and post-editing), three genres and three language pairs was created and annotated in detail for creativity by experienced professional literary translators. The results show that both AEMs and LLM-as-a-judge evaluations correlate poorly with professional evaluations on creativity, with LLM-as-a-judge showing a systematic bias in favour of machine-translated texts and penalising creative and culturally appropriate solutions. Moreover, performance is consistently worse for more literary genres such as poetry. This highlights fundamental limitations of current automatic evaluation tools for literary translation and the need to create new tools that do not frequently consider out of routine translations as errors.

representative citing papers

AI translation of literary texts is "fine", but readers still prefer human translations

cs.CL · 2026-06-24 · unverdicted · novelty 6.0

Human readers prefer human literary translations over AI-generated ones for immersion and clarity despite finding MT adequate and struggling to identify the source.

LitSeg: Narrative-Aware Document Segmentation for Literary RAG

cs.CL · 2026-05-26 · unverdicted · novelty 5.0

LitSeg segments literary texts using narrative analysis via multi-stage prompting and offers a distilled lightweight version for efficient use in RAG systems.

citing papers explorer

Showing 2 of 2 citing papers.

AI translation of literary texts is "fine", but readers still prefer human translations cs.CL · 2026-06-24 · unverdicted · none · ref 96 · internal anchor
Human readers prefer human literary translations over AI-generated ones for immersion and clarity despite finding MT adequate and struggling to identify the source.
LitSeg: Narrative-Aware Document Segmentation for Literary RAG cs.CL · 2026-05-26 · unverdicted · none · ref 3 · internal anchor
LitSeg segments literary texts using narrative analysis via multi-stage prompting and offers a distilled lightweight version for efficient use in RAG systems.

Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations

fields

years

verdicts

representative citing papers

citing papers explorer