Beyond Reproduction: A Paired-Task Framework for Assessing LLM Comprehension and Creativity in Literary Translation

Arda Tezcan; Lieve Macken; Ran Zhang; Simone Paolo Ponzetto; Steffen Eger; Wei Zhao

arxiv: 2604.18169 · v1 · submitted 2026-04-20 · 💻 cs.CL · cs.AI

Beyond Reproduction: A Paired-Task Framework for Assessing LLM Comprehension and Creativity in Literary Translation

Ran Zhang , Steffen Eger , Arda Tezcan , Wei Zhao , Simone Paolo Ponzetto , Lieve Macken This is my paper

Pith reviewed 2026-05-10 04:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords literary translationLLM evaluationtranslational creativitysource comprehensionUnits of Creative Potentialpaired-task frameworkmachine translationcreativity assessment

0 comments

The pith

Strong comprehension in LLMs does not produce human-level creativity in literary translation

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a paired-task framework to test both comprehension of source literary texts and the creativity of their translations by large language models. It identifies Units of Creative Potential in the source and checks if models render them creatively rather than literally. Testing many models on excerpts from eleven books shows a clear separation: good understanding of the text rarely leads to creative translations. The shortfall grows larger when translating between more distant languages such as English and Chinese. Only a few model and prompt combinations reach modest creativity scores, while humans score higher.

Core claim

The paper claims that LLMs can comprehend literary source texts effectively yet consistently fail to translate them with human-like creativity, defaulting instead to literal or inappropriate renderings. This is shown using a framework that pairs a comprehension task with a creativity evaluation based on annotated Units of Creative Potential such as metaphors and wordplay, applied across multiple models and prompts.

What carries the argument

A paired-task framework that first assesses source-text comprehension and then measures translational creativity through Units of Creative Potential identified in the source material.

Load-bearing premise

That the selected Units of Creative Potential and expert annotations provide a reliable and complete way to measure translational creativity in line with professional judgment.

What would settle it

Observing a model that consistently produces translations rated as more creative than human ones by experts on the same literary excerpts, or one that correctly and creatively handles all identified Units of Creative Potential.

Figures

Figures reproduced from arXiv: 2604.18169 by Arda Tezcan, Lieve Macken, Ran Zhang, Simone Paolo Ponzetto, Steffen Eger, Wei Zhao.

**Figure 2.** Figure 2: Task 1 performance (F1 score) as a function [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: (a) Kernel density estimates of creativity scores for 23 En–Zh models under four prompts (base, P1–P3) [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Screenshot of the task 2 annotation tool. [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: From source comprehension (Task 1, F1) to [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly used for creative tasks such as literary translation. Yet translational creativity remains underexplored and is rarely evaluated at scale, while source-text comprehension is typically studied in isolation, despite the fact that, in professional translation, comprehension and creativity are tightly intertwined. We address these gaps with a paired-task framework applied to literary excerpts from 11 books. Task 1 assesses source-text comprehension, and Task 2 evaluates translational creativity through Units of Creative Potential (UCPs), such as metaphors and wordplay. Using a scalable evaluation setup that combines expert human annotations with UCP-based automatic scoring, we benchmark 23 models and four creativity-oriented prompts. Our findings show that strong comprehension does not translate into human-level creativity: models often produce literal or contextually inappropriate renderings, with particularly large gaps for the more distant English-Chinese language pair. Creativity-oriented prompts yield only modest gains, and only one model, Mistral-Large, comes close to human-level creativity (0.167 vs. 0.246). Across all model-prompt combinations, only three exceed a creativity score of 0.1, while the rest remain at or near zero.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a paired-task framework for evaluating LLMs on literary translation tasks, separating source-text comprehension (Task 1) from translational creativity assessed via Units of Creative Potential (UCPs) such as metaphors and wordplay (Task 2). Applied to excerpts from 11 books, it benchmarks 23 models plus four creativity-oriented prompts against human baselines, concluding that strong comprehension does not yield human-level creativity, with models frequently producing literal or inappropriate renderings (especially for English-Chinese), creativity prompts offering only modest gains, and only Mistral-Large approaching human scores (0.167 vs. 0.246), while most model-prompt pairs score at or near zero.

Significance. If the UCP-based measure proves robust, the work offers a useful disentangled evaluation paradigm for creative generation tasks where comprehension and creativity are intertwined in professional practice. The scale of the benchmark (23 models) and the explicit human comparison provide concrete evidence of current LLM limitations in literary translation beyond reproduction, which could guide future prompt engineering and model development.

major comments (2)

[§3 and §4] §3 (Paired-Task Framework) and §4 (UCP Annotation Protocol): The central claim that models exhibit a comprehension-creativity disconnect rests on UCPs serving as a valid, comprehensive proxy for translational creativity. However, the manuscript provides no inter-annotator agreement statistics, no correlation analysis with professional translator judgments, and no external validation of the UCP inventory against established translation quality frameworks; without these, the reported gaps (particularly the English-Chinese disparity) risk being artifacts of annotation subjectivity or incomplete coverage of culturally specific creative elements.
[§5] §5 (Results and Model Rankings): The quantitative finding that only three model-prompt combinations exceed a creativity score of 0.1, with Mistral-Large at 0.167 versus human 0.246, is load-bearing for the headline conclusion. Yet the automatic UCP scoring procedure and its agreement with the expert annotations are not detailed with sufficient precision (e.g., exact matching rules, weighting of UCP types, or error analysis), making it impossible to determine whether the low scores reflect genuine creative deficits or measurement limitations.

minor comments (2)

[Results tables] Table 2 or equivalent results table: report standard errors or confidence intervals alongside the mean creativity scores to allow readers to assess the stability of the model-human gaps.
[§2] §2 (Related Work): the discussion of prior LLM translation benchmarks could more explicitly contrast the paired-task design with existing holistic quality metrics such as BLEU or human MQM to clarify the novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments on the validity of our paired-task framework and the UCP-based evaluation. We address the major comments point by point below, indicating the changes we will implement in the revised manuscript.

read point-by-point responses

Referee: [§3 and §4] §3 (Paired-Task Framework) and §4 (UCP Annotation Protocol): The central claim that models exhibit a comprehension-creativity disconnect rests on UCPs serving as a valid, comprehensive proxy for translational creativity. However, the manuscript provides no inter-annotator agreement statistics, no correlation analysis with professional translator judgments, and no external validation of the UCP inventory against established translation quality frameworks; without these, the reported gaps (particularly the English-Chinese disparity) risk being artifacts of annotation subjectivity or incomplete coverage of culturally specific creative elements.

Authors: We recognize that additional evidence for the robustness of the UCP measure would strengthen our conclusions. The UCP inventory draws from well-established categories in literary translation theory, including those identified in prior work on creative translation. To address the lack of reported inter-annotator agreement, we will include these statistics in the revised manuscript, calculated across the annotations of the expert annotators for both source comprehension and creativity tasks. We will also provide a more detailed discussion of how UCP scores correlate with the professional human translations used as baselines and align with established frameworks such as functionalist approaches to translation quality. A full-scale external validation against independent professional judgments is not feasible within the current study but represents an important direction for future research. We maintain that the consistent patterns observed across 23 models and the clear gap with human performance support the comprehension-creativity disconnect, particularly as the annotation protocol was applied uniformly. revision: partial
Referee: [§5] §5 (Results and Model Rankings): The quantitative finding that only three model-prompt combinations exceed a creativity score of 0.1, with Mistral-Large at 0.167 versus human 0.246, is load-bearing for the headline conclusion. Yet the automatic UCP scoring procedure and its agreement with the expert annotations are not detailed with sufficient precision (e.g., exact matching rules, weighting of UCP types, or error analysis), making it impossible to determine whether the low scores reflect genuine creative deficits or measurement limitations.

Authors: We agree that the automatic scoring procedure requires more detailed exposition to ensure transparency and allow verification of the results. In the revised manuscript, we will elaborate on the UCP scoring algorithm in §5, specifying the exact rules for matching UCPs (including handling of synonyms and paraphrases via semantic similarity thresholds), the weighting applied to different UCP categories, and a comprehensive error analysis that reports the agreement between automatic scores and expert manual annotations on a held-out sample. This will include precision, recall, and F1 metrics for UCP detection. We believe this will demonstrate that the low creativity scores accurately reflect the models' output characteristics rather than flaws in the measurement process. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking with external human baselines

full rationale

The paper presents an empirical paired-task evaluation framework for LLMs on literary translation. Task 1 measures source-text comprehension and Task 2 scores translational creativity via Units of Creative Potential (UCPs) identified in source texts, followed by expert human annotations and automatic UCP-based scoring. Reported creativity scores (e.g., 0.167 for Mistral-Large vs. human 0.246) are direct outputs of this annotation pipeline applied to model generations, not quantities derived from fitted parameters or self-referential equations within the paper. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the derivation of the central claims. The framework is self-contained against external human baselines and does not reduce its findings to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that UCPs capture translational creativity and that the selected excerpts and human annotations are representative. No free parameters or invented physical entities are described.

axioms (1)

domain assumption Units of Creative Potential such as metaphors and wordplay constitute measurable and representative aspects of translational creativity
Invoked to define Task 2 evaluation

invented entities (1)

Units of Creative Potential (UCPs) no independent evidence
purpose: Quantify creative elements in model translations for automatic scoring
New construct introduced to operationalize creativity

pith-pipeline@v0.9.0 · 5529 in / 1273 out tokens · 24763 ms · 2026-05-10T04:43:58.811433+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

[1]

William Orwig, Emma R Edenbaum, Joshua D Greene, and Daniel L Schacter

How does narrow ai impact human creativity? Creativity Research Journal, pages 1–11. William Orwig, Emma R Edenbaum, Joshua D Greene, and Daniel L Schacter. 2024. The language of cre- ativity: Evidence from humans and large language models.The Journal of creative behavior, 58(1):128– 136. Max Peeperkorn, Tom Kouwenhoven, Dan Brown, and Anna Jordanous. 202...

work page 2024
[2]

Ran Zhang and Steffen Eger

(perhaps) beyond human translation: Harness- ing multi-agent collaboration for translating ultra- long literary texts.Transactions of the Association for Computational Linguistics, 13:901–922. Ran Zhang and Steffen Eger. 2026. Llm-based multi-agent poetry generation in non-cooperative environments.Journal of Language Modelling, 13(2):261–318. Ran Zhang, W...

work page 2026
[3]

Assessing and understanding creativity in large language models.Machine Intelligence Research, 22(3):417–436. A Dataset construction A.1 Task 1 details A.1.1 Translation difficulty Tables 3 and 4 contain the prompt with the defini- tion of translation difficulties, i.e., Units of Creative Shift (UCPs). A.1.2 Task 1 annotation guideline OverviewYou will ev...

work page
[4]

E.g., Time is a thief

Metaphors and original images: A metaphor is a figure of speech in which a word or phrase is applied to an object or action to which it is not literally applicable, but helps explain an idea or make a comparison. E.g., Time is a thief

work page
[5]

like”, “-like

Comparisons/simile: A figure of speech in- volving an explicit comparison of one thing with another thing of a different kind used to make a description more empathic or vivid (often using words like “like”, “-like”, “as”, and “-looking”. E.g. Her smile was as bright as the sun

work page
[6]

E.g., He had a photographic memory but never developed it

Wordplay: Wordplay refers to the clever and humorous use of words in a way that exploits their multiple meanings, sounds, or structures. E.g., He had a photographic memory but never developed it. E.g., I scream for ice cream

work page
[7]

E.g., Achilles heel, French Revolution

Cultural and historical references: References to culture-specific items or historical events, fig- ures, or practices. E.g., Achilles heel, French Revolution

work page
[8]

E.g., piece of cake, bite the bullet, break the ice

Idiomatic phrases: An idiom is a phrase or expression that largely or exclusively carries a figurative or non-literal meaning. E.g., piece of cake, bite the bullet, break the ice

work page
[9]

Rhyme and metrics: Use of rhyme and rhyth- mic patterns E.g. It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of light, it was the season of darkness, it was the spring of hope, it was the winter of de- spair. (Charle...

work page
[10]

E.g., a well-to-do girl of about thirteen (Sarah Waters - Night Watch)

Lexical variety & lexical deviation: The use of a wide range of vocabulary, including infrequent words, or departure from conventional meanings and uses of words. E.g., a well-to-do girl of about thirteen (Sarah Waters - Night Watch). E.g., A great sport ... old Evie! (Agatha Christie - The Mysterious Affair at Styles)

work page
[11]

problem_type

Proper names: Names of people, places, and organizations. E.g., Albus Dumbledore (Harry Potter - J.K. Rowling), Oompa-Loompas (Roald Dahl - Charlie and the Chocolate Factory). Table 3: Translation difficulty (1) 13 Now read the following text and determine whether it contains any translation problems or not. text Please return the answer in the following ...

work page 1925
[12]

raining cats and dogs

Label: Translation Technique (choose exactly ONE) - Creative Shift (CS): translations that deviate from a literal rendering of the source text to cre- ate a more impactful, natural, or culturally ap- propriate translation in the target language in a meaningful or stylistically motivated way. Please make sure the deviation of meaning or image is rooted in ...

work page
[13]

- low→routine, literal, unsurprising - medium → some interpretive or stylistic choice - high→clearly inventive, or unexpected

Level_of_creativity ∈ [low, medium, high]: How surprising, inventive, or non-routine the translation choice feels within the given literary context. - low→routine, literal, unsurprising - medium → some interpretive or stylistic choice - high→clearly inventive, or unexpected

work page
[14]

Level_of_acceptability ∈ [low, medium, high]: Overall translation quality and appropriateness in context: - low→clearly unacceptable as a translation - medium → understandable but flawed, awk- ward, or unstable - high → logical, fluent, appropriate, and accept- able for literary use based on the source text. Make sure the word choice is conventional for t...

work page
[15]

too creative

Level of deviation∈[low, medium, high]: - low→meaning & imagery preserved - medium → meaning & imagery partially pre- served - high → strongly reimagined meaning or image, with added information !! Please double-check to ensure the deviation adheres to the source context. Be cautious in cases of medium or high deviation, as this usu- ally indicates change...

work page

[1] [1]

William Orwig, Emma R Edenbaum, Joshua D Greene, and Daniel L Schacter

How does narrow ai impact human creativity? Creativity Research Journal, pages 1–11. William Orwig, Emma R Edenbaum, Joshua D Greene, and Daniel L Schacter. 2024. The language of cre- ativity: Evidence from humans and large language models.The Journal of creative behavior, 58(1):128– 136. Max Peeperkorn, Tom Kouwenhoven, Dan Brown, and Anna Jordanous. 202...

work page 2024

[2] [2]

Ran Zhang and Steffen Eger

(perhaps) beyond human translation: Harness- ing multi-agent collaboration for translating ultra- long literary texts.Transactions of the Association for Computational Linguistics, 13:901–922. Ran Zhang and Steffen Eger. 2026. Llm-based multi-agent poetry generation in non-cooperative environments.Journal of Language Modelling, 13(2):261–318. Ran Zhang, W...

work page 2026

[3] [3]

Assessing and understanding creativity in large language models.Machine Intelligence Research, 22(3):417–436. A Dataset construction A.1 Task 1 details A.1.1 Translation difficulty Tables 3 and 4 contain the prompt with the defini- tion of translation difficulties, i.e., Units of Creative Shift (UCPs). A.1.2 Task 1 annotation guideline OverviewYou will ev...

work page

[4] [4]

E.g., Time is a thief

Metaphors and original images: A metaphor is a figure of speech in which a word or phrase is applied to an object or action to which it is not literally applicable, but helps explain an idea or make a comparison. E.g., Time is a thief

work page

[5] [5]

like”, “-like

Comparisons/simile: A figure of speech in- volving an explicit comparison of one thing with another thing of a different kind used to make a description more empathic or vivid (often using words like “like”, “-like”, “as”, and “-looking”. E.g. Her smile was as bright as the sun

work page

[6] [6]

E.g., He had a photographic memory but never developed it

Wordplay: Wordplay refers to the clever and humorous use of words in a way that exploits their multiple meanings, sounds, or structures. E.g., He had a photographic memory but never developed it. E.g., I scream for ice cream

work page

[7] [7]

E.g., Achilles heel, French Revolution

Cultural and historical references: References to culture-specific items or historical events, fig- ures, or practices. E.g., Achilles heel, French Revolution

work page

[8] [8]

E.g., piece of cake, bite the bullet, break the ice

Idiomatic phrases: An idiom is a phrase or expression that largely or exclusively carries a figurative or non-literal meaning. E.g., piece of cake, bite the bullet, break the ice

work page

[9] [9]

Rhyme and metrics: Use of rhyme and rhyth- mic patterns E.g. It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of light, it was the season of darkness, it was the spring of hope, it was the winter of de- spair. (Charle...

work page

[10] [10]

E.g., a well-to-do girl of about thirteen (Sarah Waters - Night Watch)

Lexical variety & lexical deviation: The use of a wide range of vocabulary, including infrequent words, or departure from conventional meanings and uses of words. E.g., a well-to-do girl of about thirteen (Sarah Waters - Night Watch). E.g., A great sport ... old Evie! (Agatha Christie - The Mysterious Affair at Styles)

work page

[11] [11]

problem_type

Proper names: Names of people, places, and organizations. E.g., Albus Dumbledore (Harry Potter - J.K. Rowling), Oompa-Loompas (Roald Dahl - Charlie and the Chocolate Factory). Table 3: Translation difficulty (1) 13 Now read the following text and determine whether it contains any translation problems or not. text Please return the answer in the following ...

work page 1925

[12] [12]

raining cats and dogs

Label: Translation Technique (choose exactly ONE) - Creative Shift (CS): translations that deviate from a literal rendering of the source text to cre- ate a more impactful, natural, or culturally ap- propriate translation in the target language in a meaningful or stylistically motivated way. Please make sure the deviation of meaning or image is rooted in ...

work page

[13] [13]

- low→routine, literal, unsurprising - medium → some interpretive or stylistic choice - high→clearly inventive, or unexpected

Level_of_creativity ∈ [low, medium, high]: How surprising, inventive, or non-routine the translation choice feels within the given literary context. - low→routine, literal, unsurprising - medium → some interpretive or stylistic choice - high→clearly inventive, or unexpected

work page

[14] [14]

Level_of_acceptability ∈ [low, medium, high]: Overall translation quality and appropriateness in context: - low→clearly unacceptable as a translation - medium → understandable but flawed, awk- ward, or unstable - high → logical, fluent, appropriate, and accept- able for literary use based on the source text. Make sure the word choice is conventional for t...

work page

[15] [15]

too creative

Level of deviation∈[low, medium, high]: - low→meaning & imagery preserved - medium → meaning & imagery partially pre- served - high → strongly reimagined meaning or image, with added information !! Please double-check to ensure the deviation adheres to the source context. Be cautious in cases of medium or high deviation, as this usu- ally indicates change...

work page