Recognition: no theorem link
Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations
Pith reviewed 2026-05-14 20:02 UTC · model grok-4.3
The pith
Automatic evaluation metrics and LLM judges correlate poorly with professional translators on creativity in literary texts and bias toward machine outputs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Automatic evaluation metrics and LLM-as-a-judge evaluations correlate poorly with professional literary translators' assessments of creativity, and LLM judges display a systematic bias that favors machine-translated texts while penalizing creative and culturally appropriate solutions, with performance dropping further on poetry and similar literary genres.
What carries the argument
A dataset of literary translations across human, machine, and post-edited modalities, annotated by professional translators for creative shifts and errors, used to measure alignment with automatic metrics and LLM judges.
If this is right
- Automatic metrics cannot serve as reliable substitutes for professional judgment when creativity is the focus of evaluation.
- LLM-as-a-judge methods introduce a consistent preference for literal machine outputs over creative human solutions.
- Evaluation accuracy declines markedly for poetry and other highly literary genres.
- New automatic tools are required that treat creative out-of-routine solutions as valid rather than errors.
Where Pith is reading between the lines
- If automatic scores are used for quality control, translation workflows may systematically undervalue creative post-editing by humans.
- The same bias pattern could appear in AI evaluation of other creative writing tasks such as story generation or script adaptation.
- Explicit cultural and creative criteria would need to be built into future evaluation frameworks to reduce the observed mismatch.
Load-bearing premise
Detailed annotations by experienced professional literary translators constitute an objective and reliable ground truth for measuring creativity and translation quality across genres and modalities.
What would settle it
A new set of annotations on the same dataset by a different group of professional literary translators that produces substantially different creativity scores from the original annotations.
Figures
read the original abstract
This article investigates the performance of automatic evaluation metrics (AEMs) and LLM-as-a-judge evaluation on literary translation across multiple languages, genres, and translation modalities. The aim is to assess how well these tools align with professionals when evaluating translation, creativity (creative shifts & errors), and see if they can substitute laborious manual annotations. A dataset of literary translations across three modalities (human translation, machine translation, and post-editing), three genres and three language pairs was created and annotated in detail for creativity by experienced professional literary translators. The results show that both AEMs and LLM-as-a-judge evaluations correlate poorly with professional evaluations on creativity, with LLM-as-a-judge showing a systematic bias in favour of machine-translated texts and penalising creative and culturally appropriate solutions. Moreover, performance is consistently worse for more literary genres such as poetry. This highlights fundamental limitations of current automatic evaluation tools for literary translation and the need to create new tools that do not frequently consider out of routine translations as errors.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that automatic evaluation metrics (AEMs) and LLM-as-a-judge methods correlate poorly with professional literary translators' assessments of translation quality and creativity (creative shifts and errors) across three language pairs, three genres, and three modalities (human translation, machine translation, post-editing). It further reports that LLM judges exhibit systematic bias favoring machine-translated outputs while penalizing creative and culturally appropriate solutions, with performance degrading for more literary genres such as poetry.
Significance. If the empirical findings hold after addressing reporting gaps, the work provides useful evidence of limitations in current automatic tools for evaluating creativity in literary translation and motivates development of new metrics. The construction of a multi-genre, multi-modality dataset annotated by experienced professionals is a concrete contribution that can support future research.
major comments (2)
- [Annotation methodology] Annotation methodology section: no inter-annotator agreement statistics (Fleiss' kappa, Cohen's kappa, or equivalent) are reported for the creativity labels (creative shifts & errors) assigned by the professional translators. Without these figures, the low correlations and reported LLM bias could reflect label noise rather than genuine misalignment with the ground truth.
- [Results] Results section: the abstract and main results provide no sample sizes (number of translations or segments per genre/modality), exact implementations of the AEMs and LLM prompts, correlation coefficients with confidence intervals, or statistical tests (e.g., p-values for differences between modalities). These omissions prevent assessment of whether the claimed poor correlations and systematic bias are statistically robust.
minor comments (2)
- [Abstract] Abstract: add concrete numbers (e.g., total segments annotated, number of annotators, range of correlation values) to make the headline claims more informative.
- [Dataset construction] Dataset description: clarify selection criteria for the literary texts and any filtering applied after annotation.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which has helped us strengthen the reporting in our manuscript. We address each major comment below and have revised the paper to incorporate the requested details on annotation reliability and statistical reporting.
read point-by-point responses
-
Referee: [Annotation methodology] Annotation methodology section: no inter-annotator agreement statistics (Fleiss' kappa, Cohen's kappa, or equivalent) are reported for the creativity labels (creative shifts & errors) assigned by the professional translators. Without these figures, the low correlations and reported LLM bias could reflect label noise rather than genuine misalignment with the ground truth.
Authors: We agree that inter-annotator agreement metrics are essential to demonstrate label reliability. Although the original submission omitted these statistics, the annotations were performed by multiple professional translators with overlapping segments. In the revised manuscript we now report Cohen's kappa values for the creativity shift and error labels (computed on the double-annotated subset), which show substantial agreement. This addition confirms that the observed low correlations with automatic metrics are unlikely to stem from label noise. revision: yes
-
Referee: [Results] Results section: the abstract and main results provide no sample sizes (number of translations or segments per genre/modality), exact implementations of the AEMs and LLM prompts, correlation coefficients with confidence intervals, or statistical tests (e.g., p-values for differences between modalities). These omissions prevent assessment of whether the claimed poor correlations and systematic bias are statistically robust.
Authors: We acknowledge the reporting gaps. The revised version now includes explicit sample sizes (number of segments per genre, language pair, and modality) in both the abstract and Results section. We have added the precise AEM implementations, full LLM prompts, Pearson/Spearman correlations with 95% confidence intervals, and p-values from appropriate statistical tests comparing modalities. These details are also summarized in a new table for clarity and confirm the robustness of the reported poor correlations and LLM bias. revision: yes
Circularity Check
No circularity: purely empirical comparison to external annotations
full rationale
The paper constructs a dataset of literary translations across modalities and genres, obtains detailed annotations for creativity from experienced professional translators, and reports correlations between these annotations and both automatic evaluation metrics and LLM-as-a-judge outputs. No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text. The central claims rest on observed empirical mismatches rather than any reduction of predictions to inputs by construction. Self-citations, if present, are not load-bearing for the core result. This is a standard empirical evaluation study whose validity depends on annotation quality and inter-annotator agreement (a separate reliability concern), not on circular logic.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Professional literary translators' judgments provide a reliable and objective measure of creativity and translation quality
Reference graph
Works this paper leans on
-
[1]
and Ullman, Jeffrey D
Aho, Alfred V. and Ullman, Jeffrey D. , title =. 1972 , volume=1, publisher =
1972
-
[2]
Interspeech 2006 --- Ninth International Conference on Spoken Language Processing , address=
Unsupervised language model adaptation using latent semantic marginals , author=. Interspeech 2006 --- Ninth International Conference on Spoken Language Processing , address=. 2006 , pages=
2006
-
[3]
1983 , publisher=
Publications. 1983 , publisher=
1983
-
[4]
Chandra and Dexter C
Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year=1981, title=. Journal of the Association for Computing Machinery , volume=28, issue=1, pages=
1981
-
[5]
Coling 2008, 22nd International Conference on Computational Linguistics , address=
Anne Gledson and John Keane , year=2008, title=. Coling 2008, 22nd International Conference on Computational Linguistics , address=
2008
-
[6]
Dan Gusfield , title=
-
[7]
Proceedings of ICASSP 2007, International Conference on Acoustics, Speech, and Signal Processing , address=
Yik-Cheung Tam and Tanja Schultz , year=2007, title=. Proceedings of ICASSP 2007, International Conference on Acoustics, Speech, and Signal Processing , address=
2007
-
[8]
MATEO : MA chine T ranslation E valuation O nline
Vanroy, Bram and Tezcan, Arda and Macken, Lieve. MATEO : MA chine T ranslation E valuation O nline. Proceedings of the 24th Annual Conference of the European Association for Machine Translation. 2023
2023
-
[9]
2020 , eprint=
BERTScore: Evaluating Text Generation with BERT , author=. 2020 , eprint=
2020
-
[10]
B leu: a method for automatic evaluation of machine translation
Kishore Papineni and Salim Roukos and Todd Ward and Wei-Jing Zhu , year=. Bleu: a Method for Automatic Evaluation of Machine Translation , booktitle=. doi:10.3115/1073083.1073135 , pages=
-
[11]
A Study of Translation Edit Rate with Targeted Human Annotation
Matthew Snover and Dorr, Bonnie and Schwartz, Rich and Micciulla, Linnea and Makhoul, John. A Study of Translation Edit Rate with Targeted Human Annotation. Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers. 2006
2006
-
[12]
chr F deconstructed: beta parameters and n-gram weights
Popovi \'c , Maja. chr F deconstructed: beta parameters and n-gram weights. Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers. 2016. doi:10.18653/v1/W16-2341
-
[13]
Rei, Ricardo and Stewart, Craig and Farinha, Ana C and Lavie, Alon. COMET : A Neural Framework for MT Evaluation. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.213
-
[14]
BLEURT : Learning Robust Metrics for Text Generation
Sellam, Thibault and Das, Dipanjan and Parikh, Ankur. BLEURT : Learning Robust Metrics for Text Generation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.704
-
[15]
Zhang, Ran and Zhao, Wei and Macken, Lieve and Eger, Steffen. L i T rans P ro QA : An LLM -based Literary Translation Evaluation Metric with Professional Question Answering. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1482
-
[16]
Prompting C hat GPT for Translation: A Comparative Analysis of Translation Brief and Persona Prompts
He, Sui. Prompting C hat GPT for Translation: A Comparative Analysis of Translation Brief and Persona Prompts. Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1). 2024
2024
-
[17]
Poetry and Emotions: Investigating the Limitations of AI Translation
Priya Sharma and Tanuja Yadav. Poetry and Emotions: Investigating the Limitations of AI Translation. Proceedings of International Conference on Innovations in Data Science. 2026
2026
-
[18]
AL-ĪMĀN Research Journal , author=
A Comparative Analysis of Machine Translation and Human Translation: Efficacy of Poetry Translation from Urdu to English , volume=. AL-ĪMĀN Research Journal , author=. 2026 , month=. doi:10.63283/IRJ.04.01/02 , abstractNote=
-
[19]
Don ' t Go Far Off: An Empirical Study on Neural Poetry Translation
Chakrabarty, Tuhin and Saakyan, Arkadiy and Muresan, Smaranda. Don ' t Go Far Off: An Empirical Study on Neural Poetry Translation. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.577
-
[20]
The Translator ' s Canvas: Using LLM s to Enhance Poetry Translation
Resende, Nat \'a lia and Hadley, James. The Translator ' s Canvas: Using LLM s to Enhance Poetry Translation. Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track). 2024
2024
-
[21]
A Detailed Comparative Analysis of Automatic Neural Metrics for Machine Translation:
Mukherjee, Aniruddha and Hassija, Vikas and Chamola, Vinay and Gupta, Karunesh Kumar , journal=. A Detailed Comparative Analysis of Automatic Neural Metrics for Machine Translation:. 2025 , volume=
2025
-
[22]
A Fine-Grained Analysis of BERTS core
Hanna, Michael and Bojar, Ond r ej. A Fine-Grained Analysis of BERTS core. Proceedings of the Sixth Conference on Machine Translation. 2021
2021
-
[23]
Extrinsic Evaluation of Machine Translation Metrics
Nikita Moghe and Tom Sherborne and Mark Steedman and Alexandra Birch. Extrinsic Evaluation of Machine Translation Metrics. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.730
-
[24]
Results of WMT 23 Metrics Shared Task: Metrics Might Be Guilty but References Are Not Innocent
Freitag, Markus and Mathur, Nitika and Lo, Chi-kiu and Avramidis, Eleftherios and Rei, Ricardo and Thompson, Brian and Kocmi, Tom and Blain, Frederic and Deutsch, Daniel and Stewart, Craig and Zerva, Chrysoula and Castilho, Sheila and Lavie, Alon and Foster, George. Results of WMT 23 Metrics Shared Task: Metrics Might Be Guilty but References Are Not Inno...
-
[25]
BLEU , METEOR , BERTS core: Evaluation of Metrics Performance in Assessing Critical Translation Errors in Sentiment-Oriented Text
Saadany, Hadeel and Orasan, Constantin. BLEU , METEOR , BERTS core: Evaluation of Metrics Performance in Assessing Critical Translation Errors in Sentiment-Oriented Text. Proceedings of the Translation and Interpreting Technology Online Conference. 2021
2021
-
[26]
Kocmi, Tom and Artemova, Ekaterina and Avramidis, Eleftherios and Bawden, Rachel and Bojar, Ond r ej and Dranch, Konstantin and Dvorkovich, Anton and Dukanov, Sergey and Fishel, Mark and Freitag, Markus and Gowda, Thamme and Grundkiewicz, Roman and Haddow, Barry and Karpinska, Marzena and Koehn, Philipp and Lakougna, Howard and Lundin, Jessica and Monz, C...
-
[27]
Freitag, Markus and Rei, Ricardo and Mathur, Nitika and Lo, Chi-kiu and Stewart, Craig and Avramidis, Eleftherios and Kocmi, Tom and Foster, George and Lavie, Alon and Martins, Andr \'e F. T. Results of WMT 22 Metrics Shared Task: Stop Using BLEU -- Neural Metrics Are Better and More Robust. Proceedings of the Seventh Conference on Machine Translation (WM...
-
[28]
and Rei, Ricardo and Stigt, Daan van and Coheur, Luisa and Colombo, Pierre and Martins, Andr \'e F
Guerreiro, Nuno M. and Rei, Ricardo and Stigt, Daan van and Coheur, Luisa and Colombo, Pierre and Martins, Andr \'e F. T. x COMET : Transparent Machine Translation Evaluation through Fine-grained Error Detection. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00683
-
[29]
Fine-Tuned Machine Translation Metrics Struggle in Unseen Domains
Zouhar, Vil \'e m and Ding, Shuoyang and Currey, Anna and Badeka, Tatyana and Wang, Jenyuan and Thompson, Brian. Fine-Tuned Machine Translation Metrics Struggle in Unseen Domains. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2024. doi:10.18653/v1/2024.acl-short.45
-
[30]
ACES : Translation Accuracy Challenge Sets for Evaluating Machine Translation Metrics
Amrhein, Chantal and Moghe, Nikita and Guillou, Liane. ACES : Translation Accuracy Challenge Sets for Evaluating Machine Translation Metrics. Proceedings of the Seventh Conference on Machine Translation (WMT). 2022. doi:10.18653/v1/2022.wmt-1.44
-
[31]
Evaluating Automatic Metrics with Incremental Machine Translation Systems
Wu, Guojun and Cohen, Shay B and Sennrich, Rico. Evaluating Automatic Metrics with Incremental Machine Translation Systems. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.169
-
[32]
To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation
Kocmi, Tom and Federmann, Christian and Grundkiewicz, Roman and Junczys-Dowmunt, Marcin and Matsushita, Hitokazu and Menezes, Arul. To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation. Proceedings of the Sixth Conference on Machine Translation. 2021
2021
-
[33]
DEMETR : Diagnosing Evaluation Metrics for Translation
Karpinska, Marzena and Raj, Nishant and Thai, Katherine and Song, Yixiao and Gupta, Ankita and Iyyer, Mohit. DEMETR : Diagnosing Evaluation Metrics for Translation. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.649
-
[34]
Fernandes, Patrick and Deutsch, Daniel and Finkelstein, Mara and Riley, Parker and Martins, Andr \'e and Neubig, Graham and Garg, Ankush and Clark, Jonathan and Freitag, Markus and Firat, Orhan. The Devil Is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation. Proceedings of the Eighth Conference on Machine Tran...
-
[35]
Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models
Lu, Qingyu and Qiu, Baopu and Ding, Liang and Zhang, Kanjian and Kocmi, Tom and Tao, Dacheng. Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.520
-
[36]
Evolutionary Studies in Imaginative Culture , year =
Nehal Ali AbdulGhaffar , title =. Evolutionary Studies in Imaginative Culture , year =
-
[37]
and Zerva, Chrysoula and Farinha, Ana C and Maroti, Christine and C
Rei, Ricardo and Treviso, Marcos and Guerreiro, Nuno M. and Zerva, Chrysoula and Farinha, Ana C and Maroti, Christine and C. de Souza, Jos \'e G. and Glushkova, Taisiya and Alves, Duarte and Coheur, Luisa and Lavie, Alon and Martins, Andr \'e F. T. C omet K iwi: IST -Unbabel 2022 Submission for the Quality Estimation Shared Task. Proceedings of the Sevent...
-
[38]
Kocmi, Tom and Agrawal, Sweta and Artemova, Ekaterina and Avramidis, Eleftherios and Briakou, Eleftheria and Chen, Pinzhen and Fadaee, Marzieh and Freitag, Markus and Grundkiewicz, Roman and Hou, Yupeng and Koehn, Philipp and Kreutzer, Julia and Mansour, Saab and Perrella, Stefano and Proietti, Lorenzo and Riley, Parker and S \'a nchez, Eduardo and Schmid...
-
[39]
M etric X -24: The G oogle Submission to the WMT 2024 Metrics Shared Task
Juraska, Juraj and Deutsch, Daniel and Finkelstein, Mara and Freitag, Markus. M etric X -24: The G oogle Submission to the WMT 2024 Metrics Shared Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.35
-
[40]
Kocmi, Tom and Avramidis, Eleftherios and Bawden, Rachel and Bojar, Ond r ej and Dvorkovich, Anton and Federmann, Christian and Fishel, Mark and Freitag, Markus and Gowda, Thamme and Grundkiewicz, Roman and Haddow, Barry and Karpinska, Marzena and Koehn, Philipp and Marie, Benjamin and Monz, Christof and Murray, Kenton and Nagata, Masaaki and Popel, Marti...
-
[41]
Can Automatic Metrics Assess High-Quality Translations?
Agrawal, Sweta and Farinhas, Ant \'o nio and Rei, Ricardo and Martins, Andre. Can Automatic Metrics Assess High-Quality Translations?. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.802
-
[42]
and Platek, Ondrej and Sivaprasad, Adarsa
Schmidtova, Patricia and Mahamood, Saad and Balloccu, Simone and Dusek, Ondrej and Gatt, Albert and Gkatzia, Dimitra and Howcroft, David M. and Platek, Ondrej and Sivaprasad, Adarsa. Automatic Metrics in Natural Language Generation: A survey of Current Evaluation Practices. Proceedings of the 17th International Natural Language Generation Conference. 2024...
-
[43]
Exploring Document-Level Literary Machine Translation with Parallel Paragraphs from World Literature
Thai, Katherine and Karpinska, Marzena and Krishna, Kalpesh and Ray, Bill and Inghilleri, Moira and Wieting, John and Iyyer, Mohit. Exploring Document-Level Literary Machine Translation with Parallel Paragraphs from World Literature. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.672
-
[44]
2025 , eprint=
MAATS: A Multi-Agent Automated Translation System Based on MQM Evaluation , author=. 2025 , eprint=
2025
-
[45]
H i MATE : A Hierarchical Multi-Agent Framework for Machine Translation Evaluation
Zhang, Shijie and Li, Renhao and Wang, Songsheng and Koehn, Philipp and Yang, Min and Wong, Derek F. H i MATE : A Hierarchical Multi-Agent Framework for Machine Translation Evaluation. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.593
-
[46]
M - MAD : Multidimensional Multi-Agent Debate for Advanced Machine Translation Evaluation
Feng, Zhaopeng and Su, Jiayuan and Zheng, Jiamei and Ren, Jiahan and Zhang, Yan and Wu, Jian and Wang, Hongwei and Liu, Zuozhu. M - MAD : Multidimensional Multi-Agent Debate for Advanced Machine Translation Evaluation. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/202...
-
[47]
The Atlantic , year =
Jeremy Klemin , title =. The Atlantic , year =
-
[48]
The Bookseller , year =
Emily Warner , title =. The Bookseller , year =
-
[49]
The price of debiasing automatic metrics in natural language evaluation
Chaganty, Arun and Mussmann, Stephen and Liang, Percy. The price of debiasing automatic metrics in natural language evaluation. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1060
-
[50]
Zhang, Ran and Zhao, Wei and Eger, Steffen. How Good Are LLM s for Literary Translation, Really? Literary Translation Evaluation with Humans and LLM s. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.n...
-
[51]
The INCE p TION Platform: Machine-Assisted and Knowledge-Oriented Interactive Annotation
Klie, Jan-Christoph and Bugert, Michael and Boullosa, Beto and Eckart de Castilho, Richard and Gurevych, Iryna. The INCE p TION Platform: Machine-Assisted and Knowledge-Oriented Interactive Annotation. Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations. 2018
2018
-
[52]
The Challenges of Using Neural Machine Translation for Literature
Matusov, Evgeny. The Challenges of Using Neural Machine Translation for Literature. Proceedings of the Qualities of Literary Machine Translation. 2019
2019
-
[53]
Machine vs Human Translation of Formal Neologisms in Literature: Exploring
Laura Noriega-Santi. Machine vs Human Translation of Formal Neologisms in Literature: Exploring. Revista Tradum. 2023 , url=
2023
-
[54]
Information , VOLUME =
Corpas Pastor, Gloria and Noriega-Santiáñez, Laura , TITLE =. Information , VOLUME =. 2024 , NUMBER =
2024
-
[55]
Karpinska, Marzena and Iyyer, Mohit. Large Language Models Effectively Leverage Document-level Context for Literary Translation, but Critical Errors Persist. Proceedings of the Eighth Conference on Machine Translation. 2023. doi:10.18653/v1/2023.wmt-1.41
-
[56]
Translation Spaces , volume =
Dorothy Kenny and Marion Winters , title =. Translation Spaces , volume =. 2020 , doi =
2020
-
[57]
ELOPE: English Language Overseas Perspectives and Enquiries , volume =
Tjaša Mohar and Sara Orthaber and Tomaž Onič , title =. ELOPE: English Language Overseas Perspectives and Enquiries , volume =. 2020 , doi =
2020
-
[58]
Proceedings of the Qualities of Literary Machine Translation
Would MT kill creativity in literary retranslation?. Proceedings of the Qualities of Literary Machine Translation. 2019
2019
-
[59]
Creativity in Translation: Machine Translation as a Constraint for Literary Texts , Volume =
Ana Guerberof-Arenas and Antonio Toral , Journal =. Creativity in Translation: Machine Translation as a Constraint for Literary Texts , Volume =
-
[60]
The Impact of Post-Editing and Machine Translation on Creativity and Reading Experience , Volume =
Ana Guerberof-Arenas and Antonio Toral , Journal =. The Impact of Post-Editing and Machine Translation on Creativity and Reading Experience , Volume =
-
[61]
To be or not to be: A translation reception study of a literary text translated into
Ana Guerberof-Arenas and Antonio Toral , Journal =. To be or not to be: A translation reception study of a literary text translated into
-
[62]
Frontiers in Digital Humanities , VOLUME=
Toral, Antonio and Wieling, Martijn and Way, Andy , TITLE=. Frontiers in Digital Humanities , VOLUME=. 2018 , URL=. doi:10.3389/fdigh.2018.00009 , ISSN=
-
[63]
B y GPT 5: End-to-End Style-conditioned Poetry Generation with Token-free Language Models
Belouadi, Jonas and Eger, Steffen. B y GPT 5: End-to-End Style-conditioned Poetry Generation with Token-free Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.406
-
[64]
Evaluating Diversity in Automatic Poetry Generation
Chen, Yanran and Gr. Evaluating Diversity in Automatic Poetry Generation. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.1097
-
[65]
Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems , articleno =
Chakrabarty, Tuhin and Laban, Philippe and Agarwal, Divyansh and Muresan, Smaranda and Wu, Chien-Sheng , title =. Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems , articleno =. 2024 , isbn =. doi:10.1145/3613904.3642731 , abstract =
-
[66]
Re-evaluating the Role of B leu in Machine Translation Research
Callison-Burch, Chris and Osborne, Miles and Koehn, Philipp. Re-evaluating the Role of B leu in Machine Translation Research. 11th Conference of the E uropean Chapter of the Association for Computational Linguistics. 2006
2006
-
[67]
Tangled up in BLEU : Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics
Mathur, Nitika and Baldwin, Timothy and Cohn, Trevor. Tangled up in BLEU : Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.448
-
[68]
Why We Need New Evaluation Metrics for NLG
Novikova, Jekaterina and Du s ek, Ond r ej and Cercas Curry, Amanda and Rieser, Verena. Why We Need New Evaluation Metrics for NLG. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. doi:10.18653/v1/D17-1238
-
[69]
AI -Assisted Human Evaluation of Machine Translation
Zouhar, Vil \'e m and Kocmi, Tom and Sachan, Mrinmaya. AI -Assisted Human Evaluation of Machine Translation. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.naacl-long.255
-
[70]
INSTRUCTSCORE : Towards Explainable Text Generation Evaluation with Automatic Feedback
Xu, Wenda and Wang, Danqing and Pan, Liangming and Song, Zhenqiao and Freitag, Markus and Wang, William and Li, Lei. INSTRUCTSCORE : Towards Explainable Text Generation Evaluation with Automatic Feedback. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.365
-
[71]
G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment
Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang. G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.153
-
[72]
Tagged Span Annotation for Detecting Translation Errors in Reasoning LLM s
Yeom, Taemin and Ryu, Yonghyun and Choi, Yoonjung and Bak, Jinyeong. Tagged Span Annotation for Detecting Translation Errors in Reasoning LLM s. Proceedings of the Tenth Conference on Machine Translation. 2025. doi:10.18653/v1/2025.wmt-1.62
-
[73]
Qiu, Ziliang and Hu, Renfen. Deep Associations, High Creativity: A Simple yet Effective Metric for Evaluating Large Language Models. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.550
-
[74]
Human Translation of Stylistic Neologisms in English Language Chick Lit into Ukrainian , url=
Machine vs. Human Translation of Stylistic Neologisms in English Language Chick Lit into Ukrainian , url=. Respectus Philologicus , author=. 2025 , month=. doi:10.15388/RESPECTUS.2025.48.9 , abstractNote=
-
[75]
2024 , eprint=
CS4: Measuring the Creativity of Large Language Models Automatically by Controlling the Number of Story-Writing Constraints , author=. 2024 , eprint=
2024
-
[76]
Automatic Creativity Measurement in Scratch Programs Across Modalities , year=
Kovalkov, Anastasia and Paaßen, Benjamin and Segal, Avi and Pinkwart, Niels and Gal, Kobi , journal=. Automatic Creativity Measurement in Scratch Programs Across Modalities , year=
-
[77]
Prompting Large Language Models for Idiomatic Translation
Castaldo, Antonio and Monti, Johanna. Prompting Large Language Models for Idiomatic Translation. Proceedings of the 1st Workshop on Creative-text Translation and Technology. 2024
2024
-
[78]
Zhang, Xuan and Rajabi, Navid and Duh, Kevin and Koehn, Philipp. Machine Translation with Large Language Models: Prompting, Few-shot Learning, and Fine-tuning with QL o RA. Proceedings of the Eighth Conference on Machine Translation. 2023. doi:10.18653/v1/2023.wmt-1.43
-
[79]
Proceedings of the 40th International Conference on Machine Learning , pages =
Prompting Large Language Model for Machine Translation: A Case Study , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =
2023
-
[80]
Proceedings of the 6th ACM International Conference on Multimedia in Asia Workshops , articleno =
Gao, Yuan and Wang, Ruili and Hou, Feng , title =. Proceedings of the 6th ACM International Conference on Multimedia in Asia Workshops , articleno =. 2024 , isbn =. doi:10.1145/3700410.3702123 , abstract =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.