pith. sign in

arxiv: 2605.21135 · v1 · pith:QH5Q2UDHnew · submitted 2026-05-20 · 💻 cs.CL

Smarter edits? Post-editing with error highlights and translation suggestions

Pith reviewed 2026-05-21 04:46 UTC · model grok-4.3

classification 💻 cs.CL
keywords post-editingmachine translationerror highlightsautomatic post-editingquality estimationuser experienceprofessional translatorsLLM
0
0 comments X

The pith

Professional translators saw no productivity or quality gains from LLM error highlights or correction suggestions in post-editing, though they preferred automatic post-editing highlights and liked the suggestions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether adding error highlights generated by large language models and correction suggestions from automatic post-editing can make machine translation post-editing faster, better, or more pleasant for professional translators. A controlled study had English-to-Dutch translators work under four conditions: plain post-editing, post-editing with quality-estimation highlights, post-editing with automatic post-editing highlights, and post-editing with both highlights and correction suggestions. Productivity and final translation quality stayed the same across all conditions. Translators rated the automatic post-editing highlights higher than quality-estimation ones and reported better overall experience when correction suggestions were available. These results matter because they show which interface features actually register with users even when objective metrics do not move.

Core claim

In a study with professional En-Nl translators, post-editing with APE error highlights and correction suggestions showed no productivity or quality gains compared to regular post-editing or QE-derived highlights, yet APE highlights were better received than QE highlights and correction suggestions improved user experience.

What carries the argument

A four-condition user study that measures productivity (time and edits), final quality, and subjective user-experience ratings while varying the source of error highlights and the presence of correction suggestions.

If this is right

  • Automatic post-editing highlights can be more acceptable to translators than quality-estimation highlights even when neither improves speed or quality.
  • Correction suggestions can raise subjective satisfaction with the post-editing interface without raising objective productivity.
  • Standard post-editing without extra highlights remains competitive on both speed and output quality.
  • User-experience measures should be tracked separately from productivity when evaluating new post-editing features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Tool designers might try combining highlight sources or making suggestions more interactive to turn the observed experience gains into actual speed improvements.
  • The preference for APE highlights could stem from how closely they match the kinds of errors translators naturally notice.
  • Results might shift if the study moved to language pairs with very different error profiles or to translators with less experience.
  • Future experiments could test whether the same features affect revision behavior when translators work on longer documents or under time pressure.

Load-bearing premise

The particular LLM-derived highlights and APE suggestions tested here would behave the same way in other real professional workflows and that results from these En-Nl translators would hold for different language pairs or translator groups.

What would settle it

A replication study using different language pairs or different underlying models that finds measurable increases in words per minute or quality scores when the same highlights and suggestions are provided would disprove the no-gain result.

Figures

Figures reproduced from arXiv: 2605.21135 by Alina Karakanta, Andrea Camasta, Dora \v{Z}ug\v{c}i\'c, Fleur V.J. van Tellingen, Gautam Ranka, Joyce van der Wal, Livio Guerra.

Figure 1
Figure 1. Figure 1: SmartPE: Post editing with error highlights (H-QE and H-APE). Major errors in orange , minor in yellow [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: SmartPE: Post editing with error highlights and correction suggestions (S-APE). To test the ability of translators to identify crit￾ical errors, two critical errors were manually in￾serted in each text (negation, serious mistransla￾tion, serious omission) before annotating the er￾rors. Out of the 16 total inserted critical errors, only 11 were annotated by xCOMET and 10 by xTower. However, since we wanted … view at source ↗
Figure 3
Figure 3. Figure 3: shows the productivity per individual translation (PET) and as a group mean. The group mean is nearly flat across conditions, showing no productivity gains compared to regular PE. The re￾sults shown in the figures were confirmed statis￾tically using one-way repeated measures ANOVA on log-transformed PET-level means (see Ap￾pendix 8), which revealed no significant effect of condition on productivity. We obs… view at source ↗
Figure 4
Figure 4. Figure 4: Final translation quality in terms of Direct Assess￾ment scores per post-editor (PET) and group mean. Perceived effect on quality When asked if the error highlights helped improve the quality of the translation, half of the translators (4) thought that the quality did improve, while the rest stated that the highlights made no difference. For sugges￾tions, almost all translators (7) found that the cor￾recti… view at source ↗
Figure 5
Figure 5. Figure 5: Differences in metrics between news (left) and biomedical (right) domains. and 4.2, as productivity and quality did not show any statistically significant differences across do￾mains (productivity 1.65 chr/s vs 1.71 chr/s and DA scores 83.8 vs 87.2 for news and biomedi￾cal respectively). Despite this, news and biomedi￾cal texts showed differences across several dimen￾sions of the post-editing process, with… view at source ↗
Figure 7
Figure 7. Figure 7: Post-editing interface showing error annotations with suggestions. • Make sure you have a space where you can work without distractions. • Make sure to familiarise yourself with the in￾terface before you start. • Join the Teams meeting. We will ask you to share your screen (only the interface window) and the meeting will be recorded. Workflow • Open the interface by double-clicking on the ‘main’ file in th… view at source ↗
Figure 6
Figure 6. Figure 6: Example of the post-editing interface showing error annotations with minor errors highlighted in yellow and major errors in orange. 3. Post-editing with error annotations and sug￾gestions Hovering the mouse over highlighted text will show a translation suggestion in a black box above the highlight. To adopt the suggestion, click on the black box. This will substitute the highlighted text with the translati… view at source ↗
read the original abstract

As MT quality increases, interest in enhanced post-editing features such as QE-derived error highlights is growing, yet evidence for their usefulness remains limited. In this work, we explore the usefulness of LLM-derived error highlights and correction suggestions based on automatic post-editing (APE). We conduct a study where professional translators (En-Nl) post-edit translations using APE error highlights and correction suggestions and compare productivity, quality and user experience to regular PE and PE with QE-derived highlights. While no condition yielded productivity or quality gains compared to regular PE, APE highlights were better received than QE-derived highlights, and correction suggestions improved overall user experience.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. This paper reports results from a controlled user study with professional translators performing English-to-Dutch post-editing. It compares four conditions: standard post-editing, post-editing with quality-estimation-derived error highlights, post-editing with automatic-post-editing-derived error highlights, and post-editing with automatic-post-editing-derived correction suggestions. Productivity, final translation quality, and user-experience measures are reported. The central findings are that none of the enhanced conditions produced productivity or quality gains relative to standard post-editing, yet APE-derived highlights were rated more favorably than QE-derived highlights and the addition of correction suggestions improved overall user experience.

Significance. If the empirical results hold under broader conditions, the work supplies useful negative evidence on productivity and quality gains from current LLM-based post-editing aids while documenting positive effects on translator satisfaction. Such findings are relevant for MT tool design and for HCI research on translation workflows, indicating that user-experience considerations may matter more for adoption than raw efficiency metrics. The head-to-head comparison of APE versus QE signals is timely given the rapid integration of LLMs into translation pipelines.

major comments (3)
  1. [Methods] Methods section: The generation procedures for APE error highlights and correction suggestions are described at a high level but without any quantitative assessment of their intrinsic quality (e.g., highlight precision/recall against human error annotations or suggestion acceptance rates during the study). This omission makes it difficult to attribute the reported UX preference for APE over QE to the underlying signal type rather than to incidental differences in the quality of the particular LLM outputs used.
  2. [Results] Results section: The null findings on productivity and quality are presented without accompanying effect sizes, confidence intervals, or power analysis. Given that user studies with professional translators often involve modest sample sizes, the absence of these statistics leaves open the possibility that meaningful differences were simply undetected.
  3. [Discussion] Discussion section: The claim that APE highlights are better received than QE-derived highlights is framed as a general advantage, yet the study is restricted to a single language pair (En-Nl) and a specific set of LLM prompts and models. The paper should explicitly discuss the risk that the observed preference is implementation- or domain-specific and outline concrete steps (additional language pairs, alternative models, or ablation of prompt components) that would be needed to test broader applicability.
minor comments (1)
  1. [Results] Table 2 or the corresponding results table: Ensure that all condition labels are fully spelled out in the caption so that readers can map them directly to the four experimental arms without cross-referencing the text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important aspects of methodological transparency, statistical reporting, and generalizability that we will address in the revision. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Methods] Methods section: The generation procedures for APE error highlights and correction suggestions are described at a high level but without any quantitative assessment of their intrinsic quality (e.g., highlight precision/recall against human error annotations or suggestion acceptance rates during the study). This omission makes it difficult to attribute the reported UX preference for APE over QE to the underlying signal type rather than to incidental differences in the quality of the particular LLM outputs used.

    Authors: We agree that quantitative assessment of the generated highlights and suggestions would strengthen attribution of the UX differences. Our primary focus was the user study outcomes rather than intrinsic system evaluation, and we did not obtain separate human error annotations for precision/recall. However, we did log suggestion acceptance rates during the sessions. In the revised manuscript we will report these acceptance rates and add a brief discussion of how they relate to the observed UX preference. We will also clarify the generation procedures with additional implementation details. revision: partial

  2. Referee: [Results] Results section: The null findings on productivity and quality are presented without accompanying effect sizes, confidence intervals, or power analysis. Given that user studies with professional translators often involve modest sample sizes, the absence of these statistics leaves open the possibility that meaningful differences were simply undetected.

    Authors: We accept this point. In the revised results section we will report effect sizes (Cohen’s d) and 95% confidence intervals for all key comparisons. A prospective power analysis was not performed because the study was exploratory and constrained by the limited availability of professional translators; we will add a post-hoc discussion of achieved power and the implications for detecting small-to-medium effects given our sample size. revision: yes

  3. Referee: [Discussion] Discussion section: The claim that APE highlights are better received than QE-derived highlights is framed as a general advantage, yet the study is restricted to a single language pair (En-Nl) and a specific set of LLM prompts and models. The paper should explicitly discuss the risk that the observed preference is implementation- or domain-specific and outline concrete steps (additional language pairs, alternative models, or ablation of prompt components) that would be needed to test broader applicability.

    Authors: We agree that the current framing risks over-generalization. In the revised discussion we will explicitly state the limitations of the single En-Nl pair, the chosen models, and prompt design. We will also add a dedicated paragraph outlining concrete next steps: replication with at least two additional language pairs, comparison with alternative LLMs, and systematic prompt ablations to isolate which components drive the preference. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical user study with direct measurements

full rationale

The paper reports an empirical user study comparing post-editing conditions (regular PE, QE highlights, APE highlights plus suggestions) on productivity, quality, and UX metrics collected from professional En-Nl translators. No derivation chain, equations, fitted parameters renamed as predictions, or first-principles results exist that could reduce to inputs by construction. Claims rest on observed experimental outcomes rather than self-definitional loops or load-bearing self-citations. The work is self-contained against its own study data and does not invoke uniqueness theorems or ansatzes from prior author work to force conclusions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard assumptions of user-study methodology rather than new mathematical derivations. No free parameters, invented entities, or non-standard axioms are introduced in the abstract.

axioms (1)
  • domain assumption Professional translators' self-reported experience and measured productivity accurately reflect real-world post-editing performance.
    The study design assumes that the recruited participants and task conditions generalize to professional practice.

pith-pipeline@v0.9.0 · 5660 in / 1309 out tokens · 30692 ms · 2026-05-21T04:46:06.433520+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages

  1. [1]

    Keystroke Logging in Writing Research: Using Inputlog to Analyze Writing Processes , journal =

    Leijten, Mariëlle and Van Waes, Luuk , year =. Keystroke Logging in Writing Research: Using Inputlog to Analyze Writing Processes , journal =

  2. [2]

    2023 , eprint=

    xCOMET: Transparent Machine Translation Evaluation through Fine-grained Error Detection , author=. 2023 , eprint=

  3. [3]

    In: Webber, B., Cohn, T., He, Y., Liu, Y

    Rei, Ricardo and Stewart, Craig and Farinha, Ana C and Lavie, Alon. COMET : A Neural Framework for MT Evaluation. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.213

  4. [5]

    Large Language Models Are State-of-the-Art Evaluators of Translation Quality

    Kocmi, Tom and Federmann, Christian. Large Language Models Are State-of-the-Art Evaluators of Translation Quality. Proceedings of the 24th Annual Conference of the European Association for Machine Translation. 2023

  5. [6]

    In: Koehn, P., Haddow, B., Kocmi, T., Monz, C

    Kocmi, Tom and Federmann, Christian. GEMBA - MQM : Detecting Translation Quality Error Spans with GPT -4. Proceedings of the Eighth Conference on Machine Translation. 2023. doi:10.18653/v1/2023.wmt-1.64

  6. [7]

    Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation

    Kocmi, Tom and Zouhar, Vil \'e m and Avramidis, Eleftherios and Grundkiewicz, Roman and Karpinska, Marzena and Popovi \'c , Maja and Sachan, Mrinmaya and Shmatova, Mariya. Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.131

  7. [8]

    2025 , eprint=

    QE4PE: Word-level Quality Estimation for Human Post-Editing , author=. 2025 , eprint=

  8. [9]

    Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models

    Lu, Qingyu and Qiu, Baopu and Ding, Liang and Zhang, Kanjian and Kocmi, Tom and Tao, Dacheng. Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.520

  9. [10]

    The Devil Is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation

    Fernandes, Patrick and Deutsch, Daniel and Finkelstein, Mara and Riley, Parker and Martins, Andr \'e and Neubig, Graham and Garg, Ankush and Clark, Jonathan and Freitag, Markus and Firat, Orhan. The Devil Is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation. Proceedings of the Eighth Conference on Machine Tran...

  10. [11]

    and Rei, Ricardo and Stigt, Daan van and Coheur, Luisa and Colombo, Pierre and Martins, André F

    Guerreiro, Nuno M. and Rei, Ricardo and Stigt, Daan van and Coheur, Luisa and Colombo, Pierre and Martins, André F. T. , title = ". Transactions of the Association for Computational Linguistics , volume =. 2024 , month =. doi:10.1162/tacl_a_00683 , url =

  11. [12]

    Multidimensional quality metrics (MQM): A framework for declaring and describing translation quality metrics , journal =

    Arle Lommel and Hans Uszkoreit and Aljoscha Burchardt , year =. Multidimensional quality metrics (MQM): A framework for declaring and describing translation quality metrics , journal =

  12. [13]

    Kepler, Fabio and Tr \'e nous, Jonay and Treviso, Marcos and Vera, Miguel and Martins, Andr \'e F. T. O pen K iwi: An Open Source Framework for Quality Estimation. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 2019. doi:10.18653/v1/P19-3020

  13. [14]

    Advances in Neural Information Processing Systems , year =

    Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric Xing and Hao Zhang and Joseph E Gonzalez and Ion Stoica , title =. Advances in Neural Information Processing Systems , year =

  14. [15]

    Findings of the WMT 2023 Shared Task on Automatic Post-Editing

    Bhattacharyya, Pushpak and Chatterjee, Rajen and Freitag, Markus and Kanojia, Diptesh and Negri, Matteo and Turchi, Marco. Findings of the WMT 2023 Shared Task on Automatic Post-Editing. Proceedings of the Eighth Conference on Machine Translation. 2023. doi:10.18653/v1/2023.wmt-1.55

  15. [16]

    Machine Translation Meets Large Language Models: Evaluating C hat GPT ' s Ability to Automatically Post-Edit Literary Texts

    Macken, Lieve. Machine Translation Meets Large Language Models: Evaluating C hat GPT ' s Ability to Automatically Post-Edit Literary Texts. Proceedings of the 1st Workshop on Creative-text Translation and Technology. 2024

  16. [17]

    Quality Estimation-Assisted Automatic Post-Editing

    Deoghare, Sourabh and Kanojia, Diptesh and Blain, Fred and Ranasinghe, Tharindu and Bhattacharyya, Pushpak. Quality Estimation-Assisted Automatic Post-Editing. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.115

  17. [18]

    Combining Quality Estimation and Automatic Post-editing to Enhance Machine Translation output

    Chatterjee, Rajen and Negri, Matteo and Turchi, Marco and Blain, Fr \'e d \'e ric and Specia, Lucia. Combining Quality Estimation and Automatic Post-editing to Enhance Machine Translation output. Proceedings of the 13th Conference of the Association for Machine Translation in the A mericas (Volume 1: Research Track). 2018

  18. [19]

    Leveraging GPT -4 for Automatic Translation Post-Editing

    Raunak, Vikas and Sharaf, Amr and Wang, Yiren and Awadalla, Hany and Menezes, Arul. Leveraging GPT -4 for Automatic Translation Post-Editing. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.804

  19. [20]

    doi:10.3115/1073083.1073135 , editor =

    Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing. B leu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002. doi:10.3115/1073083.1073135

  20. [21]

    In: Bojar, O., Chatterjee, R., Federmann, C., Haddow, B., Hokamp, C., Huck, M., Logacheva, V., Pecina, P

    Popovi \'c , Maja. chr F : character n-gram F -score for automatic MT evaluation. Proceedings of the Tenth Workshop on Statistical Machine Translation. 2015. doi:10.18653/v1/W15-3049

  21. [22]

    Deploying MT Quality Estimation on a large scale: Lessons learned and open questions

    Tamchyna, Ale s. Deploying MT Quality Estimation on a large scale: Lessons learned and open questions. Proceedings of Machine Translation Summit XVIII: Users and Providers Track. 2021

  22. [23]

    Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems , pages =

    Coppers, Sven and Van den Bergh, Jan and Luyten, Kris and Coninx, Karin and van der Lek-Ciudin, Iulianna and Vanallemeersch, Tom and Vandeghinste, Vincent , title =. Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems , pages =. 2018 , isbn =. doi:10.1145/3173574.3174098 , abstract =

  23. [24]

    MMPE : A M ulti- M odal I nterface for P ost- E diting M achine T ranslation

    Herbig, Nico and D. MMPE : A M ulti- M odal I nterface for P ost- E diting M achine T ranslation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.155

  24. [25]

    MT Quality Estimation for Computer-assisted Translation: Does it Really Help?

    Turchi, Marco and Negri, Matteo and Federico, Marcello. MT Quality Estimation for Computer-assisted Translation: Does it Really Help?. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 2015. doi:10.3115/v1/P15-2087

  25. [26]

    Informatics , VOLUME =

    Béchara, Hannah and Orăsan, Constantin and Parra Escartín, Carla and Zampieri, Marcos and Lowe, William , TITLE =. Informatics , VOLUME =. 2021 , NUMBER =

  26. [27]

    The Prague Bulletin of Mathematical Linguistics , year=

    Questing for quality estimation a user study , author=. The Prague Bulletin of Mathematical Linguistics , year=

  27. [28]

    Introducing Quality Estimation to Machine Translation Post-editing Workflow: An Empirical Study on Its Usefulness

    Liu, Siqi and Dai, Guangrong and Li, Dechao. Introducing Quality Estimation to Machine Translation Post-editing Workflow: An Empirical Study on Its Usefulness. Proceedings of Machine Translation Summit XX: Volume 1. 2025

  28. [29]

    The Impact of MT Quality Estimation on Post-Editing Effort

    Teixeira, Carlos and O ' Brien, Sharon. The Impact of MT Quality Estimation on Post-Editing Effort. Proceedings of Machine Translation Summit XVI: Commercial MT Users and Translators Track. 2017

  29. [30]

    Investigating the Helpfulness of Word-Level Quality Estimation for Post-Editing Machine Translation Output

    Shenoy, Raksha and Herbig, Nico and Kr. Investigating the Helpfulness of Word-Level Quality Estimation for Post-Editing Machine Translation Output. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.799

  30. [31]

    Word-Level Quality Estimation for Korean-English Neural Machine Translation , year=

    Eo, Sugyeong and Park, Chanjun and Moon, Hyeonseok and Seo, Jaehyung and Lim, Heuiseok , journal=. Word-Level Quality Estimation for Korean-English Neural Machine Translation , year=

  31. [32]

    Natural Language Engineering , volume=

    Can machine translation systems be evaluated by the crowd alone , author=. Natural Language Engineering , volume=. 2017 , publisher=

  32. [33]

    Translating Step-by-Step: Decomposing the Translation Process for Improved Translation Quality of Long-Form Texts

    Briakou, Eleftheria and Luo, Jiaming and Cherry, Colin and Freitag, Markus. Translating Step-by-Step: Decomposing the Translation Process for Improved Translation Quality of Long-Form Texts. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.123

  33. [34]

    Are AI agents the new machine translation frontier? Challenges and opportunities of single- and multi-agent systems for multilingual digital communication

    Briva-Iglesias, Vicent. Are AI agents the new machine translation frontier? Challenges and opportunities of single- and multi-agent systems for multilingual digital communication. Proceedings of Machine Translation Summit XX: Volume 1. 2025

  34. [35]

    (Perhaps) Beyond Human Translation: Harnessing Multi-Agent Collaboration for Translating Ultra-Long Literary Texts

    Wu, Minghao and Xu, Jiahao and Yuan, Yulin and Haffari, Gholamreza and Wan, Longyue and Luo, Weihua and Zhang, Kaifu. (Perhaps) Beyond Human Translation: Harnessing Multi-Agent Collaboration for Translating Ultra-Long Literary Texts. Transactions of the Association for Computational Linguistics. 2025. doi:10.1162/tacl.a.25

  35. [36]

    Giving the Old a Fresh Spin: Quality Estimation-Assisted Constrained Decoding for Automatic Post-Editing

    Deoghare, Sourabh and Kanojia, Diptesh and Bhattacharyya, Pushpak. Giving the Old a Fresh Spin: Quality Estimation-Assisted Constrained Decoding for Automatic Post-Editing. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers). 2025. ...

  36. [37]

    Machine Translation , volume=

    A user study of neural interactive translation prediction , author=. Machine Translation , volume=. 2019 , publisher=

  37. [38]

    New directions in empirical translation process research: exploring the CRITT TPR-DB , pages=

    Learning advanced post-editing , author=. New directions in empirical translation process research: exploring the CRITT TPR-DB , pages=. 2016 , publisher=

  38. [39]

    Human-centered, augmented machine translation: analysing user experience, quality and productivity in interactive post-editing vs traditional post-editing , author=. Tradum

  39. [40]

    Translation, Cognition & Behavior , volume=

    The impact of traditional and interactive post-editing on machine translation user experience, quality, and productivity , author=. Translation, Cognition & Behavior , volume=. 2023 , publisher=

  40. [41]

    2017 , school=

    Productivity in post-editing and in neural interactive translation prediction: A study of English-to-Spanish professional translators , author=. 2017 , school=

  41. [42]

    Translation studies , volume=

    Translators and translation technology: The dance of agency , author=. Translation studies , volume=. 2011 , publisher=

  42. [43]

    Perspectives , volume=

    Human-centered augmented translation: Against antagonistic dualisms , author=. Perspectives , volume=. 2024 , publisher=

  43. [44]

    2024 , school=

    Productivity in the post-editing of neural machine translation: A mixed-methods analysis of speed and edits at Toppan Digital Language , author=. 2024 , school=

  44. [45]

    Findings of the WMT 2024 Biomedical Translation Shared Task: Test Sets on Abstract Level

    Neves, Mariana and Grozea, Cristian and Thomas, Philippe and Roller, Roland and Bawden, Rachel and N \'e v \'e ol, Aur \'e lie and Castle, Steffen and Bonato, Vanessa and Di Nunzio, Giorgio Maria and Vezzani, Federica and Vicente Navarro, Maika and Yeganova, Lana and Jimeno Yepes, Antonio. Findings of the WMT 2024 Biomedical Translation Shared Task: Test ...

  45. [46]

    Alabau, Vicent, Michael Carl, Francisco Casacuberta, Mercedes Garc \' a Mart \' nez, Jes \'u s Gonz \'a lez-Rubio, Bartolom \'e Mesa-Lao, Daniel Ortiz-Mart \' nez, Moritz Schaeffer, and Germ \'a n Sanchis-Trilles. 2016. Learning advanced post-editing. In New directions in empirical translation process research: exploring the CRITT TPR-DB , pages 95--110. Springer

  46. [47]

    Briakou, Eleftheria, Jiaming Luo, Colin Cherry, and Markus Freitag. 2024. Translating step-by-step: Decomposing the translation process for improved translation quality of long-form texts. In Haddow, Barry, Tom Kocmi, Philipp Koehn, and Christof Monz, editors, Proceedings of the Ninth Conference on Machine Translation , pages 1301--1317, Miami, Florida, U...

  47. [48]

    Briva-Iglesias, Vicent, Sharon O’Brien, and Benjamin R Cowan. 2023. The impact of traditional and interactive post-editing on machine translation user experience, quality, and productivity. Translation, Cognition & Behavior , 6(1):60--86

  48. [49]

    Briva-Iglesias, Vicent. 2025a. Are AI agents the new machine translation frontier? challenges and opportunities of single- and multi-agent systems for multilingual digital communication. In Bouillon, Pierrette, Johanna Gerlach, Sabrina Girletti, Lise Volkart, Raphael Rubino, Rico Sennrich, Ana C. Farinha, Marco Gaido, Joke Daems, Dorothy Kenny, Helena Mon...

  49. [50]

    Briva-Iglesias, Vicent. 2025b. Human-centered, augmented machine translation: analysing user experience, quality and productivity in interactive post-editing vs traditional post-editing. Tradum \`a tica tecnologies de la traducci \'o , (23):350--382

  50. [51]

    Béchara, Hannah, Constantin Orăsan, Carla Parra Escartín, Marcos Zampieri, and William Lowe. 2021. The role of machine translation quality estimation in the post-editing workflow. Informatics , 8(3)

  51. [52]

    Chatterjee, Rajen, Matteo Negri, Marco Turchi, Fr \'e d \'e ric Blain, and Lucia Specia. 2018. Combining quality estimation and automatic post-editing to enhance machine translation output. In Cherry, Colin and Graham Neubig, editors, Proceedings of the 13th Conference of the Association for Machine Translation in the A mericas (Volume 1: Research Track) ...

  52. [53]

    Coppers, Sven, Jan Van den Bergh, Kris Luyten, Karin Coninx, Iulianna van der Lek-Ciudin, Tom Vanallemeersch, and Vincent Vandeghinste. 2018. Intellingo: An intelligible translation environment. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems , CHI '18, page 1–13, New York, NY, USA. Association for Computing Machinery

  53. [54]

    Deoghare, Sourabh, Diptesh Kanojia, Fred Blain, Tharindu Ranasinghe, and Pushpak Bhattacharyya. 2023. Quality estimation-assisted automatic post-editing. In Bouamor, Houda, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 1686--1698, Singapore, December. Association for Computational Linguistics

  54. [55]

    Deoghare, Sourabh, Diptesh Kanojia, and Pushpak Bhattacharyya. 2025. Giving the old a fresh spin: Quality estimation-assisted constrained decoding for automatic post-editing. In Chiruzzo, Luis, Alan Ritter, and Lu Wang, editors, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Huma...

  55. [56]

    Escart \' n, Carla Parra, Hanna B \'e chara, and Constantin Or a san. 2017. Questing for quality estimation a user study. The Prague Bulletin of Mathematical Linguistics

  56. [57]

    Fernandes, Patrick, Daniel Deutsch, Mara Finkelstein, Parker Riley, Andr \'e Martins, Graham Neubig, Ankush Garg, Jonathan Clark, Markus Freitag, and Orhan Firat. 2023. The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation. In Proceedings of the Eighth Conference on Machine Translation , pages 1066--1...

  57. [58]

    Graham, Yvette, Timothy Baldwin, Alistair Moffat, and Justin Zobel. 2017. Can machine translation systems be evaluated by the crowd alone. Natural Language Engineering , 23(1):3--30

  58. [59]

    Guerreiro, Nuno M., Ricardo Rei, Daan van Stigt, Luisa Coheur, Pierre Colombo, and André F. T. Martins. 2023. xcomet: Transparent machine translation evaluation through fine-grained error detection

  59. [60]

    Guerreiro, Nuno M., Ricardo Rei, Daan van Stigt, Luisa Coheur, Pierre Colombo, and André F. T. Martins. 2024. xcomet: Transparent Machine Translation Evaluation through Fine-grained Error Detection . Transactions of the Association for Computational Linguistics , 12:979--995, 09

  60. [61]

    Kepler, Fabio, Jonay Tr \'e nous, Marcos Treviso, Miguel Vera, and Andr \'e F. T. Martins. 2019. O pen K iwi: An open source framework for quality estimation. In Costa-juss \`a , Marta R. and Enrique Alfonseca, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations , pages 117--122, Florence...

  61. [62]

    Knowles, Rebecca, Marina Sanchez-Torron, and Philipp Koehn. 2019. A user study of neural interactive translation prediction. Machine Translation , 33(1):135--154

  62. [63]

    Kocmi, Tom and Christian Federmann. 2023a. GEMBA - MQM : Detecting translation quality error spans with GPT -4. In Koehn, Philipp, Barry Haddow, Tom Kocmi, and Christof Monz, editors, Proceedings of the Eighth Conference on Machine Translation , pages 768--775, Singapore, December. Association for Computational Linguistics

  63. [64]

    Kocmi, Tom and Christian Federmann. 2023b. Large language models are state-of-the-art evaluators of translation quality. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation , pages 193--203, Tampere, Finland, June. European Association for Machine Translation

  64. [65]

    Kocmi, Tom, Vil \'e m Zouhar, Eleftherios Avramidis, Roman Grundkiewicz, Marzena Karpinska, Maja Popovi \'c , Mrinmaya Sachan, and Mariya Shmatova. 2024. Error span annotation: A balanced approach for human evaluation of machine translation. In Haddow, Barry, Tom Kocmi, Philipp Koehn, and Christof Monz, editors, Proceedings of the Ninth Conference on Mach...

  65. [66]

    Liu, Siqi, Guangrong Dai, and Dechao Li. 2025. Introducing quality estimation to machine translation post-editing workflow: An empirical study on its usefulness. In Bouillon, Pierrette, Johanna Gerlach, Sabrina Girletti, Lise Volkart, Raphael Rubino, Rico Sennrich, Ana C. Farinha, Marco Gaido, Joke Daems, Dorothy Kenny, Helena Moniz, and Sara Szoc, editor...

  66. [67]

    Lommel, Arle, Hans Uszkoreit, and Aljoscha Burchardt. 2014. Multidimensional quality metrics (mqm): A framework for declaring and describing translation quality metrics. Revista Tradumàtica: tecnologies de la traducció

  67. [68]

    Lu, Qingyu, Baopu Qiu, Liang Ding, Kanjian Zhang, Tom Kocmi, and Dacheng Tao. 2024. Error analysis prompting enables human-like translation evaluation in large language models. In Findings of the Association for Computational Linguistics: ACL 2024 , pages 8801--8816, Bangkok, Thailand, August. Association for Computational Linguistics

  68. [69]

    Macken, Lieve. 2024. Machine translation meets large language models: Evaluating C hat GPT ' s ability to automatically post-edit literary texts. In Vanroy, Bram, Marie-Aude Lefer, Lieve Macken, and Paola Ruffo, editors, Proceedings of the 1st Workshop on Creative-text Translation and Technology , pages 65--81, Sheffield, United Kingdom, June. European As...

  69. [70]

    Neves, Mariana, Cristian Grozea, Philippe Thomas, Roland Roller, Rachel Bawden, Aur \'e lie N \'e v \'e ol, Steffen Castle, Vanessa Bonato, Giorgio Maria Di Nunzio, Federica Vezzani, Maika Vicente Navarro, Lana Yeganova, and Antonio Jimeno Yepes. 2024. Findings of the WMT 2024 biomedical translation shared task: Test sets on abstract level. In Haddow, Bar...

  70. [71]

    Olohan, Maeve. 2011. Translators and translation technology: The dance of agency. Translation studies , 4(3):342--357

  71. [72]

    O’Brien, Sharon. 2024. Human-centered augmented translation: Against antagonistic dualisms. Perspectives , 32(3):391--406

  72. [73]

    Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. B leu: a method for automatic evaluation of machine translation. In Isabelle, Pierre, Eugene Charniak, and Dekang Lin, editors, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics , pages 311--318, Philadelphia, Pennsylvania, USA, July. Association for ...

  73. [74]

    Popovi \'c , Maja. 2015. chr F : character n-gram F -score for automatic MT evaluation. In Bojar, Ond r ej, Rajan Chatterjee, Christian Federmann, Barry Haddow, Chris Hokamp, Matthias Huck, Varvara Logacheva, and Pavel Pecina, editors, Proceedings of the Tenth Workshop on Statistical Machine Translation , pages 392--395, Lisbon, Portugal, September. Assoc...

  74. [75]

    Raunak, Vikas, Amr Sharaf, Yiren Wang, Hany Awadalla, and Arul Menezes. 2023. Leveraging GPT -4 for automatic translation post-editing. In Bouamor, Houda, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 12009--12024, Singapore, December. Association for Computational Linguistics

  75. [76]

    Sarti, Gabriele, Vilém Zouhar, Grzegorz Chrupała, Ana Guerberof-Arenas, Malvina Nissim, and Arianna Bisazza. 2025. Qe4pe: Word-level quality estimation for human post-editing

  76. [77]

    Shenoy, Raksha, Nico Herbig, Antonio Kr \"u ger, and Josef van Genabith. 2021. Investigating the helpfulness of word-level quality estimation for post-editing machine translation output. In Moens, Marie-Francine, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Proces...

  77. [78]

    Teixeira, Carlos and Sharon O ' Brien. 2017. The impact of MT quality estimation on post-editing effort. In Yamada, Masaru and Mark Seligman, editors, Proceedings of Machine Translation Summit XVI: Commercial MT Users and Translators Track , pages 142--153, Nagoya Japan, September 18 – September 22

  78. [79]

    Terribile, Silvia. 2024. Productivity in the post-editing of neural machine translation: A mixed-methods analysis of speed and edits at Toppan Digital Language . Ph.D. thesis, The University of Manchester (United Kingdom)

  79. [80]

    Treviso, Marcos, Nuno M Guerreiro, Sweta Agrawal, Ricardo Rei, Jos \'e Pombal, Tania Vaz, Helena Wu, Beatriz Silva, Daan van Stigt, and Andr \'e FT Martins. 2024. xtower: A multilingual llm for explaining and correcting translation errors. arXiv preprint arXiv:2406.19482

  80. [81]

    Turchi, Marco, Matteo Negri, and Marcello Federico. 2015. MT quality estimation for computer-assisted translation: Does it really help? In Zong, Chengqing and Michael Strube, editors, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: ...

Showing first 80 references.