pith. machine review for the scientific record. sign in

arxiv: 2602.19016 · v2 · submitted 2026-02-22 · 💻 cs.HC

Recognition: no theorem link

CHORUS: Effort-Aware Multi-Agent Human-AI Collaboration for Professional Translation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:57 UTC · model grok-4.3

classification 💻 cs.HC
keywords professional translationmulti-agent AIhuman-AI collaborationcognitive efforttranslation qualityMQM theorymixed-initiative systemseffort-aware design
0
0 comments X

The pith

A multi-agent AI system for professional translators reduces completion time by 33.8 percent, lowers cognitive effort, and improves final quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CHORUS, a mixed-initiative system that uses multiple AI agents to guide professional translators through their full workflow while matching individual styles. A formative study identified the value of MQM theory for spotting translation issues and the need for the system to adapt to each user's unique traits. In a within-subject experiment with 30 licensed English-Chinese translators, the system produced the reported gains in speed, reduced mental load, and higher BLEU and COMET scores. The work targets the gap where generic AI tools fail to support accountability and detailed process needs in high-stakes translation. These outcomes show how multi-agent designs can fit into expert human routines rather than replace them.

Core claim

CHORUS is a mixed-initiative translation system that incorporates MQM theory to support the translation process and personal style as translators work. Formative findings established the benefit of MQM and the requirement to adapt to each translator's idiosyncratic traits. The within-subject study with 30 licensed English-Chinese translators found that the system reduced completion time by 33.8%, lowered translators' cognitive effort, and improved final translation quality according to BLEU and COMET metrics. Participants reported that issues became easier to inspect, repeated prompting dropped relative to single-agent systems, and the interface offered reflections on their habits.

What carries the argument

Multi-agent AI architecture that applies MQM theory for issue detection, adapts interfaces to individual translator traits, and minimizes repeated prompting while providing habit reflections.

If this is right

  • Translators complete tasks in roughly two-thirds the time while maintaining or raising quality.
  • Cognitive effort drops, which may reduce fatigue during extended professional sessions.
  • Final outputs achieve higher automatic metric scores on BLEU and COMET.
  • Translation issues become easier to locate and address during the workflow.
  • Repeated prompting to AI agents decreases, freeing attention for higher-level decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar multi-agent structures could transfer to other expert domains that require process accountability, such as legal editing or technical documentation.
  • Trait adaptation suggests that future tools could track and reflect user patterns across sessions to build long-term personalization.
  • Lowered effort might enable translators to accept higher volumes or more complex projects without quality loss.
  • Reduced prompting repetition points to efficiency gains in any human-AI loop where iteration currently consumes time.

Load-bearing premise

Incorporating MQM theory improves professional translation outcomes and the system can effectively adapt to each translator's idiosyncratic traits identified in the formative study.

What would settle it

A replication study with licensed translators that finds no significant reduction in completion time, no drop in reported cognitive effort, and no improvement in BLEU or COMET scores when CHORUS is compared to standard single-agent translation tools.

Figures

Figures reproduced from arXiv: 2602.19016 by George X. Wang, Guande Wu Jing Qian, Jiaqian Hu.

Figure 1
Figure 1. Figure 1: CHORUS is a translation system designed for professional translators using multi-agent collaboration. Inspired by [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: System flow chart of CHORUS workflow integrating context, draft, goals, agents, revisions, effort, and memory [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of CHORUS. Source Context Panel (A) shows the source and initial translation. The Active Editing panel (B) [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Task-order trajectories for completion time and live [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: NASA-TLX scores for CHORUS and Baseline [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Task-order trends in translation-quality metrics [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Human expert ratings of whether translation qual [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Participant responses to perception questions. [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
read the original abstract

Despite the widespread use of automatic AI translation systems in daily language tasks, professional translation remains crucial in domain-specific and high-stakes scenarios. Yet professional translators rarely rely on these systems in their everyday practice due to a lack of detailed support for the translation process, matching professional styles, and accountability for the final outcome. To bridge the gap, we present CHORUS, a mixed-initiative translation system that supports the translation process and personal style as translators work. A formative study found that incorporating MQM theory may be beneficial for achieving professional translation, and that the system should adapt to each individual translator's idiosyncratic traits. The final within-subject study with 30 licensed English--Chinese translators found that our system reduced completion time by 33.8\%, lowered translators' cognitive effort, and improved final translation quality using the BLEU and COMET as automatic evaluation metrics. Participants' qualitative analysis also revealed that the system made translation issues easier to inspect, reduced repeated prompting compared to single-agent AI systems, and offered reflections on their habits and traits. Our findings illustrate how multi-agent AI systems can be designed to support expert workflows and their potential for professional use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents CHORUS, a mixed-initiative multi-agent system for professional English-to-Chinese translation. Drawing from a formative study that identified benefits of MQM theory and need for personalization, the authors conduct a within-subject study with 30 licensed translators showing that CHORUS reduces completion time by 33.8%, lowers cognitive effort, and improves quality per BLEU and COMET metrics, along with qualitative benefits in issue inspection and reduced prompting.

Significance. If the empirical findings are robustly supported, this research would contribute meaningfully to HCI by illustrating how multi-agent AI can be integrated into expert professional workflows, offering time savings and reduced effort in high-stakes translation. The focus on licensed professionals and adaptation to individual traits adds practical relevance.

major comments (2)
  1. [Within-subject evaluation] The improvement in final translation quality is asserted based solely on automatic BLEU and COMET scores. However, the formative study highlighted MQM theory as beneficial, yet no MQM or human expert scoring is reported in the evaluation to validate the quality gains. Given that BLEU and COMET may not fully capture stylistic and domain-specific accuracy in professional English-Chinese translation, this undermines the central claim of quality improvement.
  2. [Results section] The abstract and reported results lack statistical details such as p-values, effect sizes, or descriptions of baselines and controls for the within-subject study with 30 participants. This makes it challenging to evaluate the significance of the 33.8% time reduction and other outcomes.
minor comments (2)
  1. [Abstract] The abstract mentions 'lowered translators' cognitive effort' without specifying the measurement method (e.g., NASA-TLX or similar).
  2. [Formative study] More details on how the system adapts to idiosyncratic traits could be provided to clarify the implementation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps us clarify the evaluation approach and strengthen the presentation of results. We address each major comment below with our response and planned revisions.

read point-by-point responses
  1. Referee: [Within-subject evaluation] The improvement in final translation quality is asserted based solely on automatic BLEU and COMET scores. However, the formative study highlighted MQM theory as beneficial, yet no MQM or human expert scoring is reported in the evaluation to validate the quality gains. Given that BLEU and COMET may not fully capture stylistic and domain-specific accuracy in professional English-Chinese translation, this undermines the central claim of quality improvement.

    Authors: We acknowledge that BLEU and COMET have limitations in capturing stylistic nuances and domain-specific accuracy for professional English-to-Chinese translation. The formative study used MQM insights to guide system design (e.g., issue detection support), but the main evaluation with 30 participants prioritized scalable automatic metrics alongside qualitative feedback on easier issue inspection and reduced prompting. We will revise the manuscript to add an explicit discussion of metric limitations, cite relevant literature on their correlation with human judgments in translation, and qualify the quality claims as supported by both automatic scores and participant reports. This will appear in the Results and Limitations sections. revision: partial

  2. Referee: [Results section] The abstract and reported results lack statistical details such as p-values, effect sizes, or descriptions of baselines and controls for the within-subject study with 30 participants. This makes it challenging to evaluate the significance of the 33.8% time reduction and other outcomes.

    Authors: We agree that the abstract and high-level results summary should include these details for clarity. The full Results section already reports paired t-tests, p-values, and effect sizes (e.g., for completion time and cognitive effort) with the single-agent AI condition as the within-subject baseline. We will revise the abstract to incorporate key statistical information and ensure the results summary explicitly describes the tests, effect sizes, and controls. These updates will be made in the next version. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical user study

full rationale

The paper reports direct measurements from a within-subject study with 30 licensed translators (completion time reduced 33.8%, cognitive effort lowered, quality via BLEU/COMET) and a prior formative study. No equations, fitted parameters, derivations, or self-citation chains exist that reduce any claim to its inputs by construction. Outcomes are independent participant data, not statistical artifacts or renamed fits. This is a standard empirical HCI paper whose central results stand on external human-subject evidence.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions drawn from the formative study: that MQM theory improves professional translation outcomes and that per-translator adaptation is feasible and useful. No free parameters or invented entities are introduced.

axioms (2)
  • domain assumption MQM theory is beneficial for achieving professional translation quality
    Stated as a finding from the formative study that guided system design
  • domain assumption The system should adapt to each individual translator's idiosyncratic traits
    Stated as a requirement identified in the formative study

pith-pipeline@v0.9.0 · 5507 in / 1319 out tokens · 36175 ms · 2026-05-15T20:57:01.702038+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · 1 internal anchor

  1. [1]

    Vicent Alabau, Ragnar Bonkb, Christian Buck, Michael Carlb, Francisco Casacu- berta Nolla, Mercedes García-Martínez, Jesus Gonzalez Rubio, Philipp Koehn, Luis Alberto Leiva Torres, Bartolomé Mesa-Lao, et al. 2013. CASMACAT: An open source workbench for advanced computer aided translation. (2013)

  2. [2]

    Abeer Alabbas and Khalid Alomar. 2025. A weighted composite metric for evalu- ating user experience in educational chatbots: balancing usability, engagement, and effectiveness.Future Internet17, 2 (2025), 64

  3. [3]

    Luisa Bentivogli, Mauro Cettolo, Marcello Federico, and Christian Federmann

  4. [4]

    InProceedings of the 15th International Conference on Spoken Language Translation

    Machine translation human evaluation: an investigation of evaluation based on post-editing and its relation with direct assessment. InProceedings of the 15th International Conference on Spoken Language Translation. 62–69

  5. [5]

    Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative research in psychology3, 2 (2006), 77–101

  6. [6]

    Vicent Briva-Iglesias. 2025. Are AI agents the new machine translation frontier? Challenges and opportunities of single-and multi-agent systems for multilingual digital communication.arXiv preprint arXiv:2504.12891(2025)

  7. [7]

    Chris Callison-Burch, Miles Osborne, and Philipp Koehn. 2006. Re-evaluating the role of BLEU in machine translation research. In11th conference of the european chapter of the association for computational linguistics. 249–256

  8. [8]

    Eirini Chatzikoumi. 2020. How to evaluate machine translation: A review of automated and human metrics.Natural Language Engineering26, 2 (2020), 137– 161

  9. [9]

    Pinzhen Chen, Zhicheng Guo, Barry Haddow, and Kenneth Heafield. 2024. Itera- tive translation refinement with large language models. InProceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1). 181–190

  10. [10]

    Daniel Deutsch, Eleftheria Briakou, Isaac Caswell, Mara Finkelstein, Rebecca Galor, Juraj Juraska, Geza Kovacs, Alison Lui, Ricardo Rei, Jason Riesa, Shruti Rijhwani, Parker Riley, Elizabeth Salesky, Firas Trabelsi, Stephanie Winkler, Biao Zhang, and Markus Freitag. 2025. WMT24++: Expanding the Language Coverage of WMT24 to 55 Languages & Dialects. arXiv:...

  11. [11]

    Yue Dong, Zichao Li, Mehdi Rezagholizadeh, and Jackie Chi Kit Cheung. 2019. EditNTS: An neural programmer-interpreter model for sentence simplification through explicit editing. InProceedings of the 57th Annual Meeting of the Associa- tion for Computational Linguistics. 3393–3402

  12. [12]

    Johannes Eschbach-Dymanus, Frank Essenberger, Bianka Buschbeck, and Miriam Exel. 2024. Exploring the effectiveness of llm domain adaptation for business it machine translation. InProceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1). 610–622

  13. [13]

    Zhaopeng Feng, Yan Zhang, Hao Li, Wenqiang Liu, Jun Lang, Yang Feng, Jian Wu, and Zuozhu Liu. 2024. Improving llm-based machine translation with systematic self-correction.CoRR(2024)

  14. [14]

    George Foster, Pierre Isabelle, and Pierre Plamondon. 1997. Target-text mediated interactive machine translation.Machine Translation12, 1 (1997), 175–194

  15. [15]

    Markus Freitag, George Foster, David Grangier, Viresh Ratnakar, Qijun Tan, and Wolfgang Macherey. 2021. Experts, errors, and context: A large-scale study of human evaluation for machine translation.Transactions of the Association for Computational Linguistics9 (2021), 1460–1474

  16. [16]

    Markus Freitag, Nitika Mathur, Daniel Deutsch, Chi-Kiu Lo, Eleftherios Avramidis, Ricardo Rei, Brian Thompson, Frederic Blain, Tom Kocmi, Jiayi Wang, et al. 2024. Are LLMs breaking MT metrics? results of the WMT24 metrics shared task. In Proceedings of the Ninth Conference on Machine Translation. 47–81

  17. [17]

    Markus Freitag, Ricardo Rei, Nitika Mathur, Chi-kiu Lo, Craig Stewart, George Foster, Alon Lavie, and Ondřej Bojar. 2021. Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain. InProceedings of the Sixth Conference on Machine Translation. 733–774

  18. [18]

    António Góis and André FT Martins. 2019. Translator2vec: Understanding and representing human post-editors. InProceedings of Machine Translation Summit XVII: Research Track. 43–54

  19. [19]

    Attila Görög. 2014. Quality evaluation today: the dynamic quality framework. In Proceedings of Translating and the Computer 36

  20. [20]

    Yvette Graham, Timothy Baldwin, Alistair Moffat, and Justin Zobel. 2013. Con- tinuous measurement scales in human evaluation of machine translation. In Proceedings of the 7th linguistic annotation workshop and interoperability with discourse. 33–41

  21. [21]

    Yvette Graham, Barry Haddow, and Philipp Koehn. 2019. Translationese in machine translation evaluation.arXiv preprint arXiv:1906.09833(2019)

  22. [22]

    Spence Green, Jeffrey Heer, and Christopher D Manning. 2015. Natural language translation at the intersection of AI and HCI.Commun. ACM58, 9 (2015), 46–53

  23. [23]

    Spence Green, Sida I Wang, Jason Chuang, Jeffrey Heer, Sebastian Schuster, and Christopher D Manning. 2014. Human effort and machine learnability in computer aided translation. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1225–1236

  24. [24]

    Guoping Huang, Lemao Liu, Xing Wang, Longyue Wang, Huayang Li, Zhaopeng Tu, Chengyan Huang, and Shuming Shi. 2021. Transmart: a practical interactive machine translation system. arXiv.arXiv preprint arXiv:2105.13072(2021)

  25. [25]

    James W Hunt and Thomas G Szymanski. 1977. A fast algorithm for computing longest common subsequences.Commun. ACM20, 5 (1977), 350–353

  26. [26]

    Dayeon Ki and Marine Carpuat. 2024. Guiding large language models to post-edit machine translation with error annotations. InFindings of the Association for Computational Linguistics: NAACL 2024. 4253–4273

  27. [27]

    Rebecca Knowles and Philipp Koehn. 2016. Neural interactive translation predic- tion. InConferences of the Association for Machine Translation in the Americas: MT Researchers’ Track. 107–120

  28. [28]

    Maarit Koponen. 2016. Is machine translation post-editing worth the effort? A survey of research into post-editing and effort.The Journal of Specialised Translation25 (2016), 131–148

  29. [29]

    2001.Repairing texts: Empirical investigations of machine transla- tion post-editing processes

    Hans P Krings. 2001.Repairing texts: Empirical investigations of machine transla- tion post-editing processes. Vol. 5. Kent State University Press

  30. [30]

    Joseph B Kruskal. 1983. An overview of sequence comparison: Time warps, string edits, and macromolecules.SIAM review25, 2 (1983), 201–237

  31. [31]

    Tsz Kin Lam, Shigehiko Schamoni, and Stefan Riezler. 2019. Interactive-predictive neural machine translation through reinforcement and imitation. InProceedings of Machine Translation Summit XVII: Research Track. 96–106

  32. [32]

    Philippe Langlais, George Foster, and Guy Lapalme. 2000. TransType: a computer- aided translation typing system. InANLP-NAACL 2000 Workshop: Embedded Machine Translation Systems

  33. [33]

    Samuel Läubli, Sheila Castilho, Graham Neubig, Rico Sennrich, Qinlan Shen, and Antonio Toral. 2020. A set of recommendations for assessing human–machine parity in language translation.Journal of artificial intelligence research67 (2020), 653–672

  34. [34]

    Arle Lommel. 2018. Metrics for translation quality assessment: A case for stan- dardising error typologies. InTranslation quality assessment: From principles to practice. Springer, 109–127

  35. [35]

    Arle Lommel, Aljoscha Burchardt, Maja Popović, Kim Harris, Eleftherios Avramidis, and Hans Uszkoreit. 2014. Using a new analytic measure for the annotation and analysis of MT errors on real data. InProceedings of the 17th Annual conference of the European Association for Machine Translation. 165–172

  36. [36]

    Arle Lommel, Serge Gladkoff, Alan K Melby, Sue Ellen Wright, Ingemar Strand- vik, Katerina Gasova, Angelika Vaasa, Andy Benzo, Romina Marazzato Sparano, Monica Foresi, et al. 2024. The multi-range theory of translation quality mea- surement: MQM scoring models and statistical quality control. InProceedings of the 16th Conference of the Association for Mac...

  37. [37]

    Arle Lommel, Hans Uszkoreit, and Aljoscha Burchardt. 2014. Multidimensional quality metrics (MQM): A framework for declaring and describing translation quality metrics.Tradumàtica12 (2014), 0455–463

  38. [38]

    Qingyu Lu, Liang Ding, Kanjian Zhang, Jinxia Zhang, and Dacheng Tao. 2025. MQM-APE: toward high-quality error annotation predictors with automatic post- editing in LLM translation evaluators. InProceedings of the 31st International Conference on Computational Linguistics. 5570–5587. Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Wang et al

  39. [39]

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al

  40. [40]

    Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems36 (2023), 46534–46594

  41. [41]

    Benjamin Marie, Atsushi Fujita, and Raphael Rubino. 2021. Scientific credibility of machine translation research: A meta-evaluation of 769 papers. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 7297–7306

  42. [42]

    Raphaël Merx, Aso Mahmudi, Katrina Langford, Leo Alberto de Araujo, and Ekaterina Vylomova. 2024. Low-resource machine translation through retrieval- augmented LLM prompting: A study on the Mambai language. InProceedings of the 2nd Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia (EURALI)@ LREC-CO...

  43. [43]

    Yasmin Moslem, Rejwanul Haque, John Kelleher, and Andy Way. 2023. Adaptive machine translation with large language models. InProceedings of the 24th Annual Conference of the European Association for Machine Translation. 227–237

  44. [44]

    MQM Council. [n. d.]. MQM (Multidimensional Quality Metrics). https://themqm. org/. Accessed: 2026-03-20

  45. [45]

    Daniel Ortiz-Martinez, Luis A Leiva, Vicent Alabau, Ismael Garcia-Varea, and Francisco Casacuberta. 2011. An interactive machine translation system with online learning. InProceedings of the ACL-HLT 2011 System Demonstrations. 68– 73

  46. [46]

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318

  47. [47]

    Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology. 1–22

  48. [48]

    Alvaro Peris and Francisco Casacuberta. 2019. Online learning for effort reduction in interactive neural machine translation.Computer Speech & Language58 (2019), 98–126

  49. [49]

    Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, and Zaid Harchaoui. 2021. Mauve: Measuring the gap between neural text and human text using divergence frontiers.Advances in Neural Information Processing Systems34 (2021), 4816–4828

  50. [50]

    Mark A Przybocki, Gregory A Sanders, and Audrey N Le. 2006. Edit Distance: A Metric for Machine Translation Evaluation.. InLREC. 2038–2043

  51. [51]

    Vikas Raunak, Arul Menezes, and Hany Awadalla. 2023. Dissecting in-context learning of translations in GPT-3. InFindings of the Association for Computational Linguistics: EMNLP 2023. 866–872

  52. [52]

    Vikas Raunak, Amr Sharaf, Yiren Wang, Hany Awadalla, and Arul Menezes

  53. [53]

    InFindings of the Association for Computational Linguistics: EMNLP 2023

    Leveraging GPT-4 for automatic translation post-editing. InFindings of the Association for Computational Linguistics: EMNLP 2023. 12009–12024

  54. [54]

    Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. COMET: A neural framework for MT evaluation. InProceedings of the 2020 conference on empirical methods in natural language processing (emnlp). 2685–2702

  55. [55]

    Ehud Reiter. 2018. A structured review of the validity of BLEU.Computational Linguistics44, 3 (2018), 393–401

  56. [56]

    RWS. [n. d.]. The LISA QA Model. https://docs.rws.com/en-US/sdl-multitrans- 785465/the-lisa-qa-model-788069. Accessed: 2026-03-26

  57. [57]

    Klaus R Scherer. 2014. On the nature and function of emotion: A component process approach. InApproaches to emotion. Psychology Press, 293–317

  58. [58]

    Jörg Schütz. 1999. Deploying the SAE J2450 translation quality metric in language technology evaluation projects. InProceedings of Translating and the Computer 21

  59. [59]

    Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. InProceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers. 223–231

  60. [60]

    Matthew G Snover, Nitin Madnani, Bonnie Dorr, and Richard Schwartz. 2009. Ter-plus: paraphrase, semantic, and alignment enhancements to translation edit rate.Machine Translation23, 2 (2009), 117–127

  61. [61]

    Yixiao Song, Parker Riley, Daniel Deutsch, and Markus Freitag. 2025. Enhancing Human Evaluation in Machine Translation with Comparative Judgement. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 20536–20551

  62. [62]

    Maria Stasimioti and Vilelmini Sosoni. 2020. Translation vs post-editing of NMT output: Insights from the English-Greek language pair. InProceedings of 1st Workshop on Post-Editing in Modern-Day Translation. 109–124

  63. [63]

    Marco Turchi, Matteo Negri, and Marcello Federico. 2013. Coping with the subjectivity of human judgements in MT quality estimation. InProceedings of the Eighth Workshop on Statistical Machine Translation. 240–251

  64. [64]

    David Vilar, Markus Freitag, Colin Cherry, Jiaming Luo, Viresh Ratnakar, and George Foster. 2023. Prompting palm for translation: Assessing strategies and performance. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 15406–15427

  65. [65]

    Robert A Wagner and Michael J Fischer. 1974. The string-to-string correction problem.Journal of the ACM (JACM)21, 1 (1974), 168–173

  66. [66]

    Dongqi Wang, Haoran Wei, Zhirui Zhang, Shujian Huang, Jun Xie, and Jiajun Chen. 2022. Non-parametric online learning from human feedback for neural ma- chine translation. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 11431–11439

  67. [67]

    George Wang, Jiaqian Hu, and Safinah Ali. 2025. MAATS: A Multi-Agent Automated Translation System Based on MQM Evaluation.arXiv preprint arXiv:2505.14848(2025)

  68. [68]

    Jiayi Wang, Ke Wang, Fengming Zhou, Chengyu Wang, Zhiyong Fu, Zeyu Feng, Yu Zhao, and Yuqi Zhang. 2024. Synslator: An interactive machine translation tool with online learning. InCompanion Proceedings of the ACM Web Conference

  69. [69]

    Xi Wang, Shiyang Zhang, Fanfei Meng, and Lan Li. 2025. The Hidden Pitfalls of E-Dictionaries: How Inaccuracies Affect Chinese Language Users.Lexicography 12, 2 (2025), 107–130

  70. [70]

    Minghao Wu, Jiahao Xu, and Longyue Wang. 2024. Transagents: Build your translation company with language agents. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 131–141

  71. [71]

    Wenda Xu, Guanglei Zhu, Xuandong Zhao, Liangming Pan, Lei Li, and William Wang. 2024. Pride and prejudice: LLM amplifies self-bias in self-refinement. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 15474–15492

  72. [72]

    Daichi Yamaguchi, Rei Miyata, Atsushi Fujita, Tomoyuki Kajiwara, and Satoshi Sato. 2024. Automatic Decomposition of Text Editing Examples into Primitive Edit Operations: Toward Analytic Evaluation of Editing Systems. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)....

  73. [73]

    Kamer Ali Yuksel, Ahmet Gunduz, Abdul Baseet Anees, and Hassan Sawaf. 2025. Efficient Machine Translation Corpus Generation: Integrating Human-in-the- Loop Post-Editing with Large Language Models.arXiv preprint arXiv:2502.12755 (2025)

  74. [74]

    Enze Zhang, Jiaying Wang, Mengxi Xiao, Jifei Liu, Ziyan Kuang, Rui Dong, Eric Dong, Sophia Ananiadou, Min Peng, and Qianqian Xie. 2025. DITING: A Multi- Agent Evaluation Framework for Benchmarking Web Novel Translation.arXiv preprint arXiv:2510.09116(2025)

  75. [75]

    Shijie Zhang, Renhao Li, Songsheng Wang, Philipp Koehn, Min Yang, and Derek F Wong. 2025. HiMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation.arXiv preprint arXiv:2505.16281(2025)

  76. [76]

    Xuan Zhang, Navid Rajabi, Kevin Duh, and Philipp Koehn. 2023. Machine trans- lation with large language models: Prompting, few-shot learning, and fine-tuning with QLoRA. InProceedings of the Eighth Conference on Machine Translation. 468–481

  77. [77]

    Wei Zhao, Goran Glavaš, Maxime Peyrard, Yang Gao, Robert West, and Steffen Eger. 2020. On the limitations of cross-lingual encoders as exposed by reference- free machine translation evaluation. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 1656–1671

  78. [78]

    Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. 2024. Memo- rybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, Vol. 38. 19724–19731

  79. [79]

    Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, and Lei Li. 2024. Multilingual machine translation with large language models: Empirical results and analysis. InFindings of the association for computational linguistics: NAACL 2024. 2765–2781. Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009