The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation

Fernandes, Patrick, Deutsch, Daniel, Finkelstein, Mara, Riley, Parker, Neubig, Graham · 2023 · DOI 10.18653/v1/2023.wmt-1.100

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open at publisher browse 4 citing papers

representative citing papers

Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.

Self-Rewarding Language Models

cs.CL · 2024-01-18 · conditional · novelty 7.0

Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.

CompactQE: Interpretable Translation Quality Estimation via Small Open-Weight LLMs

cs.CL · 2026-05-15 · unverdicted · novelty 5.0

Small open-source LLMs achieve competitive system-level correlations with human judgments in machine translation quality estimation, outperforming traditional neural metrics and fine-tuned models via single-pass multi-output prompting.

Smarter edits? Post-editing with error highlights and translation suggestions

cs.CL · 2026-05-20 · unverdicted · novelty 4.0

User study with professional En-Nl translators found LLM-based error highlights and APE correction suggestions did not improve productivity or quality over standard post-editing but were better received and enhanced user experience.

citing papers explorer

Showing 4 of 4 citing papers.

Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations cs.CL · 2026-05-13 · unverdicted · none · ref 34
Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.
Self-Rewarding Language Models cs.CL · 2024-01-18 · conditional · none · ref 96
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
CompactQE: Interpretable Translation Quality Estimation via Small Open-Weight LLMs cs.CL · 2026-05-15 · unverdicted · none · ref 12
Small open-source LLMs achieve competitive system-level correlations with human judgments in machine translation quality estimation, outperforming traditional neural metrics and fine-tuned models via single-pass multi-output prompting.
Smarter edits? Post-editing with error highlights and translation suggestions cs.CL · 2026-05-20 · unverdicted · none · ref 10
User study with professional En-Nl translators found LLM-based error highlights and APE correction suggestions did not improve productivity or quality over standard post-editing but were better received and enhanced user experience.

The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation

fields

years

verdicts

representative citing papers

citing papers explorer