Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.
The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 4representative citing papers
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
Small open-source LLMs achieve competitive system-level correlations with human judgments in machine translation quality estimation, outperforming traditional neural metrics and fine-tuned models via single-pass multi-output prompting.
User study with professional En-Nl translators found LLM-based error highlights and APE correction suggestions did not improve productivity or quality over standard post-editing but were better received and enhanced user experience.
citing papers explorer
-
Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations
Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.
-
Self-Rewarding Language Models
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
-
CompactQE: Interpretable Translation Quality Estimation via Small Open-Weight LLMs
Small open-source LLMs achieve competitive system-level correlations with human judgments in machine translation quality estimation, outperforming traditional neural metrics and fine-tuned models via single-pass multi-output prompting.
-
Smarter edits? Post-editing with error highlights and translation suggestions
User study with professional En-Nl translators found LLM-based error highlights and APE correction suggestions did not improve productivity or quality over standard post-editing but were better received and enhanced user experience.