How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation

Amr Hendy; Amr Sharaf; Hany Hassan Awadalla; Hitokazu Matsushita; Mohamed Abdelrehim; Mohamed Afify; Mohamed Gabr; Vikas Raunak; Young Jin Kim

arxiv: 2302.09210 · v1 · pith:2BYHZEPWnew · submitted 2023-02-18 · 💻 cs.CL

How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation

Amr Hendy , Mohamed Abdelrehim , Amr Sharaf , Vikas Raunak , Mohamed Gabr , Hitokazu Matsushita , Young Jin Kim , Mohamed Afify

show 1 more author

Hany Hassan Awadalla

This is my paper

classification 💻 cs.CL

keywords translationmodelscomprehensiveevaluationlanguagesmachinequalityresource

0 comments

read the original abstract

Generative Pre-trained Transformer (GPT) models have shown remarkable capabilities for natural language generation, but their performance for machine translation has not been thoroughly investigated. In this paper, we present a comprehensive evaluation of GPT models for machine translation, covering various aspects such as quality of different GPT models in comparison with state-of-the-art research and commercial systems, effect of prompting strategies, robustness towards domain shifts and document-level translation. We experiment with eighteen different translation directions involving high and low resource languages, as well as non English-centric translations, and evaluate the performance of three GPT models: ChatGPT, GPT3.5 (text-davinci-003), and text-davinci-002. Our results show that GPT models achieve very competitive translation quality for high resource languages, while having limited capabilities for low resource languages. We also show that hybrid approaches, which combine GPT models with other translation systems, can further enhance the translation quality. We perform comprehensive analysis and human evaluation to further understand the characteristics of GPT translations. We hope that our paper provides valuable insights for researchers and practitioners in the field and helps to better understand the potential and limitations of GPT models for translation.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 12 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Nsanku: Evaluating Zero-Shot Translation Performance of LLMs for Ghanaian Languages
cs.CL 2026-05 accept novelty 7.0

Nsanku benchmark shows current LLMs achieve only modest zero-shot translation scores on 43 Ghanaian languages, with no model reaching both high average performance and high cross-language consistency.
The Prompt Report: A Systematic Survey of Prompt Engineering Techniques
cs.CL 2024-06 accept novelty 7.0

This systematic survey organizes prompt engineering into a taxonomy of 58 LLM techniques and 40 others, supplies a shared vocabulary, and offers guidelines for state-of-the-art models.
MC-PDD: Masked Corpus-Level Pretraining Data Detection for Black-Box Large Language Models
cs.CL 2026-06 unverdicted novelty 6.0

A masked-token hit-rate comparison method detects pretraining data membership in black-box LLMs with performance comparable to white-box approaches.
From Outliers to Errors: Auditing Pali-to-English LLM Translations with Multi-Reference Adjudication
cs.CL 2026-05 unverdicted novelty 6.0

A multi-reference audit framework for LLM translations of the Pali Canon uses embedding drift from a human reference centroid to triage candidates for LLM-judge adjudication, showing drift correlates with major error ...
Evaluating Chinese Ambiguity Understanding in Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Introduces the CHA-Gen dataset for Chinese ambiguity based on Potential Ambiguity Theory and shows LLMs struggle to detect ambiguity, exhibiting specific failure modes and overconfidence after instruction tuning.
RouteLMT: Learned Sample Routing for Hybrid LLM Translation Deployment
cs.CL 2026-04 unverdicted novelty 6.0

RouteLMT learns to route MT requests to large or small LLMs by predicting marginal quality gain from small-model token representations, yielding a better quality-budget Pareto frontier than baselines.
Low-Resource Languages Jailbreak GPT-4
cs.CL 2023-10 conditional novelty 6.0

Translating unsafe inputs to low-resource languages jailbreaks GPT-4 at rates on par with or exceeding state-of-the-art attacks.
Evaluating Large Language Models for Hausa and Fongbe Machine Translation: Benchmarks, Failures, and Metric Reliability
cs.CL 2026-06 unverdicted novelty 5.0

LLM translation quality reaches acceptable human scores for Hausa but remains poor for Fongbe, automatic metrics show weak human correlation especially for Hausa due to neural embedding collapse, and at least 2,500 se...
When Does Data Augmentation Help? Evaluating LLM and Back-Translation Methods for Hausa and Fongbe NLP
cs.CL 2026-04 unverdicted novelty 5.0

Data augmentation via LLMs and back-translation produces task-specific effects on NER and POS tagging for Hausa and Fongbe, with no consistent gains over baseline and opposite outcomes across tasks for the same synthe...
Mining Large Language Models for Low-Resource Language Data: Comparing Elicitation Strategies for Hausa and Fongbe
cs.CL 2026-04 unverdicted novelty 5.0

GPT-4o Mini extracts 6-41 times more usable Hausa and Fongbe text per API call than Gemini 2.5 Flash, with optimal elicitation strategies differing by language.
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
cs.CL 2023-05 conditional novelty 5.0

Multi-agent debate with tit-for-tat arguments and a judge LLM improves reasoning by preventing LLMs from locking into incorrect initial solutions.
Benchmark Data Contamination of Large Language Models: A Survey
cs.CL 2024-06 unverdicted novelty 3.0

A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.