A Call for Clarity in Reporting BLEU Scores

Matt Post

Authors on Pith no claims yet

classification 💻 cs.CL

keywords bleumachinescorestranslationmetricparametersreferencereporting

read the original abstract

The field of machine translation faces an under-recognized problem because of inconsistency in the reporting of scores from its dominant metric. Although people refer to "the" BLEU score, BLEU is in fact a parameterized metric whose values can vary wildly with changes to these parameters. These parameters are often not reported or are hard to find, and consequently, BLEU scores between papers cannot be directly compared. I quantify this variation, finding differences as high as 1.8 between commonly used configurations. The main culprit is different tokenization and normalization schemes applied to the reference. Pointing to the success of the parsing community, I suggest machine translation researchers settle upon the BLEU scheme used by the annual Conference on Machine Translation (WMT), which does not allow for user-supplied reference processing, and provide a new tool, SacreBLEU, to facilitate this.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Language Models are Few-Shot Learners
cs.CL 2020-05 accept novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
cs.LG 2019-10 unverdicted novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
cs.CL 2018-08 accept novelty 7.0

SentencePiece trains subword models directly from raw text to enable language-independent neural text processing.
AITP: Traffic Accident Responsibility Allocation via Multimodal Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

AITP is a new multimodal large language model that uses multimodal chain-of-thought and retrieval-augmented generation of legal knowledge to achieve state-of-the-art results on traffic accident responsibility allocati...
Analyzing Chain of Thought (CoT) Approaches in Control Flow Code Deobfuscation Tasks
cs.SE 2026-04 unverdicted novelty 4.0

CoT prompting improves LLM performance on control-flow deobfuscation of C benchmarks, yielding ~16% better CFG reconstruction and ~20.5% better semantic preservation for GPT5 versus zero-shot prompting.