hub

A Call for Clarity in Reporting BLEU Scores

Matt Post · 2018 · cs.CL · arXiv 1804.08771

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

open full Pith review browse 10 citing papers arXiv PDF

abstract

The field of machine translation faces an under-recognized problem because of inconsistency in the reporting of scores from its dominant metric. Although people refer to "the" BLEU score, BLEU is in fact a parameterized metric whose values can vary wildly with changes to these parameters. These parameters are often not reported or are hard to find, and consequently, BLEU scores between papers cannot be directly compared. I quantify this variation, finding differences as high as 1.8 between commonly used configurations. The main culprit is different tokenization and normalization schemes applied to the reference. Pointing to the success of the parsing community, I suggest machine translation researchers settle upon the BLEU scheme used by the annual Conference on Machine Translation (WMT), which does not allow for user-supplied reference processing, and provide a new tool, SacreBLEU, to facilitate this.

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

Language Models are Few-Shot Learners

cs.CL · 2020-05-28 · accept · novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

cs.LG · 2019-10-23 · unverdicted · novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

cs.CL · 2018-08-19 · accept · novelty 7.0

SentencePiece trains subword models directly from raw text to enable language-independent neural text processing.

Widening the Representation Bottleneck in Neural Machine Translation with Lexical Shortcuts

cs.CL · 2019-06-28 · conditional · novelty 6.0

Gated lexical shortcut connections added to the transformer yield 0.9 BLEU average gains on five WMT directions while lowering the lexical content stored in hidden states.

AITP: Traffic Accident Responsibility Allocation via Multimodal Large Language Models

cs.CL · 2026-04-11 · unverdicted · novelty 6.0

AITP is a new multimodal large language model that uses multimodal chain-of-thought and retrieval-augmented generation of legal knowledge to achieve state-of-the-art results on traffic accident responsibility allocation and related tasks, supported by the DecaTARA benchmark of 67,941 videos.

Root Mean Square Layer Normalization

cs.LG · 2019-10-16 · conditional · novelty 5.0

RMSNorm delivers re-scaling invariance and comparable accuracy to LayerNorm while cutting computation by skipping mean subtraction, yielding 7-64% runtime reductions across tested models.

Low-Resource Corpus Filtering using Multilingual Sentence Embeddings

cs.CL · 2019-06-20 · unverdicted · novelty 5.0

LASER sentence embeddings are applied directly to filter parallel corpora, achieving the best BLEU scores in the WMT19 low-resource tasks for Nepali-English and Sinhala-English by margins of 1.3 and 1.4.

Enhancing Scientific Discourse: Machine Translation for the Scientific Domain

cs.CL · 2026-05-20 · conditional · novelty 4.0

Development of domain-specific scientific corpora for English-Spanish, English-French, and English-Portuguese and their application to fine-tuning NMT models.

Analyzing Chain of Thought (CoT) Approaches in Control Flow Code Deobfuscation Tasks

cs.SE · 2026-04-16 · unverdicted · novelty 4.0

CoT prompting improves LLM performance on control-flow deobfuscation of C benchmarks, yielding ~16% better CFG reconstruction and ~20.5% better semantic preservation for GPT5 versus zero-shot prompting.

Robust Machine Translation with Domain Sensitive Pseudo-Sources: Baidu-OSU WMT19 MT Robustness Shared Task System Report

cs.CL · 2019-06-19 · unverdicted · novelty 3.0

Baidu-OSU WMT19 system achieves >10 BLEU gain on En-Fr and Fr-En social media translation via domain sensitive training and pseudo noisy sources.

citing papers explorer

Showing 10 of 10 citing papers.

Language Models are Few-Shot Learners cs.CL · 2020-05-28 · accept · none · ref 61
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer cs.LG · 2019-10-23 · unverdicted · none · ref 56
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing cs.CL · 2018-08-19 · accept · none · ref 10
SentencePiece trains subword models directly from raw text to enable language-independent neural text processing.
Widening the Representation Bottleneck in Neural Machine Translation with Lexical Shortcuts cs.CL · 2019-06-28 · conditional · none · ref 19 · internal anchor
Gated lexical shortcut connections added to the transformer yield 0.9 BLEU average gains on five WMT directions while lowering the lexical content stored in hidden states.
AITP: Traffic Accident Responsibility Allocation via Multimodal Large Language Models cs.CL · 2026-04-11 · unverdicted · none · ref 26
AITP is a new multimodal large language model that uses multimodal chain-of-thought and retrieval-augmented generation of legal knowledge to achieve state-of-the-art results on traffic accident responsibility allocation and related tasks, supported by the DecaTARA benchmark of 67,941 videos.
Root Mean Square Layer Normalization cs.LG · 2019-10-16 · conditional · none · ref 21 · internal anchor
RMSNorm delivers re-scaling invariance and comparable accuracy to LayerNorm while cutting computation by skipping mean subtraction, yielding 7-64% runtime reductions across tested models.
Low-Resource Corpus Filtering using Multilingual Sentence Embeddings cs.CL · 2019-06-20 · unverdicted · none · ref 17 · internal anchor
LASER sentence embeddings are applied directly to filter parallel corpora, achieving the best BLEU scores in the WMT19 low-resource tasks for Nepali-English and Sinhala-English by margins of 1.3 and 1.4.
Enhancing Scientific Discourse: Machine Translation for the Scientific Domain cs.CL · 2026-05-20 · conditional · none · ref 23 · internal anchor
Development of domain-specific scientific corpora for English-Spanish, English-French, and English-Portuguese and their application to fine-tuning NMT models.
Analyzing Chain of Thought (CoT) Approaches in Control Flow Code Deobfuscation Tasks cs.SE · 2026-04-16 · unverdicted · none · ref 48
CoT prompting improves LLM performance on control-flow deobfuscation of C benchmarks, yielding ~16% better CFG reconstruction and ~20.5% better semantic preservation for GPT5 versus zero-shot prompting.
Robust Machine Translation with Domain Sensitive Pseudo-Sources: Baidu-OSU WMT19 MT Robustness Shared Task System Report cs.CL · 2019-06-19 · unverdicted · none · ref 18 · internal anchor
Baidu-OSU WMT19 system achieves >10 BLEU gain on En-Fr and Fr-En social media translation via domain sensitive training and pseudo noisy sources.

A Call for Clarity in Reporting BLEU Scores

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer