Evaluating the Evaluator: Problems with SemEval-2020 Task 1 for Lexical Semantic Change Detection

Bach Phan-Tat; Dirk Geeraerts; Dirk Speelmana; Kris Heylen; Stefano De Pascale

arxiv: 2604.13232 · v2 · pith:IZ5WUD72new · submitted 2026-04-14 · 💻 cs.CL

Evaluating the Evaluator: Problems with SemEval-2020 Task 1 for Lexical Semantic Change Detection

Bach Phan-Tat , Kris Heylen , Dirk Geeraerts , Stefano De Pascale , Dirk Speelmana This is my paper

Pith reviewed 2026-05-10 15:31 UTC · model grok-4.3

classification 💻 cs.CL

keywords lexical semantic change detectionSemEval-2020 Task 1benchmark evaluationdata qualitycorpus preprocessingsemantic changeevaluation frameworkshared task limitations

0 comments

The pith

SemEval-2020 Task 1 for lexical semantic change detection has narrow definitions of change, corpus preprocessing errors, and limited target sets that make it a partial rather than definitive benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the SemEval-2020 shared task through three lenses: how change is operationalised, the quality of the underlying data, and the overall benchmark structure. It shows that the task frames semantic change mainly as discrete sense gain, loss, or redistribution, while the corpora contain OCR noise, malformed characters, truncated sentences, inconsistent lemmatisation, POS errors, and missed targets. The small curated target lists and narrow language coverage further limit statistical reliability and realism. Because of these combined issues, the benchmark cannot serve as a complete measure of progress in the field.

Core claim

The authors argue that SemEval-2020 Task 1 models semantic change primarily through gain, loss, or redistribution of discrete senses, yet its corpora suffer from OCR noise, malformed characters, truncated sentences, inconsistent lemmatisation, POS-tagging errors, and missed targets, while its small curated target sets and limited language coverage reduce realism and increase statistical uncertainty; taken together these limitations indicate that the benchmark should be treated as a useful but partial test bed rather than a definitive measure of progress.

What carries the argument

The three-part evaluative framework that assesses operationalisation of semantic change, data quality of the corpora, and benchmark design choices such as target selection and language coverage.

If this is right

Future shared tasks must adopt broader theories that include gradual, constructional, collocational, and discourse-level change.
All preprocessing steps must be documented transparently so that downstream analysis remains reproducible.
Target sets must be expanded and cross-linguistic coverage increased to improve statistical power and realism.
Evaluation settings should move toward more naturalistic corpora and annotation schemes that reflect actual usage patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Existing published results that rely on this benchmark as ground truth may need re-examination once cleaned data versions become available.
Similar data-quality audits could be applied to other lexical change or diachronic NLP benchmarks that use historical corpora.
Model developers might prioritise robustness tests against common OCR and tokenisation artifacts when training on historical text.

Load-bearing premise

That the listed corpus and preprocessing problems substantially distort model behaviour, complicate linguistic analysis, and reduce reproducibility.

What would settle it

Re-running the original participating systems on versions of the same corpora after documented cleaning of OCR errors, consistent lemmatisation, and full sentence restoration, then checking whether model rankings and gold-label correlations stay materially unchanged.

read the original abstract

This discussion paper re-examines SemEval-2020 Task 1, the most influential shared benchmark for lexical semantic change detection, through a three-part evaluative framework: operationalisation, data quality, and benchmark design. First, at the level of operationalisation, we argue that the benchmark models semantic change mainly as gain, loss, or redistribution of discrete senses. While practical for annotation and evaluation, this framing is too narrow to capture gradual, constructional, collocational, and discourse-level change. Also, the gold labels are outcomes of annotation decisions, clustering procedures, and threshold settings, which could potentially limit the validity of the task. Second, at the level of data quality, we show that the benchmark is affected by substantial corpus and preprocessing problems, including OCR noise, malformed characters, truncated sentences, inconsistent lemmatisation, POS-tagging errors, and missed targets. These issues can distort model behaviour, complicate linguistic analysis, and reduce reproducibility. Third, at the level of bench-mark design, we argue the small curated target sets and limited language coverage reduce realism and increase statistical uncertainty. Taken together, these limitations suggest that the benchmark should be treated as a useful but partial test bed rather than a definitive measure of progress. We therefore call for future datasets and shared tasks to adopt broader theories of semantic change, document pre-processing transparently, expand cross-linguistic coverage, and use more realistic evaluation settings. Such steps are necessary for more valid, interpretable, and generalisable progress in lexical semantic change detection

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clear, structured rundown of why SemEval-2020 Task 1 is narrower and noisier than it looks, but it lists data problems without measuring how much they actually shift results or rankings.

read the letter

The main thing here is that SemEval-2020 Task 1 treats semantic change mostly as sense gain or loss, and the paper shows how that plus messy corpora and small target sets make it a limited benchmark. That three-part breakdown—operationalisation, data quality, design—pulls together points that have floated around separately and puts them in one place with a call for better future tasks. That's the useful part: it gives people a ready checklist when they build the next shared task or dataset. The authors are right that the gold labels depend on annotation choices and thresholds, and they flag real preprocessing headaches like OCR errors, truncated sentences, and inconsistent lemmatisation across the corpora. Those are the kinds of details that make reproducibility harder in historical linguistics work. What is missing is any count of how common the problems are or any test showing that fixing them changes model scores or correlations. The abstract says the issues “can distort” behaviour, but without prevalence numbers or a before-after comparison on the same systems, the claim stays at the level of “these are bad” rather than “these move the rankings by X.” That gap keeps the central recommendation—that the benchmark should be treated as partial—from landing as hard as it could. The paper is aimed at anyone running or designing lexical semantic change experiments who wants to avoid over-interpreting the 2020 results. It is not a methods paper with new algorithms, so it will not change what people implement tomorrow, but it can shape how they evaluate. I would send it to peer review because the critique is grounded in the actual task setup and points to concrete next steps; referees can ask for the missing quantification without rejecting the core argument.

Referee Report

2 major / 2 minor

Summary. This paper critiques SemEval-2020 Task 1 for lexical semantic change detection by examining its operationalisation of semantic change, data quality issues in the corpora, and limitations in benchmark design. It claims that the discrete sense-based approach is too narrow, that preprocessing problems like OCR noise and tagging errors distort results, and that small target sets reduce realism, recommending the benchmark be viewed as partial rather than definitive.

Significance. If the problems identified are shown to affect model performance, the paper's significance lies in providing a detailed evaluation framework that could lead to improved benchmarks in lexical semantic change detection. It emphasizes the need for broader theories of change and better data practices, which are crucial for the field's progress toward more interpretable and generalizable results.

major comments (2)

[Data Quality] Data Quality section: The manuscript identifies issues such as OCR noise, malformed characters, truncated sentences, inconsistent lemmatisation, POS-tagging errors, and missed targets, stating they 'can distort model behaviour, complicate linguistic analysis, and reduce reproducibility.' However, it provides no prevalence statistics (e.g., percentage of affected tokens or sentences per corpus) or empirical tests of impact on participating models' scores or rankings, which is required to support the central claim that these issues undermine the benchmark's validity.
[Operationalisation] Operationalisation section: The discussion notes that gold labels are outcomes of annotation decisions, clustering procedures, and threshold settings, potentially limiting validity. The paper should detail the specific procedures used in SemEval-2020 Task 1 and provide concrete examples of how they fail to capture gradual, constructional, or discourse-level changes.

minor comments (2)

[Abstract] The term 'bench-mark' appears with a hyphen in the abstract; standardize to 'benchmark' throughout the manuscript.
[Data Quality] The paper would benefit from a summary table listing the specific preprocessing problems found in each language/corpus and their potential effects.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on our discussion paper. We address each major comment below, indicating planned revisions to improve the manuscript.

read point-by-point responses

Referee: [Data Quality] Data Quality section: The manuscript identifies issues such as OCR noise, malformed characters, truncated sentences, inconsistent lemmatisation, POS-tagging errors, and missed targets, stating they 'can distort model behaviour, complicate linguistic analysis, and reduce reproducibility.' However, it provides no prevalence statistics (e.g., percentage of affected tokens or sentences per corpus) or empirical tests of impact on participating models' scores or rankings, which is required to support the central claim that these issues undermine the benchmark's validity.

Authors: We agree that quantitative prevalence statistics would strengthen our data quality critique. In the revised manuscript, we will add estimates of error rates based on our manual inspections (e.g., proportions of truncated sentences and OCR artifacts per corpus). However, performing comprehensive empirical tests on the impact to all participating models' scores and rankings would require re-running the full SemEval-2020 evaluation pipeline, which is outside the scope of this discussion paper. We will instead include illustrative case studies showing how specific preprocessing errors alter model predictions for selected target words. revision: partial
Referee: [Operationalisation] Operationalisation section: The discussion notes that gold labels are outcomes of annotation decisions, clustering procedures, and threshold settings, potentially limiting validity. The paper should detail the specific procedures used in SemEval-2020 Task 1 and provide concrete examples of how they fail to capture gradual, constructional, or discourse-level changes.

Authors: We welcome this request for greater specificity. The revised Operationalisation section will describe the SemEval-2020 Task 1 annotation guidelines, the clustering algorithm applied to contextual embeddings (including parameter settings and sense inventory construction), and the exact thresholds used for binary change classification. We will also add concrete examples, such as gradual connotational shifts in words like 'gay' that are not captured by discrete sense addition or loss, and constructional changes (e.g., shifts in verb argument structure) that affect discourse patterns without triggering sense-level change detection. revision: yes

Circularity Check

0 steps flagged

No circularity: qualitative critique with no derivations or self-referential reductions.

full rationale

The paper is a discussion critique of SemEval-2020 Task 1, structured around operationalisation, data quality, and benchmark design arguments. It lists concrete corpus issues (OCR noise, truncated sentences, inconsistent lemmatisation) and argues for treating the benchmark as partial, but contains no equations, fitted parameters, predictions, or derivation chains. No self-citations are load-bearing for the central claims, and the reasoning relies on external observations of the benchmark rather than any reduction to the paper's own inputs by construction. This is the expected non-finding for a non-mathematical critique paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on domain assumptions about the nature of semantic change and standards for data integrity rather than new parameters or entities.

axioms (2)

domain assumption Semantic change includes gradual, constructional, collocational, and discourse-level phenomena beyond discrete sense gain, loss, or redistribution.
Invoked when arguing that the benchmark's operationalisation is too narrow.
domain assumption Corpus and preprocessing defects such as OCR noise and inconsistent lemmatisation materially distort model outputs and reduce reproducibility.
Central premise of the data-quality critique.

pith-pipeline@v0.9.0 · 5593 in / 1526 out tokens · 76662 ms · 2026-05-10T15:31:06.988162+00:00 · methodology

Evaluating the Evaluator: Problems with SemEval-2020 Task 1 for Lexical Semantic Change Detection

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)