Evaluating the Evaluator: Problems with SemEval-2020 Task 1 for Lexical Semantic Change Detection
Pith reviewed 2026-05-10 15:31 UTC · model grok-4.3
The pith
SemEval-2020 Task 1 for lexical semantic change detection has narrow definitions of change, corpus preprocessing errors, and limited target sets that make it a partial rather than definitive benchmark.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors argue that SemEval-2020 Task 1 models semantic change primarily through gain, loss, or redistribution of discrete senses, yet its corpora suffer from OCR noise, malformed characters, truncated sentences, inconsistent lemmatisation, POS-tagging errors, and missed targets, while its small curated target sets and limited language coverage reduce realism and increase statistical uncertainty; taken together these limitations indicate that the benchmark should be treated as a useful but partial test bed rather than a definitive measure of progress.
What carries the argument
The three-part evaluative framework that assesses operationalisation of semantic change, data quality of the corpora, and benchmark design choices such as target selection and language coverage.
If this is right
- Future shared tasks must adopt broader theories that include gradual, constructional, collocational, and discourse-level change.
- All preprocessing steps must be documented transparently so that downstream analysis remains reproducible.
- Target sets must be expanded and cross-linguistic coverage increased to improve statistical power and realism.
- Evaluation settings should move toward more naturalistic corpora and annotation schemes that reflect actual usage patterns.
Where Pith is reading between the lines
- Existing published results that rely on this benchmark as ground truth may need re-examination once cleaned data versions become available.
- Similar data-quality audits could be applied to other lexical change or diachronic NLP benchmarks that use historical corpora.
- Model developers might prioritise robustness tests against common OCR and tokenisation artifacts when training on historical text.
Load-bearing premise
That the listed corpus and preprocessing problems substantially distort model behaviour, complicate linguistic analysis, and reduce reproducibility.
What would settle it
Re-running the original participating systems on versions of the same corpora after documented cleaning of OCR errors, consistent lemmatisation, and full sentence restoration, then checking whether model rankings and gold-label correlations stay materially unchanged.
read the original abstract
This discussion paper re-examines SemEval-2020 Task 1, the most influential shared benchmark for lexical semantic change detection, through a three-part evaluative framework: operationalisation, data quality, and benchmark design. First, at the level of operationalisation, we argue that the benchmark models semantic change mainly as gain, loss, or redistribution of discrete senses. While practical for annotation and evaluation, this framing is too narrow to capture gradual, constructional, collocational, and discourse-level change. Also, the gold labels are outcomes of annotation decisions, clustering procedures, and threshold settings, which could potentially limit the validity of the task. Second, at the level of data quality, we show that the benchmark is affected by substantial corpus and preprocessing problems, including OCR noise, malformed characters, truncated sentences, inconsistent lemmatisation, POS-tagging errors, and missed targets. These issues can distort model behaviour, complicate linguistic analysis, and reduce reproducibility. Third, at the level of bench-mark design, we argue the small curated target sets and limited language coverage reduce realism and increase statistical uncertainty. Taken together, these limitations suggest that the benchmark should be treated as a useful but partial test bed rather than a definitive measure of progress. We therefore call for future datasets and shared tasks to adopt broader theories of semantic change, document pre-processing transparently, expand cross-linguistic coverage, and use more realistic evaluation settings. Such steps are necessary for more valid, interpretable, and generalisable progress in lexical semantic change detection
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper critiques SemEval-2020 Task 1 for lexical semantic change detection by examining its operationalisation of semantic change, data quality issues in the corpora, and limitations in benchmark design. It claims that the discrete sense-based approach is too narrow, that preprocessing problems like OCR noise and tagging errors distort results, and that small target sets reduce realism, recommending the benchmark be viewed as partial rather than definitive.
Significance. If the problems identified are shown to affect model performance, the paper's significance lies in providing a detailed evaluation framework that could lead to improved benchmarks in lexical semantic change detection. It emphasizes the need for broader theories of change and better data practices, which are crucial for the field's progress toward more interpretable and generalizable results.
major comments (2)
- [Data Quality] Data Quality section: The manuscript identifies issues such as OCR noise, malformed characters, truncated sentences, inconsistent lemmatisation, POS-tagging errors, and missed targets, stating they 'can distort model behaviour, complicate linguistic analysis, and reduce reproducibility.' However, it provides no prevalence statistics (e.g., percentage of affected tokens or sentences per corpus) or empirical tests of impact on participating models' scores or rankings, which is required to support the central claim that these issues undermine the benchmark's validity.
- [Operationalisation] Operationalisation section: The discussion notes that gold labels are outcomes of annotation decisions, clustering procedures, and threshold settings, potentially limiting validity. The paper should detail the specific procedures used in SemEval-2020 Task 1 and provide concrete examples of how they fail to capture gradual, constructional, or discourse-level changes.
minor comments (2)
- [Abstract] The term 'bench-mark' appears with a hyphen in the abstract; standardize to 'benchmark' throughout the manuscript.
- [Data Quality] The paper would benefit from a summary table listing the specific preprocessing problems found in each language/corpus and their potential effects.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments on our discussion paper. We address each major comment below, indicating planned revisions to improve the manuscript.
read point-by-point responses
-
Referee: [Data Quality] Data Quality section: The manuscript identifies issues such as OCR noise, malformed characters, truncated sentences, inconsistent lemmatisation, POS-tagging errors, and missed targets, stating they 'can distort model behaviour, complicate linguistic analysis, and reduce reproducibility.' However, it provides no prevalence statistics (e.g., percentage of affected tokens or sentences per corpus) or empirical tests of impact on participating models' scores or rankings, which is required to support the central claim that these issues undermine the benchmark's validity.
Authors: We agree that quantitative prevalence statistics would strengthen our data quality critique. In the revised manuscript, we will add estimates of error rates based on our manual inspections (e.g., proportions of truncated sentences and OCR artifacts per corpus). However, performing comprehensive empirical tests on the impact to all participating models' scores and rankings would require re-running the full SemEval-2020 evaluation pipeline, which is outside the scope of this discussion paper. We will instead include illustrative case studies showing how specific preprocessing errors alter model predictions for selected target words. revision: partial
-
Referee: [Operationalisation] Operationalisation section: The discussion notes that gold labels are outcomes of annotation decisions, clustering procedures, and threshold settings, potentially limiting validity. The paper should detail the specific procedures used in SemEval-2020 Task 1 and provide concrete examples of how they fail to capture gradual, constructional, or discourse-level changes.
Authors: We welcome this request for greater specificity. The revised Operationalisation section will describe the SemEval-2020 Task 1 annotation guidelines, the clustering algorithm applied to contextual embeddings (including parameter settings and sense inventory construction), and the exact thresholds used for binary change classification. We will also add concrete examples, such as gradual connotational shifts in words like 'gay' that are not captured by discrete sense addition or loss, and constructional changes (e.g., shifts in verb argument structure) that affect discourse patterns without triggering sense-level change detection. revision: yes
Circularity Check
No circularity: qualitative critique with no derivations or self-referential reductions.
full rationale
The paper is a discussion critique of SemEval-2020 Task 1, structured around operationalisation, data quality, and benchmark design arguments. It lists concrete corpus issues (OCR noise, truncated sentences, inconsistent lemmatisation) and argues for treating the benchmark as partial, but contains no equations, fitted parameters, predictions, or derivation chains. No self-citations are load-bearing for the central claims, and the reasoning relies on external observations of the benchmark rather than any reduction to the paper's own inputs by construction. This is the expected non-finding for a non-mathematical critique paper.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Semantic change includes gradual, constructional, collocational, and discourse-level phenomena beyond discrete sense gain, loss, or redistribution.
- domain assumption Corpus and preprocessing defects such as OCR noise and inconsistent lemmatisation materially distort model outputs and reduce reproducibility.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.