pith. sign in

arxiv: 2604.13232 · v2 · pith:IZ5WUD72new · submitted 2026-04-14 · 💻 cs.CL

Evaluating the Evaluator: Problems with SemEval-2020 Task 1 for Lexical Semantic Change Detection

Pith reviewed 2026-05-10 15:31 UTC · model grok-4.3

classification 💻 cs.CL
keywords lexical semantic change detectionSemEval-2020 Task 1benchmark evaluationdata qualitycorpus preprocessingsemantic changeevaluation frameworkshared task limitations
0
0 comments X

The pith

SemEval-2020 Task 1 for lexical semantic change detection has narrow definitions of change, corpus preprocessing errors, and limited target sets that make it a partial rather than definitive benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the SemEval-2020 shared task through three lenses: how change is operationalised, the quality of the underlying data, and the overall benchmark structure. It shows that the task frames semantic change mainly as discrete sense gain, loss, or redistribution, while the corpora contain OCR noise, malformed characters, truncated sentences, inconsistent lemmatisation, POS errors, and missed targets. The small curated target lists and narrow language coverage further limit statistical reliability and realism. Because of these combined issues, the benchmark cannot serve as a complete measure of progress in the field.

Core claim

The authors argue that SemEval-2020 Task 1 models semantic change primarily through gain, loss, or redistribution of discrete senses, yet its corpora suffer from OCR noise, malformed characters, truncated sentences, inconsistent lemmatisation, POS-tagging errors, and missed targets, while its small curated target sets and limited language coverage reduce realism and increase statistical uncertainty; taken together these limitations indicate that the benchmark should be treated as a useful but partial test bed rather than a definitive measure of progress.

What carries the argument

The three-part evaluative framework that assesses operationalisation of semantic change, data quality of the corpora, and benchmark design choices such as target selection and language coverage.

If this is right

  • Future shared tasks must adopt broader theories that include gradual, constructional, collocational, and discourse-level change.
  • All preprocessing steps must be documented transparently so that downstream analysis remains reproducible.
  • Target sets must be expanded and cross-linguistic coverage increased to improve statistical power and realism.
  • Evaluation settings should move toward more naturalistic corpora and annotation schemes that reflect actual usage patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Existing published results that rely on this benchmark as ground truth may need re-examination once cleaned data versions become available.
  • Similar data-quality audits could be applied to other lexical change or diachronic NLP benchmarks that use historical corpora.
  • Model developers might prioritise robustness tests against common OCR and tokenisation artifacts when training on historical text.

Load-bearing premise

That the listed corpus and preprocessing problems substantially distort model behaviour, complicate linguistic analysis, and reduce reproducibility.

What would settle it

Re-running the original participating systems on versions of the same corpora after documented cleaning of OCR errors, consistent lemmatisation, and full sentence restoration, then checking whether model rankings and gold-label correlations stay materially unchanged.

read the original abstract

This discussion paper re-examines SemEval-2020 Task 1, the most influential shared benchmark for lexical semantic change detection, through a three-part evaluative framework: operationalisation, data quality, and benchmark design. First, at the level of operationalisation, we argue that the benchmark models semantic change mainly as gain, loss, or redistribution of discrete senses. While practical for annotation and evaluation, this framing is too narrow to capture gradual, constructional, collocational, and discourse-level change. Also, the gold labels are outcomes of annotation decisions, clustering procedures, and threshold settings, which could potentially limit the validity of the task. Second, at the level of data quality, we show that the benchmark is affected by substantial corpus and preprocessing problems, including OCR noise, malformed characters, truncated sentences, inconsistent lemmatisation, POS-tagging errors, and missed targets. These issues can distort model behaviour, complicate linguistic analysis, and reduce reproducibility. Third, at the level of bench-mark design, we argue the small curated target sets and limited language coverage reduce realism and increase statistical uncertainty. Taken together, these limitations suggest that the benchmark should be treated as a useful but partial test bed rather than a definitive measure of progress. We therefore call for future datasets and shared tasks to adopt broader theories of semantic change, document pre-processing transparently, expand cross-linguistic coverage, and use more realistic evaluation settings. Such steps are necessary for more valid, interpretable, and generalisable progress in lexical semantic change detection

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. This paper critiques SemEval-2020 Task 1 for lexical semantic change detection by examining its operationalisation of semantic change, data quality issues in the corpora, and limitations in benchmark design. It claims that the discrete sense-based approach is too narrow, that preprocessing problems like OCR noise and tagging errors distort results, and that small target sets reduce realism, recommending the benchmark be viewed as partial rather than definitive.

Significance. If the problems identified are shown to affect model performance, the paper's significance lies in providing a detailed evaluation framework that could lead to improved benchmarks in lexical semantic change detection. It emphasizes the need for broader theories of change and better data practices, which are crucial for the field's progress toward more interpretable and generalizable results.

major comments (2)
  1. [Data Quality] Data Quality section: The manuscript identifies issues such as OCR noise, malformed characters, truncated sentences, inconsistent lemmatisation, POS-tagging errors, and missed targets, stating they 'can distort model behaviour, complicate linguistic analysis, and reduce reproducibility.' However, it provides no prevalence statistics (e.g., percentage of affected tokens or sentences per corpus) or empirical tests of impact on participating models' scores or rankings, which is required to support the central claim that these issues undermine the benchmark's validity.
  2. [Operationalisation] Operationalisation section: The discussion notes that gold labels are outcomes of annotation decisions, clustering procedures, and threshold settings, potentially limiting validity. The paper should detail the specific procedures used in SemEval-2020 Task 1 and provide concrete examples of how they fail to capture gradual, constructional, or discourse-level changes.
minor comments (2)
  1. [Abstract] The term 'bench-mark' appears with a hyphen in the abstract; standardize to 'benchmark' throughout the manuscript.
  2. [Data Quality] The paper would benefit from a summary table listing the specific preprocessing problems found in each language/corpus and their potential effects.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on our discussion paper. We address each major comment below, indicating planned revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [Data Quality] Data Quality section: The manuscript identifies issues such as OCR noise, malformed characters, truncated sentences, inconsistent lemmatisation, POS-tagging errors, and missed targets, stating they 'can distort model behaviour, complicate linguistic analysis, and reduce reproducibility.' However, it provides no prevalence statistics (e.g., percentage of affected tokens or sentences per corpus) or empirical tests of impact on participating models' scores or rankings, which is required to support the central claim that these issues undermine the benchmark's validity.

    Authors: We agree that quantitative prevalence statistics would strengthen our data quality critique. In the revised manuscript, we will add estimates of error rates based on our manual inspections (e.g., proportions of truncated sentences and OCR artifacts per corpus). However, performing comprehensive empirical tests on the impact to all participating models' scores and rankings would require re-running the full SemEval-2020 evaluation pipeline, which is outside the scope of this discussion paper. We will instead include illustrative case studies showing how specific preprocessing errors alter model predictions for selected target words. revision: partial

  2. Referee: [Operationalisation] Operationalisation section: The discussion notes that gold labels are outcomes of annotation decisions, clustering procedures, and threshold settings, potentially limiting validity. The paper should detail the specific procedures used in SemEval-2020 Task 1 and provide concrete examples of how they fail to capture gradual, constructional, or discourse-level changes.

    Authors: We welcome this request for greater specificity. The revised Operationalisation section will describe the SemEval-2020 Task 1 annotation guidelines, the clustering algorithm applied to contextual embeddings (including parameter settings and sense inventory construction), and the exact thresholds used for binary change classification. We will also add concrete examples, such as gradual connotational shifts in words like 'gay' that are not captured by discrete sense addition or loss, and constructional changes (e.g., shifts in verb argument structure) that affect discourse patterns without triggering sense-level change detection. revision: yes

Circularity Check

0 steps flagged

No circularity: qualitative critique with no derivations or self-referential reductions.

full rationale

The paper is a discussion critique of SemEval-2020 Task 1, structured around operationalisation, data quality, and benchmark design arguments. It lists concrete corpus issues (OCR noise, truncated sentences, inconsistent lemmatisation) and argues for treating the benchmark as partial, but contains no equations, fitted parameters, predictions, or derivation chains. No self-citations are load-bearing for the central claims, and the reasoning relies on external observations of the benchmark rather than any reduction to the paper's own inputs by construction. This is the expected non-finding for a non-mathematical critique paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on domain assumptions about the nature of semantic change and standards for data integrity rather than new parameters or entities.

axioms (2)
  • domain assumption Semantic change includes gradual, constructional, collocational, and discourse-level phenomena beyond discrete sense gain, loss, or redistribution.
    Invoked when arguing that the benchmark's operationalisation is too narrow.
  • domain assumption Corpus and preprocessing defects such as OCR noise and inconsistent lemmatisation materially distort model outputs and reduce reproducibility.
    Central premise of the data-quality critique.

pith-pipeline@v0.9.0 · 5593 in / 1526 out tokens · 76662 ms · 2026-05-10T15:31:06.988162+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.