Interpretable Coreference Resolution Evaluation Using Explicit Semantics

Bruno Gatti; Giuliano Martinelli; Roberto Navigli

arxiv: 2605.10627 · v1 · submitted 2026-05-11 · 💻 cs.CL · cs.AI

Interpretable Coreference Resolution Evaluation Using Explicit Semantics

Bruno Gatti , Giuliano Martinelli , Roberto Navigli This is my paper

Pith reviewed 2026-05-12 04:56 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords coreference resolutionevaluation metricssemantic typingnamed entity recognitionconcept recognitioninterpretabilitydata augmentationcluster evaluation

0 comments

The pith

Overlaying concept and named entity labels onto coreference clusters yields typed scores that expose category-specific failures hidden by aggregate metrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Coreference systems are scored with aggregate measures such as CoNLL-F1 that count cluster overlap without indicating whether errors concentrate on people, places, events, or other semantic kinds. The paper adds an overlay step that runs concept and named entity recognition on nominal mentions, then assigns those labels to whole predicted and gold clusters. Typed precision, recall, and F1 can then be computed separately for each semantic class, separating mention detection from linking performance. Experiments on OntoNotes, LitBank, and PreCo show these breakdowns surface systematic weaknesses that standard scores conceal. The same breakdowns can be turned into low-cost data-augmentation rules that produce measurable gains when models are tested outside their original domains.

Core claim

By running CNER tools on the nominal mentions inside coreference clusters and propagating the resulting semantic labels to entire clusters, the framework produces per-class scores that separately measure how well a system extracts and links mentions of each semantic type. These scores reveal error patterns that remain invisible under aggregate CoNLL-F1 and can be used to select targeted augmentation data that improves out-of-domain performance.

What carries the argument

Propagation of CNER semantic labels from individual nominal mentions to entire coreference clusters, enabling class-stratified scoring of mention extraction and linking.

If this is right

Aggregate metrics such as CoNLL-F1 can mask large differences in performance across semantic categories.
Typed scores separate mention-detection errors from linking errors within each semantic class.
Diagnostics from the typed scores can identify the specific classes that require additional training data.
Low-cost augmentation strategies derived from those diagnostics produce measurable out-of-domain gains on OntoNotes, LitBank, and PreCo.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the CNER layer proves stable across domains, the typed scores could be added as a routine reporting requirement for future coreference benchmarks.
The same label-propagation idea could be applied to other structured prediction tasks such as event extraction or relation classification to obtain class-specific diagnostics.
System developers could use the per-class breakdowns to decide whether to invest in better mention detectors or better cluster linkers for particular semantic types.

Load-bearing premise

CNER tools must label mentions accurately enough that the derived cluster-level scores faithfully reflect true semantic weaknesses rather than tool errors.

What would settle it

Re-running the typed evaluation on the same coreference outputs but with a CNER system known to have substantially lower accuracy on the same domains produces different or contradictory patterns of per-class weaknesses.

Figures

Figures reproduced from arXiv: 2605.10627 by Bruno Gatti, Giuliano Martinelli, Roberto Navigli.

**Figure 2.** Figure 2: Percentage of CNER-annotated coreference [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: Distribution of propagated CNER semantic [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Per-class Mention F1 scores for each model. In-domain results are shown in grey, while out-of-domain results are shown in different colors and computed as the average performance on the two datasets not used for training. Dashed vertical lines indicate the mean in-domain and out-of-domain scores. Classes are ordered by decreasing category frequency in LitBank so as to highlight out-of-domain performance. c… view at source ↗

**Figure 6.** Figure 6: Performance difference between maverick [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Per-class Link F1 scores for maverick-mes-ontonotes, maverick-mes-litbank, and maverick-mes-preco. Grey bars indicate in-domain performance, while colored bars indicate out-of-domain performance. Classes are ordered by decreasing support in the LitBank dataset. We first present the Micro-averaged results of our evaluation on maverick-mes-ontonotes, maverick-mes-litbank and maverick-mes-preco, then compleme… view at source ↗

**Figure 8.** Figure 8: Delta of out-of-domain Mention F1 for maverick-mes-litbank-NR, compared to maverick-meslitbank. Positive values indicate improvements over the baseline [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

read the original abstract

Coreference resolution is typically evaluated using aggregate statistical metrics such as CoNLL-F1, which measure structural overlap between predicted and gold clusters. While widely used, these metrics offer limited diagnostic insights, penalizing errors without revealing whether a system struggles with specific semantic categories, such as people, locations, or events, and making it difficult to interpret model capabilities or derive actionable improvements. We address this gap by introducing a semantically-enhanced evaluation framework for coreference resolution. Our approach overlays Concept and Named Entity Recognition (CNER) onto coreference outputs, assigning semantic labels to nominal mentions and propagating them to entire coreference clusters. This enables the computation of typed scores aimed at evaluating mention extraction and linking capabilities stratified by semantic class. Across our experiments on OntoNotes, LitBank, and PreCo, we show that our framework uncovers systematic weaknesses that remain obscured by aggregate metrics. Furthermore, we demonstrate that these diagnostics can be used to design targeted, low-cost data augmentation strategies, achieving measurable out-of-domain improvements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper layers CNER labels onto coreference clusters for typed extraction and linking scores, which is a simple but useful diagnostic step beyond aggregate CoNLL-F1, though it needs checks on label noise.

read the letter

The core contribution is overlaying Concept and Named Entity Recognition labels on coreference mentions and propagating them to clusters. This produces per-class scores for mention extraction and linking that standard metrics hide. They test on OntoNotes, LitBank, and PreCo and report that the typed view reveals patterns the usual F1 misses, then use those patterns for low-cost augmentation that lifts out-of-domain results. That combination is new enough to matter for people who actually debug coreference systems rather than just report one number. The approach is additive and does not require new models, which keeps it practical. The main weakness is the unexamined reliance on CNER accuracy for nominal mentions. If the labeler misfires on ambiguous or out-of-domain text, the per-class differences and the augmentation choices could track CNER noise instead of coreference behavior. The abstract gives no CNER validation numbers, no sensitivity runs, and no error analysis, so the central claims rest on moderate evidence. No implementation details or statistical tests appear either. This is worth a serious referee for groups working on coreference evaluation or data-efficient improvement. The idea is clear and the datasets are standard, so a full version with the missing checks could be usable. I would not cite it yet without seeing the actual numbers and the CNER validation step.

Referee Report

1 major / 2 minor

Summary. The paper introduces a semantically-enhanced evaluation framework for coreference resolution by overlaying Concept and Named Entity Recognition (CNER) labels onto coreference outputs. Semantic labels are assigned to nominal mentions and propagated to entire clusters, enabling computation of typed precision/recall scores stratified by semantic class (e.g., people, locations). Experiments on OntoNotes, LitBank, and PreCo are reported to show that these typed metrics reveal systematic weaknesses obscured by aggregate CoNLL-F1 scores, and that the diagnostics support targeted, low-cost data augmentation yielding out-of-domain improvements.

Significance. If the CNER labeling step is shown to be reliable, the framework would offer a practical, additive diagnostic tool that improves interpretability of coreference systems without replacing existing metrics. The non-circular, parameter-free overlay and the augmentation use-case are strengths that could guide more targeted model development. However, the current evidential support is limited by the absence of validation for the core assumption.

major comments (1)

Abstract (and framework description): The claims that the framework 'uncovers systematic weaknesses' and enables 'targeted' augmentation rest on the assumption that CNER labels on nominal mentions are sufficiently accurate for propagation to clusters. No CNER accuracy measurement, inter-annotator agreement on labeled mentions, or sensitivity analysis under label noise is reported. If CNER errors are non-negligible (common for ambiguous nominals or out-of-domain text), the per-class differences and augmentation decisions could primarily reflect CNER noise rather than coreference behavior, directly undermining the central diagnostic and improvement claims.

minor comments (2)

The abstract refers to 'measurable out-of-domain improvements' without quantifying the gains, describing the augmentation procedure, or reporting statistical tests; these details are needed for assessing the practical impact.
Implementation details for the CNER overlay, exact propagation rule to clusters, and the specific CNER tool(s) employed are missing and should be provided for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We address the major comment below and have revised the manuscript to incorporate additional validation for the CNER component.

read point-by-point responses

Referee: Abstract (and framework description): The claims that the framework 'uncovers systematic weaknesses' and enables 'targeted' augmentation rest on the assumption that CNER labels on nominal mentions are sufficiently accurate for propagation to clusters. No CNER accuracy measurement, inter-annotator agreement on labeled mentions, or sensitivity analysis under label noise is reported. If CNER errors are non-negligible (common for ambiguous nominals or out-of-domain text), the per-class differences and augmentation decisions could primarily reflect CNER noise rather than coreference behavior, directly undermining the central diagnostic and improvement claims.

Authors: We agree that the reliability of the CNER labeling step is central to the framework's interpretability claims and that its absence from the original submission represents a genuine limitation. In the revised manuscript we have added a new subsection (3.3) that directly addresses this. The subsection reports CNER accuracy measured against a manually annotated sample of nominal mentions drawn from each of the three evaluation corpora, provides inter-annotator agreement statistics for the semantic labels, and includes a sensitivity analysis that injects controlled label noise at varying rates and tracks the resulting changes in typed precision/recall and in the augmentation gains. The analysis shows that the relative ordering of semantic-class weaknesses and the direction of the augmentation improvements remain stable under moderate noise levels. We have also updated the abstract and the framework description to reflect these new results and to qualify the strength of the diagnostic claims accordingly. revision: yes

Circularity Check

0 steps flagged

No circularity: additive evaluation overlay with no derivations or self-referential reductions

full rationale

The paper proposes an evaluation framework that overlays CNER labels on coreference outputs to produce typed scores, then reports empirical results on OntoNotes, LitBank, and PreCo plus augmentation experiments. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text. The methodology is described as an additive post-processing step on existing cluster outputs rather than a closed derivation chain. Claims rest on experimental outcomes rather than self-definition, self-citation load-bearing, or renaming of known results. This is the common case of an honest methodological contribution with no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that CNER outputs are accurate enough for evaluation; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption Concept and Named Entity Recognition (CNER) provides sufficiently accurate semantic labels for nominal mentions to support reliable cluster-level evaluation.
The framework assigns and propagates these labels; any systematic CNER error would directly affect the typed scores and augmentation strategy.

pith-pipeline@v0.9.0 · 5468 in / 1213 out tokens · 69210 ms · 2026-05-12T04:56:43.601163+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

Entity linking via ex- plicit mention-mention coreference modeling. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, pages 4644–4658, Seattle, United States. Association for Computational Linguistics. Oshin Agarwal, Sanjay Subramanian, Ani Nenkova, and Dan Roth

work page 2022
[2]

InCOLING 1998 V olume 1: The 17th Inter- national Conference on Computational Linguistics

Entity-based cross-document coreferencing using the vector space model. InCOLING 1998 V olume 1: The 17th Inter- national Conference on Computational Linguistics. David Bamman, Olivia Lewke, and Anya Mansoor

work page 1998
[3]

InProceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 216–225, Vancouver, Canada

Robust coreference resolution and entity linking on dialogues: Character identification on TV show tran- scripts. InProceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 216–225, Vancouver, Canada. Associa- tion for Computational Linguistics. Hong Chen, Zhenhua Fan, Hao Lu, Alan Yuille, and Shu Rong

work page 2017
[4]

In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 172– 181, Brussels, Belgium

PreCo: A large-scale dataset in preschool vocabulary for coreference resolution. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 172– 181, Brussels, Belgium. Association for Computa- tional Linguistics. Kevin Clark and Christopher D. Manning

work page 2018
[5]

InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 265–277, Seattle, Washington, USA

Error- driven analysis of challenges in coreference reso- lution. InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 265–277, Seattle, Washington, USA. Associa- tion for Computational Linguistics. Xiaoqiang Luo

work page 2013
[6]

InProceedings of Human Language Technology Conference and Conference on Empiri- cal Methods in Natural Language Processing, pages 25–32, Vancouver, British Columbia, Canada

On coreference resolution perfor- mance metrics. InProceedings of Human Language Technology Conference and Conference on Empiri- cal Methods in Natural Language Processing, pages 25–32, Vancouver, British Columbia, Canada. Asso- ciation for Computational Linguistics. Giuliano Martinelli, Edoardo Barba, and Roberto Nav- igli. 2024a. Maverick: Efficient and...

work page 2024
[7]

InProceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demon- strations, pages 6–10, Denver, Colorado

Analyzing and visualizing corefer- ence resolution errors. InProceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demon- strations, pages 6–10, Denver, Colorado. Association for Computational Linguistics. Nafise Sadat Moosavi, Leo Born, Massimo Poesio, and Michael Strube

work page 2015
[8]

InFindings of the Association for Computational Lin- guistics: ACL 2024, pages 15380–15395, Bangkok, Thailand

Challenges to evaluating the generalization of coreference resolu- tion models: A measurement modeling perspective. InFindings of the Association for Computational Lin- guistics: ACL 2024, pages 15380–15395, Bangkok, Thailand. Association for Computational Linguistics. Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang

work page 2024
[9]

InJoint Conference on EMNLP and CoNLL - Shared Task, pages 1–40, Jeju Island, Korea

CoNLL- 2012 shared task: Modeling multilingual unrestricted coreference in OntoNotes. InJoint Conference on EMNLP and CoNLL - Shared Task, pages 1–40, Jeju Island, Korea. Association for Computational Lin- guistics. Marta Recasens, Cristian Danescu-Niculescu-Mizil, and Dan Jurafsky

work page 2012
[10]

In Findings of the Association for Computational Lin- guistics: EMNLP 2021, pages 2521–2533, Punta Cana, Dominican Republic

WikiNEuRal: Combined neural and knowledge- based silver data creation for multilingual NER. In Findings of the Association for Computational Lin- guistics: EMNLP 2021, pages 2521–2533, Punta Cana, Dominican Republic. Association for Compu- tational Linguistics. Marc Vilain, John Burger, John Aberdeen, Dennis Connolly, and Lynette Hirschman

work page 2021
[11]

people,” “some,

Can we fix the scope for corefer- ence? problems and solutions for benchmarks be- yond ontonotes.CoRR, abs/2112.09742. A Additional Details on Effectiveness of our Labeling and Propagation Technique In this Section we provide additional details on the process of evaluating our labeling and propagation technique. We assess its effectiveness by i) mea- suri...

work page arXiv 2012
[12]

5https://github.com/SapienzaNLP/ maverick-coref 6https://cdn.openai.com/gpt-5-system-card. pdf Annotation DetailsThe generated texts were annotated according to two different annotation schemes: i)unrestricted annotation, and ii)re- stricted annotation, inspired by the LitBank an- notation guidelines. In the restricted setting, annotations were de- signed...

work page 1979

[1] [1]

Entity linking via ex- plicit mention-mention coreference modeling. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, pages 4644–4658, Seattle, United States. Association for Computational Linguistics. Oshin Agarwal, Sanjay Subramanian, Ani Nenkova, and Dan Roth

work page 2022

[2] [2]

InCOLING 1998 V olume 1: The 17th Inter- national Conference on Computational Linguistics

Entity-based cross-document coreferencing using the vector space model. InCOLING 1998 V olume 1: The 17th Inter- national Conference on Computational Linguistics. David Bamman, Olivia Lewke, and Anya Mansoor

work page 1998

[3] [3]

InProceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 216–225, Vancouver, Canada

Robust coreference resolution and entity linking on dialogues: Character identification on TV show tran- scripts. InProceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 216–225, Vancouver, Canada. Associa- tion for Computational Linguistics. Hong Chen, Zhenhua Fan, Hao Lu, Alan Yuille, and Shu Rong

work page 2017

[4] [4]

In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 172– 181, Brussels, Belgium

PreCo: A large-scale dataset in preschool vocabulary for coreference resolution. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 172– 181, Brussels, Belgium. Association for Computa- tional Linguistics. Kevin Clark and Christopher D. Manning

work page 2018

[5] [5]

InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 265–277, Seattle, Washington, USA

Error- driven analysis of challenges in coreference reso- lution. InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 265–277, Seattle, Washington, USA. Associa- tion for Computational Linguistics. Xiaoqiang Luo

work page 2013

[6] [6]

InProceedings of Human Language Technology Conference and Conference on Empiri- cal Methods in Natural Language Processing, pages 25–32, Vancouver, British Columbia, Canada

On coreference resolution perfor- mance metrics. InProceedings of Human Language Technology Conference and Conference on Empiri- cal Methods in Natural Language Processing, pages 25–32, Vancouver, British Columbia, Canada. Asso- ciation for Computational Linguistics. Giuliano Martinelli, Edoardo Barba, and Roberto Nav- igli. 2024a. Maverick: Efficient and...

work page 2024

[7] [7]

InProceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demon- strations, pages 6–10, Denver, Colorado

Analyzing and visualizing corefer- ence resolution errors. InProceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demon- strations, pages 6–10, Denver, Colorado. Association for Computational Linguistics. Nafise Sadat Moosavi, Leo Born, Massimo Poesio, and Michael Strube

work page 2015

[8] [8]

InFindings of the Association for Computational Lin- guistics: ACL 2024, pages 15380–15395, Bangkok, Thailand

Challenges to evaluating the generalization of coreference resolu- tion models: A measurement modeling perspective. InFindings of the Association for Computational Lin- guistics: ACL 2024, pages 15380–15395, Bangkok, Thailand. Association for Computational Linguistics. Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang

work page 2024

[9] [9]

InJoint Conference on EMNLP and CoNLL - Shared Task, pages 1–40, Jeju Island, Korea

CoNLL- 2012 shared task: Modeling multilingual unrestricted coreference in OntoNotes. InJoint Conference on EMNLP and CoNLL - Shared Task, pages 1–40, Jeju Island, Korea. Association for Computational Lin- guistics. Marta Recasens, Cristian Danescu-Niculescu-Mizil, and Dan Jurafsky

work page 2012

[10] [10]

In Findings of the Association for Computational Lin- guistics: EMNLP 2021, pages 2521–2533, Punta Cana, Dominican Republic

WikiNEuRal: Combined neural and knowledge- based silver data creation for multilingual NER. In Findings of the Association for Computational Lin- guistics: EMNLP 2021, pages 2521–2533, Punta Cana, Dominican Republic. Association for Compu- tational Linguistics. Marc Vilain, John Burger, John Aberdeen, Dennis Connolly, and Lynette Hirschman

work page 2021

[11] [11]

people,” “some,

Can we fix the scope for corefer- ence? problems and solutions for benchmarks be- yond ontonotes.CoRR, abs/2112.09742. A Additional Details on Effectiveness of our Labeling and Propagation Technique In this Section we provide additional details on the process of evaluating our labeling and propagation technique. We assess its effectiveness by i) mea- suri...

work page arXiv 2012

[12] [12]

5https://github.com/SapienzaNLP/ maverick-coref 6https://cdn.openai.com/gpt-5-system-card. pdf Annotation DetailsThe generated texts were annotated according to two different annotation schemes: i)unrestricted annotation, and ii)re- stricted annotation, inspired by the LitBank an- notation guidelines. In the restricted setting, annotations were de- signed...

work page 1979