Knowledge-aware Pronoun Coreference Resolution

Dong Yu; Hongming Zhang; Yangqiu Song; Yan Song

arxiv: 1907.03663 · v1 · pith:KD6Y44EXnew · submitted 2019-07-08 · 💻 cs.CL

Knowledge-aware Pronoun Coreference Resolution

Hongming Zhang , Yan Song , Yangqiu Song , Dong Yu This is my paper

Pith reviewed 2026-05-25 01:02 UTC · model grok-4.3

classification 💻 cs.CL

keywords pronoun coreference resolutionknowledge graphsneural networksattention mechanismcross-domain generalizationnatural language processing

0 comments

The pith

A neural model uses knowledge graph triplets and attention to resolve pronoun coreference more accurately than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to improve pronoun coreference resolution by feeding external knowledge directly into a neural network in the form of simple triplets rather than through hand-crafted rules. An attention module learns which pieces of knowledge matter for a given sentence and ignores the rest. This leads to stronger results on standard test sets and, crucially, better transfer when the model is tested on new domains because it draws on general knowledge instead of memorizing the training examples alone.

Core claim

The model resolves pronouns by directly incorporating knowledge in triplet format from knowledge graphs and employs a knowledge attention module to selectively use informative knowledge based on the surrounding context, leading to improved performance on in-domain and cross-domain datasets.

What carries the argument

The knowledge attention module, which learns to select and use informative knowledge based on contexts.

If this is right

The model outperforms state-of-the-art baselines by a large margin on two datasets from different domains.
It shows superior performance compared with baselines in the cross-domain setting.
Relying on external knowledge rather than only fitting the training data improves generalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same triplet-plus-attention pattern could be applied to other language tasks that need facts beyond the sentence, such as entity linking or question answering.
If the attention module reliably filters noise, the approach might lower the amount of labeled data needed for coreference systems in new domains.
One could test whether the triplet format works equally well with knowledge graphs that have different structures or coverage levels.

Load-bearing premise

External knowledge in triplet form can be fed into the neural model and the attention module will pick only the helpful pieces without adding noise or hurting generalization.

What would settle it

A controlled test in which the knowledge attention module is removed or replaced with random selection and performance fails to improve or drops in the cross-domain setting would falsify the central claim.

Figures

Figures reproduced from arXiv: 1907.03663 by Dong Yu, Hongming Zhang, Yangqiu Song, Yan Song.

**Figure 3.** Figure 3: The structure of the knowledge attention [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of different softmax selection thresh [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Resolving pronoun coreference requires knowledge support, especially for particular domains (e.g., medicine). In this paper, we explore how to leverage different types of knowledge to better resolve pronoun coreference with a neural model. To ensure the generalization ability of our model, we directly incorporate knowledge in the format of triplets, which is the most common format of modern knowledge graphs, instead of encoding it with features or rules as that in conventional approaches. Moreover, since not all knowledge is helpful in certain contexts, to selectively use them, we propose a knowledge attention module, which learns to select and use informative knowledge based on contexts, to enhance our model. Experimental results on two datasets from different domains prove the validity and effectiveness of our model, where it outperforms state-of-the-art baselines by a large margin. Moreover, since our model learns to use external knowledge rather than only fitting the training data, it also demonstrates superior performance to baselines in the cross-domain setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a knowledge attention module to pull KG triplets into a neural coreference model and reports gains both in-domain and cross-domain.

read the letter

The paper adds a knowledge attention module to pull KG triplets into a neural coreference model and reports gains both in-domain and cross-domain. They feed external knowledge directly as triplets rather than turning it into hand-crafted features or rules, then let the attention layer decide which triplets matter for each pronoun context. That setup is tested on two datasets from separate domains, with the cross-domain results presented as evidence that the model learns to use the knowledge instead of just fitting the training data. The direct triplet format and the selective attention are the concrete moves that differ from earlier work on knowledge injection for coreference. The cross-domain test is the part that actually tests whether the knowledge helps generalization rather than just adding capacity. If the full experiments include ablations on the attention component and confirm the knowledge source stays external to the test sets, the gains would be worth noting for anyone working on domain-specific coreference. The main soft spot is that the abstract states large-margin wins without listing scores, exact baselines, or error bars, so the size of the improvement and whether the attention truly filters noise remain hard to judge from the summary alone. Retrieval details for the triplets and any overlap checks would also need to be verified in the full text. Readers who build coreference systems or experiment with structured knowledge in neural models would get the most out of this. It is a straightforward engineering extension rather than a conceptual shift, but the cross-domain angle gives it enough substance to go through peer review so the experimental claims can be examined directly.

Referee Report

2 major / 2 minor

Summary. The paper proposes a neural model for pronoun coreference resolution that directly incorporates external knowledge from knowledge graphs as triplets (rather than hand-crafted features) and uses a knowledge attention module to selectively attend to informative triplets based on context. It reports large-margin gains over SOTA baselines on two datasets from different domains and superior cross-domain performance, attributing the latter to the model's use of external knowledge rather than overfitting to training data.

Significance. If the experimental claims hold, the work would be significant for demonstrating a practical method to inject structured KG knowledge into neural coreference models while preserving generalization; the triplet format and attention-based selection avoid the brittleness of rule-based or feature-engineered knowledge integration and could extend to other knowledge-intensive NLP tasks.

major comments (2)

[§3.2] §3.2 (Knowledge Attention Module): the central claim that the attention mechanism learns to select only informative triplets (avoiding noise) is load-bearing for both the in-domain gains and the cross-domain superiority, yet the manuscript provides no ablation that isolates the attention module (e.g., full model vs. model that concatenates all retrieved triplets or uses uniform attention).
[§4] §4 (Experiments), cross-domain setting: the reported superiority is presented without statistical significance tests, run-to-run variance, or explicit description of how the two domains are partitioned for training/testing, which is required to substantiate that the gains arise from external knowledge rather than domain-specific fitting.

minor comments (2)

The abstract and introduction refer to 'two datasets from different domains' without naming them or their domains until later sections; moving this information earlier would improve readability.
[§3] Notation for the knowledge triplet embedding and the attention score computation (Eqs. in §3) could be clarified with an explicit diagram showing how context, pronoun, and candidate entity interact with the KG triplets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript accordingly to strengthen the experimental validation.

read point-by-point responses

Referee: [§3.2] §3.2 (Knowledge Attention Module): the central claim that the attention mechanism learns to select only informative triplets (avoiding noise) is load-bearing for both the in-domain gains and the cross-domain superiority, yet the manuscript provides no ablation that isolates the attention module (e.g., full model vs. model that concatenates all retrieved triplets or uses uniform attention).

Authors: We agree that an ablation isolating the attention module is necessary to support the central claim. In the revision we will add experiments comparing the full model to (i) a variant with uniform attention over all triplets and (ii) a variant that concatenates all retrieved triplets without selection. These results will quantify the contribution of learned selection versus noise. revision: yes
Referee: [§4] §4 (Experiments), cross-domain setting: the reported superiority is presented without statistical significance tests, run-to-run variance, or explicit description of how the two domains are partitioned for training/testing, which is required to substantiate that the gains arise from external knowledge rather than domain-specific fitting.

Authors: We accept the need for these details. The revision will report means and standard deviations over multiple random seeds, include paired significance tests on the cross-domain results, and add an explicit description of the domain partitioning and train/test splits used in Section 4. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes a neural model that directly incorporates external knowledge graph triplets via a knowledge attention module to resolve pronoun coreference. The central claims rest on empirical results from two datasets (including cross-domain evaluation) showing outperformance over baselines. No equations, derivations, or self-citations are presented that reduce any prediction or uniqueness claim to a fitted parameter or prior author result by construction. The model architecture draws on standard neural components plus external KG data rather than redefining inputs as outputs or smuggling ansatzes via self-citation. This is a standard empirical ML paper whose validity is assessed via held-out performance rather than internal definitional closure.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on the availability of relevant external knowledge graphs and the effectiveness of the attention mechanism for selection. No free parameters or invented entities are explicitly described. The approach assumes triplet format preserves utility for generalization.

axioms (2)

domain assumption External knowledge graphs contain useful triplets for resolving pronouns in various domains including medicine
Invoked to justify direct incorporation of knowledge for better resolution and generalization.
domain assumption Not all knowledge is helpful in certain contexts, but an attention module can learn to select informative parts
Stated as motivation for the knowledge attention module.

pith-pipeline@v0.9.0 · 5688 in / 1349 out tokens · 31499 ms · 2026-05-25T01:02:48.434006+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose a knowledge attention module, which learns to select and use informative knowledge based on contexts
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

directly incorporate knowledge in the format of triplets

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 4 internal anchors

[1]

Neural Machine Translation by Jointly Learning to Align and Translate

Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Eugene Charniak and Micha Elsner

work page internal anchor Pith review Pith/arXiv arXiv
[2]

In EACL, 2009, pages 148–156

Em works for pronoun anaphora resolution. In EACL, 2009, pages 148–156. Kevin Clark and Christopher D Manning

work page 2009
[3]

In ACL-IJCNLP , 2015, volume 1, pages 1405–1415

Entity- centric coreference resolution with model stacking. In ACL-IJCNLP , 2015, volume 1, pages 1405–1415. Kevin Clark and Christopher D. Manning

work page 2015
[4]

In EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 2256–2262

Deep reinforcement learning for mention-ranking corefer- ence models. In EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 2256–2262. K Bretonnel Cohen, Arrick Lanfranchi, Miji Joo-young Choi, Michael Bada, William A Baumgartner, Na- talya Panteleyeva, Karin Verspoor, Martha Palmer, and Lawrence E Hunter

work page 2016
[5]

In ACL, 1981, pages 89–93

Search and inference strategies in pronoun resolution: An experimental study. In ACL, 1981, pages 89–93. Ali Emami, Paul Trichelair, Adam Trischler, Ka- heer Suleman, Hannes Schulz, and Jackie Chi Kit Cheung

work page 1981
[6]

The Knowref Coreference Corpus: Removing Gender and Number Cues for Difficult Pronominal Anaphora Resolution

The hard-core coreference cor- pus: Removing gender and number cues for difﬁ- cult pronominal anaphora resolution. arXiv preprint arXiv:1811.01747. Jerry R Hobbs

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Adam: A Method for Stochastic Optimization

Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. Kenton Lee, Luheng He, Mike Lewis, and Luke Zettle- moyer

work page internal anchor Pith review Pith/arXiv arXiv
[8]

In EMNLP , 9-11, 2017, pages 188–197

End-to-end neural coreference resolu- tion. In EMNLP , 9-11, 2017, pages 188–197. Kenton Lee, Luheng He, and Luke Zettlemoyer

work page 2017
[9]

In Proceedings of ACL 2011, pages 1169–1178

A pronoun anaphora resolution system based on fac- torial hidden markov models. In Proceedings of ACL 2011, pages 1169–1178. Miaofeng Liu, Jialong Han, Haisong Zhang, and Yan Song

work page 2011
[10]

In Proceed- ings of the BioNLP 2018 workshop 2018 , pages 137–141

Domain Adaptation for Disease Phrase Matching with Adversarial Networks. In Proceed- ings of the BioNLP 2018 workshop 2018 , pages 137–141. Miaofeng Liu, Yan Song, Hongbin Zou, and Tong Zhang

work page 2018
[11]

In ACL, 1998, pages 869–875

Robust pronoun resolution with limited knowledge. In ACL, 1998, pages 869–875. Ruslan Mitkov et al

work page 1998
[12]

In CCL, 1994, pages 1157–1163

Robust method of pro- noun resolution using full-text information. In CCL, 1994, pages 1157–1163. Vincent Ng

work page 1994
[13]

In EMNLP , 2005, volume 20, page

Supervised ranking for pronoun res- olution: Some recent improvements. In EMNLP , 2005, volume 20, page

work page 2005
[14]

In EMNLP , 2014, pages 1532–1543

Glove: Global vectors for word representation. In EMNLP , 2014, pages 1532–1543. Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer

work page 2014
[15]

Deep contextualized word representations

Deep contextualized word rep- resentations. arXiv preprint arXiv:1802.05365. Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang

work page internal anchor Pith review Pith/arXiv arXiv
[16]

In EMNLP , 2012, pages 1–40

Conll- 2012 shared task: Modeling multilingual unre- stricted coreference in ontonotes. In EMNLP , 2012, pages 1–40. Karthik Raghunathan, Heeyoung Lee, Sudarshan Ran- garajan, Nathanael Chambers, Mihai Surdeanu, Dan Jurafsky, and Christopher Manning

work page 2012
[17]

In EMNLP , 2010, pages 492–501

A multi- pass sieve for coreference resolution. In EMNLP , 2010, pages 492–501. Altaf Rahman and Vincent Ng

work page 2010
[18]

In ACL, 2011, pages 814–824

Coreference res- olution with world knowledge. In ACL, 2011, pages 814–824. Philip Resnik

work page 2011
[19]

In Proceedings of IJCAI 2018, pages 4368–4374

Complementary Learning of Word Embeddings. In Proceedings of IJCAI 2018, pages 4368–4374. Yan Song, Shuming Shi, Jing Li, and Haisong Zhang

work page 2018
[20]

In Proceedings of NAACL-HLT 2018, pages 175–180

Directional Skip-Gram: Explicitly Distin- guishing Left and Right Context for Word Embed- dings. In Proceedings of NAACL-HLT 2018, pages 175–180. Josef Steinberger, Massimo Poesio, Mijail A Kabadjov, and Karel Jevzek

work page 2018
[21]

In ACL, 2003, pages 168–175

A ma- chine learning approach to pronoun resolution in spoken dialogue. In ACL, 2003, pages 168–175. Long Trieu, Nhung Nguyen, Makoto Miwa, and Sophia Ananiadou

work page 2003
[22]

In Proceedings of the BioNLP 2018 workshop, pages 183–188

Investigating domain-speciﬁc information for neural coreference resolution on biomedical texts. In Proceedings of the BioNLP 2018 workshop, pages 183–188. Ozlem Uzuner, Andreea Bodnari, Shuying Shen, Tyler Forbush, John Pestian, and Brett R South

work page 2018
[23]

Artiﬁcial intelligence, 6(1):53–74

A preferential, pattern-seeking, semantics for natural language inference. Artiﬁcial intelligence, 6(1):53–74. Hongming Zhang, Xin Liu, Haojie Pan, Yangqiu Song, and Cane Wing-Ki Leung. 2019a. Aser: A large- scale eventuality knowledge graph. arXiv preprint arXiv:1905.00270. Hongming Zhang, Yan Song, and Yangqiu Song. 2019b. Incorporating context and exte...

work page arXiv 1905

[1] [1]

Neural Machine Translation by Jointly Learning to Align and Translate

Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Eugene Charniak and Micha Elsner

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

In EACL, 2009, pages 148–156

Em works for pronoun anaphora resolution. In EACL, 2009, pages 148–156. Kevin Clark and Christopher D Manning

work page 2009

[3] [3]

In ACL-IJCNLP , 2015, volume 1, pages 1405–1415

Entity- centric coreference resolution with model stacking. In ACL-IJCNLP , 2015, volume 1, pages 1405–1415. Kevin Clark and Christopher D. Manning

work page 2015

[4] [4]

In EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 2256–2262

Deep reinforcement learning for mention-ranking corefer- ence models. In EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 2256–2262. K Bretonnel Cohen, Arrick Lanfranchi, Miji Joo-young Choi, Michael Bada, William A Baumgartner, Na- talya Panteleyeva, Karin Verspoor, Martha Palmer, and Lawrence E Hunter

work page 2016

[5] [5]

In ACL, 1981, pages 89–93

Search and inference strategies in pronoun resolution: An experimental study. In ACL, 1981, pages 89–93. Ali Emami, Paul Trichelair, Adam Trischler, Ka- heer Suleman, Hannes Schulz, and Jackie Chi Kit Cheung

work page 1981

[6] [6]

The Knowref Coreference Corpus: Removing Gender and Number Cues for Difficult Pronominal Anaphora Resolution

The hard-core coreference cor- pus: Removing gender and number cues for difﬁ- cult pronominal anaphora resolution. arXiv preprint arXiv:1811.01747. Jerry R Hobbs

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Adam: A Method for Stochastic Optimization

Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. Kenton Lee, Luheng He, Mike Lewis, and Luke Zettle- moyer

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

In EMNLP , 9-11, 2017, pages 188–197

End-to-end neural coreference resolu- tion. In EMNLP , 9-11, 2017, pages 188–197. Kenton Lee, Luheng He, and Luke Zettlemoyer

work page 2017

[9] [9]

In Proceedings of ACL 2011, pages 1169–1178

A pronoun anaphora resolution system based on fac- torial hidden markov models. In Proceedings of ACL 2011, pages 1169–1178. Miaofeng Liu, Jialong Han, Haisong Zhang, and Yan Song

work page 2011

[10] [10]

In Proceed- ings of the BioNLP 2018 workshop 2018 , pages 137–141

Domain Adaptation for Disease Phrase Matching with Adversarial Networks. In Proceed- ings of the BioNLP 2018 workshop 2018 , pages 137–141. Miaofeng Liu, Yan Song, Hongbin Zou, and Tong Zhang

work page 2018

[11] [11]

In ACL, 1998, pages 869–875

Robust pronoun resolution with limited knowledge. In ACL, 1998, pages 869–875. Ruslan Mitkov et al

work page 1998

[12] [12]

In CCL, 1994, pages 1157–1163

Robust method of pro- noun resolution using full-text information. In CCL, 1994, pages 1157–1163. Vincent Ng

work page 1994

[13] [13]

In EMNLP , 2005, volume 20, page

Supervised ranking for pronoun res- olution: Some recent improvements. In EMNLP , 2005, volume 20, page

work page 2005

[14] [14]

In EMNLP , 2014, pages 1532–1543

Glove: Global vectors for word representation. In EMNLP , 2014, pages 1532–1543. Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer

work page 2014

[15] [15]

Deep contextualized word representations

Deep contextualized word rep- resentations. arXiv preprint arXiv:1802.05365. Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

In EMNLP , 2012, pages 1–40

Conll- 2012 shared task: Modeling multilingual unre- stricted coreference in ontonotes. In EMNLP , 2012, pages 1–40. Karthik Raghunathan, Heeyoung Lee, Sudarshan Ran- garajan, Nathanael Chambers, Mihai Surdeanu, Dan Jurafsky, and Christopher Manning

work page 2012

[17] [17]

In EMNLP , 2010, pages 492–501

A multi- pass sieve for coreference resolution. In EMNLP , 2010, pages 492–501. Altaf Rahman and Vincent Ng

work page 2010

[18] [18]

In ACL, 2011, pages 814–824

Coreference res- olution with world knowledge. In ACL, 2011, pages 814–824. Philip Resnik

work page 2011

[19] [19]

In Proceedings of IJCAI 2018, pages 4368–4374

Complementary Learning of Word Embeddings. In Proceedings of IJCAI 2018, pages 4368–4374. Yan Song, Shuming Shi, Jing Li, and Haisong Zhang

work page 2018

[20] [20]

In Proceedings of NAACL-HLT 2018, pages 175–180

Directional Skip-Gram: Explicitly Distin- guishing Left and Right Context for Word Embed- dings. In Proceedings of NAACL-HLT 2018, pages 175–180. Josef Steinberger, Massimo Poesio, Mijail A Kabadjov, and Karel Jevzek

work page 2018

[21] [21]

In ACL, 2003, pages 168–175

A ma- chine learning approach to pronoun resolution in spoken dialogue. In ACL, 2003, pages 168–175. Long Trieu, Nhung Nguyen, Makoto Miwa, and Sophia Ananiadou

work page 2003

[22] [22]

In Proceedings of the BioNLP 2018 workshop, pages 183–188

Investigating domain-speciﬁc information for neural coreference resolution on biomedical texts. In Proceedings of the BioNLP 2018 workshop, pages 183–188. Ozlem Uzuner, Andreea Bodnari, Shuying Shen, Tyler Forbush, John Pestian, and Brett R South

work page 2018

[23] [23]

Artiﬁcial intelligence, 6(1):53–74

A preferential, pattern-seeking, semantics for natural language inference. Artiﬁcial intelligence, 6(1):53–74. Hongming Zhang, Xin Liu, Haojie Pan, Yangqiu Song, and Cane Wing-Ki Leung. 2019a. Aser: A large- scale eventuality knowledge graph. arXiv preprint arXiv:1905.00270. Hongming Zhang, Yan Song, and Yangqiu Song. 2019b. Incorporating context and exte...

work page arXiv 1905