pith. sign in

arxiv: 1906.08931 · v1 · pith:O33KTMLRnew · submitted 2019-06-21 · 💻 cs.CL

Exploiting Entity BIO Tag Embeddings and Multi-task Learning for Relation Extraction with Imbalanced Data

Pith reviewed 2026-05-25 19:23 UTC · model grok-4.3

classification 💻 cs.CL
keywords relation extractionimbalanced datamulti-task learningBIO tag embeddingsnamed entity recognitionACE 2005
0
0 comments X

The pith

A multi-task model using BIO tag embeddings from named entity recognition improves relation extraction F1 by more than 10 points on imbalanced ACE 2005 data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the problem that negative non-relation entity pairs vastly outnumber positive ones in relation extraction. It introduces a multi-task setup that trains relation identification with cross-entropy loss alongside relation classification with ranking loss, while enriching input representations with character-wise or word-wise BIO tag embeddings taken from a named entity recognition task. The authors report that this combination raises baseline F1 by more than 10 absolute points and surpasses prior state-of-the-art results on both the Chinese and English portions of the ACE 2005 corpus. They further note that the BIO tag embeddings alone can be added to other models to gain similar benefits.

Core claim

The central claim is that jointly optimizing relation identification via cross-entropy and relation classification via ranking loss, while injecting BIO tag embeddings from a separate named entity recognition task into the input embeddings, supplies the semantic patterns needed to separate positive from negative relation instances and thereby overcomes the performance drop caused by severe class imbalance.

What carries the argument

A multi-task architecture that pairs cross-entropy loss for identifying whether a relation exists with ranking loss for assigning the correct relation class, augmented by BIO tag embeddings derived from named entity recognition.

If this is right

  • The model achieves more than 10 percent absolute F1 increase over a baseline on imbalanced relation extraction.
  • It outperforms prior state-of-the-art systems on the ACE 2005 Chinese and English corpora.
  • BIO tag embeddings can be added to other relation extraction models to produce performance gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same BIO tag injection technique could be tested on relation extraction datasets from domains other than news text to check whether the gain persists.
  • If entity boundary signals prove helpful here, similar tag embeddings might aid other tasks that require distinguishing sparse positive events from abundant negatives.
  • An ablation that replaces BIO tags with random embeddings of the same dimensionality would isolate whether the actual tag values or merely the added capacity drives the improvement.

Load-bearing premise

The patterns captured by character-wise or word-wise BIO tag embeddings from a separate named entity recognition task contain useful semantic information that helps distinguish positive from negative relation instances.

What would settle it

Running the proposed model on the ACE 2005 Chinese or English corpus after removing the BIO tag embeddings from the input representation and observing that the F1 improvement over the baseline falls below 5 absolute points would falsify the contribution of those embeddings.

Figures

Figures reproduced from arXiv: 1906.08931 by Bo Li, Long Chen, Rui Xie, Shikun Zhang, Wei Ye, Zhonghao Sheng.

Figure 1
Figure 1. Figure 1: The overall multi-task architecture. To demonstrate, there are three window sizes for filters in the convolutional layer, as denoted by the three-layer stack; for each window size there are four filters, as denoted by the number of rows in each layer. Max￾pooling is applied to each row in each layer of the stack, and the dimension of the output is equal to the total number of filters. Three are three main … view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of BIO tag information and positional information for a given instance. In this example, there [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effect of positive/negative instance ratio on [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

In practical scenario, relation extraction needs to first identify entity pairs that have relation and then assign a correct relation class. However, the number of non-relation entity pairs in context (negative instances) usually far exceeds the others (positive instances), which negatively affects a model's performance. To mitigate this problem, we propose a multi-task architecture which jointly trains a model to perform relation identification with cross-entropy loss and relation classification with ranking loss. Meanwhile, we observe that a sentence may have multiple entities and relation mentions, and the patterns in which the entities appear in a sentence may contain useful semantic information that can be utilized to distinguish between positive and negative instances. Thus we further incorporate the embeddings of character-wise/word-wise BIO tag from the named entity recognition task into character/word embeddings to enrich the input representation. Experiment results show that our proposed approach can significantly improve the performance of a baseline model with more than 10% absolute increase in F1-score, and outperform the state-of-the-art models on ACE 2005 Chinese and English corpus. Moreover, BIO tag embeddings are particularly effective and can be used to improve other models as well.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a multi-task architecture for relation extraction on imbalanced data that jointly optimizes relation identification via cross-entropy loss and relation classification via ranking loss, while enriching input representations with character-wise and word-wise BIO tag embeddings obtained from a separate NER task. It reports that this yields more than 10% absolute F1 improvement over a baseline and outperforms prior state-of-the-art models on the ACE 2005 Chinese and English corpora, with the BIO embeddings described as particularly effective.

Significance. If the performance gains prove robust under controlled evaluation, the combination of multi-task losses with auxiliary NER-derived embeddings could supply a practical, reusable technique for mitigating imbalance in relation extraction and potentially other entity-centric NLP tasks.

major comments (2)
  1. [Abstract] Abstract: the headline claim of '>10% absolute increase in F1-score' and outperformance of SOTA is presented without any description of the baseline model, dataset partitioning, hyper-parameter settings, or statistical significance testing, rendering it impossible to attribute the lift to the BIO embeddings versus the multi-task objective alone.
  2. [Abstract] Abstract: the assertion that 'BIO tag embeddings are particularly effective' is load-bearing for the central contribution yet is unsupported by any ablation that removes the embeddings while retaining the joint cross-entropy + ranking losses, or that compares against a simple parameter-matched baseline.
minor comments (1)
  1. [Abstract] The abstract refers to both 'character-wise' and 'word-wise' BIO embeddings without clarifying whether both are used simultaneously or chosen per language/corpus.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our submission. Below we provide point-by-point responses to the major comments.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim of '>10% absolute increase in F1-score' and outperformance of SOTA is presented without any description of the baseline model, dataset partitioning, hyper-parameter settings, or statistical significance testing, rendering it impossible to attribute the lift to the BIO embeddings versus the multi-task objective alone.

    Authors: We agree that the abstract, being a concise summary, omits these details. The baseline is a standard relation extraction model without the proposed components, the dataset partitioning follows the ACE 2005 conventions as detailed in the experimental setup section, hyper-parameters are specified there as well, and results are averaged over five runs. No formal statistical significance testing beyond reporting averages was conducted. We will revise the abstract to note that the gains result from the joint optimization and BIO embeddings, with full attribution supported by the experiments in the paper body. revision: partial

  2. Referee: [Abstract] Abstract: the assertion that 'BIO tag embeddings are particularly effective' is load-bearing for the central contribution yet is unsupported by any ablation that removes the embeddings while retaining the joint cross-entropy + ranking losses, or that compares against a simple parameter-matched baseline.

    Authors: The manuscript shows that BIO tag embeddings improve performance when added to other models. However, a dedicated ablation study that removes the BIO embeddings from the multi-task model (retaining only the joint losses) or a parameter-matched baseline is not included. We will add these ablations to the revised manuscript to provide stronger support for the claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture on external benchmarks

full rationale

The paper presents a multi-task neural architecture for relation extraction that combines cross-entropy loss on identification with ranking loss on classification, plus concatenation of pre-trained BIO tag embeddings into the input representation. All performance claims rest on held-out evaluation on the standard ACE 2005 Chinese and English corpora; no equations, uniqueness theorems, or self-citations are invoked to derive the reported F1 gains. The method is therefore self-contained against external data and does not reduce any claimed result to a fitted input or self-referential quantity by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

No new free parameters, axioms, or invented entities are introduced beyond standard supervised neural-network assumptions and the pre-existing NER task used to generate BIO tags.

axioms (2)
  • domain assumption Cross-entropy loss is appropriate for binary relation identification and ranking loss is appropriate for multi-class relation classification.
    The multi-task objective rests on these loss functions being well-matched to the two subtasks.
  • domain assumption BIO tag sequences produced by an off-the-shelf NER model carry transferable semantic patterns useful for relation extraction.
    The input-enrichment step assumes the auxiliary NER signal is informative rather than noise.

pith-pipeline@v0.9.0 · 5741 in / 1250 out tokens · 25544 ms · 2026-05-25T19:23:43.828822+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages

  1. [1]

    URL: " 'urlintro :=

    ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Elizabeth Boschee, Ralph Weischedel, and Alex Zamanian. 2005. Automatic information extraction. In Proceedings of the International Conference on Intelligence Analysis, volume 71

  4. [4]

    Razvan C Bunescu and Raymond J Mooney. 2005. A shortest path dependency kernel for relation extraction. In Proceedings of the conference on human language technology and empirical methods in natural language processing, pages 724--731

  5. [5]

    Yee Seng Chan and Dan Roth. 2010. Exploiting background knowledge for relation extraction. In COLING 2010, 23rd International Conference on Computational Linguistics, Proceedings of the Conference, 23-27 August 2010, Beijing, China , pages 152--160

  6. [6]

    Fenia Christopoulou, Makoto Miwa, and Sophia Ananiadou. 2018. A walk-based model on entity graphs for relation extraction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 2: Short Papers , pages 81--88

  7. [7]

    Sorensen

    Aron Culotta and Jeffrey S. Sorensen. 2004. Dependency tree kernels for relation extraction. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, 21-26 July, 2004, Barcelona, Spain., pages 423--429

  8. [8]

    Claudio Giuliano, Alberto Lavelli, Daniele Pighin, and Lorenza Romano. 2007. Fbk-irst: Kernel methods for semantic relation extraction. In Proceedings of the 4th International Workshop on Semantic Evaluations, pages 141--144

  9. [9]

    Zhou GuoDong, Su Jian, Zhang Jie, and Zhang Min. 2005. Exploring various knowledge in relation extraction. In Proceedings of the 43rd annual meeting on association for computational linguistics, pages 427--434

  10. [10]

    Zhengqiu He, Wenliang Chen, Zhenghua Li, Meishan Zhang, Wei Zhang, and Min Zhang. 2018. See: Syntax-aware entity embedding for neural relation extraction. In Thirty-Second AAAI Conference on Artificial Intelligence

  11. [11]

    Guoliang Ji, Kang Liu, Shizhu He, and Jun Zhao. 2017. Distant supervision for relation extraction with sentence-level attention and entity descriptions. In Thirty-First AAAI Conference on Artificial Intelligence

  12. [12]

    Jing Jiang and ChengXiang Zhai. 2007. A systematic exploration of the feature space for relation extraction. In Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, April 22-27, 2007, Rochester, New York, USA , pages 113--120

  13. [13]

    Nanda Kambhatla. 2004. Combining lexical, syntactic, and semantic features with maximum entropy models for extracting relations. In Proceedings of the ACL 2004 on Interactive poster and demonstration sessions, page 22

  14. [14]

    Qi Li and Heng Ji. 2014. Incremental joint extraction of entity mentions and relations. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, Volume 1: Long Papers , pages 402--412

  15. [15]

    Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. 2016. Neural relation extraction with selective attention over instances. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2124--2133

  16. [16]

    ChunYang Liu, WenBo Sun, WenHan Chao, and Wanxiang Che. 2013. Convolution neural network for relation extraction. In International Conference on Advanced Data Mining and Applications, pages 231--242

  17. [17]

    Makoto Miwa and Mohit Bansal. 2016. End-to-end relation extraction using lstms on sequences and tree structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers

  18. [18]

    Raymond J Mooney and Razvan C Bunescu. 2006. Subsequence kernels for relation extraction. In Advances in neural information processing systems, pages 171--178

  19. [19]

    Thien Huu Nguyen and Ralph Grishman. 2014. Employing word representations and regularization for domain adaptation of relation extraction. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, Volume 2: Short Papers , pages 68--74

  20. [20]

    Thien Huu Nguyen and Ralph Grishman. 2015. Relation extraction: Perspective from convolutional neural networks. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pages 39--48

  21. [21]

    Nguyen, Alessandro Moschitti, and Giuseppe Riccardi

    Truc - Vien T. Nguyen, Alessandro Moschitti, and Giuseppe Riccardi. 2009. Convolution kernels on constituent, dependency and sequential structures for relation extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, 6-7 August 2009, Singapore, A meeting of SIGDAT, a Special Interest Group of the A...

  22. [22]

    Longhua Qian, Guodong Zhou, Fang Kong, Qiaoming Zhu, and Peide Qian. 2008. Exploiting constituent dependencies for tree kernel-based semantic relation extraction. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, pages 697--704

  23. [23]

    Pengda Qin, Weiran Xu, and Jun Guo. 2017. Designing an adaptive attention mechanism for relation classification. In 2017 International Joint Conference on Neural Networks, IJCNN 2017, Anchorage, AK, USA, May 14-19, 2017 , pages 4356--4362

  24. [24]

    Pengda Qin, Weiran Xu, and William Yang Wang. 2018. DSGAN: generative adversarial training for distant supervision relation extraction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers , pages 496--505

  25. [25]

    C \' cero Nogueira dos Santos, Bing Xiang, and Bowen Zhou. 2015. Classifying relations by ranking with convolutional neural networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, AC...

  26. [26]

    Heng She, Bin Wu, Bai Wang, and Renjun Chi. 2018. Distant supervision for relation extraction with hierarchical attention and entity descriptions. In 2018 International Joint Conference on Neural Networks (IJCNN), pages 1--8

  27. [27]

    Yu Su, Honglei Liu, Semih Yavuz anda Izzeddin Gur, Huan Sun, and Xifeng Yan. 2018. Global relation embedding for relation extraction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long ...

  28. [28]

    Ang Sun, Ralph Grishman, and Satoshi Sekine. 2011. Semi-supervised relation extraction with large-scale word clustering. In The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19-24 June, 2011, Portland, Oregon, USA , pages 521--529

  29. [29]

    Le Sun and Xianpei Han. 2014. A feature-enriched tree kernel for relation extraction. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, Volume 2: Short Papers , pages 61--67

  30. [30]

    Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella. 2003. Kernel methods for relation extraction. Journal of Machine Learning Research, 3:1083--1106

  31. [31]

    Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. 2015. Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1753--1762

  32. [32]

    Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, and Jun Zhao. 2014. Relation classification via convolutional deep neural network. In COLING 2014, 25th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, August 23-29, 2014, Dublin, Ireland , pages 2335--2344

  33. [33]

    Min Zhang, Jie Zhang, and Jian Su. 2006. Exploring syntactic features for relation extraction using a convolution tree kernel. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pages 288--295

  34. [34]

    Guodong Zhou, Min Zhang, DongHong Ji, and Qiaoming Zhu. 2007. Tree kernel-based relation extraction with context-sensitive structured parse tree information. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

  35. [35]

    Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li, Hongwei Hao, and Bo Xu. 2016. Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 207--212