Exploiting Entity BIO Tag Embeddings and Multi-task Learning for Relation Extraction with Imbalanced Data
Pith reviewed 2026-05-25 19:23 UTC · model grok-4.3
The pith
A multi-task model using BIO tag embeddings from named entity recognition improves relation extraction F1 by more than 10 points on imbalanced ACE 2005 data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that jointly optimizing relation identification via cross-entropy and relation classification via ranking loss, while injecting BIO tag embeddings from a separate named entity recognition task into the input embeddings, supplies the semantic patterns needed to separate positive from negative relation instances and thereby overcomes the performance drop caused by severe class imbalance.
What carries the argument
A multi-task architecture that pairs cross-entropy loss for identifying whether a relation exists with ranking loss for assigning the correct relation class, augmented by BIO tag embeddings derived from named entity recognition.
If this is right
- The model achieves more than 10 percent absolute F1 increase over a baseline on imbalanced relation extraction.
- It outperforms prior state-of-the-art systems on the ACE 2005 Chinese and English corpora.
- BIO tag embeddings can be added to other relation extraction models to produce performance gains.
Where Pith is reading between the lines
- The same BIO tag injection technique could be tested on relation extraction datasets from domains other than news text to check whether the gain persists.
- If entity boundary signals prove helpful here, similar tag embeddings might aid other tasks that require distinguishing sparse positive events from abundant negatives.
- An ablation that replaces BIO tags with random embeddings of the same dimensionality would isolate whether the actual tag values or merely the added capacity drives the improvement.
Load-bearing premise
The patterns captured by character-wise or word-wise BIO tag embeddings from a separate named entity recognition task contain useful semantic information that helps distinguish positive from negative relation instances.
What would settle it
Running the proposed model on the ACE 2005 Chinese or English corpus after removing the BIO tag embeddings from the input representation and observing that the F1 improvement over the baseline falls below 5 absolute points would falsify the contribution of those embeddings.
Figures
read the original abstract
In practical scenario, relation extraction needs to first identify entity pairs that have relation and then assign a correct relation class. However, the number of non-relation entity pairs in context (negative instances) usually far exceeds the others (positive instances), which negatively affects a model's performance. To mitigate this problem, we propose a multi-task architecture which jointly trains a model to perform relation identification with cross-entropy loss and relation classification with ranking loss. Meanwhile, we observe that a sentence may have multiple entities and relation mentions, and the patterns in which the entities appear in a sentence may contain useful semantic information that can be utilized to distinguish between positive and negative instances. Thus we further incorporate the embeddings of character-wise/word-wise BIO tag from the named entity recognition task into character/word embeddings to enrich the input representation. Experiment results show that our proposed approach can significantly improve the performance of a baseline model with more than 10% absolute increase in F1-score, and outperform the state-of-the-art models on ACE 2005 Chinese and English corpus. Moreover, BIO tag embeddings are particularly effective and can be used to improve other models as well.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a multi-task architecture for relation extraction on imbalanced data that jointly optimizes relation identification via cross-entropy loss and relation classification via ranking loss, while enriching input representations with character-wise and word-wise BIO tag embeddings obtained from a separate NER task. It reports that this yields more than 10% absolute F1 improvement over a baseline and outperforms prior state-of-the-art models on the ACE 2005 Chinese and English corpora, with the BIO embeddings described as particularly effective.
Significance. If the performance gains prove robust under controlled evaluation, the combination of multi-task losses with auxiliary NER-derived embeddings could supply a practical, reusable technique for mitigating imbalance in relation extraction and potentially other entity-centric NLP tasks.
major comments (2)
- [Abstract] Abstract: the headline claim of '>10% absolute increase in F1-score' and outperformance of SOTA is presented without any description of the baseline model, dataset partitioning, hyper-parameter settings, or statistical significance testing, rendering it impossible to attribute the lift to the BIO embeddings versus the multi-task objective alone.
- [Abstract] Abstract: the assertion that 'BIO tag embeddings are particularly effective' is load-bearing for the central contribution yet is unsupported by any ablation that removes the embeddings while retaining the joint cross-entropy + ranking losses, or that compares against a simple parameter-matched baseline.
minor comments (1)
- [Abstract] The abstract refers to both 'character-wise' and 'word-wise' BIO embeddings without clarifying whether both are used simultaneously or chosen per language/corpus.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our submission. Below we provide point-by-point responses to the major comments.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim of '>10% absolute increase in F1-score' and outperformance of SOTA is presented without any description of the baseline model, dataset partitioning, hyper-parameter settings, or statistical significance testing, rendering it impossible to attribute the lift to the BIO embeddings versus the multi-task objective alone.
Authors: We agree that the abstract, being a concise summary, omits these details. The baseline is a standard relation extraction model without the proposed components, the dataset partitioning follows the ACE 2005 conventions as detailed in the experimental setup section, hyper-parameters are specified there as well, and results are averaged over five runs. No formal statistical significance testing beyond reporting averages was conducted. We will revise the abstract to note that the gains result from the joint optimization and BIO embeddings, with full attribution supported by the experiments in the paper body. revision: partial
-
Referee: [Abstract] Abstract: the assertion that 'BIO tag embeddings are particularly effective' is load-bearing for the central contribution yet is unsupported by any ablation that removes the embeddings while retaining the joint cross-entropy + ranking losses, or that compares against a simple parameter-matched baseline.
Authors: The manuscript shows that BIO tag embeddings improve performance when added to other models. However, a dedicated ablation study that removes the BIO embeddings from the multi-task model (retaining only the joint losses) or a parameter-matched baseline is not included. We will add these ablations to the revised manuscript to provide stronger support for the claim. revision: yes
Circularity Check
No circularity: empirical architecture on external benchmarks
full rationale
The paper presents a multi-task neural architecture for relation extraction that combines cross-entropy loss on identification with ranking loss on classification, plus concatenation of pre-trained BIO tag embeddings into the input representation. All performance claims rest on held-out evaluation on the standard ACE 2005 Chinese and English corpora; no equations, uniqueness theorems, or self-citations are invoked to derive the reported F1 gains. The method is therefore self-contained against external data and does not reduce any claimed result to a fitted input or self-referential quantity by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Cross-entropy loss is appropriate for binary relation identification and ranking loss is appropriate for multi-class relation classification.
- domain assumption BIO tag sequences produced by an off-the-shelf NER model carry transferable semantic patterns useful for relation extraction.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
multi-task architecture which jointly trains a model to perform relation identification with cross-entropy loss and relation classification with ranking loss... incorporate the embeddings of character-wise/word-wise BIO tag
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
BIO tag embeddings are particularly effective
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Elizabeth Boschee, Ralph Weischedel, and Alex Zamanian. 2005. Automatic information extraction. In Proceedings of the International Conference on Intelligence Analysis, volume 71
work page 2005
-
[4]
Razvan C Bunescu and Raymond J Mooney. 2005. A shortest path dependency kernel for relation extraction. In Proceedings of the conference on human language technology and empirical methods in natural language processing, pages 724--731
work page 2005
-
[5]
Yee Seng Chan and Dan Roth. 2010. Exploiting background knowledge for relation extraction. In COLING 2010, 23rd International Conference on Computational Linguistics, Proceedings of the Conference, 23-27 August 2010, Beijing, China , pages 152--160
work page 2010
-
[6]
Fenia Christopoulou, Makoto Miwa, and Sophia Ananiadou. 2018. A walk-based model on entity graphs for relation extraction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 2: Short Papers , pages 81--88
work page 2018
- [7]
-
[8]
Claudio Giuliano, Alberto Lavelli, Daniele Pighin, and Lorenza Romano. 2007. Fbk-irst: Kernel methods for semantic relation extraction. In Proceedings of the 4th International Workshop on Semantic Evaluations, pages 141--144
work page 2007
-
[9]
Zhou GuoDong, Su Jian, Zhang Jie, and Zhang Min. 2005. Exploring various knowledge in relation extraction. In Proceedings of the 43rd annual meeting on association for computational linguistics, pages 427--434
work page 2005
-
[10]
Zhengqiu He, Wenliang Chen, Zhenghua Li, Meishan Zhang, Wei Zhang, and Min Zhang. 2018. See: Syntax-aware entity embedding for neural relation extraction. In Thirty-Second AAAI Conference on Artificial Intelligence
work page 2018
-
[11]
Guoliang Ji, Kang Liu, Shizhu He, and Jun Zhao. 2017. Distant supervision for relation extraction with sentence-level attention and entity descriptions. In Thirty-First AAAI Conference on Artificial Intelligence
work page 2017
-
[12]
Jing Jiang and ChengXiang Zhai. 2007. A systematic exploration of the feature space for relation extraction. In Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, April 22-27, 2007, Rochester, New York, USA , pages 113--120
work page 2007
-
[13]
Nanda Kambhatla. 2004. Combining lexical, syntactic, and semantic features with maximum entropy models for extracting relations. In Proceedings of the ACL 2004 on Interactive poster and demonstration sessions, page 22
work page 2004
-
[14]
Qi Li and Heng Ji. 2014. Incremental joint extraction of entity mentions and relations. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, Volume 1: Long Papers , pages 402--412
work page 2014
-
[15]
Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. 2016. Neural relation extraction with selective attention over instances. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2124--2133
work page 2016
-
[16]
ChunYang Liu, WenBo Sun, WenHan Chao, and Wanxiang Che. 2013. Convolution neural network for relation extraction. In International Conference on Advanced Data Mining and Applications, pages 231--242
work page 2013
-
[17]
Makoto Miwa and Mohit Bansal. 2016. End-to-end relation extraction using lstms on sequences and tree structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers
work page 2016
-
[18]
Raymond J Mooney and Razvan C Bunescu. 2006. Subsequence kernels for relation extraction. In Advances in neural information processing systems, pages 171--178
work page 2006
-
[19]
Thien Huu Nguyen and Ralph Grishman. 2014. Employing word representations and regularization for domain adaptation of relation extraction. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, Volume 2: Short Papers , pages 68--74
work page 2014
-
[20]
Thien Huu Nguyen and Ralph Grishman. 2015. Relation extraction: Perspective from convolutional neural networks. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pages 39--48
work page 2015
-
[21]
Nguyen, Alessandro Moschitti, and Giuseppe Riccardi
Truc - Vien T. Nguyen, Alessandro Moschitti, and Giuseppe Riccardi. 2009. Convolution kernels on constituent, dependency and sequential structures for relation extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, 6-7 August 2009, Singapore, A meeting of SIGDAT, a Special Interest Group of the A...
work page 2009
-
[22]
Longhua Qian, Guodong Zhou, Fang Kong, Qiaoming Zhu, and Peide Qian. 2008. Exploiting constituent dependencies for tree kernel-based semantic relation extraction. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, pages 697--704
work page 2008
-
[23]
Pengda Qin, Weiran Xu, and Jun Guo. 2017. Designing an adaptive attention mechanism for relation classification. In 2017 International Joint Conference on Neural Networks, IJCNN 2017, Anchorage, AK, USA, May 14-19, 2017 , pages 4356--4362
work page 2017
-
[24]
Pengda Qin, Weiran Xu, and William Yang Wang. 2018. DSGAN: generative adversarial training for distant supervision relation extraction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers , pages 496--505
work page 2018
-
[25]
C \' cero Nogueira dos Santos, Bing Xiang, and Bowen Zhou. 2015. Classifying relations by ranking with convolutional neural networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, AC...
work page 2015
-
[26]
Heng She, Bin Wu, Bai Wang, and Renjun Chi. 2018. Distant supervision for relation extraction with hierarchical attention and entity descriptions. In 2018 International Joint Conference on Neural Networks (IJCNN), pages 1--8
work page 2018
-
[27]
Yu Su, Honglei Liu, Semih Yavuz anda Izzeddin Gur, Huan Sun, and Xifeng Yan. 2018. Global relation embedding for relation extraction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long ...
work page 2018
-
[28]
Ang Sun, Ralph Grishman, and Satoshi Sekine. 2011. Semi-supervised relation extraction with large-scale word clustering. In The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19-24 June, 2011, Portland, Oregon, USA , pages 521--529
work page 2011
-
[29]
Le Sun and Xianpei Han. 2014. A feature-enriched tree kernel for relation extraction. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, Volume 2: Short Papers , pages 61--67
work page 2014
-
[30]
Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella. 2003. Kernel methods for relation extraction. Journal of Machine Learning Research, 3:1083--1106
work page 2003
-
[31]
Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. 2015. Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1753--1762
work page 2015
-
[32]
Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, and Jun Zhao. 2014. Relation classification via convolutional deep neural network. In COLING 2014, 25th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, August 23-29, 2014, Dublin, Ireland , pages 2335--2344
work page 2014
-
[33]
Min Zhang, Jie Zhang, and Jian Su. 2006. Exploring syntactic features for relation extraction using a convolution tree kernel. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pages 288--295
work page 2006
-
[34]
Guodong Zhou, Min Zhang, DongHong Ji, and Qiaoming Zhu. 2007. Tree kernel-based relation extraction with context-sensitive structured parse tree information. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)
work page 2007
-
[35]
Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li, Hongwei Hao, and Bo Xu. 2016. Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 207--212
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.