What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework

Rezarta Islamaj; Robert Leaman; Zhiyong Lu

arxiv: 2605.20537 · v1 · pith:IZOJEFBNnew · submitted 2026-05-19 · 💻 cs.CL

What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework

Robert Leaman , Rezarta Islamaj , Zhiyong Lu This is my paper

Pith reviewed 2026-05-21 06:27 UTC · model grok-4.3

classification 💻 cs.CL

keywords biomedical NERentity linkingcorpus diagnosticsbenchmark analysistrain-test overlapterminology coveragenamed entity recognitiongeneralization demands

0 comments

The pith

Biomedical NER and entity linking corpora differ substantially in properties even for similar tasks, rendering common statistics insufficient to show what benchmarks actually evaluate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a diagnostic framework that pulls standardized statistics directly from corpus annotations, concept links, train-test splits, metadata, and terminology mappings. These statistics fall into five families that together reveal how much each corpus tests specific evaluation signals, how much generalization it demands, how much train-test reuse it allows, and which parts of the biomedical literature and concept space it covers. Applying the framework to nine existing corpora for diseases, chemicals, and cell types shows that corpora addressing the same apparent task can still differ markedly in all these respects. A sympathetic reader would care because model developers often choose corpora or interpret benchmark scores without knowing these hidden differences, which can mask transfer risks or limit the scope of conclusions drawn from results.

Core claim

We present a corpus-centric framework that organizes standardized statistics into five families—scale, density and label distribution; lexical and conceptual structure; train-test overlap; metadata composition; and terminology coverage—to diagnose benchmark-relevant properties directly from annotations, concept links, splits, document metadata, and terminology mappings. When applied to nine corpora spanning diseases, chemicals, and cell types, the framework shows that corpus properties differ substantially even for the same apparent task, producing distinct evaluation signals, generalization demands, degrees of train-test reuse, and coverage of literature and concept space. These differences

What carries the argument

The corpus-centric diagnostic framework, which computes and compares statistics across five families drawn from annotations, links, splits, metadata, and mappings to expose differences in evaluation signal and generalization demands.

If this is right

Benchmark results can be interpreted with explicit awareness of the distinct regions of literature and concept space each corpus represents.
Transfer risks between corpora become identifiable before model training or evaluation.
Corpus selection can move beyond surface descriptors such as size and entity type toward matching specific generalization requirements.
The open-source implementation and dashboard enable direct reproduction and extension to new corpora.
Reporting practices can incorporate these diagnostics to clarify what a given benchmark actually measures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same diagnostic approach could be applied to related biomedical tasks such as relation extraction to reveal comparable hidden differences.
Models that appear strong on one corpus may fail on another primarily because of unmeasured differences in lexical structure or metadata composition.
Standard practice in the field may need to evolve toward routine publication of these five-family statistics alongside benchmark scores.
Interactive dashboards built on the framework could become a standard way to document corpus suitability for specific research questions.

Load-bearing premise

That the five families of statistics together capture enough information to diagnose the evaluation signal, generalization demands, and transfer risks of each corpus.

What would settle it

Re-running the framework on the same nine corpora and finding that all differences in the five statistic families are negligible or already predicted by simple size and entity-type counts.

Figures

Figures reproduced from arXiv: 2605.20537 by Rezarta Islamaj, Robert Leaman, Zhiyong Lu.

**Figure 2.** Figure 2: Train-test overlap across nine biomedical corpora. All values are Jaccard similarity (%) between training [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Terminology-based coverage analysis. Left panels: distribution normalized within each corpus’s own [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Biomedical named entity recognition (NER) and entity linking (EL) strongly depend on annotated corpora, but the utility of these resources for benchmarking is often assumed rather than characterized. We present a corpus-centric framework for diagnosing benchmark-relevant properties directly from corpus annotations, concept links, train-test splits, document metadata, and terminology mappings. The framework organizes standardized statistics into five families: (1) scale, density and label distribution, (2) lexical and conceptual structure, (3) train-test overlap, (4) metadata composition, and (5) terminology coverage where applicable. Applying the framework to nine corpora spanning diseases, chemicals, and cell types, we find that corpus properties can differ substantially, even when they address the same apparent task. We find differences in the evaluation signal they provide, the generalization demands they impose, the degree of train-test reuse they permit, and the regions of biomedical literature and concept space they represent. These differences suggest that commonly reported corpus statistics can be insufficient to characterize what biomedical NER and EL benchmarks evaluate. We argue that corpus-centric diagnostics provide a practical framework for analyzing corpora beyond surface descriptors such as corpus size and entity type, for identifying potential transfer risks, and for interpreting the scope of benchmarking conclusions. We release the framework as open-source code with an interactive dashboard to support reproducing our analyses and characterizing additional corpora.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript presents a corpus-centric diagnostic framework for biomedical NER and EL benchmarks. The framework computes standardized statistics across five families—scale, density and label distribution; lexical and conceptual structure; train-test overlap; metadata composition; and terminology coverage—from annotations, concept links, splits, metadata, and terminology mappings. When applied to nine corpora covering diseases, chemicals, and cell types, the analysis shows that corpora addressing similar tasks can differ substantially in evaluation signal, generalization demands, train-test reuse, and coverage of biomedical literature and concept space. The authors conclude that commonly reported statistics are insufficient to characterize these benchmarks and release open-source code and an interactive dashboard for reproducibility and extension.

Significance. This work is significant for the field of biomedical NLP as it provides a practical, descriptive tool to better interpret what existing benchmarks actually measure and to identify potential transfer risks when using them. The direct, parameter-free computation of statistics from the corpora themselves, the broad application to nine diverse resources, and the public release of code and dashboard are notable strengths that enhance reproducibility and utility. If the differences identified are as pronounced as reported, the framework could help researchers select more appropriate corpora and interpret benchmarking results with greater nuance.

minor comments (2)

[§3.2] §3.2: The precise formulas for the lexical and conceptual structure statistics (e.g., type-token ratios and concept overlap measures) are described in prose; adding explicit equations would improve precision and ease of re-implementation.
[Table 2] Table 2: The train-test overlap column reports percentages but does not indicate whether the splits are document-level or mention-level; clarifying this distinction would strengthen the interpretation of reuse risks.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our work, for recognizing its significance to biomedical NLP, and for recommending minor revision. The referee's description accurately reflects the corpus-centric diagnostic framework, its application to nine corpora, and the release of code and dashboard. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper defines a descriptive diagnostic framework consisting of five families of standardized statistics computed directly from corpus annotations, concept links, train-test splits, metadata, and terminology mappings. These are applied empirically to nine existing corpora to identify differences in scale, structure, overlap, composition, and coverage. No derivations, fitted parameters presented as predictions, self-referential definitions, or load-bearing self-citations appear in the argument; the central claims rest on explicit computation and released code rather than any reduction to inputs by construction. The framework functions as a practical analysis tool without circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Framework rests on the domain assumption that the chosen statistics capture benchmark-relevant properties; no free parameters are introduced and no new entities are postulated.

axioms (1)

domain assumption Standard statistical measures of scale, overlap, and terminology coverage meaningfully diagnose what a benchmark evaluates.
Invoked when the five families are presented as the diagnostic lens.

pith-pipeline@v0.9.0 · 5775 in / 1166 out tokens · 36239 ms · 2026-05-21T06:27:06.360850+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

143 extracted references · 143 canonical work pages · 3 internal anchors

[1]

Bio-ID track overview

Arighi, Cecilia and Lynette Hirschman and Thomas Lemberger and Samuel Bayer and Robin Liechti and Donald Comeau and Cathy Wu. Bio-ID track overview. BioCreative VI Challenge Evaluation Workshop. 2017

work page 2017
[2]

AI Magazine , month = mar, pages =

Aroyo, Lora and Welty, Chris , title =. AI Magazine , month = mar, pages =. 2015 , issue_date =. doi:10.1609/aimag.v36i1.2564 , abstract =

work page doi:10.1609/aimag.v36i1.2564 2015
[3]

Computational Linguistics 34(4), 555–596 (2008)

Survey Article: Inter-Coder Agreement for Computational Linguistics , author=. Computational Linguistics , volume=. doi:10.1162/coli.07-034-R2

work page doi:10.1162/coli.07-034-r2
[4]

Bada, Michael and Miriam Eckert and Donald Evans and Kristin Garcia and Krista Shipley and Dmitry Sitnikov and William A Baumgartner Jr and K Bretonnel Cohen and Karin Verspoor and Judith A Blake and Lawrence E. Hunter. Concept annotation in the CRAFT corpus. BMC Bioinformatics. 2012. doi:10.1186/1471-2105-13-161

work page doi:10.1186/1471-2105-13-161 2012
[5]

Introduction to the Bio-entity Recognition Task at JNLPBA

Collier, Nigel and Tomoko Ohta and Yoshimasa Tsuruoka and Yuka Tateisi and Jin-Dong Kim. Introduction to the Bio-entity Recognition Task at JNLPBA. Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP). 2004

work page 2004
[6]

Donald C. Comeau and Rezarta Islamaj Doğan and Paolo Ciccarese and Kevin Bretonnel Cohen and Martin Krallinger and Florian Leitner and Zhiyong Lu and Yifan Peng and Fabio Rinaldi and Manabu Torii and Alfonso Valencia and Karin Verspoor and Thomas C. Wiegers and Cathy H. Wu and W. John Wilbur. BioC: a minimalist approach to interoperability for biomedical ...

work page doi:10.1093/database/bat064 2013
[7]

Bioinformatics , volume =

Comeau, Donald C and Wei, Chih-Hsuan and Islamaj Doğan, Rezarta and Lu, Zhiyong , title =. Bioinformatics , volume =. 2019 , month =. doi:10.1093/bioinformatics/btz070 , url =

work page doi:10.1093/bioinformatics/btz070 2019
[8]

NCBI disease corpus: A resource for disease name recognition and concept normalization

Doğan, Rezarta Islamaj and Robert Leaman and Zhiyong Lu. NCBI disease corpus: A resource for disease name recognition and concept normalization. Journal of Biomedical Informatics. 2014. doi:10.1016/j.jbi.2013.12.006

work page doi:10.1016/j.jbi.2013.12.006 2014
[9]

2022 , url=

Fries, Jason and Weber, Leon and Seelam, Natasha and Altay, Gabriel and Datta, Debajyoti and Garda, Samuele and Kang, Sunny and Su, Rosaline and Kusa, Wojciech and Cahyawijaya, Samuel and others , booktitle=. 2022 , url=

work page 2022
[10]

ACM Transactions on Computing for Healthcare ( HEALTH )

Gu, Yu and Tinn, Robert and Cheng, Hao and Lucas, Michael and Usuyama, Naoto and Liu, Xiaodong and Naumann, Tristan and Gao, Jianfeng and Poon, Hoifung , title = ". ACM Transactions on Computing for Healthcare ( HEALTH ). 2021

work page 2021
[11]

and Smith, Noah A

Gururangan, Suchin and Swayamdipta, Swabha and Levy, Omer and Schwartz, Roy and Bowman, Samuel R. and Smith, Noah A. , title = ". Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) , pages =. 2018 , doi =

work page 2018
[12]

Journal of Biomedical Informatics , volume =

Herrero-Zazo, María and Segura-Bedmar, Isabel and Martínez, Paloma and Declerck, Thierry , title = ". Journal of Biomedical Informatics , volume =. 2013 , doi =

work page 2013
[13]

International Journal of Translation , volume=

Towards a `science' of corpus annotation: a new methodological challenge for corpus linguistics , author=. International Journal of Translation , volume=

work page
[14]

Journal of the American Medical Informatics Association , volume=

Agreement, the f-measure, and reliability in information retrieval , author=. Journal of the American Medical Informatics Association , volume=. 2005 , doi =

work page 2005
[15]

Briefings in Bioinformatics , volume=

Biomedical named entity recognition and linking datasets: survey and our recent development , author=. Briefings in Bioinformatics , volume=. 2020 , publisher=

work page 2020
[16]

Islamaj, Rezarta and Leaman, Robert and Kim, Sun and Kwon, Dongseop and Wei, Chih-Hsuan and Comeau, Donald C. and Peng, Yifan and Cissel, David and Coss, Cathleen and Fisher, Carol and Guzman, Rob and Kochar, Preeti Gokal and Koppel, Stella and Trinh, Dorothy and Sekiya, Keiko and Ward, Janice and Whitman, Deborah and Schmidt, Susan and Lu, Zhiyong , titl...

work page 2021
[17]

Islamaj and R. and C. H. Wei and D. Cissel and N. Miliaras and O. Printseva and O. Rodionov and K. Sekiya and J. Ward and Z. Lu. NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition. Journal of Biomedical Informatics. 2021

work page 2021
[18]

Islamaj and R. and C. H. Wei and P. T. Lai and L. Luo and C. Coss and P. Gokal Kochar and N. Miliaras and O. Rodionov and K. Sekiya and D. Trinh and D. Whitman and Z. Lu. The biomedical relationship corpus of the BioRED track at the BioCreative VIII challenge and workshop. Database (Oxford). 2024

work page 2024
[19]

GENIA corpus--a semantically annotated corpus for bio-textmining

Kim, Jin-Dong and Ohta, Tomoko and Tateisi, Yuka and Tsujii, Jun'ichi. GENIA corpus--a semantically annotated corpus for bio-textmining. Bioinformatics. 2003

work page 2003
[20]

Journal of Cheminformatics

Krallinger, Martin and Rabal, Obdulia and Leitner, Florian and Vazquez, Miguel and Salgado, David and Lu, Zhiyong and Leaman, Robert and Lu, Yanan and Ji, Donghong and Lowe, Daniel M and Sayle, Roger A and Batista-Navarro, Riza Theresa and Rak, Rafal and Huber, Torsten and Rocktäschel, Tim and Matos, Sérgio and Campos, David and Tang, Buzhou and Hua Xu an...

work page 2015
[21]

and Sciaky, Daniela and Wei, Chih-Hsuan and Leaman, Robert and Davis, Allan Peter and Mattingly, Carolyn J

Li, Jiao and Sun, Yueping and Johnson, Robin J. and Sciaky, Daniela and Wei, Chih-Hsuan and Leaman, Robert and Davis, Allan Peter and Mattingly, Carolyn J. and Wiegers, Thomas C. and Lu, Zhiyong , title =. Database , volume =. 2016 , doi =

work page 2016
[22]

Transactions on Machine Learning Research , year =

Holistic Evaluation of Language Models , author =. Transactions on Machine Learning Research , year =

work page
[23]

Medical Subject Headings (

Lipscomb, Carolyn E , journal=. Medical Subject Headings (

work page
[24]

BMC Bioinformatics

Lu, Zhiyong and Kao, Hung-Yu and Wei, Chih-Hsuan and Huang, Minlie and Liu, Jingchen and Kuo, Cheng-Ju and Hsu, Chun-Nan and Tsai, Richard Tzong-Han and Dai, Hong-Jie and Okazaki, Naoaki and Cho, Hye-Cheol and Gerner, Martin and Solt, Illes and Agarwal, Shashank and Liu, Feifan and Vishnyakova, Dina and Ruch, Patrick and Romacker, Martin and Rinaldi, Fabi...

work page doi:10.1186/1471-2105-12-s8-s2 2011
[25]

Nucleic Acids Research , volume=

Malik, Adnan and Arsalan, Muhammad and Moreno, Carlos and Mosquera, Juan and F. Nucleic Acids Research , volume=. 2026 , publisher=

work page 2026
[26]

Miranda-Escalada and A. and F. Mehryary and J. Luoma and D. Estrada-Zavala and L. Gasco and S. Pyysalo and A. Valencia and M. Krallinger. Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical-protein relations. Database (Oxford). 2023

work page 2023
[27]

and Lu, Zhiyong and Wang, Xinglong and Cohen, Aaron M

Morgan, Alexander A. and Lu, Zhiyong and Wang, Xinglong and Cohen, Aaron M. and Fluck, Juliane and Ruch, Patrick and Divoli, Anna and Fundel, Katrin and Leaman, Robert and Hakenberg, Jörg and Sun, Chenjie and Liu, Heng-hui and Torres, Rafael and Krauthammer, Michael and Lau, William W. and Liu, Hongfang and Hsu, Chun-Nan and Schuemie, Martijn and Cohen, K...

work page 2008
[28]

K nowtator: A Prot \'e g \'e plug-in for annotated corpus construction

Ogren, Philip V. K nowtator: A Prot \'e g \'e plug-in for annotated corpus construction. Proceedings of the Human Language Technology Conference of the NAACL , Companion Volume: Demonstrations. 2006

work page 2006
[29]

Proceedings of the 18th BioNLP Workshop and Shared Task , pages =

Peng, Yifan and Yan, Shankai and Lu, Zhiyong , title =. Proceedings of the 18th BioNLP Workshop and Shared Task , pages =. 2019 , doi =

work page 2019
[30]

Bioinformatics , volume =

Pyysalo, Sampo and Ananiadou, Sophia , title =. Bioinformatics , volume =. 2014 , doi =

work page 2014
[31]

Rotenberg and N. H. and R. Leaman and R. Islamaj and H. Kuivaniemi and G. Tromp and B. Fluharty and S. Richardson and C. Eastwood and M. Diller and B. Xu and A. V. Pankajam and D. Osumi-Sutherland and Z. Lu and R. H. Scheuermann. Cell phenotypes in the biomedical literature: a systematic analysis and text mining corpus. bioRxiv. 2026

work page 2026
[32]

Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics , pages=

brat: a Web-based Tool for NLP-Assisted Text Annotation , author=. Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics , pages=

work page
[33]

Scientific Data , year =

The Cell Ontology in the age of single-cell omics , author =. Scientific Data , year =

work page
[34]

Fair Evaluation in Concept Normalization: a Large-scale Comparative Analysis for BERT -based Models

Tutubalina, Elena and Kadurin, Artur and Miftahutdinov, Zulfat. Fair Evaluation in Concept Normalization: a Large-scale Comparative Analysis for BERT -based Models. Proceedings of the 28th International Conference on Computational Linguistics. 2020. doi:10.18653/v1/2020.coling-main.588

work page doi:10.18653/v1/2020.coling-main.588 2020
[35]

Learning from Disagreement: A Survey , volume =

Uma, Alexandra and Fornaciari, Tommaso and Hovy, Dirk and Paun, Silviu and Plank, Barbara and Poesio, Massimo , year =. Learning from Disagreement: A Survey , volume =. Journal of Artificial Intelligence Research , doi =

work page
[36]

Genetics , volume=

Mondo: integrating disease terminology across communities , author=. Genetics , volume=. 2026 , publisher=

work page 2026
[37]

Wang, Alex and Pruksachatkun, Yada and Nangia, Nikita and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel , booktitle =. 2019

work page 2019
[38]

In: Linzen, T., Chrupała, G., Alishahi, A

Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R. GLUE : A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. Proceedings of the 2018 EMNLP Workshop B lackbox NLP : Analyzing and Interpreting Neural Networks for NLP. 2018. doi:10.18653/v1/W18-5446

work page doi:10.18653/v1/w18-5446 2018
[39]

and Kao, Hung-Yu and Lu, Zhiyong

Wei, Chih-Hsuan and Harris, Bethany R. and Kao, Hung-Yu and Lu, Zhiyong. tmVar: a text mining approach for extracting sequence variants in biomedical literature. Bioinformatics. 2013

work page 2013
[40]

and Li, Jiao and Wiegers, Thomas C

Wei, Chih-Hsuan and Peng, Yifan and Leaman, Robert and Davis, Allan Peter and Mattingly, Carolyn J. and Li, Jiao and Wiegers, Thomas C. and Lu, Zhiyong , title = ". Database (Oxford) , volume =. 2016 , doi =

work page 2016
[41]

Nucleic Acids Research , volume =

Wei, Chih-Hsuan and Allot, Alexis and Lai, Po-Ting and Leaman, Robert and Tian, Shubo and Luo, Ling and Jin, Qiao and Wang, Zhizheng and Chen, Qingyu and Lu, Zhiyong , title =. Nucleic Acids Research , volume =. 2024 , month =. doi:10.1093/nar/gkae235 , url =

work page doi:10.1093/nar/gkae235 2024
[42]

Vasilevsky and N. A. and N. A. Matentzoglu and S. Toro and J. E. Flack and H. Hegde and D. R. Unni and G. F. Alyea and J. S. Amberger and L. Babb and J. P. Balhoff and T. I. Bingaman and G. A. Burns and O. J. Buske and T. J. Callahan and L. C. Carmody and P. C. Cordo and L. E. Chan and G. S. Chang and S. L. Christiaens and L. C. Daugherty and M. Dumontier...

work page arXiv 2022
[43]

Vladika and J. and P. Schneider and F. Matthes. MedREQAL: Examining Medical Knowledge Recall of Large Language Models via Question Answering, Bangkok, Thailand, Association for Computational Linguistics. 2024

work page 2024
[44]

Wang and L. and N. Yang and F. Wei. Query2doc: Query Expansion with Large Language Models, Singapore, Association for Computational Linguistics. 2023

work page 2023
[45]

Wang and Q. and S. A. S and L. Almeida and S. Ananiadou and Y. I. Balderas-Martinez and R. Batista-Navarro and D. Campos and L. Chilton and H. J. Chou and G. Contreras and L. Cooper and H. J. Dai and B. Ferrell and J. Fluck and S. Gama-Castro and N. George and G. Gkoutos and A. K. Irin and L. J. Jensen and S. Jimenez and T. R. Jue and I. Keseler and S. Ma...

work page 2016
[46]

Welbl and J. and P. Stenetorp and S. Riedel. Constructing Datasets for Multi-hop Reading Comprehension Across Documents. Transactions of the Association for Computational Linguistics. 2018

work page 2018
[47]

Wiegers and T. C. and A. P. Davis and C. J. Mattingly. Collaborative biocuration--text-mining development task for document prioritization for curation. Database (Oxford) 2012: bas037. 2012

work page 2012
[48]

Yang and Z. and P. Qi and S. Zhang and Y. Bengio and W. Cohen and R. Salakhutdinov and C. D. Manning. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering, Brussels, Belgium, Association for Computational Linguistics. 2018

work page 2018
[49]

Yao and S. and J. Zhao and D. Yu and N. Du and I. Shafran and K. R. Narasimhan and Y. Cao. React: Synergizing reasoning and acting in language models. The eleventh international conference on learning representations. 2022

work page 2022
[50]

Yeh and A. and A. Morgan and M. Colosimo and L. Hirschman. BioCreAtIvE task 1A: gene mention finding evaluation. BMC Bioinformatics 6 Suppl 1(Suppl 1): S2. 2005

work page 2005
[51]

Yuelyu Ji and H. Z. and Shiven Verma and Hui Ji and Chun Li and Yushui Han and Yanshan Wang. DeepRAG: Integrating Hierarchical Reasoning and Process Supervision for Biomedical Multi-Hop QA. Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artifici...

work page 2025
[52]

Zhu and M. and A. Ahuja and D.-C. Juan and W. Wei and C. K. Reddy. Question Answering with Long Multiple-Span Answers, Online, Association for Computational Linguistics. 2020

work page 2020
[53]

Nucleic acids research , volume=

PubTator: a web-based text mining tool for assisting biocuration , author=. Nucleic acids research , volume=. 2013 , publisher=

work page 2013
[54]

Krallinger and M. and O. Rabal and S. A. Akhondi and M. P. Pérez and J. Santamaría and G. P. Rodríguez and G. Tsatsaronis and A. Intxaurrondo and J. A. B. López and U. K. Nandal and E. M. v. Buel and A. Chandrasekhar and M. Rodenburg and A. Lægreid and M. A. Doornenbal and J. Oyarzábal and A. Lourenço and A. Valencia. Overview of the BioCreative VI chemic...

work page 2017
[55]

Abacha and A. B. and E. Agichtein and Y. Pinter and D. Demner-Fushman. Overview of the Medical Question Answering Task at TREC 2017 LiveQA. Text Retrieval Conference. 2017

work page 2017
[56]

Adams and L. and F. Busch and T. Han and J.-B. Excoffier and M. Ortala and A. Löser and H. J. W. L. Aerts and J. N. Kather and D. Truhn and K. Bressem. LongHealth: A Question Answering Benchmark with Long Clinical Documents. Journal of Healthcare Informatics Research. 2025

work page 2025
[57]

Alliheedi, A. B. a. M. Evaluating Advanced Prompting on Gemini Flash for Multi-Hop Biomedical QA. Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artificial Intelligence (IJCAI). 2025

work page 2025
[58]

Arighi and C. N. and Z. Lu and M. Krallinger and K. B. Cohen and W. J. Wilbur and A. Valencia and L. Hirschman and C. H. Wu. Overview of the BioCreative III Workshop. BMC Bioinformatics 12 Suppl 8(Suppl 8): S1. 2011

work page 2011
[59]

Arighi and C. N. and P. M. Roberts and S. Agarwal and S. Bhattacharya and G. Cesareni and A. Chatr-Aryamontri and S. Clematide and P. Gaudet and M. G. Giglio and I. Harrow and E. Huala and M. Krallinger and U. Leser and D. Li and F. Liu and Z. Lu and L. J. Maltais and N. Okazaki and L. Perfetto and F. Rinaldi and R. Saetre and D. Salgado and P. Srinivasan...

work page 2011
[60]

Arighi and C. N. and C. H. Wu and K. B. Cohen and L. Hirschman and M. Krallinger and A. Valencia and Z. Lu and J. W. Wilbur and T. C. Wiegers. BioCreative-IV virtual issue. Database (Oxford) 2014. 2014

work page 2014
[61]

Asai and A. and J. He and R. Shao and W. Shi and A. Singh and J. C. Chang and K. Lo and L. Soldaini and S. Feldman and M. D'Arcy and D. Wadden and M. Latzke and J. Sparks and J. D. Hwang and V. Kishore and M. Tian and P. Ji and S. Liu and H. Tong and B. Wu and Y. Xiong and L. Zettlemoyer and G. Neubig and D. S. Weld and D. Downey and W. T. Yih and P. W. K...

work page 2026
[62]

Ben Abacha and A. and Y. Mrabet and Y. Zhang and C. Shivade and C. Langlotz and D. Demner-Fushman. Overview of the MEDIQA 2021 Shared Task on Summarization in the Medical Domain, Online, Association for Computational Linguistics. 2021

work page 2021
[63]

Ben Abacha and A. and C. Shivade and D. Demner-Fushman. Overview of the MEDIQA 2019 Shared Task on Textual Inference, Question Entailment and Question Answering, Florence, Italy, Association for Computational Linguistics. 2019

work page 2019
[64]

Biomedical ontologies in action: role in knowledge management, data integration and decision support

Bodenreider, O. Biomedical ontologies in action: role in knowledge management, data integration and decision support. Yearb Med Inform: 67–79. 2008

work page 2008
[65]

Bommasani and R. and D. A. Hudson and E. Adeli and R. Altman and S. Arora and S. von Arx and M. S. Bernstein and J. Bohg and A. Bosselut and E. Brunskill. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258. 2021. arXiv:2108.07258

work page internal anchor Pith review Pith/arXiv arXiv 2021
[66]

Chatr-Aryamontri and A. and L. Hirschman and K. E. Ross and R. Oughtred and M. Krallinger and K. Dolinski and M. Tyers and T. Korves and C. N. Arighi. Overview of the COVID-19 text mining tool interactive demonstration track in BioCreative VII. Database (Oxford) 2022. 2022

work page 2022
[67]

Chen and Q. and A. Allot and R. Leaman and R. Islamaj and J. Du and L. Fang and K. Wang and S. Xu and Y. Zhang and P. Bagherzadeh and S. Bergler and A. Bhatnagar and N. Bhavsar and Y. C. Chang and S. J. Lin and W. Tang and H. Zhang and I. Tavchioski and S. Pollak and S. Tian and J. Zhang and Y. Otmakhova and A. J. Yepes and H. Dong and H. Wu and R. Dufour...

work page 2022
[68]

Christophe and C. and P. K. Kanithi and T. Raha and S. Khan and M. A. Pimentel. Med42-v2: A suite of clinical llms. arXiv preprint arXiv:2408.06142. 2024. arXiv:2408.06142

work page arXiv 2024
[69]

Colelough and B. and D. Bartels and D. Demner-Fushman. Overview of the ClinIQLink 2025 Shared Task on Medical Question-Answering, Viena, Austria, Association for Computational Linguistics. 2025

work page 2025
[70]

Comeau and D. C. and R. T. Batista-Navarro and H. J. Dai and R. I. Dogan and A. J. Yepes and R. Khare and Z. Lu and H. Marques and C. J. Mattingly and M. Neves and Y. Peng and R. Rak and F. Rinaldi and R. T. Tsai and K. Verspoor and T. C. Wiegers and C. H. Wu and W. J. Wilbur. BioC interoperability track overview. Database (Oxford) 2014. 2014

work page 2014
[71]

Cormack and G. V. and C. L. A. Clarke and S. Buettcher. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. Boston, MA, USA, Association for Computing Machinery: 758–759. 2009

work page 2009
[72]

Davis and A. P. and C. G. Murphy and C. A. Saraceni-Richards and M. C. Rosenstein and T. C. Wiegers and C. J. Mattingly. Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical-gene-disease networks. Nucleic Acids Res 37(Database issue): D786–792. 2009

work page 2009
[73]

Gilardi and F. and M. Alizadeh and M. Kubli. ChatGPT outperforms crowd workers for text-annotation tasks. Proc Natl Acad Sci U S A 120(30): e2305016120. 2023

work page 2023
[74]

Gobeill and J. and P. Gaudet and D. Dopp and A. Morrone and I. Kahanda and Y. Y. Hsu and C. H. Wei and Z. Lu and P. Ruch. Overview of the BioCreative VI text-mining services for Kinome Curation Track. Database (Oxford) 2018. 2018

work page 2018
[75]

Golchin and S. and M. Surdeanu. Time Travel in LLMs: Tracing Data Contamination in Large Language Models. The Twelfth International Conference on Learning Representations (ICLR). 2024

work page 2024
[76]

Harikrishnan Gurushankar Saisudha, G. C. a. S. B. Agentic and Non-Agentic Multi-Hop Systems for Medical Question Answering. Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artificial Intelligence (IJCAI). 2025

work page 2025
[77]

Hendrycks and D. and C. Burns and S. Basart and A. Zou and M. Mazeika and D. Song and J. Steinhardt. Measuring massive multitask language understanding. International Conference on Learning Representations. 2021

work page 2021
[78]

Hirschman and L. and M. Colosimo and A. Morgan and A. Yeh. Overview of BioCreAtIvE task 1B: normalized gene lists. BMC Bioinformatics 6 Suppl 1(Suppl 1): S11. 2005

work page 2005
[79]

Hirschman and L. and A. Yeh and C. Blaschke and A. Valencia. Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics 6 Suppl 1(Suppl 1): S1. 2005

work page 2005
[80]

and A.-K

Ho and X. and A.-K. Duong Nguyen and S. Sugawara and A. Aizawa. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps, Barcelona, Spain (Online), International Committee on Computational Linguistics. 2020

work page 2020

Showing first 80 references.

[1] [1]

Bio-ID track overview

Arighi, Cecilia and Lynette Hirschman and Thomas Lemberger and Samuel Bayer and Robin Liechti and Donald Comeau and Cathy Wu. Bio-ID track overview. BioCreative VI Challenge Evaluation Workshop. 2017

work page 2017

[2] [2]

AI Magazine , month = mar, pages =

Aroyo, Lora and Welty, Chris , title =. AI Magazine , month = mar, pages =. 2015 , issue_date =. doi:10.1609/aimag.v36i1.2564 , abstract =

work page doi:10.1609/aimag.v36i1.2564 2015

[3] [3]

Computational Linguistics 34(4), 555–596 (2008)

Survey Article: Inter-Coder Agreement for Computational Linguistics , author=. Computational Linguistics , volume=. doi:10.1162/coli.07-034-R2

work page doi:10.1162/coli.07-034-r2

[4] [4]

Bada, Michael and Miriam Eckert and Donald Evans and Kristin Garcia and Krista Shipley and Dmitry Sitnikov and William A Baumgartner Jr and K Bretonnel Cohen and Karin Verspoor and Judith A Blake and Lawrence E. Hunter. Concept annotation in the CRAFT corpus. BMC Bioinformatics. 2012. doi:10.1186/1471-2105-13-161

work page doi:10.1186/1471-2105-13-161 2012

[5] [5]

Introduction to the Bio-entity Recognition Task at JNLPBA

Collier, Nigel and Tomoko Ohta and Yoshimasa Tsuruoka and Yuka Tateisi and Jin-Dong Kim. Introduction to the Bio-entity Recognition Task at JNLPBA. Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP). 2004

work page 2004

[6] [6]

Donald C. Comeau and Rezarta Islamaj Doğan and Paolo Ciccarese and Kevin Bretonnel Cohen and Martin Krallinger and Florian Leitner and Zhiyong Lu and Yifan Peng and Fabio Rinaldi and Manabu Torii and Alfonso Valencia and Karin Verspoor and Thomas C. Wiegers and Cathy H. Wu and W. John Wilbur. BioC: a minimalist approach to interoperability for biomedical ...

work page doi:10.1093/database/bat064 2013

[7] [7]

Bioinformatics , volume =

Comeau, Donald C and Wei, Chih-Hsuan and Islamaj Doğan, Rezarta and Lu, Zhiyong , title =. Bioinformatics , volume =. 2019 , month =. doi:10.1093/bioinformatics/btz070 , url =

work page doi:10.1093/bioinformatics/btz070 2019

[8] [8]

NCBI disease corpus: A resource for disease name recognition and concept normalization

Doğan, Rezarta Islamaj and Robert Leaman and Zhiyong Lu. NCBI disease corpus: A resource for disease name recognition and concept normalization. Journal of Biomedical Informatics. 2014. doi:10.1016/j.jbi.2013.12.006

work page doi:10.1016/j.jbi.2013.12.006 2014

[9] [9]

2022 , url=

Fries, Jason and Weber, Leon and Seelam, Natasha and Altay, Gabriel and Datta, Debajyoti and Garda, Samuele and Kang, Sunny and Su, Rosaline and Kusa, Wojciech and Cahyawijaya, Samuel and others , booktitle=. 2022 , url=

work page 2022

[10] [10]

ACM Transactions on Computing for Healthcare ( HEALTH )

Gu, Yu and Tinn, Robert and Cheng, Hao and Lucas, Michael and Usuyama, Naoto and Liu, Xiaodong and Naumann, Tristan and Gao, Jianfeng and Poon, Hoifung , title = ". ACM Transactions on Computing for Healthcare ( HEALTH ). 2021

work page 2021

[11] [11]

and Smith, Noah A

Gururangan, Suchin and Swayamdipta, Swabha and Levy, Omer and Schwartz, Roy and Bowman, Samuel R. and Smith, Noah A. , title = ". Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) , pages =. 2018 , doi =

work page 2018

[12] [12]

Journal of Biomedical Informatics , volume =

Herrero-Zazo, María and Segura-Bedmar, Isabel and Martínez, Paloma and Declerck, Thierry , title = ". Journal of Biomedical Informatics , volume =. 2013 , doi =

work page 2013

[13] [13]

International Journal of Translation , volume=

Towards a `science' of corpus annotation: a new methodological challenge for corpus linguistics , author=. International Journal of Translation , volume=

work page

[14] [14]

Journal of the American Medical Informatics Association , volume=

Agreement, the f-measure, and reliability in information retrieval , author=. Journal of the American Medical Informatics Association , volume=. 2005 , doi =

work page 2005

[15] [15]

Briefings in Bioinformatics , volume=

Biomedical named entity recognition and linking datasets: survey and our recent development , author=. Briefings in Bioinformatics , volume=. 2020 , publisher=

work page 2020

[16] [16]

Islamaj, Rezarta and Leaman, Robert and Kim, Sun and Kwon, Dongseop and Wei, Chih-Hsuan and Comeau, Donald C. and Peng, Yifan and Cissel, David and Coss, Cathleen and Fisher, Carol and Guzman, Rob and Kochar, Preeti Gokal and Koppel, Stella and Trinh, Dorothy and Sekiya, Keiko and Ward, Janice and Whitman, Deborah and Schmidt, Susan and Lu, Zhiyong , titl...

work page 2021

[17] [17]

Islamaj and R. and C. H. Wei and D. Cissel and N. Miliaras and O. Printseva and O. Rodionov and K. Sekiya and J. Ward and Z. Lu. NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition. Journal of Biomedical Informatics. 2021

work page 2021

[18] [18]

Islamaj and R. and C. H. Wei and P. T. Lai and L. Luo and C. Coss and P. Gokal Kochar and N. Miliaras and O. Rodionov and K. Sekiya and D. Trinh and D. Whitman and Z. Lu. The biomedical relationship corpus of the BioRED track at the BioCreative VIII challenge and workshop. Database (Oxford). 2024

work page 2024

[19] [19]

GENIA corpus--a semantically annotated corpus for bio-textmining

Kim, Jin-Dong and Ohta, Tomoko and Tateisi, Yuka and Tsujii, Jun'ichi. GENIA corpus--a semantically annotated corpus for bio-textmining. Bioinformatics. 2003

work page 2003

[20] [20]

Journal of Cheminformatics

Krallinger, Martin and Rabal, Obdulia and Leitner, Florian and Vazquez, Miguel and Salgado, David and Lu, Zhiyong and Leaman, Robert and Lu, Yanan and Ji, Donghong and Lowe, Daniel M and Sayle, Roger A and Batista-Navarro, Riza Theresa and Rak, Rafal and Huber, Torsten and Rocktäschel, Tim and Matos, Sérgio and Campos, David and Tang, Buzhou and Hua Xu an...

work page 2015

[21] [21]

and Sciaky, Daniela and Wei, Chih-Hsuan and Leaman, Robert and Davis, Allan Peter and Mattingly, Carolyn J

Li, Jiao and Sun, Yueping and Johnson, Robin J. and Sciaky, Daniela and Wei, Chih-Hsuan and Leaman, Robert and Davis, Allan Peter and Mattingly, Carolyn J. and Wiegers, Thomas C. and Lu, Zhiyong , title =. Database , volume =. 2016 , doi =

work page 2016

[22] [22]

Transactions on Machine Learning Research , year =

Holistic Evaluation of Language Models , author =. Transactions on Machine Learning Research , year =

work page

[23] [23]

Medical Subject Headings (

Lipscomb, Carolyn E , journal=. Medical Subject Headings (

work page

[24] [24]

BMC Bioinformatics

Lu, Zhiyong and Kao, Hung-Yu and Wei, Chih-Hsuan and Huang, Minlie and Liu, Jingchen and Kuo, Cheng-Ju and Hsu, Chun-Nan and Tsai, Richard Tzong-Han and Dai, Hong-Jie and Okazaki, Naoaki and Cho, Hye-Cheol and Gerner, Martin and Solt, Illes and Agarwal, Shashank and Liu, Feifan and Vishnyakova, Dina and Ruch, Patrick and Romacker, Martin and Rinaldi, Fabi...

work page doi:10.1186/1471-2105-12-s8-s2 2011

[25] [25]

Nucleic Acids Research , volume=

Malik, Adnan and Arsalan, Muhammad and Moreno, Carlos and Mosquera, Juan and F. Nucleic Acids Research , volume=. 2026 , publisher=

work page 2026

[26] [26]

Miranda-Escalada and A. and F. Mehryary and J. Luoma and D. Estrada-Zavala and L. Gasco and S. Pyysalo and A. Valencia and M. Krallinger. Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical-protein relations. Database (Oxford). 2023

work page 2023

[27] [27]

and Lu, Zhiyong and Wang, Xinglong and Cohen, Aaron M

Morgan, Alexander A. and Lu, Zhiyong and Wang, Xinglong and Cohen, Aaron M. and Fluck, Juliane and Ruch, Patrick and Divoli, Anna and Fundel, Katrin and Leaman, Robert and Hakenberg, Jörg and Sun, Chenjie and Liu, Heng-hui and Torres, Rafael and Krauthammer, Michael and Lau, William W. and Liu, Hongfang and Hsu, Chun-Nan and Schuemie, Martijn and Cohen, K...

work page 2008

[28] [28]

K nowtator: A Prot \'e g \'e plug-in for annotated corpus construction

Ogren, Philip V. K nowtator: A Prot \'e g \'e plug-in for annotated corpus construction. Proceedings of the Human Language Technology Conference of the NAACL , Companion Volume: Demonstrations. 2006

work page 2006

[29] [29]

Proceedings of the 18th BioNLP Workshop and Shared Task , pages =

Peng, Yifan and Yan, Shankai and Lu, Zhiyong , title =. Proceedings of the 18th BioNLP Workshop and Shared Task , pages =. 2019 , doi =

work page 2019

[30] [30]

Bioinformatics , volume =

Pyysalo, Sampo and Ananiadou, Sophia , title =. Bioinformatics , volume =. 2014 , doi =

work page 2014

[31] [31]

Rotenberg and N. H. and R. Leaman and R. Islamaj and H. Kuivaniemi and G. Tromp and B. Fluharty and S. Richardson and C. Eastwood and M. Diller and B. Xu and A. V. Pankajam and D. Osumi-Sutherland and Z. Lu and R. H. Scheuermann. Cell phenotypes in the biomedical literature: a systematic analysis and text mining corpus. bioRxiv. 2026

work page 2026

[32] [32]

Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics , pages=

brat: a Web-based Tool for NLP-Assisted Text Annotation , author=. Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics , pages=

work page

[33] [33]

Scientific Data , year =

The Cell Ontology in the age of single-cell omics , author =. Scientific Data , year =

work page

[34] [34]

Fair Evaluation in Concept Normalization: a Large-scale Comparative Analysis for BERT -based Models

Tutubalina, Elena and Kadurin, Artur and Miftahutdinov, Zulfat. Fair Evaluation in Concept Normalization: a Large-scale Comparative Analysis for BERT -based Models. Proceedings of the 28th International Conference on Computational Linguistics. 2020. doi:10.18653/v1/2020.coling-main.588

work page doi:10.18653/v1/2020.coling-main.588 2020

[35] [35]

Learning from Disagreement: A Survey , volume =

Uma, Alexandra and Fornaciari, Tommaso and Hovy, Dirk and Paun, Silviu and Plank, Barbara and Poesio, Massimo , year =. Learning from Disagreement: A Survey , volume =. Journal of Artificial Intelligence Research , doi =

work page

[36] [36]

Genetics , volume=

Mondo: integrating disease terminology across communities , author=. Genetics , volume=. 2026 , publisher=

work page 2026

[37] [37]

Wang, Alex and Pruksachatkun, Yada and Nangia, Nikita and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel , booktitle =. 2019

work page 2019

[38] [38]

In: Linzen, T., Chrupała, G., Alishahi, A

Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R. GLUE : A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. Proceedings of the 2018 EMNLP Workshop B lackbox NLP : Analyzing and Interpreting Neural Networks for NLP. 2018. doi:10.18653/v1/W18-5446

work page doi:10.18653/v1/w18-5446 2018

[39] [39]

and Kao, Hung-Yu and Lu, Zhiyong

Wei, Chih-Hsuan and Harris, Bethany R. and Kao, Hung-Yu and Lu, Zhiyong. tmVar: a text mining approach for extracting sequence variants in biomedical literature. Bioinformatics. 2013

work page 2013

[40] [40]

and Li, Jiao and Wiegers, Thomas C

Wei, Chih-Hsuan and Peng, Yifan and Leaman, Robert and Davis, Allan Peter and Mattingly, Carolyn J. and Li, Jiao and Wiegers, Thomas C. and Lu, Zhiyong , title = ". Database (Oxford) , volume =. 2016 , doi =

work page 2016

[41] [41]

Nucleic Acids Research , volume =

Wei, Chih-Hsuan and Allot, Alexis and Lai, Po-Ting and Leaman, Robert and Tian, Shubo and Luo, Ling and Jin, Qiao and Wang, Zhizheng and Chen, Qingyu and Lu, Zhiyong , title =. Nucleic Acids Research , volume =. 2024 , month =. doi:10.1093/nar/gkae235 , url =

work page doi:10.1093/nar/gkae235 2024

[42] [42]

Vasilevsky and N. A. and N. A. Matentzoglu and S. Toro and J. E. Flack and H. Hegde and D. R. Unni and G. F. Alyea and J. S. Amberger and L. Babb and J. P. Balhoff and T. I. Bingaman and G. A. Burns and O. J. Buske and T. J. Callahan and L. C. Carmody and P. C. Cordo and L. E. Chan and G. S. Chang and S. L. Christiaens and L. C. Daugherty and M. Dumontier...

work page arXiv 2022

[43] [43]

Vladika and J. and P. Schneider and F. Matthes. MedREQAL: Examining Medical Knowledge Recall of Large Language Models via Question Answering, Bangkok, Thailand, Association for Computational Linguistics. 2024

work page 2024

[44] [44]

Wang and L. and N. Yang and F. Wei. Query2doc: Query Expansion with Large Language Models, Singapore, Association for Computational Linguistics. 2023

work page 2023

[45] [45]

Wang and Q. and S. A. S and L. Almeida and S. Ananiadou and Y. I. Balderas-Martinez and R. Batista-Navarro and D. Campos and L. Chilton and H. J. Chou and G. Contreras and L. Cooper and H. J. Dai and B. Ferrell and J. Fluck and S. Gama-Castro and N. George and G. Gkoutos and A. K. Irin and L. J. Jensen and S. Jimenez and T. R. Jue and I. Keseler and S. Ma...

work page 2016

[46] [46]

Welbl and J. and P. Stenetorp and S. Riedel. Constructing Datasets for Multi-hop Reading Comprehension Across Documents. Transactions of the Association for Computational Linguistics. 2018

work page 2018

[47] [47]

Wiegers and T. C. and A. P. Davis and C. J. Mattingly. Collaborative biocuration--text-mining development task for document prioritization for curation. Database (Oxford) 2012: bas037. 2012

work page 2012

[48] [48]

Yang and Z. and P. Qi and S. Zhang and Y. Bengio and W. Cohen and R. Salakhutdinov and C. D. Manning. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering, Brussels, Belgium, Association for Computational Linguistics. 2018

work page 2018

[49] [49]

Yao and S. and J. Zhao and D. Yu and N. Du and I. Shafran and K. R. Narasimhan and Y. Cao. React: Synergizing reasoning and acting in language models. The eleventh international conference on learning representations. 2022

work page 2022

[50] [50]

Yeh and A. and A. Morgan and M. Colosimo and L. Hirschman. BioCreAtIvE task 1A: gene mention finding evaluation. BMC Bioinformatics 6 Suppl 1(Suppl 1): S2. 2005

work page 2005

[51] [51]

Yuelyu Ji and H. Z. and Shiven Verma and Hui Ji and Chun Li and Yushui Han and Yanshan Wang. DeepRAG: Integrating Hierarchical Reasoning and Process Supervision for Biomedical Multi-Hop QA. Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artifici...

work page 2025

[52] [52]

Zhu and M. and A. Ahuja and D.-C. Juan and W. Wei and C. K. Reddy. Question Answering with Long Multiple-Span Answers, Online, Association for Computational Linguistics. 2020

work page 2020

[53] [53]

Nucleic acids research , volume=

PubTator: a web-based text mining tool for assisting biocuration , author=. Nucleic acids research , volume=. 2013 , publisher=

work page 2013

[54] [54]

Krallinger and M. and O. Rabal and S. A. Akhondi and M. P. Pérez and J. Santamaría and G. P. Rodríguez and G. Tsatsaronis and A. Intxaurrondo and J. A. B. López and U. K. Nandal and E. M. v. Buel and A. Chandrasekhar and M. Rodenburg and A. Lægreid and M. A. Doornenbal and J. Oyarzábal and A. Lourenço and A. Valencia. Overview of the BioCreative VI chemic...

work page 2017

[55] [55]

Abacha and A. B. and E. Agichtein and Y. Pinter and D. Demner-Fushman. Overview of the Medical Question Answering Task at TREC 2017 LiveQA. Text Retrieval Conference. 2017

work page 2017

[56] [56]

Adams and L. and F. Busch and T. Han and J.-B. Excoffier and M. Ortala and A. Löser and H. J. W. L. Aerts and J. N. Kather and D. Truhn and K. Bressem. LongHealth: A Question Answering Benchmark with Long Clinical Documents. Journal of Healthcare Informatics Research. 2025

work page 2025

[57] [57]

Alliheedi, A. B. a. M. Evaluating Advanced Prompting on Gemini Flash for Multi-Hop Biomedical QA. Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artificial Intelligence (IJCAI). 2025

work page 2025

[58] [58]

Arighi and C. N. and Z. Lu and M. Krallinger and K. B. Cohen and W. J. Wilbur and A. Valencia and L. Hirschman and C. H. Wu. Overview of the BioCreative III Workshop. BMC Bioinformatics 12 Suppl 8(Suppl 8): S1. 2011

work page 2011

[59] [59]

Arighi and C. N. and P. M. Roberts and S. Agarwal and S. Bhattacharya and G. Cesareni and A. Chatr-Aryamontri and S. Clematide and P. Gaudet and M. G. Giglio and I. Harrow and E. Huala and M. Krallinger and U. Leser and D. Li and F. Liu and Z. Lu and L. J. Maltais and N. Okazaki and L. Perfetto and F. Rinaldi and R. Saetre and D. Salgado and P. Srinivasan...

work page 2011

[60] [60]

Arighi and C. N. and C. H. Wu and K. B. Cohen and L. Hirschman and M. Krallinger and A. Valencia and Z. Lu and J. W. Wilbur and T. C. Wiegers. BioCreative-IV virtual issue. Database (Oxford) 2014. 2014

work page 2014

[61] [61]

Asai and A. and J. He and R. Shao and W. Shi and A. Singh and J. C. Chang and K. Lo and L. Soldaini and S. Feldman and M. D'Arcy and D. Wadden and M. Latzke and J. Sparks and J. D. Hwang and V. Kishore and M. Tian and P. Ji and S. Liu and H. Tong and B. Wu and Y. Xiong and L. Zettlemoyer and G. Neubig and D. S. Weld and D. Downey and W. T. Yih and P. W. K...

work page 2026

[62] [62]

Ben Abacha and A. and Y. Mrabet and Y. Zhang and C. Shivade and C. Langlotz and D. Demner-Fushman. Overview of the MEDIQA 2021 Shared Task on Summarization in the Medical Domain, Online, Association for Computational Linguistics. 2021

work page 2021

[63] [63]

Ben Abacha and A. and C. Shivade and D. Demner-Fushman. Overview of the MEDIQA 2019 Shared Task on Textual Inference, Question Entailment and Question Answering, Florence, Italy, Association for Computational Linguistics. 2019

work page 2019

[64] [64]

Biomedical ontologies in action: role in knowledge management, data integration and decision support

Bodenreider, O. Biomedical ontologies in action: role in knowledge management, data integration and decision support. Yearb Med Inform: 67–79. 2008

work page 2008

[65] [65]

Bommasani and R. and D. A. Hudson and E. Adeli and R. Altman and S. Arora and S. von Arx and M. S. Bernstein and J. Bohg and A. Bosselut and E. Brunskill. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258. 2021. arXiv:2108.07258

work page internal anchor Pith review Pith/arXiv arXiv 2021

[66] [66]

Chatr-Aryamontri and A. and L. Hirschman and K. E. Ross and R. Oughtred and M. Krallinger and K. Dolinski and M. Tyers and T. Korves and C. N. Arighi. Overview of the COVID-19 text mining tool interactive demonstration track in BioCreative VII. Database (Oxford) 2022. 2022

work page 2022

[67] [67]

Chen and Q. and A. Allot and R. Leaman and R. Islamaj and J. Du and L. Fang and K. Wang and S. Xu and Y. Zhang and P. Bagherzadeh and S. Bergler and A. Bhatnagar and N. Bhavsar and Y. C. Chang and S. J. Lin and W. Tang and H. Zhang and I. Tavchioski and S. Pollak and S. Tian and J. Zhang and Y. Otmakhova and A. J. Yepes and H. Dong and H. Wu and R. Dufour...

work page 2022

[68] [68]

Christophe and C. and P. K. Kanithi and T. Raha and S. Khan and M. A. Pimentel. Med42-v2: A suite of clinical llms. arXiv preprint arXiv:2408.06142. 2024. arXiv:2408.06142

work page arXiv 2024

[69] [69]

Colelough and B. and D. Bartels and D. Demner-Fushman. Overview of the ClinIQLink 2025 Shared Task on Medical Question-Answering, Viena, Austria, Association for Computational Linguistics. 2025

work page 2025

[70] [70]

Comeau and D. C. and R. T. Batista-Navarro and H. J. Dai and R. I. Dogan and A. J. Yepes and R. Khare and Z. Lu and H. Marques and C. J. Mattingly and M. Neves and Y. Peng and R. Rak and F. Rinaldi and R. T. Tsai and K. Verspoor and T. C. Wiegers and C. H. Wu and W. J. Wilbur. BioC interoperability track overview. Database (Oxford) 2014. 2014

work page 2014

[71] [71]

Cormack and G. V. and C. L. A. Clarke and S. Buettcher. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. Boston, MA, USA, Association for Computing Machinery: 758–759. 2009

work page 2009

[72] [72]

Davis and A. P. and C. G. Murphy and C. A. Saraceni-Richards and M. C. Rosenstein and T. C. Wiegers and C. J. Mattingly. Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical-gene-disease networks. Nucleic Acids Res 37(Database issue): D786–792. 2009

work page 2009

[73] [73]

Gilardi and F. and M. Alizadeh and M. Kubli. ChatGPT outperforms crowd workers for text-annotation tasks. Proc Natl Acad Sci U S A 120(30): e2305016120. 2023

work page 2023

[74] [74]

Gobeill and J. and P. Gaudet and D. Dopp and A. Morrone and I. Kahanda and Y. Y. Hsu and C. H. Wei and Z. Lu and P. Ruch. Overview of the BioCreative VI text-mining services for Kinome Curation Track. Database (Oxford) 2018. 2018

work page 2018

[75] [75]

Golchin and S. and M. Surdeanu. Time Travel in LLMs: Tracing Data Contamination in Large Language Models. The Twelfth International Conference on Learning Representations (ICLR). 2024

work page 2024

[76] [76]

Harikrishnan Gurushankar Saisudha, G. C. a. S. B. Agentic and Non-Agentic Multi-Hop Systems for Medical Question Answering. Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artificial Intelligence (IJCAI). 2025

work page 2025

[77] [77]

Hendrycks and D. and C. Burns and S. Basart and A. Zou and M. Mazeika and D. Song and J. Steinhardt. Measuring massive multitask language understanding. International Conference on Learning Representations. 2021

work page 2021

[78] [78]

Hirschman and L. and M. Colosimo and A. Morgan and A. Yeh. Overview of BioCreAtIvE task 1B: normalized gene lists. BMC Bioinformatics 6 Suppl 1(Suppl 1): S11. 2005

work page 2005

[79] [79]

Hirschman and L. and A. Yeh and C. Blaschke and A. Valencia. Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics 6 Suppl 1(Suppl 1): S1. 2005

work page 2005

[80] [80]

and A.-K

Ho and X. and A.-K. Duong Nguyen and S. Sugawara and A. Aizawa. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps, Barcelona, Spain (Online), International Committee on Computational Linguistics. 2020

work page 2020