pith. sign in

arxiv: 2605.20537 · v1 · pith:IZOJEFBNnew · submitted 2026-05-19 · 💻 cs.CL

What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework

Pith reviewed 2026-05-21 06:27 UTC · model grok-4.3

classification 💻 cs.CL
keywords biomedical NERentity linkingcorpus diagnosticsbenchmark analysistrain-test overlapterminology coveragenamed entity recognitiongeneralization demands
0
0 comments X

The pith

Biomedical NER and entity linking corpora differ substantially in properties even for similar tasks, rendering common statistics insufficient to show what benchmarks actually evaluate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a diagnostic framework that pulls standardized statistics directly from corpus annotations, concept links, train-test splits, metadata, and terminology mappings. These statistics fall into five families that together reveal how much each corpus tests specific evaluation signals, how much generalization it demands, how much train-test reuse it allows, and which parts of the biomedical literature and concept space it covers. Applying the framework to nine existing corpora for diseases, chemicals, and cell types shows that corpora addressing the same apparent task can still differ markedly in all these respects. A sympathetic reader would care because model developers often choose corpora or interpret benchmark scores without knowing these hidden differences, which can mask transfer risks or limit the scope of conclusions drawn from results.

Core claim

We present a corpus-centric framework that organizes standardized statistics into five families—scale, density and label distribution; lexical and conceptual structure; train-test overlap; metadata composition; and terminology coverage—to diagnose benchmark-relevant properties directly from annotations, concept links, splits, document metadata, and terminology mappings. When applied to nine corpora spanning diseases, chemicals, and cell types, the framework shows that corpus properties differ substantially even for the same apparent task, producing distinct evaluation signals, generalization demands, degrees of train-test reuse, and coverage of literature and concept space. These differences

What carries the argument

The corpus-centric diagnostic framework, which computes and compares statistics across five families drawn from annotations, links, splits, metadata, and mappings to expose differences in evaluation signal and generalization demands.

If this is right

  • Benchmark results can be interpreted with explicit awareness of the distinct regions of literature and concept space each corpus represents.
  • Transfer risks between corpora become identifiable before model training or evaluation.
  • Corpus selection can move beyond surface descriptors such as size and entity type toward matching specific generalization requirements.
  • The open-source implementation and dashboard enable direct reproduction and extension to new corpora.
  • Reporting practices can incorporate these diagnostics to clarify what a given benchmark actually measures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same diagnostic approach could be applied to related biomedical tasks such as relation extraction to reveal comparable hidden differences.
  • Models that appear strong on one corpus may fail on another primarily because of unmeasured differences in lexical structure or metadata composition.
  • Standard practice in the field may need to evolve toward routine publication of these five-family statistics alongside benchmark scores.
  • Interactive dashboards built on the framework could become a standard way to document corpus suitability for specific research questions.

Load-bearing premise

That the five families of statistics together capture enough information to diagnose the evaluation signal, generalization demands, and transfer risks of each corpus.

What would settle it

Re-running the framework on the same nine corpora and finding that all differences in the five statistic families are negligible or already predicted by simple size and entity-type counts.

Figures

Figures reproduced from arXiv: 2605.20537 by Rezarta Islamaj, Robert Leaman, Zhiyong Lu.

Figure 1
Figure 1. Figure 1: Corpus diagnostic framework. Entity-annotated corpora are converted into a common representation, [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Train-test overlap across nine biomedical corpora. All values are Jaccard similarity (%) between training [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Terminology-based coverage analysis. Left panels: distribution normalized within each corpus’s own [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Biomedical named entity recognition (NER) and entity linking (EL) strongly depend on annotated corpora, but the utility of these resources for benchmarking is often assumed rather than characterized. We present a corpus-centric framework for diagnosing benchmark-relevant properties directly from corpus annotations, concept links, train-test splits, document metadata, and terminology mappings. The framework organizes standardized statistics into five families: (1) scale, density and label distribution, (2) lexical and conceptual structure, (3) train-test overlap, (4) metadata composition, and (5) terminology coverage where applicable. Applying the framework to nine corpora spanning diseases, chemicals, and cell types, we find that corpus properties can differ substantially, even when they address the same apparent task. We find differences in the evaluation signal they provide, the generalization demands they impose, the degree of train-test reuse they permit, and the regions of biomedical literature and concept space they represent. These differences suggest that commonly reported corpus statistics can be insufficient to characterize what biomedical NER and EL benchmarks evaluate. We argue that corpus-centric diagnostics provide a practical framework for analyzing corpora beyond surface descriptors such as corpus size and entity type, for identifying potential transfer risks, and for interpreting the scope of benchmarking conclusions. We release the framework as open-source code with an interactive dashboard to support reproducing our analyses and characterizing additional corpora.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript presents a corpus-centric diagnostic framework for biomedical NER and EL benchmarks. The framework computes standardized statistics across five families—scale, density and label distribution; lexical and conceptual structure; train-test overlap; metadata composition; and terminology coverage—from annotations, concept links, splits, metadata, and terminology mappings. When applied to nine corpora covering diseases, chemicals, and cell types, the analysis shows that corpora addressing similar tasks can differ substantially in evaluation signal, generalization demands, train-test reuse, and coverage of biomedical literature and concept space. The authors conclude that commonly reported statistics are insufficient to characterize these benchmarks and release open-source code and an interactive dashboard for reproducibility and extension.

Significance. This work is significant for the field of biomedical NLP as it provides a practical, descriptive tool to better interpret what existing benchmarks actually measure and to identify potential transfer risks when using them. The direct, parameter-free computation of statistics from the corpora themselves, the broad application to nine diverse resources, and the public release of code and dashboard are notable strengths that enhance reproducibility and utility. If the differences identified are as pronounced as reported, the framework could help researchers select more appropriate corpora and interpret benchmarking results with greater nuance.

minor comments (2)
  1. [§3.2] §3.2: The precise formulas for the lexical and conceptual structure statistics (e.g., type-token ratios and concept overlap measures) are described in prose; adding explicit equations would improve precision and ease of re-implementation.
  2. [Table 2] Table 2: The train-test overlap column reports percentages but does not indicate whether the splits are document-level or mention-level; clarifying this distinction would strengthen the interpretation of reuse risks.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our work, for recognizing its significance to biomedical NLP, and for recommending minor revision. The referee's description accurately reflects the corpus-centric diagnostic framework, its application to nine corpora, and the release of code and dashboard. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper defines a descriptive diagnostic framework consisting of five families of standardized statistics computed directly from corpus annotations, concept links, train-test splits, metadata, and terminology mappings. These are applied empirically to nine existing corpora to identify differences in scale, structure, overlap, composition, and coverage. No derivations, fitted parameters presented as predictions, self-referential definitions, or load-bearing self-citations appear in the argument; the central claims rest on explicit computation and released code rather than any reduction to inputs by construction. The framework functions as a practical analysis tool without circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Framework rests on the domain assumption that the chosen statistics capture benchmark-relevant properties; no free parameters are introduced and no new entities are postulated.

axioms (1)
  • domain assumption Standard statistical measures of scale, overlap, and terminology coverage meaningfully diagnose what a benchmark evaluates.
    Invoked when the five families are presented as the diagnostic lens.

pith-pipeline@v0.9.0 · 5775 in / 1166 out tokens · 36239 ms · 2026-05-21T06:27:06.360850+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

143 extracted references · 143 canonical work pages · 3 internal anchors

  1. [1]

    Bio-ID track overview

    Arighi, Cecilia and Lynette Hirschman and Thomas Lemberger and Samuel Bayer and Robin Liechti and Donald Comeau and Cathy Wu. Bio-ID track overview. BioCreative VI Challenge Evaluation Workshop. 2017

  2. [2]

    AI Magazine , month = mar, pages =

    Aroyo, Lora and Welty, Chris , title =. AI Magazine , month = mar, pages =. 2015 , issue_date =. doi:10.1609/aimag.v36i1.2564 , abstract =

  3. [3]

    Computational Linguistics 34(4), 555–596 (2008)

    Survey Article: Inter-Coder Agreement for Computational Linguistics , author=. Computational Linguistics , volume=. doi:10.1162/coli.07-034-R2

  4. [4]

    Bada, Michael and Miriam Eckert and Donald Evans and Kristin Garcia and Krista Shipley and Dmitry Sitnikov and William A Baumgartner Jr and K Bretonnel Cohen and Karin Verspoor and Judith A Blake and Lawrence E. Hunter. Concept annotation in the CRAFT corpus. BMC Bioinformatics. 2012. doi:10.1186/1471-2105-13-161

  5. [5]

    Introduction to the Bio-entity Recognition Task at JNLPBA

    Collier, Nigel and Tomoko Ohta and Yoshimasa Tsuruoka and Yuka Tateisi and Jin-Dong Kim. Introduction to the Bio-entity Recognition Task at JNLPBA. Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP). 2004

  6. [6]

    Donald C. Comeau and Rezarta Islamaj Doğan and Paolo Ciccarese and Kevin Bretonnel Cohen and Martin Krallinger and Florian Leitner and Zhiyong Lu and Yifan Peng and Fabio Rinaldi and Manabu Torii and Alfonso Valencia and Karin Verspoor and Thomas C. Wiegers and Cathy H. Wu and W. John Wilbur. BioC: a minimalist approach to interoperability for biomedical ...

  7. [7]

    Bioinformatics , volume =

    Comeau, Donald C and Wei, Chih-Hsuan and Islamaj Doğan, Rezarta and Lu, Zhiyong , title =. Bioinformatics , volume =. 2019 , month =. doi:10.1093/bioinformatics/btz070 , url =

  8. [8]

    NCBI disease corpus: A resource for disease name recognition and concept normalization

    Doğan, Rezarta Islamaj and Robert Leaman and Zhiyong Lu. NCBI disease corpus: A resource for disease name recognition and concept normalization. Journal of Biomedical Informatics. 2014. doi:10.1016/j.jbi.2013.12.006

  9. [9]

    2022 , url=

    Fries, Jason and Weber, Leon and Seelam, Natasha and Altay, Gabriel and Datta, Debajyoti and Garda, Samuele and Kang, Sunny and Su, Rosaline and Kusa, Wojciech and Cahyawijaya, Samuel and others , booktitle=. 2022 , url=

  10. [10]

    ACM Transactions on Computing for Healthcare ( HEALTH )

    Gu, Yu and Tinn, Robert and Cheng, Hao and Lucas, Michael and Usuyama, Naoto and Liu, Xiaodong and Naumann, Tristan and Gao, Jianfeng and Poon, Hoifung , title = ". ACM Transactions on Computing for Healthcare ( HEALTH ). 2021

  11. [11]

    and Smith, Noah A

    Gururangan, Suchin and Swayamdipta, Swabha and Levy, Omer and Schwartz, Roy and Bowman, Samuel R. and Smith, Noah A. , title = ". Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) , pages =. 2018 , doi =

  12. [12]

    Journal of Biomedical Informatics , volume =

    Herrero-Zazo, María and Segura-Bedmar, Isabel and Martínez, Paloma and Declerck, Thierry , title = ". Journal of Biomedical Informatics , volume =. 2013 , doi =

  13. [13]

    International Journal of Translation , volume=

    Towards a `science' of corpus annotation: a new methodological challenge for corpus linguistics , author=. International Journal of Translation , volume=

  14. [14]

    Journal of the American Medical Informatics Association , volume=

    Agreement, the f-measure, and reliability in information retrieval , author=. Journal of the American Medical Informatics Association , volume=. 2005 , doi =

  15. [15]

    Briefings in Bioinformatics , volume=

    Biomedical named entity recognition and linking datasets: survey and our recent development , author=. Briefings in Bioinformatics , volume=. 2020 , publisher=

  16. [16]

    Islamaj, Rezarta and Leaman, Robert and Kim, Sun and Kwon, Dongseop and Wei, Chih-Hsuan and Comeau, Donald C. and Peng, Yifan and Cissel, David and Coss, Cathleen and Fisher, Carol and Guzman, Rob and Kochar, Preeti Gokal and Koppel, Stella and Trinh, Dorothy and Sekiya, Keiko and Ward, Janice and Whitman, Deborah and Schmidt, Susan and Lu, Zhiyong , titl...

  17. [17]

    Islamaj and R. and C. H. Wei and D. Cissel and N. Miliaras and O. Printseva and O. Rodionov and K. Sekiya and J. Ward and Z. Lu. NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition. Journal of Biomedical Informatics. 2021

  18. [18]

    Islamaj and R. and C. H. Wei and P. T. Lai and L. Luo and C. Coss and P. Gokal Kochar and N. Miliaras and O. Rodionov and K. Sekiya and D. Trinh and D. Whitman and Z. Lu. The biomedical relationship corpus of the BioRED track at the BioCreative VIII challenge and workshop. Database (Oxford). 2024

  19. [19]

    GENIA corpus--a semantically annotated corpus for bio-textmining

    Kim, Jin-Dong and Ohta, Tomoko and Tateisi, Yuka and Tsujii, Jun'ichi. GENIA corpus--a semantically annotated corpus for bio-textmining. Bioinformatics. 2003

  20. [20]

    Journal of Cheminformatics

    Krallinger, Martin and Rabal, Obdulia and Leitner, Florian and Vazquez, Miguel and Salgado, David and Lu, Zhiyong and Leaman, Robert and Lu, Yanan and Ji, Donghong and Lowe, Daniel M and Sayle, Roger A and Batista-Navarro, Riza Theresa and Rak, Rafal and Huber, Torsten and Rocktäschel, Tim and Matos, Sérgio and Campos, David and Tang, Buzhou and Hua Xu an...

  21. [21]

    and Sciaky, Daniela and Wei, Chih-Hsuan and Leaman, Robert and Davis, Allan Peter and Mattingly, Carolyn J

    Li, Jiao and Sun, Yueping and Johnson, Robin J. and Sciaky, Daniela and Wei, Chih-Hsuan and Leaman, Robert and Davis, Allan Peter and Mattingly, Carolyn J. and Wiegers, Thomas C. and Lu, Zhiyong , title =. Database , volume =. 2016 , doi =

  22. [22]

    Transactions on Machine Learning Research , year =

    Holistic Evaluation of Language Models , author =. Transactions on Machine Learning Research , year =

  23. [23]

    Medical Subject Headings (

    Lipscomb, Carolyn E , journal=. Medical Subject Headings (

  24. [24]

    BMC Bioinformatics

    Lu, Zhiyong and Kao, Hung-Yu and Wei, Chih-Hsuan and Huang, Minlie and Liu, Jingchen and Kuo, Cheng-Ju and Hsu, Chun-Nan and Tsai, Richard Tzong-Han and Dai, Hong-Jie and Okazaki, Naoaki and Cho, Hye-Cheol and Gerner, Martin and Solt, Illes and Agarwal, Shashank and Liu, Feifan and Vishnyakova, Dina and Ruch, Patrick and Romacker, Martin and Rinaldi, Fabi...

  25. [25]

    Nucleic Acids Research , volume=

    Malik, Adnan and Arsalan, Muhammad and Moreno, Carlos and Mosquera, Juan and F. Nucleic Acids Research , volume=. 2026 , publisher=

  26. [26]

    Miranda-Escalada and A. and F. Mehryary and J. Luoma and D. Estrada-Zavala and L. Gasco and S. Pyysalo and A. Valencia and M. Krallinger. Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical-protein relations. Database (Oxford). 2023

  27. [27]

    and Lu, Zhiyong and Wang, Xinglong and Cohen, Aaron M

    Morgan, Alexander A. and Lu, Zhiyong and Wang, Xinglong and Cohen, Aaron M. and Fluck, Juliane and Ruch, Patrick and Divoli, Anna and Fundel, Katrin and Leaman, Robert and Hakenberg, Jörg and Sun, Chenjie and Liu, Heng-hui and Torres, Rafael and Krauthammer, Michael and Lau, William W. and Liu, Hongfang and Hsu, Chun-Nan and Schuemie, Martijn and Cohen, K...

  28. [28]

    K nowtator: A Prot \'e g \'e plug-in for annotated corpus construction

    Ogren, Philip V. K nowtator: A Prot \'e g \'e plug-in for annotated corpus construction. Proceedings of the Human Language Technology Conference of the NAACL , Companion Volume: Demonstrations. 2006

  29. [29]

    Proceedings of the 18th BioNLP Workshop and Shared Task , pages =

    Peng, Yifan and Yan, Shankai and Lu, Zhiyong , title =. Proceedings of the 18th BioNLP Workshop and Shared Task , pages =. 2019 , doi =

  30. [30]

    Bioinformatics , volume =

    Pyysalo, Sampo and Ananiadou, Sophia , title =. Bioinformatics , volume =. 2014 , doi =

  31. [31]

    Rotenberg and N. H. and R. Leaman and R. Islamaj and H. Kuivaniemi and G. Tromp and B. Fluharty and S. Richardson and C. Eastwood and M. Diller and B. Xu and A. V. Pankajam and D. Osumi-Sutherland and Z. Lu and R. H. Scheuermann. Cell phenotypes in the biomedical literature: a systematic analysis and text mining corpus. bioRxiv. 2026

  32. [32]

    Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics , pages=

    brat: a Web-based Tool for NLP-Assisted Text Annotation , author=. Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics , pages=

  33. [33]

    Scientific Data , year =

    The Cell Ontology in the age of single-cell omics , author =. Scientific Data , year =

  34. [34]

    Fair Evaluation in Concept Normalization: a Large-scale Comparative Analysis for BERT -based Models

    Tutubalina, Elena and Kadurin, Artur and Miftahutdinov, Zulfat. Fair Evaluation in Concept Normalization: a Large-scale Comparative Analysis for BERT -based Models. Proceedings of the 28th International Conference on Computational Linguistics. 2020. doi:10.18653/v1/2020.coling-main.588

  35. [35]

    Learning from Disagreement: A Survey , volume =

    Uma, Alexandra and Fornaciari, Tommaso and Hovy, Dirk and Paun, Silviu and Plank, Barbara and Poesio, Massimo , year =. Learning from Disagreement: A Survey , volume =. Journal of Artificial Intelligence Research , doi =

  36. [36]

    Genetics , volume=

    Mondo: integrating disease terminology across communities , author=. Genetics , volume=. 2026 , publisher=

  37. [37]

    Wang, Alex and Pruksachatkun, Yada and Nangia, Nikita and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel , booktitle =. 2019

  38. [38]

    In: Linzen, T., Chrupała, G., Alishahi, A

    Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R. GLUE : A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. Proceedings of the 2018 EMNLP Workshop B lackbox NLP : Analyzing and Interpreting Neural Networks for NLP. 2018. doi:10.18653/v1/W18-5446

  39. [39]

    and Kao, Hung-Yu and Lu, Zhiyong

    Wei, Chih-Hsuan and Harris, Bethany R. and Kao, Hung-Yu and Lu, Zhiyong. tmVar: a text mining approach for extracting sequence variants in biomedical literature. Bioinformatics. 2013

  40. [40]

    and Li, Jiao and Wiegers, Thomas C

    Wei, Chih-Hsuan and Peng, Yifan and Leaman, Robert and Davis, Allan Peter and Mattingly, Carolyn J. and Li, Jiao and Wiegers, Thomas C. and Lu, Zhiyong , title = ". Database (Oxford) , volume =. 2016 , doi =

  41. [41]

    Nucleic Acids Research , volume =

    Wei, Chih-Hsuan and Allot, Alexis and Lai, Po-Ting and Leaman, Robert and Tian, Shubo and Luo, Ling and Jin, Qiao and Wang, Zhizheng and Chen, Qingyu and Lu, Zhiyong , title =. Nucleic Acids Research , volume =. 2024 , month =. doi:10.1093/nar/gkae235 , url =

  42. [42]

    Vasilevsky and N. A. and N. A. Matentzoglu and S. Toro and J. E. Flack and H. Hegde and D. R. Unni and G. F. Alyea and J. S. Amberger and L. Babb and J. P. Balhoff and T. I. Bingaman and G. A. Burns and O. J. Buske and T. J. Callahan and L. C. Carmody and P. C. Cordo and L. E. Chan and G. S. Chang and S. L. Christiaens and L. C. Daugherty and M. Dumontier...

  43. [43]

    Vladika and J. and P. Schneider and F. Matthes. MedREQAL: Examining Medical Knowledge Recall of Large Language Models via Question Answering, Bangkok, Thailand, Association for Computational Linguistics. 2024

  44. [44]

    Wang and L. and N. Yang and F. Wei. Query2doc: Query Expansion with Large Language Models, Singapore, Association for Computational Linguistics. 2023

  45. [45]

    Wang and Q. and S. A. S and L. Almeida and S. Ananiadou and Y. I. Balderas-Martinez and R. Batista-Navarro and D. Campos and L. Chilton and H. J. Chou and G. Contreras and L. Cooper and H. J. Dai and B. Ferrell and J. Fluck and S. Gama-Castro and N. George and G. Gkoutos and A. K. Irin and L. J. Jensen and S. Jimenez and T. R. Jue and I. Keseler and S. Ma...

  46. [46]

    Welbl and J. and P. Stenetorp and S. Riedel. Constructing Datasets for Multi-hop Reading Comprehension Across Documents. Transactions of the Association for Computational Linguistics. 2018

  47. [47]

    Wiegers and T. C. and A. P. Davis and C. J. Mattingly. Collaborative biocuration--text-mining development task for document prioritization for curation. Database (Oxford) 2012: bas037. 2012

  48. [48]

    Yang and Z. and P. Qi and S. Zhang and Y. Bengio and W. Cohen and R. Salakhutdinov and C. D. Manning. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering, Brussels, Belgium, Association for Computational Linguistics. 2018

  49. [49]

    Yao and S. and J. Zhao and D. Yu and N. Du and I. Shafran and K. R. Narasimhan and Y. Cao. React: Synergizing reasoning and acting in language models. The eleventh international conference on learning representations. 2022

  50. [50]

    Yeh and A. and A. Morgan and M. Colosimo and L. Hirschman. BioCreAtIvE task 1A: gene mention finding evaluation. BMC Bioinformatics 6 Suppl 1(Suppl 1): S2. 2005

  51. [51]

    Yuelyu Ji and H. Z. and Shiven Verma and Hui Ji and Chun Li and Yushui Han and Yanshan Wang. DeepRAG: Integrating Hierarchical Reasoning and Process Supervision for Biomedical Multi-Hop QA. Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artifici...

  52. [52]

    Zhu and M. and A. Ahuja and D.-C. Juan and W. Wei and C. K. Reddy. Question Answering with Long Multiple-Span Answers, Online, Association for Computational Linguistics. 2020

  53. [53]

    Nucleic acids research , volume=

    PubTator: a web-based text mining tool for assisting biocuration , author=. Nucleic acids research , volume=. 2013 , publisher=

  54. [54]

    Krallinger and M. and O. Rabal and S. A. Akhondi and M. P. Pérez and J. Santamaría and G. P. Rodríguez and G. Tsatsaronis and A. Intxaurrondo and J. A. B. López and U. K. Nandal and E. M. v. Buel and A. Chandrasekhar and M. Rodenburg and A. Lægreid and M. A. Doornenbal and J. Oyarzábal and A. Lourenço and A. Valencia. Overview of the BioCreative VI chemic...

  55. [55]

    Abacha and A. B. and E. Agichtein and Y. Pinter and D. Demner-Fushman. Overview of the Medical Question Answering Task at TREC 2017 LiveQA. Text Retrieval Conference. 2017

  56. [56]

    Adams and L. and F. Busch and T. Han and J.-B. Excoffier and M. Ortala and A. Löser and H. J. W. L. Aerts and J. N. Kather and D. Truhn and K. Bressem. LongHealth: A Question Answering Benchmark with Long Clinical Documents. Journal of Healthcare Informatics Research. 2025

  57. [57]

    Alliheedi, A. B. a. M. Evaluating Advanced Prompting on Gemini Flash for Multi-Hop Biomedical QA. Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artificial Intelligence (IJCAI). 2025

  58. [58]

    Arighi and C. N. and Z. Lu and M. Krallinger and K. B. Cohen and W. J. Wilbur and A. Valencia and L. Hirschman and C. H. Wu. Overview of the BioCreative III Workshop. BMC Bioinformatics 12 Suppl 8(Suppl 8): S1. 2011

  59. [59]

    Arighi and C. N. and P. M. Roberts and S. Agarwal and S. Bhattacharya and G. Cesareni and A. Chatr-Aryamontri and S. Clematide and P. Gaudet and M. G. Giglio and I. Harrow and E. Huala and M. Krallinger and U. Leser and D. Li and F. Liu and Z. Lu and L. J. Maltais and N. Okazaki and L. Perfetto and F. Rinaldi and R. Saetre and D. Salgado and P. Srinivasan...

  60. [60]

    Arighi and C. N. and C. H. Wu and K. B. Cohen and L. Hirschman and M. Krallinger and A. Valencia and Z. Lu and J. W. Wilbur and T. C. Wiegers. BioCreative-IV virtual issue. Database (Oxford) 2014. 2014

  61. [61]

    Asai and A. and J. He and R. Shao and W. Shi and A. Singh and J. C. Chang and K. Lo and L. Soldaini and S. Feldman and M. D'Arcy and D. Wadden and M. Latzke and J. Sparks and J. D. Hwang and V. Kishore and M. Tian and P. Ji and S. Liu and H. Tong and B. Wu and Y. Xiong and L. Zettlemoyer and G. Neubig and D. S. Weld and D. Downey and W. T. Yih and P. W. K...

  62. [62]

    Ben Abacha and A. and Y. Mrabet and Y. Zhang and C. Shivade and C. Langlotz and D. Demner-Fushman. Overview of the MEDIQA 2021 Shared Task on Summarization in the Medical Domain, Online, Association for Computational Linguistics. 2021

  63. [63]

    Ben Abacha and A. and C. Shivade and D. Demner-Fushman. Overview of the MEDIQA 2019 Shared Task on Textual Inference, Question Entailment and Question Answering, Florence, Italy, Association for Computational Linguistics. 2019

  64. [64]

    Biomedical ontologies in action: role in knowledge management, data integration and decision support

    Bodenreider, O. Biomedical ontologies in action: role in knowledge management, data integration and decision support. Yearb Med Inform: 67–79. 2008

  65. [65]

    Bommasani and R. and D. A. Hudson and E. Adeli and R. Altman and S. Arora and S. von Arx and M. S. Bernstein and J. Bohg and A. Bosselut and E. Brunskill. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258. 2021. arXiv:2108.07258

  66. [66]

    Chatr-Aryamontri and A. and L. Hirschman and K. E. Ross and R. Oughtred and M. Krallinger and K. Dolinski and M. Tyers and T. Korves and C. N. Arighi. Overview of the COVID-19 text mining tool interactive demonstration track in BioCreative VII. Database (Oxford) 2022. 2022

  67. [67]

    Chen and Q. and A. Allot and R. Leaman and R. Islamaj and J. Du and L. Fang and K. Wang and S. Xu and Y. Zhang and P. Bagherzadeh and S. Bergler and A. Bhatnagar and N. Bhavsar and Y. C. Chang and S. J. Lin and W. Tang and H. Zhang and I. Tavchioski and S. Pollak and S. Tian and J. Zhang and Y. Otmakhova and A. J. Yepes and H. Dong and H. Wu and R. Dufour...

  68. [68]

    Christophe and C. and P. K. Kanithi and T. Raha and S. Khan and M. A. Pimentel. Med42-v2: A suite of clinical llms. arXiv preprint arXiv:2408.06142. 2024. arXiv:2408.06142

  69. [69]

    Colelough and B. and D. Bartels and D. Demner-Fushman. Overview of the ClinIQLink 2025 Shared Task on Medical Question-Answering, Viena, Austria, Association for Computational Linguistics. 2025

  70. [70]

    Comeau and D. C. and R. T. Batista-Navarro and H. J. Dai and R. I. Dogan and A. J. Yepes and R. Khare and Z. Lu and H. Marques and C. J. Mattingly and M. Neves and Y. Peng and R. Rak and F. Rinaldi and R. T. Tsai and K. Verspoor and T. C. Wiegers and C. H. Wu and W. J. Wilbur. BioC interoperability track overview. Database (Oxford) 2014. 2014

  71. [71]

    Cormack and G. V. and C. L. A. Clarke and S. Buettcher. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. Boston, MA, USA, Association for Computing Machinery: 758–759. 2009

  72. [72]

    Davis and A. P. and C. G. Murphy and C. A. Saraceni-Richards and M. C. Rosenstein and T. C. Wiegers and C. J. Mattingly. Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical-gene-disease networks. Nucleic Acids Res 37(Database issue): D786–792. 2009

  73. [73]

    Gilardi and F. and M. Alizadeh and M. Kubli. ChatGPT outperforms crowd workers for text-annotation tasks. Proc Natl Acad Sci U S A 120(30): e2305016120. 2023

  74. [74]

    Gobeill and J. and P. Gaudet and D. Dopp and A. Morrone and I. Kahanda and Y. Y. Hsu and C. H. Wei and Z. Lu and P. Ruch. Overview of the BioCreative VI text-mining services for Kinome Curation Track. Database (Oxford) 2018. 2018

  75. [75]

    Golchin and S. and M. Surdeanu. Time Travel in LLMs: Tracing Data Contamination in Large Language Models. The Twelfth International Conference on Learning Representations (ICLR). 2024

  76. [76]

    Harikrishnan Gurushankar Saisudha, G. C. a. S. B. Agentic and Non-Agentic Multi-Hop Systems for Medical Question Answering. Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artificial Intelligence (IJCAI). 2025

  77. [77]

    Hendrycks and D. and C. Burns and S. Basart and A. Zou and M. Mazeika and D. Song and J. Steinhardt. Measuring massive multitask language understanding. International Conference on Learning Representations. 2021

  78. [78]

    Hirschman and L. and M. Colosimo and A. Morgan and A. Yeh. Overview of BioCreAtIvE task 1B: normalized gene lists. BMC Bioinformatics 6 Suppl 1(Suppl 1): S11. 2005

  79. [79]

    Hirschman and L. and A. Yeh and C. Blaschke and A. Valencia. Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics 6 Suppl 1(Suppl 1): S1. 2005

  80. [80]

    and A.-K

    Ho and X. and A.-K. Duong Nguyen and S. Sugawara and A. Aizawa. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps, Barcelona, Spain (Online), International Committee on Computational Linguistics. 2020

Showing first 80 references.