Recognition: no theorem link
Extracting Breast Cancer Phenotypes from Clinical Notes: Comparing LLMs with Classical Ontology Methods
Pith reviewed 2026-05-15 10:59 UTC · model grok-4.3
The pith
An LLM-based framework extracts breast cancer phenotypes from clinical notes with accuracy comparable to classical ontology methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that an LLM-based information extraction framework can be easily adapted to extract phenotypes with an accuracy that is comparable to the classical ontology-based methods, demonstrated on breast cancer phenotypes from provider notes, while noting that trained models can be fine-tuned to handle other cancer types and diseases.
What carries the argument
The LLM-based information extraction framework that processes unstructured provider notes to identify and extract medical phenotypes, directly benchmarked against the NCIt Ontology Annotator.
If this is right
- Phenotype data trapped in free-text notes becomes extractable at scale without building separate ontologies for each disease.
- Fine-tuning a trained LLM allows quick extension to additional cancer types with limited new engineering.
- Larger clinical studies can incorporate more treatment outcome and biomarker details from natural language records.
- EMR systems gain a practical alternative to rule-based annotation for ongoing phenotype capture.
Where Pith is reading between the lines
- LLMs could reduce dependence on manually curated medical ontologies for information extraction tasks.
- Integration into live EMR workflows might enable real-time phenotype tracking to support treatment decisions.
- Performance on varied note styles across institutions would need separate validation to confirm broad applicability.
Load-bearing premise
The evaluation dataset of clinical notes fairly represents real-world variability so that accuracy comparisons between the LLM framework and ontology methods are reliable.
What would settle it
A test on a new collection of diverse clinical notes from multiple hospitals where the LLM accuracy falls substantially below the ontology-based method would falsify the comparability result.
Figures
read the original abstract
A significant amount of data held in Oncology Electronic Medical Records (EMRs) is contained in unstructured provider notes -- including but not limited to the chemotherapy (or cancer treatment) outcome, different biomarkers, the tumor's location, sizes, and growth patterns of a patient. The clinical studies show that the majority of oncologists are comfortable providing these valuable insights in their notes in a natural language rather than the relevant structured fields of an EMR. The major contribution of this research is to report an LLM-based framework to process provider notes and extract valuable medical knowledge and phenotype mentioned above, with a focus on the domain of oncology. In this paper, we focus on extracting phenotypes related to breast cancer using our LLM framework, and then compare its performance with earlier works that used knowledge-driven annotation system, paired with the NCIt Ontology Annotator. The results of the study show that an LLM-based information extraction framework can be easily adapted to extract phenotypes with an accuracy that is comparable to the classical ontology-based methods. However, once trained, they could be easily fine-tuned to cater for other cancer types and diseases.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an LLM-based information extraction framework for extracting breast cancer phenotypes (e.g., chemotherapy outcomes, biomarkers, tumor location and size) from unstructured oncology EMR provider notes. It compares this approach to a classical knowledge-driven method using the NCIt Ontology Annotator and claims that the LLM framework achieves comparable accuracy while being easily adaptable to other cancer types after initial training.
Significance. If the empirical comparison is supported by rigorous quantitative evaluation on identical data, the result would be significant for clinical NLP: it would demonstrate that LLMs can serve as a flexible, low-engineering alternative to ontology-based phenotype extraction, with potential for rapid domain adaptation across diseases.
major comments (2)
- [Abstract] Abstract: The central claim that the LLM framework extracts phenotypes 'with an accuracy that is comparable' to the NCIt ontology annotator supplies no precision, recall, F1, or other metrics, no sample size (number of notes or phenotypes), no description of gold-standard creation, and no statement that both systems were evaluated on the same held-out notes using identical phenotype definitions. This renders the comparability assertion unverifiable.
- [Methods/Results] Methods/Results: The manuscript does not specify the LLM model, prompting strategy, fine-tuning details, or evaluation protocol (e.g., inter-annotator agreement, exact phenotype list, train/test split), all of which are load-bearing for reproducing the adaptation claim and for confirming that the comparison is fair.
minor comments (2)
- [Abstract] Abstract: The phrasing 'extract valuable medical knowledge and phenotype mentioned above' is grammatically awkward and the referent of 'mentioned above' is unclear.
- [Results] The manuscript would benefit from an explicit table listing the exact phenotypes extracted and the corresponding performance numbers for both methods.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The comments highlight important gaps in reporting that we have addressed through revisions to improve verifiability and reproducibility.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the LLM framework extracts phenotypes 'with an accuracy that is comparable' to the NCIt ontology annotator supplies no precision, recall, F1, or other metrics, no sample size (number of notes or phenotypes), no description of gold-standard creation, and no statement that both systems were evaluated on the same held-out notes using identical phenotype definitions. This renders the comparability assertion unverifiable.
Authors: We agree that the original abstract did not supply sufficient quantitative detail to support the comparability claim. The revised abstract now reports precision, recall, and F1 scores for both approaches, the total sample size (notes and phenotypes), the gold-standard creation process (including annotator qualifications and agreement), and an explicit statement that both the LLM framework and NCIt annotator were evaluated on the identical held-out test set using the same phenotype definitions. revision: yes
-
Referee: [Methods/Results] Methods/Results: The manuscript does not specify the LLM model, prompting strategy, fine-tuning details, or evaluation protocol (e.g., inter-annotator agreement, exact phenotype list, train/test split), all of which are load-bearing for reproducing the adaptation claim and for confirming that the comparison is fair.
Authors: We agree that these implementation and evaluation details were missing from the original submission and are necessary for assessing reproducibility and fairness of the comparison. The revised manuscript now specifies the LLM model, prompting strategy, whether fine-tuning was used, the inter-annotator agreement for the gold standard, the complete list of phenotypes, and the train/test split. These additions allow readers to reproduce the experiments and confirm that the two systems were compared under identical conditions. revision: yes
Circularity Check
No significant circularity in empirical comparison study
full rationale
The paper is an empirical comparison of an LLM-based information extraction framework against classical ontology-based methods (NCIt annotator) for breast cancer phenotypes from clinical notes. It contains no equations, parameter fittings, derivations, or self-citations that reduce the central comparability claim to inputs by construction. The reported accuracy comparability is presented as an experimental outcome rather than a self-definitional or fitted-input result. No load-bearing premises rely on prior author work for uniqueness or ansatz smuggling. The study is self-contained as a direct experimental evaluation without circular reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Clinical notes contain extractable phenotype information in natural language.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:1903.10676 (2019)
Beltagy, I., Lo, K., Cohan, A.: Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676 (2019)
-
[2]
arXiv preprint arXiv:2212.05238 (2022)
Dunn, A., Dagdelen, J., Walker, N., Lee, S., Rosen, A.S., Ceder, G., Persson, K., Jain, A.: Structured information extraction from complex scientific text with fine- tuned large language models. arXiv preprint arXiv:2212.05238 (2022)
-
[3]
D’Souza, J., Ng, V.: Sieve-based entity linking for the biomedical domain. In: Pro- ceedings of the 53rd Annual Meeting of the Association for Computational Linguis- tics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). pp. 297–302 (2015)
work page 2015
-
[4]
Journal of web semantics1(1), 75–80 (2003)
Golbeck, J., Fragoso, G., Hartel, F., Hendler, J., Oberthaler, J., Parsia, B.: The national cancer institute’s thesaurus and ontology. Journal of web semantics1(1), 75–80 (2003)
work page 2003
-
[5]
npj Digital Medicine7(1), 106 (2024)
Huang, J., Yang, D.M., Rong, R., Nezafati, K., Treager, C., Chi, Z., Wang, S., Cheng, X., Guo, Y., Klesse, L.J., et al.: A critical assessment of using chatgpt for extracting structured data from clinical notes. npj Digital Medicine7(1), 106 (2024)
work page 2024
-
[6]
Bioinformatics36(4), 1234–1240 (2020)
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics36(4), 1234–1240 (2020)
work page 2020
-
[7]
JCO clinical cancer informatics2, 1–14 (2018)
Lin, F.P., Groza, T., Kocbek, S., Antezana, E., Epstein, R.J.: Cancer care treat- ment outcome ontology: a novel computable ontology for profiling treatment out- comes in patients with solid tumors. JCO clinical cancer informatics2, 1–14 (2018)
work page 2018
-
[8]
Simmons, A., Takkavatakarn, K., McDougal, M., Dilcher, B., Pincavitch, J., Mead- ows, L., Kauffman, J., Klang, E., Wig, R., Smith, G., et al.: Benchmarking large language models for extraction of international classification of diseases codes from clinical documentation. medRxiv pp. 2024–04 (2024)
work page 2024
-
[9]
In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
Wang, H., Zheng, J.G., Ma, X., Fox, P., Ji, H.: Language and domain independent entity linking with quantified collective validation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. pp. 695–704 (2015)
work page 2015
-
[10]
arXiv preprint arXiv:2402.05125 (2024)
Wornow, M., Lozano, A., Dash, D., Jindal, J., Mahaffey, K.W., Shah, N.H.: Zero- shot clinical trial patient matching with llms. arXiv preprint arXiv:2402.05125 (2024)
-
[11]
BMC medical informatics and decision making15, 1–9 (2015) A
Zheng, J.G., Howsmon, D., Zhang, B., Hahn, J., McGuinness, D., Hendler, J., Ji, H.: Entity linking for biomedical literature. BMC medical informatics and decision making15, 1–9 (2015) A. TOKEN LENGTH OF BREAST CANCER NOTES 11 A Token Length of Breast Cancer Notes Fig. 2: Distribution of token lengths (in tokens) of the breast cancer notes, sampled across ...
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.