Protocol for evaluating ChatGPT in biomedical association generation and verification using a RAG-enabled, cross-model majority voting workflow

Ahmed Abdeen Hamed; Luis M. Rocha

arxiv: 2605.30400 · v1 · pith:EGRY2VJLnew · submitted 2026-05-28 · 💻 cs.CL

Protocol for evaluating ChatGPT in biomedical association generation and verification using a RAG-enabled, cross-model majority voting workflow

Ahmed Abdeen Hamed , Luis M. Rocha This is my paper

Pith reviewed 2026-06-29 07:45 UTC · model grok-4.3

classification 💻 cs.CL

keywords ChatGPTbiomedical associationsRAGhallucination detectionontology validationmajority votingself-consistencylarge language models

0 comments

The pith

A protocol uses RAG and cross-model majority voting to let LLMs verify associations generated by ChatGPT and detect hallucinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper outlines a step-by-step protocol for testing whether ChatGPT can produce accurate disease-related biomedical associations. It starts by generating associations, then checks the biological entities against standard ontologies, and confirms the links against published literature. When exact ontology matches fail, the protocol switches to a retrieval-augmented generation setup where open-source LLMs read relevant papers and vote on whether the association holds. A self-consistency check across different ChatGPT versions measures how reliably the model produces the same associations. This setup turns LLMs into verifiers of other LLMs' output, providing a way to surface hallucinations without relying solely on human review.

Core claim

The protocol enables LLMs to establish truth over content generated by other LLMs and expose hallucination through a RAG-enabled, cross-model majority voting workflow.

What carries the argument

RAG-enabled cross-model majority voting workflow that performs semantic verification when ontology exact matching fails.

If this is right

Associations generated by ChatGPT can be validated for entity correctness using biomedical ontologies.
Self-consistency across ChatGPT models provides a measure of generative reliability.
Semantic verification via open-source LLMs can supplement ontology matching to check literature support.
The workflow exposes cases where ChatGPT hallucinates associations not supported by evidence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the open-source LLMs in the RAG step introduce their own biases, the verification could systematically miss or create false positives in hallucination detection.
This approach could be extended to verify outputs from other generative models beyond ChatGPT in different scientific domains.
Controlled experiments with known true and false associations would be needed to calibrate the voting threshold for reliable truth establishment.

Load-bearing premise

That open-source LLMs used in the RAG component can reliably perform semantic verification of associations when ontology exact matching fails, without introducing their own systematic errors or biases.

What would settle it

A test set of known true and known false biomedical associations where the majority vote from the RAG LLMs consistently disagrees with the ground truth labels.

read the original abstract

We present a protocol to evaluate ChatGPT's ability to generate disease-centric biomedical associations. It outlines how we generate the associations, validate the biological entities using biomedical ontologies, and verify associations using literature. The protocol includes a self-consistency strategy to assess generative reliability across ChatGPT models. To address ontology exact-match limitations, we provide a use case performing semantic verification through a workflow enabled by Retrieval-Augmented Generation (RAG) powered by open-source large language models (LLMs). This enables LLMs to establish truth over content generated by other LLMs and expose hallucination.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A methods paper describing a RAG-plus-voting protocol for checking ChatGPT biomedical outputs, with no results or tests included.

read the letter

The paper's core contribution is a step-by-step protocol for generating disease associations with ChatGPT, validating entities against ontologies, and then using RAG with open-source LLMs plus cross-model majority voting to verify associations when exact matches fail. It also includes self-consistency checks across ChatGPT variants.

This workflow is a reasonable extension of existing RAG and ensemble ideas to the specific problem of biomedical hallucination detection. The authors correctly flag the ontology exact-match bottleneck and propose semantic verification as a workaround, which is a practical move. The description stays focused on the sequence of steps without overclaiming prior results.

The obvious gap is the complete absence of any implementation details, pilot runs, error analysis, or comparison against simpler baselines. We have no data on whether the open-source models used for verification add their own systematic mistakes or whether the voting step actually improves reliability over single-model checks. The claim that the setup "enables LLMs to establish truth" remains a design intention rather than something shown.

The paper is aimed at groups already running LLM pipelines for biomedical knowledge extraction who need a documented evaluation template. It could serve as a starting point for someone building their own verification layer, but only if the full methods section supplies the missing code-level specifics.

I would send it to peer review. The protocol is concrete enough that referees can usefully comment on the workflow choices and request validation experiments, even though the current version is thin on evidence.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a protocol for evaluating ChatGPT's ability to generate disease-centric biomedical associations. It describes association generation, entity validation via biomedical ontologies, literature-based verification, a self-consistency strategy across ChatGPT models, and a RAG-enabled semantic verification workflow powered by open-source LLMs with cross-model majority voting to address ontology exact-match failures and expose hallucinations.

Significance. If the protocol can be implemented reproducibly and shown to perform as described, it would offer a structured, multi-layered approach to assessing LLM reliability in biomedical knowledge generation. The combination of ontology validation, literature checks, self-consistency, and RAG-based semantic verification addresses hallucination concerns in a domain where factual accuracy is critical.

major comments (2)

[Abstract] Abstract: The claim that the RAG-enabled, cross-model majority voting workflow 'enables LLMs to establish truth over content generated by other LLMs and expose hallucination' is presented as an outcome of the protocol design, yet the manuscript provides no empirical results, error analysis, or discussion of failure modes to support this assertion.
[Use case (RAG-enabled workflow)] Use-case description of the RAG workflow: The protocol assumes open-source LLMs can reliably perform semantic verification of associations when ontology exact matching fails without introducing their own systematic errors or biases, but no safeguards, prompt details, retrieval corpus specification, or bias-mitigation steps are described, which is load-bearing for the workflow's validity.

minor comments (2)

The self-consistency strategy across ChatGPT models is mentioned but lacks concrete details on how consistency is quantified or how disagreements are resolved.
The manuscript would benefit from explicit pseudocode or a numbered step-by-step outline of the full workflow to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our protocol manuscript. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the RAG-enabled, cross-model majority voting workflow 'enables LLMs to establish truth over content generated by other LLMs and expose hallucination' is presented as an outcome of the protocol design, yet the manuscript provides no empirical results, error analysis, or discussion of failure modes to support this assertion.

Authors: We agree that the manuscript describes a protocol and does not present empirical results or error analysis. The abstract phrasing was meant to outline the intended purpose of the workflow design. We will revise the abstract to state that the protocol is designed to enable such verification and hallucination exposure, rather than claiming it as a demonstrated outcome. revision: yes
Referee: [Use case (RAG-enabled workflow)] Use-case description of the RAG workflow: The protocol assumes open-source LLMs can reliably perform semantic verification of associations when ontology exact matching fails without introducing their own systematic errors or biases, but no safeguards, prompt details, retrieval corpus specification, or bias-mitigation steps are described, which is load-bearing for the workflow's validity.

Authors: The use case is presented as an illustrative implementation. We acknowledge that additional details are needed for reproducibility. In revision, we will expand this section with example prompts, specification of the retrieval corpus (e.g., PubMed), and bias-mitigation approaches such as the cross-model majority voting already central to the protocol. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents a protocol description for generating, validating, and verifying biomedical associations using LLMs, RAG, and majority voting. No mathematical derivations, equations, fitted parameters, or predictions that reduce to inputs by construction are present. No self-citations form load-bearing premises, no uniqueness theorems are invoked, and no ansatz or renaming patterns apply. The workflow is externally described without internal self-reference that would make claims equivalent to their inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the contribution is a workflow description rather than a parameterized model or theoretical derivation.

pith-pipeline@v0.9.1-grok · 5625 in / 1078 out tokens · 25160 ms · 2026-06-29T07:45:30.890863+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 3 canonical work pages

[1]

Zhang, K., Huang, T., Malin, B.A., Osterman, T., Long, Q., and Jiang, X. (2025). Introducing mcodegpt as a zero-shot information extraction from clinical free text data tool for cancer research. Commun. Med. 5, 422

2025
[2]

Jiang, Y., Qiang, S., Li, W., and Liang, Y. (2025). Llm-diffaug: Enhancing few-shot object detection via llm-guided diffusion augmentation. Knowl. Base Syst. 326, 114066

2025
[3]

Ciampaglia, G.L., Shiralkar, P., Rocha, L.M., Bollen, J., Menczer, F., and Flammini, A. (2015). Computational fact checking from knowledge networks. PLoS One 10, e0128193

2015
[4]

Social media, political polarization, and political disinformation: A review of the scientific literature

Tucker J.A., Guess A., Barbera´ P., Vaccari C., Siegel A., Sanovich S., Stukal D., Nyhan B. Social media, political polarization, and political disinformation: A review of the scientific literature. Political Polarization, and Political Disinformation: A Review of the Scientific Literature (March 19, 2018). SSRN electronic journal; 2018. p. 95. http://dx....

work page doi:10.2139/ssrn.3144139 2018
[5]

Flamino, J., Galeazzi, A., Feldman, S., Macy, M.W., Cross, B., Zhou, Z., Serafino, M., Bovet, A., Makse, H.A., and Szymanski, B.K. (2023). Political polarization of news media and influencers on twitter in the 2016 and 2020 us presidential elections. Nat. Hum. Behav. 7, 904– 916

2023
[6]

Abdeen, M.A.R., Hamed, A.A., and Wu, X. (2021). Fighting the covid-19 infodemic in news articles and false publications: The neonet text classifier, a supervised machine learning algorithm. Appl. Sci. 11, 7265

2021
[7]

Challenging the Machinery of Generative Ai with Fact-Checking: Ontology-Driven Biological Graphs for Verifying Human Disease-Gene Links

Hamed A.A., Crimi A., Lee B.S., and Misiak M.M. Challenging the Machinery of Generative Ai with Fact-Checking: Ontology-Driven Biological Graphs for Verifying Human Disease-Gene Links. Available at SSRN 4888506. 2023, https://ssrn.com/ abstract=4888506

2023
[8]

Hamed, A.A., Crimi, A., Lee, B.S., and Misiak, M.M. (2024). Fact-checking generative ai: Ontology-driven biological graphs for disease- gene link verification. In International Conference on Computational Science (Springer), pp. 130–137

2024
[9]

Hamed, A.A., Zachara-Szymanska, M., and Wu, X. (2024). Safeguarding authenticity for mitigating the harms of generative ai: Issues, research agenda, and policies for detection, fact- checking, and ethical ai. iScience 27, 108782

2024
[10]

Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., et al. (2008). Top 10 algorithms in data mining. Knowl. Inf. Syst. 14, 1–37

2008
[11]

Wang, Z., Zhou, X., Yang, Y., Ma, B., Wang, L., and Dong, R. (2025). Sgeu: enhancing llm reasoning via backward exemplar generation and verification. Appl. Intell. 55, 748

2025
[12]

Verspoor, K. (2024). Fighting Fire with Fire— Using Llms to Combat Llm Hallucinations. Nature 630, 569–570. https://doi.org/10.1038/ d41586- 024-01641-0

2024
[13]

Wang, L., Chen, X., Deng, X., Wen, H., You, M., Liu, W., Li, Q., and Li, J. (2024). Prompt engineering in consistency and reliability with the evidence- based guideline for llms. npj Digit. Med. 7, 41

2024
[14]

Azimi, I., Qi, M., Wang, L., Rahmani, A.M., and Li, Y. (2025). Evaluation of llms accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrieval. Sci. Rep. 15, 1506

2025
[15]

Hamed, A.A., Crimi, A., Misiak, M.M., and Lee, B.S. (2025). From knowledge genera-tion to knowledge verification: examining the biomedical generative capabilities of chatgpt. iScience 28, 112492

2025
[16]

National Library of Medicine (2026)

National Center for Biotechnology Information, and U.S. National Library of Medicine (2026). Pubmed. Accessed April 3, 2026

2026
[17]

Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazare´ , P.-E., Lomeli, M., Hosseini, L., and Je´ gou, H. (2026). The faiss library IEEE Transactions on Big Data 12, 346–361. https://doi.org/10.1109/TBDATA. 2025.3618474

work page doi:10.1109/tbdata 2026
[18]

Mucherino, A., Papajorgji, P.J., and Pardalos, P.M. (2009). K-Nearest Neighbor Classification 83–106 (Springer New York), pp. 83–106. https://doi.org/10.1007/978-0- 387-88615-2_4

work page doi:10.1007/978-0- 2009
[19]

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J.E., Zhang, H., and Stoica, I. (2023). Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23). Association for Computing Machinery, New York, NY, USA (Association for Computing ...

2023
[20]

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., and Brew, J. (2020). Huggingface’s transformers: State-of-the-art natural language processing. Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp. 38–45

2020
[21]

Symptom Ontology (Symp) – Ontology Lookup Service

European Bioinformatics Institute (2024). Symptom Ontology (Symp) – Ontology Lookup Service. Accessed: 2026-03-08

2024
[22]

Qwen/Qwen2.5-0.5B-Instruct

Hamed, A.A., Fandy, T.E., and Wu, X. (2024). Accelerating complex disease treatment through network medicine and genai: A case study on drug repurposing for breast cancer. In 2024 IEEE International Conference on Medical Artificial Intelligence (MedAI), pp. 354–359. 16 STAR Protocols 7, 104533, June 19, 2026 Methods S1: A Use-Case for Performing Semantic ...

2024
[23]

While we used GCP, any cloud 588 environment would offer an instance with such configurations and support for Jupyter Note- 589 books

Securing an independent instance with at least 2 GPUs. While we used GCP, any cloud 588 environment would offer an instance with such configurations and support for Jupyter Note- 589 books. 590
[24]

Install Python PyTorch for GPU-ready environment: 591
[25]

Install the vLLM framework on the VM: 592 pip install vllm
[26]

Sign up into a HuggingFace account20, and accept the term of use the open-source models 593 used. 594
[27]

Test if the vLLM is installed properly: 595 25 CUDA_VISIBLE_DEVICES=0,1 \ vllm serve \ --model mistralai/Mistral-7B-Instruct-v0.2 \ --port 8000 pip install sentence transformers intfloat/e5-large
[28]

It is recommended to launch a terminal from within the Jupyter Notebook, launch the model 596 of choice with the configuration being part of the excuting command (i.e., the number of 597 GPUs, the model name, the model version, the port number running the process from a 598 notebook): 599 The Semantic Matching Using Open-Source LLMs 600 Since the objectiv...
[29]

A new list of 10,000 generated disease-symptom associations 604
[30]

The list of SYMPTOM ontology Term-ID and and Term. 605 Executing the Retrieval-Augment generation (RAG) to Produce Top-30 Can- 606 didates 607 The RAG process takes a single symptom term and the full list of SYMPTOM ontology terms as 608 input. It returns the Top-K most similar ontology terms, where K is set to 30. For each symptom, 609 the model identifi...
[31]

Installing the necessary transformers: 613
[32]

Selecting an embedding model that is compatible with the open-source LLMs used: 614
[33]

Process a symptom term as the main query parameter. 615
[34]

Perform the RAG search which produces a list of top-30 similar candidate terms and their 616 corresponding scores 617
[35]

Hello, vLLM!

Both the candidate terms and a symptom as returned as an input for a zero-shot prompt to 618 perform the semantic matching. 619 26 print(llm.generate("Hello, vLLM!")[0].outputs[0].text) \ EOF 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 The Best Semantic Match Prompt Engineering 620 To be able to match each of the generated symptoms, we designed a highly ...
[36]

The ontology term generated. 623
[37]

The Top-K candidates retrieved terms produced by the Retrieval-Augmented Generation 624 process. 625
[38]

Here we experimented with the following models (Qwen2.5, 626 MistralAI, Microsoft/Phi3, Google/Gemma, Meta-llama/Llamm-3.1)

A given open-source LLM. Here we experimented with the following models (Qwen2.5, 626 MistralAI, Microsoft/Phi3, Google/Gemma, Meta-llama/Llamm-3.1). 627
[39]

629 The code below shows how the prompt is encoded and the necessary configuration parameters

The instructions to producing valid output in the form of JSON format to represent the term 628 and its semantic best match. 629 The code below shows how the prompt is encoded and the necessary configuration parameters. 630 27 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 It also illustrates how the model was instructed to produce a JSON-formatted output, which is ...
[40]

Load the ontology semantic labels as key/value pairs: (Term-ID and Term-Label) 642
[41]

List of all the 10,000 generated to be semantically matched one term as at time 643
[42]

Execute the RAG process for each of the generated terms and produce the 30-candidates 644 perm 645
[43]

Configure the zero-shot prompt with the SYMP term, candidates, and the LLM 646
[44]

Extract the semantically identified term and identify its Term-ID from the ontology 647
[45]

Return a record of a valid JSON record that compile all the results: 648 JSON_record = {generated-symp, onto-best-match, 649 onto-term-item, status-of-identification} 650 651
[46]

~/../outs/semsymp-analysis-gemma-3-4b-it.jsonl

Store the JSON records as JSONL file for further verification and computing the majority 652 vote among all the models. 653 28 Cross-Model Verification: a Majority-vote Heuristic 654 Inspired by the data cross-validation and the strategy of fighting fire with fire12, we performed 655 a verification process known as cross-models. The process is compute the...
[47]

Cross-Model Verification

Google Cloud Platform notebook timeouts. Long-running semantic matching tasks may 682 not complete before the 12-hour timeout threshold imposed by Google Cloud Platform. 683 This timeout affects only the Jupyter Notebook interface; the underlying processes continue 684 30 Table S1: Minority-vote analysis of 20 generated symptoms, literal ontology matching...
[48]

Some LLMs occasionally produce in - 687 complete or improperly formatted JSON

Malformed or partial JSON outputs from LLMs. Some LLMs occasionally produce in - 687 complete or improperly formatted JSON. To mitigate this, we implemented a robust JSON- 688 extraction and validation step that isolates the <json>...</json> block before parsing. 689 This prevents pipeline failures and ensures that only valid JSON is processed. The follow...
[49]

Some open-source models require authentication through 692 Hugging Face

Gated model access in vLLM. Some open-source models require authentication through 692 Hugging Face. Users must log in, generate an access token, and supply it when launching 693 vLLM. Without this token, gated models will fail to load. 694
[50]

Users working with limited GPU re- 695 sources may encounter memory errors

Out-of-memory errors when loading large models. Users working with limited GPU re- 695 sources may encounter memory errors. Selecting appropriately sized models (e.g., 3B–7B) 696 or reducing batch sizes can prevent these failures. 697
[51]

lymphadenopathy

vLLM server not responding. The vLLM server may occasionally fail to respond due to 698 port conflicts or stale processes occupying GPU memory. Restarting the server, verifying 699 the port assignment, and ensuring no previous instance is running (e.g., via nvidia-smi) 700 typically resolves the issue. 701 Results 702 The majority vote process resulted in...

[1] [1]

Zhang, K., Huang, T., Malin, B.A., Osterman, T., Long, Q., and Jiang, X. (2025). Introducing mcodegpt as a zero-shot information extraction from clinical free text data tool for cancer research. Commun. Med. 5, 422

2025

[2] [2]

Jiang, Y., Qiang, S., Li, W., and Liang, Y. (2025). Llm-diffaug: Enhancing few-shot object detection via llm-guided diffusion augmentation. Knowl. Base Syst. 326, 114066

2025

[3] [3]

Ciampaglia, G.L., Shiralkar, P., Rocha, L.M., Bollen, J., Menczer, F., and Flammini, A. (2015). Computational fact checking from knowledge networks. PLoS One 10, e0128193

2015

[4] [4]

Social media, political polarization, and political disinformation: A review of the scientific literature

Tucker J.A., Guess A., Barbera´ P., Vaccari C., Siegel A., Sanovich S., Stukal D., Nyhan B. Social media, political polarization, and political disinformation: A review of the scientific literature. Political Polarization, and Political Disinformation: A Review of the Scientific Literature (March 19, 2018). SSRN electronic journal; 2018. p. 95. http://dx....

work page doi:10.2139/ssrn.3144139 2018

[5] [5]

Flamino, J., Galeazzi, A., Feldman, S., Macy, M.W., Cross, B., Zhou, Z., Serafino, M., Bovet, A., Makse, H.A., and Szymanski, B.K. (2023). Political polarization of news media and influencers on twitter in the 2016 and 2020 us presidential elections. Nat. Hum. Behav. 7, 904– 916

2023

[6] [6]

Abdeen, M.A.R., Hamed, A.A., and Wu, X. (2021). Fighting the covid-19 infodemic in news articles and false publications: The neonet text classifier, a supervised machine learning algorithm. Appl. Sci. 11, 7265

2021

[7] [7]

Challenging the Machinery of Generative Ai with Fact-Checking: Ontology-Driven Biological Graphs for Verifying Human Disease-Gene Links

Hamed A.A., Crimi A., Lee B.S., and Misiak M.M. Challenging the Machinery of Generative Ai with Fact-Checking: Ontology-Driven Biological Graphs for Verifying Human Disease-Gene Links. Available at SSRN 4888506. 2023, https://ssrn.com/ abstract=4888506

2023

[8] [8]

Hamed, A.A., Crimi, A., Lee, B.S., and Misiak, M.M. (2024). Fact-checking generative ai: Ontology-driven biological graphs for disease- gene link verification. In International Conference on Computational Science (Springer), pp. 130–137

2024

[9] [9]

Hamed, A.A., Zachara-Szymanska, M., and Wu, X. (2024). Safeguarding authenticity for mitigating the harms of generative ai: Issues, research agenda, and policies for detection, fact- checking, and ethical ai. iScience 27, 108782

2024

[10] [10]

Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., et al. (2008). Top 10 algorithms in data mining. Knowl. Inf. Syst. 14, 1–37

2008

[11] [11]

Wang, Z., Zhou, X., Yang, Y., Ma, B., Wang, L., and Dong, R. (2025). Sgeu: enhancing llm reasoning via backward exemplar generation and verification. Appl. Intell. 55, 748

2025

[12] [12]

Verspoor, K. (2024). Fighting Fire with Fire— Using Llms to Combat Llm Hallucinations. Nature 630, 569–570. https://doi.org/10.1038/ d41586- 024-01641-0

2024

[13] [13]

Wang, L., Chen, X., Deng, X., Wen, H., You, M., Liu, W., Li, Q., and Li, J. (2024). Prompt engineering in consistency and reliability with the evidence- based guideline for llms. npj Digit. Med. 7, 41

2024

[14] [14]

Azimi, I., Qi, M., Wang, L., Rahmani, A.M., and Li, Y. (2025). Evaluation of llms accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrieval. Sci. Rep. 15, 1506

2025

[15] [15]

Hamed, A.A., Crimi, A., Misiak, M.M., and Lee, B.S. (2025). From knowledge genera-tion to knowledge verification: examining the biomedical generative capabilities of chatgpt. iScience 28, 112492

2025

[16] [16]

National Library of Medicine (2026)

National Center for Biotechnology Information, and U.S. National Library of Medicine (2026). Pubmed. Accessed April 3, 2026

2026

[17] [17]

Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazare´ , P.-E., Lomeli, M., Hosseini, L., and Je´ gou, H. (2026). The faiss library IEEE Transactions on Big Data 12, 346–361. https://doi.org/10.1109/TBDATA. 2025.3618474

work page doi:10.1109/tbdata 2026

[18] [18]

Mucherino, A., Papajorgji, P.J., and Pardalos, P.M. (2009). K-Nearest Neighbor Classification 83–106 (Springer New York), pp. 83–106. https://doi.org/10.1007/978-0- 387-88615-2_4

work page doi:10.1007/978-0- 2009

[19] [19]

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J.E., Zhang, H., and Stoica, I. (2023). Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23). Association for Computing Machinery, New York, NY, USA (Association for Computing ...

2023

[20] [20]

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., and Brew, J. (2020). Huggingface’s transformers: State-of-the-art natural language processing. Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp. 38–45

2020

[21] [21]

Symptom Ontology (Symp) – Ontology Lookup Service

European Bioinformatics Institute (2024). Symptom Ontology (Symp) – Ontology Lookup Service. Accessed: 2026-03-08

2024

[22] [22]

Qwen/Qwen2.5-0.5B-Instruct

Hamed, A.A., Fandy, T.E., and Wu, X. (2024). Accelerating complex disease treatment through network medicine and genai: A case study on drug repurposing for breast cancer. In 2024 IEEE International Conference on Medical Artificial Intelligence (MedAI), pp. 354–359. 16 STAR Protocols 7, 104533, June 19, 2026 Methods S1: A Use-Case for Performing Semantic ...

2024

[23] [23]

While we used GCP, any cloud 588 environment would offer an instance with such configurations and support for Jupyter Note- 589 books

Securing an independent instance with at least 2 GPUs. While we used GCP, any cloud 588 environment would offer an instance with such configurations and support for Jupyter Note- 589 books. 590

[24] [24]

Install Python PyTorch for GPU-ready environment: 591

[25] [25]

Install the vLLM framework on the VM: 592 pip install vllm

[26] [26]

Sign up into a HuggingFace account20, and accept the term of use the open-source models 593 used. 594

[27] [27]

Test if the vLLM is installed properly: 595 25 CUDA_VISIBLE_DEVICES=0,1 \ vllm serve \ --model mistralai/Mistral-7B-Instruct-v0.2 \ --port 8000 pip install sentence transformers intfloat/e5-large

[28] [28]

It is recommended to launch a terminal from within the Jupyter Notebook, launch the model 596 of choice with the configuration being part of the excuting command (i.e., the number of 597 GPUs, the model name, the model version, the port number running the process from a 598 notebook): 599 The Semantic Matching Using Open-Source LLMs 600 Since the objectiv...

[29] [29]

A new list of 10,000 generated disease-symptom associations 604

[30] [30]

The list of SYMPTOM ontology Term-ID and and Term. 605 Executing the Retrieval-Augment generation (RAG) to Produce Top-30 Can- 606 didates 607 The RAG process takes a single symptom term and the full list of SYMPTOM ontology terms as 608 input. It returns the Top-K most similar ontology terms, where K is set to 30. For each symptom, 609 the model identifi...

[31] [31]

Installing the necessary transformers: 613

[32] [32]

Selecting an embedding model that is compatible with the open-source LLMs used: 614

[33] [33]

Process a symptom term as the main query parameter. 615

[34] [34]

Perform the RAG search which produces a list of top-30 similar candidate terms and their 616 corresponding scores 617

[35] [35]

Hello, vLLM!

Both the candidate terms and a symptom as returned as an input for a zero-shot prompt to 618 perform the semantic matching. 619 26 print(llm.generate("Hello, vLLM!")[0].outputs[0].text) \ EOF 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 The Best Semantic Match Prompt Engineering 620 To be able to match each of the generated symptoms, we designed a highly ...

[36] [36]

The ontology term generated. 623

[37] [37]

The Top-K candidates retrieved terms produced by the Retrieval-Augmented Generation 624 process. 625

[38] [38]

Here we experimented with the following models (Qwen2.5, 626 MistralAI, Microsoft/Phi3, Google/Gemma, Meta-llama/Llamm-3.1)

A given open-source LLM. Here we experimented with the following models (Qwen2.5, 626 MistralAI, Microsoft/Phi3, Google/Gemma, Meta-llama/Llamm-3.1). 627

[39] [39]

629 The code below shows how the prompt is encoded and the necessary configuration parameters

The instructions to producing valid output in the form of JSON format to represent the term 628 and its semantic best match. 629 The code below shows how the prompt is encoded and the necessary configuration parameters. 630 27 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 It also illustrates how the model was instructed to produce a JSON-formatted output, which is ...

[40] [40]

Load the ontology semantic labels as key/value pairs: (Term-ID and Term-Label) 642

[41] [41]

List of all the 10,000 generated to be semantically matched one term as at time 643

[42] [42]

Execute the RAG process for each of the generated terms and produce the 30-candidates 644 perm 645

[43] [43]

Configure the zero-shot prompt with the SYMP term, candidates, and the LLM 646

[44] [44]

Extract the semantically identified term and identify its Term-ID from the ontology 647

[45] [45]

Return a record of a valid JSON record that compile all the results: 648 JSON_record = {generated-symp, onto-best-match, 649 onto-term-item, status-of-identification} 650 651

[46] [46]

~/../outs/semsymp-analysis-gemma-3-4b-it.jsonl

Store the JSON records as JSONL file for further verification and computing the majority 652 vote among all the models. 653 28 Cross-Model Verification: a Majority-vote Heuristic 654 Inspired by the data cross-validation and the strategy of fighting fire with fire12, we performed 655 a verification process known as cross-models. The process is compute the...

[47] [47]

Cross-Model Verification

Google Cloud Platform notebook timeouts. Long-running semantic matching tasks may 682 not complete before the 12-hour timeout threshold imposed by Google Cloud Platform. 683 This timeout affects only the Jupyter Notebook interface; the underlying processes continue 684 30 Table S1: Minority-vote analysis of 20 generated symptoms, literal ontology matching...

[48] [48]

Some LLMs occasionally produce in - 687 complete or improperly formatted JSON

Malformed or partial JSON outputs from LLMs. Some LLMs occasionally produce in - 687 complete or improperly formatted JSON. To mitigate this, we implemented a robust JSON- 688 extraction and validation step that isolates the <json>...</json> block before parsing. 689 This prevents pipeline failures and ensures that only valid JSON is processed. The follow...

[49] [49]

Some open-source models require authentication through 692 Hugging Face

Gated model access in vLLM. Some open-source models require authentication through 692 Hugging Face. Users must log in, generate an access token, and supply it when launching 693 vLLM. Without this token, gated models will fail to load. 694

[50] [50]

Users working with limited GPU re- 695 sources may encounter memory errors

Out-of-memory errors when loading large models. Users working with limited GPU re- 695 sources may encounter memory errors. Selecting appropriately sized models (e.g., 3B–7B) 696 or reducing batch sizes can prevent these failures. 697

[51] [51]

lymphadenopathy

vLLM server not responding. The vLLM server may occasionally fail to respond due to 698 port conflicts or stale processes occupying GPU memory. Restarting the server, verifying 699 the port assignment, and ensuring no previous instance is running (e.g., via nvidia-smi) 700 typically resolves the issue. 701 Results 702 The majority vote process resulted in...