Protocol for evaluating ChatGPT in biomedical association generation and verification using a RAG-enabled, cross-model majority voting workflow
Pith reviewed 2026-06-29 07:45 UTC · model grok-4.3
The pith
A protocol uses RAG and cross-model majority voting to let LLMs verify associations generated by ChatGPT and detect hallucinations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The protocol enables LLMs to establish truth over content generated by other LLMs and expose hallucination through a RAG-enabled, cross-model majority voting workflow.
What carries the argument
RAG-enabled cross-model majority voting workflow that performs semantic verification when ontology exact matching fails.
If this is right
- Associations generated by ChatGPT can be validated for entity correctness using biomedical ontologies.
- Self-consistency across ChatGPT models provides a measure of generative reliability.
- Semantic verification via open-source LLMs can supplement ontology matching to check literature support.
- The workflow exposes cases where ChatGPT hallucinates associations not supported by evidence.
Where Pith is reading between the lines
- If the open-source LLMs in the RAG step introduce their own biases, the verification could systematically miss or create false positives in hallucination detection.
- This approach could be extended to verify outputs from other generative models beyond ChatGPT in different scientific domains.
- Controlled experiments with known true and false associations would be needed to calibrate the voting threshold for reliable truth establishment.
Load-bearing premise
That open-source LLMs used in the RAG component can reliably perform semantic verification of associations when ontology exact matching fails, without introducing their own systematic errors or biases.
What would settle it
A test set of known true and known false biomedical associations where the majority vote from the RAG LLMs consistently disagrees with the ground truth labels.
read the original abstract
We present a protocol to evaluate ChatGPT's ability to generate disease-centric biomedical associations. It outlines how we generate the associations, validate the biological entities using biomedical ontologies, and verify associations using literature. The protocol includes a self-consistency strategy to assess generative reliability across ChatGPT models. To address ontology exact-match limitations, we provide a use case performing semantic verification through a workflow enabled by Retrieval-Augmented Generation (RAG) powered by open-source large language models (LLMs). This enables LLMs to establish truth over content generated by other LLMs and expose hallucination.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a protocol for evaluating ChatGPT's ability to generate disease-centric biomedical associations. It describes association generation, entity validation via biomedical ontologies, literature-based verification, a self-consistency strategy across ChatGPT models, and a RAG-enabled semantic verification workflow powered by open-source LLMs with cross-model majority voting to address ontology exact-match failures and expose hallucinations.
Significance. If the protocol can be implemented reproducibly and shown to perform as described, it would offer a structured, multi-layered approach to assessing LLM reliability in biomedical knowledge generation. The combination of ontology validation, literature checks, self-consistency, and RAG-based semantic verification addresses hallucination concerns in a domain where factual accuracy is critical.
major comments (2)
- [Abstract] Abstract: The claim that the RAG-enabled, cross-model majority voting workflow 'enables LLMs to establish truth over content generated by other LLMs and expose hallucination' is presented as an outcome of the protocol design, yet the manuscript provides no empirical results, error analysis, or discussion of failure modes to support this assertion.
- [Use case (RAG-enabled workflow)] Use-case description of the RAG workflow: The protocol assumes open-source LLMs can reliably perform semantic verification of associations when ontology exact matching fails without introducing their own systematic errors or biases, but no safeguards, prompt details, retrieval corpus specification, or bias-mitigation steps are described, which is load-bearing for the workflow's validity.
minor comments (2)
- The self-consistency strategy across ChatGPT models is mentioned but lacks concrete details on how consistency is quantified or how disagreements are resolved.
- The manuscript would benefit from explicit pseudocode or a numbered step-by-step outline of the full workflow to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our protocol manuscript. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that the RAG-enabled, cross-model majority voting workflow 'enables LLMs to establish truth over content generated by other LLMs and expose hallucination' is presented as an outcome of the protocol design, yet the manuscript provides no empirical results, error analysis, or discussion of failure modes to support this assertion.
Authors: We agree that the manuscript describes a protocol and does not present empirical results or error analysis. The abstract phrasing was meant to outline the intended purpose of the workflow design. We will revise the abstract to state that the protocol is designed to enable such verification and hallucination exposure, rather than claiming it as a demonstrated outcome. revision: yes
-
Referee: [Use case (RAG-enabled workflow)] Use-case description of the RAG workflow: The protocol assumes open-source LLMs can reliably perform semantic verification of associations when ontology exact matching fails without introducing their own systematic errors or biases, but no safeguards, prompt details, retrieval corpus specification, or bias-mitigation steps are described, which is load-bearing for the workflow's validity.
Authors: The use case is presented as an illustrative implementation. We acknowledge that additional details are needed for reproducibility. In revision, we will expand this section with example prompts, specification of the retrieval corpus (e.g., PubMed), and bias-mitigation approaches such as the cross-model majority voting already central to the protocol. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper presents a protocol description for generating, validating, and verifying biomedical associations using LLMs, RAG, and majority voting. No mathematical derivations, equations, fitted parameters, or predictions that reduce to inputs by construction are present. No self-citations form load-bearing premises, no uniqueness theorems are invoked, and no ansatz or renaming patterns apply. The workflow is externally described without internal self-reference that would make claims equivalent to their inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Zhang, K., Huang, T., Malin, B.A., Osterman, T., Long, Q., and Jiang, X. (2025). Introducing mcodegpt as a zero-shot information extraction from clinical free text data tool for cancer research. Commun. Med. 5, 422
2025
-
[2]
Jiang, Y., Qiang, S., Li, W., and Liang, Y. (2025). Llm-diffaug: Enhancing few-shot object detection via llm-guided diffusion augmentation. Knowl. Base Syst. 326, 114066
2025
-
[3]
Ciampaglia, G.L., Shiralkar, P., Rocha, L.M., Bollen, J., Menczer, F., and Flammini, A. (2015). Computational fact checking from knowledge networks. PLoS One 10, e0128193
2015
-
[4]
Tucker J.A., Guess A., Barbera´ P., Vaccari C., Siegel A., Sanovich S., Stukal D., Nyhan B. Social media, political polarization, and political disinformation: A review of the scientific literature. Political Polarization, and Political Disinformation: A Review of the Scientific Literature (March 19, 2018). SSRN electronic journal; 2018. p. 95. http://dx....
-
[5]
Flamino, J., Galeazzi, A., Feldman, S., Macy, M.W., Cross, B., Zhou, Z., Serafino, M., Bovet, A., Makse, H.A., and Szymanski, B.K. (2023). Political polarization of news media and influencers on twitter in the 2016 and 2020 us presidential elections. Nat. Hum. Behav. 7, 904– 916
2023
-
[6]
Abdeen, M.A.R., Hamed, A.A., and Wu, X. (2021). Fighting the covid-19 infodemic in news articles and false publications: The neonet text classifier, a supervised machine learning algorithm. Appl. Sci. 11, 7265
2021
-
[7]
Challenging the Machinery of Generative Ai with Fact-Checking: Ontology-Driven Biological Graphs for Verifying Human Disease-Gene Links
Hamed A.A., Crimi A., Lee B.S., and Misiak M.M. Challenging the Machinery of Generative Ai with Fact-Checking: Ontology-Driven Biological Graphs for Verifying Human Disease-Gene Links. Available at SSRN 4888506. 2023, https://ssrn.com/ abstract=4888506
2023
-
[8]
Hamed, A.A., Crimi, A., Lee, B.S., and Misiak, M.M. (2024). Fact-checking generative ai: Ontology-driven biological graphs for disease- gene link verification. In International Conference on Computational Science (Springer), pp. 130–137
2024
-
[9]
Hamed, A.A., Zachara-Szymanska, M., and Wu, X. (2024). Safeguarding authenticity for mitigating the harms of generative ai: Issues, research agenda, and policies for detection, fact- checking, and ethical ai. iScience 27, 108782
2024
-
[10]
Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., et al. (2008). Top 10 algorithms in data mining. Knowl. Inf. Syst. 14, 1–37
2008
-
[11]
Wang, Z., Zhou, X., Yang, Y., Ma, B., Wang, L., and Dong, R. (2025). Sgeu: enhancing llm reasoning via backward exemplar generation and verification. Appl. Intell. 55, 748
2025
-
[12]
Verspoor, K. (2024). Fighting Fire with Fire— Using Llms to Combat Llm Hallucinations. Nature 630, 569–570. https://doi.org/10.1038/ d41586- 024-01641-0
2024
-
[13]
Wang, L., Chen, X., Deng, X., Wen, H., You, M., Liu, W., Li, Q., and Li, J. (2024). Prompt engineering in consistency and reliability with the evidence- based guideline for llms. npj Digit. Med. 7, 41
2024
-
[14]
Azimi, I., Qi, M., Wang, L., Rahmani, A.M., and Li, Y. (2025). Evaluation of llms accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrieval. Sci. Rep. 15, 1506
2025
-
[15]
Hamed, A.A., Crimi, A., Misiak, M.M., and Lee, B.S. (2025). From knowledge genera-tion to knowledge verification: examining the biomedical generative capabilities of chatgpt. iScience 28, 112492
2025
-
[16]
National Library of Medicine (2026)
National Center for Biotechnology Information, and U.S. National Library of Medicine (2026). Pubmed. Accessed April 3, 2026
2026
-
[17]
Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazare´ , P.-E., Lomeli, M., Hosseini, L., and Je´ gou, H. (2026). The faiss library IEEE Transactions on Big Data 12, 346–361. https://doi.org/10.1109/TBDATA. 2025.3618474
-
[18]
Mucherino, A., Papajorgji, P.J., and Pardalos, P.M. (2009). K-Nearest Neighbor Classification 83–106 (Springer New York), pp. 83–106. https://doi.org/10.1007/978-0- 387-88615-2_4
-
[19]
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J.E., Zhang, H., and Stoica, I. (2023). Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23). Association for Computing Machinery, New York, NY, USA (Association for Computing ...
2023
-
[20]
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., and Brew, J. (2020). Huggingface’s transformers: State-of-the-art natural language processing. Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp. 38–45
2020
-
[21]
Symptom Ontology (Symp) – Ontology Lookup Service
European Bioinformatics Institute (2024). Symptom Ontology (Symp) – Ontology Lookup Service. Accessed: 2026-03-08
2024
-
[22]
Qwen/Qwen2.5-0.5B-Instruct
Hamed, A.A., Fandy, T.E., and Wu, X. (2024). Accelerating complex disease treatment through network medicine and genai: A case study on drug repurposing for breast cancer. In 2024 IEEE International Conference on Medical Artificial Intelligence (MedAI), pp. 354–359. 16 STAR Protocols 7, 104533, June 19, 2026 Methods S1: A Use-Case for Performing Semantic ...
2024
-
[23]
While we used GCP, any cloud 588 environment would offer an instance with such configurations and support for Jupyter Note- 589 books
Securing an independent instance with at least 2 GPUs. While we used GCP, any cloud 588 environment would offer an instance with such configurations and support for Jupyter Note- 589 books. 590
-
[24]
Install Python PyTorch for GPU-ready environment: 591
-
[25]
Install the vLLM framework on the VM: 592 pip install vllm
-
[26]
Sign up into a HuggingFace account20, and accept the term of use the open-source models 593 used. 594
-
[27]
Test if the vLLM is installed properly: 595 25 CUDA_VISIBLE_DEVICES=0,1 \ vllm serve \ --model mistralai/Mistral-7B-Instruct-v0.2 \ --port 8000 pip install sentence transformers intfloat/e5-large
-
[28]
It is recommended to launch a terminal from within the Jupyter Notebook, launch the model 596 of choice with the configuration being part of the excuting command (i.e., the number of 597 GPUs, the model name, the model version, the port number running the process from a 598 notebook): 599 The Semantic Matching Using Open-Source LLMs 600 Since the objectiv...
-
[29]
A new list of 10,000 generated disease-symptom associations 604
-
[30]
The list of SYMPTOM ontology Term-ID and and Term. 605 Executing the Retrieval-Augment generation (RAG) to Produce Top-30 Can- 606 didates 607 The RAG process takes a single symptom term and the full list of SYMPTOM ontology terms as 608 input. It returns the Top-K most similar ontology terms, where K is set to 30. For each symptom, 609 the model identifi...
-
[31]
Installing the necessary transformers: 613
-
[32]
Selecting an embedding model that is compatible with the open-source LLMs used: 614
-
[33]
Process a symptom term as the main query parameter. 615
-
[34]
Perform the RAG search which produces a list of top-30 similar candidate terms and their 616 corresponding scores 617
-
[35]
Hello, vLLM!
Both the candidate terms and a symptom as returned as an input for a zero-shot prompt to 618 perform the semantic matching. 619 26 print(llm.generate("Hello, vLLM!")[0].outputs[0].text) \ EOF 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 The Best Semantic Match Prompt Engineering 620 To be able to match each of the generated symptoms, we designed a highly ...
-
[36]
The ontology term generated. 623
-
[37]
The Top-K candidates retrieved terms produced by the Retrieval-Augmented Generation 624 process. 625
-
[38]
Here we experimented with the following models (Qwen2.5, 626 MistralAI, Microsoft/Phi3, Google/Gemma, Meta-llama/Llamm-3.1)
A given open-source LLM. Here we experimented with the following models (Qwen2.5, 626 MistralAI, Microsoft/Phi3, Google/Gemma, Meta-llama/Llamm-3.1). 627
-
[39]
629 The code below shows how the prompt is encoded and the necessary configuration parameters
The instructions to producing valid output in the form of JSON format to represent the term 628 and its semantic best match. 629 The code below shows how the prompt is encoded and the necessary configuration parameters. 630 27 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 It also illustrates how the model was instructed to produce a JSON-formatted output, which is ...
-
[40]
Load the ontology semantic labels as key/value pairs: (Term-ID and Term-Label) 642
-
[41]
List of all the 10,000 generated to be semantically matched one term as at time 643
-
[42]
Execute the RAG process for each of the generated terms and produce the 30-candidates 644 perm 645
-
[43]
Configure the zero-shot prompt with the SYMP term, candidates, and the LLM 646
-
[44]
Extract the semantically identified term and identify its Term-ID from the ontology 647
-
[45]
Return a record of a valid JSON record that compile all the results: 648 JSON_record = {generated-symp, onto-best-match, 649 onto-term-item, status-of-identification} 650 651
-
[46]
~/../outs/semsymp-analysis-gemma-3-4b-it.jsonl
Store the JSON records as JSONL file for further verification and computing the majority 652 vote among all the models. 653 28 Cross-Model Verification: a Majority-vote Heuristic 654 Inspired by the data cross-validation and the strategy of fighting fire with fire12, we performed 655 a verification process known as cross-models. The process is compute the...
-
[47]
Cross-Model Verification
Google Cloud Platform notebook timeouts. Long-running semantic matching tasks may 682 not complete before the 12-hour timeout threshold imposed by Google Cloud Platform. 683 This timeout affects only the Jupyter Notebook interface; the underlying processes continue 684 30 Table S1: Minority-vote analysis of 20 generated symptoms, literal ontology matching...
-
[48]
Some LLMs occasionally produce in - 687 complete or improperly formatted JSON
Malformed or partial JSON outputs from LLMs. Some LLMs occasionally produce in - 687 complete or improperly formatted JSON. To mitigate this, we implemented a robust JSON- 688 extraction and validation step that isolates the <json>...</json> block before parsing. 689 This prevents pipeline failures and ensures that only valid JSON is processed. The follow...
-
[49]
Some open-source models require authentication through 692 Hugging Face
Gated model access in vLLM. Some open-source models require authentication through 692 Hugging Face. Users must log in, generate an access token, and supply it when launching 693 vLLM. Without this token, gated models will fail to load. 694
-
[50]
Users working with limited GPU re- 695 sources may encounter memory errors
Out-of-memory errors when loading large models. Users working with limited GPU re- 695 sources may encounter memory errors. Selecting appropriately sized models (e.g., 3B–7B) 696 or reducing batch sizes can prevent these failures. 697
-
[51]
lymphadenopathy
vLLM server not responding. The vLLM server may occasionally fail to respond due to 698 port conflicts or stale processes occupying GPU memory. Restarting the server, verifying 699 the port assignment, and ensuring no previous instance is running (e.g., via nvidia-smi) 700 typically resolves the issue. 701 Results 702 The majority vote process resulted in...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.