Contradictions in Context: Challenges for Retrieval-Augmented Generation in Healthcare

Bahadorreza Ofoghi; Manan Gangar; Saeedeh Javadi; Sara Mirabi

arxiv: 2511.06668 · v2 · submitted 2025-11-10 · 💻 cs.IR · cs.LG

Contradictions in Context: Challenges for Retrieval-Augmented Generation in Healthcare

Saeedeh Javadi , Sara Mirabi , Manan Gangar , Bahadorreza Ofoghi This is my paper

Pith reviewed 2026-05-18 00:20 UTC · model grok-4.3

classification 💻 cs.IR cs.LG

keywords retrieval-augmented generationRAGhealthcarecontradictionslarge language modelsPubMedmedical information retrievalbenchmark dataset

0 comments

The pith

Contradictions between highly similar medical abstracts cause large language models to give inconsistent and less factually accurate answers during retrieval-augmented generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests retrieval-augmented generation on medicine queries by turning official Australian health regulator headings into questions and pairing them with PubMed abstracts drawn from different publication years. It shows that even abstracts scoring high on similarity can still contain direct factual conflicts, and that these conflicts lead models to produce answers with more inconsistencies and lower accuracy. The work demonstrates that simple similarity-based retrieval is not enough to keep medical RAG trustworthy. A controlled benchmark is built to measure the effects of outdated and contradictory evidence separately from other variables.

Core claim

The central claim is that contradictions between highly similar abstracts degrade performance in RAG-based medical responses, producing inconsistencies and reduced factual accuracy in model answers.

What carries the argument

A benchmark dataset built from Therapeutic Goods Administration consumer medicine information headings repurposed as queries, with PubMed abstracts retrieved and stratified by publication year to isolate the impact of temporal contradictions.

If this is right

RAG systems require mechanisms beyond retrieval similarity to detect and reconcile conflicting evidence in medical sources.
Model answers become less consistent and factually accurate when contradictory abstracts are included in the context.
Temporal stratification of evidence reveals that outdated information contributes to errors even when documents appear similar.
Comparative testing across five LLMs shows that the degradation occurs across different model architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same contradiction problem could appear in other high-stakes retrieval settings such as legal or financial document generation.
Adding a pre-generation step that flags or filters contradictory passages might improve reliability without changing the core retrieval method.
Expanding the benchmark to include full-text articles rather than abstracts could test whether longer contexts amplify or reduce the observed effects.

Load-bearing premise

The chosen TGA headings and temporally stratified PubMed abstracts contain representative contradictions whose effects can be isolated from other retrieval or generation factors.

What would settle it

If models supplied with explicit contradiction detection produce answers whose consistency and factual accuracy show no measurable drop on the same queries and retrieved abstracts, the claim would be undermined.

Figures

Figures reproduced from arXiv: 2511.06668 by Bahadorreza Ofoghi, Manan Gangar, Saeedeh Javadi, Sara Mirabi.

**Figure 1.** Figure 1: Contradiction-aware medical RAG pipeline, showing data progression from TGA queries through search, embedding, and three retrieval strategies to final LLMbased generation and evaluation. CN T (d) = 1 |Ri,j | − 1 X d ′∈Ri,j d ′ ̸=d CN T (d, d′ ). (12) Given K, the most-contradictory and least-contradictory context sets used in the retrieval variants were constructed as: C most i,j = arg topK d∈Ri,j [PITH_… view at source ↗

**Figure 2.** Figure 2: Normalized distribution of documents across contradiction score bins and 5- year publication intervals. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

read the original abstract

In high-stakes information domains such as healthcare, where large language models (LLMs) can produce hallucinations or misinformation, retrieval-augmented generation (RAG) has been proposed as a mitigation strategy, grounding model outputs in external, domain-specific documents. Yet, this approach can introduce errors when source documents contain outdated or contradictory information. This work investigates the performance of five LLMs in generating RAG-based responses to medicine-related queries. Our contributions are three-fold: i) the creation of a benchmark dataset using consumer medicine information documents from the Australian Therapeutic Goods Administration (TGA), where headings are repurposed as natural language questions, ii) the retrieval of PubMed abstracts using TGA headings, stratified across multiple publication years, to enable controlled temporal evaluation of outdated evidence, and iii) a comparative analysis of the frequency and impact of outdated or contradictory content on model-generated responses, assessing how LLMs integrate and reconcile temporally inconsistent information. Our findings show that contradictions between highly similar abstracts do, in fact, degrade performance, leading to inconsistencies and reduced factual accuracy in model answers. These results highlight that retrieval similarity alone is insufficient for reliable medical RAG and underscore the need for contradiction-aware filtering strategies to ensure trustworthy responses in high-stakes domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper constructs a benchmark dataset by repurposing headings from Australian Therapeutic Goods Administration (TGA) consumer medicine documents as natural-language queries. It retrieves temporally stratified PubMed abstracts for these queries and evaluates five LLMs on RAG-generated responses, reporting that contradictions between highly similar abstracts degrade factual accuracy and introduce inconsistencies in model outputs.

Significance. If the central empirical claim holds after addressing design gaps, the work demonstrates a concrete limitation of similarity-based retrieval in high-stakes medical RAG and motivates contradiction-aware filtering. The use of real TGA documents and temporal stratification provides ecological validity that is stronger than purely synthetic contradiction tests.

major comments (1)

[§3.2] §3.2 (Retrieval and Stratification): The setup retrieves multiple temporally stratified PubMed abstracts per TGA heading but does not include a matched control condition of high-similarity yet internally consistent abstracts. Without this contrast, observed degradations in factual accuracy and consistency cannot be attributed specifically to contradictions rather than general multi-document reconciliation difficulties, which is load-bearing for the central claim.

minor comments (2)

[Results] The evaluation protocol (results section) does not report exact metrics for factual accuracy or inconsistency detection, nor any inter-annotator agreement for human judgments of model outputs.
[Tables/Figures] Table or figure captions should explicitly state the number of queries, abstracts per query, and models evaluated to allow replication.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our experimental design. The point regarding the need for a matched control condition is well-taken and directly relevant to strengthening the attribution of effects to contradictions specifically. We address this below and outline the planned revision.

read point-by-point responses

Referee: [§3.2] §3.2 (Retrieval and Stratification): The setup retrieves multiple temporally stratified PubMed abstracts per TGA heading but does not include a matched control condition of high-similarity yet internally consistent abstracts. Without this contrast, observed degradations in factual accuracy and consistency cannot be attributed specifically to contradictions rather than general multi-document reconciliation difficulties, which is load-bearing for the central claim.

Authors: We agree that isolating the specific contribution of contradictions requires a contrast against high-similarity but internally consistent document sets. Our current approach uses temporal stratification on the same TGA-derived queries to surface contradictions that arise naturally from evolving medical evidence, and we quantify their downstream impact on LLM consistency and accuracy. However, this does not fully rule out general multi-document reconciliation challenges. To address the concern, we will introduce a control condition consisting of high-similarity PubMed abstracts that are temporally proximate and verified as non-contradictory (e.g., via manual inspection or semantic consistency checks). We will then compare RAG performance across the contradictory and consistent conditions while holding similarity and query fixed. This addition will be reported in a revised §3.2 and associated results. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical evaluation

full rationale

The paper constructs a benchmark dataset from TGA consumer medicine headings repurposed as queries, retrieves temporally stratified PubMed abstracts, and reports direct observations of LLM RAG outputs on factual accuracy and inconsistency. No mathematical derivations, equations, fitted parameters presented as predictions, or self-citation chains underpin the central claims. Findings are stated as empirical results from model evaluations on the constructed dataset without reduction to inputs by construction or load-bearing self-references. The work is self-contained as an observational study.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The study rests on standard assumptions about document representativeness and the ability to measure factual accuracy in generated text; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Headings from consumer medicine documents can be treated as natural-language queries that reflect real user information needs.
Used to create the benchmark dataset from TGA documents.
domain assumption Publication year stratification in PubMed abstracts reliably captures outdated versus current evidence.
Central to the controlled temporal evaluation described.

pith-pipeline@v0.9.0 · 5529 in / 1171 out tokens · 29181 ms · 2026-05-18T00:20:00.420366+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CNT(da,db) = max(s,t)∈P(da,db) Pcon(s,t) ... most-contradictory context sets Cmost i,j = arg topK d∈Ri,j (CNT(d))
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Temporal-citation balanced selection ... stratified sampling ... round-robin across years

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Neuro-Symbolic Resolution of Recommendation Conflicts in Multimorbidity Clinical Guidelines
cs.CL 2026-04 unverdicted novelty 7.0

Neuro-symbolic pipeline using multi-agent translation and SAT solving detects conflicts in multimorbidity guidelines with 0.861 F1, finding 90.6% are local conflicts on 12 SGLT2 guidelines.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

search | cochrane library.https://www

Cochrane library. search | cochrane library.https://www. cochranelibrary.com/cdsr/reviews(2025)

work page 2025
[2]

heart.org/en/guidelines-statements-search(2025)

American Heart Association: Statements search.https://professional. heart.org/en/guidelines-statements-search(2025)

work page 2025
[3]

ACM SIGIR Forum51(2), 335–336 (1998)

Carbonell, J., Goldstein, J.: The use of mmr, diversity-based reranking for reordering documents and producing summaries. ACM SIGIR Forum51(2), 335–336 (1998)

work page 1998
[4]

Health Expect19, 1173–1182 (2016).https://doi.org/10.1111/hex.12438

Carpenter, D., Geryk, L., AT, A.C., Nagler, R., Dieckmann, N., Han, P.: Conflicting health information: A critical research need. Health Expect19, 1173–1182 (2016).https://doi.org/10.1111/hex.12438

work page doi:10.1111/hex.12438 2016
[5]

Cohan, A., Feldman, S., Beltagy, I., Downey, D., Weld, D.: Specter: document-level representation learning using citation-informed transform- ers. 2020. arXiv preprint arXiv:2004.07180 (2004)

work page arXiv 2020
[6]

Journal of Medical Internet Research27, e66220 (2025)

Das, S., Ge, Y., Guo, Y., Rajwal, S., Hairston, J., Powell, J., Walker, D., Peddireddy, S., Lakamana, S., Bozkurt, S., et al.: Two-layer retrieval- augmented generation framework for low-resource medical question answer- ing using reddit data: proof-of-concept study. Journal of Medical Internet Research27, e66220 (2025)

work page 2025
[7]

DeepMind, G.: Gemma 3 270m instruction-tuned (mlx 8-bit).https:// huggingface.co/mlx-community/gemma-3-270m-it-8bit(2025)

work page 2025
[8]

Deka, P.: Pubmedbert-mnli-mednli.https://huggingface.co/ pritamdeka/PubMedBERT-MNLI-MedNLI(2021)

work page 2021
[9]

arXiv preprint arXiv:2504.21252 (2025)

Dong, X., Zhu, W., Wang, H., Chen, X., Qiu, P., Yin, R., Su, Y., Wang, Y.: Talk before you retrieve: Agent-led discussions for better rag in medical qa. arXiv preprint arXiv:2504.21252 (2025)

work page arXiv 2025
[10]

mradermacher (GGUF), M.A.: Mixtral-8x7b-instruct-v0.1 (gguf).https: //huggingface.co/mradermacher/Mixtral-8x7B-Instruct-v0.1-GGUF (2023), apache-2.0; 32k context

work page 2023
[11]

Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A.: spacy: Industrial- strength natural language processing in python (2020).https://doi.org/ 10.5281/zenodo.1212303

work page doi:10.5281/zenodo.1212303 2020
[12]

Bioinformatics 39(11), btad651 (2023)

Jin, Q., Kim, W., Chen, Q., Comeau, D.C., Yeganova, L., Wilbur, W.J., Lu, Z.: Medcpt: Contrastive pre-trained transformers with large-scale pubmed search logs for zero-shot biomedical information retrieval. Bioinformatics 39(11), btad651 (2023)

work page 2023
[13]

IEEE Transactions on Big Data7(3), 535–547 (2019)

Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus. IEEE Transactions on Big Data7(3), 535–547 (2019)

work page 2019
[14]

Lab, Y.B.X.: Med-llama3-8b.https://huggingface.co/YBXL/ Med-LLaMA3-8B(2024)

work page 2024
[15]

In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., Riedel, S., Kiela, D.: Retrieval-augmented generation for knowledge-intensive NLP tasks. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Ad- vances in Neural Information Processing Systems. vol. 33, pp. 9459...

work page 2020
[16]

arXiv preprint arXiv:2407.00541 (2024)

Low, Y.S., Jackson, M.L., Hyde, R.J., Brown, R.E., Sanghavi, N.M., Bald- win, J.D., Pike, C.W., Muralidharan, J., Hui, G., Alexander, N., et al.: Answering real-world clinical questions using large language model based systems. arXiv preprint arXiv:2407.00541 (2024)

work page arXiv 2024
[17]

OpenAI: gpt-oss-20b model card.https://huggingface.co/openai/ gpt-oss-20b(2025), apache-2.0; 21B total, 3.6B active; 128k context

work page 2025
[18]

arXiv preprint arXiv:2503.17933 (2025)

Ou, J., Huang, T., Zhao, Y., Yu, Z., Lu, P., Ying, R.: Experience retrieval- augmentation with electronic health records enables accurate discharge qa. arXiv preprint arXiv:2503.17933 (2025)

work page arXiv 2025
[19]

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen

Pal, A., Umapathi, L.K., Sankarasubbu, M.: Med-HALT: Medical domain hallucination test for large language models. In: Jiang, J., Reitter, D., Deng, S. (eds.) Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL). pp. 314–334. Association for Computational Linguistics, Singapore (Dec 2023).https://doi.org/10.18653/v1/2023. c...

work page doi:10.18653/v1/2023 2023
[20]

Foundations and Trends®in Information Retrieval3(4), 333–389 (2009)

Robertson, S., Zaragoza, H., et al.: The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends®in Information Retrieval3(4), 333–389 (2009)

work page 2009
[21]

In: Proceedings of AMIA Annual Symposium (2025)

Shi, Y., Xu, S., Yang, T., Liu, Z., Liu, T., Li, X., Liu, N.: MKRAG: Medical knowledge retrieval augmented generation for medical question answering. In: Proceedings of AMIA Annual Symposium (2025)

work page 2025
[22]

In: AMIA Annual Symposium Proceedings

Shi, Y., Xu, S., Yang, T., Liu, Z., Liu, T., Li, X., Liu, N.: MKRAG: Medical knowledge retrieval augmented generation for medical question answering. In: AMIA Annual Symposium Proceedings. vol. 2024, p. 1011 (2025)

work page 2024
[23]

arXiv preprint arXiv:2505.07917 (2025)

Stuhlmann, L., Saxer, M.A., Fürst, J.: Efficient and reproducible biomedical question answering using retrieval augmented generation. arXiv preprint arXiv:2505.07917 (2025)

work page arXiv 2025
[24]

Taylor, T.: How to make sense of contradictory health news,https://www.abc.net.au/news/health/2018-04-24/ making-sense-of-seemingly-contradictory-health-news/9343684, accessed: 2025-09-16

work page arXiv 2018
[25]

Technology Innovation Institute: Falcon3-7b-instruct.https: //huggingface.co/tiiuae/Falcon3-7B-Instruct(2024), license: TII Falcon-LLM 2.0

work page 2024
[26]

Discover Computing 28(1), 27 (2025)

Upadhyay, R., Viviani, M.: Enhancing health information retrieval with rag by prioritizing topical relevance and factual accuracy. Discover Computing 28(1), 27 (2025)

work page 2025
[27]

arXiv preprint arXiv:2509.10843 (2025)

Wang, C., Chen, Y.: Evaluating large language models for evidence-based clinical question answering. arXiv preprint arXiv:2509.10843 (2025)

work page arXiv 2025
[28]

arXiv preprint arXiv:2508.15849 (2025) 15

Wang, Z., Khatibi, E., Rahmani, A.M.: Medcot-rag: Causal chain- of-thought rag for medical question answering. arXiv preprint arXiv:2508.15849 (2025) 15

work page arXiv 2025
[29]

C-Pack: Packed Resources For General Chinese Embeddings

Xiao, S., Liu, Z., Zhang, P., Muennighoff, N.: C-pack: Packaged resources to advance general chinese embedding. arXiv preprint arXiv:2309.07597 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

In: Ku, L.W., Martins, A., Srikumar, V

Xiong, G., Jin, Q., Lu, Z., Zhang, A.: Benchmarking retrieval-augmented generation for medicine. In: Ku, L.W., Martins, A., Srikumar, V. (eds.) ACL (Findings). pp. 6233–6251. Association for Computational Linguistics (2024)

work page 2024
[31]

In: Al-Onaizan, Y., Bansal, M., Chen, Y.N

Xu, R., Qi, Z., Guo, Z., Wang, C., Wang, H., Zhang, Y., Xu, W.: Knowledge conflicts for LLMs: A survey. In: Al-Onaizan, Y., Bansal, M., Chen, Y.N. (eds.) EMNLP. pp. 8541–8565. Association for Computational Linguistics (2024)

work page 2024
[32]

Yan, S.Q., Gu, J.C., Zhu, Y., Ling, Z.H.: Corrective retrieval augmented generation (2024),https://arxiv.org/abs/2401.15884

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

npj Digital Medicine8(1), 239 (2025)

Zhang, G., Xu, Z., Jin, Q., Chen, F., Fang, Y., Liu, Y., Rousseau, J.F., Xu, Z., Lu, Z., Weng, C., et al.: Leveraging long context in retrieval augmented language models for medical question answering. npj Digital Medicine8(1), 239 (2025)

work page 2025
[34]

In: Proceedings of the ACM on Web Conference 2025

Zhao, X., Liu, S., Yang, S.Y., Miao, C.: Medrag: Enhancing retrieval- augmented generation with knowledge graph-elicited reasoning for health- care copilot. In: Proceedings of the ACM on Web Conference 2025. pp. 4442–4457 (2025) 16

work page 2025

[1] [1]

search | cochrane library.https://www

Cochrane library. search | cochrane library.https://www. cochranelibrary.com/cdsr/reviews(2025)

work page 2025

[2] [2]

heart.org/en/guidelines-statements-search(2025)

American Heart Association: Statements search.https://professional. heart.org/en/guidelines-statements-search(2025)

work page 2025

[3] [3]

ACM SIGIR Forum51(2), 335–336 (1998)

Carbonell, J., Goldstein, J.: The use of mmr, diversity-based reranking for reordering documents and producing summaries. ACM SIGIR Forum51(2), 335–336 (1998)

work page 1998

[4] [4]

Health Expect19, 1173–1182 (2016).https://doi.org/10.1111/hex.12438

Carpenter, D., Geryk, L., AT, A.C., Nagler, R., Dieckmann, N., Han, P.: Conflicting health information: A critical research need. Health Expect19, 1173–1182 (2016).https://doi.org/10.1111/hex.12438

work page doi:10.1111/hex.12438 2016

[5] [5]

Cohan, A., Feldman, S., Beltagy, I., Downey, D., Weld, D.: Specter: document-level representation learning using citation-informed transform- ers. 2020. arXiv preprint arXiv:2004.07180 (2004)

work page arXiv 2020

[6] [6]

Journal of Medical Internet Research27, e66220 (2025)

Das, S., Ge, Y., Guo, Y., Rajwal, S., Hairston, J., Powell, J., Walker, D., Peddireddy, S., Lakamana, S., Bozkurt, S., et al.: Two-layer retrieval- augmented generation framework for low-resource medical question answer- ing using reddit data: proof-of-concept study. Journal of Medical Internet Research27, e66220 (2025)

work page 2025

[7] [7]

DeepMind, G.: Gemma 3 270m instruction-tuned (mlx 8-bit).https:// huggingface.co/mlx-community/gemma-3-270m-it-8bit(2025)

work page 2025

[8] [8]

Deka, P.: Pubmedbert-mnli-mednli.https://huggingface.co/ pritamdeka/PubMedBERT-MNLI-MedNLI(2021)

work page 2021

[9] [9]

arXiv preprint arXiv:2504.21252 (2025)

Dong, X., Zhu, W., Wang, H., Chen, X., Qiu, P., Yin, R., Su, Y., Wang, Y.: Talk before you retrieve: Agent-led discussions for better rag in medical qa. arXiv preprint arXiv:2504.21252 (2025)

work page arXiv 2025

[10] [10]

mradermacher (GGUF), M.A.: Mixtral-8x7b-instruct-v0.1 (gguf).https: //huggingface.co/mradermacher/Mixtral-8x7B-Instruct-v0.1-GGUF (2023), apache-2.0; 32k context

work page 2023

[11] [11]

Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A.: spacy: Industrial- strength natural language processing in python (2020).https://doi.org/ 10.5281/zenodo.1212303

work page doi:10.5281/zenodo.1212303 2020

[12] [12]

Bioinformatics 39(11), btad651 (2023)

Jin, Q., Kim, W., Chen, Q., Comeau, D.C., Yeganova, L., Wilbur, W.J., Lu, Z.: Medcpt: Contrastive pre-trained transformers with large-scale pubmed search logs for zero-shot biomedical information retrieval. Bioinformatics 39(11), btad651 (2023)

work page 2023

[13] [13]

IEEE Transactions on Big Data7(3), 535–547 (2019)

Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus. IEEE Transactions on Big Data7(3), 535–547 (2019)

work page 2019

[14] [14]

Lab, Y.B.X.: Med-llama3-8b.https://huggingface.co/YBXL/ Med-LLaMA3-8B(2024)

work page 2024

[15] [15]

In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., Riedel, S., Kiela, D.: Retrieval-augmented generation for knowledge-intensive NLP tasks. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Ad- vances in Neural Information Processing Systems. vol. 33, pp. 9459...

work page 2020

[16] [16]

arXiv preprint arXiv:2407.00541 (2024)

Low, Y.S., Jackson, M.L., Hyde, R.J., Brown, R.E., Sanghavi, N.M., Bald- win, J.D., Pike, C.W., Muralidharan, J., Hui, G., Alexander, N., et al.: Answering real-world clinical questions using large language model based systems. arXiv preprint arXiv:2407.00541 (2024)

work page arXiv 2024

[17] [17]

OpenAI: gpt-oss-20b model card.https://huggingface.co/openai/ gpt-oss-20b(2025), apache-2.0; 21B total, 3.6B active; 128k context

work page 2025

[18] [18]

arXiv preprint arXiv:2503.17933 (2025)

Ou, J., Huang, T., Zhao, Y., Yu, Z., Lu, P., Ying, R.: Experience retrieval- augmentation with electronic health records enables accurate discharge qa. arXiv preprint arXiv:2503.17933 (2025)

work page arXiv 2025

[19] [19]

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen

Pal, A., Umapathi, L.K., Sankarasubbu, M.: Med-HALT: Medical domain hallucination test for large language models. In: Jiang, J., Reitter, D., Deng, S. (eds.) Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL). pp. 314–334. Association for Computational Linguistics, Singapore (Dec 2023).https://doi.org/10.18653/v1/2023. c...

work page doi:10.18653/v1/2023 2023

[20] [20]

Foundations and Trends®in Information Retrieval3(4), 333–389 (2009)

Robertson, S., Zaragoza, H., et al.: The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends®in Information Retrieval3(4), 333–389 (2009)

work page 2009

[21] [21]

In: Proceedings of AMIA Annual Symposium (2025)

Shi, Y., Xu, S., Yang, T., Liu, Z., Liu, T., Li, X., Liu, N.: MKRAG: Medical knowledge retrieval augmented generation for medical question answering. In: Proceedings of AMIA Annual Symposium (2025)

work page 2025

[22] [22]

In: AMIA Annual Symposium Proceedings

Shi, Y., Xu, S., Yang, T., Liu, Z., Liu, T., Li, X., Liu, N.: MKRAG: Medical knowledge retrieval augmented generation for medical question answering. In: AMIA Annual Symposium Proceedings. vol. 2024, p. 1011 (2025)

work page 2024

[23] [23]

arXiv preprint arXiv:2505.07917 (2025)

Stuhlmann, L., Saxer, M.A., Fürst, J.: Efficient and reproducible biomedical question answering using retrieval augmented generation. arXiv preprint arXiv:2505.07917 (2025)

work page arXiv 2025

[24] [24]

Taylor, T.: How to make sense of contradictory health news,https://www.abc.net.au/news/health/2018-04-24/ making-sense-of-seemingly-contradictory-health-news/9343684, accessed: 2025-09-16

work page arXiv 2018

[25] [25]

Technology Innovation Institute: Falcon3-7b-instruct.https: //huggingface.co/tiiuae/Falcon3-7B-Instruct(2024), license: TII Falcon-LLM 2.0

work page 2024

[26] [26]

Discover Computing 28(1), 27 (2025)

Upadhyay, R., Viviani, M.: Enhancing health information retrieval with rag by prioritizing topical relevance and factual accuracy. Discover Computing 28(1), 27 (2025)

work page 2025

[27] [27]

arXiv preprint arXiv:2509.10843 (2025)

Wang, C., Chen, Y.: Evaluating large language models for evidence-based clinical question answering. arXiv preprint arXiv:2509.10843 (2025)

work page arXiv 2025

[28] [28]

arXiv preprint arXiv:2508.15849 (2025) 15

Wang, Z., Khatibi, E., Rahmani, A.M.: Medcot-rag: Causal chain- of-thought rag for medical question answering. arXiv preprint arXiv:2508.15849 (2025) 15

work page arXiv 2025

[29] [29]

C-Pack: Packed Resources For General Chinese Embeddings

Xiao, S., Liu, Z., Zhang, P., Muennighoff, N.: C-pack: Packaged resources to advance general chinese embedding. arXiv preprint arXiv:2309.07597 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

In: Ku, L.W., Martins, A., Srikumar, V

Xiong, G., Jin, Q., Lu, Z., Zhang, A.: Benchmarking retrieval-augmented generation for medicine. In: Ku, L.W., Martins, A., Srikumar, V. (eds.) ACL (Findings). pp. 6233–6251. Association for Computational Linguistics (2024)

work page 2024

[31] [31]

In: Al-Onaizan, Y., Bansal, M., Chen, Y.N

Xu, R., Qi, Z., Guo, Z., Wang, C., Wang, H., Zhang, Y., Xu, W.: Knowledge conflicts for LLMs: A survey. In: Al-Onaizan, Y., Bansal, M., Chen, Y.N. (eds.) EMNLP. pp. 8541–8565. Association for Computational Linguistics (2024)

work page 2024

[32] [32]

Yan, S.Q., Gu, J.C., Zhu, Y., Ling, Z.H.: Corrective retrieval augmented generation (2024),https://arxiv.org/abs/2401.15884

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

npj Digital Medicine8(1), 239 (2025)

Zhang, G., Xu, Z., Jin, Q., Chen, F., Fang, Y., Liu, Y., Rousseau, J.F., Xu, Z., Lu, Z., Weng, C., et al.: Leveraging long context in retrieval augmented language models for medical question answering. npj Digital Medicine8(1), 239 (2025)

work page 2025

[34] [34]

In: Proceedings of the ACM on Web Conference 2025

Zhao, X., Liu, S., Yang, S.Y., Miao, C.: Medrag: Enhancing retrieval- augmented generation with knowledge graph-elicited reasoning for health- care copilot. In: Proceedings of the ACM on Web Conference 2025. pp. 4442–4457 (2025) 16

work page 2025