BELIEF: Structured Evidence Modeling and Uncertainty-Aware Fusion for Biomedical Question Answering

Chang Zong; Hao Ning; Jian Wan; Jie Huang; Siliang Tang

arxiv: 2605.17435 · v1 · pith:NYWZCKOZnew · submitted 2026-05-17 · 💻 cs.CL

BELIEF: Structured Evidence Modeling and Uncertainty-Aware Fusion for Biomedical Question Answering

Chang Zong , Hao Ning , Siliang Tang , Jie Huang , Jian Wan This is my paper

Pith reviewed 2026-05-20 13:09 UTC · model grok-4.3

classification 💻 cs.CL

keywords biomedical question answeringstructured evidenceuncertainty modelingDempster-Shafer theorylarge language modelsevidence fusionretrieval-augmented generation

0 comments

The pith

BELIEF converts retrieved documents into structured evidence to improve biomedical question answering by fusing symbolic and neural reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that biomedical question answering can be improved by explicitly modeling the reliability and uncertainty in retrieved literature rather than feeding it as flat text to large language models. It does this by turning documents into evidence objects that capture clinical attributes, source quality, relevance, and support strength. These objects then support two paths: one using Dempster-Shafer theory to compute belief and uncertainty symbolically, and another using the LLM for semantic inference. An arbitration module then combines them based on consistency and reliability. This approach leads to better performance on standard biomedical QA benchmarks with various LLM backbones.

Core claim

BELIEF structures retrieved literature into evidence objects that record clinical attributes, source quality, question relevance, support strength, and the associated candidate hypothesis. These objects enable reliability-weighted basic probability assignments via Dempster-Shafer theory for symbolic evidence fusion to estimate belief and residual uncertainty, while the same objects support LLM-based semantic inference; a reliability-aware arbitration module then reconciles the two outputs according to belief strength, uncertainty, evidence reliability, and semantic consistency.

What carries the argument

Structured evidence objects that provide a shared basis for symbolic Dempster-Shafer fusion and neural semantic inference, reconciled by a reliability-aware arbitration module.

If this is right

Evidence reliability becomes an explicit factor in the final answer selection.
The system can quantify and report residual uncertainty in its decisions.
Performance gains appear across different general-purpose LLM backbones without domain-specific pretraining.
Retrieved evidence is utilized more effectively by making structure, disagreement, and uncertainty explicit.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could be adapted to other fields where evidence quality varies, such as legal document analysis or scientific hypothesis testing.
Integrating this with retrieval systems that prioritize high-reliability sources might further enhance results.
Testing the framework on questions requiring multi-hop reasoning across documents would reveal its limits in handling complex evidence chains.

Load-bearing premise

Retrieved documents can be reliably converted into structured evidence objects that accurately record clinical attributes, source quality, question relevance, and support strength without introducing new errors or biases.

What would settle it

An experiment showing that errors introduced during the structuring of evidence lead to overall performance worse than using unstructured text as context.

Figures

Figures reproduced from arXiv: 2605.17435 by Chang Zong, Hao Ning, Jian Wan, Jie Huang, Siliang Tang.

**Figure 2.** Figure 2: Overview of BELIEF. Retrieved biomedical literature is converted into structured evidence objects, processed by symbolic D-S fusion and neural [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Ablation results of BELIEF across different backbone models. Each radar plot corresponds to one backbone model, each vertex denotes the full model [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Divergent Acc of arbitration on cases where the D-S and LLM paths [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Sensitivity analysis of BELIEF with respect to retrieval depth [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Illustration of complementary reasoning behaviors and reliability-aware arbitration in BELIEF. The upper panel shows a reasoning–answer mismatch in [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

read the original abstract

Biomedical question answering often requires decisions from retrieved literature whose relevance, quality, and support for candidate answers are uneven. Most retrieval-augmented large language model (LLM) methods feed this literature to the model as flat text, leaving evidence reliability and remaining uncertainty largely implicit. We propose BELIEF, a structured evidence modeling and uncertainty-aware fusion framework for closed-set biomedical question answering. Rather than treating retrieved documents as undifferentiated context, BELIEF converts them into evidence objects that record clinical attributes, source quality, question relevance, support strength, and the associated candidate hypothesis. These evidence objects provide a shared basis for two complementary reasoning paths. The symbolic path constructs reliability-weighted basic probability assignments based on Dempster--Shafer (D-S) theory over a finite answer space and performs uncertainty-aware symbolic evidence fusion to estimate belief and residual uncertainty. The neural path uses the same structured evidence for LLM-based semantic inference, while a reliability-aware arbitration module reconciles the symbolic and neural outputs according to belief strength, uncertainty, evidence reliability, and semantic consistency. Experiments on PubMedQA, MedQA, and MedMCQA with five general-purpose LLM backbones show that BELIEF obtains the best result in 25 of 30 backbone--dataset--metric settings. Comparisons with biomedical-domain models indicate that BELIEF is competitive on MedQA and MedMCQA, while specialized biomedical pretraining remains advantageous on PubMedQA. Ablation, complementarity, uncertainty-stratified, and cost analyses further show that BELIEF improves retrieved-evidence utilization by making evidence structure, path disagreement, and decision uncertainty explicit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BELIEF structures evidence into objects for D-S fusion plus LLM arbitration in biomedical QA and reports top results in most settings, but the structuring step lacks validation that could affect the gains.

read the letter

Hi there, The punchline on this one is that BELIEF structures retrieved documents into evidence objects with clinical attributes, quality, relevance and support details, then runs them through a Dempster-Shafer symbolic fusion path and an LLM neural path, arbitrating between them based on belief strength and reliability. It claims the best score in 25 out of 30 settings across three datasets and five LLM backbones. The new part is the specific hybrid with D-S theory applied to these structured objects and the arbitration module that reconciles the two paths using uncertainty and consistency measures. The paper does a decent job showing that this explicit handling improves utilization of retrieved evidence. The ablations, complementarity checks, and uncertainty-stratified analyses add credibility to the claims. Comparisons with domain-specific models are also included, which helps place the results. The main soft spot is the evidence structuring process itself. The description is there, but there's little quantitative validation of how accurately those objects capture the source material or whether the conversion introduces systematic errors. If the structuring uses an LLM prompt, any issues there feed into both paths and the fusion might just be papering over them. The reported results also lack error bars or details on potential post-hoc decisions in the evaluation, which weakens the ability to judge the performance claims fully. Still, the approach doesn't seem circular or self-referential in an obvious way. This work is aimed at people building retrieval-augmented systems for biomedical QA, especially those interested in making uncertainty explicit. A reader working on hybrid reasoning or medical applications would get practical value from the framework and the analyses. It should go to peer review. The idea tackles a real problem in the area and the experiments are extensive enough to merit referee input, even if some methodological details need tightening. Cheers

Referee Report

2 major / 2 minor

Summary. The manuscript presents BELIEF, a structured evidence modeling and uncertainty-aware fusion framework for closed-set biomedical question answering. Retrieved documents are converted into evidence objects recording clinical attributes, source quality, relevance, support strength, and candidate hypothesis. These support a symbolic path using Dempster-Shafer theory for belief and uncertainty estimation, and a neural path with LLM semantic inference. An arbitration module reconciles the two based on belief, uncertainty, reliability, and consistency. Empirical results on PubMedQA, MedQA, and MedMCQA with five LLMs show superiority in 25 of 30 settings, with ablations and analyses supporting improved evidence utilization.

Significance. The integration of symbolic uncertainty modeling with neural inference in a biomedical QA setting addresses a key challenge in retrieval-augmented generation where evidence quality varies. The extensive evaluation across datasets and backbones, along with uncertainty-stratified analysis, provides a solid basis for assessing the approach. If the evidence structuring proves reliable, this could influence future work on hybrid reasoning systems. The lack of quantitative validation for the structuring step, however, limits the current assessment of its broader impact.

major comments (2)

[Abstract and Experiments] Abstract and Experiments section: The headline result that BELIEF obtains the best result in 25 of 30 backbone--dataset--metric settings is presented without error bars, confidence intervals, or statistical significance tests. This makes it difficult to determine whether the observed improvements are robust or could be due to variance in LLM outputs or evaluation choices.
[Method] Method section (evidence object construction): The conversion of retrieved documents into structured evidence objects is described as recording clinical attributes, source quality, question relevance, and support strength, but no quantitative validation (e.g., human agreement rates, error analysis, or bias audit) is provided. Since this step feeds both the D-S BPA construction and the neural path, unmeasured errors here could confound the attribution of gains to the uncertainty-aware fusion and arbitration.

minor comments (2)

[Related Work] Related Work: Consider adding references to recent works on uncertainty estimation in LLMs for QA to better contextualize the contribution.
[Figure 1] Figure 1: The overview figure would benefit from annotations indicating the flow from evidence objects to symbolic and neural paths.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of result presentation and methodological validation. We address each major comment below and have made targeted revisions to improve the manuscript.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: The headline result that BELIEF obtains the best result in 25 of 30 backbone--dataset--metric settings is presented without error bars, confidence intervals, or statistical significance tests. This makes it difficult to determine whether the observed improvements are robust or could be due to variance in LLM outputs or evaluation choices.

Authors: We agree that error bars and statistical tests would better demonstrate robustness against LLM output variance. In the revised manuscript we have added standard deviations computed over three independent runs to the main result tables and included McNemar's tests for pairwise significance between BELIEF and baselines, with p-values reported in the Experiments section. revision: yes
Referee: [Method] Method section (evidence object construction): The conversion of retrieved documents into structured evidence objects is described as recording clinical attributes, source quality, question relevance, and support strength, but no quantitative validation (e.g., human agreement rates, error analysis, or bias audit) is provided. Since this step feeds both the D-S BPA construction and the neural path, unmeasured errors here could confound the attribution of gains to the uncertainty-aware fusion and arbitration.

Authors: The referee correctly notes the absence of quantitative validation for evidence structuring. We have added a human evaluation subsection reporting inter-annotator agreement (Cohen's kappa = 0.76) on a 200-instance sample in the revised Experiments section. A comprehensive bias audit remains outside the current scope and is listed as future work in the Discussion. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces an empirical framework that converts retrieved documents into structured evidence objects, applies standard Dempster-Shafer theory for symbolic belief fusion on one path, performs LLM semantic inference on a parallel neural path, and reconciles outputs via a reliability-aware arbitration module. All reported gains are measured through external experiments on PubMedQA, MedQA, and MedMCQA using five independent LLM backbones, with no closed-form derivation or equation chain that reduces the final performance numbers to fitted parameters or self-referential definitions by construction. The symbolic component invokes established D-S operations on externally supplied evidence attributes rather than deriving those attributes from the fusion result itself, rendering the approach self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate specific free parameters or axioms; the framework implicitly relies on the assumption that evidence structuring preserves semantic fidelity and that D-S theory can be applied directly to the constructed basic probability assignments.

axioms (1)

domain assumption Retrieved documents can be converted into evidence objects that faithfully record relevance, quality, and support without substantial information loss or bias.
Central to the entire pipeline; stated in the description of converting documents to evidence objects.

pith-pipeline@v0.9.0 · 5826 in / 1354 out tokens · 45675 ms · 2026-05-20T13:09:31.495747+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BELIEF converts retrieved documents into evidence objects that record clinical attributes, source quality, question relevance, support strength... constructs reliability-weighted basic probability assignments based on Dempster–Shafer (D-S) theory... reliability-aware arbitration module
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments on PubMedQA, MedQA, and MedMCQA with five general-purpose LLM backbones

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 5 internal anchors

[1]

Language mod- els are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language mod- els are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

work page 1901
[2]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Retrieval- augmented generation for knowledge-intensive nlp tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,” inAdvances in Neural Information Processing Systems, vol. 33, 2020, pp. 9459–9474

work page 2020
[4]

Retrieval-augmented generation for large language models: A survey,

Y . Gao, Y . Xiong, X. Gao, K. Jia, J. Pan, Y . Bi, Y . Dai, J. Sun, M. Wang, and H. Wang, “Retrieval-augmented generation for large language models: A survey,” 2024

work page 2024
[5]

Self-rag: Learning to retrieve, generate, and critique through self-reflection,

A. Asai, Z. Wu, Y . Wang, A. Sil, and H. Hajishirzi, “Self-rag: Learning to retrieve, generate, and critique through self-reflection,” inThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[6]

Evidence based medicine: what it is and what it isn’t,

D. L. Sackett, W. M. Rosenberg, J. M. Gray, R. B. Haynes, and W. S. Richardson, “Evidence based medicine: what it is and what it isn’t,” pp. 71–72, 1996

work page 1996
[7]

Grade guidelines: 1. introduction—grade evidence profiles and summary of findings tables,

G. Guyatt, A. D. Oxman, E. A. Akl, R. Kunz, G. Vist, J. Brozek, S. Nor- ris, Y . Falck-Ytter, P. Glasziou, H. DeBeeret al., “Grade guidelines: 1. introduction—grade evidence profiles and summary of findings tables,” Journal of clinical epidemiology, vol. 64, no. 4, pp. 383–394, 2011

work page 2011
[8]

Available: https://arxiv.org/abs/2503.05777

Y . Kim, H. Jeong, S. Chen, S. S. Li, C. Park, M. Lu, K. Al- hamoud, J. Mun, C. Grau, M. Junget al., “Medical hallucinations in foundation models and their impact on healthcare,”arXiv preprint arXiv:2503.05777, 2025

work page arXiv 2025
[9]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,

L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qinet al., “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,” ACM Transactions on Information Systems, vol. 43, no. 2, pp. 1–55, 2025

work page 2025
[10]

Are large lan- guage models really good logical reasoners? a comprehensive evaluation and beyond,

F. Xu, Q. Lin, J. Han, T. Zhao, J. Liu, and E. Cambria, “Are large lan- guage models really good logical reasoners? a comprehensive evaluation and beyond,”IEEE Transactions on Knowledge and Data Engineering, vol. 37, no. 4, pp. 1620–1634, 2025

work page 2025
[11]

Final: Combining first-order logic with natural logic for question answering,

J. Shi, X. Ding, S. C. Hui, Y . Yan, H. Zhao, T. Liu, and B. Qin, “Final: Combining first-order logic with natural logic for question answering,” IEEE Transactions on Knowledge and Data Engineering, 2025

work page 2025
[12]

Generalized divergence-based deci- sion making method with an application to pattern classification,

F. Xiao, J. Wen, and W. Pedrycz, “Generalized divergence-based deci- sion making method with an application to pattern classification,”IEEE transactions on knowledge and data engineering, vol. 35, no. 7, pp. 6941–6956, 2022

work page 2022
[13]

Upper and lower probabilities induced by a multivalued mapping,

A. P. Dempster, “Upper and lower probabilities induced by a multivalued mapping,” inClassic works of the Dempster-Shafer theory of belief functions. Springer, 2008, pp. 57–72

work page 2008
[14]

Shafer,A Mathematical Theory of Evidence

G. Shafer,A Mathematical Theory of Evidence. Princeton University Press, 1976

work page 1976
[15]

Knowledge graph neural network with spatial-aware capsule for drug-drug inter- action prediction,

X. Su, B. Zhao, G. Li, J. Zhang, P. Hu, Z. You, and L. Hu, “Knowledge graph neural network with spatial-aware capsule for drug-drug inter- action prediction,”IEEE journal of biomedical and health informatics, vol. 29, no. 3, pp. 1771–1781, 2024

work page 2024
[16]

Biomedical question answering: a survey of approaches and challenges,

Q. Jin, Z. Yuan, G. Xiong, Q. Yu, H. Ying, C. Tan, M. Chen, S. Huang, X. Liu, and S. Yu, “Biomedical question answering: a survey of approaches and challenges,”ACM Computing Surveys (CSUR), vol. 55, no. 2, pp. 1–36, 2022

work page 2022
[17]

Pubmedqa: A dataset for biomedical research question answering,

Q. Jin, B. Dhingra, Z. Liu, W. Cohen, and X. Lu, “Pubmedqa: A dataset for biomedical research question answering,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 2567–2577

work page 2019
[18]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams,

D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits, “What disease does this patient have? a large-scale open domain question answering dataset from medical exams,”Applied Sciences, vol. 11, no. 14, p. 6421, 2021

work page 2021
[19]

Medmcqa: A large- scale multi-subject multi-choice dataset for medical domain question answering,

A. Pal, L. K. Umapathi, and M. Sankarasubbu, “Medmcqa: A large- scale multi-subject multi-choice dataset for medical domain question answering,” inConference on health, inference, and learning. PMLR, 2022, pp. 248–260

work page 2022
[20]

Biomistral: A collection of open-source pretrained large language models for medical domains,

Y . Labrak, A. Bazoge, E. Morin, P.-A. Gourraud, M. Rouvier, and R. Dufour, “Biomistral: A collection of open-source pretrained large language models for medical domains,” inFindings of the association for computational linguistics: acl 2024, 2024, pp. 5848–5864

work page 2024
[21]

Meditron-70b: Scaling medical pretraining for large language models,

Z. Chen, A. H. Cano, A. Romanou, A. Bonnet, K. Matoba, F. Salvi, M. Pagliardini, S. Fan, A. K ¨opf, A. Mohtashamiet al., “Meditron-70b: Scaling medical pretraining for large language models,” 2023

work page 2023
[22]

Toward expert- level medical question answering with large language models,

K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, M. Amin, L. Hou, K. Clark, S. R. Pfohl, H. Cole-Lewiset al., “Toward expert- level medical question answering with large language models,”Nature medicine, vol. 31, no. 3, pp. 943–950, 2025

work page 2025
[23]

A survey on rag meeting llms: Towards retrieval-augmented large language models,

W. Fan, Y . Ding, L. Ning, S. Wang, H. Li, D. Yin, T.-S. Chua, and Q. Li, “A survey on rag meeting llms: Towards retrieval-augmented large language models,” inProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, 2024, pp. 6491–6501

work page 2024
[24]

RAT: Retrieval augmented thoughts elicit context-aware reasoning and verification in long-horizon generation,

Z. Wang, A. Liu, H. Lin, J. Li, X. Ma, and Y . Liang, “RAT: Retrieval augmented thoughts elicit context-aware reasoning and verification in long-horizon generation,” inNeurIPS 2024 Workshop on Open-World Agents, 2024. [Online]. Available: https: //openreview.net/forum?id=5QtKMjNkjL

work page 2024
[25]

Corrective Retrieval Augmented Generation

S.-Q. Yan, J.-C. Gu, Y . Zhu, and Z.-H. Ling, “Corrective retrieval augmented generation,”arXiv preprint arXiv:2401.15884, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Enhancing large language models reasoning via multi-path optimization on knowledge graph,

J. Liao, C. Liu, Y . Ding, H. Wang, Z. Tang, K. Li, and K. Li, “Enhancing large language models reasoning via multi-path optimization on knowledge graph,”IEEE Transactions on Knowledge and Data Engineering, 2026

work page 2026
[27]

A framework of knowledge graph-enhanced large language model based on global planning,

Y . Li, D. Song, Y . Tian, H. Wang, C. Zhou, and S. Zhang, “A framework of knowledge graph-enhanced large language model based on global planning,”IEEE Transactions on Knowledge and Data Engineering, vol. 38, no. 2, pp. 736–748, 2025

work page 2025
[28]

The prisma 2020 statement: an updated guideline for reporting systematic reviews,

M. J. Page, J. E. McKenzie, P. M. Bossuyt, I. Boutron, T. C. Hoffmann, C. D. Mulrow, L. Shamseer, J. M. Tetzlaff, E. A. Akl, S. E. Brennan, R. Chou, J. Glanville, J. M. Grimshaw, A. Hr ´objartsson, M. M. Lalu, T. Li, E. W. Loder, E. Mayo-Wilson, S. McDonald, L. A. McGuinness, L. A. Stewart, J. Thomas, A. C. Tricco, V . A. Welch, P. Whiting, and D. Moher, ...

work page 2020
[29]

The well-built clinical question: a key to evidence-based decisions

W. S. Richardson, M. C. Wilson, J. Nishikawa, and R. S. Hayward, “The well-built clinical question: a key to evidence-based decisions.” ACP journal club, vol. 123, no. 3, pp. A12–3, 1995

work page 1995
[30]

A corpus with multi-level annotations of patients, interventions and outcomes to support language processing for medical literature,

B. Nye, J. J. Li, R. Patel, Y . Yang, I. Marshall, A. Nenkova, and B. C. Wallace, “A corpus with multi-level annotations of patients, interventions and outcomes to support language processing for medical literature,” inProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 197–207

work page 2018
[31]

Evaluation of pico as a knowledge representation for clinical questions,

X. Huang, J. Lin, and D. Demner-Fushman, “Evaluation of pico as a knowledge representation for clinical questions,” inAMIA annual symposium proceedings, vol. 2006, 2006, p. 359

work page 2006
[32]

Inferring which medical treatments work from reports of clinical trials,

E. Lehman, J. DeYoung, R. Barzilay, and B. C. Wallace, “Inferring which medical treatments work from reports of clinical trials,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 3705–3717

work page 2019
[33]

Robotreviewer: evaluation of a system for automatically assessing bias in clinical trials,

I. J. Marshall, J. Kuiper, and B. C. Wallace, “Robotreviewer: evaluation of a system for automatically assessing bias in clinical trials,”Journal of the American Medical Informatics Association, vol. 23, no. 1, pp. 193–201, 2016

work page 2016
[34]

Tri- alstreamer: A living, automatically updated database of clinical trial reports,

I. J. Marshall, B. Nye, J. Kuiper, A. Noel-Storr, R. Marshall, R. Maclean, F. Soboczenski, A. Nenkova, J. Thomas, and B. C. Wallace, “Tri- alstreamer: A living, automatically updated database of clinical trial reports,”Journal of the American Medical Informatics Association, vol. 27, no. 12, pp. 1903–1912, 2020

work page 1903
[35]

Empowering ex- plainable artificial intelligence through case-based reasoning: A com- prehensive exploration,

P. Pradeep, M. Caro-Mart ´ınez, and A. Wijekoon, “Empowering ex- plainable artificial intelligence through case-based reasoning: A com- prehensive exploration,”IEEE Transactions on Knowledge and Data Engineering, 2025

work page 2025
[36]

Combination of evidence in dempster-shafer theory,

K. Sentz and S. Ferson, “Combination of evidence in dempster-shafer theory,” Sandia National Laboratories, Tech. Rep., 2002. 14

work page 2002
[37]

Inconsistency elimination of multi-source information fusion in smart home using the dempster-shafer evidence theory,

S. Li, H. Xu, J. Xu, X. Li, Y . Wang, J. Zeng, J. Li, X. Li, Y . Li, and W. Ai, “Inconsistency elimination of multi-source information fusion in smart home using the dempster-shafer evidence theory,”Information Processing & Management, vol. 61, no. 4, p. 103723, 2024

work page 2024
[38]

Neural-symbolic learning and reasoning: Contributions and challenges

A. S. d. Garcez, T. R. Besold, L. De Raedt, P. F¨oldiak, P. Hitzler, T. Icard, K.-U. K ¨uhnberger, L. C. Lamb, R. Miikkulainen, and D. L. Silver, “Neural-symbolic learning and reasoning: Contributions and challenges.” inAAAI Spring Symposia, 2015, pp. 18–21

work page 2015
[39]

Neurosymbolic ai: The 3rd wave,

A. d. Garcez and L. C. Lamb, “Neurosymbolic ai: The 3rd wave,” Artificial Intelligence Review, vol. 56, no. 11, pp. 12 387–12 406, 2023

work page 2023
[40]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems, vol. 35, 2022, pp. 24 824–24 837

work page 2022
[41]

Self-consistency improves chain of thought reasoning in language models,

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdh- ery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” inThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[42]

Re- flexion: Language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Re- flexion: Language agents with verbal reinforcement learning,”Advances in neural information processing systems, vol. 36, pp. 8634–8652, 2023

work page 2023
[43]

Qwen2.5 Technical Report

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Weiet al., “Qwen2.5 technical report,”arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Gpt-4o mini: Advancing cost-efficient intelligence,

OpenAI, “Gpt-4o mini: Advancing cost-efficient intelligence,” https: //openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/, 2024

work page 2024
[46]

gpt-oss-120b & gpt-oss-20b Model Card

S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y . Bai, B. Baker, H. Baoet al., “gpt-oss-120b & gpt-oss-20b model card,”arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Improving medical reason- ing through retrieval and self-reflection with retrieval-augmented large language models,

M. Jeong, J. Sohn, M. Sung, and J. Kang, “Improving medical reason- ing through retrieval and self-reflection with retrieval-augmented large language models,”Bioinformatics, vol. 40, no. Supplement 1, pp. i119– i129, 2024

work page 2024
[48]

Ultramedical: Building specialized generalists in biomedicine,

K. Zhang, S. Zeng, E. Hua, N. Ding, Z.-R. Chen, Z. Ma, H. Li, G. Cui, B. Qi, X. Zhuet al., “Ultramedical: Building specialized generalists in biomedicine,”Advances in Neural Information Processing Systems, vol. 37, pp. 26 045–26 081, 2024

work page 2024
[49]

Towards medical complex reasoning with LLMs through medical verifiable problems,

J. Chen, Z. Cai, K. Ji, X. Wang, W. Liu, R. Wang, and B. Wang, “Towards medical complex reasoning with LLMs through medical verifiable problems,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 14 552–14 573

work page 2025
[50]

Llama-3-meditron: An open-weight suite of medical llms based on llama-3.1,

A. Sallinen, A.-J. Solergibert, M. Zhang, G. Boy ´e, M. Dupont-Roc, X. Theimer-Lienhard, E. Boisson, B. Bernath, H. Hadhri, A. Tranet al., “Llama-3-meditron: An open-weight suite of medical llms based on llama-3.1,” inWorkshop on Large Language Models and Generative AI for Health at AAAI 2025, 2025

work page 2025

[1] [1]

Language mod- els are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language mod- els are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

work page 1901

[2] [2]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Retrieval- augmented generation for knowledge-intensive nlp tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,” inAdvances in Neural Information Processing Systems, vol. 33, 2020, pp. 9459–9474

work page 2020

[4] [4]

Retrieval-augmented generation for large language models: A survey,

Y . Gao, Y . Xiong, X. Gao, K. Jia, J. Pan, Y . Bi, Y . Dai, J. Sun, M. Wang, and H. Wang, “Retrieval-augmented generation for large language models: A survey,” 2024

work page 2024

[5] [5]

Self-rag: Learning to retrieve, generate, and critique through self-reflection,

A. Asai, Z. Wu, Y . Wang, A. Sil, and H. Hajishirzi, “Self-rag: Learning to retrieve, generate, and critique through self-reflection,” inThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[6] [6]

Evidence based medicine: what it is and what it isn’t,

D. L. Sackett, W. M. Rosenberg, J. M. Gray, R. B. Haynes, and W. S. Richardson, “Evidence based medicine: what it is and what it isn’t,” pp. 71–72, 1996

work page 1996

[7] [7]

Grade guidelines: 1. introduction—grade evidence profiles and summary of findings tables,

G. Guyatt, A. D. Oxman, E. A. Akl, R. Kunz, G. Vist, J. Brozek, S. Nor- ris, Y . Falck-Ytter, P. Glasziou, H. DeBeeret al., “Grade guidelines: 1. introduction—grade evidence profiles and summary of findings tables,” Journal of clinical epidemiology, vol. 64, no. 4, pp. 383–394, 2011

work page 2011

[8] [8]

Available: https://arxiv.org/abs/2503.05777

Y . Kim, H. Jeong, S. Chen, S. S. Li, C. Park, M. Lu, K. Al- hamoud, J. Mun, C. Grau, M. Junget al., “Medical hallucinations in foundation models and their impact on healthcare,”arXiv preprint arXiv:2503.05777, 2025

work page arXiv 2025

[9] [9]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,

L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qinet al., “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,” ACM Transactions on Information Systems, vol. 43, no. 2, pp. 1–55, 2025

work page 2025

[10] [10]

Are large lan- guage models really good logical reasoners? a comprehensive evaluation and beyond,

F. Xu, Q. Lin, J. Han, T. Zhao, J. Liu, and E. Cambria, “Are large lan- guage models really good logical reasoners? a comprehensive evaluation and beyond,”IEEE Transactions on Knowledge and Data Engineering, vol. 37, no. 4, pp. 1620–1634, 2025

work page 2025

[11] [11]

Final: Combining first-order logic with natural logic for question answering,

J. Shi, X. Ding, S. C. Hui, Y . Yan, H. Zhao, T. Liu, and B. Qin, “Final: Combining first-order logic with natural logic for question answering,” IEEE Transactions on Knowledge and Data Engineering, 2025

work page 2025

[12] [12]

Generalized divergence-based deci- sion making method with an application to pattern classification,

F. Xiao, J. Wen, and W. Pedrycz, “Generalized divergence-based deci- sion making method with an application to pattern classification,”IEEE transactions on knowledge and data engineering, vol. 35, no. 7, pp. 6941–6956, 2022

work page 2022

[13] [13]

Upper and lower probabilities induced by a multivalued mapping,

A. P. Dempster, “Upper and lower probabilities induced by a multivalued mapping,” inClassic works of the Dempster-Shafer theory of belief functions. Springer, 2008, pp. 57–72

work page 2008

[14] [14]

Shafer,A Mathematical Theory of Evidence

G. Shafer,A Mathematical Theory of Evidence. Princeton University Press, 1976

work page 1976

[15] [15]

Knowledge graph neural network with spatial-aware capsule for drug-drug inter- action prediction,

X. Su, B. Zhao, G. Li, J. Zhang, P. Hu, Z. You, and L. Hu, “Knowledge graph neural network with spatial-aware capsule for drug-drug inter- action prediction,”IEEE journal of biomedical and health informatics, vol. 29, no. 3, pp. 1771–1781, 2024

work page 2024

[16] [16]

Biomedical question answering: a survey of approaches and challenges,

Q. Jin, Z. Yuan, G. Xiong, Q. Yu, H. Ying, C. Tan, M. Chen, S. Huang, X. Liu, and S. Yu, “Biomedical question answering: a survey of approaches and challenges,”ACM Computing Surveys (CSUR), vol. 55, no. 2, pp. 1–36, 2022

work page 2022

[17] [17]

Pubmedqa: A dataset for biomedical research question answering,

Q. Jin, B. Dhingra, Z. Liu, W. Cohen, and X. Lu, “Pubmedqa: A dataset for biomedical research question answering,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 2567–2577

work page 2019

[18] [18]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams,

D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits, “What disease does this patient have? a large-scale open domain question answering dataset from medical exams,”Applied Sciences, vol. 11, no. 14, p. 6421, 2021

work page 2021

[19] [19]

Medmcqa: A large- scale multi-subject multi-choice dataset for medical domain question answering,

A. Pal, L. K. Umapathi, and M. Sankarasubbu, “Medmcqa: A large- scale multi-subject multi-choice dataset for medical domain question answering,” inConference on health, inference, and learning. PMLR, 2022, pp. 248–260

work page 2022

[20] [20]

Biomistral: A collection of open-source pretrained large language models for medical domains,

Y . Labrak, A. Bazoge, E. Morin, P.-A. Gourraud, M. Rouvier, and R. Dufour, “Biomistral: A collection of open-source pretrained large language models for medical domains,” inFindings of the association for computational linguistics: acl 2024, 2024, pp. 5848–5864

work page 2024

[21] [21]

Meditron-70b: Scaling medical pretraining for large language models,

Z. Chen, A. H. Cano, A. Romanou, A. Bonnet, K. Matoba, F. Salvi, M. Pagliardini, S. Fan, A. K ¨opf, A. Mohtashamiet al., “Meditron-70b: Scaling medical pretraining for large language models,” 2023

work page 2023

[22] [22]

Toward expert- level medical question answering with large language models,

K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, M. Amin, L. Hou, K. Clark, S. R. Pfohl, H. Cole-Lewiset al., “Toward expert- level medical question answering with large language models,”Nature medicine, vol. 31, no. 3, pp. 943–950, 2025

work page 2025

[23] [23]

A survey on rag meeting llms: Towards retrieval-augmented large language models,

W. Fan, Y . Ding, L. Ning, S. Wang, H. Li, D. Yin, T.-S. Chua, and Q. Li, “A survey on rag meeting llms: Towards retrieval-augmented large language models,” inProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, 2024, pp. 6491–6501

work page 2024

[24] [24]

RAT: Retrieval augmented thoughts elicit context-aware reasoning and verification in long-horizon generation,

Z. Wang, A. Liu, H. Lin, J. Li, X. Ma, and Y . Liang, “RAT: Retrieval augmented thoughts elicit context-aware reasoning and verification in long-horizon generation,” inNeurIPS 2024 Workshop on Open-World Agents, 2024. [Online]. Available: https: //openreview.net/forum?id=5QtKMjNkjL

work page 2024

[25] [25]

Corrective Retrieval Augmented Generation

S.-Q. Yan, J.-C. Gu, Y . Zhu, and Z.-H. Ling, “Corrective retrieval augmented generation,”arXiv preprint arXiv:2401.15884, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Enhancing large language models reasoning via multi-path optimization on knowledge graph,

J. Liao, C. Liu, Y . Ding, H. Wang, Z. Tang, K. Li, and K. Li, “Enhancing large language models reasoning via multi-path optimization on knowledge graph,”IEEE Transactions on Knowledge and Data Engineering, 2026

work page 2026

[27] [27]

A framework of knowledge graph-enhanced large language model based on global planning,

Y . Li, D. Song, Y . Tian, H. Wang, C. Zhou, and S. Zhang, “A framework of knowledge graph-enhanced large language model based on global planning,”IEEE Transactions on Knowledge and Data Engineering, vol. 38, no. 2, pp. 736–748, 2025

work page 2025

[28] [28]

The prisma 2020 statement: an updated guideline for reporting systematic reviews,

M. J. Page, J. E. McKenzie, P. M. Bossuyt, I. Boutron, T. C. Hoffmann, C. D. Mulrow, L. Shamseer, J. M. Tetzlaff, E. A. Akl, S. E. Brennan, R. Chou, J. Glanville, J. M. Grimshaw, A. Hr ´objartsson, M. M. Lalu, T. Li, E. W. Loder, E. Mayo-Wilson, S. McDonald, L. A. McGuinness, L. A. Stewart, J. Thomas, A. C. Tricco, V . A. Welch, P. Whiting, and D. Moher, ...

work page 2020

[29] [29]

The well-built clinical question: a key to evidence-based decisions

W. S. Richardson, M. C. Wilson, J. Nishikawa, and R. S. Hayward, “The well-built clinical question: a key to evidence-based decisions.” ACP journal club, vol. 123, no. 3, pp. A12–3, 1995

work page 1995

[30] [30]

A corpus with multi-level annotations of patients, interventions and outcomes to support language processing for medical literature,

B. Nye, J. J. Li, R. Patel, Y . Yang, I. Marshall, A. Nenkova, and B. C. Wallace, “A corpus with multi-level annotations of patients, interventions and outcomes to support language processing for medical literature,” inProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 197–207

work page 2018

[31] [31]

Evaluation of pico as a knowledge representation for clinical questions,

X. Huang, J. Lin, and D. Demner-Fushman, “Evaluation of pico as a knowledge representation for clinical questions,” inAMIA annual symposium proceedings, vol. 2006, 2006, p. 359

work page 2006

[32] [32]

Inferring which medical treatments work from reports of clinical trials,

E. Lehman, J. DeYoung, R. Barzilay, and B. C. Wallace, “Inferring which medical treatments work from reports of clinical trials,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 3705–3717

work page 2019

[33] [33]

Robotreviewer: evaluation of a system for automatically assessing bias in clinical trials,

I. J. Marshall, J. Kuiper, and B. C. Wallace, “Robotreviewer: evaluation of a system for automatically assessing bias in clinical trials,”Journal of the American Medical Informatics Association, vol. 23, no. 1, pp. 193–201, 2016

work page 2016

[34] [34]

Tri- alstreamer: A living, automatically updated database of clinical trial reports,

I. J. Marshall, B. Nye, J. Kuiper, A. Noel-Storr, R. Marshall, R. Maclean, F. Soboczenski, A. Nenkova, J. Thomas, and B. C. Wallace, “Tri- alstreamer: A living, automatically updated database of clinical trial reports,”Journal of the American Medical Informatics Association, vol. 27, no. 12, pp. 1903–1912, 2020

work page 1903

[35] [35]

Empowering ex- plainable artificial intelligence through case-based reasoning: A com- prehensive exploration,

P. Pradeep, M. Caro-Mart ´ınez, and A. Wijekoon, “Empowering ex- plainable artificial intelligence through case-based reasoning: A com- prehensive exploration,”IEEE Transactions on Knowledge and Data Engineering, 2025

work page 2025

[36] [36]

Combination of evidence in dempster-shafer theory,

K. Sentz and S. Ferson, “Combination of evidence in dempster-shafer theory,” Sandia National Laboratories, Tech. Rep., 2002. 14

work page 2002

[37] [37]

Inconsistency elimination of multi-source information fusion in smart home using the dempster-shafer evidence theory,

S. Li, H. Xu, J. Xu, X. Li, Y . Wang, J. Zeng, J. Li, X. Li, Y . Li, and W. Ai, “Inconsistency elimination of multi-source information fusion in smart home using the dempster-shafer evidence theory,”Information Processing & Management, vol. 61, no. 4, p. 103723, 2024

work page 2024

[38] [38]

Neural-symbolic learning and reasoning: Contributions and challenges

A. S. d. Garcez, T. R. Besold, L. De Raedt, P. F¨oldiak, P. Hitzler, T. Icard, K.-U. K ¨uhnberger, L. C. Lamb, R. Miikkulainen, and D. L. Silver, “Neural-symbolic learning and reasoning: Contributions and challenges.” inAAAI Spring Symposia, 2015, pp. 18–21

work page 2015

[39] [39]

Neurosymbolic ai: The 3rd wave,

A. d. Garcez and L. C. Lamb, “Neurosymbolic ai: The 3rd wave,” Artificial Intelligence Review, vol. 56, no. 11, pp. 12 387–12 406, 2023

work page 2023

[40] [40]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems, vol. 35, 2022, pp. 24 824–24 837

work page 2022

[41] [41]

Self-consistency improves chain of thought reasoning in language models,

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdh- ery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” inThe Eleventh International Conference on Learning Representations, 2023

work page 2023

[42] [42]

Re- flexion: Language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Re- flexion: Language agents with verbal reinforcement learning,”Advances in neural information processing systems, vol. 36, pp. 8634–8652, 2023

work page 2023

[43] [43]

Qwen2.5 Technical Report

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Weiet al., “Qwen2.5 technical report,”arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

Gpt-4o mini: Advancing cost-efficient intelligence,

OpenAI, “Gpt-4o mini: Advancing cost-efficient intelligence,” https: //openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/, 2024

work page 2024

[46] [46]

gpt-oss-120b & gpt-oss-20b Model Card

S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y . Bai, B. Baker, H. Baoet al., “gpt-oss-120b & gpt-oss-20b model card,”arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

Improving medical reason- ing through retrieval and self-reflection with retrieval-augmented large language models,

M. Jeong, J. Sohn, M. Sung, and J. Kang, “Improving medical reason- ing through retrieval and self-reflection with retrieval-augmented large language models,”Bioinformatics, vol. 40, no. Supplement 1, pp. i119– i129, 2024

work page 2024

[48] [48]

Ultramedical: Building specialized generalists in biomedicine,

K. Zhang, S. Zeng, E. Hua, N. Ding, Z.-R. Chen, Z. Ma, H. Li, G. Cui, B. Qi, X. Zhuet al., “Ultramedical: Building specialized generalists in biomedicine,”Advances in Neural Information Processing Systems, vol. 37, pp. 26 045–26 081, 2024

work page 2024

[49] [49]

Towards medical complex reasoning with LLMs through medical verifiable problems,

J. Chen, Z. Cai, K. Ji, X. Wang, W. Liu, R. Wang, and B. Wang, “Towards medical complex reasoning with LLMs through medical verifiable problems,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 14 552–14 573

work page 2025

[50] [50]

Llama-3-meditron: An open-weight suite of medical llms based on llama-3.1,

A. Sallinen, A.-J. Solergibert, M. Zhang, G. Boy ´e, M. Dupont-Roc, X. Theimer-Lienhard, E. Boisson, B. Bernath, H. Hadhri, A. Tranet al., “Llama-3-meditron: An open-weight suite of medical llms based on llama-3.1,” inWorkshop on Large Language Models and Generative AI for Health at AAAI 2025, 2025

work page 2025