pith. sign in

arxiv: 2605.17435 · v1 · pith:NYWZCKOZnew · submitted 2026-05-17 · 💻 cs.CL

BELIEF: Structured Evidence Modeling and Uncertainty-Aware Fusion for Biomedical Question Answering

Pith reviewed 2026-05-20 13:09 UTC · model grok-4.3

classification 💻 cs.CL
keywords biomedical question answeringstructured evidenceuncertainty modelingDempster-Shafer theorylarge language modelsevidence fusionretrieval-augmented generation
0
0 comments X

The pith

BELIEF converts retrieved documents into structured evidence to improve biomedical question answering by fusing symbolic and neural reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that biomedical question answering can be improved by explicitly modeling the reliability and uncertainty in retrieved literature rather than feeding it as flat text to large language models. It does this by turning documents into evidence objects that capture clinical attributes, source quality, relevance, and support strength. These objects then support two paths: one using Dempster-Shafer theory to compute belief and uncertainty symbolically, and another using the LLM for semantic inference. An arbitration module then combines them based on consistency and reliability. This approach leads to better performance on standard biomedical QA benchmarks with various LLM backbones.

Core claim

BELIEF structures retrieved literature into evidence objects that record clinical attributes, source quality, question relevance, support strength, and the associated candidate hypothesis. These objects enable reliability-weighted basic probability assignments via Dempster-Shafer theory for symbolic evidence fusion to estimate belief and residual uncertainty, while the same objects support LLM-based semantic inference; a reliability-aware arbitration module then reconciles the two outputs according to belief strength, uncertainty, evidence reliability, and semantic consistency.

What carries the argument

Structured evidence objects that provide a shared basis for symbolic Dempster-Shafer fusion and neural semantic inference, reconciled by a reliability-aware arbitration module.

If this is right

  • Evidence reliability becomes an explicit factor in the final answer selection.
  • The system can quantify and report residual uncertainty in its decisions.
  • Performance gains appear across different general-purpose LLM backbones without domain-specific pretraining.
  • Retrieved evidence is utilized more effectively by making structure, disagreement, and uncertainty explicit.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method could be adapted to other fields where evidence quality varies, such as legal document analysis or scientific hypothesis testing.
  • Integrating this with retrieval systems that prioritize high-reliability sources might further enhance results.
  • Testing the framework on questions requiring multi-hop reasoning across documents would reveal its limits in handling complex evidence chains.

Load-bearing premise

Retrieved documents can be reliably converted into structured evidence objects that accurately record clinical attributes, source quality, question relevance, and support strength without introducing new errors or biases.

What would settle it

An experiment showing that errors introduced during the structuring of evidence lead to overall performance worse than using unstructured text as context.

Figures

Figures reproduced from arXiv: 2605.17435 by Chang Zong, Hao Ning, Jian Wan, Jie Huang, Siliang Tang.

Figure 1
Figure 1. Figure 1: Traditional RAG treats retrieved evidence as flat context, whereas [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of BELIEF. Retrieved biomedical literature is converted into structured evidence objects, processed by symbolic D-S fusion and neural [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Ablation results of BELIEF across different backbone models. Each radar plot corresponds to one backbone model, each vertex denotes the full model [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Divergent Acc of arbitration on cases where the D-S and LLM paths [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Sensitivity analysis of BELIEF with respect to retrieval depth [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of complementary reasoning behaviors and reliability-aware arbitration in BELIEF. The upper panel shows a reasoning–answer mismatch in [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
read the original abstract

Biomedical question answering often requires decisions from retrieved literature whose relevance, quality, and support for candidate answers are uneven. Most retrieval-augmented large language model (LLM) methods feed this literature to the model as flat text, leaving evidence reliability and remaining uncertainty largely implicit. We propose BELIEF, a structured evidence modeling and uncertainty-aware fusion framework for closed-set biomedical question answering. Rather than treating retrieved documents as undifferentiated context, BELIEF converts them into evidence objects that record clinical attributes, source quality, question relevance, support strength, and the associated candidate hypothesis. These evidence objects provide a shared basis for two complementary reasoning paths. The symbolic path constructs reliability-weighted basic probability assignments based on Dempster--Shafer (D-S) theory over a finite answer space and performs uncertainty-aware symbolic evidence fusion to estimate belief and residual uncertainty. The neural path uses the same structured evidence for LLM-based semantic inference, while a reliability-aware arbitration module reconciles the symbolic and neural outputs according to belief strength, uncertainty, evidence reliability, and semantic consistency. Experiments on PubMedQA, MedQA, and MedMCQA with five general-purpose LLM backbones show that BELIEF obtains the best result in 25 of 30 backbone--dataset--metric settings. Comparisons with biomedical-domain models indicate that BELIEF is competitive on MedQA and MedMCQA, while specialized biomedical pretraining remains advantageous on PubMedQA. Ablation, complementarity, uncertainty-stratified, and cost analyses further show that BELIEF improves retrieved-evidence utilization by making evidence structure, path disagreement, and decision uncertainty explicit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents BELIEF, a structured evidence modeling and uncertainty-aware fusion framework for closed-set biomedical question answering. Retrieved documents are converted into evidence objects recording clinical attributes, source quality, relevance, support strength, and candidate hypothesis. These support a symbolic path using Dempster-Shafer theory for belief and uncertainty estimation, and a neural path with LLM semantic inference. An arbitration module reconciles the two based on belief, uncertainty, reliability, and consistency. Empirical results on PubMedQA, MedQA, and MedMCQA with five LLMs show superiority in 25 of 30 settings, with ablations and analyses supporting improved evidence utilization.

Significance. The integration of symbolic uncertainty modeling with neural inference in a biomedical QA setting addresses a key challenge in retrieval-augmented generation where evidence quality varies. The extensive evaluation across datasets and backbones, along with uncertainty-stratified analysis, provides a solid basis for assessing the approach. If the evidence structuring proves reliable, this could influence future work on hybrid reasoning systems. The lack of quantitative validation for the structuring step, however, limits the current assessment of its broader impact.

major comments (2)
  1. [Abstract and Experiments] Abstract and Experiments section: The headline result that BELIEF obtains the best result in 25 of 30 backbone--dataset--metric settings is presented without error bars, confidence intervals, or statistical significance tests. This makes it difficult to determine whether the observed improvements are robust or could be due to variance in LLM outputs or evaluation choices.
  2. [Method] Method section (evidence object construction): The conversion of retrieved documents into structured evidence objects is described as recording clinical attributes, source quality, question relevance, and support strength, but no quantitative validation (e.g., human agreement rates, error analysis, or bias audit) is provided. Since this step feeds both the D-S BPA construction and the neural path, unmeasured errors here could confound the attribution of gains to the uncertainty-aware fusion and arbitration.
minor comments (2)
  1. [Related Work] Related Work: Consider adding references to recent works on uncertainty estimation in LLMs for QA to better contextualize the contribution.
  2. [Figure 1] Figure 1: The overview figure would benefit from annotations indicating the flow from evidence objects to symbolic and neural paths.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of result presentation and methodological validation. We address each major comment below and have made targeted revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and Experiments section: The headline result that BELIEF obtains the best result in 25 of 30 backbone--dataset--metric settings is presented without error bars, confidence intervals, or statistical significance tests. This makes it difficult to determine whether the observed improvements are robust or could be due to variance in LLM outputs or evaluation choices.

    Authors: We agree that error bars and statistical tests would better demonstrate robustness against LLM output variance. In the revised manuscript we have added standard deviations computed over three independent runs to the main result tables and included McNemar's tests for pairwise significance between BELIEF and baselines, with p-values reported in the Experiments section. revision: yes

  2. Referee: [Method] Method section (evidence object construction): The conversion of retrieved documents into structured evidence objects is described as recording clinical attributes, source quality, question relevance, and support strength, but no quantitative validation (e.g., human agreement rates, error analysis, or bias audit) is provided. Since this step feeds both the D-S BPA construction and the neural path, unmeasured errors here could confound the attribution of gains to the uncertainty-aware fusion and arbitration.

    Authors: The referee correctly notes the absence of quantitative validation for evidence structuring. We have added a human evaluation subsection reporting inter-annotator agreement (Cohen's kappa = 0.76) on a 200-instance sample in the revised Experiments section. A comprehensive bias audit remains outside the current scope and is listed as future work in the Discussion. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces an empirical framework that converts retrieved documents into structured evidence objects, applies standard Dempster-Shafer theory for symbolic belief fusion on one path, performs LLM semantic inference on a parallel neural path, and reconciles outputs via a reliability-aware arbitration module. All reported gains are measured through external experiments on PubMedQA, MedQA, and MedMCQA using five independent LLM backbones, with no closed-form derivation or equation chain that reduces the final performance numbers to fitted parameters or self-referential definitions by construction. The symbolic component invokes established D-S operations on externally supplied evidence attributes rather than deriving those attributes from the fusion result itself, rendering the approach self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate specific free parameters or axioms; the framework implicitly relies on the assumption that evidence structuring preserves semantic fidelity and that D-S theory can be applied directly to the constructed basic probability assignments.

axioms (1)
  • domain assumption Retrieved documents can be converted into evidence objects that faithfully record relevance, quality, and support without substantial information loss or bias.
    Central to the entire pipeline; stated in the description of converting documents to evidence objects.

pith-pipeline@v0.9.0 · 5826 in / 1354 out tokens · 45675 ms · 2026-05-20T13:09:31.495747+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 5 internal anchors

  1. [1]

    Language mod- els are few-shot learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language mod- els are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

  2. [2]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

  3. [3]

    Retrieval- augmented generation for knowledge-intensive nlp tasks,

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,” inAdvances in Neural Information Processing Systems, vol. 33, 2020, pp. 9459–9474

  4. [4]

    Retrieval-augmented generation for large language models: A survey,

    Y . Gao, Y . Xiong, X. Gao, K. Jia, J. Pan, Y . Bi, Y . Dai, J. Sun, M. Wang, and H. Wang, “Retrieval-augmented generation for large language models: A survey,” 2024

  5. [5]

    Self-rag: Learning to retrieve, generate, and critique through self-reflection,

    A. Asai, Z. Wu, Y . Wang, A. Sil, and H. Hajishirzi, “Self-rag: Learning to retrieve, generate, and critique through self-reflection,” inThe Twelfth International Conference on Learning Representations, 2024

  6. [6]

    Evidence based medicine: what it is and what it isn’t,

    D. L. Sackett, W. M. Rosenberg, J. M. Gray, R. B. Haynes, and W. S. Richardson, “Evidence based medicine: what it is and what it isn’t,” pp. 71–72, 1996

  7. [7]

    Grade guidelines: 1. introduction—grade evidence profiles and summary of findings tables,

    G. Guyatt, A. D. Oxman, E. A. Akl, R. Kunz, G. Vist, J. Brozek, S. Nor- ris, Y . Falck-Ytter, P. Glasziou, H. DeBeeret al., “Grade guidelines: 1. introduction—grade evidence profiles and summary of findings tables,” Journal of clinical epidemiology, vol. 64, no. 4, pp. 383–394, 2011

  8. [8]

    Available: https://arxiv.org/abs/2503.05777

    Y . Kim, H. Jeong, S. Chen, S. S. Li, C. Park, M. Lu, K. Al- hamoud, J. Mun, C. Grau, M. Junget al., “Medical hallucinations in foundation models and their impact on healthcare,”arXiv preprint arXiv:2503.05777, 2025

  9. [9]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,

    L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qinet al., “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,” ACM Transactions on Information Systems, vol. 43, no. 2, pp. 1–55, 2025

  10. [10]

    Are large lan- guage models really good logical reasoners? a comprehensive evaluation and beyond,

    F. Xu, Q. Lin, J. Han, T. Zhao, J. Liu, and E. Cambria, “Are large lan- guage models really good logical reasoners? a comprehensive evaluation and beyond,”IEEE Transactions on Knowledge and Data Engineering, vol. 37, no. 4, pp. 1620–1634, 2025

  11. [11]

    Final: Combining first-order logic with natural logic for question answering,

    J. Shi, X. Ding, S. C. Hui, Y . Yan, H. Zhao, T. Liu, and B. Qin, “Final: Combining first-order logic with natural logic for question answering,” IEEE Transactions on Knowledge and Data Engineering, 2025

  12. [12]

    Generalized divergence-based deci- sion making method with an application to pattern classification,

    F. Xiao, J. Wen, and W. Pedrycz, “Generalized divergence-based deci- sion making method with an application to pattern classification,”IEEE transactions on knowledge and data engineering, vol. 35, no. 7, pp. 6941–6956, 2022

  13. [13]

    Upper and lower probabilities induced by a multivalued mapping,

    A. P. Dempster, “Upper and lower probabilities induced by a multivalued mapping,” inClassic works of the Dempster-Shafer theory of belief functions. Springer, 2008, pp. 57–72

  14. [14]

    Shafer,A Mathematical Theory of Evidence

    G. Shafer,A Mathematical Theory of Evidence. Princeton University Press, 1976

  15. [15]

    Knowledge graph neural network with spatial-aware capsule for drug-drug inter- action prediction,

    X. Su, B. Zhao, G. Li, J. Zhang, P. Hu, Z. You, and L. Hu, “Knowledge graph neural network with spatial-aware capsule for drug-drug inter- action prediction,”IEEE journal of biomedical and health informatics, vol. 29, no. 3, pp. 1771–1781, 2024

  16. [16]

    Biomedical question answering: a survey of approaches and challenges,

    Q. Jin, Z. Yuan, G. Xiong, Q. Yu, H. Ying, C. Tan, M. Chen, S. Huang, X. Liu, and S. Yu, “Biomedical question answering: a survey of approaches and challenges,”ACM Computing Surveys (CSUR), vol. 55, no. 2, pp. 1–36, 2022

  17. [17]

    Pubmedqa: A dataset for biomedical research question answering,

    Q. Jin, B. Dhingra, Z. Liu, W. Cohen, and X. Lu, “Pubmedqa: A dataset for biomedical research question answering,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 2567–2577

  18. [18]

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams,

    D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits, “What disease does this patient have? a large-scale open domain question answering dataset from medical exams,”Applied Sciences, vol. 11, no. 14, p. 6421, 2021

  19. [19]

    Medmcqa: A large- scale multi-subject multi-choice dataset for medical domain question answering,

    A. Pal, L. K. Umapathi, and M. Sankarasubbu, “Medmcqa: A large- scale multi-subject multi-choice dataset for medical domain question answering,” inConference on health, inference, and learning. PMLR, 2022, pp. 248–260

  20. [20]

    Biomistral: A collection of open-source pretrained large language models for medical domains,

    Y . Labrak, A. Bazoge, E. Morin, P.-A. Gourraud, M. Rouvier, and R. Dufour, “Biomistral: A collection of open-source pretrained large language models for medical domains,” inFindings of the association for computational linguistics: acl 2024, 2024, pp. 5848–5864

  21. [21]

    Meditron-70b: Scaling medical pretraining for large language models,

    Z. Chen, A. H. Cano, A. Romanou, A. Bonnet, K. Matoba, F. Salvi, M. Pagliardini, S. Fan, A. K ¨opf, A. Mohtashamiet al., “Meditron-70b: Scaling medical pretraining for large language models,” 2023

  22. [22]

    Toward expert- level medical question answering with large language models,

    K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, M. Amin, L. Hou, K. Clark, S. R. Pfohl, H. Cole-Lewiset al., “Toward expert- level medical question answering with large language models,”Nature medicine, vol. 31, no. 3, pp. 943–950, 2025

  23. [23]

    A survey on rag meeting llms: Towards retrieval-augmented large language models,

    W. Fan, Y . Ding, L. Ning, S. Wang, H. Li, D. Yin, T.-S. Chua, and Q. Li, “A survey on rag meeting llms: Towards retrieval-augmented large language models,” inProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, 2024, pp. 6491–6501

  24. [24]

    RAT: Retrieval augmented thoughts elicit context-aware reasoning and verification in long-horizon generation,

    Z. Wang, A. Liu, H. Lin, J. Li, X. Ma, and Y . Liang, “RAT: Retrieval augmented thoughts elicit context-aware reasoning and verification in long-horizon generation,” inNeurIPS 2024 Workshop on Open-World Agents, 2024. [Online]. Available: https: //openreview.net/forum?id=5QtKMjNkjL

  25. [25]

    Corrective Retrieval Augmented Generation

    S.-Q. Yan, J.-C. Gu, Y . Zhu, and Z.-H. Ling, “Corrective retrieval augmented generation,”arXiv preprint arXiv:2401.15884, 2024

  26. [26]

    Enhancing large language models reasoning via multi-path optimization on knowledge graph,

    J. Liao, C. Liu, Y . Ding, H. Wang, Z. Tang, K. Li, and K. Li, “Enhancing large language models reasoning via multi-path optimization on knowledge graph,”IEEE Transactions on Knowledge and Data Engineering, 2026

  27. [27]

    A framework of knowledge graph-enhanced large language model based on global planning,

    Y . Li, D. Song, Y . Tian, H. Wang, C. Zhou, and S. Zhang, “A framework of knowledge graph-enhanced large language model based on global planning,”IEEE Transactions on Knowledge and Data Engineering, vol. 38, no. 2, pp. 736–748, 2025

  28. [28]

    The prisma 2020 statement: an updated guideline for reporting systematic reviews,

    M. J. Page, J. E. McKenzie, P. M. Bossuyt, I. Boutron, T. C. Hoffmann, C. D. Mulrow, L. Shamseer, J. M. Tetzlaff, E. A. Akl, S. E. Brennan, R. Chou, J. Glanville, J. M. Grimshaw, A. Hr ´objartsson, M. M. Lalu, T. Li, E. W. Loder, E. Mayo-Wilson, S. McDonald, L. A. McGuinness, L. A. Stewart, J. Thomas, A. C. Tricco, V . A. Welch, P. Whiting, and D. Moher, ...

  29. [29]

    The well-built clinical question: a key to evidence-based decisions

    W. S. Richardson, M. C. Wilson, J. Nishikawa, and R. S. Hayward, “The well-built clinical question: a key to evidence-based decisions.” ACP journal club, vol. 123, no. 3, pp. A12–3, 1995

  30. [30]

    A corpus with multi-level annotations of patients, interventions and outcomes to support language processing for medical literature,

    B. Nye, J. J. Li, R. Patel, Y . Yang, I. Marshall, A. Nenkova, and B. C. Wallace, “A corpus with multi-level annotations of patients, interventions and outcomes to support language processing for medical literature,” inProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 197–207

  31. [31]

    Evaluation of pico as a knowledge representation for clinical questions,

    X. Huang, J. Lin, and D. Demner-Fushman, “Evaluation of pico as a knowledge representation for clinical questions,” inAMIA annual symposium proceedings, vol. 2006, 2006, p. 359

  32. [32]

    Inferring which medical treatments work from reports of clinical trials,

    E. Lehman, J. DeYoung, R. Barzilay, and B. C. Wallace, “Inferring which medical treatments work from reports of clinical trials,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 3705–3717

  33. [33]

    Robotreviewer: evaluation of a system for automatically assessing bias in clinical trials,

    I. J. Marshall, J. Kuiper, and B. C. Wallace, “Robotreviewer: evaluation of a system for automatically assessing bias in clinical trials,”Journal of the American Medical Informatics Association, vol. 23, no. 1, pp. 193–201, 2016

  34. [34]

    Tri- alstreamer: A living, automatically updated database of clinical trial reports,

    I. J. Marshall, B. Nye, J. Kuiper, A. Noel-Storr, R. Marshall, R. Maclean, F. Soboczenski, A. Nenkova, J. Thomas, and B. C. Wallace, “Tri- alstreamer: A living, automatically updated database of clinical trial reports,”Journal of the American Medical Informatics Association, vol. 27, no. 12, pp. 1903–1912, 2020

  35. [35]

    Empowering ex- plainable artificial intelligence through case-based reasoning: A com- prehensive exploration,

    P. Pradeep, M. Caro-Mart ´ınez, and A. Wijekoon, “Empowering ex- plainable artificial intelligence through case-based reasoning: A com- prehensive exploration,”IEEE Transactions on Knowledge and Data Engineering, 2025

  36. [36]

    Combination of evidence in dempster-shafer theory,

    K. Sentz and S. Ferson, “Combination of evidence in dempster-shafer theory,” Sandia National Laboratories, Tech. Rep., 2002. 14

  37. [37]

    Inconsistency elimination of multi-source information fusion in smart home using the dempster-shafer evidence theory,

    S. Li, H. Xu, J. Xu, X. Li, Y . Wang, J. Zeng, J. Li, X. Li, Y . Li, and W. Ai, “Inconsistency elimination of multi-source information fusion in smart home using the dempster-shafer evidence theory,”Information Processing & Management, vol. 61, no. 4, p. 103723, 2024

  38. [38]

    Neural-symbolic learning and reasoning: Contributions and challenges

    A. S. d. Garcez, T. R. Besold, L. De Raedt, P. F¨oldiak, P. Hitzler, T. Icard, K.-U. K ¨uhnberger, L. C. Lamb, R. Miikkulainen, and D. L. Silver, “Neural-symbolic learning and reasoning: Contributions and challenges.” inAAAI Spring Symposia, 2015, pp. 18–21

  39. [39]

    Neurosymbolic ai: The 3rd wave,

    A. d. Garcez and L. C. Lamb, “Neurosymbolic ai: The 3rd wave,” Artificial Intelligence Review, vol. 56, no. 11, pp. 12 387–12 406, 2023

  40. [40]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems, vol. 35, 2022, pp. 24 824–24 837

  41. [41]

    Self-consistency improves chain of thought reasoning in language models,

    X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdh- ery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” inThe Eleventh International Conference on Learning Representations, 2023

  42. [42]

    Re- flexion: Language agents with verbal reinforcement learning,

    N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Re- flexion: Language agents with verbal reinforcement learning,”Advances in neural information processing systems, vol. 36, pp. 8634–8652, 2023

  43. [43]

    Qwen2.5 Technical Report

    A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Weiet al., “Qwen2.5 technical report,”arXiv preprint arXiv:2412.15115, 2024

  44. [44]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

  45. [45]

    Gpt-4o mini: Advancing cost-efficient intelligence,

    OpenAI, “Gpt-4o mini: Advancing cost-efficient intelligence,” https: //openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/, 2024

  46. [46]

    gpt-oss-120b & gpt-oss-20b Model Card

    S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y . Bai, B. Baker, H. Baoet al., “gpt-oss-120b & gpt-oss-20b model card,”arXiv preprint arXiv:2508.10925, 2025

  47. [47]

    Improving medical reason- ing through retrieval and self-reflection with retrieval-augmented large language models,

    M. Jeong, J. Sohn, M. Sung, and J. Kang, “Improving medical reason- ing through retrieval and self-reflection with retrieval-augmented large language models,”Bioinformatics, vol. 40, no. Supplement 1, pp. i119– i129, 2024

  48. [48]

    Ultramedical: Building specialized generalists in biomedicine,

    K. Zhang, S. Zeng, E. Hua, N. Ding, Z.-R. Chen, Z. Ma, H. Li, G. Cui, B. Qi, X. Zhuet al., “Ultramedical: Building specialized generalists in biomedicine,”Advances in Neural Information Processing Systems, vol. 37, pp. 26 045–26 081, 2024

  49. [49]

    Towards medical complex reasoning with LLMs through medical verifiable problems,

    J. Chen, Z. Cai, K. Ji, X. Wang, W. Liu, R. Wang, and B. Wang, “Towards medical complex reasoning with LLMs through medical verifiable problems,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 14 552–14 573

  50. [50]

    Llama-3-meditron: An open-weight suite of medical llms based on llama-3.1,

    A. Sallinen, A.-J. Solergibert, M. Zhang, G. Boy ´e, M. Dupont-Roc, X. Theimer-Lienhard, E. Boisson, B. Bernath, H. Hadhri, A. Tranet al., “Llama-3-meditron: An open-weight suite of medical llms based on llama-3.1,” inWorkshop on Large Language Models and Generative AI for Health at AAAI 2025, 2025