Recognition: no theorem link
Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning
Pith reviewed 2026-05-14 21:40 UTC · model grok-4.3
The pith
Counterfactual edits to clinical findings let multi-agent LLMs test evidence support and raise diagnostic accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that explicitly introducing counterfactual case editing to modify clinical findings and computing the resulting probability gaps allows multi-agent systems to verify evidence support for diagnoses, leading to improved accuracy and interpretability in LLM-based clinical reasoning.
What carries the argument
The Counterfactual Probability Gap, a metric that calculates how much a diagnosis's confidence changes when specific clinical findings are edited or removed, thereby identifying which evidence supports or refutes each hypothesis.
If this is right
- Accuracy improves across three diagnostic benchmarks and seven LLMs compared to prompting and prior multi-agent baselines.
- Gains are largest in complex and ambiguous diagnostic cases.
- Reasoning trajectories become more interpretable through explicit evidence testing.
- Human evaluations rate the outputs as more clinically useful, reliable, and coherent.
Where Pith is reading between the lines
- The method could be adapted to non-medical domains requiring hypothesis testing against alternatives, such as legal or scientific reasoning.
- Integrating this with real-time patient data might further validate the counterfactual assumptions in practice.
- It suggests a general strategy for reducing hallucination in LLM reasoning by enforcing evidence-grounded challenges.
Load-bearing premise
That counterfactual edits to clinical findings and the resulting probability gaps validly capture how individual pieces of evidence support or refute diagnoses in real clinical reasoning.
What would settle it
Apply the framework to clinical cases where counterfactual edits are constructed to contradict established medical knowledge and observe whether accuracy gains disappear or reverse.
Figures
read the original abstract
Clinical diagnosis is a complex reasoning process in which clinicians gather evidence, form hypotheses, and test them against alternative explanations. In medical training, this reasoning is explicitly developed through counterfactual questioning--e.g., asking how a diagnosis would change if a key symptom were absent or altered--to strengthen differential diagnosis skills. As large language model (LLM)-based systems are increasingly used for diagnostic support, ensuring the interpretability of their recommendations becomes critical. However, most existing LLM-based diagnostic agents reason over fixed clinical evidence without explicitly testing how individual findings support or weaken competing diagnoses. In this work, we propose a counterfactual multi-agent diagnostic framework inspired by clinician training that makes hypothesis testing explicit and evidence-grounded. Our framework introduces counterfactual case editing to modify clinical findings and evaluate how these changes affect competing diagnoses. We further define the Counterfactual Probability Gap, a method that quantifies how strongly individual findings support a diagnosis by measuring confidence shifts under these edits. These counterfactual signals guide multi-round specialist discussions, enabling agents to challenge unsupported hypotheses, refine differential diagnoses, and produce more interpretable reasoning trajectories. Across three diagnostic benchmarks and seven LLMs, our method consistently improves diagnostic accuracy over prompting and prior multi-agent baselines, with the largest gains observed in complex and ambiguous cases. Human evaluation further indicates that our framework produces more clinically useful, reliable, and coherent reasoning. These results suggest that incorporating counterfactual evidence verification is an important step toward building reliable AI systems for clinical decision support.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a counterfactual multi-agent diagnostic framework for LLMs in clinical settings. It introduces counterfactual editing of clinical findings and defines a Counterfactual Probability Gap to quantify how individual findings support or refute competing diagnoses. These signals drive multi-round specialist agent discussions to challenge hypotheses and refine outputs. Experiments across three diagnostic benchmarks and seven LLMs report consistent accuracy gains over standard prompting and prior multi-agent baselines, with larger improvements in complex/ambiguous cases, plus positive human evaluations of reasoning quality.
Significance. If the gains are robust and attributable to the counterfactual mechanism rather than prompting depth, the work could meaningfully advance interpretable LLM-based clinical decision support by explicitly incorporating hypothesis-testing practices from medical training.
major comments (2)
- [Method (Counterfactual Probability Gap and multi-agent discussion)] The central claim that the Counterfactual Probability Gap validly quantifies evidential strength (and thereby improves diagnosis) rests on the untested assumption that LLM confidence shifts under counterfactual edits track real clinical causality. No section provides expert clinician validation, comparison to established differential-diagnosis metrics, or controls showing that the gaps are not artifacts of LLM priors; this directly affects whether the multi-agent guidance produces genuine refinement or merely deeper prompting.
- [Experiments and Results] The experimental results section asserts consistent improvements across three benchmarks and seven LLMs but supplies no statistical tests, variance estimates, exact baseline re-implementations, or ablation isolating the contribution of the probability gap versus increased discussion rounds. Without these, it is impossible to determine whether the reported gains (especially the largest ones in ambiguous cases) are load-bearing evidence for the framework.
minor comments (1)
- [Abstract] The abstract and method would be clearer if the three benchmarks and seven LLMs were named explicitly rather than left generic.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important areas for strengthening the validation of the Counterfactual Probability Gap and the experimental rigor. We address each point below and commit to revisions that directly respond to the concerns while preserving the core contributions of the work.
read point-by-point responses
-
Referee: [Method (Counterfactual Probability Gap and multi-agent discussion)] The central claim that the Counterfactual Probability Gap validly quantifies evidential strength (and thereby improves diagnosis) rests on the untested assumption that LLM confidence shifts under counterfactual edits track real clinical causality. No section provides expert clinician validation, comparison to established differential-diagnosis metrics, or controls showing that the gaps are not artifacts of LLM priors; this directly affects whether the multi-agent guidance produces genuine refinement or merely deeper prompting.
Authors: We agree that explicit validation of the Counterfactual Probability Gap against clinician judgments would strengthen the central claim. The existing human evaluation focuses on overall reasoning quality, clinical usefulness, and coherence but does not isolate ratings for the gap metric itself. In revision we will add (1) a direct comparison of the gap to standard differential-diagnosis heuristics drawn from medical literature, (2) controls that test gap behavior on synthetic cases with known causal structure to check for LLM-prior artifacts, and (3) an expanded clinician rating task that specifically scores the evidential strength signals produced by the gap. These additions will be included without requiring an entirely new large-scale clinician study. revision: partial
-
Referee: [Experiments and Results] The experimental results section asserts consistent improvements across three benchmarks and seven LLMs but supplies no statistical tests, variance estimates, exact baseline re-implementations, or ablation isolating the contribution of the probability gap versus increased discussion rounds. Without these, it is impossible to determine whether the reported gains (especially the largest ones in ambiguous cases) are load-bearing evidence for the framework.
Authors: We acknowledge that the current manuscript lacks statistical tests, variance reporting, and targeted ablations. In the revised version we will (1) re-run all experiments across multiple random seeds and report means with standard deviations, (2) include paired statistical tests (e.g., McNemar or Wilcoxon) with p-values for all accuracy comparisons, (3) document exact baseline re-implementations including prompt templates and discussion-round counts, and (4) add an ablation that replaces the learned Counterfactual Probability Gap with either random values or a fixed-discussion-round baseline to isolate its contribution. These changes will make the evidence for the framework's gains transparent and reproducible. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper introduces counterfactual case editing and defines the Counterfactual Probability Gap directly as a measure of confidence shifts under those edits, without any reduction to parameters fitted on the evaluation benchmarks or self-citation chains that force the result. Accuracy gains are presented as empirical outcomes across three diagnostic benchmarks and seven LLMs, independent of the framework's internal definitions. No equations, uniqueness theorems, or ansatzes are shown to collapse the claimed improvements back to the inputs by construction, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Counterfactual edits to clinical findings simulate real diagnostic hypothesis testing
Reference graph
Works this paper leans on
-
[1]
National Academies Press, 2015
John R Ball, Bryan T Miller, and Erin P Balogh.Improving diagnosis in health care. National Academies Press, 2015
work page 2015
-
[2]
Said A Ibrahim and Peter J Pronovost. Diagnostic errors, health disparities, and artificial intelligence: a combination for health or harm? InJAMA Health Forum, volume 2, pages e212430–e212430. American Medical Association, 2021
work page 2021
-
[3]
Harold C Sox, Michael C Higgins, Douglas K Owens, and Gillian Sanders Schmidler.Medical decision making. John Wiley & Sons, 2024
work page 2024
-
[4]
Dan P Ly, Paul G Shekelle, and Zirui Song. Evidence for anchoring bias during physician decision-making.JAMA internal medicine, 183(8):818–823, 2023
work page 2023
-
[5]
Zunaid Ismail Vally, Razia AG Khammissa, Gal Feller, Raoul Ballyram, Michaela Beetge, and Liviu Feller. Errors in clinical diagnosis: a narrative review.Journal of International Medical Research, 51(8):03000605231162798, 2023
work page 2023
-
[6]
CS Webster, S Taylor, and JM Weller. Cognitive biases in diagnosis and decision making during anaesthesia and intensive care.BJA education, 21(11):420–425, 2021
work page 2021
-
[7]
Xi Chen, Huahui Yi, Mingke You, WeiZhi Liu, Li Wang, Hairui Li, Xue Zhang, Yingman Guo, Lei Fan, Gang Chen, et al. Enhancing diagnostic capability with multi-agents conversational large language models.NPJ digital medicine, 8(1):159, 2025
work page 2025
-
[8]
Kevin Wu, Eric Wu, Rahul Thapa, Kevin Wei, Angela Zhang, Arvind Suresh, Jacqueline J Tao, Min Woo Sun, Alejandro Lozano, and James Zou. Medcasereasoning: Evaluating and learning diagnostic reasoning from clinical case reports.arXiv preprint arXiv:2505.11733, 2025
-
[9]
Xiaohong Liu, Hao Liu, Guoxing Yang, Zeyu Jiang, Shuguang Cui, Zhaoze Zhang, Huan Wang, Liyuan Tao, Yongchang Sun, Zhu Song, et al. A generalist medical language model for disease diagnosis assistance.Nature medicine, 31(3):932–942, 2025
work page 2025
-
[10]
Automated lay language summariza- tion of biomedical scientific reviews
Yue Guo, Wei Qiu, Yizhong Wang, and Trevor Cohen. Automated lay language summariza- tion of biomedical scientific reviews. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 160–168, 2021. 18
work page 2021
-
[11]
Zhiwen You, Shruthan Radhakrishna, Shufan Ming, and Halil Kilicoglu. Uiuc_bionlp at biolaysumm: an extract-then-summarize approach augmented with wikipedia knowledge for biomedical lay summarization. InProceedings of the 23rd Workshop on Biomedical Natural Language Processing, pages 132–143, 2024
work page 2024
-
[12]
Pyhealth: A deep learning toolkit for healthcare predictive modeling
Chaoqi Yang, Zhenbang Wu, Patrick Jiang, Zhen Lin, Junyi Gao, Benjamin Danek, Jimeng Sun, et al. Pyhealth: A deep learning toolkit for healthcare predictive modeling. InProceedings of the 27th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), volume 2023, 2023
work page 2023
-
[13]
Reasoning-enhanced healthcare predictions with knowledge graph community retrieval
Pengcheng Jiang, Cao Xiao, Minhao Jiang, Parminder Bhatia, Taha Kass-Hout, Jimeng Sun, and Jiawei Han. Reasoning-enhanced healthcare predictions with knowledge graph community retrieval. InThe Thirteenth International Conference on Learning Representations
-
[14]
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022
work page 2022
-
[15]
Medagents: Large language models as collaborators for zero-shot medical reasoning
Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning. InFindings of the Association for Computational Linguistics: ACL 2024, pages 599–621, 2024
work page 2024
-
[16]
Yubin Kim, Chanwoo Park, Hyewon Jeong, Yik S Chan, Xuhai Xu, Daniel McDuff, Hyeon- hoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, and Hae W Park. Mdagents: An adaptive collaboration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024
work page 2024
-
[17]
Tree-of-reasoning: Towards complex medical diagnosis via multi-agent reasoning with evidence tree
Qi Peng, Jialin Cui, Jiayuan Xie, Yi Cai, and Qing Li. Tree-of-reasoning: Towards complex medical diagnosis via multi-agent reasoning with evidence tree. InProceedings of the 33rd ACM International Conference on Multimedia, pages 1744–1753, 2025
work page 2025
-
[18]
Evalu- ation and mitigation of the limitations of large language models in clinical decision-making
Paul Hager, Friederike Jungmann, Robbie Holland, Kunal Bhagat, Inga Hubrecht, Manuel Knauer, Jakob Vielhauer, Marcus Makowski, Rickmer Braren, Georgios Kaissis, et al. Evalu- ation and mitigation of the limitations of large language models in clinical decision-making. Nature medicine, 30(9):2613–2622, 2024
work page 2024
-
[19]
ER-Reason: A Benchmark Dataset for LLM Clinical Reasoning in the Emergency Room
Nikita Mehandru, Niloufar Golchini, David Bamman, Travis Zack, Melanie F Molina, and Ahmed Alaa. Er-reason: A benchmark dataset for llm-based clinical reasoning in the emergency room.arXiv preprint arXiv:2505.22919, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Self-consistency improves chain of thought reasoning in language models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations
-
[21]
Yinghao Zhu, Ziyi He, Haoran Hu, Xiaochen Zheng, Xichen Zhang, Zixiang Wang, Junyi Gao, Liantao Ma, and Lequan Yu. Medagentboard: Benchmarking multi-agent collaboration with conventional methods for diverse medical tasks.arXiv preprint arXiv:2505.12371, 2025
-
[22]
Llama 3.1: Open foundation and instruction-tuned models, 2024
Meta. Llama 3.1: Open foundation and instruction-tuned models, 2024
work page 2024
-
[23]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Xiaoke Huang, Juncheng Wu, Hui Liu, Xianfeng Tang, and Yuyin Zhou. m1: Unleash the potential of test-time scaling for medical reasoning with large language models.arXiv preprint arXiv:2504.00869, 2025
-
[25]
Juncheng Wu, Wenlong Deng, Xingxuan Li, Sheng Liu, Taomian Mi, Yifan Peng, Ziyang Xu, Yi Liu, Hyunjin Cho, Chang-In Choi, et al. Medreason: Eliciting factual medical reasoning steps in llms via knowledge graphs.arXiv preprint arXiv:2504.00993, 2025. 19
-
[26]
Verity Schaye, Louis Miller, David Kudlowitz, Jonathan Chun, Jesse Burk-Rafel, Patrick Cocks, Benedict Guzman, Yindalon Aphinyanaphongs, and Marina Marin. Development of a clinical reasoning documentation assessment tool for resident and fellow admission notes: a shared mental model for feedback.Journal of General Internal Medicine, 37(3):507–512, 2022
work page 2022
-
[27]
Reverse thinking makes llms stronger reasoners
Justin Chen, Zifeng Wang, Hamid Palangi, Rujun Han, Sayna Ebrahimi, Long Le, Vincent Perot, Swaroop Mishra, Mohit Bansal, Chen-Yu Lee, et al. Reverse thinking makes llms stronger reasoners. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Lo...
work page 2025
-
[28]
Jonathan G Richens, Ciarán M Lee, and Saurabh Johri. Improving the accuracy of medical diagnosis with causal machine learning.Nature communications, 11(1):3923, 2020
work page 2020
-
[29]
Aura: A multi-modal medical agent for understanding, reasoning and annotation
Nima Fathi, Amar Kumar, and Tal Arbel. Aura: A multi-modal medical agent for understanding, reasoning and annotation. InInternational Workshop on Agentic AI for Medicine, pages 105–114. Springer, 2025
work page 2025
-
[30]
Shuang Zhou, Zidu Xu, Mian Zhang, Chunpu Xu, Yawen Guo, Zaifu Zhan, Yi Fang, Sirui Ding, Jiashuo Wang, Kaishuai Xu, et al. Large language models for disease diagnosis: A scoping review.npj Artificial Intelligence, 1(1):9, 2025
work page 2025
-
[31]
Wenyue Hua, Jiang Guo, Mingwen Dong, Henghui Zhu, Patrick Ng, and Zhiguo Wang. Propaga- tion and pitfalls: Reasoning-based assessment of knowledge editing through counterfactual tasks. InFindings of the Association for Computational Linguistics ACL 2024, pages 12503–12525, 2024
work page 2024
-
[32]
Sentence-bert: Sentence embeddings using siamese bert- networks
Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019
work page 2019
-
[33]
Alistair Johnson, Lucas Bulgarelli, Tom Pollard, Steven Horng, Leo Anthony Celi, and Roger Mark. Mimic-iv.PhysioNet. Available online at: https://physionet. org/content/mimi- civ/1.0/(accessed August 23, 2021), pages 49–55, 2020
work page 2021
-
[34]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025
work page 2025
-
[35]
Qwen2.5: A party of foundation models, September 2024
Qwen Team. Qwen2.5: A party of foundation models, September 2024
work page 2024
-
[36]
HealthBench: Evaluating Large Language Models Towards Improved Human Health
Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Efficient memory management for large language model serving with pagedattention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023
work page 2023
-
[38]
SciPrompt: Knowledge-augmented prompting for fine-grained categorization of scientific topics
Zhiwen You, Kanyao Han, Haotian Zhu, Bertram Ludaescher, and Jana Diesner. SciPrompt: Knowledge-augmented prompting for fine-grained categorization of scientific topics. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6087–6104, Miami, Florida, USA,...
work page 2024
-
[39]
Sheng Zhang, Qianchu Liu, Guanghui Qin, Tristan Naumann, and Hoifung Poon. Med-rlvr: Emerging medical reasoning from a 3b base model via reinforcement learning.arXiv preprint arXiv:2502.19655, 2025
-
[40]
Zhiwen You and Yue Guo. Plainqafact: Retrieval-augmented factual consistency evaluation metric for biomedical plain language summarization.Journal of Biomedical Informatics, page 105019, 2026. 20
work page 2026
-
[41]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022. 21 A Experiment Details A.1 Datasets We evaluate our proposed diagnostic system on three datasets: MIMIC-CDM-F...
work page 2022
-
[42]
Source Fidelity - Extract facts only from the supplied case presentation. a. Do NOT invent, embellish, or “smooth out” missing data. b. Paraphrase narrative prose into concise bullets where helpful, but never add new facts
-
[43]
Use the XML Tags Exactly as Shown <case_prompt>...</case_prompt> - the information given to students be- fore they generate a differential
-
[44]
What Goes Inside<case_prompt>: a. Present only the facts known before a working differential was made: chief complaint, HPI, vitals, physical exam, and early investigations. b. Do not include references to Figure 1, Table 1, etc. directly. Summarize any imaging findings from what is given in the text. c. Present the case in the order presented in the orig...
-
[45]
Detect the patient’s main symptom(s) and list the key problems from the case presentation
-
[46]
Based on the patient features/symptoms and clinical notes, assign UP TO FIVE most relevant specialists from the specialist list below. Based on the case, determine what kind of experts will you recruit to better make an accurate answer. Allowed specialists: {specialist_pool} Constraints: - “num_agents” must equal the number of objects in “assigned_special...
-
[47]
**COPY-PASTE FIRST**: Start by copying the ENTIRE original case text verbatim
-
[48]
**MINIMAL EDITS ONLY**: Then modify ONLY the specific spans listed in target_evidence_group
-
[49]
**PRESERVE EVERYTHING ELSE**: Do NOT change, remove, summarize, or rephrase any other part of the case
-
[50]
**SAME LENGTH**: The edited_case MUST be approximately the same length as the original (+/−5%)
-
[51]
**NO SUMMARIZATION**: NEVER replace detailed information with summaries like “normal results” or “unremarkable” Operations: - negate: Change positive finding to negative (e.g., “fever present” -> “no fever”) - remove: Delete the specific span (keep surrounding context) - replace: Substitute with different but plausible value - weaken: Reduce severity (e.g...
-
[52]
**Clinical reasoning quality**: Which diagnosis has the strongest evidence-based reasoning from the case presentation?
-
[53]
**Counterfactual hypothesis testing evidence**: Which diagnosis shows the strongest counterfactual evidence (high CPG scores, consistent hypothesis testing)?
-
[54]
**Specialist consensus on critical features**: Which diagnosis has the most specialists identifying the same critical features?
-
[55]
**Initial symptom explanation**: Which diagnosis best explains WHY the initial symptoms occurred?
-
[56]
**Timeline consistency**: Which diagnosis fits the temporal evolution of the case?
-
[57]
**Diagnostic criteria matching**: Which diagnosis best matches established clinical criteria for that condition? WARNING: If specialists converged on a high-probability diagnosis but counterfactual hypothesis testing shows weak evidence, consider alternative diagnoses even if they have lower probability. CRITICAL V ALIDATION:
-
[58]
If your rationale supports diagnosis A, you MUST choose diagnosis A as final_diagnosis
Your final_diagnosis MUST be consistent with your rationale. If your rationale supports diagnosis A, you MUST choose diagnosis A as final_diagnosis
-
[59]
**You MUST select your final_diagnosis from the top 3 differential diagnoses provided. You cannot propose a diagnosis outside this list.** Output JSON only with this schema: { “had_consensus”: true/false, “final_diagnosis”: "final chosen diagnosis label (must select from the top 3 differential diagnoses)", “winner_role”: “Role name whose reasoning most st...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.