Agentic AI Enhances Physician Trust in Clinical Decision Making
Pith reviewed 2026-07-01 07:14 UTC · model grok-4.3
The pith
Agentic AI earns significantly higher physician trust than non-agentic models in clinical cases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Three physicians evaluated 315 multimodal clinical cases and found significantly higher cognitive and behavioral trust for the agentic model than for non-agentic baselines. Physicians preferred the agentic reasoning in 89.57 percent of treatment planning cases. Process-oriented cognitive trust showed a significant association with outcome-oriented behavioral reliance. Measurable over-reliance on incorrect agentic outputs was still observed.
What carries the argument
Direct comparison of process-oriented cognitive trust and outcome-oriented behavioral reliance when physicians choose between agentic AI outputs that expose tool calls and reasoning steps versus non-agentic baselines.
If this is right
- Physicians prefer agentic AI reasoning over non-agentic baselines in the large majority of treatment planning tasks.
- Cognitive trust in the visible reasoning process predicts behavioral reliance on the final output.
- Transparency of tool invocations and intermediate steps increases both forms of trust compared with opaque models.
- Over-reliance on incorrect outputs can still occur even when reasoning steps are visible.
- Continuous clinician oversight remains necessary despite the trust gains from transparency.
Where Pith is reading between the lines
- The association between process trust and outcome reliance may weaken if physicians encounter more cases where the visible steps contain subtle errors.
- Expanding the evaluation to additional medical specialties could reveal whether the trust advantage holds when case types differ from the original 315.
- Designers of clinical AI systems might test whether adding explicit uncertainty signals to the visible steps further reduces over-reliance.
- The same transparency mechanism could be examined in non-medical domains such as legal or financial decision support to check for similar trust effects.
Load-bearing premise
Ratings from only three physicians on 315 selected cases reflect how physicians generally behave and that the case selection and evaluation protocol did not favor the agentic condition.
What would settle it
A follow-up study with at least ten physicians rating a fresh set of cases that shows no significant trust difference or lower trust for the agentic model.
read the original abstract
Medical AI has shifted from reasoning to agentic AI, a new paradigm that autonomously invokes external tools during reasoning, rendering intermediate reasoning steps and tool outputs transparent to users. Although proven to outperform previous models, physician trust in agentic AI remains largely unexplored. To address this, three physicians evaluated 315 multimodal clinical cases quantifying both process-oriented cognitive trust and outcome-oriented behavioral reliance. Comparing agentic AI against non-agentic baselines, physicians exhibited significantly higher cognitive and behavioral trust for the agentic model (P < 0.001). Specifically, on treatment planning tasks, physicians trusted the agentic reasoning most, preferring it in 89.57% of cases. Furthermore, process-oriented cognitive trust is significantly associated with outcome-oriented behavioral reliance (P < 0.001). However, measurable over-reliance on incorrect agentic outputs still exists, highlighting the inherent limitations of decision-logic transparency alone and underscoring the continuous need for rigorous clinician oversight.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports an empirical user study in which three physicians evaluated 315 multimodal clinical cases to compare trust in agentic AI (which autonomously invokes tools with transparent reasoning) against non-agentic baselines. It claims significantly higher cognitive and behavioral trust for the agentic model (P < 0.001), with physicians preferring the agentic reasoning in 89.57% of treatment planning cases, and a significant association between process-oriented cognitive trust and outcome-oriented behavioral reliance (P < 0.001). The work also notes persistent over-reliance on incorrect agentic outputs despite transparency.
Significance. If the empirical results prove robust, the paper would offer concrete evidence that tool-use transparency in agentic AI can increase physician trust in clinical tasks, with potential implications for AI system design in healthcare. The reported link between cognitive trust and behavioral reliance would add to human-AI interaction literature. However, the extremely small evaluator sample (n=3) substantially limits the generalizability and thus the broader significance of the findings.
major comments (2)
- [Abstract] Abstract: The headline claims rest on evaluations performed by only three physicians. With a physician-level sample size of n=3, the reported P < 0.001 results and 89.57% preference rate cannot be extrapolated to physicians in general; individual rater biases or case-selection effects could fully explain the observed differences. This is load-bearing for the central claim that 'physicians exhibited significantly higher cognitive and behavioral trust for the agentic model'.
- [Abstract] Abstract: The abstract supplies no information on study design details including case selection criteria for the 315 cases, blinding procedures, randomization of presentation order, inter-rater agreement metrics, or the specific statistical tests employed. These omissions prevent verification that the data support the reported significance levels and preference percentages.
minor comments (1)
- The manuscript would benefit from explicit discussion of limitations arising from the small evaluator sample in a dedicated limitations paragraph.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive review. We agree that the small evaluator sample and the abstract's lack of methodological detail are important issues. We address each comment below and will make corresponding revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline claims rest on evaluations performed by only three physicians. With a physician-level sample size of n=3, the reported P < 0.001 results and 89.57% preference rate cannot be extrapolated to physicians in general; individual rater biases or case-selection effects could fully explain the observed differences. This is load-bearing for the central claim that 'physicians exhibited significantly higher cognitive and behavioral trust for the agentic model'.
Authors: We agree that n=3 is a substantial limitation that precludes generalizing the results to physicians at large. The study was conceived as an initial, in-depth exploration rather than a large-scale survey. We will revise the abstract to qualify all headline claims, explicitly noting the limited sample and framing the findings as preliminary evidence. We will also expand the limitations section to discuss risks of rater bias and case-selection effects. revision: yes
-
Referee: [Abstract] Abstract: The abstract supplies no information on study design details including case selection criteria for the 315 cases, blinding procedures, randomization of presentation order, inter-rater agreement metrics, or the specific statistical tests employed. These omissions prevent verification that the data support the reported significance levels and preference percentages.
Authors: We agree that the abstract should contain enough methodological information for readers to assess the reported statistics. We will expand the abstract with a concise summary of case selection criteria, blinding and randomization procedures, inter-rater agreement metrics, and the statistical tests used, drawing directly from the methods section of the manuscript. revision: yes
Circularity Check
No circularity: direct empirical user study with no derivations or self-referential reductions
full rationale
The paper reports results from a user study in which three physicians rated 315 cases for cognitive and behavioral trust in agentic vs. non-agentic AI. All reported findings (P<0.001 differences, 89.57% preference, association between trust types) are direct statistical summaries of the collected ratings. No equations, fitted parameters, predictions derived from prior fits, uniqueness theorems, or self-citations appear as load-bearing steps in the provided abstract or description. The result does not reduce to its inputs by construction; it is an observational comparison whose validity rests on study design (sample size, blinding, case selection) rather than definitional or self-referential logic. This matches the default expectation for an empirical study and receives the lowest circularity score.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Physician trust ratings collected via the described process-oriented and outcome-oriented scales validly measure the intended constructs
Reference graph
Works this paper leans on
-
[1]
Large language models in medicine
Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29:1930-1940
2023
-
[2]
Chain-of-thought prompting elicits reasoning in large language models
Wei J, Wang X, Schuurmans D, et al. Chain-of-thought prompting elicits reasoning in large language models. In: Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS '22). Red Hook (NY): Curran Associates Inc.; 2022. p. 24824-37
2022
-
[3]
Chain of thought utilization in large language models and application in nephrology
Miao J, Thongprayoon C, Suppadungsuk S, Krisanapan P, Radhakrishnan Y, Cheungpasitporn W. Chain of thought utilization in large language models and application in nephrology. Medicina (Kaunas). 2024;60(1):148
2024
-
[4]
Implementing AI in healthcare—the relevance of trust: a scoping review
Steerling E, Siira E, Nilsen P, Svedberg P, Nygren J. Implementing AI in healthcare—the relevance of trust: a scoping review. Front Health Serv. 2023;3:1211150
2023
-
[5]
Factors influencing trust in medical artificial intelligence for healthcare professionals: a narrative review
Tucci V, Saary J, Doyle TE. Factors influencing trust in medical artificial intelligence for healthcare professionals: a narrative review. J Med Artif Intell. 2022;5:4
2022
-
[6]
Agentic ai: a comprehensive survey of architectures, applications, and future directions
Abou Ali M, Dornaika F, Charafeddine J. Agentic ai: a comprehensive survey of architectures, applications, and future directions. Artif Intell Rev. 2026;59:11
2026
-
[7]
An autonomous agentic workflow for clinical detection of cognitive concerns using large language models
Tian J, Fard P, Cagan C, et al. An autonomous agentic workflow for clinical detection of cognitive concerns using large language models. npj Digit Med. 2026;9:51
2026
-
[8]
Human trust in artificial intelligence: review of empirical research
Glikson E, Woolley AW. Human trust in artificial intelligence: review of empirical research. Acad Manag Ann. 2020;14(2):627-660
2020
-
[9]
Artificial intelligence and human trust in healthcare: focus on clinicians
Asan O, Bayrak AE, Choudhury A. Artificial intelligence and human trust in healthcare: focus on clinicians. J Med Internet Res. 2020;22(6):e15154
2020
-
[10]
MedXpertQA: benchmarking expert-level medical reasoning and understanding
Zuo Y, Qu S, Li Y, et al. MedXpertQA: benchmarking expert-level medical reasoning and understanding. In: Proceedings of the 42nd International Conference on Machine Learning. PMLR; 2025. p. 80961-90
2025
-
[11]
O3 and o4-mini system card [Internet]
OpenAI. O3 and o4-mini system card [Internet]. OpenAI; 2024. Available from: https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf
2024
-
[12]
OpenAI. GPT-4o system card [Internet]. arXiv [cs.CL]. 2024. Available from: https://arxiv.org/abs/2410.21276
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Benchmark evaluation of deepseek large language models in clinical decision-making
Sandmann S, Hegselmann S, Fujarski M, et al. Benchmark evaluation of deepseek large language models in clinical decision-making. Nat Med. 2025;31:2546-2549
2025
-
[14]
Response strategies for coping with the cognitive demands of attitude measures in surveys
Krosnick JA. Response strategies for coping with the cognitive demands of attitude measures in surveys. Appl Cogn Psychol. 1991;5(3):213-236
1991
-
[15]
Automation bias: a systematic review of frequency, effect mediators, and mitigators
Goddard K, Roudsari A, Wyatt JC. Automation bias: a systematic review of frequency, effect mediators, and mitigators. J Am Med Inform Assoc. 2012;19(1):121-127
2012
-
[16]
Judging llm-as-a-judge with mt-bench and chatbot arena
Zheng L, Chiang WL, Sheng Y, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook (NY): Curran Associates, Inc.; 2023. p. 46595-46623
2023
-
[17]
Gemini 3 flash [Internet]
Google DeepMind. Gemini 3 flash [Internet]. Google; 2025. Available from: https://deepmind.google/models/gemini/flash/
2025
-
[18]
Towards conversational diagnostic artificial intelligence
Tu T, Schaekermann M, Palepu A, et al. Towards conversational diagnostic artificial intelligence. Nature. 2025;642:442-450
2025
-
[19]
Gpt-5 system card [Internet]
OpenAI. Gpt-5 system card [Internet]. OpenAI; 2025. Available from: https://cdn.openai.com/gpt-5-system- card.pdf
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.