Agentic AI Enhances Physician Trust in Clinical Decision Making

Ann Pongsakul; David J King; Eashan Adhikarla; Hongfang Liu; Hui Ren; Lichao Sun; Lifang He; Quanzheng Li; Sunyang Fu; Xiang Li

arxiv: 2606.30658 · v1 · pith:MNRTUD3Enew · submitted 2026-06-16 · 💻 cs.CY · cs.AI

Agentic AI Enhances Physician Trust in Clinical Decision Making

Zhiling Yan , Zhe Fang , David J King , Ann Pongsakul , Eashan Adhikarla , Hui Ren , Sunyang Fu , Quanzheng Li

show 5 more authors

Lifang He Xiang Li Hongfang Liu Yonghui Wu Lichao Sun

This is my paper

Pith reviewed 2026-07-01 07:14 UTC · model grok-4.3

classification 💻 cs.CY cs.AI

keywords agentic AIphysician trustclinical decision makingcognitive trustbehavioral reliancetreatment planningmultimodal casesover-reliance

0 comments

The pith

Agentic AI earns significantly higher physician trust than non-agentic models in clinical cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether agentic AI, which shows its tool-using steps and reasoning transparently, builds more trust from physicians than standard AI models. Three physicians reviewed 315 multimodal clinical cases and gave higher ratings for both process trust and outcome reliance to the agentic version. Trust differences reached statistical significance, with an 89.57 percent preference for agentic reasoning specifically on treatment planning tasks. The study also finds that trust in the visible process strongly predicts willingness to follow the final recommendation. These results matter because they indicate that making AI steps visible can shift how doctors decide to use or ignore AI output in real medical decisions.

Core claim

Three physicians evaluated 315 multimodal clinical cases and found significantly higher cognitive and behavioral trust for the agentic model than for non-agentic baselines. Physicians preferred the agentic reasoning in 89.57 percent of treatment planning cases. Process-oriented cognitive trust showed a significant association with outcome-oriented behavioral reliance. Measurable over-reliance on incorrect agentic outputs was still observed.

What carries the argument

Direct comparison of process-oriented cognitive trust and outcome-oriented behavioral reliance when physicians choose between agentic AI outputs that expose tool calls and reasoning steps versus non-agentic baselines.

If this is right

Physicians prefer agentic AI reasoning over non-agentic baselines in the large majority of treatment planning tasks.
Cognitive trust in the visible reasoning process predicts behavioral reliance on the final output.
Transparency of tool invocations and intermediate steps increases both forms of trust compared with opaque models.
Over-reliance on incorrect outputs can still occur even when reasoning steps are visible.
Continuous clinician oversight remains necessary despite the trust gains from transparency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The association between process trust and outcome reliance may weaken if physicians encounter more cases where the visible steps contain subtle errors.
Expanding the evaluation to additional medical specialties could reveal whether the trust advantage holds when case types differ from the original 315.
Designers of clinical AI systems might test whether adding explicit uncertainty signals to the visible steps further reduces over-reliance.
The same transparency mechanism could be examined in non-medical domains such as legal or financial decision support to check for similar trust effects.

Load-bearing premise

Ratings from only three physicians on 315 selected cases reflect how physicians generally behave and that the case selection and evaluation protocol did not favor the agentic condition.

What would settle it

A follow-up study with at least ten physicians rating a fresh set of cases that shows no significant trust difference or lower trust for the agentic model.

read the original abstract

Medical AI has shifted from reasoning to agentic AI, a new paradigm that autonomously invokes external tools during reasoning, rendering intermediate reasoning steps and tool outputs transparent to users. Although proven to outperform previous models, physician trust in agentic AI remains largely unexplored. To address this, three physicians evaluated 315 multimodal clinical cases quantifying both process-oriented cognitive trust and outcome-oriented behavioral reliance. Comparing agentic AI against non-agentic baselines, physicians exhibited significantly higher cognitive and behavioral trust for the agentic model (P < 0.001). Specifically, on treatment planning tasks, physicians trusted the agentic reasoning most, preferring it in 89.57% of cases. Furthermore, process-oriented cognitive trust is significantly associated with outcome-oriented behavioral reliance (P < 0.001). However, measurable over-reliance on incorrect agentic outputs still exists, highlighting the inherent limitations of decision-logic transparency alone and underscoring the continuous need for rigorous clinician oversight.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The main finding rests on trust ratings from only three physicians, which undercuts any broad claims about physician behavior.

read the letter

The paper's central result is that three physicians showed higher cognitive and behavioral trust toward agentic AI than toward non-agentic baselines across 315 cases, with an 89.57% preference on treatment planning and a reported link between process trust and outcome reliance. That is the one thing a colleague needs to know up front.

What is new is the direct comparison of trust metrics for agentic versus non-agentic models on multimodal clinical cases. The work separates cognitive trust (process-oriented) from behavioral reliance (outcome-oriented) and documents that transparency alone does not remove over-reliance on wrong outputs. Those distinctions are useful even if the numbers are preliminary.

The execution has clear limits. The entire dataset comes from three raters, so any P-value or percentage reflects the views of that small group rather than physicians in general. The abstract gives no information on case selection criteria, blinding, randomization of presentation order, or inter-rater agreement, which leaves open the possibility that the observed differences trace to rater-specific biases or how the 315 cases were chosen. With physician-level n equal to three, the statistical claims cannot be extrapolated without strong additional assumptions.

This paper is for readers who follow medical AI adoption and want an early empirical signal on trust. Someone already working on agentic systems or clinical decision support might pull the trust-association finding as a starting point for their own studies. It is not yet solid enough for design guidelines or policy.

The question is timely and the authors engage the literature honestly, so the paper deserves peer review to surface the missing protocol details and assess whether the sample can be expanded or the analysis reframed as exploratory.

Referee Report

2 major / 1 minor

Summary. The paper reports an empirical user study in which three physicians evaluated 315 multimodal clinical cases to compare trust in agentic AI (which autonomously invokes tools with transparent reasoning) against non-agentic baselines. It claims significantly higher cognitive and behavioral trust for the agentic model (P < 0.001), with physicians preferring the agentic reasoning in 89.57% of treatment planning cases, and a significant association between process-oriented cognitive trust and outcome-oriented behavioral reliance (P < 0.001). The work also notes persistent over-reliance on incorrect agentic outputs despite transparency.

Significance. If the empirical results prove robust, the paper would offer concrete evidence that tool-use transparency in agentic AI can increase physician trust in clinical tasks, with potential implications for AI system design in healthcare. The reported link between cognitive trust and behavioral reliance would add to human-AI interaction literature. However, the extremely small evaluator sample (n=3) substantially limits the generalizability and thus the broader significance of the findings.

major comments (2)

[Abstract] Abstract: The headline claims rest on evaluations performed by only three physicians. With a physician-level sample size of n=3, the reported P < 0.001 results and 89.57% preference rate cannot be extrapolated to physicians in general; individual rater biases or case-selection effects could fully explain the observed differences. This is load-bearing for the central claim that 'physicians exhibited significantly higher cognitive and behavioral trust for the agentic model'.
[Abstract] Abstract: The abstract supplies no information on study design details including case selection criteria for the 315 cases, blinding procedures, randomization of presentation order, inter-rater agreement metrics, or the specific statistical tests employed. These omissions prevent verification that the data support the reported significance levels and preference percentages.

minor comments (1)

The manuscript would benefit from explicit discussion of limitations arising from the small evaluator sample in a dedicated limitations paragraph.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive review. We agree that the small evaluator sample and the abstract's lack of methodological detail are important issues. We address each comment below and will make corresponding revisions.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claims rest on evaluations performed by only three physicians. With a physician-level sample size of n=3, the reported P < 0.001 results and 89.57% preference rate cannot be extrapolated to physicians in general; individual rater biases or case-selection effects could fully explain the observed differences. This is load-bearing for the central claim that 'physicians exhibited significantly higher cognitive and behavioral trust for the agentic model'.

Authors: We agree that n=3 is a substantial limitation that precludes generalizing the results to physicians at large. The study was conceived as an initial, in-depth exploration rather than a large-scale survey. We will revise the abstract to qualify all headline claims, explicitly noting the limited sample and framing the findings as preliminary evidence. We will also expand the limitations section to discuss risks of rater bias and case-selection effects. revision: yes
Referee: [Abstract] Abstract: The abstract supplies no information on study design details including case selection criteria for the 315 cases, blinding procedures, randomization of presentation order, inter-rater agreement metrics, or the specific statistical tests employed. These omissions prevent verification that the data support the reported significance levels and preference percentages.

Authors: We agree that the abstract should contain enough methodological information for readers to assess the reported statistics. We will expand the abstract with a concise summary of case selection criteria, blinding and randomization procedures, inter-rater agreement metrics, and the statistical tests used, drawing directly from the methods section of the manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical user study with no derivations or self-referential reductions

full rationale

The paper reports results from a user study in which three physicians rated 315 cases for cognitive and behavioral trust in agentic vs. non-agentic AI. All reported findings (P<0.001 differences, 89.57% preference, association between trust types) are direct statistical summaries of the collected ratings. No equations, fitted parameters, predictions derived from prior fits, uniqueness theorems, or self-citations appear as load-bearing steps in the provided abstract or description. The result does not reduce to its inputs by construction; it is an observational comparison whose validity rests on study design (sample size, blinding, case selection) rather than definitional or self-referential logic. This matches the default expectation for an empirical study and receives the lowest circularity score.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical observations from physician evaluators using defined trust metrics; no free parameters, new entities, or non-standard axioms are introduced beyond standard assumptions of statistical testing.

axioms (1)

domain assumption Physician trust ratings collected via the described process-oriented and outcome-oriented scales validly measure the intended constructs
The study draws conclusions from these metrics without reporting validation of the scales or inter-rater reliability.

pith-pipeline@v0.9.1-grok · 5731 in / 1367 out tokens · 42496 ms · 2026-07-01T07:14:30.286700+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Large language models in medicine

Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29:1930-1940

2023
[2]

Chain-of-thought prompting elicits reasoning in large language models

Wei J, Wang X, Schuurmans D, et al. Chain-of-thought prompting elicits reasoning in large language models. In: Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS '22). Red Hook (NY): Curran Associates Inc.; 2022. p. 24824-37

2022
[3]

Chain of thought utilization in large language models and application in nephrology

Miao J, Thongprayoon C, Suppadungsuk S, Krisanapan P, Radhakrishnan Y, Cheungpasitporn W. Chain of thought utilization in large language models and application in nephrology. Medicina (Kaunas). 2024;60(1):148

2024
[4]

Implementing AI in healthcare—the relevance of trust: a scoping review

Steerling E, Siira E, Nilsen P, Svedberg P, Nygren J. Implementing AI in healthcare—the relevance of trust: a scoping review. Front Health Serv. 2023;3:1211150

2023
[5]

Factors influencing trust in medical artificial intelligence for healthcare professionals: a narrative review

Tucci V, Saary J, Doyle TE. Factors influencing trust in medical artificial intelligence for healthcare professionals: a narrative review. J Med Artif Intell. 2022;5:4

2022
[6]

Agentic ai: a comprehensive survey of architectures, applications, and future directions

Abou Ali M, Dornaika F, Charafeddine J. Agentic ai: a comprehensive survey of architectures, applications, and future directions. Artif Intell Rev. 2026;59:11

2026
[7]

An autonomous agentic workflow for clinical detection of cognitive concerns using large language models

Tian J, Fard P, Cagan C, et al. An autonomous agentic workflow for clinical detection of cognitive concerns using large language models. npj Digit Med. 2026;9:51

2026
[8]

Human trust in artificial intelligence: review of empirical research

Glikson E, Woolley AW. Human trust in artificial intelligence: review of empirical research. Acad Manag Ann. 2020;14(2):627-660

2020
[9]

Artificial intelligence and human trust in healthcare: focus on clinicians

Asan O, Bayrak AE, Choudhury A. Artificial intelligence and human trust in healthcare: focus on clinicians. J Med Internet Res. 2020;22(6):e15154

2020
[10]

MedXpertQA: benchmarking expert-level medical reasoning and understanding

Zuo Y, Qu S, Li Y, et al. MedXpertQA: benchmarking expert-level medical reasoning and understanding. In: Proceedings of the 42nd International Conference on Machine Learning. PMLR; 2025. p. 80961-90

2025
[11]

O3 and o4-mini system card [Internet]

OpenAI. O3 and o4-mini system card [Internet]. OpenAI; 2024. Available from: https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf

2024
[12]

GPT-4o System Card

OpenAI. GPT-4o system card [Internet]. arXiv [cs.CL]. 2024. Available from: https://arxiv.org/abs/2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Benchmark evaluation of deepseek large language models in clinical decision-making

Sandmann S, Hegselmann S, Fujarski M, et al. Benchmark evaluation of deepseek large language models in clinical decision-making. Nat Med. 2025;31:2546-2549

2025
[14]

Response strategies for coping with the cognitive demands of attitude measures in surveys

Krosnick JA. Response strategies for coping with the cognitive demands of attitude measures in surveys. Appl Cogn Psychol. 1991;5(3):213-236

1991
[15]

Automation bias: a systematic review of frequency, effect mediators, and mitigators

Goddard K, Roudsari A, Wyatt JC. Automation bias: a systematic review of frequency, effect mediators, and mitigators. J Am Med Inform Assoc. 2012;19(1):121-127

2012
[16]

Judging llm-as-a-judge with mt-bench and chatbot arena

Zheng L, Chiang WL, Sheng Y, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook (NY): Curran Associates, Inc.; 2023. p. 46595-46623

2023
[17]

Gemini 3 flash [Internet]

Google DeepMind. Gemini 3 flash [Internet]. Google; 2025. Available from: https://deepmind.google/models/gemini/flash/

2025
[18]

Towards conversational diagnostic artificial intelligence

Tu T, Schaekermann M, Palepu A, et al. Towards conversational diagnostic artificial intelligence. Nature. 2025;642:442-450

2025
[19]

Gpt-5 system card [Internet]

OpenAI. Gpt-5 system card [Internet]. OpenAI; 2025. Available from: https://cdn.openai.com/gpt-5-system- card.pdf

2025

[1] [1]

Large language models in medicine

Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29:1930-1940

2023

[2] [2]

Chain-of-thought prompting elicits reasoning in large language models

Wei J, Wang X, Schuurmans D, et al. Chain-of-thought prompting elicits reasoning in large language models. In: Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS '22). Red Hook (NY): Curran Associates Inc.; 2022. p. 24824-37

2022

[3] [3]

Chain of thought utilization in large language models and application in nephrology

Miao J, Thongprayoon C, Suppadungsuk S, Krisanapan P, Radhakrishnan Y, Cheungpasitporn W. Chain of thought utilization in large language models and application in nephrology. Medicina (Kaunas). 2024;60(1):148

2024

[4] [4]

Implementing AI in healthcare—the relevance of trust: a scoping review

Steerling E, Siira E, Nilsen P, Svedberg P, Nygren J. Implementing AI in healthcare—the relevance of trust: a scoping review. Front Health Serv. 2023;3:1211150

2023

[5] [5]

Factors influencing trust in medical artificial intelligence for healthcare professionals: a narrative review

Tucci V, Saary J, Doyle TE. Factors influencing trust in medical artificial intelligence for healthcare professionals: a narrative review. J Med Artif Intell. 2022;5:4

2022

[6] [6]

Agentic ai: a comprehensive survey of architectures, applications, and future directions

Abou Ali M, Dornaika F, Charafeddine J. Agentic ai: a comprehensive survey of architectures, applications, and future directions. Artif Intell Rev. 2026;59:11

2026

[7] [7]

An autonomous agentic workflow for clinical detection of cognitive concerns using large language models

Tian J, Fard P, Cagan C, et al. An autonomous agentic workflow for clinical detection of cognitive concerns using large language models. npj Digit Med. 2026;9:51

2026

[8] [8]

Human trust in artificial intelligence: review of empirical research

Glikson E, Woolley AW. Human trust in artificial intelligence: review of empirical research. Acad Manag Ann. 2020;14(2):627-660

2020

[9] [9]

Artificial intelligence and human trust in healthcare: focus on clinicians

Asan O, Bayrak AE, Choudhury A. Artificial intelligence and human trust in healthcare: focus on clinicians. J Med Internet Res. 2020;22(6):e15154

2020

[10] [10]

MedXpertQA: benchmarking expert-level medical reasoning and understanding

Zuo Y, Qu S, Li Y, et al. MedXpertQA: benchmarking expert-level medical reasoning and understanding. In: Proceedings of the 42nd International Conference on Machine Learning. PMLR; 2025. p. 80961-90

2025

[11] [11]

O3 and o4-mini system card [Internet]

OpenAI. O3 and o4-mini system card [Internet]. OpenAI; 2024. Available from: https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf

2024

[12] [12]

GPT-4o System Card

OpenAI. GPT-4o system card [Internet]. arXiv [cs.CL]. 2024. Available from: https://arxiv.org/abs/2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Benchmark evaluation of deepseek large language models in clinical decision-making

Sandmann S, Hegselmann S, Fujarski M, et al. Benchmark evaluation of deepseek large language models in clinical decision-making. Nat Med. 2025;31:2546-2549

2025

[14] [14]

Response strategies for coping with the cognitive demands of attitude measures in surveys

Krosnick JA. Response strategies for coping with the cognitive demands of attitude measures in surveys. Appl Cogn Psychol. 1991;5(3):213-236

1991

[15] [15]

Automation bias: a systematic review of frequency, effect mediators, and mitigators

Goddard K, Roudsari A, Wyatt JC. Automation bias: a systematic review of frequency, effect mediators, and mitigators. J Am Med Inform Assoc. 2012;19(1):121-127

2012

[16] [16]

Judging llm-as-a-judge with mt-bench and chatbot arena

Zheng L, Chiang WL, Sheng Y, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook (NY): Curran Associates, Inc.; 2023. p. 46595-46623

2023

[17] [17]

Gemini 3 flash [Internet]

Google DeepMind. Gemini 3 flash [Internet]. Google; 2025. Available from: https://deepmind.google/models/gemini/flash/

2025

[18] [18]

Towards conversational diagnostic artificial intelligence

Tu T, Schaekermann M, Palepu A, et al. Towards conversational diagnostic artificial intelligence. Nature. 2025;642:442-450

2025

[19] [19]

Gpt-5 system card [Internet]

OpenAI. Gpt-5 system card [Internet]. OpenAI; 2025. Available from: https://cdn.openai.com/gpt-5-system- card.pdf

2025