arxiv: 2604.16346 · v2 · submitted 2026-03-16 · 💻 cs.HC · cs.CY

Recognition: no theorem link

DR. INFO at the Point of Care: A Prospective Pilot Study of Physician-Perceived Value of an Agentic AI Clinical Assistant

Rogerio Corga Da Silva , Miguel Romano , Tiago Mendes , Marta Isidoro , Sandhanakrishnan Ravichandran , Shivesh Kumar , Michiel van der Heijden , Olivier Fail

show 1 more author

Valentine Emmanuel Gnanapragasam

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:29 UTC · model grok-4.3

classification 💻 cs.HC cs.CY

keywords AI clinical assistantphysician perceptiontime efficiencydecision supportpilot studyNet Promoter Scoreclinical documentationhealthcare AI

0 comments

The pith

Physicians rated an agentic AI clinical assistant 4.27 out of 5 for time savings and 4.16 for decision support in a five-day pilot.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper reports results from a small prospective pilot in which 29 physicians and medical students used DR. INFO, an agentic AI tool, during routine clinical work. Participants gave consistently high marks for perceived reductions in time spent on documentation and for help with clinical decisions, with scores holding steady across the study period. A net promoter score of 81 among those who completed the final survey indicated strong willingness to recommend the tool. The authors frame these self-reported outcomes as preliminary evidence that such AI assistants can address documentation burdens at the point of care. They note that larger studies with objective measures will be required to confirm the findings.

Core claim

In this single-arm feasibility study, physicians across specialties used DR. INFO v1.0 for five working days and reported mean Likert scores of 4.27 for time efficiency and 4.16 for decision support, both with confidence intervals above 3.8; ratings remained stable day to day, and the net promoter score reached 81.2 among completers.

What carries the argument

DR. INFO v1.0, an agentic AI clinical assistant that retrieves verified information and assists with documentation during patient encounters.

If this is right

Stable daily ratings imply the perceived benefits do not fade quickly after initial use.
High net promoter scores across career stages suggest the tool could see broad uptake if scaled.
Positive perceptions in multiple specialties support testing in diverse clinical workflows.
The results justify investment in controlled trials that add objective performance data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future work could pair the AI with electronic health record systems to measure actual minutes saved per patient.
Testing the tool in settings with different patient loads would reveal whether benefits hold under higher pressure.
Addressing source reliability concerns explicitly in larger studies would help close the gap between perception and verified accuracy.

Load-bearing premise

Self-reported perceptions on short Likert scales from a small uncontrolled pilot accurately reflect real clinical time savings and decision quality.

What would settle it

A randomized trial that directly times documentation tasks and tracks clinical error rates with versus without the AI assistant over multiple weeks.

Figures

Figures reproduced from arXiv: 2604.16346 by Marta Isidoro, Michiel van der Heijden, Miguel Romano, Olivier Fail, Rogerio Corga Da Silva, Sandhanakrishnan Ravichandran, Shivesh Kumar, Tiago Mendes, Valentine Emmanuel Gnanapragasam.

**Figure 2.** Figure 2: Comparison of mean diary scores between final eCRF completers and non [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Mean Likert scores for time saving and decision support across the five study [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Heatmap of individual daily Likert scores for time saving (left) and decision [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Scatter plots showing the relationship between years of practice and perceived [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of NPS scores and sensitivity analysis. (a) Distribution of NPS [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Frequency of clinical use case categories reported by participants across all diary [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Content analysis of participant feedback on areas for improvement. Frequency [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

read the original abstract

Background: Clinical documentation and information retrieval consume over half of physicians working hours, contributing to cognitive overload and burnout. While artificial intelligence offers a potential solution, concerns over hallucinations and source reliability have limited adoption at the point of care. This study aimed to evaluate physician-perceived time efficiency, decision-making support, and satisfaction with DR. INFO, an agentic AI clinical assistant, in routine clinical practice. Methodology: In this prospective, single-arm, pilot feasibility study, 29 physicians and medical students across multiple specialties in Portuguese healthcare institutions used DR. INFO v1.0 over five working days within a two-week period. Outcomes were assessed via daily Likert-scale evaluations (time saving and decision support) and a final Net Promoter Score (NPS). Non-parametric methods were used throughout, with bootstrap confidence intervals (CIs) and sensitivity analysis to address non-response. Results: Physicians reported high perceived time saving (mean = 4.27/5; 95% CI = 3.97-4.57) and decision support (mean = 4.16/5; 95% CI = 3.86-4.45), with ratings stable across the five-day study window. Among the 16 (55%) participants who completed the final evaluation, the NPS was 81.2, with no detractors; sensitivity analysis indicated an NPS of 44.8 under conservative non-response assumptions. Conclusions: Physicians across specialties and career stages reported positive perceptions of DR. INFO for both time efficiency and clinical decision support within the study window. These findings are preliminary and should be confirmed in larger, controlled studies that include objective performance measures and independent accuracy verification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This small single-arm pilot reports positive physician perceptions of DR. INFO on time saving and decision support but rests entirely on subjective Likert scores without objective measures or controls.

read the letter

This pilot gives early positive signals on how physicians perceive an agentic AI clinical assistant, but the results rest on subjective reports from a small single-arm study. The key point is that 29 doctors and students in Portugal tried DR. INFO for five days and rated it highly for time saving and decision support. The scores stayed consistent day to day, and the net promoter score among those who answered the final survey was high. What stands out is the new data from this Portuguese multi-specialty group on this particular tool. The daily Likert tracking, bootstrap confidence intervals, and sensitivity analysis for the 45 percent non-response rate show careful handling of a feasibility dataset. The soft spots sit in the methods. There are no objective time measurements, no audit of answer accuracy, and no control arm. Self-reports can be influenced by novelty or the desire to please, so the claimed efficiency gains remain unverified. The authors note this and call for larger controlled trials. This paper will interest teams running early pilots of clinical AI systems. It offers a practical example of survey design and conservative stats for similar work. Readers seeking proof of real-world impact on burnout or error rates will find it too preliminary. I recommend sending it for peer review. The limitations are stated plainly, the analysis is transparent, and the contribution is a modest but real addition to the perception data on these tools.

Referee Report

3 major / 1 minor

Summary. The manuscript reports results from a prospective single-arm pilot feasibility study in which 29 physicians and medical students across specialties used the DR. INFO v1.0 agentic AI clinical assistant for five working days. Daily Likert-scale ratings indicated high perceived time saving (mean 4.27/5, 95% CI 3.97-4.57) and decision support (mean 4.16/5, 95% CI 3.86-4.45), with ratings stable over the study period. Among the 16 participants completing the final evaluation, the Net Promoter Score was 81.2 (sensitivity analysis: 44.8 under conservative non-response assumptions). The authors conclude that these positive perceptions support further evaluation in larger controlled trials incorporating objective performance measures.

Significance. If the subjective perceptions track actual clinical experience, the work provides early evidence that agentic AI tools can address documentation burden and support decision-making, a high-impact area given that administrative tasks consume over half of physicians' time. The study is notable for its multi-specialty sample and use of non-parametric statistics with bootstrap CIs, but the lack of objective validation means its primary contribution is hypothesis generation and feasibility demonstration rather than definitive proof of efficiency gains.

major comments (3)

[Methodology] Methodology section: The single-arm design without a control group, timed task logs, or independent accuracy verification against source documents leaves the reported time-saving and decision-support means vulnerable to novelty bias and social-desirability effects; this directly limits the strength of the efficiency claims in the Results.
[Results] Results section: With only 55% completion of the final NPS evaluation, the sensitivity analysis correctly shows the drop to 44.8, yet the manuscript does not quantify how non-response may have biased the daily Likert stability findings or the overall perceived-value interpretation.
[Abstract and Conclusions] Abstract and Conclusions: Framing the study as evaluating 'time efficiency' and 'clinical decision support' risks overstating subjective Likert data as evidence of objective gains; the text should more explicitly tie all claims to the self-report nature of the measures.

minor comments (1)

[Methods] Methods: The exact schedule of daily Likert administrations (e.g., end-of-day vs. per-task) is not specified, which would aid reproducibility and interpretation of the five-day stability analysis.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We agree that the pilot nature of the study and reliance on subjective measures require careful framing, and we have revised the manuscript accordingly to address each point. Below we respond to the major comments.

read point-by-point responses

Referee: [Methodology] Methodology section: The single-arm design without a control group, timed task logs, or independent accuracy verification against source documents leaves the reported time-saving and decision-support means vulnerable to novelty bias and social-desirability effects; this directly limits the strength of the efficiency claims in the Results.

Authors: We agree that the single-arm design and absence of objective measures (such as timed logs or source verification) make the findings susceptible to novelty bias and social-desirability bias, limiting causal claims about efficiency. As this was explicitly a feasibility pilot, we have revised the Methods, Results, and Discussion to more explicitly describe these design constraints and to qualify all reported outcomes as perceived rather than objective. We have added text noting the lack of control group and objective validation. revision: yes
Referee: [Results] Results section: With only 55% completion of the final NPS evaluation, the sensitivity analysis correctly shows the drop to 44.8, yet the manuscript does not quantify how non-response may have biased the daily Likert stability findings or the overall perceived-value interpretation.

Authors: The daily Likert ratings were completed by all 29 participants, so the reported stability is based on complete data and is unlikely to be affected by final-survey non-response. For the overall interpretation, we have added discussion in the Results and Limitations sections acknowledging that the subset completing the NPS may have been more favorable, potentially inflating perceived value. A formal bias quantification is not possible without additional data on non-responders, which we do not have. revision: partial
Referee: [Abstract and Conclusions] Abstract and Conclusions: Framing the study as evaluating 'time efficiency' and 'clinical decision support' risks overstating subjective Likert data as evidence of objective gains; the text should more explicitly tie all claims to the self-report nature of the measures.

Authors: We have revised the Abstract, Results, and Conclusions to consistently use the phrasing 'perceived time efficiency' and 'perceived decision support' and to state explicitly that all outcomes derive from self-reported Likert scales. We have also strengthened the final sentence of the Conclusions to underscore the preliminary, subjective character of the data and the need for objective measures in future work. revision: yes

Circularity Check

0 steps flagged

No circularity: direct reporting of survey data with standard statistics

full rationale

The paper is a single-arm pilot feasibility study that collects daily Likert-scale responses on time saving and decision support from 29 participants, computes means with bootstrap CIs, and reports NPS from a subset. No equations, fitted parameters, predictive models, or derivation steps exist. All results are direct descriptive summaries of the collected responses using non-parametric methods; no self-citation, ansatz, or uniqueness claim is invoked to justify the central findings. The chain from data collection to reported means is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that self-reported Likert ratings validly measure time savings and decision support in the absence of objective benchmarks; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Likert-scale self-reports accurately reflect actual time savings and clinical decision support
Invoked in the outcome measures and results sections without external validation against stopwatch data or chart review.

pith-pipeline@v0.9.0 · 5659 in / 1267 out tokens · 38298 ms · 2026-05-15T10:29:29.507663+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

[1]

McCoy, and Adam Wright

Siru Liu, Allison B. McCoy, and Adam Wright. Improving large language model ap- plications in biomedicine with retrieval-augmented generation: a systematic review, meta-analysis, and clinical development guidelines.Journal of the American Medical Informatics Association, 32(4):605–615, 2025. doi: 10.1093/jamia/ocaf008

work page doi:10.1093/jamia/ocaf008 2025
[2]

Doctors work fewer hours, but the EHR still fol- lows them home

Sara Berg. Doctors work fewer hours, but the EHR still fol- lows them home. American Medical Association, August 2025. URL https://www.ama-assn.org/practice-management/physician-health/ doctors-work-fewer-hours-ehr-still-follows-them-home. Accessed: 2026-03-23

work page 2025
[3]

Shanafelt, Colin P

Tait D. Shanafelt, Colin P. West, Christine Sinsky, Mickey Trockel, Michael Tutty, Hanhan Wang, Lindsey E. Carlasare, and Liselotte N. Dyrbye. Changes in burnout and satisfaction with work–life integration in physicians and the general US working populationbetween2011and2023.Mayo Clinic Proceedings, 100(7):1142–1158, 2025. doi: 10.1016/j.mayocp.2024.11.031

work page doi:10.1016/j.mayocp.2024.11.031 2025
[4]

Stanford develops real-world benchmarks for health- care AI agents, September 2025

Stanford HAI. Stanford develops real-world benchmarks for health- care AI agents, September 2025. URL https://hai.stanford.edu/news/ stanford-develops-real-world-benchmarks-for-healthcare-ai-agents. Accessed: 2026-03-23

work page 2025
[5]

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl, Heather Cole-Lewis, Darlene Neal, Mike Schaekermann, Amy Wang, Mohamed Amin, Sami Lachber, Philip Mansfield, Sushant Prakash, Bradley Green, Avinatan Hassidim, Sara Mahdavi, Greg S. Corrado, Yossi Matias, Katherine Chou, David Fleet, Laurent El Shafey, ...

work page doi:10.1038/s41591-024-03423-7 2025
[6]

W., Poliak, A., Dredze, M., Leas, E

John W. Ayers, Adam Poliak, Mark Dredze, Eric C. Leas, Zechariah Zhu, Jessica B. Kelley, Dennis J. Faix, Aaron M. Goodman, Christopher A. Longhurst, Michael Hog- arth, and Davey M. Smith. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum.JAMA Internal Medicine, 183(6):589–596, 2023...

work page doi:10.1001/jamainternmed.2023.1838 2023
[7]

Large language mod- els in medicine.Nature Medicine, 29(8):1930–1940, 2023

Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language mod- els in medicine.Nature Medicine, 29(8):1930–1940, 2023. doi: 10.1038/ s41591-023-02448-8

work page 1930
[8]

Cool, Zahir Kanjee, Andrew S

Ethan Goh, Robert Gallo, Jason Hom, Eric Strong, Yingjie Weng, Hannah Kerman, Jaden A. Cool, Zahir Kanjee, Andrew S. Parsons, Neera Ahuja, Eric Horvitz, Daniel Yang, Arnold Milstein, Andrew P. J. Olson, and Jonathan H. Chen. Large language model influence on diagnostic reasoning: a randomized clinical trial.JAMA Network Open, 7(10):e2440969, 2024. doi: 10...

work page doi:10.1001/jamanetworkopen.2024.40969 2024
[9]

Prasad, Adam Landman, Keith Dreyer, and Marc D

Arya Rao, Michael Pang, John Kim, Meghna Kamineni, Winston Lie, Anoop K. Prasad, Adam Landman, Keith Dreyer, and Marc D. Succi. Assessing the utility of ChatGPT throughout the entire clinical workflow: development and usability study. Journal of Medical Internet Research, 25:e48659, 2023. doi: 10.2196/48659

work page doi:10.2196/48659 2023
[10]

Glicksberg, Girish N

Mahmud Omar, Reem Agbareia, Benjamin S. Glicksberg, Girish N. Nadkarni, and Eyal Klang. Benchmarking the confidence of large language models in answering clinical questions: cross-sectional evaluation study.JMIR Medical Informatics, 13: e66917, 2025. doi: 10.2196/66917

work page doi:10.2196/66917 2025
[11]

Portugal lidera saúde digital na união europeia, December 2024

COMPETE 2030. Portugal lidera saúde digital na união europeia, December 2024. URL https://www.compete2030.gov.pt/comunicacao/ portugal-lidera-saude-digital-na-uniao-europeia/. Accessed: 2026-03-23

work page 2030
[12]

Inteligên- cia artificial na saúde em Portugal: regulamentação, impactos e perspeti- vas de futuro

Serviços Partilhados do Ministério da Saúde (SPMS), EPE. Inteligên- cia artificial na saúde em Portugal: regulamentação, impactos e perspeti- vas de futuro. White paper, Ministério da Saúde, Portugal, February 2025. URL https://www.spms.min-saude.pt/wp-content/uploads/2025/03/White-Paper_ Inteligencia-Artificial-na-Saude-em-Portugal_-Final2-1.pdf. Accesse...

work page 2025
[13]

Introducing HealthBench, May 2025

OpenAI. Introducing HealthBench, May 2025. URL https://openai.com/index/ healthbench/. Accessed: 2026-03-23

work page 2025
[14]

OpenAI’s HealthBench in action: evaluating an LLM-based medical assistant on realistic clinical queries.arXiv preprint, 2025

Sandhanakrishnan Ravichandran, Shivesh Kumar, Rogerio Corga Da Silva, Miguel Romano, Michiel van der Heijden, Olivier Fail, and Valentine Emmanuel Gnanapra- gasam. OpenAI’s HealthBench in action: evaluating an LLM-based medical assistant on realistic clinical queries.arXiv preprint, 2025. doi: 10.48550/arXiv.2509.02594. URL https://arxiv.org/abs/2509.02594

work page doi:10.48550/arxiv.2509.02594 2025
[15]

Rios, Reid Robson, Marroon Thabane, Lora Giangregorio, and Charles H

Lehana Thabane, Jinhui Ma, Rong Chu, Ji Cheng, Afisi Ismaila, Lorena P. Rios, Reid Robson, Marroon Thabane, Lora Giangregorio, and Charles H. Goldsmith. A tutorial on pilot studies: the what, why and how.BMC Medical Research Methodology, 10:1,

work page
[16]

doi: 10.1186/1471-2288-10-1

work page doi:10.1186/1471-2288-10-1
[17]

Sullivan and Anthony R

Gail M. Sullivan and Anthony R. Artino, Jr. Analyzing and interpreting data from Likert-type scales.Journal of Graduate Medical Education, 5(4):541–542, 2013. doi: 10.4300/JGME-5-4-18

work page doi:10.4300/jgme-5-4-18 2013
[18]

Reichheld

Frederick F. Reichheld. The one number you need to grow.Harvard Business Review, 81(12):46–54, 2003

work page 2003
[19]

John Castellan, Jr.Nonparametric Statistics for the Behavioral Sciences

Sidney Siegel and N. John Castellan, Jr.Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill, New York, 2nd edition, 1988. doi: 10.1086/416341

work page doi:10.1086/416341 1988
[20]

Lei n.o 21/2014 de 16 de abril: Aprova a lei da investigação clínica, 2014

Assembleia da República. Lei n.o 21/2014 de 16 de abril: Aprova a lei da investigação clínica, 2014. URL https://diariodarepublica.pt/dr/detalhe/lei/21-2014-25344024. Accessed: 2026-03-23

work page 2014
[21]

NPS benchmarks for 2025: good net promoter scores by industry

Laura Kurasińska. NPS benchmarks for 2025: good net promoter scores by industry. Survicate, December 2025. URL https://survicate.com/nps-benchmarks/. Accessed: 2026-03-23

work page 2025
[22]

Hurt, Curtiss R

Ryan T. Hurt, Curtiss R. Stephenson, Emily A. Gilman, Caroline A. Aakre, Ivana T. Croghan, ManpreetS.Mundi, KalyaniGhosh, andJithinrajEdakkanambethVarayil. The use of an artificial intelligence platform OpenEvidence to augment clinical decision-making for primary care physicians.Journal of Primary Care & Community Health, 16:21501319251332215, 2025. doi: ...

work page doi:10.1177/21501319251332215 2025
[23]

Gomez-Cabello, Sahar Borna, Sophia Pressman, Syed Ali Haider, Clifton R

Cesar A. Gomez-Cabello, Sahar Borna, Sophia Pressman, Syed Ali Haider, Clifton R. Haider, and Antonio J. Forte. Artificial-intelligence-based clinical decision support systems in primary care: a scoping review of current clinical implementations.Eu- ropean Journal of Investigation in Health, Psychology and Education, 14(3):685–698,

work page
[24]

doi: 10.3390/ejihpe14030045. 22

work page doi:10.3390/ejihpe14030045
[25]

O’Reilly

Kevin B. O’Reilly. How much can ambient AI scribes help cut doctor burnout? American Medical Association, October 2025. URL https://www.ama-assn.org/practice-management/physician-health/ how-much-can-ambient-ai-scribes-help-cut-doctor-burnout. Accessed: 2026-03- 23

work page 2025
[26]

Gonzalez, Nasra M

Mingyang Chen, Bo Zhang, Ziting Cai, Samuel Seery, Maria J. Gonzalez, Nasra M. Ali, Ran Ren, Youlin Qiao, Peng Xue, and Yu Jiang. Acceptance of clinical artificial intelligence among physicians and medical students: a systematic review with cross- sectional survey.Frontiers in Medicine, 9:990604, 2022. doi: 10.3389/fmed.2022. 990604. 23

work page doi:10.3389/fmed.2022 2022