arxiv: 2605.04012 · v2 · submitted 2026-05-05 · 💻 cs.AI

Recognition: no theorem link

SymptomAI: Toward a Conversational AI Agent for Everyday Symptom Assessment

Joseph Breda , Fadi Yousif , Beszel Hawkins , Marinela Cotoi , Miao Liu , Ray Luo , Po-Hsuan Cameron Chen , Mike Schaekermann

show 25 more authors

Samuel Schmidgall Xin Liu Girish Narayanswamy Samuel Solomon Maxwell A. Xu Xiaoran Fan Longfei Shangguan Anran Wang Bhavna Daryani Buddy Herkenham Cara Tan Mark Malhotra Shwetak Patel John B. Hernandez Quang Duong Yun Liu Zach Wasson Dimitrios Antos Bob Lou Matthew Thompson Jonathan Richina Anupam Pathak Nichole Young-Lin Jake Sunshine Daniel McDuff

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:03 UTC · model grok-4.3

classification 💻 cs.AI

keywords conversational AIsymptom assessmentdifferential diagnosislarge language modelshealthcare AIrandomized studywearable data

0 comments

The pith

SymptomAI conversational agents produce more accurate differential diagnoses than independent clinicians when both review the same real-world patient dialogues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper deploys several versions of SymptomAI inside the Fitbit app and randomizes more than 13,000 users to interact with them. A blinded comparison shows the AI diagnoses are more accurate than those produced by separate clinicians given identical transcripts. Agents that run a structured symptom interview before diagnosing outperform agents that let users steer the conversation. The same pattern holds in an auxiliary sample drawn from the general U.S. population. The work also uses the AI labels to link wearable sensor readings to hundreds of reported conditions.

Core claim

SymptomAI differential diagnoses were significantly more accurate (OR = 2.56, p < 0.001) than those from independent clinicians given the same dialogue in a blinded randomized comparison. Agentic strategies that conduct a dedicated symptom interview to elicit additional information before rendering a diagnosis perform substantially better than baseline, user-guided conversations (p < 0.001).

What carries the argument

Agentic conversational strategy that runs a dedicated symptom interview to gather additional information before issuing a differential diagnosis.

If this is right

Structured interviews that actively elicit symptoms improve diagnostic accuracy over free-form user-led chats.
Large-scale AI labeling of real-world conversations can support analysis of wearable metrics across hundreds of conditions.
The performance advantage of dedicated interviews generalizes from wearable users to a broader U.S. population panel.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Consumer health apps may gain from requiring complete symptom elicitation rather than depending on user initiative.
The results point toward hybrid systems that combine conversational interviews with direct sensor data.
Future evaluations could test whether the same structured approach improves accuracy on rarer or more serious conditions.

Load-bearing premise

Clinician-provided diagnoses and expert-panel annotations serve as reliable ground truth even though they rest on patient self-reports and limited dialogue context.

What would settle it

A follow-up study that compares both the AI outputs and the clinician reviews against laboratory confirmation or imaging results for the same patients would settle the accuracy claim.

read the original abstract

Language models excel at diagnostic assessments on curated medical case-studies and vignettes, performing on par with, or better than, clinical professionals. However, existing studies focus on complex scenarios with rich context making it difficult to draw conclusions about how these systems perform for patients reporting symptoms in everyday life. We deployed SymptomAI, a set of conversational AI agents for end-to-end patient interviewing and differential diagnosis (DDx), via the Fitbit app in a study that randomized participants (N=13,917) to interact with five AI agents. This corpus captures diverse communication and a realistic distribution of illnesses from a real world population. A subset of 1,228 participants reported a clinician-provided diagnosis, and 517 of these were further evaluated by a panel of clinicians during over 250 hours of annotation. SymptomAI DDx were significantly more accurate (OR = 2.56, p < 0.001) than those from independent clinicians given the same dialogue in a blinded randomized comparison. Moreover, agentic strategies which conduct a dedicated symptom interview that elicit additional symptom information before providing a diagnosis, perform substantially better than baseline, user-guided conversations (p < 0.001). An auxiliary analysis on 1,509 conversations from a general US population panel validated that these results generalize beyond wearable device users. We used SymptomAI diagnoses as labels for all 13,917 participants to analyze over 500,000 days of wearable metrics across nearly 400 unique conditions. We identified strong associations between acute infections and physiological shifts (e.g., OR > 7 for influenza). While limited by self-reported ground truth, these results demonstrate the benefits of a dedicated and complete symptom interview compared to a user-guided symptom discussion, which is the default of most consumer LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SymptomAI ran a large real-world randomized test in the Fitbit app showing agentic AI outperforming clinicians on self-reported diagnoses, plus wearable linkages, but the ground truth is the clear weak point.

read the letter

The paper's core contribution is a deployment of several conversational agents to 13,917 Fitbit users, with a randomized comparison on 1,228 cases where the AI's differential diagnoses beat independent clinicians given the same dialogue (OR 2.56). Agentic versions that run a structured interview also beat baseline user-guided chats, and they extend the labels to link diagnoses with over 500,000 days of wearable data across hundreds of conditions. An auxiliary check on a separate US panel helps with generalizability beyond wearable owners. This scale and the real-world setting go beyond the usual vignette studies in the medical LLM literature, and the wearable correlation analysis is a useful extra step that most prior work skips. The abstract is upfront about the self-report limitation on clinician diagnoses, which is the main soft spot. Both the AI and the comparator clinicians are scored against participant recall of what a doctor said plus limited expert-panel review on 517 cases; that introduces noise and potential bias that could affect the accuracy gap. The abstract gives no details on randomization mechanics, exact diagnostic criteria, or missing-data handling, so the statistical claims cannot be fully assessed from what's here. This work is aimed at people studying practical deployment of medical AI in consumer apps and at teams trying to collect large symptom datasets at scale. Readers focused on real-world evidence rather than controlled benchmarks will find the most value. It deserves a serious referee because the empirical scale is uncommon and the questions it raises about agentic interviewing are worth pressing. I would send it to peer review, with the expectation that reviewers will ask for fuller methods and more scrutiny of the ground-truth construction.

Referee Report

2 major / 1 minor

Summary. SymptomAI is a set of conversational AI agents for end-to-end patient interviewing and differential diagnosis (DDx) deployed in the Fitbit app. The study randomized 13,917 participants to interact with five AI agents, finding that SymptomAI DDx were significantly more accurate (OR = 2.56, p < 0.001) than independent clinicians in a blinded randomized comparison using self-reported clinician diagnoses as ground truth. Agentic strategies with dedicated symptom interviews outperformed baseline user-guided conversations (p < 0.001). Results were validated on a general US population panel, and wearable metrics were analyzed for associations with diagnoses across 500,000 days.

Significance. Should the findings be robust to the acknowledged limitations in ground truth, this research would highlight the advantages of agentic, interview-based approaches in consumer-facing AI for symptom assessment in everyday settings, as opposed to passive or user-directed interactions common in current LLMs. The scale of the study and the linkage to real-world wearable data provide valuable empirical support for such systems and open avenues for large-scale health monitoring.

major comments (2)

[Abstract] The central claim of superior accuracy (OR = 2.56) relies on 1,228 self-reported clinician diagnoses and 517 panel annotations as ground truth. However, the abstract does not specify the randomization procedure, exact diagnostic criteria, inter-rater reliability of the clinician panel, or methods for handling missing data. These details are essential to substantiate the blinded randomized comparison and are load-bearing for the reported statistical results.
[Abstract] The comparison involves 'independent clinicians given the same dialogue,' but no information is provided on the selection, training, or number of these clinicians, nor on how the expert panel's annotations were aggregated. This omission risks undermining the reliability of the accuracy metric.

minor comments (1)

[Abstract] The abstract could more explicitly state the number and nature of the five AI agents tested to allow better understanding of the agentic vs. baseline comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments. We address each major comment point by point below. We agree that the abstract would benefit from additional details on the study methodology to support the central claims.

read point-by-point responses

Referee: [Abstract] The central claim of superior accuracy (OR = 2.56) relies on 1,228 self-reported clinician diagnoses and 517 panel annotations as ground truth. However, the abstract does not specify the randomization procedure, exact diagnostic criteria, inter-rater reliability of the clinician panel, or methods for handling missing data. These details are essential to substantiate the blinded randomized comparison and are load-bearing for the reported statistical results.

Authors: We agree that these methodological details are important and currently absent from the abstract. We will revise the abstract to include information on the randomization procedure, exact diagnostic criteria, inter-rater reliability of the clinician panel, and methods for handling missing data. revision: yes
Referee: [Abstract] The comparison involves 'independent clinicians given the same dialogue,' but no information is provided on the selection, training, or number of these clinicians, nor on how the expert panel's annotations were aggregated. This omission risks undermining the reliability of the accuracy metric.

Authors: We concur that details on the independent clinicians and the aggregation of panel annotations are missing from the abstract. We will revise the abstract to include information on the selection, training, or number of these clinicians, and how the expert panel's annotations were aggregated. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical randomized comparison against external labels

full rationale

The paper's central claims rest on a blinded randomized study (N=13,917) that directly measures SymptomAI DDx accuracy against independent clinician judgments and expert-panel annotations on self-reported diagnoses. The reported OR=2.56 and agentic-strategy superiority (p<0.001) are computed from these external comparisons, not from any equations, fitted parameters, or self-citations that reduce the result to the inputs by construction. Secondary use of AI-generated labels for wearable-metric associations is explicitly caveated as limited by self-reported ground truth and does not feed back into the primary accuracy claims. No derivation chain, ansatz, or uniqueness theorem is invoked.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical study with no mathematical derivations; relies on standard clinical evaluation assumptions and statistical testing.

axioms (1)

domain assumption Clinician panel annotations on dialogue transcripts provide a valid proxy for true diagnosis accuracy
The blinded comparison and accuracy claims rest on this assumption about the panel's judgments.

pith-pipeline@v0.9.0 · 5731 in / 1281 out tokens · 39274 ms · 2026-05-12T02:03:48.020813+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 2 internal anchors

[1]

R. K. Arora, J. Wei, R. S. Hicks, P. Bowman, J. Quiñonero-Candela, F. Tsimpourlas, M. Sharman, M. Shah, A. Vallone, A. Beutel, et al. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775,

work page internal anchor Pith review arXiv
[2]

S. Bedi, H. Cui, M. Fuentes, A. Unell, M. Wornow, J. M. Banda, N. Kotecha, T. Keyes, Y. Mai, M. Oez, et al. Medhelm: Holistic evaluation of large language models for medical tasks.arXiv preprint arXiv:2505.23802,

work page arXiv
[3]

Hayat, M

H. Hayat, M. Kudrautsau, E. Makarov, V. Melnichenko, T. Tsykunou, P. Varaksin, M. Pavelle, and A. Z. Oskowitz. Toward the autonomous ai doctor: Quantitative benchmarking of an autonomous agentic ai versus board-certified clinicians in a real world setting.arXiv preprint arXiv:2507.22902,

work page arXiv
[4]

Heumann and S

R. Heumann and S. R. Steinhubl. Associations between online search trends and outpatient visits for common medical symptoms in the united states from 2004 to 2019: Time series ecological study. JMIR Formative Research, 9(1):e77274,

work page 2004
[5]

Brodeur and Thomas A

doi: 10.1126/science.adz4433. URLhttps: //www.science.org/doi/10.1126/science.adz4433. D. McDuff, M. Schaekermann, T. Tu, A. Palepu, A. Wang, J. Garrison, K. Singhal, Y. Sharma, S. Azizi, K. Kulkarni, et al. Towards accurate differential diagnosis with large language models.Nature, 642 (8067):451–457,

work page doi:10.1126/science.adz4433
[6]

H. Nori, Y. T. Lee, S. Zhang, D. Carignan, R. Edgar, N. Fusi, N. King, J. Larson, Y. Li, W. Liu, et al. Can generalist foundation models outcompete special-purpose tuning? case study in medicine.arXiv preprint arXiv:2311.16452,

work page arXiv
[7]

Palepu, V

A. Palepu, V. Dhillon, P. Niravath, W.-H. Weng, P. Prasad, K. Saab, R. Tanno, Y. Cheng, H. Mai, E. Burns, et al. Exploring large language models for specialist-level oncology care.NEJM AI, 2(11): AIcs2500025, 2025a. A. Palepu, V. Liévin, W.-H. Weng, K. Saab, D. Stutz, Y. Cheng, K. Kulkarni, S. S. Mahdavi, J. Barral, D. R. Webster, et al. Towards conversat...

work page arXiv
[8]

K. Saab, T. Tu, W.-H. Weng, R. Tanno, D. Stutz, E. Wulczyn, F. Zhang, T. Strother, C. Park, E. Vedadi, et al. Capabilities of gemini models in medicine.arXiv preprint arXiv:2404.18416,

work page internal anchor Pith review arXiv
[9]

K. Saab, J. Freyberg, C. Park, T. Strother, Y. Cheng, W.-H. Weng, D. G. Barrett, D. Stutz, N. Tomasev, A. Palepu, et al. Advancing conversational diagnostic ai with multimodal reasoning.arXiv preprint arXiv:2505.04653,

work page arXiv
[10]

Sayres, Y

R. Sayres, Y. Hao, A. Ward, A. Wang, B. Freeman, S. Zhan, D. Ardila, J. Li, I.-C. Lee, A. Iurchenko, et al. Towards better health conversations: The benefits of context-seeking. InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems, pages 1–28,

work page 2026
[11]

Sharma, M

M. Sharma, M. Tong, T. Korbak, D. Duvenaud, A. Askell, S. Bowman, E. Durmus, Z. Hatfield-Dodds, S. Johnston, S. Kravec, et al. Towards understanding sycophancy in language models. InInterna- tional Conference on Learning Representations, volume 2024, pages 110–144,

work page 2024
[12]

Accessed: 2026-05-02

URLhttps://www.anthropic.com/research/ claude-personal-guidance. Accessed: 2026-05-02. K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180,

work page 2026
[13]

Vedadi, D

E. Vedadi, D. Barrett, N. Harris, E. Wulczyn, S. Reddy, R. Ruparel, M. Schaekermann, T. Strother, R. Tanno, Y. Sharma, et al. Towards physician-centered oversight of conversational diagnostic ai. arXiv preprint arXiv:2507.15743,

work page arXiv