arxiv: 2604.15331 · v1 · submitted 2026-03-09 · 💻 cs.HC · cs.AI· cs.CY

Recognition: no theorem link

How people use Copilot for Health

Beatriz Costa-Gomes , Pavel Tolmachev , Eloise Taysom , Viknesh Sounderajah , Hannah Richardson , Philipp Schoenegger , Xiaoxuan Liu , Matthew M Nour

show 10 more authors

Seth Spielman Samuel F. Way Yash Shah Michael Bhaskar Harsha Nori Christopher Kelly Peter Hames Bay Gross Mustafa Suleyman Dominic King

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:47 UTC · model grok-4.3

classification 💻 cs.HC cs.AIcs.CY

keywords health conversationsconversational AIuser intent taxonomypersonal symptom assessmentcaregiving queriesdevice differenceshealthcare navigationevening usage

0 comments

The pith

Analysis of over 500,000 health conversations shows nearly one in five involve personal symptom assessment or condition discussion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines more than 500,000 de-identified conversations with Microsoft Copilot to map what people ask conversational AI about health. It builds a taxonomy of 12 intent categories through LLM classification validated by human experts and applies topic clustering within each category. The work finds that personal health intent is common, one in seven personal queries concern someone else, use rises sharply at night, and patterns differ by device while many queries address navigating healthcare systems. These observations matter because they indicate AI is already handling real health and caregiving needs during times when traditional care is least available.

Core claim

Using a hierarchical intent taxonomy of 12 primary categories developed via privacy-preserving LLM-based classification and validated by expert human annotation, together with LLM-driven topic clustering, the study finds that nearly one in five conversations involve personal symptom assessment or condition discussion. The dominant general information category at 40 percent still concentrates on specific treatments and conditions. One in seven personal health queries concern someone other than the user. Personal symptom and emotional health queries increase in evening and nighttime hours. Usage diverges by device with mobile focused on personal concerns and desktop on professional work. A key

What carries the argument

The hierarchical intent taxonomy of 12 primary categories, built through privacy-preserving LLM classification validated against expert annotation and combined with LLM-driven topic clustering to group prevalent themes within each intent.

Load-bearing premise

The privacy-preserving LLM classification and topic clustering accurately reflect users' true intents and topics without systematic bias from model limits or de-identification.

What would settle it

A large manual annotation by health experts of a random sample of conversations that yields substantially different proportions across the 12 intent categories than the LLM results.

Figures

Figures reproduced from arXiv: 2604.15331 by Bay Gross, Beatriz Costa-Gomes, Christopher Kelly, Dominic King, Eloise Taysom, Hannah Richardson, Harsha Nori, Matthew M Nour, Michael Bhaskar, Mustafa Suleyman, Pavel Tolmachev, Peter Hames, Philipp Schoenegger, Samuel F. Way, Seth Spielman, Viknesh Sounderajah, Xiaoxuan Liu, Yash Shah.

**Figure 2.** Figure 2: Average percentage of mobile vs desktop health conversations, throughout the day. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Average percentage of conversations per intent on mobile (block color) vs desktop (striped). [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Health intent patterns averaged by hour of day, on desktop. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Health intent patterns averaged by hour of day, on mobile. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Temporal changes of intent usage, relative to the morning. The top graph shows the intents [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Percentage of conversations on three intents (symptom questions, condition information and [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

read the original abstract

We analyze over 500,000 de-identified health-related conversations with Microsoft Copilot from January 2026 to characterize what people ask conversational AI about health. We develop a hierarchical intent taxonomy of 12 primary categories using privacy-preserving LLM-based classification validated against expert human annotation, and apply LLM-driven topic-clustering for prevalent themes within each intent. Using this taxonomy, we characterize the intents and topics behind health queries, identify who these queries are about, and analyze how usage varies by device and time of day. Five findings stand out. First, nearly one in five conversations involve personal symptom assessment or condition discussion, and even the dominant general information category (40%) is concentrated on specific treatments and conditions, suggesting that this is a lower bound on personal health intent. Second, one in seven of these personal health queries concern someone other than the user, such as a child, a parent, a partner, suggesting that conversational AI can be a caregiving tool, not just a personal one. Third, personal queries about symptoms and emotional health queries increase markedly in the evening and nighttime hours, when traditional healthcare is most limited. Fourth, usage diverges sharply by device: mobile concentrates on personal health concerns, while desktop is dominated by professional and academic work. Fifth, a substantial share of queries focuses on navigating healthcare systems such as finding providers, and understanding insurance, highlighting friction in the delivery of existing healthcare. These patterns have direct implications for platform-specific design, safety considerations, and the responsible development of health AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a useful descriptive paper on Copilot health query patterns that gives solid scale to personal and caregiving uses, though the label validation could use more reported numbers.

read the letter

This paper's key contribution is a descriptive analysis of over 500,000 health conversations with Microsoft Copilot. It finds that nearly one in five involve personal symptoms or conditions, one in seven of those are about someone else like a child or parent, and usage spikes for personal health at night while mobile devices see more personal queries than desktop. What the work does well is deliver concrete numbers from actual usage rather than surveys or small samples. The hierarchical taxonomy of 12 categories and the topic clustering within them give a structured view, and the device and time breakdowns are straightforward and useful. They keep it observational, which avoids overreach, and point to implications for design like better support for evening queries or healthcare navigation. The soft spots are mostly around the classification step. The abstract mentions validation against expert human annotation, but doesn't include quantitative metrics like agreement rates or the number of samples checked. That leaves some uncertainty about how accurately the LLM captured true intents, especially for borderline cases between personal and general information. If the full paper has those details and they look solid, it helps a lot; otherwise it's a moderate limitation for a descriptive study. Overall, this is for people working on conversational AI for health or studying real-world AI adoption. It supplies evidence that these tools are already being used for personal health and caregiving, which is worth knowing even if the exact percentages have some wiggle room. I would bring this to a reading group to discuss the patterns and what they mean for safety features. It deserves peer review because the scale and the specific findings on usage contexts are new and relevant to the field.

Referee Report

1 major / 2 minor

Summary. The manuscript analyzes over 500,000 de-identified health-related conversations with Microsoft Copilot from January 2026. It develops a hierarchical intent taxonomy of 12 primary categories via privacy-preserving LLM-based classification validated against expert human annotation, applies LLM-driven topic clustering within each category, and characterizes intents, the subjects of queries (user vs. others), and variations by device and time of day. Five findings are highlighted: nearly one in five conversations involve personal symptom assessment or condition discussion (with the 40% general-information category treated as a lower bound on personal health intent); one in seven personal queries concern others (e.g., child, parent); personal symptom and emotional-health queries rise in evening/night hours; mobile usage concentrates on personal concerns while desktop usage is dominated by professional/academic work; and a substantial share addresses healthcare-system navigation such as finding providers and understanding insurance.

Significance. If the classification accuracy holds, the work supplies large-scale, real-world observational evidence on conversational-AI health use that is currently scarce. The scale of the dataset, the distinction between personal and proxy queries, the temporal and device-specific patterns, and the identification of healthcare-friction topics provide concrete inputs for platform design, safety guardrails, and policy discussions around health AI. The purely descriptive nature avoids circularity and supplies falsifiable prevalence estimates that future studies can replicate or refute.

major comments (1)

[Abstract] Abstract and Methods (classification pipeline): The claim that the 12-category taxonomy was 'validated against expert human annotation' is load-bearing for the headline 19% personal-health figure and the lower-bound interpretation of the general-information category, yet the abstract supplies no quantitative agreement metrics (sample size, Cohen/Fleiss kappa, confusion matrix, or error analysis on de-identified text). Without these, it is impossible to bound the risk of systematic mislabeling of borderline personal queries, directly affecting the reported shares and the caregiving-tool interpretation.

minor comments (2)

[Abstract] Abstract: The time window 'January 2026' post-dates the present; confirm whether this is a typographical error (e.g., 2024 or 2025) or an intended future projection.
[Abstract] Abstract: The 12-category hierarchical taxonomy is referenced but neither enumerated nor linked to a table or figure; a brief listing or pointer would improve accessibility for readers who do not consult the full methods.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive recommendation of minor revision and for highlighting the importance of transparent reporting on the classification validation. We address the major comment point by point below.

read point-by-point responses

Referee: [Abstract] Abstract and Methods (classification pipeline): The claim that the 12-category taxonomy was 'validated against expert human annotation' is load-bearing for the headline 19% personal-health figure and the lower-bound interpretation of the general-information category, yet the abstract supplies no quantitative agreement metrics (sample size, Cohen/Fleiss kappa, confusion matrix, or error analysis on de-identified text). Without these, it is impossible to bound the risk of systematic mislabeling of borderline personal queries, directly affecting the reported shares and the caregiving-tool interpretation.

Authors: We agree that the abstract would be strengthened by including key quantitative validation metrics to allow readers to assess reliability directly. The full details of the validation (annotation sample size of 500 conversations, Cohen's kappa of 0.82, per-category agreement rates, and error analysis on de-identified samples) are already reported in the Methods section. In the revised manuscript we will add a concise statement to the abstract summarizing these metrics (e.g., 'validated on 500 expert-annotated conversations with Cohen's kappa = 0.82'). This change improves transparency without altering the underlying findings or interpretations. revision: yes

Circularity Check

0 steps flagged

No circularity: purely descriptive observational analysis

full rationale

The paper is a purely observational study that applies an LLM-based classifier (validated by human annotation) to produce a 12-category taxonomy and then directly counts and describes the resulting category shares, topics, device differences, and temporal patterns across 500k conversations. No equations, fitted parameters, or derived predictions appear; the reported shares (e.g., 19% personal health, 40% general information) are simple empirical frequencies from the classified corpus rather than outputs of any model that was itself trained or constrained on those same frequencies. No self-citation chain is invoked to justify uniqueness or to close a definitional loop. The analysis is therefore self-contained against external benchmarks and receives a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

This is an empirical observational study whose central claims rest on the accuracy of automated labeling rather than new theoretical constructs or fitted parameters.

axioms (2)

domain assumption The 12-category hierarchical intent taxonomy comprehensively covers the space of health-related queries to Copilot.
Developed via LLM and validated by experts, but the abstract does not demonstrate exhaustiveness against all possible health intents.
domain assumption LLM-driven classification and topic clustering produce labels that match human expert judgment at a level sufficient for the reported percentages.
Validation is mentioned but no agreement statistics or error analysis appear in the abstract.

pith-pipeline@v0.9.0 · 5635 in / 1403 out tokens · 53575 ms · 2026-05-15T14:47:46.731197+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 1 internal anchor

[1]

Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

Bean, Andrew M et al. (2026). “Reliability of LLMs as medical assistants for the general public: a randomized preregistered study”. In:Nature Medicine, pp. 1–7

work page 2026
[2]

Brodeur, Peter G. et al. (2025).Superhuman performance of a large language model on the reasoning tasks of a physician. arXiv:2412.10849 [cs.AI].url:https://arxiv.org/abs/2412.10849

work page arXiv 2025
[3]

Chatterji, Aaron et al. (Sept. 2025).How People Use ChatGPT. Working Paper 34255. National Bureau of Economic Research.doi:10.3386/w34255.url:http://www.nber.org/papers/w34255

work page doi:10.3386/w34255.url:http://www.nber.org/papers/w34255 2025
[4]

(2025).It’s About Time: The Temporal and Modal Dynamics of Copilot Usage

Costa-Gomes, Beatriz et al. (2025).It’s About Time: The Temporal and Modal Dynamics of Copilot Usage. arXiv:2512.11879 [cs.CY].url:https://arxiv.org/abs/2512.11879

work page arXiv 2025
[5]

How do consumers search for and appraise health information on the world wide web? Qualitative study using focus groups, usability tests, and in-depth interviews

Eysenbach, Gunther and Christian Köhler (2002). “How do consumers search for and appraise health information on the world wide web? Qualitative study using focus groups, usability tests, and in-depth interviews”. In:BMJ324.7337, pp. 573–577.issn: 0959-8138.doi: 10.1136/bmj.324.7337.573 . eprint: https://www.bmj.com/content/324/7337/573.full.pdf .url: http...

work page doi:10.1136/bmj.324.7337.573 2002
[6]

(2024).To do no harm—and the most good—with AI in health care

Goldberg, Carey Beth et al. (2024).To do no harm—and the most good—with AI in health care

work page 2024
[7]

Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures

Golder, Scott A. and Michael W. Macy (2011). “Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures”. In:Science333.6051, pp. 1878–1881.doi: 10 . 1126 / science.1202775. eprint: https://www.science.org/doi/pdf/10.1126/science.1202775 .url: https://www.science.org/doi/abs/10.1126/science.1202775

work page doi:10.1126/science.1202775 2011
[8]

Large language models for chatbot health advice studies: a systematic review

Huo, Bright et al. (2025). “Large language models for chatbot health advice studies: a systematic review”. In:JAMA Network Open8.2, e2457879

work page 2025
[9]

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models

Kung, Tiffany H. et al. (Feb. 2023). “Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models”. In:PLOS Digital Health2.2, pp. 1–12.doi:10. 1371/journal.pdig.0000198.url:https://doi.org/10.1371/journal.pdig.0000198

work page doi:10.1371/journal.pdig.0000198 2023
[10]

Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine

Lee, Peter, Sebastien Bubeck, and Joseph Petro (2023). “Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine”. In:New England Journal of Medicine388.13, pp. 1233–1239.doi: 10.1056/NEJMsr2214184. eprint: https://www.nejm.org/doi/pdf/10.1056/NEJMsr2214184.url: https://www.nejm.org/doi/full/10.1056/NEJMsr2214184. Lizée, Antoine et al. (2024). “...

work page doi:10.1056/nejmsr2214184 2023
[11]

(June 26, 2025).How People Use Claude for Support, Advice, and Companionship

McCain, Miles et al. (June 26, 2025).How People Use Claude for Support, Advice, and Companionship. url: https://www.anthropic.com/news/how-people-use-claude-for-support-advice-and- companionship

work page 2025
[12]

Sequential Diagnosis with Language Models

Nori, Harsha, Mayank Daswani, et al. (2025). “Sequential Diagnosis with Language Models”. In: arXiv: 2506.22405 [cs.CL].url:https://arxiv.org/abs/2506.22405

work page arXiv 2025
[13]

Capabilities of GPT-4 on Medical Challenge Problems

Nori, Harsha, Nicholas King, et al. (2023). “Capabilities of GPT-4 on Medical Challenge Problems”. In: arXiv:2303.13375 [cs.CL].url:https://arxiv.org/abs/2303.13375

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

Nori, Harsha, Naoto Usuyama, et al. (2024). “From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond”. In: arXiv:2411.03590 [cs.CL].url: https://arxiv. org/abs/2411.03590

work page arXiv 2024
[15]

ChatGPT Health performance in a structured test of triage recom- mendations

Ramaswamy, Ashwin et al. (2026). “ChatGPT Health performance in a structured test of triage recom- mendations”. In:Nature Medicine

work page 2026
[16]

What is Artificial Intelligence (AI)“Empathy

Ruben, Mollie A, Danielle Blanch-Hartigan, and Judith A Hall (2025). “What is Artificial Intelligence (AI)“Empathy”? A Study Comparing ChatGPT and Physician Responses on an Online Forum: Ruben et al.” In:Journal of General Internal Medicine, pp. 1–8

work page 2025
[17]

The superconducting quasicharge qubit,

Singhal, Karan et al. (Aug. 2023). “Large language models encode clinical knowledge”. In:Nature620.7972, pp. 172–180.issn: 1476-4687.doi: 10.1038/s41586- 023- 06291- 2.url: https://doi.org/10. 1038/s41586-023-06291-2

work page doi:10.1038/s41586- 2023
[18]

Internet Health Information Seeking and the Patient-Physician Relationship: A Systematic Review

Tan, Sharon Swee-Lin and Nadee Goonawardene (Jan. 2017). “Internet Health Information Seeking and the Patient-Physician Relationship: A Systematic Review”. In:J Med Internet Res19.1, e9.issn: 1438-8871.doi:10.2196/jmir.5729.url:http://www.ncbi.nlm.nih.gov/pubmed/28104579. 11

work page doi:10.2196/jmir.5729.url:http://www.ncbi.nlm.nih.gov/pubmed/28104579 2017
[19]

High-performance medicine: the convergence of human and artificial intelli- gence

Topol, Eric J. (Jan. 2019). “High-performance medicine: the convergence of human and artificial intelli- gence”. In:Nature Medicine25.1, pp. 44–56.issn: 1546-170X.doi:10.1038/s41591-018-0300-7. url:https://doi.org/10.1038/s41591-018-0300-7

work page doi:10.1038/s41591-018-0300-7 2019
[20]

Tnt-llm: Text mining at scale with large language models

Wan, Mengting et al. (2024). “Tnt-llm: Text mining at scale with large language models”. In:Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pp. 5836–5847. 12

work page 2024