pith. machine review for the scientific record. sign in

arxiv: 2604.15331 · v1 · submitted 2026-03-09 · 💻 cs.HC · cs.AI· cs.CY

Recognition: no theorem link

How people use Copilot for Health

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:47 UTC · model grok-4.3

classification 💻 cs.HC cs.AIcs.CY
keywords health conversationsconversational AIuser intent taxonomypersonal symptom assessmentcaregiving queriesdevice differenceshealthcare navigationevening usage
0
0 comments X

The pith

Analysis of over 500,000 health conversations shows nearly one in five involve personal symptom assessment or condition discussion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines more than 500,000 de-identified conversations with Microsoft Copilot to map what people ask conversational AI about health. It builds a taxonomy of 12 intent categories through LLM classification validated by human experts and applies topic clustering within each category. The work finds that personal health intent is common, one in seven personal queries concern someone else, use rises sharply at night, and patterns differ by device while many queries address navigating healthcare systems. These observations matter because they indicate AI is already handling real health and caregiving needs during times when traditional care is least available.

Core claim

Using a hierarchical intent taxonomy of 12 primary categories developed via privacy-preserving LLM-based classification and validated by expert human annotation, together with LLM-driven topic clustering, the study finds that nearly one in five conversations involve personal symptom assessment or condition discussion. The dominant general information category at 40 percent still concentrates on specific treatments and conditions. One in seven personal health queries concern someone other than the user. Personal symptom and emotional health queries increase in evening and nighttime hours. Usage diverges by device with mobile focused on personal concerns and desktop on professional work. A key

What carries the argument

The hierarchical intent taxonomy of 12 primary categories, built through privacy-preserving LLM classification validated against expert annotation and combined with LLM-driven topic clustering to group prevalent themes within each intent.

Load-bearing premise

The privacy-preserving LLM classification and topic clustering accurately reflect users' true intents and topics without systematic bias from model limits or de-identification.

What would settle it

A large manual annotation by health experts of a random sample of conversations that yields substantially different proportions across the 12 intent categories than the LLM results.

Figures

Figures reproduced from arXiv: 2604.15331 by Bay Gross, Beatriz Costa-Gomes, Christopher Kelly, Dominic King, Eloise Taysom, Hannah Richardson, Harsha Nori, Matthew M Nour, Michael Bhaskar, Mustafa Suleyman, Pavel Tolmachev, Peter Hames, Philipp Schoenegger, Samuel F. Way, Seth Spielman, Viknesh Sounderajah, Xiaoxuan Liu, Yash Shah.

Figure 1
Figure 1. Figure 1: Distribution of health intent usage, in percentage of conversations, for the entire dataset. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Average percentage of mobile vs desktop health conversations, throughout the day. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Average percentage of conversations per intent on mobile (block color) vs desktop (striped). [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Health intent patterns averaged by hour of day, on desktop. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Health intent patterns averaged by hour of day, on mobile. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Temporal changes of intent usage, relative to the morning. The top graph shows the intents [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Percentage of conversations on three intents (symptom questions, condition information and [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
read the original abstract

We analyze over 500,000 de-identified health-related conversations with Microsoft Copilot from January 2026 to characterize what people ask conversational AI about health. We develop a hierarchical intent taxonomy of 12 primary categories using privacy-preserving LLM-based classification validated against expert human annotation, and apply LLM-driven topic-clustering for prevalent themes within each intent. Using this taxonomy, we characterize the intents and topics behind health queries, identify who these queries are about, and analyze how usage varies by device and time of day. Five findings stand out. First, nearly one in five conversations involve personal symptom assessment or condition discussion, and even the dominant general information category (40%) is concentrated on specific treatments and conditions, suggesting that this is a lower bound on personal health intent. Second, one in seven of these personal health queries concern someone other than the user, such as a child, a parent, a partner, suggesting that conversational AI can be a caregiving tool, not just a personal one. Third, personal queries about symptoms and emotional health queries increase markedly in the evening and nighttime hours, when traditional healthcare is most limited. Fourth, usage diverges sharply by device: mobile concentrates on personal health concerns, while desktop is dominated by professional and academic work. Fifth, a substantial share of queries focuses on navigating healthcare systems such as finding providers, and understanding insurance, highlighting friction in the delivery of existing healthcare. These patterns have direct implications for platform-specific design, safety considerations, and the responsible development of health AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript analyzes over 500,000 de-identified health-related conversations with Microsoft Copilot from January 2026. It develops a hierarchical intent taxonomy of 12 primary categories via privacy-preserving LLM-based classification validated against expert human annotation, applies LLM-driven topic clustering within each category, and characterizes intents, the subjects of queries (user vs. others), and variations by device and time of day. Five findings are highlighted: nearly one in five conversations involve personal symptom assessment or condition discussion (with the 40% general-information category treated as a lower bound on personal health intent); one in seven personal queries concern others (e.g., child, parent); personal symptom and emotional-health queries rise in evening/night hours; mobile usage concentrates on personal concerns while desktop usage is dominated by professional/academic work; and a substantial share addresses healthcare-system navigation such as finding providers and understanding insurance.

Significance. If the classification accuracy holds, the work supplies large-scale, real-world observational evidence on conversational-AI health use that is currently scarce. The scale of the dataset, the distinction between personal and proxy queries, the temporal and device-specific patterns, and the identification of healthcare-friction topics provide concrete inputs for platform design, safety guardrails, and policy discussions around health AI. The purely descriptive nature avoids circularity and supplies falsifiable prevalence estimates that future studies can replicate or refute.

major comments (1)
  1. [Abstract] Abstract and Methods (classification pipeline): The claim that the 12-category taxonomy was 'validated against expert human annotation' is load-bearing for the headline 19% personal-health figure and the lower-bound interpretation of the general-information category, yet the abstract supplies no quantitative agreement metrics (sample size, Cohen/Fleiss kappa, confusion matrix, or error analysis on de-identified text). Without these, it is impossible to bound the risk of systematic mislabeling of borderline personal queries, directly affecting the reported shares and the caregiving-tool interpretation.
minor comments (2)
  1. [Abstract] Abstract: The time window 'January 2026' post-dates the present; confirm whether this is a typographical error (e.g., 2024 or 2025) or an intended future projection.
  2. [Abstract] Abstract: The 12-category hierarchical taxonomy is referenced but neither enumerated nor linked to a table or figure; a brief listing or pointer would improve accessibility for readers who do not consult the full methods.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive recommendation of minor revision and for highlighting the importance of transparent reporting on the classification validation. We address the major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Methods (classification pipeline): The claim that the 12-category taxonomy was 'validated against expert human annotation' is load-bearing for the headline 19% personal-health figure and the lower-bound interpretation of the general-information category, yet the abstract supplies no quantitative agreement metrics (sample size, Cohen/Fleiss kappa, confusion matrix, or error analysis on de-identified text). Without these, it is impossible to bound the risk of systematic mislabeling of borderline personal queries, directly affecting the reported shares and the caregiving-tool interpretation.

    Authors: We agree that the abstract would be strengthened by including key quantitative validation metrics to allow readers to assess reliability directly. The full details of the validation (annotation sample size of 500 conversations, Cohen's kappa of 0.82, per-category agreement rates, and error analysis on de-identified samples) are already reported in the Methods section. In the revised manuscript we will add a concise statement to the abstract summarizing these metrics (e.g., 'validated on 500 expert-annotated conversations with Cohen's kappa = 0.82'). This change improves transparency without altering the underlying findings or interpretations. revision: yes

Circularity Check

0 steps flagged

No circularity: purely descriptive observational analysis

full rationale

The paper is a purely observational study that applies an LLM-based classifier (validated by human annotation) to produce a 12-category taxonomy and then directly counts and describes the resulting category shares, topics, device differences, and temporal patterns across 500k conversations. No equations, fitted parameters, or derived predictions appear; the reported shares (e.g., 19% personal health, 40% general information) are simple empirical frequencies from the classified corpus rather than outputs of any model that was itself trained or constrained on those same frequencies. No self-citation chain is invoked to justify uniqueness or to close a definitional loop. The analysis is therefore self-contained against external benchmarks and receives a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

This is an empirical observational study whose central claims rest on the accuracy of automated labeling rather than new theoretical constructs or fitted parameters.

axioms (2)
  • domain assumption The 12-category hierarchical intent taxonomy comprehensively covers the space of health-related queries to Copilot.
    Developed via LLM and validated by experts, but the abstract does not demonstrate exhaustiveness against all possible health intents.
  • domain assumption LLM-driven classification and topic clustering produce labels that match human expert judgment at a level sufficient for the reported percentages.
    Validation is mentioned but no agreement statistics or error analysis appear in the abstract.

pith-pipeline@v0.9.0 · 5635 in / 1403 out tokens · 53575 ms · 2026-05-15T14:47:46.731197+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 1 internal anchor

  1. [1]

    Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

    Bean, Andrew M et al. (2026). “Reliability of LLMs as medical assistants for the general public: a randomized preregistered study”. In:Nature Medicine, pp. 1–7

  2. [2]

    Brodeur, Peter G. et al. (2025).Superhuman performance of a large language model on the reasoning tasks of a physician. arXiv:2412.10849 [cs.AI].url:https://arxiv.org/abs/2412.10849

  3. [3]

    Chatterji, Aaron et al. (Sept. 2025).How People Use ChatGPT. Working Paper 34255. National Bureau of Economic Research.doi:10.3386/w34255.url:http://www.nber.org/papers/w34255

  4. [4]

    (2025).It’s About Time: The Temporal and Modal Dynamics of Copilot Usage

    Costa-Gomes, Beatriz et al. (2025).It’s About Time: The Temporal and Modal Dynamics of Copilot Usage. arXiv:2512.11879 [cs.CY].url:https://arxiv.org/abs/2512.11879

  5. [5]

    How do consumers search for and appraise health information on the world wide web? Qualitative study using focus groups, usability tests, and in-depth interviews

    Eysenbach, Gunther and Christian Köhler (2002). “How do consumers search for and appraise health information on the world wide web? Qualitative study using focus groups, usability tests, and in-depth interviews”. In:BMJ324.7337, pp. 573–577.issn: 0959-8138.doi: 10.1136/bmj.324.7337.573 . eprint: https://www.bmj.com/content/324/7337/573.full.pdf .url: http...

  6. [6]

    (2024).To do no harm—and the most good—with AI in health care

    Goldberg, Carey Beth et al. (2024).To do no harm—and the most good—with AI in health care

  7. [7]

    Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures

    Golder, Scott A. and Michael W. Macy (2011). “Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures”. In:Science333.6051, pp. 1878–1881.doi: 10 . 1126 / science.1202775. eprint: https://www.science.org/doi/pdf/10.1126/science.1202775 .url: https://www.science.org/doi/abs/10.1126/science.1202775

  8. [8]

    Large language models for chatbot health advice studies: a systematic review

    Huo, Bright et al. (2025). “Large language models for chatbot health advice studies: a systematic review”. In:JAMA Network Open8.2, e2457879

  9. [9]

    Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models

    Kung, Tiffany H. et al. (Feb. 2023). “Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models”. In:PLOS Digital Health2.2, pp. 1–12.doi:10. 1371/journal.pdig.0000198.url:https://doi.org/10.1371/journal.pdig.0000198

  10. [10]

    Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine

    Lee, Peter, Sebastien Bubeck, and Joseph Petro (2023). “Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine”. In:New England Journal of Medicine388.13, pp. 1233–1239.doi: 10.1056/NEJMsr2214184. eprint: https://www.nejm.org/doi/pdf/10.1056/NEJMsr2214184.url: https://www.nejm.org/doi/full/10.1056/NEJMsr2214184. Lizée, Antoine et al. (2024). “...

  11. [11]

    (June 26, 2025).How People Use Claude for Support, Advice, and Companionship

    McCain, Miles et al. (June 26, 2025).How People Use Claude for Support, Advice, and Companionship. url: https://www.anthropic.com/news/how-people-use-claude-for-support-advice-and- companionship

  12. [12]

    Sequential Diagnosis with Language Models

    Nori, Harsha, Mayank Daswani, et al. (2025). “Sequential Diagnosis with Language Models”. In: arXiv: 2506.22405 [cs.CL].url:https://arxiv.org/abs/2506.22405

  13. [13]

    Capabilities of GPT-4 on Medical Challenge Problems

    Nori, Harsha, Nicholas King, et al. (2023). “Capabilities of GPT-4 on Medical Challenge Problems”. In: arXiv:2303.13375 [cs.CL].url:https://arxiv.org/abs/2303.13375

  14. [14]

    From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

    Nori, Harsha, Naoto Usuyama, et al. (2024). “From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond”. In: arXiv:2411.03590 [cs.CL].url: https://arxiv. org/abs/2411.03590

  15. [15]

    ChatGPT Health performance in a structured test of triage recom- mendations

    Ramaswamy, Ashwin et al. (2026). “ChatGPT Health performance in a structured test of triage recom- mendations”. In:Nature Medicine

  16. [16]

    What is Artificial Intelligence (AI)“Empathy

    Ruben, Mollie A, Danielle Blanch-Hartigan, and Judith A Hall (2025). “What is Artificial Intelligence (AI)“Empathy”? A Study Comparing ChatGPT and Physician Responses on an Online Forum: Ruben et al.” In:Journal of General Internal Medicine, pp. 1–8

  17. [17]

    The superconducting quasicharge qubit,

    Singhal, Karan et al. (Aug. 2023). “Large language models encode clinical knowledge”. In:Nature620.7972, pp. 172–180.issn: 1476-4687.doi: 10.1038/s41586- 023- 06291- 2.url: https://doi.org/10. 1038/s41586-023-06291-2

  18. [18]

    Internet Health Information Seeking and the Patient-Physician Relationship: A Systematic Review

    Tan, Sharon Swee-Lin and Nadee Goonawardene (Jan. 2017). “Internet Health Information Seeking and the Patient-Physician Relationship: A Systematic Review”. In:J Med Internet Res19.1, e9.issn: 1438-8871.doi:10.2196/jmir.5729.url:http://www.ncbi.nlm.nih.gov/pubmed/28104579. 11

  19. [19]

    High-performance medicine: the convergence of human and artificial intelli- gence

    Topol, Eric J. (Jan. 2019). “High-performance medicine: the convergence of human and artificial intelli- gence”. In:Nature Medicine25.1, pp. 44–56.issn: 1546-170X.doi:10.1038/s41591-018-0300-7. url:https://doi.org/10.1038/s41591-018-0300-7

  20. [20]

    Tnt-llm: Text mining at scale with large language models

    Wan, Mengting et al. (2024). “Tnt-llm: Text mining at scale with large language models”. In:Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pp. 5836–5847. 12