pith. sign in

arxiv: 2603.18294 · v2 · submitted 2026-03-18 · 💻 cs.AI

The Validity Gap in Health AI Evaluation: A Cross-Sectional Analysis of Benchmark Composition

Pith reviewed 2026-05-15 08:12 UTC · model grok-4.3

classification 💻 cs.AI
keywords health AILLM evaluationbenchmark compositionclinical queriesvalidity gapquery taxonomyAI safetyclinical alignment
0
0 comments X

The pith

Health AI benchmarks exhibit a structural validity gap, with query composition misaligned to real clinical needs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines 18,707 consumer health queries drawn from six public benchmarks and applies a standardized taxonomy to profile their clinical content. The analysis reveals that while benchmarks have shifted toward interactive formats, they still contain little raw diagnostic material such as lab values or medical records, almost no safety-critical topics, and minimal coverage of chronic disease management or vulnerable age groups. Aggregate accuracy scores on these benchmarks therefore risk overstating how prepared models are for actual patient encounters that involve full diagnostic complexity and longitudinal care.

Core claim

Application of a 16-field taxonomy to the full corpus shows that objective data references are present in 42 percent of queries yet skewed heavily toward wellness wearables, while laboratory values appear in only 5.2 percent, imaging in 3.8 percent, and raw medical records in 0.6 percent; suicide or self-harm queries comprise less than 0.7 percent, chronic disease management only 5.5 percent, and pediatrics or older-adult queries together remain below 11 percent, establishing a persistent misalignment between benchmark composition and the requirements of clinical practice.

What carries the argument

The 16-field taxonomy that classifies each query by clinical context, topic, and intent, applied automatically by LLMs to enable scalable, standardized profiling across benchmarks.

If this is right

  • Benchmark creators must adopt standardized query-profiling methods comparable to clinical-trial inclusion reporting.
  • Future benchmarks require substantially higher fractions of raw diagnostic inputs and longitudinal chronic-care scenarios.
  • Safety-critical topics such as self-harm and management of vulnerable populations must be represented at rates closer to real clinical prevalence.
  • Aggregate performance metrics on current benchmarks cannot be treated as reliable indicators of readiness for clinical deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The gap implies that reported model accuracy may decline sharply when queries shift from wellness-style inputs to full diagnostic records.
  • Synthetic query generation pipelines could be tested against the same taxonomy to measure how well they close the identified composition shortfalls.
  • Regulatory review of health AI tools might incorporate mandatory benchmark-composition audits before approval.
  • Extending the taxonomy to private hospital query logs would allow direct comparison of public benchmark realism against actual clinical workloads.

Load-bearing premise

The LLM-driven application of the 16-field taxonomy produces classifications that are sufficiently accurate and free of systematic bias.

What would settle it

A human re-coding of a random sample of several hundred queries that yields materially different category proportions, especially for raw clinical artifacts or safety-critical content.

Figures

Figures reproduced from arXiv: 2603.18294 by Alvin Rajkomar, Angela Lai, Lily Peng, Pavan Sudarshan.

Figure 1
Figure 1. Figure 1: CONSORT-style flow diagram showing study progression from initial dataset assessment [PITH_FULL_IMAGE:figures/full_fig_p012_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Demographic and geographic representation gaps. (A) Age distribution: benchmarks [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Benchmark intent distribution compared to real-world office-based physician visit reasons [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Behavioral health crisis scenarios by benchmark generation. Crisis conditions (suicidal [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
read the original abstract

Background: Clinical trials rely on transparent inclusion criteria to ensure generalizability. In contrast, benchmarks validating health-related large language models (LLMs) rarely characterize the "patient" or "query" populations they contain. Without defined composition, aggregate performance metrics may misrepresent model readiness for clinical use. Methods: We analyzed 18,707 consumer health queries across six public benchmarks using LLMs as automated coding instruments to apply a standardized 16-field taxonomy profiling context, topic, and intent. Results: We identified a structural "validity gap." While benchmarks have evolved from static retrieval to interactive dialogue, clinical composition remains misaligned with real-world needs. Although 42% of the corpus referenced objective data, this was polarized toward wellness-focused wearable signals (17.7%); complex diagnostic inputs remained rare, including laboratory values (5.2%), imaging (3.8%), and raw medical records (0.6%). Safety-critical scenarios were effectively absent: suicide/self-harm queries comprised <0.7% of the corpus and chronic disease management only 5.5%. Benchmarks also neglected vulnerable populations (pediatrics/older adults <11%) and global health needs. Conclusions: Evaluation benchmarks remain misaligned with real-world clinical needs, lacking raw clinical artifacts, adequate representation of vulnerable populations, and longitudinal chronic care scenarios. The field must adopt standardized query profiling--analogous to clinical trial reporting--to align evaluation with the full complexity of clinical practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper analyzes 18,707 consumer health queries across six public benchmarks by applying a 16-field taxonomy via LLM coding to profile context, topic, and intent. It reports a structural validity gap: 42% reference objective data (mostly wellness wearables at 17.7%), but complex diagnostics are rare (laboratory values 5.2%, imaging 3.8%, raw records 0.6%), safety-critical content is nearly absent (<0.7% suicide/self-harm), chronic management is low (5.5%), and vulnerable populations are underrepresented (<11%). The central claim is that benchmarks remain misaligned with real-world clinical needs and that standardized query profiling is required.

Significance. If the classifications hold, the work provides a useful empirical baseline quantifying misalignment between health AI benchmarks and clinical practice, including underrepresentation of raw records, chronic scenarios, and vulnerable groups. The direct, non-circular analysis of public data offers a concrete taxonomy and cross-sectional snapshot that could inform benchmark design, though its impact depends on the reliability of the automated coding step.

major comments (1)
  1. [Methods] Methods: The 16-field taxonomy is applied exclusively via LLM to all 18,707 queries with no reported human validation, inter-rater reliability, gold-standard subsample, confusion matrix, or sensitivity analysis. All headline percentages (42% objective data, 0.6% raw records, 5.5% chronic management, <0.7% suicide/self-harm, <11% vulnerable populations) rest on this step; unquantified LLM bias or error would directly alter the reported validity-gap magnitudes.
minor comments (1)
  1. [Abstract] Abstract and Results: The six specific benchmarks are not named, which reduces reproducibility and context for readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for highlighting the importance of validating the automated coding pipeline. We address the single major comment below and commit to strengthening the methods section accordingly.

read point-by-point responses
  1. Referee: [Methods] Methods: The 16-field taxonomy is applied exclusively via LLM to all 18,707 queries with no reported human validation, inter-rater reliability, gold-standard subsample, confusion matrix, or sensitivity analysis. All headline percentages (42% objective data, 0.6% raw records, 5.5% chronic management, <0.7% suicide/self-harm, <11% vulnerable populations) rest on this step; unquantified LLM bias or error would directly alter the reported validity-gap magnitudes.

    Authors: We agree that the absence of human validation for the LLM coding step is a limitation of the current manuscript. In the revised version we will add a human validation protocol: two independent annotators (a clinician and an AI researcher) will label a stratified random subsample of 1,000 queries (approximately 5.3% of the corpus). We will report Cohen’s kappa for inter-rater reliability, LLM–human agreement rates per field, and a confusion matrix for the key binary fields that drive the headline statistics. We will also include a sensitivity analysis that re-computes the primary percentages after (a) excluding low-confidence LLM predictions and (b) using an alternative LLM. These additions will be placed in the Methods and Results sections and will not change the overall direction of the validity-gap findings. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical counts from taxonomy application

full rationale

The paper conducts a descriptive cross-sectional analysis of 18,707 existing benchmark queries by applying a 16-field taxonomy via LLM coding. No derivations, equations, fitted parameters, or predictions are present that could reduce to the inputs by construction. The reported percentages (e.g., 42% objective data, 0.6% raw records) are direct outputs of the classification process on public data. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked to justify the central claims. The analysis is self-contained as an observational study of benchmark composition with no self-referential loops.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis depends on the validity of the chosen 16-field taxonomy and the reliability of LLM coding as a proxy for manual expert review; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The 16-field taxonomy accurately and comprehensively captures the clinically relevant dimensions of consumer health queries.
    Applied directly to code all queries without reported validation against expert human labels in the abstract.

pith-pipeline@v0.9.0 · 5570 in / 1194 out tokens · 37529 ms · 2026-05-15T08:12:15.795289+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages

  1. [1]

    to whom do the results of this trial apply?

    Peter M. Rothwell. External validity of randomised controlled trials: ”to whom do the results of this trial apply?”.Lancet (London, England), 365(9453):82–93, 2005–7. ISSN 1474-547X. doi: 10.1016/S0140-6736(04)17670-8

  2. [2]

    Collins, Asbjørn Hr´ objartsson, David Moher, Ken- neth F

    Sally Hopewell, An-Wen Chan, Gary S. Collins, Asbjørn Hr´ objartsson, David Moher, Ken- neth F. Schulz, Ruth Tunn, Rakesh Aggarwal, Michael Berkwits, Jesse A. Berlin, Nita Bhan- dari, Nancy J. Butcher, Marion K. Campbell, Runcie C. W. Chidebe, Diana Elbourne, Andrew Farmer, Dean A. Fergusson, Robert M. Golub, Steven N. Goodman, Tammy C. Hoffmann, John P. ...

  3. [3]

    How people use Copilot for Health

    Beatriz Costa-Gomes, Pavel Tolmachev, Eloise Taysom, Viknesh Sounderajah, Hannah Richardson, Philipp Schoenegger, Xiaoxuan Liu, Matthew M Nour, Seth Spielman, Samuel F Way, Yash Shah, Michael Bhaskar, Harsha Nori, Christopher Kelly, Peter Hames, Bay Gross, Mustafa Suleyman, and Dominic King. How people use Copilot for Health

  4. [4]

    KFF Tracking Poll on Health Information and Trust: Use of AI For Health Infor- mation and Advice, March 2026

    kffjulianm. KFF Tracking Poll on Health Information and Trust: Use of AI For Health Infor- mation and Advice, March 2026

  5. [5]

    Fries, Michael Wornow, Akshay Swami- nathan, Lisa Soleymani Lehmann, Hyo Jung Hong, Mehr Kashyap, Akash R

    Suhana Bedi, Yutong Liu, Lucy Orr-Ewing, Dev Dash, Sanmi Koyejo, Alison Callahan, Ja- son A. Fries, Michael Wornow, Akshay Swaminathan, Lisa Soleymani Lehmann, Hyo Jung Hong, Mehr Kashyap, Akash R. Chaurasia, Nirav R. Shah, Karandeep Singh, Troy Tazbaz, Arnold Milstein, Michael A. Pfeffer, and Nigam H. Shah. Testing and Evaluation of Health Care Applicati...

  6. [6]

    Lawrence K. Q. Yan, Qian Niu, Ming Li, Yichao Zhang, Caitlyn Heqi Yin, Cheng Fei, Benji Peng, Ziqian Bi, Pohsun Feng, Keyu Chen, Tianyang Wang, Yunze Wang, Silin Chen, Ming Liu, and Junyu Liu. Large Language Model Benchmarks in Medical Tasks, December 2024

  7. [7]

    Andrew M. Bean, Ryan Othniel Kearns, Angelika Romanou, Franziska Sofia Hafner, Harry Mayne, Jan Batzner, Negar Foroutan, Chris Schmitz, Karolina Korgul, Hunar Batra, Oishi Deb, Emma Beharry, Cornelius Emde, Thomas Foster, Anna Gausen, Mar´ ıa Grandury, Simeng Han, Valentin Hofmann, Lujain Ibrahim, Hazel Kim, Hannah Rose Kirk, Fangru Lin, Gabrielle Kaili-M...

  8. [8]

    The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards, May 2018

    Sarah Holland, Ahmed Hosny, Sarah Newman, Joshua Joseph, and Kasia Chmielinski. The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards, May 2018. Comment: First Draft May 2018. 9

  9. [9]

    Bender and Batya Friedman

    Emily M. Bender and Batya Friedman. Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science.Transactions of the Association for Computational Linguistics, 6:587–604, December 2018. ISSN 2307-387X. doi: 10.1162/ tacl a 00041

  10. [10]

    Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Senevi- ratne, Paul Gamble, Chris Kelly, Abubakr Babiker, Nathanael Sch¨ arli, Aakanksha Chowdh- ery, Philip Mansfield, Dina Demner-Fushman, Blaise Ag¨ uera y Arcas, Dale Webster, Greg S. ...

  11. [11]

    MedRedQA for Medical Consumer Question Answering: Dataset, Tasks, and Neural Baselines

    Vincent Nguyen, Sarvnaz Karimi, Maciej Rybinski, and Zhenchang Xing. MedRedQA for Medical Consumer Question Answering: Dataset, Tasks, and Neural Baselines. In Jong C. Park, Yuki Arase, Baotian Hu, Wei Lu, Derry Wijaya, Ayu Purwarianti, and Adila Alfa Kris- nadhi, editors,Proceedings of the 13th International Joint Conference on Natural Language Processin...

  12. [12]

    doi: 10.18653/v1/2023.ijcnlp-main.42

    Association for Computational Linguistics. doi: 10.18653/v1/2023.ijcnlp-main.42

  13. [13]

    Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Qui˜ nonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. HealthBench: Evaluating Large Language Models Towards Improved Human Health, May 2025. Comment: Blog: https://openai.com/index/healthbench/ Code:...

  14. [14]

    Furlotte, Chace Lee, Erik Schenck, Yojan Patel, Jian Cui, Logan Douglas Schneider, Robby Bryant, Ryan G

    Justin Khasentino, Anastasiya Belyaeva, Xin Liu, Zhun Yang, Nicholas A. Furlotte, Chace Lee, Erik Schenck, Yojan Patel, Jian Cui, Logan Douglas Schneider, Robby Bryant, Ryan G. Gomes, Allen Jiang, Roy Lee, Yun Liu, Javier Perez, Jameson K. Rogers, Cathy Speed, Shyam Tailor, Megan Walker, Jeffrey Yu, Tim Althoff, Conor Heneghan, John Hernandez, Mark Malhot...

  15. [15]

    How People Use ChatGPT

    Aaron Chatterji, Thomas Cunningham, David J. Deming, Zoe Hitzig, Christopher Ong, Carl Yan Shan, and Kevin Wadman. How People Use ChatGPT.NBER Working Paper Series, September 2025. doi: 10.3386/w34255

  16. [16]

    Ashman, Ph.D., Loredana Santo, M.D., M.P.H., Titilayo Okey- ode, and M.Sc

    Jill J. Ashman, Ph.D., Loredana Santo, M.D., M.P.H., Titilayo Okey- ode, and M.Sc. Products - Data Briefs - Number 408 - May 2021. https://www.cdc.gov/nchs/products/databriefs/db408.htm, May 2021

  17. [17]

    Huffstetler, and Yalda Jabbarpour

    Anuradha Jetty, Marie Ezran, Alison N. Huffstetler, and Yalda Jabbarpour. An Evaluation of the Decline in Primary Care Physician Visits, 2010 to 2021.Journal of Primary Care & Community Health, 16:21501319251321618, February 2025. ISSN 2150-1319. doi: 10.1177/ 21501319251321618. 10

  18. [18]

    Cade Metz. Are A.I. Therapy Chatbots Safe to Use?The New York Times, November 2025. ISSN 0362-4331

  19. [19]

    Machine Learning in Medicine.New Eng- land Journal of Medicine, 380(14):1347–1358, April 2019

    Alvin Rajkomar, Jeffrey Dean, and Isaac Kohane. Machine Learning in Medicine.New Eng- land Journal of Medicine, 380(14):1347–1358, April 2019. ISSN 0028-4793. doi: 10.1056/ NEJMra1814259

  20. [20]

    Annals of Internal Medicine169(12), 866–872 (2018) https: //doi.org/10.7326/M18-1990

    Alvin Rajkomar, Michaela Hardt, Michael D. Howell, Greg Corrado, and Marshall H. Chin. Ensuring Fairness in Machine Learning to Advance Health Equity.Annals of Internal Medicine, 169(12):866–872, December 2018. ISSN 0003-4819. doi: 10.7326/M18-1990

  21. [21]

    Stolyar, Katelyn Polanska, Karleigh R

    Thomas Yu Chow Tam, Sonish Sivarajkumar, Sumit Kapoor, Alisa V. Stolyar, Katelyn Polanska, Karleigh R. McCarthy, Hunter Osterhoudt, Xizhi Wu, Shyam Visweswaran, Sun- yang Fu, Piyush Mathur, Giovanni E. Cacciamani, Cong Sun, Yifan Peng, and Yanshan Wang. A framework for human evaluation of large language models in healthcare derived from literature review....

  22. [22]

    the 35 problems of two paradoxes

    A. R. Feinstein and D. V. Cicchetti. High agreement but low kappa: I. The problems of two paradoxes.Journal of Clinical Epidemiology, 43(6):543–549, 1990. ISSN 0895-4356. doi: 10.1016/0895-4356(90)90158-l. 11 Tables and Figures Figure 1: CONSORT-style flow diagram showing study progression from initial dataset assessment through tagging methods to final a...

  23. [23]

    This tag is reasonable for this query

    A 5-point Likert rating (1 = strongly disagree to 5 = strongly agree) for the statement “This tag is reasonable for this query.” On agreement-stratum queries, this applied to the consensus tag; on disagreement-stratum queries where models agreed on a given dimension, it applied to the shared value

  24. [24]

    Both reasonable, prefer A/B/no preference,

    For dimensions where models disagreed, a forced-choice preference (“Both reasonable, prefer A/B/no preference,” “Only A reasonable,” “Only B reasonable,” or “Neither reasonable”), from which per-tag reasonableness was derived. Results Tag reasonableness.Across all 80 queries and both reviewers, the GPT-5.2 tag was judged reasonable (Likert≥4) for 96.9% of...

  25. [25]

    Intent Distribution Education2,645 (83.4%) Education Explainer 2,635 (99.6%) Basic Science 10 (0.4%) Symptom Check380 (12.0%) Management Plan 235 (61.8%) Differential Diagnosis 107 (28.2%) Triage Disposition 38 (10.0%) Non-Health50 (1.6%) Offtopic Nonhealth 50 (100.0%) General Health Advice47 (1.5%) Nutrition/Diet 19 (40.4%) Cosmeceuticals/Topicals 10 (21...

  26. [26]

    Topic Distribution (Top 5) Brain & Nerves366 (11.5%) Other/Unspecified 212 (57.9%) Neuropathy 43 (11.8%) Cognitive Changes 39 (10.7%) Skin & Hair366 (11.5%) Other/Unspecified 169 (46.2%) Infections 64 (17.5%) Rash 56 (15.3%) Muscles, Bones & Joints274 (8.6%) Other/Unspecified 180 (65.7%) Arthritis 32 (11.7%) Sprains & Strains 19 (6.9%) Digestive & Nutriti...

  27. [27]

    Context Richness Conversation Structure Single Turn 3,173 (100.0%) Narrative Detail Short 3,173 (100.0%) Context Depth Continued on next page 47 Table S11 –Continued from previous page Dimension/Category Count (%) Low 3,171 (99.9%) High 2 (0.1%)

  28. [28]

    Clinical Complexity Risk Level Low 3,093 (97.5%) Moderate 70 (2.2%) High 10 (0.3%) User Type Consumer 3,173 (100.0%) Population Adult (Unspecified) 3,104 (97.8%) Peds Unspecified 40 (1.3%) Pediatric (Under 5) 26 (0.8%) Adult 65Plus 2 (0.1%) Peds 5To17 1 (0.0%) Language English 3,173 (100.0%) Language Complexity Lay 3,069 (96.7%) Technical 104 (3.3%) Query...

  29. [29]

    Data Integration Objective Data Present Yes 8 (0.2%) No 3,165 (99.8%) Objective Data Types Diagnoses 5 (0.2%) Vitals (Basic) 3 (0.1%) 48 A.20.2 MashQA Test (N=3,490) Dimension/Category Count (%)

  30. [30]

    Intent Distribution Education2,389 (68.5%) Education Explainer 2,250 (94.2%) Basic Science 139 (5.8%) Medication Information332 (9.5%) Side Effects 157 (47.3%) Selection 112 (33.7%) Dosing 45 (13.6%) General Health Advice240 (6.9%) Nutrition/Diet 72 (30.0%) Supplements/Nutraceuticals 64 (26.7%) Fitness/Exercise 40 (16.7%) Condition Management157 (4.5%) Ch...

  31. [31]

    Topic Distribution (Top 5) Cancer524 (15.0%) Other/Unspecified 291 (55.5%) Lung 75 (14.3%) Breast 74 (14.1%) Muscles, Bones & Joints330 (9.5%) Arthritis 217 (65.8%) Other/Unspecified 41 (12.4%) Back & Neck Pain 30 (9.1%) Brain & Nerves283 (8.1%) Other/Unspecified 116 (41.0%) Headache/Migraine 100 (35.3%) Neuropathy 30 (10.6%) Digestive & Nutrition272 (7.8...

  32. [32]

    Context Richness Conversation Structure Single Turn 3,490 (100.0%) Continued on next page 49 Table S12 –Continued from previous page Dimension/Category Count (%) Narrative Detail Short 3,490 (100.0%) Context Depth Low 3,484 (99.8%) High 6 (0.2%)

  33. [33]

    Clinical Complexity Risk Level Low 3,441 (98.6%) Moderate 44 (1.3%) High 5 (0.1%) User Type Consumer 3,490 (100.0%) Population Adult (Unspecified) 3,297 (94.5%) Peds Unspecified 115 (3.3%) Pediatric (Under 5) 57 (1.6%) Adult 65Plus 13 (0.4%) Peds 5To17 8 (0.2%) Language English 3,490 (100.0%) Language Complexity Lay 3,157 (90.5%) Technical 333 (9.5%) Quer...

  34. [34]

    Data Integration Objective Data Present Yes 133 (3.8%) No 3,357 (96.2%) Objective Data Types Diagnoses 110 (3.1%) Medications 21 (0.6%) Procedures 15 (0.4%) Labs 1 (0.0%) 50 A.20.3 MedRedQA Test (N=5,081) Dimension/Category Count (%)

  35. [35]

    Intent Distribution Symptom Check2,656 (52.3%) Differential Diagnosis 1,235 (46.5%) Management Plan 774 (29.1%) Triage Disposition 647 (24.4%) Tests and Results702 (13.8%) Test Interpretation 593 (84.5%) Test Selection 109 (15.5%) Medication Information608 (12.0%) Side Effects 240 (39.5%) Selection 134 (22.0%) Interactions 119 (19.6%) Condition Management...

  36. [36]

    Topic Distribution (Top 5) Skin & Hair761 (15.0%) Rash 194 (25.5%) Wounds 175 (23.0%) Other/Unspecified 157 (20.6%) Digestive & Nutrition521 (10.2%) Other/Unspecified 169 (32.4%) Liver Disease 75 (14.4%) Abdominal Pain 69 (13.2%) Brain & Nerves454 (8.9%) Other/Unspecified 118 (26.0%) Headache/Migraine 76 (16.7%) Neuropathy 63 (13.9%) Infections (General)4...

  37. [37]

    Context Richness Conversation Structure Single Turn 5,072 (99.8%) Continued on next page 51 Table S13 –Continued from previous page Dimension/Category Count (%) Multi Turn 9 (0.2%) Narrative Detail Detailed 4,804 (94.5%) Short 277 (5.5%) Context Depth High 4,461 (87.8%) Low 620 (12.2%)

  38. [38]

    Clinical Complexity Risk Level Moderate 2,942 (57.9%) Low 1,639 (32.3%) High 500 (9.8%) User Type Consumer 5,081 (100.0%) Population Adult (Unspecified) 4,394 (86.5%) Peds 5To17 468 (9.2%) Adult 65Plus 117 (2.3%) Pediatric (Under 5) 97 (1.9%) Peds Unspecified 5 (0.1%) Language English 5,081 (100.0%) Language Complexity Lay 4,461 (87.8%) Technical 620 (12....

  39. [39]

    Data Integration Objective Data Present Yes 3,646 (71.8%) No 1,435 (28.2%) Objective Data Types Diagnoses 2,230 (43.9%) Medications 2,044 (40.2%) Labs 898 (17.7%) Procedures 879 (17.3%) Imaging 659 (13.0%) Continued on next page 52 Table S13 –Continued from previous page Dimension/Category Count (%) Vitals (Basic) 383 (7.5%) Vitals (Wearable) 35 (0.7%) 53...

  40. [40]

    Intent Distribution Symptom Check1,384 (37.5%) Management Plan 681 (49.2%) Triage Disposition 392 (28.3%) Differential Diagnosis 311 (22.5%) Medication Information729 (19.8%) Selection 443 (60.8%) Dosing 170 (23.3%) Side Effects 72 (9.9%) Condition Management297 (8.0%) Chronic Care Support 227 (76.4%) Risk/Prognosis 44 (14.8%) Acute Flare Management 26 (8...

  41. [41]

    Topic Distribution (Top 5) Infections (General)425 (11.5%) Other/Unspecified 218 (51.3%) Travel-Related 81 (19.1%) Fever (Unspecified) 66 (15.5%) Brain & Nerves344 (9.3%) Other/Unspecified 94 (27.3%) Headache/Migraine 88 (25.6%) Dizziness/Vertigo 59 (17.1%) Digestive & Nutrition339 (9.2%) Other/Unspecified 153 (45.1%) Abdominal Pain 42 (12.4%) Reflux/Hear...

  42. [42]

    Context Richness Conversation Structure Single Turn 2,193 (59.4%) Continued on next page 54 Table S14 –Continued from previous page Dimension/Category Count (%) Multi Turn 1,499 (40.6%) Narrative Detail Short 3,237 (87.7%) Detailed 455 (12.3%) Context Depth Low 3,174 (86.0%) High 518 (14.0%)

  43. [43]

    Clinical Complexity Risk Level Low 2,161 (58.5%) Moderate 1,189 (32.2%) High 342 (9.3%) User Type Consumer 3,692 (100.0%) Population Adult (Unspecified) 3,249 (88.0%) Pediatric (Under 5) 159 (4.3%) Peds Unspecified 121 (3.3%) Peds 5To17 93 (2.5%) Adult 65Plus 70 (1.9%) Language English 3,010 (81.5%) Non-English 682 (18.5%) Language Complexity Lay 3,470 (9...

  44. [44]

    Data Integration Objective Data Present Yes 863 (23.4%) No 2,829 (76.6%) Objective Data Types Diagnoses 546 (14.8%) Medications 257 (7.0%) Procedures 99 (2.7%) Labs 82 (2.2%) Continued on next page 55 Table S14 –Continued from previous page Dimension/Category Count (%) Vitals (Basic) 66 (1.8%) Imaging 43 (1.2%) Vitals (Wearable) 1 (0.0%) 56 A.20.5 GoogleF...

  45. [45]

    Intent Distribution General Health Advice1,521 (100.0%) Sleep Hygiene 1,521 (100.0%)

  46. [46]

    Topic Distribution (Top 5) Holistic Health & Wellness1,521 (100.0%) Sleep & Lifestyle 1,521 (100.0%)

  47. [47]

    Context Richness Conversation Structure Single Turn 1,521 (100.0%) Narrative Detail Detailed 1,521 (100.0%) Context Depth High 1,521 (100.0%)

  48. [48]

    Clinical Complexity Risk Level Low 1,521 (100.0%) User Type Consumer 1,521 (100.0%) Population Adult (Unspecified) 1,092 (71.8%) Adult 65Plus 429 (28.2%) Language English 1,521 (100.0%) Language Complexity Lay 1,521 (100.0%) Query Subject Self 1,521 (100.0%) Personal Health Query Yes 1,521 (100.0%)

  49. [49]

    Data Integration Objective Data Present Yes 1,521 (100.0%) Objective Data Types Vitals (Wearable) 1,521 (100.0%) 57 A.20.6 GoogleFitbit Fitness (N=1,750) Dimension/Category Count (%)

  50. [50]

    Intent Distribution General Health Advice1,750 (100.0%) Fitness/Exercise 1,750 (100.0%)

  51. [51]

    Topic Distribution (Top 5) Holistic Health & Wellness1,750 (100.0%) Sleep & Lifestyle 1,750 (100.0%)

  52. [52]

    Context Richness Conversation Structure Single Turn 1,750 (100.0%) Narrative Detail Detailed 1,750 (100.0%) Context Depth High 1,750 (100.0%)

  53. [53]

    Clinical Complexity Risk Level Low 1,750 (100.0%) User Type Consumer 1,750 (100.0%) Population Adult (Unspecified) 1,520 (86.9%) Adult 65Plus 230 (13.1%) Language English 1,750 (100.0%) Language Complexity Lay 1,750 (100.0%) Query Subject Self 1,750 (100.0%) Personal Health Query Yes 1,750 (100.0%)

  54. [54]

    Data Integration Objective Data Present Yes 1,750 (100.0%) Objective Data Types Vitals (Wearable) 1,750 (100.0%) 58