The Validity Gap in Health AI Evaluation: A Cross-Sectional Analysis of Benchmark Composition
Pith reviewed 2026-05-15 08:12 UTC · model grok-4.3
The pith
Health AI benchmarks exhibit a structural validity gap, with query composition misaligned to real clinical needs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Application of a 16-field taxonomy to the full corpus shows that objective data references are present in 42 percent of queries yet skewed heavily toward wellness wearables, while laboratory values appear in only 5.2 percent, imaging in 3.8 percent, and raw medical records in 0.6 percent; suicide or self-harm queries comprise less than 0.7 percent, chronic disease management only 5.5 percent, and pediatrics or older-adult queries together remain below 11 percent, establishing a persistent misalignment between benchmark composition and the requirements of clinical practice.
What carries the argument
The 16-field taxonomy that classifies each query by clinical context, topic, and intent, applied automatically by LLMs to enable scalable, standardized profiling across benchmarks.
If this is right
- Benchmark creators must adopt standardized query-profiling methods comparable to clinical-trial inclusion reporting.
- Future benchmarks require substantially higher fractions of raw diagnostic inputs and longitudinal chronic-care scenarios.
- Safety-critical topics such as self-harm and management of vulnerable populations must be represented at rates closer to real clinical prevalence.
- Aggregate performance metrics on current benchmarks cannot be treated as reliable indicators of readiness for clinical deployment.
Where Pith is reading between the lines
- The gap implies that reported model accuracy may decline sharply when queries shift from wellness-style inputs to full diagnostic records.
- Synthetic query generation pipelines could be tested against the same taxonomy to measure how well they close the identified composition shortfalls.
- Regulatory review of health AI tools might incorporate mandatory benchmark-composition audits before approval.
- Extending the taxonomy to private hospital query logs would allow direct comparison of public benchmark realism against actual clinical workloads.
Load-bearing premise
The LLM-driven application of the 16-field taxonomy produces classifications that are sufficiently accurate and free of systematic bias.
What would settle it
A human re-coding of a random sample of several hundred queries that yields materially different category proportions, especially for raw clinical artifacts or safety-critical content.
Figures
read the original abstract
Background: Clinical trials rely on transparent inclusion criteria to ensure generalizability. In contrast, benchmarks validating health-related large language models (LLMs) rarely characterize the "patient" or "query" populations they contain. Without defined composition, aggregate performance metrics may misrepresent model readiness for clinical use. Methods: We analyzed 18,707 consumer health queries across six public benchmarks using LLMs as automated coding instruments to apply a standardized 16-field taxonomy profiling context, topic, and intent. Results: We identified a structural "validity gap." While benchmarks have evolved from static retrieval to interactive dialogue, clinical composition remains misaligned with real-world needs. Although 42% of the corpus referenced objective data, this was polarized toward wellness-focused wearable signals (17.7%); complex diagnostic inputs remained rare, including laboratory values (5.2%), imaging (3.8%), and raw medical records (0.6%). Safety-critical scenarios were effectively absent: suicide/self-harm queries comprised <0.7% of the corpus and chronic disease management only 5.5%. Benchmarks also neglected vulnerable populations (pediatrics/older adults <11%) and global health needs. Conclusions: Evaluation benchmarks remain misaligned with real-world clinical needs, lacking raw clinical artifacts, adequate representation of vulnerable populations, and longitudinal chronic care scenarios. The field must adopt standardized query profiling--analogous to clinical trial reporting--to align evaluation with the full complexity of clinical practice.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes 18,707 consumer health queries across six public benchmarks by applying a 16-field taxonomy via LLM coding to profile context, topic, and intent. It reports a structural validity gap: 42% reference objective data (mostly wellness wearables at 17.7%), but complex diagnostics are rare (laboratory values 5.2%, imaging 3.8%, raw records 0.6%), safety-critical content is nearly absent (<0.7% suicide/self-harm), chronic management is low (5.5%), and vulnerable populations are underrepresented (<11%). The central claim is that benchmarks remain misaligned with real-world clinical needs and that standardized query profiling is required.
Significance. If the classifications hold, the work provides a useful empirical baseline quantifying misalignment between health AI benchmarks and clinical practice, including underrepresentation of raw records, chronic scenarios, and vulnerable groups. The direct, non-circular analysis of public data offers a concrete taxonomy and cross-sectional snapshot that could inform benchmark design, though its impact depends on the reliability of the automated coding step.
major comments (1)
- [Methods] Methods: The 16-field taxonomy is applied exclusively via LLM to all 18,707 queries with no reported human validation, inter-rater reliability, gold-standard subsample, confusion matrix, or sensitivity analysis. All headline percentages (42% objective data, 0.6% raw records, 5.5% chronic management, <0.7% suicide/self-harm, <11% vulnerable populations) rest on this step; unquantified LLM bias or error would directly alter the reported validity-gap magnitudes.
minor comments (1)
- [Abstract] Abstract and Results: The six specific benchmarks are not named, which reduces reproducibility and context for readers.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for highlighting the importance of validating the automated coding pipeline. We address the single major comment below and commit to strengthening the methods section accordingly.
read point-by-point responses
-
Referee: [Methods] Methods: The 16-field taxonomy is applied exclusively via LLM to all 18,707 queries with no reported human validation, inter-rater reliability, gold-standard subsample, confusion matrix, or sensitivity analysis. All headline percentages (42% objective data, 0.6% raw records, 5.5% chronic management, <0.7% suicide/self-harm, <11% vulnerable populations) rest on this step; unquantified LLM bias or error would directly alter the reported validity-gap magnitudes.
Authors: We agree that the absence of human validation for the LLM coding step is a limitation of the current manuscript. In the revised version we will add a human validation protocol: two independent annotators (a clinician and an AI researcher) will label a stratified random subsample of 1,000 queries (approximately 5.3% of the corpus). We will report Cohen’s kappa for inter-rater reliability, LLM–human agreement rates per field, and a confusion matrix for the key binary fields that drive the headline statistics. We will also include a sensitivity analysis that re-computes the primary percentages after (a) excluding low-confidence LLM predictions and (b) using an alternative LLM. These additions will be placed in the Methods and Results sections and will not change the overall direction of the validity-gap findings. revision: yes
Circularity Check
No circularity: direct empirical counts from taxonomy application
full rationale
The paper conducts a descriptive cross-sectional analysis of 18,707 existing benchmark queries by applying a 16-field taxonomy via LLM coding. No derivations, equations, fitted parameters, or predictions are present that could reduce to the inputs by construction. The reported percentages (e.g., 42% objective data, 0.6% raw records) are direct outputs of the classification process on public data. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked to justify the central claims. The analysis is self-contained as an observational study of benchmark composition with no self-referential loops.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 16-field taxonomy accurately and comprehensively captures the clinically relevant dimensions of consumer health queries.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We assessed tagging validity through two complementary analyses... Overall agreement averaged 90.0% (Cohen’s κ = 0.77)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
to whom do the results of this trial apply?
Peter M. Rothwell. External validity of randomised controlled trials: ”to whom do the results of this trial apply?”.Lancet (London, England), 365(9453):82–93, 2005–7. ISSN 1474-547X. doi: 10.1016/S0140-6736(04)17670-8
-
[2]
Collins, Asbjørn Hr´ objartsson, David Moher, Ken- neth F
Sally Hopewell, An-Wen Chan, Gary S. Collins, Asbjørn Hr´ objartsson, David Moher, Ken- neth F. Schulz, Ruth Tunn, Rakesh Aggarwal, Michael Berkwits, Jesse A. Berlin, Nita Bhan- dari, Nancy J. Butcher, Marion K. Campbell, Runcie C. W. Chidebe, Diana Elbourne, Andrew Farmer, Dean A. Fergusson, Robert M. Golub, Steven N. Goodman, Tammy C. Hoffmann, John P. ...
-
[3]
How people use Copilot for Health
Beatriz Costa-Gomes, Pavel Tolmachev, Eloise Taysom, Viknesh Sounderajah, Hannah Richardson, Philipp Schoenegger, Xiaoxuan Liu, Matthew M Nour, Seth Spielman, Samuel F Way, Yash Shah, Michael Bhaskar, Harsha Nori, Christopher Kelly, Peter Hames, Bay Gross, Mustafa Suleyman, and Dominic King. How people use Copilot for Health
-
[4]
kffjulianm. KFF Tracking Poll on Health Information and Trust: Use of AI For Health Infor- mation and Advice, March 2026
work page 2026
-
[5]
Suhana Bedi, Yutong Liu, Lucy Orr-Ewing, Dev Dash, Sanmi Koyejo, Alison Callahan, Ja- son A. Fries, Michael Wornow, Akshay Swaminathan, Lisa Soleymani Lehmann, Hyo Jung Hong, Mehr Kashyap, Akash R. Chaurasia, Nirav R. Shah, Karandeep Singh, Troy Tazbaz, Arnold Milstein, Michael A. Pfeffer, and Nigam H. Shah. Testing and Evaluation of Health Care Applicati...
-
[6]
Lawrence K. Q. Yan, Qian Niu, Ming Li, Yichao Zhang, Caitlyn Heqi Yin, Cheng Fei, Benji Peng, Ziqian Bi, Pohsun Feng, Keyu Chen, Tianyang Wang, Yunze Wang, Silin Chen, Ming Liu, and Junyu Liu. Large Language Model Benchmarks in Medical Tasks, December 2024
work page 2024
-
[7]
Andrew M. Bean, Ryan Othniel Kearns, Angelika Romanou, Franziska Sofia Hafner, Harry Mayne, Jan Batzner, Negar Foroutan, Chris Schmitz, Karolina Korgul, Hunar Batra, Oishi Deb, Emma Beharry, Cornelius Emde, Thomas Foster, Anna Gausen, Mar´ ıa Grandury, Simeng Han, Valentin Hofmann, Lujain Ibrahim, Hazel Kim, Hannah Rose Kirk, Fangru Lin, Gabrielle Kaili-M...
work page 2025
-
[8]
The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards, May 2018
Sarah Holland, Ahmed Hosny, Sarah Newman, Joshua Joseph, and Kasia Chmielinski. The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards, May 2018. Comment: First Draft May 2018. 9
work page 2018
-
[9]
Emily M. Bender and Batya Friedman. Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science.Transactions of the Association for Computational Linguistics, 6:587–604, December 2018. ISSN 2307-387X. doi: 10.1162/ tacl a 00041
work page 2018
-
[10]
Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Senevi- ratne, Paul Gamble, Chris Kelly, Abubakr Babiker, Nathanael Sch¨ arli, Aakanksha Chowdh- ery, Philip Mansfield, Dina Demner-Fushman, Blaise Ag¨ uera y Arcas, Dale Webster, Greg S. ...
-
[11]
MedRedQA for Medical Consumer Question Answering: Dataset, Tasks, and Neural Baselines
Vincent Nguyen, Sarvnaz Karimi, Maciej Rybinski, and Zhenchang Xing. MedRedQA for Medical Consumer Question Answering: Dataset, Tasks, and Neural Baselines. In Jong C. Park, Yuki Arase, Baotian Hu, Wei Lu, Derry Wijaya, Ayu Purwarianti, and Adila Alfa Kris- nadhi, editors,Proceedings of the 13th International Joint Conference on Natural Language Processin...
-
[12]
doi: 10.18653/v1/2023.ijcnlp-main.42
Association for Computational Linguistics. doi: 10.18653/v1/2023.ijcnlp-main.42
-
[13]
Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Qui˜ nonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. HealthBench: Evaluating Large Language Models Towards Improved Human Health, May 2025. Comment: Blog: https://openai.com/index/healthbench/ Code:...
work page 2025
-
[14]
Justin Khasentino, Anastasiya Belyaeva, Xin Liu, Zhun Yang, Nicholas A. Furlotte, Chace Lee, Erik Schenck, Yojan Patel, Jian Cui, Logan Douglas Schneider, Robby Bryant, Ryan G. Gomes, Allen Jiang, Roy Lee, Yun Liu, Javier Perez, Jameson K. Rogers, Cathy Speed, Shyam Tailor, Megan Walker, Jeffrey Yu, Tim Althoff, Conor Heneghan, John Hernandez, Mark Malhot...
-
[15]
Aaron Chatterji, Thomas Cunningham, David J. Deming, Zoe Hitzig, Christopher Ong, Carl Yan Shan, and Kevin Wadman. How People Use ChatGPT.NBER Working Paper Series, September 2025. doi: 10.3386/w34255
-
[16]
Ashman, Ph.D., Loredana Santo, M.D., M.P.H., Titilayo Okey- ode, and M.Sc
Jill J. Ashman, Ph.D., Loredana Santo, M.D., M.P.H., Titilayo Okey- ode, and M.Sc. Products - Data Briefs - Number 408 - May 2021. https://www.cdc.gov/nchs/products/databriefs/db408.htm, May 2021
work page 2021
-
[17]
Huffstetler, and Yalda Jabbarpour
Anuradha Jetty, Marie Ezran, Alison N. Huffstetler, and Yalda Jabbarpour. An Evaluation of the Decline in Primary Care Physician Visits, 2010 to 2021.Journal of Primary Care & Community Health, 16:21501319251321618, February 2025. ISSN 2150-1319. doi: 10.1177/ 21501319251321618. 10
work page 2010
-
[18]
Cade Metz. Are A.I. Therapy Chatbots Safe to Use?The New York Times, November 2025. ISSN 0362-4331
work page 2025
-
[19]
Machine Learning in Medicine.New Eng- land Journal of Medicine, 380(14):1347–1358, April 2019
Alvin Rajkomar, Jeffrey Dean, and Isaac Kohane. Machine Learning in Medicine.New Eng- land Journal of Medicine, 380(14):1347–1358, April 2019. ISSN 0028-4793. doi: 10.1056/ NEJMra1814259
work page 2019
-
[20]
Annals of Internal Medicine169(12), 866–872 (2018) https: //doi.org/10.7326/M18-1990
Alvin Rajkomar, Michaela Hardt, Michael D. Howell, Greg Corrado, and Marshall H. Chin. Ensuring Fairness in Machine Learning to Advance Health Equity.Annals of Internal Medicine, 169(12):866–872, December 2018. ISSN 0003-4819. doi: 10.7326/M18-1990
-
[21]
Stolyar, Katelyn Polanska, Karleigh R
Thomas Yu Chow Tam, Sonish Sivarajkumar, Sumit Kapoor, Alisa V. Stolyar, Katelyn Polanska, Karleigh R. McCarthy, Hunter Osterhoudt, Xizhi Wu, Shyam Visweswaran, Sun- yang Fu, Piyush Mathur, Giovanni E. Cacciamani, Cong Sun, Yifan Peng, and Yanshan Wang. A framework for human evaluation of large language models in healthcare derived from literature review....
-
[22]
the 35 problems of two paradoxes
A. R. Feinstein and D. V. Cicchetti. High agreement but low kappa: I. The problems of two paradoxes.Journal of Clinical Epidemiology, 43(6):543–549, 1990. ISSN 0895-4356. doi: 10.1016/0895-4356(90)90158-l. 11 Tables and Figures Figure 1: CONSORT-style flow diagram showing study progression from initial dataset assessment through tagging methods to final a...
-
[23]
This tag is reasonable for this query
A 5-point Likert rating (1 = strongly disagree to 5 = strongly agree) for the statement “This tag is reasonable for this query.” On agreement-stratum queries, this applied to the consensus tag; on disagreement-stratum queries where models agreed on a given dimension, it applied to the shared value
-
[24]
Both reasonable, prefer A/B/no preference,
For dimensions where models disagreed, a forced-choice preference (“Both reasonable, prefer A/B/no preference,” “Only A reasonable,” “Only B reasonable,” or “Neither reasonable”), from which per-tag reasonableness was derived. Results Tag reasonableness.Across all 80 queries and both reviewers, the GPT-5.2 tag was judged reasonable (Likert≥4) for 96.9% of...
work page 2044
-
[25]
Intent Distribution Education2,645 (83.4%) Education Explainer 2,635 (99.6%) Basic Science 10 (0.4%) Symptom Check380 (12.0%) Management Plan 235 (61.8%) Differential Diagnosis 107 (28.2%) Triage Disposition 38 (10.0%) Non-Health50 (1.6%) Offtopic Nonhealth 50 (100.0%) General Health Advice47 (1.5%) Nutrition/Diet 19 (40.4%) Cosmeceuticals/Topicals 10 (21...
-
[26]
Topic Distribution (Top 5) Brain & Nerves366 (11.5%) Other/Unspecified 212 (57.9%) Neuropathy 43 (11.8%) Cognitive Changes 39 (10.7%) Skin & Hair366 (11.5%) Other/Unspecified 169 (46.2%) Infections 64 (17.5%) Rash 56 (15.3%) Muscles, Bones & Joints274 (8.6%) Other/Unspecified 180 (65.7%) Arthritis 32 (11.7%) Sprains & Strains 19 (6.9%) Digestive & Nutriti...
-
[27]
Context Richness Conversation Structure Single Turn 3,173 (100.0%) Narrative Detail Short 3,173 (100.0%) Context Depth Continued on next page 47 Table S11 –Continued from previous page Dimension/Category Count (%) Low 3,171 (99.9%) High 2 (0.1%)
-
[28]
Clinical Complexity Risk Level Low 3,093 (97.5%) Moderate 70 (2.2%) High 10 (0.3%) User Type Consumer 3,173 (100.0%) Population Adult (Unspecified) 3,104 (97.8%) Peds Unspecified 40 (1.3%) Pediatric (Under 5) 26 (0.8%) Adult 65Plus 2 (0.1%) Peds 5To17 1 (0.0%) Language English 3,173 (100.0%) Language Complexity Lay 3,069 (96.7%) Technical 104 (3.3%) Query...
-
[29]
Data Integration Objective Data Present Yes 8 (0.2%) No 3,165 (99.8%) Objective Data Types Diagnoses 5 (0.2%) Vitals (Basic) 3 (0.1%) 48 A.20.2 MashQA Test (N=3,490) Dimension/Category Count (%)
-
[30]
Intent Distribution Education2,389 (68.5%) Education Explainer 2,250 (94.2%) Basic Science 139 (5.8%) Medication Information332 (9.5%) Side Effects 157 (47.3%) Selection 112 (33.7%) Dosing 45 (13.6%) General Health Advice240 (6.9%) Nutrition/Diet 72 (30.0%) Supplements/Nutraceuticals 64 (26.7%) Fitness/Exercise 40 (16.7%) Condition Management157 (4.5%) Ch...
-
[31]
Topic Distribution (Top 5) Cancer524 (15.0%) Other/Unspecified 291 (55.5%) Lung 75 (14.3%) Breast 74 (14.1%) Muscles, Bones & Joints330 (9.5%) Arthritis 217 (65.8%) Other/Unspecified 41 (12.4%) Back & Neck Pain 30 (9.1%) Brain & Nerves283 (8.1%) Other/Unspecified 116 (41.0%) Headache/Migraine 100 (35.3%) Neuropathy 30 (10.6%) Digestive & Nutrition272 (7.8...
-
[32]
Context Richness Conversation Structure Single Turn 3,490 (100.0%) Continued on next page 49 Table S12 –Continued from previous page Dimension/Category Count (%) Narrative Detail Short 3,490 (100.0%) Context Depth Low 3,484 (99.8%) High 6 (0.2%)
-
[33]
Clinical Complexity Risk Level Low 3,441 (98.6%) Moderate 44 (1.3%) High 5 (0.1%) User Type Consumer 3,490 (100.0%) Population Adult (Unspecified) 3,297 (94.5%) Peds Unspecified 115 (3.3%) Pediatric (Under 5) 57 (1.6%) Adult 65Plus 13 (0.4%) Peds 5To17 8 (0.2%) Language English 3,490 (100.0%) Language Complexity Lay 3,157 (90.5%) Technical 333 (9.5%) Quer...
-
[34]
Data Integration Objective Data Present Yes 133 (3.8%) No 3,357 (96.2%) Objective Data Types Diagnoses 110 (3.1%) Medications 21 (0.6%) Procedures 15 (0.4%) Labs 1 (0.0%) 50 A.20.3 MedRedQA Test (N=5,081) Dimension/Category Count (%)
-
[35]
Intent Distribution Symptom Check2,656 (52.3%) Differential Diagnosis 1,235 (46.5%) Management Plan 774 (29.1%) Triage Disposition 647 (24.4%) Tests and Results702 (13.8%) Test Interpretation 593 (84.5%) Test Selection 109 (15.5%) Medication Information608 (12.0%) Side Effects 240 (39.5%) Selection 134 (22.0%) Interactions 119 (19.6%) Condition Management...
-
[36]
Topic Distribution (Top 5) Skin & Hair761 (15.0%) Rash 194 (25.5%) Wounds 175 (23.0%) Other/Unspecified 157 (20.6%) Digestive & Nutrition521 (10.2%) Other/Unspecified 169 (32.4%) Liver Disease 75 (14.4%) Abdominal Pain 69 (13.2%) Brain & Nerves454 (8.9%) Other/Unspecified 118 (26.0%) Headache/Migraine 76 (16.7%) Neuropathy 63 (13.9%) Infections (General)4...
-
[37]
Context Richness Conversation Structure Single Turn 5,072 (99.8%) Continued on next page 51 Table S13 –Continued from previous page Dimension/Category Count (%) Multi Turn 9 (0.2%) Narrative Detail Detailed 4,804 (94.5%) Short 277 (5.5%) Context Depth High 4,461 (87.8%) Low 620 (12.2%)
-
[38]
Clinical Complexity Risk Level Moderate 2,942 (57.9%) Low 1,639 (32.3%) High 500 (9.8%) User Type Consumer 5,081 (100.0%) Population Adult (Unspecified) 4,394 (86.5%) Peds 5To17 468 (9.2%) Adult 65Plus 117 (2.3%) Pediatric (Under 5) 97 (1.9%) Peds Unspecified 5 (0.1%) Language English 5,081 (100.0%) Language Complexity Lay 4,461 (87.8%) Technical 620 (12....
-
[39]
Data Integration Objective Data Present Yes 3,646 (71.8%) No 1,435 (28.2%) Objective Data Types Diagnoses 2,230 (43.9%) Medications 2,044 (40.2%) Labs 898 (17.7%) Procedures 879 (17.3%) Imaging 659 (13.0%) Continued on next page 52 Table S13 –Continued from previous page Dimension/Category Count (%) Vitals (Basic) 383 (7.5%) Vitals (Wearable) 35 (0.7%) 53...
-
[40]
Intent Distribution Symptom Check1,384 (37.5%) Management Plan 681 (49.2%) Triage Disposition 392 (28.3%) Differential Diagnosis 311 (22.5%) Medication Information729 (19.8%) Selection 443 (60.8%) Dosing 170 (23.3%) Side Effects 72 (9.9%) Condition Management297 (8.0%) Chronic Care Support 227 (76.4%) Risk/Prognosis 44 (14.8%) Acute Flare Management 26 (8...
-
[41]
Topic Distribution (Top 5) Infections (General)425 (11.5%) Other/Unspecified 218 (51.3%) Travel-Related 81 (19.1%) Fever (Unspecified) 66 (15.5%) Brain & Nerves344 (9.3%) Other/Unspecified 94 (27.3%) Headache/Migraine 88 (25.6%) Dizziness/Vertigo 59 (17.1%) Digestive & Nutrition339 (9.2%) Other/Unspecified 153 (45.1%) Abdominal Pain 42 (12.4%) Reflux/Hear...
-
[42]
Context Richness Conversation Structure Single Turn 2,193 (59.4%) Continued on next page 54 Table S14 –Continued from previous page Dimension/Category Count (%) Multi Turn 1,499 (40.6%) Narrative Detail Short 3,237 (87.7%) Detailed 455 (12.3%) Context Depth Low 3,174 (86.0%) High 518 (14.0%)
-
[43]
Clinical Complexity Risk Level Low 2,161 (58.5%) Moderate 1,189 (32.2%) High 342 (9.3%) User Type Consumer 3,692 (100.0%) Population Adult (Unspecified) 3,249 (88.0%) Pediatric (Under 5) 159 (4.3%) Peds Unspecified 121 (3.3%) Peds 5To17 93 (2.5%) Adult 65Plus 70 (1.9%) Language English 3,010 (81.5%) Non-English 682 (18.5%) Language Complexity Lay 3,470 (9...
-
[44]
Data Integration Objective Data Present Yes 863 (23.4%) No 2,829 (76.6%) Objective Data Types Diagnoses 546 (14.8%) Medications 257 (7.0%) Procedures 99 (2.7%) Labs 82 (2.2%) Continued on next page 55 Table S14 –Continued from previous page Dimension/Category Count (%) Vitals (Basic) 66 (1.8%) Imaging 43 (1.2%) Vitals (Wearable) 1 (0.0%) 56 A.20.5 GoogleF...
-
[45]
Intent Distribution General Health Advice1,521 (100.0%) Sleep Hygiene 1,521 (100.0%)
-
[46]
Topic Distribution (Top 5) Holistic Health & Wellness1,521 (100.0%) Sleep & Lifestyle 1,521 (100.0%)
-
[47]
Context Richness Conversation Structure Single Turn 1,521 (100.0%) Narrative Detail Detailed 1,521 (100.0%) Context Depth High 1,521 (100.0%)
-
[48]
Clinical Complexity Risk Level Low 1,521 (100.0%) User Type Consumer 1,521 (100.0%) Population Adult (Unspecified) 1,092 (71.8%) Adult 65Plus 429 (28.2%) Language English 1,521 (100.0%) Language Complexity Lay 1,521 (100.0%) Query Subject Self 1,521 (100.0%) Personal Health Query Yes 1,521 (100.0%)
-
[49]
Data Integration Objective Data Present Yes 1,521 (100.0%) Objective Data Types Vitals (Wearable) 1,521 (100.0%) 57 A.20.6 GoogleFitbit Fitness (N=1,750) Dimension/Category Count (%)
-
[50]
Intent Distribution General Health Advice1,750 (100.0%) Fitness/Exercise 1,750 (100.0%)
-
[51]
Topic Distribution (Top 5) Holistic Health & Wellness1,750 (100.0%) Sleep & Lifestyle 1,750 (100.0%)
-
[52]
Context Richness Conversation Structure Single Turn 1,750 (100.0%) Narrative Detail Detailed 1,750 (100.0%) Context Depth High 1,750 (100.0%)
-
[53]
Clinical Complexity Risk Level Low 1,750 (100.0%) User Type Consumer 1,750 (100.0%) Population Adult (Unspecified) 1,520 (86.9%) Adult 65Plus 230 (13.1%) Language English 1,750 (100.0%) Language Complexity Lay 1,750 (100.0%) Query Subject Self 1,750 (100.0%) Personal Health Query Yes 1,750 (100.0%)
-
[54]
Data Integration Objective Data Present Yes 1,750 (100.0%) Objective Data Types Vitals (Wearable) 1,750 (100.0%) 58
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.