pith. sign in

arxiv: 2606.18596 · v1 · pith:GPBK2S5Onew · submitted 2026-06-17 · 💻 cs.HC · cs.AI

Better Adherence, Richer Context: A Field Evaluation of LLM-Powered Conversational Voice Diaries for Sleep

Pith reviewed 2026-06-26 20:06 UTC · model grok-4.3

classification 💻 cs.HC cs.AI
keywords sleep diaryconversational voiceLLMadherenceself-reportbehavioral sleep medicinevoice assistantfield study
0
0 comments X

The pith

LLM-powered voice diaries achieve higher adherence and richer sleep context than text-based diaries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates an LLM-powered conversational voice diary against a matched text-based mobile diary in a four-week between-subjects field study with 30 university students. The voice diary produced higher adherence rates and more detailed reports on routines, stressors, environmental conditions, and other sleep factors. Participants viewed the voice system as easier to fit into daily life despite longer completion times. The work identifies a trade-off in which conversational richness comes at the cost of lower completeness on some structured diary fields.

Core claim

The authors claim that an LLM-powered conversational voice diary, using proactive smart-speaker prompts, structured intake, and adaptive follow-up dialogue, produces higher adherence and richer contextual self-reports than a text-based diary with matched items and reminders, although it yields lower completeness on certain structured fields.

What carries the argument

LLM-powered conversational intake delivered through proactive smart-speaker prompts with adaptive follow-up dialogue.

If this is right

  • Daily sleep diary completion becomes more sustainable for users in behavioral sleep medicine.
  • Clinicians receive more detailed contextual information for interpreting night-to-night sleep variation.
  • Voice interfaces integrate more readily into daily routines than text entry on mobile devices.
  • Designers must address the observed trade-off between expressive richness and structured field completeness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same conversational approach may increase adherence for other longitudinal self-report tasks such as mood or pain tracking.
  • Testing the system over longer periods in clinical insomnia populations would clarify whether adherence gains hold outside student samples.
  • Pairing the voice diary with passive sensor data could combine rich context with objective measures of sleep.

Load-bearing premise

The university student sample and four-week between-subjects design represent the target clinical population and permit valid comparison of adherence and context richness.

What would settle it

A replication study with diagnosed insomnia patients that finds no difference or lower adherence for the voice diary would disprove the central claim.

Figures

Figures reproduced from arXiv: 2606.18596 by Amama Mahmood, Bokyung Kim, Chien-Ming Huang, Honghao Zhao, Luis F. Buenaver, Michael T. Smith, Molly E. Atwood.

Figure 1
Figure 1. Figure 1: In this work, we propose a conversational sleep diary to address key limitations of traditional text-based [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Interaction flow of the LLM-powered sleep diary system. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: 4 week deployment and data collection. (1) Introduction and consent. The experimenter explained the study procedures, data collection methods, and privacy protections, and obtained written informed consent. (2) Sleep-window profiling. The experimenter documented participants’ typical wake-up times (weekday and weekend), departure and return-home times, and usual wind-down period to configure personalized m… view at source ↗
Figure 4
Figure 4. Figure 4: Diary entry depth: quantitative information-density metrics [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Diary entry depth: human-coded completeness, disclosure, and engagement [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
read the original abstract

Sleep diaries are central to behavioral sleep medicine and cognitive behavioral therapy for insomnia, yet daily completion is difficult to sustain, and static forms often provide limited context for interpreting night-to-night sleep variation. We designed an LLM-powered conversational voice diary that delivers clinically grounded morning and evening sleep diary questions through proactive smart-speaker prompts, structured conversational intake, and adaptive follow-up dialogue. We evaluated the system in a four-week between-subjects field study with 30 university students, comparing it with a text-based mobile diary using matched diary items, reporting windows, and reminder intervals. Compared with the text-based diary, the conversational voice diary showed higher adherence and elicited more detailed contextual self-report about routines, stressors, environmental conditions, and other sleep-related factors. Participants also described the voice diary as easier to integrate into daily routines, despite longer perceived completion time. However, voice-based conversational intake produced lower completeness for some structured diary fields, revealing a trade-off between expressive richness and structured precision. These findings show both the promise and the challenge of using LLM-powered conversational voice assistants for longitudinal health self-report.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript describes the design of an LLM-powered conversational voice diary that uses proactive smart-speaker prompts, structured intake, and adaptive follow-up to deliver clinically grounded morning and evening sleep diary items. It reports results from a four-week between-subjects field study with 30 university students comparing the voice system to a matched text-based mobile diary on adherence, contextual richness of self-reports, perceived integration into routines, and completeness of structured fields.

Significance. If the empirical comparison holds after appropriate statistical reporting, the work provides direct evidence that conversational voice interfaces can increase adherence and elicit richer contextual detail than static text diaries in a longitudinal self-report setting. This is a concrete strength for HCI and health applications, as the study uses matched items, windows, and reminders in a real-world field deployment rather than a lab simulation.

major comments (2)
  1. [Abstract] Abstract: the central claim that the conversational voice diary 'showed higher adherence' is presented without any statistical tests, exact adherence percentages, confidence intervals, or dropout handling; this directly limits verification of the magnitude and reliability of the primary outcome.
  2. [Abstract] Abstract / participant description: the between-subjects comparison with a university-student convenience sample is used to position the system for behavioral sleep medicine and CBT-I applications, yet clinical insomnia patients differ systematically in sleep severity, daytime impairment, motivation, and comorbidities that affect adherence and the nature of contextual reporting; this extrapolation is load-bearing for the applicability claim.
minor comments (1)
  1. [Abstract] Abstract: participant demographics are described only as 'university students' with no further detail on age range, gender balance, baseline sleep profiles, or device familiarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below and note the planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the conversational voice diary 'showed higher adherence' is presented without any statistical tests, exact adherence percentages, confidence intervals, or dropout handling; this directly limits verification of the magnitude and reliability of the primary outcome.

    Authors: The results section reports adherence with statistical tests, exact percentages, confidence intervals, and dropout handling. We agree the abstract summarizes the finding at too high a level. We will revise the abstract to include the key quantitative details from the results. revision: yes

  2. Referee: [Abstract] Abstract / participant description: the between-subjects comparison with a university-student convenience sample is used to position the system for behavioral sleep medicine and CBT-I applications, yet clinical insomnia patients differ systematically in sleep severity, daytime impairment, motivation, and comorbidities that affect adherence and the nature of contextual reporting; this extrapolation is load-bearing for the applicability claim.

    Authors: We accept that the sample is a university-student convenience sample and that clinical insomnia populations differ on key dimensions. The study is framed as an initial field evaluation of feasibility and adherence. We will revise the abstract and discussion to explicitly state the sample limitations, reduce the strength of applicability claims to CBT-I, and note the need for future clinical validation. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical field evaluation with observed outcomes

full rationale

This paper reports a between-subjects field study measuring adherence rates, completeness of diary entries, and qualitative participant feedback on a conversational voice diary versus a text diary. No equations, fitted parameters, predictive models, or derivation chains are present. All central claims (higher adherence, richer context) are presented as direct empirical observations from the 30-participant sample rather than outputs of any self-referential process. No self-citations function as load-bearing premises for uniqueness or ansatz choices. The study is self-contained against external benchmarks of adherence and context richness.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

As an empirical HCI field study, the claims rest on standard domain assumptions about self-report validity and study design rather than new free parameters or invented entities.

axioms (2)
  • domain assumption Self-reported adherence and contextual details accurately reflect participants' actual behavior and experiences.
    The study measures adherence and richness via participant reports without independent verification such as device logs or observer data.
  • domain assumption The between-subjects assignment with matched diary items and reminder intervals sufficiently controls for confounding variables.
    The design assumes random assignment and matching eliminate major individual or timing differences between groups.

pith-pipeline@v0.9.1-grok · 5751 in / 1384 out tokens · 40377 ms · 2026-06-26T20:06:54.439046+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

77 extracted references · 59 canonical work pages

  1. [1]

    Schoevers

    Marije Aan Het Rot, Koen Hogenelst, and Robert A. Schoevers. 2012. Mood Disorders in Everyday Life: A Systematic Review of Experience Sampling and Ecological Momentary Assessment Studies.Clinical Psychology Review32, 6 (2012), 510–523. https://doi.org/10.1016/j.cpr.2012.05.007

  2. [2]

    Tessa Aarts, Panos Markopoulos, Lars Giling, Tudor Vacaretu, and Sigrid Pillen. 2022. Snoozy: A Chatbot-Based Sleep Diary for Children Aged Eight to Twelve. InProceedings of the 21st Annual ACM Interaction Design and Children Conference (IDC ’22). Association for Computing Machinery, 297–307. https://doi.org/10.1145/3501712.3529718

  3. [3]

    Ramokapane, and Jose M

    Noura Abdi, Kopo M. Ramokapane, and Jose M. Such. 2019. More than smart speakers: security and privacy perceptions of smart home personal assistants. InProceedings of the Fifteenth USENIX Conference on Usable Privacy and Security (Santa Clara, CA, USA)(SOUPS’19). USENIX Association, USA, 451–466

  4. [4]

    Abdalsalam Almzayyen, Angel Vela de la Garza Evia, Nick Coronato, and Mehdi Boukhechba. 2022. Voice-Based Conversational Agents for self-reporting fluid consumption and sleep quality. https://arxiv.org/abs/2202.02186

  5. [5]

    Sonia Ancoli-Israel, Roger Cole, Cathy Alessi, Mark Chambers, William Moorcroft, and Charles P. Pollak. 2003. The Role of Actigraphy in the Study of Sleep and Circadian Rhythms.Sleep26, 3 (2003), 342–392. https://doi.org/10.1093/ sleep/26.3.342

  6. [6]

    Frank Bentley, Chris Luvogt, Max Silverman, Rushani Wirasinghe, Brooke White, and Danielle Lottridge. 2018. Understanding the Long-Term Use of Smart Speaker Assistants.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies2, 3, Article 91 (2018), 24 pages. https://doi.org/10.1145/3264901

  7. [7]

    Timothy Bickmore and Tony Giorgino. 2006. Health Dialog Systems for Patients and Consumers.Journal of Biomedical Informatics39, 5 (2006), 556–571. https://doi.org/10.1016/j.jbi.2005.12.004

  8. [8]

    Timothy Bickmore, Daniel Schulman, and Langxuan Yin. 2010. Maintaining Engagement in Long-Term Interventions with Relational Agents.Applied Artificial Intelligence24, 6 (2010), 648–666. https://doi.org/10.1080/08839514.2010. 492259

  9. [9]

    Bickmore and Rosalind W

    Timothy W. Bickmore and Rosalind W. Picard. 2005. Establishing and Maintaining Long-Term Human-Computer Relationships.ACM Transactions on Computer-Human Interaction12, 2 (2005), 293–327. https://doi.org/10.1145/ 1067860.1067867

  10. [10]

    Bickmore, Ha Trinh, Stefan Olafsson, Teresa K

    Timothy W. Bickmore, Ha Trinh, Stefan Olafsson, Teresa K. O’Leary, Reza Asadi, Nina M. Rickles, and Ricardo Cruz. 2018. Patient and Consumer Safety Risks When Using Conversational Assistants for Medical Information: An Observational Study of Siri, Alexa, and Google Assistant.Journal of Medical Internet Research20, 9 (2018), e11510. https://doi.org/10.2196/11510

  11. [11]

    Niall Bolger, Angelina Davis, and Eshkol Rafaeli. 2003. Diary Methods: Capturing Life as It Is Lived.Annual Review of Psychology54 (2003), 579–616. https://doi.org/10.1146/annurev.psych.54.101601.145030

  12. [12]

    Hudson, Ehsan Adeli, Russ Altman, et al

    Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, et al . 2021. On the Opportunities and Risks of Foundation Models.arXiv preprint arXiv:2108.07258(2021). arXiv:2108.07258

  13. [13]

    Virginia Braun, Victoria Clarke, Nikki Hayfield, Louise Davey, and Elizabeth Jenkinson. 2023. Doing Reflexive Thematic Analysis. InSupporting Research in Counselling and Psychotherapy: Qualitative, Quantitative, and Mixed Methods Research, Sofie Bager-Charleson and Alistair G. McBeath (Eds.). Palgrave Macmillan, Cham, 19–38. https: //doi.org/10.1007/978-3...

  14. [14]

    Quick and Dirty

    John Brooke. 1996. SUS: A “Quick and Dirty” Usability Scale. InUsability Evaluation in Industry, Patrick W. Jordan, Bruce Thomas, Bernard A. Weerdmeester, and Ian L. McClelland (Eds.). Taylor & Francis, 189–194. https://doi.org/10. 1201/9781498710411-35

  15. [15]

    Kemper, Ruth Herman, and Michael A

    Cati Brown, Tony Snodgrass, Susan J. Kemper, Ruth Herman, and Michael A. Covington. 2008. Automatic measurement of propositional idea density from part-of-speech tagging.Behavior Research Methods40, 2 (2008), 540–545. https: //doi.org/10.3758/BRM.40.2.540

  16. [16]

    Carney, Daniel J

    Colleen E. Carney, Daniel J. Buysse, Sonia Ancoli-Israel, Jack D. Edinger, Andrew D. Krystal, Kenneth L. Lichstein, and Charles M. Morin. 2012. The Consensus Sleep Diary: Standardizing Prospective Sleep Self-Monitoring.Sleep35, 2 (2012), 287–302. https://doi.org/10.5665/sleep.1642

  17. [17]

    2008.Text Complexity and Reading Comprehension Tests

    Erik Castello. 2008.Text Complexity and Reading Comprehension Tests. Number 85 in Linguistic Insights. Peter Lang, Bern

  18. [18]

    Stevie Chancellor and Munmun De Choudhury. 2020. Methods in Predictive Techniques for Mental Health Status on Social Media: A Critical Review.npj Digital Medicine3, 1 (2020), 1–11. https://doi.org/10.1038/s41746-020-0233-7

  19. [19]

    Shanshan Chen, Panos Markopoulos, and Jun Hu. 2024. Dozzz: Exploring the Feasibility of a Voice-Based Sleep Diary for Children. InProceedings of BCS HCI 2024. BCS, The Chartered Institute for IT. https://doi.org/10.14236/ewic/ BCSHCI2024.10

  20. [20]

    Lee, Bongshin Lee, Wanda Pratt, and Julie A

    Eun Kyoung Choe, Nicole B. Lee, Bongshin Lee, Wanda Pratt, and Julie A. Kientz. 2014. Understanding Quantified- Selfers’ Practices in Collecting and Exploring Personal Data. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1143–1152. https://doi.org/10.1145/2556288.2557372 28 Mahmood et al

  21. [21]

    1988.Statistical Power Analysis for the Behavioral Sciences(2 ed.)

    Jacob Cohen. 1988.Statistical Power Analysis for the Behavioral Sciences(2 ed.). Lawrence Erlbaum Associates, Hillsdale, NJ

  22. [22]

    Karuna Datta. 2022. Use of a Sleep Diary. InMaking Sense of Sleep Medicine: A Hands-On Guide, Karuna Datta and Deepak Shrivastava (Eds.). CRC Press, 109–120. https://doi.org/10.1201/9781003093381-20

  23. [23]

    Edinger, J

    Jack D. Edinger, J. Todd Arnedt, Suzanne M. Bertisch, Colleen E. Carney, John J. Harrington, Kenneth L. Lichstein, Michael J. Sateia, Wendy M. Troxel, Eric S. Zhou, Uzma Kazmi, Jonathan L. Heald, and Jennifer L. Martin. 2021. Behavioral and Psychological Treatments for Chronic Insomnia Disorder in Adults: An American Academy of Sleep Medicine Clinical Pra...

  24. [24]

    Epstein, An Ping, James Fogarty, and Sean A

    Daniel A. Epstein, An Ping, James Fogarty, and Sean A. Munson. 2015. A Lived Informatics Model of Personal Informatics. InProceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing. 731–742. https://doi.org/10.1145/2750858.2804250

  25. [25]

    Alena Ermolina and Victor Tiberius. 2021. Voice-Controlled Intelligent Personal Assistants in Health Care: International Delphi Study.Journal of Medical Internet Research23, 4 (2021), e25312. https://doi.org/10.2196/25312

  26. [26]

    Andrea Grimes, Desney Tan, and Dan Morris. 2009. Toward Technologies That Support Family Reflections on Health. InProceedings of the ACM 2009 International Conference on Supporting Group Work (GROUP ’09). Association for Computing Machinery, New York, NY, USA, 311–320. https://doi.org/10.1145/1531674.1531721

  27. [27]

    Hart and Lowell E

    Sandra G. Hart and Lowell E. Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research.Advances in Psychology52 (1988), 139–183. https://doi.org/10.1016/S0166-4115(08)62386-9

  28. [28]

    Allison G. Harvey. 2002. A Cognitive Model of Insomnia.Behaviour Research and Therapy40, 8 (2002), 869–893. https://doi.org/10.1016/S0005-7967(01)00061-4

  29. [29]

    Harvey, Kerrie Stinson, Katriina L

    Allison G. Harvey, Kerrie Stinson, Katriina L. Whitaker, Damian Moskovitz, and Harmehr Virk. 2008. The Subjective Meaning of Sleep Quality: A Comparison of Individuals with and without Insomnia.Sleep31, 3 (2008), 383–393. https://doi.org/10.1093/sleep/31.3.383

  30. [30]

    Hufford, Saul Shiffman, Jean Paty, and Arthur A

    Michael R. Hufford, Saul Shiffman, Jean Paty, and Arthur A. Stone. 2001. Ecological Momentary Assessment: Real- World, Real-Time Measurement of Patient Experience. InProgress in Ambulatory Assessment: Computer-Assisted Psychological and Psychophysiological Methods in Monitoring and Field Studies, Jochen Fahrenberg and Michael Myrtek (Eds.). Hogrefe & Hube...

  31. [31]

    Vanessa Ibáñez, Josep Silva, and Omar Cauli. 2018. A Survey on Sleep Questionnaires and Diaries.Sleep Medicine42 (2018), 90–96. https://doi.org/10.1016/j.sleep.2017.08.026

  32. [32]

    Michael R. Irwin. 2015. Why Sleep Is Important for Health: A Psychoneuroimmunology Perspective.Annual Review of Psychology66 (2015), 143–172. https://doi.org/10.1146/annurev-psych-010213-115205

  33. [33]

    Zhiqiu Jiang, Mashrur Rashik, Kunjal Panchal, Mahmood Jasim, Ali Sarvghad, Pari Riahi, Erica DeWitt, Fey Thurber, and Narges Mahyar. 2023. CommunityBots: Creating and Evaluating A Multi-Agent Chatbot Platform for Public Input Elicitation.Proceedings of the ACM on Human-Computer Interaction7, CSCW1 (2023), 1–32. https://doi.org/10.1145/ 3579469

  34. [34]

    Soomin Kim, Jinsu Lee, and Gahgene Gweon. 2019. Comparing Data from Chatbot and Web Surveys: Effects of Platform and Conversational Style on Survey Response Quality. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–12. https://doi.org/10.1145/3290605.3300316

  35. [35]

    Terry K Koo and Mae Y Li. 2016. A guideline of selecting and reporting intraclass correlation coefficients for reliability research.Journal of chiropractic medicine15, 2 (2016), 155–163. https://doi.org/10.1016/j.jcm.2016.02.012

  36. [36]

    Arnardottir

    Hlín Kristbergsdóttir, Anna Sigridur Islind, Lisa Schmitz, and Erna S. Arnardottir. 2023. Working Towards a Novel Digital Sleep Diary Standard.ERJ Open Research9, suppl 11 (2023), 73. https://doi.org/10.1183/23120541.sleepandbreathing- 2023.73

  37. [37]

    Dunn, Huong Ly Tong, Ahmet Baki Kocaballi, Jessica A

    Liliana Laranjo, Adam G. Dunn, Huong Ly Tong, Ahmet Baki Kocaballi, Jessica A. Chen, Rabia Bashir, Didi Surian, Blanca Gallego, Farah Magrabi, Annie Y. S. Lau, and Enrico W. Coiera. 2018. Conversational Agents in Healthcare: A Systematic Review.Journal of the American Medical Informatics Association25, 9 (2018), 1248–1258. https: //doi.org/10.1093/jamia/ocy072

  38. [38]

    Josephine Lau, Benjamin Zimmerman, and Florian Schaub. 2018. Alexa, Are You Listening?: Privacy Perceptions, Concerns and Privacy-Seeking Behaviors with Smart Speakers.Proceedings of the ACM on Human-Computer Interaction 2, CSCW, Article 102 (2018), 31 pages. https://doi.org/10.1145/3274371

  39. [39]

    Lauderdale, Kristen L

    Diane S. Lauderdale, Kristen L. Knutson, Lijing L. Yan, Kiang Liu, and Paul J. Rathouz. 2008. Self-Reported and Measured Sleep Duration: How Similar Are They?Epidemiology19, 6 (2008), 838–845. https://doi.org/10.1097/EDE. 0b013e318187a7b0

  40. [40]

    Amanda Lazar, Christian Koehler, Joshua Tanenbaum, and David H. Nguyen. 2015. Why We Use and Abandon Smart Devices. (2015), 635–646. https://doi.org/10.1145/2750858.2804288 Better Adherence, Richer Context: A Field Evaluation of LLM-Powered Conversational Voice Diaries for Sleep 29

  41. [41]

    Peter Lee, Sebastien Bubeck, and Joseph Petro. 2023. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. New England Journal of Medicine388, 13 (2023), 1233–1239. https://doi.org/10.1056/NEJMsr2214184

  42. [42]

    Starling, Eric S

    Daniel Lewin, Claire M. Starling, Eric S. Zhou, Daniel Greenberg, Callen Shaw, and Hannah Arem. 2024. A Novel Voice Interactive Sleep Log: Concurrent Validity with Actigraphy and Sleep Diaries.Journal of Clinical Sleep Medicine 20, 2 (2024), 309–312. https://doi.org/10.5664/jcsm.10878

  43. [43]

    Lloyd-Jones, Norrina B

    Donald M. Lloyd-Jones, Norrina B. Allen, Cheryl A. M. Anderson, Tiffany Black, LaPrincess C. Brewer, Randi E. Foraker, Michael A. Grandner, Helen Lavretsky, Amanda M. Perak, Garima Sharma, and Wayne Rosamond. 2022. Life’s Essential 8: Updating and Enhancing the American Heart Association’s Construct of Cardiovascular Health: A Presidential Advisory From t...

  44. [44]

    Irene Lopatovska and Harriet Williams. 2018. Personification of the Amazon Alexa: BFF or a Mindless Companion. In Proceedings of the 2018 Conference on Human Information Interaction and Retrieval. 265–268. https://doi.org/10.1145/ 3176349.3176868

  45. [45]

    Roshan Maharjan, Kate O’Doherty, David A Rohani, Patrick Bekgaard, and Jakob E Bardram. 2022. Experiences of a speech-enabled conversational agent for the self-report of well-being among people living with affective disorders: an in-the-wild study.ACM Transactions on Interactive Intelligent Systems12, 2 (2022), 1–31. https://doi.org/10.1145/3484508

  46. [46]

    Amama Mahmood, Junxiang Wang, and Chien-Ming Huang. 2026. Situated Understanding of Errors in Older Adults’ Interactions with Voice Assistants: A Month-Long, In-Home Study.ACM Transactions on Accessible Computing19, 1, Article 2 (March 2026), 36 pages. https://doi.org/10.1145/3796236

  47. [47]

    Amama Mahmood, Junxiang Wang, Bingsheng Yao, Dakuo Wang, and Chien-Ming Huang. 2025. User Interaction Patterns and Breakdowns in Conversing with LLM-Powered Voice Assistants.International Journal of Human-Computer Studies195 (2025), 103406. https://doi.org/10.1016/j.ijhcs.2024.103406

  48. [48]

    Nathan Malkin, Joe Deatrick, Allen Tong, Primal Wijesekera, Serge Egelman, and David Wagner. 2019. Privacy Attitudes of Smart Speaker Users.Proceedings on Privacy Enhancing Technologies2019, 4 (2019), 250–271. https: //doi.org/10.2478/popets-2019-0068

  49. [49]

    Mary L. McHugh. 2012. Interrater Reliability: The Kappa Statistic.Biochemia Medica22, 3 (2012), 276–282

  50. [50]

    Alexa, I Just Ate a Donut

    Louise A. C. Millard, Laura Johnson, Samuel R. Neaves, Peter A. Flach, Kate Tilling, and Deborah A. Lawlor. 2022. “Alexa, I Just Ate a Donut”: A Pilot Study Collecting Food and Drink Intake Data with Voice Input.medRxiv(2022). https://doi.org/10.1101/2022.06.28.22276999 Preprint

  51. [51]

    Moore and Raphael Arar

    Robert J. Moore and Raphael Arar. 2019.Conversational UX Design: A Practitioner’s Guide to the Natural Conversation Framework. Association for Computing Machinery. https://doi.org/10.1145/3304087

  52. [52]

    Morin and Ruth Benca

    Charles M. Morin and Ruth Benca. 2012. Chronic Insomnia.The Lancet379, 9821 (2012), 1129–1141. https: //doi.org/10.1016/S0140-6736(11)60750-2

  53. [53]

    Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. 2023. Capabilities of GPT-4 on Medical Challenge Problems.arXiv preprint arXiv:2303.13375(2023). https://arxiv.org/abs/2303.13375

  54. [54]

    OpenAI. 2023. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774(2023). arXiv:2303.08774

  55. [55]

    Martin Pielot, Karen Church, and Rodrigo De Oliveira. 2014. An In-Situ Study of Mobile Phone Notifications. In Proceedings of the 16th International Conference on Human-Computer Interaction with Mobile Devices and Services. 233–242. https://doi.org/10.1145/2628363.2628364

  56. [56]

    Fischer, Stuart Reeves, and Sarah Sharples

    Martin Porcheron, Joel E. Fischer, Stuart Reeves, and Sarah Sharples. 2018. Voice Interfaces in Everyday Life. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1–12. https://doi.org/10.1145/3173574. 3174214

  57. [57]

    Accessibility Came by Accident

    Alisha Pradhan, Kanika Mehta, and Leah Findlater. 2018. “Accessibility Came by Accident”: Use of Voice-Controlled Intelligent Personal Assistants by People with Disabilities. InProceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI ’18). Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/ 3173574.3174033

  58. [58]

    Simon Provoost, Ho Ming Lau, Jeroen Ruwaard, and Heleen Riper. 2017. Embodied Conversational Agents in Clinical Psychology: A Scoping Review.Journal of Medical Internet Research19, 5 (2017), e151. https://doi.org/10.2196/jmir.6553

  59. [59]

    Rebecca Robbins, Azizi Seixas, Lillian Walton Masters, Nicholas Chanko, Faiyaz Diaby, Dorice Vieira, and Girardin Jean-Louis. 2019. Sleep Tracking: A Systematic Review of the Research Using Commercially Available Technology. Current Sleep Medicine Reports5, 3 (2019), 156–163. https://doi.org/10.1007/s40675-019-00150-1

  60. [60]

    Harshita Sahijwani. 2022. Adaptive Dialogue Management for Conversational Information Elicitation. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, 3495. https://doi.org/10.1145/3477495.3531684

  61. [61]

    Stone, and Michael R

    Saul Shiffman, Arthur A. Stone, and Michael R. Hufford. 2008. Ecological Momentary Assessment.Annual Review of Clinical Psychology4 (2008), 1–32. https://doi.org/10.1146/annurev.clinpsy.3.022806.091415 30 Mahmood et al

  62. [62]

    Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Nathaneal Scharli, Aakanksha Chowdhery, Philip Mansfield, Blaise Agüera y Arcas, Dale Webster, Greg S. Corrado, Yossi Matias, Katherine Chou, Juraj G...

  63. [63]

    Gabriel Skantze. 2021. Turn-Taking in Conversational Systems and Human-Robot Interaction: A Review.Computer Speech & Language67 (2021), 101178. https://doi.org/10.1016/j.csl.2020.101178

  64. [64]

    Starling, Daniel Greenberg, Daniel Lewin, Callen Shaw, Eric S

    Claire M. Starling, Daniel Greenberg, Daniel Lewin, Callen Shaw, Eric S. Zhou, Daniel Lieberman, and Hannah Arem

  65. [65]

    https://doi.org/10.1001/jamanetworkopen.2024.35011

    Voice-Activated Cognitive Behavioral Therapy for Insomnia: A Randomized Clinical Trial.JAMA Network Open 7, 9 (2024), e2435011. https://doi.org/10.1001/jamanetworkopen.2024.35011

  66. [66]

    Stone, Saul Shiffman, Joseph E

    Arthur A. Stone, Saul Shiffman, Joseph E. Schwartz, Joan E. Broderick, and Michael R. Hufford. 2002. Patient Non- Compliance with Paper Diaries.BMJ324, 7347 (2002), 1193–1194. https://doi.org/10.1136/bmj.324.7347.1193

  67. [67]

    Stone, Saul Shiffman, Joseph E

    Arthur A. Stone, Saul Shiffman, Joseph E. Schwartz, Joan E. Broderick, and Michael R. Hufford. 2003. Patient Compliance with Paper and Electronic Diaries.Controlled Clinical Trials24, 2 (2003), 182–199. https://doi.org/10.1016/S0197- 2456(02)00320-3

  68. [68]

    Sunshine

    Jacob E. Sunshine. 2022. Smart Speakers: The Next Frontier in mHealth.JMIR mHealth and uHealth10, 2 (2022), e28686. https://doi.org/10.2196/28686

  69. [69]

    Linkai Tao, Myrte Elise Thoolen, Bram de Vogel, Loe M. G. Feijs, Wei Chen, and Jun Hu. 2019. EVE: A Combined Physical-Digital Interface for Insomnia Sleep Diary. InIntelligent Systems and Applications: Proceedings of the 2018 Intelligent Systems Conference (IntelliSys), Volume 2 (Advances in Intelligent Systems and Computing, Vol. 869). Springer, Cham, 46...

  70. [70]

    Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. 2023. Large Language Models in Medicine.Nature Medicine29, 8 (2023), 1930–1940. https: //doi.org/10.1038/s41591-023-02448-8

  71. [71]

    Dhinagaran, Bhone Myint Kyaw, Tobias Kowatsch, J

    Lorainne Tudor Car, Dharshini A. Dhinagaran, Bhone Myint Kyaw, Tobias Kowatsch, J. S. Rayhan, Yin-Leng Theng, and Rifat Atun. 2020. Conversational Agents in Health Care: Scoping Review and Conceptual Analysis.Journal of Medical Internet Research22, 8 (2020), e17158. https://doi.org/10.2196/17158

  72. [72]

    Leyao Wang, Zhiyu Wan, Congning Ni, Qingyuan Song, Yang Li, Ellen Clayton, Bradley Malin, and Zhijun Yin. 2024. Applications and Concerns of ChatGPT and Other Conversational Large Language Models in Health Care: Systematic Review.Journal of Medical Internet Research26 (2024), e22769. https://doi.org/10.2196/22769

  73. [73]

    Carolin Wienrich, Clemens Reitelbach, and Astrid Carolus. 2021. The Trustworthiness of Voice Assistants in the Context of Healthcare: Investigating the Effect of Perceived Expertise on the Trustworthiness of Voice Assistants, Providers, Data Receivers, and Automatic Speech Recognition.Frontiers in Computer Science3 (2021). https://doi.org/ 10.3389/fcomp.2...

  74. [74]

    Pfeffer, Jason Fries, and Nigam H

    Michael Wornow, Yizhe Xu, Rachana Thapa, Bhavik Patel, Elissa Steinberg, Sarah Fleming, Marc A. Pfeffer, Jason Fries, and Nigam H. Shah. 2023. The shaky foundations of large language models and foundation models for electronic health records.NPJ Digital Medicine6, 1 (2023), 135. https://doi.org/10.1038/s41746-023-00879-8

  75. [75]

    Ziang Xiao, Michelle X. Zhou, Q. Vera Liao, Gloria Mark, Changyan Chi, Wenxi Chen, and Huahai Yang. 2020. Tell Me About Yourself: Using an AI-Powered Chatbot to Conduct Conversational Surveys with Open-Ended Questions.ACM Transactions on Computer-Human Interaction27, 3, Article 15 (2020), 37 pages. https://doi.org/10.1145/3381804

  76. [76]

    Nima Zargham, Leon Reicherts, Michael Bonfert, Sarah Theres Völkel, Johannes Schöning, Rainer Malaka, and Yvonne Rogers. 2022. Understanding Circumstances for Desirable Proactive Behaviour of Voice Assistants: The Proactivity Dilemma. InProceedings of the 4th Conference on Conversational User Interfaces (CUI ’22). Association for Computing Machinery, New ...

  77. [77]

    Wayne Xin Zhao, Kun Zhou, Junyi Li, et al. 2026. A Survey of Large Language Models.Frontiers of Computer Science 20, 12 (2026), 2012627. https://doi.org/10.1007/s11704-026-60308-3 Better Adherence, Richer Context: A Field Evaluation of LLM-Powered Conversational Voice Diaries for Sleep 31 A Sleep diary questions Morning Sleep Diary (1) Time the user physi...