Better Adherence, Richer Context: A Field Evaluation of LLM-Powered Conversational Voice Diaries for Sleep

Amama Mahmood; Bokyung Kim; Chien-Ming Huang; Honghao Zhao; Luis F. Buenaver; Michael T. Smith; Molly E. Atwood

arxiv: 2606.18596 · v1 · pith:GPBK2S5Onew · submitted 2026-06-17 · 💻 cs.HC · cs.AI

Better Adherence, Richer Context: A Field Evaluation of LLM-Powered Conversational Voice Diaries for Sleep

Amama Mahmood , Bokyung Kim , Honghao Zhao , Molly E. Atwood , Luis F. Buenaver , Michael T. Smith , Chien-Ming Huang This is my paper

Pith reviewed 2026-06-26 20:06 UTC · model grok-4.3

classification 💻 cs.HC cs.AI

keywords sleep diaryconversational voiceLLMadherenceself-reportbehavioral sleep medicinevoice assistantfield study

0 comments

The pith

LLM-powered voice diaries achieve higher adherence and richer sleep context than text-based diaries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates an LLM-powered conversational voice diary against a matched text-based mobile diary in a four-week between-subjects field study with 30 university students. The voice diary produced higher adherence rates and more detailed reports on routines, stressors, environmental conditions, and other sleep factors. Participants viewed the voice system as easier to fit into daily life despite longer completion times. The work identifies a trade-off in which conversational richness comes at the cost of lower completeness on some structured diary fields.

Core claim

The authors claim that an LLM-powered conversational voice diary, using proactive smart-speaker prompts, structured intake, and adaptive follow-up dialogue, produces higher adherence and richer contextual self-reports than a text-based diary with matched items and reminders, although it yields lower completeness on certain structured fields.

What carries the argument

LLM-powered conversational intake delivered through proactive smart-speaker prompts with adaptive follow-up dialogue.

If this is right

Daily sleep diary completion becomes more sustainable for users in behavioral sleep medicine.
Clinicians receive more detailed contextual information for interpreting night-to-night sleep variation.
Voice interfaces integrate more readily into daily routines than text entry on mobile devices.
Designers must address the observed trade-off between expressive richness and structured field completeness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same conversational approach may increase adherence for other longitudinal self-report tasks such as mood or pain tracking.
Testing the system over longer periods in clinical insomnia populations would clarify whether adherence gains hold outside student samples.
Pairing the voice diary with passive sensor data could combine rich context with objective measures of sleep.

Load-bearing premise

The university student sample and four-week between-subjects design represent the target clinical population and permit valid comparison of adherence and context richness.

What would settle it

A replication study with diagnosed insomnia patients that finds no difference or lower adherence for the voice diary would disprove the central claim.

Figures

Figures reproduced from arXiv: 2606.18596 by Amama Mahmood, Bokyung Kim, Chien-Ming Huang, Honghao Zhao, Luis F. Buenaver, Michael T. Smith, Molly E. Atwood.

**Figure 1.** Figure 1: In this work, we propose a conversational sleep diary to address key limitations of traditional text-based [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Interaction flow of the LLM-powered sleep diary system. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: 4 week deployment and data collection. (1) Introduction and consent. The experimenter explained the study procedures, data collection methods, and privacy protections, and obtained written informed consent. (2) Sleep-window profiling. The experimenter documented participants’ typical wake-up times (weekday and weekend), departure and return-home times, and usual wind-down period to configure personalized m… view at source ↗

**Figure 4.** Figure 4: Diary entry depth: quantitative information-density metrics [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Diary entry depth: human-coded completeness, disclosure, and engagement [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

read the original abstract

Sleep diaries are central to behavioral sleep medicine and cognitive behavioral therapy for insomnia, yet daily completion is difficult to sustain, and static forms often provide limited context for interpreting night-to-night sleep variation. We designed an LLM-powered conversational voice diary that delivers clinically grounded morning and evening sleep diary questions through proactive smart-speaker prompts, structured conversational intake, and adaptive follow-up dialogue. We evaluated the system in a four-week between-subjects field study with 30 university students, comparing it with a text-based mobile diary using matched diary items, reporting windows, and reminder intervals. Compared with the text-based diary, the conversational voice diary showed higher adherence and elicited more detailed contextual self-report about routines, stressors, environmental conditions, and other sleep-related factors. Participants also described the voice diary as easier to integrate into daily routines, despite longer perceived completion time. However, voice-based conversational intake produced lower completeness for some structured diary fields, revealing a trade-off between expressive richness and structured precision. These findings show both the promise and the challenge of using LLM-powered conversational voice assistants for longitudinal health self-report.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Voice diary beats text on adherence and context in 30 students, but student sample undercuts claims for clinical sleep medicine use.

read the letter

The core result is that their LLM voice diary produced higher adherence and pulled richer contextual details than a matched text diary over four weeks. The system uses proactive smart-speaker prompts, structured intake, and adaptive LLM follow-ups for morning and evening questions.

They ran a between-subjects field study with 30 university students, keeping diary items, windows, and reminders the same across conditions. The voice version got more reports on routines, stressors, environment, and other factors; participants said it fit daily life better despite longer completion time. They also flag the clear downside that some structured fields had lower completeness. That trade-off is useful to see.

The work is new in the specific combination: LLM conversational voice for sleep diaries with a longitudinal proactive field comparison. It does a straightforward job of testing the idea against a text baseline and reporting both gains and limits.

The soft spot is the sample. University students are not the insomnia patients who use CBT-I diaries in practice; those patients differ in sleep severity, motivation, comorbidities, and routine stability, all of which affect adherence and the kind of context they provide. A between-subjects design with small per-cell numbers adds risk that unmeasured individual differences drive the result. The abstract gives no stats, exact adherence numbers, demographics beyond students, or dropout handling, which makes the claims harder to weigh.

This paper is for HCI people building voice interfaces for health tracking and for sleep researchers who want to see how conversational tools perform in the field. A reader working on digital self-report or longitudinal health data will find concrete design lessons and the adherence-context trade-off.

It deserves a serious referee. The empirical comparison is direct and the problem is practical; the limitations are fixable with clearer methods and a note on population scope. I would send it to review rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The manuscript describes the design of an LLM-powered conversational voice diary that uses proactive smart-speaker prompts, structured intake, and adaptive follow-up to deliver clinically grounded morning and evening sleep diary items. It reports results from a four-week between-subjects field study with 30 university students comparing the voice system to a matched text-based mobile diary on adherence, contextual richness of self-reports, perceived integration into routines, and completeness of structured fields.

Significance. If the empirical comparison holds after appropriate statistical reporting, the work provides direct evidence that conversational voice interfaces can increase adherence and elicit richer contextual detail than static text diaries in a longitudinal self-report setting. This is a concrete strength for HCI and health applications, as the study uses matched items, windows, and reminders in a real-world field deployment rather than a lab simulation.

major comments (2)

[Abstract] Abstract: the central claim that the conversational voice diary 'showed higher adherence' is presented without any statistical tests, exact adherence percentages, confidence intervals, or dropout handling; this directly limits verification of the magnitude and reliability of the primary outcome.
[Abstract] Abstract / participant description: the between-subjects comparison with a university-student convenience sample is used to position the system for behavioral sleep medicine and CBT-I applications, yet clinical insomnia patients differ systematically in sleep severity, daytime impairment, motivation, and comorbidities that affect adherence and the nature of contextual reporting; this extrapolation is load-bearing for the applicability claim.

minor comments (1)

[Abstract] Abstract: participant demographics are described only as 'university students' with no further detail on age range, gender balance, baseline sleep profiles, or device familiarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below and note the planned revisions to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the conversational voice diary 'showed higher adherence' is presented without any statistical tests, exact adherence percentages, confidence intervals, or dropout handling; this directly limits verification of the magnitude and reliability of the primary outcome.

Authors: The results section reports adherence with statistical tests, exact percentages, confidence intervals, and dropout handling. We agree the abstract summarizes the finding at too high a level. We will revise the abstract to include the key quantitative details from the results. revision: yes
Referee: [Abstract] Abstract / participant description: the between-subjects comparison with a university-student convenience sample is used to position the system for behavioral sleep medicine and CBT-I applications, yet clinical insomnia patients differ systematically in sleep severity, daytime impairment, motivation, and comorbidities that affect adherence and the nature of contextual reporting; this extrapolation is load-bearing for the applicability claim.

Authors: We accept that the sample is a university-student convenience sample and that clinical insomnia populations differ on key dimensions. The study is framed as an initial field evaluation of feasibility and adherence. We will revise the abstract and discussion to explicitly state the sample limitations, reduce the strength of applicability claims to CBT-I, and note the need for future clinical validation. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical field evaluation with observed outcomes

full rationale

This paper reports a between-subjects field study measuring adherence rates, completeness of diary entries, and qualitative participant feedback on a conversational voice diary versus a text diary. No equations, fitted parameters, predictive models, or derivation chains are present. All central claims (higher adherence, richer context) are presented as direct empirical observations from the 30-participant sample rather than outputs of any self-referential process. No self-citations function as load-bearing premises for uniqueness or ansatz choices. The study is self-contained against external benchmarks of adherence and context richness.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

As an empirical HCI field study, the claims rest on standard domain assumptions about self-report validity and study design rather than new free parameters or invented entities.

axioms (2)

domain assumption Self-reported adherence and contextual details accurately reflect participants' actual behavior and experiences.
The study measures adherence and richness via participant reports without independent verification such as device logs or observer data.
domain assumption The between-subjects assignment with matched diary items and reminder intervals sufficiently controls for confounding variables.
The design assumes random assignment and matching eliminate major individual or timing differences between groups.

pith-pipeline@v0.9.1-grok · 5751 in / 1384 out tokens · 40377 ms · 2026-06-26T20:06:54.439046+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

77 extracted references · 59 canonical work pages

[1]

Schoevers

Marije Aan Het Rot, Koen Hogenelst, and Robert A. Schoevers. 2012. Mood Disorders in Everyday Life: A Systematic Review of Experience Sampling and Ecological Momentary Assessment Studies.Clinical Psychology Review32, 6 (2012), 510–523. https://doi.org/10.1016/j.cpr.2012.05.007

work page doi:10.1016/j.cpr.2012.05.007 2012
[2]

Tessa Aarts, Panos Markopoulos, Lars Giling, Tudor Vacaretu, and Sigrid Pillen. 2022. Snoozy: A Chatbot-Based Sleep Diary for Children Aged Eight to Twelve. InProceedings of the 21st Annual ACM Interaction Design and Children Conference (IDC ’22). Association for Computing Machinery, 297–307. https://doi.org/10.1145/3501712.3529718

work page doi:10.1145/3501712.3529718 2022
[3]

Ramokapane, and Jose M

Noura Abdi, Kopo M. Ramokapane, and Jose M. Such. 2019. More than smart speakers: security and privacy perceptions of smart home personal assistants. InProceedings of the Fifteenth USENIX Conference on Usable Privacy and Security (Santa Clara, CA, USA)(SOUPS’19). USENIX Association, USA, 451–466

2019
[4]

Abdalsalam Almzayyen, Angel Vela de la Garza Evia, Nick Coronato, and Mehdi Boukhechba. 2022. Voice-Based Conversational Agents for self-reporting fluid consumption and sleep quality. https://arxiv.org/abs/2202.02186

arXiv 2022
[5]

Sonia Ancoli-Israel, Roger Cole, Cathy Alessi, Mark Chambers, William Moorcroft, and Charles P. Pollak. 2003. The Role of Actigraphy in the Study of Sleep and Circadian Rhythms.Sleep26, 3 (2003), 342–392. https://doi.org/10.1093/ sleep/26.3.342

2003
[6]

Frank Bentley, Chris Luvogt, Max Silverman, Rushani Wirasinghe, Brooke White, and Danielle Lottridge. 2018. Understanding the Long-Term Use of Smart Speaker Assistants.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies2, 3, Article 91 (2018), 24 pages. https://doi.org/10.1145/3264901

work page doi:10.1145/3264901 2018
[7]

Timothy Bickmore and Tony Giorgino. 2006. Health Dialog Systems for Patients and Consumers.Journal of Biomedical Informatics39, 5 (2006), 556–571. https://doi.org/10.1016/j.jbi.2005.12.004

work page doi:10.1016/j.jbi.2005.12.004 2006
[8]

Timothy Bickmore, Daniel Schulman, and Langxuan Yin. 2010. Maintaining Engagement in Long-Term Interventions with Relational Agents.Applied Artificial Intelligence24, 6 (2010), 648–666. https://doi.org/10.1080/08839514.2010. 492259

work page doi:10.1080/08839514.2010 2010
[9]

Bickmore and Rosalind W

Timothy W. Bickmore and Rosalind W. Picard. 2005. Establishing and Maintaining Long-Term Human-Computer Relationships.ACM Transactions on Computer-Human Interaction12, 2 (2005), 293–327. https://doi.org/10.1145/ 1067860.1067867

arXiv 2005
[10]

Bickmore, Ha Trinh, Stefan Olafsson, Teresa K

Timothy W. Bickmore, Ha Trinh, Stefan Olafsson, Teresa K. O’Leary, Reza Asadi, Nina M. Rickles, and Ricardo Cruz. 2018. Patient and Consumer Safety Risks When Using Conversational Assistants for Medical Information: An Observational Study of Siri, Alexa, and Google Assistant.Journal of Medical Internet Research20, 9 (2018), e11510. https://doi.org/10.2196/11510

work page doi:10.2196/11510 2018
[11]

Niall Bolger, Angelina Davis, and Eshkol Rafaeli. 2003. Diary Methods: Capturing Life as It Is Lived.Annual Review of Psychology54 (2003), 579–616. https://doi.org/10.1146/annurev.psych.54.101601.145030

work page doi:10.1146/annurev.psych.54.101601.145030 2003
[12]

Hudson, Ehsan Adeli, Russ Altman, et al

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, et al . 2021. On the Opportunities and Risks of Foundation Models.arXiv preprint arXiv:2108.07258(2021). arXiv:2108.07258

Pith/arXiv arXiv 2021
[13]

Virginia Braun, Victoria Clarke, Nikki Hayfield, Louise Davey, and Elizabeth Jenkinson. 2023. Doing Reflexive Thematic Analysis. InSupporting Research in Counselling and Psychotherapy: Qualitative, Quantitative, and Mixed Methods Research, Sofie Bager-Charleson and Alistair G. McBeath (Eds.). Palgrave Macmillan, Cham, 19–38. https: //doi.org/10.1007/978-3...

work page doi:10.1007/978-3-031-13942-0_2 2023
[14]

Quick and Dirty

John Brooke. 1996. SUS: A “Quick and Dirty” Usability Scale. InUsability Evaluation in Industry, Patrick W. Jordan, Bruce Thomas, Bernard A. Weerdmeester, and Ian L. McClelland (Eds.). Taylor & Francis, 189–194. https://doi.org/10. 1201/9781498710411-35

1996
[15]

Kemper, Ruth Herman, and Michael A

Cati Brown, Tony Snodgrass, Susan J. Kemper, Ruth Herman, and Michael A. Covington. 2008. Automatic measurement of propositional idea density from part-of-speech tagging.Behavior Research Methods40, 2 (2008), 540–545. https: //doi.org/10.3758/BRM.40.2.540

work page doi:10.3758/brm.40.2.540 2008
[16]

Carney, Daniel J

Colleen E. Carney, Daniel J. Buysse, Sonia Ancoli-Israel, Jack D. Edinger, Andrew D. Krystal, Kenneth L. Lichstein, and Charles M. Morin. 2012. The Consensus Sleep Diary: Standardizing Prospective Sleep Self-Monitoring.Sleep35, 2 (2012), 287–302. https://doi.org/10.5665/sleep.1642

work page doi:10.5665/sleep.1642 2012
[17]

2008.Text Complexity and Reading Comprehension Tests

Erik Castello. 2008.Text Complexity and Reading Comprehension Tests. Number 85 in Linguistic Insights. Peter Lang, Bern

2008
[18]

Stevie Chancellor and Munmun De Choudhury. 2020. Methods in Predictive Techniques for Mental Health Status on Social Media: A Critical Review.npj Digital Medicine3, 1 (2020), 1–11. https://doi.org/10.1038/s41746-020-0233-7

work page doi:10.1038/s41746-020-0233-7 2020
[19]

Shanshan Chen, Panos Markopoulos, and Jun Hu. 2024. Dozzz: Exploring the Feasibility of a Voice-Based Sleep Diary for Children. InProceedings of BCS HCI 2024. BCS, The Chartered Institute for IT. https://doi.org/10.14236/ewic/ BCSHCI2024.10

work page doi:10.14236/ewic/ 2024
[20]

Lee, Bongshin Lee, Wanda Pratt, and Julie A

Eun Kyoung Choe, Nicole B. Lee, Bongshin Lee, Wanda Pratt, and Julie A. Kientz. 2014. Understanding Quantified- Selfers’ Practices in Collecting and Exploring Personal Data. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1143–1152. https://doi.org/10.1145/2556288.2557372 28 Mahmood et al

work page doi:10.1145/2556288.2557372 2014
[21]

1988.Statistical Power Analysis for the Behavioral Sciences(2 ed.)

Jacob Cohen. 1988.Statistical Power Analysis for the Behavioral Sciences(2 ed.). Lawrence Erlbaum Associates, Hillsdale, NJ

1988
[22]

Karuna Datta. 2022. Use of a Sleep Diary. InMaking Sense of Sleep Medicine: A Hands-On Guide, Karuna Datta and Deepak Shrivastava (Eds.). CRC Press, 109–120. https://doi.org/10.1201/9781003093381-20

work page doi:10.1201/9781003093381-20 2022
[23]

Edinger, J

Jack D. Edinger, J. Todd Arnedt, Suzanne M. Bertisch, Colleen E. Carney, John J. Harrington, Kenneth L. Lichstein, Michael J. Sateia, Wendy M. Troxel, Eric S. Zhou, Uzma Kazmi, Jonathan L. Heald, and Jennifer L. Martin. 2021. Behavioral and Psychological Treatments for Chronic Insomnia Disorder in Adults: An American Academy of Sleep Medicine Clinical Pra...

2021
[24]

Epstein, An Ping, James Fogarty, and Sean A

Daniel A. Epstein, An Ping, James Fogarty, and Sean A. Munson. 2015. A Lived Informatics Model of Personal Informatics. InProceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing. 731–742. https://doi.org/10.1145/2750858.2804250

work page doi:10.1145/2750858.2804250 2015
[25]

Alena Ermolina and Victor Tiberius. 2021. Voice-Controlled Intelligent Personal Assistants in Health Care: International Delphi Study.Journal of Medical Internet Research23, 4 (2021), e25312. https://doi.org/10.2196/25312

work page doi:10.2196/25312 2021
[26]

Andrea Grimes, Desney Tan, and Dan Morris. 2009. Toward Technologies That Support Family Reflections on Health. InProceedings of the ACM 2009 International Conference on Supporting Group Work (GROUP ’09). Association for Computing Machinery, New York, NY, USA, 311–320. https://doi.org/10.1145/1531674.1531721

work page doi:10.1145/1531674.1531721 2009
[27]

Hart and Lowell E

Sandra G. Hart and Lowell E. Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research.Advances in Psychology52 (1988), 139–183. https://doi.org/10.1016/S0166-4115(08)62386-9

work page doi:10.1016/s0166-4115(08)62386-9 1988
[28]

Allison G. Harvey. 2002. A Cognitive Model of Insomnia.Behaviour Research and Therapy40, 8 (2002), 869–893. https://doi.org/10.1016/S0005-7967(01)00061-4

work page doi:10.1016/s0005-7967(01)00061-4 2002
[29]

Harvey, Kerrie Stinson, Katriina L

Allison G. Harvey, Kerrie Stinson, Katriina L. Whitaker, Damian Moskovitz, and Harmehr Virk. 2008. The Subjective Meaning of Sleep Quality: A Comparison of Individuals with and without Insomnia.Sleep31, 3 (2008), 383–393. https://doi.org/10.1093/sleep/31.3.383

work page doi:10.1093/sleep/31.3.383 2008
[30]

Hufford, Saul Shiffman, Jean Paty, and Arthur A

Michael R. Hufford, Saul Shiffman, Jean Paty, and Arthur A. Stone. 2001. Ecological Momentary Assessment: Real- World, Real-Time Measurement of Patient Experience. InProgress in Ambulatory Assessment: Computer-Assisted Psychological and Psychophysiological Methods in Monitoring and Field Studies, Jochen Fahrenberg and Michael Myrtek (Eds.). Hogrefe & Hube...

2001
[31]

Vanessa Ibáñez, Josep Silva, and Omar Cauli. 2018. A Survey on Sleep Questionnaires and Diaries.Sleep Medicine42 (2018), 90–96. https://doi.org/10.1016/j.sleep.2017.08.026

work page doi:10.1016/j.sleep.2017.08.026 2018
[32]

Michael R. Irwin. 2015. Why Sleep Is Important for Health: A Psychoneuroimmunology Perspective.Annual Review of Psychology66 (2015), 143–172. https://doi.org/10.1146/annurev-psych-010213-115205

work page doi:10.1146/annurev-psych-010213-115205 2015
[33]

Zhiqiu Jiang, Mashrur Rashik, Kunjal Panchal, Mahmood Jasim, Ali Sarvghad, Pari Riahi, Erica DeWitt, Fey Thurber, and Narges Mahyar. 2023. CommunityBots: Creating and Evaluating A Multi-Agent Chatbot Platform for Public Input Elicitation.Proceedings of the ACM on Human-Computer Interaction7, CSCW1 (2023), 1–32. https://doi.org/10.1145/ 3579469

2023
[34]

Soomin Kim, Jinsu Lee, and Gahgene Gweon. 2019. Comparing Data from Chatbot and Web Surveys: Effects of Platform and Conversational Style on Survey Response Quality. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–12. https://doi.org/10.1145/3290605.3300316

work page doi:10.1145/3290605.3300316 2019
[35]

Terry K Koo and Mae Y Li. 2016. A guideline of selecting and reporting intraclass correlation coefficients for reliability research.Journal of chiropractic medicine15, 2 (2016), 155–163. https://doi.org/10.1016/j.jcm.2016.02.012

work page doi:10.1016/j.jcm.2016.02.012 2016
[36]

Arnardottir

Hlín Kristbergsdóttir, Anna Sigridur Islind, Lisa Schmitz, and Erna S. Arnardottir. 2023. Working Towards a Novel Digital Sleep Diary Standard.ERJ Open Research9, suppl 11 (2023), 73. https://doi.org/10.1183/23120541.sleepandbreathing- 2023.73

work page doi:10.1183/23120541.sleepandbreathing- 2023
[37]

Dunn, Huong Ly Tong, Ahmet Baki Kocaballi, Jessica A

Liliana Laranjo, Adam G. Dunn, Huong Ly Tong, Ahmet Baki Kocaballi, Jessica A. Chen, Rabia Bashir, Didi Surian, Blanca Gallego, Farah Magrabi, Annie Y. S. Lau, and Enrico W. Coiera. 2018. Conversational Agents in Healthcare: A Systematic Review.Journal of the American Medical Informatics Association25, 9 (2018), 1248–1258. https: //doi.org/10.1093/jamia/ocy072

work page doi:10.1093/jamia/ocy072 2018
[38]

Josephine Lau, Benjamin Zimmerman, and Florian Schaub. 2018. Alexa, Are You Listening?: Privacy Perceptions, Concerns and Privacy-Seeking Behaviors with Smart Speakers.Proceedings of the ACM on Human-Computer Interaction 2, CSCW, Article 102 (2018), 31 pages. https://doi.org/10.1145/3274371

work page doi:10.1145/3274371 2018
[39]

Lauderdale, Kristen L

Diane S. Lauderdale, Kristen L. Knutson, Lijing L. Yan, Kiang Liu, and Paul J. Rathouz. 2008. Self-Reported and Measured Sleep Duration: How Similar Are They?Epidemiology19, 6 (2008), 838–845. https://doi.org/10.1097/EDE. 0b013e318187a7b0

work page doi:10.1097/ede 2008
[40]

Amanda Lazar, Christian Koehler, Joshua Tanenbaum, and David H. Nguyen. 2015. Why We Use and Abandon Smart Devices. (2015), 635–646. https://doi.org/10.1145/2750858.2804288 Better Adherence, Richer Context: A Field Evaluation of LLM-Powered Conversational Voice Diaries for Sleep 29

work page doi:10.1145/2750858.2804288 2015
[41]

Peter Lee, Sebastien Bubeck, and Joseph Petro. 2023. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. New England Journal of Medicine388, 13 (2023), 1233–1239. https://doi.org/10.1056/NEJMsr2214184

work page doi:10.1056/nejmsr2214184 2023
[42]

Starling, Eric S

Daniel Lewin, Claire M. Starling, Eric S. Zhou, Daniel Greenberg, Callen Shaw, and Hannah Arem. 2024. A Novel Voice Interactive Sleep Log: Concurrent Validity with Actigraphy and Sleep Diaries.Journal of Clinical Sleep Medicine 20, 2 (2024), 309–312. https://doi.org/10.5664/jcsm.10878

work page doi:10.5664/jcsm.10878 2024
[43]

Lloyd-Jones, Norrina B

Donald M. Lloyd-Jones, Norrina B. Allen, Cheryl A. M. Anderson, Tiffany Black, LaPrincess C. Brewer, Randi E. Foraker, Michael A. Grandner, Helen Lavretsky, Amanda M. Perak, Garima Sharma, and Wayne Rosamond. 2022. Life’s Essential 8: Updating and Enhancing the American Heart Association’s Construct of Cardiovascular Health: A Presidential Advisory From t...

2022
[44]

Irene Lopatovska and Harriet Williams. 2018. Personification of the Amazon Alexa: BFF or a Mindless Companion. In Proceedings of the 2018 Conference on Human Information Interaction and Retrieval. 265–268. https://doi.org/10.1145/ 3176349.3176868

arXiv 2018
[45]

Roshan Maharjan, Kate O’Doherty, David A Rohani, Patrick Bekgaard, and Jakob E Bardram. 2022. Experiences of a speech-enabled conversational agent for the self-report of well-being among people living with affective disorders: an in-the-wild study.ACM Transactions on Interactive Intelligent Systems12, 2 (2022), 1–31. https://doi.org/10.1145/3484508

work page doi:10.1145/3484508 2022
[46]

Amama Mahmood, Junxiang Wang, and Chien-Ming Huang. 2026. Situated Understanding of Errors in Older Adults’ Interactions with Voice Assistants: A Month-Long, In-Home Study.ACM Transactions on Accessible Computing19, 1, Article 2 (March 2026), 36 pages. https://doi.org/10.1145/3796236

work page doi:10.1145/3796236 2026
[47]

Amama Mahmood, Junxiang Wang, Bingsheng Yao, Dakuo Wang, and Chien-Ming Huang. 2025. User Interaction Patterns and Breakdowns in Conversing with LLM-Powered Voice Assistants.International Journal of Human-Computer Studies195 (2025), 103406. https://doi.org/10.1016/j.ijhcs.2024.103406

work page doi:10.1016/j.ijhcs.2024.103406 2025
[48]

Nathan Malkin, Joe Deatrick, Allen Tong, Primal Wijesekera, Serge Egelman, and David Wagner. 2019. Privacy Attitudes of Smart Speaker Users.Proceedings on Privacy Enhancing Technologies2019, 4 (2019), 250–271. https: //doi.org/10.2478/popets-2019-0068

work page doi:10.2478/popets-2019-0068 2019
[49]

Mary L. McHugh. 2012. Interrater Reliability: The Kappa Statistic.Biochemia Medica22, 3 (2012), 276–282

2012
[50]

Alexa, I Just Ate a Donut

Louise A. C. Millard, Laura Johnson, Samuel R. Neaves, Peter A. Flach, Kate Tilling, and Deborah A. Lawlor. 2022. “Alexa, I Just Ate a Donut”: A Pilot Study Collecting Food and Drink Intake Data with Voice Input.medRxiv(2022). https://doi.org/10.1101/2022.06.28.22276999 Preprint

work page doi:10.1101/2022.06.28.22276999 2022
[51]

Moore and Raphael Arar

Robert J. Moore and Raphael Arar. 2019.Conversational UX Design: A Practitioner’s Guide to the Natural Conversation Framework. Association for Computing Machinery. https://doi.org/10.1145/3304087

work page doi:10.1145/3304087 2019
[52]

Morin and Ruth Benca

Charles M. Morin and Ruth Benca. 2012. Chronic Insomnia.The Lancet379, 9821 (2012), 1129–1141. https: //doi.org/10.1016/S0140-6736(11)60750-2

work page doi:10.1016/s0140-6736(11)60750-2 2012
[53]

Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. 2023. Capabilities of GPT-4 on Medical Challenge Problems.arXiv preprint arXiv:2303.13375(2023). https://arxiv.org/abs/2303.13375

Pith/arXiv arXiv 2023
[54]

OpenAI. 2023. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774(2023). arXiv:2303.08774

Pith/arXiv arXiv 2023
[55]

Martin Pielot, Karen Church, and Rodrigo De Oliveira. 2014. An In-Situ Study of Mobile Phone Notifications. In Proceedings of the 16th International Conference on Human-Computer Interaction with Mobile Devices and Services. 233–242. https://doi.org/10.1145/2628363.2628364

work page doi:10.1145/2628363.2628364 2014
[56]

Fischer, Stuart Reeves, and Sarah Sharples

Martin Porcheron, Joel E. Fischer, Stuart Reeves, and Sarah Sharples. 2018. Voice Interfaces in Everyday Life. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1–12. https://doi.org/10.1145/3173574. 3174214

work page doi:10.1145/3173574 2018
[57]

Accessibility Came by Accident

Alisha Pradhan, Kanika Mehta, and Leah Findlater. 2018. “Accessibility Came by Accident”: Use of Voice-Controlled Intelligent Personal Assistants by People with Disabilities. InProceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI ’18). Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/ 3173574.3174033

arXiv 2018
[58]

Simon Provoost, Ho Ming Lau, Jeroen Ruwaard, and Heleen Riper. 2017. Embodied Conversational Agents in Clinical Psychology: A Scoping Review.Journal of Medical Internet Research19, 5 (2017), e151. https://doi.org/10.2196/jmir.6553

work page doi:10.2196/jmir.6553 2017
[59]

Rebecca Robbins, Azizi Seixas, Lillian Walton Masters, Nicholas Chanko, Faiyaz Diaby, Dorice Vieira, and Girardin Jean-Louis. 2019. Sleep Tracking: A Systematic Review of the Research Using Commercially Available Technology. Current Sleep Medicine Reports5, 3 (2019), 156–163. https://doi.org/10.1007/s40675-019-00150-1

work page doi:10.1007/s40675-019-00150-1 2019
[60]

Harshita Sahijwani. 2022. Adaptive Dialogue Management for Conversational Information Elicitation. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, 3495. https://doi.org/10.1145/3477495.3531684

work page doi:10.1145/3477495.3531684 2022
[61]

Stone, and Michael R

Saul Shiffman, Arthur A. Stone, and Michael R. Hufford. 2008. Ecological Momentary Assessment.Annual Review of Clinical Psychology4 (2008), 1–32. https://doi.org/10.1146/annurev.clinpsy.3.022806.091415 30 Mahmood et al

work page doi:10.1146/annurev.clinpsy.3.022806.091415 2008
[62]

Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Nathaneal Scharli, Aakanksha Chowdhery, Philip Mansfield, Blaise Agüera y Arcas, Dale Webster, Greg S. Corrado, Yossi Matias, Katherine Chou, Juraj G...

work page doi:10.1038/s41586-023-06291-2 2023
[63]

Gabriel Skantze. 2021. Turn-Taking in Conversational Systems and Human-Robot Interaction: A Review.Computer Speech & Language67 (2021), 101178. https://doi.org/10.1016/j.csl.2020.101178

work page doi:10.1016/j.csl.2020.101178 2021
[64]

Starling, Daniel Greenberg, Daniel Lewin, Callen Shaw, Eric S

Claire M. Starling, Daniel Greenberg, Daniel Lewin, Callen Shaw, Eric S. Zhou, Daniel Lieberman, and Hannah Arem
[65]

https://doi.org/10.1001/jamanetworkopen.2024.35011

Voice-Activated Cognitive Behavioral Therapy for Insomnia: A Randomized Clinical Trial.JAMA Network Open 7, 9 (2024), e2435011. https://doi.org/10.1001/jamanetworkopen.2024.35011

work page doi:10.1001/jamanetworkopen.2024.35011 2024
[66]

Stone, Saul Shiffman, Joseph E

Arthur A. Stone, Saul Shiffman, Joseph E. Schwartz, Joan E. Broderick, and Michael R. Hufford. 2002. Patient Non- Compliance with Paper Diaries.BMJ324, 7347 (2002), 1193–1194. https://doi.org/10.1136/bmj.324.7347.1193

work page doi:10.1136/bmj.324.7347.1193 2002
[67]

Stone, Saul Shiffman, Joseph E

Arthur A. Stone, Saul Shiffman, Joseph E. Schwartz, Joan E. Broderick, and Michael R. Hufford. 2003. Patient Compliance with Paper and Electronic Diaries.Controlled Clinical Trials24, 2 (2003), 182–199. https://doi.org/10.1016/S0197- 2456(02)00320-3

work page doi:10.1016/s0197- 2003
[68]

Sunshine

Jacob E. Sunshine. 2022. Smart Speakers: The Next Frontier in mHealth.JMIR mHealth and uHealth10, 2 (2022), e28686. https://doi.org/10.2196/28686

work page doi:10.2196/28686 2022
[69]

Linkai Tao, Myrte Elise Thoolen, Bram de Vogel, Loe M. G. Feijs, Wei Chen, and Jun Hu. 2019. EVE: A Combined Physical-Digital Interface for Insomnia Sleep Diary. InIntelligent Systems and Applications: Proceedings of the 2018 Intelligent Systems Conference (IntelliSys), Volume 2 (Advances in Intelligent Systems and Computing, Vol. 869). Springer, Cham, 46...

work page doi:10.1007/978-3-030-01057-7_37 2019
[70]

Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. 2023. Large Language Models in Medicine.Nature Medicine29, 8 (2023), 1930–1940. https: //doi.org/10.1038/s41591-023-02448-8

work page doi:10.1038/s41591-023-02448-8 2023
[71]

Dhinagaran, Bhone Myint Kyaw, Tobias Kowatsch, J

Lorainne Tudor Car, Dharshini A. Dhinagaran, Bhone Myint Kyaw, Tobias Kowatsch, J. S. Rayhan, Yin-Leng Theng, and Rifat Atun. 2020. Conversational Agents in Health Care: Scoping Review and Conceptual Analysis.Journal of Medical Internet Research22, 8 (2020), e17158. https://doi.org/10.2196/17158

work page doi:10.2196/17158 2020
[72]

Leyao Wang, Zhiyu Wan, Congning Ni, Qingyuan Song, Yang Li, Ellen Clayton, Bradley Malin, and Zhijun Yin. 2024. Applications and Concerns of ChatGPT and Other Conversational Large Language Models in Health Care: Systematic Review.Journal of Medical Internet Research26 (2024), e22769. https://doi.org/10.2196/22769

work page doi:10.2196/22769 2024
[73]

Carolin Wienrich, Clemens Reitelbach, and Astrid Carolus. 2021. The Trustworthiness of Voice Assistants in the Context of Healthcare: Investigating the Effect of Perceived Expertise on the Trustworthiness of Voice Assistants, Providers, Data Receivers, and Automatic Speech Recognition.Frontiers in Computer Science3 (2021). https://doi.org/ 10.3389/fcomp.2...

work page doi:10.3389/fcomp.2021.685250 2021
[74]

Pfeffer, Jason Fries, and Nigam H

Michael Wornow, Yizhe Xu, Rachana Thapa, Bhavik Patel, Elissa Steinberg, Sarah Fleming, Marc A. Pfeffer, Jason Fries, and Nigam H. Shah. 2023. The shaky foundations of large language models and foundation models for electronic health records.NPJ Digital Medicine6, 1 (2023), 135. https://doi.org/10.1038/s41746-023-00879-8

work page doi:10.1038/s41746-023-00879-8 2023
[75]

Ziang Xiao, Michelle X. Zhou, Q. Vera Liao, Gloria Mark, Changyan Chi, Wenxi Chen, and Huahai Yang. 2020. Tell Me About Yourself: Using an AI-Powered Chatbot to Conduct Conversational Surveys with Open-Ended Questions.ACM Transactions on Computer-Human Interaction27, 3, Article 15 (2020), 37 pages. https://doi.org/10.1145/3381804

work page doi:10.1145/3381804 2020
[76]

Nima Zargham, Leon Reicherts, Michael Bonfert, Sarah Theres Völkel, Johannes Schöning, Rainer Malaka, and Yvonne Rogers. 2022. Understanding Circumstances for Desirable Proactive Behaviour of Voice Assistants: The Proactivity Dilemma. InProceedings of the 4th Conference on Conversational User Interfaces (CUI ’22). Association for Computing Machinery, New ...

work page doi:10.1145/3543829.3543834 2022
[77]

Wayne Xin Zhao, Kun Zhou, Junyi Li, et al. 2026. A Survey of Large Language Models.Frontiers of Computer Science 20, 12 (2026), 2012627. https://doi.org/10.1007/s11704-026-60308-3 Better Adherence, Richer Context: A Field Evaluation of LLM-Powered Conversational Voice Diaries for Sleep 31 A Sleep diary questions Morning Sleep Diary (1) Time the user physi...

work page doi:10.1007/s11704-026-60308-3 2026

[1] [1]

Schoevers

Marije Aan Het Rot, Koen Hogenelst, and Robert A. Schoevers. 2012. Mood Disorders in Everyday Life: A Systematic Review of Experience Sampling and Ecological Momentary Assessment Studies.Clinical Psychology Review32, 6 (2012), 510–523. https://doi.org/10.1016/j.cpr.2012.05.007

work page doi:10.1016/j.cpr.2012.05.007 2012

[2] [2]

Tessa Aarts, Panos Markopoulos, Lars Giling, Tudor Vacaretu, and Sigrid Pillen. 2022. Snoozy: A Chatbot-Based Sleep Diary for Children Aged Eight to Twelve. InProceedings of the 21st Annual ACM Interaction Design and Children Conference (IDC ’22). Association for Computing Machinery, 297–307. https://doi.org/10.1145/3501712.3529718

work page doi:10.1145/3501712.3529718 2022

[3] [3]

Ramokapane, and Jose M

Noura Abdi, Kopo M. Ramokapane, and Jose M. Such. 2019. More than smart speakers: security and privacy perceptions of smart home personal assistants. InProceedings of the Fifteenth USENIX Conference on Usable Privacy and Security (Santa Clara, CA, USA)(SOUPS’19). USENIX Association, USA, 451–466

2019

[4] [4]

Abdalsalam Almzayyen, Angel Vela de la Garza Evia, Nick Coronato, and Mehdi Boukhechba. 2022. Voice-Based Conversational Agents for self-reporting fluid consumption and sleep quality. https://arxiv.org/abs/2202.02186

arXiv 2022

[5] [5]

Sonia Ancoli-Israel, Roger Cole, Cathy Alessi, Mark Chambers, William Moorcroft, and Charles P. Pollak. 2003. The Role of Actigraphy in the Study of Sleep and Circadian Rhythms.Sleep26, 3 (2003), 342–392. https://doi.org/10.1093/ sleep/26.3.342

2003

[6] [6]

Frank Bentley, Chris Luvogt, Max Silverman, Rushani Wirasinghe, Brooke White, and Danielle Lottridge. 2018. Understanding the Long-Term Use of Smart Speaker Assistants.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies2, 3, Article 91 (2018), 24 pages. https://doi.org/10.1145/3264901

work page doi:10.1145/3264901 2018

[7] [7]

Timothy Bickmore and Tony Giorgino. 2006. Health Dialog Systems for Patients and Consumers.Journal of Biomedical Informatics39, 5 (2006), 556–571. https://doi.org/10.1016/j.jbi.2005.12.004

work page doi:10.1016/j.jbi.2005.12.004 2006

[8] [8]

Timothy Bickmore, Daniel Schulman, and Langxuan Yin. 2010. Maintaining Engagement in Long-Term Interventions with Relational Agents.Applied Artificial Intelligence24, 6 (2010), 648–666. https://doi.org/10.1080/08839514.2010. 492259

work page doi:10.1080/08839514.2010 2010

[9] [9]

Bickmore and Rosalind W

Timothy W. Bickmore and Rosalind W. Picard. 2005. Establishing and Maintaining Long-Term Human-Computer Relationships.ACM Transactions on Computer-Human Interaction12, 2 (2005), 293–327. https://doi.org/10.1145/ 1067860.1067867

arXiv 2005

[10] [10]

Bickmore, Ha Trinh, Stefan Olafsson, Teresa K

Timothy W. Bickmore, Ha Trinh, Stefan Olafsson, Teresa K. O’Leary, Reza Asadi, Nina M. Rickles, and Ricardo Cruz. 2018. Patient and Consumer Safety Risks When Using Conversational Assistants for Medical Information: An Observational Study of Siri, Alexa, and Google Assistant.Journal of Medical Internet Research20, 9 (2018), e11510. https://doi.org/10.2196/11510

work page doi:10.2196/11510 2018

[11] [11]

Niall Bolger, Angelina Davis, and Eshkol Rafaeli. 2003. Diary Methods: Capturing Life as It Is Lived.Annual Review of Psychology54 (2003), 579–616. https://doi.org/10.1146/annurev.psych.54.101601.145030

work page doi:10.1146/annurev.psych.54.101601.145030 2003

[12] [12]

Hudson, Ehsan Adeli, Russ Altman, et al

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, et al . 2021. On the Opportunities and Risks of Foundation Models.arXiv preprint arXiv:2108.07258(2021). arXiv:2108.07258

Pith/arXiv arXiv 2021

[13] [13]

Virginia Braun, Victoria Clarke, Nikki Hayfield, Louise Davey, and Elizabeth Jenkinson. 2023. Doing Reflexive Thematic Analysis. InSupporting Research in Counselling and Psychotherapy: Qualitative, Quantitative, and Mixed Methods Research, Sofie Bager-Charleson and Alistair G. McBeath (Eds.). Palgrave Macmillan, Cham, 19–38. https: //doi.org/10.1007/978-3...

work page doi:10.1007/978-3-031-13942-0_2 2023

[14] [14]

Quick and Dirty

John Brooke. 1996. SUS: A “Quick and Dirty” Usability Scale. InUsability Evaluation in Industry, Patrick W. Jordan, Bruce Thomas, Bernard A. Weerdmeester, and Ian L. McClelland (Eds.). Taylor & Francis, 189–194. https://doi.org/10. 1201/9781498710411-35

1996

[15] [15]

Kemper, Ruth Herman, and Michael A

Cati Brown, Tony Snodgrass, Susan J. Kemper, Ruth Herman, and Michael A. Covington. 2008. Automatic measurement of propositional idea density from part-of-speech tagging.Behavior Research Methods40, 2 (2008), 540–545. https: //doi.org/10.3758/BRM.40.2.540

work page doi:10.3758/brm.40.2.540 2008

[16] [16]

Carney, Daniel J

Colleen E. Carney, Daniel J. Buysse, Sonia Ancoli-Israel, Jack D. Edinger, Andrew D. Krystal, Kenneth L. Lichstein, and Charles M. Morin. 2012. The Consensus Sleep Diary: Standardizing Prospective Sleep Self-Monitoring.Sleep35, 2 (2012), 287–302. https://doi.org/10.5665/sleep.1642

work page doi:10.5665/sleep.1642 2012

[17] [17]

2008.Text Complexity and Reading Comprehension Tests

Erik Castello. 2008.Text Complexity and Reading Comprehension Tests. Number 85 in Linguistic Insights. Peter Lang, Bern

2008

[18] [18]

Stevie Chancellor and Munmun De Choudhury. 2020. Methods in Predictive Techniques for Mental Health Status on Social Media: A Critical Review.npj Digital Medicine3, 1 (2020), 1–11. https://doi.org/10.1038/s41746-020-0233-7

work page doi:10.1038/s41746-020-0233-7 2020

[19] [19]

Shanshan Chen, Panos Markopoulos, and Jun Hu. 2024. Dozzz: Exploring the Feasibility of a Voice-Based Sleep Diary for Children. InProceedings of BCS HCI 2024. BCS, The Chartered Institute for IT. https://doi.org/10.14236/ewic/ BCSHCI2024.10

work page doi:10.14236/ewic/ 2024

[20] [20]

Lee, Bongshin Lee, Wanda Pratt, and Julie A

Eun Kyoung Choe, Nicole B. Lee, Bongshin Lee, Wanda Pratt, and Julie A. Kientz. 2014. Understanding Quantified- Selfers’ Practices in Collecting and Exploring Personal Data. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1143–1152. https://doi.org/10.1145/2556288.2557372 28 Mahmood et al

work page doi:10.1145/2556288.2557372 2014

[21] [21]

1988.Statistical Power Analysis for the Behavioral Sciences(2 ed.)

Jacob Cohen. 1988.Statistical Power Analysis for the Behavioral Sciences(2 ed.). Lawrence Erlbaum Associates, Hillsdale, NJ

1988

[22] [22]

Karuna Datta. 2022. Use of a Sleep Diary. InMaking Sense of Sleep Medicine: A Hands-On Guide, Karuna Datta and Deepak Shrivastava (Eds.). CRC Press, 109–120. https://doi.org/10.1201/9781003093381-20

work page doi:10.1201/9781003093381-20 2022

[23] [23]

Edinger, J

Jack D. Edinger, J. Todd Arnedt, Suzanne M. Bertisch, Colleen E. Carney, John J. Harrington, Kenneth L. Lichstein, Michael J. Sateia, Wendy M. Troxel, Eric S. Zhou, Uzma Kazmi, Jonathan L. Heald, and Jennifer L. Martin. 2021. Behavioral and Psychological Treatments for Chronic Insomnia Disorder in Adults: An American Academy of Sleep Medicine Clinical Pra...

2021

[24] [24]

Epstein, An Ping, James Fogarty, and Sean A

Daniel A. Epstein, An Ping, James Fogarty, and Sean A. Munson. 2015. A Lived Informatics Model of Personal Informatics. InProceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing. 731–742. https://doi.org/10.1145/2750858.2804250

work page doi:10.1145/2750858.2804250 2015

[25] [25]

Alena Ermolina and Victor Tiberius. 2021. Voice-Controlled Intelligent Personal Assistants in Health Care: International Delphi Study.Journal of Medical Internet Research23, 4 (2021), e25312. https://doi.org/10.2196/25312

work page doi:10.2196/25312 2021

[26] [26]

Andrea Grimes, Desney Tan, and Dan Morris. 2009. Toward Technologies That Support Family Reflections on Health. InProceedings of the ACM 2009 International Conference on Supporting Group Work (GROUP ’09). Association for Computing Machinery, New York, NY, USA, 311–320. https://doi.org/10.1145/1531674.1531721

work page doi:10.1145/1531674.1531721 2009

[27] [27]

Hart and Lowell E

Sandra G. Hart and Lowell E. Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research.Advances in Psychology52 (1988), 139–183. https://doi.org/10.1016/S0166-4115(08)62386-9

work page doi:10.1016/s0166-4115(08)62386-9 1988

[28] [28]

Allison G. Harvey. 2002. A Cognitive Model of Insomnia.Behaviour Research and Therapy40, 8 (2002), 869–893. https://doi.org/10.1016/S0005-7967(01)00061-4

work page doi:10.1016/s0005-7967(01)00061-4 2002

[29] [29]

Harvey, Kerrie Stinson, Katriina L

Allison G. Harvey, Kerrie Stinson, Katriina L. Whitaker, Damian Moskovitz, and Harmehr Virk. 2008. The Subjective Meaning of Sleep Quality: A Comparison of Individuals with and without Insomnia.Sleep31, 3 (2008), 383–393. https://doi.org/10.1093/sleep/31.3.383

work page doi:10.1093/sleep/31.3.383 2008

[30] [30]

Hufford, Saul Shiffman, Jean Paty, and Arthur A

Michael R. Hufford, Saul Shiffman, Jean Paty, and Arthur A. Stone. 2001. Ecological Momentary Assessment: Real- World, Real-Time Measurement of Patient Experience. InProgress in Ambulatory Assessment: Computer-Assisted Psychological and Psychophysiological Methods in Monitoring and Field Studies, Jochen Fahrenberg and Michael Myrtek (Eds.). Hogrefe & Hube...

2001

[31] [31]

Vanessa Ibáñez, Josep Silva, and Omar Cauli. 2018. A Survey on Sleep Questionnaires and Diaries.Sleep Medicine42 (2018), 90–96. https://doi.org/10.1016/j.sleep.2017.08.026

work page doi:10.1016/j.sleep.2017.08.026 2018

[32] [32]

Michael R. Irwin. 2015. Why Sleep Is Important for Health: A Psychoneuroimmunology Perspective.Annual Review of Psychology66 (2015), 143–172. https://doi.org/10.1146/annurev-psych-010213-115205

work page doi:10.1146/annurev-psych-010213-115205 2015

[33] [33]

Zhiqiu Jiang, Mashrur Rashik, Kunjal Panchal, Mahmood Jasim, Ali Sarvghad, Pari Riahi, Erica DeWitt, Fey Thurber, and Narges Mahyar. 2023. CommunityBots: Creating and Evaluating A Multi-Agent Chatbot Platform for Public Input Elicitation.Proceedings of the ACM on Human-Computer Interaction7, CSCW1 (2023), 1–32. https://doi.org/10.1145/ 3579469

2023

[34] [34]

Soomin Kim, Jinsu Lee, and Gahgene Gweon. 2019. Comparing Data from Chatbot and Web Surveys: Effects of Platform and Conversational Style on Survey Response Quality. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–12. https://doi.org/10.1145/3290605.3300316

work page doi:10.1145/3290605.3300316 2019

[35] [35]

Terry K Koo and Mae Y Li. 2016. A guideline of selecting and reporting intraclass correlation coefficients for reliability research.Journal of chiropractic medicine15, 2 (2016), 155–163. https://doi.org/10.1016/j.jcm.2016.02.012

work page doi:10.1016/j.jcm.2016.02.012 2016

[36] [36]

Arnardottir

Hlín Kristbergsdóttir, Anna Sigridur Islind, Lisa Schmitz, and Erna S. Arnardottir. 2023. Working Towards a Novel Digital Sleep Diary Standard.ERJ Open Research9, suppl 11 (2023), 73. https://doi.org/10.1183/23120541.sleepandbreathing- 2023.73

work page doi:10.1183/23120541.sleepandbreathing- 2023

[37] [37]

Dunn, Huong Ly Tong, Ahmet Baki Kocaballi, Jessica A

Liliana Laranjo, Adam G. Dunn, Huong Ly Tong, Ahmet Baki Kocaballi, Jessica A. Chen, Rabia Bashir, Didi Surian, Blanca Gallego, Farah Magrabi, Annie Y. S. Lau, and Enrico W. Coiera. 2018. Conversational Agents in Healthcare: A Systematic Review.Journal of the American Medical Informatics Association25, 9 (2018), 1248–1258. https: //doi.org/10.1093/jamia/ocy072

work page doi:10.1093/jamia/ocy072 2018

[38] [38]

Josephine Lau, Benjamin Zimmerman, and Florian Schaub. 2018. Alexa, Are You Listening?: Privacy Perceptions, Concerns and Privacy-Seeking Behaviors with Smart Speakers.Proceedings of the ACM on Human-Computer Interaction 2, CSCW, Article 102 (2018), 31 pages. https://doi.org/10.1145/3274371

work page doi:10.1145/3274371 2018

[39] [39]

Lauderdale, Kristen L

Diane S. Lauderdale, Kristen L. Knutson, Lijing L. Yan, Kiang Liu, and Paul J. Rathouz. 2008. Self-Reported and Measured Sleep Duration: How Similar Are They?Epidemiology19, 6 (2008), 838–845. https://doi.org/10.1097/EDE. 0b013e318187a7b0

work page doi:10.1097/ede 2008

[40] [40]

Amanda Lazar, Christian Koehler, Joshua Tanenbaum, and David H. Nguyen. 2015. Why We Use and Abandon Smart Devices. (2015), 635–646. https://doi.org/10.1145/2750858.2804288 Better Adherence, Richer Context: A Field Evaluation of LLM-Powered Conversational Voice Diaries for Sleep 29

work page doi:10.1145/2750858.2804288 2015

[41] [41]

Peter Lee, Sebastien Bubeck, and Joseph Petro. 2023. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. New England Journal of Medicine388, 13 (2023), 1233–1239. https://doi.org/10.1056/NEJMsr2214184

work page doi:10.1056/nejmsr2214184 2023

[42] [42]

Starling, Eric S

Daniel Lewin, Claire M. Starling, Eric S. Zhou, Daniel Greenberg, Callen Shaw, and Hannah Arem. 2024. A Novel Voice Interactive Sleep Log: Concurrent Validity with Actigraphy and Sleep Diaries.Journal of Clinical Sleep Medicine 20, 2 (2024), 309–312. https://doi.org/10.5664/jcsm.10878

work page doi:10.5664/jcsm.10878 2024

[43] [43]

Lloyd-Jones, Norrina B

Donald M. Lloyd-Jones, Norrina B. Allen, Cheryl A. M. Anderson, Tiffany Black, LaPrincess C. Brewer, Randi E. Foraker, Michael A. Grandner, Helen Lavretsky, Amanda M. Perak, Garima Sharma, and Wayne Rosamond. 2022. Life’s Essential 8: Updating and Enhancing the American Heart Association’s Construct of Cardiovascular Health: A Presidential Advisory From t...

2022

[44] [44]

Irene Lopatovska and Harriet Williams. 2018. Personification of the Amazon Alexa: BFF or a Mindless Companion. In Proceedings of the 2018 Conference on Human Information Interaction and Retrieval. 265–268. https://doi.org/10.1145/ 3176349.3176868

arXiv 2018

[45] [45]

Roshan Maharjan, Kate O’Doherty, David A Rohani, Patrick Bekgaard, and Jakob E Bardram. 2022. Experiences of a speech-enabled conversational agent for the self-report of well-being among people living with affective disorders: an in-the-wild study.ACM Transactions on Interactive Intelligent Systems12, 2 (2022), 1–31. https://doi.org/10.1145/3484508

work page doi:10.1145/3484508 2022

[46] [46]

Amama Mahmood, Junxiang Wang, and Chien-Ming Huang. 2026. Situated Understanding of Errors in Older Adults’ Interactions with Voice Assistants: A Month-Long, In-Home Study.ACM Transactions on Accessible Computing19, 1, Article 2 (March 2026), 36 pages. https://doi.org/10.1145/3796236

work page doi:10.1145/3796236 2026

[47] [47]

Amama Mahmood, Junxiang Wang, Bingsheng Yao, Dakuo Wang, and Chien-Ming Huang. 2025. User Interaction Patterns and Breakdowns in Conversing with LLM-Powered Voice Assistants.International Journal of Human-Computer Studies195 (2025), 103406. https://doi.org/10.1016/j.ijhcs.2024.103406

work page doi:10.1016/j.ijhcs.2024.103406 2025

[48] [48]

Nathan Malkin, Joe Deatrick, Allen Tong, Primal Wijesekera, Serge Egelman, and David Wagner. 2019. Privacy Attitudes of Smart Speaker Users.Proceedings on Privacy Enhancing Technologies2019, 4 (2019), 250–271. https: //doi.org/10.2478/popets-2019-0068

work page doi:10.2478/popets-2019-0068 2019

[49] [49]

Mary L. McHugh. 2012. Interrater Reliability: The Kappa Statistic.Biochemia Medica22, 3 (2012), 276–282

2012

[50] [50]

Alexa, I Just Ate a Donut

Louise A. C. Millard, Laura Johnson, Samuel R. Neaves, Peter A. Flach, Kate Tilling, and Deborah A. Lawlor. 2022. “Alexa, I Just Ate a Donut”: A Pilot Study Collecting Food and Drink Intake Data with Voice Input.medRxiv(2022). https://doi.org/10.1101/2022.06.28.22276999 Preprint

work page doi:10.1101/2022.06.28.22276999 2022

[51] [51]

Moore and Raphael Arar

Robert J. Moore and Raphael Arar. 2019.Conversational UX Design: A Practitioner’s Guide to the Natural Conversation Framework. Association for Computing Machinery. https://doi.org/10.1145/3304087

work page doi:10.1145/3304087 2019

[52] [52]

Morin and Ruth Benca

Charles M. Morin and Ruth Benca. 2012. Chronic Insomnia.The Lancet379, 9821 (2012), 1129–1141. https: //doi.org/10.1016/S0140-6736(11)60750-2

work page doi:10.1016/s0140-6736(11)60750-2 2012

[53] [53]

Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. 2023. Capabilities of GPT-4 on Medical Challenge Problems.arXiv preprint arXiv:2303.13375(2023). https://arxiv.org/abs/2303.13375

Pith/arXiv arXiv 2023

[54] [54]

OpenAI. 2023. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774(2023). arXiv:2303.08774

Pith/arXiv arXiv 2023

[55] [55]

Martin Pielot, Karen Church, and Rodrigo De Oliveira. 2014. An In-Situ Study of Mobile Phone Notifications. In Proceedings of the 16th International Conference on Human-Computer Interaction with Mobile Devices and Services. 233–242. https://doi.org/10.1145/2628363.2628364

work page doi:10.1145/2628363.2628364 2014

[56] [56]

Fischer, Stuart Reeves, and Sarah Sharples

Martin Porcheron, Joel E. Fischer, Stuart Reeves, and Sarah Sharples. 2018. Voice Interfaces in Everyday Life. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1–12. https://doi.org/10.1145/3173574. 3174214

work page doi:10.1145/3173574 2018

[57] [57]

Accessibility Came by Accident

Alisha Pradhan, Kanika Mehta, and Leah Findlater. 2018. “Accessibility Came by Accident”: Use of Voice-Controlled Intelligent Personal Assistants by People with Disabilities. InProceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI ’18). Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/ 3173574.3174033

arXiv 2018

[58] [58]

Simon Provoost, Ho Ming Lau, Jeroen Ruwaard, and Heleen Riper. 2017. Embodied Conversational Agents in Clinical Psychology: A Scoping Review.Journal of Medical Internet Research19, 5 (2017), e151. https://doi.org/10.2196/jmir.6553

work page doi:10.2196/jmir.6553 2017

[59] [59]

Rebecca Robbins, Azizi Seixas, Lillian Walton Masters, Nicholas Chanko, Faiyaz Diaby, Dorice Vieira, and Girardin Jean-Louis. 2019. Sleep Tracking: A Systematic Review of the Research Using Commercially Available Technology. Current Sleep Medicine Reports5, 3 (2019), 156–163. https://doi.org/10.1007/s40675-019-00150-1

work page doi:10.1007/s40675-019-00150-1 2019

[60] [60]

Harshita Sahijwani. 2022. Adaptive Dialogue Management for Conversational Information Elicitation. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, 3495. https://doi.org/10.1145/3477495.3531684

work page doi:10.1145/3477495.3531684 2022

[61] [61]

Stone, and Michael R

Saul Shiffman, Arthur A. Stone, and Michael R. Hufford. 2008. Ecological Momentary Assessment.Annual Review of Clinical Psychology4 (2008), 1–32. https://doi.org/10.1146/annurev.clinpsy.3.022806.091415 30 Mahmood et al

work page doi:10.1146/annurev.clinpsy.3.022806.091415 2008

[62] [62]

Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Nathaneal Scharli, Aakanksha Chowdhery, Philip Mansfield, Blaise Agüera y Arcas, Dale Webster, Greg S. Corrado, Yossi Matias, Katherine Chou, Juraj G...

work page doi:10.1038/s41586-023-06291-2 2023

[63] [63]

Gabriel Skantze. 2021. Turn-Taking in Conversational Systems and Human-Robot Interaction: A Review.Computer Speech & Language67 (2021), 101178. https://doi.org/10.1016/j.csl.2020.101178

work page doi:10.1016/j.csl.2020.101178 2021

[64] [64]

Starling, Daniel Greenberg, Daniel Lewin, Callen Shaw, Eric S

Claire M. Starling, Daniel Greenberg, Daniel Lewin, Callen Shaw, Eric S. Zhou, Daniel Lieberman, and Hannah Arem

[65] [65]

https://doi.org/10.1001/jamanetworkopen.2024.35011

Voice-Activated Cognitive Behavioral Therapy for Insomnia: A Randomized Clinical Trial.JAMA Network Open 7, 9 (2024), e2435011. https://doi.org/10.1001/jamanetworkopen.2024.35011

work page doi:10.1001/jamanetworkopen.2024.35011 2024

[66] [66]

Stone, Saul Shiffman, Joseph E

Arthur A. Stone, Saul Shiffman, Joseph E. Schwartz, Joan E. Broderick, and Michael R. Hufford. 2002. Patient Non- Compliance with Paper Diaries.BMJ324, 7347 (2002), 1193–1194. https://doi.org/10.1136/bmj.324.7347.1193

work page doi:10.1136/bmj.324.7347.1193 2002

[67] [67]

Stone, Saul Shiffman, Joseph E

Arthur A. Stone, Saul Shiffman, Joseph E. Schwartz, Joan E. Broderick, and Michael R. Hufford. 2003. Patient Compliance with Paper and Electronic Diaries.Controlled Clinical Trials24, 2 (2003), 182–199. https://doi.org/10.1016/S0197- 2456(02)00320-3

work page doi:10.1016/s0197- 2003

[68] [68]

Sunshine

Jacob E. Sunshine. 2022. Smart Speakers: The Next Frontier in mHealth.JMIR mHealth and uHealth10, 2 (2022), e28686. https://doi.org/10.2196/28686

work page doi:10.2196/28686 2022

[69] [69]

Linkai Tao, Myrte Elise Thoolen, Bram de Vogel, Loe M. G. Feijs, Wei Chen, and Jun Hu. 2019. EVE: A Combined Physical-Digital Interface for Insomnia Sleep Diary. InIntelligent Systems and Applications: Proceedings of the 2018 Intelligent Systems Conference (IntelliSys), Volume 2 (Advances in Intelligent Systems and Computing, Vol. 869). Springer, Cham, 46...

work page doi:10.1007/978-3-030-01057-7_37 2019

[70] [70]

Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. 2023. Large Language Models in Medicine.Nature Medicine29, 8 (2023), 1930–1940. https: //doi.org/10.1038/s41591-023-02448-8

work page doi:10.1038/s41591-023-02448-8 2023

[71] [71]

Dhinagaran, Bhone Myint Kyaw, Tobias Kowatsch, J

Lorainne Tudor Car, Dharshini A. Dhinagaran, Bhone Myint Kyaw, Tobias Kowatsch, J. S. Rayhan, Yin-Leng Theng, and Rifat Atun. 2020. Conversational Agents in Health Care: Scoping Review and Conceptual Analysis.Journal of Medical Internet Research22, 8 (2020), e17158. https://doi.org/10.2196/17158

work page doi:10.2196/17158 2020

[72] [72]

Leyao Wang, Zhiyu Wan, Congning Ni, Qingyuan Song, Yang Li, Ellen Clayton, Bradley Malin, and Zhijun Yin. 2024. Applications and Concerns of ChatGPT and Other Conversational Large Language Models in Health Care: Systematic Review.Journal of Medical Internet Research26 (2024), e22769. https://doi.org/10.2196/22769

work page doi:10.2196/22769 2024

[73] [73]

Carolin Wienrich, Clemens Reitelbach, and Astrid Carolus. 2021. The Trustworthiness of Voice Assistants in the Context of Healthcare: Investigating the Effect of Perceived Expertise on the Trustworthiness of Voice Assistants, Providers, Data Receivers, and Automatic Speech Recognition.Frontiers in Computer Science3 (2021). https://doi.org/ 10.3389/fcomp.2...

work page doi:10.3389/fcomp.2021.685250 2021

[74] [74]

Pfeffer, Jason Fries, and Nigam H

Michael Wornow, Yizhe Xu, Rachana Thapa, Bhavik Patel, Elissa Steinberg, Sarah Fleming, Marc A. Pfeffer, Jason Fries, and Nigam H. Shah. 2023. The shaky foundations of large language models and foundation models for electronic health records.NPJ Digital Medicine6, 1 (2023), 135. https://doi.org/10.1038/s41746-023-00879-8

work page doi:10.1038/s41746-023-00879-8 2023

[75] [75]

Ziang Xiao, Michelle X. Zhou, Q. Vera Liao, Gloria Mark, Changyan Chi, Wenxi Chen, and Huahai Yang. 2020. Tell Me About Yourself: Using an AI-Powered Chatbot to Conduct Conversational Surveys with Open-Ended Questions.ACM Transactions on Computer-Human Interaction27, 3, Article 15 (2020), 37 pages. https://doi.org/10.1145/3381804

work page doi:10.1145/3381804 2020

[76] [76]

Nima Zargham, Leon Reicherts, Michael Bonfert, Sarah Theres Völkel, Johannes Schöning, Rainer Malaka, and Yvonne Rogers. 2022. Understanding Circumstances for Desirable Proactive Behaviour of Voice Assistants: The Proactivity Dilemma. InProceedings of the 4th Conference on Conversational User Interfaces (CUI ’22). Association for Computing Machinery, New ...

work page doi:10.1145/3543829.3543834 2022

[77] [77]

Wayne Xin Zhao, Kun Zhou, Junyi Li, et al. 2026. A Survey of Large Language Models.Frontiers of Computer Science 20, 12 (2026), 2012627. https://doi.org/10.1007/s11704-026-60308-3 Better Adherence, Richer Context: A Field Evaluation of LLM-Powered Conversational Voice Diaries for Sleep 31 A Sleep diary questions Morning Sleep Diary (1) Time the user physi...

work page doi:10.1007/s11704-026-60308-3 2026