arxiv: 2605.08590 · v1 · submitted 2026-05-09 · 💻 cs.HC · cs.AI· cs.CL· cs.CY

Recognition: no theorem link

Causal Stories from Sensor Traces: Auditing Epistemic Overreach in LLM-Generated Personal Sensing Explanations

Shanshan Zhu , Han Zhang , J. Doris Chi , Subigya Nepal , Koustuv Saha

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:49 UTC · model grok-4.3

classification 💻 cs.HC cs.AIcs.CLcs.CY

keywords epistemic overreachLLM explanationspersonal sensingsensor datacausal attributionanomaly explanationcollege student dataevidential grounding

0 comments

The pith

LLMs frequently generate causal explanations for personal sensing anomalies that go beyond what the sensor data can support.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines how large language models create natural-language stories explaining unusual days from personal sensor traces like activity and mood. The authors define epistemic overreach as when these explanations claim more than the evidence allows. They test this on thousands of scenarios from three college student datasets using multiple LLM families and different prompting styles. The results show that unsupported causal attributions are common, extra data does not fix it reliably, and even prompts to stick to the data only partially help. This matters because such explanations are used in personal sensing apps, so users might get misleading accounts of their own behavior.

Core claim

We introduce epistemic overreach (EO) as a measure for cases where a generated explanation implies more than the available sensing evidence can justify. To audit how often and in what forms EO occurs, we obtained anomalous-day scenarios from three longitudinal sensing datasets of college students. Across activity, sleep, and affect anomalies, we generated 14,922 explanations using three LLM families under two prompting conditions. We find that LLMs routinely attribute anomalous days to causes without sufficient support from the data, and that this pattern replicates across datasets, anomaly types, and model families. Further, providing richer context does not reliably reduce EO; bounded 0,0,

What carries the argument

Epistemic overreach, measured by a structured rubric that scores explanations on unsupported causal attribution, unacknowledged data gaps, overconfident language, temporal inconsistency, and diagnostic inference.

If this is right

Evidential grounding should be a primary criterion for evaluating LLM-generated personal sensing explanations, alongside fluency and plausibility.
Personal sensing explanation systems need to explicitly distinguish what is observed, what is inferred, and what remains unknown.
The pattern of overreach holds across different anomaly types and data sources, suggesting it is a general property of current LLMs in this use case.
Bounded prompting reduces but does not remove epistemic overreach, indicating a need for stronger controls or architectural changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Users of personal sensing apps that use LLMs for explanations may need warnings or alternative non-causal summaries to avoid overinterpreting their data.
Similar audits could be applied to LLM use in other data-driven storytelling domains like financial reports or health diagnostics.
Future models might incorporate explicit uncertainty tracking or retrieval from raw data to reduce overreach.
Developers should test for EO before deploying LLM explainers in high-stakes personal data contexts.

Load-bearing premise

The structured rubric used to score epistemic overreach accurately reflects when explanations exceed the evidence, without major bias or inconsistency in how evaluators apply it.

What would settle it

A follow-up study that applies the same scenarios and models but uses a different set of human evaluators or an automated metric for overreach and finds substantially lower rates of unsupported attributions would challenge the finding that overreach is routine.

Figures

Figures reproduced from arXiv: 2605.08590 by Han Zhang, J. Doris Chi, Koustuv Saha, Shanshan Zhu, Subigya Nepal.

**Figure 1.** Figure 1: Empirical audit workflow. We construct participant-relative anomalous-day scenarios from three longitudinal sensing [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

**Figure 2.** Figure 2: Mean normalized EO score by generation model and prompt policy. EO score is computed as the proportion of 16 [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 3.** Figure 3: EO by rubric dimension across dataset, generation model, and prompt policy. Scores are shown on the original 0–1 [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Relative percentage difference in EO score by anomaly type and evidence tier, shown separately by generation model [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

read the original abstract

LLMs are increasingly used to explain personal sensing data, translating traces of activity and mood into natural-language accounts of why an anomalous day may have occurred. However, such explanations can sound coherent and personally meaningful even when the underlying evidence is sparse or missing. We introduce epistemic overreach (EO) as a measure for cases where a generated explanation implies more than the available sensing evidence can justify. To audit how often and in what forms EO occurs, we obtained anomalous-day scenarios from three longitudinal sensing datasets of college students: StudentLife, GLOBEM, and CollegeExperience. Across activity, sleep, and affect anomalies, we generated 14,922 explanations using three LLM families -- Llama, Qwen, and GPT -- under two prompting conditions: one minimally constrained prompt and another prompt explicitly instructing models to bound claims to the data. For each scenario, we varied the amount of behavioral evidence available to the model to examine whether more evidence reduces EO. We evaluated each explanation using a structured rubric, decomposing EO into the dimensions of unsupported causal attribution, unacknowledged data gaps, overconfident language, temporal inconsistency, and diagnostic inference. We find that LLMs routinely attribute anomalous days to causes without sufficient support from the data, and that this pattern replicates across datasets, anomaly types, and model families. Further, providing richer context does not reliably reduce EO; bounded prompting helps but does not eliminate it. These findings suggest that evidential grounding should be a first-order evaluation criterion for LLM-generated personal sensing explanations, alongside fluency and plausibility. We argue that personal sensing explanations require evidential discipline: systems must distinguish what is observed, what is inferred, and what remains unknown.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs routinely overreach when explaining personal sensor anomalies, and the large multi-model audit is the useful part while the rubric evaluation needs reliability checks.

read the letter

The main takeaway is that LLMs frequently generate causal explanations for anomalous days in activity, sleep, or affect data that go beyond what the sensor traces actually support, and richer context or explicit bounding prompts only reduce the problem without removing it. This holds across the three datasets and model families they tested. The paper defines epistemic overreach clearly and runs a controlled audit at scale—nearly 15k generations with systematic variation in evidence levels and prompt types—which gives a concrete sense of how widespread the issue is in this setting. That controlled design and the decomposed rubric are the parts that work well and make the results worth looking at for anyone building personal sensing tools. The soft spot is the evaluation step. The headline claims rest on human scoring of the five rubric dimensions, yet the abstract gives no inter-rater reliability numbers or validation against expert gold labels for what counts as sufficient support. If scorers share background assumptions about plausible personal causes, the consistent patterns across models could partly reflect that rather than an intrinsic LLM property. The paper would benefit from showing how they operationalized the dimensions on real traces. This is for HCI and responsible-AI readers who work on explanation interfaces or health-data systems. Someone evaluating LLM outputs for personal data would get direct value from the audit numbers and the practical takeaway about evidential grounding. It deserves a serious referee because the empirical scale and the replication across conditions are strong enough to justify review time, even with the rubric questions.

Referee Report

2 major / 2 minor

Summary. The paper introduces epistemic overreach (EO) as a measure of LLM-generated explanations for anomalous days in personal sensing data that imply more than the available sensor traces justify. Using three public longitudinal datasets (StudentLife, GLOBEM, CollegeExperience), it generates 14,922 explanations across activity, sleep, and affect anomalies with Llama, Qwen, and GPT models under minimal and bounded prompting conditions while varying available behavioral evidence. Each explanation is scored on a five-dimension rubric (unsupported causal attribution, unacknowledged data gaps, overconfident language, temporal inconsistency, diagnostic inference), yielding the finding that EO occurs routinely, replicates across datasets/models/anomaly types, richer context does not reliably reduce it, and bounded prompting helps but does not eliminate it.

Significance. If the results hold, the work is significant for establishing evidential grounding as a necessary evaluation criterion for LLM use in personal sensing applications. It earns credit for its large generation volume (14,922 explanations), use of three public longitudinal datasets, and decomposed rubric applied across multiple conditions and model families, which together provide solid empirical grounding for the replication claims.

major comments (2)

[Evaluation section / Rubric description] Evaluation section / Rubric description: The paper applies a five-dimension rubric to score EO but reports no inter-rater reliability statistics, no validation against domain-expert gold labels, and no explicit operationalization of 'sufficient support from the data' for activity/sleep/affect anomalies. Because the central claims about routine EO and the effects of context/prompting rest entirely on these scores, the absence of reliability metrics is load-bearing.
[Results on context variation] Results on context variation: The claim that 'providing richer context does not reliably reduce EO' depends on how the amount of behavioral evidence was varied in the prompts; without a precise description of the context levels (e.g., which sensor features were added or withheld) or controls for prompt length, it is difficult to isolate the effect of evidence volume from other prompt properties.

minor comments (2)

[Abstract] Abstract: The statement that 'bounded prompting helps but does not eliminate it' would be clearer if it included the magnitude of the observed reduction (e.g., percentage-point change in EO rate) rather than a qualitative description.
[Methods] Methods: The exact wording of the 'minimally constrained prompt' and the 'bounded prompt' should be provided verbatim in the main text or an appendix so readers can assess how the bounding instructions were phrased.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive review. The comments identify important areas for clarification and strengthening of the manuscript. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Evaluation section / Rubric description] The paper applies a five-dimension rubric to score EO but reports no inter-rater reliability statistics, no validation against domain-expert gold labels, and no explicit operationalization of 'sufficient support from the data' for activity/sleep/affect anomalies. Because the central claims about routine EO and the effects of context/prompting rest entirely on these scores, the absence of reliability metrics is load-bearing.

Authors: We agree that greater transparency around the rubric is warranted. The five dimensions were derived from the sensor traces available in the datasets, with 'sufficient support' operationalized as the presence of direct quantitative evidence (e.g., step-count deviation or location change for activity anomalies; sleep-duration or timing data for sleep anomalies; self-report affect scores for affect anomalies). Scoring was performed by the author team with consensus resolution of borderline cases. We did not collect external expert labels or compute inter-rater reliability statistics. In the revised manuscript we will (1) add an expanded Evaluation section that provides explicit operational definitions and one annotated scoring example per anomaly type, (2) include a table mapping each rubric dimension to the specific sensor features that constitute support, and (3) add a limitations paragraph noting the absence of external validation and IRR as a direction for subsequent work. These changes increase transparency without altering the core findings. revision: partial
Referee: [Results on context variation] The claim that 'providing richer context does not reliably reduce EO' depends on how the amount of behavioral evidence was varied in the prompts; without a precise description of the context levels (e.g., which sensor features were added or withheld) or controls for prompt length, it is difficult to isolate the effect of evidence volume from other prompt properties.

Authors: We appreciate the request for precision. Context was varied across three explicitly defined levels: minimal (anomaly flag and date only), aggregate (daily summary statistics such as total steps, sleep duration, and average affect), and full (time-series traces plus location and app-usage features). Structured prompt templates were used to keep token length comparable across conditions. In the revision we will insert a table that lists the exact sensor features supplied at each level, report mean and standard-deviation token counts per condition, and add a supplementary analysis confirming that prompt length does not explain the observed EO patterns. These additions will allow readers to isolate the contribution of evidence volume. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical audit with no derivations or self-referential predictions

full rationale

The paper conducts an empirical audit of LLM-generated explanations for sensor data anomalies. It generates 14,922 explanations across datasets and models, then applies a human-defined rubric to score epistemic overreach dimensions. No equations, fitted parameters, uniqueness theorems, or predictive claims appear in the abstract or described methods. The central findings derive directly from the generated text evaluated against provided traces, with no step reducing by construction to prior inputs or self-citations. The rubric is an evaluation tool, not a derived result. This matches the default expectation of no significant circularity for non-mathematical empirical studies.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the assumption that the chosen rubric dimensions validly operationalize epistemic overreach and that the selected anomalous-day scenarios from college-student datasets are representative of real personal sensing use cases.

axioms (1)

domain assumption The structured rubric (unsupported causal attribution, unacknowledged data gaps, overconfident language, temporal inconsistency, diagnostic inference) reliably identifies explanations that exceed available sensing evidence.
All quantitative findings rest on human application of this rubric; its validity is not independently validated in the abstract.

invented entities (1)

epistemic overreach (EO) no independent evidence
purpose: A composite measure for LLM explanations that imply more causal or diagnostic content than the sensor traces justify.
Newly defined construct introduced to frame the audit; no external falsifiable test is provided.

pith-pipeline@v0.9.0 · 5631 in / 1345 out tokens · 48116 ms · 2026-05-12T00:49:28.789800+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 3 internal anchors

[1]

Lim, and Mohan Kankan- halli

Ashraf Abdul, Jo Vermeulen, Danding Wang, Brian Y. Lim, and Mohan Kankanhalli. 2018. Trends and Trajectories for Explainable, Accountable and Intelligible Systems: An HCI Research Agenda. InProceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI ’18). Association for Computing Machinery, New York, NY, USA, 1–18. https://doi.org/1...

work page doi:10.1145/3173574.3174156 2018
[2]

Gregory D Abowd, Anind K Dey, Peter J Brown, Nigel Davies, Mark Smith, and Pete Steggles. 1999. Towards a better understanding of context and context-awareness. InInternational Symposium on Handheld and Ubiquitous Computing. Springer, 304–307

work page 1999
[3]

Adler, Vincent W.-S

Daniel A. Adler, Vincent W.-S. Tseng, Gengmo Qi, Joseph Scarpa, Srijan Sen, and Tanzeem Choudhury. 2021. Identifying Mobile Sensing Indicators of Stress-Resilience.Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.5, 2, Article 51 (2021). https://doi.org/10.1145/ 3463528

work page 2021
[4]

In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems

Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz. 2019. Guidelines for Human-AI Interaction. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems(Glasgow, Scotland, UK)(CHI ’19). Associa...

work page doi:10.1145/3290605.3300233 2019
[5]

D’Mello, Munmun De Choudhury, Gregory D

Mehrab Bin Morshed, Koustuv Saha, Richard Li, Sidney K. D’Mello, Munmun De Choudhury, Gregory D. Abowd, and Thomas Plötz

work page
[6]

Prediction of Mood Instability with Passive Sensing.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies3, 3 (2019)

work page 2019
[7]

Zana Buçinca, Maja Barbara Malaya, and Krzysztof Z Gajos. 2021. To trust or to think: Cognitive forcing functions can reduce overreliance on AI in AI-assisted decision-making.Proceedings of the ACM on Human-Computer Interaction5, CSCW1 (2021), 1–21

work page 2021
[8]

Wenhu Chen, Xinyi Wang, and William Yang Wang. 2021. A Dataset for Answering Time-Sensitive Questions. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, Vol. 1. https://arxiv.org/abs/2108.06314

work page arXiv 2021
[9]

Eun Kyoung Choe, Bongshin Lee, Haining Zhu, Nathalie Henry Riche, and Dominikus Baur. 2017. Understanding Self-Reflection: How People Reflect on Personal Data through Visual Data Exploration.Proceedings of the 11th EAI International Conference on Pervasive Computing Technologies for Healthcare(2017), 173–182. https://doi.org/10.1145/3154862.3154881

work page doi:10.1145/3154862.3154881 2017
[10]

Shaan Chopra, Katherine Juarez, James Fogarty, and Sean A Munson. 2025. Engagements with Generative AI and Personal Health Informatics: Opportunities for Planning, Tracking, Reflecting, and Acting around Personal Health Data.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies9, 3 (2025), 1–33

work page 2025
[11]

Akshat Choube, Sohini Bhattacharya, Rahul Majethia, Jiachen Li, Vedant Das Swain, and Varun Mishra. 2025. Imputation Matters: A Deeper Look into an Overlooked Step in Longitudinal Health and Behavior Sensing Research.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies9, 4 (2025), 1–30

work page 2025
[12]

Akshat Choube, Ha Le, Jiachen Li, Kaixin Ji, Vedant Das Swain, and Varun Mishra. 2025. GLOSS: Group of LLMs for open-ended sensemaking of passive sensing data for health and wellbeing.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies9, 3 (2025), 1–32. Causal Stories from Sensor Traces: Auditing Epistemic Overreach in LLM-...

work page 2025
[13]

Zheng Chu, Jingchang Chen, Qianglon Chen, Weijiang Yu, Haotian Wang, Ming Liu, and Bing Qin. 2024. TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 1197...

work page 2024
[14]

Justin Cosentino, Anastasiya Belyaeva, Xin Liu, Zhun Yang, Yun Liu, Shyam A Tailor, Tim Althoff, John B Hernandez, Yossi Matias, Greg Corrado, et al. 2024. Towards a Personal Health Large Language Model. InAdvancements In Medical Foundation Models: Explainability, Robustness, Security, and Beyond

work page 2024
[15]

Aykut Coşkun and Armağan Karahanoğlu. 2023. Data Sensemaking in Self-Tracking: Towards a New Generation of Self-Tracking Tools. International Journal of Human–Computer Interaction(2023). https://doi.org/10.1080/10447318.2022.2075637

work page doi:10.1080/10447318.2022.2075637 2023
[16]

In this online environment, we’re limited

Vedant Das Swain, Victor Chen, Shrija Mishra, Stephen M. Mattingly, Gregory D. Abowd, and Munmun De Choudhury. 2022. Semantic Gap in Predicting Mental Wellbeing through Passive Sensing. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 374, 16 pages. ht...

work page doi:10.1145/3491102 2022
[17]

Gregg, Suwen Lin, Gonzalo J

Vedant Das Swain, Koustuv Saha, Hemang Rajvanshy, Anusha Sirigiri, Julie M. Gregg, Suwen Lin, Gonzalo J. Martinez, Stephen M. Mattingly, Shayan Mirjafari, Raghu Mulukutla, and et al. 2019. A Multisensor Person-Centered Approach to Understand the Role of Daily Activities in Job Performance with Organizational Personas.Proc. ACM IMWUT(2019)

work page 2019
[18]

Morris, Chun-Cheng Chang, Xuhai Xu, Xin Liu, Shwetak Patel, and Vikram Iyer

Zachary Englhardt, Chengqian Ma, Margaret E. Morris, Chun-Cheng Chang, Xuhai Xu, Xin Liu, Shwetak Patel, and Vikram Iyer. 2024. From Classification to Clinical Insights: Towards Analyzing and Reasoning About Mobile and Behavioral Health Data With Large Language Models.Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.8, 2, Article 56 (2024). https://do...

work page doi:10.1145/3659604 2024
[19]

Epstein, Monica Caraway, Chuck Johnston, An Ping, James Fogarty, and Sean A

Daniel A. Epstein, Monica Caraway, Chuck Johnston, An Ping, James Fogarty, and Sean A. Munson. 2016. Beyond Abandonment to Next Steps: Understanding and Designing for Life after Personal Informatics Tool Use. InProceedings of the 2016 CHI Conference on Human Factors in Computing Systems(San Jose, California, USA)(CHI ’16). Association for Computing Machin...

work page doi:10.1145/2858036.2858045 2016
[20]

InProceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp ’15)

Daniel A. Epstein, An Ping, James Fogarty, and Sean A. Munson. 2015. A Lived Informatics Model of Personal Informatics. InProceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing(Osaka, Japan)(UbiComp ’15). Association for Computing Machinery, New York, NY, USA, 731–742. https://doi.org/10.1145/2750858.2804250

work page doi:10.1145/2750858.2804250 2015
[21]

Alexander Richard Fabbri, Chien-Sheng Wu, Wenhao Liu, and Caiming Xiong. 2022. QAFactEval: Improved QA-based factual consistency evaluation for summarization. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2587–2601

work page 2022
[22]

Cathy Mengying Fang, Valdemar Danry, Nathan Whitmore, Andria Bao, Andrew Hutchison, Cayden Pierce, and Pattie Maes. 2024. PhysioLLM: Supporting Personalized Health Insights with Wearables and Large Language Models. In2024 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI). IEEE, 1–8. https://doi.org/10.1109/BHI62660.2024.10913781

work page doi:10.1109/bhi62660.2024.10913781 2024
[23]

Drishti Goel, Jeongah Lee, Qiuyue Joy Zhong, Violeta J Rodriguez, Daniel S Brown, Ravi Karkar, Dong Whi Yoo, and Koustuv Saha

work page
[24]

RubRIX: Rubric-Driven Risk Mitigation in Caregiver-AI Interactions.arXiv preprint arXiv:2601.13235(2026)

work page arXiv 2026
[25]

Miriam Greis, Jessica Hullman, Michael Correll, Matthew Kay, and Orit Shaer. 2017. Designing for Uncertainty in HCI: When Does Uncertainty Help?. InProceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems (CHI EA ’17). Association for Computing Machinery, New York, NY, USA, 593–600. https://doi.org/10.1145/3027063.3027091

work page doi:10.1145/3027063.3027091 2017
[26]

Harari, Sandrine S

Gabriella M. Harari, Sandrine S. Vaid, Sandrine R. Müller, Clemens Stachl, Zachary Marrero, Ramona Schoedel, Markus Bühner, and Samuel D. Gosling. 2020. Personality Sensing for Theory Development and Assessment in the Digital Age.European Journal of Personality34, 5, 649–669. https://doi.org/10.1002/per.2273

work page doi:10.1002/per.2273 2020
[27]

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions.ACM Transactions on Information Systems43, 2, Article 43 (2025). https://doi.org/10.1145/3703155

work page doi:10.1145/3703155 2025
[28]

Thomas R Insel. 2017. Digital phenotyping: technology for a new science of behavior.Jama318, 13 (2017), 1215–1216

work page 2017
[29]

Sijie Ji, Xinzhe Zheng, Jiawei Sun, Renqi Chen, Wei Gao, and Mani Srivastava. 2024. MindGuard: Towards Accessible and Sitgma-free Mental Health First Aid via Edge LLM. arXiv:2409.10064 [cs.HC] https://arxiv.org/abs/2409.10064

work page arXiv 2024
[30]

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Wenliang Dai, Andrea Madotto, and Pascale Fung

work page
[31]

ACM Comput

Survey of Hallucination in Natural Language Generation.Comput. Surveys55, 12, Article 248 (2023). https://doi.org/10.1145/3571730

work page doi:10.1145/3571730 2023
[32]

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

I Didn’t Know I Looked Angry

Harmanpreet Kaur, Daniel McDuff, Alex C Williams, Jaime Teevan, and Shamsi T Iqbal. 2022. “I Didn’t Know I Looked Angry”: Characterizing Observed Emotion and Reported Affect at Work. InCHI Conference on Human Factors in Computing Systems. 1–18. 22•Anonymous Author(s)

work page 2022
[34]

Harmanpreet Kaur, Harsha Nori, Samuel Jenkins, Rich Caruana, Hanna Wallach, and Jennifer Wortman Vaughan. 2020. Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning. InProceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI ’20). Association for Computing Machinery, New ...

work page doi:10.1145/3313831.3376219 2020
[35]

Mathew V Kiang, Jarvis T Chen, Nancy Krieger, Caroline O Buckee, Monica J Alexander, Justin T Baker, Randy L Buckner, Garth Coombs III, Janet W Rich-Edwards, Kenzie W Carlson, et al. 2021. Sociodemographic characteristics of missing data in digital phenotyping. Scientific Reports11, 1 (2021), 15408

work page 2021
[36]

Jiwon Kim, Violeta J Rodriguez, Dong Whi Yoo, Eshwar Chandrasekharan, and Koustuv Saha. 2026. PAIR-SAFE: A Paired-Agent Approach for Runtime Auditing and Refining AI-Mediated Mental Health Support.arXiv preprint arXiv:2601.12754(2026)

work page arXiv 2026
[37]

Yubin Kim, Xuhai Xu, Daniel McDuff, Cynthia Breazeal, and Hae Won Park. 2024. Health-LLM: Large Language Models for Health Prediction via Wearable Sensor Data. InProceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. Association for Computing Machinery. https://doi.org/10.1145/3698587.3701355

work page doi:10.1145/3698587.3701355 2024
[38]

2019.Content Analysis

Klaus Krippendorff. 2019.Content Analysis. SAGE Publications, Inc. https://doi.org/10.4135/9781071878781

work page doi:10.4135/9781071878781 2019
[39]

Ian Li, Anind Dey, and Jodi Forlizzi. 2010. A Stage-Based Model of Personal Informatics Systems. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Atlanta, Georgia, USA)(CHI ’10). Association for Computing Machinery, New York, NY, USA, 557–566. https://doi.org/10.1145/1753326.1753409

work page doi:10.1145/1753326.1753409 2010
[40]

Dey, and Jodi Forlizzi

Ian Li, Anind K. Dey, and Jodi Forlizzi. 2011. Understanding My Data, Myself: Supporting Self-Reflection with Ubicomp Technologies. InProceedings of the 13th International Conference on Ubiquitous Computing(Beijing, China)(UbiComp ’11). Association for Computing Machinery, New York, NY, USA, 405–414. https://doi.org/10.1145/2030112.2030166

work page doi:10.1145/2030112.2030166 2011
[41]

Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2023. Halueval: A large-scale hallucination evaluation benchmark for large language models. InProceedings of the 2023 conference on empirical methods in natural language processing. 6449–6464

work page 2023
[42]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al . 2022. Holistic Evaluation of Language Models.Transactions on Machine Learning Research. https: //arxiv.org/abs/2211.09110

work page internal anchor Pith review Pith/arXiv arXiv 2022
[43]

Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. TruthfulQA: Measuring How Models Mimic Human Falsehoods. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 3214–3252. https://doi.org/10.18653/v1/2022.acl-long.229

work page doi:10.18653/v1/2022.acl-long.229 2022
[44]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation Using GPT-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2511–2522. https://doi.org/10.18653/v1/2023.emnlp-main.153

work page doi:10.18653/v1/2023.emnlp-main.153 2023
[45]

Yuhan Luo, Xinning Gui, Xianghua Ding, Xi Zheng, Rie Helene Hernandez, Zhuoyang Li, and Qiurong Song. 2025. Reflecting Upon The Unintended Consequences of Personal Informatics Systems: A Systematic Review of Empirical Studies. InProceedings of the 2025 ACM Designing Interactive Systems Conference (DIS ’25). Association for Computing Machinery, New York, N...

work page doi:10.1145/3715336.3735746 2025
[46]

Potsawee Manakul, Adian Liusie, and Mark Gales. 2023. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. InProceedings of the 2023 conference on empirical methods in natural language processing. 9004–9017

work page 2023
[47]

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On Faithfulness and Factuality in Abstractive Summarization. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1906–1919. https://doi.org/10.18653/v1/2020.acl-main.173

work page doi:10.18653/v1/2020.acl-main.173 2020
[48]

Lakmal Meegahapola, William Droz, Peter Kun, Amalia de Götzen, Chaitanya Nutakki, Shyam Diwakar, Salvador Ruiz Correa, Donglei Song, Hao Xu, Miriam Bidoglia, et al. 2023. Generalization and Personalization of Mobile Sensing-Based Mood Inference Models: An Analysis of College Students in Eight Countries.Proc. ACM Interact. Mob. Wearable Ubiquitous Technol....

work page doi:10.1145/3569483 2023
[49]

Merrill, Akshay Paruchuri, Naghmeh Rezaei, Geza Kovacs, Javier Perez, Jake Chan, Shyam Dash, Xin Liu, Daniel McDuff, and Tim Althoff

Mike A. Merrill, Akshay Paruchuri, Naghmeh Rezaei, Geza Kovacs, Javier Perez, Jake Chan, Shyam Dash, Xin Liu, Daniel McDuff, and Tim Althoff. 2025. Transforming Wearable Data into Health Insights using Large Language Model Agents.Nature Communications16, Article 548 (2025). https://doi.org/10.1038/s41467-025-67922-y

work page doi:10.1038/s41467-025-67922-y 2025
[50]

Tim Miller. 2019. Explanation in artificial intelligence: Insights from the social sciences.Artificial intelligence267 (2019), 1–38

work page 2019
[51]

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long-Form Text Generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Ling...

work page doi:10.18653/v1/2023.emnlp-main.741 2023
[52]

Campbell, Nitesh V

Shayan Mirjafari, Kizito Masaba, Ted Grover, Weichen Wang, Pino Audia, Andrew T. Campbell, Nitesh V. Chawla, Vedant Das Swain, Munmun De Choudhury, Anind K. Dey, and et al. 2019. Differentiating Higher and Lower Job Performers in the Workplace Using Mobile Sensing.Proc. ACM IMWUT(2019). Causal Stories from Sensor Traces: Auditing Epistemic Overreach in LL...

work page 2019
[53]

Mohr, Mi Zhang, and Stephen M

David C. Mohr, Mi Zhang, and Stephen M. Schueller. 2017. Personal Sensing: Understanding Mental Health Using Ubiquitous Sensors and Machine Learning.Annual Review of Clinical Psychology13 (2017), 23–47. https://doi.org/10.1146/annurev-clinpsy-032816-044949

work page doi:10.1146/annurev-clinpsy-032816-044949 2017
[54]

Huckins, Courtney Rogers, Meghan L

Subigya Nepal, Wenjun Liu, Arvind Pillai, Weichen Wang, Vlado Vojdanovski, Jeremy F. Huckins, Courtney Rogers, Meghan L. Meyer, and Andrew T. Campbell. 2024. Capturing the College Experience: A Four-Year Mobile Sensing Study of Mental Health, Resilience and Behavior of College Students during the Pandemic.Proceedings of the ACM on Interactive, Mobile, Wea...

work page doi:10.1145/3643501 2024
[55]

Subigya Nepal, Arvind Pillai, William Campbell, Talie Massachi, Michael V Heinz, Ashmita Kunwar, Eunsol Soul Choi, Xuhai Xu, Joanna Kuc, Jeremy F Huckins, et al. 2024. MindScape study: integrating LLM and behavioral sensing for personalized AI-driven journaling experiences.Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies...

work page 2024
[56]

Thomas Plötz. 2021. Applying Machine Learning for Sensor Data Analysis in Interactive Systems: Common Pitfalls of Pragmatic Use and Ways to Avoid Them.Comput. Surveys54, 6, Article 134 (2021), 25 pages. https://doi.org/10.1145/3459666

work page doi:10.1145/3459666 2021
[57]

Mashfiqui Rabbi, Shahid Ali, Tanzeem Choudhury, and Ethan Berke. 2011. Passive and In-Situ Assessment of Mental and Physical Well-Being Using Mobile Sensors. InProceedings of the 13th International Conference on Ubiquitous Computing(Beijing, China)(UbiComp ’11). Association for Computing Machinery, New York, NY, USA, 385–394. https://doi.org/10.1145/20301...

work page doi:10.1145/2030112.2030164 2011
[58]

John Rooksby, Mattias Rost, Alistair Morrison, and Matthew Chalmers. 2014. Personal Tracking As Lived Informatics. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Toronto, Ontario, Canada)(CHI ’14). ACM, New York, NY, USA, 1163–1172. https://doi.org/10.1145/2556288.2557039

work page doi:10.1145/2556288.2557039 2014
[59]

Ognjen Rudovic, Jaeryoung Lee, Miles Dai, Björn Schuller, and Rosalind W Picard. 2018. Personalized machine learning for robot perception of affect and engagement in autism therapy.Science Robotics3, 19 (2018)

work page 2018
[60]

Daniel M Russell, Mark J Stefik, Peter Pirolli, and Stuart K Card. 1993. The cost structure of sensemaking. InProceedings of the INTERACT’93 and CHI’93 conference on Human factors in computing systems. 269–276

work page 1993
[61]

Koustuv Saha, Larry Chan, Kaya De Barbaro, Gregory D Abowd, and Munmun De Choudhury. 2017. Inferring mood instability on social media by leveraging ecological momentary assessments.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies1, 3 (2017), 95

work page 2017
[62]

Koustuv Saha and Munmun De Choudhury. 2017. Modeling stress with social media around incidents of gun violence on college campuses.PACM Human-Computer InteractionCSCW (2017)

work page 2017
[63]

Mattingly, Vedant Das swain, Pranshu Gupta, Gonzalo J

Koustuv Saha, Ted Grover, Stephen M. Mattingly, Vedant Das swain, Pranshu Gupta, Gonzalo J. Martinez, Pablo Robles-Granda, Gloria Mark, Aaron Striegel, and Munmun De Choudhury. 2021. Person-Centered Predictions of Psychological Constructs with Social Media Contextualized by Multimodal Sensing.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubi...

work page doi:10.1145/3448117 2021
[64]

Reddy, Vedant das Swain, Julie M

Koustuv Saha, Manikanta D. Reddy, Vedant das Swain, Julie M. Gregg, Ted Grover, Suwen Lin, Gonzalo J. Martinez, Stephen M. Mattingly, Shayan Mirjafari, Raghu Mulukutla, Kari Nies, Pablo Robles-Granda, Anusha Sirigiri, Dong Whi Yoo, Pino Audia, Andrew T. Campbell, Nitesh V. Chawla, Sidney K. D’Mello, Anind K. Dey, Kaifeng Jiang, Qiang Liu, Gloria Mark, Edw...

work page doi:10.1109/acii.2019.8925479 2019
[65]

2025.The Coding Manual for Qualitative Researchers

Johnny Saldaña. 2025.The Coding Manual for Qualitative Researchers. SAGE Publications Ltd. https://doi.org/10.4135/9781036235611

work page doi:10.4135/9781036235611 2025
[66]

Wong, Yoko Akama, and Ann Light

Robert Soden, Laura Devendorf, Richmond Y. Wong, Yoko Akama, and Ann Light. 2022. Modes of Uncertainty in HCI.Foundations and Trends in Human–Computer Interaction15, 4 (2022), 317–426. https://doi.org/10.1561/1100000085

work page doi:10.1561/1100000085 2022
[67]

Salminen, C

Konstantin R. Strömel, Stanislas Henry, Tim Johansson, Jasmin Niess, and Paweł W. Woźniak. 2024. Narrating Fitness: Leveraging Large Language Models for Reflective Fitness Tracker Data Interpretation. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, N...

work page doi:10.1145/3613904.3642032 2024
[68]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. 2023. Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. InAdvances in Neural Information Processing Systems (NeurIPS 2023, Vol. 36). Curran Associates, Inc. https://arxiv.org/abs/2305.04388

work page arXiv 2023
[69]

Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. 2023. DecodingTrust: A Comprehensive Assessment of Trustworthiness in {GPT} Models.Neural Information Processing Systems Datasets; Benchmarks Track(2023)

work page 2023
[70]

Rui Wang, Min S. H. Aung, Saeed Abdullah, Rachel Brian, Andrew T. Campbell, Tanzeem Choudhury, Marta Hauser, John Kane, Michael Merrill, and Emily A. Scherer. 2016. CrossCheck: Toward Passive Sensing and Detection of Mental Health Changes in People with Schizophrenia. InProceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous...

work page doi:10.1145/2971648.2971740 2016
[71]

Campbell

Rui Wang, Fanglin Chen, Zhenyu Chen, Tianxing Li, Gabriella Harari, Stefanie Tignor, Xia Zhou, Dror Ben-Zeev, and Andrew T. Campbell. 2014. StudentLife: Assessing Mental Health, Academic Performance and Behavioral Trends of College Students Using Smartphones. InProceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computin...

work page doi:10.1145/2632048.2632054 2014
[72]

Huckins, William M

Rui Wang, Weichen Wang, Alex da Silva, Jeremy F. Huckins, William M. Kelley, Todd F. Heatherton, and Andrew T. Campbell. 2018. Tracking Depression Dynamics in College Students Using Mobile Phone and Wearable Sensing.Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.2, 1, Article 43 (2018). https://doi.org/10.1145/3191775

work page doi:10.1145/3191775 2018
[73]

Weichen Wang, Gabriella M Harari, Rui Wang, Sandrine R Müller, Shayan Mirjafari, Kizito Masaba, and Andrew T Campbell. 2018. Sensing Behavioral Change over Time: Using Within-Person Variability Features from Mobile Sensing to Predict Personality Traits. IMWUT(2018)

work page 2018
[74]

Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, et al. 2024. Long-form factuality in large language models.Advances in Neural Information Processing Systems37 (2024), 80756–80827

work page 2024
[75]

Dutcher, Yasaman S

Xuhai Xu, Prerna Chikersal, Janine M. Dutcher, Yasaman S. Sefidgar, Woosuk Seo, Michael J. Tumminia, Daniella K. Villalba, Sheldon Cohen, Kasey G. Creswell, J. David Creswell, Jennifer Mankoff, Afsaneh Doryab, and Anind K. Dey. 2021. Leveraging Collaborative- Filtering for Personalized Behavior Modeling: A Case Study of Depression Detection Among College ...

work page doi:10.1145/3448107 2021
[76]

Kuehn, Jeremy F

Xuhai Xu, Xin Liu, Han Zhang, Weichen Wang, Subigya Nepal, Yasaman Sefidgar, Woosuk Seo, Kevin S. Kuehn, Jeremy F. Huckins, Margaret E. Morris, Paula S. Nurius, Eve A. Riskin, Shwetak Patel, Tim Althoff, Andrew Campbell, Anind K. Dey, and Jennifer Mankoff

work page
[77]

GLOBEM: Cross-Dataset Generalization of Longitudinal Human Behavior Modeling.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies6, 4, Article 190 (Dec. 2023). https://doi.org/10.1145/3569485

work page doi:10.1145/3569485 2023
[78]

Bufang Yang, Siyang Jiang, Lilin Xu, Kaiwei Liu, Hai Li, Guoliang Xing, Hao Chen, Xiaofan Jiang, and Zheng Yan. 2024. DrHouse: An LLM-Empowered Diagnostic Reasoning System through Harnessing Outcomes from Sensor Data and Expert Knowledge. InProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 8. Association for Computi...

work page doi:10.1145/3699765 2024
[79]

Xiaofan Yu, Lanxiang Hu, Benjamin Reichman, Dylan Chu, Rushil Chandrupatla, Xiyuan Zhang, Larry Heck, and Tajana S Rosing. 2025. Sensorchat: Answering qualitative and quantitative questions during long-term multimodal sensor interactions.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies9, 3 (2025), 1–35

work page 2025
[80]

Yiran Zhao, Jinghan Zhang, I Chern, Siyang Gao, Pengfei Liu, Junxian He, et al. 2023. Felm: Benchmarking factuality evaluation of large language models.Advances in Neural Information Processing Systems36 (2023), 44502–44523

work page 2023

Showing first 80 references.