pith. sign in

arxiv: 2512.23025 · v2 · submitted 2025-12-28 · 💻 cs.CL · cs.AI

LENS: LLM-Enabled Narrative Synthesis for Mental Health by Aligning Multimodal Sensing with Language Models

Pith reviewed 2026-05-16 18:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords multimodal sensinglanguage modelsmental health narrativesecological momentary assessmentsensor-text alignmentdepression and anxietypatch-level encoder
0
0 comments X

The pith

LENS aligns raw sensor streams with language models to generate mental health narratives from over 100,000 EMA-derived pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LENS as a way to turn numerical time-series sensor data into natural language descriptions of depression and anxiety symptoms. It first converts Ecological Momentary Assessment responses into text to build a large paired dataset, then trains a patch-level encoder that maps raw sensor signals straight into an LLM's space. This setup lets the model produce narratives that score better than baselines on standard metrics and symptom accuracy measures. A study with mental health professionals found the outputs comprehensive and clinically relevant, suggesting a route for LLMs to interpret behavioral signals directly.

Core claim

LENS constructs a dataset of more than 100,000 sensor-text QA pairs by transforming EMA responses on depression and anxiety symptoms into natural-language descriptions, then trains a patch-level encoder to project raw multimodal sensor signals directly into an LLM's representation space, enabling the generation of clinically grounded narratives that outperform baselines on NLP metrics and symptom-severity accuracy while receiving positive ratings from mental health professionals for being comprehensive and meaningful.

What carries the argument

The patch-level encoder, which projects raw sensor signals directly into an LLM's representation space to support native integration of time-series data with language models.

If this is right

  • The model can process long-duration sensor streams that current LLMs cannot handle natively.
  • Generated narratives achieve higher scores on both general NLP metrics and specific symptom-severity accuracy measures than strong baselines.
  • Mental health professionals rate the outputs as comprehensive and clinically meaningful in direct user testing.
  • The approach supplies a scalable route for LLMs to reason over raw behavioral signals and inform clinical decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same alignment technique could extend to other time-series health signals such as sleep patterns or physical activity for broader wellness applications.
  • Real-time deployment on wearable devices might enable continuous narrative summaries that flag symptom changes without requiring manual EMA input.
  • Privacy safeguards would need explicit design when scaling the sensor-to-text mapping to personal longitudinal data streams.

Load-bearing premise

Converting EMA responses into natural-language descriptions produces training pairs accurate enough to teach the encoder clinically valid links between sensor streams and symptom narratives.

What would settle it

A controlled comparison where LENS narratives receive lower expert ratings on clinical accuracy than human-written summaries of the same sensor data from the same participants.

Figures

Figures reproduced from arXiv: 2512.23025 by Amanda C Collins, Andrew Campbell, Arvind Pillai, Daniel M Mackin, Michael V Heinz, Nicholas C Jacobson, Subigya Nepal, Tess Z Griffin, Wenxuan Xu.

Figure 1
Figure 1. Figure 1: Illustration of the LENS idea. Mobile and wearable sensing signals, combined with a question, are passed to LENS, which produces a natural-language description. Clinicians can then view an interpretable snapshot of the user’s mental state instead of raw sensor streams. high burden on clinicians, dependence on retro￾spective self-reports, and reduced ecological valid￾ity because they are administered in con… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the narrative synthesis pipeline. EMA questions and responses are mapped to templates populated with corresponding frequency phrases. Subsequently, an LLM refines the text at both the item and summary levels (concatenated narratives) to enhance fluency and lexical diversity. high-quality QA datasets from self-reported EMA responses to serve as ground truth for model train￾ing and evaluation. Be… view at source ↗
Figure 3
Figure 3. Figure 3: LENS dataset construction pipeline. EMA responses are first converted into item-level and summary template narratives, which are then rewritten into fluent, enhanced narratives using Qwen2.5-14B. A multi-agent LLM-as-a-judge system conducts automatic quality control, routing failed cases back for regeneration until they satisfy all criteria. The final accepted narratives are then paired with paraphrased qu… view at source ↗
Figure 4
Figure 4. Figure 4: LENS Architecture. The model accepts multimodal inputs consisting of description text (e.g., "Heart rate..."), instruction text (e.g., "Summarize the user’s current mental well-being?") and raw time-series sensor streams (e.g., heart rate). The text is processed by a frozen LLM text embedder (f emb ϕ ), while time-series data is encoded by a trainable patch-based encoder (fθ). The resulting embeddings are … view at source ↗
Figure 5
Figure 5. Figure 5: User Study: Comprehensiveness. Expert ratings compare how many symptoms each model’s narrative successfully covers. From a clinical perspective, LENS achieves the highest alignment with ground-truth diagnoses. In summary-level generation, it attains a Symptom Coverage score of 0.801 and a Presence Align￾ment score of 0.601. This indicates that LENS is more capable of capturing the full spectrum of pa￾tient… view at source ↗
Figure 7
Figure 7. Figure 7: User Study: Clinical Utility & Language Cohesion. Comparing expert ratings indicating the use￾fulness of the narratives and the cohesiveness of the language. strates superior computational efficiency, requir￾ing approximately 930 tokens per sample, a re￾duction of roughly 94% compared to verbose text serialization and 4× relative to vision-based mod￾els ( [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Computational Efficiency Analysis. Comparison of token consumption across three modalities: Text (Qwen-2.5), Vision-Language (TS￾Image/Qwen2.5-VL), and Time-Series (LENS). (a) Mean prefill tokens per prompt. (b) Total dataset to￾ken consumption (in Millions) for Narrative and QA datasets. trained on 50% data achieved the highest cover￾age (0.823), slightly outperforming the full model (0.801) and the 10% m… view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative examples for narrative QA (top) and single-question QA (bottom). Each example shows the full prompt context, the form of sensor input provided to the model, and the generated output. LENS directly consumes raw multivariate time-series via a patch-based time-series encoder. VLM-based baselines receive the same signals rendered as multi-panel plots, while text-based baselines process serialized n… view at source ↗
read the original abstract

Multimodal health sensing offers rich behavioral signals for assessing mental health, yet translating these numerical time-series measurements into natural language remains challenging. Current LLMs cannot natively ingest long-duration sensor streams, and paired sensor-text datasets are scarce. To address these challenges, we introduce LENS, a framework that aligns multimodal sensing data with language models to generate clinically grounded mental-health narratives. LENS first constructs a large-scale dataset by transforming Ecological Momentary Assessment (EMA) responses related to depression and anxiety symptoms into natural-language descriptions, yielding over 100,000 sensor-text QA pairs from 258 participants. To enable native time-series integration, we train a patch-level encoder that projects raw sensor signals directly into an LLM's representation space. Our results show that LENS outperforms strong baselines on standard NLP metrics and task-specific measures of symptom-severity accuracy. A user study with 13 mental-health professionals further indicates that LENS-produced narratives are comprehensive and clinically meaningful. Ultimately, our approach advances LLMs as interfaces for health sensing, providing a scalable path toward models that can reason over raw behavioral signals and support downstream clinical decision-making.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces LENS, a framework that constructs a large-scale paired dataset by transforming EMA responses on depression and anxiety symptoms from 258 participants into natural-language descriptions, yielding over 100,000 sensor-text QA pairs. It then trains a patch-level encoder to project raw multimodal sensor streams directly into an LLM's representation space. The work reports that LENS outperforms strong baselines on standard NLP metrics and task-specific symptom-severity accuracy measures, and a user study with 13 mental-health professionals finds the generated narratives to be comprehensive and clinically meaningful.

Significance. If the EMA-to-narrative transformation step produces clinically valid and unbiased training pairs, LENS would offer a practical route for LLMs to ingest and reason over long-duration raw sensor data without intermediate feature engineering. The patch encoder and large paired dataset address a genuine scarcity in multimodal health sensing, and the clinician ratings provide an initial signal of downstream utility for narrative-based clinical interfaces. These elements could support more scalable, interpretable mental-health sensing applications if the core alignment is shown to be grounded.

major comments (2)
  1. [Dataset Construction] Dataset Construction section: The manuscript describes converting EMA symptom scores into natural-language descriptions to create the >100k training pairs but supplies no concrete details on the templates, prompting strategy, LLM used for generation, post-processing rules, or any validation (e.g., clinician review or inter-rater agreement) that the resulting texts faithfully capture symptom phenomenology without introducing artifacts or bias. Because the patch encoder is trained exclusively on these pairs and all reported gains in symptom-severity accuracy and narrative quality rest on this alignment, the absence of such documentation leaves the central technical claim unsupported.
  2. [Results] Results section (and abstract): The claims of outperformance on NLP metrics and task-specific symptom-severity accuracy are stated without accompanying quantitative values, named baselines, effect sizes, confidence intervals, or statistical tests. This omission prevents evaluation of whether the reported improvements are robust or clinically relevant, directly undermining the empirical contribution.
minor comments (1)
  1. [User Study] The user study is described only at a high level; adding the exact rating scales, inter-rater reliability, and any statistical comparison to baselines would strengthen the qualitative evidence without altering the core claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and will revise the manuscript to incorporate additional details and quantitative results as outlined.

read point-by-point responses
  1. Referee: [Dataset Construction] Dataset Construction section: The manuscript describes converting EMA symptom scores into natural-language descriptions to create the >100k training pairs but supplies no concrete details on the templates, prompting strategy, LLM used for generation, post-processing rules, or any validation (e.g., clinician review or inter-rater agreement) that the resulting texts faithfully capture symptom phenomenology without introducing artifacts or bias. Because the patch encoder is trained exclusively on these pairs and all reported gains in symptom-severity accuracy and narrative quality rest on this alignment, the absence of such documentation leaves the central technical claim unsupported.

    Authors: We agree that the Dataset Construction section requires more explicit documentation to substantiate the training pairs. In the revised manuscript, we will expand this section with the specific templates for converting EMA scores to natural language, the prompting strategy and LLM used for generation, post-processing rules applied, and validation procedures including any clinician review or inter-rater agreement metrics. These additions will directly address concerns about fidelity, bias, and support for the alignment claims. revision: yes

  2. Referee: [Results] Results section (and abstract): The claims of outperformance on NLP metrics and task-specific symptom-severity accuracy are stated without accompanying quantitative values, named baselines, effect sizes, confidence intervals, or statistical tests. This omission prevents evaluation of whether the reported improvements are robust or clinically relevant, directly undermining the empirical contribution.

    Authors: We concur that the Results section and abstract would be strengthened by including the specific quantitative details. In the revision, we will report the exact NLP metric values and symptom-severity accuracy scores, name the baselines compared against, include effect sizes, confidence intervals, and results of statistical tests. This will enable proper assessment of robustness and clinical relevance. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical dataset construction and encoder training are self-contained

full rationale

The paper's core pipeline constructs >100k sensor-text pairs by transforming EMA symptom scores into natural-language descriptions, then trains a patch-level encoder to project raw multimodal streams into LLM space. All reported results (NLP metrics, symptom-severity accuracy, clinician ratings) follow from standard training and evaluation on these pairs plus a separate user study. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation; the central claims rest on the external validity of the EMA-to-text transformation rather than any internal reduction to the paper's own inputs. This is the expected non-circular outcome for an applied systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities beyond the implicit assumption that EMA responses can be reliably turned into clinically accurate natural-language descriptions; no numerical constants or new physical entities are introduced.

pith-pipeline@v0.9.0 · 5529 in / 1220 out tokens · 42769 ms · 2026-05-16T18:43:25.261776+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PULSE: Agentic Investigation with Passive Sensing for Proactive Intervention in Cancer Survivorship

    cs.HC 2026-05 unverdicted novelty 6.0

    PULSE demonstrates that agentic LLM-based investigation of passive smartphone sensing data achieves balanced accuracies of 0.743 (with diary) and 0.713 (sensing-only) for predicting emotion regulation desire and inter...

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 1 Pith paper

  1. [1]

    Patrick Langer, Thomas Kaar, Max Rosenblattl, Maxwell A

    The phq-9: validity of a brief depression sever- ity measure.Journal of general internal medicine, 16(9):606–613. Patrick Langer, Thomas Kaar, Max Rosenblattl, Maxwell A. Xu, Winnie Chow, Martin Maritsch, Aradhana Verma, Brian Han, Daniel Seung Kim, Henry Chubb, Scott Ceresnak, Aydin Zahedivash, Alexander Tarlochan Singh Sandhu, Fatima Ro- driguez, Daniel...

  2. [2]

    A picture is worth a thousand numbers: En- abling LLMs reason about time series via visualiza- tion. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), pages 7486– 7518, Albuquerque, New Mexico. Association for Computational Lin...

  3. [3]

    In Proceedings of the Conference on Health, Inference, and Learning (CHIL)

    Time-LLM: Time series forecasting by repro- gramming large language models. Sehwan Moon, Aram Lee, Jeong Eun Kim, Hee-Ju Kang, Il-Seon Shin, Sung-Wan Kim, Jae-Min Kim, Min Jhon, and Ju-Wan Kim. 2025. Depressllm: In- terpretable domain-adapted language model for de- pression detection from real-world narratives. Subigya Nepal, Wenjun Liu, Arvind Pillai, We...

  4. [4]

    Frontiers in digital health, 3:662811

    Wearable, environmental, and smartphone- based passive sensing for mental health monitoring. Frontiers in digital health, 3:662811. Dimitris Spathis and Fahim Kawsar. 2024. The first step is the hardest: Pitfalls of representing and tokenizing temporal data for large language models.Journal of the American Medical Informatics Association, 31(9):2151–2158....

  5. [5]

    Chatts: Aligning time series with llms via synthetic data for enhanced understanding and reasoning.arXiv preprint arXiv:2412.03104, 2024

    ChatTS: Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning.arXiv preprint. ArXiv:2412.03104 [cs]. Xuhai Xu, Xin Liu, Han Zhang, Weichen Wang, Subi- gya Nepal, and 1 others. 2023. Globem: Cross- dataset generalization of longitudinal human behav- ior modeling.Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., 6(4...

  6. [6]

    Kailai Yang, Tianlin Zhang, Ziyan Kuang, Qianqian Xie, Jimin Huang, and Sophia Ananiadou

    Qwen3 technical report. Kailai Yang, Tianlin Zhang, Ziyan Kuang, Qianqian Xie, Jimin Huang, and Sophia Ananiadou. 2024. Mental- lama: Interpretable mental health analysis on social media with large language models. InProceedings of the ACM Web Conference 2024, WWW ’24, page 4489–4500, New York, NY , USA. Association for Computing Machinery. Hyungjun Yoon,...

  7. [7]

    scale to assess symptoms over the preced- ing four-hour window. As shown in Table 2, the questionnaire covers 14 distinct categories, includ- ing core depression indicators like anhedonia and depressed mood, alongside physical and cognitive markers such as somatic discomfort, fatigue, and concentration. It also includes a negative event question to unders...

  8. [8]

    Anhedonia (Interest/Pleasure) In the past 4 hours, how much has the user shown little interest or pleasure in activities?

  9. [9]

    Depressed Mood In the past 4 hours, how much has the user appeared down, depressed, or hopeless?

  10. [10]

    Sleep Disturbance Last night, how much trouble did the user have with sleep?

  11. [11]

    Fatigue / Energy In the past 4 hours, how tired or low in energy has the user been?

  12. [12]

    Appetite Change In the past 4 hours, how much has the user shown a poor appetite or overeating?

  13. [13]

    Self-worth / Guilt In the past 4 hours, how much has the user felt bad about themselves?

  14. [14]

    Concentration In the past 4 hours, how much trouble has the user had concentrating?

  15. [15]

    Psychomotor Change In the past 4 hours, how much has the user been moving or speaking more slowly than usual?

  16. [16]

    Suicidal Ideation In the past 4 hours, how often has the user had thoughts of harming themselves or wishing to be dead?

  17. [17]

    Somatic Discomfort In the past 4 hours, how much has the user experienced headache, abdominal discomfort, or body aches?

  18. [18]

    Inverted Question An inverted question randomized from Q1, Q4, or Q7

  19. [19]

    Anxiety Arousal In the past 4 hours, how much has the user felt nervous, anxious, or on edge?

  20. [20]

    Uncontrollable Worry In the past 4 hours, how much has the user been unable to stop or control worrying?

  21. [21]

    Table 2:EMA Questions.Categories of depression- and anxiety-related symptoms and their corresponding questions or statements

    Negative Event In the past 4 hours, did the user experience a negative event? If yes: How negative was the event? Overall Summary Please summarize the user’s overall mental and physical state in the past 4 hours, integrating mood, energy, sleep, appetite, concentration, and physical symptoms. Table 2:EMA Questions.Categories of depression- and anxiety-rel...

  22. [22]

    What is your professional role or background? (Options: Psychiatrist, Clinical Psychologist, Therapist, Other)

  23. [23]

    What is your highest degree held? (Options: High school/Diploma, Bachelor’s, Master’s, Doctorate/MD)

  24. [24]

    How familiar are you with the Patient Health Questionnaire-9 (PHQ-9) for depres- sion screening?

  25. [25]

    scores": [...],

    How familiar are you with the Generalized Anxiety Disorder Questionnaire (GAD-7) for anxiety screening? For Questions 3 and 4, the response options were: Not familiar at all, Slightly familiar, Moderately familiar, and Very familiar. The experts’ pre-survey responses are summarized in Table 7. Survey Flow.After completing the pre-survey questions, each ex...

  26. [26]

    Anhedonia (loss of interest or pleasure)

  27. [27]

    If a symptom is not present in a text, you must set both presence and severity to 0

    OverallSeverity Severity scale is ordinal and must be inferred from the overall semantic strength of the description. If a symptom is not present in a text, you must set both presence and severity to 0. User Prompt Template Reference Summary: {reference} Prediction Summary: {prediction} Structured Response Schema SymptomEvaluation object with 14 symptom f...

  28. [33]

    Phone unlock status: number of unlock events per minute, representing cognitive or social engagement. Additional contextual features: {contextual_features} Your task: Analyze the figure carefully and provide a clinical sum- mary of the user’s recent psychological and behavioral state, as if you were writing a short report based on PHQ- 9–related observati...

  29. [34]

    Heart rate (bpm): indicator of arousal, stress, and autonomic balance

  30. [35]

    Pseudoactigraphy: derived from wrist accelerom- eter signals, representing movement intensity and rest–activity rhythm

  31. [36]

    Steps per minute: reflects overall mobility and en- gagement in physical activity

  32. [37]

    Stress level: Garmin HRV-based estimation of phys- iological stress

  33. [38]

    GPS coordinates (longitude and latitude): capture spatial mobility and time spent in different environ- ments

  34. [39]

    Additional contextual features: {contextual_features} Question: {question} Your task: Analyze the figure and answer the question directly

    Phone unlock status: number of unlock events per minute, representing cognitive or social engagement. Additional contextual features: {contextual_features} Question: {question} Your task: Analyze the figure and answer the question directly. Base your reasoning only on observable behavioral and physi- ological patterns plus the contextual features. Produce...

  35. [46]

    Phone unlock status (binary 0/1 per minute, length 240)<ts></ts> {sleep_conversation} Task: Using only the provided textual data, produce a short clin- ical summary (about one concise paragraph) describing the user’s psychological and physical state over the last 4 hours. Your description should resemble a human-written mental- health assessment and cover...

  36. [47]

    Heart rate (1 reading per minute, length 1440) <ts></ts>

  37. [48]

    Pseudoactigraphy (accelerometer-based movement intensity × zero-crossing rate, length 480) <ts></ts>

  38. [49]

    Steps per minute (length 240)<ts></ts>

  39. [50]

    Stress level (length 240)<ts></ts>

  40. [51]

    GPS longitude (length 24)<ts></ts>

  41. [52]

    GPS latitude (length 24)<ts></ts>

  42. [53]

    • Refer only to the information implied by the time- series data; do not add external facts

    Phone unlock status (binary 0/1 per minute, length 240)<ts></ts> {sleep_conversation} Question: {question} Answer Requirements: • Provide a concise, clinically grounded answer in one or two sentences. • Refer only to the information implied by the time- series data; do not add external facts. • If the data is insufficient, explicitly say so. Answer: Table...