HEARTS: Benchmarking LLM Reasoning on Health Time Series

Ahmed Metwally; Daniel McDuff; Mihir Joshi; Shuhan Xiao; Sirui Li; Wei Wang; Yuzhe Yang

arxiv: 2603.06638 · v3 · pith:NB5P4HCQnew · submitted 2026-02-25 · 💻 cs.LG · cs.AI

HEARTS: Benchmarking LLM Reasoning on Health Time Series

Sirui Li , Shuhan Xiao , Mihir Joshi , Ahmed Metwally , Daniel McDuff , Wei Wang , Yuzhe Yang This is my paper

classification 💻 cs.LG cs.AI

keywords reasoninghealthllmsseriestimeheartstemporalbenchmark

0 comments

read the original abstract

The rise of large language models (LLMs) has shifted time series analysis from narrow analytics to general-purpose reasoning. Yet, existing benchmarks cover only a small set of health time series modalities and tasks, failing to reflect the diverse domains and extensive temporal dependencies inherent in real-world physiological modeling. To bridge these gaps, we introduce HEARTS (Health Reasoning over Time Series), a unified benchmark for evaluating hierarchical reasoning capabilities of LLMs over general health time series. HEARTS integrates 16 real-world datasets across 12 health domains and 20 signal modalities, and defines a comprehensive taxonomy of 110 tasks grouped into four core capabilities: Perception, Inference, Generation, and Deduction. Evaluating 16 state-of-the-art LLMs on more than 20K test samples reveals intriguing findings. First, LLMs substantially underperform specialized models, and their performance is only weakly related to general reasoning scores. Moreover, LLMs often rely on simple heuristics and struggle with multi-step temporal reasoning. Finally, performance declines with increasing temporal complexity, with similar failure modes within model families, indicating that scaling alone is insufficient. By making these gaps measurable, HEARTS provides a standardized testbed and living benchmark for developing next-generation LLM agents capable of reasoning over diverse health signals.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GlucoFM: A Dual-Stream Foundation Model for Continuous Glucose Monitoring
cs.LG 2026-05 unverdicted novelty 7.0

GlucoFM decomposes CGM traces into dual state-event streams, pretrains on 109k hours of unlabeled data, and reports superior subject-disjoint performance on seven clinical tasks across four cohorts.
TimeSRL: Generalizable Time-Series Behavioral Modeling via Semantic RL-Tuned LLMs -- A Case Study in Mental Health
cs.LG 2026-05 unverdicted novelty 6.0

TimeSRL uses semantic abstractions from time-series data optimized via reinforcement learning to achieve better cross-dataset generalization than standard ML or LLM baselines in mental health prediction.