Evaluating large language models on time series feature understanding: A comprehensive taxonomy and benchmark.arXiv preprint arXiv:2404.16563, 2024

Elizabeth Fons, Rachneet Kaur, Soham Palande, Zhen Zeng, Tucker Balch, Manuela Veloso, Svitlana Vyetrenko · 2024 · arXiv 2404.16563

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

representative citing papers

Peak-Detector: Explainable Peak Detection via Instruction-Tuned Large Language Models in Physiological Sign

cs.LG · 2026-05-15 · unverdicted · novelty 7.0

Peak-Detector uses instruction-tuned LLMs and a condensed peak-representation of time-series data to achieve robust cross-modal peak detection with self-generated explanations across ECG, PPG, BCG, and BSG signals.

TSVer: A Benchmark for Fact Verification Against Time-Series Evidence

cs.CL · 2025-11-02 · unverdicted · novelty 7.0

TSVer is a new benchmark dataset for fact verification against time-series evidence, with 304 annotated real-world claims, 400 time series, verdicts, and justifications, plus baseline results showing current models struggle.

TS-Agent: Understanding and Reasoning Over Raw Time Series via Iterative Insight Gathering

cs.AI · 2025-10-08 · unverdicted · novelty 7.0

TS-Agent is an agentic framework that uses LLMs only for evidence-based reasoning while delegating extraction to raw time series tools, matching or exceeding baselines on four benchmarks with largest gains on reasoning tasks.

GeoMind: An Agentic Workflow for Lithology Classification with Reasoned Tool Invocation

cs.AI · 2026-04-23 · unverdicted · novelty 6.0

GeoMind applies an agentic workflow with tool-augmented modules and process supervision to outperform static models on lithology classification from well logs while producing traceable decisions.

SupChain-Bench: Benchmarking Large Language Models for Real-World Supply Chain Management

cs.AI · 2026-02-07 · unverdicted · novelty 6.0

SupChain-Bench reveals substantial gaps in LLM reliability for long-horizon supply chain orchestration, while the proposed SupChain-ReAct framework improves tool-calling by autonomously synthesizing procedures.

BEDTime: A Unified Benchmark for Automatically Describing Time Series

cs.CL · 2025-09-05 · conditional · novelty 6.0

BEDTime benchmark tests 17 models on describing time series structure and finds vision-language models outperform dedicated time-series-language models and language-only approaches, with all models fragile to robustness tests.

citing papers explorer

Showing 6 of 6 citing papers.

Peak-Detector: Explainable Peak Detection via Instruction-Tuned Large Language Models in Physiological Sign cs.LG · 2026-05-15 · unverdicted · none · ref 26
Peak-Detector uses instruction-tuned LLMs and a condensed peak-representation of time-series data to achieve robust cross-modal peak detection with self-generated explanations across ECG, PPG, BCG, and BSG signals.
TSVer: A Benchmark for Fact Verification Against Time-Series Evidence cs.CL · 2025-11-02 · unverdicted · none · ref 24
TSVer is a new benchmark dataset for fact verification against time-series evidence, with 304 annotated real-world claims, 400 time series, verdicts, and justifications, plus baseline results showing current models struggle.
TS-Agent: Understanding and Reasoning Over Raw Time Series via Iterative Insight Gathering cs.AI · 2025-10-08 · unverdicted · none · ref 2
TS-Agent is an agentic framework that uses LLMs only for evidence-based reasoning while delegating extraction to raw time series tools, matching or exceeding baselines on four benchmarks with largest gains on reasoning tasks.
GeoMind: An Agentic Workflow for Lithology Classification with Reasoned Tool Invocation cs.AI · 2026-04-23 · unverdicted · none · ref 12
GeoMind applies an agentic workflow with tool-augmented modules and process supervision to outperform static models on lithology classification from well logs while producing traceable decisions.
SupChain-Bench: Benchmarking Large Language Models for Real-World Supply Chain Management cs.AI · 2026-02-07 · unverdicted · none · ref 1
SupChain-Bench reveals substantial gaps in LLM reliability for long-horizon supply chain orchestration, while the proposed SupChain-ReAct framework improves tool-calling by autonomously synthesizing procedures.
BEDTime: A Unified Benchmark for Automatically Describing Time Series cs.CL · 2025-09-05 · conditional · none · ref 24
BEDTime benchmark tests 17 models on describing time series structure and finds vision-language models outperform dedicated time-series-language models and language-only approaches, with all models fragile to robustness tests.

Evaluating large language models on time series feature understanding: A comprehensive taxonomy and benchmark.arXiv preprint arXiv:2404.16563, 2024

fields

years

verdicts

representative citing papers

citing papers explorer