TS-Agent: Understanding and Reasoning Over Raw Time Series via Iterative Insight Gathering

Annita Vapsi; Daniel Borrajo; Elizabeth Fons; Manuela Veloso; Mohsen Ghassemi; Penghang Liu; Svitlana Vyetrenko; Vamsi K. Potluru

arxiv: 2510.07432 · v2 · submitted 2025-10-08 · 💻 cs.AI

TS-Agent: Understanding and Reasoning Over Raw Time Series via Iterative Insight Gathering

Penghang Liu , Elizabeth Fons , Annita Vapsi , Mohsen Ghassemi , Svitlana Vyetrenko , Daniel Borrajo , Vamsi K. Potluru , Manuela Veloso This is my paper

Pith reviewed 2026-05-18 09:06 UTC · model grok-4.3

classification 💻 cs.AI

keywords time series reasoningLLM agentstool-augmented reasoningzero-shot QAhallucination reductionevidence loggingReAct loopraw data processing

0 comments

The pith

TS-Agent lets LLMs reason over raw time series by calling analytical tools iteratively instead of converting the data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TS-Agent as a way to keep large language models focused on reasoning while handing off all statistical and structural work to existing time series tools that operate directly on the original sequences. The system runs in a closed loop of LLM thought, tool call, and observation, keeps an explicit evidence log, applies a self-refinement critic, and finishes with an answer-verification step. This design is meant to sidestep the representation losses, hallucinations, and knowledge leakage that appear when time series are turned into text, plots, or embeddings. If the approach holds, it shows that LLMs can handle time series understanding and reasoning tasks at competitive levels without cross-modal training or fine-tuning.

Core claim

TS-Agent solves time series tasks through an evidence-driven agentic process: it alternates between thinking, tool execution on raw sequences, and observation in a ReAct-style loop, records intermediate results in an explicit evidence log, corrects the reasoning trace via a self-refinement critic, and enforces a final answer-verification step to prevent hallucinations and leakage.

What carries the argument

The ReAct-style loop with an evidence log and self-refinement critic, in which the LLM only reasons while delegating all statistical extraction to time series analytical tools that run on the unaltered raw data.

If this is right

Performance matches or exceeds strong text-based, vision-based, and time-series language model baselines on four benchmarks.
Gains are largest on reasoning tasks where multimodal LLMs commonly hallucinate or leak knowledge in zero-shot settings.
The explicit evidence log and final verification step reduce hallucinations and knowledge leakage compared with direct conversion methods.
No cross-modal alignment or fine-tuning is required because the LLM works only with tool outputs rather than transformed inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tool-grounded loop could be applied to other structured data types where reliable analytical functions already exist, lowering the need for modality-specific fine-tuning.
Success may depend on the coverage of the tool library; adding or removing particular statistical functions could be tested to measure impact on different reasoning subtasks.
The framework implies that agentic separation of reasoning from feature extraction may scale more readily than end-to-end multimodal models when new time-series analysis routines become available.

Load-bearing premise

The chosen time series analytical tools can extract every piece of statistical and structural evidence the LLM needs to reach correct conclusions without any cross-modal alignment or extra training.

What would settle it

A benchmark instance in which the available tools miss a critical pattern that only appears in a visual plot or requires a statistic not implemented in the tool set, causing the final verified answer to be wrong while a vision-based baseline succeeds.

Figures

Figures reproduced from arXiv: 2510.07432 by Annita Vapsi, Daniel Borrajo, Elizabeth Fons, Manuela Veloso, Mohsen Ghassemi, Penghang Liu, Svitlana Vyetrenko, Vamsi K. Potluru.

**Figure 1.** Figure 1: The two types of time series questions. step is logged in an evidence log, reviewed by a critic, and verified by a final quality gate, ensuring transparency and verifiability rather than relying on a one-shot plan. 3 The Time-Series Reasoning Problem Language models have been used to address time series questions from a wide spectrum, but the nature of these questions spans two fundamentally different cate… view at source ↗

**Figure 2.** Figure 2: The TS Agent framework (left) and example of reasoning trace (right). [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Large language models (LLMs) exhibit strong symbolic and compositional reasoning, yet they struggle with time series question answering as the data is typically transformed into an LLM-compatible modality, e.g., serialized text, plotted images, or compressed time series embeddings. Such conversions impose representation bottlenecks, often require cross-modal alignment or finetuning, and can exacerbate hallucination and knowledge leakage. To address these limitations, we propose TS-Agent, an agentic, tool-grounded framework that uses LLMs strictly for iterative evidence-based reasoning, while delegating statistical and structural extraction to time series analytical tools operating on raw sequences. Our framework solves time series tasks through an evidence-driven agentic process: (1) it alternates between thinking, tool execution, and observation in a ReAct-style loop, (2) records intermediate results in an explicit evidence log and corrects the reasoning trace via a self-refinement critic, and (3) enforces a final answer-verification step to prevent hallucinations and leakage. Across four benchmarks spanning time series understanding and reasoning, TS-Agent matches or exceeds strong text-based, vision-based, and time-series language model baselines, with the largest gains on reasoning tasks where multimodal LLMs are prone to hallucination and knowledge leakage in zero-shot settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TS-Agent keeps the LLM on reasoning by calling time series tools on raw data plus an evidence log and critic loop, but the abstract gives no numbers so the gains are hard to judge yet.

read the letter

The key takeaway is that this paper presents TS-Agent as a way to handle time series questions with LLMs by using analytical tools directly on the raw sequences rather than converting the data first. The new part is the full setup: a ReAct-style loop where the agent thinks, calls a tool, observes, records everything in an evidence log, uses a critic to refine the trace, and then verifies the answer at the end. That package of components, applied specifically to time series, does not appear in the prior literature they cite. The paper also does a solid job spelling out why the usual conversions to text or images create bottlenecks and increase the chance of hallucinations or leakage. On the positive side, the approach keeps the LLM focused on reasoning while offloading the number crunching to established time series methods. That separation makes sense for avoiding cross-modal issues. The main soft spot is the lack of concrete results in the abstract. It says the method matches or beats baselines on four benchmarks with the biggest lifts on reasoning tasks, but without scores, ablations, or details on the tool set and prompts, it's difficult to see how much credit the framework deserves. The full paper presumably has those, but from what's here the central claim is not yet verifiable. Another concern is the one in the stress test: since the critic uses the same base model as the main agent, it may not catch errors in interpreting tool outputs any better. If a tool returns something like a decomposition or a p-value that the LLM misreads, the loop could just reinforce the mistake. The paper would need to show that the verification step actually reduces those cases. This work is aimed at people trying to apply LLMs to real time series data in domains like finance or sensor monitoring without having to fine tune or align modalities. A reader interested in agent architectures for data analysis would find it useful to read and perhaps build on. It deserves a serious referee because the problem it targets is practical and the proposed solution is described in enough detail to be reproducible if the experiments hold up. I would send it for review.

Referee Report

2 major / 2 minor

Summary. The paper introduces TS-Agent, an agentic framework for time series understanding and reasoning. LLMs are restricted to iterative evidence-based reasoning while statistical and structural extraction is delegated to analytical tools that operate directly on raw sequences. The process follows a ReAct-style loop that records results in an explicit evidence log, applies a self-refinement critic for trace correction, and ends with an answer-verification step. The central empirical claim is that this approach matches or exceeds text-based, vision-based, and time-series LM baselines across four benchmarks, with the largest improvements on reasoning tasks where multimodal models suffer from hallucination and knowledge leakage.

Significance. If the empirical results hold, the work is significant for demonstrating a practical route to apply LLMs to time series without serialization, plotting, or cross-modal alignment. The explicit separation of tool-based extraction from LLM reasoning, combined with the evidence log and critic loop, offers a structured alternative to direct multimodal prompting. The focus on raw-sequence tools and the reported gains on hallucination-prone reasoning tasks constitute falsifiable claims that could be tested by other groups. The framework also supplies a concrete procedural template that future work could extend to additional time-series analytical primitives.

major comments (2)

[§3.2–3.3] §3.2–3.3 (ReAct loop and self-refinement critic): The central claim requires that tool outputs (p-values, change-point statistics, frequency summaries) are interpreted accurately enough for the critic to catch residual errors. Because the critic uses the same base LLM as the primary agent, it is not obvious that it operates on a stronger signal. The manuscript should supply a controlled experiment that injects known misinterpretations into tool outputs and measures whether the critic corrects them at a rate higher than the base model alone.
[§4] §4 (Experimental results): The abstract asserts performance parity or gains with largest improvements on reasoning tasks, yet the provided text supplies no numerical scores, standard deviations, or statistical tests. Without these quantities it is impossible to assess whether the reported advantage is load-bearing or within noise. The manuscript must include the full per-task tables with error bars and at least one ablation that isolates the contribution of the critic and verification step.

minor comments (2)

[Abstract, §2] Abstract and §2: The description of the four benchmarks is brief; adding one sentence per benchmark that states the task type (e.g., forecasting vs. anomaly detection) and the exact metric would improve readability for readers outside the immediate sub-area.
[Figure 1] Figure 1: The loop diagram would be clearer if the evidence-log update arrow were labeled with the exact data structure written at each iteration.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments have helped us clarify the role of the self-refinement critic and improve the transparency of our experimental reporting. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [§3.2–3.3] The central claim requires that tool outputs (p-values, change-point statistics, frequency summaries) are interpreted accurately enough for the critic to catch residual errors. Because the critic uses the same base LLM as the primary agent, it is not obvious that it operates on a stronger signal. The manuscript should supply a controlled experiment that injects known misinterpretations into tool outputs and measures whether the critic corrects them at a rate higher than the base model alone.

Authors: We agree that directly testing the critic's error-correction capability strengthens the central claim. Although the critic uses the same base model, it receives the full evidence log, prior reasoning trace, and an explicit refinement prompt, which supplies additional context unavailable to the primary agent. To address the request, we have added an ablation in the revised manuscript that isolates the critic by comparing full TS-Agent against a variant without the critic (but retaining the evidence log). This shows measurable drops on reasoning tasks. We also include qualitative examples of trace corrections in §3.3. A full synthetic injection experiment is a valuable direction we note as future work, as it would require new controlled datasets beyond the current scope. revision: partial
Referee: [§4] The abstract asserts performance parity or gains with largest improvements on reasoning tasks, yet the provided text supplies no numerical scores, standard deviations, or statistical tests. Without these quantities it is impossible to assess whether the reported advantage is load-bearing or within noise. The manuscript must include the full per-task tables with error bars and at least one ablation that isolates the contribution of the critic and verification step.

Authors: We acknowledge that the initial submission did not present the complete quantitative results with sufficient detail. In the revised manuscript we expand §4 to include full per-task tables for all four benchmarks, reporting mean accuracy, standard deviations across five independent runs, and paired t-test p-values against each baseline. We have also added an ablation study that removes the critic and the verification step individually, confirming their contributions to the gains on reasoning tasks. These tables and ablations will be incorporated in the next version. revision: yes

Circularity Check

0 steps flagged

No circularity: procedural framework with empirical results only

full rationale

The paper presents TS-Agent as a new agentic framework that delegates statistical extraction to external time-series tools and restricts LLMs to iterative ReAct-style reasoning over an evidence log plus critic and verification steps. No equations, fitted parameters, or derivations appear in the provided description that would reduce the reported benchmark gains to quantities defined or computed inside the same paper. Performance claims are direct empirical comparisons against text, vision, and time-series baselines on four external benchmarks; the method itself is described procedurally without self-referential reduction or load-bearing self-citations. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the assumption that current LLMs can reliably follow tool-augmented iterative reasoning when supplied with structured evidence; no free parameters or new physical entities are introduced.

axioms (1)

domain assumption LLMs can perform effective iterative evidence-based reasoning when given outputs from statistical tools on raw sequences
Central to the ReAct-style loop and self-refinement critic described in the abstract.

invented entities (1)

TS-Agent framework no independent evidence
purpose: Agentic wrapper that delegates extraction to time-series tools and enforces evidence logging plus verification
Newly proposed system whose independent evidence consists only of the four benchmarks mentioned.

pith-pipeline@v0.9.0 · 5787 in / 1243 out tokens · 47791 ms · 2026-05-18T09:06:40.591258+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the agent decomposes the problem into reasoning steps, calls analytical tools iteratively to extract evidence from X, and integrates observations into its chain of thought until reaching a final answer... step-wise critic... quality gate
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

delegating statistical and structural extraction to time series analytical tools operating on raw sequences

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TimeClaw: A Time-Series AI Agent with Exploratory Execution Learning
cs.AI 2026-05 unverdicted novelty 7.0

TimeClaw is an exploratory execution learning system that turns multiple valid tool-use paths into hierarchical distilled experience for improved time-series reasoning without test-time adaptation.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Timeseriesexam: A time series understanding exam

Yifu Cai, Arjun Choudhry, Mononito Goswami, and Artur Dubrawski. Timeseriesexam: A time series understanding exam.arXiv preprint arXiv:2410.14752, 2024

work page arXiv 2024
[2]

Evaluating large language models on time series feature understanding: A comprehensive taxonomy and benchmark.arXiv preprint arXiv:2404.16563, 2024

Elizabeth Fons, Rachneet Kaur, Soham Palande, Zhen Zeng, Tucker Balch, Manuela Veloso, and Svitlana Vyetrenko. Evaluating large language models on time series feature understanding: A comprehensive taxonomy and benchmark.arXiv preprint arXiv:2404.16563, 2024

work page arXiv 2024
[3]

Hctsa: A computational framework for automated time-series phenotyping using massive feature extraction.Cell Systems, 5(5):527–531, 2017

Ben D Fulcher and Nick S Jones. Hctsa: A computational framework for automated time-series phenotyping using massive feature extraction.Cell Systems, 5(5):527–531, 2017

work page 2017
[4]

Large language models are zero- shot time series forecasters.Advances in Neural Information Processing Systems, 36:19622– 19635, 2023

Nate Gruver, Marc Finzi, Shikai Qiu, and Andrew G Wilson. Large language models are zero- shot time series forecasters.Advances in Neural Information Processing Systems, 36:19622– 19635, 2023

work page 2023
[5]

Deep learning for time-series classification: A review.Data Mining and Knowledge Discovery, 33(4):917–963, 2019

Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar, and Pierre- Alain Muller. Deep learning for time-series classification: A review.Data Mining and Knowledge Discovery, 33(4):917–963, 2019

work page 2019
[6]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[7]

Time-mqa: Time series multi-task question answering with context enhancement, 2025

Yaxuan Kong, Yiyuan Yang, Yoontae Hwang, Wenjie Du, Stefan Zohren, Zhangyang Wang, Ming Jin, and Qingsong Wen. Time-mqa: Time series multi-task question answering with context enhancement.arXiv preprint arXiv:2503.01875, 2025

work page arXiv 2025
[8]

Position: Empowering time series reasoning with multimodal llms,

Yaxuan Kong, Yiyuan Yang, Shiyu Wang, Chenghao Liu, Yuxuan Liang, Ming Jin, Stefan Zohren, Dan Pei, Yan Liu, and Qingsong Wen. Position: Empowering time series reasoning with multimodal llms.arXiv preprint arXiv:2502.01477, 2025

work page arXiv 2025
[9]

Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems, 36:46534–46594, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems, 36:46534–46594, 2023

work page 2023
[10]

Language models still struggle to zero-shot reason about time series

Mike A Merrill, Mingtian Tan, Vinayak Gupta, Tom Hartvigsen, and Tim Althoff. Language models still struggle to zero-shot reason about time series.arXiv preprint arXiv:2404.11757, 2024

work page arXiv 2024
[11]

Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

work page 2023
[12]

Springer, 2017

Robert H Shumway and David S Stoffer.Time Series Analysis and Its Applications. Springer, 2017

work page 2017
[13]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

work page 2024
[14]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, et al. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations (ICLR), 2022

work page 2022
[15]

Chain of thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, et al. Chain of thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[16]

A survey on time series forecasting.arXiv preprint arXiv:2004.13408, 2020

Qingsong Wen, Liang Sun, Fan Yang, Xue Wang Song, Jianmin Gao, Xian Wang, and Huan Xu. A survey on time series forecasting.arXiv preprint arXiv:2004.13408, 2020. 9

work page arXiv 2004
[17]

The rise and potential of large language model based agents: A survey.Science China Information Sciences, 68(2):121101, 2025

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. The rise and potential of large language model based agents: A survey.Science China Information Sciences, 68(2):121101, 2025

work page 2025
[18]

Chatts: Aligning time series with llms via synthetic data for enhanced understanding and reasoning.arXiv preprint arXiv:2412.03104, 2024

Zhe Xie, Zeyan Li, Xiao He, Longlong Xu, Xidao Wen, Tieying Zhang, Jianjun Chen, Rui Shi, and Dan Pei. Chatts: Aligning time series with llms via synthetic data for enhanced understanding and reasoning.arXiv preprint arXiv:2412.03104, 2024

work page arXiv 2024
[19]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[20]

TS-Reasoner: Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis

Wen Ye, Yizhou Zhang, Wei Yang, Lumingyuan Tang, Defu Cao, Jie Cai, and Yan Liu. Beyond forecasting: Compositional time series reasoning for end-to-end task execution.arXiv preprint arXiv:2410.04047, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

R., Gupta , R

Xiyuan Zhang, Ranak Roy Chowdhury, Rajesh K Gupta, and Jingbo Shang. Large language models for time series: A survey.arXiv preprint arXiv:2402.01801, 2024

work page arXiv 2024
[22]

arXiv preprint arXiv:2502.04395 (2025)

Siru Zhong, Weilin Ruan, Ming Jin, Huan Li, Qingsong Wen, and Yuxuan Liang. Time-vlm: Exploring multimodal vision-language models for augmented time series forecasting.arXiv preprint arXiv:2502.04395, 2025

work page arXiv 2025
[23]

volatility-adjusted moving average

Denny Zhou, Quoc V Le, et al. Least-to-most prompting enables complex reasoning in large language models. InInternational Conference on Learning Representations (ICLR), 2023. 10 A Benchmark Datasets Here we provide detailed information on the benchmark datasets used in our evaluation. We cover three recent benchmarks that target complementary aspects of t...

work page 2023

[1] [1]

Timeseriesexam: A time series understanding exam

Yifu Cai, Arjun Choudhry, Mononito Goswami, and Artur Dubrawski. Timeseriesexam: A time series understanding exam.arXiv preprint arXiv:2410.14752, 2024

work page arXiv 2024

[2] [2]

Evaluating large language models on time series feature understanding: A comprehensive taxonomy and benchmark.arXiv preprint arXiv:2404.16563, 2024

Elizabeth Fons, Rachneet Kaur, Soham Palande, Zhen Zeng, Tucker Balch, Manuela Veloso, and Svitlana Vyetrenko. Evaluating large language models on time series feature understanding: A comprehensive taxonomy and benchmark.arXiv preprint arXiv:2404.16563, 2024

work page arXiv 2024

[3] [3]

Hctsa: A computational framework for automated time-series phenotyping using massive feature extraction.Cell Systems, 5(5):527–531, 2017

Ben D Fulcher and Nick S Jones. Hctsa: A computational framework for automated time-series phenotyping using massive feature extraction.Cell Systems, 5(5):527–531, 2017

work page 2017

[4] [4]

Large language models are zero- shot time series forecasters.Advances in Neural Information Processing Systems, 36:19622– 19635, 2023

Nate Gruver, Marc Finzi, Shikai Qiu, and Andrew G Wilson. Large language models are zero- shot time series forecasters.Advances in Neural Information Processing Systems, 36:19622– 19635, 2023

work page 2023

[5] [5]

Deep learning for time-series classification: A review.Data Mining and Knowledge Discovery, 33(4):917–963, 2019

Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar, and Pierre- Alain Muller. Deep learning for time-series classification: A review.Data Mining and Knowledge Discovery, 33(4):917–963, 2019

work page 2019

[6] [6]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[7] [7]

Time-mqa: Time series multi-task question answering with context enhancement, 2025

Yaxuan Kong, Yiyuan Yang, Yoontae Hwang, Wenjie Du, Stefan Zohren, Zhangyang Wang, Ming Jin, and Qingsong Wen. Time-mqa: Time series multi-task question answering with context enhancement.arXiv preprint arXiv:2503.01875, 2025

work page arXiv 2025

[8] [8]

Position: Empowering time series reasoning with multimodal llms,

Yaxuan Kong, Yiyuan Yang, Shiyu Wang, Chenghao Liu, Yuxuan Liang, Ming Jin, Stefan Zohren, Dan Pei, Yan Liu, and Qingsong Wen. Position: Empowering time series reasoning with multimodal llms.arXiv preprint arXiv:2502.01477, 2025

work page arXiv 2025

[9] [9]

Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems, 36:46534–46594, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems, 36:46534–46594, 2023

work page 2023

[10] [10]

Language models still struggle to zero-shot reason about time series

Mike A Merrill, Mingtian Tan, Vinayak Gupta, Tom Hartvigsen, and Tim Althoff. Language models still struggle to zero-shot reason about time series.arXiv preprint arXiv:2404.11757, 2024

work page arXiv 2024

[11] [11]

Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

work page 2023

[12] [12]

Springer, 2017

Robert H Shumway and David S Stoffer.Time Series Analysis and Its Applications. Springer, 2017

work page 2017

[13] [13]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

work page 2024

[14] [14]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, et al. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations (ICLR), 2022

work page 2022

[15] [15]

Chain of thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, et al. Chain of thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[16] [16]

A survey on time series forecasting.arXiv preprint arXiv:2004.13408, 2020

Qingsong Wen, Liang Sun, Fan Yang, Xue Wang Song, Jianmin Gao, Xian Wang, and Huan Xu. A survey on time series forecasting.arXiv preprint arXiv:2004.13408, 2020. 9

work page arXiv 2004

[17] [17]

The rise and potential of large language model based agents: A survey.Science China Information Sciences, 68(2):121101, 2025

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. The rise and potential of large language model based agents: A survey.Science China Information Sciences, 68(2):121101, 2025

work page 2025

[18] [18]

Chatts: Aligning time series with llms via synthetic data for enhanced understanding and reasoning.arXiv preprint arXiv:2412.03104, 2024

Zhe Xie, Zeyan Li, Xiao He, Longlong Xu, Xidao Wen, Tieying Zhang, Jianjun Chen, Rui Shi, and Dan Pei. Chatts: Aligning time series with llms via synthetic data for enhanced understanding and reasoning.arXiv preprint arXiv:2412.03104, 2024

work page arXiv 2024

[19] [19]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023

[20] [20]

TS-Reasoner: Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis

Wen Ye, Yizhou Zhang, Wei Yang, Lumingyuan Tang, Defu Cao, Jie Cai, and Yan Liu. Beyond forecasting: Compositional time series reasoning for end-to-end task execution.arXiv preprint arXiv:2410.04047, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

R., Gupta , R

Xiyuan Zhang, Ranak Roy Chowdhury, Rajesh K Gupta, and Jingbo Shang. Large language models for time series: A survey.arXiv preprint arXiv:2402.01801, 2024

work page arXiv 2024

[22] [22]

arXiv preprint arXiv:2502.04395 (2025)

Siru Zhong, Weilin Ruan, Ming Jin, Huan Li, Qingsong Wen, and Yuxuan Liang. Time-vlm: Exploring multimodal vision-language models for augmented time series forecasting.arXiv preprint arXiv:2502.04395, 2025

work page arXiv 2025

[23] [23]

volatility-adjusted moving average

Denny Zhou, Quoc V Le, et al. Least-to-most prompting enables complex reasoning in large language models. InInternational Conference on Learning Representations (ICLR), 2023. 10 A Benchmark Datasets Here we provide detailed information on the benchmark datasets used in our evaluation. We cover three recent benchmarks that target complementary aspects of t...

work page 2023