pith. sign in

arxiv: 2410.04047 · v6 · submitted 2024-10-05 · 💻 cs.LG · cs.AI

TS-Reasoner: Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis

Pith reviewed 2026-05-23 19:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords time series analysislarge language modelsmulti-step reasoningdomain-specific agentsinference agentscomputational toolserror feedback
0
0 comments X

The pith

TS-Reasoner integrates LLM reasoning with time-series tools and feedback loops to outperform general models on multi-step inference tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a specialized agent that pairs large language models with domain-specific computational tools and an error-correction loop to perform multi-step time series analysis. It tests this agent on basic concept understanding and on a new dataset that requires both compositional reasoning and numerical precision. The central demonstration is that the combined system produces more accurate and constraint-aware results than standalone general-purpose language models. This matters for applications where time series data must be interpreted iteratively rather than in a single pass. The work positions domain-specialized agents as a practical route to automated analytical workflows.

Core claim

TS-Reasoner is a domain-specialized agent that integrates LLM reasoning with domain-specific computational tools and an error feedback loop, enabling domain-informed, constraint-aware analytical workflows that combine symbolic reasoning with precise numerical analysis. Experiments on TimeSeriesExam and a new multi-step inference dataset show that this approach outperforms standalone general-purpose LLMs in both fundamental time series concept understanding and complex inference tasks.

What carries the argument

TS-Reasoner agent that fuses LLM reasoning with domain-specific computational tools and an error feedback loop.

If this is right

  • The agent achieves higher accuracy on basic time series concept questions than general LLMs.
  • It completes multi-step inference tasks that require both compositional logic and exact numerical computation more reliably.
  • The resulting workflows stay within domain constraints while mixing symbolic steps and numerical evaluation.
  • The design supports automated real-world time series reasoning without manual intervention at each step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar tool-plus-feedback structures could be applied to other data modalities that mix language and precise calculation.
  • The approach suggests a template for agents in scientific domains where general models currently fall short on numerical fidelity.
  • Performance gains may depend on how well the feedback loop identifies and corrects specific classes of numerical or logical errors.

Load-bearing premise

Combining language-model reasoning with domain tools and feedback loops produces genuinely better constraint-aware workflows than general models alone.

What would settle it

An experiment in which TS-Reasoner achieves equal or lower accuracy than a general LLM on the same multi-step time series tasks and datasets.

Figures

Figures reproduced from arXiv: 2410.04047 by Defu Cao, Jie Cai, Lumingyuan Tang, Wei Yang, Wen Ye, Yan Liu, Yizhou Zhang.

Figure 1
Figure 1. Figure 1: A time series of daily search frequency for the keyword "reasoning". To address these challenges, we call for do￾main specialization of LLM [20] and introduce the Domain-Oriented Time Series Agent, TS￾Reasoner, for multi-step time series inference. TS-Reasoner integrates language-based reason￾ing with precise numerical execution by decom￾posing high-level instructions into structured workflows composed of … view at source ↗
Figure 2
Figure 2. Figure 2: The pipeline of TS-Reasoner. The LLM work as task decomposer, which learn from [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Strict Accuracy of TS-Reasoner and general purpose LLMs on the TimeSeriesExam. Dataset To address the underexplored area of complex time series reasoning, we construct a multi-step time series inference dataset3 catego￾rized into two classes: predictive task and diag￾nostic task. Each class presents unique challenges requiring both compositional reasoning and pre￾cise numerical computation and demonstrates… view at source ↗
Figure 4
Figure 4. Figure 4: Performance on Multi-Step Diagnostic Tasks. A small jittering noise of 0.01 is added to [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Error distribution of different approaches on electricity prediction task without covariates. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation Study on Electricity Prediction w/ Covariates task. We removes each component [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of an example TS-Reasoner workflow [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example Result Errors I have historical Advertising Spend (A), Sales (B), Economic Factors (C), Customer Sentiment (D) data and want to get the causal relationship between each pair of the variables. I know that 41.66666666666667% of the variable pairs have relationship. Consider the potential influence of each variable on the others in this variable list: ['Advertising Spend (A)', 'Sales (B)', 'Economic F… view at source ↗
Figure 9
Figure 9. Figure 9: Example Execution Errors. D Additional Error Analysis E Task Instance Templates In this section, we provide an outline of templates used for each type of tasks. The exact template for each sub question type may vary from each other to best reflect the available information: with and without covariate versions, with and without large amount of data, with or without anomaly free samples). 7 https://climatele… view at source ↗
Figure 10
Figure 10. Figure 10: Error distribution of different approaches on electricity prediction task with covariates. [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Error distribution of different approaches on electricity prediction task across multiple [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Error distribution of different approaches on extreme weather detection with anomaly free [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Error distribution of different approaches on extreme weather detection task with known [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Error distribution of different approaches on causal discovery tasks. [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
read the original abstract

Time series analysis is crucial in real-world applications, yet traditional methods focus on isolated tasks only, and recent studies on time series reasoning remain limited to either single-step inference or are constrained to natural language answers. In this work, we introduce TS-Reasoner, a domain-specialized agent designed for multi-step time series inference. By integrating large language model (LLM) reasoning with domain-specific computational tools and an error feedback loop, TS-Reasoner enables domain-informed, constraint-aware analytical workflows that combine symbolic reasoning with precise numerical analysis. We assess the system's capabilities along two axes: (1) fundamental time series understanding assessed by TimeSeriesExam and (2) complex, multi-step inference evaluated by a newly proposed dataset designed to test both compositional reasoning and computational precision in time series analysis. Experiments show that our approach outperforms standalone general-purpose LLMs in both basic time series concept understanding as well as the multi-step time series inference task, highlighting the promise of domain-specialized agents for automating real-world time series reasoning and analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces TS-Reasoner, a domain-specialized agent that integrates LLM reasoning with domain-specific computational tools and an error feedback loop to enable multi-step time series inference. It evaluates performance on TimeSeriesExam for basic concept understanding and on a newly introduced dataset for compositional multi-step reasoning, claiming consistent outperformance over standalone general-purpose LLMs.

Significance. If the performance gains can be attributed to the agent architecture rather than tool access alone, the work would provide evidence that hybrid LLM-tool systems with feedback can improve automated analysis on tasks requiring both symbolic and numerical precision, with potential implications for domain-specific agent design in scientific ML.

major comments (3)
  1. [Experiments] Experiments section (and abstract): the baselines are described only as 'standalone general-purpose LLMs' with no indication that they receive access to the same domain-specific computational tools used by TS-Reasoner. Because the central claim attributes gains to the combination of LLM reasoning, tools, and error feedback loop, the absence of tool-augmented LLM baselines (or component ablations) means the results do not isolate whether the reported improvements require the agent scaffolding.
  2. [§4] §4 (evaluation on new dataset): the manuscript provides insufficient detail on dataset construction, task distribution, and metrics for compositional reasoning versus computational precision, making it difficult to verify that the new benchmark genuinely stresses multi-step inference beyond what single-step tool use would achieve.
  3. [Results tables] Table 1 / results tables: no statistical significance tests, confidence intervals, or variance across runs are reported for the claimed outperformance, which is load-bearing for the assertion that the approach 'outperforms' on both axes.
minor comments (2)
  1. [§3] Notation for the error feedback loop and tool interfaces is introduced without a clear diagram or pseudocode, reducing reproducibility.
  2. [Abstract] The abstract states outperformance but the full experimental design details (prompt templates, tool definitions, number of trials) appear only later; moving a concise summary of controls to the abstract would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We agree that additional baselines, expanded dataset details, and statistical reporting are needed to strengthen the claims, and we will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Experiments] Experiments section (and abstract): the baselines are described only as 'standalone general-purpose LLMs' with no indication that they receive access to the same domain-specific computational tools used by TS-Reasoner. Because the central claim attributes gains to the combination of LLM reasoning, tools, and error feedback loop, the absence of tool-augmented LLM baselines (or component ablations) means the results do not isolate whether the reported improvements require the agent scaffolding.

    Authors: We agree that the current evaluation does not fully isolate the contribution of the agent scaffolding from tool access alone. In the revision we will add tool-augmented LLM baselines (general-purpose LLMs given the same computational tools but without the multi-step agent loop or error feedback) as well as component ablations. These new results will be reported in the experiments section and referenced in the abstract. revision: yes

  2. Referee: [§4] §4 (evaluation on new dataset): the manuscript provides insufficient detail on dataset construction, task distribution, and metrics for compositional reasoning versus computational precision, making it difficult to verify that the new benchmark genuinely stresses multi-step inference beyond what single-step tool use would achieve.

    Authors: We will substantially expand §4 to include the dataset construction methodology, the breakdown of task types (compositional reasoning vs. computational precision), and the precise metrics used for each axis. This will clarify how the benchmark evaluates multi-step inference beyond single-step tool calls. revision: yes

  3. Referee: [Results tables] Table 1 / results tables: no statistical significance tests, confidence intervals, or variance across runs are reported for the claimed outperformance, which is load-bearing for the assertion that the approach 'outperforms' on both axes.

    Authors: We acknowledge the omission. We will rerun the key experiments across multiple random seeds, compute confidence intervals and standard deviations, and add statistical significance tests (e.g., paired t-tests) to all reported performance differences in the revised tables. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on independent benchmarks

full rationale

The paper introduces an agent architecture (LLM + domain tools + error feedback) and evaluates it on TimeSeriesExam plus a new compositional dataset. No mathematical derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the provided abstract or description. The central claim is experimental outperformance over standalone LLMs; the benchmarks are described as external and independent, with no indication that results reduce to the inputs by construction. This is a standard system paper whose validity hinges on experimental controls rather than definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper's approach rests on the assumption that LLM-based agents augmented with tools will outperform general models, which is a domain assumption not independently verified in the abstract.

axioms (1)
  • domain assumption Domain-specific tools can be effectively integrated with LLMs for precise numerical analysis in time series tasks
    Central to enabling the constraint-aware workflows.

pith-pipeline@v0.9.0 · 5724 in / 1128 out tokens · 38587 ms · 2026-05-23T19:43:48.107638+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TimeClaw: A Time-Series AI Agent with Exploratory Execution Learning

    cs.AI 2026-05 unverdicted novelty 7.0

    TimeClaw is an exploratory execution learning system that turns multiple valid tool-use paths into hierarchical distilled experience for improved time-series reasoning without test-time adaptation.

  2. TimeSeriesExamAgent: Creating Time Series Reasoning Benchmarks at Scale

    cs.AI 2026-04 conditional novelty 7.0

    TimeSeriesExamAgent combines templates and LLM agents to generate scalable time series reasoning benchmarks, demonstrating that current LLMs have limited performance on both abstract and domain-specific tasks.

  3. TS-Agent: Understanding and Reasoning Over Raw Time Series via Iterative Insight Gathering

    cs.AI 2025-10 unverdicted novelty 7.0

    TS-Agent is an agentic framework that uses LLMs only for evidence-based reasoning while delegating extraction to raw time series tools, matching or exceeding baselines on four benchmarks with largest gains on reasoning tasks.

  4. TimeMM: Time-as-Operator Spectral Filtering for Dynamic Multimodal Recommendation

    cs.IR 2026-04 unverdicted novelty 6.0

    TimeMM proposes a time-as-operator spectral filtering framework with adaptive mixing and modality routing to model non-stationary multimodal user preferences in recommendation systems.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 4 Pith papers · 11 internal anchors

  1. [1]

    A refined comprehen- sive earthquake focal mechanism catalog for southern california derived with deep learning algorithms

    Yifang Cheng, Zachary Ross, Egill Hauksson, and Yehuda Ben-Zion. A refined comprehen- sive earthquake focal mechanism catalog for southern california derived with deep learning algorithms. In AGU Fall Meeting Abstracts, volume 2021, pages S32A–05, 2021

  2. [2]

    Identifying coordinated accounts on social media through hidden influence and group behaviours

    Karishma Sharma, Yizhou Zhang, Emilio Ferrara, and Yan Liu. Identifying coordinated accounts on social media through hidden influence and group behaviours. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1441–1451, 2021

  3. [3]

    Vigdet: Knowledge informed neural temporal point process for coordination detection on social media

    Yizhou Zhang, Karishma Sharma, and Yan Liu. Vigdet: Knowledge informed neural temporal point process for coordination detection on social media. Advances in Neural Information Processing Systems, 34:3218–3231, 2021

  4. [4]

    Time series analysis

    James D Hamilton. Time series analysis. Princeton university press, 2020

  5. [5]

    Time-series forecasting with deep learning: a survey

    Bryan Lim and Stefan Zohren. Time-series forecasting with deep learning: a survey. Philosoph- ical Transactions of the Royal Society A, 379(2194):20200209, 2021

  6. [6]

    Deep learning for time series classification: a review.Data mining and knowledge discovery, 33(4):917–963, 2019

    Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar, and Pierre- Alain Muller. Deep learning for time series classification: a review.Data mining and knowledge discovery, 33(4):917–963, 2019

  7. [7]

    Deep learning for time series anomaly detection: A survey

    Zahra Zamanzadeh Darban, Geoffrey I Webb, Shirui Pan, Charu Aggarwal, and Mahsa Salehi. Deep learning for time series anomaly detection: A survey. ACM Computing Surveys, 57(1):1– 42, 2024

  8. [8]

    Chronos: Learning the Language of Time Series

    Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, et al. Chronos: Learning the language of time series. arXiv preprint arXiv:2403.07815, 2024

  9. [9]

    Moirai-moe: Empowering time series foundation models with sparse mixture of experts

    Xu Liu, Juncheng Liu, Gerald Woo, Taha Aksu, Yuxuan Liang, Roger Zimmermann, Chenghao Liu, Silvio Savarese, Caiming Xiong, and Doyen Sahoo. Moirai-moe: Empowering time series foundation models with sparse mixture of experts. arXiv preprint arXiv:2410.10469, 2024

  10. [10]

    Timegpt-1,

    Azul Garza and Max Mergenthaler-Canseco. Timegpt-1. arXiv preprint arXiv:2310.03589, 2023

  11. [11]

    Moment: A family of open time-series foundation models

    Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski. Moment: A family of open time-series foundation models. arXiv preprint arXiv:2402.03885, 2024

  12. [12]

    Informer: Beyond efficient transformer for long sequence time-series forecasting

    Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of AAAI, 2021

  13. [13]

    220kv city power grid maximum loadability determination with static security-constraints

    Ke-qiu W ANG, Si-guang SUN, Hong-yi W ANG, Chang-xu JIANG, and Zhao-xia JING. 220kv city power grid maximum loadability determination with static security-constraints. Power, Energy Engineering and Management (PEEM2016), page 1, 2016

  14. [14]

    Beyond accuracy: Evaluating the reasoning behavior of large language models–a survey

    Philipp Mondorf and Barbara Plank. Beyond accuracy: Evaluating the reasoning behavior of large language models–a survey. arXiv preprint arXiv:2404.01869, 2024

  15. [15]

    Evaluating large language models at evaluating instruction following

    Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, and Danqi Chen. Evaluating large language models at evaluating instruction following. arXiv preprint arXiv:2310.07641, 2023

  16. [16]

    The first step is the hardest: Pitfalls of representing and tokenizing temporal data for large language models

    Dimitris Spathis and Fahim Kawsar. The first step is the hardest: Pitfalls of representing and tokenizing temporal data for large language models. Journal of the American Medical Informatics Association, 31(9):2151–2158, 2024

  17. [17]

    Mechanics of next token prediction with self-attention

    Yingcong Li, Yixiao Huang, Muhammed E Ildiz, Ankit Singh Rawat, and Samet Oymak. Mechanics of next token prediction with self-attention. In International Conference on Artificial Intelligence and Statistics, pages 685–693. PMLR, 2024. 10

  18. [18]

    The future is different: Large pre- trained language models fail in prediction tasks

    Kostadin Cvejoski, Ramsés J Sánchez, and César Ojeda. The future is different: Large pre- trained language models fail in prediction tasks. arXiv preprint arXiv:2211.00384, 2022

  19. [19]

    Why large language models fail at precision regression, 2025

    Karthick Panner Selvam. Why large language models fail at precision regression, 2025

  20. [20]

    Domain specialization as the key to make large language models disruptive: A comprehensive survey

    Chen Ling, Xujiang Zhao, Jiaying Lu, Chengyuan Deng, Can Zheng, Junxiang Wang, Tanmoy Chowdhury, Yun Li, Hejie Cui, Xuchao Zhang, et al. Domain specialization as the key to make large language models disruptive: A comprehensive survey. arXiv preprint arXiv:2305.18703, 2023

  21. [21]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

  22. [22]

    Timeseriesexam: A time series understanding exam

    Yifu Cai, Arjun Choudhry, Mononito Goswami, and Artur Dubrawski. Timeseriesexam: A time series understanding exam. arXiv preprint arXiv:2410.14752, 2024

  23. [23]

    Real-time load variability control using energy storage system for demand-side management in south korea

    Kyo Beom Han, Jaesung Jung, and Byung O Kang. Real-time load variability control using energy storage system for demand-side management in south korea. Energies, 14(19):6292, 2021

  24. [24]

    Short-term scheduling of electric power systems under minimum load conditions

    Claudia Greif, Raymond B Johnson, Chao an Li, Alva J Svoboda, and K Andrijeski Uemura. Short-term scheduling of electric power systems under minimum load conditions. IEEE transactions on power systems, 14(1):280–286, 1999

  25. [25]

    Learning semantic context from normal samples for unsupervised anomaly detection

    Xudong Yan, Huaidong Zhang, Xuemiao Xu, Xiaowei Hu, and Pheng-Ann Heng. Learning semantic context from normal samples for unsupervised anomaly detection. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 3110–3118, 2021

  26. [26]

    Parkca: Causal inference with partially known causes

    Raquel Aoki and Martin Ester. Parkca: Causal inference with partially known causes. In BIO- COMPUTING 2021: Proceedings of the Pacific Symposium, pages 196–207. World Scientific, 2020

  27. [27]

    Tsmixer: Lightweight mlp-mixer model for multivariate time series forecasting

    Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. Tsmixer: Lightweight mlp-mixer model for multivariate time series forecasting. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’23, page 459–469, New York, NY , USA, 2023. Association for Computing Machinery

  28. [28]

    Csdi: Conditional score-based diffusion models for probabilistic time series imputation

    Yusuke Tashiro, Jiaming Song, Yang Song, and Stefano Ermon. Csdi: Conditional score-based diffusion models for probabilistic time series imputation. Advances in Neural Information Processing Systems, 34:24804–24816, 2021

  29. [29]

    Convolutional neural networks for time series classification

    Bendong Zhao, Huanzhang Lu, Shangfeng Chen, Junliang Liu, and Dongya Wu. Convolutional neural networks for time series classification. Journal of Systems Engineering and Electronics, 28(1):162–169, 2017

  30. [30]

    Anomaly transformer: Time series anomaly detection with association discrepancy

    Jiehui Xu, Haixu Wu, Jianmin Wang, and Mingsheng Long. Anomaly transformer: Time series anomaly detection with association discrepancy. arXiv preprint arXiv:2110.02642, 2021

  31. [31]

    Unified training of universal time series forecasting transformers

    Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. Unified training of universal time series forecasting transformers. In Forty-first International Conference on Machine Learning, 2024

  32. [32]

    Large language models are zero-shot time series forecasters

    Nate Gruver, Marc Finzi, Shikai Qiu, and Andrew G Wilson. Large language models are zero-shot time series forecasters. Advances in Neural Information Processing Systems , 36, 2024

  33. [33]

    Time-LLM: Time Series Forecasting by Reprogramming Large Language Models

    Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, et al. Time-llm: Time series forecasting by reprogramming large language models. arXiv preprint arXiv:2310.01728, 2023

  34. [34]

    Time- moe: Billion-scale time series foundation models with mixture of experts

    Xiaoming Shi, Shiyu Wang, Yuqi Nie, Dianqi Li, Zhou Ye, Qingsong Wen, and Ming Jin. Time- moe: Billion-scale time series foundation models with mixture of experts. In The Twenty-First International Conference on Learning Representations, 2025. 11

  35. [35]

    O., Pfister, T., Zheng, Y., Ye, W., and Liu, Y

    Defu Cao, Furong Jia, Sercan O Arik, Tomas Pfister, Yixiang Zheng, Wen Ye, and Yan Liu. Tempo: Prompt-based generative pre-trained transformer for time series forecasting. arXiv preprint arXiv:2310.04948, 2023

  36. [36]

    Towards Reasoning in Large Language Models: A Survey

    Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403, 2022

  37. [37]

    Reasoning with language model prompting: A survey

    Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. Reasoning with language model prompting: A survey. arXiv preprint arXiv:2212.09597, 2022

  38. [38]

    Large language models for mathematical reasoning: Progresses and challenges

    Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges. arXiv preprint arXiv:2402.00157, 2024

  39. [39]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022

  40. [40]

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023

  41. [41]

    Beyond chain-of-thought, effective graph-of-thought reasoning in language models

    Yao Yao, Zuchao Li, and Hai Zhao. Beyond chain-of-thought, effective graph-of-thought reasoning in language models. arXiv preprint arXiv:2305.16582, 2023

  42. [42]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

  43. [43]

    Recursive introspection: Teaching language model agents how to self-improve

    Yuxiao Qu, Tianjun Zhang, Naman Garg, and Aviral Kumar. Recursive introspection: Teaching language model agents how to self-improve. arXiv preprint arXiv:2407.18219, 2024

  44. [44]

    Decomposed Prompting: A Modular Approach for Solving Complex Tasks

    Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406, 2022

  45. [45]

    Faithful reasoning using large language models

    Antonia Creswell and Murray Shanahan. Faithful reasoning using large language models. arXiv preprint arXiv:2208.14271, 2022

  46. [46]

    Pal: Program-aided language models

    Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR, 2023

  47. [47]

    Visual programming: Compositional visual reasoning without training

    Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962, 2023

  48. [48]

    SWE-agent: Agent-computer interfaces enable automated soft- ware engineering

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated soft- ware engineering. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  49. [49]

    Honeycomb: A flexible llm-based agent system for materials science

    Huan Zhang, Yu Song, Ziyu Hou, Santiago Miret, and Bang Liu. Honeycomb: A flexible llm-based agent system for materials science. arXiv preprint arXiv:2409.00135, 2024

  50. [50]

    Crispr-gpt: An llm agent for automated design of gene-editing experiments

    Kaixuan Huang, Yuanhao Qu, Henry Cousins, William A Johnson, Di Yin, Mihir Shah, Denny Zhou, Russ Altman, Mengdi Wang, and Le Cong. Crispr-gpt: An llm agent for automated design of gene-editing experiments. arXiv preprint arXiv:2404.18021, 2024

  51. [51]

    Agentic feedback loop modeling improves recommendation and user simulation

    Shihao Cai, Jizhi Zhang, Keqin Bao, Chongming Gao, Qifan Wang, Fuli Feng, and Xiangnan He. Agentic feedback loop modeling improves recommendation and user simulation. InProceedings of the 48th International ACM SIGIR conference on Research and Development in Information Retrieval, 2025. 12

  52. [52]

    Adaplanner: Adaptive planning from feedback with language models

    Haotian Sun, Yuchen Zhuang, Lingkai Kong, Bo Dai, and Chao Zhang. Adaplanner: Adaptive planning from feedback with language models. Advances in neural information processing systems, 36:58202–58245, 2023

  53. [53]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

  54. [54]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

  55. [55]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

  56. [56]

    Executable code actions elicit better llm agents

    Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. In Forty-first International Conference on Machine Learning, 2024

  57. [57]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022

  58. [58]

    AgentScope: A Flexible yet Robust Multi-Agent Platform,

    Dawei Gao, Zitao Li, Xuchen Pan, Weirui Kuang, Zhijian Ma, Bingchen Qian, Fei Wei, Wenhao Zhang, Yuexiang Xie, Daoyuan Chen, Liuyi Yao, Hongyi Peng, Ze Yu Zhang, Lin Zhu, Chen Cheng, Hongzhu Shi, Yaliang Li, Bolin Ding, and Jingren Zhou. Agentscope: A flexible yet robust multi-agent platform. CoRR, abs/2402.14034, 2024

  59. [59]

    Application of Deep Convolutional Neural Networks for Detecting Extreme Weather in Climate Datasets

    Yunjie Liu, Evan Racah, Joaquin Correa, Amir Khosrowshahi, David Lavers, Kenneth Kunkel, Michael Wehner, William Collins, et al. Application of deep convolutional neural networks for detecting extreme weather in climate datasets. arXiv preprint arXiv:1605.01156, 2016

  60. [60]

    Electricity load forecasting: a systematic review.Journal of Electrical Systems and Information Technology, 7:1–19, 2020

    Isaac Kofi Nti, Moses Teimeh, Owusu Nyarko-Boateng, and Adebayo Felix Adekoya. Electricity load forecasting: a systematic review.Journal of Electrical Systems and Information Technology, 7:1–19, 2020

  61. [61]

    evaluator received input of NoneType

    Xiangtian Zheng, Nan Xu, Loc Trinh, Dongqi Wu, Tong Huang, S Sivaranjani, Yan Liu, and Le Xie. Psml: a multi-scale time-series dataset for machine learning in decarbonized energy grids. arXiv preprint arXiv:2110.06324, 2021. 13 A Dataset Compilation Dataset Compilation Since complex time series reasoning remains largely underexplored, we construct a compl...

  62. [62]

    TEMPO",

    I require that the system load is maintained above a minimum of {load value} MW. 3. I must monitor the load ramp rate to ensure it does not exceed {constraint value} MW for each time step. 4. I need to manage the load variability so that it does not exceed {constraint value} MW over the given period.] Think about how {influence variables} influence {targe...

  63. [63]

    data correlation: the multi variable should be correlated, sample: which A first influence B, then B have influence on C or D, there should be some time delay, as the influence on other staff needs time

  64. [64]

    data trend: there should be some trend in the data, like the data is increasing or decreasing

  65. [65]

    data: seasonality there should be some seasonality in the data, like the data is periodic

  66. [66]

    data noise: the noise should be added to the data, as the real world data is not perfect

  67. [67]

    CoT Sample: Q: Approximate Relation Ratio: 0.5 Relation Matrix: A B C D A 1 1 0 1 B 0 1 0 1 C 0 1 1 1 D 0 0 0 1 • A influences B and D, and itself

    data background: the data should have some real world background, you should first think about different real world data, and provide a description for the variable and time series data, then generate the data using the code. CoT Sample: Q: Approximate Relation Ratio: 0.5 Relation Matrix: A B C D A 1 1 0 1 B 0 1 0 1 C 0 1 1 1 D 0 0 0 1 • A influences B an...

  68. [68]

    V AL" and the expected anomaly rate is to be stored in the variable

    Advertising (A): The level of advertising spend directly impacts the sales of each store. After a delay, this starts influencing sales. 2. Sales (B): The sales numbers for each store are influenced by both the advertising and local seasonal events. 3. Economic Factors (C): Broader economic trends, like GDP growth or unemployment rates, also impact sales. ...