TS-Reasoner: Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis

Defu Cao; Jie Cai; Lumingyuan Tang; Wei Yang; Wen Ye; Yan Liu; Yizhou Zhang

arxiv: 2410.04047 · v6 · submitted 2024-10-05 · 💻 cs.LG · cs.AI

TS-Reasoner: Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis

Wen Ye , Wei Yang , Defu Cao , Yizhou Zhang , Lumingyuan Tang , Jie Cai , Yan Liu This is my paper

Pith reviewed 2026-05-23 19:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords time series analysislarge language modelsmulti-step reasoningdomain-specific agentsinference agentscomputational toolserror feedback

0 comments

The pith

TS-Reasoner integrates LLM reasoning with time-series tools and feedback loops to outperform general models on multi-step inference tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a specialized agent that pairs large language models with domain-specific computational tools and an error-correction loop to perform multi-step time series analysis. It tests this agent on basic concept understanding and on a new dataset that requires both compositional reasoning and numerical precision. The central demonstration is that the combined system produces more accurate and constraint-aware results than standalone general-purpose language models. This matters for applications where time series data must be interpreted iteratively rather than in a single pass. The work positions domain-specialized agents as a practical route to automated analytical workflows.

Core claim

TS-Reasoner is a domain-specialized agent that integrates LLM reasoning with domain-specific computational tools and an error feedback loop, enabling domain-informed, constraint-aware analytical workflows that combine symbolic reasoning with precise numerical analysis. Experiments on TimeSeriesExam and a new multi-step inference dataset show that this approach outperforms standalone general-purpose LLMs in both fundamental time series concept understanding and complex inference tasks.

What carries the argument

TS-Reasoner agent that fuses LLM reasoning with domain-specific computational tools and an error feedback loop.

If this is right

The agent achieves higher accuracy on basic time series concept questions than general LLMs.
It completes multi-step inference tasks that require both compositional logic and exact numerical computation more reliably.
The resulting workflows stay within domain constraints while mixing symbolic steps and numerical evaluation.
The design supports automated real-world time series reasoning without manual intervention at each step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar tool-plus-feedback structures could be applied to other data modalities that mix language and precise calculation.
The approach suggests a template for agents in scientific domains where general models currently fall short on numerical fidelity.
Performance gains may depend on how well the feedback loop identifies and corrects specific classes of numerical or logical errors.

Load-bearing premise

Combining language-model reasoning with domain tools and feedback loops produces genuinely better constraint-aware workflows than general models alone.

What would settle it

An experiment in which TS-Reasoner achieves equal or lower accuracy than a general LLM on the same multi-step time series tasks and datasets.

Figures

Figures reproduced from arXiv: 2410.04047 by Defu Cao, Jie Cai, Lumingyuan Tang, Wei Yang, Wen Ye, Yan Liu, Yizhou Zhang.

**Figure 1.** Figure 1: A time series of daily search frequency for the keyword "reasoning". To address these challenges, we call for domain specialization of LLM [20] and introduce the Domain-Oriented Time Series Agent, TSReasoner, for multi-step time series inference. TS-Reasoner integrates language-based reasoning with precise numerical execution by decomposing high-level instructions into structured workflows composed of … view at source ↗

**Figure 2.** Figure 2: The pipeline of TS-Reasoner. The LLM work as task decomposer, which learn from [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Strict Accuracy of TS-Reasoner and general purpose LLMs on the TimeSeriesExam. Dataset To address the underexplored area of complex time series reasoning, we construct a multi-step time series inference dataset3 categorized into two classes: predictive task and diagnostic task. Each class presents unique challenges requiring both compositional reasoning and precise numerical computation and demonstrates… view at source ↗

**Figure 4.** Figure 4: Performance on Multi-Step Diagnostic Tasks. A small jittering noise of 0.01 is added to [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Error distribution of different approaches on electricity prediction task without covariates. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation Study on Electricity Prediction w/ Covariates task. We removes each component [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Illustration of an example TS-Reasoner workflow [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Example Result Errors I have historical Advertising Spend (A), Sales (B), Economic Factors (C), Customer Sentiment (D) data and want to get the causal relationship between each pair of the variables. I know that 41.66666666666667% of the variable pairs have relationship. Consider the potential influence of each variable on the others in this variable list: ['Advertising Spend (A)', 'Sales (B)', 'Economic F… view at source ↗

**Figure 9.** Figure 9: Example Execution Errors. D Additional Error Analysis E Task Instance Templates In this section, we provide an outline of templates used for each type of tasks. The exact template for each sub question type may vary from each other to best reflect the available information: with and without covariate versions, with and without large amount of data, with or without anomaly free samples). 7 https://climatele… view at source ↗

**Figure 10.** Figure 10: Error distribution of different approaches on electricity prediction task with covariates. [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Error distribution of different approaches on electricity prediction task across multiple [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Error distribution of different approaches on extreme weather detection with anomaly free [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: Error distribution of different approaches on extreme weather detection task with known [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

**Figure 14.** Figure 14: Error distribution of different approaches on causal discovery tasks. [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗

read the original abstract

Time series analysis is crucial in real-world applications, yet traditional methods focus on isolated tasks only, and recent studies on time series reasoning remain limited to either single-step inference or are constrained to natural language answers. In this work, we introduce TS-Reasoner, a domain-specialized agent designed for multi-step time series inference. By integrating large language model (LLM) reasoning with domain-specific computational tools and an error feedback loop, TS-Reasoner enables domain-informed, constraint-aware analytical workflows that combine symbolic reasoning with precise numerical analysis. We assess the system's capabilities along two axes: (1) fundamental time series understanding assessed by TimeSeriesExam and (2) complex, multi-step inference evaluated by a newly proposed dataset designed to test both compositional reasoning and computational precision in time series analysis. Experiments show that our approach outperforms standalone general-purpose LLMs in both basic time series concept understanding as well as the multi-step time series inference task, highlighting the promise of domain-specialized agents for automating real-world time series reasoning and analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TS-Reasoner combines an LLM with domain tools and a feedback loop for multi-step time series tasks, but the reported gains over plain LLMs do not yet isolate whether the agent scaffolding adds anything beyond tool access.

read the letter

The paper's main contribution is TS-Reasoner, an agent that layers an LLM with time-series-specific computational tools and an error feedback loop to support multi-step inference instead of single-step or purely linguistic outputs. It evaluates basic concept understanding on TimeSeriesExam and compositional plus numerical precision on a new dataset. The system description is concrete and the move toward constraint-aware workflows that mix symbolic and numerical steps is a reasonable direction given the limits of standalone LLMs on these tasks. The new dataset itself is a useful addition for testing exactly the multi-step setting the authors target. The experiments claim clear outperformance over general-purpose LLMs. The soft spot is that the abstract and claim language give no sign of tool-equipped LLM baselines or component ablations. If the comparison is only against LLMs with no tool access at all, the gains could be explained by tool availability rather than the agent architecture or feedback loop. That matches the stress-test concern exactly, and without those controls the central claim stays under-supported. The citation pattern looks standard for LLM-agent work and there is no obvious circularity in the evaluation setup. This is aimed at people already working on domain-specialized LLM agents or automated scientific workflows. A reader interested in time-series automation would get value from the system sketch and the new test cases, even if the results need stronger isolation of the proposed components. It is worth sending to peer review because the idea is well-motivated and the evaluation axes are sensible, but the experimental design will need tightening before the performance claims can be taken at face value.

Referee Report

3 major / 2 minor

Summary. The paper introduces TS-Reasoner, a domain-specialized agent that integrates LLM reasoning with domain-specific computational tools and an error feedback loop to enable multi-step time series inference. It evaluates performance on TimeSeriesExam for basic concept understanding and on a newly introduced dataset for compositional multi-step reasoning, claiming consistent outperformance over standalone general-purpose LLMs.

Significance. If the performance gains can be attributed to the agent architecture rather than tool access alone, the work would provide evidence that hybrid LLM-tool systems with feedback can improve automated analysis on tasks requiring both symbolic and numerical precision, with potential implications for domain-specific agent design in scientific ML.

major comments (3)

[Experiments] Experiments section (and abstract): the baselines are described only as 'standalone general-purpose LLMs' with no indication that they receive access to the same domain-specific computational tools used by TS-Reasoner. Because the central claim attributes gains to the combination of LLM reasoning, tools, and error feedback loop, the absence of tool-augmented LLM baselines (or component ablations) means the results do not isolate whether the reported improvements require the agent scaffolding.
[§4] §4 (evaluation on new dataset): the manuscript provides insufficient detail on dataset construction, task distribution, and metrics for compositional reasoning versus computational precision, making it difficult to verify that the new benchmark genuinely stresses multi-step inference beyond what single-step tool use would achieve.
[Results tables] Table 1 / results tables: no statistical significance tests, confidence intervals, or variance across runs are reported for the claimed outperformance, which is load-bearing for the assertion that the approach 'outperforms' on both axes.

minor comments (2)

[§3] Notation for the error feedback loop and tool interfaces is introduced without a clear diagram or pseudocode, reducing reproducibility.
[Abstract] The abstract states outperformance but the full experimental design details (prompt templates, tool definitions, number of trials) appear only later; moving a concise summary of controls to the abstract would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We agree that additional baselines, expanded dataset details, and statistical reporting are needed to strengthen the claims, and we will revise the manuscript accordingly.

read point-by-point responses

Referee: [Experiments] Experiments section (and abstract): the baselines are described only as 'standalone general-purpose LLMs' with no indication that they receive access to the same domain-specific computational tools used by TS-Reasoner. Because the central claim attributes gains to the combination of LLM reasoning, tools, and error feedback loop, the absence of tool-augmented LLM baselines (or component ablations) means the results do not isolate whether the reported improvements require the agent scaffolding.

Authors: We agree that the current evaluation does not fully isolate the contribution of the agent scaffolding from tool access alone. In the revision we will add tool-augmented LLM baselines (general-purpose LLMs given the same computational tools but without the multi-step agent loop or error feedback) as well as component ablations. These new results will be reported in the experiments section and referenced in the abstract. revision: yes
Referee: [§4] §4 (evaluation on new dataset): the manuscript provides insufficient detail on dataset construction, task distribution, and metrics for compositional reasoning versus computational precision, making it difficult to verify that the new benchmark genuinely stresses multi-step inference beyond what single-step tool use would achieve.

Authors: We will substantially expand §4 to include the dataset construction methodology, the breakdown of task types (compositional reasoning vs. computational precision), and the precise metrics used for each axis. This will clarify how the benchmark evaluates multi-step inference beyond single-step tool calls. revision: yes
Referee: [Results tables] Table 1 / results tables: no statistical significance tests, confidence intervals, or variance across runs are reported for the claimed outperformance, which is load-bearing for the assertion that the approach 'outperforms' on both axes.

Authors: We acknowledge the omission. We will rerun the key experiments across multiple random seeds, compute confidence intervals and standard deviations, and add statistical significance tests (e.g., paired t-tests) to all reported performance differences in the revised tables. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on independent benchmarks

full rationale

The paper introduces an agent architecture (LLM + domain tools + error feedback) and evaluates it on TimeSeriesExam plus a new compositional dataset. No mathematical derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the provided abstract or description. The central claim is experimental outperformance over standalone LLMs; the benchmarks are described as external and independent, with no indication that results reduce to the inputs by construction. This is a standard system paper whose validity hinges on experimental controls rather than definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper's approach rests on the assumption that LLM-based agents augmented with tools will outperform general models, which is a domain assumption not independently verified in the abstract.

axioms (1)

domain assumption Domain-specific tools can be effectively integrated with LLMs for precise numerical analysis in time series tasks
Central to enabling the constraint-aware workflows.

pith-pipeline@v0.9.0 · 5724 in / 1128 out tokens · 38587 ms · 2026-05-23T19:43:48.107638+00:00 · methodology

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TimeClaw: A Time-Series AI Agent with Exploratory Execution Learning
cs.AI 2026-05 unverdicted novelty 7.0

TimeClaw is an exploratory execution learning system that turns multiple valid tool-use paths into hierarchical distilled experience for improved time-series reasoning without test-time adaptation.
TimeSeriesExamAgent: Creating Time Series Reasoning Benchmarks at Scale
cs.AI 2026-04 conditional novelty 7.0

TimeSeriesExamAgent combines templates and LLM agents to generate scalable time series reasoning benchmarks, demonstrating that current LLMs have limited performance on both abstract and domain-specific tasks.
TS-Agent: Understanding and Reasoning Over Raw Time Series via Iterative Insight Gathering
cs.AI 2025-10 unverdicted novelty 7.0

TS-Agent is an agentic framework that uses LLMs only for evidence-based reasoning while delegating extraction to raw time series tools, matching or exceeding baselines on four benchmarks with largest gains on reasoning tasks.
TimeMM: Time-as-Operator Spectral Filtering for Dynamic Multimodal Recommendation
cs.IR 2026-04 unverdicted novelty 6.0

TimeMM proposes a time-as-operator spectral filtering framework with adaptive mixing and modality routing to model non-stationary multimodal user preferences in recommendation systems.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 4 Pith papers · 11 internal anchors

[1]

A refined comprehen- sive earthquake focal mechanism catalog for southern california derived with deep learning algorithms

Yifang Cheng, Zachary Ross, Egill Hauksson, and Yehuda Ben-Zion. A refined comprehen- sive earthquake focal mechanism catalog for southern california derived with deep learning algorithms. In AGU Fall Meeting Abstracts, volume 2021, pages S32A–05, 2021

work page 2021
[2]

Identifying coordinated accounts on social media through hidden influence and group behaviours

Karishma Sharma, Yizhou Zhang, Emilio Ferrara, and Yan Liu. Identifying coordinated accounts on social media through hidden influence and group behaviours. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1441–1451, 2021

work page 2021
[3]

Vigdet: Knowledge informed neural temporal point process for coordination detection on social media

Yizhou Zhang, Karishma Sharma, and Yan Liu. Vigdet: Knowledge informed neural temporal point process for coordination detection on social media. Advances in Neural Information Processing Systems, 34:3218–3231, 2021

work page 2021
[4]

Time series analysis

James D Hamilton. Time series analysis. Princeton university press, 2020

work page 2020
[5]

Time-series forecasting with deep learning: a survey

Bryan Lim and Stefan Zohren. Time-series forecasting with deep learning: a survey. Philosoph- ical Transactions of the Royal Society A, 379(2194):20200209, 2021

work page 2021
[6]

Deep learning for time series classification: a review.Data mining and knowledge discovery, 33(4):917–963, 2019

Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar, and Pierre- Alain Muller. Deep learning for time series classification: a review.Data mining and knowledge discovery, 33(4):917–963, 2019

work page 2019
[7]

Deep learning for time series anomaly detection: A survey

Zahra Zamanzadeh Darban, Geoffrey I Webb, Shirui Pan, Charu Aggarwal, and Mahsa Salehi. Deep learning for time series anomaly detection: A survey. ACM Computing Surveys, 57(1):1– 42, 2024

work page 2024
[8]

Chronos: Learning the Language of Time Series

Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, et al. Chronos: Learning the language of time series. arXiv preprint arXiv:2403.07815, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Moirai-moe: Empowering time series foundation models with sparse mixture of experts

Xu Liu, Juncheng Liu, Gerald Woo, Taha Aksu, Yuxuan Liang, Roger Zimmermann, Chenghao Liu, Silvio Savarese, Caiming Xiong, and Doyen Sahoo. Moirai-moe: Empowering time series foundation models with sparse mixture of experts. arXiv preprint arXiv:2410.10469, 2024

work page arXiv 2024
[10]

Timegpt-1,

Azul Garza and Max Mergenthaler-Canseco. Timegpt-1. arXiv preprint arXiv:2310.03589, 2023

work page arXiv 2023
[11]

Moment: A family of open time-series foundation models

Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski. Moment: A family of open time-series foundation models. arXiv preprint arXiv:2402.03885, 2024

work page arXiv 2024
[12]

Informer: Beyond efficient transformer for long sequence time-series forecasting

Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of AAAI, 2021

work page 2021
[13]

220kv city power grid maximum loadability determination with static security-constraints

Ke-qiu W ANG, Si-guang SUN, Hong-yi W ANG, Chang-xu JIANG, and Zhao-xia JING. 220kv city power grid maximum loadability determination with static security-constraints. Power, Energy Engineering and Management (PEEM2016), page 1, 2016

work page 2016
[14]

Beyond accuracy: Evaluating the reasoning behavior of large language models–a survey

Philipp Mondorf and Barbara Plank. Beyond accuracy: Evaluating the reasoning behavior of large language models–a survey. arXiv preprint arXiv:2404.01869, 2024

work page arXiv 2024
[15]

Evaluating large language models at evaluating instruction following

Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, and Danqi Chen. Evaluating large language models at evaluating instruction following. arXiv preprint arXiv:2310.07641, 2023

work page arXiv 2023
[16]

The first step is the hardest: Pitfalls of representing and tokenizing temporal data for large language models

Dimitris Spathis and Fahim Kawsar. The first step is the hardest: Pitfalls of representing and tokenizing temporal data for large language models. Journal of the American Medical Informatics Association, 31(9):2151–2158, 2024

work page 2024
[17]

Mechanics of next token prediction with self-attention

Yingcong Li, Yixiao Huang, Muhammed E Ildiz, Ankit Singh Rawat, and Samet Oymak. Mechanics of next token prediction with self-attention. In International Conference on Artificial Intelligence and Statistics, pages 685–693. PMLR, 2024. 10

work page 2024
[18]

The future is different: Large pre- trained language models fail in prediction tasks

Kostadin Cvejoski, Ramsés J Sánchez, and César Ojeda. The future is different: Large pre- trained language models fail in prediction tasks. arXiv preprint arXiv:2211.00384, 2022

work page arXiv 2022
[19]

Why large language models fail at precision regression, 2025

Karthick Panner Selvam. Why large language models fail at precision regression, 2025

work page 2025
[20]

Domain specialization as the key to make large language models disruptive: A comprehensive survey

Chen Ling, Xujiang Zhao, Jiaying Lu, Chengyuan Deng, Can Zheng, Junxiang Wang, Tanmoy Chowdhury, Yun Li, Hejie Cui, Xuchao Zhang, et al. Domain specialization as the key to make large language models disruptive: A comprehensive survey. arXiv preprint arXiv:2305.18703, 2023

work page arXiv 2023
[21]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[22]

Timeseriesexam: A time series understanding exam

Yifu Cai, Arjun Choudhry, Mononito Goswami, and Artur Dubrawski. Timeseriesexam: A time series understanding exam. arXiv preprint arXiv:2410.14752, 2024

work page arXiv 2024
[23]

Real-time load variability control using energy storage system for demand-side management in south korea

Kyo Beom Han, Jaesung Jung, and Byung O Kang. Real-time load variability control using energy storage system for demand-side management in south korea. Energies, 14(19):6292, 2021

work page 2021
[24]

Short-term scheduling of electric power systems under minimum load conditions

Claudia Greif, Raymond B Johnson, Chao an Li, Alva J Svoboda, and K Andrijeski Uemura. Short-term scheduling of electric power systems under minimum load conditions. IEEE transactions on power systems, 14(1):280–286, 1999

work page 1999
[25]

Learning semantic context from normal samples for unsupervised anomaly detection

Xudong Yan, Huaidong Zhang, Xuemiao Xu, Xiaowei Hu, and Pheng-Ann Heng. Learning semantic context from normal samples for unsupervised anomaly detection. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 3110–3118, 2021

work page 2021
[26]

Parkca: Causal inference with partially known causes

Raquel Aoki and Martin Ester. Parkca: Causal inference with partially known causes. In BIO- COMPUTING 2021: Proceedings of the Pacific Symposium, pages 196–207. World Scientific, 2020

work page 2021
[27]

Tsmixer: Lightweight mlp-mixer model for multivariate time series forecasting

Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. Tsmixer: Lightweight mlp-mixer model for multivariate time series forecasting. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’23, page 459–469, New York, NY , USA, 2023. Association for Computing Machinery

work page 2023
[28]

Csdi: Conditional score-based diffusion models for probabilistic time series imputation

Yusuke Tashiro, Jiaming Song, Yang Song, and Stefano Ermon. Csdi: Conditional score-based diffusion models for probabilistic time series imputation. Advances in Neural Information Processing Systems, 34:24804–24816, 2021

work page 2021
[29]

Convolutional neural networks for time series classification

Bendong Zhao, Huanzhang Lu, Shangfeng Chen, Junliang Liu, and Dongya Wu. Convolutional neural networks for time series classification. Journal of Systems Engineering and Electronics, 28(1):162–169, 2017

work page 2017
[30]

Anomaly transformer: Time series anomaly detection with association discrepancy

Jiehui Xu, Haixu Wu, Jianmin Wang, and Mingsheng Long. Anomaly transformer: Time series anomaly detection with association discrepancy. arXiv preprint arXiv:2110.02642, 2021

work page arXiv 2021
[31]

Unified training of universal time series forecasting transformers

Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. Unified training of universal time series forecasting transformers. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[32]

Large language models are zero-shot time series forecasters

Nate Gruver, Marc Finzi, Shikai Qiu, and Andrew G Wilson. Large language models are zero-shot time series forecasters. Advances in Neural Information Processing Systems , 36, 2024

work page 2024
[33]

Time-LLM: Time Series Forecasting by Reprogramming Large Language Models

Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, et al. Time-llm: Time series forecasting by reprogramming large language models. arXiv preprint arXiv:2310.01728, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Time- moe: Billion-scale time series foundation models with mixture of experts

Xiaoming Shi, Shiyu Wang, Yuqi Nie, Dianqi Li, Zhou Ye, Qingsong Wen, and Ming Jin. Time- moe: Billion-scale time series foundation models with mixture of experts. In The Twenty-First International Conference on Learning Representations, 2025. 11

work page 2025
[35]

O., Pfister, T., Zheng, Y., Ye, W., and Liu, Y

Defu Cao, Furong Jia, Sercan O Arik, Tomas Pfister, Yixiang Zheng, Wen Ye, and Yan Liu. Tempo: Prompt-based generative pre-trained transformer for time series forecasting. arXiv preprint arXiv:2310.04948, 2023

work page arXiv 2023
[36]

Towards Reasoning in Large Language Models: A Survey

Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Reasoning with language model prompting: A survey

Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. Reasoning with language model prompting: A survey. arXiv preprint arXiv:2212.09597, 2022

work page arXiv 2022
[38]

Large language models for mathematical reasoning: Progresses and challenges

Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges. arXiv preprint arXiv:2402.00157, 2024

work page arXiv 2024
[39]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022

work page 2022
[40]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Beyond chain-of-thought, effective graph-of-thought reasoning in language models

Yao Yao, Zuchao Li, and Hai Zhao. Beyond chain-of-thought, effective graph-of-thought reasoning in language models. arXiv preprint arXiv:2305.16582, 2023

work page arXiv 2023
[42]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[43]

Recursive introspection: Teaching language model agents how to self-improve

Yuxiao Qu, Tianjun Zhang, Naman Garg, and Aviral Kumar. Recursive introspection: Teaching language model agents how to self-improve. arXiv preprint arXiv:2407.18219, 2024

work page arXiv 2024
[44]

Decomposed Prompting: A Modular Approach for Solving Complex Tasks

Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[45]

Faithful reasoning using large language models

Antonia Creswell and Murray Shanahan. Faithful reasoning using large language models. arXiv preprint arXiv:2208.14271, 2022

work page arXiv 2022
[46]

Pal: Program-aided language models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR, 2023

work page 2023
[47]

Visual programming: Compositional visual reasoning without training

Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962, 2023

work page 2023
[48]

SWE-agent: Agent-computer interfaces enable automated soft- ware engineering

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated soft- ware engineering. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[49]

Honeycomb: A flexible llm-based agent system for materials science

Huan Zhang, Yu Song, Ziyu Hou, Santiago Miret, and Bang Liu. Honeycomb: A flexible llm-based agent system for materials science. arXiv preprint arXiv:2409.00135, 2024

work page arXiv 2024
[50]

Crispr-gpt: An llm agent for automated design of gene-editing experiments

Kaixuan Huang, Yuanhao Qu, Henry Cousins, William A Johnson, Di Yin, Mihir Shah, Denny Zhou, Russ Altman, Mengdi Wang, and Le Cong. Crispr-gpt: An llm agent for automated design of gene-editing experiments. arXiv preprint arXiv:2404.18021, 2024

work page arXiv 2024
[51]

Agentic feedback loop modeling improves recommendation and user simulation

Shihao Cai, Jizhi Zhang, Keqin Bao, Chongming Gao, Qifan Wang, Fuli Feng, and Xiangnan He. Agentic feedback loop modeling improves recommendation and user simulation. InProceedings of the 48th International ACM SIGIR conference on Research and Development in Information Retrieval, 2025. 12

work page 2025
[52]

Adaplanner: Adaptive planning from feedback with language models

Haotian Sun, Yuchen Zhuang, Lingkai Kong, Bo Dai, and Chao Zhang. Adaplanner: Adaptive planning from feedback with language models. Advances in neural information processing systems, 36:58202–58245, 2023

work page 2023
[53]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

Executable code actions elicit better llm agents

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[57]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[58]

AgentScope: A Flexible yet Robust Multi-Agent Platform,

Dawei Gao, Zitao Li, Xuchen Pan, Weirui Kuang, Zhijian Ma, Bingchen Qian, Fei Wei, Wenhao Zhang, Yuexiang Xie, Daoyuan Chen, Liuyi Yao, Hongyi Peng, Ze Yu Zhang, Lin Zhu, Chen Cheng, Hongzhu Shi, Yaliang Li, Bolin Ding, and Jingren Zhou. Agentscope: A flexible yet robust multi-agent platform. CoRR, abs/2402.14034, 2024

work page arXiv 2024
[59]

Application of Deep Convolutional Neural Networks for Detecting Extreme Weather in Climate Datasets

Yunjie Liu, Evan Racah, Joaquin Correa, Amir Khosrowshahi, David Lavers, Kenneth Kunkel, Michael Wehner, William Collins, et al. Application of deep convolutional neural networks for detecting extreme weather in climate datasets. arXiv preprint arXiv:1605.01156, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[60]

Electricity load forecasting: a systematic review.Journal of Electrical Systems and Information Technology, 7:1–19, 2020

Isaac Kofi Nti, Moses Teimeh, Owusu Nyarko-Boateng, and Adebayo Felix Adekoya. Electricity load forecasting: a systematic review.Journal of Electrical Systems and Information Technology, 7:1–19, 2020

work page 2020
[61]

evaluator received input of NoneType

Xiangtian Zheng, Nan Xu, Loc Trinh, Dongqi Wu, Tong Huang, S Sivaranjani, Yan Liu, and Le Xie. Psml: a multi-scale time-series dataset for machine learning in decarbonized energy grids. arXiv preprint arXiv:2110.06324, 2021. 13 A Dataset Compilation Dataset Compilation Since complex time series reasoning remains largely underexplored, we construct a compl...

work page arXiv 2021
[62]

TEMPO",

I require that the system load is maintained above a minimum of {load value} MW. 3. I must monitor the load ramp rate to ensure it does not exceed {constraint value} MW for each time step. 4. I need to manage the load variability so that it does not exceed {constraint value} MW over the given period.] Think about how {influence variables} influence {targe...

work page
[63]

data correlation: the multi variable should be correlated, sample: which A first influence B, then B have influence on C or D, there should be some time delay, as the influence on other staff needs time

work page
[64]

data trend: there should be some trend in the data, like the data is increasing or decreasing

work page
[65]

data: seasonality there should be some seasonality in the data, like the data is periodic

work page
[66]

data noise: the noise should be added to the data, as the real world data is not perfect

work page
[67]

CoT Sample: Q: Approximate Relation Ratio: 0.5 Relation Matrix: A B C D A 1 1 0 1 B 0 1 0 1 C 0 1 1 1 D 0 0 0 1 • A influences B and D, and itself

data background: the data should have some real world background, you should first think about different real world data, and provide a description for the variable and time series data, then generate the data using the code. CoT Sample: Q: Approximate Relation Ratio: 0.5 Relation Matrix: A B C D A 1 1 0 1 B 0 1 0 1 C 0 1 1 1 D 0 0 0 1 • A influences B an...

work page
[68]

V AL" and the expected anomaly rate is to be stored in the variable

Advertising (A): The level of advertising spend directly impacts the sales of each store. After a delay, this starts influencing sales. 2. Sales (B): The sales numbers for each store are influenced by both the advertising and local seasonal events. 3. Economic Factors (C): Broader economic trends, like GDP growth or unemployment rates, also impact sales. ...

work page arXiv

[1] [1]

A refined comprehen- sive earthquake focal mechanism catalog for southern california derived with deep learning algorithms

Yifang Cheng, Zachary Ross, Egill Hauksson, and Yehuda Ben-Zion. A refined comprehen- sive earthquake focal mechanism catalog for southern california derived with deep learning algorithms. In AGU Fall Meeting Abstracts, volume 2021, pages S32A–05, 2021

work page 2021

[2] [2]

Identifying coordinated accounts on social media through hidden influence and group behaviours

Karishma Sharma, Yizhou Zhang, Emilio Ferrara, and Yan Liu. Identifying coordinated accounts on social media through hidden influence and group behaviours. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1441–1451, 2021

work page 2021

[3] [3]

Vigdet: Knowledge informed neural temporal point process for coordination detection on social media

Yizhou Zhang, Karishma Sharma, and Yan Liu. Vigdet: Knowledge informed neural temporal point process for coordination detection on social media. Advances in Neural Information Processing Systems, 34:3218–3231, 2021

work page 2021

[4] [4]

Time series analysis

James D Hamilton. Time series analysis. Princeton university press, 2020

work page 2020

[5] [5]

Time-series forecasting with deep learning: a survey

Bryan Lim and Stefan Zohren. Time-series forecasting with deep learning: a survey. Philosoph- ical Transactions of the Royal Society A, 379(2194):20200209, 2021

work page 2021

[6] [6]

Deep learning for time series classification: a review.Data mining and knowledge discovery, 33(4):917–963, 2019

Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar, and Pierre- Alain Muller. Deep learning for time series classification: a review.Data mining and knowledge discovery, 33(4):917–963, 2019

work page 2019

[7] [7]

Deep learning for time series anomaly detection: A survey

Zahra Zamanzadeh Darban, Geoffrey I Webb, Shirui Pan, Charu Aggarwal, and Mahsa Salehi. Deep learning for time series anomaly detection: A survey. ACM Computing Surveys, 57(1):1– 42, 2024

work page 2024

[8] [8]

Chronos: Learning the Language of Time Series

Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, et al. Chronos: Learning the language of time series. arXiv preprint arXiv:2403.07815, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Moirai-moe: Empowering time series foundation models with sparse mixture of experts

Xu Liu, Juncheng Liu, Gerald Woo, Taha Aksu, Yuxuan Liang, Roger Zimmermann, Chenghao Liu, Silvio Savarese, Caiming Xiong, and Doyen Sahoo. Moirai-moe: Empowering time series foundation models with sparse mixture of experts. arXiv preprint arXiv:2410.10469, 2024

work page arXiv 2024

[10] [10]

Timegpt-1,

Azul Garza and Max Mergenthaler-Canseco. Timegpt-1. arXiv preprint arXiv:2310.03589, 2023

work page arXiv 2023

[11] [11]

Moment: A family of open time-series foundation models

Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski. Moment: A family of open time-series foundation models. arXiv preprint arXiv:2402.03885, 2024

work page arXiv 2024

[12] [12]

Informer: Beyond efficient transformer for long sequence time-series forecasting

Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of AAAI, 2021

work page 2021

[13] [13]

220kv city power grid maximum loadability determination with static security-constraints

Ke-qiu W ANG, Si-guang SUN, Hong-yi W ANG, Chang-xu JIANG, and Zhao-xia JING. 220kv city power grid maximum loadability determination with static security-constraints. Power, Energy Engineering and Management (PEEM2016), page 1, 2016

work page 2016

[14] [14]

Beyond accuracy: Evaluating the reasoning behavior of large language models–a survey

Philipp Mondorf and Barbara Plank. Beyond accuracy: Evaluating the reasoning behavior of large language models–a survey. arXiv preprint arXiv:2404.01869, 2024

work page arXiv 2024

[15] [15]

Evaluating large language models at evaluating instruction following

Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, and Danqi Chen. Evaluating large language models at evaluating instruction following. arXiv preprint arXiv:2310.07641, 2023

work page arXiv 2023

[16] [16]

The first step is the hardest: Pitfalls of representing and tokenizing temporal data for large language models

Dimitris Spathis and Fahim Kawsar. The first step is the hardest: Pitfalls of representing and tokenizing temporal data for large language models. Journal of the American Medical Informatics Association, 31(9):2151–2158, 2024

work page 2024

[17] [17]

Mechanics of next token prediction with self-attention

Yingcong Li, Yixiao Huang, Muhammed E Ildiz, Ankit Singh Rawat, and Samet Oymak. Mechanics of next token prediction with self-attention. In International Conference on Artificial Intelligence and Statistics, pages 685–693. PMLR, 2024. 10

work page 2024

[18] [18]

The future is different: Large pre- trained language models fail in prediction tasks

Kostadin Cvejoski, Ramsés J Sánchez, and César Ojeda. The future is different: Large pre- trained language models fail in prediction tasks. arXiv preprint arXiv:2211.00384, 2022

work page arXiv 2022

[19] [19]

Why large language models fail at precision regression, 2025

Karthick Panner Selvam. Why large language models fail at precision regression, 2025

work page 2025

[20] [20]

Domain specialization as the key to make large language models disruptive: A comprehensive survey

Chen Ling, Xujiang Zhao, Jiaying Lu, Chengyuan Deng, Can Zheng, Junxiang Wang, Tanmoy Chowdhury, Yun Li, Hejie Cui, Xuchao Zhang, et al. Domain specialization as the key to make large language models disruptive: A comprehensive survey. arXiv preprint arXiv:2305.18703, 2023

work page arXiv 2023

[21] [21]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901

[22] [22]

Timeseriesexam: A time series understanding exam

Yifu Cai, Arjun Choudhry, Mononito Goswami, and Artur Dubrawski. Timeseriesexam: A time series understanding exam. arXiv preprint arXiv:2410.14752, 2024

work page arXiv 2024

[23] [23]

Real-time load variability control using energy storage system for demand-side management in south korea

Kyo Beom Han, Jaesung Jung, and Byung O Kang. Real-time load variability control using energy storage system for demand-side management in south korea. Energies, 14(19):6292, 2021

work page 2021

[24] [24]

Short-term scheduling of electric power systems under minimum load conditions

Claudia Greif, Raymond B Johnson, Chao an Li, Alva J Svoboda, and K Andrijeski Uemura. Short-term scheduling of electric power systems under minimum load conditions. IEEE transactions on power systems, 14(1):280–286, 1999

work page 1999

[25] [25]

Learning semantic context from normal samples for unsupervised anomaly detection

Xudong Yan, Huaidong Zhang, Xuemiao Xu, Xiaowei Hu, and Pheng-Ann Heng. Learning semantic context from normal samples for unsupervised anomaly detection. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 3110–3118, 2021

work page 2021

[26] [26]

Parkca: Causal inference with partially known causes

Raquel Aoki and Martin Ester. Parkca: Causal inference with partially known causes. In BIO- COMPUTING 2021: Proceedings of the Pacific Symposium, pages 196–207. World Scientific, 2020

work page 2021

[27] [27]

Tsmixer: Lightweight mlp-mixer model for multivariate time series forecasting

Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. Tsmixer: Lightweight mlp-mixer model for multivariate time series forecasting. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’23, page 459–469, New York, NY , USA, 2023. Association for Computing Machinery

work page 2023

[28] [28]

Csdi: Conditional score-based diffusion models for probabilistic time series imputation

Yusuke Tashiro, Jiaming Song, Yang Song, and Stefano Ermon. Csdi: Conditional score-based diffusion models for probabilistic time series imputation. Advances in Neural Information Processing Systems, 34:24804–24816, 2021

work page 2021

[29] [29]

Convolutional neural networks for time series classification

Bendong Zhao, Huanzhang Lu, Shangfeng Chen, Junliang Liu, and Dongya Wu. Convolutional neural networks for time series classification. Journal of Systems Engineering and Electronics, 28(1):162–169, 2017

work page 2017

[30] [30]

Anomaly transformer: Time series anomaly detection with association discrepancy

Jiehui Xu, Haixu Wu, Jianmin Wang, and Mingsheng Long. Anomaly transformer: Time series anomaly detection with association discrepancy. arXiv preprint arXiv:2110.02642, 2021

work page arXiv 2021

[31] [31]

Unified training of universal time series forecasting transformers

Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. Unified training of universal time series forecasting transformers. In Forty-first International Conference on Machine Learning, 2024

work page 2024

[32] [32]

Large language models are zero-shot time series forecasters

Nate Gruver, Marc Finzi, Shikai Qiu, and Andrew G Wilson. Large language models are zero-shot time series forecasters. Advances in Neural Information Processing Systems , 36, 2024

work page 2024

[33] [33]

Time-LLM: Time Series Forecasting by Reprogramming Large Language Models

Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, et al. Time-llm: Time series forecasting by reprogramming large language models. arXiv preprint arXiv:2310.01728, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Time- moe: Billion-scale time series foundation models with mixture of experts

Xiaoming Shi, Shiyu Wang, Yuqi Nie, Dianqi Li, Zhou Ye, Qingsong Wen, and Ming Jin. Time- moe: Billion-scale time series foundation models with mixture of experts. In The Twenty-First International Conference on Learning Representations, 2025. 11

work page 2025

[35] [35]

O., Pfister, T., Zheng, Y., Ye, W., and Liu, Y

Defu Cao, Furong Jia, Sercan O Arik, Tomas Pfister, Yixiang Zheng, Wen Ye, and Yan Liu. Tempo: Prompt-based generative pre-trained transformer for time series forecasting. arXiv preprint arXiv:2310.04948, 2023

work page arXiv 2023

[36] [36]

Towards Reasoning in Large Language Models: A Survey

Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[37] [37]

Reasoning with language model prompting: A survey

Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. Reasoning with language model prompting: A survey. arXiv preprint arXiv:2212.09597, 2022

work page arXiv 2022

[38] [38]

Large language models for mathematical reasoning: Progresses and challenges

Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges. arXiv preprint arXiv:2402.00157, 2024

work page arXiv 2024

[39] [39]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022

work page 2022

[40] [40]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [41]

Beyond chain-of-thought, effective graph-of-thought reasoning in language models

Yao Yao, Zuchao Li, and Hai Zhao. Beyond chain-of-thought, effective graph-of-thought reasoning in language models. arXiv preprint arXiv:2305.16582, 2023

work page arXiv 2023

[42] [42]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[43] [43]

Recursive introspection: Teaching language model agents how to self-improve

Yuxiao Qu, Tianjun Zhang, Naman Garg, and Aviral Kumar. Recursive introspection: Teaching language model agents how to self-improve. arXiv preprint arXiv:2407.18219, 2024

work page arXiv 2024

[44] [44]

Decomposed Prompting: A Modular Approach for Solving Complex Tasks

Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[45] [45]

Faithful reasoning using large language models

Antonia Creswell and Murray Shanahan. Faithful reasoning using large language models. arXiv preprint arXiv:2208.14271, 2022

work page arXiv 2022

[46] [46]

Pal: Program-aided language models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR, 2023

work page 2023

[47] [47]

Visual programming: Compositional visual reasoning without training

Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962, 2023

work page 2023

[48] [48]

SWE-agent: Agent-computer interfaces enable automated soft- ware engineering

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated soft- ware engineering. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[49] [49]

Honeycomb: A flexible llm-based agent system for materials science

Huan Zhang, Yu Song, Ziyu Hou, Santiago Miret, and Bang Liu. Honeycomb: A flexible llm-based agent system for materials science. arXiv preprint arXiv:2409.00135, 2024

work page arXiv 2024

[50] [50]

Crispr-gpt: An llm agent for automated design of gene-editing experiments

Kaixuan Huang, Yuanhao Qu, Henry Cousins, William A Johnson, Di Yin, Mihir Shah, Denny Zhou, Russ Altman, Mengdi Wang, and Le Cong. Crispr-gpt: An llm agent for automated design of gene-editing experiments. arXiv preprint arXiv:2404.18021, 2024

work page arXiv 2024

[51] [51]

Agentic feedback loop modeling improves recommendation and user simulation

Shihao Cai, Jizhi Zhang, Keqin Bao, Chongming Gao, Qifan Wang, Fuli Feng, and Xiangnan He. Agentic feedback loop modeling improves recommendation and user simulation. InProceedings of the 48th International ACM SIGIR conference on Research and Development in Information Retrieval, 2025. 12

work page 2025

[52] [52]

Adaplanner: Adaptive planning from feedback with language models

Haotian Sun, Yuchen Zhuang, Lingkai Kong, Bo Dai, and Chao Zhang. Adaplanner: Adaptive planning from feedback with language models. Advances in neural information processing systems, 36:58202–58245, 2023

work page 2023

[53] [53]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[54] [54]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[55] [55]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[56] [56]

Executable code actions elicit better llm agents

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. In Forty-first International Conference on Machine Learning, 2024

work page 2024

[57] [57]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[58] [58]

AgentScope: A Flexible yet Robust Multi-Agent Platform,

Dawei Gao, Zitao Li, Xuchen Pan, Weirui Kuang, Zhijian Ma, Bingchen Qian, Fei Wei, Wenhao Zhang, Yuexiang Xie, Daoyuan Chen, Liuyi Yao, Hongyi Peng, Ze Yu Zhang, Lin Zhu, Chen Cheng, Hongzhu Shi, Yaliang Li, Bolin Ding, and Jingren Zhou. Agentscope: A flexible yet robust multi-agent platform. CoRR, abs/2402.14034, 2024

work page arXiv 2024

[59] [59]

Application of Deep Convolutional Neural Networks for Detecting Extreme Weather in Climate Datasets

Yunjie Liu, Evan Racah, Joaquin Correa, Amir Khosrowshahi, David Lavers, Kenneth Kunkel, Michael Wehner, William Collins, et al. Application of deep convolutional neural networks for detecting extreme weather in climate datasets. arXiv preprint arXiv:1605.01156, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[60] [60]

Electricity load forecasting: a systematic review.Journal of Electrical Systems and Information Technology, 7:1–19, 2020

Isaac Kofi Nti, Moses Teimeh, Owusu Nyarko-Boateng, and Adebayo Felix Adekoya. Electricity load forecasting: a systematic review.Journal of Electrical Systems and Information Technology, 7:1–19, 2020

work page 2020

[61] [61]

evaluator received input of NoneType

Xiangtian Zheng, Nan Xu, Loc Trinh, Dongqi Wu, Tong Huang, S Sivaranjani, Yan Liu, and Le Xie. Psml: a multi-scale time-series dataset for machine learning in decarbonized energy grids. arXiv preprint arXiv:2110.06324, 2021. 13 A Dataset Compilation Dataset Compilation Since complex time series reasoning remains largely underexplored, we construct a compl...

work page arXiv 2021

[62] [62]

TEMPO",

I require that the system load is maintained above a minimum of {load value} MW. 3. I must monitor the load ramp rate to ensure it does not exceed {constraint value} MW for each time step. 4. I need to manage the load variability so that it does not exceed {constraint value} MW over the given period.] Think about how {influence variables} influence {targe...

work page

[63] [63]

data correlation: the multi variable should be correlated, sample: which A first influence B, then B have influence on C or D, there should be some time delay, as the influence on other staff needs time

work page

[64] [64]

data trend: there should be some trend in the data, like the data is increasing or decreasing

work page

[65] [65]

data: seasonality there should be some seasonality in the data, like the data is periodic

work page

[66] [66]

data noise: the noise should be added to the data, as the real world data is not perfect

work page

[67] [67]

CoT Sample: Q: Approximate Relation Ratio: 0.5 Relation Matrix: A B C D A 1 1 0 1 B 0 1 0 1 C 0 1 1 1 D 0 0 0 1 • A influences B and D, and itself

data background: the data should have some real world background, you should first think about different real world data, and provide a description for the variable and time series data, then generate the data using the code. CoT Sample: Q: Approximate Relation Ratio: 0.5 Relation Matrix: A B C D A 1 1 0 1 B 0 1 0 1 C 0 1 1 1 D 0 0 0 1 • A influences B an...

work page

[68] [68]

V AL" and the expected anomaly rate is to be stored in the variable

Advertising (A): The level of advertising spend directly impacts the sales of each store. After a delay, this starts influencing sales. 2. Sales (B): The sales numbers for each store are influenced by both the advertising and local seasonal events. 3. Economic Factors (C): Broader economic trends, like GDP growth or unemployment rates, also impact sales. ...

work page arXiv