Harnessing Generalist Agents for Contextualized Time Series

Avaneesh Kumar; Baoyu Jing; Hanghang Tong; Jiaru Zou; Jingrui He; Kaifeng Jin; Mengting Ai; Xuying Ning; Yanjun Zhao; Yuanchen Bei

arxiv: 2606.05404 · v1 · pith:4WZJZVUTnew · submitted 2026-06-03 · 💻 cs.AI · cs.CL· cs.LG

Harnessing Generalist Agents for Contextualized Time Series

Zihao Li , Kaifeng Jin , Yuanchen Bei , Jiaru Zou , Avaneesh Kumar , Xuying Ning , Yanjun Zhao , Mengting Ai

show 3 more authors

Baoyu Jing Hanghang Tong Jingrui He

This is my paper

Pith reviewed 2026-06-28 05:43 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords time series analysisLLM agentsagentic frameworkscontextual reasoningtemporal toolsmultimodal memorycapability evolution

0 comments

The pith

TimeClaw equips generalist LLM agents with temporal tools, evolutionary routines, and multimodal memory for open-ended contextual time series reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TimeClaw as an agentic framework that adds time series-native support to generalist LLM agents. It combines executable temporal tools for analysis that can be checked, a mechanism that evolves reusable analytical routines from experience, and episodic multimodal memory that stores and retrieves reasoning traces. These elements together aim to let agents handle full workflows on time series data that sit inside rich real-world contexts, rather than limiting them to isolated forecasting tasks. A reader would care if the approach succeeds because many practical problems in energy, finance, weather and traffic require exactly this kind of grounded, context-aware temporal work.

Core claim

TimeClaw integrates executable temporal tools for grounded and auditable analysis, experience-driven capability evolution for creating reusable analytical routines, and episodic multimodal memory for retrieving relevant reasoning traces, thereby unlocking harnessed open-ended temporal reasoning with contextual information.

What carries the argument

The TimeClaw agentic harness, which supplies generalist LLM agents with three integrated components: executable temporal tools, experience-driven capability evolution, and episodic multimodal memory.

If this is right

Agents can perform end-to-end workflows that treat forecasting as only one step inside broader analysis loops.
Analysis becomes grounded and auditable because tools execute directly on temporal data.
Analytical routines become reusable across tasks once evolved from experience.
Relevant past reasoning traces can be retrieved via multimodal memory to inform new problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same harness structure might transfer to other structured data types that currently frustrate text-only agents.
If the evolution mechanism works, it could reduce the need to hand-craft new prompts for each domain.
Episodic memory might allow agents to build long-term expertise within a single deployment rather than resetting per session.

Load-bearing premise

That giving generalist LLM agents time series-native runtime support through tools, evolution, and memory will produce measurable gains on real-world tasks without being blocked by the base models' limits on structured temporal signals.

What would settle it

An experiment in which TimeClaw-augmented agents show no performance advantage over unmodified generalist LLM agents across the same set of energy, finance, weather and traffic benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.05404 by Avaneesh Kumar, Baoyu Jing, Hanghang Tong, Jiaru Zou, Jingrui He, Kaifeng Jin, Mengting Ai, Xuying Ning, Yanjun Zhao, Yuanchen Bei, Zihao Li.

**Figure 2.** Figure 2: Overview of TIMECLAW. Given contextualized time series, TIMECLAW operates through a time-seriesnative runtime that provides server-side numerical execution, an evolving toolbox for grounded analysis, and auditable solution trajectories (Section 3.2). The agent further improves over time through capability evolution (Section 3.3) and retrieves relevant experience from multimodal memory via multimodal finge… view at source ↗

**Figure 3.** Figure 3: Performance-efficiency trade-off on CiK. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Performance on TSAIA. TIMECLAW substantially outperforms both general agentic baselines and finance-specific agents, demonstrating its practical applicability in financial time-series analysis. 4.4 Ablation Study We conduct ablation studies to examine the effects of memory retrieval, backbone models, and framework components. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation studies of TIMECLAW. (a) Retrieval size ablation shows the benefit of enabling memory retrieval. (b) Backbone model ablation shows that TIMECLAW consistently benefits from stronger LLM backbones. (c) Component ablation verifies the contributions of the tool harness, capability evolution, and memory. 5 Related Work In this section, we review the key related works on the topics that are closely rela… view at source ↗

**Figure 6.** Figure 6: Across three backbone model families, re [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗

**Figure 8.** Figure 8: Overall accuracy on TSRBench as a function [PITH_FULL_IMAGE:figures/full_fig_p031_8.png] view at source ↗

**Figure 9.** Figure 9: Case study on a paired SVAR task from Context-is-Key ( [PITH_FULL_IMAGE:figures/full_fig_p032_9.png] view at source ↗

**Figure 10.** Figure 10: Causal maps of the case study on memory transfer in TSRBench river flood causal reasoning (R stands [PITH_FULL_IMAGE:figures/full_fig_p033_10.png] view at source ↗

**Figure 11.** Figure 11: Case study of TIMECLAW on a pair of CiK SVAR tasks that share their numerical history and ground truth but differ only in the textual description of the data-generating process. With the explicit SVAR equation in context, the agent iterates the recursion on the known X0 step schedule and recovers the ground-truth ramp (sMAPE 0.44%). With only a qualitative parent list, it extrapolates a smooth low-amplitu… view at source ↗

**Figure 12.** Figure 12: Case study on a TSRBench river-flood causal-discovery instance [PITH_FULL_IMAGE:figures/full_fig_p035_12.png] view at source ↗

**Figure 13.** Figure 13: Case study of TIMECLAW on TSAIA’s finance MC split. Starting from a generic time-series toolbox, TIMECLAW observes recurring successful training trajectories and evolves finance-specific tools for portfolio evaluation and market-factor regression. At test time, these evolved tools replace brittle manual arithmetic with direct executable routines, improving accuracy while reducing unnecessary exploratory c… view at source ↗

**Figure 14.** Figure 14: General prompt skeleton used in TIMECLAW. A small set of optional slots covers all benchmarks; each concrete builder fills the slots appropriate to its task type and tool-availability regime. Prompt Template B.2: TSRBench TSRBench Prompt. TSRBench questions come in three surface forms: (i) “perception” questions with an inline <ts><ts/> placeholder where the series should be substituted, (ii) multiple-cho… view at source ↗

**Figure 15.** Figure 15: Specific TSRBench prompt template. For Context-is-Key and TSAIA prompt template, please refer to [PITH_FULL_IMAGE:figures/full_fig_p037_15.png] view at source ↗

**Figure 16.** Figure 16: The retrieval prefix (top) surfaces the analytic spine of nearest-neighbor trajectories from the memory [PITH_FULL_IMAGE:figures/full_fig_p038_16.png] view at source ↗

read the original abstract

Time series are often embedded in rich contexts that are essential for holistic modeling. Moreover, real-world practitioners often require end-to-end workflows for analyzing temporal dynamics, where widely studied tasks such as forecasting are only one step in a broader solution loop. While generalist AI agents offer a promising interface for such workflows under complex contexts, they still operate primarily in textual spaces that are not fully aligned with structured temporal signals. In this work, we introduce TimeClaw, an agentic harness framework for time series that equips generalist LLM agents with the time series-native runtime support needed for contextualized temporal reasoning. TimeClaw integrates executable temporal tools for grounded and auditable analysis, experience-driven capability evolution for creating reusable analytical routines, and episodic multimodal memory for retrieving relevant reasoning traces. Together, these components unlock harnessed open-ended temporal reasoning with contextual information. Extensive evaluation on multiple benchmarks covering diverse tasks across energy, finance, weather, traffic, and other real-world domains demonstrates improved performance of TimeClaw. Code is available at https://github.com/iDEA-iSAIL-Lab-UIUC/TimeClaw.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TimeClaw adds temporal tools, experience evolution, and multimodal memory to LLM agents for time series workflows, but the abstract leaves the size of the gains unclear.

read the letter

Colleague,

The core of this paper is TimeClaw, a harness that gives generalist LLM agents three pieces of time-series-native support: executable tools for grounded analysis, an experience-driven loop that turns past runs into reusable routines, and episodic multimodal memory that stores and retrieves reasoning traces. The claim is that these let agents handle contextualized temporal tasks end-to-end instead of staying stuck in text.

The framework is a reasonable response to the mismatch the authors flag between textual agents and structured signals. The tools make outputs auditable, the evolution mechanism tries to accumulate capability without constant prompting, and the memory component keeps relevant history accessible. Releasing the code is a concrete step that lets others test the pieces directly.

The soft spot is the evaluation. The abstract states that extensive tests across energy, finance, weather, and traffic show improved performance, yet supplies no information on the baselines, the metrics, the statistical tests, or whether the controls already included temporal prompting or retrieval. Without those details it is difficult to tell whether the three components produce real lifts or whether the base model’s difficulty with structured data still caps the results. The stress-test note about the alignment gap therefore lands; the paper would be stronger if it showed direct evidence that the harness overcomes the textual limitation rather than wrapping it.

This is work for researchers who build or apply agent systems to real temporal data. It deserves a serious referee because it ships code, targets a practical gap, and lays out a clear architecture even if the experiments need tighter reporting.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces TimeClaw, a framework equipping generalist LLM agents with three components—executable temporal tools for grounded analysis, experience-driven capability evolution for reusable routines, and episodic multimodal memory for retrieving traces—to enable contextualized time series reasoning beyond pure textual spaces. The central claim is that these elements together unlock open-ended temporal reasoning with context, supported by extensive evaluations showing improved performance on benchmarks spanning energy, finance, weather, traffic, and other real-world domains.

Significance. If the empirical results are robust, the work could meaningfully advance agentic approaches to time series by providing runtime support that aligns generalist models with structured temporal signals. The open availability of code at the provided GitHub link is a strength for reproducibility and further testing.

major comments (1)

[Abstract] Abstract: the claim that 'extensive evaluation on multiple benchmarks covering diverse tasks across energy, finance, weather, traffic, and other real-world domains demonstrates improved performance of TimeClaw' supplies no information on baselines, metrics, statistical tests, data splits, or quantitative gains. This absence makes it impossible to assess whether the central empirical claim holds or whether the three components overcome the alignment gap noted in the same paragraph.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that greater specificity is needed to allow readers to evaluate the empirical claims and will revise the abstract accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'extensive evaluation on multiple benchmarks covering diverse tasks across energy, finance, weather, traffic, and other real-world domains demonstrates improved performance of TimeClaw' supplies no information on baselines, metrics, statistical tests, data splits, or quantitative gains. This absence makes it impossible to assess whether the central empirical claim holds or whether the three components overcome the alignment gap noted in the same paragraph.

Authors: We agree with the observation. The current abstract is intentionally high-level and does not include the requested quantitative details. In the revised manuscript we will expand the abstract to report the main baselines (vanilla LLM agents without TimeClaw components and standard time-series models), primary metrics (MAE, RMSE, and task-specific accuracy measures), data-split protocols, and observed quantitative gains (percentage improvements with statistical significance where computed). These specifics already appear in the Experiments section; the revision will simply surface them concisely in the abstract so that the central claim can be assessed at a glance. revision: yes

Circularity Check

0 steps flagged

No circularity: framework evaluated on external benchmarks with no derivation chain

full rationale

The paper introduces an agentic framework (TimeClaw) consisting of tools, evolution mechanisms, and memory components for LLM agents on time series tasks. No mathematical derivations, equations, fitted parameters, or predictions are described. The central claims rest on empirical evaluation across external benchmarks in energy, finance, weather, and traffic domains, not on any internal self-definition or reduction to inputs. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. This matches the default case of a self-contained engineering contribution whose value is externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 4 invented entities

Based only on the abstract, the central claim rests on the assumption that the three introduced components can be integrated into generalist agents to produce better temporal reasoning; no free parameters are mentioned.

axioms (1)

domain assumption Generalist LLM agents can be effectively augmented with domain-specific temporal runtime support to achieve contextualized reasoning.
This premise is required for the claim that the harness unlocks improved performance; it is invoked when the abstract states the components 'unlock' the reasoning.

invented entities (4)

TimeClaw framework no independent evidence
purpose: Agentic harness equipping LLM agents with time series support
Newly introduced system whose value is asserted in the abstract.
executable temporal tools no independent evidence
purpose: Grounded and auditable analysis of temporal signals
Component introduced as part of the framework.
experience-driven capability evolution no independent evidence
purpose: Creating reusable analytical routines from experience
Component introduced as part of the framework.
episodic multimodal memory no independent evidence
purpose: Retrieving relevant reasoning traces
Component introduced as part of the framework.

pith-pipeline@v0.9.1-grok · 5762 in / 1515 out tokens · 35327 ms · 2026-06-28T05:43:34.876847+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 4 linked inside Pith

[1]

InAdvances in Neural Information Processing Sys- tems, volume 29

Retain: An interpretable predictive model for healthcare using reverse time attention mechanism. InAdvances in Neural Information Processing Sys- tems, volume 29. Curran Associates, Inc. Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. 2024. A decoder-only foundation model for time-series forecasting. InProceedings of the 41st International Confere...

Pith/arXiv arXiv 2024
[2]

Azul Garza, Cristian Challu, and Max Mergenthaler- Canseco

The causal chambers: Real physical systems as a testbed for ai methodology.arXiv preprint arXiv:2404.11341. Azul Garza, Cristian Challu, and Max Mergenthaler- Canseco. 2023. TimeGPT-1.arXiv preprint arXiv:2310.03589. Rakshitha Godahewa, Christoph Bergmeir, Geoffrey I Webb, Rob J Hyndman, and Pablo Montero-Manso

arXiv 2023
[3]

Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski

Monash time series forecasting archive.arXiv preprint arXiv:2105.06643. Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski. 2024. MO- MENT: A family of open time-series foundation models. InProceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, page...

arXiv 2024
[4]

InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 21690–21698

Harnessing vision-language models for time series anomaly detection. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 21690–21698. Sirui Hong, Yizhang Lin, Bang Liu, Bangbang Liu, Binhao Wu, Ceyao Zhang, Danyang Li, Jiaqi Chen, Jiayi Zhang, Jinlin Wang, and 1 others. 2025. Data interpreter: An LLM agent for data science. I...

Pith/arXiv arXiv 2025
[5]

InProceed- ings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 20271–20309

MLAgentBench: Evaluating language agents on machine learning experimentation. InProceed- ings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 20271–20309. PMLR. Qihe Huang, Zhengyang Zhou, Yangze Li, Kuo Yang, Binwu Wang, and Yang Wang. 2026a. Many minds, one goal: Time series forecast...

arXiv 2017
[6]

InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Min- ing, V .2, KDD 2025, Toronto ON, Canada, August 3-7, 2025, pages 6043–6053

Multi-modal time series analysis: A tutorial and survey. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Min- ing, V .2, KDD 2025, Toronto ON, Canada, August 3-7, 2025, pages 6043–6053. ACM. Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, and 1 ot...

Pith/arXiv arXiv 2025
[7]

InMachine Learning for Health, ML4H@NeurIPS 2023, 10 December 2023, New Orleans, Louisiana, USA, Proceedings of Machine Learning Research, pages 244–255

Multimodal pretraining of medical time se- ries and notes. InMachine Learning for Health, ML4H@NeurIPS 2023, 10 December 2023, New Orleans, Louisiana, USA, Proceedings of Machine Learning Research, pages 244–255. PMLR. Lingdong Kong, Xian Sun, Wei Chow, Linfeng Li, Kevin Qinghong Lin, Xuan Billy Zhang, Song Wang, Rong Li, Qing Wu, Wei Gao, and 1 others. 2...

Pith/arXiv arXiv 2023
[8]

InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pages 5351–5362

Urbangpt: Spatio-temporal large language models. InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pages 5351–5362. Zihao Li, Xiao Lin, Zhining Liu, Jiaru Zou, Ziwei Wu, Lecheng Zheng, Dongqi Fu, Yada Zhu, Hendrik F. Hamann, Hanghang Tong, and Jingrui He. 2025c. Language in the flow of time: Time-series-paired texts w...

arXiv 2024
[9]

InInternational Conference on Learning Representations, volume 2024, pages 37854–37881

Test: Text prototype aligned embedding to activate llm’s ability for time series. InInternational Conference on Learning Representations, volume 2024, pages 37854–37881. Mingtian Tan, Mike A Merrill, Vinayak Gupta, Tim Althoff, and Thomas Hartvigsen. 2024. Are language models actually useful for time series forecasting? InThe Thirty-eighth Annual Conferen...

arXiv 2024
[10]

Yingqian Wu, Qiushi Wang, Zefei Long, Rong Ye, Zhongtian Lu, Xianyin Zhang, Bingxuan Li, Wei Chen, Liwen Zhang, and Zhongyu Wei

Timeart: Towards agentic time series reason- ing via tool-augmentation.CoRR, abs/2601.13653. Yingqian Wu, Qiushi Wang, Zefei Long, Rong Ye, Zhongtian Lu, Xianyin Zhang, Bingxuan Li, Wei Chen, Liwen Zhang, and Zhongyu Wei. 2025. Fin- team: A multi-agent collaborative intelligence sys- tem for comprehensive financial scenarios. InNat- ural Language Processi...

arXiv 2025
[11]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis

Springer. Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024a. Efficient streaming lan- guage models with attention sinks. InInternational Conference on Learning Representations, volume 2024, pages 21875–21895. Yijia Xiao, Edward Sun, Di Luo, and Wei Wang. 2024b. Tradingagents: Multi-agents LLM financial trading framework.CoRR, abs/...

arXiv 2024
[12]

CoRR, abs/2509.01822

When LLM meets time series: Can llms per- form multi-step time series reasoning and inference. CoRR, abs/2509.01822. Fangxu Yu, Xingang Guo, Lingzhi Yuan, Haoqiang Kang, Hongyu Zhao, Lianhui Qin, Furong Huang, Bin Hu, and Tianyi Zhou. 2026. Tsrbench: A com- prehensive multi-task multi-modal time series rea- soning benchmark for generalist models.CoRR, abs...

arXiv 2026
[13]

AAAI Press

Are transformers effective for time series forecasting? InThirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, pages 11121–1...

arXiv 2023
[14]

strawberry

Informer: Beyond efficient transformer for long sequence time-series forecasting. InProceed- ings of the AAAI conference on artificial intelligence, volume 35, pages 11106–11115. Tian Zhou, Peisong Niu, Liang Sun, Rong Jin, and 1 others. 2023. One fits all: Power general time series analysis by pretrained lm.Advances in neural information processing syste...

arXiv 2023
[15]

struct” features describe the channel pool as a whole; “per-ch

Because the outputs are unit-norm, cosine similarity reduces to a dot product, cos(ϕq, ϕm) = ϕ⊤ q ϕm, which we exploit at retrieval time. If the encoder rejects an input as too long, the descrip- tor is halved and re-submitted, with at most two rounds of halving before the call is treated as a hard failure. Storage.At bank-building time, the embedding is ...

arXiv 2023
[16]

Withk= 1on this task both stages converge on the same family (FULLCAUSALCONTEXTIMPLICITEQUATIONBIVARLINSVAR) seen during training

Two-stage retrieval.TIMECLAWembeds the task’sbackground + scenario + constraintsblock with text-embedding-3-small, takes the top-Ktextrecords by cosine similarity from the training bank, then re-ranks the survivors by L2 distance on the 20-dim numerical fingerprint ofpast_time. Withk= 1on this task both stages converge on the same family (FULLCAUSALCONTEX...
[17]

Reference compression.The retrieved trajectory is compressed bysummarize_trajectoryinto the trainer’s analytic spine: acontext→forecastexplanation, together with the sequence of MCP tool calls and their truncated responses. This summary is then injected before the test prompt as aREFERENCES FROM PRIOR TRAININGblock, surfacing the transferable rule that th...
[18]

Context parsing.The agent reads the explicit linear SV AR equation from the prompt (Xt1 = 1.322Xt-10 − 0.604Xt-11 + 0.926Xt-20 + 0.763Xt-21 −0.851Xt-30 + 0.623Xt-31 ) and the piece-wise constant schedule forX0 over the 32-step horizon
[19]

1")→compute_acf(

Tool-grounded historical inspection.Guided by the retrieved spine, the agent invokes the MCP analysis server: series_overview()→channel_stats("1")→compute_acf("1", max_lag=20)→channel_values("1", 120, 128), confirming the recentX1 levels needed to seed the recursion
[20]

parents forX1 at lagk∈ {1,2,3}areX0, X1,

Closed-form forecast.With the equation, the recent lagged values, and the futureX0 schedule all known, the agent unrolls the recursion deterministically, producing a 32-step forecast that tracks the ground-truth ramp from 0.035→1.03almost exactly. Counterfactual: Minimal Context, Same Series.On the matched MINIMAL-context variant, the prompt only states t...
[21]

Then TIMECLAWretrievesk=3records

Two-stage retrieval.TIMECLAWembeds the test prompt withtext-embedding-3-smalland selects the top- Ktextrecords from the training bank by cosine, then re-ranks the survivors byL2distance on the fingerprint of the loaded series. Then TIMECLAWretrievesk=3records
[22]

the mechanism mapping is upstream-to-downstream edges

Rule extraction viasummarize_trajectory.Each retrieved trajectory is compressed to a tool-call spine plus a context_to_actionblock written during training. One block of the retrieved action in memory reads verbatim: “the mechanism mapping is upstream-to-downstream edges. ”This is the transferable decision rule that primes the test agent
[23]

Execution Stage

Prompt injection.The compressed references are concatenated as aREFERENCES FROM PRIOR TRAININGblock in front of the test prompt, only analytic spine and decision rules. Execution Stage
[24]

It skips the exhaustive per- channel statistics, peak-finding, and periodicity-detection that the no-memory agent runs through

Targeted tool use.Primed by the retrieved rule, the agent runs only the inspections needed to confirm direction: list_channels→series_overview→fivecompute_acfprobes, one per river. It skips the exhaustive per- channel statistics, peak-finding, and periodicity-detection that the no-memory agent runs through
[25]

The rule then determines option D, in which the sink row R176 is densely populated with parents from the tributaries and the tributaries themselves are source rows

Mapping the rule onto the new river set.The agent matches its own observations to the retrieved rule: the channels with mean flow∼660–680(R175, R176) form the downstream sinks; the three with mean flow<4(R77, R546, R1071) are upstream tributaries. The rule then determines option D, in which the sink row R176 is densely populated with parents from the trib...
[26]

Counterfactual: Same Series, No Memory.Re-running the identical task atk=0removes only theREFERENCESblock

Answer realisation.The agent emits the final letter<answer>D</answer>, with a brief justification citing the retrieved cascade rule. Counterfactual: Same Series, No Memory.Re-running the identical task atk=0removes only theREFERENCESblock. Without memory, the agent resorts to brute-force inspection, issuing22tool calls acrosschannel_stats, compute_acf, fi...
[27]

Each successful trajectory is summarized into a compact routine description that records the input conventions, intermediate computations, and final decision rule

Routine extraction viasummarize_trajectory.During training, three recurring analytical routines appear across successful trajectories: portfolio risk-adjusted return estimation, portfolio risk estimation, and market-factor regression. Each successful trajectory is summarized into a compact routine description that records the input conventions, intermedia...
[28]

parametric

Evolving finance-specific tools.From these recurring routines, TIMECLAWevolves three finance-specific tool schemas and adds them to the agent’s reusable toolbox: •portfolio_sharpe(channels, weights, risk_free, period_per_year): compute risk-adjusted return for each portfolio option and compare the returned ratios. •portfolio_var(channels, weights, horizon...
[29]

Inspect the series with whatever tools are available and walk through the reasoning step by step
[30]

scenario says ‘heat wave for 2 hours’→ground truth has a 4×spike at the stated start→because air-conditioning load scales with cooling demand

Output a<context_to_action>...</context_to_action>block (≤3 sentences) that explicitly states: • which sentences in the context (background / scenario / constraints / question / options) drive the answer’s shape, AND • the rule that maps those sentences to that shape (e.g. “scenario says ‘heat wave for 2 hours’→ground truth has a 4×spike at the stated sta...

[1] [1]

InAdvances in Neural Information Processing Sys- tems, volume 29

Retain: An interpretable predictive model for healthcare using reverse time attention mechanism. InAdvances in Neural Information Processing Sys- tems, volume 29. Curran Associates, Inc. Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. 2024. A decoder-only foundation model for time-series forecasting. InProceedings of the 41st International Confere...

Pith/arXiv arXiv 2024

[2] [2]

Azul Garza, Cristian Challu, and Max Mergenthaler- Canseco

The causal chambers: Real physical systems as a testbed for ai methodology.arXiv preprint arXiv:2404.11341. Azul Garza, Cristian Challu, and Max Mergenthaler- Canseco. 2023. TimeGPT-1.arXiv preprint arXiv:2310.03589. Rakshitha Godahewa, Christoph Bergmeir, Geoffrey I Webb, Rob J Hyndman, and Pablo Montero-Manso

arXiv 2023

[3] [3]

Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski

Monash time series forecasting archive.arXiv preprint arXiv:2105.06643. Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski. 2024. MO- MENT: A family of open time-series foundation models. InProceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, page...

arXiv 2024

[4] [4]

InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 21690–21698

Harnessing vision-language models for time series anomaly detection. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 21690–21698. Sirui Hong, Yizhang Lin, Bang Liu, Bangbang Liu, Binhao Wu, Ceyao Zhang, Danyang Li, Jiaqi Chen, Jiayi Zhang, Jinlin Wang, and 1 others. 2025. Data interpreter: An LLM agent for data science. I...

Pith/arXiv arXiv 2025

[5] [5]

InProceed- ings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 20271–20309

MLAgentBench: Evaluating language agents on machine learning experimentation. InProceed- ings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 20271–20309. PMLR. Qihe Huang, Zhengyang Zhou, Yangze Li, Kuo Yang, Binwu Wang, and Yang Wang. 2026a. Many minds, one goal: Time series forecast...

arXiv 2017

[6] [6]

InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Min- ing, V .2, KDD 2025, Toronto ON, Canada, August 3-7, 2025, pages 6043–6053

Multi-modal time series analysis: A tutorial and survey. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Min- ing, V .2, KDD 2025, Toronto ON, Canada, August 3-7, 2025, pages 6043–6053. ACM. Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, and 1 ot...

Pith/arXiv arXiv 2025

[7] [7]

InMachine Learning for Health, ML4H@NeurIPS 2023, 10 December 2023, New Orleans, Louisiana, USA, Proceedings of Machine Learning Research, pages 244–255

Multimodal pretraining of medical time se- ries and notes. InMachine Learning for Health, ML4H@NeurIPS 2023, 10 December 2023, New Orleans, Louisiana, USA, Proceedings of Machine Learning Research, pages 244–255. PMLR. Lingdong Kong, Xian Sun, Wei Chow, Linfeng Li, Kevin Qinghong Lin, Xuan Billy Zhang, Song Wang, Rong Li, Qing Wu, Wei Gao, and 1 others. 2...

Pith/arXiv arXiv 2023

[8] [8]

InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pages 5351–5362

Urbangpt: Spatio-temporal large language models. InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pages 5351–5362. Zihao Li, Xiao Lin, Zhining Liu, Jiaru Zou, Ziwei Wu, Lecheng Zheng, Dongqi Fu, Yada Zhu, Hendrik F. Hamann, Hanghang Tong, and Jingrui He. 2025c. Language in the flow of time: Time-series-paired texts w...

arXiv 2024

[9] [9]

InInternational Conference on Learning Representations, volume 2024, pages 37854–37881

Test: Text prototype aligned embedding to activate llm’s ability for time series. InInternational Conference on Learning Representations, volume 2024, pages 37854–37881. Mingtian Tan, Mike A Merrill, Vinayak Gupta, Tim Althoff, and Thomas Hartvigsen. 2024. Are language models actually useful for time series forecasting? InThe Thirty-eighth Annual Conferen...

arXiv 2024

[10] [10]

Yingqian Wu, Qiushi Wang, Zefei Long, Rong Ye, Zhongtian Lu, Xianyin Zhang, Bingxuan Li, Wei Chen, Liwen Zhang, and Zhongyu Wei

Timeart: Towards agentic time series reason- ing via tool-augmentation.CoRR, abs/2601.13653. Yingqian Wu, Qiushi Wang, Zefei Long, Rong Ye, Zhongtian Lu, Xianyin Zhang, Bingxuan Li, Wei Chen, Liwen Zhang, and Zhongyu Wei. 2025. Fin- team: A multi-agent collaborative intelligence sys- tem for comprehensive financial scenarios. InNat- ural Language Processi...

arXiv 2025

[11] [11]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis

Springer. Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024a. Efficient streaming lan- guage models with attention sinks. InInternational Conference on Learning Representations, volume 2024, pages 21875–21895. Yijia Xiao, Edward Sun, Di Luo, and Wei Wang. 2024b. Tradingagents: Multi-agents LLM financial trading framework.CoRR, abs/...

arXiv 2024

[12] [12]

CoRR, abs/2509.01822

When LLM meets time series: Can llms per- form multi-step time series reasoning and inference. CoRR, abs/2509.01822. Fangxu Yu, Xingang Guo, Lingzhi Yuan, Haoqiang Kang, Hongyu Zhao, Lianhui Qin, Furong Huang, Bin Hu, and Tianyi Zhou. 2026. Tsrbench: A com- prehensive multi-task multi-modal time series rea- soning benchmark for generalist models.CoRR, abs...

arXiv 2026

[13] [13]

AAAI Press

Are transformers effective for time series forecasting? InThirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, pages 11121–1...

arXiv 2023

[14] [14]

strawberry

Informer: Beyond efficient transformer for long sequence time-series forecasting. InProceed- ings of the AAAI conference on artificial intelligence, volume 35, pages 11106–11115. Tian Zhou, Peisong Niu, Liang Sun, Rong Jin, and 1 others. 2023. One fits all: Power general time series analysis by pretrained lm.Advances in neural information processing syste...

arXiv 2023

[15] [15]

struct” features describe the channel pool as a whole; “per-ch

Because the outputs are unit-norm, cosine similarity reduces to a dot product, cos(ϕq, ϕm) = ϕ⊤ q ϕm, which we exploit at retrieval time. If the encoder rejects an input as too long, the descrip- tor is halved and re-submitted, with at most two rounds of halving before the call is treated as a hard failure. Storage.At bank-building time, the embedding is ...

arXiv 2023

[16] [16]

Withk= 1on this task both stages converge on the same family (FULLCAUSALCONTEXTIMPLICITEQUATIONBIVARLINSVAR) seen during training

Two-stage retrieval.TIMECLAWembeds the task’sbackground + scenario + constraintsblock with text-embedding-3-small, takes the top-Ktextrecords by cosine similarity from the training bank, then re-ranks the survivors by L2 distance on the 20-dim numerical fingerprint ofpast_time. Withk= 1on this task both stages converge on the same family (FULLCAUSALCONTEX...

[17] [17]

Reference compression.The retrieved trajectory is compressed bysummarize_trajectoryinto the trainer’s analytic spine: acontext→forecastexplanation, together with the sequence of MCP tool calls and their truncated responses. This summary is then injected before the test prompt as aREFERENCES FROM PRIOR TRAININGblock, surfacing the transferable rule that th...

[18] [18]

Context parsing.The agent reads the explicit linear SV AR equation from the prompt (Xt1 = 1.322Xt-10 − 0.604Xt-11 + 0.926Xt-20 + 0.763Xt-21 −0.851Xt-30 + 0.623Xt-31 ) and the piece-wise constant schedule forX0 over the 32-step horizon

[19] [19]

1")→compute_acf(

Tool-grounded historical inspection.Guided by the retrieved spine, the agent invokes the MCP analysis server: series_overview()→channel_stats("1")→compute_acf("1", max_lag=20)→channel_values("1", 120, 128), confirming the recentX1 levels needed to seed the recursion

[20] [20]

parents forX1 at lagk∈ {1,2,3}areX0, X1,

Closed-form forecast.With the equation, the recent lagged values, and the futureX0 schedule all known, the agent unrolls the recursion deterministically, producing a 32-step forecast that tracks the ground-truth ramp from 0.035→1.03almost exactly. Counterfactual: Minimal Context, Same Series.On the matched MINIMAL-context variant, the prompt only states t...

[21] [21]

Then TIMECLAWretrievesk=3records

Two-stage retrieval.TIMECLAWembeds the test prompt withtext-embedding-3-smalland selects the top- Ktextrecords from the training bank by cosine, then re-ranks the survivors byL2distance on the fingerprint of the loaded series. Then TIMECLAWretrievesk=3records

[22] [22]

the mechanism mapping is upstream-to-downstream edges

Rule extraction viasummarize_trajectory.Each retrieved trajectory is compressed to a tool-call spine plus a context_to_actionblock written during training. One block of the retrieved action in memory reads verbatim: “the mechanism mapping is upstream-to-downstream edges. ”This is the transferable decision rule that primes the test agent

[23] [23]

Execution Stage

Prompt injection.The compressed references are concatenated as aREFERENCES FROM PRIOR TRAININGblock in front of the test prompt, only analytic spine and decision rules. Execution Stage

[24] [24]

It skips the exhaustive per- channel statistics, peak-finding, and periodicity-detection that the no-memory agent runs through

Targeted tool use.Primed by the retrieved rule, the agent runs only the inspections needed to confirm direction: list_channels→series_overview→fivecompute_acfprobes, one per river. It skips the exhaustive per- channel statistics, peak-finding, and periodicity-detection that the no-memory agent runs through

[25] [25]

The rule then determines option D, in which the sink row R176 is densely populated with parents from the tributaries and the tributaries themselves are source rows

Mapping the rule onto the new river set.The agent matches its own observations to the retrieved rule: the channels with mean flow∼660–680(R175, R176) form the downstream sinks; the three with mean flow<4(R77, R546, R1071) are upstream tributaries. The rule then determines option D, in which the sink row R176 is densely populated with parents from the trib...

[26] [26]

Counterfactual: Same Series, No Memory.Re-running the identical task atk=0removes only theREFERENCESblock

Answer realisation.The agent emits the final letter<answer>D</answer>, with a brief justification citing the retrieved cascade rule. Counterfactual: Same Series, No Memory.Re-running the identical task atk=0removes only theREFERENCESblock. Without memory, the agent resorts to brute-force inspection, issuing22tool calls acrosschannel_stats, compute_acf, fi...

[27] [27]

Each successful trajectory is summarized into a compact routine description that records the input conventions, intermediate computations, and final decision rule

Routine extraction viasummarize_trajectory.During training, three recurring analytical routines appear across successful trajectories: portfolio risk-adjusted return estimation, portfolio risk estimation, and market-factor regression. Each successful trajectory is summarized into a compact routine description that records the input conventions, intermedia...

[28] [28]

parametric

Evolving finance-specific tools.From these recurring routines, TIMECLAWevolves three finance-specific tool schemas and adds them to the agent’s reusable toolbox: •portfolio_sharpe(channels, weights, risk_free, period_per_year): compute risk-adjusted return for each portfolio option and compare the returned ratios. •portfolio_var(channels, weights, horizon...

[29] [29]

Inspect the series with whatever tools are available and walk through the reasoning step by step

[30] [30]

scenario says ‘heat wave for 2 hours’→ground truth has a 4×spike at the stated start→because air-conditioning load scales with cooling demand

Output a<context_to_action>...</context_to_action>block (≤3 sentences) that explicitly states: • which sentences in the context (background / scenario / constraints / question / options) drive the answer’s shape, AND • the rule that maps those sentences to that shape (e.g. “scenario says ‘heat wave for 2 hours’→ground truth has a 4×spike at the stated sta...