pith. machine review for the scientific record. sign in

arxiv: 2605.10038 · v1 · submitted 2026-05-11 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

TimeClaw: A Time-Series AI Agent with Exploratory Execution Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:45 UTC · model grok-4.3

classification 💻 cs.AI
keywords time seriesexploratory executionAI agentLLM agenttool usehierarchical experienceforecastingreasoning tasks
0
0 comments X

The pith

TimeClaw turns exploratory executions into reusable hierarchical patterns that improve time-series agent performance without retraining the base model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TimeClaw to help AI agents handle time-series tasks such as forecasting and reasoning in finance and weather. Instead of stopping at the first successful execution, the system explores multiple tool-use paths, compares them using task metrics, distills the useful structures into hierarchical experience, and reinjects that experience on later tasks. This approach keeps the underlying language model frozen and avoids any online adaptation during testing. The method matters because early successes in numeric domains often lock agents into suboptimal tool choices, limiting accuracy even when better procedures exist. Evaluation on an MTBench-aligned set of 17 tasks shows steady improvements over standard baselines.

Core claim

TimeClaw is an exploratory execution learning framework built around a four-stage loop of Explore, Compare, Distill, and Reinject. It applies metric-supervised comparison across candidate executions together with task-aware tool dropout to extract reusable hierarchical distilled experience. This experience is stored and reinjected at inference time to guide future decisions. The base model remains frozen throughout, with no test-time adaptation. In an evaluation spanning 17 tasks aligned with MTBench in finance, weather prediction, and reasoning, the system produces consistent gains over baselines. The authors conclude that the limiting factor in these scientific systems is less the raw tool

What carries the argument

The four-stage Explore-Compare-Distill-Reinject loop that converts multiple exploratory executions into reusable hierarchical distilled experience through metric supervision and task-aware tool dropout.

If this is right

  • Produces consistent gains over baselines across 17 tasks in finance, weather prediction, and reasoning.
  • Avoids tool-prior collapse by continuing exploration even after early valid executions.
  • Enables reuse of hierarchical patterns at inference time while the base model stays frozen.
  • Shifts the bottleneck in scientific AI agents from execution capability to how exploratory experience is compared, distilled, and reused.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same loop could be applied to other tool-using agents in sequential decision domains such as code synthesis or robotic planning.
  • Hierarchical distillation may allow static models to accumulate cross-task memory without periodic retraining.
  • Focusing on experience reuse rather than model updates could simplify deployment and reduce compute costs in production settings.

Load-bearing premise

Metric-supervised comparison of exploratory executions can reliably extract reusable hierarchical patterns that improve later performance without selection bias or any need to adapt the base model.

What would settle it

On the same 17-task benchmark, removing the Compare and Distill stages or replacing metric comparison with random selection of executions eliminates the reported gains and returns performance to baseline levels.

Figures

Figures reproduced from arXiv: 2605.10038 by Dongyuan Li, Hangchen Liu, Jiewen Deng, Renhe Jiang, Weiwei Ye, Yoshihide Sekimoto.

Figure 1
Figure 1. Figure 1: Overview of TimeClaw. The framework follows four stages: exploration-time learning, [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Case study of post-shock MACD forecasting. The figure compares TimeClaw Unexplored [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Tool-dropout mechanism. Frequently used tools are temporarily masked according to their historical usage counts to encourage broader exploration over candidate tools. 5.5 Tool Dropout Analysis 50 100 150 200 250 300 Completed Samples 0% 20% 40% 60% 80% 100% Top-5 Tool Share forecast, w. dropout forecast, w/o. dropout indicator_macd, w. dropout indicator_macd, w/o. dropout forecast indicator_macd correlatio… view at source ↗
Figure 4
Figure 4. Figure 4: Tool exploration analysis. We evaluate whether tool dropout mitigates tool-prior collapse during exploratory execution learning on four tasks: forecast, indicator_macd, correlation, and trend_past. We use two behavioral metrics: tool coverage rate, measuring how many visible tools are invoked within an exploration prefix, and Top-5 tool share, measuring concentration on the five most-used tools [PITH_FULL… view at source ↗
Figure 5
Figure 5. Figure 5: Execution sample on weather forecast task. [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Execution sample on finance forecast task. [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Execution sample on weather future-trend task. [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Execution sample on finance trend task. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Execution sample on weather past-trend task. [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Inference sample on weather forecast task. [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Inference sample on finance trend task. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Inference sample on weather future-trend task. [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Inference sample on weather future-trend task. [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Inference sample on finance forecast task. [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗
read the original abstract

Time series analysis underpins forecasting, monitoring, and decision making in domains such as finance and weather, where solving a task often requires both numerical accuracy and contextual reasoning. Recent progress has moved from specialized neural predictors to approaches built on LLMs and foundation models that can reason over time series inputs and use external tools. However, most such systems remain execution-centric: they focus on solving the current instance but learn little from exploratory execution. This is especially limiting in verifiable numeric settings, where multiple candidate executions and tool-use procedures may all be task-valid yet differ sharply in quantitative quality, and where early success can trigger tool-prior collapse that suppresses further exploration. To address this limitation, we present TimeClaw, an exploratory execution learning framework that turns exploratory execution into reusable hierarchical distilled experience through a four-stage loop: Explore, Compare, Distill, and Reinject. TimeClaw combines metric-supervised exploratory execution learning, task-aware tool dropout, and hierarchical distilled experience for inference-time reinjection, while keeping the base model frozen and avoiding online test-time adaptation. In an MTBench-aligned evaluation with 17 tasks that span finance and weather prediction and reasoning tasks, TimeClaw delivers consistent gains over the baselines. These results suggest that, for scientific systems, the bottleneck is not only execution-time capability, but how exploratory experience is compared, distilled, and reused.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces TimeClaw, an exploratory execution learning framework for time-series AI agents. It uses a four-stage Explore-Compare-Distill-Reinject loop to convert exploratory executions into reusable hierarchical distilled experience, incorporating metric-supervised comparison, task-aware tool dropout, and inference-time reinjection while keeping the base model frozen. The central claim is that this approach yields consistent performance gains over baselines on 17 MTBench-aligned tasks spanning finance, weather prediction, and reasoning domains.

Significance. If the empirical results hold with proper controls, TimeClaw could advance LLM-based agents in numeric domains by addressing tool-prior collapse and enabling reuse of exploratory experience without online adaptation or fine-tuning. The frozen-base-model design and focus on verifiable numeric settings are strengths that distinguish it from typical execution-centric systems.

major comments (3)
  1. [Abstract and Evaluation] Abstract and Evaluation section: The claim that TimeClaw 'delivers consistent gains over the baselines' on 17 tasks is presented without any quantitative details on effect sizes, variance across runs, specific baseline implementations, or statistical tests. This absence leaves the magnitude and reliability of the improvements unassessable and directly weakens support for the central claim.
  2. [Method] Method section (Compare stage): The metric-supervised comparison procedure lacks description of controls against selection bias, such as how high-metric paths are sampled, whether dropout prevents over-representation of lucky executions, or validation that distilled patterns generalize beyond the current instance. Without these, the Explore-Compare-Distill-Reinject loop risks amplifying instance-specific successes rather than learning reusable hierarchical patterns.
  3. [Experiments] Experiments section: No details are provided on the exact composition of the 17 tasks, how MTBench alignment was performed, or ablation studies isolating the contribution of the Distill and Reinject stages versus simple exploration. This makes it impossible to verify that the reported gains stem from the proposed learning mechanism rather than other factors.
minor comments (2)
  1. [Abstract] The term 'MTBench-aligned' is used without a clear definition or citation in the abstract; a brief explanation or reference should be added in the introduction.
  2. [Method] Notation for the hierarchical distilled experience could be clarified with a small diagram or pseudocode in the Method section to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions have been made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and Evaluation] Abstract and Evaluation section: The claim that TimeClaw 'delivers consistent gains over the baselines' on 17 tasks is presented without any quantitative details on effect sizes, variance across runs, specific baseline implementations, or statistical tests. This absence leaves the magnitude and reliability of the improvements unassessable and directly weakens support for the central claim.

    Authors: We agree that the abstract would be improved by including quantitative anchors for the central claim. In the revised manuscript we have updated the abstract to reference the specific tables in the Experiments section that report average improvements, effect sizes, standard deviations across runs, baseline implementation details, and statistical test results. The full quantitative support remains in the evaluation, now more explicitly signposted from the abstract. revision: yes

  2. Referee: [Method] Method section (Compare stage): The metric-supervised comparison procedure lacks description of controls against selection bias, such as how high-metric paths are sampled, whether dropout prevents over-representation of lucky executions, or validation that distilled patterns generalize beyond the current instance. Without these, the Explore-Compare-Distill-Reinject loop risks amplifying instance-specific successes rather than learning reusable hierarchical patterns.

    Authors: We acknowledge that the original description of the Compare stage was insufficiently explicit about bias controls. The revised Method section now details the sampling rule for high-metric paths (top-k selection with a task-metric threshold), the precise application of task-aware tool dropout to reduce lucky-execution bias, and the cross-instance validation procedure used to confirm that distilled patterns transfer beyond the source instance. These additions clarify how the loop favors reusable hierarchical patterns. revision: yes

  3. Referee: [Experiments] Experiments section: No details are provided on the exact composition of the 17 tasks, how MTBench alignment was performed, or ablation studies isolating the contribution of the Distill and Reinject stages versus simple exploration. This makes it impossible to verify that the reported gains stem from the proposed learning mechanism rather than other factors.

    Authors: We agree that greater transparency is required. The revised Experiments section now contains an explicit table enumerating all 17 tasks with their domains and metrics, a description of the MTBench alignment procedure (category mapping and prompt adaptation for time-series tool use), and new ablation experiments that isolate the Distill and Reinject stages against a pure-exploration baseline. These ablations demonstrate the incremental contribution of each component. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains from independent framework on external benchmarks

full rationale

The paper describes an empirical four-stage loop (Explore-Compare-Distill-Reinject) evaluated on 17 MTBench-aligned tasks spanning finance, weather, and reasoning, reporting consistent gains over baselines while keeping the base model frozen. No equations, derivations, or first-principles claims are present in the provided text that reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains. The central claim rests on external task performance rather than any internal reduction to the method's own inputs by construction. This is a standard self-contained empirical result with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard assumptions from LLM agent literature (tool-use validity, metric-based ranking) plus the new claim that distilled experience can be hierarchically reused without adaptation; no explicit free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption Multiple distinct tool-use sequences can be valid for the same time-series task yet differ in quantitative quality.
    Invoked to justify the need for exploration and comparison stages.

pith-pipeline@v0.9.0 · 5556 in / 1251 out tokens · 59345 ms · 2026-05-12T02:45:07.962803+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 11 internal anchors

  1. [1]

    TRAJECT-bench: A trajectory-aware benchmark for evaluating agentic tool use,

    P. He, Z. Dai, B. He, H. Liu, X. Tang, H. Lu, J. Li, J. Ding, S. Mukherjee, S. Wang, Y . Xing, J. Tang, and B. Dumoulin, “TRAJECT-bench: A trajectory-aware benchmark for evaluating agentic tool use,” inThe Fourteenth International Conference on Learning Representations,

  2. [2]

    Available: https://openreview.net/forum?id=TZWnWvsQ0X

    [Online]. Available: https://openreview.net/forum?id=TZWnWvsQ0X

  3. [3]

    Continuous-time value iteration for multi-agent reinforcement learning,

    X. Wang, L. Zhang, H. Pu, A. H. Qureshi, and H. Li, “Continuous-time value iteration for multi-agent reinforcement learning,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=7N0ugLE17t

  4. [4]

    SPIRAL: Self-play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning,

    B. Liu, S. Yu, Z. Liu, L. Guertler, P. Qi, D. Balcells, M. Liu, C. Tan, W. Shi, M. Lin, W. S. Lee, and N. Jaques, “SPIRAL: Self-play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=7...

  5. [5]

    CoMind: Towards community-driven agents for machine learning engineering,

    S. Li, W. Sun, S. Li, A. Talwalkar, and Y . Yang, “CoMind: Towards community-driven agents for machine learning engineering,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=P479BoN8BD

  6. [6]

    AgentBench: Evaluating LLMs as agents,

    X. Liu, H. Yu, H. Zhang, X. Li, X. Dai, Y . Dong, A. Zenget al., “AgentBench: Evaluating LLMs as agents,” inThe Twelfth International Conference on Learning Representations, 2024

  7. [7]

    WebArena: A realistic web environ- ment for building autonomous agents,

    S. Zhou, F. F. Xu, H. Zhu, X. Zhou, G. Neubiget al., “WebArena: A realistic web environ- ment for building autonomous agents,” inThe Twelfth International Conference on Learning Representations, 2024

  8. [8]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Q. Wu, G. Bansal, J. Zhang, Y . Wu, B. Li, E. Zhu, L. Jianget al., “AutoGen: Enabling next-gen LLM applications via multi-agent conversation,”arXiv preprint arXiv:2308.08155, 2023

  9. [9]

    Zephyrus: An agentic framework for weather science,

    S. Varambally, M. Fisher, J. Thakker, Y . Chen, Z. Xia, R. Niu, Y . Jafari, V . V . Manivannan, Z. Novack, L. Han, S. Eranky, S. Rühling Cachay, T. Berg-Kirkpatrick, D. Watson-Parris, Y . Ma, and R. Yu, “Zephyrus: An agentic framework for weather science,” inNeurIPS 2025 AI for Science Workshop, 2025. [Online]. Available: https://openreview.net/forum?id=F...

  10. [10]

    USTBench: Benchmarking and dissecting spatiotemporal reasoning capabilities of LLMs as urban agents,

    S. Lai, Y . Ning, Z. Yuan, Z. Chen, and H. Liu, “USTBench: Benchmarking and dissecting spatiotemporal reasoning capabilities of LLMs as urban agents,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=ETzBStUFJy

  11. [11]

    STAIRS-former: Spatio-temporal attention with interleaved recursive structure transformer for offline mulit-task multi-agent reinforcement learning,

    J. Jeon, M. Cho, and Y . Sung, “STAIRS-former: Spatio-temporal attention with interleaved recursive structure transformer for offline mulit-task multi-agent reinforcement learning,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=Biz1vpQeLI

  12. [12]

    NetArena: Dynamic benchmarks for AI agents in network automation,

    Y . Zhou, J. Ruan, E. S. Wang, S. Fouladi, F. Y . Yan, K. Hsieh, and Z. Liu, “NetArena: Dynamic benchmarks for AI agents in network automation,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=BPVPOtzoOz

  13. [13]

    Mtbench: A multimodal time series benchmark for temporal reasoning and question answering, 2026

    J. Chen, A. Feng, Z. Zhao, J. Garza, G. Nurbek, C. Qin, A. Maatouk, L. Tassiulas, Y . Gao, and R. Ying, “MTBench: A multimodal time series benchmark for temporal reasoning and question answering,”arXiv preprint arXiv:2503.16858, 2025

  14. [14]

    TemporalBench: A benchmark for evaluating LLM-based agents on contextual and event-informed time series tasks,

    M. Weng, D. Cao, W. Yang, Y . Sharma, and Y . Liu, “TemporalBench: A benchmark for evaluating LLM-based agents on contextual and event-informed time series tasks,”arXiv preprint arXiv:2602.13272, 2026

  15. [15]

    TimeART: Towards agentic time series reasoning via tool-augmentation,

    X. Wu, J. Lu, Z. Li, X. Qiu, J. Hu, C. Guo, C. S. Jensen, and B. Yang, “TimeART: Towards agentic time series reasoning via tool-augmentation,”arXiv preprint arXiv:2601.13653, 2026

  16. [16]

    TS-Agent: Understanding and Reasoning Over Raw Time Series via Iterative Insight Gathering

    P. Liu, E. Fons, A. Vapsi, M. Ghassemi, S. Vyetrenko, D. Borrajo, V . K. Potluru, and M. Veloso, “TS-Agent: Understanding and reasoning over raw time series via iterative insight gathering,” arXiv preprint arXiv:2510.07432, 2025. 10

  17. [17]

    TimeSeriesScientist: A general-purpose AI agent for time series analysis,

    H. Zhao, X. Zhang, J. Wei, Y . Xu, Y . He, S. Sun, and C. You, “TimeSeriesScientist: A general- purpose AI agent for time series analysis,”arXiv preprint arXiv:2510.01538, 2025

  18. [18]

    TimeCAP: Learning to contextualize, augment, and predict time series events with large language model agents,

    G. Lee, W. Yu, K. Shin, W. Cheng, and H. Chen, “TimeCAP: Learning to contextualize, augment, and predict time series events with large language model agents,” inProceedings of the AAAI Conference on Artificial Intelligence, 2025

  19. [19]

    Cast- R1: Learning tool-augmented sequential decision policies for time series forecasting,

    X. Tao, M. Cheng, C. Jiang, T. Gao, H. Zhang, and Y . Liu, “Cast-R1: Learning tool-augmented sequential decision policies for time series forecasting,”arXiv preprint arXiv:2602.13802, 2026

  20. [20]

    Timexl: Explainable multi-modal time series prediction with llm-in- the-loop.arXiv preprint arXiv:2503.01013, 2025

    Y . Jiang, W. Yu, G. Lee, D. Song, K. Shin, W. Cheng, Y . Liu, and H. Chen, “TimeXL: Explainable multi-modal time series prediction with LLM-in-the-loop,”arXiv preprint arXiv:2503.01013, 2025

  21. [21]

    Empowering time series forecasting with llm-agents

    C.-C. M. Yeh, V . Lai, U. S. Saini, X. Fan, Y . Fan, J. Wang, X. Dai, and Y . Zheng, “Empowering time series forecasting with LLM-agents,”arXiv preprint arXiv:2508.04231, 2025

  22. [22]

    Toolformer: Language models can teach themselves to use tools,

    T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” inAdvances in Neural Information Processing Systems, vol. 36, 2023

  23. [23]

    Gorilla: Large language model connected with massive APIs,

    S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “Gorilla: Large language model connected with massive APIs,” inAdvances in Neural Information Processing Systems, vol. 37, 2024

  24. [24]

    ToolLLM: Facili- tating large language models to master 16000+ real-world APIs,

    Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qian, S. Zhao, L. Hong, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun, “ToolLLM: Facili- tating large language models to master 16000+ real-world APIs,” inThe Twelfth International Conference on Learning Representations, 2024

  25. [25]

    MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

    H. Zhang, Q. Long, J. Bao, T. Feng, W. Zhang, H. Yue, and W. Wang, “MemSkill: Learning and evolving memory skills for self-evolving agents,”arXiv preprint arXiv:2602.02474, 2026

  26. [26]

    Asda: Automated skill distillation and adaptation for financial reasoning.arXiv preprint arXiv:2603.16112, 2026

    T. Y . Yim, W. Tan, S. Y . Chan, T.-W. Lam, and S. M. Yiu, “ASDA: Automated skill distillation and adaptation for financial reasoning,”arXiv preprint arXiv:2603.16112, 2026

  27. [27]

    Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

    H. Wang, G. Wang, H. Xiao, Y . Zhou, Y . Pan, J. Wang, K. Xu, Y . Wen, X. Ruan, X. Chen, and H. Qi, “Skill-SD: Skill-conditioned self-distillation for multi-turn LLM agents,”arXiv preprint arXiv:2604.10674, 2026

  28. [28]

    ReAct: Synergizing reasoning and acting in language models,

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “ReAct: Synergizing reasoning and acting in language models,” inThe Eleventh International Conference on Learning Representations, 2023

  29. [29]

    Reflexion: Language agents with verbal reinforcement learning,

    N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,” inAdvances in Neural Information Processing Systems, vol. 36, 2023

  30. [30]

    OpenClaw-RL: Train Any Agent Simply by Talking

    Y . Wang, X. Chen, X. Jin, M. Wang, and L. Yang, “OpenClaw-RL: Train any agent simply by talking,”arXiv preprint arXiv:2603.10165, 2026

  31. [31]

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu

    T. Chen, Y . Li, M. Solodko, S. Wang, N. Jiang, T. Cui, J. Hao, J. Ko, S. Abdali, L. Xu, S. Zheng, H. Fan, P. Cameron, J. Wagle, and K. Koishida, “CUA-Skill: Develop skills for computer using agent,”arXiv preprint arXiv:2601.21123, 2026

  32. [32]

    Clawkeeper: Comprehensive safety protection for openclaw agents through skills, plugins, and watchers.arXiv preprint arXiv:2603.24414,

    S. Liu, C. Li, C. Wang, J. Hou, Z. Chen, L. Zhang, Z. Liu, Q. Ye, Y . Hei, X. Zhang, and Z. Wang, “ClawKeeper: Comprehensive safety protection for OpenClaw agents through skills, plugins, and watchers,”arXiv preprint arXiv:2603.24414, 2026

  33. [33]

    TS-Reasoner: Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis

    W. Ye, W. Yang, D. Cao, Y . Zhang, L. Tang, J. Cai, and Y . Liu, “Domain-oriented time series inference agents for reasoning and automated analysis,”arXiv preprint arXiv:2410.04047, 2024

  34. [34]

    FLAIRR-TS: Forecasting LLM-agents with iterative refinement and retrieval for time series,

    G. Jalori, P. Verma, and S. O. Arık, “FLAIRR-TS: Forecasting LLM-agents with iterative refinement and retrieval for time series,” inFindings of the Association for Computational Linguistics: EMNLP 2025, 2025. 11

  35. [35]

    TS-Debate: Multimodal collaborative debate for zero-shot time series reasoning,

    P. Trirat, J. M. Kwak, J. Heo, H. Lee, and S. J. Hwang, “TS-Debate: Multimodal collaborative debate for zero-shot time series reasoning,”arXiv preprint arXiv:2601.19151, 2026

  36. [36]

    Visual reasoning over time series via multi-agent system,

    W. Ruan and Y . Liang, “Visual reasoning over time series via multi-agent system,”arXiv preprint arXiv:2602.03026, 2026

  37. [37]

    When LLM meets time series: Can LLMs perform multi-step time series reasoning and inference,

    W. Ye, J. Liu, D. Cao, W. Yang, and Y . Liu, “When LLM meets time series: Can LLMs perform multi-step time series reasoning and inference,”arXiv preprint arXiv:2509.01822, 2025

  38. [38]

    Can competition enhance the proficiency of agents powered by large language models in the realm of news-driven time series forecasting?

    Y . Zhang, Y . Feng, D. Li, K. Zhang, J. Chen, and B. Deng, “Can competition enhance the proficiency of agents powered by large language models in the realm of news-driven time series forecasting?”arXiv preprint arXiv:2504.10210, 2025

  39. [39]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar, “V oyager: An open-ended embodied agent with large language models,”arXiv preprint arXiv:2305.16291, 2023

  40. [40]

    MemGPT: Towards LLMs as Operating Systems

    C. Packer, S. Wooders, K. Lin, V . Fang, S. G. Patil, I. Stoica, and J. E. Gonzalez, “MemGPT: Towards LLMs as operating systems,”arXiv preprint arXiv:2310.08560, 2023

  41. [41]

    STaR: Bootstrapping reasoning with reason- ing,

    E. Zelikman, Y . Wu, J. Mu, and N. D. Goodman, “STaR: Bootstrapping reasoning with reason- ing,” inAdvances in Neural Information Processing Systems, vol. 35, 2022

  42. [42]

    Reinforced Self-Training (ReST) for Language Modeling

    C. Gulcehre, T. Le Paine, S. Srinivasan, K. Konyushkova, L. Weerts, A. Sharma, A. Siddhant, A. Ahern, M. Wang, C. Gu, W. Macherey, A. Doucet, O. Firat, and N. de Freitas, “Reinforced self-training (ReST) for language modeling,”arXiv preprint arXiv:2308.08998, 2023

  43. [43]

    ClawSafety: "Safe" LLMs, Unsafe Agents

    B. Wei, Y . Zhang, J. Pan, K. Mei, X. Wang, J. Hamm, Z. Zhu, and Y . Ge, “ClawSafety: “safe” LLMs, unsafe agents,”arXiv preprint arXiv:2604.01438, 2026

  44. [44]

    SafeClaw-R: Towards safe and secure multi-agent personal assistants,

    H. Wang, Z. Xiao, Y . Zhang, C. M. Poskitt, and J. Sun, “SafeClaw-R: Towards safe and secure multi-agent personal assistants,”arXiv preprint arXiv:2603.28807, 2026

  45. [45]

    Chronos: Learning the Language of Time Series

    A. F. Ansari, L. Stella, C. Turkmen, X. Zhang, P. Mercado, H. Shen, O. Shchuret al., “Chronos: Learning the language of time series,”arXiv preprint arXiv:2403.07815, 2024

  46. [46]

    Unified training of universal time series forecasting transformers.arXiv preprint arXiv:2402.02592, 2024

    G. Woo, C. Liu, A. Kumar, C. Xiong, S. Savarese, and D. Sahoo, “Moirai: Unified training of universal time series forecasting transformers,”arXiv preprint arXiv:2402.02592, 2024

  47. [47]

    A decoder- only foundation model for time-series forecasting.arXiv preprint arXiv:2310.10688,

    A. Das, W. Kong, R. Sen, and Y . Zhou, “A decoder-only foundation model for time-series forecasting,”arXiv preprint arXiv:2310.10688, 2024

  48. [48]

    Time-LLM: Time series forecasting by reprogramming large language models,

    M. Jin, S. Wang, L. Ma, Z. Chu, J. Y . Zhang, X. Shi, P.-Y . Chen, Y . Liang, Y .-F. Li, S. Pan, and Q. Wen, “Time-LLM: Time series forecasting by reprogramming large language models,” in The Twelfth International Conference on Learning Representations, 2024

  49. [49]

    Moment: A family of open time-series foundation models

    M. Goswami, K. Szafer, A. Choudhry, Y . Cai, S. Li, and A. Dubrawski, “MOMENT: A family of open time-series foundation models,”arXiv preprint arXiv:2402.03885, 2024

  50. [50]

    Lag-llama: Towards foundation models for probabilistic time series forecasting.arXiv preprint arXiv:2310.08278,

    K. Rasul, A. Ashok, A. R. Williams, H. Ghonia, R. Bhagwatkar, A. Khorasani, M. J. Darvishi Bayaziet al., “Lag-Llama: Towards foundation models for probabilistic time se- ries forecasting,”arXiv preprint arXiv:2310.08278, 2023

  51. [51]

    Sundial: A Family of Highly Capable Time Series Foundation Models

    Y . Liu, G. Qin, Z. Shi, Z. Chen, C. Yang, X. Huang, J. Wang, and M. Long, “Sundial: A family of highly capable time series foundation models,”arXiv preprint arXiv:2502.00816, 2025

  52. [52]

    Timer: Transformers for time series analysis at scale,

    Y . Liu, H. Zhang, C. Li, X. Huang, J. Wang, and M. Long, “Timer: Transformers for time series analysis at scale,”arXiv preprint arXiv:2402.02368, 2024

  53. [53]

    Aurora: Towards universal generative multimodal time series forecasting,

    X. Wu, J. Jin, W. Qiu, P. Chen, Y . Shu, B. Yang, and C. Guo, “Aurora: Towards universal generative multimodal time series forecasting,”arXiv preprint arXiv:2509.22295, 2025. 12 A Implementation Details This section records implementation-specific details that are not central to the main text. Exploratory learning and downstream inference share the same c...

  54. [54]

    Forecasts should preserve a believable24-hour temperature rhythm across all72 hourly steps

  55. [55]

    Treat storm text as a short-lived modifier unless it clearly signals a real regime break such as a cold front or snow

  56. [56]

    Tool chain sundial_base_128m_forecast -> timesfm2_forecast Step records

    Start the forecast close to the last observed temperature before continuing the next diurnal cycle. Tool chain sundial_base_128m_forecast -> timesfm2_forecast Step records

  57. [57]

    2.timesfm2_forecast: produced a second real72-step forecast candidate for the same horizon

    sundial_base_128m_forecast: produced a smooth 72-step temperature path that preserved overnight cooling and daytime warming. 2.timesfm2_forecast: produced a second real72-step forecast candidate for the same horizon

  58. [58]

    execution_context field The historical series shows a strong repeated24-hour temperature rhythm with only a brief storm event near the boundary

    Final prediction metrics in the row artifact: MAE = 1.687 , RMSE = 2.323 , MAPE = 8.090,MSE = 5.398. execution_context field The historical series shows a strong repeated24-hour temperature rhythm with only a brief storm event near the boundary. sundial_base_128m_forecastpreserved smooth overnight cooling and daytime warming while softening the warm endpo...

  59. [59]

    Weight the last regime and endpoint behavior more than the full-month trend

  60. [60]

    Match the chosen bucket to the implied percentage move from the latest price

  61. [61]

    Treat routine bullish commentary as weaker magnitude evidence than a fresh company catalyst. Tool chain value_at -> text_to_event -> llm_reasoning_classification Step records 1.value_at: latest observed price= 19.375, withpct_change_vs_reference = 35.55%from the first observed price. 2.text_to_event:events = [],n_events = 0. 3.llm_reasoning_classification...

  62. [62]

    Classify from the mean of the last observed 24 hours versus the mean of the first predicted 24 hours using the exact±0.5thresholds

  63. [63]

    Prioritize the latest 24-hour regime and any end-aligned storm context over the full 14-day slope

  64. [64]

    Tool chain segment -> window_stats -> llm_reasoning_classification Step records 1.segment: the336hourly values were divided into14exact daily windows

    Segment the series into exact daily windows before reasoning when the task is explicitly about comparing daily means. Tool chain segment -> window_stats -> llm_reasoning_classification Step records 1.segment: the336hourly values were divided into14exact daily windows. 2.window_stats: on the last observed 24 hours,mean = 16.435416666666665. 3.llm_reasoning...

  65. [65]

    Base the label on the difference between the observed last-day mean and the first forecast-day mean rather than on the full-series slope

  66. [66]

    Treat compressed last-day windows near the ±0.5 cutoff as threshold-sensitive cases

  67. [67]

    When boundary weather makes the mean shift near the cutoff, compare more than one plausible next-day forecast scenario before classifying. Tool chain segment -> chronos2_forecast -> window_stats -> window_stats -> aurora_forecast -> llm_reasoning_classification Step records 1.window_statson the last observed 24-hour window gavemean = 20.239583333333332. 2...

  68. [68]

    Use aligned news as a light directional calibration unless it contains a clear near-horizon catalyst

  69. [69]

    Return exactly the prompt-specified number of float prices in the required format

  70. [70]

    Tool chain timesfm2_forecast -> moirai1_1_r_base_forecast Step records 1.timesfm2_forecast: generated a smooth78-step path clustered near51.41and slowly easing lower

    Prefertimesfm2_forecastwhen a5-minute financial series needs smooth fine-grained continuation rather than quantized level holding. Tool chain timesfm2_forecast -> moirai1_1_r_base_forecast Step records 1.timesfm2_forecast: generated a smooth78-step path clustered near51.41and slowly easing lower. 2.moirai1_1_r_base_forecast: generated a second78-step cand...

  71. [71]

    execution_context field timesfm2_forecastgenerated a smooth78-step path clustered near51.41and slowly easing lower

    Final prediction metrics in the row artifact: MAE = 0.190 , RMSE = 0.219 , MAPE = 0.371,MSE = 0.048. execution_context field timesfm2_forecastgenerated a smooth78-step path clustered near51.41and slowly easing lower. moirai1_1_r_base_forecastgenerated a much choppier78-step path with repeated sharp oscillations and spikes. The news text was generic Nasdaq...