arxiv: 2605.10038 · v1 · submitted 2026-05-11 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

TimeClaw: A Time-Series AI Agent with Exploratory Execution Learning

Hangchen Liu , Dongyuan Li , Renhe Jiang , Jiewen Deng , Weiwei Ye , Yoshihide Sekimoto

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:45 UTC · model grok-4.3

classification 💻 cs.AI

keywords time seriesexploratory executionAI agentLLM agenttool usehierarchical experienceforecastingreasoning tasks

0 comments

The pith

TimeClaw turns exploratory executions into reusable hierarchical patterns that improve time-series agent performance without retraining the base model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TimeClaw to help AI agents handle time-series tasks such as forecasting and reasoning in finance and weather. Instead of stopping at the first successful execution, the system explores multiple tool-use paths, compares them using task metrics, distills the useful structures into hierarchical experience, and reinjects that experience on later tasks. This approach keeps the underlying language model frozen and avoids any online adaptation during testing. The method matters because early successes in numeric domains often lock agents into suboptimal tool choices, limiting accuracy even when better procedures exist. Evaluation on an MTBench-aligned set of 17 tasks shows steady improvements over standard baselines.

Core claim

TimeClaw is an exploratory execution learning framework built around a four-stage loop of Explore, Compare, Distill, and Reinject. It applies metric-supervised comparison across candidate executions together with task-aware tool dropout to extract reusable hierarchical distilled experience. This experience is stored and reinjected at inference time to guide future decisions. The base model remains frozen throughout, with no test-time adaptation. In an evaluation spanning 17 tasks aligned with MTBench in finance, weather prediction, and reasoning, the system produces consistent gains over baselines. The authors conclude that the limiting factor in these scientific systems is less the raw tool

What carries the argument

The four-stage Explore-Compare-Distill-Reinject loop that converts multiple exploratory executions into reusable hierarchical distilled experience through metric supervision and task-aware tool dropout.

If this is right

Produces consistent gains over baselines across 17 tasks in finance, weather prediction, and reasoning.
Avoids tool-prior collapse by continuing exploration even after early valid executions.
Enables reuse of hierarchical patterns at inference time while the base model stays frozen.
Shifts the bottleneck in scientific AI agents from execution capability to how exploratory experience is compared, distilled, and reused.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same loop could be applied to other tool-using agents in sequential decision domains such as code synthesis or robotic planning.
Hierarchical distillation may allow static models to accumulate cross-task memory without periodic retraining.
Focusing on experience reuse rather than model updates could simplify deployment and reduce compute costs in production settings.

Load-bearing premise

Metric-supervised comparison of exploratory executions can reliably extract reusable hierarchical patterns that improve later performance without selection bias or any need to adapt the base model.

What would settle it

On the same 17-task benchmark, removing the Compare and Distill stages or replacing metric comparison with random selection of executions eliminates the reported gains and returns performance to baseline levels.

Figures

Figures reproduced from arXiv: 2605.10038 by Dongyuan Li, Hangchen Liu, Jiewen Deng, Renhe Jiang, Weiwei Ye, Yoshihide Sekimoto.

**Figure 2.** Figure 2: Case study of post-shock MACD forecasting. The figure compares TimeClaw Unexplored [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Tool-dropout mechanism. Frequently used tools are temporarily masked according to their historical usage counts to encourage broader exploration over candidate tools. 5.5 Tool Dropout Analysis 50 100 150 200 250 300 Completed Samples 0% 20% 40% 60% 80% 100% Top-5 Tool Share forecast, w. dropout forecast, w/o. dropout indicator_macd, w. dropout indicator_macd, w/o. dropout forecast indicator_macd correlatio… view at source ↗

**Figure 4.** Figure 4: Tool exploration analysis. We evaluate whether tool dropout mitigates tool-prior collapse during exploratory execution learning on four tasks: forecast, indicator_macd, correlation, and trend_past. We use two behavioral metrics: tool coverage rate, measuring how many visible tools are invoked within an exploration prefix, and Top-5 tool share, measuring concentration on the five most-used tools [PITH_FULL… view at source ↗

**Figure 5.** Figure 5: Execution sample on weather forecast task. [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: Execution sample on finance forecast task. [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Execution sample on weather future-trend task. [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: Execution sample on finance trend task. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Execution sample on weather past-trend task. [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Inference sample on weather forecast task. [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: Inference sample on finance trend task. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

**Figure 12.** Figure 12: Inference sample on weather future-trend task. [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗

**Figure 13.** Figure 13: Inference sample on weather future-trend task. [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗

**Figure 14.** Figure 14: Inference sample on finance forecast task. [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗

read the original abstract

Time series analysis underpins forecasting, monitoring, and decision making in domains such as finance and weather, where solving a task often requires both numerical accuracy and contextual reasoning. Recent progress has moved from specialized neural predictors to approaches built on LLMs and foundation models that can reason over time series inputs and use external tools. However, most such systems remain execution-centric: they focus on solving the current instance but learn little from exploratory execution. This is especially limiting in verifiable numeric settings, where multiple candidate executions and tool-use procedures may all be task-valid yet differ sharply in quantitative quality, and where early success can trigger tool-prior collapse that suppresses further exploration. To address this limitation, we present TimeClaw, an exploratory execution learning framework that turns exploratory execution into reusable hierarchical distilled experience through a four-stage loop: Explore, Compare, Distill, and Reinject. TimeClaw combines metric-supervised exploratory execution learning, task-aware tool dropout, and hierarchical distilled experience for inference-time reinjection, while keeping the base model frozen and avoiding online test-time adaptation. In an MTBench-aligned evaluation with 17 tasks that span finance and weather prediction and reasoning tasks, TimeClaw delivers consistent gains over the baselines. These results suggest that, for scientific systems, the bottleneck is not only execution-time capability, but how exploratory experience is compared, distilled, and reused.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TimeClaw's four-stage loop turns exploratory tool calls into reusable hierarchical experience for time-series agents with a frozen base model, but the abstract gives no numbers or bias controls to back the claimed gains.

read the letter

TimeClaw's core idea is a four-stage loop—Explore, Compare, Distill, Reinject—that lets an agent try different tool sequences on time-series tasks, score them by metrics, pull out reusable patterns, and feed those patterns back in later runs without touching the base model weights. The task-aware tool dropout and hierarchical reinjection are the specific new pieces that target tool-prior collapse, where an early decent execution stops the agent from trying better paths in numeric work like forecasting or finance queries. That framing matches a real gap in current LLM agent setups for verifiable domains. The paper does a clean job showing why multiple valid executions exist and why early success can suppress exploration, then proposes a concrete way to capture the better ones as experience. The MTBench-aligned setup across 17 tasks spanning prediction and reasoning is a reasonable test bed for the claim. The soft spots sit in the results and the Compare stage. The abstract states consistent gains but supplies no effect sizes, standard deviations, baseline code details, or statistical tests, so it is impossible to judge how large or reliable the improvement actually is. The metric-supervised comparison could easily favor paths that score high on the current distribution without guaranteeing they generalize, which is exactly the selection bias the stress-test flags. Without explicit controls or hold-out checks described, the mechanism's contribution stays plausible but unproven. This work is for researchers building agents that need to improve tool use on scientific time-series problems without retraining. A reader who wants practical ideas for turning exploration into reusable patterns will get value from the loop design. It deserves peer review because the problem is well-posed and the proposed fix is specific enough for referees to test the quantitative claims and bias safeguards directly.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces TimeClaw, an exploratory execution learning framework for time-series AI agents. It uses a four-stage Explore-Compare-Distill-Reinject loop to convert exploratory executions into reusable hierarchical distilled experience, incorporating metric-supervised comparison, task-aware tool dropout, and inference-time reinjection while keeping the base model frozen. The central claim is that this approach yields consistent performance gains over baselines on 17 MTBench-aligned tasks spanning finance, weather prediction, and reasoning domains.

Significance. If the empirical results hold with proper controls, TimeClaw could advance LLM-based agents in numeric domains by addressing tool-prior collapse and enabling reuse of exploratory experience without online adaptation or fine-tuning. The frozen-base-model design and focus on verifiable numeric settings are strengths that distinguish it from typical execution-centric systems.

major comments (3)

[Abstract and Evaluation] Abstract and Evaluation section: The claim that TimeClaw 'delivers consistent gains over the baselines' on 17 tasks is presented without any quantitative details on effect sizes, variance across runs, specific baseline implementations, or statistical tests. This absence leaves the magnitude and reliability of the improvements unassessable and directly weakens support for the central claim.
[Method] Method section (Compare stage): The metric-supervised comparison procedure lacks description of controls against selection bias, such as how high-metric paths are sampled, whether dropout prevents over-representation of lucky executions, or validation that distilled patterns generalize beyond the current instance. Without these, the Explore-Compare-Distill-Reinject loop risks amplifying instance-specific successes rather than learning reusable hierarchical patterns.
[Experiments] Experiments section: No details are provided on the exact composition of the 17 tasks, how MTBench alignment was performed, or ablation studies isolating the contribution of the Distill and Reinject stages versus simple exploration. This makes it impossible to verify that the reported gains stem from the proposed learning mechanism rather than other factors.

minor comments (2)

[Abstract] The term 'MTBench-aligned' is used without a clear definition or citation in the abstract; a brief explanation or reference should be added in the introduction.
[Method] Notation for the hierarchical distilled experience could be clarified with a small diagram or pseudocode in the Method section to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions have been made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and Evaluation] Abstract and Evaluation section: The claim that TimeClaw 'delivers consistent gains over the baselines' on 17 tasks is presented without any quantitative details on effect sizes, variance across runs, specific baseline implementations, or statistical tests. This absence leaves the magnitude and reliability of the improvements unassessable and directly weakens support for the central claim.

Authors: We agree that the abstract would be improved by including quantitative anchors for the central claim. In the revised manuscript we have updated the abstract to reference the specific tables in the Experiments section that report average improvements, effect sizes, standard deviations across runs, baseline implementation details, and statistical test results. The full quantitative support remains in the evaluation, now more explicitly signposted from the abstract. revision: yes
Referee: [Method] Method section (Compare stage): The metric-supervised comparison procedure lacks description of controls against selection bias, such as how high-metric paths are sampled, whether dropout prevents over-representation of lucky executions, or validation that distilled patterns generalize beyond the current instance. Without these, the Explore-Compare-Distill-Reinject loop risks amplifying instance-specific successes rather than learning reusable hierarchical patterns.

Authors: We acknowledge that the original description of the Compare stage was insufficiently explicit about bias controls. The revised Method section now details the sampling rule for high-metric paths (top-k selection with a task-metric threshold), the precise application of task-aware tool dropout to reduce lucky-execution bias, and the cross-instance validation procedure used to confirm that distilled patterns transfer beyond the source instance. These additions clarify how the loop favors reusable hierarchical patterns. revision: yes
Referee: [Experiments] Experiments section: No details are provided on the exact composition of the 17 tasks, how MTBench alignment was performed, or ablation studies isolating the contribution of the Distill and Reinject stages versus simple exploration. This makes it impossible to verify that the reported gains stem from the proposed learning mechanism rather than other factors.

Authors: We agree that greater transparency is required. The revised Experiments section now contains an explicit table enumerating all 17 tasks with their domains and metrics, a description of the MTBench alignment procedure (category mapping and prompt adaptation for time-series tool use), and new ablation experiments that isolate the Distill and Reinject stages against a pure-exploration baseline. These ablations demonstrate the incremental contribution of each component. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains from independent framework on external benchmarks

full rationale

The paper describes an empirical four-stage loop (Explore-Compare-Distill-Reinject) evaluated on 17 MTBench-aligned tasks spanning finance, weather, and reasoning, reporting consistent gains over baselines while keeping the base model frozen. No equations, derivations, or first-principles claims are present in the provided text that reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains. The central claim rests on external task performance rather than any internal reduction to the method's own inputs by construction. This is a standard self-contained empirical result with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard assumptions from LLM agent literature (tool-use validity, metric-based ranking) plus the new claim that distilled experience can be hierarchically reused without adaptation; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Multiple distinct tool-use sequences can be valid for the same time-series task yet differ in quantitative quality.
Invoked to justify the need for exploration and comparison stages.

pith-pipeline@v0.9.0 · 5556 in / 1251 out tokens · 59345 ms · 2026-05-12T02:45:07.962803+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J uniqueness) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

four-stage loop: Explore, Compare, Distill, and Reinject... metric-supervised exploratory execution learning, task-aware tool dropout, and hierarchical distilled experience
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking (D=3) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

9-tuple memory rule r=(κ,σ,χ,T+,T−,ρ,E,c,ι)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 11 internal anchors

[1]

TRAJECT-bench: A trajectory-aware benchmark for evaluating agentic tool use,

P. He, Z. Dai, B. He, H. Liu, X. Tang, H. Lu, J. Li, J. Ding, S. Mukherjee, S. Wang, Y . Xing, J. Tang, and B. Dumoulin, “TRAJECT-bench: A trajectory-aware benchmark for evaluating agentic tool use,” inThe Fourteenth International Conference on Learning Representations,

work page
[2]

Available: https://openreview.net/forum?id=TZWnWvsQ0X

[Online]. Available: https://openreview.net/forum?id=TZWnWvsQ0X

work page
[3]

Continuous-time value iteration for multi-agent reinforcement learning,

X. Wang, L. Zhang, H. Pu, A. H. Qureshi, and H. Li, “Continuous-time value iteration for multi-agent reinforcement learning,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=7N0ugLE17t

work page 2026
[4]

SPIRAL: Self-play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning,

B. Liu, S. Yu, Z. Liu, L. Guertler, P. Qi, D. Balcells, M. Liu, C. Tan, W. Shi, M. Lin, W. S. Lee, and N. Jaques, “SPIRAL: Self-play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=7...

work page 2026
[5]

CoMind: Towards community-driven agents for machine learning engineering,

S. Li, W. Sun, S. Li, A. Talwalkar, and Y . Yang, “CoMind: Towards community-driven agents for machine learning engineering,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=P479BoN8BD

work page 2026
[6]

AgentBench: Evaluating LLMs as agents,

X. Liu, H. Yu, H. Zhang, X. Li, X. Dai, Y . Dong, A. Zenget al., “AgentBench: Evaluating LLMs as agents,” inThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[7]

WebArena: A realistic web environ- ment for building autonomous agents,

S. Zhou, F. F. Xu, H. Zhu, X. Zhou, G. Neubiget al., “WebArena: A realistic web environ- ment for building autonomous agents,” inThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[8]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Q. Wu, G. Bansal, J. Zhang, Y . Wu, B. Li, E. Zhu, L. Jianget al., “AutoGen: Enabling next-gen LLM applications via multi-agent conversation,”arXiv preprint arXiv:2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Zephyrus: An agentic framework for weather science,

S. Varambally, M. Fisher, J. Thakker, Y . Chen, Z. Xia, R. Niu, Y . Jafari, V . V . Manivannan, Z. Novack, L. Han, S. Eranky, S. Rühling Cachay, T. Berg-Kirkpatrick, D. Watson-Parris, Y . Ma, and R. Yu, “Zephyrus: An agentic framework for weather science,” inNeurIPS 2025 AI for Science Workshop, 2025. [Online]. Available: https://openreview.net/forum?id=F...

work page 2025
[10]

USTBench: Benchmarking and dissecting spatiotemporal reasoning capabilities of LLMs as urban agents,

S. Lai, Y . Ning, Z. Yuan, Z. Chen, and H. Liu, “USTBench: Benchmarking and dissecting spatiotemporal reasoning capabilities of LLMs as urban agents,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=ETzBStUFJy

work page 2026
[11]

STAIRS-former: Spatio-temporal attention with interleaved recursive structure transformer for offline mulit-task multi-agent reinforcement learning,

J. Jeon, M. Cho, and Y . Sung, “STAIRS-former: Spatio-temporal attention with interleaved recursive structure transformer for offline mulit-task multi-agent reinforcement learning,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=Biz1vpQeLI

work page 2026
[12]

NetArena: Dynamic benchmarks for AI agents in network automation,

Y . Zhou, J. Ruan, E. S. Wang, S. Fouladi, F. Y . Yan, K. Hsieh, and Z. Liu, “NetArena: Dynamic benchmarks for AI agents in network automation,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=BPVPOtzoOz

work page 2026
[13]

Mtbench: A multimodal time series benchmark for temporal reasoning and question answering, 2026

J. Chen, A. Feng, Z. Zhao, J. Garza, G. Nurbek, C. Qin, A. Maatouk, L. Tassiulas, Y . Gao, and R. Ying, “MTBench: A multimodal time series benchmark for temporal reasoning and question answering,”arXiv preprint arXiv:2503.16858, 2025

work page arXiv 2025
[14]

TemporalBench: A benchmark for evaluating LLM-based agents on contextual and event-informed time series tasks,

M. Weng, D. Cao, W. Yang, Y . Sharma, and Y . Liu, “TemporalBench: A benchmark for evaluating LLM-based agents on contextual and event-informed time series tasks,”arXiv preprint arXiv:2602.13272, 2026

work page arXiv 2026
[15]

TimeART: Towards agentic time series reasoning via tool-augmentation,

X. Wu, J. Lu, Z. Li, X. Qiu, J. Hu, C. Guo, C. S. Jensen, and B. Yang, “TimeART: Towards agentic time series reasoning via tool-augmentation,”arXiv preprint arXiv:2601.13653, 2026

work page arXiv 2026
[16]

TS-Agent: Understanding and Reasoning Over Raw Time Series via Iterative Insight Gathering

P. Liu, E. Fons, A. Vapsi, M. Ghassemi, S. Vyetrenko, D. Borrajo, V . K. Potluru, and M. Veloso, “TS-Agent: Understanding and reasoning over raw time series via iterative insight gathering,” arXiv preprint arXiv:2510.07432, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

TimeSeriesScientist: A general-purpose AI agent for time series analysis,

H. Zhao, X. Zhang, J. Wei, Y . Xu, Y . He, S. Sun, and C. You, “TimeSeriesScientist: A general- purpose AI agent for time series analysis,”arXiv preprint arXiv:2510.01538, 2025

work page arXiv 2025
[18]

TimeCAP: Learning to contextualize, augment, and predict time series events with large language model agents,

G. Lee, W. Yu, K. Shin, W. Cheng, and H. Chen, “TimeCAP: Learning to contextualize, augment, and predict time series events with large language model agents,” inProceedings of the AAAI Conference on Artificial Intelligence, 2025

work page 2025
[19]

Cast- R1: Learning tool-augmented sequential decision policies for time series forecasting,

X. Tao, M. Cheng, C. Jiang, T. Gao, H. Zhang, and Y . Liu, “Cast-R1: Learning tool-augmented sequential decision policies for time series forecasting,”arXiv preprint arXiv:2602.13802, 2026

work page arXiv 2026
[20]

Timexl: Explainable multi-modal time series prediction with llm-in- the-loop.arXiv preprint arXiv:2503.01013, 2025

Y . Jiang, W. Yu, G. Lee, D. Song, K. Shin, W. Cheng, Y . Liu, and H. Chen, “TimeXL: Explainable multi-modal time series prediction with LLM-in-the-loop,”arXiv preprint arXiv:2503.01013, 2025

work page arXiv 2025
[21]

Empowering time series forecasting with llm-agents

C.-C. M. Yeh, V . Lai, U. S. Saini, X. Fan, Y . Fan, J. Wang, X. Dai, and Y . Zheng, “Empowering time series forecasting with LLM-agents,”arXiv preprint arXiv:2508.04231, 2025

work page arXiv 2025
[22]

Toolformer: Language models can teach themselves to use tools,

T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” inAdvances in Neural Information Processing Systems, vol. 36, 2023

work page 2023
[23]

Gorilla: Large language model connected with massive APIs,

S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “Gorilla: Large language model connected with massive APIs,” inAdvances in Neural Information Processing Systems, vol. 37, 2024

work page 2024
[24]

ToolLLM: Facili- tating large language models to master 16000+ real-world APIs,

Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qian, S. Zhao, L. Hong, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun, “ToolLLM: Facili- tating large language models to master 16000+ real-world APIs,” inThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[25]

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

H. Zhang, Q. Long, J. Bao, T. Feng, W. Zhang, H. Yue, and W. Wang, “MemSkill: Learning and evolving memory skills for self-evolving agents,”arXiv preprint arXiv:2602.02474, 2026

work page internal anchor Pith review arXiv 2026
[26]

Asda: Automated skill distillation and adaptation for financial reasoning.arXiv preprint arXiv:2603.16112, 2026

T. Y . Yim, W. Tan, S. Y . Chan, T.-W. Lam, and S. M. Yiu, “ASDA: Automated skill distillation and adaptation for financial reasoning,”arXiv preprint arXiv:2603.16112, 2026

work page arXiv 2026
[27]

Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

H. Wang, G. Wang, H. Xiao, Y . Zhou, Y . Pan, J. Wang, K. Xu, Y . Wen, X. Ruan, X. Chen, and H. Qi, “Skill-SD: Skill-conditioned self-distillation for multi-turn LLM agents,”arXiv preprint arXiv:2604.10674, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

ReAct: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “ReAct: Synergizing reasoning and acting in language models,” inThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[29]

Reflexion: Language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,” inAdvances in Neural Information Processing Systems, vol. 36, 2023

work page 2023
[30]

OpenClaw-RL: Train Any Agent Simply by Talking

Y . Wang, X. Chen, X. Jin, M. Wang, and L. Yang, “OpenClaw-RL: Train any agent simply by talking,”arXiv preprint arXiv:2603.10165, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu

T. Chen, Y . Li, M. Solodko, S. Wang, N. Jiang, T. Cui, J. Hao, J. Ko, S. Abdali, L. Xu, S. Zheng, H. Fan, P. Cameron, J. Wagle, and K. Koishida, “CUA-Skill: Develop skills for computer using agent,”arXiv preprint arXiv:2601.21123, 2026

work page arXiv 2026
[32]

Clawkeeper: Comprehensive safety protection for openclaw agents through skills, plugins, and watchers.arXiv preprint arXiv:2603.24414,

S. Liu, C. Li, C. Wang, J. Hou, Z. Chen, L. Zhang, Z. Liu, Q. Ye, Y . Hei, X. Zhang, and Z. Wang, “ClawKeeper: Comprehensive safety protection for OpenClaw agents through skills, plugins, and watchers,”arXiv preprint arXiv:2603.24414, 2026

work page arXiv 2026
[33]

TS-Reasoner: Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis

W. Ye, W. Yang, D. Cao, Y . Zhang, L. Tang, J. Cai, and Y . Liu, “Domain-oriented time series inference agents for reasoning and automated analysis,”arXiv preprint arXiv:2410.04047, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

FLAIRR-TS: Forecasting LLM-agents with iterative refinement and retrieval for time series,

G. Jalori, P. Verma, and S. O. Arık, “FLAIRR-TS: Forecasting LLM-agents with iterative refinement and retrieval for time series,” inFindings of the Association for Computational Linguistics: EMNLP 2025, 2025. 11

work page 2025
[35]

TS-Debate: Multimodal collaborative debate for zero-shot time series reasoning,

P. Trirat, J. M. Kwak, J. Heo, H. Lee, and S. J. Hwang, “TS-Debate: Multimodal collaborative debate for zero-shot time series reasoning,”arXiv preprint arXiv:2601.19151, 2026

work page arXiv 2026
[36]

Visual reasoning over time series via multi-agent system,

W. Ruan and Y . Liang, “Visual reasoning over time series via multi-agent system,”arXiv preprint arXiv:2602.03026, 2026

work page arXiv 2026
[37]

When LLM meets time series: Can LLMs perform multi-step time series reasoning and inference,

W. Ye, J. Liu, D. Cao, W. Yang, and Y . Liu, “When LLM meets time series: Can LLMs perform multi-step time series reasoning and inference,”arXiv preprint arXiv:2509.01822, 2025

work page arXiv 2025
[38]

Can competition enhance the proficiency of agents powered by large language models in the realm of news-driven time series forecasting?

Y . Zhang, Y . Feng, D. Li, K. Zhang, J. Chen, and B. Deng, “Can competition enhance the proficiency of agents powered by large language models in the realm of news-driven time series forecasting?”arXiv preprint arXiv:2504.10210, 2025

work page arXiv 2025
[39]

Voyager: An Open-Ended Embodied Agent with Large Language Models

G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar, “V oyager: An open-ended embodied agent with large language models,”arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

MemGPT: Towards LLMs as Operating Systems

C. Packer, S. Wooders, K. Lin, V . Fang, S. G. Patil, I. Stoica, and J. E. Gonzalez, “MemGPT: Towards LLMs as operating systems,”arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

STaR: Bootstrapping reasoning with reason- ing,

E. Zelikman, Y . Wu, J. Mu, and N. D. Goodman, “STaR: Bootstrapping reasoning with reason- ing,” inAdvances in Neural Information Processing Systems, vol. 35, 2022

work page 2022
[42]

Reinforced Self-Training (ReST) for Language Modeling

C. Gulcehre, T. Le Paine, S. Srinivasan, K. Konyushkova, L. Weerts, A. Sharma, A. Siddhant, A. Ahern, M. Wang, C. Gu, W. Macherey, A. Doucet, O. Firat, and N. de Freitas, “Reinforced self-training (ReST) for language modeling,”arXiv preprint arXiv:2308.08998, 2023

work page Pith review arXiv 2023
[43]

ClawSafety: "Safe" LLMs, Unsafe Agents

B. Wei, Y . Zhang, J. Pan, K. Mei, X. Wang, J. Hamm, Z. Zhu, and Y . Ge, “ClawSafety: “safe” LLMs, unsafe agents,”arXiv preprint arXiv:2604.01438, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[44]

SafeClaw-R: Towards safe and secure multi-agent personal assistants,

H. Wang, Z. Xiao, Y . Zhang, C. M. Poskitt, and J. Sun, “SafeClaw-R: Towards safe and secure multi-agent personal assistants,”arXiv preprint arXiv:2603.28807, 2026

work page arXiv 2026
[45]

Chronos: Learning the Language of Time Series

A. F. Ansari, L. Stella, C. Turkmen, X. Zhang, P. Mercado, H. Shen, O. Shchuret al., “Chronos: Learning the language of time series,”arXiv preprint arXiv:2403.07815, 2024

work page internal anchor Pith review arXiv 2024
[46]

Unified training of universal time series forecasting transformers.arXiv preprint arXiv:2402.02592, 2024

G. Woo, C. Liu, A. Kumar, C. Xiong, S. Savarese, and D. Sahoo, “Moirai: Unified training of universal time series forecasting transformers,”arXiv preprint arXiv:2402.02592, 2024

work page arXiv 2024
[47]

A decoder- only foundation model for time-series forecasting.arXiv preprint arXiv:2310.10688,

A. Das, W. Kong, R. Sen, and Y . Zhou, “A decoder-only foundation model for time-series forecasting,”arXiv preprint arXiv:2310.10688, 2024

work page arXiv 2024
[48]

Time-LLM: Time series forecasting by reprogramming large language models,

M. Jin, S. Wang, L. Ma, Z. Chu, J. Y . Zhang, X. Shi, P.-Y . Chen, Y . Liang, Y .-F. Li, S. Pan, and Q. Wen, “Time-LLM: Time series forecasting by reprogramming large language models,” in The Twelfth International Conference on Learning Representations, 2024

work page 2024
[49]

Moment: A family of open time-series foundation models

M. Goswami, K. Szafer, A. Choudhry, Y . Cai, S. Li, and A. Dubrawski, “MOMENT: A family of open time-series foundation models,”arXiv preprint arXiv:2402.03885, 2024

work page arXiv 2024
[50]

Lag-llama: Towards foundation models for probabilistic time series forecasting.arXiv preprint arXiv:2310.08278,

K. Rasul, A. Ashok, A. R. Williams, H. Ghonia, R. Bhagwatkar, A. Khorasani, M. J. Darvishi Bayaziet al., “Lag-Llama: Towards foundation models for probabilistic time se- ries forecasting,”arXiv preprint arXiv:2310.08278, 2023

work page arXiv 2023
[51]

Sundial: A Family of Highly Capable Time Series Foundation Models

Y . Liu, G. Qin, Z. Shi, Z. Chen, C. Yang, X. Huang, J. Wang, and M. Long, “Sundial: A family of highly capable time series foundation models,”arXiv preprint arXiv:2502.00816, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Timer: Transformers for time series analysis at scale,

Y . Liu, H. Zhang, C. Li, X. Huang, J. Wang, and M. Long, “Timer: Transformers for time series analysis at scale,”arXiv preprint arXiv:2402.02368, 2024

work page arXiv 2024
[53]

Aurora: Towards universal generative multimodal time series forecasting,

X. Wu, J. Jin, W. Qiu, P. Chen, Y . Shu, B. Yang, and C. Guo, “Aurora: Towards universal generative multimodal time series forecasting,”arXiv preprint arXiv:2509.22295, 2025. 12 A Implementation Details This section records implementation-specific details that are not central to the main text. Exploratory learning and downstream inference share the same c...

work page arXiv 2025
[54]

Forecasts should preserve a believable24-hour temperature rhythm across all72 hourly steps

work page
[55]

Treat storm text as a short-lived modifier unless it clearly signals a real regime break such as a cold front or snow

work page
[56]

Tool chain sundial_base_128m_forecast -> timesfm2_forecast Step records

Start the forecast close to the last observed temperature before continuing the next diurnal cycle. Tool chain sundial_base_128m_forecast -> timesfm2_forecast Step records

work page
[57]

2.timesfm2_forecast: produced a second real72-step forecast candidate for the same horizon

sundial_base_128m_forecast: produced a smooth 72-step temperature path that preserved overnight cooling and daytime warming. 2.timesfm2_forecast: produced a second real72-step forecast candidate for the same horizon

work page
[58]

execution_context field The historical series shows a strong repeated24-hour temperature rhythm with only a brief storm event near the boundary

Final prediction metrics in the row artifact: MAE = 1.687 , RMSE = 2.323 , MAPE = 8.090,MSE = 5.398. execution_context field The historical series shows a strong repeated24-hour temperature rhythm with only a brief storm event near the boundary. sundial_base_128m_forecastpreserved smooth overnight cooling and daytime warming while softening the warm endpo...

work page
[59]

Weight the last regime and endpoint behavior more than the full-month trend

work page
[60]

Match the chosen bucket to the implied percentage move from the latest price

work page
[61]

Treat routine bullish commentary as weaker magnitude evidence than a fresh company catalyst. Tool chain value_at -> text_to_event -> llm_reasoning_classification Step records 1.value_at: latest observed price= 19.375, withpct_change_vs_reference = 35.55%from the first observed price. 2.text_to_event:events = [],n_events = 0. 3.llm_reasoning_classification...

work page
[62]

Classify from the mean of the last observed 24 hours versus the mean of the first predicted 24 hours using the exact±0.5thresholds

work page
[63]

Prioritize the latest 24-hour regime and any end-aligned storm context over the full 14-day slope

work page
[64]

Tool chain segment -> window_stats -> llm_reasoning_classification Step records 1.segment: the336hourly values were divided into14exact daily windows

Segment the series into exact daily windows before reasoning when the task is explicitly about comparing daily means. Tool chain segment -> window_stats -> llm_reasoning_classification Step records 1.segment: the336hourly values were divided into14exact daily windows. 2.window_stats: on the last observed 24 hours,mean = 16.435416666666665. 3.llm_reasoning...

work page
[65]

Base the label on the difference between the observed last-day mean and the first forecast-day mean rather than on the full-series slope

work page
[66]

Treat compressed last-day windows near the ±0.5 cutoff as threshold-sensitive cases

work page
[67]

When boundary weather makes the mean shift near the cutoff, compare more than one plausible next-day forecast scenario before classifying. Tool chain segment -> chronos2_forecast -> window_stats -> window_stats -> aurora_forecast -> llm_reasoning_classification Step records 1.window_statson the last observed 24-hour window gavemean = 20.239583333333332. 2...

work page
[68]

Use aligned news as a light directional calibration unless it contains a clear near-horizon catalyst

work page
[69]

Return exactly the prompt-specified number of float prices in the required format

work page
[70]

Tool chain timesfm2_forecast -> moirai1_1_r_base_forecast Step records 1.timesfm2_forecast: generated a smooth78-step path clustered near51.41and slowly easing lower

Prefertimesfm2_forecastwhen a5-minute financial series needs smooth fine-grained continuation rather than quantized level holding. Tool chain timesfm2_forecast -> moirai1_1_r_base_forecast Step records 1.timesfm2_forecast: generated a smooth78-step path clustered near51.41and slowly easing lower. 2.moirai1_1_r_base_forecast: generated a second78-step cand...

work page
[71]

execution_context field timesfm2_forecastgenerated a smooth78-step path clustered near51.41and slowly easing lower

Final prediction metrics in the row artifact: MAE = 0.190 , RMSE = 0.219 , MAPE = 0.371,MSE = 0.048. execution_context field timesfm2_forecastgenerated a smooth78-step path clustered near51.41and slowly easing lower. moirai1_1_r_base_forecastgenerated a much choppier78-step path with repeated sharp oscillations and spikes. The news text was generic Nasdaq...

work page