Recognition: 2 theorem links
· Lean TheoremTimeClaw: A Time-Series AI Agent with Exploratory Execution Learning
Pith reviewed 2026-05-12 02:45 UTC · model grok-4.3
The pith
TimeClaw turns exploratory executions into reusable hierarchical patterns that improve time-series agent performance without retraining the base model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TimeClaw is an exploratory execution learning framework built around a four-stage loop of Explore, Compare, Distill, and Reinject. It applies metric-supervised comparison across candidate executions together with task-aware tool dropout to extract reusable hierarchical distilled experience. This experience is stored and reinjected at inference time to guide future decisions. The base model remains frozen throughout, with no test-time adaptation. In an evaluation spanning 17 tasks aligned with MTBench in finance, weather prediction, and reasoning, the system produces consistent gains over baselines. The authors conclude that the limiting factor in these scientific systems is less the raw tool
What carries the argument
The four-stage Explore-Compare-Distill-Reinject loop that converts multiple exploratory executions into reusable hierarchical distilled experience through metric supervision and task-aware tool dropout.
If this is right
- Produces consistent gains over baselines across 17 tasks in finance, weather prediction, and reasoning.
- Avoids tool-prior collapse by continuing exploration even after early valid executions.
- Enables reuse of hierarchical patterns at inference time while the base model stays frozen.
- Shifts the bottleneck in scientific AI agents from execution capability to how exploratory experience is compared, distilled, and reused.
Where Pith is reading between the lines
- The same loop could be applied to other tool-using agents in sequential decision domains such as code synthesis or robotic planning.
- Hierarchical distillation may allow static models to accumulate cross-task memory without periodic retraining.
- Focusing on experience reuse rather than model updates could simplify deployment and reduce compute costs in production settings.
Load-bearing premise
Metric-supervised comparison of exploratory executions can reliably extract reusable hierarchical patterns that improve later performance without selection bias or any need to adapt the base model.
What would settle it
On the same 17-task benchmark, removing the Compare and Distill stages or replacing metric comparison with random selection of executions eliminates the reported gains and returns performance to baseline levels.
Figures
read the original abstract
Time series analysis underpins forecasting, monitoring, and decision making in domains such as finance and weather, where solving a task often requires both numerical accuracy and contextual reasoning. Recent progress has moved from specialized neural predictors to approaches built on LLMs and foundation models that can reason over time series inputs and use external tools. However, most such systems remain execution-centric: they focus on solving the current instance but learn little from exploratory execution. This is especially limiting in verifiable numeric settings, where multiple candidate executions and tool-use procedures may all be task-valid yet differ sharply in quantitative quality, and where early success can trigger tool-prior collapse that suppresses further exploration. To address this limitation, we present TimeClaw, an exploratory execution learning framework that turns exploratory execution into reusable hierarchical distilled experience through a four-stage loop: Explore, Compare, Distill, and Reinject. TimeClaw combines metric-supervised exploratory execution learning, task-aware tool dropout, and hierarchical distilled experience for inference-time reinjection, while keeping the base model frozen and avoiding online test-time adaptation. In an MTBench-aligned evaluation with 17 tasks that span finance and weather prediction and reasoning tasks, TimeClaw delivers consistent gains over the baselines. These results suggest that, for scientific systems, the bottleneck is not only execution-time capability, but how exploratory experience is compared, distilled, and reused.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces TimeClaw, an exploratory execution learning framework for time-series AI agents. It uses a four-stage Explore-Compare-Distill-Reinject loop to convert exploratory executions into reusable hierarchical distilled experience, incorporating metric-supervised comparison, task-aware tool dropout, and inference-time reinjection while keeping the base model frozen. The central claim is that this approach yields consistent performance gains over baselines on 17 MTBench-aligned tasks spanning finance, weather prediction, and reasoning domains.
Significance. If the empirical results hold with proper controls, TimeClaw could advance LLM-based agents in numeric domains by addressing tool-prior collapse and enabling reuse of exploratory experience without online adaptation or fine-tuning. The frozen-base-model design and focus on verifiable numeric settings are strengths that distinguish it from typical execution-centric systems.
major comments (3)
- [Abstract and Evaluation] Abstract and Evaluation section: The claim that TimeClaw 'delivers consistent gains over the baselines' on 17 tasks is presented without any quantitative details on effect sizes, variance across runs, specific baseline implementations, or statistical tests. This absence leaves the magnitude and reliability of the improvements unassessable and directly weakens support for the central claim.
- [Method] Method section (Compare stage): The metric-supervised comparison procedure lacks description of controls against selection bias, such as how high-metric paths are sampled, whether dropout prevents over-representation of lucky executions, or validation that distilled patterns generalize beyond the current instance. Without these, the Explore-Compare-Distill-Reinject loop risks amplifying instance-specific successes rather than learning reusable hierarchical patterns.
- [Experiments] Experiments section: No details are provided on the exact composition of the 17 tasks, how MTBench alignment was performed, or ablation studies isolating the contribution of the Distill and Reinject stages versus simple exploration. This makes it impossible to verify that the reported gains stem from the proposed learning mechanism rather than other factors.
minor comments (2)
- [Abstract] The term 'MTBench-aligned' is used without a clear definition or citation in the abstract; a brief explanation or reference should be added in the introduction.
- [Method] Notation for the hierarchical distilled experience could be clarified with a small diagram or pseudocode in the Method section to improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions have been made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and Evaluation] Abstract and Evaluation section: The claim that TimeClaw 'delivers consistent gains over the baselines' on 17 tasks is presented without any quantitative details on effect sizes, variance across runs, specific baseline implementations, or statistical tests. This absence leaves the magnitude and reliability of the improvements unassessable and directly weakens support for the central claim.
Authors: We agree that the abstract would be improved by including quantitative anchors for the central claim. In the revised manuscript we have updated the abstract to reference the specific tables in the Experiments section that report average improvements, effect sizes, standard deviations across runs, baseline implementation details, and statistical test results. The full quantitative support remains in the evaluation, now more explicitly signposted from the abstract. revision: yes
-
Referee: [Method] Method section (Compare stage): The metric-supervised comparison procedure lacks description of controls against selection bias, such as how high-metric paths are sampled, whether dropout prevents over-representation of lucky executions, or validation that distilled patterns generalize beyond the current instance. Without these, the Explore-Compare-Distill-Reinject loop risks amplifying instance-specific successes rather than learning reusable hierarchical patterns.
Authors: We acknowledge that the original description of the Compare stage was insufficiently explicit about bias controls. The revised Method section now details the sampling rule for high-metric paths (top-k selection with a task-metric threshold), the precise application of task-aware tool dropout to reduce lucky-execution bias, and the cross-instance validation procedure used to confirm that distilled patterns transfer beyond the source instance. These additions clarify how the loop favors reusable hierarchical patterns. revision: yes
-
Referee: [Experiments] Experiments section: No details are provided on the exact composition of the 17 tasks, how MTBench alignment was performed, or ablation studies isolating the contribution of the Distill and Reinject stages versus simple exploration. This makes it impossible to verify that the reported gains stem from the proposed learning mechanism rather than other factors.
Authors: We agree that greater transparency is required. The revised Experiments section now contains an explicit table enumerating all 17 tasks with their domains and metrics, a description of the MTBench alignment procedure (category mapping and prompt adaptation for time-series tool use), and new ablation experiments that isolate the Distill and Reinject stages against a pure-exploration baseline. These ablations demonstrate the incremental contribution of each component. revision: yes
Circularity Check
No circularity: empirical gains from independent framework on external benchmarks
full rationale
The paper describes an empirical four-stage loop (Explore-Compare-Distill-Reinject) evaluated on 17 MTBench-aligned tasks spanning finance, weather, and reasoning, reporting consistent gains over baselines while keeping the base model frozen. No equations, derivations, or first-principles claims are present in the provided text that reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains. The central claim rests on external task performance rather than any internal reduction to the method's own inputs by construction. This is a standard self-contained empirical result with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multiple distinct tool-use sequences can be valid for the same time-series task yet differ in quantitative quality.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel (J uniqueness) unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
four-stage loop: Explore, Compare, Distill, and Reinject... metric-supervised exploratory execution learning, task-aware tool dropout, and hierarchical distilled experience
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking (D=3) unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
9-tuple memory rule r=(κ,σ,χ,T+,T−,ρ,E,c,ι)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
TRAJECT-bench: A trajectory-aware benchmark for evaluating agentic tool use,
P. He, Z. Dai, B. He, H. Liu, X. Tang, H. Lu, J. Li, J. Ding, S. Mukherjee, S. Wang, Y . Xing, J. Tang, and B. Dumoulin, “TRAJECT-bench: A trajectory-aware benchmark for evaluating agentic tool use,” inThe Fourteenth International Conference on Learning Representations,
-
[2]
Available: https://openreview.net/forum?id=TZWnWvsQ0X
[Online]. Available: https://openreview.net/forum?id=TZWnWvsQ0X
-
[3]
Continuous-time value iteration for multi-agent reinforcement learning,
X. Wang, L. Zhang, H. Pu, A. H. Qureshi, and H. Li, “Continuous-time value iteration for multi-agent reinforcement learning,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=7N0ugLE17t
work page 2026
-
[4]
B. Liu, S. Yu, Z. Liu, L. Guertler, P. Qi, D. Balcells, M. Liu, C. Tan, W. Shi, M. Lin, W. S. Lee, and N. Jaques, “SPIRAL: Self-play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=7...
work page 2026
-
[5]
CoMind: Towards community-driven agents for machine learning engineering,
S. Li, W. Sun, S. Li, A. Talwalkar, and Y . Yang, “CoMind: Towards community-driven agents for machine learning engineering,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=P479BoN8BD
work page 2026
-
[6]
AgentBench: Evaluating LLMs as agents,
X. Liu, H. Yu, H. Zhang, X. Li, X. Dai, Y . Dong, A. Zenget al., “AgentBench: Evaluating LLMs as agents,” inThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[7]
WebArena: A realistic web environ- ment for building autonomous agents,
S. Zhou, F. F. Xu, H. Zhu, X. Zhou, G. Neubiget al., “WebArena: A realistic web environ- ment for building autonomous agents,” inThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[8]
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Q. Wu, G. Bansal, J. Zhang, Y . Wu, B. Li, E. Zhu, L. Jianget al., “AutoGen: Enabling next-gen LLM applications via multi-agent conversation,”arXiv preprint arXiv:2308.08155, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Zephyrus: An agentic framework for weather science,
S. Varambally, M. Fisher, J. Thakker, Y . Chen, Z. Xia, R. Niu, Y . Jafari, V . V . Manivannan, Z. Novack, L. Han, S. Eranky, S. Rühling Cachay, T. Berg-Kirkpatrick, D. Watson-Parris, Y . Ma, and R. Yu, “Zephyrus: An agentic framework for weather science,” inNeurIPS 2025 AI for Science Workshop, 2025. [Online]. Available: https://openreview.net/forum?id=F...
work page 2025
-
[10]
USTBench: Benchmarking and dissecting spatiotemporal reasoning capabilities of LLMs as urban agents,
S. Lai, Y . Ning, Z. Yuan, Z. Chen, and H. Liu, “USTBench: Benchmarking and dissecting spatiotemporal reasoning capabilities of LLMs as urban agents,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=ETzBStUFJy
work page 2026
-
[11]
J. Jeon, M. Cho, and Y . Sung, “STAIRS-former: Spatio-temporal attention with interleaved recursive structure transformer for offline mulit-task multi-agent reinforcement learning,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=Biz1vpQeLI
work page 2026
-
[12]
NetArena: Dynamic benchmarks for AI agents in network automation,
Y . Zhou, J. Ruan, E. S. Wang, S. Fouladi, F. Y . Yan, K. Hsieh, and Z. Liu, “NetArena: Dynamic benchmarks for AI agents in network automation,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=BPVPOtzoOz
work page 2026
-
[13]
Mtbench: A multimodal time series benchmark for temporal reasoning and question answering, 2026
J. Chen, A. Feng, Z. Zhao, J. Garza, G. Nurbek, C. Qin, A. Maatouk, L. Tassiulas, Y . Gao, and R. Ying, “MTBench: A multimodal time series benchmark for temporal reasoning and question answering,”arXiv preprint arXiv:2503.16858, 2025
-
[14]
M. Weng, D. Cao, W. Yang, Y . Sharma, and Y . Liu, “TemporalBench: A benchmark for evaluating LLM-based agents on contextual and event-informed time series tasks,”arXiv preprint arXiv:2602.13272, 2026
-
[15]
TimeART: Towards agentic time series reasoning via tool-augmentation,
X. Wu, J. Lu, Z. Li, X. Qiu, J. Hu, C. Guo, C. S. Jensen, and B. Yang, “TimeART: Towards agentic time series reasoning via tool-augmentation,”arXiv preprint arXiv:2601.13653, 2026
-
[16]
TS-Agent: Understanding and Reasoning Over Raw Time Series via Iterative Insight Gathering
P. Liu, E. Fons, A. Vapsi, M. Ghassemi, S. Vyetrenko, D. Borrajo, V . K. Potluru, and M. Veloso, “TS-Agent: Understanding and reasoning over raw time series via iterative insight gathering,” arXiv preprint arXiv:2510.07432, 2025. 10
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
TimeSeriesScientist: A general-purpose AI agent for time series analysis,
H. Zhao, X. Zhang, J. Wei, Y . Xu, Y . He, S. Sun, and C. You, “TimeSeriesScientist: A general- purpose AI agent for time series analysis,”arXiv preprint arXiv:2510.01538, 2025
-
[18]
G. Lee, W. Yu, K. Shin, W. Cheng, and H. Chen, “TimeCAP: Learning to contextualize, augment, and predict time series events with large language model agents,” inProceedings of the AAAI Conference on Artificial Intelligence, 2025
work page 2025
-
[19]
Cast- R1: Learning tool-augmented sequential decision policies for time series forecasting,
X. Tao, M. Cheng, C. Jiang, T. Gao, H. Zhang, and Y . Liu, “Cast-R1: Learning tool-augmented sequential decision policies for time series forecasting,”arXiv preprint arXiv:2602.13802, 2026
-
[20]
Y . Jiang, W. Yu, G. Lee, D. Song, K. Shin, W. Cheng, Y . Liu, and H. Chen, “TimeXL: Explainable multi-modal time series prediction with LLM-in-the-loop,”arXiv preprint arXiv:2503.01013, 2025
-
[21]
Empowering time series forecasting with llm-agents
C.-C. M. Yeh, V . Lai, U. S. Saini, X. Fan, Y . Fan, J. Wang, X. Dai, and Y . Zheng, “Empowering time series forecasting with LLM-agents,”arXiv preprint arXiv:2508.04231, 2025
-
[22]
Toolformer: Language models can teach themselves to use tools,
T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” inAdvances in Neural Information Processing Systems, vol. 36, 2023
work page 2023
-
[23]
Gorilla: Large language model connected with massive APIs,
S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “Gorilla: Large language model connected with massive APIs,” inAdvances in Neural Information Processing Systems, vol. 37, 2024
work page 2024
-
[24]
ToolLLM: Facili- tating large language models to master 16000+ real-world APIs,
Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qian, S. Zhao, L. Hong, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun, “ToolLLM: Facili- tating large language models to master 16000+ real-world APIs,” inThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[25]
MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents
H. Zhang, Q. Long, J. Bao, T. Feng, W. Zhang, H. Yue, and W. Wang, “MemSkill: Learning and evolving memory skills for self-evolving agents,”arXiv preprint arXiv:2602.02474, 2026
work page internal anchor Pith review arXiv 2026
-
[26]
T. Y . Yim, W. Tan, S. Y . Chan, T.-W. Lam, and S. M. Yiu, “ASDA: Automated skill distillation and adaptation for financial reasoning,”arXiv preprint arXiv:2603.16112, 2026
-
[27]
Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents
H. Wang, G. Wang, H. Xiao, Y . Zhou, Y . Pan, J. Wang, K. Xu, Y . Wen, X. Ruan, X. Chen, and H. Qi, “Skill-SD: Skill-conditioned self-distillation for multi-turn LLM agents,”arXiv preprint arXiv:2604.10674, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[28]
ReAct: Synergizing reasoning and acting in language models,
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “ReAct: Synergizing reasoning and acting in language models,” inThe Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[29]
Reflexion: Language agents with verbal reinforcement learning,
N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,” inAdvances in Neural Information Processing Systems, vol. 36, 2023
work page 2023
-
[30]
OpenClaw-RL: Train Any Agent Simply by Talking
Y . Wang, X. Chen, X. Jin, M. Wang, and L. Yang, “OpenClaw-RL: Train any agent simply by talking,”arXiv preprint arXiv:2603.10165, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[31]
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu
T. Chen, Y . Li, M. Solodko, S. Wang, N. Jiang, T. Cui, J. Hao, J. Ko, S. Abdali, L. Xu, S. Zheng, H. Fan, P. Cameron, J. Wagle, and K. Koishida, “CUA-Skill: Develop skills for computer using agent,”arXiv preprint arXiv:2601.21123, 2026
-
[32]
S. Liu, C. Li, C. Wang, J. Hou, Z. Chen, L. Zhang, Z. Liu, Q. Ye, Y . Hei, X. Zhang, and Z. Wang, “ClawKeeper: Comprehensive safety protection for OpenClaw agents through skills, plugins, and watchers,”arXiv preprint arXiv:2603.24414, 2026
-
[33]
TS-Reasoner: Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis
W. Ye, W. Yang, D. Cao, Y . Zhang, L. Tang, J. Cai, and Y . Liu, “Domain-oriented time series inference agents for reasoning and automated analysis,”arXiv preprint arXiv:2410.04047, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
FLAIRR-TS: Forecasting LLM-agents with iterative refinement and retrieval for time series,
G. Jalori, P. Verma, and S. O. Arık, “FLAIRR-TS: Forecasting LLM-agents with iterative refinement and retrieval for time series,” inFindings of the Association for Computational Linguistics: EMNLP 2025, 2025. 11
work page 2025
-
[35]
TS-Debate: Multimodal collaborative debate for zero-shot time series reasoning,
P. Trirat, J. M. Kwak, J. Heo, H. Lee, and S. J. Hwang, “TS-Debate: Multimodal collaborative debate for zero-shot time series reasoning,”arXiv preprint arXiv:2601.19151, 2026
-
[36]
Visual reasoning over time series via multi-agent system,
W. Ruan and Y . Liang, “Visual reasoning over time series via multi-agent system,”arXiv preprint arXiv:2602.03026, 2026
-
[37]
When LLM meets time series: Can LLMs perform multi-step time series reasoning and inference,
W. Ye, J. Liu, D. Cao, W. Yang, and Y . Liu, “When LLM meets time series: Can LLMs perform multi-step time series reasoning and inference,”arXiv preprint arXiv:2509.01822, 2025
-
[38]
Y . Zhang, Y . Feng, D. Li, K. Zhang, J. Chen, and B. Deng, “Can competition enhance the proficiency of agents powered by large language models in the realm of news-driven time series forecasting?”arXiv preprint arXiv:2504.10210, 2025
-
[39]
Voyager: An Open-Ended Embodied Agent with Large Language Models
G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar, “V oyager: An open-ended embodied agent with large language models,”arXiv preprint arXiv:2305.16291, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
MemGPT: Towards LLMs as Operating Systems
C. Packer, S. Wooders, K. Lin, V . Fang, S. G. Patil, I. Stoica, and J. E. Gonzalez, “MemGPT: Towards LLMs as operating systems,”arXiv preprint arXiv:2310.08560, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
STaR: Bootstrapping reasoning with reason- ing,
E. Zelikman, Y . Wu, J. Mu, and N. D. Goodman, “STaR: Bootstrapping reasoning with reason- ing,” inAdvances in Neural Information Processing Systems, vol. 35, 2022
work page 2022
-
[42]
Reinforced Self-Training (ReST) for Language Modeling
C. Gulcehre, T. Le Paine, S. Srinivasan, K. Konyushkova, L. Weerts, A. Sharma, A. Siddhant, A. Ahern, M. Wang, C. Gu, W. Macherey, A. Doucet, O. Firat, and N. de Freitas, “Reinforced self-training (ReST) for language modeling,”arXiv preprint arXiv:2308.08998, 2023
work page Pith review arXiv 2023
-
[43]
ClawSafety: "Safe" LLMs, Unsafe Agents
B. Wei, Y . Zhang, J. Pan, K. Mei, X. Wang, J. Hamm, Z. Zhu, and Y . Ge, “ClawSafety: “safe” LLMs, unsafe agents,”arXiv preprint arXiv:2604.01438, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[44]
SafeClaw-R: Towards safe and secure multi-agent personal assistants,
H. Wang, Z. Xiao, Y . Zhang, C. M. Poskitt, and J. Sun, “SafeClaw-R: Towards safe and secure multi-agent personal assistants,”arXiv preprint arXiv:2603.28807, 2026
-
[45]
Chronos: Learning the Language of Time Series
A. F. Ansari, L. Stella, C. Turkmen, X. Zhang, P. Mercado, H. Shen, O. Shchuret al., “Chronos: Learning the language of time series,”arXiv preprint arXiv:2403.07815, 2024
work page internal anchor Pith review arXiv 2024
-
[46]
G. Woo, C. Liu, A. Kumar, C. Xiong, S. Savarese, and D. Sahoo, “Moirai: Unified training of universal time series forecasting transformers,”arXiv preprint arXiv:2402.02592, 2024
-
[47]
A decoder- only foundation model for time-series forecasting.arXiv preprint arXiv:2310.10688,
A. Das, W. Kong, R. Sen, and Y . Zhou, “A decoder-only foundation model for time-series forecasting,”arXiv preprint arXiv:2310.10688, 2024
-
[48]
Time-LLM: Time series forecasting by reprogramming large language models,
M. Jin, S. Wang, L. Ma, Z. Chu, J. Y . Zhang, X. Shi, P.-Y . Chen, Y . Liang, Y .-F. Li, S. Pan, and Q. Wen, “Time-LLM: Time series forecasting by reprogramming large language models,” in The Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[49]
Moment: A family of open time-series foundation models
M. Goswami, K. Szafer, A. Choudhry, Y . Cai, S. Li, and A. Dubrawski, “MOMENT: A family of open time-series foundation models,”arXiv preprint arXiv:2402.03885, 2024
-
[50]
K. Rasul, A. Ashok, A. R. Williams, H. Ghonia, R. Bhagwatkar, A. Khorasani, M. J. Darvishi Bayaziet al., “Lag-Llama: Towards foundation models for probabilistic time se- ries forecasting,”arXiv preprint arXiv:2310.08278, 2023
-
[51]
Sundial: A Family of Highly Capable Time Series Foundation Models
Y . Liu, G. Qin, Z. Shi, Z. Chen, C. Yang, X. Huang, J. Wang, and M. Long, “Sundial: A family of highly capable time series foundation models,”arXiv preprint arXiv:2502.00816, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
Timer: Transformers for time series analysis at scale,
Y . Liu, H. Zhang, C. Li, X. Huang, J. Wang, and M. Long, “Timer: Transformers for time series analysis at scale,”arXiv preprint arXiv:2402.02368, 2024
-
[53]
Aurora: Towards universal generative multimodal time series forecasting,
X. Wu, J. Jin, W. Qiu, P. Chen, Y . Shu, B. Yang, and C. Guo, “Aurora: Towards universal generative multimodal time series forecasting,”arXiv preprint arXiv:2509.22295, 2025. 12 A Implementation Details This section records implementation-specific details that are not central to the main text. Exploratory learning and downstream inference share the same c...
-
[54]
Forecasts should preserve a believable24-hour temperature rhythm across all72 hourly steps
-
[55]
Treat storm text as a short-lived modifier unless it clearly signals a real regime break such as a cold front or snow
-
[56]
Tool chain sundial_base_128m_forecast -> timesfm2_forecast Step records
Start the forecast close to the last observed temperature before continuing the next diurnal cycle. Tool chain sundial_base_128m_forecast -> timesfm2_forecast Step records
-
[57]
2.timesfm2_forecast: produced a second real72-step forecast candidate for the same horizon
sundial_base_128m_forecast: produced a smooth 72-step temperature path that preserved overnight cooling and daytime warming. 2.timesfm2_forecast: produced a second real72-step forecast candidate for the same horizon
-
[58]
Final prediction metrics in the row artifact: MAE = 1.687 , RMSE = 2.323 , MAPE = 8.090,MSE = 5.398. execution_context field The historical series shows a strong repeated24-hour temperature rhythm with only a brief storm event near the boundary. sundial_base_128m_forecastpreserved smooth overnight cooling and daytime warming while softening the warm endpo...
-
[59]
Weight the last regime and endpoint behavior more than the full-month trend
-
[60]
Match the chosen bucket to the implied percentage move from the latest price
-
[61]
Treat routine bullish commentary as weaker magnitude evidence than a fresh company catalyst. Tool chain value_at -> text_to_event -> llm_reasoning_classification Step records 1.value_at: latest observed price= 19.375, withpct_change_vs_reference = 35.55%from the first observed price. 2.text_to_event:events = [],n_events = 0. 3.llm_reasoning_classification...
-
[62]
Classify from the mean of the last observed 24 hours versus the mean of the first predicted 24 hours using the exact±0.5thresholds
-
[63]
Prioritize the latest 24-hour regime and any end-aligned storm context over the full 14-day slope
-
[64]
Segment the series into exact daily windows before reasoning when the task is explicitly about comparing daily means. Tool chain segment -> window_stats -> llm_reasoning_classification Step records 1.segment: the336hourly values were divided into14exact daily windows. 2.window_stats: on the last observed 24 hours,mean = 16.435416666666665. 3.llm_reasoning...
-
[65]
Base the label on the difference between the observed last-day mean and the first forecast-day mean rather than on the full-series slope
-
[66]
Treat compressed last-day windows near the ±0.5 cutoff as threshold-sensitive cases
-
[67]
When boundary weather makes the mean shift near the cutoff, compare more than one plausible next-day forecast scenario before classifying. Tool chain segment -> chronos2_forecast -> window_stats -> window_stats -> aurora_forecast -> llm_reasoning_classification Step records 1.window_statson the last observed 24-hour window gavemean = 20.239583333333332. 2...
-
[68]
Use aligned news as a light directional calibration unless it contains a clear near-horizon catalyst
-
[69]
Return exactly the prompt-specified number of float prices in the required format
-
[70]
Prefertimesfm2_forecastwhen a5-minute financial series needs smooth fine-grained continuation rather than quantized level holding. Tool chain timesfm2_forecast -> moirai1_1_r_base_forecast Step records 1.timesfm2_forecast: generated a smooth78-step path clustered near51.41and slowly easing lower. 2.moirai1_1_r_base_forecast: generated a second78-step cand...
-
[71]
Final prediction metrics in the row artifact: MAE = 0.190 , RMSE = 0.219 , MAPE = 0.371,MSE = 0.048. execution_context field timesfm2_forecastgenerated a smooth78-step path clustered near51.41and slowly easing lower. moirai1_1_r_base_forecastgenerated a much choppier78-step path with repeated sharp oscillations and spikes. The news text was generic Nasdaq...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.