STAR: Failure-Aware Markovian Routing for Multi-Agent Spatiotemporal Reasoning

Flora D. Salim; Hao Xue; Lihuan Li; Ruiyi Yang

arxiv: 2605.10057 · v3 · pith:CZGDPYCFnew · submitted 2026-05-11 · 💻 cs.AI · cs.MA

STAR: Failure-Aware Markovian Routing for Multi-Agent Spatiotemporal Reasoning

Ruiyi Yang , Lihuan Li , Hao Xue , Flora D. Salim This is my paper

Pith reviewed 2026-05-19 14:40 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords failure-aware routingmulti-agent systemsspatiotemporal reasoningMarkovian routingrecovery transitionsexecution tracesLLM tool augmentationagent routing matrix

0 comments

The pith

STAR models inter-agent routing as a Markovian transition policy conditioned on typed failure states to learn specific recovery transitions from unsuccessful traces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents STAR as a framework that externalizes routing decisions among heterogeneous specialist agents in spatiotemporal reasoning tasks. Instead of leaving recovery implicit in language generation, it uses a routing matrix that blends expert-defined nominal paths with transitions learned from both successful and failed executions. The matrix distinguishes failure categories such as malformed outputs, missing dependencies, and tool mismatches, so the system can respond differently rather than issuing generic retries. Retaining unsuccessful traces during training expands the policy's coverage of error states, which the authors show produces measurable gains on queries that deviate from expected routes. This approach is tested across three benchmarks and eight backbone models, with the largest benefits appearing precisely where nominal routing breaks.

Core claim

STAR externalizes inter-agent control as a state-conditioned transition policy over the current agent, task type, and typed execution status. At its center is an agent routing matrix that fuses expert-specified nominal routes with recovery transitions learned from execution traces. Because the matrix conditions on distinct failure states rather than collapsing them, the router can select different recoveries for malformed outputs, missing dependencies, and tool-query mismatches. Retaining unsuccessful traces during training enlarges the support of the routing policy on error states, enabling recovery transitions that success-only training cannot represent. This yields improvements over prior

What carries the argument

The agent routing matrix, a state-conditioned transition policy that mixes nominal routes with learned recoveries conditioned on typed failure categories.

If this is right

The routing policy acquires explicit support on error states, allowing recovery transitions absent from success-only training.
Improvements appear most clearly on queries whose execution deviates from the nominal routing path.
Typed failure-aware routing, rather than specialist composition alone, drives the observed gains across benchmarks.
The blackboard protocol for intermediate results supports downstream fusion once recovery transitions restore valid state.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same matrix structure could be applied to other multi-agent domains where execution paths have qualitatively different failure modes.
Explicit state tracking may reduce reliance on prompt-based recovery heuristics in tool-augmented LLM systems.
Retaining failure traces suggests a general training principle for policy learning in environments with sparse success signals.

Load-bearing premise

Failure states can be accurately and consistently typed into distinct categories such as malformed outputs or missing dependencies during execution, so the matrix can learn type-specific recoveries instead of treating all errors as one signal.

What would settle it

A controlled run on the same benchmarks where failure types are deliberately collapsed into a single generic error signal or where unsuccessful traces are discarded, showing that the reported gains on deviated queries disappear.

Figures

Figures reproduced from arXiv: 2605.10057 by Flora D. Salim, Hao Xue, Lihuan Li, Ruiyi Yang.

**Figure 1.** Figure 1: STAR architecture. Queries are parsed into a task profile, the failure-aware routing matrix selects specialists conditioned on the current agent, task type, execution status, and specialists execute through an extract-compute-deposit protocol over a shared blackboard before final fusion. This paper proposes STAR (Spatio-Temporal Agent Router), a failure-aware routing framework that externalizes inter-agent… view at source ↗

**Figure 2.** Figure 2: State-conditioned routing matrix slices for a representative task type. Each panel visualizes [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Failure-aware routing and execution feedback in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Routing example through the dual-system kernel. System 1 nominal routes (green arrows) [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Empirical precision–coverage trade-off of [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Learned transition matrix M averaged across task types, rendered as four status-conditional heatmaps (SUCCESS / FAIL / INFO_MISSING / BLOCKED). Rows index the from_agent, columns index the to_agent, and each cell is P(next | from,status). Each benchmark produces a qualitatively different SUCCESS map (because task taxonomies differ), but every benchmark exhibits the same structural property—error-state rows… view at source ↗

**Figure 7.** Figure 7: Top-3 successors per failure status, grouped by originating agent. For every [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

read the original abstract

Compositional spatiotemporal reasoning often requires a system to invoke multiple heterogeneous specialists, such as geometric, temporal, topological, and trajectory agents. A central question is how such a system should route among specialists when execution does not simply succeed or fail, but fails in qualitatively different ways. Existing tool-augmented and multi-agent LLM systems typically leave this routing decision implicit in language generation, making recovery ad hoc, difficult to interpret, and hard to optimize. This paper presents STAR (Spatio-Temporal Agent Router), a failure-aware routing framework that externalizes inter-agent control as a state-conditioned transition policy over the current agent, task type, and typed execution status. At the center of STARis an agent routing matrix that combines expert-specified nominal routes with recovery transitions learned from execution traces. Because the matrix conditions on distinct failure states, the router can respond differently to malformed outputs, missing dependencies, and tool--query mismatches, rather than collapsing them into a generic retry signal. Specialists execute through a tool-grounded extract--compute--deposit protocol and write intermediate results to a shared blackboard for downstream fusion. Results prove that retaining unsuccessful traces during training enlarges the support of the routing policy on error states, enabling recovery transitions that success-only training cannot represent. Across three spatiotemporal benchmarks and eight backbone LLMs, STAR improves over multiple baselines with the clearest gains on queries whose execution deviates from the nominal routing path. Router-specific ablations and recovery analyses further show that typed failure-aware routing, rather than specialist composition alone, is a key factor for these improvements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STAR adds explicit typed-failure Markov routing to multi-agent LLM systems, but the abstract leaves the experimental claims hard to assess.

read the letter

The main thing here is that STAR externalizes routing among specialist agents as a state-conditioned Markov matrix that mixes expert nominal routes with recoveries learned from traces that include failures. This lets the system treat malformed outputs, missing dependencies, and tool mismatches as distinct signals instead of a generic retry. The blackboard and extract-compute-deposit protocol keep intermediate results structured for downstream fusion in spatiotemporal tasks. That framing addresses a real pain point in composing heterogeneous agents where execution paths often deviate. The idea of enlarging policy support on error states by retaining unsuccessful traces is a straightforward but useful distinction from success-only training. The paper does a clean job spelling out why implicit language-model routing is hard to interpret or optimize. The stress-test concern about failure-typing reliability is on target; if the typing step is noisy or inconsistent, the learned transitions lose their type-specific advantage and the gains on deviated queries become hard to attribute. The abstract reports improvements across three benchmarks and eight LLMs with clearest benefits on non-nominal paths, yet supplies no numbers, controls, or statistical details. Without those, it is difficult to judge whether the typed matrix is doing the work or whether other factors in the framework explain the results. The assumption that failures can be typed accurately and consistently during trace collection is load-bearing and needs more evidence. This paper is for researchers building multi-agent pipelines for compositional reasoning, especially when reliability on error paths matters. A reader looking for concrete mechanisms to make routing decisions more controllable would get value from the matrix construction and the trace-learning angle. It deserves a serious referee because the core problem is genuine and the proposed structure is clear enough to evaluate. I would send it to peer review with a request for fuller experimental reporting and checks on typing consistency.

Referee Report

2 major / 2 minor

Summary. The paper introduces STAR, a failure-aware Markovian routing framework for multi-agent spatiotemporal reasoning. It externalizes inter-agent control as a state-conditioned transition policy over the current agent, task type, and typed execution status. At its core is an agent routing matrix that combines expert-specified nominal routes with recovery transitions learned from execution traces (including unsuccessful ones). The central claim is that conditioning on distinct failure types (malformed outputs, missing dependencies, tool-query mismatches) allows type-specific recoveries that enlarge policy support on error states, unlike success-only training or generic signals. Empirical results across three spatiotemporal benchmarks and eight backbone LLMs show improvements over baselines, with clearest gains on queries whose execution deviates from the nominal path.

Significance. If the empirical claims and the role of typed failure-aware routing hold after proper controls, this would be a meaningful contribution to multi-agent LLM systems. It provides an explicit, optimizable mechanism for handling qualitatively different failures rather than ad-hoc language-based recovery, and the use of unsuccessful traces to learn recovery transitions is a potentially useful idea for enlarging policy support on error states.

major comments (2)

[Abstract / Results] Abstract and Results: the claim that retaining unsuccessful traces 'enlarges the support of the routing policy on error states, enabling recovery transitions that success-only training cannot represent' is load-bearing for the central contribution, yet the abstract provides no details on experimental setup, statistical significance, controls, or how success-only baselines were constructed. Without these, it is impossible to verify whether the reported gains on deviated paths are attributable to the typed recovery transitions rather than other factors.
[Routing matrix / failure typing] Routing matrix description (central mechanism): the framework assumes failure states can be accurately and consistently typed into distinct categories during trace collection so that the matrix can learn type-specific recoveries. The paper should provide evidence (e.g., typing accuracy, inter-rater agreement, or an ablation on untyped vs. typed failures) because systematic misclassification would cause the learned transitions to collapse, rendering the gains on deviated paths indistinguishable from a generic-retry or success-only baseline.

minor comments (2)

[Abstract] The abstract is dense; splitting the description of the routing matrix from the empirical claims would improve readability.
[Method] Notation for the transition policy and routing matrix should be introduced with a clear equation or diagram early in the method section to avoid ambiguity when discussing nominal vs. recovery transitions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed report. The comments highlight important aspects of clarity and validation that will strengthen the manuscript. We address each major comment below and commit to revisions that incorporate the requested details and evidence.

read point-by-point responses

Referee: [Abstract / Results] Abstract and Results: the claim that retaining unsuccessful traces 'enlarges the support of the routing policy on error states, enabling recovery transitions that success-only training cannot represent' is load-bearing for the central contribution, yet the abstract provides no details on experimental setup, statistical significance, controls, or how success-only baselines were constructed. Without these, it is impossible to verify whether the reported gains on deviated paths are attributable to the typed recovery transitions rather than other factors.

Authors: We agree that the abstract and results presentation would benefit from greater specificity to support the central claim. In the revised manuscript we will expand the abstract to summarize the experimental setup (three spatiotemporal benchmarks, eight backbone LLMs, multiple independent runs), note that statistical significance was evaluated with paired tests across runs, and briefly describe the success-only baseline construction (routing matrix trained exclusively on successful traces). The results section will be augmented with an explicit subsection on controls, including direct comparison to a generic-retry baseline and reporting of standard deviations and p-values. These additions will make the attribution of gains on deviated paths to typed recovery transitions explicit and verifiable. revision: yes
Referee: [Routing matrix / failure typing] Routing matrix description (central mechanism): the framework assumes failure states can be accurately and consistently typed into distinct categories during trace collection so that the matrix can learn type-specific recoveries. The paper should provide evidence (e.g., typing accuracy, inter-rater agreement, or an ablation on untyped vs. typed failures) because systematic misclassification would cause the learned transitions to collapse, rendering the gains on deviated paths indistinguishable from a generic-retry or success-only baseline.

Authors: We accept that explicit validation of the failure-typing process is required. The manuscript currently defines three failure categories from execution traces (malformed outputs, missing dependencies, tool-query mismatches) via a combination of deterministic rules and LLM-assisted labeling. In revision we will add (1) an ablation that trains and evaluates an untyped variant in which all failure states share a single recovery transition, and (2) quantitative evidence of typing reliability: accuracy on a held-out manually annotated subset together with inter-annotator agreement (Cohen’s kappa). The ablation will directly test whether type-specific transitions provide benefit beyond a collapsed generic-retry policy; if misclassification were dominant, the typed and untyped curves would be statistically indistinguishable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is empirical and self-contained

full rationale

The paper presents STAR as an empirical framework that learns recovery transitions from execution traces including failures and evaluates improvements on separate spatiotemporal benchmarks against baselines. The abstract describes the routing matrix as combining expert nominal routes with trace-learned recoveries conditioned on typed failures, but this is a modeling choice whose performance gains are measured externally rather than defined into existence. No equations, self-citations, or fitted quantities are shown reducing the reported results to the inputs by construction. The derivation chain therefore remains independent of the evaluation data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the ability to identify and categorize execution failures into distinct types and on the existence of execution traces that capture both nominal and error states for training the recovery transitions.

axioms (1)

domain assumption Specialists execute through a tool-grounded extract-compute-deposit protocol and write results to a shared blackboard.
Invoked as the execution mechanism for heterogeneous agents in the routing framework.

pith-pipeline@v0.9.0 · 5818 in / 1367 out tokens · 54207 ms · 2026-05-19T14:40:20.404072+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

At the center of STAR is an agent routing matrix that combines expert-specified nominal routes with recovery transitions learned from execution traces. Because the matrix conditions on distinct failure states, the router can respond differently to malformed outputs, missing dependencies, and tool–query mismatches
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery and orbit embedding unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1 (Recovery Reachability Dominance). Let Mα be the transition matrix trained with w(r)=r+α(1−r) for α>0, and let M0 be the success-only matrix. ... supp Mα[a,s,t,·] ≥ supp M0[a,s,t,·]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 6 internal anchors

[1]

Can large language models be good path planners? a benchmark and investigation on spatial-temporal reasoning.arXiv preprint arXiv:2310.03249, 2023

Mohamed Aghzal, Erion Plaku, and Ziyu Yao. Can large language models be good path planners? a benchmark and investigation on spatial-temporal reasoning.arXiv preprint arXiv:2310.03249, 2023

work page arXiv 2023
[2]

Graph of thoughts: Solving elaborate problems with large language models

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024

work page 2024
[3]

V-star: Bench- marking video-llms on video spatio-temporal reasoning.arXiv preprint arXiv:2503.11495, 2025

Zixu Cheng, Jian Hu, Ziquan Liu, Chenyang Si, Wei Li, and Shaogang Gong. V-star: Bench- marking video-llms on video spatio-temporal reasoning.arXiv preprint arXiv:2503.11495, 2025

work page arXiv 2025
[4]

From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

Mohamed Amine Ferrag, Norbert Tihanyi, and Merouane Debbah. From llm reasoning to autonomous ai agents: A comprehensive review.arXiv preprint arXiv:2504.19678, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Tremu: Towards neuro-symbolic temporal reasoning for llm-agents with memory in multi-session dialogues

Yubin Ge, Salvatore Romeo, Jason Cai, Raphael Shu, Yassine Benajiba, Monica Sunkara, and Yi Zhang. Tremu: Towards neuro-symbolic temporal reasoning for llm-agents with memory in multi-session dialogues. InFindings of the Association for Computational Linguistics: ACL 2025, pages 18974–18988, 2025

work page 2025
[6]

Llm-based intent processing and network optimization using attention-based hierarchical reinforcement learning

Md Arafat Habib, Pedro Enrique Iturria Rivera, Yigit Ozcan, Medhat Elsayed, Majid Bavand, Raimundus Gaigalas, and Melike Erol-Kantarci. Llm-based intent processing and network optimization using attention-based hierarchical reinforcement learning. In2025 IEEE Wireless Communications and Networking Conference (WCNC), pages 1–6. IEEE, 2025

work page 2025
[7]

Exploring advanced llm multi-agent systems based on blackboard architecture.arXiv preprint arXiv:2507.01701, 2025

Bochen Han and Songmao Zhang. Exploring advanced llm multi-agent systems based on blackboard architecture.arXiv preprint arXiv:2507.01701, 2025

work page arXiv 2025
[8]

Metagpt: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InThe twelfth international conference on learning representations, 2023

work page 2023
[9]

Stbench: Assessing the ability of large language models in spatio-temporal analysis

Wenbin Li, Di Yao, Ruibo Zhao, Wenjie Chen, Zijie Xu, Chengxue Luo, Chang Gong, Quanliang Jing, Haining Tan, and Jingping Bi. Stbench: Assessing the ability of large language models in spatio-temporal analysis. InCompanion Proceedings of the ACM on Web Conference 2025, pages 749–752, 2025

work page 2025
[10]

A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth, 1(1):9, 2024

Xinyi Li, Sai Wang, Siqi Zeng, Yu Wu, and Yi Yang. A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth, 1(1):9, 2024

work page 2024
[11]

Zechen Li, Baiyu Chen, Hao Xue, and Flora D. Salim. Zara: Training-free motion time-series reasoning via evidence-grounded llm agents.arXiv preprint arXiv:2508.04038, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Chameleon: Plug-and-play compositional reasoning with large language models.Advances in Neural Information Processing Systems, 36:43447–43478, 2023

Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language models.Advances in Neural Information Processing Systems, 36:43447–43478, 2023

work page 2023
[13]

Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023. 10

work page 2023
[14]

Omnirouter: Budget and performance controllable multi-llm routing.ACM SIGKDD Explorations Newsletter, 27(2):107–116, 2025

Kai Mei, Wujiang Xu, Minghao Guo, Shuhang Lin, and Yongfeng Zhang. Omnirouter: Budget and performance controllable multi-llm routing.ACM SIGKDD Explorations Newsletter, 27(2):107–116, 2025

work page 2025
[15]

STReasoner: Empowering LLMs for Spatio-Temporal Reasoning in Time Series via Spatial-Aware Reinforcement Learning

Juntong Ni, Shiyu Wang, Ming Jin, Qi He, and Wei Jin. Streasoner: Empowering llms for spatio-temporal reasoning in time series via spatial-aware reinforcement learning.arXiv preprint arXiv:2601.03248, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Taskweaver: A code-first agent framework

Bo Qiao, Liqun Li, Xu Zhang, Shilin He, Yu Kang, Chaoyun Zhang, Fangkai Yang, Hang Dong, Jue Zhang, Lu Wang, et al. Taskweaver: A code-first agent framework.arXiv preprint arXiv:2311.17541, 2023

work page arXiv 2023
[17]

Benchmarking spatiotemporal reasoning in llms and reasoning models: Capabilities and challenges.arXiv preprint arXiv:2505.11618, 2025

Pengrui Quan, Brian Wang, Kang Yang, Liying Han, and Mani Srivastava. Benchmarking spatiotemporal reasoning in llms and reasoning models: Capabilities and challenges.arXiv preprint arXiv:2505.11618, 2025

work page arXiv 2025
[18]

Self-reflection in llm agents: Effects on problem-solving performance.arXiv preprint arXiv:2405.06682, 2024

Matthew Renze and Erhan Guven. Self-reflection in llm agents: Effects on problem-solving performance.arXiv preprint arXiv:2405.06682, 2024

work page arXiv 2024
[19]

Llm-based multi-agent blackboard system for information discovery in data science.arXiv preprint arXiv:2510.01285, 2025

Alireza Salemi, Mihir Parmar, Palash Goyal, Yiwen Song, Jinsung Yoon, Hamed Zamani, Tomas Pfister, and Hamid Palangi. Llm-based multi-agent blackboard system for information discovery in data science.arXiv preprint arXiv:2510.01285, 2025

work page arXiv 2025
[20]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

work page 2023
[21]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

Exploring multi-modal data with tool-augmented llm agents for precise causal discovery

ChengAo Shen, Zhengzhang Chen, Dongsheng Luo, Dongkuan Xu, Haifeng Chen, and Jingchao Ni. Exploring multi-modal data with tool-augmented llm agents for precise causal discovery. In Findings of the Association for Computational Linguistics: ACL 2025, pages 636–660, 2025

work page 2025
[23]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

work page 2023
[24]

Learning options in reinforcement learning

Martin Stolle and Doina Precup. Learning options in reinforcement learning. InInternational Symposium on abstraction, reformulation, and approximation, pages 212–223. Springer, 2002

work page 2002
[25]

Multi-Agent Collaboration Mechanisms: A Survey of LLMs

Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D Nguyen. Multi-agent collaboration mechanisms: A survey of llms.arXiv preprint arXiv:2501.06322, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Beyond react: A planner-centric framework for complex tool- augmented llm reasoning

Xiaolong Wei, Yuehu Dong, Xingliang Wang, Xingyu Zhang, Zhejun Zhao, Dongdong Shen, Long Xia, and Dawei Yin. Beyond react: A planner-centric framework for complex tool- augmented llm reasoning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 33845–33853, 2026

work page 2026
[27]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

work page 2024
[28]

Large language models can learn temporal reasoning

Siheng Xiong, Ali Payani, Ramana Kompella, and Faramarz Fekri. Large language models can learn temporal reasoning. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10452–10470, 2024

work page 2024
[29]

Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, and Flora D. Salim. Reloop: Recur- sive retrieval with multi-hop reasoner and planners for heterogeneous qa.arXiv preprint arXiv:2510.20505, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

work page 2023
[31]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

work page 2022
[32]

which parameters?

Yao Yao, Zuchao Li, and Hai Zhao. Got: Effective graph-of-thought reasoning in language models. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 2901–2921, 2024. 12 A Theoretical Proofs and Structural Properties This appendix collects proofs, structural properties, and auxiliary analysis for STAR. Theorem 1 is stated in the m...

work page 2024

[1] [1]

Can large language models be good path planners? a benchmark and investigation on spatial-temporal reasoning.arXiv preprint arXiv:2310.03249, 2023

Mohamed Aghzal, Erion Plaku, and Ziyu Yao. Can large language models be good path planners? a benchmark and investigation on spatial-temporal reasoning.arXiv preprint arXiv:2310.03249, 2023

work page arXiv 2023

[2] [2]

Graph of thoughts: Solving elaborate problems with large language models

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024

work page 2024

[3] [3]

V-star: Bench- marking video-llms on video spatio-temporal reasoning.arXiv preprint arXiv:2503.11495, 2025

Zixu Cheng, Jian Hu, Ziquan Liu, Chenyang Si, Wei Li, and Shaogang Gong. V-star: Bench- marking video-llms on video spatio-temporal reasoning.arXiv preprint arXiv:2503.11495, 2025

work page arXiv 2025

[4] [4]

From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

Mohamed Amine Ferrag, Norbert Tihanyi, and Merouane Debbah. From llm reasoning to autonomous ai agents: A comprehensive review.arXiv preprint arXiv:2504.19678, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Tremu: Towards neuro-symbolic temporal reasoning for llm-agents with memory in multi-session dialogues

Yubin Ge, Salvatore Romeo, Jason Cai, Raphael Shu, Yassine Benajiba, Monica Sunkara, and Yi Zhang. Tremu: Towards neuro-symbolic temporal reasoning for llm-agents with memory in multi-session dialogues. InFindings of the Association for Computational Linguistics: ACL 2025, pages 18974–18988, 2025

work page 2025

[6] [6]

Llm-based intent processing and network optimization using attention-based hierarchical reinforcement learning

Md Arafat Habib, Pedro Enrique Iturria Rivera, Yigit Ozcan, Medhat Elsayed, Majid Bavand, Raimundus Gaigalas, and Melike Erol-Kantarci. Llm-based intent processing and network optimization using attention-based hierarchical reinforcement learning. In2025 IEEE Wireless Communications and Networking Conference (WCNC), pages 1–6. IEEE, 2025

work page 2025

[7] [7]

Exploring advanced llm multi-agent systems based on blackboard architecture.arXiv preprint arXiv:2507.01701, 2025

Bochen Han and Songmao Zhang. Exploring advanced llm multi-agent systems based on blackboard architecture.arXiv preprint arXiv:2507.01701, 2025

work page arXiv 2025

[8] [8]

Metagpt: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InThe twelfth international conference on learning representations, 2023

work page 2023

[9] [9]

Stbench: Assessing the ability of large language models in spatio-temporal analysis

Wenbin Li, Di Yao, Ruibo Zhao, Wenjie Chen, Zijie Xu, Chengxue Luo, Chang Gong, Quanliang Jing, Haining Tan, and Jingping Bi. Stbench: Assessing the ability of large language models in spatio-temporal analysis. InCompanion Proceedings of the ACM on Web Conference 2025, pages 749–752, 2025

work page 2025

[10] [10]

A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth, 1(1):9, 2024

Xinyi Li, Sai Wang, Siqi Zeng, Yu Wu, and Yi Yang. A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth, 1(1):9, 2024

work page 2024

[11] [11]

Zechen Li, Baiyu Chen, Hao Xue, and Flora D. Salim. Zara: Training-free motion time-series reasoning via evidence-grounded llm agents.arXiv preprint arXiv:2508.04038, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

Chameleon: Plug-and-play compositional reasoning with large language models.Advances in Neural Information Processing Systems, 36:43447–43478, 2023

Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language models.Advances in Neural Information Processing Systems, 36:43447–43478, 2023

work page 2023

[13] [13]

Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023. 10

work page 2023

[14] [14]

Omnirouter: Budget and performance controllable multi-llm routing.ACM SIGKDD Explorations Newsletter, 27(2):107–116, 2025

Kai Mei, Wujiang Xu, Minghao Guo, Shuhang Lin, and Yongfeng Zhang. Omnirouter: Budget and performance controllable multi-llm routing.ACM SIGKDD Explorations Newsletter, 27(2):107–116, 2025

work page 2025

[15] [15]

STReasoner: Empowering LLMs for Spatio-Temporal Reasoning in Time Series via Spatial-Aware Reinforcement Learning

Juntong Ni, Shiyu Wang, Ming Jin, Qi He, and Wei Jin. Streasoner: Empowering llms for spatio-temporal reasoning in time series via spatial-aware reinforcement learning.arXiv preprint arXiv:2601.03248, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [16]

Taskweaver: A code-first agent framework

Bo Qiao, Liqun Li, Xu Zhang, Shilin He, Yu Kang, Chaoyun Zhang, Fangkai Yang, Hang Dong, Jue Zhang, Lu Wang, et al. Taskweaver: A code-first agent framework.arXiv preprint arXiv:2311.17541, 2023

work page arXiv 2023

[17] [17]

Benchmarking spatiotemporal reasoning in llms and reasoning models: Capabilities and challenges.arXiv preprint arXiv:2505.11618, 2025

Pengrui Quan, Brian Wang, Kang Yang, Liying Han, and Mani Srivastava. Benchmarking spatiotemporal reasoning in llms and reasoning models: Capabilities and challenges.arXiv preprint arXiv:2505.11618, 2025

work page arXiv 2025

[18] [18]

Self-reflection in llm agents: Effects on problem-solving performance.arXiv preprint arXiv:2405.06682, 2024

Matthew Renze and Erhan Guven. Self-reflection in llm agents: Effects on problem-solving performance.arXiv preprint arXiv:2405.06682, 2024

work page arXiv 2024

[19] [19]

Llm-based multi-agent blackboard system for information discovery in data science.arXiv preprint arXiv:2510.01285, 2025

Alireza Salemi, Mihir Parmar, Palash Goyal, Yiwen Song, Jinsung Yoon, Hamed Zamani, Tomas Pfister, and Hamid Palangi. Llm-based multi-agent blackboard system for information discovery in data science.arXiv preprint arXiv:2510.01285, 2025

work page arXiv 2025

[20] [20]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

work page 2023

[21] [21]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[22] [22]

Exploring multi-modal data with tool-augmented llm agents for precise causal discovery

ChengAo Shen, Zhengzhang Chen, Dongsheng Luo, Dongkuan Xu, Haifeng Chen, and Jingchao Ni. Exploring multi-modal data with tool-augmented llm agents for precise causal discovery. In Findings of the Association for Computational Linguistics: ACL 2025, pages 636–660, 2025

work page 2025

[23] [23]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

work page 2023

[24] [24]

Learning options in reinforcement learning

Martin Stolle and Doina Precup. Learning options in reinforcement learning. InInternational Symposium on abstraction, reformulation, and approximation, pages 212–223. Springer, 2002

work page 2002

[25] [25]

Multi-Agent Collaboration Mechanisms: A Survey of LLMs

Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D Nguyen. Multi-agent collaboration mechanisms: A survey of llms.arXiv preprint arXiv:2501.06322, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Beyond react: A planner-centric framework for complex tool- augmented llm reasoning

Xiaolong Wei, Yuehu Dong, Xingliang Wang, Xingyu Zhang, Zhejun Zhao, Dongdong Shen, Long Xia, and Dawei Yin. Beyond react: A planner-centric framework for complex tool- augmented llm reasoning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 33845–33853, 2026

work page 2026

[27] [27]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

work page 2024

[28] [28]

Large language models can learn temporal reasoning

Siheng Xiong, Ali Payani, Ramana Kompella, and Faramarz Fekri. Large language models can learn temporal reasoning. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10452–10470, 2024

work page 2024

[29] [29]

Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, and Flora D. Salim. Reloop: Recur- sive retrieval with multi-hop reasoner and planners for heterogeneous qa.arXiv preprint arXiv:2510.20505, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

work page 2023

[31] [31]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

work page 2022

[32] [32]

which parameters?

Yao Yao, Zuchao Li, and Hai Zhao. Got: Effective graph-of-thought reasoning in language models. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 2901–2921, 2024. 12 A Theoretical Proofs and Structural Properties This appendix collects proofs, structural properties, and auxiliary analysis for STAR. Theorem 1 is stated in the m...

work page 2024