DART: Semantic Recoverability for Structured Tool Agents

Huaxi Huang; Kejin Xu; Ke Yang; Panpan Li; Xiaoshui Huang; Zonghan Wu

arxiv: 2605.23311 · v1 · pith:R4TTZ2UGnew · submitted 2026-05-22 · 💻 cs.AI

DART: Semantic Recoverability for Structured Tool Agents

Ke Yang , Panpan Li , Zonghan Wu , Kejin Xu , Huaxi Huang , Xiaoshui Huang This is my paper

Pith reviewed 2026-05-25 04:28 UTC · model grok-4.3

classification 💻 cs.AI

keywords semantic recoverabilitytool agentslocal recoveryrollbackadmissibility checkcommitment-sensitivedependency constraintseffect constraints

0 comments

The pith

An explicit semantic admissibility check allows safe local recovery in structured tool agents without invalidating downstream commitments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that when a tool agent fails mid-execution, restoring a local checkpoint can leave downstream consumers tied to an upstream history that no longer exists, producing invalid states after they have already acted on the output. DART addresses this by localizing the failed instance, certifying the boundaries of semantically recoverable states from dependency and effect constraints, aligning checkpoints to those boundaries, and selecting a restore point that preserves committed work or blocking the recovery. This matters in commitment-sensitive settings because replaying the entire task is safe but inefficient while mechanical rollback alone provides no criterion for semantic validity. Evaluation across three domains plus external validation shows DART succeeds on all tested cases where baselines fail, with a safety audit confirming no unsafe rollbacks are admitted.

Core claim

DART formalizes semantic recoverability and implements a modular runtime that localizes the failed instance, certifies semantically recoverable boundaries, aligns checkpoints to those boundaries, and selects an admissible restore point that preserves committed downstream work under dependency and effect constraints or blocks otherwise. The results establish that controller legality does not imply semantic validity and that sound local recovery requires an explicit admissibility check.

What carries the argument

The admissibility check that certifies semantically recoverable boundaries from dependency and effect constraints and selects valid restore points or blocks recovery.

Load-bearing premise

Semantic recoverability boundaries can be reliably certified from dependency and effect constraints alone.

What would settle it

A commitment-sensitive case where DART admits a restore point that produces inconsistent downstream state despite satisfying the dependency and effect constraints.

Figures

Figures reproduced from arXiv: 2605.23311 by Huaxi Huang, Kejin Xu, Ke Yang, Panpan Li, Xiaoshui Huang, Zonghan Wu.

**Figure 2.** Figure 2: Recovery method overview. After failure, the runtime identifies the failed instance, checks [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Runtime sidecar overview. Reviewed boundaries define recovery contracts; the sidecar lifts [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗

read the original abstract

When a structured tool agent fails mid-execution, the runtime faces a dilemma: replaying the entire task is safe but wasteful, while restoring from a local checkpoint is efficient but can leave committed downstream work tied to an upstream history that no longer exists. This tension is acute in commitment-sensitive settings, where rollback targets a single failed instance yet downstream consumers have already acted on its output. Existing recovery approaches provide mechanical rollback but no criterion for whether a local restore remains semantically valid after downstream commitment. We formalize this gap as semantic recoverability and address it in DART, a modular runtime that localizes the failed instance, certifies semantically recoverable boundaries of that instance, aligns checkpoints to those boundaries, and selects an admissible restore point that preserves committed downstream work under dependency and effect constraints-or blocks otherwise. Across three LLM-driven domains and external validation on a LangGraph-based substrate, DART correctly recovers all evaluated commitment-sensitive cases where baseline local recovery fails, and a five-domain safety audit finds no unsafe admitted rollbacks. These results show that controller legality does not imply semantic validity, and that sound local recovery requires an explicit admissibility check.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DART names a real gap in agent recovery but the abstract supplies no methods or data to check whether the constraint-based check actually works.

read the letter

The paper's core move is to separate mechanical rollback from semantic recoverability: a local restore can be technically possible yet invalid once downstream consumers have already acted on the failed instance's output. DART tries to close that gap by localizing the failure, certifying recoverable boundaries from dependency and effect constraints, aligning checkpoints, and either picking an admissible restore or blocking it. That distinction is useful and the modular runtime sketch is a reasonable way to operationalize it for structured tool agents. The LangGraph validation mention shows they at least tried an external substrate. Those are the parts that land cleanly. The rest is thin. The abstract asserts that DART recovers every evaluated commitment-sensitive case where baselines fail and that a five-domain audit found no unsafe admissions, yet it gives zero information on domain construction, constraint formalization, how implicit LLM side effects were handled, or what the audit actually measured. The stress-test worry about unmodeled commitments is therefore live; three LLM domains plus an audit do not obviously demonstrate completeness. Without the full methods, derivations, or counterexample analysis, the central claim that dependency and effect constraints alone suffice remains untestable. This work is aimed at people building production agent runtimes who already deal with checkpointing and need a way to reason about downstream commitments. A reader in that niche could borrow the terminology and the high-level architecture even if the current evidence does not yet support deployment. It is coherent enough on its own terms to deserve referee time, though any review would need to press hard on evaluation scope and constraint soundness.

Referee Report

2 major / 1 minor

Summary. The paper introduces the concept of semantic recoverability for structured tool agents that fail mid-execution in commitment-sensitive settings. It presents DART, a modular runtime that localizes the failed instance, certifies recoverable boundaries using dependency and effect constraints, aligns checkpoints accordingly, and selects an admissible restore point that preserves downstream committed work or blocks the restore. Empirical claims state that across three LLM-driven domains with external LangGraph validation, DART recovers all evaluated commitment-sensitive cases where baseline local recovery fails, and a five-domain safety audit finds no unsafe admitted rollbacks. The paper concludes that controller legality does not imply semantic validity.

Significance. If the results hold, the work is significant for distinguishing mechanical rollback from semantically valid recovery in agent systems and for proposing an explicit admissibility check based on constraints. This could improve reliability in tool-using agents where downstream actions depend on prior outputs. The modular design and external substrate validation are positive elements if the constraint completeness can be established.

major comments (2)

[Abstract] Abstract: the claim that 'DART correctly recovers all evaluated commitment-sensitive cases where baseline local recovery fails' and that 'a five-domain safety audit finds no unsafe admitted rollbacks' supplies no methods, data, metrics, or derivation details; central claims cannot be verified from the available text.
[Evaluation (three LLM-driven domains and five-domain audit)] The three LLM-driven domains and five-domain audit: the evaluation does not address whether dependency and effect constraints alone suffice to certify recoverable boundaries. LLM-driven execution introduces non-determinism and potential implicit commitments (e.g., unmodeled side effects or data flows) that may not be captured by explicit constraints; no counterexample analysis or completeness argument is provided showing the admissibility check blocks all invalid restores outside the evaluated set.

minor comments (1)

[Abstract] Abstract: the description of DART's four steps is compressed and would benefit from explicit enumeration or a diagram for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below, providing clarifications from the full manuscript and indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'DART correctly recovers all evaluated commitment-sensitive cases where baseline local recovery fails' and that 'a five-domain safety audit finds no unsafe admitted rollbacks' supplies no methods, data, metrics, or derivation details; central claims cannot be verified from the available text.

Authors: The abstract provides a concise summary of the results. The full manuscript details the methods, domains, metrics (recovery success and safety audit outcomes), and evaluation protocol in Section 5, including the three LLM-driven domains, LangGraph external validation, and the five-domain audit. To improve standalone verifiability of the abstract, we will revise it to briefly reference the evaluation setup and metrics. revision: partial
Referee: [Evaluation (three LLM-driven domains and five-domain audit)] The three LLM-driven domains and five-domain audit: the evaluation does not address whether dependency and effect constraints alone suffice to certify recoverable boundaries. LLM-driven execution introduces non-determinism and potential implicit commitments (e.g., unmodeled side effects or data flows) that may not be captured by explicit constraints; no counterexample analysis or completeness argument is provided showing the admissibility check blocks all invalid restores outside the evaluated set.

Authors: The evaluation is empirical and demonstrates that, with explicitly provided dependency and effect constraints, DART recovers all tested commitment-sensitive cases and admits no unsafe rollbacks in the audit. We agree that no formal completeness proof or exhaustive counterexample analysis is included, as the work focuses on the runtime mechanism rather than proving constraint sufficiency in all cases. LLM non-determinism is addressed via the external LangGraph substrate validation. We will add a Limitations subsection acknowledging that constraint completeness relies on domain modeling and that unmodeled effects remain possible outside the evaluated set. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation is self-contained description and evaluation

full rationale

The manuscript introduces semantic recoverability as a new formalization and describes DART's modular runtime components (localization, boundary certification, checkpoint alignment, admissibility check) under dependency and effect constraints. No equations, fitted parameters, or self-citations appear in the provided text that would reduce any central claim to a definition or input by construction. The evaluation claims (recovery of all commitment-sensitive cases, zero unsafe rollbacks in five-domain audit) rest on external validation across domains and LangGraph substrate rather than on any self-referential reduction. This matches the default expectation of a non-circular paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, parameters, or explicit assumptions; ledger is empty by necessity.

pith-pipeline@v0.9.0 · 5737 in / 1066 out tokens · 20660 ms · 2026-05-25T04:28:40.044083+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 10 canonical work pages · 1 internal anchor

[1]

K. M. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computer Systems, 3(1):63–75, 1985

1985
[2]

E. N. Elnozahy, L. Alvisi, Y .-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems.ACM Computing Surveys, 34(3):375–408, 2002

2002
[3]

Haerder and A

T. Haerder and A. Reuter. Principles of transaction-oriented database recovery.ACM Computing Surveys, 15(4):287–317, 1983

1983
[4]

Garcia-Molina and K

H. Garcia-Molina and K. Salem. Sagas. InProceedings of the 1987 ACM SIGMOD International Confer- ence on Management of Data, pages 249–259, 1987

1987
[5]

Haerder and K

T. Haerder and K. Rothermel. Concepts for transaction recovery in nested transactions. InProceedings of the 1987 ACM SIGMOD International Conference on Management of Data, pages 272–286, 1987

1987
[6]

Casati, S

F. Casati, S. Ceri, S. Paraboschi, and G. Pozzi. Specification and implementation of exceptions in workflow management systems.ACM Transactions on Database Systems, 24(3):405–451, 1999

1999
[7]

Hagen and G

C. Hagen and G. Alonso. Exception handling in workflow management systems.IEEE Transactions on Software Engineering, 26(10):943–958, 2000

2000
[8]

Baresi, C

L. Baresi, C. Ghezzi, and S. Guinea. Smart monitors for composed services. InProceedings of the 2nd International Conference on Service-Oriented Computing, pages 193–202, 2004

2004
[9]

Baresi, S

L. Baresi, S. Guinea, and L. Pasquale. Self-healing BPEL processes with Dynamo and the JBoss rule engine. InProceedings of the International Workshop on Engineering of Software Services for Pervasive Environments, pages 11–20, 2007

2007
[10]

Carzaniga, A

A. Carzaniga, A. Gorla, N. Perino, and M. Pezzè. Automatic workarounds for web applications. In Proceedings of the 18th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 237–246, 2010

2010
[11]

Simmonds, S

J. Simmonds, S. Ben-David, and M. Chechik. Guided recovery for web service applications. InProceedings of the 18th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 247–256, 2010

2010
[12]

Simmonds, S

J. Simmonds, S. Ben-David, and M. Chechik. Monitoring and recovery of web service applications. In M. Chignell, J. Cordy, J. Ng, and Y . Yesha, editors,The Smart Internet, volume 6400 ofLecture Notes in Computer Science, pages 250–288. Springer, 2010

2010
[13]

Documentation, 2026

LangChain.LangGraph Persistence. Documentation, 2026. docs.langchain.com/.../persistence. Accessed April 2026

2026
[14]

Documentation, 2026

LangChain.LangGraph Interrupts. Documentation, 2026. docs.langchain.com/.../interrupts. Accessed April 2026

2026
[15]

LangSmith Documentation, 2026

LangChain.Rollback Concurrent. LangSmith Documentation, 2026. docs.langchain.com/langsmith/ rollback-concurrent. Accessed April 2026

2026
[16]

Documentation, 2026

Amazon Web Services.Error Handling in Step Functions. Documentation, 2026. docs.aws.amazon.com/ step-functions/.... Accessed April 2026

2026
[17]

Featonby.Making Retries Safe with Idempotent APIs

M. Featonby.Making Retries Safe with Idempotent APIs. Amazon Builders’ Library, 2021. aws.amazon. com/builders-library/.... Accessed April 2026

2021
[18]

Documentation, 2026

Ray Team.Fault Tolerance. Documentation, 2026. docs.ray.io/.../fault-tolerance.html. Accessed April 2026

2026
[19]

X. Liu, H. Zhang, Y . Song, et al. AgentBench: Evaluating LLMs as agents.arXiv preprint arXiv:2308.03688, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Schick, J

T. Schick, J. Dwivedi-Yu, R. Dessì, et al. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36, 2023. 10

2023
[21]

Shinn, B

N. Shinn, B. Labash, and A. Gopinath. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36, 2023

2023
[22]

S. Yao, J. Zhao, D. Yu, et al. ReAct: Synergizing reasoning and acting in language models. InProceedings of the 11th International Conference on Learning Representations, 2023

2023
[23]

L. Guo, W. Liu, Y . W. Heng, T.-H. Chen, and Y . Wang. Agent-SAMA: State-aware mobile assistant.arXiv preprint arXiv:2505.23596, 2025

work page arXiv 2025
[24]

Zhang, C

S. Zhang, C. Yuan, R. Guo, X. Yu, R. Xu, Z. Chen, Z. Li, Z. Yang, S. Guan, Z. Tang, S. Hu, L. Zhang, R. Chen, and H. Wang. EvoFSM: Controllable self-evolution for deep research with finite state machines. arXiv preprint arXiv:2601.09465, 2026

work page arXiv 2026
[25]

Vyas and M

J. Vyas and M. Mercangoz. Autonomous control leveraging LLMs: An agentic framework for next- generation industrial automation.arXiv preprint arXiv:2507.07115, 2025

work page arXiv 2025
[26]

Barke, A

S. Barke, A. Goyal, A. Khare, A. Singh, S. Nath, and C. Bansal. AgentRx: Diagnosing AI agent failures from execution trajectories.arXiv preprint arXiv:2602.02475, 2026

work page arXiv 2026
[27]

K. Zhu, Z. Liu, B. Li, M. Tian, Y . Yang, J. Zhang, P. Han, Q. Xie, F. Cui, W. Zhang, X. Ma, X. Yu, G. Ramesh, J. Wu, Z. Liu, P. Lu, J. Zou, and J. You. Where LLM agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025

work page arXiv 2025
[28]

S. V . Vuddanti, A. Shah, S. K. Chittiprolu, T. Song, S. Dev, K. Zhu, and M. Chaudhary. PALADIN: Self-correcting language model agents to cure tool-failure cases.arXiv preprint arXiv:2509.25238, 2025

work page arXiv 2025
[29]

E. Y . Chang and L. Geng. SagaLLM: Context management, validation, and transaction guarantees for multi-agent LLM planning.arXiv preprint arXiv:2503.11951, 2025

work page arXiv 2025
[30]

E. Y . Chang and L. Geng. ALAS: A stateful multi-LLM agent framework for disruption-aware planning. arXiv preprint arXiv:2505.12501, 2025

work page arXiv 2025
[31]

Y . In, M. Tanjim, J. Subramanian, S. Kim, U. Bhattacharya, W. Kim, S. Park, S. Sarkhel, and C. Park. Rethinking failure attribution in multi-agent systems: A multi-perspective benchmark and evaluation.arXiv preprint arXiv:2603.25001, 2026

work page arXiv 2026
[32]

Huang, J

J.-T. Huang, J. Zhou, T. Jin, X. Zhou, Z. Chen, W. Wang, Y . Yuan, M. R. Lyu, and M. Sap. On the resilience of LLM-based multi-agent collaboration with faulty agents. InProceedings of the 42nd International Conference on Machine Learning, 2025

2025
[33]

J. Xie, K. Zhang, J. Chen, T. Zhu, R. Lou, Y . Tian, Y . Xiao, and Y . Su. TravelPlanner: A benchmark for real-world planning with language agents. InProceedings of the 41st International Conference on Machine Learning, 2024

2024
[34]

C. G. Cassandras and S. Lafortune.Introduction to Discrete Event Systems. Springer, 3rd edition, 2021

2021
[35]

Sampath, R

M. Sampath, R. Sengupta, S. Lafortune, K. Sinnamohideen, and D. Teneketzis. Diagnosability of discrete- event systems.IEEE Transactions on Automatic Control, 40(9):1555–1575, 1995. A Appendix Roadmap The appendix is organized as a compact support map rather than a second narrative. The main text now includes a dedicated Discussion and Limitations section ...

1995

[1] [1]

K. M. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computer Systems, 3(1):63–75, 1985

1985

[2] [2]

E. N. Elnozahy, L. Alvisi, Y .-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems.ACM Computing Surveys, 34(3):375–408, 2002

2002

[3] [3]

Haerder and A

T. Haerder and A. Reuter. Principles of transaction-oriented database recovery.ACM Computing Surveys, 15(4):287–317, 1983

1983

[4] [4]

Garcia-Molina and K

H. Garcia-Molina and K. Salem. Sagas. InProceedings of the 1987 ACM SIGMOD International Confer- ence on Management of Data, pages 249–259, 1987

1987

[5] [5]

Haerder and K

T. Haerder and K. Rothermel. Concepts for transaction recovery in nested transactions. InProceedings of the 1987 ACM SIGMOD International Conference on Management of Data, pages 272–286, 1987

1987

[6] [6]

Casati, S

F. Casati, S. Ceri, S. Paraboschi, and G. Pozzi. Specification and implementation of exceptions in workflow management systems.ACM Transactions on Database Systems, 24(3):405–451, 1999

1999

[7] [7]

Hagen and G

C. Hagen and G. Alonso. Exception handling in workflow management systems.IEEE Transactions on Software Engineering, 26(10):943–958, 2000

2000

[8] [8]

Baresi, C

L. Baresi, C. Ghezzi, and S. Guinea. Smart monitors for composed services. InProceedings of the 2nd International Conference on Service-Oriented Computing, pages 193–202, 2004

2004

[9] [9]

Baresi, S

L. Baresi, S. Guinea, and L. Pasquale. Self-healing BPEL processes with Dynamo and the JBoss rule engine. InProceedings of the International Workshop on Engineering of Software Services for Pervasive Environments, pages 11–20, 2007

2007

[10] [10]

Carzaniga, A

A. Carzaniga, A. Gorla, N. Perino, and M. Pezzè. Automatic workarounds for web applications. In Proceedings of the 18th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 237–246, 2010

2010

[11] [11]

Simmonds, S

J. Simmonds, S. Ben-David, and M. Chechik. Guided recovery for web service applications. InProceedings of the 18th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 247–256, 2010

2010

[12] [12]

Simmonds, S

J. Simmonds, S. Ben-David, and M. Chechik. Monitoring and recovery of web service applications. In M. Chignell, J. Cordy, J. Ng, and Y . Yesha, editors,The Smart Internet, volume 6400 ofLecture Notes in Computer Science, pages 250–288. Springer, 2010

2010

[13] [13]

Documentation, 2026

LangChain.LangGraph Persistence. Documentation, 2026. docs.langchain.com/.../persistence. Accessed April 2026

2026

[14] [14]

Documentation, 2026

LangChain.LangGraph Interrupts. Documentation, 2026. docs.langchain.com/.../interrupts. Accessed April 2026

2026

[15] [15]

LangSmith Documentation, 2026

LangChain.Rollback Concurrent. LangSmith Documentation, 2026. docs.langchain.com/langsmith/ rollback-concurrent. Accessed April 2026

2026

[16] [16]

Documentation, 2026

Amazon Web Services.Error Handling in Step Functions. Documentation, 2026. docs.aws.amazon.com/ step-functions/.... Accessed April 2026

2026

[17] [17]

Featonby.Making Retries Safe with Idempotent APIs

M. Featonby.Making Retries Safe with Idempotent APIs. Amazon Builders’ Library, 2021. aws.amazon. com/builders-library/.... Accessed April 2026

2021

[18] [18]

Documentation, 2026

Ray Team.Fault Tolerance. Documentation, 2026. docs.ray.io/.../fault-tolerance.html. Accessed April 2026

2026

[19] [19]

X. Liu, H. Zhang, Y . Song, et al. AgentBench: Evaluating LLMs as agents.arXiv preprint arXiv:2308.03688, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Schick, J

T. Schick, J. Dwivedi-Yu, R. Dessì, et al. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36, 2023. 10

2023

[21] [21]

Shinn, B

N. Shinn, B. Labash, and A. Gopinath. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36, 2023

2023

[22] [22]

S. Yao, J. Zhao, D. Yu, et al. ReAct: Synergizing reasoning and acting in language models. InProceedings of the 11th International Conference on Learning Representations, 2023

2023

[23] [23]

L. Guo, W. Liu, Y . W. Heng, T.-H. Chen, and Y . Wang. Agent-SAMA: State-aware mobile assistant.arXiv preprint arXiv:2505.23596, 2025

work page arXiv 2025

[24] [24]

Zhang, C

S. Zhang, C. Yuan, R. Guo, X. Yu, R. Xu, Z. Chen, Z. Li, Z. Yang, S. Guan, Z. Tang, S. Hu, L. Zhang, R. Chen, and H. Wang. EvoFSM: Controllable self-evolution for deep research with finite state machines. arXiv preprint arXiv:2601.09465, 2026

work page arXiv 2026

[25] [25]

Vyas and M

J. Vyas and M. Mercangoz. Autonomous control leveraging LLMs: An agentic framework for next- generation industrial automation.arXiv preprint arXiv:2507.07115, 2025

work page arXiv 2025

[26] [26]

Barke, A

S. Barke, A. Goyal, A. Khare, A. Singh, S. Nath, and C. Bansal. AgentRx: Diagnosing AI agent failures from execution trajectories.arXiv preprint arXiv:2602.02475, 2026

work page arXiv 2026

[27] [27]

K. Zhu, Z. Liu, B. Li, M. Tian, Y . Yang, J. Zhang, P. Han, Q. Xie, F. Cui, W. Zhang, X. Ma, X. Yu, G. Ramesh, J. Wu, Z. Liu, P. Lu, J. Zou, and J. You. Where LLM agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025

work page arXiv 2025

[28] [28]

S. V . Vuddanti, A. Shah, S. K. Chittiprolu, T. Song, S. Dev, K. Zhu, and M. Chaudhary. PALADIN: Self-correcting language model agents to cure tool-failure cases.arXiv preprint arXiv:2509.25238, 2025

work page arXiv 2025

[29] [29]

E. Y . Chang and L. Geng. SagaLLM: Context management, validation, and transaction guarantees for multi-agent LLM planning.arXiv preprint arXiv:2503.11951, 2025

work page arXiv 2025

[30] [30]

E. Y . Chang and L. Geng. ALAS: A stateful multi-LLM agent framework for disruption-aware planning. arXiv preprint arXiv:2505.12501, 2025

work page arXiv 2025

[31] [31]

Y . In, M. Tanjim, J. Subramanian, S. Kim, U. Bhattacharya, W. Kim, S. Park, S. Sarkhel, and C. Park. Rethinking failure attribution in multi-agent systems: A multi-perspective benchmark and evaluation.arXiv preprint arXiv:2603.25001, 2026

work page arXiv 2026

[32] [32]

Huang, J

J.-T. Huang, J. Zhou, T. Jin, X. Zhou, Z. Chen, W. Wang, Y . Yuan, M. R. Lyu, and M. Sap. On the resilience of LLM-based multi-agent collaboration with faulty agents. InProceedings of the 42nd International Conference on Machine Learning, 2025

2025

[33] [33]

J. Xie, K. Zhang, J. Chen, T. Zhu, R. Lou, Y . Tian, Y . Xiao, and Y . Su. TravelPlanner: A benchmark for real-world planning with language agents. InProceedings of the 41st International Conference on Machine Learning, 2024

2024

[34] [34]

C. G. Cassandras and S. Lafortune.Introduction to Discrete Event Systems. Springer, 3rd edition, 2021

2021

[35] [35]

Sampath, R

M. Sampath, R. Sengupta, S. Lafortune, K. Sinnamohideen, and D. Teneketzis. Diagnosability of discrete- event systems.IEEE Transactions on Automatic Control, 40(9):1555–1575, 1995. A Appendix Roadmap The appendix is organized as a compact support map rather than a second narrative. The main text now includes a dedicated Discussion and Limitations section ...

1995