DART: Semantic Recoverability for Structured Tool Agents
Pith reviewed 2026-05-25 04:28 UTC · model grok-4.3
The pith
An explicit semantic admissibility check allows safe local recovery in structured tool agents without invalidating downstream commitments.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DART formalizes semantic recoverability and implements a modular runtime that localizes the failed instance, certifies semantically recoverable boundaries, aligns checkpoints to those boundaries, and selects an admissible restore point that preserves committed downstream work under dependency and effect constraints or blocks otherwise. The results establish that controller legality does not imply semantic validity and that sound local recovery requires an explicit admissibility check.
What carries the argument
The admissibility check that certifies semantically recoverable boundaries from dependency and effect constraints and selects valid restore points or blocks recovery.
Load-bearing premise
Semantic recoverability boundaries can be reliably certified from dependency and effect constraints alone.
What would settle it
A commitment-sensitive case where DART admits a restore point that produces inconsistent downstream state despite satisfying the dependency and effect constraints.
Figures
read the original abstract
When a structured tool agent fails mid-execution, the runtime faces a dilemma: replaying the entire task is safe but wasteful, while restoring from a local checkpoint is efficient but can leave committed downstream work tied to an upstream history that no longer exists. This tension is acute in commitment-sensitive settings, where rollback targets a single failed instance yet downstream consumers have already acted on its output. Existing recovery approaches provide mechanical rollback but no criterion for whether a local restore remains semantically valid after downstream commitment. We formalize this gap as semantic recoverability and address it in DART, a modular runtime that localizes the failed instance, certifies semantically recoverable boundaries of that instance, aligns checkpoints to those boundaries, and selects an admissible restore point that preserves committed downstream work under dependency and effect constraints-or blocks otherwise. Across three LLM-driven domains and external validation on a LangGraph-based substrate, DART correctly recovers all evaluated commitment-sensitive cases where baseline local recovery fails, and a five-domain safety audit finds no unsafe admitted rollbacks. These results show that controller legality does not imply semantic validity, and that sound local recovery requires an explicit admissibility check.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the concept of semantic recoverability for structured tool agents that fail mid-execution in commitment-sensitive settings. It presents DART, a modular runtime that localizes the failed instance, certifies recoverable boundaries using dependency and effect constraints, aligns checkpoints accordingly, and selects an admissible restore point that preserves downstream committed work or blocks the restore. Empirical claims state that across three LLM-driven domains with external LangGraph validation, DART recovers all evaluated commitment-sensitive cases where baseline local recovery fails, and a five-domain safety audit finds no unsafe admitted rollbacks. The paper concludes that controller legality does not imply semantic validity.
Significance. If the results hold, the work is significant for distinguishing mechanical rollback from semantically valid recovery in agent systems and for proposing an explicit admissibility check based on constraints. This could improve reliability in tool-using agents where downstream actions depend on prior outputs. The modular design and external substrate validation are positive elements if the constraint completeness can be established.
major comments (2)
- [Abstract] Abstract: the claim that 'DART correctly recovers all evaluated commitment-sensitive cases where baseline local recovery fails' and that 'a five-domain safety audit finds no unsafe admitted rollbacks' supplies no methods, data, metrics, or derivation details; central claims cannot be verified from the available text.
- [Evaluation (three LLM-driven domains and five-domain audit)] The three LLM-driven domains and five-domain audit: the evaluation does not address whether dependency and effect constraints alone suffice to certify recoverable boundaries. LLM-driven execution introduces non-determinism and potential implicit commitments (e.g., unmodeled side effects or data flows) that may not be captured by explicit constraints; no counterexample analysis or completeness argument is provided showing the admissibility check blocks all invalid restores outside the evaluated set.
minor comments (1)
- [Abstract] Abstract: the description of DART's four steps is compressed and would benefit from explicit enumeration or a diagram for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below, providing clarifications from the full manuscript and indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'DART correctly recovers all evaluated commitment-sensitive cases where baseline local recovery fails' and that 'a five-domain safety audit finds no unsafe admitted rollbacks' supplies no methods, data, metrics, or derivation details; central claims cannot be verified from the available text.
Authors: The abstract provides a concise summary of the results. The full manuscript details the methods, domains, metrics (recovery success and safety audit outcomes), and evaluation protocol in Section 5, including the three LLM-driven domains, LangGraph external validation, and the five-domain audit. To improve standalone verifiability of the abstract, we will revise it to briefly reference the evaluation setup and metrics. revision: partial
-
Referee: [Evaluation (three LLM-driven domains and five-domain audit)] The three LLM-driven domains and five-domain audit: the evaluation does not address whether dependency and effect constraints alone suffice to certify recoverable boundaries. LLM-driven execution introduces non-determinism and potential implicit commitments (e.g., unmodeled side effects or data flows) that may not be captured by explicit constraints; no counterexample analysis or completeness argument is provided showing the admissibility check blocks all invalid restores outside the evaluated set.
Authors: The evaluation is empirical and demonstrates that, with explicitly provided dependency and effect constraints, DART recovers all tested commitment-sensitive cases and admits no unsafe rollbacks in the audit. We agree that no formal completeness proof or exhaustive counterexample analysis is included, as the work focuses on the runtime mechanism rather than proving constraint sufficiency in all cases. LLM non-determinism is addressed via the external LangGraph substrate validation. We will add a Limitations subsection acknowledging that constraint completeness relies on domain modeling and that unmodeled effects remain possible outside the evaluated set. revision: partial
Circularity Check
No circularity: derivation is self-contained description and evaluation
full rationale
The manuscript introduces semantic recoverability as a new formalization and describes DART's modular runtime components (localization, boundary certification, checkpoint alignment, admissibility check) under dependency and effect constraints. No equations, fitted parameters, or self-citations appear in the provided text that would reduce any central claim to a definition or input by construction. The evaluation claims (recovery of all commitment-sensitive cases, zero unsafe rollbacks in five-domain audit) rest on external validation across domains and LangGraph substrate rather than on any self-referential reduction. This matches the default expectation of a non-circular paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
K. M. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computer Systems, 3(1):63–75, 1985
1985
-
[2]
E. N. Elnozahy, L. Alvisi, Y .-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems.ACM Computing Surveys, 34(3):375–408, 2002
2002
-
[3]
Haerder and A
T. Haerder and A. Reuter. Principles of transaction-oriented database recovery.ACM Computing Surveys, 15(4):287–317, 1983
1983
-
[4]
Garcia-Molina and K
H. Garcia-Molina and K. Salem. Sagas. InProceedings of the 1987 ACM SIGMOD International Confer- ence on Management of Data, pages 249–259, 1987
1987
-
[5]
Haerder and K
T. Haerder and K. Rothermel. Concepts for transaction recovery in nested transactions. InProceedings of the 1987 ACM SIGMOD International Conference on Management of Data, pages 272–286, 1987
1987
-
[6]
Casati, S
F. Casati, S. Ceri, S. Paraboschi, and G. Pozzi. Specification and implementation of exceptions in workflow management systems.ACM Transactions on Database Systems, 24(3):405–451, 1999
1999
-
[7]
Hagen and G
C. Hagen and G. Alonso. Exception handling in workflow management systems.IEEE Transactions on Software Engineering, 26(10):943–958, 2000
2000
-
[8]
Baresi, C
L. Baresi, C. Ghezzi, and S. Guinea. Smart monitors for composed services. InProceedings of the 2nd International Conference on Service-Oriented Computing, pages 193–202, 2004
2004
-
[9]
Baresi, S
L. Baresi, S. Guinea, and L. Pasquale. Self-healing BPEL processes with Dynamo and the JBoss rule engine. InProceedings of the International Workshop on Engineering of Software Services for Pervasive Environments, pages 11–20, 2007
2007
-
[10]
Carzaniga, A
A. Carzaniga, A. Gorla, N. Perino, and M. Pezzè. Automatic workarounds for web applications. In Proceedings of the 18th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 237–246, 2010
2010
-
[11]
Simmonds, S
J. Simmonds, S. Ben-David, and M. Chechik. Guided recovery for web service applications. InProceedings of the 18th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 247–256, 2010
2010
-
[12]
Simmonds, S
J. Simmonds, S. Ben-David, and M. Chechik. Monitoring and recovery of web service applications. In M. Chignell, J. Cordy, J. Ng, and Y . Yesha, editors,The Smart Internet, volume 6400 ofLecture Notes in Computer Science, pages 250–288. Springer, 2010
2010
-
[13]
Documentation, 2026
LangChain.LangGraph Persistence. Documentation, 2026. docs.langchain.com/.../persistence. Accessed April 2026
2026
-
[14]
Documentation, 2026
LangChain.LangGraph Interrupts. Documentation, 2026. docs.langchain.com/.../interrupts. Accessed April 2026
2026
-
[15]
LangSmith Documentation, 2026
LangChain.Rollback Concurrent. LangSmith Documentation, 2026. docs.langchain.com/langsmith/ rollback-concurrent. Accessed April 2026
2026
-
[16]
Documentation, 2026
Amazon Web Services.Error Handling in Step Functions. Documentation, 2026. docs.aws.amazon.com/ step-functions/.... Accessed April 2026
2026
-
[17]
Featonby.Making Retries Safe with Idempotent APIs
M. Featonby.Making Retries Safe with Idempotent APIs. Amazon Builders’ Library, 2021. aws.amazon. com/builders-library/.... Accessed April 2026
2021
-
[18]
Documentation, 2026
Ray Team.Fault Tolerance. Documentation, 2026. docs.ray.io/.../fault-tolerance.html. Accessed April 2026
2026
-
[19]
X. Liu, H. Zhang, Y . Song, et al. AgentBench: Evaluating LLMs as agents.arXiv preprint arXiv:2308.03688, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
Schick, J
T. Schick, J. Dwivedi-Yu, R. Dessì, et al. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36, 2023. 10
2023
-
[21]
Shinn, B
N. Shinn, B. Labash, and A. Gopinath. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36, 2023
2023
-
[22]
S. Yao, J. Zhao, D. Yu, et al. ReAct: Synergizing reasoning and acting in language models. InProceedings of the 11th International Conference on Learning Representations, 2023
2023
- [23]
- [24]
-
[25]
J. Vyas and M. Mercangoz. Autonomous control leveraging LLMs: An agentic framework for next- generation industrial automation.arXiv preprint arXiv:2507.07115, 2025
- [26]
- [27]
- [28]
- [29]
- [30]
- [31]
-
[32]
Huang, J
J.-T. Huang, J. Zhou, T. Jin, X. Zhou, Z. Chen, W. Wang, Y . Yuan, M. R. Lyu, and M. Sap. On the resilience of LLM-based multi-agent collaboration with faulty agents. InProceedings of the 42nd International Conference on Machine Learning, 2025
2025
-
[33]
J. Xie, K. Zhang, J. Chen, T. Zhu, R. Lou, Y . Tian, Y . Xiao, and Y . Su. TravelPlanner: A benchmark for real-world planning with language agents. InProceedings of the 41st International Conference on Machine Learning, 2024
2024
-
[34]
C. G. Cassandras and S. Lafortune.Introduction to Discrete Event Systems. Springer, 3rd edition, 2021
2021
-
[35]
Sampath, R
M. Sampath, R. Sengupta, S. Lafortune, K. Sinnamohideen, and D. Teneketzis. Diagnosability of discrete- event systems.IEEE Transactions on Automatic Control, 40(9):1555–1575, 1995. A Appendix Roadmap The appendix is organized as a compact support map rather than a second narrative. The main text now includes a dedicated Discussion and Limitations section ...
1995
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.