pith. sign in

arxiv: 2605.23311 · v1 · pith:R4TTZ2UGnew · submitted 2026-05-22 · 💻 cs.AI

DART: Semantic Recoverability for Structured Tool Agents

Pith reviewed 2026-05-25 04:28 UTC · model grok-4.3

classification 💻 cs.AI
keywords semantic recoverabilitytool agentslocal recoveryrollbackadmissibility checkcommitment-sensitivedependency constraintseffect constraints
0
0 comments X

The pith

An explicit semantic admissibility check allows safe local recovery in structured tool agents without invalidating downstream commitments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that when a tool agent fails mid-execution, restoring a local checkpoint can leave downstream consumers tied to an upstream history that no longer exists, producing invalid states after they have already acted on the output. DART addresses this by localizing the failed instance, certifying the boundaries of semantically recoverable states from dependency and effect constraints, aligning checkpoints to those boundaries, and selecting a restore point that preserves committed work or blocking the recovery. This matters in commitment-sensitive settings because replaying the entire task is safe but inefficient while mechanical rollback alone provides no criterion for semantic validity. Evaluation across three domains plus external validation shows DART succeeds on all tested cases where baselines fail, with a safety audit confirming no unsafe rollbacks are admitted.

Core claim

DART formalizes semantic recoverability and implements a modular runtime that localizes the failed instance, certifies semantically recoverable boundaries, aligns checkpoints to those boundaries, and selects an admissible restore point that preserves committed downstream work under dependency and effect constraints or blocks otherwise. The results establish that controller legality does not imply semantic validity and that sound local recovery requires an explicit admissibility check.

What carries the argument

The admissibility check that certifies semantically recoverable boundaries from dependency and effect constraints and selects valid restore points or blocks recovery.

Load-bearing premise

Semantic recoverability boundaries can be reliably certified from dependency and effect constraints alone.

What would settle it

A commitment-sensitive case where DART admits a restore point that produces inconsistent downstream state despite satisfying the dependency and effect constraints.

Figures

Figures reproduced from arXiv: 2605.23311 by Huaxi Huang, Kejin Xu, Ke Yang, Panpan Li, Xiaoshui Huang, Zonghan Wu.

Figure 1
Figure 1. Figure 1: Commitment-sensitive recovery regime. Whole-task rerun is correct but expensive because [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Recovery method overview. After failure, the runtime identifies the failed instance, checks [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Runtime sidecar overview. Reviewed boundaries define recovery contracts; the sidecar lifts [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗
read the original abstract

When a structured tool agent fails mid-execution, the runtime faces a dilemma: replaying the entire task is safe but wasteful, while restoring from a local checkpoint is efficient but can leave committed downstream work tied to an upstream history that no longer exists. This tension is acute in commitment-sensitive settings, where rollback targets a single failed instance yet downstream consumers have already acted on its output. Existing recovery approaches provide mechanical rollback but no criterion for whether a local restore remains semantically valid after downstream commitment. We formalize this gap as semantic recoverability and address it in DART, a modular runtime that localizes the failed instance, certifies semantically recoverable boundaries of that instance, aligns checkpoints to those boundaries, and selects an admissible restore point that preserves committed downstream work under dependency and effect constraints-or blocks otherwise. Across three LLM-driven domains and external validation on a LangGraph-based substrate, DART correctly recovers all evaluated commitment-sensitive cases where baseline local recovery fails, and a five-domain safety audit finds no unsafe admitted rollbacks. These results show that controller legality does not imply semantic validity, and that sound local recovery requires an explicit admissibility check.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the concept of semantic recoverability for structured tool agents that fail mid-execution in commitment-sensitive settings. It presents DART, a modular runtime that localizes the failed instance, certifies recoverable boundaries using dependency and effect constraints, aligns checkpoints accordingly, and selects an admissible restore point that preserves downstream committed work or blocks the restore. Empirical claims state that across three LLM-driven domains with external LangGraph validation, DART recovers all evaluated commitment-sensitive cases where baseline local recovery fails, and a five-domain safety audit finds no unsafe admitted rollbacks. The paper concludes that controller legality does not imply semantic validity.

Significance. If the results hold, the work is significant for distinguishing mechanical rollback from semantically valid recovery in agent systems and for proposing an explicit admissibility check based on constraints. This could improve reliability in tool-using agents where downstream actions depend on prior outputs. The modular design and external substrate validation are positive elements if the constraint completeness can be established.

major comments (2)
  1. [Abstract] Abstract: the claim that 'DART correctly recovers all evaluated commitment-sensitive cases where baseline local recovery fails' and that 'a five-domain safety audit finds no unsafe admitted rollbacks' supplies no methods, data, metrics, or derivation details; central claims cannot be verified from the available text.
  2. [Evaluation (three LLM-driven domains and five-domain audit)] The three LLM-driven domains and five-domain audit: the evaluation does not address whether dependency and effect constraints alone suffice to certify recoverable boundaries. LLM-driven execution introduces non-determinism and potential implicit commitments (e.g., unmodeled side effects or data flows) that may not be captured by explicit constraints; no counterexample analysis or completeness argument is provided showing the admissibility check blocks all invalid restores outside the evaluated set.
minor comments (1)
  1. [Abstract] Abstract: the description of DART's four steps is compressed and would benefit from explicit enumeration or a diagram for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below, providing clarifications from the full manuscript and indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'DART correctly recovers all evaluated commitment-sensitive cases where baseline local recovery fails' and that 'a five-domain safety audit finds no unsafe admitted rollbacks' supplies no methods, data, metrics, or derivation details; central claims cannot be verified from the available text.

    Authors: The abstract provides a concise summary of the results. The full manuscript details the methods, domains, metrics (recovery success and safety audit outcomes), and evaluation protocol in Section 5, including the three LLM-driven domains, LangGraph external validation, and the five-domain audit. To improve standalone verifiability of the abstract, we will revise it to briefly reference the evaluation setup and metrics. revision: partial

  2. Referee: [Evaluation (three LLM-driven domains and five-domain audit)] The three LLM-driven domains and five-domain audit: the evaluation does not address whether dependency and effect constraints alone suffice to certify recoverable boundaries. LLM-driven execution introduces non-determinism and potential implicit commitments (e.g., unmodeled side effects or data flows) that may not be captured by explicit constraints; no counterexample analysis or completeness argument is provided showing the admissibility check blocks all invalid restores outside the evaluated set.

    Authors: The evaluation is empirical and demonstrates that, with explicitly provided dependency and effect constraints, DART recovers all tested commitment-sensitive cases and admits no unsafe rollbacks in the audit. We agree that no formal completeness proof or exhaustive counterexample analysis is included, as the work focuses on the runtime mechanism rather than proving constraint sufficiency in all cases. LLM non-determinism is addressed via the external LangGraph substrate validation. We will add a Limitations subsection acknowledging that constraint completeness relies on domain modeling and that unmodeled effects remain possible outside the evaluated set. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation is self-contained description and evaluation

full rationale

The manuscript introduces semantic recoverability as a new formalization and describes DART's modular runtime components (localization, boundary certification, checkpoint alignment, admissibility check) under dependency and effect constraints. No equations, fitted parameters, or self-citations appear in the provided text that would reduce any central claim to a definition or input by construction. The evaluation claims (recovery of all commitment-sensitive cases, zero unsafe rollbacks in five-domain audit) rest on external validation across domains and LangGraph substrate rather than on any self-referential reduction. This matches the default expectation of a non-circular paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, parameters, or explicit assumptions; ledger is empty by necessity.

pith-pipeline@v0.9.0 · 5737 in / 1066 out tokens · 20660 ms · 2026-05-25T04:28:40.044083+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 10 canonical work pages · 1 internal anchor

  1. [1]

    K. M. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computer Systems, 3(1):63–75, 1985

  2. [2]

    E. N. Elnozahy, L. Alvisi, Y .-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems.ACM Computing Surveys, 34(3):375–408, 2002

  3. [3]

    Haerder and A

    T. Haerder and A. Reuter. Principles of transaction-oriented database recovery.ACM Computing Surveys, 15(4):287–317, 1983

  4. [4]

    Garcia-Molina and K

    H. Garcia-Molina and K. Salem. Sagas. InProceedings of the 1987 ACM SIGMOD International Confer- ence on Management of Data, pages 249–259, 1987

  5. [5]

    Haerder and K

    T. Haerder and K. Rothermel. Concepts for transaction recovery in nested transactions. InProceedings of the 1987 ACM SIGMOD International Conference on Management of Data, pages 272–286, 1987

  6. [6]

    Casati, S

    F. Casati, S. Ceri, S. Paraboschi, and G. Pozzi. Specification and implementation of exceptions in workflow management systems.ACM Transactions on Database Systems, 24(3):405–451, 1999

  7. [7]

    Hagen and G

    C. Hagen and G. Alonso. Exception handling in workflow management systems.IEEE Transactions on Software Engineering, 26(10):943–958, 2000

  8. [8]

    Baresi, C

    L. Baresi, C. Ghezzi, and S. Guinea. Smart monitors for composed services. InProceedings of the 2nd International Conference on Service-Oriented Computing, pages 193–202, 2004

  9. [9]

    Baresi, S

    L. Baresi, S. Guinea, and L. Pasquale. Self-healing BPEL processes with Dynamo and the JBoss rule engine. InProceedings of the International Workshop on Engineering of Software Services for Pervasive Environments, pages 11–20, 2007

  10. [10]

    Carzaniga, A

    A. Carzaniga, A. Gorla, N. Perino, and M. Pezzè. Automatic workarounds for web applications. In Proceedings of the 18th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 237–246, 2010

  11. [11]

    Simmonds, S

    J. Simmonds, S. Ben-David, and M. Chechik. Guided recovery for web service applications. InProceedings of the 18th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 247–256, 2010

  12. [12]

    Simmonds, S

    J. Simmonds, S. Ben-David, and M. Chechik. Monitoring and recovery of web service applications. In M. Chignell, J. Cordy, J. Ng, and Y . Yesha, editors,The Smart Internet, volume 6400 ofLecture Notes in Computer Science, pages 250–288. Springer, 2010

  13. [13]

    Documentation, 2026

    LangChain.LangGraph Persistence. Documentation, 2026. docs.langchain.com/.../persistence. Accessed April 2026

  14. [14]

    Documentation, 2026

    LangChain.LangGraph Interrupts. Documentation, 2026. docs.langchain.com/.../interrupts. Accessed April 2026

  15. [15]

    LangSmith Documentation, 2026

    LangChain.Rollback Concurrent. LangSmith Documentation, 2026. docs.langchain.com/langsmith/ rollback-concurrent. Accessed April 2026

  16. [16]

    Documentation, 2026

    Amazon Web Services.Error Handling in Step Functions. Documentation, 2026. docs.aws.amazon.com/ step-functions/.... Accessed April 2026

  17. [17]

    Featonby.Making Retries Safe with Idempotent APIs

    M. Featonby.Making Retries Safe with Idempotent APIs. Amazon Builders’ Library, 2021. aws.amazon. com/builders-library/.... Accessed April 2026

  18. [18]

    Documentation, 2026

    Ray Team.Fault Tolerance. Documentation, 2026. docs.ray.io/.../fault-tolerance.html. Accessed April 2026

  19. [19]

    X. Liu, H. Zhang, Y . Song, et al. AgentBench: Evaluating LLMs as agents.arXiv preprint arXiv:2308.03688, 2023

  20. [20]

    Schick, J

    T. Schick, J. Dwivedi-Yu, R. Dessì, et al. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36, 2023. 10

  21. [21]

    Shinn, B

    N. Shinn, B. Labash, and A. Gopinath. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36, 2023

  22. [22]

    S. Yao, J. Zhao, D. Yu, et al. ReAct: Synergizing reasoning and acting in language models. InProceedings of the 11th International Conference on Learning Representations, 2023

  23. [23]

    L. Guo, W. Liu, Y . W. Heng, T.-H. Chen, and Y . Wang. Agent-SAMA: State-aware mobile assistant.arXiv preprint arXiv:2505.23596, 2025

  24. [24]

    Zhang, C

    S. Zhang, C. Yuan, R. Guo, X. Yu, R. Xu, Z. Chen, Z. Li, Z. Yang, S. Guan, Z. Tang, S. Hu, L. Zhang, R. Chen, and H. Wang. EvoFSM: Controllable self-evolution for deep research with finite state machines. arXiv preprint arXiv:2601.09465, 2026

  25. [25]

    Vyas and M

    J. Vyas and M. Mercangoz. Autonomous control leveraging LLMs: An agentic framework for next- generation industrial automation.arXiv preprint arXiv:2507.07115, 2025

  26. [26]

    Barke, A

    S. Barke, A. Goyal, A. Khare, A. Singh, S. Nath, and C. Bansal. AgentRx: Diagnosing AI agent failures from execution trajectories.arXiv preprint arXiv:2602.02475, 2026

  27. [27]

    K. Zhu, Z. Liu, B. Li, M. Tian, Y . Yang, J. Zhang, P. Han, Q. Xie, F. Cui, W. Zhang, X. Ma, X. Yu, G. Ramesh, J. Wu, Z. Liu, P. Lu, J. Zou, and J. You. Where LLM agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025

  28. [28]

    S. V . Vuddanti, A. Shah, S. K. Chittiprolu, T. Song, S. Dev, K. Zhu, and M. Chaudhary. PALADIN: Self-correcting language model agents to cure tool-failure cases.arXiv preprint arXiv:2509.25238, 2025

  29. [29]

    E. Y . Chang and L. Geng. SagaLLM: Context management, validation, and transaction guarantees for multi-agent LLM planning.arXiv preprint arXiv:2503.11951, 2025

  30. [30]

    E. Y . Chang and L. Geng. ALAS: A stateful multi-LLM agent framework for disruption-aware planning. arXiv preprint arXiv:2505.12501, 2025

  31. [31]

    Y . In, M. Tanjim, J. Subramanian, S. Kim, U. Bhattacharya, W. Kim, S. Park, S. Sarkhel, and C. Park. Rethinking failure attribution in multi-agent systems: A multi-perspective benchmark and evaluation.arXiv preprint arXiv:2603.25001, 2026

  32. [32]

    Huang, J

    J.-T. Huang, J. Zhou, T. Jin, X. Zhou, Z. Chen, W. Wang, Y . Yuan, M. R. Lyu, and M. Sap. On the resilience of LLM-based multi-agent collaboration with faulty agents. InProceedings of the 42nd International Conference on Machine Learning, 2025

  33. [33]

    J. Xie, K. Zhang, J. Chen, T. Zhu, R. Lou, Y . Tian, Y . Xiao, and Y . Su. TravelPlanner: A benchmark for real-world planning with language agents. InProceedings of the 41st International Conference on Machine Learning, 2024

  34. [34]

    C. G. Cassandras and S. Lafortune.Introduction to Discrete Event Systems. Springer, 3rd edition, 2021

  35. [35]

    Sampath, R

    M. Sampath, R. Sengupta, S. Lafortune, K. Sinnamohideen, and D. Teneketzis. Diagnosability of discrete- event systems.IEEE Transactions on Automatic Control, 40(9):1555–1575, 1995. A Appendix Roadmap The appendix is organized as a compact support map rather than a second narrative. The main text now includes a dedicated Discussion and Limitations section ...