arxiv: 2605.07935 · v1 · submitted 2026-05-08 · 💻 cs.AI · cs.MA

Recognition: no theorem link

TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples

Shuren Xia , Qiwei Li , Taqiya Ehsan , Jorge Ortiz

Authors on Pith no claims yet

Pith reviewed 2026-05-11 03:23 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords multi-agent coordinationLLM agentsprotocol repairTLA+ verificationcounterexample-guided repairPlusCalruntime monitoringdeadlock prevention

0 comments

The pith

LLM agents can synthesize and repair multi-agent coordination protocols to full TLA+ verification in at most four iterations, producing executable prompts that raise task completion and halve deadlock rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that an LLM can first output a structured protocol topology and PlusCal logic, then use counterexamples from the TLA+ model checker to repair the logic until every state is verified. This matters because unverified LLM protocols frequently deadlock or livelock, limiting reliable multi-agent work. Once verified, the logic is turned into per-agent prompts and run under a monitor that blocks out-of-protocol actions. Across 48 tasks the method reaches verification for every case, with most succeeding on the first try, and the resulting systems complete tasks at 89 percent while cutting deadlocks from 31 percent to 14 percent.

Core claim

TraceFix lets an agent produce a protocol topology as an intermediate representation, emit PlusCal coordination code, and feed TLC counterexamples back into an LLM repair loop until the entire model checks; the verified code is then compiled into system prompts that agents execute under a runtime monitor enforcing the original topology.

What carries the argument

Counterexample-guided iterative repair loop that converts TLC traces into targeted fixes for PlusCal process bodies until model checking passes.

If this is right

Every one of the 48 tasks reaches full TLC verification, most on the first attempt and none needing more than four repair cycles.
Verified protocols deliver 89.4 percent average task completion and 81.5 percent full completion across the test suite.
Deadlock and livelock rates drop from 31.1 percent to 14.1 percent when the verified protocol is used instead of plain prompts.
Runtime performance degrades at roughly half the rate of prompt-only baselines as underlying model capability is reduced.
Verification finishes in under 60 seconds even when state spaces differ by six orders of magnitude.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same counterexample loop could be applied to other formalisms such as Promela or Alloy to widen the set of coordination patterns that can be automatically hardened.
Runtime monitoring appears essential; without it, even a verified protocol could be violated by an LLM that hallucinates an extra message.
The approach separates protocol design from execution, so future work could swap the synthesis LLM for a human-written skeleton while keeping the repair and verification stages.

Load-bearing premise

The TLA+ model built from the protocol description captures every coordination behavior and failure mode that actually appears when the prompts run on real LLMs.

What would settle it

A verified protocol that still produces a deadlock or livelock when executed by the target LLMs under the runtime monitor would show the modeling step missed critical behaviors.

Figures

Figures reproduced from arXiv: 2605.07935 by Jorge Ortiz, Qiwei Li, Shuren Xia, Taqiya Ehsan.

**Figure 1.** Figure 1: TraceFix pipeline overview. At design time (Stages 1–4), an orchestration agent synthesizes a protocol topology [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Root-cause distribution of the 29 repair attempts [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of repair-requiring tasks (18 of 48) by [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: TLC distinct states (circles, left axis, log scale) and [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Average and full simulation completion for each [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Average simulation completion for Topology [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

We present TraceFix, a verification-first pipeline for Large Language Model (LLM) multi-agent coordination. An agent synthesizes a protocol topology as a structured intermediate representation (IR) from a task description, generates PlusCal coordination logic, and iteratively repairs the protocol using counterexamples from the TLA+ model checker (TLC) until verification succeeds. Verified process bodies are compiled into per-agent system prompts and executed under a runtime monitor that rejects out-of-topology coordination operations. On 48 tasks spanning 16 scenario families, all tasks reach full TLC verification; 62.5% pass on the first attempt and none requires more than four repair iterations. State spaces span six orders of magnitude yet verification completes in under 60 s for every task. A 3,456-run runtime comparison shows that topology-monitored execution achieves the highest task completion (89.4% average, 81.5% full) and that runtimes using the verified protocol degrade at roughly half the rate of prompt-only and chat-only baselines when model capability is reduced. A paired ablation under a fixed runtime shows that TLC-verified protocols cut deadlock/livelock (DL/LL) from 31.1% to 14.1%, with the largest separation under fault injection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TraceFix gives a working TLA+ loop that repairs LLM agent protocols to full verification on their tests, but the real question is how closely the model matches what the LLMs actually do at runtime.

read the letter

This paper shows a pipeline that lets an LLM build a protocol topology as IR, emit PlusCal, then use TLC counterexamples to prompt repairs until the spec passes model checking. On 48 tasks it reaches 100% verification success, usually in one or two rounds, and the checks stay fast even when state spaces grow large. The 3456-run runtime comparison finds that the verified protocols hold up better than prompt-only or chat baselines when model capability drops, and a fixed-runtime ablation cuts deadlock and livelock rates roughly in half under the topology monitor.

Referee Report

2 major / 2 minor

Summary. The paper presents TraceFix, a verification-first pipeline for LLM multi-agent coordination. An LLM synthesizes a protocol topology as a structured IR from a task description, generates PlusCal coordination logic, and iteratively repairs the protocol using counterexamples from the TLA+ model checker (TLC) until verification succeeds. Verified process bodies are compiled into per-agent system prompts executed under a runtime monitor that rejects out-of-topology operations. On 48 tasks spanning 16 scenario families, all tasks reach full TLC verification (62.5% on first attempt, none requiring more than four iterations), with verification completing in under 60s despite state spaces spanning six orders of magnitude. A 3,456-run runtime comparison shows verified protocols achieve 89.4% average task completion (81.5% full) and reduce deadlock/livelock from 31.1% to 14.1% versus baselines, with slower degradation under reduced model capability.

Significance. If the central assumption holds, the work offers a novel combination of formal verification with LLM agent synthesis, using machine-checked TLA+ models and counterexample-guided repair to improve coordination reliability. The scale of the evaluation (48 tasks, 3,456 runs) and the demonstration that verification succeeds rapidly across diverse scenarios are strengths that could guide future verifiable multi-agent systems. The approach provides falsifiable predictions via the TLA+ invariants and a concrete runtime monitor.

major comments (2)

[3,456-run runtime comparison] The claim that verified protocols improve runtime outcomes (89.4% completion, DL/LL reduced to 14.1%) is load-bearing on the unvalidated assumption that the TLA+ specification (PlusCal processes, invariants, state transitions) faithfully captures coordination behaviors when the compiled prompts are executed by LLMs. The runtime monitor only rejects out-of-topology calls and does not enforce the full verified invariants; no section shows that actual runtime traces satisfy the TLC-verified properties. Residual DL/LL at 14.1% may indicate unmodeled LLM deviations (stochastic misinterpretation or state drift) rather than protocol issues.
[Repair pipeline description] Details are insufficient on the mechanism translating TLC counterexamples into repairs (how the LLM modifies the IR or PlusCal code based on the counterexample), the exact implementation of the prompt-only and chat-only baselines, and any statistical significance testing for the reported improvements in the 3,456-run comparison. These omissions undermine assessment of the repair pipeline's reproducibility and the robustness of the empirical claims.

minor comments (2)

[Abstract and Introduction] The abstract and introduction introduce terms such as 'structured protocol topology IR' and 'runtime monitor' without early formal definitions or examples, which would aid readability.
The paper would benefit from an explicit limitations section addressing the fidelity gap between the TLA+ model and LLM execution, as well as potential failure modes not captured by the topology monitor.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of our evaluation and presentation. We address each major comment below with clarifications and indicate planned revisions.

read point-by-point responses

Referee: The claim that verified protocols improve runtime outcomes (89.4% completion, DL/LL reduced to 14.1%) is load-bearing on the unvalidated assumption that the TLA+ specification faithfully captures coordination behaviors when the compiled prompts are executed by LLMs. The runtime monitor only rejects out-of-topology calls and does not enforce the full verified invariants; no section shows that actual runtime traces satisfy the TLC-verified properties. Residual DL/LL at 14.1% may indicate unmodeled LLM deviations rather than protocol issues.

Authors: We acknowledge that the runtime monitor enforces topology constraints derived from the verified IR rather than replaying every TLA+ invariant at execution time. The TLA+ model captures the coordination logic (process bodies, state transitions, and safety/liveness properties) that the LLM prompts are generated to implement; the monitor prevents structural deviations that would violate the topology. The 14.1% residual DL/LL rate is consistent with stochastic LLM deviations from the prompt instructions, and the paired ablation shows that verified protocols still halve the DL/LL rate relative to baselines. We agree that an explicit trace-to-model correspondence is not demonstrated in the current manuscript. In revision we will add a dedicated limitations subsection discussing the gap between model-level verification and runtime enforcement, together with illustrative runtime trace excerpts aligned to the corresponding TLA+ states. revision: partial
Referee: Details are insufficient on the mechanism translating TLC counterexamples into repairs (how the LLM modifies the IR or PlusCal code based on the counterexample), the exact implementation of the prompt-only and chat-only baselines, and any statistical significance testing for the reported improvements in the 3,456-run comparison.

Authors: We will expand Section 3.3 (Repair Pipeline) with the precise prompt templates used to translate TLC counterexamples into IR edits and subsequent PlusCal regeneration. For the baselines we will include the exact system-prompt wording and conversation structure for both prompt-only and chat-only conditions. We will also report statistical significance (paired t-tests or Wilcoxon tests with p-values) for the key completion-rate and DL/LL differences in the 3,456-run evaluation. These additions will appear in the revised manuscript and supplementary material to support reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical pipeline or verification results

full rationale

The paper presents an empirical pipeline for synthesizing, verifying, and executing LLM agent protocols, with all central claims consisting of measured success rates (e.g., 62.5% first-attempt verification, 89.4% task completion, DL/LL reduction to 14.1%) obtained from direct runs on 48 tasks and 3,456 runtime comparisons. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text; the TLA+ model and runtime monitor are described as engineering components whose fidelity is an external validity assumption rather than a self-referential reduction. The derivation chain is therefore self-contained against the reported benchmarks and does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The approach rests on the assumption that TLA+ specifications can be generated from LLM outputs and that the runtime monitor enforces the verified topology without introducing new errors. No free parameters or invented physical entities are described.

axioms (2)

domain assumption TLA+ model checker (TLC) counterexamples provide sufficient information for an LLM to repair coordination logic.
Invoked in the repair loop description; no independent evidence supplied in abstract.
domain assumption The generated topology IR fully captures the coordination constraints needed for correct execution.
Central to both synthesis and runtime monitor; treated as given.

invented entities (2)

structured protocol topology IR no independent evidence
purpose: Intermediate representation that LLM uses to generate PlusCal and that runtime monitor enforces.
New artifact introduced by the pipeline; no external falsifiable handle mentioned.
runtime monitor no independent evidence
purpose: Rejects out-of-topology coordination operations at execution time.
New enforcement mechanism; no independent evidence of its correctness beyond the verification claim.

pith-pipeline@v0.9.0 · 5531 in / 1481 out tokens · 47749 ms · 2026-05-11T03:23:21.158350+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 4 internal anchors

[1]

Rajeev Alur, Mukund Raghothaman, Christos Stergiou, Stavros Tripakis, and Abhishek Udupa. 2015. Automatic Completion of Distributed Protocols with Symmetry. arXiv:1505.04409 [cs.FL] https://arxiv.org/abs/1505.04409

work page arXiv 2015
[2]

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, et al . 2022. Constitutional AI: Harmlessness from AI Feedback.arXiv preprint arXiv:2212.08073(2022). arXiv:2212.08073 [cs.CL] CAIS ’26, May 26–29, 2026, San Jose, CA, USA Xia et al

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Laura Bocchi, Tzu-Chun Chen, Romain Demangeon, Kohei Honda, and Nobuko Yoshida. 2017. Monitoring Networks through Multiparty Session Types.Theoret- ical Computer Science669 (2017), 33–58. doi:10.1016/j.tcs.2017.02.009

work page doi:10.1016/j.tcs.2017.02.009 2017
[4]

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ram- chandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. 2025. Why Do Multi-Agent LLM Systems Fail? arXiv:2503.13657 [cs.AI] https://arxiv.org/abs/ 2503.13657

work page internal anchor Pith review arXiv 2025
[5]

Harrison Chase. 2022. LangChain. https://github.com/langchain-ai/langchain. Accessed: 2026-02-16

work page 2022
[6]

Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Matei Zaharia, James Zou, and Ion Stoica. 2025. Optimizing model selection for compound ai systems.arXiv preprint arXiv:2502.14815(2025)

work page arXiv 2025
[7]

Zhiyong Chen, Jialun Cao, Chang Xu, and Shing-Chi Cheung. 2026. ModelWis- dom: An Integrated Toolkit for TLA+ Model Visualization, Digest and Repair. arXiv:2602.12058 [cs.SE] https://arxiv.org/abs/2602.12058

work page arXiv 2026
[8]

Zhaorun Chen, Mintong Kang, and Bo Li. 2025. ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning. arXiv:2503.22738 [cs.LG] https://arxiv. org/abs/2503.22738

work page arXiv 2025
[9]

Qian Cheng, Ruize Tang, Emilie Ma, Finn Hackett, Peiyang He, Yiming Su, Ivan Beschastnikh, Yu Huang, Xiaoxing Ma, and Tianyin Xu. 2025. Sys- MoBench: Evaluating AI on Formally Modeling Complex Real-World Systems. arXiv:2509.23130 [cs.AI] https://arxiv.org/abs/2509.23130

work page arXiv 2025
[10]

Clarke, Orna Grumberg, and Doron A

Edmund M. Clarke, Orna Grumberg, and Doron A. Peled. 1999.Model Checking. MIT Press, Cambridge, MA, USA

work page 1999
[11]

CrewAI and contributors

Inc. CrewAI and contributors. 2026. CrewAI. GitHub repository. https://github. com/crewAIInc/crewAI Accessed: 2026-02-16

work page 2026
[12]

Dijkstra

Edsger W. Dijkstra. 1971. Hierarchical Ordering of Sequential Processes.Acta Informatica1, 2 (1971), 115–138. doi:10.1007/BF00289519

work page doi:10.1007/bf00289519 1971
[13]

Marius-Constantin Dinu, Claudiu Leoveanu-Condrei, Markus Holzleitner, Werner Zellinger, and Sepp Hochreiter. 2024. SymbolicAI: A framework for logic-based approaches combining generative models and solvers. arXiv:2402.00854 [cs.LG] https://arxiv.org/abs/2402.00854

work page arXiv 2024
[14]

Derek Egolf, William Schultz, and Stavros Tripakis. 2024. Efficient Synthesis of Symbolic Distributed Protocols by Sketching. arXiv:2405.07807 [cs.LO] https: //arxiv.org/abs/2405.07807

work page arXiv 2024
[15]

Alessio Ferrari and Paola Spoletini. 2025. Formal requirements engineering and large language models: A two-way roadmap.Information and Software Technology (2025). doi:10.1016/j.infsof.2025.107697

work page doi:10.1016/j.infsof.2025.107697 2025
[16]

Gonczarowski

Yannai A. Gonczarowski. 2012. Timely Coordination in a Multi-Agent System. The Hebrew University of Jerusalem. https://yannai.gonch.name/scientific/ papers/2012-timely-coordination-huji.pdf Undergraduate thesis

work page 2012
[17]

Google and contributors. 2026. ADK (Agent Development Kit) for Python. GitHub repository. https://github.com/google/adk-python Accessed: 2026-02-16

work page 2026
[18]

Kohei Honda, Nobuko Yoshida, and Marco Carbone. 2008. Multiparty Asynchro- nous Session Types. InProceedings of POPL. doi:10.1145/1328438.1328472

work page doi:10.1145/1328438.1328472 2008
[19]

Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. 2024. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. The Twelfth International Conference on Learning Representations

work page 2024
[20]

Roham Koohestani. 2025. AgentGuard: Runtime Verification of AI Agents. arXiv:2509.23864 [cs.AI] https://arxiv.org/abs/2509.23864

work page arXiv 2025
[21]

Saul A Kripke. 1963. Semantical considerations on modal logic.Acta philosophica fennica16 (1963)

work page 1963
[22]

2002.Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers

Leslie Lamport. 2002.Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers. Addison-Wesley, Boston, MA, USA

work page 2002
[23]

Leslie Lamport. 2009. The PlusCal algorithm language. InInternational colloquium on theoretical aspects of computing. Springer, 36–60

work page 2009
[24]

Leslie Lamport, Robert Shostak, and Marshall Pease. 1982. The Byzantine Gen- erals Problem.ACM Transactions on Programming Languages and Systems4, 3 (1982), 382–401

work page 1982
[25]

LangChain and contributors

Inc. LangChain and contributors. 2026. LangGraph. GitHub repository. https: //github.com/langchain-ai/langgraph Accessed: 2026-02-16

work page 2026
[26]

Martin Leucker and Christian Schallhart. 2009. A Brief Account of Runtime Verification.Journal of Logic and Algebraic Programming78, 5 (2009), 293–303. doi:10.1016/j.jlap.2008.08.004

work page doi:10.1016/j.jlap.2008.08.004 2009
[27]

Shuang Li, Jing Yang, et al. 2023. TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs.arXiv preprint arXiv:2303.16434(2023). arXiv:2303.16434 [cs.AI]

work page arXiv 2023
[28]

Yujia Li, David Choi, Junyoung Chung, et al . 2022. Competition-Level Code Generation with AlphaCode.Science378, 6624 (2022), 1092–1097

work page 2022
[29]

Chris Newcombe, Tim Rath, Fan Zhang, Bogdan Munteanu, Marc Brooker, and Michael Deardeuff. 2015. How Amazon Web Services Uses Formal Methods. Commun. ACM58, 4 (2015), 66–73. doi:10.1145/2699417

work page doi:10.1145/2699417 2015
[30]

Ansong Ni, Srinivasan Iyer, and Dragomir Radev. 2023. LEVER: Learning to Verify Language-to-Code Generation with Execution.arXiv preprint arXiv:2302.08468 (2023). arXiv:2302.08468 [cs.CL]

work page arXiv 2023
[31]

Amir Pnueli. 1977. The temporal logic of programs. In18th Annual Symposium on Foundations of Computer Science (sfcs 1977). 46–57. doi:10.1109/SFCS.1977.32

work page doi:10.1109/sfcs.1977.32 1977
[32]

Dhananjay Raju, Suda Bharadwaj, and Ufuk Topcu. 2019. Online Synthesis for Runtime Enforcement of Safety in Multi-Agent Systems.arXiv preprint arXiv:1910.10380(2019). arXiv:1910.10380 [cs.SY]

work page arXiv 2019
[33]

Keshav Ramani, Vali Tawosi, Salwa Alamir, and Daniel Borrajo. 2025. Bridging LLM Planning Agents and Formal Methods: A Case Study in Plan Verification. arXiv:2510.03469 [cs.AI] https://arxiv.org/abs/2510.03469

work page arXiv 2025
[34]

Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar, Christopher Parisien, and Jonathan Cohen. 2023. NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails. InProceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations. 431– 445

work page 2023
[35]

Vikash Singh, Darion Cassel, Nathaniel Weir, Nick Feng, and Sam Bayless. 2026. VERGE: Formal Refinement and Guidance Engine for Verifiable LLM Reasoning. arXiv:2601.20055 [cs.CL] https://arxiv.org/abs/2601.20055

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

Chuyue Sun, Ying Sheng, Oded Padon, and Clark Barrett. 2024. Clover: Closed- Loop Verifiable Code Generation. arXiv:2310.17807 [cs.AI] https://arxiv.org/abs/ 2310.17807

work page arXiv 2024
[37]

Pro2guard: Proactive runtime enforcement of llm agent safety via probabilistic model checking

Haoyu Wang, Christopher M. Poskitt, Jun Sun, and Jiali Wei. 2025. Pro2Guard: Proactive Runtime Enforcement of LLM Agent Safety via Probabilistic Model Checking. arXiv:2508.00500 [cs.AI] https://arxiv.org/abs/2508.00500

work page arXiv 2025
[38]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agar- wal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 248...

work page 2022
[39]

will62794 and contributors. 2026. TLA+ Web Explorer (tla-web). GitHub reposi- tory. https://github.com/will62794/tla-web Accessed: 2026-02-21

work page 2026
[40]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. 2023. AutoGen: Enabling Next- Gen LLM Applications via Multi-Agent Conversation. arXiv:2308.08155 [cs.AI] https://arxiv.org/abs/2308.08155

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations

work page 2022
[42]

Yedi Zhang, Yufan Cai, Xinyue Zuo, Xiaokun Luan, Kailong Wang, Zhe Hou, Yifan Zhang, Zhiyuan Wei, Meng Sun, Jun Sun, Jing Sun, and Jin Song Dong. 2024. The Fusion of Large Language Models and Formal Methods for Trustworthy AI Agents: A Roadmap. arXiv:2412.06512 [cs.AI] https://arxiv.org/abs/2412.06512

work page arXiv 2024
[43]

Yedi Zhang, Sun Yi Emma, Annabelle Lee Jia En, and Jin Song Dong. 2025. RvLLM: LLM Runtime Verification with Domain Knowledge. arXiv:2505.18585 [cs.AI] https://arxiv.org/abs/2505.18585

work page arXiv 2025
[44]

Kunlun Zhu, Hongyi Du, Zhaochen Hong, Xiaocheng Yang, Shuyi Guo, Zhe Wang, Zhenhailong Wang, Cheng Qian, Xiangru Tang, Heng Ji, and Jiaxuan You

work page
[45]

Multiagentbench: Evaluating the collaboration and competition of llm agents,

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents. arXiv:2503.01935 [cs.MA] https://arxiv.org/abs/2503.01935 A Full Task Inventory Table 4 lists all 48 tasks. |𝐴| = agents, |𝐿| = locks, |𝐾| = counters, |𝐶| = channels, 𝑅 = repairs, 𝑆 = distinct states. This appendix table is intended as the structural backbone for the benchmark di...

work page arXiv 2026