pith. sign in

arxiv: 2606.28690 · v1 · pith:WATDE7ASnew · submitted 2026-06-27 · 💻 cs.CR

Formal Security Analysis of Agent Protocol Composition

Pith reviewed 2026-06-30 10:08 UTC · model grok-4.3

classification 💻 cs.CR
keywords agent protocolssecurity analysisprotocol compositionTLA+responsibility gapsformal verificationAI agentsSDK testing
0
0 comments X

The pith

Security problems in AI agent protocols stem from unassigned responsibilities across specs, SDKs, and deployments rather than isolated errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that agent protocols suffer from gaps in who must enforce security when multiple protocols interact, and it introduces AgentThread to detect these gaps systematically. AgentThread turns specification text into TLA+ invariants, checks them in a model checker, then replays counterexamples on real SDKs via adapters while tracing each result back to the original source text. A sympathetic reader would care because current agent deployments combine protocols without any single document saying who handles cross-protocol threats, so the same failures keep appearing in production. The work separates violated requirements from missing recommendations and unassigned duties, showing that thirty failures surface only under composition and that only one of five protocols actually enforces a relevant control in code.

Core claim

Across five agent protocols, AgentThread records 35 specification-level findings and 80 implementation tests, then uncovers thirty additional failures that appear solely when protocols are composed; only one protocol enforces any security-relevant control in practice and none assign responsibility for cross-protocol behavior, so insecurity arises from responsibility gaps rather than specification or implementation problems alone.

What carries the argument

AgentThread, a source-linked framework that defines a layered security scope, encodes protocol-derived checks as TLA+ invariants, and uses a two-phase checker to compile specifications into models and replay executable counterexamples on real SDKs through protocol adapters.

If this is right

  • Protocol specifications must separate requirements from recommendations and record cross-protocol duties.
  • SDKs and reference servers must be tested together rather than in isolation to expose composition failures.
  • Only one of the five studied protocols currently implements a security control that the analysis checks.
  • No studied protocol assigns enforcement for behavior that crosses protocol boundaries.
  • Findings are supported by source-text links so each violation can be traced to its originating requirement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams building agent systems would need a shared responsibility matrix before combining protocols instead of fixing issues after deployment.
  • The same two-phase method could be applied to other multi-protocol domains such as cloud service meshes or distributed ledgers.
  • If responsibility gaps remain unaddressed, security tooling will continue to focus on single-protocol bugs while composition attacks increase.

Load-bearing premise

Protocol adapters replay model-checker counterexamples against real SDKs without introducing artifacts or missing relevant behaviors.

What would settle it

A protocol document or SDK that explicitly assigns enforcement responsibility for security behavior spanning multiple protocols would contradict the central claim.

Figures

Figures reproduced from arXiv: 2606.28690 by Christophe Hauser, Haonan Li, Qifan Zhang, Shenghan Zheng, Zheng Zhang.

Figure 1
Figure 1. Figure 1: Motivating multi-MCP server composition. A hid [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Layered security scope used by AgentThread. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: AgentThread workflow. Protocol artifacts are nor [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cross-protocol attack chain: MCP prompt injection [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Responsibility gap heatmap across layer-derived [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

AI agent protocols define how agents use tools, delegate work, and coordinate across software systems, but their security requirements remain incomplete and inconsistently enforced across deployments. We present AgentThread, a source-linked framework for security assurance analysis of agent protocols, from specification text to running SDKs. AgentThread contributes a layered security scope, protocol-derived checks formalized as TLA+ invariants, and a two-phase checker that compiles protocol specifications into model-checkable models and replays executable counterexamples against real SDKs through protocol adapters. For each finding, AgentThread records the source text behind the check and separates violated protocol requirements from missing recommendations, hardening gaps, and unassigned cross-protocol responsibilities. Across five emerging agent protocols, AgentThread identifies 35 specification-level findings, supports them with 80 implementation tests against production SDKs and reference servers, and finds 30 additional failures that emerge only under protocol composition. We further show that only one protocol enforces a security-relevant control in practice and no protocol assigns enforcement for cross-protocol behavior. Insecurity in agent protocols is therefore not only a specification or implementation problem, but also a responsibility gap across protocols, SDKs, and deployments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces AgentThread, a source-linked framework that formalizes security checks for AI agent protocols as TLA+ invariants, compiles specifications into model-checkable models, and uses a two-phase checker to replay counterexamples against production SDKs via protocol adapters. Across five protocols it reports 35 specification-level findings, 80 implementation tests, and 30 additional failures that appear only under composition; it concludes that only one protocol enforces a relevant control in practice and that no protocol assigns responsibility for cross-protocol behavior, establishing a responsibility gap across protocols, SDKs, and deployments.

Significance. If the adapter replay step is shown to be faithful, the work supplies concrete, source-linked evidence that security shortfalls in agent systems arise not only from individual specifications or implementations but from unassigned cross-protocol responsibilities. The combination of TLA+ invariants with executable replay on real SDKs and the separation of violated requirements from missing recommendations constitute a useful methodological contribution for protocol analysis.

major comments (3)
  1. [Abstract / two-phase checker] Abstract and two-phase checker description: the 30 composition failures (and therefore the responsibility-gap claim) rest on the assertion that protocol adapters replay TLA+ counterexamples against production SDKs without introducing artifacts or omitting behaviors. No completeness argument, construction details, or validation that adapters preserve the model’s observable behaviors is supplied; this is load-bearing for the central conclusion.
  2. [Abstract] Abstract: the counts (35 findings, 80 tests, 30 composition failures) are presented without derivation details, exclusion criteria, or a table mapping each failure to the specific invariant violated and the source text that generated it. This prevents independent assessment of whether the 30 failures are genuinely emergent under composition rather than artifacts of the adapter layer.
  3. [Abstract] Abstract claim that “only one protocol enforces a security-relevant control in practice”: the supporting evidence is the 80 implementation tests, yet no section enumerates which control was tested, which SDKs were used, or the pass/fail criteria applied to each of the five protocols.
minor comments (1)
  1. Notation for the layered security scope and the distinction between “violated protocol requirements,” “missing recommendations,” “hardening gaps,” and “unassigned cross-protocol responsibilities” is introduced in the abstract but never given explicit definitions or an example mapping in the provided text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of presentation and evidence that we will address through targeted revisions. Below we respond point-by-point to the major comments.

read point-by-point responses
  1. Referee: [Abstract / two-phase checker] Abstract and two-phase checker description: the 30 composition failures (and therefore the responsibility-gap claim) rest on the assertion that protocol adapters replay TLA+ counterexamples against production SDKs without introducing artifacts or omitting behaviors. No completeness argument, construction details, or validation that adapters preserve the model’s observable behaviors is supplied; this is load-bearing for the central conclusion.

    Authors: We agree that adapter fidelity is load-bearing for the composition-only failure claims and the responsibility-gap conclusion. Section 4 of the manuscript describes the two-phase checker and the construction of protocol adapters that map TLA+ actions to SDK API calls while preserving message formats and state transitions. However, we acknowledge the absence of an explicit completeness argument or dedicated validation experiments. We will add a new subsection (4.3) providing construction details, a formal argument that adapters preserve observable behaviors (by construction from the protocol specification), and validation results showing that counterexamples replay identically on reference servers. This revision will directly support the 30 composition failures. revision: yes

  2. Referee: [Abstract] Abstract: the counts (35 findings, 80 tests, 30 composition failures) are presented without derivation details, exclusion criteria, or a table mapping each failure to the specific invariant violated and the source text that generated it. This prevents independent assessment of whether the 30 failures are genuinely emergent under composition rather than artifacts of the adapter layer.

    Authors: The counts are obtained by applying the layered security scope (Section 3) to the specification text of each of the five protocols, producing 35 invariant violations, then executing the corresponding checks via the two-phase checker to yield 80 implementation tests and 30 composition-only failures. Exclusion criteria (security-relevant invariants only; non-security recommendations separated) are stated in Section 3.2. To enable independent assessment we will add a summary table (new Table 2) and an appendix that maps every finding to the violated invariant, the exact source-text excerpt, the protocol, and whether the failure appears only under composition. The abstract will be updated with a brief derivation sentence. revision: yes

  3. Referee: [Abstract] Abstract claim that “only one protocol enforces a security-relevant control in practice”: the supporting evidence is the 80 implementation tests, yet no section enumerates which control was tested, which SDKs were used, or the pass/fail criteria applied to each of the five protocols.

    Authors: Section 6 enumerates the security-relevant controls tested (authentication, authorization, and session-integrity invariants), lists the five protocols together with the production SDKs and reference servers used for each, and defines pass/fail criteria as whether the running implementation satisfies the TLA+ invariant under the replayed counterexample. The claim that only one protocol passes follows directly from these results. To improve clarity we will insert a compact summary table (new Table 3) in Section 6 that lists, for each protocol, the control tested, the SDKs, and the pass/fail outcome. The abstract wording will remain but will now reference this table. revision: partial

Circularity Check

0 steps flagged

No significant circularity; external formal methods applied to independent specs and SDKs

full rationale

The derivation relies on TLA+ invariants checked against protocol specifications, followed by adapter-based replay of counterexamples on production SDKs. The 35 specification findings and 30 composition failures are outputs of this external model-checking process and 80 implementation tests, not reductions of fitted parameters or self-citations. No self-definitional equations, uniqueness theorems from prior author work, or ansatzes smuggled via citation appear in the abstract or described method. The central responsibility-gap claim follows from the independent checker results rather than by construction from its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim depends on domain assumptions about model fidelity and protocol representativeness rather than free parameters or new physical entities; the framework itself is an invented analysis method without independent falsifiable evidence outside the paper.

axioms (2)
  • domain assumption TLA+ models derived from protocol text accurately capture security-relevant behaviors for invariant checking.
    Invoked in the description of compiling specifications into model-checkable models.
  • domain assumption The five selected emerging agent protocols are representative of the broader class for drawing general conclusions about responsibility gaps.
    Basis for generalizing from the reported findings to all agent protocols.
invented entities (1)
  • AgentThread two-phase checker with protocol adapters no independent evidence
    purpose: To link formal counterexamples back to executable SDK behavior
    New component introduced to bridge specification analysis and implementation testing.

pith-pipeline@v0.9.1-grok · 5734 in / 1501 out tokens · 42821 ms · 2026-06-30T10:08:01.488972+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 19 canonical work pages · 1 internal anchor

  1. [1]

    A2A Project. 2025. Agent-to-Agent Protocol (A2A). https://github.com/a2aproj ect/A2A. Announced April 2025; now under Linux Foundation. Retrieved March 2026

  2. [2]

    ACP Contributors. 2025. Agent Communication Protocol. https://agentcommu nicationprotocol.dev/introduction/welcome. Under active development via RFDs. Retrieved March 2026

  3. [3]

    William Adamson and Nicolas Williams. 2016. Remote Procedure Call (RPC) Security Version 3. RFC 7861

  4. [4]

    Agent Client Protocol Contributors. 2025. Agent Client Protocol (ACP). https: //github.com/agentclientprotocol/agent-client-protocol. retrieved March 2026

  5. [5]

    Agent Network Protocol Team. 2025. AgentNetworkProtocol (ANP): An Open Protocol for Agent Communication. https://github.com/agent- network- protocol/AgentNetworkProtocol. Retrieved March 2026

  6. [6]

    Abdullah Al Ishtiaq, Sarkar Snigdha Sarathi Das, Syed Md Mukit Rashid, Ali Ran- jbar, Kai Tu, Tianwei Wu, Zhezheng Song, Weixuan Wang, Mujtahid Akon, Rui Zhang, et al. 2024. Hermes: Unlocking security analysis of cellular network pro- tocols by synthesizing finite state machines from natural language specifications. In33rd USENIX Security Symposium (USENI...

  7. [7]

    Zeynab Anbiaee, Mahdi Rabbani, Mansur Mirani, Gunjan Piya, Igor Opushnyev, Ali Ghorbani, and Sajjad Dadkhah. 2026. Security Threat Modeling for Emerging AI-Agent Protocols: A Comparative Analysis of MCP, A2A, Agora, and ANP. arXiv preprint arXiv: 2602.11327(2026)

  8. [8]

    Anthropic. 2024. Model Context Protocol Specification. https://modelcontextpr otocol.io/specification/2025-11-25. retrieved March 2026

  9. [9]

    Linard Arquint, Malte Schwerhoff, Vaibhav Mehta, and Peter Müller. 2023. A Generic Methodology for the Modular Verification of Security Protocol Imple- mentations. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security(Copenhagen, Denmark)(CCS ’23). Association for Computing Machinery, New York, NY, USA, 1377–1391. doi:10....

  10. [10]

    David Basin, Jannik Dreier, Lucca Hirschi, Saša Radomirovic, Ralf Sasse, and Vincent Stettler. 2018. A Formal Analysis of 5G Authentication. InProceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (Toronto, Canada)(CCS ’18). Association for Computing Machinery, New York, NY, USA, 1383–1396. doi:10.1145/3243734.3243846

  11. [11]

    McMillan, Kedar S

    David Basin, Nate Foster, Kenneth L. McMillan, Kedar S. Namjoshi, Cristina Nita-Rotaru, Jonathan M. Smith, Pamela Zave, and Lenore D. Zuck. 2025. It Takes a Village: Bridging the Gaps between Current and Formal Specifications for Protocols.Commun. ACM68, 8 (July 2025), 50–61. doi:10.1145/3706572

  12. [12]

    Karthikeyan Bhargavan, Davor Obradovic, and Carl A. Gunter. 2002. Formal verification of standards for distance vector routing protocols.J. ACM49, 4 (July 2002), 538–576. doi:10.1145/581771.581775

  13. [13]

    Steve Bishop, Matthew Fairbairn, Michael Norrish, Peter Sewell, Michael Smith, and Keith Wansbrough. 2005. Rigorous specification and conformance testing techniques for network protocols, as applied to TCP, UDP, and sockets, Vol. 35. Association for Computing Machinery, New York, NY, USA, 265–276. doi:10.114 5/1090191.1080123

  14. [14]

    B Blanchet et al. [n. d.]. An efficient cryptographic protocol verifier based on prolog rules. 2014.doi10, 82–96

  15. [15]

    Bochmann and Alexandre Petrenko

    Gregor V. Bochmann and Alexandre Petrenko. 1994. Protocol testing: review of methods and relevance for software testing. InProceedings of the 1994 ACM SIGSOFT International Symposium on Software Testing and Analysis(Seattle, Washington, USA)(ISSTA ’94). Association for Computing Machinery, New York, NY, USA, 109–124. doi:10.1145/186258.187153

  16. [16]

    Scott Bradner. 1997. Key words for use in RFCs to Indicate Requirement Levels. RFC 2119

  17. [17]

    Moshe Siman Tov Bustan, Mustafa Naamnih, Nir Zadok, and Roni Bar. 2026. The Mother of All AI Supply Chains: Critical, Systemic Vulnerability at the Core of Anthropic’s MCP. OX Security Blog, https://www.ox.security/blog/the-mother- of-all-ai-supply-chains-critical-systemic-vulnerability-at-the-core-of-the- mcp/. Accessed June 2026

  18. [18]

    Yi Chen, Di Tang, Yepeng Yao, Mingming Zha, XiaoFeng Wang, Xiaozhong Liu, Haixu Tang, and Dongfang Zhao. 2022. Seeing the Forest for the Trees: Understanding Security Hazards in the 3GPP Ecosystem through Intelligent Analysis on Change Requests. In31st USENIX Security Symposium (USENIX Security 22). USENIX Association, Boston, MA, 17–34. https://www.useni...

  19. [19]

    Cloud Security Alliance AI Safety Initiative. 2026. MCP by Design: RCE Across the AI Agent Ecosystem. Cloud Security Alliance Lab Space, https://labs.clo udsecurityalliance.org/research/csa-research-note-mcp-by-design-rce-ox- security-20260420-csa/. Accessed June 2026

  20. [20]

    Cas Cremers, Alexander Dax, and Aurora Naska. 2023. Formal Analysis of SPDM: Security Protocol and Data Model version 1.2. In32nd USENIX Security Symposium (USENIX Security 23). USENIX Association, Anaheim, CA, 6611–6628. https://www.usenix.org/conference/usenixsecurity23/presentation/cremers- spdm

  21. [21]

    Cas Cremers, Benjamin Kiesl, and Niklas Medinger. 2020. A Formal Analysis of IEEE 802.11’s WPA2: Countering the Kracks Caused by Cracking the Counters. In 29th USENIX Security Symposium (USENIX Security 20). USENIX Association, 1–

  22. [22]

    https://www.usenix.org/conference/usenixsecurity20/presentation/cremers

  23. [23]

    Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. 2024. Agentdojo: A dynamic environment to eval- uate prompt injection attacks and defenses for llm agents.Advances in Neural Information Processing Systems37, 82895–82920

  24. [24]

    Vincent Diemunsch, Lucca Hirschi, and Steve Kremer. 2025. A comprehensive formal security analysis of OPC UA. InProceedings of the 34th USENIX Conference on Security Symposium(Seattle, WA, USA)(SEC ’25). USENIX Association, USA, Article 363, 20 pages

  25. [25]

    Danny Dolev and Andrew Yao. 2003. On the security of public key protocols. IEEE Transactions on information theory29, 2 (2003), 198–208

  26. [26]

    Pengfei Gao and Chao Peng. 2025. More with less: An empirical study of turn- control strategies for efficient coding agents.arXiv preprint arXiv:2510.16786 (2025). 10

  27. [27]

    Jacob Ginesin, Max von Hippel, Evan Defloor, Cristina Nita-Rotaru, and Michael Tüxen. 2024. A Formal Analysis of SCTP: Attack Synthesis and Patch Verification. In33rd USENIX Security Symposium (USENIX Security 24). USENIX Association, Philadelphia, PA, 3099–3116. https://www.usenix.org/conference/usenixsecuri ty24/presentation/ginesin

  28. [28]

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not what you’ve signed up for: Compromising real- world llm-integrated applications with indirect prompt injection. InProceedings of the 16th ACM workshop on artificial intelligence and security. 79–90

  29. [29]

    Allenoush Hayrapetian and Rajeev Raje. 2018. Empirically Analyzing and Eval- uating Security Features in Software Requirements. InProceedings of the 11th Innovations in Software Engineering Conference(Hyderabad, India)(ISEC ’18). Association for Computing Machinery, New York, NY, USA, Article 9, 11 pages. doi:10.1145/3172871.3172879

  30. [30]

    Feng He, Tianqing Zhu, Dayong Ye, Bo Liu, Wanlei Zhou, and Philip S. Yu. 2026. The Emerged Security and Privacy of LLM Agent: A Survey with Case Studies. ACM Comput. Surv.58, 6 (2026), 162:1–162:36. doi:10.1145/3773080

  31. [31]

    Charlie Jacomme, Elise Klein, Steve Kremer, and Maïwenn Racouchot. 2023. A comprehensive, formal and automated analysis of the EDHOC protocol. In 32nd USENIX Security Symposium (USENIX Security 23). USENIX Association, Anaheim, CA, 5881–5898. https://www.usenix.org/conference/usenixsecurity 23/presentation/jacomme

  32. [32]

    Dezhang Kong, Shi Lin, Zhenhua Xu, Zhebo Wang, Minghao Li, Yufeng Li, Yilun Zhang, Hujin Peng, Xiang Chen, Zeyang Sha, Yuyuan Li, Changting Lin, Xun Wang, Xuan Liu, Ningyu Zhang, Chaochao Chen, Chunming Wu, Muham- mad Khurram Khan, and Meng Han. 2025. A Survey of LLM-Driven AI Agent Communication: Protocols, Security Risks, and Defense Countermeasures.arX...

  33. [33]

    Tomas Kulik, Brijesh Dongol, Peter Gorm Larsen, Hugo Daniel Macedo, Steve Schneider, Peter W. V. Tran-Jørgensen, and James Woodcock. 2022. A Survey of Practical Formal Methods for Security.Form. Asp. Comput.34, 1, Article 5 (July 2022), 39 pages. doi:10.1145/3522582

  34. [34]

    2002.Specifying systems

    Leslie Lamport. 2002.Specifying systems. Vol. 388. Addison-Wesley Boston

  35. [35]

    Christine P Lee, David Porfirio, Xinyu Jessica Wang, Kevin Chenkai Zhao, and Bilge Mutlu. 2025. Veriplan: Integrating formal verification and llms into end- user planning. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–19

  36. [36]

    Hui Li, Zhen Dong, Siao Wang, Hui Zhang, Liwei Shen, Xin Peng, and Dongdong She. 2025. Extracting Formal Specifications From Documents Using LLMS for Test Automation. In2025 IEEE/ACM 33rd International Conference on Program Comprehension (ICPC). IEEE Computer Society, 1–12

  37. [37]

    Han Li, Yuling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin Wang, Yantao Jia, Tao Huang, and Qianxiang Wang. 2025. Swe-debate: Competitive multi-agent debate for software issue resolution.arXiv preprint arXiv:2507.23348(2025)

  38. [38]

    Qiaomu Li and Ying Xie. 2025. From glue-code to protocols: A critical anal- ysis of a2a and mcp integration for scalable agent systems.arXiv preprint arXiv:2505.03864(2025)

  39. [39]

    Felix Linker, Ralf Sasse, and David Basin. 2025. A formal analysis of apple’s iMessage PQ3 protocol. InProceedings of the 34th USENIX Conference on Security Symposium(Seattle, WA, USA)(SEC ’25). USENIX Association, USA, Article 258, 20 pages

  40. [40]

    R. J. Linn. 2006. Conformance evaluation methodology and protocol testing. IEEE J.Sel. A. Commun.7, 7 (Sept. 2006), 1143–1158. doi:10.1109/49.44561

  41. [41]

    Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. 2024. Formalizing and Benchmarking Prompt Injection Attacks and Defenses. In33rd USENIX Security Symposium (USENIX Security 24). USENIX Association, Philadel- phia, PA, 1831–1847. https://www.usenix.org/conference/usenixsecurity24/pre sentation/liu-yupei

  42. [42]

    Yinuo Liu, Ruohan Xu, Xilong Wang, Yuqi Jia, and Neil Zhenqiang Gong. 2025. WAInjectBench: Benchmarking Prompt Injection Detections for Web Agents. arXiv preprint arXiv: 2510.01354(2025)

  43. [43]

    Gavin Lowe. 1996. Breaking and fixing the Needham-Schroeder public-key protocol using FDR. InInternational Workshop on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 147–166

  44. [44]

    Simon Meier, Benedikt Schmidt, Cas Cremers, and David Basin. 2013. The TAMARIN prover for the symbolic analysis of security protocols. InInternational conference on computer aided verification. Springer, 696–701

  45. [45]

    National Institute of Standards and Technology. 2024. Open Security Controls Assessment Language (OSCAL). https://pages.nist.gov/OSCAL/. Accessed April 2026

  46. [46]

    National Vulnerability Database. 2025. CVE-2025-49596: MCP Inspector remote code execution vulnerability. https://nvd.nist.gov/vuln/detail/CVE-2025-49596

  47. [47]

    National Vulnerability Database. 2025. CVE-2025-68143: Anthropic MCP Git server path handling vulnerability. https://nvd.nist.gov/vuln/detail/CVE-2025- 68143

  48. [48]

    National Vulnerability Database. 2025. CVE-2025-68144: Anthropic MCP Git server argument injection vulnerability. https://nvd.nist.gov/vuln/detail/CVE- 2025-68144

  49. [49]

    National Vulnerability Database. 2025. CVE-2025-68145: Anthropic MCP Git server repository boundary vulnerability. https://nvd.nist.gov/vuln/detail/CVE- 2025-68145

  50. [50]

    Raul Onitza-Klugman. 2025. Prompt Injection Meets MCP: A New Exploitation Vector Emerging? Snyk Labs, https://labs.snyk.io/resources/prompt-injection- mcp/. Accessed June 2026

  51. [51]

    OpenAI. 2026. Introducing the OpenAI Safety Bug Bounty Program. https: //openai.com/index/safety-bug-bounty/. Accessed June 2026

  52. [52]

    Maria Leonor Pacheco, Max von Hippel, Ben Weintraub, Dan Goldwasser, and Cristina Nita-Rotaru. 2022. Automated attack synthesis by extracting finite state machines from protocol specification documents. In2022 IEEE Symposium on Security and Privacy (SP). IEEE, 51–68

  53. [53]

    Eric Rescorla and Brian Korver. 2003. Guidelines for Writing RFC Text on Security Considerations. RFC 3552

  54. [54]

    Shubham Saboo and Kristopher Overholt. 2026. A Developers’ Guide to AI Agent Protocols. https://developers.googleblog.com/developers-guide-to-ai-agent- protocols/. Accessed April 2026

  55. [55]

    B Sarikaya. 1986. Formal specification-based conformance testing. InProceedings of the ACM SIGCOMM Conference on Communications Architectures & Protocols (Stowe, Vermont, USA)(SIGCOMM ’86). Association for Computing Machinery, New York, NY, USA, 236–240. doi:10.1145/18172.18199

  56. [56]

    Prakhar Sharma and Vinod Yegneswaran. 2023. Prosper: Extracting protocol specifications using large language models. InProceedings of the 22nd ACM workshop on hot topics in networks. 41–47

  57. [57]

    Min Shi, Jing Chen, Kun He, Haoran Zhao, Meng Jia, and Ruiying Du. 2023. Formal Analysis and Patching of BLE-SC Pairing. In32nd USENIX Security Symposium (USENIX Security 23). USENIX Association, Anaheim, CA, 37–52. https://www.usenix.org/conference/usenixsecurity23/presentation/shi-min

  58. [58]

    Robert Thurlow. 2009. RPC: Remote Procedure Call Protocol Specification Version

  59. [59]

    Shuren Xia, Qiwei Li, Taqiya Ehsan, and Jorge Ortiz. 2026. TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples. InProceedings of the ACM Conference on AI and Agentic Systems. 181–196

  60. [60]

    Jane Yen, Tamás Lévai, Qinyuan Ye, Xiang Ren, Ramesh Govindan, and Barath Raghavan. 2021. Semi-automated protocol disambiguation and code generation (SIGCOMM ’21). Association for Computing Machinery, New York, NY, USA, 272–286. doi:10.1145/3452296.3472910

  61. [61]

    Yuan Yu, Panagiotis Manolios, and Leslie Lamport. 1999. Model checking TLA+ specifications. InAdvanced research working conference on correct hardware design and verification methods. Springer, 54–66

  62. [62]

    Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. 2024. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. InFindings of the Association for Computational Linguistics: ACL 2024. 10471–10506

  63. [63]

    Mingwei Zheng, Danning Xie, Qingkai Shi, Chengpeng Wang, and Xiangyu Zhang. 2025. Validating Network Protocol Parsers with Traceable RFC Document Interpretation.Proc. ACM Softw. Eng.2, ISSTA, Article ISSTA078, 23 pages. doi:10.1145/3728955 11