pith. sign in

arxiv: 2604.21477 · v1 · submitted 2026-04-23 · 💻 cs.CR

MCP Pitfall Lab: Exposing Developer Pitfalls in MCP Tool Server Security under Multi-Vector Attacks

Pith reviewed 2026-05-09 21:34 UTC · model grok-4.3

classification 💻 cs.CR
keywords MCPModel Context ProtocolLLM agentstool serverssecurity testinghardeningtrace validationmulti-vector attacks
0
0 comments X

The pith

Hardening MCP tool servers eliminates all Tier-1 security findings at a cost of 27 lines of code while agent self-reports mismatch traces in 63 percent of cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MCP Pitfall Lab, a testing framework that turns common developer mistakes in Model Context Protocol tool servers into reproducible attack scenarios across email, document, and crypto workflows. It evaluates six server variants against three attack families using actual protocol traces and objective checks rather than relying on the agent's own description of events. Static analysis catches four pitfall classes with perfect accuracy, and applying the suggested fixes removes every high-severity issue and drops the overall risk score to zero. The work also shows that in preliminary tests the agents' written accounts of their actions disagree with the real execution traces in most runs and in every case involving a dangerous sink action. This supplies developers with concrete, low-cost remediation steps for a protocol whose growing use in LLM agents widens the attack surface across metadata, cross-tool flows, and multimodal inputs.

Core claim

By operationalizing developer pitfalls as three workflow challenges with baseline and hardened server variants, the framework demonstrates that recommended hardening eliminates all Tier-1 findings (29 to 0) and reduces the risk score from 10.0 to 0.0 at a mean cost of 27 lines of code, while a 19-run corpus reveals agent narratives diverging from trace evidence in 63.2 percent of runs and 100 percent of sink-action runs.

What carries the argument

The MCP Pitfall Lab framework, which defines reproducible scenarios for tool-metadata poisoning, puppet servers, and multimodal image-to-tool chains, then validates outcomes with protocol traces and objective validators instead of agent self-report.

If this is right

  • Static analysis reaches F1 of 1.0 on four statically checkable pitfall classes.
  • Hardening removes every Tier-1 finding and brings the framework risk score to zero.
  • Agent narratives diverge from trace evidence in 63.2 percent of runs and all sink-action runs.
  • Trace-grounded auditing and regression testing become necessary for accurate security assessment of MCP servers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future MCP tool servers should log and expose execution traces by default so that security audits can avoid dependence on agent self-report.
  • The low code cost of hardening suggests similar lightweight fixes could be catalogued for other tool-calling protocols used by LLM agents.
  • Integrating trace-based validation into continuous testing pipelines would catch cross-tool and multimodal leaks that static checks miss.
  • The divergence between narrative and trace points to a broader need for objective ground truth when evaluating any autonomous agent that can invoke external tools.

Load-bearing premise

The three chosen workflow challenges and six server variants adequately stand in for real-world MCP deployments and the three attack families cover the primary threats developers will face.

What would settle it

Applying the recommended hardening to a production MCP server and then testing it against a new attack vector or a larger corpus of runs that shows agent narratives matching traces more than 36.8 percent of the time.

Figures

Figures reproduced from arXiv: 2604.21477 by Run Hao, Zhuoran Tan.

Figure 1
Figure 1. Figure 1: Threat Model - Multi-Vector Attack Surface [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: MCP Pitfall Lab Architecture Lab enables objective and reproducible assessment without relying on agent self-report. Multi-input awareness. Pitfall Lab treats tool-returned con￾tent and visual artifacts (e.g., attachments and screenshots) as untrusted inputs and records their provenance. When image extraction is enabled, the trace links image-derived content to subsequent tool invocations, supporting audit… view at source ↗
read the original abstract

Model Context Protocol (MCP) is increasingly adopted for tool-integrated LLM agents, but its multi-layer design and third-party server ecosystem expand risks across tool metadata, untrusted outputs, cross-tool flows, multimodal inputs, and supply-chain vectors. Existing MCP benchmarks largely measure robustness to malicious inputs but offer limited remediation guidance. We present MCP Pitfall Lab, a protocol-aware security testing framework that operationalizes developer pitfalls as reproducible scenarios and validates outcomes with MCP traces and objective validators (rather than agent self-report). We instantiate three workflow challenges (email, document, crypto) with six server variants (baseline and hardened) and model three attack families: tool-metadata poisoning, puppet servers, and multimodal image-to-tool chains, in a unified, trace-grounded evaluation. In Tier-1 static analysis over six variants (36 binary labels), our analyzer achieves F1 = 1.0 on four statically checkable pitfall classes (P1, P2, P5, P6) and flags cross-tool forwarding and image-to-tool leakage (P3, P4) as trace/dataflow-dependent. Applying recommended hardening eliminates all Tier-1 findings (29 to 0) and reduces the framework risk score (10.0 to 0.0) at a mean cost of 27 lines of code (LOC). Finally, in a preliminary 19-run corpus from the email system challenge (tool poisoning and puppet attacks), agent narratives diverge from trace evidence in 63.2% of runs and 100% of sink-action runs, motivating trace-based auditing and regression testing. Overall, Pitfall Lab enables practical, end-to-end assessment and hardening of MCP tool servers under realistic multi-vector conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces MCP Pitfall Lab, a protocol-aware security testing framework for Model Context Protocol (MCP) tool servers in LLM agents. It operationalizes developer pitfalls as reproducible scenarios across three workflow challenges (email, document, crypto) with six server variants (baseline and hardened) and evaluates three attack families: tool-metadata poisoning, puppet servers, and multimodal image-to-tool chains. Validation relies on MCP traces and objective validators rather than agent self-reports. Reported results include a Tier-1 static analyzer achieving F1=1.0 on four pitfall classes (P1, P2, P5, P6), complete elimination of Tier-1 findings (29 to 0) and framework risk score reduction (10.0 to 0.0) after applying recommended hardening at a mean cost of 27 LOC, and agent narrative divergence from trace evidence in 63.2% of 19 runs (100% of sink-action runs).

Significance. If the results hold, the work supplies a concrete, trace-grounded framework that addresses the gap in remediation guidance for MCP security, demonstrating measurable improvements from hardening and exposing discrepancies between agent narratives and actual traces. This is a strength for practical developer use in tool-integrated LLM agents. The approach of using objective validators and reproducible scenarios adds value over input-robustness benchmarks alone.

major comments (2)
  1. The central claim that Pitfall Lab enables 'practical, end-to-end assessment' of MCP tool servers under realistic multi-vector conditions (Abstract) depends on the three workflow challenges and six server variants sufficiently representing real-world MCP deployments and primary threat vectors. No external justification, coverage argument, or comparison to production deployments is provided; if other patterns (e.g., long-running stateful tools or multi-turn chaining with persistent context) contain additional pitfalls, the 29-to-0 elimination of findings would not generalize.
  2. The headline metrics (F1=1.0 on four classes, 29-to-0 Tier-1 reduction, 10.0-to-0.0 risk score) are presented without implementation details, full attack payloads, or raw data (Abstract). This prevents independent verification of the evaluation, which is load-bearing for the soundness of the hardening and divergence claims.
minor comments (1)
  1. The divergence result is labeled 'preliminary' from a 19-run corpus (Abstract); specifying run selection criteria, statistical measures, and whether the corpus covers all attack families would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below and outline the revisions we will make to improve the manuscript's clarity on scope and reproducibility.

read point-by-point responses
  1. Referee: The central claim that Pitfall Lab enables 'practical, end-to-end assessment' of MCP tool servers under realistic multi-vector conditions (Abstract) depends on the three workflow challenges and six server variants sufficiently representing real-world MCP deployments and primary threat vectors. No external justification, coverage argument, or comparison to production deployments is provided; if other patterns (e.g., long-running stateful tools or multi-turn chaining with persistent context) contain additional pitfalls, the 29-to-0 elimination of findings would not generalize.

    Authors: We acknowledge that the manuscript lacks an explicit coverage argument or direct mapping to production MCP deployments. The three workflow challenges were selected to span representative categories of tool usage (communication, document handling, and cryptographic operations) drawn from the MCP specification and common agent patterns. The six variants enable controlled baseline-to-hardened comparisons. We agree that unexamined patterns such as persistent stateful tools could surface additional issues. In the revised manuscript we will add a dedicated Limitations subsection that provides a coverage rationale based on the MCP protocol documentation, lists the evaluated patterns, and explicitly states that the 29-to-0 Tier-1 reduction and risk-score improvement apply only to the tested variants. This will qualify the 'practical, end-to-end assessment' claim without overstating generalizability. revision: yes

  2. Referee: The headline metrics (F1=1.0 on four classes, 29-to-0 Tier-1 reduction, 10.0-to-0.0 risk score) are presented without implementation details, full attack payloads, or raw data (Abstract). This prevents independent verification of the evaluation, which is load-bearing for the soundness of the hardening and divergence claims.

    Authors: We agree that the abstract and main evaluation summary omit sufficient implementation details for independent verification. The full manuscript describes the Tier-1 analyzer rules, hardening steps (with per-variant LOC counts), risk-scoring formula, and trace-validation procedure. To resolve this, we will release the complete artifact package (all six server variants, attack payloads, raw MCP traces from the 19 runs, and analyzer source) upon acceptance. In the revised version we will insert a 'Reproducibility' subsection that references the artifacts, supplies example payloads and trace excerpts, and explains how the F1, 29-to-0, and 10.0-to-0.0 figures were computed. This directly supports verification of the hardening and narrative-divergence results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results are direct measurements on instantiated scenarios

full rationale

The paper defines a testing framework, selects three concrete workflow challenges and six server variants, applies three attack families, and reports direct outcomes from static analysis (F1=1.0 on four pitfall classes) and trace validation on a 19-run corpus. Hardening results (29 to 0 Tier-1 findings, risk score 10.0 to 0.0 at 27 LOC) are measured on the chosen instances rather than derived from any fitted parameter or self-referential definition. No equations, self-citations, or uniqueness theorems appear in the provided text; the evaluation relies on external validators and objective checks. The representativeness of the chosen workflows is an unproven assumption but does not create a circular reduction in the reported derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central contribution is the new testing framework itself; no free parameters are fitted, the main axioms are standard assumptions about MCP risk surfaces, and the only invented entity is the framework and its pitfall taxonomy.

axioms (1)
  • domain assumption MCP's multi-layer design and third-party server ecosystem expand risks across tool metadata, untrusted outputs, cross-tool flows, multimodal inputs, and supply-chain vectors.
    Stated in the opening of the abstract as the motivation for the work.
invented entities (1)
  • MCP Pitfall Lab no independent evidence
    purpose: Protocol-aware security testing framework that operationalizes pitfalls as reproducible scenarios with trace validation.
    New artifact introduced by the paper; no independent evidence outside this work.

pith-pipeline@v0.9.0 · 5612 in / 1480 out tokens · 35194 ms · 2026-05-09T21:34:09.279551+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

  1. [1]

    Donating the model context protocol and establishing the agentic ai foundation, December 2025

    Anthropic. Donating the model context protocol and establishing the agentic ai foundation, December 2025

  2. [2]

    (ab) using images and sounds for indirect instruction injection in multi-modal llms,

    Eugene Bagdasarian, Tsung-Yin Hsieh, Ben Nassi, and Vitaly Shmatikov. (ab)using images and sounds for indirect instruction injection in multi-modal llms.ArXiv, abs/2307.10490, 2023

  3. [3]

    Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents

    Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. InThe Thirty-eight Conference on Neural Information Process- ing Systems Datasets and Benchmarks Track, 2024

  4. [4]

    OWASP Top 10 for Large Lan- guage Model Applications, 2025

    OWASP Foundation. OWASP Top 10 for Large Lan- guage Model Applications, 2025. Version 2025

  5. [5]

    Coercing LLMs to do and reveal (almost) anything

    Jonas Geiping, Alex Stein, Manli Shu, Khalid Saifullah, Yuxin Wen, and Tom Goldstein. Coercing LLMs to do and reveal (almost) anything. InICLR 2024 Workshop on Secure and Trustworthy Large Language Models, 2024

  6. [6]

    Developers are not the enemy!: The need for usable security apis.IEEE Security & Privacy, 14(5):40–46, 2016

    Matthew Green and Matthew Smith. Developers are not the enemy!: The need for usable security apis.IEEE Security & Privacy, 14(5):40–46, 2016

  7. [7]

    arXiv preprint arXiv:2410.22770 , year=

    Hao Li, Xiaogeng Liu, and Chaowei Xiao. Injecguard: Benchmarking and mitigating over-defense in prompt in- jection guardrail models.ArXiv, abs/2410.22770, 2024

  8. [8]

    Gentel-safe: A uni- fied benchmark and shielding framework for defending against prompt injection attacks, 2024

    Rongchang Li, Minjie Chen, Chang Hu, Han Chen, Wenpeng Xing, and Meng Han. Gentel-safe: A uni- fied benchmark and shielding framework for defending against prompt injection attacks, 2024

  9. [9]

    Formalizing and benchmarking prompt injection attacks and defenses

    Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and benchmarking prompt injection attacks and defenses. In33rd USENIX Security Symposium (USENIX Security 24), pages 1831– 1847, Philadelphia, PA, August 2024. USENIX Associ- ation

  10. [10]

    Backstabber’s Knife Collection: A Review of Open Source Software Supply Chain Attacks, 2020

    Marc Ohm, Henrik Plate, Arnold Sykosch, and Michael Meier. Backstabber’s Knife Collection: A Review of Open Source Software Supply Chain Attacks, 2020

  11. [11]

    Beyond the protocol: Unveiling attack vectors in the model context protocol (mcp) ecosystem, 2025

    Hao Song, Yiming Shen, Wenxuan Luo, Leixin Guo, Ting Chen, Jiashui Wang, Beibei Li, Xiaosong Zhang, and Jiachi Chen. Beyond the protocol: Unveiling attack vectors in the model context protocol (mcp) ecosystem, 2025

  12. [12]

    Your clawdbot (moltbot) ai assistant has shell access and one prompt injection away from disaster, January 2026

    Liran Tal. Your clawdbot (moltbot) ai assistant has shell access and one prompt injection away from disaster, January 2026

  13. [13]

    Marnerides

    Zhuoran Tan, Shameem Puthiya Parambath, Chris- tos Anagnostopoulos, Jeremy Singer, and Angelos K. Marnerides. Advanced persistent threats based on sup- ply chain vulnerabilities: Challenges, solutions, and future directions.IEEE Internet of Things Journal, 12(6):6371–6395, 2025

  14. [14]

    The instruction hier- archy: Training llms to prioritize privileged instructions, 2024

    Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Jo- hannes Heidecke, and Alex Beutel. The instruction hier- archy: Training llms to prioritize privileged instructions, 2024

  15. [15]

    Raccoon: Prompt extraction benchmark of LLM- integrated applications

    Junlin Wang, Tianyi Yang, Roy Xie, and Bhuwan Dhin- gra. Raccoon: Prompt extraction benchmark of LLM- integrated applications. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 13349– 13365, Bangkok, Thailand, August 2024. Association for Computational Linguistics

  16. [16]

    Manipulating multimodal agents via cross-modal prompt injection

    Le Wang, Zonghao Ying, Tianyuan Zhang, Siyuan Liang, Shengshan Hu, Mingchuan Zhang, Aishan Liu, and Xianglong Liu. Manipulating multimodal agents via cross-modal prompt injection. InProceedings of the 33rd ACM International Conference on Multime- dia, MM ’25, page 10955–10964, New York, NY , USA,

  17. [17]

    Association for Computing Machinery

  18. [18]

    Mcptox: A benchmark for tool poisoning attack on real-world mcp servers, 2025

    Zhiqiang Wang, Yichao Gao, Yanting Wang, Suyuan Liu, Haifeng Sun, Haoran Cheng, Guanquan Shi, Haohua Du, and Xiangyang Li. Mcptox: A benchmark for tool poisoning attack on real-world mcp servers, 2025

  19. [19]

    Bench- marking and defending against indirect prompt injection attacks on large language models

    Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. Bench- marking and defending against indirect prompt injection attacks on large language models. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discov- ery and Data Mining V .1, KDD ’25, page 1809–1820, New York, NY , USA, 2025. Association for Computing...

  20. [20]

    InjecAgent: Benchmarking indirect prompt in- jections in tool-integrated large language model agents

    Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. InjecAgent: Benchmarking indirect prompt in- jections in tool-integrated large language model agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 10471–10506, Bangkok, Thailand, August 2024. Association for Comp...

  21. [21]

    MCP security bench (MSB): Benchmarking attacks against model context protocol in LLM agents

    Dongsen Zhang, Zekun Li, Xu Luo, Xuannan Liu, Pei Pei Li, and Wenjun Xu. MCP security bench (MSB): Benchmarking attacks against model context protocol in LLM agents. InThe Fourteenth International Confer- ence on Learning Representations, 2026. 13

  22. [22]

    MCP-safetybench: A benchmark for safety evaluation of large language models with real-world MCP servers

    Xuanjun Zong, Zhiqi Shen, Lei Wang, Yunshi Lan, and Chao Yang. MCP-safetybench: A benchmark for safety evaluation of large language models with real-world MCP servers. InThe Fourteenth International Confer- ence on Learning Representations, 2026. 14