pith. sign in

arxiv: 2606.30317 · v1 · pith:EBUPTWNJnew · submitted 2026-06-29 · 💻 cs.SE · cs.AI

MCP Server Architecture Patterns for LLM-Integrated Applications

Pith reviewed 2026-06-30 05:16 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords MCPModel Context Protocolarchitectural patternsLLMserver architecturedesign patternsLLM integrationsoftware patterns
0
0 comments X

The pith

Five recurring patterns structure MCP servers for LLM-integrated applications.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper catalogs five architectural patterns observed in MCP servers that connect LLMs to external tools and services. These patterns were identified through examination of fifteen servers, five from a production voice AI platform and ten from the public registry. Each pattern is detailed with its context, the problem it addresses, the proposed solution, and resulting consequences, following the format of established design pattern literature. The authors also document anti-patterns and cross-cutting issues including authentication and observability. Measurements on pattern classification reliability, communication overhead, and tool selection performance support the taxonomy.

Core claim

Five recurring MCP server architectural patterns—Resource Gateway, Tool Orchestrator, Stateful Session Server, Proxy Aggregator, and Domain-Specific Adapter—are observed across an enumerated corpus of fifteen independently developed servers. Each pattern is described in the structured form of context, problem, solution, and consequences. The evaluation includes inter-rater reliability of 0.76, transport overhead, and tool-count effects on accuracy.

What carries the argument

The taxonomy of five MCP server architectural patterns, each specified via context-problem-solution-consequences.

If this is right

  • Developers building MCP servers can draw from these patterns rather than starting from scratch.
  • Tool selection by models like Claude Haiku and Sonnet has measurable accuracy limits based on the number of available tools.
  • Common concerns such as authentication, versioning, and observability apply across all patterns.
  • Four anti-patterns should be avoided when implementing MCP servers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adoption of these patterns could lead to more interoperable and maintainable LLM tool ecosystems.
  • The noted ambiguities in pattern boundaries suggest opportunities for more precise definitions in future work.
  • The overhead measurements may guide choices between local and remote MCP server deployments.

Load-bearing premise

The sample of fifteen servers adequately represents the structures found in the broader set of MCP servers being developed.

What would settle it

Analysis of a larger set of MCP servers revealing that the majority cannot be classified into any of the five patterns.

Figures

Figures reproduced from arXiv: 2606.30317 by Carson Rodrigues, Oysturn Vas.

Figure 1
Figure 1. Figure 1: MCP transport latency (p50/p95/p99, log scale) by configuration; row labels indicate measured vs. modeled. 0 10 20 30 40 50 Number of Tools in Context 60 70 80 90 100 Tool Selection Accuracy (%) Recommended ( 10 tools) Accuracy vs. Tool Count (N=200 requests per bucket) Claude Haiku 4.5 Claude Sonnet 4 90% threshold 0 10 20 30 40 50 Number of Tools in Context 200 300 400 500 600 700 800 Median Latency (ms)… view at source ↗
Figure 2
Figure 2. Figure 2: Tool count vs. accuracy and latency (Claude Haiku 4.5 and Claude [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
read the original abstract

The Model Context Protocol (MCP), introduced by Anthropic in November 2024, defines a standardized interface for connecting large language models (LLMs) to external tools, data sources, and services. Within months of release, hundreds of community-built MCP servers appeared on GitHub, but no software-maintenance literature has yet described how the ecosystem is being structured in production. This industry experience paper catalogues five recurring MCP server architectural patterns observed across an enumerated corpus of fifteen independently developed servers (five production servers from the ANSYR voice AI platform plus ten public servers from the official MCP registry): Resource Gateway, Tool Orchestrator, Stateful Session Server, Proxy Aggregator, and Domain-Specific Adapter. Each pattern is described in the structured form of Gamma et al.: context, problem, solution, and consequences. We also document four anti-patterns and a set of cross-cutting concerns around authentication, versioning, and observability. The quantitative evaluation contributes three measurements: inter-rater reliability of the taxonomy across two independent LLM raters on 54 held-out servers (Cohen's kappa = 0.76), which also localizes three pattern-boundary ambiguities; transport overhead measured end-to-end on loopback and modeled for cross-host paths; and a tool-count study showing tool-selection accuracy drops below 90% between 10 and 15 tools per context for Claude Haiku 4.5 and between 20 and 30 tools for Sonnet 4. Code, corpus, and prompts are released as a replication package.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims to identify and catalogue five recurring architectural patterns for MCP servers (Resource Gateway, Tool Orchestrator, Stateful Session Server, Proxy Aggregator, Domain-Specific Adapter) observed in an enumerated corpus of fifteen servers, using the Gamma et al. template for each; it also documents four anti-patterns and cross-cutting concerns on authentication/versioning/observability. Quantitative contributions include Cohen's kappa = 0.76 inter-rater reliability on 54 held-out servers, end-to-end transport overhead measurements, and tool-count thresholds where LLM tool-selection accuracy drops below 90%.

Significance. If the taxonomy is robust, the work supplies a practical, structured vocabulary for an emerging protocol ecosystem that currently lacks software-engineering literature, directly aiding maintainability of LLM-integrated applications. The release of the replication package (code, corpus, prompts) is a clear strength for reproducibility and further validation.

major comments (1)
  1. [Abstract] Abstract (and the corpus description in the methods section): the central claim that the five patterns were 'observed across an enumerated corpus of fifteen independently developed servers' is load-bearing for the taxonomy's validity, yet the parenthetical breakdown states that five servers are production servers from the single ANSYR voice AI platform. Servers sharing the same development organization or codebase do not constitute independent development efforts, so the evidence that the patterns recur across distinct contexts (rather than internal platform conventions) is weakened; the remaining ten public-registry servers cannot compensate for this non-independence in the derivation corpus.
minor comments (1)
  1. [Abstract] The abstract states that 'Code, corpus, and prompts are released as a replication package' but does not provide the repository URL, commit hash, or DOI; this should be added for immediate accessibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The point about corpus independence is well-taken and we address it directly below.

read point-by-point responses
  1. Referee: [Abstract] Abstract (and the corpus description in the methods section): the central claim that the five patterns were 'observed across an enumerated corpus of fifteen independently developed servers' is load-bearing for the taxonomy's validity, yet the parenthetical breakdown states that five servers are production servers from the single ANSYR voice AI platform. Servers sharing the same development organization or codebase do not constitute independent development efforts, so the evidence that the patterns recur across distinct contexts (rather than internal platform conventions) is weakened; the remaining ten public-registry servers cannot compensate for this non-independence in the derivation corpus.

    Authors: We acknowledge that the five ANSYR servers were developed within a single organization and therefore do not meet a strict definition of independent development efforts. While they were produced by separate teams for distinct voice-AI use cases and did not share MCP-specific code beyond the protocol specification itself, this does not fully address the concern. We will revise the abstract and the methods section to remove the unqualified phrase "independently developed servers" and instead describe the corpus explicitly as "five production servers from the ANSYR voice AI platform together with ten servers drawn from the public MCP registry." We will also add a short paragraph in the threats-to-validity section noting the organizational provenance of the ANSYR subset and stating that the ten public-registry servers constitute the primary source of cross-context evidence. The taxonomy itself remains grounded in the observed designs; the revision affects only the wording of the claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; taxonomy is observational from external corpus with independent validation steps.

full rationale

The paper's central claim is an observational catalog of five patterns drawn from an enumerated corpus of fifteen servers (five ANSYR production servers plus ten from the public MCP registry), presented in Gamma et al. format. Additional measurements include Cohen's kappa on 54 held-out servers, transport overhead, and tool-count accuracy thresholds. No equations, fitted parameters, self-citations, or ansatzes appear in the derivation; the patterns are not defined in terms of themselves, no prediction reduces to a fit by construction, and no uniqueness theorem or prior author work is invoked to force the taxonomy. Issues of corpus independence or representativeness are sampling concerns, not circular reductions of the claimed derivation to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the chosen corpus is representative and that patterns can be reliably extracted from it.

axioms (1)
  • domain assumption The fifteen servers examined are representative of the broader MCP ecosystem
    Used to generalize observed patterns to production use.

pith-pipeline@v0.9.1-grok · 5798 in / 1081 out tokens · 30575 ms · 2026-06-30T05:16:04.866592+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    Gamma, R

    E. Gamma, R. Helm, R. Johnson, and J. Vlissides,Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley, 1994

  2. [2]

    Model context protocol specification,

    Anthropic, “Model context protocol specification,” November 2024. [Online]. Available: https://modelcontextprotocol.io/specification

  3. [3]

    Model context protocol reference servers,

    ——, “Model context protocol reference servers,” 2025. [Online]. Available: https://github.com/modelcontextprotocol/servers

  4. [4]

    Hohpe and B

    G. Hohpe and B. Woolf,Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions. Addison-Wesley Pro- fessional, 2003

  5. [5]

    Function calling and other api updates,

    OpenAI, “Function calling and other api updates,” 2023. [Online]. Avail- able: https://openai.com/blog/function-calling-and-other-api-updates

  6. [6]

    Tool use (function calling) — anthropic documenta- tion,

    Anthropic, “Tool use (function calling) — anthropic documenta- tion,” 2024. [Online]. Available: https://docs.anthropic.com/en/docs/ build-with-claude/tool-use

  7. [7]

    Language server protocol specification,

    Microsoft, “Language server protocol specification,” 2016. [Online]. Available: https://microsoft.github.io/language-server-protocol/

  8. [8]

    Fowler,Patterns of Enterprise Application Architecture

    M. Fowler,Patterns of Enterprise Application Architecture. Addison- Wesley Professional, 2002

  9. [9]

    Architectural styles and the design of network-based software architectures,

    R. T. Fielding, “Architectural styles and the design of network-based software architectures,” Ph.D. dissertation, University of California, Irvine, 2000

  10. [10]

    Semantics and complexity of GraphQL,

    O. Hartig and J. P ´erez, “Semantics and complexity of GraphQL,” in Proc. The Web Conference (WWW), 2018, pp. 1155–1164

  11. [11]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    T. Schick, J. Dwivedi-Yu, R. Dess `ı, R. Raileanu, M. Lomeli, L. Zettle- moyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” inAdvances in Neural Informa- tion Processing Systems 36 (NeurIPS 2023), 2023, published version; arXiv:2302.04761; DOI verified on CrossRef

  12. [12]

    ReAct: Synergizing Reasoning and Acting in Language Models

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” 2022. [Online]. Available: https://arxiv.org/abs/2210.03629

  13. [13]

    Auto-gpt: An autonomous gpt-4 experiment,

    S. Gravitas, “Auto-gpt: An autonomous gpt-4 experiment,” 2023. [Online]. Available: https://github.com/Significant-Gravitas/Auto-GPT

  14. [14]

    Langchain: Building applications with llms through composability,

    H. Chase, “Langchain: Building applications with llms through composability,” 2022. [Online]. Available: https://github.com/ langchain-ai/langchain

  15. [15]

    Introducing computer use, a new claude 3.5 sonnet, and claude 3.5 haiku,

    Anthropic, “Introducing computer use, a new claude 3.5 sonnet, and claude 3.5 haiku,” 2024. [Online]. Available: https://www.anthropic. com/news/3-5-models-and-computer-use

  16. [16]

    Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions,

    X. Hou, Y . Zhao, S. Wang, and H. Wang, “Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions,” ACM Transactions on Software Engineering and Methodology, 2026, [DOI verified on CrossRef]

  17. [17]

    Model Context Protocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers,

    M. M. Hasan, H. Li, E. Fallahzadeh, G. K. Rajbahadur, B. Adams, and A. E. Hassan, “Model Context Protocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers,”ACM Transactions on Software Engineering and Methodology, 2026, [DOI verified on CrossRef]

  18. [18]

    A measurement study of model context protocol ecosystem,

    H. Guo, Y . Hao, Y . Zhang, M. Xu, P. Lv, J. Chen, and X. Cheng, “A measurement study of model context protocol ecosystem,” 2025. [Online]. Available: https://arxiv.org/abs/2509.25292

  19. [19]

    Salda ˜na,The Coding Manual for Qualitative Researchers, 4th ed

    J. Salda ˜na,The Coding Manual for Qualitative Researchers, 4th ed. SAGE Publications, 2021, [Verified on CrossRef]

  20. [20]

    RAG-MCP: Mitigating prompt bloat in LLM tool selection via retrieval-augmented generation.arXiv preprint arXiv:2505.03275,

    T. Gan and Q. Sun, “RAG-MCP: Mitigating prompt bloat in LLM tool selection via retrieval-augmented generation,”arXiv preprint arXiv:2505.03275, 2025

  21. [21]

    Model context protocol python sdk,

    Anthropic, “Model context protocol python sdk,” https://github.com/ modelcontextprotocol/python-sdk, 2024, official Python implementation of the MCP specification

  22. [22]

    Pipecat: Open source framework for voice and multimodal ai agents,

    Daily, “Pipecat: Open source framework for voice and multimodal ai agents,” https://github.com/pipecat-ai/pipecat, 2024, gitHub repository; used for MCP transport benchmarking

  23. [23]

    LongFuncEval: Measuring the effectiveness of long context models for function calling,

    K. Kate, T. Pedapati, K. Basuet al., “LongFuncEval: Measuring the effectiveness of long context models for function calling,”arXiv preprint arXiv:2505.10570, 2025