Evaluating Tool Cloning in Agentic-AI Ecosystems
Pith reviewed 2026-05-20 22:26 UTC · model grok-4.3
The pith
Tool cloning is pervasive in agentic AI ecosystems, with most high-similarity repository pairs confirmed as clones.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Cloning is not an isolated artifact: high-similarity regions appear across comparison settings, and 60% of high-Jaccard candidates and 85% of high-ssdeep candidates in the MCP ecosystem are manually verified as clones. These results indicate that tool cloning is a pervasive and severe source of hidden duplication in agent-tool ecosystems.
What carries the argument
repository-level auditing pipeline that computes pairwise similarity using complementary lexical (Jaccard) and fuzzy-structural (ssdeep) metrics on extracted tool code
If this is right
- Raw counts of tools in marketplaces substantially overstate actual ecosystem diversity.
- Benchmark splits that ignore repository provenance become contaminated by duplicate implementations.
- Vulnerable code can propagate more widely through cloned tools.
- Measurements of tool-use generalization become biased when cloned variants are treated as independent.
- Provenance, attribution, and intellectual-property questions arise for derived tools.
Where Pith is reading between the lines
- Evaluation protocols for agent tool use should incorporate similarity-based deduplication before reporting performance numbers.
- Marketplace operators could add automated clone detection to surface derivative tools for users.
- Security scanning efforts would benefit from treating high-similarity clusters as single entities rather than independent codebases.
Load-bearing premise
The 100 sampled high-similarity pairs per ecosystem represent the full distribution of clones and manual review reliably separates true cloning from coincidental similarity or shared templates.
What would settle it
A full manual audit of every high-similarity pair that finds verification rates well below 60 percent for Jaccard and 85 percent for ssdeep would falsify the claim of pervasive cloning.
Figures
read the original abstract
Agent tools are becoming a core interface through which LLM agents access external data, services, and execution environments. As these tools are distributed through public marketplaces, raw tool counts may substantially overstate ecosystem diversity if many repositories are cloned, lightly modified, or derived from shared templates. Such hidden duplication can contaminate benchmark splits, propagate vulnerable implementations, bias measurements of tool-use generalization, and raise provenance, attribution, and intellectual-property concerns. We present, to our knowledge, the first large-scale measurement study of tool cloning in agentic AI ecosystems. We curate a unified dataset from multiple public platforms, covering 7,508 Model Context Protocol (MCP) repositories with 87,564 extracted tools and 1,353 Skills repositories with 12,447 tools, for a total of 8,861 repositories and 100,011 tool entries. To measure implementation-level duplication, we build a repository-level auditing pipeline using complementary lexical and fuzzy-structural similarity metrics, and compute pairwise similarity across MCP-to-MCP, Skills-to-Skills, and MCP-to-Skills repository pairs. We further manually verify 100 sampled pairs per MCP and Skills ecosystem across similarity-score buckets to calibrate how often high similarity reflects true code cloning. Our analysis shows that cloning is not an isolated artifact: high-similarity regions appear across comparison settings, and 60\% of high-Jaccard candidates and 85\% of high-ssdeep candidates in the MCP ecosystem are manually verified as clones. These results indicate that tool cloning is a pervasive and severe source of hidden duplication in agent-tool ecosystems. They further suggest that agent-tool datasets and benchmarks should account for repository provenance and implementation similarity when measuring tool diversity or constructing evaluation splits.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents the first large-scale empirical measurement of tool cloning in agentic-AI ecosystems. It curates a unified dataset of 8,861 repositories and 100,011 tools from the Model Context Protocol (MCP) and Skills platforms, computes pairwise repository-level similarities using complementary lexical (Jaccard) and fuzzy-structural (ssdeep) metrics across MCP-to-MCP, Skills-to-Skills, and cross-ecosystem pairs, and manually verifies 100 sampled high-similarity pairs per ecosystem across similarity-score buckets. The central finding is that cloning is pervasive rather than isolated, with 60% of high-Jaccard candidates and 85% of high-ssdeep candidates in the MCP ecosystem manually confirmed as clones, implying that raw tool counts overstate diversity and that benchmarks must account for provenance and implementation similarity.
Significance. If the sampling and verification procedures prove robust, the work supplies a valuable baseline for an emerging problem at the intersection of software engineering, AI agents, and tool ecosystems. The scale of the curated dataset, the use of multiple complementary similarity metrics, and the explicit linkage to downstream concerns (benchmark contamination, vulnerability propagation, IP) are clear strengths. The empirical focus on hidden duplication addresses a practical gap that existing tool-use literature has largely overlooked.
major comments (2)
- Abstract and auditing-pipeline description: The headline clone rates (60% high-Jaccard, 85% high-ssdeep in MCP) are computed solely from the 100 manually inspected pairs per ecosystem. The text states that sampling occurs “across similarity-score buckets” but supplies neither the bucket boundaries, the total number of candidate pairs falling into each bucket, the exact sampling procedure within buckets, nor the verification rubric used to distinguish true implementation clones from shared templates or coincidental similarity. Because these rates are the primary quantitative support for the pervasiveness claim, the absence of this information prevents assessment of representativeness and reproducibility.
- Manual-verification procedure: No inter-rater agreement statistic, number of annotators, or explicit decision criteria for labeling a pair as a clone are reported. Without these details it is impossible to gauge the reliability of the 60% and 85% figures or to determine whether the manual labels systematically separate cloning from boilerplate reuse.
minor comments (2)
- The abstract would be clearer if it stated the concrete similarity thresholds that define the “high-Jaccard” and “high-ssdeep” buckets from which the 100 pairs were drawn.
- The manuscript should report the total number of pairwise comparisons performed in each setting (MCP-to-MCP, Skills-to-Skills, MCP-to-Skills) so readers can contextualize the scale of the high-similarity regions.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and constructive suggestions. We agree that additional details on the sampling and verification procedures are important for assessing the robustness of our findings. We address each major comment below and will incorporate the requested information into the revised manuscript.
read point-by-point responses
-
Referee: Abstract and auditing-pipeline description: The headline clone rates (60% high-Jaccard, 85% high-ssdeep in MCP) are computed solely from the 100 manually inspected pairs per ecosystem. The text states that sampling occurs “across similarity-score buckets” but supplies neither the bucket boundaries, the total number of candidate pairs falling into each bucket, the exact sampling procedure within buckets, nor the verification rubric used to distinguish true implementation clones from shared templates or coincidental similarity. Because these rates are the primary quantitative support for the pervasiveness claim, the absence of this information prevents assessment of representativeness and reproducibility.
Authors: We acknowledge that these methodological details were not included in the submitted manuscript. To address this, we will add a new subsection in the Methods section that specifies the similarity-score buckets used (for example, for Jaccard similarity: [0.7, 0.8), [0.8, 0.9), [0.9, 1.0]; similar for ssdeep), the total number of repository pairs in each bucket for the MCP ecosystem, the stratified sampling approach (selecting a fixed number of pairs from each bucket to ensure coverage across the similarity range), and the detailed verification rubric. The rubric defines a true clone as a pair where one repository is a direct copy or minor modification of the other, excluding cases of shared templates or coincidental similarity based on code structure and functionality. This will allow readers to evaluate the representativeness of the 100 sampled pairs. revision: yes
-
Referee: Manual-verification procedure: No inter-rater agreement statistic, number of annotators, or explicit decision criteria for labeling a pair as a clone are reported. Without these details it is impossible to gauge the reliability of the 60% and 85% figures or to determine whether the manual labels systematically separate cloning from boilerplate reuse.
Authors: We agree that reporting these details is essential for transparency. The manual verification was conducted by two independent annotators (both authors with expertise in software engineering), who reviewed each pair and labeled it as clone or not based on explicit criteria. Disagreements were resolved through discussion with a third author. We will report the inter-rater agreement using Cohen's kappa statistic in the revised version. The decision criteria include: (1) substantial code overlap (>60% after removing comments and whitespace), (2) evidence of direct copying such as identical file structures and function names with minor variable changes, or (3) one repository being a fork with limited modifications. Pairs with only shared dependencies or boilerplate code were not labeled as clones. We will include these details and the agreement statistic in the updated manuscript. revision: yes
Circularity Check
Empirical measurement study with no derivations or self-referential reductions
full rationale
The paper is a large-scale empirical measurement study that curates datasets, applies standard lexical and fuzzy similarity metrics (Jaccard, ssdeep), computes pairwise scores, and reports direct counts from manual verification of 100 sampled high-similarity pairs per ecosystem. No equations, fitted parameters, predictions, or derivations appear in the provided text. The reported clone fractions (60% Jaccard, 85% ssdeep) are literal outcomes of the verification step rather than quantities forced by construction from the similarity scores themselves. No self-citations, uniqueness theorems, or ansatzes are invoked to justify core claims. The analysis therefore remains self-contained and does not reduce to its inputs by definition.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption High lexical or fuzzy-structural similarity between repositories indicates cloning or derivation rather than independent implementation of similar functionality.
- domain assumption The public marketplaces sampled are representative of the broader agent-tool ecosystem.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We build a repository-level auditing pipeline using complementary lexical and fuzzy-structural similarity metrics, and compute pairwise similarity across MCP-to-MCP, Skills-to-Skills, and MCP-to-Skills repository pairs. We further manually verify 100 sampled pairs per MCP and Skills ecosystem across similarity-score buckets
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Clone detection using abstract syntax trees , author=. Proceedings. International Conference on Software Maintenance , year=
-
[2]
Queen’s School of computing TR , year=
A survey on software clone detection research , author=. Queen’s School of computing TR , year=
-
[3]
Science of computer programming , year=
Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , author=. Science of computer programming , year=
- [4]
-
[5]
IEEE Transactions on software engineering , year=
Comparison and evaluation of clone detection tools , author=. IEEE Transactions on software engineering , year=
-
[6]
2009 IEEE 31st International Conference on Software Engineering , pages=
Do code clones matter? , author=. 2009 IEEE 31st International Conference on Software Engineering , pages=. 2009 , organization=
work page 2009
- [7]
-
[8]
The distribution of the flora in the alpine zone. 1 , author=. New phytologist , volume=. 1912 , publisher=
work page 1912
-
[9]
Digital investigation , volume=
Identifying almost identical files using context triggered piecewise hashing , author=. Digital investigation , volume=. 2006 , publisher=
work page 2006
- [10]
-
[11]
Finding near-duplicate web pages: a large-scale evaluation of algorithms , author=. Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval , pages=
- [12]
-
[13]
Empirical Software Engineering , volume=
Empirical study of android repackaged applications , author=. Empirical Software Engineering , volume=. 2019 , publisher=
work page 2019
-
[14]
The 2014 ACM international conference on Measurement and modeling of computer systems , pages=
A measurement study of google play , author=. The 2014 ACM international conference on Measurement and modeling of computer systems , pages=
work page 2014
-
[15]
GraphCodeBERT: Pre-training Code Representations with Data Flow
Graphcodebert: Pre-training code representations with data flow , author=. arXiv preprint arXiv:2009.08366 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[16]
Findings of the association for computational linguistics: EMNLP 2020 , pages=
Codebert: A pre-trained model for programming and natural languages , author=. Findings of the association for computational linguistics: EMNLP 2020 , pages=
work page 2020
-
[17]
Introducing the Model Context Protocol , author =. 2024 , howpublished =
work page 2024
-
[18]
Equipping Agents for the Real World with Agent Skills , author =. 2025 , howpublished =
work page 2025
-
[20]
The twelfth international conference on learning representations , year=
Toolllm: Facilitating large language models to master 16000+ real-world apis , author=. The twelfth international conference on learning representations , year=
-
[21]
International Conference on Learning Representations (ICLR) , year=
ReAct: Synergizing Reasoning and Acting in Language Models , author=. International Conference on Learning Representations (ICLR) , year=
-
[22]
Advances in neural information processing systems , volume=
Reflexion: Language agents with verbal reinforcement learning , author=. Advances in neural information processing systems , volume=
-
[23]
Advances in Neural Information Processing Systems , volume=
Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face , author=. Advances in Neural Information Processing Systems , volume=
-
[24]
Advances in neural information processing systems , volume=
Toolformer: Language models can teach themselves to use tools , author=. Advances in neural information processing systems , volume=
-
[25]
2012 IEEE symposium on security and privacy , pages=
Dissecting android malware: Characterization and evolution , author=. 2012 IEEE symposium on security and privacy , pages=. 2012 , organization=
work page 2012
-
[26]
Llama 4: Open Foundation Models for Multimodal and Efficient AI , author =. 2025 , howpublished =
work page 2025
-
[27]
Advances in Neural Information Processing Systems , year=
Webshop: Towards scalable real-world web interaction with grounded language agents , author=. Advances in Neural Information Processing Systems , year=
-
[28]
The Twelfth International Conference on Learning Representations , year=
AgentBench: Evaluating LLMs as Agents , author=. The Twelfth International Conference on Learning Representations , year=
-
[29]
The twelfth international conference on learning representations , year=
Swe-bench: Can language models resolve real-world github issues? , author=. The twelfth international conference on learning representations , year=
-
[30]
Advances in Neural Information Processing Systems , year=
Gorilla: Large language model connected with massive apis , author=. Advances in Neural Information Processing Systems , year=
-
[31]
IEEE transactions on software engineering , year=
CCFinder: A multilinguistic token-based code clone detection system for large scale source code , author=. IEEE transactions on software engineering , year=
-
[32]
29th International Conference on Software Engineering (ICSE'07) , year=
Deckard: Scalable and accurate tree-based detection of code clones , author=. 29th International Conference on Software Engineering (ICSE'07) , year=
-
[33]
Proceedings of the 38th international conference on software engineering , year=
Sourcerercc: Scaling code clone detection to big-code , author=. Proceedings of the 38th international conference on software engineering , year=
-
[34]
IEEE Transactions on software Engineering , year=
CP-Miner: Finding copy-paste and related bugs in large-scale software code , author=. IEEE Transactions on software Engineering , year=
-
[35]
Proceedings of the second ACM conference on Data and Application Security and Privacy , year=
Detecting repackaged smartphone applications in third-party android marketplaces , author=. Proceedings of the second ACM conference on Data and Application Security and Privacy , year=
-
[36]
European Symposium on Research in Computer Security , year=
Attack of the clones: Detecting cloned applications on android markets , author=. European Symposium on Research in Computer Security , year=
-
[37]
European Symposium on Research in Computer Security , year=
Andarwin: Scalable detection of semantically similar android applications , author=. European Symposium on Research in Computer Security , year=
-
[38]
MCP.so , year = 2025, howpublished =
work page 2025
-
[39]
MCPServers.org , year = 2025, howpublished =
work page 2025
-
[40]
MCP Market , title =
-
[41]
Introducing the model context protocol
Anthropic . Introducing the model context protocol. https://www.anthropic.com/news/model-context-protocol, 2024
work page 2024
-
[42]
Equipping agents for the real world with agent skills
Anthropic . Equipping agents for the real world with agent skills. https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills, 2025
work page 2025
-
[43]
Clone detection using abstract syntax trees
Ira D Baxter, Andrew Yahin, Leonardo Moura, Marcelo Sant'Anna, and Lorraine Bier. Clone detection using abstract syntax trees. In Proceedings. International Conference on Software Maintenance, 1998
work page 1998
-
[44]
Comparison and evaluation of clone detection tools
Stefan Bellon, Rainer Koschke, Giulio Antoniol, Jens Krinke, and Ettore Merlo. Comparison and evaluation of clone detection tools. IEEE Transactions on software engineering, 2007
work page 2007
-
[45]
Attack of the clones: Detecting cloned applications on android markets
Jonathan Crussell, Clint Gibler, and Hao Chen. Attack of the clones: Detecting cloned applications on android markets. In European Symposium on Research in Computer Security, 2012
work page 2012
-
[46]
Andarwin: Scalable detection of semantically similar android applications
Jonathan Crussell, Clint Gibler, and Hao Chen. Andarwin: Scalable detection of semantically similar android applications. In European Symposium on Research in Computer Security, 2013
work page 2013
-
[47]
Deckard: Scalable and accurate tree-based detection of code clones
Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. Deckard: Scalable and accurate tree-based detection of code clones. In 29th International Conference on Software Engineering (ICSE'07), 2007
work page 2007
-
[48]
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. Swe-bench: Can language models resolve real-world github issues? In The twelfth international conference on learning representations, 2023
work page 2023
-
[49]
Elmar Juergens, Florian Deissenboeck, Benjamin Hummel, and Stefan Wagner. Do code clones matter? In 2009 IEEE 31st International Conference on Software Engineering, pages 485--495. IEEE, 2009
work page 2009
-
[50]
Ccfinder: A multilinguistic token-based code clone detection system for large scale source code
Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. Ccfinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE transactions on software engineering, 2002
work page 2002
-
[51]
Identifying almost identical files using context triggered piecewise hashing
Jesse Kornblum. Identifying almost identical files using context triggered piecewise hashing. Digital investigation, 3: 0 91--97, 2006
work page 2006
-
[52]
Survey of research on software clones
Rainer Koschke. Survey of research on software clones. 2007
work page 2007
-
[53]
Cp-miner: Finding copy-paste and related bugs in large-scale software code
Zhenmin Li, Shan Lu, Suvda Myagmar, and Yuanyuan Zhou. Cp-miner: Finding copy-paste and related bugs in large-scale software code. IEEE Transactions on software Engineering, 2006
work page 2006
-
[54]
Agentbench: Evaluating llms as agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. In The Twelfth International Conference on Learning Representations, 2024
work page 2024
- [55]
-
[56]
Llama 4: Open foundation models for multimodal and efficient ai
Meta AI . Llama 4: Open foundation models for multimodal and efficient ai. https://ai.meta.com/llama/, 2025. Accessed: 2026-05-06
work page 2025
-
[57]
Gorilla: Large language model connected with massive apis
Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis. In Advances in Neural Information Processing Systems, 2024
work page 2024
-
[58]
Toolllm: Facilitating large language models to master 16000+ real-world apis
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. In The twelfth international conference on learning representations, 2023
work page 2023
-
[59]
Comparison and evaluation of code clone detection techniques and tools: A qualitative approach
Chanchal K Roy, James R Cordy, and Rainer Koschke. Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Science of computer programming, 2009
work page 2009
-
[60]
A survey on software clone detection research
Chanchal Kumar Roy and James R Cordy. A survey on software clone detection research. Queen’s School of computing TR, 2007
work page 2007
-
[61]
Sourcerercc: Scaling code clone detection to big-code
Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, Chanchal K Roy, and Cristina V Lopes. Sourcerercc: Scaling code clone detection to big-code. In Proceedings of the 38th international conference on software engineering, 2016
work page 2016
-
[62]
Toolformer: Language models can teach themselves to use tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems, 36: 0 68539--68551, 2023
work page 2023
-
[63]
Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems, 36: 0 38154--38180, 2023
work page 2023
-
[64]
Reflexion: Language agents with verbal reinforcement learning
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in neural information processing systems, 36: 0 8634--8652, 2023
work page 2023
- [65]
-
[66]
ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases
Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, and Le Sun. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases. arXiv preprint arXiv:2306.05301, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [67]
-
[68]
MCP.so. Mcp.so. https://mcp.so/, 2025
work page 2025
-
[69]
Webshop: Towards scalable real-world web interaction with grounded language agents
Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. In Advances in Neural Information Processing Systems, 2022
work page 2022
-
[70]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023
work page 2023
-
[71]
Detecting repackaged smartphone applications in third-party android marketplaces
Wu Zhou, Yajin Zhou, Xuxian Jiang, and Peng Ning. Detecting repackaged smartphone applications in third-party android marketplaces. In Proceedings of the second ACM conference on Data and Application Security and Privacy, 2012
work page 2012
-
[72]
Dissecting android malware: Characterization and evolution
Yajin Zhou and Xuxian Jiang. Dissecting android malware: Characterization and evolution. In 2012 IEEE symposium on security and privacy, pages 95--109. IEEE, 2012
work page 2012
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.