Uncovering Similar but Different Packages in PyPI and Potential Security Threats
Pith reviewed 2026-06-30 05:40 UTC · model grok-4.3
The pith
Replication in PyPI redistributes code from popular packages, hides vulnerabilities, and enables malware.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By analyzing one-third of the PyPI repository, the study shows that replication frequently redistributes substantial portions of existing packages under different maintainers, creates vulnerability blind spots that current detection tools rarely catch, and serves as an attack vector for malware distribution, as evidenced by 1,361 replicated popular packages, 256 previously unknown replicated vulnerable packages, and 7 new replicated malicious packages.
What carries the argument
Package replication, the duplication of most of the codebase from existing packages under different maintainers.
If this is right
- Replication of popular packages redistributes substantial portions of existing packages under different maintainers.
- Replication creates vulnerability blind spots that current detection tools rarely catch.
- Replication serves as an attack vector for malware distribution through minor modifications and code injection.
- 4.79 percent of known malicious packages replicated popular ones.
Where Pith is reading between the lines
- Package indexes could add automated similarity checks during upload to reduce developer confusion from near-duplicates.
- Vulnerability scanners might improve coverage by cross-referencing new packages against known originals rather than analyzing them in isolation.
- Malware detectors could flag packages that match popular ones except for small injected changes as higher-risk candidates.
Load-bearing premise
The criteria and algorithm used to classify packages as replicated rather than independently similar implementations are accurate and were applied consistently.
What would settle it
A manual audit of the reported replicated packages that finds a substantial fraction were developed independently instead of copied would reduce or eliminate the reported counts of security impacts.
Figures
read the original abstract
In this study, we present a large-scale, in-depth study of package replication in PyPI. As a vital platform, PyPI streamlines Python package distribution for developers. However, beyond small-scale code cloning, we observe that many replicated packages exist on PyPI, which duplicate most of the codebase from existing packages. Such replication not only confuses developers but also propagates known vulnerabilities and enables the creation of new malicious packages. To address this issue, we comprehensively examine the characteristics and potential threats of replicated packages. Using one-third of the entire PyPI repository (200K packages), we investigate replication from three perspectives: replication of popular packages, vulnerable packages, and malicious packages. Our experiments reveal three critical findings about package replication in PyPI: (1) by identifying 1,361 replicated packages of the top 3K popular projects, we show that replication frequently redistributes substantial portions of existing packages under different maintainers; (2) by uncovering 256 previously unknown replicated vulnerable packages, we demonstrate that replication creates vulnerability blind spots that current detection tools rarely catch; (3) by analyzing 3,883 known malicious packages, we found that 186 (4.79%) replicated popular ones, and this pattern further led us to identify seven previously unknown replicated malicious packages, highlighting its role as an attack vector for malware distribution through minor modifications and code injection.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports a large-scale observational study of package replication in PyPI, analyzing 200K packages (one-third of the repository). It identifies 1,361 replicated packages among the top 3K popular projects, 256 previously unknown replicated vulnerable packages, and 7 new replicated malicious packages (from 3,883 known malicious packages, of which 186 or 4.79% replicated popular ones), concluding that replication redistributes substantial code under new maintainers, creates vulnerability blind spots missed by current tools, and serves as a malware attack vector via minor modifications and injection.
Significance. If the replication detection procedure is accurate, reproducible, and validated against false positives, the results would provide concrete empirical evidence of a systemic security risk in the Python ecosystem, quantifying how code redistribution can propagate vulnerabilities and enable malware. The scale (200K packages) and specific counts could inform improvements to package managers, vulnerability scanners, and malware detection, representing a meaningful contribution to software supply-chain security research.
major comments (2)
- [Methods] Methods section: The similarity metric, threshold, handling of package metadata/dependencies, and validation steps (e.g., precision/recall on a labeled sample or manual audit) used to classify packages as 'replicated' versus independent implementations are not described. All headline counts (1,361 replicated popular packages, 256 replicated vulnerable packages, 7 new malicious packages) rest on this unstated procedure; without it, it is impossible to assess whether the security-threat conclusions follow from the data.
- [Results on popular packages] § on replication of popular packages: The claim that replication 'frequently redistributes substantial portions of existing packages' requires supporting quantitative detail such as the distribution of code-overlap percentages or similarity scores across the 1,361 cases; the current presentation leaves open whether many cases are superficially similar rather than true redistributions.
minor comments (2)
- [Abstract] Abstract: Clarify the sampling method for the 200K packages (random, time-based, or popularity-stratified) and the exact date or total size of PyPI at the time of data collection.
- [Results] The paper would benefit from a table or figure summarizing the replication similarity distribution and any false-positive rate estimates.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below and will incorporate revisions to strengthen the presentation of our methods and results.
read point-by-point responses
-
Referee: [Methods] Methods section: The similarity metric, threshold, handling of package metadata/dependencies, and validation steps (e.g., precision/recall on a labeled sample or manual audit) used to classify packages as 'replicated' versus independent implementations are not described. All headline counts (1,361 replicated popular packages, 256 replicated vulnerable packages, 7 new malicious packages) rest on this unstated procedure; without it, it is impossible to assess whether the security-threat conclusions follow from the data.
Authors: We agree that the current manuscript does not provide adequate detail on the replication detection procedure. In the revised version, we will expand the Methods section to fully describe the similarity metric employed, the threshold for classifying replication, the handling of package metadata and dependencies, and the validation steps including any precision/recall assessment on a labeled sample or manual audit. This addition will enable readers to evaluate the reliability of the reported counts and the resulting security conclusions. revision: yes
-
Referee: [Results on popular packages] § on replication of popular packages: The claim that replication 'frequently redistributes substantial portions of existing packages' requires supporting quantitative detail such as the distribution of code-overlap percentages or similarity scores across the 1,361 cases; the current presentation leaves open whether many cases are superficially similar rather than true redistributions.
Authors: We acknowledge that additional quantitative support is needed to substantiate the claim of substantial code redistribution. In the revision, we will include the distribution of code-overlap percentages and similarity scores across the 1,361 replicated popular packages. This will provide concrete evidence that the identified cases involve meaningful code duplication rather than superficial similarities. revision: yes
Circularity Check
No circularity: direct empirical counts from repository scan
full rationale
The paper reports observational counts (1,361 replicated popular packages, 256 vulnerable, 7 malicious) obtained by scanning one-third of PyPI. No equations, fitted parameters, predictions, or self-citations appear in the abstract or described method. The replication classification procedure is a data-processing step whose validity is external to the reported numbers; it does not reduce the outputs to the inputs by construction. This is a standard empirical measurement study whose central claims rest on the accuracy of the (unspecified here) similarity detector rather than on any definitional or self-referential loop.
Axiom & Free-Parameter Ledger
free parameters (1)
- replication similarity threshold
axioms (1)
- domain assumption The 200K-package sample is representative of replication patterns across the full PyPI repository.
Reference graph
Works this paper leans on
-
[1]
Mahmoud Alfadel, Diego Elias Costa, and Emad Shihab. 2023. Empirical Analysis of Security Vulnerabilities in Python Packages.Empirical Software Engineering28, 3 (2023), 59. doi:10.1007/s10664-022-10278-4
-
[2]
Gábor Antal, Márton Keleti, and Péter Heged ˘us. 2020. Exploring the Security Awareness of the Python and JavaScript Open Source Communities. InProceedings of the 17th International Conference on Mining Software Repositories. 16–20
2020
-
[3]
Ethan Bommarito and Michael Bommarito. 2019. An Empirical Analysis of the Python Package Index (PyPI).arXiv preprint arXiv:1907.11073(2019). doi:10.48550/arXiv.1907.11073
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1907.11073 2019
-
[4]
Mircea Cadariu, Eric Bouwers, Joost Visser, and Arie Van Deursen. 2015. Tracking Known Security Vulnerabilities in Proprietary Software Systems. In2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER). IEEE, 516–519. doi:10.1109/SANER.2015.7081868
-
[5]
Seogyeong Cho, Seungeun Yu, and Seunghoon Woo. 2025. Cryptbara: Dependency-Guided Detection of Python Cryptographic API Misuses. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1578–1590. doi:10.1109/ASE63991.2025.00133
-
[6]
Ctags. 2026. Universal Ctags. https://github.com/universal-ctags/ctags
2026
-
[7]
Datadog. 2026. GuardDog: A CLI Tool to Identify Malicious Packages. https://github.com/DataDog/guarddog
2026
-
[8]
DependencyTrack. 2026. DependencyTrack. https://github.com/DependencyTrack/dependency-track
2026
-
[9]
Ruian Duan, Omar Alrawi, Ranjita Pai Kasturi, Ryan Elder, Brendan Saltaformaggio, and Wenke Lee. 2021. Towards Measuring Supply Chain Attacks on Package Managers for Interpreted Languages. In28th Annual Network and Distributed System Security Symposium, NDSS. doi:10.14722/ndss.2021.23055
-
[10]
Siyue Feng, Yueming Wu, Wenjie Xue, Sikui Pan, Deqing Zou, Yang Liu, and Hai Jin. 2024. FIRE: Combining Multi-Stage Filtering with Taint Analysis for Scalable Recurring Vulnerability Detection. In33rd USENIX Security Symposium (USENIX Security 24). 1867–1884. doi:10.5555/3698900.3699005
-
[11]
Xingan Gao, Xiaobing Sun, Sicong Cao, Kaifeng Huang, Di Wu, Xiaolei Liu, Xingwei Lin, and Yang Xiang. 2025. MALGUARD: Towards Real-Time, Accurate, and Actionable Detection of Malicious Packages in PyPI Ecosystem. In Proceedings of the 34th USENIX Security Symposium (USENIX Security ’25). doi:10.5555/3766078.3766322
-
[12]
Google. 2025. OSV: A Distributed Vulnerability Database for Open Source. https://osv.dev/
2025
-
[13]
Wenbo Guo, Zhengzi Xu, Chengwei Liu, Cheng Huang, Yong Fang, and Yang Liu. 2023. An Empirical Study of Malicious Code In PyPI Ecosystem. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 166–177. doi:10.1109/ASE56229.2023.00135
-
[14]
Stefan Haefliger, Georg Von Krogh, and Sebastian Spaeth. 2008. Code Reuse in Open Source Software.Management science54, 1 (2008), 180–193. doi:10.1287/mnsc.1070.0748
-
[15]
Jiyong Jang, Abeer Agrawal, and David Brumley. 2012. ReDeBug: Finding Unpatched Code Clones in Entire OS Distributions. In2012 IEEE Symposium on Security and Privacy. IEEE, 48–62. doi:10.1109/SP.2012.13
-
[16]
Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. DECKARD: Scalable and Accurate Tree-based Detection of Code Clones. In29th International Conference on Software Engineering (ICSE’07). IEEE, 96–105
2007
-
[17]
Berkay Kaplan and Jingyu Qian. 2021. A Survey on Common Threats in npm and PyPi Registries. InInternational Workshop on Deployable Machine Learning for Security Defense. Springer, 132–156
2021
-
[18]
Seulbae Kim, Seunghoon Woo, Heejo Lee, and Hakjoo Oh. 2017. VUDDY: A Scalable Approach for Vulnerable Code Clone Discovery. In2017 IEEE symposium on security and privacy (SP). IEEE, 595–614. doi:10.1109/SP.2017.62
-
[19]
Piergiorgio Ladisa, Henrik Plate, Matias Martinez, and Olivier Barais. 2022. Taxonomy of Attacks on Open-Source Software Supply Chains.arXiv preprint arXiv:2204.04008(2022). doi:10.1109/SP46215.2023.10179304
-
[20]
Ningke Li, Shenao Wang, Mingxi Feng, Kailong Wang, Meizhen Wang, and Haoyu Wang. 2023. MalWuKong: Towards Fast, Accurate, and Multilingual Detection of Malicious Code Poisoning in OSS Supply Chains. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1993–2005. doi:10.1109/ASE56229.2023.00073
-
[21]
Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Hanchao Qi, and Jie Hu. 2016. VulPecker: An Automated Vulnerability Detection System Based on Code Similarity Analysis. InProceedings of the 32nd annual conference on computer security applications. 201–213. doi:10.1145/2991079.2991102
-
[22]
Wentao Liang, Xiang Ling, Jingzheng Wu, Tianyue Luo, and Yanjun Wu. 2023. A Needle is an Outlier in a Haystack: Hunting Malicious PyPI Packages with Code Clustering. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 307–318. doi:10.1109/ASE56229.2023.00085
-
[23]
Cristina V Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny, Hitesh Sajnani, and Jan Vitek. 2017. DéjàVu: A Map of Code Duplicates on GitHub.Proceedings of the ACM on Programming Languages1, OOPSLA (2017), 1–28. doi:10.1145/3133908
-
[24]
Leland McInnes, John Healy, and Steve Astels. 2017. hdbscan: Hierarchical density based clustering.The Journal of Open Source Software2, 11 (2017), 205. doi:10.21105/joss.00205
-
[25]
Leland McInnes, John Healy, and James Melville. 2018. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.arXiv preprint arXiv:1802.03426(2018). doi:10.21105/joss.00861 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE210. Publication date: July 2026. FSE210:22 Sunha Park, Soojin Han, and Seunghoon Woo
work page internal anchor Pith review Pith/arXiv arXiv doi:10.21105/joss.00861 2018
-
[26]
Abdechakour Mechri, Mohamed Amine Ferrag, and Merouane Debbah. 2025. SecureQwen: Leveraging LLMs for Vulnerability Detection in Python Codebases.Computers & Security148 (2025), 104151. doi:10.1016/j.cose.2024.104151
-
[27]
Tasuku Nakagawa, Yoshiki Higo, and Shinji Kusumoto. 2021. NIL: Large-Scale Detection of Large-Variance Clones. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 830–841. doi:10.1145/3468264.3468564
-
[28]
Shradha Neupane, Grant Holmes, Elizabeth Wyss, Drew Davidson, and Lorenzo De Carli. 2023. Beyond Typosquatting: An In-depth Look at Package Confusion. In32nd USENIX Security Symposium (USENIX Security 23). 3439–3456
2023
-
[29]
Changan Niu, Chuanyi Li, Vincent Ng, Dongxiao Chen, Jidong Ge, and Bin Luo. 2023. An Empirical Comparison of Pre-Trained Models of Source Code. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2136–2148. doi:10.1109/ICSE48619.2023.00180
-
[30]
Marc Ohm, Henrik Plate, Arnold Sykosch, and Michael Meier. 2020. Backstabber’s Knife Collection: A Review of Open Source Software Supply Chain Attacks. InInternational Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer. doi:10.1007/978-3-030-52683-2_2
-
[31]
David Reid, Mahmoud Jahanshahi, and Audris Mockus. 2022. The Extent of Orphan Vulnerabilities from Code Reuse in Open Source Software. InProceedings of the 44th international conference on software engineering. 2104–2115
2022
-
[32]
David Reid, Kristiina Rahkema, and James Walden. 2023. Large Scale Study of Orphan Vulnerabilities in the Software Supply Chain. InProceedings of the 19th International Conference on Predictive Models and Data Analytics in Software Engineering. 22–32. doi:10.1145/3617555.3617872
-
[33]
Safety. 2025. Safety: Python Dependency Vulnerability Scanner. https://pypi.org/project/safety/
2025
-
[34]
Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, Chanchal K Roy, and Cristina V Lopes. 2016. SourcererCC: Scaling Code Clone Detection to Big-Code. InProceedings of the 38th international conference on software engineering. 1157–1168
2016
-
[35]
Xiaobing Sun, Xingan Gao, Sicong Cao, Lili Bo, Xiaoxue Wu, and Kaifeng Huang. 2024. 1+1>2: Integrating Deep Code Behaviors with Metadata Features for Malicious PyPI Package Detection. InProceedings of the 39th IEEE/ACM international conference on automated software engineering. 1159–1170. doi:10.1145/3691620.3695493
-
[36]
Vaidya, Drew Davidson, Lorenzo D Carli, and Vaibhav Rastogi
Matthew Taylor, Ruturaj K. Vaidya, Drew Davidson, Lorenzo D Carli, and Vaibhav Rastogi. 2020. SpellBound: Defending Against Package Typosquatting. arXiv:2003.03471 [cs.SE] doi:10.48550/arXiv.2003.03471
-
[37]
Marat Valiev, Bogdan Vasilescu, and James Herbsleb. 2018. Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem. InProceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 644–655
2018
-
[38]
Hugo van Kemenade, Cal Paterson, Martin Thoma, Mike Fiedler, Richard Si, and Zsolt Dollenstein. 2025. hugovk/top- pypi-packages: Release 2025.08. Zenodo. doi:10.5281/zenodo.16672093
-
[39]
Duc-Ly Vu, Fabio Massacci, Ivan Pashchenko, Henrik Plate, and Antonino Sabetta. 2021. LASTPYMILE: Identifying the Discrepancy between Sources and Packages. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 780–792
2021
-
[40]
Duc-Ly Vu, Ivan Pashchenko, Fabio Massacci, Henrik Plate, and Antonino Sabetta. 2020. Typosquatting and Com- bosquatting Attacks on the Python Ecosystem. In2020 ieee european symposium on security and privacy workshops (euros&pw). IEEE, 509–514. doi:10.1109/EuroSPW51379.2020.00074
-
[41]
Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. 2023. CodeT5+: Open Code Large Language Models for Code Understanding and Generation.arXiv preprint arXiv:2305.07922(2023)
Pith/arXiv arXiv 2023
-
[42]
Laura Wartschinski, Yannic Noller, Thomas Vogel, Timo Kehrer, and Lars Grunske. 2022. VUDENC: Vulnerability Detection with Deep Learning on a Natural Codebase for Python. InInformation and Software Technology. Elsevier
2022
-
[43]
Seunghoon Woo, Eunjin Choi, Heejo Lee, and Hakjoo Oh. 2023. V1SCAN: Discovering 1-day Vulnerabilities in Reused C/C++ Open-source Software Components Using Code Classification Techniques. In32nd USENIX Security Symposium (USENIX Security 23). 6541–6556. doi:10.5555/3620237.3620603
-
[44]
Seunghoon Woo, Hyunji Hong, Eunjin Choi, and Heejo Lee. 2022. MOVERY: A Precise Approach for Modified Vulner- able Code Clone Discovery from Modified Open-Source Software Components. In31st USENIX Security Symposium (USENIX Security 22). 3037–3053
2022
-
[45]
Elizabeth Wyss, Lorenzo De Carli, and Drew Davidson. 2022. What the Fork? Finding Hidden Code Clones in npm. In Proceedings of the 44th international conference on software engineering. 2415–2426. doi:10.1145/3510003.3510168
-
[46]
Junan Zhang, Kaifeng Huang, Yiheng Huang, Bihuan Chen, Ruisi Wang, Chong Wang, and Xin Peng. 2025. Killing Two Birds with One Stone: Malicious Package Detection in NPM and PyPI using a Single Model of Malicious Behavior Sequence.ACM Transactions on Software Engineering and Methodology34, 4 (2025), 1–28. doi:10.1145/3705304
-
[47]
Kunpeng Zhao, Shuya Duan, Ge Qiu, Jinyuan Zhai, Mingze Li, and Long Liu. 2024. Python source code vulnerability detection based on CodeBERT language model. In2024 7th International Conference on Algorithms, Computing and Artificial Intelligence (ACAI). IEEE, 1–6. doi:10.1109/ACAI63924.2024.10899694 Received 2026-02-25; accepted 2026-03-24 Proc. ACM Softw....
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.