An Empirical Analysis of the Python Package Index (PyPI)
Pith reviewed 2026-05-24 16:00 UTC · model grok-4.3
The pith
PyPI has grown at a 47% compound annual rate for active packages over 15 years, with most packages from single individuals.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Within PyPI, the growth of the repository has been robust under all measures, with a compound annual growth rate of 47% for active packages, 39% for new authors, and 61% for new import statements over the last 15 years. As with many similar social systems, a number of highly right-skewed distributions are found, including the distribution of releases per package, packages and releases per author, imports per package, and size per package and release. However, most packages are contributed by single individuals, not multiple individuals or organizations. The data provides an anchor for public discourse on PyPI and a foundation for future research on the Python software ecosystem.
What carries the argument
A comprehensive snapshot of PyPI package metadata and source code used to compute counts and trends for packages, releases, dependencies, licenses, imports, authors, and organizations.
If this is right
- Growth rates across multiple metrics indicate sustained expansion of the Python ecosystem.
- Right-skewed distributions mean a small number of packages account for most releases and imports.
- Single-individual contributions suggest decentralized, individual-driven development in PyPI.
- The provided data serves as a foundation for future research and public discourse on software repositories.
Where Pith is reading between the lines
- If the growth trends continue, PyPI could see exponential increases in package diversity and complexity.
- Single-author dominance may imply challenges in maintenance for many packages.
- Comparisons with other language ecosystems could reveal whether similar single-contributor patterns hold elsewhere.
- The skewed distributions suggest potential for studying power laws in software contributions.
Load-bearing premise
The scraped snapshot of PyPI at the time of collection is assumed to be complete and free of significant missing packages, erroneous metadata, or inconsistent historical records that would materially alter the reported counts and growth rates.
What would settle it
Discovery of a substantial number of missing packages or historical records that, when included, reduce the calculated compound annual growth rate for active packages below 40%.
Figures
read the original abstract
In this research, we provide a comprehensive empirical summary of the Python Package Repository, PyPI, including both package metadata and source code covering 178,592 packages, 1,745,744 releases, 76,997 contributors, and 156,816,750 import statements. We provide counts and trends for packages, releases, dependencies, category classifications, licenses, and package imports, as well as authors, maintainers, and organizations. As one of the largest and oldest software repositories as of publication, PyPI provides insight not just into the Python ecosystem today, but also trends in software development and licensing more broadly over time. Within PyPI, we find that the growth of the repository has been robust under all measures, with a compound annual growth rate of 47% for active packages, 39% for new authors, and 61% for new import statements over the last 15 years. As with many similar social systems, we find a number of highly right-skewed distributions, including the distribution of releases per package, packages and releases per author, imports per package, and size per package and release. However, we also find that most packages are contributed by single individuals, not multiple individuals or organizations. The data, methods, and calculations herein provide an anchor for public discourse on PyPI and serve as a foundation for future research on the Python software ecosystem.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a large-scale descriptive empirical analysis of the PyPI repository based on a single snapshot, covering 178,592 packages, 1,745,744 releases, 76,997 contributors, and 156,816,750 import statements. It reports counts, distributions, and compound annual growth rates (CAGRs) over 15 years, including 47% for active packages, 39% for new authors, and 61% for new import statements, along with observations on right-skewed distributions and the predominance of single-individual contributions. The work positions the dataset as a foundation for future research on the Python ecosystem.
Significance. If the scrape completeness and metric definitions hold, the paper supplies a valuable baseline of descriptive statistics for one of the largest open-source repositories, with explicit credit due to the dataset scale (nearly 180k packages and over 156M imports). This provides an anchor for discourse on software ecosystems, licensing trends, and contribution patterns without relying on fitted models or predictions.
major comments (2)
- [Data collection subsection] Data collection and methods: The central growth-rate claims (e.g., 47% CAGR for active packages) rest on the assumption that the scraped snapshot is complete and that the definition of 'active packages' is robust; however, the manuscript provides no validation of scrape completeness, cross-checks against external package counts, or sensitivity analysis on metric definitions, which is load-bearing for all reported aggregates and trends.
- [Contributor analysis section] Results on contributor distributions: The claim that most packages are contributed by single individuals (rather than organizations or teams) is presented as a key finding, yet the manuscript does not detail the heuristics used to classify authors vs. organizations or report uncertainty in these classifications, which directly supports the social-systems observation.
minor comments (2)
- [Abstract] Abstract and results: Growth rates are stated without accompanying uncertainty quantification or error bars; adding these (or at minimum a limitations paragraph) would improve interpretability of the CAGR figures.
- [Discussion] The manuscript would benefit from a brief explicit limitations subsection addressing potential metadata inconsistencies in historical PyPI records.
Simulated Author's Rebuttal
We thank the referee for the constructive review and the recommendation of minor revision. We address each major comment below.
read point-by-point responses
-
Referee: [Data collection subsection] Data collection and methods: The central growth-rate claims (e.g., 47% CAGR for active packages) rest on the assumption that the scraped snapshot is complete and that the definition of 'active packages' is robust; however, the manuscript provides no validation of scrape completeness, cross-checks against external package counts, or sensitivity analysis on metric definitions, which is load-bearing for all reported aggregates and trends.
Authors: We agree that the absence of explicit validation and sensitivity checks is a limitation. In the revision we will expand the Data collection subsection to document the exact snapshot date and scraping procedure (PyPI JSON API), note known sources of incompleteness, add a cross-check against contemporaneous public PyPI statistics, clarify the precise definition of 'active packages', and include a short sensitivity table showing how the reported CAGRs change under alternative activity thresholds. revision: yes
-
Referee: [Contributor analysis section] Results on contributor distributions: The claim that most packages are contributed by single individuals (rather than organizations or teams) is presented as a key finding, yet the manuscript does not detail the heuristics used to classify authors vs. organizations or report uncertainty in these classifications, which directly supports the social-systems observation.
Authors: We accept the criticism. The current text relies on an ad-hoc rule-based heuristic applied to the author/maintainer metadata fields. In the revision we will describe the heuristic in full (including the list of organizational indicators and decision rules), supply illustrative examples, and report the fraction of cases flagged as ambiguous together with a brief discussion of classification uncertainty. revision: yes
Circularity Check
Purely descriptive empirical summary with no circular derivations
full rationale
The manuscript reports counts, distributions, and CAGR values computed directly from a single scraped PyPI snapshot (178k packages, 1.7M releases, etc.). No equations, fitted models, predictions, or self-citations appear; the headline growth rates (47% active packages, 39% new authors, 61% new imports) are simple aggregates of the collected metadata and import statements. The analysis is therefore self-contained against external benchmarks and contains no load-bearing step that reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The PyPI data snapshot at collection time accurately captures all packages, releases, and metadata without material omissions or errors.
Forward citations
Cited by 2 Pith papers
-
Model Context Protocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers
First study of 1,899 MCP servers finds eight distinct vulnerabilities (only three traditional), 7.2% with general issues, 5.5% with tool poisoning, and 66% with code smells, urging MCP-specific security practices.
-
Analyzing the Availability of E-Mail Addresses for PyPI Libraries
79.1% of PyPI libraries provide at least one valid email address, primarily from PyPI metadata, with high coverage extending to dependency chains.
Reference graph
Works this paper leans on
-
[1]
T. F. Bissyand, F. Thung, D. Lo, L. Jiang, L. Rveillre, Pop ularity, interoperability, and impact of programming lang uages in 100,000 open source projects, in: 2013 IEEE 37th Annual Co mputer Software and Applications Conference, pp. 303–312
work page 2013
-
[2]
Inc, Tiobe software: Tiobe index, https://www.tiobe.com/tiobe-index/, 2019
T. Inc, Tiobe software: Tiobe index, https://www.tiobe.com/tiobe-index/, 2019. Online; accessed 5 January 2019
work page 2019
-
[3]
A. Kuchling, Pep 241 – metadata for python software packa ges, https://www.python.org/dev/peps/pep-0241/, 2001. Online; accessed 5 January 2019
work page 2001
-
[4]
R. Jones, Pep 301 – package index and metadata for distuti ls, https://www.python.org/dev/peps/pep-0301/, 2002. Online; accessed 5 January 2019. 12 Y ear Count 2005 26,756 2006 91,896 2007 188,050 2008 459,461 2009 1,015,681 2010 1,520,202 2011 2,335,051 2012 3,552,860 2013 6,837,532 2014 10,864,275 2015 15,716,157 2016 25,696,479 2017 36,932,209 2018 47,...
work page 2002
- [5]
-
[6]
I. Samoladas, L. Angelis, I. Stamelos, Survival analysi s on the duration of open source projects, Information and So ftware Technology 52 (2010) 902–922
work page 2010
- [7]
- [8]
- [9]
-
[10]
R. Kikas, G. Gousios, M. Dumas, D. Pfahl, Structure and e volution of package dependency networks, in: Proceedings o f 13 Y ear os sys re django ccxt numpy 2005 47 55 29 0 0 0 2006 176 194 96 0 0 15 2007 556 457 221 7 0 20 2008 1339 892 503 38 0 55 2009 2439 1552 872 163 0 115 2010 3523 2482 1423 404 0 176 2011 4660 3341 2017 655 0 265 2012 6935 4896 3187...
work page 2005
-
[11]
B. A. Malloy, J. F. Power, Quantifying the transition fr om python 2 to 3: An empirical study of python applications, i n: 2017 ACM/IEEE International Symposium on Empirical Softwa re Engineering and Measurement (ESEM), pp. 314–323
work page 2017
- [12]
-
[13]
Tidelift, libraries.io, https://libraries.io/about, 2019. Online; accessed 24 July 2019
work page 2019
-
[14]
P. S. Foundation, ast abstract syntax trees, https://docs.python.org/3.6/library/ast.html, 2019. Online; accessed 7 January 2019
work page 2019
-
[15]
P. S. Foundation, Cpython 3.6 lib2to3 source code, https://github.com/python/cpython/tree/3.6/Lib/lib2to3/, 2019. Online; accessed 7 January 2019. 14
work page 2019
-
[16]
DeBill, modulecounts.com, http://www.modulecounts.com/, 2019
E. DeBill, modulecounts.com, http://www.modulecounts.com/, 2019. Online; accessed 15 July 2019
work page 2019
-
[17]
S. Yitzhaki, Relative deprivation and the gini coefficie nt, The Quarterly Journal of Economics 93 (1979) 321–324
work page 1979
-
[18]
UNDP, HDR 2010 - The Real W ealth of Nations: Pathways to H uman Development, Human Development Report Office (HDRO), United Nations Development Programme (UNDP), 2010
work page 2010
-
[19]
C. Vendome, A large scale study of license usage on githu b, in: Proceedings of the 37th International Conference on Software Engineering - Volume 2, ICSE ’15, IEEE Press, Pisca taway, NJ, USA, 2015, pp. 772–774
work page 2015
-
[20]
C. Vendome, M. Linares-V´ asquez, G. Bavota, M. Di Penta , D. German, D. Poshyvanyk, Machine learning-based detecti on of open source license exceptions, in: Proceedings of the 39 th International Conference on Software Engineering, ICSE ’17, IEEE Press, Piscataway, NJ, USA, 2017, pp. 118–129
work page 2017
-
[21]
C. Vendome, M. Linares-Vasquez, G. Bavota, M. Di Penta, D. M. German, D. Poshyvanyk, When and why developers adopt and change software licenses, in: 2015 IEEE Internati onal Conference on Software Maintenance and Evolution (ICSME), IEEE, pp. 31–40
work page 2015
-
[22]
M. Feng, W. Mao, Z. Yuan, Y. Xiao, G. Ban, W. W ang, S. W ang, Q. Tang, J. Xu, H. Su, B. Liu, W. Huo, Open-source license violations of binary software at large scale, in: 20 19 IEEE 26th International Conference on Software Analysis , Evolution and Reengineering (SANER), pp. 564–568
-
[23]
D. A. Almeida, G. C. Murphy, G. Wilson, M. Hoye, Investig ating whether and how software developers understand open source software licensing, Empirical Software Engineerin g 24 (2019) 211–239
work page 2019
-
[24]
Y. W u, Y. Manabe, T. Kanda, D. M. German, K. Inoue, Analys is of license inconsistency in large collections of open source projects, Empirical Software Engineering 22 (2017) 1194–1222. 15
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.