pith. sign in

arxiv: 1907.11073 · v2 · pith:M2PEY5LSnew · submitted 2019-07-25 · 💻 cs.SE · cs.CY· physics.soc-ph

An Empirical Analysis of the Python Package Index (PyPI)

Pith reviewed 2026-05-24 16:00 UTC · model grok-4.3

classification 💻 cs.SE cs.CYphysics.soc-ph
keywords PyPIPython packagessoftware repository analysisgrowth ratesopen source contributionspackage importsempirical studysingle author packages
0
0 comments X

The pith

PyPI has grown at a 47% compound annual rate for active packages over 15 years, with most packages from single individuals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines metadata and code from 178,592 Python packages on PyPI. It measures growth in packages, authors, and imports, finding strong compound annual growth rates of 47%, 39%, and 61% respectively over 15 years. The analysis also reveals highly skewed distributions in releases and contributions. Most packages are developed by single individuals rather than teams or organizations. These findings offer a baseline for understanding trends in open source Python development.

Core claim

Within PyPI, the growth of the repository has been robust under all measures, with a compound annual growth rate of 47% for active packages, 39% for new authors, and 61% for new import statements over the last 15 years. As with many similar social systems, a number of highly right-skewed distributions are found, including the distribution of releases per package, packages and releases per author, imports per package, and size per package and release. However, most packages are contributed by single individuals, not multiple individuals or organizations. The data provides an anchor for public discourse on PyPI and a foundation for future research on the Python software ecosystem.

What carries the argument

A comprehensive snapshot of PyPI package metadata and source code used to compute counts and trends for packages, releases, dependencies, licenses, imports, authors, and organizations.

If this is right

  • Growth rates across multiple metrics indicate sustained expansion of the Python ecosystem.
  • Right-skewed distributions mean a small number of packages account for most releases and imports.
  • Single-individual contributions suggest decentralized, individual-driven development in PyPI.
  • The provided data serves as a foundation for future research and public discourse on software repositories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the growth trends continue, PyPI could see exponential increases in package diversity and complexity.
  • Single-author dominance may imply challenges in maintenance for many packages.
  • Comparisons with other language ecosystems could reveal whether similar single-contributor patterns hold elsewhere.
  • The skewed distributions suggest potential for studying power laws in software contributions.

Load-bearing premise

The scraped snapshot of PyPI at the time of collection is assumed to be complete and free of significant missing packages, erroneous metadata, or inconsistent historical records that would materially alter the reported counts and growth rates.

What would settle it

Discovery of a substantial number of missing packages or historical records that, when included, reduce the calculated compound annual growth rate for active packages below 40%.

Figures

Figures reproduced from arXiv: 1907.11073 by Ethan Bommarito, Michael Bommarito.

Figure 1
Figure 1. Figure 1: Dependency Identification Flowchart 2.3. License Identification and Normalization Procedure Package licensing is not a simple topic; packages may split or combine, change ownership, change license, offer multiple licenses for the entire package, license subsets of a package separately, or vendor other packages with other licenses. From a legal perspective, the proper unit of analysis may be context-depende… view at source ↗
Figure 2
Figure 2. Figure 2: License Identification Flowchart 3. Results This paper is intended to provide a high-level empirical overview of the PyPI ecosystem as of May 2019; as such, our results largely consist of raw counts and proportions by categorical dimension or over time. While a limitless number of more detailed causal or normative questions could be asked and answered, we limit the scope of this paper to provide simple, di… view at source ↗
read the original abstract

In this research, we provide a comprehensive empirical summary of the Python Package Repository, PyPI, including both package metadata and source code covering 178,592 packages, 1,745,744 releases, 76,997 contributors, and 156,816,750 import statements. We provide counts and trends for packages, releases, dependencies, category classifications, licenses, and package imports, as well as authors, maintainers, and organizations. As one of the largest and oldest software repositories as of publication, PyPI provides insight not just into the Python ecosystem today, but also trends in software development and licensing more broadly over time. Within PyPI, we find that the growth of the repository has been robust under all measures, with a compound annual growth rate of 47% for active packages, 39% for new authors, and 61% for new import statements over the last 15 years. As with many similar social systems, we find a number of highly right-skewed distributions, including the distribution of releases per package, packages and releases per author, imports per package, and size per package and release. However, we also find that most packages are contributed by single individuals, not multiple individuals or organizations. The data, methods, and calculations herein provide an anchor for public discourse on PyPI and serve as a foundation for future research on the Python software ecosystem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a large-scale descriptive empirical analysis of the PyPI repository based on a single snapshot, covering 178,592 packages, 1,745,744 releases, 76,997 contributors, and 156,816,750 import statements. It reports counts, distributions, and compound annual growth rates (CAGRs) over 15 years, including 47% for active packages, 39% for new authors, and 61% for new import statements, along with observations on right-skewed distributions and the predominance of single-individual contributions. The work positions the dataset as a foundation for future research on the Python ecosystem.

Significance. If the scrape completeness and metric definitions hold, the paper supplies a valuable baseline of descriptive statistics for one of the largest open-source repositories, with explicit credit due to the dataset scale (nearly 180k packages and over 156M imports). This provides an anchor for discourse on software ecosystems, licensing trends, and contribution patterns without relying on fitted models or predictions.

major comments (2)
  1. [Data collection subsection] Data collection and methods: The central growth-rate claims (e.g., 47% CAGR for active packages) rest on the assumption that the scraped snapshot is complete and that the definition of 'active packages' is robust; however, the manuscript provides no validation of scrape completeness, cross-checks against external package counts, or sensitivity analysis on metric definitions, which is load-bearing for all reported aggregates and trends.
  2. [Contributor analysis section] Results on contributor distributions: The claim that most packages are contributed by single individuals (rather than organizations or teams) is presented as a key finding, yet the manuscript does not detail the heuristics used to classify authors vs. organizations or report uncertainty in these classifications, which directly supports the social-systems observation.
minor comments (2)
  1. [Abstract] Abstract and results: Growth rates are stated without accompanying uncertainty quantification or error bars; adding these (or at minimum a limitations paragraph) would improve interpretability of the CAGR figures.
  2. [Discussion] The manuscript would benefit from a brief explicit limitations subsection addressing potential metadata inconsistencies in historical PyPI records.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation of minor revision. We address each major comment below.

read point-by-point responses
  1. Referee: [Data collection subsection] Data collection and methods: The central growth-rate claims (e.g., 47% CAGR for active packages) rest on the assumption that the scraped snapshot is complete and that the definition of 'active packages' is robust; however, the manuscript provides no validation of scrape completeness, cross-checks against external package counts, or sensitivity analysis on metric definitions, which is load-bearing for all reported aggregates and trends.

    Authors: We agree that the absence of explicit validation and sensitivity checks is a limitation. In the revision we will expand the Data collection subsection to document the exact snapshot date and scraping procedure (PyPI JSON API), note known sources of incompleteness, add a cross-check against contemporaneous public PyPI statistics, clarify the precise definition of 'active packages', and include a short sensitivity table showing how the reported CAGRs change under alternative activity thresholds. revision: yes

  2. Referee: [Contributor analysis section] Results on contributor distributions: The claim that most packages are contributed by single individuals (rather than organizations or teams) is presented as a key finding, yet the manuscript does not detail the heuristics used to classify authors vs. organizations or report uncertainty in these classifications, which directly supports the social-systems observation.

    Authors: We accept the criticism. The current text relies on an ad-hoc rule-based heuristic applied to the author/maintainer metadata fields. In the revision we will describe the heuristic in full (including the list of organizational indicators and decision rules), supply illustrative examples, and report the fraction of cases flagged as ambiguous together with a brief discussion of classification uncertainty. revision: yes

Circularity Check

0 steps flagged

Purely descriptive empirical summary with no circular derivations

full rationale

The manuscript reports counts, distributions, and CAGR values computed directly from a single scraped PyPI snapshot (178k packages, 1.7M releases, etc.). No equations, fitted models, predictions, or self-citations appear; the headline growth rates (47% active packages, 39% new authors, 61% new imports) are simple aggregates of the collected metadata and import statements. The analysis is therefore self-contained against external benchmarks and contains no load-bearing step that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the collected PyPI snapshot is representative and that the chosen operational definitions of packages, releases, authors, and imports are stable across the 15-year window.

axioms (1)
  • domain assumption The PyPI data snapshot at collection time accurately captures all packages, releases, and metadata without material omissions or errors.
    All reported counts, growth rates, and distributional claims depend directly on this completeness.

pith-pipeline@v0.9.0 · 5781 in / 1265 out tokens · 25485 ms · 2026-05-24T16:00:46.659759+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Model Context Protocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers

    cs.SE 2025-06 conditional novelty 8.0

    First study of 1,899 MCP servers finds eight distinct vulnerabilities (only three traditional), 7.2% with general issues, 5.5% with tool poisoning, and 66% with code smells, urging MCP-specific security practices.

  2. Analyzing the Availability of E-Mail Addresses for PyPI Libraries

    cs.SE 2026-01 unverdicted novelty 3.0

    79.1% of PyPI libraries provide at least one valid email address, primarily from PyPI metadata, with high coverage extending to dependency chains.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 2 Pith papers

  1. [1]

    T. F. Bissyand, F. Thung, D. Lo, L. Jiang, L. Rveillre, Pop ularity, interoperability, and impact of programming lang uages in 100,000 open source projects, in: 2013 IEEE 37th Annual Co mputer Software and Applications Conference, pp. 303–312

  2. [2]

    Inc, Tiobe software: Tiobe index, https://www.tiobe.com/tiobe-index/, 2019

    T. Inc, Tiobe software: Tiobe index, https://www.tiobe.com/tiobe-index/, 2019. Online; accessed 5 January 2019

  3. [3]

    Kuchling, Pep 241 – metadata for python software packa ges, https://www.python.org/dev/peps/pep-0241/, 2001

    A. Kuchling, Pep 241 – metadata for python software packa ges, https://www.python.org/dev/peps/pep-0241/, 2001. Online; accessed 5 January 2019

  4. [4]

    Jones, Pep 301 – package index and metadata for distuti ls, https://www.python.org/dev/peps/pep-0301/, 2002

    R. Jones, Pep 301 – package index and metadata for distuti ls, https://www.python.org/dev/peps/pep-0301/, 2002. Online; accessed 5 January 2019. 12 Y ear Count 2005 26,756 2006 91,896 2007 188,050 2008 459,461 2009 1,015,681 2010 1,520,202 2011 2,335,051 2012 3,552,860 2013 6,837,532 2014 10,864,275 2015 15,716,157 2016 25,696,479 2017 36,932,209 2018 47,...

  5. [5]

    Zheng, D

    X. Zheng, D. Zeng, H. Li, F. W ang, Analyzing open-source s oftware systems as complex networks, Physica A: Statistica l Mechanics and its Applications 387 (2008) 6190–6200

  6. [6]

    Samoladas, L

    I. Samoladas, L. Angelis, I. Stamelos, Survival analysi s on the duration of open source projects, Information and So ftware Technology 52 (2010) 902–922

  7. [7]

    Orr´ u, E

    M. Orr´ u, E. D. Tempero, M. Marchesi, R. Tonelli, G. Deste fanis, A curated benchmark collection of python systems for empirical studies on software engineering., in: PROMISE, p p. 2–1

  8. [8]

    Decan, T

    A. Decan, T. Mens, M. Claes, On the topology of package dep endency networks: A comparison of three programming language ecosystems, in: Proccedings of the 10th European C onference on Software Architecture W orkshops, ACM, p. 21

  9. [9]

    Decan, T

    A. Decan, T. Mens, M. Claes, An empirical comparison of de pendency issues in oss packaging ecosystems, in: 2017 IEEE 24th International Conference on Software Analysis, Evolu tion and Reengineering (SANER), pp. 2–12

  10. [10]

    Kikas, G

    R. Kikas, G. Gousios, M. Dumas, D. Pfahl, Structure and e volution of package dependency networks, in: Proceedings o f 13 Y ear os sys re django ccxt numpy 2005 47 55 29 0 0 0 2006 176 194 96 0 0 15 2007 556 457 221 7 0 20 2008 1339 892 503 38 0 55 2009 2439 1552 872 163 0 115 2010 3523 2482 1423 404 0 176 2011 4660 3341 2017 655 0 265 2012 6935 4896 3187...

  11. [11]

    B. A. Malloy, J. F. Power, Quantifying the transition fr om python 2 to 3: An empirical study of python applications, i n: 2017 ACM/IEEE International Symposium on Empirical Softwa re Engineering and Measurement (ESEM), pp. 314–323

  12. [12]

    Decan, T

    A. Decan, T. Mens, P. Grosjean, An empirical comparison of dependency network evolution in seven software packagin g ecosystems, Empirical Software Engineering 24 (2019) 381– 416

  13. [13]

    Online; accessed 24 July 2019

    Tidelift, libraries.io, https://libraries.io/about, 2019. Online; accessed 24 July 2019

  14. [14]

    P. S. Foundation, ast abstract syntax trees, https://docs.python.org/3.6/library/ast.html, 2019. Online; accessed 7 January 2019

  15. [15]

    P. S. Foundation, Cpython 3.6 lib2to3 source code, https://github.com/python/cpython/tree/3.6/Lib/lib2to3/, 2019. Online; accessed 7 January 2019. 14

  16. [16]

    DeBill, modulecounts.com, http://www.modulecounts.com/, 2019

    E. DeBill, modulecounts.com, http://www.modulecounts.com/, 2019. Online; accessed 15 July 2019

  17. [17]

    Yitzhaki, Relative deprivation and the gini coefficie nt, The Quarterly Journal of Economics 93 (1979) 321–324

    S. Yitzhaki, Relative deprivation and the gini coefficie nt, The Quarterly Journal of Economics 93 (1979) 321–324

  18. [18]

    UNDP, HDR 2010 - The Real W ealth of Nations: Pathways to H uman Development, Human Development Report Office (HDRO), United Nations Development Programme (UNDP), 2010

  19. [19]

    C. Vendome, A large scale study of license usage on githu b, in: Proceedings of the 37th International Conference on Software Engineering - Volume 2, ICSE ’15, IEEE Press, Pisca taway, NJ, USA, 2015, pp. 772–774

  20. [20]

    Vendome, M

    C. Vendome, M. Linares-V´ asquez, G. Bavota, M. Di Penta , D. German, D. Poshyvanyk, Machine learning-based detecti on of open source license exceptions, in: Proceedings of the 39 th International Conference on Software Engineering, ICSE ’17, IEEE Press, Piscataway, NJ, USA, 2017, pp. 118–129

  21. [21]

    Vendome, M

    C. Vendome, M. Linares-Vasquez, G. Bavota, M. Di Penta, D. M. German, D. Poshyvanyk, When and why developers adopt and change software licenses, in: 2015 IEEE Internati onal Conference on Software Maintenance and Evolution (ICSME), IEEE, pp. 31–40

  22. [22]

    M. Feng, W. Mao, Z. Yuan, Y. Xiao, G. Ban, W. W ang, S. W ang, Q. Tang, J. Xu, H. Su, B. Liu, W. Huo, Open-source license violations of binary software at large scale, in: 20 19 IEEE 26th International Conference on Software Analysis , Evolution and Reengineering (SANER), pp. 564–568

  23. [23]

    D. A. Almeida, G. C. Murphy, G. Wilson, M. Hoye, Investig ating whether and how software developers understand open source software licensing, Empirical Software Engineerin g 24 (2019) 211–239

  24. [24]

    Y. W u, Y. Manabe, T. Kanda, D. M. German, K. Inoue, Analys is of license inconsistency in large collections of open source projects, Empirical Software Engineering 22 (2017) 1194–1222. 15