pith. sign in

arxiv: 2605.16701 · v1 · pith:WO3EHEABnew · submitted 2026-05-15 · 💻 cs.SE

What's Inside a GitHub Repository? An Empirical Study on the Contents of 10K Projects

Pith reviewed 2026-05-20 15:44 UTC · model grok-4.3

classification 💻 cs.SE
keywords GitHub repositoriesempirical analysisrepository contentssoftware evolutionCI/CDconfiguration formatsDockergenerative AI
0
0 comments X

The pith

Analysis of 10,000 GitHub repositories shows consolidation of README.md, .gitignore and LICENSE as standard files with GitHub Actions rising as the main CI/CD platform.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper sets out to document the typical contents of GitHub repositories by examining files and directories in a large sample and tracking how they have changed over the past ten years. It establishes that certain files have become standard, that CI/CD has shifted toward GitHub Actions, that configuration formats have moved away from XML toward YAML and similar, and that new content related to AI tools is appearing. A reader would care because these patterns indicate how platform standards are influencing open source practices and how technologies rise and fall in real projects.

Core claim

Our results show major changes in GitHub over the last decade including the consolidation of README.md, .gitignore, and LICENSE as standard artifacts, the rise of GitHub Actions as the dominant CI/CD platform, the growth of configuration formats such as TOML, YAML, and JSON alongside a decline in XML, new trends such as the growth of Dockerfile, and emerging content related to LLMs and generative AI such as AGENTS.md.

What carries the argument

Empirical examination of the files, directories, and file extensions present in 10,000 GitHub repositories along with their changes over a ten-year period

If this is right

  • Open source is evolving not only organically but also under the guidance of GitHub's standards.
  • Technologies and file formats experience rises and declines in usage over time.
  • New repository contents tied to generative AI are beginning to appear.
  • These observations can aid in the design and interpretation of mining software repository studies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Project creators could adopt the identified standard files to better match current community practices.
  • Software tools and platforms may need to prioritize support for YAML, TOML, and GitHub Actions workflows.
  • Future studies of repository mining could build on these trends to forecast structural changes in open source projects.
  • The appearance of AI-related files suggests repositories are becoming venues for integrating generative models.

Load-bearing premise

The 10,000 repositories form a representative sample of GitHub projects overall and that available metadata permits reliable reconstruction of content evolution across a full ten-year period.

What would settle it

Conducting the same analysis on a new sample of repositories or extending the time period and observing no consolidation of README.md, .gitignore, and LICENSE or continued prevalence of XML over YAML and JSON.

read the original abstract

GitHub is the largest code hosting platform, with millions of repositories spanning multiple technologies. Despite this, little is known about the actual contents of GitHub's repositories in the wild. This paper presents an initial empirical analysis to better understand the contents of real-world GitHub repositories. We analyze the files, directories, and extensions present in 10,000 GitHub repositories, as well as their evolution over ten years. Our results show major changes in GitHub over the last decade: (1) the consolidation of README.md, .gitignore, and LICENSE as standard artifacts; (2) the rise of GitHub Actions as the dominant CI/CD platform; (3) the growth of configuration formats such as TOML, YAML, and JSON, alongside a decline in XML; (4) new trends, such as the growth of Dockerfile; and (5) emerging content related to LLMs and generative AI (e.g., AGENTS.md). Based on our findings, we discuss implications, including that open source is not only evolving organically but also increasingly guided by GitHub's standards, the rise and fall of technologies, and the potential support for mining software repository studies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents an empirical analysis of the contents of 10,000 GitHub repositories, examining files, directories, extensions, and their evolution over ten years. Key findings include the consolidation of README.md, .gitignore, and LICENSE as standard artifacts; the rise of GitHub Actions as the dominant CI/CD platform; growth in TOML, YAML, and JSON configuration formats with a decline in XML; increased use of Dockerfiles; and emerging content related to LLMs such as AGENTS.md. The authors discuss implications for open source evolution and mining software repository studies.

Significance. If the methodological details support the representativeness of the sample and the accuracy of the temporal reconstruction, the results could provide important insights into trends in open-source software development on GitHub. The large-scale analysis of real-world repositories is a strength, offering data-driven observations that could guide future research and platform improvements. However, the absence of detailed sampling and validation methods limits the immediate impact.

major comments (3)
  1. [§3 (Data Collection and Sampling)] §3 (Data Collection and Sampling): The sampling strategy for the 10,000 repositories is not described. No details are given on selection criteria, stratification by creation date, repository size, primary language, or popularity metrics. This is load-bearing for all longitudinal claims about decade-long trends, as non-stratified sampling (e.g., via search API defaults or popularity signals) would over-represent recent or popular projects.
  2. [§4 (Evolution Analysis)] §4 (Evolution Analysis): The method used to reconstruct file presence and content evolution over ten years is unspecified. It is unclear whether dating relies on git commit timestamps, file metadata, or current repository state, and how deleted files, early forks, or incomplete histories are handled. Without this, the reported consolidation of README.md/.gitignore/LICENSE and shifts in configuration formats cannot be reliably attributed to temporal change rather than sampling artifacts.
  3. [Abstract and §5 (Results)] Abstract and §5 (Results): No sampling method, exclusion criteria, statistical tests, confidence intervals, or error analysis are reported for the observed trends. This leaves the central claims (e.g., rise of GitHub Actions, decline of XML) without visible evidential support for assessing statistical significance or generalizability.
minor comments (2)
  1. [Table 1] Table 1 or equivalent summary table: Add column for total repositories per year or per language to allow readers to assess the distribution underlying the trend percentages.
  2. [Figures] Figure captions: Ensure all trend figures include axis labels with units (e.g., percentage of repositories) and a note on the underlying sample size per time bin.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's detailed feedback on our manuscript. The comments highlight important areas where methodological transparency can be improved. We address each major comment below and indicate the revisions we plan to make.

read point-by-point responses
  1. Referee: The sampling strategy for the 10,000 repositories is not described. No details are given on selection criteria, stratification by creation date, repository size, primary language, or popularity metrics. This is load-bearing for all longitudinal claims about decade-long trends, as non-stratified sampling (e.g., via search API defaults or popularity signals) would over-represent recent or popular projects.

    Authors: We acknowledge this omission and agree that a clear description of the sampling process is essential for interpreting the longitudinal trends. In our study, repositories were sampled using the GitHub Search API with queries designed to retrieve repositories created in each year from 2014 to 2023. We stratified the sample by creation year, selecting approximately 1,000 repositories per year. Selection criteria included a minimum of 5 stars to focus on non-trivial projects and excluded forks to avoid duplication. We will revise Section 3 to include a full description of these criteria, the API queries used, and any limitations due to API rate limits or search result ordering. revision: yes

  2. Referee: The method used to reconstruct file presence and content evolution over ten years is unspecified. It is unclear whether dating relies on git commit timestamps, file metadata, or current repository state, and how deleted files, early forks, or incomplete histories are handled. Without this, the reported consolidation of README.md/.gitignore/LICENSE and shifts in configuration formats cannot be reliably attributed to temporal change rather than sampling artifacts.

    Authors: We thank the referee for pointing this out. The temporal analysis was performed by examining the git commit history for each repository. For every file type of interest, we identified the earliest commit that introduced the file using git log --follow. This allows us to attribute the introduction of files like README.md or GitHub Actions workflows to specific years. Repositories with incomplete histories (e.g., shallow clones) were re-cloned with full history where possible. Deleted files were excluded from the analysis as we focused on the presence and introduction of files rather than their removal. We will expand Section 4 to detail this reconstruction process, including how we handled edge cases such as renamed files and merge commits. revision: yes

  3. Referee: No sampling method, exclusion criteria, statistical tests, confidence intervals, or error analysis are reported for the observed trends. This leaves the central claims (e.g., rise of GitHub Actions, decline of XML) without visible evidential support for assessing statistical significance or generalizability.

    Authors: The study is primarily descriptive and exploratory, presenting observed proportions of repositories containing certain files or using specific technologies over time. No formal statistical hypothesis tests were conducted, as the goal was to identify broad trends rather than test specific hypotheses. However, we recognize the value of providing measures of uncertainty. In the revision, we will add to Section 5 (and the abstract if appropriate) a description of the sampling method and exclusion criteria, along with year-by-year sample sizes and, where relevant, simple confidence intervals for the reported percentages based on the binomial proportion. This will help readers assess the reliability of the trends. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational empirical analysis

full rationale

This paper performs direct empirical observation of file contents, directories, extensions, and temporal presence across 10,000 GitHub repositories. No equations, fitted parameters, model predictions, or derivation steps exist. Claims about trends (README consolidation, GitHub Actions rise, config format shifts, Dockerfile growth, AGENTS.md emergence) are presented as direct counts and comparisons from the collected data, with no reduction to self-defined inputs or self-citation chains. The study is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

This is an observational empirical study; the central claims rest on sampling representativeness rather than mathematical axioms or new theoretical entities.

free parameters (1)
  • Repository sample size
    Chosen as a large yet tractable number for manual and automated inspection.
axioms (1)
  • domain assumption The 10,000 repositories constitute a representative sample of GitHub projects.
    Required to generalize observed trends to the entire platform.

pith-pipeline@v0.9.0 · 5744 in / 1380 out tokens · 118761 ms · 2026-05-20T15:44:43.800636+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 1 internal anchor

  1. [1]

    GitHub: Setting up your project for healthy contributions , https://docs.github.com/en/communities/setting-up-your-project-for- healthy-contributions, May, 2026

  2. [2]

    GitHub: Setting guidelines for repository contributors , https://docs.github.com/en/communities/setting-up-your-project-for- healthy-contributions/setting-guidelines-for-repository-contributors, May, 2026

  3. [3]

    Large language models for software engineering: Survey and open problems,

    A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo, and J. M. Zhang, “Large language models for software engineering: Survey and open problems,” inInternational Conference on Software Engineering: Future of Software Engineering, 2023, pp. 31–53

  4. [4]

    Large language models for software engi- neering: A systematic literature review,

    X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, and H. Wang, “Large language models for software engi- neering: A systematic literature review,”ACM Transactions on Software Engineering and Methodology, 2023

  5. [5]

    Promises, perils, and (timely) heuristics for mining coding agent activity,

    R. Robbes, T. Matricon, T. Degueule, A. Hora, and S. Zacchiroli, “Promises, perils, and (timely) heuristics for mining coding agent activity,” inInternational Conference on Mining Software Repositories, 2026

  6. [6]

    Agentic Much? Adoption of Coding Agents on GitHub

    ——, “Agentic Much? Adoption of Coding Agents on GitHub,”arXiv preprint arXiv:2601.18341, 2026

  7. [7]

    Do as I do, not as I say: Do contribution guidelines match the GitHub contribution process?

    O. Elazhary, M.-A. Storey, N. Ernst, and A. Zaidman, “Do as I do, not as I say: Do contribution guidelines match the GitHub contribution process?” inInternational Conference on Software Maintenance and Evolution, 2019, pp. 286–290

  8. [8]

    What do contribution guidelines say about software testing?

    B. Falcucci, F. Gomide, and A. Hora, “What do contribution guidelines say about software testing?” inInternational Conference on Mining Software Repositories, 2025, pp. 434–438

  9. [9]

    Anonymous Dataset: files, directories, and extensions, https://doi.org/10.5281/zenodo.20185536, May, 2026

  10. [10]

    Sampling Projects in GitHub for MSR Studies,

    O. Dabic, E. Aghajani, and G. Bavota, “Sampling Projects in GitHub for MSR Studies,” inInternational Conference on Mining Software Repositories, 2021, pp. 560–564

  11. [11]

    Understanding the factors that impact the popularity of GitHub repositories,

    H. Borges, A. Hora, and M. T. Valente, “Understanding the factors that impact the popularity of GitHub repositories,” inInternational Conference on Software Maintenance and Evolution, 2016, pp. 334– 344

  12. [12]

    What’s in a GitHub star? understanding repository starring practices in a social coding platform,

    H. Borges and M. T. Valente, “What’s in a GitHub star? understanding repository starring practices in a social coding platform,”Journal of Systems and Software, vol. 146, pp. 112–129, 2018

  13. [13]

    GitHub: Adding a code of conduct to your project , https://docs.github.com/en/communities/setting-up-your-project-for- healthy-contributions/adding-a-code-of-conduct-to-your-project, May, 2026

  14. [14]

    GitHub: Adding a license to a repository , https://docs.github.com/en/communities/setting-up-your-project-for- healthy-contributions/adding-a-license-to-a-repository, May, 2026

  15. [15]

    On the use of GitHub actions in software development repositories,

    A. Decan, T. Mens, P. R. Mazrae, and M. Golzadeh, “On the use of GitHub actions in software development repositories,” inInternational Conference on Software Maintenance and Evolution, 2022, pp. 235–245

  16. [16]

    Empirical analysis of security vulnerabilities in python packages,

    M. Alfadel, D. E. Costa, and E. Shihab, “Empirical analysis of security vulnerabilities in python packages,”Empirical Software Engineering, vol. 28, no. 3, p. 59, 2023

  17. [17]

    Identifying Experts in Software Libraries and Frameworks Among GitHub Users,

    J. E. Montandon, L. Lourdes Silva, and M. T. Valente, “Identifying Experts in Software Libraries and Frameworks Among GitHub Users,” inInternational Conference on Mining Software Repositories, pp. 276– 287

  18. [18]

    GitHub: Creating a new repository, https://docs.github.com/en/repositories/creating-and-managing- repositories/creating-a-new-repository, May, 2026

  19. [19]

    InterTrans: Leveraging transitive intermediate translations to enhance LLM-based code translation,

    M. Macedo, Y. Tian, P. Nie, F. R. Cogo, and B. Adams, “InterTrans: Leveraging transitive intermediate translations to enhance LLM-based code translation,” inInternational Conference on Software Engineering, pp. 1153–1164

  20. [20]

    On the Impact of AGENTS.md Files on the Efficiency of AI Coding Agents,

    J. L. Lulla, S. Mohsenimofidi, M. Galster, J. M. Zhang, S. Baltes, and C. Treude, “On the Impact of AGENTS.md Files on the Efficiency of AI Coding Agents,”arXiv preprint arXiv:2601.20404, 2026

  21. [21]

    Decoding the Configuration of AI Coding Agents: Insights from Claude Code Projects,

    H. V. F. Santos, V. Costa, J. E. Montandon, and M. T. Valente, “Decoding the Configuration of AI Coding Agents: Insights from Claude Code Projects,” inInternational Workshop on Agentic Engineering, 2026, pp. 1–5

  22. [22]

    Influence of social and technical factors for evaluating contribution in GitHub,

    J. Tsay, L. Dabbish, and J. Herbsleb, “Influence of social and technical factors for evaluating contribution in GitHub,” inInternational Confer- ence on Software Engineering, 2014, pp. 356–366

  23. [23]

    License usage and changes: a large-scale study on github,

    C. Vendome, G. Bavota, M. D. Penta, M. Linares-Vásquez, D. German, and D. Poshyvanyk, “License usage and changes: a large-scale study on github,”Empirical Software Engineering, vol. 22, no. 3, pp. 1537–1577, 2017

  24. [24]

    Historical and impact analysis of API breaking changes: A large-scale study,

    L. Xavier, A. Brito, A. Hora, and M. T. Valente, “Historical and impact analysis of API breaking changes: A large-scale study,” inInternational Conference on Software Analysis, Evolution and Reengineering, 2017, pp. 138–147

  25. [25]

    Why and how Java developers break APIs,

    A. Brito, L. Xavier, A. Hora, and M. T. Valente, “Why and how Java developers break APIs,” inInternational Conference on Software Analysis, Evolution and Reengineering, 2018, pp. 255–265

  26. [26]

    What do package dependencies tell us about semantic versioning?

    A. Decan and T. Mens, “What do package dependencies tell us about semantic versioning?”IEEE Transactions on Software Engineering, vol. 47, no. 6, pp. 1226–1240, 2019

  27. [27]

    A large scale analysis of semantic versioning in npm,

    D. Pinckney, F. Cassano, A. Guha, and J. Bell, “A large scale analysis of semantic versioning in npm,” inInternational Conference on Mining Software Repositories, 2023, pp. 485–497

  28. [28]

    GivenWhenThen: A Dataset of BDD Test Scenarios Mined from Open Source Projects,

    L. B. de Alcântara Júnior and J. E. Montandon, “GivenWhenThen: A Dataset of BDD Test Scenarios Mined from Open Source Projects,” in International Conference on Mining Software Repositories, 2026, pp. 1–5

  29. [29]

    A dataset of dockerfiles,

    J. Henkel, C. Bird, S. K. Lahiri, and T. Reps, “A dataset of dockerfiles,” inInternational Conference on Mining Software Repositories, 2020, pp. 528–532

  30. [30]

    An empirical analysis of the Docker container ecosystem on GitHub,

    J. Cito, G. Schermann, J. E. Wittern, P. Leitner, S. Zumberi, and H. C. Gall, “An empirical analysis of the Docker container ecosystem on GitHub,” inInternational Conference on Mining Software Repositories, 2017, pp. 323–333

  31. [31]

    A systematic process for mining software repositories: Results from a systematic literature review,

    M. Vidoni, “A systematic process for mining software repositories: Results from a systematic literature review,”Information and Software Technology, vol. 144, p. 106791, 2022

  32. [32]

    An in-depth study of the promises and perils of mining GitHub,

    E. Kalliamvakou, G. Gousios, K. Blincoe, L. Singer, D. M. German, and D. Damian, “An in-depth study of the promises and perils of mining GitHub,”EmpiricalSoftwareEngineering,vol.21,no.5,pp.2035–2071, Oct. 2016

  33. [33]

    Curating GitHub for engineered software projects,

    N. Munaiah, S. Kroh, C. Cabrey, and M. Nagappan, “Curating GitHub for engineered software projects,”Empirical Software Engineering, vol. 22, no. 6, pp. 3219–3253, 2017

  34. [34]

    The state of the ML-universe: 10 years of artificial intelligence & machine learning software development on GitHub,

    D. Gonzalez, T. Zimmermann, and N. Nagappan, “The state of the ML-universe: 10 years of artificial intelligence & machine learning software development on GitHub,” inInternational Conference on Mining Software Repositories, 2020, pp. 431–442

  35. [35]

    Cat- egorizing the content of GitHub README files,

    G. A. A. Prana, C. Treude, F. Thung, T. Atapattu, and D. Lo, “Cat- egorizing the content of GitHub README files,”Empirical Software Engineering, vol. 24, no. 3, pp. 1296–1327, 2019

  36. [36]

    A large-scale empirical study of open source license usage: Practices and challenges,

    J. Wu, L. Bao, X. Yang, X. Xia, and X. Hu, “A large-scale empirical study of open source license usage: Practices and challenges,” in International Conference on Mining Software Repositories, 2024, pp. 595–606