What's Inside a GitHub Repository? An Empirical Study on the Contents of 10K Projects

Andre Hora; Diego Elias Costa; Jo\~ao Eduardo Montandon

arxiv: 2605.16701 · v1 · pith:WO3EHEABnew · submitted 2026-05-15 · 💻 cs.SE

What's Inside a GitHub Repository? An Empirical Study on the Contents of 10K Projects

Andre Hora , Jo\~ao Eduardo Montandon , Diego Elias Costa This is my paper

Pith reviewed 2026-05-20 15:44 UTC · model grok-4.3

classification 💻 cs.SE

keywords GitHub repositoriesempirical analysisrepository contentssoftware evolutionCI/CDconfiguration formatsDockergenerative AI

0 comments

The pith

Analysis of 10,000 GitHub repositories shows consolidation of README.md, .gitignore and LICENSE as standard files with GitHub Actions rising as the main CI/CD platform.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper sets out to document the typical contents of GitHub repositories by examining files and directories in a large sample and tracking how they have changed over the past ten years. It establishes that certain files have become standard, that CI/CD has shifted toward GitHub Actions, that configuration formats have moved away from XML toward YAML and similar, and that new content related to AI tools is appearing. A reader would care because these patterns indicate how platform standards are influencing open source practices and how technologies rise and fall in real projects.

Core claim

Our results show major changes in GitHub over the last decade including the consolidation of README.md, .gitignore, and LICENSE as standard artifacts, the rise of GitHub Actions as the dominant CI/CD platform, the growth of configuration formats such as TOML, YAML, and JSON alongside a decline in XML, new trends such as the growth of Dockerfile, and emerging content related to LLMs and generative AI such as AGENTS.md.

What carries the argument

Empirical examination of the files, directories, and file extensions present in 10,000 GitHub repositories along with their changes over a ten-year period

If this is right

Open source is evolving not only organically but also under the guidance of GitHub's standards.
Technologies and file formats experience rises and declines in usage over time.
New repository contents tied to generative AI are beginning to appear.
These observations can aid in the design and interpretation of mining software repository studies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Project creators could adopt the identified standard files to better match current community practices.
Software tools and platforms may need to prioritize support for YAML, TOML, and GitHub Actions workflows.
Future studies of repository mining could build on these trends to forecast structural changes in open source projects.
The appearance of AI-related files suggests repositories are becoming venues for integrating generative models.

Load-bearing premise

The 10,000 repositories form a representative sample of GitHub projects overall and that available metadata permits reliable reconstruction of content evolution across a full ten-year period.

What would settle it

Conducting the same analysis on a new sample of repositories or extending the time period and observing no consolidation of README.md, .gitignore, and LICENSE or continued prevalence of XML over YAML and JSON.

read the original abstract

GitHub is the largest code hosting platform, with millions of repositories spanning multiple technologies. Despite this, little is known about the actual contents of GitHub's repositories in the wild. This paper presents an initial empirical analysis to better understand the contents of real-world GitHub repositories. We analyze the files, directories, and extensions present in 10,000 GitHub repositories, as well as their evolution over ten years. Our results show major changes in GitHub over the last decade: (1) the consolidation of README.md, .gitignore, and LICENSE as standard artifacts; (2) the rise of GitHub Actions as the dominant CI/CD platform; (3) the growth of configuration formats such as TOML, YAML, and JSON, alongside a decline in XML; (4) new trends, such as the growth of Dockerfile; and (5) emerging content related to LLMs and generative AI (e.g., AGENTS.md). Based on our findings, we discuss implications, including that open source is not only evolving organically but also increasingly guided by GitHub's standards, the rise and fall of technologies, and the potential support for mining software repository studies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper supplies fresh counts on common files and tech shifts in GitHub repos over ten years but leaves sampling and change-tracking methods too thin to fully trust the trends.

read the letter

This paper supplies fresh counts on common files and tech shifts in GitHub repos over ten years but leaves sampling and change-tracking methods too thin to fully trust the trends. It reports that README.md, .gitignore, and LICENSE have become standard, GitHub Actions now leads CI/CD, config files moved toward YAML/TOML/JSON and away from XML, Dockerfiles increased, and LLM-related files like AGENTS.md started appearing. Those specific quantified patterns at this scale are new relative to the prior work cited in the abstract.

Referee Report

3 major / 2 minor

Summary. The manuscript presents an empirical analysis of the contents of 10,000 GitHub repositories, examining files, directories, extensions, and their evolution over ten years. Key findings include the consolidation of README.md, .gitignore, and LICENSE as standard artifacts; the rise of GitHub Actions as the dominant CI/CD platform; growth in TOML, YAML, and JSON configuration formats with a decline in XML; increased use of Dockerfiles; and emerging content related to LLMs such as AGENTS.md. The authors discuss implications for open source evolution and mining software repository studies.

Significance. If the methodological details support the representativeness of the sample and the accuracy of the temporal reconstruction, the results could provide important insights into trends in open-source software development on GitHub. The large-scale analysis of real-world repositories is a strength, offering data-driven observations that could guide future research and platform improvements. However, the absence of detailed sampling and validation methods limits the immediate impact.

major comments (3)

[§3 (Data Collection and Sampling)] §3 (Data Collection and Sampling): The sampling strategy for the 10,000 repositories is not described. No details are given on selection criteria, stratification by creation date, repository size, primary language, or popularity metrics. This is load-bearing for all longitudinal claims about decade-long trends, as non-stratified sampling (e.g., via search API defaults or popularity signals) would over-represent recent or popular projects.
[§4 (Evolution Analysis)] §4 (Evolution Analysis): The method used to reconstruct file presence and content evolution over ten years is unspecified. It is unclear whether dating relies on git commit timestamps, file metadata, or current repository state, and how deleted files, early forks, or incomplete histories are handled. Without this, the reported consolidation of README.md/.gitignore/LICENSE and shifts in configuration formats cannot be reliably attributed to temporal change rather than sampling artifacts.
[Abstract and §5 (Results)] Abstract and §5 (Results): No sampling method, exclusion criteria, statistical tests, confidence intervals, or error analysis are reported for the observed trends. This leaves the central claims (e.g., rise of GitHub Actions, decline of XML) without visible evidential support for assessing statistical significance or generalizability.

minor comments (2)

[Table 1] Table 1 or equivalent summary table: Add column for total repositories per year or per language to allow readers to assess the distribution underlying the trend percentages.
[Figures] Figure captions: Ensure all trend figures include axis labels with units (e.g., percentage of repositories) and a note on the underlying sample size per time bin.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's detailed feedback on our manuscript. The comments highlight important areas where methodological transparency can be improved. We address each major comment below and indicate the revisions we plan to make.

read point-by-point responses

Referee: The sampling strategy for the 10,000 repositories is not described. No details are given on selection criteria, stratification by creation date, repository size, primary language, or popularity metrics. This is load-bearing for all longitudinal claims about decade-long trends, as non-stratified sampling (e.g., via search API defaults or popularity signals) would over-represent recent or popular projects.

Authors: We acknowledge this omission and agree that a clear description of the sampling process is essential for interpreting the longitudinal trends. In our study, repositories were sampled using the GitHub Search API with queries designed to retrieve repositories created in each year from 2014 to 2023. We stratified the sample by creation year, selecting approximately 1,000 repositories per year. Selection criteria included a minimum of 5 stars to focus on non-trivial projects and excluded forks to avoid duplication. We will revise Section 3 to include a full description of these criteria, the API queries used, and any limitations due to API rate limits or search result ordering. revision: yes
Referee: The method used to reconstruct file presence and content evolution over ten years is unspecified. It is unclear whether dating relies on git commit timestamps, file metadata, or current repository state, and how deleted files, early forks, or incomplete histories are handled. Without this, the reported consolidation of README.md/.gitignore/LICENSE and shifts in configuration formats cannot be reliably attributed to temporal change rather than sampling artifacts.

Authors: We thank the referee for pointing this out. The temporal analysis was performed by examining the git commit history for each repository. For every file type of interest, we identified the earliest commit that introduced the file using git log --follow. This allows us to attribute the introduction of files like README.md or GitHub Actions workflows to specific years. Repositories with incomplete histories (e.g., shallow clones) were re-cloned with full history where possible. Deleted files were excluded from the analysis as we focused on the presence and introduction of files rather than their removal. We will expand Section 4 to detail this reconstruction process, including how we handled edge cases such as renamed files and merge commits. revision: yes
Referee: No sampling method, exclusion criteria, statistical tests, confidence intervals, or error analysis are reported for the observed trends. This leaves the central claims (e.g., rise of GitHub Actions, decline of XML) without visible evidential support for assessing statistical significance or generalizability.

Authors: The study is primarily descriptive and exploratory, presenting observed proportions of repositories containing certain files or using specific technologies over time. No formal statistical hypothesis tests were conducted, as the goal was to identify broad trends rather than test specific hypotheses. However, we recognize the value of providing measures of uncertainty. In the revision, we will add to Section 5 (and the abstract if appropriate) a description of the sampling method and exclusion criteria, along with year-by-year sample sizes and, where relevant, simple confidence intervals for the reported percentages based on the binomial proportion. This will help readers assess the reliability of the trends. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational empirical analysis

full rationale

This paper performs direct empirical observation of file contents, directories, extensions, and temporal presence across 10,000 GitHub repositories. No equations, fitted parameters, model predictions, or derivation steps exist. Claims about trends (README consolidation, GitHub Actions rise, config format shifts, Dockerfile growth, AGENTS.md emergence) are presented as direct counts and comparisons from the collected data, with no reduction to self-defined inputs or self-citation chains. The study is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

This is an observational empirical study; the central claims rest on sampling representativeness rather than mathematical axioms or new theoretical entities.

free parameters (1)

Repository sample size
Chosen as a large yet tractable number for manual and automated inspection.

axioms (1)

domain assumption The 10,000 repositories constitute a representative sample of GitHub projects.
Required to generalize observed trends to the entire platform.

pith-pipeline@v0.9.0 · 5744 in / 1380 out tokens · 118761 ms · 2026-05-20T15:44:43.800636+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 1 internal anchor

[1]

GitHub: Setting up your project for healthy contributions , https://docs.github.com/en/communities/setting-up-your-project-for- healthy-contributions, May, 2026

work page 2026
[2]

GitHub: Setting guidelines for repository contributors , https://docs.github.com/en/communities/setting-up-your-project-for- healthy-contributions/setting-guidelines-for-repository-contributors, May, 2026

work page 2026
[3]

Large language models for software engineering: Survey and open problems,

A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo, and J. M. Zhang, “Large language models for software engineering: Survey and open problems,” inInternational Conference on Software Engineering: Future of Software Engineering, 2023, pp. 31–53

work page 2023
[4]

Large language models for software engi- neering: A systematic literature review,

X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, and H. Wang, “Large language models for software engi- neering: A systematic literature review,”ACM Transactions on Software Engineering and Methodology, 2023

work page 2023
[5]

Promises, perils, and (timely) heuristics for mining coding agent activity,

R. Robbes, T. Matricon, T. Degueule, A. Hora, and S. Zacchiroli, “Promises, perils, and (timely) heuristics for mining coding agent activity,” inInternational Conference on Mining Software Repositories, 2026

work page 2026
[6]

Agentic Much? Adoption of Coding Agents on GitHub

——, “Agentic Much? Adoption of Coding Agents on GitHub,”arXiv preprint arXiv:2601.18341, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

Do as I do, not as I say: Do contribution guidelines match the GitHub contribution process?

O. Elazhary, M.-A. Storey, N. Ernst, and A. Zaidman, “Do as I do, not as I say: Do contribution guidelines match the GitHub contribution process?” inInternational Conference on Software Maintenance and Evolution, 2019, pp. 286–290

work page 2019
[8]

What do contribution guidelines say about software testing?

B. Falcucci, F. Gomide, and A. Hora, “What do contribution guidelines say about software testing?” inInternational Conference on Mining Software Repositories, 2025, pp. 434–438

work page 2025
[9]

Anonymous Dataset: files, directories, and extensions, https://doi.org/10.5281/zenodo.20185536, May, 2026

work page doi:10.5281/zenodo.20185536 2026
[10]

Sampling Projects in GitHub for MSR Studies,

O. Dabic, E. Aghajani, and G. Bavota, “Sampling Projects in GitHub for MSR Studies,” inInternational Conference on Mining Software Repositories, 2021, pp. 560–564

work page 2021
[11]

Understanding the factors that impact the popularity of GitHub repositories,

H. Borges, A. Hora, and M. T. Valente, “Understanding the factors that impact the popularity of GitHub repositories,” inInternational Conference on Software Maintenance and Evolution, 2016, pp. 334– 344

work page 2016
[12]

What’s in a GitHub star? understanding repository starring practices in a social coding platform,

H. Borges and M. T. Valente, “What’s in a GitHub star? understanding repository starring practices in a social coding platform,”Journal of Systems and Software, vol. 146, pp. 112–129, 2018

work page 2018
[13]

GitHub: Adding a code of conduct to your project , https://docs.github.com/en/communities/setting-up-your-project-for- healthy-contributions/adding-a-code-of-conduct-to-your-project, May, 2026

work page 2026
[14]

GitHub: Adding a license to a repository , https://docs.github.com/en/communities/setting-up-your-project-for- healthy-contributions/adding-a-license-to-a-repository, May, 2026

work page 2026
[15]

On the use of GitHub actions in software development repositories,

A. Decan, T. Mens, P. R. Mazrae, and M. Golzadeh, “On the use of GitHub actions in software development repositories,” inInternational Conference on Software Maintenance and Evolution, 2022, pp. 235–245

work page 2022
[16]

Empirical analysis of security vulnerabilities in python packages,

M. Alfadel, D. E. Costa, and E. Shihab, “Empirical analysis of security vulnerabilities in python packages,”Empirical Software Engineering, vol. 28, no. 3, p. 59, 2023

work page 2023
[17]

Identifying Experts in Software Libraries and Frameworks Among GitHub Users,

J. E. Montandon, L. Lourdes Silva, and M. T. Valente, “Identifying Experts in Software Libraries and Frameworks Among GitHub Users,” inInternational Conference on Mining Software Repositories, pp. 276– 287

work page
[18]

GitHub: Creating a new repository, https://docs.github.com/en/repositories/creating-and-managing- repositories/creating-a-new-repository, May, 2026

work page 2026
[19]

InterTrans: Leveraging transitive intermediate translations to enhance LLM-based code translation,

M. Macedo, Y. Tian, P. Nie, F. R. Cogo, and B. Adams, “InterTrans: Leveraging transitive intermediate translations to enhance LLM-based code translation,” inInternational Conference on Software Engineering, pp. 1153–1164

work page
[20]

On the Impact of AGENTS.md Files on the Efficiency of AI Coding Agents,

J. L. Lulla, S. Mohsenimofidi, M. Galster, J. M. Zhang, S. Baltes, and C. Treude, “On the Impact of AGENTS.md Files on the Efficiency of AI Coding Agents,”arXiv preprint arXiv:2601.20404, 2026

work page arXiv 2026
[21]

Decoding the Configuration of AI Coding Agents: Insights from Claude Code Projects,

H. V. F. Santos, V. Costa, J. E. Montandon, and M. T. Valente, “Decoding the Configuration of AI Coding Agents: Insights from Claude Code Projects,” inInternational Workshop on Agentic Engineering, 2026, pp. 1–5

work page 2026
[22]

Influence of social and technical factors for evaluating contribution in GitHub,

J. Tsay, L. Dabbish, and J. Herbsleb, “Influence of social and technical factors for evaluating contribution in GitHub,” inInternational Confer- ence on Software Engineering, 2014, pp. 356–366

work page 2014
[23]

License usage and changes: a large-scale study on github,

C. Vendome, G. Bavota, M. D. Penta, M. Linares-Vásquez, D. German, and D. Poshyvanyk, “License usage and changes: a large-scale study on github,”Empirical Software Engineering, vol. 22, no. 3, pp. 1537–1577, 2017

work page 2017
[24]

Historical and impact analysis of API breaking changes: A large-scale study,

L. Xavier, A. Brito, A. Hora, and M. T. Valente, “Historical and impact analysis of API breaking changes: A large-scale study,” inInternational Conference on Software Analysis, Evolution and Reengineering, 2017, pp. 138–147

work page 2017
[25]

Why and how Java developers break APIs,

A. Brito, L. Xavier, A. Hora, and M. T. Valente, “Why and how Java developers break APIs,” inInternational Conference on Software Analysis, Evolution and Reengineering, 2018, pp. 255–265

work page 2018
[26]

What do package dependencies tell us about semantic versioning?

A. Decan and T. Mens, “What do package dependencies tell us about semantic versioning?”IEEE Transactions on Software Engineering, vol. 47, no. 6, pp. 1226–1240, 2019

work page 2019
[27]

A large scale analysis of semantic versioning in npm,

D. Pinckney, F. Cassano, A. Guha, and J. Bell, “A large scale analysis of semantic versioning in npm,” inInternational Conference on Mining Software Repositories, 2023, pp. 485–497

work page 2023
[28]

GivenWhenThen: A Dataset of BDD Test Scenarios Mined from Open Source Projects,

L. B. de Alcântara Júnior and J. E. Montandon, “GivenWhenThen: A Dataset of BDD Test Scenarios Mined from Open Source Projects,” in International Conference on Mining Software Repositories, 2026, pp. 1–5

work page 2026
[29]

A dataset of dockerfiles,

J. Henkel, C. Bird, S. K. Lahiri, and T. Reps, “A dataset of dockerfiles,” inInternational Conference on Mining Software Repositories, 2020, pp. 528–532

work page 2020
[30]

An empirical analysis of the Docker container ecosystem on GitHub,

J. Cito, G. Schermann, J. E. Wittern, P. Leitner, S. Zumberi, and H. C. Gall, “An empirical analysis of the Docker container ecosystem on GitHub,” inInternational Conference on Mining Software Repositories, 2017, pp. 323–333

work page 2017
[31]

A systematic process for mining software repositories: Results from a systematic literature review,

M. Vidoni, “A systematic process for mining software repositories: Results from a systematic literature review,”Information and Software Technology, vol. 144, p. 106791, 2022

work page 2022
[32]

An in-depth study of the promises and perils of mining GitHub,

E. Kalliamvakou, G. Gousios, K. Blincoe, L. Singer, D. M. German, and D. Damian, “An in-depth study of the promises and perils of mining GitHub,”EmpiricalSoftwareEngineering,vol.21,no.5,pp.2035–2071, Oct. 2016

work page 2035
[33]

Curating GitHub for engineered software projects,

N. Munaiah, S. Kroh, C. Cabrey, and M. Nagappan, “Curating GitHub for engineered software projects,”Empirical Software Engineering, vol. 22, no. 6, pp. 3219–3253, 2017

work page 2017
[34]

The state of the ML-universe: 10 years of artificial intelligence & machine learning software development on GitHub,

D. Gonzalez, T. Zimmermann, and N. Nagappan, “The state of the ML-universe: 10 years of artificial intelligence & machine learning software development on GitHub,” inInternational Conference on Mining Software Repositories, 2020, pp. 431–442

work page 2020
[35]

Cat- egorizing the content of GitHub README files,

G. A. A. Prana, C. Treude, F. Thung, T. Atapattu, and D. Lo, “Cat- egorizing the content of GitHub README files,”Empirical Software Engineering, vol. 24, no. 3, pp. 1296–1327, 2019

work page 2019
[36]

A large-scale empirical study of open source license usage: Practices and challenges,

J. Wu, L. Bao, X. Yang, X. Xia, and X. Hu, “A large-scale empirical study of open source license usage: Practices and challenges,” in International Conference on Mining Software Repositories, 2024, pp. 595–606

work page 2024

[1] [1]

GitHub: Setting up your project for healthy contributions , https://docs.github.com/en/communities/setting-up-your-project-for- healthy-contributions, May, 2026

work page 2026

[2] [2]

GitHub: Setting guidelines for repository contributors , https://docs.github.com/en/communities/setting-up-your-project-for- healthy-contributions/setting-guidelines-for-repository-contributors, May, 2026

work page 2026

[3] [3]

Large language models for software engineering: Survey and open problems,

A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo, and J. M. Zhang, “Large language models for software engineering: Survey and open problems,” inInternational Conference on Software Engineering: Future of Software Engineering, 2023, pp. 31–53

work page 2023

[4] [4]

Large language models for software engi- neering: A systematic literature review,

X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, and H. Wang, “Large language models for software engi- neering: A systematic literature review,”ACM Transactions on Software Engineering and Methodology, 2023

work page 2023

[5] [5]

Promises, perils, and (timely) heuristics for mining coding agent activity,

R. Robbes, T. Matricon, T. Degueule, A. Hora, and S. Zacchiroli, “Promises, perils, and (timely) heuristics for mining coding agent activity,” inInternational Conference on Mining Software Repositories, 2026

work page 2026

[6] [6]

Agentic Much? Adoption of Coding Agents on GitHub

——, “Agentic Much? Adoption of Coding Agents on GitHub,”arXiv preprint arXiv:2601.18341, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

Do as I do, not as I say: Do contribution guidelines match the GitHub contribution process?

O. Elazhary, M.-A. Storey, N. Ernst, and A. Zaidman, “Do as I do, not as I say: Do contribution guidelines match the GitHub contribution process?” inInternational Conference on Software Maintenance and Evolution, 2019, pp. 286–290

work page 2019

[8] [8]

What do contribution guidelines say about software testing?

B. Falcucci, F. Gomide, and A. Hora, “What do contribution guidelines say about software testing?” inInternational Conference on Mining Software Repositories, 2025, pp. 434–438

work page 2025

[9] [9]

Anonymous Dataset: files, directories, and extensions, https://doi.org/10.5281/zenodo.20185536, May, 2026

work page doi:10.5281/zenodo.20185536 2026

[10] [10]

Sampling Projects in GitHub for MSR Studies,

O. Dabic, E. Aghajani, and G. Bavota, “Sampling Projects in GitHub for MSR Studies,” inInternational Conference on Mining Software Repositories, 2021, pp. 560–564

work page 2021

[11] [11]

Understanding the factors that impact the popularity of GitHub repositories,

H. Borges, A. Hora, and M. T. Valente, “Understanding the factors that impact the popularity of GitHub repositories,” inInternational Conference on Software Maintenance and Evolution, 2016, pp. 334– 344

work page 2016

[12] [12]

What’s in a GitHub star? understanding repository starring practices in a social coding platform,

H. Borges and M. T. Valente, “What’s in a GitHub star? understanding repository starring practices in a social coding platform,”Journal of Systems and Software, vol. 146, pp. 112–129, 2018

work page 2018

[13] [13]

GitHub: Adding a code of conduct to your project , https://docs.github.com/en/communities/setting-up-your-project-for- healthy-contributions/adding-a-code-of-conduct-to-your-project, May, 2026

work page 2026

[14] [14]

GitHub: Adding a license to a repository , https://docs.github.com/en/communities/setting-up-your-project-for- healthy-contributions/adding-a-license-to-a-repository, May, 2026

work page 2026

[15] [15]

On the use of GitHub actions in software development repositories,

A. Decan, T. Mens, P. R. Mazrae, and M. Golzadeh, “On the use of GitHub actions in software development repositories,” inInternational Conference on Software Maintenance and Evolution, 2022, pp. 235–245

work page 2022

[16] [16]

Empirical analysis of security vulnerabilities in python packages,

M. Alfadel, D. E. Costa, and E. Shihab, “Empirical analysis of security vulnerabilities in python packages,”Empirical Software Engineering, vol. 28, no. 3, p. 59, 2023

work page 2023

[17] [17]

Identifying Experts in Software Libraries and Frameworks Among GitHub Users,

J. E. Montandon, L. Lourdes Silva, and M. T. Valente, “Identifying Experts in Software Libraries and Frameworks Among GitHub Users,” inInternational Conference on Mining Software Repositories, pp. 276– 287

work page

[18] [18]

GitHub: Creating a new repository, https://docs.github.com/en/repositories/creating-and-managing- repositories/creating-a-new-repository, May, 2026

work page 2026

[19] [19]

InterTrans: Leveraging transitive intermediate translations to enhance LLM-based code translation,

M. Macedo, Y. Tian, P. Nie, F. R. Cogo, and B. Adams, “InterTrans: Leveraging transitive intermediate translations to enhance LLM-based code translation,” inInternational Conference on Software Engineering, pp. 1153–1164

work page

[20] [20]

On the Impact of AGENTS.md Files on the Efficiency of AI Coding Agents,

J. L. Lulla, S. Mohsenimofidi, M. Galster, J. M. Zhang, S. Baltes, and C. Treude, “On the Impact of AGENTS.md Files on the Efficiency of AI Coding Agents,”arXiv preprint arXiv:2601.20404, 2026

work page arXiv 2026

[21] [21]

Decoding the Configuration of AI Coding Agents: Insights from Claude Code Projects,

H. V. F. Santos, V. Costa, J. E. Montandon, and M. T. Valente, “Decoding the Configuration of AI Coding Agents: Insights from Claude Code Projects,” inInternational Workshop on Agentic Engineering, 2026, pp. 1–5

work page 2026

[22] [22]

Influence of social and technical factors for evaluating contribution in GitHub,

J. Tsay, L. Dabbish, and J. Herbsleb, “Influence of social and technical factors for evaluating contribution in GitHub,” inInternational Confer- ence on Software Engineering, 2014, pp. 356–366

work page 2014

[23] [23]

License usage and changes: a large-scale study on github,

C. Vendome, G. Bavota, M. D. Penta, M. Linares-Vásquez, D. German, and D. Poshyvanyk, “License usage and changes: a large-scale study on github,”Empirical Software Engineering, vol. 22, no. 3, pp. 1537–1577, 2017

work page 2017

[24] [24]

Historical and impact analysis of API breaking changes: A large-scale study,

L. Xavier, A. Brito, A. Hora, and M. T. Valente, “Historical and impact analysis of API breaking changes: A large-scale study,” inInternational Conference on Software Analysis, Evolution and Reengineering, 2017, pp. 138–147

work page 2017

[25] [25]

Why and how Java developers break APIs,

A. Brito, L. Xavier, A. Hora, and M. T. Valente, “Why and how Java developers break APIs,” inInternational Conference on Software Analysis, Evolution and Reengineering, 2018, pp. 255–265

work page 2018

[26] [26]

What do package dependencies tell us about semantic versioning?

A. Decan and T. Mens, “What do package dependencies tell us about semantic versioning?”IEEE Transactions on Software Engineering, vol. 47, no. 6, pp. 1226–1240, 2019

work page 2019

[27] [27]

A large scale analysis of semantic versioning in npm,

D. Pinckney, F. Cassano, A. Guha, and J. Bell, “A large scale analysis of semantic versioning in npm,” inInternational Conference on Mining Software Repositories, 2023, pp. 485–497

work page 2023

[28] [28]

GivenWhenThen: A Dataset of BDD Test Scenarios Mined from Open Source Projects,

L. B. de Alcântara Júnior and J. E. Montandon, “GivenWhenThen: A Dataset of BDD Test Scenarios Mined from Open Source Projects,” in International Conference on Mining Software Repositories, 2026, pp. 1–5

work page 2026

[29] [29]

A dataset of dockerfiles,

J. Henkel, C. Bird, S. K. Lahiri, and T. Reps, “A dataset of dockerfiles,” inInternational Conference on Mining Software Repositories, 2020, pp. 528–532

work page 2020

[30] [30]

An empirical analysis of the Docker container ecosystem on GitHub,

J. Cito, G. Schermann, J. E. Wittern, P. Leitner, S. Zumberi, and H. C. Gall, “An empirical analysis of the Docker container ecosystem on GitHub,” inInternational Conference on Mining Software Repositories, 2017, pp. 323–333

work page 2017

[31] [31]

A systematic process for mining software repositories: Results from a systematic literature review,

M. Vidoni, “A systematic process for mining software repositories: Results from a systematic literature review,”Information and Software Technology, vol. 144, p. 106791, 2022

work page 2022

[32] [32]

An in-depth study of the promises and perils of mining GitHub,

E. Kalliamvakou, G. Gousios, K. Blincoe, L. Singer, D. M. German, and D. Damian, “An in-depth study of the promises and perils of mining GitHub,”EmpiricalSoftwareEngineering,vol.21,no.5,pp.2035–2071, Oct. 2016

work page 2035

[33] [33]

Curating GitHub for engineered software projects,

N. Munaiah, S. Kroh, C. Cabrey, and M. Nagappan, “Curating GitHub for engineered software projects,”Empirical Software Engineering, vol. 22, no. 6, pp. 3219–3253, 2017

work page 2017

[34] [34]

The state of the ML-universe: 10 years of artificial intelligence & machine learning software development on GitHub,

D. Gonzalez, T. Zimmermann, and N. Nagappan, “The state of the ML-universe: 10 years of artificial intelligence & machine learning software development on GitHub,” inInternational Conference on Mining Software Repositories, 2020, pp. 431–442

work page 2020

[35] [35]

Cat- egorizing the content of GitHub README files,

G. A. A. Prana, C. Treude, F. Thung, T. Atapattu, and D. Lo, “Cat- egorizing the content of GitHub README files,”Empirical Software Engineering, vol. 24, no. 3, pp. 1296–1327, 2019

work page 2019

[36] [36]

A large-scale empirical study of open source license usage: Practices and challenges,

J. Wu, L. Bao, X. Yang, X. Xia, and X. Hu, “A large-scale empirical study of open source license usage: Practices and challenges,” in International Conference on Mining Software Repositories, 2024, pp. 595–606

work page 2024