pith. machine review for the scientific record. sign in

arxiv: 2605.08435 · v1 · submitted 2026-05-08 · 💻 cs.SE

Recognition: 2 theorem links

· Lean Theorem

A Dataset of Agentic AI Coding Tool Configurations

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:26 UTC · model grok-4.3

classification 💻 cs.SE
keywords agentic AIcoding toolsconfiguration artifactsdatasetGitHubcontext engineeringAI adoptionhuman-AI collaboration
0
0 comments X

The pith

This paper presents a large public dataset of configuration artifacts for agentic AI coding tools collected from thousands of open-source repositories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to fill the gap in understanding how developers steer agentic AI coding tools like Claude Code and OpenAI Codex by creating configuration files such as context files, skills, rules, and hooks. They collected data from over 40,000 repositories, filtered to focus on engineered projects, resulting in configurations from 4,738 repositories. A sympathetic reader would care because this enables new research into context engineering and human-AI collaboration patterns without having to build such a collection from scratch. The dataset includes full file contents and associated AI-co-authored commits, making it immediately usable for analysis.

Core claim

The authors have systematically identified and compiled 15,591 configuration artifacts along with the full content of 18,167 associated configuration files and 148,519 AI-co-authored commits from 4,738 open-source repositories using a pipeline of metadata filtering, GPT-based classification, and automated detection of configuration mechanisms across five AI coding tools.

What carries the argument

The systematic detection of repository-level configuration artifacts (Context Files, Skills, Rules, and Hooks) in actively maintained GitHub repositories, after filtering and classifying projects with GPT-5.2.

If this is right

  • Researchers can study adoption patterns of different AI coding tools across software projects.
  • The data supports analysis of context engineering practices for multi-step coding tasks.
  • Insights into human-AI collaboration can be drawn from the co-authored commits linked to these configurations.
  • The public availability allows replication and extension of studies on AI tool usage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers might use the dataset to discover effective configuration strategies that improve AI tool performance.
  • Tool builders could analyze common patterns to design better default configurations or interfaces.
  • Future work might link specific configurations to code quality outcomes in the associated commits.

Load-bearing premise

That the combination of metadata filtering and GPT-5.2 classification reliably selects only engineered software projects and that the detection of configuration artifacts accurately captures them without major omissions or errors.

What would settle it

An independent manual review of a random sample of the 36,710 classified repositories that finds a substantial portion are not engineered software projects or that many configuration files were missed in the detection process.

Figures

Figures reproduced from arXiv: 2605.08435 by Christoph Treude, Jai Lal Lulla, Levi B\"ohme, Matthias Galster, Muhammad Auwal Abubakar, Sebastian Baltes, Seyedmoein Mohsenimofidi.

Figure 1
Figure 1. Figure 1: Number of repositories with detected configuration [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Number of configuration artifacts by type. For each [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Configuration mechanisms per tool. Each cell shows [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Co-occurrence of configuration mechanisms across repositories. The bar chart shows how many repositories use each [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Three-stage data collection pipeline: repository sampling from the SEART GitHub Search dataset, LLM-based classifi [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Cumulative adoption of selected AI configuration artifacts over time. Each curve shows the cumulative number of [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Reference network between context file types. Each [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Monthly volume of AI-co-authored commits across the 4,738 repositories with detected configuration artifacts. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
read the original abstract

Agentic AI coding tools such as Claude Code and OpenAI Codex execute multi-step coding tasks with limited human oversight. To steer these tools, developers create repository-level configuration artifacts (e.g., Markdown files) for configuration mechanisms such as Context Files, Skills, Rules, and Hooks. There is no curated dataset yet that captures these configurations at scale. This dataset, collected from open-source GitHub repositories, fills that gap. We selected 40,585 actively maintained repositories through metadata filtering, classified them using GPT-5.2 to identify 36,710 as belonging to engineered software projects, and systematically detected configuration artifacts in these repositories. The dataset covers 4,738 repositories across five tools (Claude Code, GitHub Copilot, OpenAI Codex, Cursor, Gemini) and eight configuration mechanisms. We collected 15,591 configuration artifacts, the full content of 18,167 configuration files associated with these configuration artifacts, and 148,519 AI-co-authored commits. The dataset and the construction pipeline are publicly available on Zenodo under CC BY 4.0. An interactive website allows researchers to browse and explore the data. This data supports research on context engineering, AI tool adoption patterns, and human-AI collaboration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a dataset of configuration artifacts for agentic AI coding tools collected from open-source GitHub repositories. It describes selecting 40,585 actively maintained repositories via metadata filtering, using GPT-5.2 to classify 36,710 as engineered software projects, systematically detecting artifacts across five tools and eight mechanisms in 4,738 repositories, and releasing 15,591 artifacts, 18,167 associated configuration files, and 148,519 AI-co-authored commits. The full dataset, construction pipeline, and an interactive exploration website are made publicly available on Zenodo under CC BY 4.0.

Significance. If the collection pipeline is reliable, this would be a valuable contribution as the first large-scale curated dataset of repository-level configurations for AI coding assistants. It directly enables empirical research on context engineering, tool adoption patterns, and human-AI collaboration in software development. The public release, reproducibility of the pipeline, and interactive website are explicit strengths that lower barriers for follow-on work.

major comments (2)
  1. [§3] §3 (Data Collection Pipeline): The headline counts (40,585 repositories filtered, 36,710 labeled engineered, 4,738 containing artifacts, 15,591 artifacts collected) rest on two automated steps—metadata filtering plus GPT-5.2 binary classification and heuristic-based detection of the eight configuration mechanisms—yet no precision, recall, error rates, or manual validation results are reported for either step. This is load-bearing for the central claim that the released dataset accurately represents configurations in engineered projects.
  2. [§4] §4 (Artifact Detection and Collection): The systematic detection logic (file-name patterns, directory heuristics, content signatures) is described at a high level but is not accompanied by any audit, inter-annotator agreement, or false-positive/false-negative estimates. Without these, it is impossible to assess whether the 15,591 artifacts and 18,167 files materially over- or under-count the true population.
minor comments (2)
  1. [§3.2] Clarify the exact GPT model version and prompting strategy used for classification; 'GPT-5.2' is non-standard and should be documented with the precise prompt template and temperature settings.
  2. The interactive website is mentioned but its features and data export options are not described in the text; adding a short subsection or figure would improve usability for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for quantitative validation of our automated pipeline steps. We agree these metrics are important to support the dataset's claims and will add them in the revision. We address each major comment below.

read point-by-point responses
  1. Referee: [§3] §3 (Data Collection Pipeline): The headline counts (40,585 repositories filtered, 36,710 labeled engineered, 4,738 containing artifacts, 15,591 artifacts collected) rest on two automated steps—metadata filtering plus GPT-5.2 binary classification and heuristic-based detection of the eight configuration mechanisms—yet no precision, recall, error rates, or manual validation results are reported for either step. This is load-bearing for the central claim that the released dataset accurately represents configurations in engineered projects.

    Authors: We agree that validation metrics are necessary to substantiate the headline counts. The metadata filter applied established criteria (activity, size, language) drawn from prior repository-mining literature, and the GPT-5.2 prompt was engineered with explicit definitions of engineered software projects. At this scale, exhaustive manual review was not performed initially. In the revised manuscript we will add a validation section reporting precision, recall, and F1-score from a manual audit of a stratified random sample of 150 repositories for the classification step, together with an error analysis of common misclassifications. revision: yes

  2. Referee: [§4] §4 (Artifact Detection and Collection): The systematic detection logic (file-name patterns, directory heuristics, content signatures) is described at a high level but is not accompanied by any audit, inter-annotator agreement, or false-positive/false-negative estimates. Without these, it is impossible to assess whether the 15,591 artifacts and 18,167 files materially over- or under-count the true population.

    Authors: We concur that false-positive and false-negative estimates are required to evaluate detection quality. The heuristics were derived from official tool documentation and preliminary manual inspection. The revised manuscript will include a new validation subsection describing a manual audit of 120 stratified repositories, with reported false-positive and false-negative rates for the overall detection process and inter-annotator agreement statistics for the manual labels. The detection scripts and validation annotations will be released to enable community verification. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset construction is empirical collection from external GitHub sources

full rationale

The paper presents a data collection pipeline that selects repositories via metadata filtering, applies GPT-5.2 classification to label engineered projects, detects configuration artifacts via systematic search, and reports resulting counts (40,585 repositories filtered, 36,710 labeled, 4,738 with artifacts, 15,591 artifacts collected). No equations, predictions, fitted parameters, or derivations are claimed. The numbers are direct outputs of the described process applied to external GitHub data, not reductions of inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The work is self-contained as a dataset release effort with no mathematical chain that could exhibit circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard assumptions about repository metadata and automated classification reliability rather than new entities or many fitted parameters.

axioms (2)
  • domain assumption GitHub metadata filtering can select actively maintained repositories suitable for analysis.
    Used to select the initial 40,585 repositories.
  • domain assumption GPT-5.2 classification accurately identifies engineered software projects.
    Used to narrow to 36,710 projects before artifact detection.

pith-pipeline@v0.9.0 · 5544 in / 1216 out tokens · 63439 ms · 2026-05-12T01:26:32.200363+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 3 internal anchors

  1. [1]

    Sebastian Baltes, Seyedmoein Mohsenimofidi, Levi Böhme, Jai Lal Lulla, Muham- mad Auwal Abubakar, Christoph Treude, and Matthias Galster. 2026. A Dataset of Agentic AI Coding Tool Configurations. doi:10.5281/zenodo.19375880

  2. [2]

    Sebastian Baltes, Seyedmoein Mohsenimofidi, Levi Böhme, Jai Lal Lulla, Muham- mad Auwal Abubakar, Christoph Treude, and Matthias Galster. 2026. A Dataset of Agentic AI Coding Tool Configurations (Pipeline). doi:10.5281/zenodo.19375429

  3. [3]

    Hassan, and Hajimu Iida

    Worawalan Chatlatanagulchai, Hao Li, Yutaro Kashiwa, Brittany Reid, Kundjana- sith Thonglek, Pattara Leelaprute, Arnon Rungsawang, Bundit Manaskasemsak, Bram Adams, Ahmed E. Hassan, and Hajimu Iida. 2025. Agent READMEs: An Empirical Study of Context Files for Agentic Coding. arXiv:2511.12884 [cs.SE] doi:10.48550/arXiv.2511.12884

  4. [4]

    Ozren Dabic, Emad Aghajani, and Gabriele Bavota. 2021. Sampling Projects in GitHub for MSR Studies. In18th IEEE/ACM International Conference on Mining Software Repositories, MSR 2021, Madrid, Spain, May 17-19, 2021. IEEE, Madrid, Spain, 560–564. doi:10.1109/MSR52588.2021.00074

  5. [5]

    Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. 2024. Self-Collaboration Code Generation via ChatGPT.ACM Trans. Softw. Eng. Methodol.33, 7 (2024), 189:1– 189:38. doi:10.1145/3672459

  6. [6]

    Matthias Galster, Seyedmoein Mohsenimofidi, Jai Lal Lulla, Muhammad Auwal Abubakar, Christoph Treude, and Sebastian Baltes. 2026. Configuring Agentic AI Coding Tools: An Exploratory Study. arXiv:2602.14690 [cs.SE] doi:10.48550/arXiv. 2602.14690 To appear at the 3rd ACM International Conference on AI-powered Software (AIware 2026)

  7. [7]

    Hassan, Hao Li, Dayi Lin, Bram Adams, Tse-Hsun Chen, Yutaro Kashiwa, and Dong Qiu

    Ahmed E. Hassan, Hao Li, Dayi Lin, Bram Adams, Tse-Hsun Chen, Yutaro Kashiwa, and Dong Qiu. 2025. Agentic Software Engineering: Foundational Pillars and a Research Roadmap. arXiv:2509.06216 [cs.SE] doi:10.48550/arXiv.2509.06216

  8. [8]

    Hao He, Courtney Miller, Shyam Agarwal, Christian Kästner, and Bogdan Vasilescu. 2026. Speed at the Cost of Quality: How Cursor AI Increases Short-Term Velocity and Long-Term Complexity in Open-Source Projects. arXiv:2511.04427 [cs.SE] doi:10.48550/arXiv.2511.04427 To appear at the 23rd IEEE/ACM International Conference on Mining Software Repositories (MSR 2026)

  9. [9]

    Kosei Horikawa, Hao Li, Yutaro Kashiwa, Bram Adams, Hajimu Iida, and Ahmed E. Hassan. 2025. Agentic Refactoring: An Empirical Study of AI Coding Agents. arXiv:2511.04824 [cs.SE] doi:10.48550/arXiv.2511.04824

  10. [10]

    Shaokang Jiang and Daye Nam. 2026. Beyond the Prompt: An Empirical Study of Cursor Rules. arXiv:2512.18925 [cs.SE] doi:10.48550/arXiv.2512.18925 To appear at the 23rd IEEE/ACM International Conference on Mining Software Repositories (MSR 2026)

  11. [11]

    Germán, and Daniela E

    Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M. Germán, and Daniela E. Damian. 2014. The promises and perils of mining GitHub. In11th Working Conference on Mining Software Repositories, MSR 2014, Proceedings, May 31 - June 1, 2014, Hyderabad, India, Premkumar T. Devanbu, Sung Kim, and Martin Pinzger (Eds.). ACM, Hyderabad, Ind...

  12. [12]

    Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. 2026. AIDev: Studying AI Coding Agents on GitHub. arXiv:2602.09185 [cs.SE] doi:10.48550/arXiv.2602.09185

  13. [13]

    Seyedmoein Mohsenimofidi, Matthias Galster, Christoph Treude, and Sebastian Baltes. 2026. Context Engineering for AI Agents in Open-Source Software. arXiv:2510.21413 [cs.SE] doi:10.48550/arXiv.2510.21413 To appear at the 23rd IEEE/ACM International Conference on Mining Software Repositories (MSR 2026)

  14. [14]

    Nuthan Munaiah, Steven Kroh, Craig Cabrey, and Meiyappan Nagappan. 2017. Curating GitHub for engineered software projects.Empir. Softw. Eng.22, 6 (2017), 3219–3253. doi:10.1007/S10664-017-9512-6

  15. [15]

    Gede Artha Azriadi Prana, Christoph Treude, Ferdian Thung, Thushari Atapattu, and David Lo. 2019. Categorizing the Content of GitHub README Files.Empir. Softw. Eng.24, 3 (2019), 1296–1327. doi:10.1007/S10664-018-9660-3

  16. [16]

    Agentic Much? Adoption of Coding Agents on GitHub

    Romain Robbes, Théo Matricon, Thomas Degueule, André C. Hora, and Ste- fano Zacchiroli. 2026. Agentic Much? Adoption of Coding Agents on GitHub. arXiv:2601.18341 [cs.SE] doi:10.48550/arXiv.2601.18341

  17. [17]

    Sapkota, K

    Ranjan Sapkota, Konstantinos I. Roumeliotis, and Manoj Karkee. 2025. AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges. arXiv:2505.10468 [cs.AI] doi:10.48550/arXiv.2505.10468

  18. [18]

    Stack Exchange Inc. 2025. Stack Overflow Developer Survey 2025: AI Agent out- of-the-box tools. https://survey.stackoverflow.co/2025/ai/#3-ai-agent-out-of-the- box-tools

  19. [19]

    Miku Watanabe, Hao Li, Yutaro Kashiwa, Brittany Reid, Hajimu Iida, and Ahmed E. Hassan. 2025. On the Use of Agentic Coding: An Empirical Study of Pull Requests on GitHub. arXiv:2509.14745 [cs.SE] doi:10.48550/arXiv.2509.14745

  20. [20]

    Tao Xiao, Youmei Fan, Fabio Calefato, Christoph Treude, Raula Gaikovina Kula, Hideaki Hata, and Sebastian Baltes. 2025. Self-Admitted GenAI Usage in Open- Source Software. arXiv:2507.10422 [cs.SE]

  21. [21]

    Yangtian Zi, Zixuan Wu, Aleksander Boruch-Gruszecki, Jonathan Bell, and Arjun Guha. 2025. AgentPack: A Dataset of Code Changes, Co-Authored by Agents and Humans. arXiv:2509.21891 [cs.SE] Received 2026-02-15; accepted 2026-03-28