pith. sign in

arxiv: 2607.00911 · v1 · pith:XPK23WBMnew · submitted 2026-07-01 · 💻 cs.SE

From Registry to Repository: How AI Agent Skills Are Written, Adapted, and Maintained

Pith reviewed 2026-07-02 08:26 UTC · model grok-4.3

classification 💻 cs.SE
keywords AI agent skillssoftware reusemaintenanceempirical studyGitHub repositoriesskill registriescustomisationSKILL.md
0
0 comments X

The pith

AI agent skills are mostly copied once from registries into personal repositories and then left largely unchanged.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper mines thousands of skills from a public registry and from GitHub repositories to trace how they move from shared sources into individual projects. It finds that reuse consists mainly of a single copy step, after which more than half the skills receive no further edits and any changes that do occur add local details or domain knowledge without touching the core interaction rules. This view treats skills as ordinary software artefacts rather than fixed agent capabilities, so the results matter for deciding where maintenance effort and tooling improvements should focus. The analysis covers both the content categories of skills and the specific kinds of modifications developers make after adoption.

Core claim

By identifying 3,709 reuse links across 18,463 registry skills and 23,199 GitHub skills, the study shows that reuse is largely a one-time copy operation: most reused skills remain near-verbatim, 53% are never modified after adoption, and subsequent local maintenance is overwhelmingly additive. Customisation primarily adapts skills to local environments, whereas evolution accretes new inline domain knowledge. Across both, a stable behavioural contract remains almost untouched.

What carries the argument

Reuse links between registry skills and GitHub skills, classified by LLM into SWEBOK areas and coded thematically for modification themes.

If this is right

  • Maintenance effort should concentrate on project-specific bindings instead of rewriting the shared skill body.
  • Registries and supporting tools should help consolidate the domain knowledge that developers currently re-author separately.
  • Software Construction is the dominant knowledge area, with a long tail of specialised areas also present.
  • Modifications most often rework operational specifications or adapt knowledge and resources rather than altering user interaction rules.
  • Local changes tend to be additive, leaving the original skill structure intact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Automated diff tools could flag when local additions might usefully flow back into the registry version.
  • The observed stability of behavioural contracts may produce consistent agent behaviour even when the same skill is used across unrelated projects.
  • The pattern could be tested in private or enterprise skill collections to see whether the one-time-copy habit persists outside public registries.

Load-bearing premise

The sampled registry skills, GitHub skills, and identified reuse links form a representative picture of real-world skill creation, copying, and change.

What would settle it

Finding a sizable collection of reuse cases in which the behavioural contract is routinely rewritten after copying would contradict the reported stability pattern.

Figures

Figures reproduced from arXiv: 2607.00911 by Christoph Treude, Haoyu Gao, Hong Yi Lin, Jai Lal Lulla, Mansooreh Zahedi, Sebastian Baltes.

Figure 1
Figure 1. Figure 1: Study design: we collect centralised and personal-use skills and recover their reuse links, then answer RQ1 (SWEBOK knowledge-area classification), [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: SWEBOK KA distribution of agent skills and 22.1% of personal-use skills. This is followed by Software Engineering Operations (≈ 5% to 7%), Software Testing (≈ 5% to 6%), and Software Configuration Management (≈ 4% to 9%). The remaining 14 KAs form a long tail. Overall, SE skills concentrate on concrete code authoring and execution, whereas domains concerning planning, management, and professional practices… view at source ↗
Figure 3
Figure 3. Figure 3: SKILL.md token length and header count TABLE I CONFORMANCE AND FEATURE PRESENCE Provision Cent. (%) Pers. (%) Mandatory provisions: conformance rate SKILL.md exists at skill root 100.00 100.00 File opens with valid YAML frontmatter 99.80 99.99 name present and non-empty 99.79 98.37 name matches required char. format 96.35 95.04 name equals parent directory name 97.65 91.23 description present, non-empty, s… view at source ↗
read the original abstract

AI coding agents increasingly rely on skills: structured context bundles, typically a SKILL.md file with a YAML header and Markdown body, loaded on demand for domain knowledge, workflows, and scripts. Public registries such as skills.sh now host tens of thousands of skills, making them an emerging unit of reuse in agent-based software engineering. Yet skills have largely been viewed as agent capabilities rather than software artefacts whose content and evolution shape agent behaviour. We present the first empirical study of AI agent skills as engineered artefacts that are authored, reused, customised and maintained, across public registries and personal-use repositories. We mined 18,463 skills from skills.sh and 23,199 personal-use skills from 5,876 GitHub repositories, identifying 3,709 reuse links. LLM-based classification into SWEBOK knowledge areas (KAs) shows Software Construction dominates alongside a long tail of specialised areas. A thematic analysis of 180 skills identifies six content categories. Qualitative coding of 444 modifications reveals six themes, of which reworking operational specifications and adapting knowledge and resources are the primary target of change. Our findings show that reuse is largely a one-time copy operation: most reused skills remain near-verbatim, 53\% are never modified after adoption, and subsequent local maintenance is overwhelmingly additive. Customisation primarily adapts skills to local environments, whereas evolution accretes new inline domain knowledge. Across both, a stable behavioural contract -- how a skill interacts with users, monitors runtime state, and recovers from failures -- remains almost untouched. These results suggest maintenance effort should focus on project-specific bindings, and that registries and tool support should enable consolidating the domain knowledge skills re-author in isolation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper claims to conduct the first empirical study of AI agent skills as software artefacts by mining 18,463 skills from the skills.sh registry and 23,199 personal-use skills from 5,876 GitHub repositories, identifying 3,709 reuse links. It uses LLM classification to map skills to SWEBOK knowledge areas (finding Software Construction dominant), performs thematic analysis on 180 skills to derive six content categories, and qualitatively codes 444 modifications to identify six themes. The central observational claim is that reuse is largely a one-time copy operation, with 53% of reused skills never modified post-adoption, local maintenance overwhelmingly additive, customisation focused on local environments, evolution adding inline domain knowledge, and a stable behavioural contract remaining untouched across changes.

Significance. If the reuse-link detection and qualitative coding hold without selection bias or validation gaps, the work provides the first large-scale view of skills as reusable engineering artefacts rather than mere agent capabilities. The dataset scale (over 41k skills and 3.7k reuses) and the distinction between additive maintenance versus untouched contracts could usefully inform registry design and tool support for consolidating domain knowledge. No machine-checked proofs or parameter-free derivations are present, but the purely observational mining approach is appropriate for the research questions if the sampling and detection methods are shown to be unbiased.

major comments (3)
  1. [§3.2] §3.2 (Reuse Link Identification): The criteria and thresholds used to detect the 3,709 reuse links are not specified in sufficient detail. If the method relies on textual similarity, embedding cosine thresholds, or exact filename matches (standard in mining studies), then the sample is biased toward near-verbatim copies; this would render the claims that 'most reused skills remain near-verbatim' and '53% are never modified' partly definitional rather than observational, as heavily customised reuses would be systematically excluded.
  2. [§4.3] §4.3 (Qualitative Coding of Modifications): No inter-rater reliability statistics, sampling criteria for the 444 modifications, or validation details for the LLM-based SWEBOK classification are reported. This directly affects the soundness of the six themes (reworking operational specifications and adapting knowledge/resources as primary) and the inference that maintenance is 'overwhelmingly additive'.
  3. [§5] §5 (Discussion and Implications): The claim that the 18,463 + 23,199 skills and 3,709 links form a representative sample of real-world AI agent skill usage is asserted without justification or comparison to other adoption channels (e.g., private registries or non-GitHub repositories), weakening the generalisability of the 'one-time copy' and 'stable behavioural contract' conclusions.
minor comments (3)
  1. [§4] The mapping between the six content categories from the thematic analysis of 180 skills and the six modification themes is not explicitly cross-referenced in the results tables or figures, reducing clarity.
  2. [Results] Figure 2 (or equivalent distribution plot) lacks axis labels or legends clarifying the proportion of unmodified vs. modified reused skills, making the 53% statistic harder to verify visually.
  3. [Abstract and §4] The abstract states 'six content categories' and 'six themes' but the full text does not provide a concise summary table linking them to SWEBOK areas; adding one would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our empirical study of AI agent skills. We address each major comment below, indicating planned revisions to improve clarity, methodological transparency, and discussion of limitations.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Reuse Link Identification): The criteria and thresholds used to detect the 3,709 reuse links are not specified in sufficient detail. If the method relies on textual similarity, embedding cosine thresholds, or exact filename matches (standard in mining studies), then the sample is biased toward near-verbatim copies; this would render the claims that 'most reused skills remain near-verbatim' and '53% are never modified' partly definitional rather than observational, as heavily customised reuses would be systematically excluded.

    Authors: We agree the reuse-link detection criteria require fuller specification. In revision we will expand §3.2 with the exact matching procedure (including any similarity metric, threshold, or filename-based rule) and any manual validation performed. We will also add an explicit limitations paragraph noting that the method privileges detectable near-verbatim copies and that heavily customised reuses may be under-represented; the observational claim therefore applies to the class of reuses our detector can identify. revision: yes

  2. Referee: [§4.3] §4.3 (Qualitative Coding of Modifications): No inter-rater reliability statistics, sampling criteria for the 444 modifications, or validation details for the LLM-based SWEBOK classification are reported. This directly affects the soundness of the six themes (reworking operational specifications and adapting knowledge/resources as primary) and the inference that maintenance is 'overwhelmingly additive'.

    Authors: We will revise §4.3 to report (a) the sampling frame and selection criteria used to obtain the 444 modifications, (b) any inter-rater reliability statistics obtained during thematic coding, and (c) validation steps (e.g., human review of a subsample) for the LLM SWEBOK classifier. These additions will allow readers to assess the reliability of the six themes and the additive-maintenance inference. revision: yes

  3. Referee: [§5] §5 (Discussion and Implications): The claim that the 18,463 + 23,199 skills and 3,709 links form a representative sample of real-world AI agent skill usage is asserted without justification or comparison to other adoption channels (e.g., private registries or non-GitHub repositories), weakening the generalisability of the 'one-time copy' and 'stable behavioural contract' conclusions.

    Authors: We will strengthen §5 by (i) explicitly stating the scope of the sampled sources (public skills.sh registry and GitHub personal repositories), (ii) comparing coverage where public data permit, and (iii) adding a dedicated limitations subsection that qualifies generalisability to other channels such as private registries. The core observational findings will be framed as applying to the studied public corpus rather than asserted as universally representative. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational empirical study with no derivations or self-referential reductions

full rationale

The paper conducts data mining of public registries and GitHub repositories followed by LLM classification, thematic analysis, and qualitative coding. No equations, fitted parameters, predictions derived from subsets, or self-citation chains appear in the abstract or described approach. The 3,709 reuse links and modification counts are presented as direct observational outputs rather than constructed by definition. The skeptic concern about detection thresholds cannot be evaluated as circularity without an explicit paper quote showing the identification method reduces the 'near-verbatim' claim to a definitional artifact; per rules, no such reduction is exhibited. The study is self-contained against external benchmarks of mined artifacts.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The study depends on the assumption that public registry and GitHub data capture typical skill usage, that LLM classification into SWEBOK areas is sufficiently accurate, and that the sampled modifications represent real maintenance activity. No free parameters or invented entities are described.

axioms (2)
  • domain assumption SWEBOK knowledge areas provide an appropriate taxonomy for classifying AI agent skill content.
    Used for LLM-based classification of the mined skills.
  • domain assumption Identified reuse links and modification histories accurately reflect actual adoption and maintenance behavior.
    Central to the reuse and maintenance findings.

pith-pipeline@v0.9.1-grok · 5856 in / 1292 out tokens · 11553 ms · 2026-07-02T08:26:04.783805+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 14 canonical work pages · 9 internal anchors

  1. [1]

    Codereval: A benchmark of pragmatic code generation with generative pre-trained models,

    H. Yu, B. Shen, D. Ran, J. Zhang, Q. Zhang, Y . Ma, G. Liang, Y . Li, Q. Wang, and T. Xie, “Codereval: A benchmark of pragmatic code generation with generative pre-trained models,” inProceedings of the 46th IEEE/ACM International Conference on Software Engineering, 2024, pp. 1–12

  2. [2]

    Codere- viewqa: The code review comprehension assessment for large language models,

    H. Y . Lin, C. Liu, H. Gao, P. Thongtanunam, and C. Treude, “Codere- viewqa: The code review comprehension assessment for large language models,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 9138–9166

  3. [3]

    Repairagent: An autonomous, llm-based agent for program repair,

    I. Bouzenia, P. Devanbu, and M. Pradel, “Repairagent: An autonomous, llm-based agent for program repair,” in2025 IEEE/ACM 47th Interna- tional Conference on Software Engineering (ICSE). IEEE, 2025, pp. 2188–2200

  4. [4]

    An empirical evaluation of using large language models for automated unit test generation,

    M. Sch ¨afer, S. Nadi, A. Eghbali, and F. Tip, “An empirical evaluation of using large language models for automated unit test generation,”IEEE Transactions on Software Engineering, vol. 50, no. 1, pp. 85–105, 2023

  5. [5]

    Does my readme file need to be updated? exploring llm-based readme maintenance,

    H. Gao, H. Y . Lin, C. Treude, G. Gay, and M. Zahedi, “Does my readme file need to be updated? exploring llm-based readme maintenance,”arXiv preprint arXiv:2603.00489, 2026

  6. [6]

    The Agent Skills Directory — skills.sh,

    “The Agent Skills Directory — skills.sh,” https://www.skills.sh/, [Ac- cessed 28-06-2026]

  7. [7]

    Agent skills: A data- driven analysis of claude skills for extending large language model functionality,

    G. Ling, S. Zhong, and R. Huang, “Agent skills: A data-driven analysis of claude skills for extending large language model functionality,”arXiv preprint arXiv:2602.08004, 2026

  8. [8]

    Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale

    Y . Liu, W. Wang, R. Feng, Y . Zhang, G. Xu, G. Deng, Y . Li, and L. Zhang, “Agent skills in the wild: An empirical study of security vulnerabilities at scale,”arXiv preprint arXiv:2601.10338, 2026

  9. [9]

    How Your Credentials Are Leaked by LLM Agent Skills: An Empirical Study

    Z. Chen, Y . Zhang, Y . Liu, G. Deng, Y . Li, Y . Zhang, J. Ning, L. Y . Zhang, L. Ma, and Z. Li, “Credential leakage in llm agent skills: A large-scale empirical study,”arXiv preprint arXiv:2604.03070, 2026

  10. [10]

    SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

    X. Li, W. Chen, Y . Liu, S. Zheng, X. Chen, Y . He, Y . Li, B. You, H. Shen, J. Sunet al., “Skillsbench: Benchmarking how well agent skills work across diverse tasks,”arXiv preprint arXiv:2602.12670, 2026

  11. [11]

    On the untriviality of trivial packages: An empirical study of npm javascript packages,

    M. A. R. Chowdhury, R. Abdalkareem, E. Shihab, and B. Adams, “On the untriviality of trivial packages: An empirical study of npm javascript packages,”IEEE Transactions on Software Engineering, vol. 48, no. 8, pp. 2695–2708, 2021

  12. [12]

    Small world with high risks: A study of security threats in the npm ecosystem,

    M. Zimmermann, C.-A. Staicu, C. Tenny, and M. Pradel, “Small world with high risks: A study of security threats in the npm ecosystem,” in28th USENIX Security symposium (USENIX security 19), 2019, pp. 995–1010

  13. [13]

    An empirical comparison of dependency network evolution in seven software packaging ecosystems,

    A. Decan, T. Mens, and P. Grosjean, “An empirical comparison of dependency network evolution in seven software packaging ecosystems,” Empirical Software Engineering, vol. 24, no. 1, pp. 381–416, 2019

  14. [14]

    Adapting installation instructions in rapidly evolving software ecosystems,

    H. Gao, C. Treude, and M. Zahedi, “Adapting installation instructions in rapidly evolving software ecosystems,”IEEE Transactions on Software Engineering, 2025

  15. [15]

    Software documentation issues unveiled,

    E. Aghajani, C. Nagy, O. L. Vega-M ´arquez, M. Linares-V ´asquez, L. Moreno, G. Bavota, and M. Lanza, “Software documentation issues unveiled,” in2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 2019, pp. 1199–1210

  16. [16]

    Evaluating Large Language Models Trained on Code

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021

  17. [17]

    Large language models for software engi- neering: A systematic literature review,

    X. Hou, Y . Zhao, Y . Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, and H. Wang, “Large language models for software engi- neering: A systematic literature review,”ACM Transactions on Software Engineering and Methodology, vol. 33, no. 8, 2024

  18. [18]

    A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT

    J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. El- nashar, J. Spencer-Smith, and D. C. Schmidt, “A prompt pattern catalog to enhance prompt engineering with chatgpt,”arXiv preprint arXiv:2302.11382, 2023

  19. [19]

    ReAct: Synergizing reasoning and acting in language models,

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “ReAct: Synergizing reasoning and acting in language models,” in International Conference on Learning Representations (ICLR), 2023

  20. [20]

    Toolformer: Language models can teach themselves to use tools,

    T. Schick, J. Dwivedi-Yu, R. Dess `ı, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

  21. [21]

    SWE-agent: Agent-computer interfaces enable automated software engineering,

    J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “SWE-agent: Agent-computer interfaces enable automated software engineering,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024

  22. [22]

    SWE-bench: Can language models resolve real-world GitHub issues?

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “SWE-bench: Can language models resolve real-world GitHub issues?” inInternational Conference on Learning Representa- tions (ICLR), 2024

  23. [23]

    Large Language Model-Based Agents for Software Engineering: A Survey

    J. Liu, K. Wang, Y . Chen, X. Peng, Z. Chen, L. Zhang, and Y . Lou, “Large language model-based agents for software engineering: A sur- vey,”arXiv preprint arXiv:2409.02977, 2024

  24. [24]

    Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions

    X. Hou, Y . Zhao, S. Wang, and H. Wang, “Model context protocol (MCP): Landscape, security threats, and future research directions,” arXiv preprint arXiv:2503.23278, 2025

  25. [25]

    Model Context Protocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers

    M. M. Hasan, H. Li, E. Fallahzadeh, G. K. Rajbahadur, B. Adams, and A. E. Hassan, “Model context protocol (MCP) at first glance: Study- ing the security and maintainability of MCP servers,”arXiv preprint arXiv:2506.13538, 2025

  26. [26]

    Agent READMEs: An empirical study of context files for agentic coding,

    W. Chatlatanagulchai, H. Li, Y . Kashiwa, B. Reid, K. Thonglek, P. Lee- laprute, A. Rungsawang, B. Manaskasemsak, B. Adams, A. E. Hassan, and H. Iida, “Agent READMEs: An empirical study of context files for agentic coding,”arXiv preprint arXiv:2511.12884, 2025

  27. [27]

    On the use of agentic coding manifests: An empirical study of Claude Code,

    W. Chatlatanagulchai, K. Thonglek, B. Reid, Y . Kashiwa, P. Leelaprute, A. Rungsawang, B. Manaskasemsak, and H. Iida, “On the use of agentic coding manifests: An empirical study of Claude Code,”arXiv preprint arXiv:2509.14744, 2025

  28. [28]

    On the impact of agents.md files on the efficiency of ai coding agents,

    J. L. Lulla, S. Mohsenimofidi, M. Galster, J. M. Zhang, S. Baltes, and C. Treude, “On the impact of agents.md files on the efficiency of ai coding agents,” inProceedings of the 1st Journal Ahead Workshop (JAWs) at the International Conference on Software Engineering (ICSE), 2026, accepted. [Online]. Available: https://arxiv.org/abs/2601.20404

  29. [29]

    An empirical study of usages, updates and risks of third-party libraries in Java projects,

    Y . Wang, B. Chen, K. Huang, B. Shi, C. Xu, X. Peng, Y . Wu, and Y . Liu, “An empirical study of usages, updates and risks of third-party libraries in Java projects,” in2020 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 2020

  30. [30]

    Backstabber’s knife collection: A review of open source software supply chain attacks,

    M. Ohm, H. Plate, A. Sykosch, and M. Meier, “Backstabber’s knife collection: A review of open source software supply chain attacks,” in International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA). Springer, 2020

  31. [31]

    SoK: Taxonomy of attacks on open-source software supply chains,

    P. Ladisa, H. Plate, M. Martinez, and O. Barais, “SoK: Taxonomy of attacks on open-source software supply chains,” in2023 IEEE Symposium on Security and Privacy (SP). IEEE, 2023

  32. [32]

    An empirical study of pre-trained model reuse in the Hugging Face deep learning model registry,

    W. Jiang, N. Synovic, M. Hyatt, T. R. Schorlemmer, R. Sethi, Y .- H. Lu, G. K. Thiruvathukal, and J. C. Davis, “An empirical study of pre-trained model reuse in the Hugging Face deep learning model registry,” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023

  33. [33]

    PeaTMOSS: A dataset and initial analysis of pre-trained models in open-source software,

    W. Jiang, J. Yasmin, J. Jones, N. Synovic, J. Kuo, N. Bielanski, Y . Tian, G. K. Thiruvathukal, and J. C. Davis, “PeaTMOSS: A dataset and initial analysis of pre-trained models in open-source software,” in 2024 IEEE/ACM 21st International Conference on Mining Software Repositories (MSR). IEEE, 2024

  34. [34]

    Programs, life cycles, and laws of software evolution,

    M. M. Lehman, “Programs, life cycles, and laws of software evolution,” Proceedings of the IEEE, vol. 68, no. 9, pp. 1060–1076, 1980

  35. [35]

    A large-scale empirical study on code-comment inconsistencies,

    F. Wen, C. Nagy, G. Bavota, and M. Lanza, “A large-scale empirical study on code-comment inconsistencies,” in2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC). IEEE, 2019, pp. 53–64

  36. [36]

    Docable: Evaluating the executability of software tutorials,

    S. Mirhosseini and C. Parnin, “Docable: Evaluating the executability of software tutorials,” inProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2020, pp. 375–385

  37. [37]

    Cat- egorizing the content of GitHub README files,

    G. A. A. Prana, C. Treude, F. Thung, T. Atapattu, and D. Lo, “Cat- egorizing the content of GitHub README files,”Empirical Software Engineering, vol. 24, pp. 1296–1327, 2019

  38. [38]

    Augmenting api documentation with insights from stack overflow,

    C. Treude and M. P. Robillard, “Augmenting api documentation with insights from stack overflow,” inProceedings of the 38th International Conference on Software Engineering, 2016, pp. 392–403

  39. [39]

    Improving api caveats accessibility by mining api caveats knowledge graph,

    H. Li, S. Li, J. Sun, Z. Xing, X. Peng, M. Liu, and X. Zhao, “Improving api caveats accessibility by mining api caveats knowledge graph,” in 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 2018, pp. 183–193

  40. [40]

    J. W. Tukey,Exploratory Data Analysis. Reading, MA: Addison- Wesley, 1977

  41. [41]

    Software Engineering Body of Knowledge (SWEBOK) — com- puter.org,

    “Software Engineering Body of Knowledge (SWEBOK) — com- puter.org,” https://www.computer.org/education/bodies-of-knowledge/ software-engineering#about, [Accessed 24-06-2026]

  42. [42]

    Can llms re- place manual annotation of software engineering artifacts?

    T. Ahmed, P. Devanbu, C. Treude, and M. Pradel, “Can llms re- place manual annotation of software engineering artifacts?” in2025 IEEE/ACM 22nd International Conference on Mining Software Reposi- tories (MSR). IEEE, 2025, pp. 526–538

  43. [43]

    Interrater reliability: the kappa statistic,

    M. L. McHugh, “Interrater reliability: the kappa statistic,”Biochemia medica, vol. 22, no. 3, pp. 276–282, 2012

  44. [44]

    Specification - Agent Skills — agentskills.io,

    “Specification - Agent Skills — agentskills.io,” https://agentskills.io/ specification, [Accessed 28-06-2026]

  45. [45]

    Sampling in software engineering research: A critical review and guidelines,

    S. Baltes and P. Ralph, “Sampling in software engineering research: A critical review and guidelines,”Empirical Software Engineering, vol. 27, no. 4, p. 94, 2022

  46. [46]

    Recommended steps for thematic synthesis in software engineering,

    D. S. Cruzes and T. Dyba, “Recommended steps for thematic synthesis in software engineering,” in2011 international symposium on empirical software engineering and measurement. IEEE, 2011, pp. 275–284

  47. [47]

    Self-organizing roles on agile soft- ware development teams,

    R. Hoda, J. Noble, and S. Marshall, “Self-organizing roles on agile soft- ware development teams,”IEEE Transactions on Software Engineering, vol. 39, no. 3, pp. 422–444, 2012

  48. [48]

    Fisher’s exact test,

    G. J. Upton, “Fisher’s exact test,”Journal of the Royal Statistical Society: Series A (Statistics in Society), vol. 155, no. 3, pp. 395–402, 1992

  49. [49]

    Controlling the false discovery rate: a practical and powerful approach to multiple testing,

    Y . Benjamini and Y . Hochberg, “Controlling the false discovery rate: a practical and powerful approach to multiple testing,”Journal of the Royal statistical society: series B (Methodological), vol. 57, no. 1, pp. 289–300, 1995

  50. [50]

    Cohen,Statistical Power Analysis for the Behavioral Sciences, 2nd ed

    J. Cohen,Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, NJ: Lawrence Erlbaum Associates, 1988

  51. [51]

    On a test of whether one of two random variables is stochastically larger than the other,

    H. B. Mann and D. R. Whitney, “On a test of whether one of two random variables is stochastically larger than the other,”The annals of mathematical statistics, pp. 50–60, 1947

  52. [52]

    Dominance statistics: Ordinal analyses to answer ordinal questions

    N. Cliff, “Dominance statistics: Ordinal analyses to answer ordinal questions.”Psychological bulletin, vol. 114, no. 3, p. 494, 1993

  53. [53]

    EASYTOOL: Enhancing LLM-based agents with concise tool instruction,

    S. Yuan, K. Song, J. Chen, X. Tan, Y . Shen, K. Ren, D. Li, and D. Yang, “EASYTOOL: Enhancing LLM-based agents with concise tool instruction,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), L. Chiruzzo, A. Ritter, and L. Wang...

  54. [54]

    SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems

    H. Pu, X. Song, and L. Zhao, “Skillops: Managing llm agent skill libraries as self-maintaining software ecosystems,”arXiv preprint arXiv:2605.13716, 2026

  55. [55]

    An empirical study of the non-determinism of chatgpt in code generation,

    S. Ouyang, J. M. Zhang, M. Harman, and M. Wang, “An empirical study of the non-determinism of chatgpt in code generation,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 2, pp. 1–28, 2025

  56. [56]

    Let me speak freely? a study on the impact of format restrictions on large language model performance

    Z. R. Tam, C.-K. Wu, Y .-L. Tsai, C.-Y . Lin, H.-y. Lee, and Y .-N. Chen, “Let me speak freely? a study on the impact of format restrictions on large language model performance.” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, 2024, pp. 1218–1236