Toward Instructions-as-Code: Understanding the Impact of Instruction Files on Agentic Pull Requests
Pith reviewed 2026-06-27 06:01 UTC · model grok-4.3
The pith
Creating instruction files for AI-agents yields mixed effects on pull request success across projects.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Specifying instructions for AI-agents does not necessarily lead to better results. With the instruction files, 27.7% of the projects increased their merge rate by at least 20%, while 26.35% decreased it. The same observation is seen with the amount of changes and with the efforts to merge an agentic PR. Projects that managed to increase their merge rate have substantially longer instruction files, which are also well structured into a higher number of sections and sub-sections.
What carries the argument
Within-project before-and-after comparison of three PR metrics (merge rate, code churn and modified files, merge time and comment count) around the creation of instruction files.
If this is right
- Roughly one-quarter of projects can expect a meaningful rise in merge rate after adding instruction files, while a similar share can expect a decline.
- Positive outcomes associate with longer files that contain more sections and subsections.
- The mixed results across all three performance dimensions motivate treating instruction-file creation as an explicit software-engineering activity called Instructions-as-Code.
Where Pith is reading between the lines
- Projects may need to treat instruction files as living artifacts that receive updates and reviews like source code to capture the gains seen in the better-structured cases.
- The length-and-structure observation suggests that simple presence of a file is insufficient; future work could test whether automated checks for file structure predict later PR improvements.
- Neighboring problems such as prompt engineering for other agent tasks may benefit from the same before-after measurement approach used here.
Load-bearing premise
The before-and-after comparison within each project isolates the causal effect of instruction-file creation on the three PR metrics, with no other concurrent project changes confounding the observed differences.
What would settle it
A study that re-runs the before-after analysis after explicitly logging and controlling for any simultaneous changes in team size, review processes, or tooling would falsify the claim if the mixed outcome pattern disappears.
Figures
read the original abstract
AI-agents (e.g., GitHub Copilot) collaborate as teammates in different software engineering tasks, including code generation proposed through pull requests (Agentic-PRs). For better agent efficiency, developers create instruction files that guide the AI-agents, including how to navigate the project, locate the right components, run tests, respect best practices, and more. In this paper, we investigate the relationship between the creation of these instructions and the performance of AI-agents in creating better pull requests, which have a higher chance of success (i.e., the merge rate), address more complex tasks (e.g., code churn), and require less effort to be merged (e.g., time to merge). To this end, we analyze 15,549 agentic PRs from 148 projects in the AIDev dataset. Using the three dimensions, we compare each project before and after the creation of the instruction files. We find that specifying instructions for AI-agents does not necessarily lead to better results. With the instruction files, 27.7\% of the projects increased their merge rate by at least 20\%, while 26.35\% decreased it. The same observation is seen with the amount of changes (e.g., code churn, number of modified files) and with the efforts to merge an agentic PR (e.g., merge time and number of comments). From a first exploration, we find that projects that managed to increase their merge rate have substantially longer instruction files, which are also well structured into a higher number of sections and sub-sections. Our results motivate the need for research to assist practitioners in framing the development of instruction files as a software engineering activity (aka, \textbf{Instructions-as-Code}).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes 15,549 agentic PRs from 148 projects in the AIDev dataset to assess the impact of instruction files on AI-agent PR performance. It compares merge rate, code churn, number of modified files, merge time, and number of comments before vs. after instruction-file creation within each project, reporting mixed results (e.g., 27.7% of projects show >=20% merge-rate increase while 26.35% show a decrease). An exploratory analysis links longer, better-structured files to improvements and motivates treating instruction development as 'Instructions-as-Code'.
Significance. If the before-after design can be shown to isolate the effect of instruction files, the mixed outcomes would be a useful empirical signal for AI-assisted software engineering, underscoring that such files are not a guaranteed win and highlighting the value of structured instructions. The scale of the AIDev dataset is a positive feature.
major comments (3)
- [Abstract] Abstract: the central mixed-result claim rests on raw percentages (27.7% increase, 26.35% decrease) with no statistical tests, confidence intervals, or description of project-selection criteria for the 148 projects or definition of before/after windows.
- [Abstract] Abstract / comparison design: the within-project before-after comparison does not discuss or adjust for concurrent events (team-size changes, process updates, new tooling, or AI-agent adoption itself) that could confound the observed differences in merge rate and other metrics.
- [Exploratory analysis] Exploratory analysis: the reported association between longer, multi-section instruction files and merge-rate gains is presented without formal controls, regression, or tests for project-level covariates, limiting its ability to support the 'Instructions-as-Code' recommendation.
minor comments (1)
- [Abstract] Abstract: the LaTeX markup '\textbf{Instructions-as-Code}' should be replaced with plain text or proper formatting for readability in non-LaTeX outputs.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and for recognizing the value of the AIDev dataset scale and the mixed-results signal. We address each major comment below and outline planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central mixed-result claim rests on raw percentages (27.7% increase, 26.35% decrease) with no statistical tests, confidence intervals, or description of project-selection criteria for the 148 projects or definition of before/after windows.
Authors: We agree the abstract would be improved by statistical support and clearer methodology. In revision we will add paired statistical tests (e.g., Wilcoxon signed-rank) with confidence intervals for the before-after differences, summarize the selection criteria (all 148 AIDev projects that added instruction files), and define before/after windows by the commit timestamp of the instruction file. These details exist in the methods but will be condensed for the abstract. revision: yes
-
Referee: [Abstract] Abstract / comparison design: the within-project before-after comparison does not discuss or adjust for concurrent events (team-size changes, process updates, new tooling, or AI-agent adoption itself) that could confound the observed differences in merge rate and other metrics.
Authors: This is a fair observation about observational before-after designs. Within-project comparison mitigates some project-level heterogeneity, yet time-varying confounders remain possible. We will expand the limitations and threats-to-validity section to explicitly discuss these issues (team changes, tooling, adoption timing) and emphasize that results are associative. Additional data would be required for full adjustment; we will note this limitation clearly. revision: yes
-
Referee: [Exploratory analysis] Exploratory analysis: the reported association between longer, multi-section instruction files and merge-rate gains is presented without formal controls, regression, or tests for project-level covariates, limiting its ability to support the 'Instructions-as-Code' recommendation.
Authors: We concur that the current exploratory analysis lacks formal controls. The manuscript already labels it a 'first exploration' and frames Instructions-as-Code as a research motivation rather than a validated claim. In revision we will add a regression model controlling for available project covariates (size, age, contributor count) or, if data limitations prevent this, strengthen the caveats around the association. This will better ground the forward-looking recommendation. revision: partial
Circularity Check
No circularity: purely observational before-after comparison
full rationale
The paper reports an empirical observational study that compares three PR metrics (merge rate, code churn, merge effort) within each of 148 projects before versus after instruction-file creation. No equations, fitted parameters, predictions, ansatzes, or derivation chains appear in the provided text or abstract. The central claim rests on direct data aggregation and percentage splits (27.7 % increase, 26.35 % decrease) rather than any self-referential modeling or self-citation load-bearing step. This is the normal case of a self-contained empirical analysis with no reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The AIDev dataset accurately records the timing of instruction-file creation and the associated agentic PR metrics for the 148 projects.
Reference graph
Works this paper leans on
-
[1]
2025. Replication Package for: Toward Instructions-as-Code: Understanding the Impact of Instruction Files on Agentic Pull Requests. doi:10.6084/m9.figshare. 30951143
-
[2]
Anthropic. 2024. The Claude 3 Model Family: Tech Report. Technical Report. https://shorturl.at/sdbBc
2024
-
[3]
Anthropic / Claude Code Documentation. 2025. Use Claude Code in VS Code. https://code.claude.com/docs/en/vs-code. Accessed: 2025-12-25
2025
-
[4]
Anysphere. 2024. Cursor: The AI-Native Code Editor. https://cursor.sh/
2024
-
[5]
Worawalan Chatlatanagulchai, Hao Li, Yutaro Kashiwa, Brittany Reid, Kundjana- sith Thonglek, Pattara Leelaprute, Arnon Rungsawang, Bundit Manaskasemsak, Bram Adams, Ahmed E. Hassan, and Hajimu Iida. 2025. Agent READMEs: An Empirical Study of Context Files for Agentic Coding. arXiv:2511.12884 [cs.SE]
arXiv 2025
-
[6]
Umut Cihan, Vahid Haratian, Arda İçöz, Mert Kaan Gül, Ömercan Devran, Emir- can Furkan Bayendur, Baykal Mehmet Uçar, and Eray Tüzün. 2025. Automated Code Review in Practice. In 2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 425–436
2025
-
[7]
Cursor Documentation. 2025. Rules — Cursor Docs. https://cursor.com/docs/ context/rules. Accessed: 2025-12-25
2025
-
[8]
Devin.ai Documentation. 2025. GitHub Integration — Devin Docs. https://docs. devin.ai/integrations/gh#search-&-precedence-order. Accessed: 2025-12-25
2025
-
[9]
Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2024. Evaluating Large Language Models in Class-Level Code Generation (ICSE ’24)
2024
-
[10]
GitHub. 2024. Adding repository custom instructions for GitHub Copilot. https: //shorturl.at/DRhK3 GitHub Docs
2024
-
[11]
GitHub Documentation. 2025. Adding Repository Custom Instructions for GitHub Copilot. https://docs.github.com/en/copilot/how-tos/configure-custom- instructions/add-repository-instructions. Accessed: 2025-12-25
2025
-
[12]
Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large Language Models for Software Engineering: A Systematic Literature Review. ACM Trans. Softw. Eng. Methodol. (2024)
2024
-
[13]
Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. 2025. The Rise of AI Team- mates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering
2025
-
[14]
Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. 2024. Make LLM a Testing Expert: Bringing Human-like Interaction to Mobile GUI Testing via Functionality-aware Decisions (ICSE ’24)
2024
-
[15]
Desmarais, and Zhen Ming (Jack) Jiang
Arghavan Moradi Dakhel, Vahid Majdinasab, Amin Nikanjam, Foutse Khomh, Michel C. Desmarais, and Zhen Ming (Jack) Jiang. 2023. GitHub Copilot AI pair programmer: Asset or Liability? J. Syst. Softw. 203, C (09 2023), 23 pages
2023
-
[16]
OpenAI Developers. 2025. Custom Instructions with AGENTS.md. https:// developers.openai.com/codex/guides/agents-md/. Accessed: 2025-12-25
2025
-
[17]
Santos, Vitor Costa, Joao Eduardo Montandon, and Marco Tulio Valente
Helio Victor F. Santos, Vitor Costa, Joao Eduardo Montandon, and Marco Tulio Valente. 2025. Decoding the Configuration of AI Coding Agents: Insights from Claude Code Projects. Received 25 December 2025; revised 19 Junuary 2026; accepted 19 January 2026
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.