pith. machine review for the scientific record. sign in

arxiv: 2604.06861 · v1 · submitted 2026-04-08 · 💻 cs.SE

Recognition: 2 theorem links

· Lean Theorem

REAgent: Requirement-Driven LLM Agents for Software Issue Resolution

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:53 UTC · model grok-4.3

classification 💻 cs.SE
keywords LLM agentssoftware issue resolutionrequirements engineeringpatch generationautomated bug fixingAI for software engineeringstructured requirements
0
0 comments X

The pith

REAgent improves LLM success at resolving software issues by 17.4 percent through structured requirements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that LLM agents fail on many software issues because raw issue descriptions often lack context or contain ambiguity that blocks accurate understanding. REAgent counters this by automatically building structured requirements from the description, detecting poor-quality ones, and refining them iteratively before patch generation. Experiments on three benchmarks with two LLMs show consistent gains over direct-input baselines. If the approach holds, it would mean that requirements-engineering techniques can raise the rate of correct automated fixes without larger models or more tools. Readers would care because it targets a concrete obstacle in practical AI-assisted debugging.

Core claim

REAgent automatically constructs structured and information-rich issue-oriented requirements, identifies low-quality requirements, and iteratively refines them to improve patch correctness when LLMs generate fixes from issue descriptions.

What carries the argument

The pipeline that extracts, structures, quality-checks, and iteratively refines issue-oriented requirements to serve as precise task specifications for the LLM agent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same requirement-refinement loop might help other LLM tasks that start from ambiguous natural-language inputs, such as test-case generation or code explanation.
  • Integrating execution feedback directly into the refinement step could further reduce cases where a refined requirement still leads to incorrect patches.
  • If the method generalizes beyond the tested benchmarks, it could lower the need for humans to rewrite issue reports before feeding them to automated repair systems.

Load-bearing premise

Issue descriptions commonly contain missing context or ambiguity that can be reliably detected and corrected through automated construction and iterative refinement of structured requirements.

What would settle it

Running REAgent on the same three benchmarks and two LLMs yields no increase or a decrease in the percentage of resolved issues compared with the direct-input baselines.

Figures

Figures reproduced from arXiv: 2604.06861 by Chaofan Tao, Haoli Bai, Junjie Chen, Kaiwei Lin, Lifeng Shang, Shaowei Wang, Shiqi Kuang, Zhao Tian.

Figure 1
Figure 1. Figure 1: A real-world example from SWE-bench Verified [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overview of REAgent the codebase, thereby retrieving key contextual information. Start￾ing from the issue description and associated code (as shown in Original Issue Description and Codebase of [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Number of uniquely resolved instances across different techniques [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Influence of the number of iterations (𝑁) on % Applied (↑) and % Resolved (↑) using DeepSeek-V3.2 average, compared to 14.33%∼19.33% for baselines. These results in￾dicate that REAgent demonstrates stronger capability in leveraging additional iterations for continuous performance improvement [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Issue resolution aims to automatically generate patches from given issue descriptions and has attracted significant attention with the rapid advancement of large language models (LLMs). However, due to the complexity of software issues and codebases, LLM-generated patches often fail to resolve corresponding issues. Although various advanced techniques have been proposed with carefully designed tools and workflows, they typically treat issue descriptions as direct inputs and largely overlook their quality (e.g., missing critical context or containing ambiguous information), which hinders LLMs from accurate understanding and resolution. To address this limitation, we draw on principles from software requirements engineering and propose REAgent, a requirement-driven LLM agent framework that introduces issue-oriented requirements as structured task specifications to better guide patch generation. Specifically, REAgent automatically constructs structured and information-rich issue-oriented requirements, identifies low-quality requirements, and iteratively refines them to improve patch correctness. We conduct comprehensive experiments on three widely used benchmarks using two advanced LLMs, comparing against five representative or state-of-the-art baselines. The results demonstrate that REAgent consistently outperforms all baselines, achieving an average improvement of 17.40% in terms of the number of successfully-resolved issues (% Resolved).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes REAgent, a requirement-driven LLM agent framework for automated software issue resolution. Drawing on software requirements engineering, it automatically constructs structured issue-oriented requirements from issue descriptions, detects low-quality requirements, and iteratively refines them to better guide LLM patch generation. Experiments on three benchmarks with two LLMs show REAgent outperforming five baselines by an average of 17.40% in the percentage of successfully resolved issues.

Significance. If the results hold under controlled conditions, the work would meaningfully advance LLM-based automated program repair by demonstrating that explicit, structured requirements can address ambiguities and missing context in issue reports. It provides a principled bridge between requirements engineering and AI agents, with potential for broader application in other complex LLM tasks. The use of public benchmarks supports reproducibility and practical relevance.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): The headline 17.40% average improvement in % Resolved is reported without any indication that the number of LLM calls, token budgets, or total inference steps were matched between REAgent and the five baselines. Since REAgent's workflow includes automated requirement construction, low-quality detection, and iterative refinement, the gains could arise from additional sampling opportunities rather than the requirements-engineering framing.
  2. [§4] §4 (Experiments): No statistical tests, confidence intervals, or measures of variance are provided for the results across benchmarks and LLMs. This leaves the claim of consistent outperformance without evidence that improvements exceed stochastic variation in LLM outputs.
  3. [§3] §3 (Methodology): The central assumption that missing context or ambiguity in issue descriptions can be reliably detected and corrected through automated requirement construction and refinement lacks supporting ablation studies that isolate the contribution of the refinement loop versus base construction or extra LLM queries.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by briefly naming the three benchmarks and two LLMs used, allowing readers to immediately assess scope and generalizability.
  2. [§4] Notation for metrics such as % Resolved could be defined more explicitly on first use in the results section to improve clarity for readers unfamiliar with the benchmarks.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and positive assessment of our work's significance. We address each major comment point by point below, with clarifications and commitments to revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The headline 17.40% average improvement in % Resolved is reported without any indication that the number of LLM calls, token budgets, or total inference steps were matched between REAgent and the five baselines. Since REAgent's workflow includes automated requirement construction, low-quality detection, and iterative refinement, the gains could arise from additional sampling opportunities rather than the requirements-engineering framing.

    Authors: We acknowledge that REAgent's multi-step workflow (requirement construction, quality detection, and refinement) typically involves more LLM calls than simpler baselines. However, the five baselines include advanced agent frameworks that also rely on multi-turn tool use and iterative interactions. In the revised §4, we will report the average number of LLM calls, token consumption, and inference steps for REAgent and all baselines to enable direct comparison. We will also add qualitative analysis showing how the structured requirements reduce ambiguity beyond raw additional queries, and discuss this as a limitation if full budget-matched experiments are not feasible within the revision timeline. revision: partial

  2. Referee: [§4] §4 (Experiments): No statistical tests, confidence intervals, or measures of variance are provided for the results across benchmarks and LLMs. This leaves the claim of consistent outperformance without evidence that improvements exceed stochastic variation in LLM outputs.

    Authors: We agree that statistical rigor is needed to support claims of consistent improvement given LLM stochasticity. In the revised manuscript, we will include variance measures (e.g., standard deviation across repeated runs where possible), confidence intervals for the % Resolved metric, and appropriate statistical tests (such as paired comparisons) across the three benchmarks and two LLMs. This will be added to §4 to demonstrate that the reported gains exceed typical variation. revision: yes

  3. Referee: [§3] §3 (Methodology): The central assumption that missing context or ambiguity in issue descriptions can be reliably detected and corrected through automated requirement construction and refinement lacks supporting ablation studies that isolate the contribution of the refinement loop versus base construction or extra LLM queries.

    Authors: We thank the referee for highlighting the need for component-level analysis. While the current evaluation demonstrates end-to-end gains, we did not provide ablations isolating the refinement loop. In the revised §4, we will add ablation experiments evaluating: (1) base requirement construction only, (2) construction plus quality detection without iteration, and (3) variants with query limits to control for extra LLM calls. These will quantify the incremental benefit of the iterative refinement and support the requirements-engineering framing. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The paper proposes REAgent as a requirements-engineering-inspired agent workflow and reports empirical gains (17.40% average lift in % Resolved) on three public benchmarks against five external baselines. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided abstract or described method. The central result is a direct performance comparison rather than a tautological reduction of outputs to inputs. Minor self-citations, if any, are not load-bearing for the headline claim, which remains falsifiable against independent benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based solely on abstract; no free parameters, invented entities, or detailed axioms are stated. The central premise is treated as a domain assumption about issue-description quality.

axioms (1)
  • domain assumption Issue descriptions frequently lack critical context or contain ambiguous information that hinders LLM understanding.
    Explicitly stated in the abstract as the core limitation being addressed.

pith-pipeline@v0.9.0 · 5518 in / 1179 out tokens · 36152 ms · 2026-05-10T17:53:16.653308+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PYTHALAB-MERA: Validation-Grounded Memory, Retrieval, and Acceptance Control for Frozen-LLM Coding Agents

    cs.CL 2026-05 unverdicted novelty 5.0

    An external controller for frozen LLMs raises strict validation success on three RL coding tasks from 0/9 to 8/9 by selecting memory records and skills, running fail-fast checks, and propagating credit via eligibility traces.

Reference graph

Works this paper leans on

89 extracted references · 24 canonical work pages · cited by 1 Pith paper · 8 internal anchors

  1. [1]

    Reem Aleithan, Haoran Xue, Mohammad Mahdi Mohajer, Elijah Nnorom, Gias Uddin, and Song Wang. 2024. Swe-bench+: Enhanced coding benchmark for llms. arXiv preprint arXiv:2410.06992(2024)

  2. [2]

    Anthropic. 2025. Claude Code: AI-powered coding assistant for developers. https://www.anthropic.com/claude-code

  3. [3]

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report.arXiv preprint arXiv:2309.16609(2023)

  4. [4]

    Nicolas Bettenburg, Sascha Just, Adrian Schröter, Cathrin Weiss, Rahul Premraj, and Thomas Zimmermann. 2008. What makes a good bug report?. InProceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering. 308–318

  5. [5]

    Elizabeth Bjarnason, Per Runeson, Markus Borg, Michael Unterkalmsteiner, Emelie Engström, Björn Regnell, Giedre Sabaliauskaite, Annabella Loconsole, Tony Gorschek, and Robert Feldt. 2014. Challenges and practices in aligning requirements with verification and validation: a case study of six companies. Empirical software engineering19, 6 (2014), 1809–1855

  6. [6]

    Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al. 2026. Qwen3-Coder- Next Technical Report.arXiv preprint arXiv:2603.00729(2026)

  7. [7]

    Oscar Chaparro, Jing Lu, Fiorella Zampetti, Laura Moreno, Massimiliano Di Penta, Andrian Marcus, Gabriele Bavota, and Vincent Ng. 2017. Detecting missing information in bug descriptions. InProceedings of the 2017 11th joint meeting on foundations of software engineering. 396–407

  8. [8]

    Dong Chen, Shaoxin Lin, Muhan Zeng, Daoguang Zan, Jian-Gang Wang, Anton Cheshkov, Jun Sun, Hao Yu, Guoliang Dong, Artem Aliev, et al . 2024. Coder: Issue resolving with multi-agent and task graphs.arXiv preprint arXiv:2406.01304 (2024)

  9. [9]

    Guoxin Chen, Fanzhe Meng, Jiale Zhao, Minghao Li, Daixuan Cheng, Huatong Song, Jie Chen, Yuzhi Lin, Hui Chen, Xin Zhao, et al. 2026. BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?arXiv preprint arXiv:2603.03194(2026)

  10. [10]

    Xiancai Chen, Zhengwei Tao, Kechi Zhang, Changzhi Zhou, Xinyu Zhang, Wanli Gu, Yuanpeng He, Mengdi Zhang, Xunliang Cai, Haiyan Zhao, et al. 2025. Revisit self-debugging with self-generated tests for code generation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 18003–18023

  11. [11]

    Zimin Chen, Yue Pan, Siyu Lu, Jiayi Xu, Claire Le Goues, Martin Monperrus, and He Ye. 2025. Prometheus: Unified knowledge graphs for issue resolution in multilingual codebases.arXiv preprint arXiv:2507.19942(2025)

  12. [12]

    Steven Davies and Marc Roper. 2014. What’s in a bug report?. InProceedings of the 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. 1–10

  13. [13]

    Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al. 2025. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941(2025)

  14. [14]

    John Doe. 2011. Recommended practice for software requirements specifications (ieee).IEEE, New York(2011)

  15. [15]

    Alessio Ferrari, Giuseppe Lipari, Stefania Gnesi, and Giorgio O Spagnolo. 2014. Pragmatic ambiguity detection in natural language requirements. In2014 IEEE 1st International Workshop on Artificial Intelligence for Requirements Engineering (AIRE). IEEE, 1–8

  16. [16]

    Xavier Franch, Cristina Palomares, Carme Quer, Panagiota Chatzipetrou, and Tony Gorschek. 2023. The state-of-practice in requirements specification: an extended interview study at 12 companies.Requirements engineering28, 3 (2023), 377–409

  17. [17]

    Eva Freund. 2012. Ieee standard for system and software verification and valida- tion (ieee std 1012-2012).Software quality professional15, 1 (2012), 43

  18. [18]

    Pengfei Gao, Zhao Tian, Xiangxin Meng, Xinchen Wang, Ruida Hu, Yuanan Xiao, Yizhou Liu, Zhao Zhang, Junjie Chen, Cuiyun Gao, et al. 2025. Trae agent: An llm-based agent for software engineering with test-time scaling.arXiv preprint arXiv:2507.23370(2025)

  19. [19]

    Andrea García. 2025. Greedy algorithms: a review and open problems.Journal of Inequalities and Applications2025, 1 (2025), 11

  20. [20]

    Paul Gauthier. 2025. Aider. https://github.com/paul-gauthier/aider

  21. [21]

    2007.A Model-Based Approach To Requirements Analysis

    Eva Geisberger, Johannes Grünbauer, and Bernhard Schätz. 2007.A Model-Based Approach To Requirements Analysis. Internat. Begegnungs-und Forschungszen- trum für Informatik

  22. [22]

    Vincenzo Gervasi and Bashar Nuseibeh. 2002. Lightweight validation of natural language requirements.Software: Practice and Experience32, 2 (2002), 113–133

  23. [23]

    Gabriella Gigante, Francesco Gargiulo, and Massimo Ficco. 2015. A semantic driven approach for requirements verification. InIntelligent distributed computing VIII. Springer, 427–436

  24. [24]

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yifan Wu, YK Li, et al. 2024. DeepSeek-Coder: when the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196(2024)

  25. [25]

    Umm-e Habiba, Markus Haug, Justus Bogner, and Stefan Wagner. 2024. How mature is requirements engineering for AI-based systems? A systematic mapping study on practices, challenges, and future research directions.Requirements Engineering29, 4 (2024), 567–600

  26. [26]

    Hojae Han, Jaejin Kim, Jaeseok Yoo, Youngwon Lee, and Seung-won Hwang. 2024. Archcode: Incorporating software requirements in code generation with large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 13520–13552

  27. [27]

    Kevin Han, Siddharth Maddikayala, Tim Knappe, Om Patel, Austen Liao, and Amir Barati Farimani. 2026. TDFlow: Agentic Workflows for Test Driven De- velopment. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 1511–1527

  28. [28]

    Yonghui Huang, Daniel Alencar da Costa, Feng Zhang, and Ying Zou. 2019. An empirical study on the issue reports with questions raised during the issue resolving process.Empirical Software Engineering24, 2 (2019), 718–750

  29. [29]

    David Inkermann, T Huth, T Vietor, A Grewe, C Knieke, and A Rausch. 2019. Model-based requirement engineering to support development of complex sys- tems.Procedia CIRP84 (2019), 239–244

  30. [30]

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. [n. d.]. Live- CodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. InThe Thirteenth International Conference on Learning Repre- sentations. Conference’17, July 2017, Washington, DC, USA S...

  31. [31]

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2026. A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology35, 2 (2026), 1–72

  32. [32]

    Zhonghao Jiang, David Lo, and Zhongxin Liu. 2025. Agentic Software Issue Res- olution with Large Language Models: A Survey.arXiv preprint arXiv:2512.22256 (2025)

  33. [33]

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-BENCH: CAN LANGUAGE MODELS RESOLVE REAL-WORLD GITHUB ISSUES?. In12th International Conference on Learning Representations, ICLR 2024

  34. [34]

    Dongming Jin, Zhi Jin, Xiaohong Chen, and Chunhui Wang. 2024. Mare: Multi- agents collaboration framework for requirements engineering.arXiv preprint arXiv:2405.03256(2024)

  35. [35]

    Philippe B Kruchten. 2002. The 4+ 1 view model of architecture.IEEE software 12, 6 (2002), 42–50

  36. [36]

    Shiqi Kuang, Zhao Tian, Tao Xiao, Dong Wang, and Junjie Chen. 2025. On the Effectiveness of Training Data Optimization for LLM-based Code Generation: An Empirical Study.arXiv preprint arXiv:2512.24570(2025)

  37. [37]

    Chao Lei, Yanchuan Chang, Nir Lipovetzky, and Krista A Ehinger. 2025. Planning- driven programming: A large language model programming workflow. InProceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 12647–12684

  38. [38]

    Hongwei Li, Yuheng Tang, Shiqi Wang, and Wenbo Guo. 2025. Patchpilot: A stable and cost-efficient agentic patching framework.arXiv e-prints(2025), arXiv–2502

  39. [39]

    Tobias Lindenbauer, Igor Slinko, Ludwig Felder, Egor Bogomolov, and Yaroslav Zharov. [n. d.]. The Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context Management. InNeurIPS 2025 Fourth Workshop on Deep Learning for Code

  40. [40]

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al . 2025. Deepseek- v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556(2025)

  41. [41]

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in neural information processing systems 36 (2023), 21558–21572

  42. [42]

    Mariam Mahran and Katharina Simbeck. 2025. Investigating bias: A multilingual pipeline for generating, solving, and evaluating math problems with llms.arXiv preprint arXiv:2509.17701(2025)

  43. [43]

    Xiangxin Meng, Zexiong Ma, Pengfei Gao, and Chao Peng. 2024. An em- pirical study on llm-based agents for automated bug fixing.arXiv preprint arXiv:2411.10213(2024)

  44. [44]

    Nguyen Nhat Minh, Andrew Baker, Clement Neo, Allen G Roush, Andreas Kirsch, and Ravid Shwartz-Ziv. [n. d.]. Turning Up the Heat: Min-p Sampling for Cre- ative and Coherent LLM Outputs. InThe Thirteenth International Conference on Learning Representations

  45. [45]

    Lloyd Montgomery, Davide Fucci, Abir Bouraffa, Lisa Scholz, and Walid Maalej

  46. [46]

    Requirements Engineering27, 2 (2022), 183–209

    Empirical research on requirements quality: a systematic mapping study. Requirements Engineering27, 2 (2022), 183–209

  47. [47]

    Julia Mucha, Andreas Kaufmann, and Dirk Riehle. 2024. A systematic literature review of pre-requirements specification traceability.Requirements Engineering 29, 2 (2024), 119–141

  48. [48]

    OpenAI. 2024. Introducing SWE-bench Verified. https://openai.com/index/ introducing-swe-bench-verified/

  49. [49]

    OpenAI. 2025. Codex CLI. https://developers.openai.com/codex/cli

  50. [50]

    OpenAI. 2026. About OpenAI. https://openai.com/about/

  51. [51]

    Sofia Ouhbi, Ali Idri, Jose Luis Fernández-Alemán, and Ambrosio Toval. 2013. Software quality requirements: a systematic mapping study. In2013 20th Asia- Pacific Software Engineering Conference (APSEC), Vol. 1. IEEE, 231–238

  52. [52]

    Siru Ouyang, Wenhao Yu, Kaixin Ma, Zilin Xiao, Zhihan Zhang, Mengzhao Jia, Jiawei Han, Hongming Zhang, and Dong Yu. 2025. REPOGRAPH: ENHANCING AI SOFTWARE ENGINEERING WITH REPOSITORY-LEVEL CODE GRAPH. In 13th International Conference on Learning Representations, ICLR 2025. International Conference on Learning Representations, ICLR, 30361–30384

  53. [53]

    Anvith Pabba, Alex Mathai, Anindya Chakraborty, and Baishakhi Ray. 2025. Sema- gent: A semantics aware program repair agent.arXiv preprint arXiv:2506.16650 (2025)

  54. [54]

    Dhirendra Pandey, Ugrasen Suman, and A Kumar Ramani. 2010. An effective requirement engineering process model for software development and require- ments management. In2010 International Conference on Advances in Recent Tech- nologies in Communication and Computing. IEEE, 287–291

  55. [55]

    Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The impact of ai on developer productivity: Evidence from github copilot.arXiv preprint arXiv:2302.06590(2023)

  56. [56]

    Karolina Rączkowska-Gzowska and Anita Walkowiak-Gall. 2023. What should a good software requirements specification include? Results of a survey.Founda- tions of Computing and Decision Sciences48, 1 (2023), 57–81

  57. [57]

    MR Raja Ramesh and Ch Satyananda Reddy. 2021. Metrics for software require- ments specification quality quantification.Computers & Electrical Engineering96 (2021), 107445

  58. [58]

    Stephen Edward Robertson, Steve Walker, Susan Jones, Micheline M Hancock- Beaulieu, Mike Gatford, et al. 1994. Okapi at TREC. (1994)

  59. [59]

    Haifeng Ruan, Yuntong Zhang, and Abhik Roychoudhury. 2025. Specrover: Code intent extraction via llms. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE, 963–974

  60. [60]

    Scale AI. 2026. About Scale AI. https://scale.com/about

  61. [61]

    Disha Shrivastava, Denis Kocetkov, Harm De Vries, Dzmitry Bahdanau, and Torsten Scholak. 2023. Repofusion: Training code models to understand your repository.arXiv preprint arXiv:2306.10998(2023)

  62. [62]

    Michal Shur-Ofry, Bar Horowitz-Amsalem, Adir Rahamim, and Yonatan Belinkov

  63. [63]

    A vailable at SSRN 5017241(2024)

    Growing a Tail: Increasing Output Diversity in Large Language Models. A vailable at SSRN 5017241(2024)

  64. [64]

    Mozhan Soltani, Felienne Hermans, and Thomas Bäck. 2020. The significance of bug report elements.Empirical Software Engineering25, 6 (2020), 5255–5294

  65. [65]

    E Stephen and E Mit. 2020. Evaluation of software requirement specification based on IEEE 830 quality properties.International Journal on Advanced Science, Engineering and Information Technology10, 4 (2020), 1396–1402

  66. [66]

    Manan Suri, Xiangci Li, Mehdi Shojaie, Songyang Han, Chao-Chun Hsu, Shweta Garg, Aniket Anand Deshmukh, and Varun Kumar. 2026. CodeScout: Con- textual Problem Statement Enhancement for Software Agents.arXiv preprint arXiv:2603.05744(2026)

  67. [67]

    Xinye Tang, Song Wang, and Ke Mao. 2015. Will this bug-fixing change break regression testing?. In2015 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). IEEE, 1–10

  68. [68]

    Wei Tao, Yucheng Zhou, Yanlin Wang, Wenqiang Zhang, Hongyu Zhang, and Yu Cheng. 2024. Magis: Llm-based multi-agent framework for github issue resolution. Advances in Neural Information Processing Systems37 (2024), 51963–51993

  69. [69]

    A Terry Bahill and Steven J Henderson. 2005. Requirements development, ver- ification, and validation exhibited in famous failures.Systems engineering8, 1 (2005), 1–14

  70. [70]

    Zhao Tian and Junjie Chen. 2025. Aligning Requirement for Large Language Model’s Code Generation.arXiv preprint arXiv:2509.01313(2025)

  71. [71]

    Zhao Tian, Pengfei Gao, Junjie Chen, and Chao Peng. 2026. Agent-Based Ensem- ble Reasoning for Repository-Level Issue Resolution. InProceedings of the 48th IEEE/ACM International Conference on Software Engineering (ICSE 2026)

  72. [72]

    Muhammad Aminu Umar and Kevin Lano. 2024. Advances in automated sup- port for requirements engineering: a systematic literature review.Requirements Engineering29, 2 (2024), 177–207

  73. [73]

    Axel Van Lamsweerde. 2008. Requirements engineering: from craft to discipline. InProceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering. 238–249

  74. [74]

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. [n. d.]. OpenHands: An Open Platform for AI Software Developers as Generalist Agents. InThe Thirteenth International Conference on Learning Representations

  75. [75]

    Yuxiang Wang, Xinnan Dai, Wenqi Fan, and Yao Ma. 2025. Exploring graph tasks with pure llms: A comprehensive benchmark and investigation.arXiv preprint arXiv:2502.18771(2025)

  76. [76]

    Yuhang Wang, Yuling Shi, Mo Yang, Rongrui Zhang, Shilin He, Heng Lian, Yuting Chen, Siyu Ye, Kai Cai, and Xiaodong Gu. 2026. SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents.arXiv preprint arXiv:2601.16746(2026)

  77. [77]

    Wilcoxon, S

    F. Wilcoxon, S. K. Katti, and R. A. Wilcox. 1963. Critical Values and Probability Levels for the Wilcoxon Rank Sum Test and the Wilcoxon Signed Rank Test. (1963)

  78. [78]

    Emily Windisch, Constantin Mandel, Simon Rapp, Nikola Bursac, and Albert Al- bers. 2022. Approach for model-based requirements engineering for the planning of engineering generations in the agile development of mechatronic systems. Procedia CIRP109 (2022), 550–555

  79. [79]

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. De- mystifying llm-based software engineering agents.Proceedings of the ACM on Software Engineering2, FSE (2025), 801–824

  80. [80]

    Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated program repair in the era of large pre-trained language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1482–1494

Showing first 80 references.